Finally, we come to the most intriguing phase of the Gaussian spattering process: rendering! This step is arguably the most crucial, as it determines the realism of our model. However, it might also be the simplest. Part 1 and part 2 In our series, we demonstrated how to transform raw symbols into a render-ready format, but now we need to do the work and render to a fixed set of pixels. The authors have developed a fast rendering engine using CUDA, which can be a bit tricky to follow. Therefore, I think it's beneficial to first go through the code in Python, using simple for loops for clarity. For those eager to dig deeper, all the necessary code is available on our G website.Information Center.
Let's look at how to represent each individual pixel. From our previous example articleWe have all the necessary components: 2D points, associated colors, covariance, sorted depth order, 2D inverse covariance, minimum and maximum x and y values for each splash, and associated opacity. With these components, we can render any pixel. Given specific pixel coordinates, we iterate through all splashes until we reach a saturation threshold, following the depth order of the splash relative to the camera plane (projected to the camera plane and then sorted by depth). For each splash, we first check if the pixel coordinate is within the bounds defined by the minimum and maximum x and y values. This check determines whether we should continue rendering or ignore the splash for these coordinates. Next, we calculate the Gaussian splash intensity at the pixel coordinate using the splash mean, splash covariance, and pixel coordinates.
def compute_gaussian_weight(
pixel_coord: torch.Tensor, # (1, 2) tensor
point_mean: torch.Tensor,
inverse_covariance: torch.Tensor,
) -> torch.Tensor:difference = point_mean - pixel_coord
power = -0.5 * difference @ inverse_covariance @ difference.T
return torch.exp(power).item()
We multiply this weight by the opacity of the splat to get a parameter called alpha. Before adding this new value to the pixel, we need to check if we have exceeded our saturation threshold. We don't want a splat behind other splats to affect the coloration of the pixel and use computing resources if the pixel is already saturated. Therefore, we use a threshold that allows us to stop rendering once it is exceeded. In practice, we start our saturation threshold at 1 and then multiply it by min(0.99, (1 — alpha)) to get a new value. If this value is lower than our threshold (0.0001), we stop rendering that pixel and consider it complete. If not, we add the colors weighted by the saturation value * (1 — alpha) and update the saturation as new_saturation = old_saturation * (1 — alpha). Finally, we loop through each pixel (or each 16×16 tile in practice) and render. The full code is shown below.
def render_pixel(
self,
pixel_coords: torch.Tensor,
points_in_tile_mean: torch.Tensor,
colors: torch.Tensor,
opacities: torch.Tensor,
inverse_covariance: torch.Tensor,
min_weight: float = 0.000001,
) -> torch.Tensor:
total_weight = torch.ones(1).to(points_in_tile_mean.device)
pixel_color = torch.zeros((1, 1, 3)).to(points_in_tile_mean.device)
for point_idx in range(points_in_tile_mean.shape(0)):
point = points_in_tile_mean(point_idx, :).view(1, 2)
weight = compute_gaussian_weight(
pixel_coord=pixel_coords,
point_mean=point,
inverse_covariance=inverse_covariance(point_idx),
)
alpha = weight * torch.sigmoid(opacities(point_idx))
test_weight = total_weight * (1 - alpha)
if test_weight < min_weight:
return pixel_color
pixel_color += total_weight * alpha * colors(point_idx)
total_weight = test_weight
# in case we never reach saturation
return pixel_color
Now that we can render a pixel, we can render a patch of an image, or what the authors call a mosaic.
def render_tile(
self,
x_min: int,
y_min: int,
points_in_tile_mean: torch.Tensor,
colors: torch.Tensor,
opacities: torch.Tensor,
inverse_covariance: torch.Tensor,
tile_size: int = 16,
) -> torch.Tensor:
"""Points in tile should be arranged in order of depth"""tile = torch.zeros((tile_size, tile_size, 3))
# iterate by tiles for more efficient processing
for pixel_x in range(x_min, x_min + tile_size):
for pixel_y in range(y_min, y_min + tile_size):
tile(pixel_x % tile_size, pixel_y % tile_size) = self.render_pixel(
pixel_coords=torch.Tensor((pixel_x, pixel_y))
.view(1, 2)
.to(points_in_tile_mean.device),
points_in_tile_mean=points_in_tile_mean,
colors=colors,
opacities=opacities,
inverse_covariance=inverse_covariance,
)
return tile
And finally we can use all those tiles to render a complete image. Notice how we check that the symbol actually affects the current tile (x_in_tile and y_in_tile code).
def render_image(self, image_idx: int, tile_size: int = 16) -> torch.Tensor:
"""For each tile have to check if the point is in the tile"""
preprocessed_scene = self.preprocess(image_idx)
height = self.images(image_idx).height
width = self.images(image_idx).widthimage = torch.zeros((width, height, 3))
for x_min in tqdm(range(0, width, tile_size)):
x_in_tile = (x_min >= preprocessed_scene.min_x) & (
x_min + tile_size <= preprocessed_scene.max_x
)
if x_in_tile.sum() == 0:
continue
for y_min in range(0, height, tile_size):
y_in_tile = (y_min >= preprocessed_scene.min_y) & (
y_min + tile_size <= preprocessed_scene.max_y
)
points_in_tile = x_in_tile & y_in_tile
if points_in_tile.sum() == 0:
continue
points_in_tile_mean = preprocessed_scene.points(points_in_tile)
colors_in_tile = preprocessed_scene.colors(points_in_tile)
opacities_in_tile = preprocessed_scene.sigmoid_opacity(points_in_tile)
inverse_covariance_in_tile = preprocessed_scene.inverse_covariance_2d(
points_in_tile
)
image(x_min : x_min + tile_size, y_min : y_min + tile_size) = (
self.render_tile(
x_min=x_min,
y_min=y_min,
points_in_tile_mean=points_in_tile_mean,
colors=colors_in_tile,
opacities=opacities_in_tile,
inverse_covariance=inverse_covariance_in_tile,
tile_size=tile_size,
)
)
return image
At last, now that we have all the necessary components, we can render an image. We take all the 3D points from the Treehill dataset and initialize them as Gaussian points. To avoid an expensive nearest neighbor search, we initialize all the scale variables as .01 (note that with such a small variance, we will need a strong concentration of points at one point for them to be visible. A larger variance makes the process quite slow). Then, all we need to do is call render_image with the image number we are trying to emulate, and as you can see, we get a sparse set of point clouds that look like our image. (See our additional section at the bottom for an equivalent CUDA kernel using the nifty pyTorch tool that compiles CUDA code!)
While the backward step is not part of this tutorial, it should be noted that while we started with just these few points, we will soon have hundreds of thousands of splashes for most scenes. This is due to splitting large splashes (as defined by higher variance on the axes) into smaller splashes and removing splashes that have extremely low opacity. For example, if we actually initialized the scale to the mean of the three nearest neighbors, we would have most of the space covered. To get fine detail, we would need to split these into much smaller splashes that can capture fine detail. They also need to populate areas with very few Gaussians. They refer to these two scenarios as over reconstruction and under reconstruction and define both scenarios by large gradient values for multiple splashes. They then split or clone the splashes based on size (see image below) and continue with the optimization process.
Although the backward step is not covered in this tutorial, it is important to note that we start with just a few points, but soon we have hundreds of thousands of splashes in most scenes. This increase is due to the splitting of large splashes (with larger variations in the axes) into smaller splashes and the removal of splashes with very low opacity. For example, if we initially set the scale to the mean of the three nearest neighbors, most of the space would be covered. To achieve fine details, we need to split these large splashes into much smaller splashes. Furthermore, areas with very few Gaussians need to be populated. These scenarios are known as over- and under-reconstruction, characterized by large gradient values for several splashes. Depending on their size, the splashes are either split or cloned (see image below) and the optimization process continues.
And that's an easy introduction to Gaussian spattering! You should now have a good intuition about what exactly is going on in the forward pass of a Gaussian scene representation. While it is a bit overwhelming and it's not exactly a neural network, all it takes is a bit of linear algebra and we can represent 3D geometry in 2D!
Feel free to leave comments on confusing topics or if I got something wrong and you can always connect with me on LinkedIn or twitter!