Accelerating Denoising at the Speed of Light

From Psych 221 Image Systems Engineering
Revision as of 16:02, 13 December 2024 by Samidhm (talk | contribs) (Conclusion and Future Work)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

In computer graphics, real-time ray tracing has become widely adopted for generating high-quality visuals in applications like gaming and interactive simulations. A significant challenge in ray tracing is that using a low number of samples per pixel often results in noisy images, limiting their practical use. Achieving high-quality images typically requires ray tracing with a large number of samples per pixel, which demands substantial computational power and makes real-time generation difficult. Consequently, there is a growing need for effective noise reduction techniques for images rendered with fewer samples per pixel. Efficient denoising can produce high-quality images that preserve scene realism while optimizing computational resources.

Background and Problem Setup

Background

While applications such as gaming typically render high-resolution images (e.g., 1080p, 4K), recent advancements in fields like robotics have created a demand for extremely fast, real-time rendering of low-resolution images [1], [2]. This project specifically addresses this challenge, focusing on developing high-quality and efficient denoising techniques for low-resolution ray-traced images.

Problem Definition

Given a 64x64 image rendered with one sample per pixel, along with other features that can be obtained using similar computational resources, we propose a denoising framework capable of producing a 64x64 output image that closely matches the quality of a ground-truth image rendered with 512 samples per pixel. Our framework is evaluated primarily based on two key criteria- Quality and Performance.

Quality

The generated image should closely replicate the realism and quality of the ground-truth image. Quality is assessed using Peak Signal-to-Noise Ratio (PSNR).

Performance

The system should be computationally efficient. Performance is evaluated by the number of frames it can denoise per second, serving as a secondary metric.

Approach

Dataset Generation

To meet the training and evaluation requirements of our framework, we have developed a comprehensive dataset featuring both low- and high-quality ray-traced images. We selected a scene rich in detail, including various objects, textures, lighting, and colors, using the exterior of the Amazon Lumberyard scene [3] as our setting. The dataset consists of 12,600 pairs of 64x64 images, where each pair includes a low-quality image rendered with one sample per pixel and a corresponding high-quality image rendered with 512 samples per pixel.

To generate these images, we strategically chose 35 key points within the scene to ensure broad coverage. At each point, we rendered images at 15 different heights and 24 distinct 3D angles. Additionally, we included auxiliary features such as depth maps, normal maps, albedo, direct gloss, and indirect gloss reflecting light within the scene, all of which contribute to more accurate denoising.

Amazon Lumberyard Bistro Scene

To perform this rendering, we use Blender's Cycle path tracer software. Using Blender's scripting interface, we automatically iterate through the different configurations, place the camera at these positions and the render the required data. To enable automatic saving of these features, Blender's compositing nodes are used.

For training, validation, and testing purposes, the dataset is divided into a 60%-20%-20% ratio. However, instead of randomly splitting the examples, we carefully handpick the points for each subset to ensure a more balanced and representative distribution. The selection process is designed to maintain significant scene diversity across all splits, which is crucial for effective model evaluation.

This approach is necessary because images rendered from the same point in the scene tend to be highly similar. If the same points were included in both the training and test sets, it could lead to an unrealistic assessment of the model’s performance, as the model might memorize specific points rather than learning to generalize. By ensuring that the points in each subset are distinct, we can better evaluate the model’s ability to generalize to new, unseen viewpoints, thus providing a more reliable measure of its denoising capabilities.

A sample dataset entry

Model Architecture

We chose a simple UNet architecture [4] that consists of n encoder, n decoder and one bottleneck layers as shown in the architecture figure below. Each encoder consists two sets of 2D convolutional layer, batch normalization and a ReLU activation stacked together. The bottleneck layer is chosen to be the same as an encoder layer. However, we can use other layers such as RNN, LSTM, etc as the bottleneck layer based on the application requirements. A decoder layer consists of two parts, a 2D convolutional transpose layer and a convolutional layer. The output of the convolutional transpose layer is concatenated with the corresponding output from encoder (skip connection) and passed into the convolutional layer.

Simple UNet architecture

Model Training

Input Feature Design

Along with the 1 sample-per-pixel image provided as input, the model is provided with additional features which are useful in denoising the image. The features we include are the depth map, normal, relative normal, albedo and the material roughness. The depth map and normal are direct features extracted during rendering. The relative normal is computed by applying the view-space matrix to the true normal. The albedo is computed by adding the glossy color and diffused color rendered. The roughness is calculated by inverting the sum of the direct and indirect gloss rendered by Blender. In section 4.2, we perform ablations to evaluate the efficacy of each of the extracted features. Only the normal, albedo and roughness meaningfully improve the model's performance.

Loss Function

The training process consists of two phases, each employing different loss functions. Initially, a linear combination of L1 and HFEN loss [5] is used, followed by L2 loss in the second phase. This training schedule is based on the approach described in [6].

Phase 1: L1 and HFEN Loss

In the first phase, the loss function is defined as:

HFEN=i=1n(LoG(outputi)LoG(targeti))22i=1nLoG(targeti)22 L=α1ni=1n|yiy^i|+(1α)HFEN

Where:

LoG is a convolution operation with a scaled integer approximation of the Laplacian-of-Gaussian kernel (σ = 1.4). L1 loss enables robust model training and is resistant to outliers [5]. HFEN loss [7] aims to reconstruct high-frequency details in the output image. However, as demonstrated, HFEN does not significantly improve the model's performance.

Phase 2: L2 Loss

In the second phase, the model is fine-tuned using L2 loss to improve edge sharpness and color accuracy in the denoised image. The loss function is defined as:

L=1ni=1n(yiy^i)2 This phase refines the output, producing images with improved visual fidelity.

Training Schedule

We run phase 1 for 30 epochs and phase 2 for 20 epochs. We use an Adam optimizer with a learning rate of 0.001 and default beta values. We have a fixed batch size of 32 for training.

Performance Optimization

We used torch.compile which is aimed at optimizing the performance of models by transforming them into more efficient representations. It makes PyTorch code run faster by JIT-compiling the code into optimized kernels, which are lower-level representations that can be intermediate representations or even directly machine code all while requiring minimal code changes. We also employ half precision floating point numbers since this makes computations faster while also reducing the model's memory footprint.

We also attempted explicit layer fusion, where the a series of convolution, batch normalization and ReLU can be fused into a single layer. However, we observe that this optimization does not offer any benefit in addition to model compilation, thereby confirming our expectation that model compilation implicitly performs layer fusion.

Evaluation and Results

Evaluation Metrics

We use Peak Signal-to-Noise Ratio (PSNR) to evaluate the quality of the output images denoised by the model. The higher the PSNR, the closer our denoised image is to the ground truth 512 samples-per-pixel image. PSNR is defined by first calculating the Mean Squared Error (MSE) between the two images and then the PSNR metric as follows:

MSE=1ni=1n(img1iimg2i)2 PSNR={if MSE=020log10(255MSE)otherwise We measure the average latency of batch inference in milliseconds (ms) to analyze the model's performance. The latency is further used to estimate the frame rate that can be supported by our model using the formula below:

Frame Rate=1000latency_in_msBatch_Size In addition to the above metrics, we also measure the size of the model in bytes.

Loss function and training schedule study

In the first phase, the L1 loss is weighted by the parameter alpha and a corresponding (1 - alpha) is used to weight the HFEN loss. We spaced alpha in increments of 0.25 and observed that alpha = 1 yields the best result. Furthermore, alpha = 0 performs significantly worse than other runs. This implies that the HFEN loss does not significantly improve the model's performance, but L1 plays a crucial role.

Next, we investigate the role of the 2nd phase of training. The L2 loss function is sensitive to outliers due to the squaring of errors, making it suitable for applications where minimizing large errors is crucial [6]. However due to this sensitivity, it is necessary to use L2 loss only once the model is in a relatively stable position in the optimization landscape, hence the model is first trained with L1 loss.

As observed in the validation PSNR graph in the figure below, the L2 loss plays an important role in improving the model's overall quality.

Validation PSNR vs Epochs
Validation PSNR vs Epochs
Training loss in Phase 1
Training loss in Phase 1
Training loss in Phase 2
Training loss in Phase 2

Model size exploration

We study the impact of the number of encoder and decoder layers within the UNet model. We test the model with 1, 2, and 4 layers. We record a PSNR of 72.27, 72.29, and 72.37 and batch inference latency of 0.0469 ms, 0.0727 ms, and 0.0824 ms respectively for a batch size of 256 on an NVIDIA T4 GPU. We observe only a slight increase in the PSNR, but observe a considerable increase in the inference latency. Thus, we conclude that a 1 layer model yields satisfactory results while being computationally efficient.


(from left to right) 1 spp ray-traced image, 512 spp ray-traced image, 1, 2, and 4 layer denoised images

Feature ablation study

In this study, we attempt to understand the impact of each input feature in an effort to identify the minimum viable subset of features that would help attain a high PSNR while limiting the model's size, thereby improving inference latency. The set of available features includes the original image (3 channels), roughness (3 channels), albedo (3 channels), normal (3 channels), relative normal (3 channels), and depth (1 channels).

Original Image (3 channels)

  • Captures the standard RGB representation of the scene.
  • Useful for visual reference and texture analysis.

Roughness (3 channels)

  • Encodes surface texture to define how light scatters.

Albedo (3 channels)

  • Represents the base color of the surface without lighting.
  • Essential for separating texture color from illumination effects.

Normal (3 channels)

  • Provides surface orientation for accurate lighting computation.
  • Enhances realism by simulating fine surface details.

Relative Normal (3 channels)

  • Adjusted surface orientation relative to a specific reference.
  • Useful for detecting deviations or localized anomalies.

Depth (1 channel)

  • Encodes distance from the camera to the surface in grayscale.


The original input image is a requirement, therefore we run experiments by removing every other features one at a time. Furthermore, we also assess the impact of any normal related features, by removing both the normal and relative normal. Our findings establish the following feature importance order: albedo >> roughness > normal > relative normal > depth. Furthermore, we conclude that removing both forms of normal does negatively impact the model's performance. Based on these results, we select our final model to only include the albedo, roughness, and normal as input features. This model interestingly beats even the model which includes all input features, therefore implying that the feature selection plays an important part in the model's quality.

PSNR values for different missing features studied on a 1 layer UNet

Latency and memory optimization study

To assess the impact of model compilation and using half precision floating point values, we measure the model size and batch inference latency. Both these optimizations improve the latency of the model, resulting in higher throughput. We perform latency measurements on a 1 layer model with roughness, albedo, depth input features. The latency is measured on a NVIDIA A100 40GB GPU, using a batch size of 1260. The results are tabulated in the table below.

Latency and throughput for 1 layer, 3 features UNet

Conclusion and Future Work

In summary, we trained a 1 layer UNet that successfully denoises an input image that was ray traced at 1 spp. This model takes the input image along with three features (albedo, roughness and normal) and produces a denoised output as shown in the two figures below. This model has been reduced to half precision and we used torch.compile() to perform code optimizations. The model performance characteristics are summarized in the table below. All inference latencies were measured on a batch of size 1260. The next steps would consist of exploring quantizing the model to int8 using the TensorRT library (PyTorch quantization library does not support GPU backend). We expect that this would achieve good performance due to the availability of fast integer matrix units. We observed that the denoised images from our model tend to have a blurry effect when compared to the ground truth. For this we intend on incorporating supersampled auxillary features as inputs to our model to improve the output PSNR of the model. Finally, an interesting direction would be to understand metrics other than PSNR to analyze the denoising construct. A study of the tolerance of whether an image is good to be considered as a true representation could also be done based on the application we plan to use the denoiser for. This study will greatly depend on the application if it is gaming, interactive simulations or robotic training data.


(from left to right) 1 spp ray-traced image, 512 spp ray-traced image, 1 layer albedo, roughness, normal
(from left to right) 1 spp ray-traced image, 512 spp ray-traced image, 1 layer albedo, roughness, normal


Performance Metrics


We also study the 'R' values and obtain a strong correlation between the 512 app image and the denoised output. This suggests that such a denoiser could be used for fixed tolerance acceptance levels.

Correlation of 512 spp v/s denoised values

References

[1] Sreedharan, S., Radhakrishnan, G., Gupta, D., & Sudarshan, T. (2014). Analysis of robotic environment using low resolution image sequence. 2014 International Conference on Contemporary Computing and Informatics (IC3I), 495–499.

[2] Juřík, M., Šmídl, V., Kuthan, J., & Mach, F. (2019). Trade-off between resolution and frame rate for visual tracking of mini-robots on planar surfaces. 2019 International Conference on Manipulation, Automation and Robotics at Small Scales (MARSS), 1–6.

[3] Lumberyard, A. (2017, July). Amazon Lumberyard Bistro. Open Research Content Archive (ORCA). Retrieved from http://developer.nvidia.com/orca/amazon-lumberyard-bistro

[4] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. [Unpublished manuscript].

[5] Chaitanya, C. R. A., Kaplanyan, A. S., Schied, C., Salvi, M., Lefohn, A., Nowrouzezahrai, D., & Aila, T. (2017). Interactive reconstruction of Monte Carlo image sequences using a recurrent denoising autoencoder. ACM Transactions on Graphics, 36(4).

[6] Zhao, H., Gallo, O., Frosio, I., & Kautz, J. (2016). Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging, 3(1), 47–57.

[7] Ravishankar, S., & Bresler, Y. (2011). MR image reconstruction from highly undersampled k-space data by dictionary learning. IEEE Transactions on Medical Imaging, 30(5), 1028–1041.