Accelerating Denoising at the Speed of Light: Difference between revisions

From Psych 221 Image Systems Engineering
Jump to navigation Jump to search
Samidhm (talk | contribs)
Samidhm (talk | contribs)
Line 98: Line 98:
Our findings establish the following feature importance order: albedo >> roughness > normal > relative normal > depth. Furthermore, we conclude that removing both forms of normal does negatively impact the model's performance.
Our findings establish the following feature importance order: albedo >> roughness > normal > relative normal > depth. Furthermore, we conclude that removing both forms of normal does negatively impact the model's performance.
Based on these results, we select our final model to only include the albedo, roughness, and normal as input features. This model interestingly beats even the model which includes all input features, therefore implying that the feature selection plays an important part in the model's quality.
Based on these results, we select our final model to only include the albedo, roughness, and normal as input features. This model interestingly beats even the model which includes all input features, therefore implying that the feature selection plays an important part in the model's quality.
{| class="wikitable"
+ PSNR values for different missing features studied on a 1 layer UNet! Missing feature(s) !! Included Features !! Validation PSNR-norm-rough-None-rel norm-albedo-depth-norm, rel norm-'''rel norm, depth'''}


=== Latency and memory optimization study ===
=== Latency and memory optimization study ===

Revision as of 07:30, 13 December 2024

Introduction

In computer graphics, real-time ray tracing has become widely adopted for generating high-quality visuals in applications like gaming and interactive simulations. A significant challenge in ray tracing is that using a low number of samples per pixel often results in noisy images, limiting their practical use. Achieving high-quality images typically requires ray tracing with a large number of samples per pixel, which demands substantial computational power and makes real-time generation difficult. Consequently, there is a growing need for effective noise reduction techniques for images rendered with fewer samples per pixel. Efficient denoising can produce high-quality images that preserve scene realism while optimizing computational resources.

Background and Problem Setup

Background

While applications such as gaming typically render high-resolution images (e.g., 1080p, 4K), recent advancements in fields like robotics have created a demand for extremely fast, real-time rendering of low-resolution images \cite{7019765}, \cite{8860966}. This project specifically addresses this challenge, focusing on developing high-quality and efficient denoising techniques for low-resolution ray-traced images.

Problem Definition

Given a 64x64 image rendered with one sample per pixel, along with other features that can be obtained using similar computational resources, we propose a denoising framework capable of producing a 64x64 output image that closely matches the quality of a ground-truth image rendered with 512 samples per pixel. Our framework is evaluated primarily based on two key criteria:

Quality

The generated image should closely replicate the realism and quality of the ground-truth image. Quality is assessed using Peak Signal-to-Noise Ratio (PSNR).

Performance

The system should be computationally efficient. Performance is evaluated by the number of frames it can denoise per second, serving as a secondary metric.

Approach

Dataset Generation

To meet the training and evaluation requirements of our framework, we have developed a comprehensive dataset featuring both low- and high-quality ray-traced images. We selected a scene rich in detail, including various objects, textures, lighting, and colors, using the exterior of the Amazon Lumberyard scene \cite{ORCAAmazonBistro} as our setting. The dataset consists of 12,600 pairs of 64x64 images, where each pair includes a low-quality image rendered with one sample per pixel and a corresponding high-quality image rendered with 512 samples per pixel.

To generate these images, we strategically chose 35 key points within the scene to ensure broad coverage. At each point, we rendered images at 15 different heights and 24 distinct 3D angles. Additionally, we included auxiliary features such as depth maps, normal maps, albedo, direct gloss, and indirect gloss reflecting light within the scene, all of which contribute to more accurate denoising.

To perform this rendering, we use Blender's Cycle path tracer software. Using Blender's scripting interface, we automatically iterate through the different configurations, place the camera at these positions and the render the required data. To enable automatic saving of these features, Blender's compositing nodes are used.

For training, validation, and testing purposes, the dataset is divided into a 60%-20%-20% ratio. However, instead of randomly splitting the examples, we carefully handpick the points for each subset to ensure a more balanced and representative distribution. The selection process is designed to maintain significant scene diversity across all splits, which is crucial for effective model evaluation.

This approach is necessary because images rendered from the same point in the scene tend to be highly similar. If the same points were included in both the training and test sets, it could lead to an unrealistic assessment of the model’s performance, as the model might memorize specific points rather than learning to generalize. By ensuring that the points in each subset are distinct, we can better evaluate the model’s ability to generalize to new, unseen viewpoints, thus providing a more reliable measure of its denoising capabilities.

Model Architecture

We chose a simple UNet architecture \cite{ronneberger2015unet} that consists of $n$ encoder, $n$ decoder and one bottleneck layers as shown in figure \ref{fig:unet}. Each encoder consists two sets of 2D convolutional layer, batch normalization and a ReLU activation stacked together. The bottleneck layer is chosen to be the same as an encoder layer. However, we can use other layers such as RNN, LSTM, etc as the bottleneck layer based on the application requirements. A decoder layer consists of two parts, a 2D convolutional transpose layer and a convolutional layer. The output of the convolutional transpose layer is concatenated with the corresponding output from encoder (skip connection) and passed into the convolutional layer.


Model Training

Input Feature Design

Along with the 1 sample-per-pixel image provided as input, the model is provided with additional features which are useful in denoising the image. The features we include are the depth map, normal, relative normal, albedo and the material roughness. The depth map and normal are direct features extracted during rendering. The relative normal is computed by applying the view-space matrix to the true normal. The albedo is computed by adding the glossy color and diffused color rendered. The roughness is calculated by inverting the sum of the direct and indirect gloss rendered by Blender. In section 4.2, we perform ablations to evaluate the efficacy of each of the extracted features. Only the normal, albedo and roughness meaningfully improve the model's performance.

Loss Function

The training process consists of two phases, each employing different loss functions. Initially, a linear combination of L1 and HFEN loss <ref>https://doi.org/10.1145/3072959.3073601</ref> is used, followed by L2 loss in the second phase. This training schedule is based on the approach described in <ref>zhao2016loss</ref>.

Phase 1: L1 and HFEN Loss

In the first phase, the loss function is defined as:

HFEN=i=1n(LoG(outputi)LoG(targeti))22i=1nLoG(targeti)22 L=α1ni=1n|yiy^i|+(1α)HFEN Where:

LoG is a convolution operation with a scaled integer approximation of the Laplacian-of-Gaussian kernel (σ = 1.4). L1 loss enables robust model training and is resistant to outliers <ref>https://doi.org/10.1145/3072959.3073601</ref>. HFEN loss <ref>https://doi.org/10.1109/TIP.2010.2049258</ref> aims to reconstruct high-frequency details in the output image. However, as demonstrated in Section 4.2, HFEN does not significantly improve the model's performance.

Phase 2: L2 Loss

In the second phase, the model is fine-tuned using L2 loss to improve edge sharpness and color accuracy in the denoised image. The loss function is defined as:

L=1ni=1n(yiy^i)2 This phase refines the output, producing images with improved visual fidelity.


Training Schedule

We run phase 1 for 30 epochs and phase 2 for 20 epochs. We use an Adam optimizer with a learning rate of 0.001 and default beta values. We have a fixed batch size of 32 for training.

Performance Optimization

We used torch.compile which is aimed at optimizing the performance of models by transforming them into more efficient representations. It makes PyTorch code run faster by JIT-compiling the code into optimized kernels, which are lower-level representations that can be intermediate representations or even directly machine code all while requiring minimal code changes. We also employ half precision floating point numbers since this makes computations faster while also reducing the model's memory footprint.

We also attempted explicit layer fusion, where the a series of convolution, batch normalization and ReLU can be fused into a single layer. However, we observe that this optimization does not offer any benefit in addition to model compilation, thereby confirming our expectation that model compilation implicitly performs layer fusion.

Evaluation and Results

Evaluation Metrics

We use Peak Signal-to-Noise Ratio (PSNR) to evaluate the quality of the output images denoised by the model. The higher the PSNR, the closer our denoised image is to the ground truth 512 samples-per-pixel image. PSNR is defined by first calculating the Mean Squared Error (MSE) between the two images and then the PSNR metric as follows:

MSE=1ni=1n(img1iimg2i)2 PSNR={if MSE=020log10(255MSE)otherwise We measure the average latency of batch inference in milliseconds (ms) to analyze the model's performance. The latency is further used to estimate the frame rate that can be supported by our model using the formula below:

Frame Rate=1000latency_in_msBatch_Size In addition to the above metrics, we also measure the size of the model in bytes.

Loss function and training schedule study

In the first phase, the L1 loss is weighted by the parameter $\alpha$ and a corresponding $1 - \alpha$ is used to weight the HFEN loss. We spaced alpha in increments of 0.25 and observed that $\alpha = 1$ yields the best result. Furthermore, $\alpha = 0$ performs significantly worse than other runs. This implies that the HFEN loss does not significantly improve the model's performance, but L1 plays a crucial role.

Next, we investigate the role of the 2nd phase of training. The L2 loss function is sensitive to outliers due to the squaring of errors, making it suitable for applications where minimizing large errors is crucial \cite{zhao2016loss}. However due to this sensitivity, it is necessary to use L2 loss only once the model is in a relatively stable position in the optimization landscape, hence the model is first trained with L1 loss.

As observed in the validation PSNR graph in figure \ref{fig:validation_psnr_epochs_study}, the L2 loss plays an important role in improving the model's overall quality.


Model size exploration

We study the impact of the number of encoder and decoder layers within the UNet model. We test the model with 1, 2, and 4 layers. We record a PSNR of 72.27, 72.29, and 72.37 and batch inference latency of 0.0469 ms, 0.0727 ms, and 0.0824 ms respectively (for a batch size of 256 on an \href{https://www.nvidia.com/en-us/data-center/tesla-t4/}{NVIDIA T4 } GPU). We observe only a slight increase in the PSNR, but observe a considerable increase in the inference latency. Thus, we conclude that a 1 layer model yields satisfactory results while being computationally efficient.


Feature ablation study

In this study, we attempt to understand the impact of each input feature in an effort to identify the minimum viable subset of features that would help attain a high PSNR while limiting the model's size, thereby improving inference latency. The set of available features includes the original image (3 channels), roughness (3 channels), albedo (3 channels), normal (3 channels), relative normal (3 channels), and depth (1 channels). The original input image is a requirement, therefore we run experiments by removing every other features one at a time. Furthermore, we also assess the impact of any normal related features, by removing both the normal and relative normal. Our findings establish the following feature importance order: albedo >> roughness > normal > relative normal > depth. Furthermore, we conclude that removing both forms of normal does negatively impact the model's performance. Based on these results, we select our final model to only include the albedo, roughness, and normal as input features. This model interestingly beats even the model which includes all input features, therefore implying that the feature selection plays an important part in the model's quality.

Latency and memory optimization study

Conclusion and Future Work

Appendix

You can write math equations as follows: y=x+5

You can include images as follows (you will need to upload the image first using the toolbox on the left bar, using the "Upload file" link).

References

[1] Sreedharan, S., Radhakrishnan, G., Gupta, D., & Sudarshan, T. (2014). Analysis of robotic environment using low resolution image sequence. 2014 International Conference on Contemporary Computing and Informatics (IC3I), 495–499.

[2] Juřík, M., Šmídl, V., Kuthan, J., & Mach, F. (2019). Trade-off between resolution and frame rate for visual tracking of mini-robots on planar surfaces. 2019 International Conference on Manipulation, Automation and Robotics at Small Scales (MARSS), 1–6.

[3] Lumberyard, A. (2017, July). Amazon Lumberyard Bistro. Open Research Content Archive (ORCA). Retrieved from http://developer.nvidia.com/orca/amazon-lumberyard-bistro

[4] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. [Unpublished manuscript].

[5] Chaitanya, C. R. A., Kaplanyan, A. S., Schied, C., Salvi, M., Lefohn, A., Nowrouzezahrai, D., & Aila, T. (2017). Interactive reconstruction of Monte Carlo image sequences using a recurrent denoising autoencoder. ACM Transactions on Graphics, 36(4).

[6] Zhao, H., Gallo, O., Frosio, I., & Kautz, J. (2016). Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging, 3(1), 47–57.

[7] Ravishankar, S., & Bresler, Y. (2011). MR image reconstruction from highly undersampled k-space data by dictionary learning. IEEE Transactions on Medical Imaging, 30(5), 1028–1041.