Accelerating Denoising at the Speed of Light: Difference between revisions

From Psych 221 Image Systems Engineering
Jump to navigation Jump to search
Samidhm (talk | contribs)
Samidhm (talk | contribs)
Line 79: Line 79:
== References ==
== References ==


[1] Sreedharan, S., Radhakrishnan, G., Gupta, D., & Sudarshan, T. (2014). Analysis of robotic environment using low resolution image sequence. ''2014 International Conference on Contemporary Computing and Informatics (IC3I)'', 495–499. [https://doi.org/ Insert DOI]
[1] Sreedharan, S., Radhakrishnan, G., Gupta, D., & Sudarshan, T. (2014). Analysis of robotic environment using low resolution image sequence. ''2014 International Conference on Contemporary Computing and Informatics (IC3I)'', 495–499.


[2] Juřík, M., Šmídl, V., Kuthan, J., & Mach, F. (2019). Trade-off between resolution and frame rate for visual tracking of mini-robots on planar surfaces. ''2019 International Conference on Manipulation, Automation and Robotics at Small Scales (MARSS)'', 1–6. [https://doi.org/ Insert DOI]
[2] Juřík, M., Šmídl, V., Kuthan, J., & Mach, F. (2019). Trade-off between resolution and frame rate for visual tracking of mini-robots on planar surfaces. ''2019 International Conference on Manipulation, Automation and Robotics at Small Scales (MARSS)'', 1–6.  


[3] Lumberyard, A. (2017, July). Amazon Lumberyard Bistro. Open Research Content Archive (ORCA). Retrieved from [http://developer.nvidia.com/orca/amazon-lumberyard-bistro http://developer.nvidia.com/orca/amazon-lumberyard-bistro]
[3] Lumberyard, A. (2017, July). Amazon Lumberyard Bistro. Open Research Content Archive (ORCA). Retrieved from [http://developer.nvidia.com/orca/amazon-lumberyard-bistro http://developer.nvidia.com/orca/amazon-lumberyard-bistro]
Line 87: Line 87:
[4] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. ''[Unpublished manuscript]''.
[4] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. ''[Unpublished manuscript]''.


[5] Chaitanya, C. R. A., Kaplanyan, A. S., Schied, C., Salvi, M., Lefohn, A., Nowrouzezahrai, D., & Aila, T. (2017). Interactive reconstruction of Monte Carlo image sequences using a recurrent denoising autoencoder. ''ACM Transactions on Graphics, 36''(4). [https://doi.org/ Insert DOI]
[5] Chaitanya, C. R. A., Kaplanyan, A. S., Schied, C., Salvi, M., Lefohn, A., Nowrouzezahrai, D., & Aila, T. (2017). Interactive reconstruction of Monte Carlo image sequences using a recurrent denoising autoencoder. ''ACM Transactions on Graphics, 36''(4).  


[6] Zhao, H., Gallo, O., Frosio, I., & Kautz, J. (2016). Loss functions for image restoration with neural networks. ''IEEE Transactions on Computational Imaging, 3''(1), 47–57. [https://doi.org/ Insert DOI]
[6] Zhao, H., Gallo, O., Frosio, I., & Kautz, J. (2016). Loss functions for image restoration with neural networks. ''IEEE Transactions on Computational Imaging, 3''(1), 47–57.  


[7] Ravishankar, S., & Bresler, Y. (2011). MR image reconstruction from highly undersampled k-space data by dictionary learning. ''IEEE Transactions on Medical Imaging, 30''(5), 1028–1041. [https://doi.org/ Insert DOI]
[7] Ravishankar, S., & Bresler, Y. (2011). MR image reconstruction from highly undersampled k-space data by dictionary learning. ''IEEE Transactions on Medical Imaging, 30''(5), 1028–1041.

Revision as of 07:15, 13 December 2024

Introduction

In computer graphics, real-time ray tracing has become widely adopted for generating high-quality visuals in applications like gaming and interactive simulations. A significant challenge in ray tracing is that using a low number of samples per pixel often results in noisy images, limiting their practical use. Achieving high-quality images typically requires ray tracing with a large number of samples per pixel, which demands substantial computational power and makes real-time generation difficult. Consequently, there is a growing need for effective noise reduction techniques for images rendered with fewer samples per pixel. Efficient denoising can produce high-quality images that preserve scene realism while optimizing computational resources.

Background and Problem Setup

Background

While applications such as gaming typically render high-resolution images (e.g., 1080p, 4K), recent advancements in fields like robotics have created a demand for extremely fast, real-time rendering of low-resolution images \cite{7019765}, \cite{8860966}. This project specifically addresses this challenge, focusing on developing high-quality and efficient denoising techniques for low-resolution ray-traced images.

Problem Definition

Given a 64x64 image rendered with one sample per pixel, along with other features that can be obtained using similar computational resources, we propose a denoising framework capable of producing a 64x64 output image that closely matches the quality of a ground-truth image rendered with 512 samples per pixel. Our framework is evaluated primarily based on two key criteria:

Quality

The generated image should closely replicate the realism and quality of the ground-truth image. Quality is assessed using Peak Signal-to-Noise Ratio (PSNR).

Performance

The system should be computationally efficient. Performance is evaluated by the number of frames it can denoise per second, serving as a secondary metric.

Approach

Dataset Generation

To meet the training and evaluation requirements of our framework, we have developed a comprehensive dataset featuring both low- and high-quality ray-traced images. We selected a scene rich in detail, including various objects, textures, lighting, and colors, using the exterior of the Amazon Lumberyard scene \cite{ORCAAmazonBistro} as our setting. The dataset consists of 12,600 pairs of 64x64 images, where each pair includes a low-quality image rendered with one sample per pixel and a corresponding high-quality image rendered with 512 samples per pixel.

To generate these images, we strategically chose 35 key points within the scene to ensure broad coverage. At each point, we rendered images at 15 different heights and 24 distinct 3D angles. Additionally, we included auxiliary features such as depth maps, normal maps, albedo, direct gloss, and indirect gloss reflecting light within the scene, all of which contribute to more accurate denoising.

To perform this rendering, we use Blender's Cycle path tracer software. Using Blender's scripting interface, we automatically iterate through the different configurations, place the camera at these positions and the render the required data. To enable automatic saving of these features, Blender's compositing nodes are used.

For training, validation, and testing purposes, the dataset is divided into a 60%-20%-20% ratio. However, instead of randomly splitting the examples, we carefully handpick the points for each subset to ensure a more balanced and representative distribution. The selection process is designed to maintain significant scene diversity across all splits, which is crucial for effective model evaluation.

This approach is necessary because images rendered from the same point in the scene tend to be highly similar. If the same points were included in both the training and test sets, it could lead to an unrealistic assessment of the model’s performance, as the model might memorize specific points rather than learning to generalize. By ensuring that the points in each subset are distinct, we can better evaluate the model’s ability to generalize to new, unseen viewpoints, thus providing a more reliable measure of its denoising capabilities.

Model Architecture

We chose a simple UNet architecture \cite{ronneberger2015unet} that consists of $n$ encoder, $n$ decoder and one bottleneck layers as shown in figure \ref{fig:unet}. Each encoder consists two sets of 2D convolutional layer, batch normalization and a ReLU activation stacked together. The bottleneck layer is chosen to be the same as an encoder layer. However, we can use other layers such as RNN, LSTM, etc as the bottleneck layer based on the application requirements. A decoder layer consists of two parts, a 2D convolutional transpose layer and a convolutional layer. The output of the convolutional transpose layer is concatenated with the corresponding output from encoder (skip connection) and passed into the convolutional layer.


Model Training

Input Feature Design

Along with the 1 sample-per-pixel image provided as input, the model is provided with additional features which are useful in denoising the image. The features we include are the depth map, normal, relative normal, albedo and the material roughness. The depth map and normal are direct features extracted during rendering. The relative normal is computed by applying the view-space matrix to the true normal. The albedo is computed by adding the glossy color and diffused color rendered. The roughness is calculated by inverting the sum of the direct and indirect gloss rendered by Blender. In section 4.2, we perform ablations to evaluate the efficacy of each of the extracted features. Only the normal, albedo and roughness meaningfully improve the model's performance.

Loss Function

The training process consists of two phases, each employing different loss functions. Initially, a linear combination of L1 and HFEN loss <ref>https://doi.org/10.1145/3072959.3073601</ref> is used, followed by L2 loss in the second phase. This training schedule is based on the approach described in <ref>zhao2016loss</ref>.

Phase 1: L1 and HFEN Loss

In the first phase, the loss function is defined as:

HFEN=i=1n(LoG(outputi)LoG(targeti))22i=1nLoG(targeti)22 L=α1ni=1n|yiy^i|+(1α)HFEN Where:

LoG is a convolution operation with a scaled integer approximation of the Laplacian-of-Gaussian kernel (σ = 1.4). L1 loss enables robust model training and is resistant to outliers <ref>https://doi.org/10.1145/3072959.3073601</ref>. HFEN loss <ref>https://doi.org/10.1109/TIP.2010.2049258</ref> aims to reconstruct high-frequency details in the output image. However, as demonstrated in Section 4.2, HFEN does not significantly improve the model's performance.

Phase 2: L2 Loss

In the second phase, the model is fine-tuned using L2 loss to improve edge sharpness and color accuracy in the denoised image. The loss function is defined as:

L=1ni=1n(yiy^i)2 This phase refines the output, producing images with improved visual fidelity.


Training Schedule

Performance Optimization

Results

Conclusions

Appendix

You can write math equations as follows: y=x+5

You can include images as follows (you will need to upload the image first using the toolbox on the left bar, using the "Upload file" link).

References

[1] Sreedharan, S., Radhakrishnan, G., Gupta, D., & Sudarshan, T. (2014). Analysis of robotic environment using low resolution image sequence. 2014 International Conference on Contemporary Computing and Informatics (IC3I), 495–499.

[2] Juřík, M., Šmídl, V., Kuthan, J., & Mach, F. (2019). Trade-off between resolution and frame rate for visual tracking of mini-robots on planar surfaces. 2019 International Conference on Manipulation, Automation and Robotics at Small Scales (MARSS), 1–6.

[3] Lumberyard, A. (2017, July). Amazon Lumberyard Bistro. Open Research Content Archive (ORCA). Retrieved from http://developer.nvidia.com/orca/amazon-lumberyard-bistro

[4] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. [Unpublished manuscript].

[5] Chaitanya, C. R. A., Kaplanyan, A. S., Schied, C., Salvi, M., Lefohn, A., Nowrouzezahrai, D., & Aila, T. (2017). Interactive reconstruction of Monte Carlo image sequences using a recurrent denoising autoencoder. ACM Transactions on Graphics, 36(4).

[6] Zhao, H., Gallo, O., Frosio, I., & Kautz, J. (2016). Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging, 3(1), 47–57.

[7] Ravishankar, S., & Bresler, Y. (2011). MR image reconstruction from highly undersampled k-space data by dictionary learning. IEEE Transactions on Medical Imaging, 30(5), 1028–1041.