Training the linear, local, learned (L3) for imaging in low light conditions

Introduction

Imaging in low light (LL) conditions has long been a challenging goal because of lack of photons and low signal-noise ratio (SNR). Although extensive research has been done, it almost remains to be an open problem. Especially, it is not until recent years did people make some concrete progresses on extreme LL imaging, with well-known convolutional neural networks [6-10]. In this project, we proposed a totally different algorithm based on local image patches classification and trained linear transformation (where the name “linear, local, learned” comes from) to solve the problem of extreme LL imaging. We achieved significant image PSNR improvement even under illuminance around 0.1, with quite small computational complexity. Additionally, we utilized an open source image system engineering toolbox (ISET) to directly get the scene from high light (HL) images and then reprocess it with realistic imaging pipeline to get corresponding LL images. This approach solves the problem associated with most synthetic datasets and provided us with reliable training data.

Related Work

Classical denoising algorithms

Before neural networks were introduced in the field of denoising, many algorithms have been proposed. Parametric methods include total variation denoising [1], wavelet transform [2], anisotropic diffusion [3], while nonparametric ones contain BM3D [4], nonlocal means (NLM) [5] and so on. Although these classical algorithms are relatively simple, they are effective and can outperform modern neural network methods under certain conditions [11].

Denoising based on neural networks

Application of neural networks to LL imaging has been a heated topic in the past several years. Stacked sparse denoising auto-encoders (SSDA) [6], trainable nonlinear reaction diffusion (TNRD) [7], CNN [8,9] and so on are all comparatively successful attempts. However, these algorithms have some disadvantages in common [10]. Firstly, most neural network methods are time-consuming and requires powerful hardware implementation. The second problem is that most of them are based on synthetic data, for example, images with added Gaussian noise. Even the widely used RENOIR dataset contains HL/LL image pairs with spatial misalignment, which is detrimental to network training.

Low light enhancement method

Another approach for LL imaging is low light contrast enhancement. The general underlying idea is to compress brighter parts of the image and enhance dimmer parts and finally achieve balance to some extent [12,13]. The problem in these works is that they don’t handle noise explicitly and only assume a modest illuminant level. Thus, they become barely valid when it comes to extreme LL conditions.

Method

Pipeline

The high level idea of our approach to tackle the problem is illustrated below in Fig 1. Our training data contain two parts. The first part is the sensor data captured in the dark environment. The label (groundtruth) is the target RGB image captured from the same scene but with a brighter illuminant level. As this is a global non-linear problem, we crop the sensor data and the corresponding image pixel into small patches to make it locally linear. Here we made an assumption that for a single pixel, the most useful information is contained in its neighbor pixels. This is an intuitive explanation about why we make it locally linear by cropping the image. After cropping, we decide to classify the patches of the sensor data based on the average signal level and combine patches with similar signal level together.

Fig 1. Processing pipelines of the L3 approach

For each class, we map each patch within the class into a single RGB pixel value which is supposed to be located in the center position of the patch. The target RGB pixel value is captured in the target RGB image captured in the brighter scene. We use multiple linear transform kernels to map the patches into the RGB pixels. Specifically, if the patch is with the size of 5x5 and the pixel value is 1x3 since the pixel is presented in RGB channel, the size of the kernel will be 5x5x3. We use the Ridge Regression to minimize the error between the predicted value with the kernels and the groundtrth target pixel value.

After we trained the model, given a new set of sensor data, we will render the image with the similar process stated above during the training process. The sensor data is first cropped into small patches. Then the patches will be classified according to the same classification strategy. With the learned kernels, we can map the patches from the new image into pixel values in the RGB space. After reconstructing the pixels, we get our rendered image.

Sensor Pattern

Here we use the mostly used Bayer pattern to generate the sensor data. The typical pattern is illustrated below in Fig 2. The color means what color filter is applied on the sensor. For example, the blue blocks represent the pixels which are covered blue filters, meaning the filter only allow blue light to go through.

Fig 2. Structure of Bayer pattern

Sensor Filter

In this project, we use a standard XYZ filter, which will cast the image into the XYZ color space, which is a standard color representation in display. The image will be transformed back to RGB space with a standard linear transform. The transimissivity of the filters are shown in Fig 3.

Fig 3. Transmissivity of the XYZ filter.

Patch Classification

We set the dimension of the size to be 5x5. In this situation, there are four possible center pixel types. We first classify them into different classes. Also, since the illuminant distribution is non-uniform within a picture, the sensor response will be different as well. So the next classification is based on the strength of the sensor signal. Here we proposed to use two ways for the classification, as is shown in Fig 4. The most direct signal is the volts in the sensor, we proposed to set different sensor volts levels. As for the distribution, we choose the logarithm space after several attempts. The second approach is classified based on the number of the electrons. The reason for this approach is because in the LL conditions, the number of the electrons excited by the coming photons are quite few compared with the daily condition.

Fig 4. Two ways of patch classification.

Ridge Regression

The expression of the Ridge Regression in this problem can be expressed as:

$A_{i}=argmin\lVert Y-XA_{i}\rVert ^{2}+\lambda \lVert A_{i}\rVert ^{2}$

where $A_{i}$ is the kernel to be learned, Y is the target pixel value, X is the stacked patches in the same class and $\lambda$ is the regularization term. The close form of the solution to this problem is given as:

$A_{i}=(X^{T}X+\lambda I)^{-1}X^{T}Y$

To further optimize the calculation to avoid inverse matrix, which will lose accuracy especially when the matrix is close to singular, we use singular vector decomposition (SVD) of X. That is, $X=UDV^{T}$ . So the kernels can be calculate with the expression:

$A_{i}=Vdiag(D_{j}/(D_{j}^{2}+\lambda ))U^{T}Y$

Generating the Dataset

The process to generate the dataset is illustrated in Fig 5. We collect the image from the public image source in the coco dataset [1]. By using an open source Image System Engineering Toolbox(ISET) [2] , we can convert the image into a scene, where for each point we can know the spectrum power distribution, direction of the light etc. Here we can select the illuminant level of the scene. In the final step, we also use the sensor module in the ISET to calculate the sensor data. For the target data, we use ISET set the sensor to be noise-free and use a brighter illuminant level for the scene.

The sensor data looks green because in the Bayer pattern, green filter is more than the red and blue channel.

We use about 30k patches and the corresponding pixel values as the training sample.

Fig 5. Process of dataset generation.

Experiment

Summary of Experiment Parameter

The parameters of our experiment are summarized in Table 1. For the illuminant, we select the range from 0.05 cd/m2 to 0.4 cd/m2, which is a very low illuminant level.

Training results

We start to analysis of our trained kernels. Take the kernels in the first class as an example, the kernels based on the first two classification approach is depicted in Fig 6.

Fig 6. Kernel and linearity evaluation for (a) sensor volts based class and (b) electron number based class.

The figures with blue dots show the linearity condition between the predicted pixel value (y axis) and the target pixel value (x axis). As can be seen, the electron number based class have a better estimation of the pixel value. The reason is because for the first class, it contains the patches with an average electron number smaller than 10. However, the when convert the volt in the first class into the number of the electrons using the conversion gain set in the sensor, we found the first class contains the patches with the average electron smaller than 1. That means for the first class in the voltage based approach, the first class contains only noise pixels. This is an indication that we should use number of electrons when determining the cut points for the classes instead of using sensor voltage.

The kernels we learned also serves as an indicator of the claim that we make. The kernels from the second approach make more sense. For example, the kernel figure on the left in Fig 5(b) shows the mapping from a patch with the center sensor having blue filter and the patch will be mapped into the blue channel. In this way, the value from the center pixel is mostly weighted, while the surrounding sensors with blue filters will be considered more than the sensors with red or green filters. In contrast, the kernels from the voltage based class are mostly random valued and cannot be interpolated.

We further explored more options of the number of the class, as illustrated in the Table 1 in the third and fourth rows. The kernels and the linearity check are shown in Fig 8.

Fig 8. Comparison of the linearity and kernels for (a) 12 classes and (b) 4 class-based classification.

L3 rendered image

Based on the previous discussion, we rendered the image with our trained model. The results are shown in the Fig 7.

As can be seen, the L3 rendered image is less noisy compared with the image rendered with the conventional image process pipeline. For instance, the price label can hardly be read in the image with conventional image process pipeline, and the price can be seen in the L3 rendered image. These results can be served as a proof of the performance of our result.

Fig 7. Compare the L3 rendered images with the groundtruth and image rendered with conventional image process pipeline.

PSNR and sCIELAB Evaluation

In the final step, we also conducted PSNR and sCIELAB analysis to evaluate the difference of our result compared with the groundtruth image. Fig 9 shows the L3 rendered image and the sCIELAB compared with the groundtruth image. The mean sCIELAB and PSNR based on different classification method are summarized in Table 2. In the sCIELAB ΔE figure, we can see the error mostly comes from the bright region. This inspired the future work should be focused on reducing the error in the brighter region.

Fig 9. sCIELAB distribution of L3 rendered images.

Conclusion and Future Work

In conclusion, we demonstrated a new method for extreme LL imaging: L3, which is fast, simple and effective. More importantly, unlike most neural network algorithms, L3 only based on linear transform and do not contain any non-linear elements, which makes it much easier to understand the details in its processing. We compared different classification methods (electron based / volt based) and different cut point strategies (large / small steps) and finally achieved a preliminary optimized result. It is notable that in our final result, we almost reach the best denoising in dark regions while still not very good at brighter ones, which is quite unintuitive and calls for future researches.

Because of the scale limitation of this project, we constrained our analysis on Bayer sensors and didn’t explore other filter types. Also, we used same cut point for RGB channels, which is not necessary the best case. Actually, it is already evident in Fig 8 that similarity of kernels associated with different intensity classes is related to color channel. For example, in Fig 8(a), there is a large discrepancy between the three kernels in the first column, while the three kernels in the last column is much more similar to each other. Therefore, it might be OK to combine the last column into one class, but it would cause unwanted degradation when we use a large class for the three smaller ones in the first column.

Reference

[1] L. I. Rudin, S. Osher, and E. Fatemi. “Nonlinear total variation based noise removal algorithms”. Phys. D, 1992. 2, [3]

[2] E. P. Simoncelli and E. H. Adelson. "Noise removal via bayesian wavelet coring". ICIP, 1996. 2, [4]

[3] P. Perona and J. Malik. "Scale-space and edge detection using anisotropic diffusion". TPAMI, 1990. 2, [5]

[4] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. "Image denoising by sparse 3d transform-domain collaborative filtering". TIP, 2007. 2, 7, [6]

[5] A. Buades, B. Coll, and J. M. Morel. "A non-local algorithm for image denoising". CVPR, 2005. 2, [7]

[6] F. Agostinelli, M. R. Anderson, and H. Lee. "Adaptive multi-column deep neural networks with application to robust image denoising". In NIPS, 2013. 2, [8]

[7] Y. Chen and T. Pock. "Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration". IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 2017. 2, [9]

[8] V. Jain and H. S. Seung. "Natural image denoising with convolutional networks". In NIPS, 2008. 2, [10]

[9] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. "Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising". IEEE Transactions on Image Processing, 26(7), 2017. 2, 4, [11]

[10] C. Chen, Q. Chen, J. Xu, and V. Koltun. "Learning to see in the dark". CVPR, 2018. 2, [12]

[11] T. Plotz and S. Roth. "Benchmarking denoising algorithms with real photographs". In CVPR, 2017. 1, 2, 5, [13]

[12] X. Dong, G. Wang, Y. Pang, W. Li, J. Wen, W. Meng, and Y. Lu. "Fast efficient algorithm for enhancement of low lighting video". In IEEE International Conference on Multimedia and Expo, 2011. 2, [14]

[13] X. Guo, Y. Li, and H. Ling. "LIME: Low-light image enhancement via illumination map estimation". IEEE Transactions on Image Processing, 26(2), 2017. 2, [15]

Appendix

Source code link for L3: [16]

Training the linear, local, learned (L3) for imaging in low light conditions

Contents

Introduction

Related Work

Classical denoising algorithms

Denoising based on neural networks

Low light enhancement method

Method

Pipeline

Sensor Pattern

Sensor Filter

Patch Classification

Ridge Regression

Generating the Dataset

Experiment

Summary of Experiment Parameter

Training results

L3 rendered image

PSNR and sCIELAB Evaluation

Conclusion and Future Work

Reference

Appendix

Navigation menu

Training the linear, local, learned (L3) for imaging in low light conditions

Introduction

Related Work

Classical denoising algorithms

Denoising based on neural networks

Low light enhancement method

Method

Pipeline

Sensor Pattern

Sensor Filter

Patch Classification

Ridge Regression

Generating the Dataset

Experiment

Summary of Experiment Parameter

Training results

L3 rendered image

PSNR and sCIELAB Evaluation

Conclusion and Future Work

Reference

Appendix

Navigation menu

Search