Image Upsampling using L3

Introduction

Image resolution enhancement is an important problem with many practical applications. It enables efficient image compression for storage and transfer over a network. It also offers the possibility of enhancing an image captured using a lower resolution camera before displaying it on a bigger or higher resolution screen. With the ubiquitous cellphone cameras and high resolution displays nowadays, having a fast and accurate technique to upsample a low resolution image has become quite important.

L3 Method

L3 stands for Linear, Local and Learned. In a real world image there is usually a strong correlation between the neighboring pixels. L3 uses this correlation to generate a higher resolution image from the lower resolution image pixels. It uses machine learning to efficiently learn this dependence present in the data. L3 consists of two steps: Rendering and Learning. The rendering step adaptively selects from a stored table of linear transforms to convert the low resolution pixel data into higher resolution image pixels. The learning step learns and stores these linear transforms used in the rendering step. The algorithm is illustrated in the figure shown below.

Rendering

In the rendering step, a $N * N$ patch, $n (x, y, p)$ is selected, centered around the pixel $(x, y)$ in the low resolution sensor data. Then we classify the pixel into one of the predefined classes, $c$ , based on the mean intensity level, pixel color and contrast. Finally, we apply the appropriate linear transform, $T (c, r)$ , for the class $c$ and output channel $r$ , to get the rendered output $o (x, y, r)$ .

$o (x, y, r) = T (c, r) n (x, y, p)$

The computation is repeated independently for each pixel in the low resolution sensor image. The figure below shows the output of the rendering step for a Bayer pattern sensor array input. In the figure below, the big RGGB (red, green, blue) pixels in the background correspond to the low resolution sensor data. For each of these four input pixels, there are four output pixels (R, G, G, B), shown by the smaller squares in the foreground.

Learning

We use the Image Systems Engineering Toolbox (ISET) to produce the training data for our machine learning model. Using ISET, we generate the low and high resolution sensor data and images for the scenes in the training dataset. The purpose of the training step is to learn the transforms for the various classes and the output channels. First, we classify all the input pixels into their respective classes and then compute the transforms for each class independently to minimize a predefined loss function (error) $L$ , between the target image and the transformed sensor data.

$m i n i m i z e_{{T_{i}}} \sum_{j \in C_{i}}^{} L (y_{j}, x_{j} T_{i})$

Here, $x_{j}$ is a $N^{2}$ dimensional row vector containing the $j^{t h}$ patch data from the sensor belonging to the class $C_{i}$ and $y_{j}$ is a row vector containing the corresponding output values in the target upsampled image. Let $X_{i}$ and $Y_{i}$ be the matrices obtained by stacking the corresponding $x_{j}$ and $y_{j}$ data. Now, we define the loss function $L$ using a regularized RMSE (root mean square error) as follows:

$L_{i} = {‖ Y_{i} - X_{i} T_{i} ‖}^{2} + λ {‖ T_{i} ‖}^{2}$ ,

where $λ$ is a parameter used to regularize the kernel coefficients and avoid noise magnification. Now, the above error can be minimized using the following closed form expression for $T_{i}$

$T_{i} = (X_{i}^{T} X_{i} + λ I)^{- 1} X_{i}^{T} Y_{i}$ .

Results

We use the following two approaches based on the L3 technique as described above to enhance the spatial resolution of an input image.

Approach 1

Here, the rendering step takes the low resolution sensor data to produce sensor data corresponding a higher spatial resolution. Then, we use the standard ISET image processing functionalities to convert the upsampled sensor data to the target color space. For illustration, we quadruple the total number of pixels i.e., the upsampled image has twice the number of rows and columns. We use a $5 * 5$ sized patch, $10 * 4$ classes corresponding to 10 linearly spaced illuminant levels and 4 color filters in the Bayer array. The algorithm is trained on 5 scenes containing different faces. The figure below shows the upsampled image produced by the trained algorithm on a test scene (not present in the training set). The left panel shows the low resolution image, the middle panel is the rendered upsampled image and the right panel is the target higher resolution image.

The CIELAB error $Δ E$ between the target and rendered upsampled image is plotted below. As expected, the error is larger in the regions containing higher spatial variations (e.g. near the flowers on the right). For this particular test image $Δ E_{a v} = 2.36$ . Overall, the rendered image is an improvement over the lower resolution image.

Approach 2

Here, the rendering step takes the low resolution sensor data to directly produce the pixel values in the target color space for the higher resolution image. Unlike the previous approach, image processing step is included in the rendering step. Here, we use the existing L3 library and add a couple more classes to include the upsampling feature. As, the previous approach, the number of rows and columns in the upsampled image is twice the dimensions of the input image. We use a $5 * 5$ sized patch, $20 * 2 * 4$ classes corresponding to 20 illuminant levels, 2 contrast levels and 4 color filters in the Bayer array. The algorithm is trained on 5 scenes at 3 exposure times (15 input images). The figure below shows the upsampled image produced by the trained algorithm on a test scene (not present in the training set). The left panel shows the low resolution image, the middle panel is the rendered upsampled image and the right panel is the target higher resolution image.

The CIELAB error $Δ E$ between the target and rendered upsampled image is plotted below. Again, there is considerable error in the regions containing higher spatial variations (e.g. near the flower). For this particular test image $Δ E_{a v} = 0.44$ and $Δ E_{m a x} = 4.78$ . This approach performs much better than the previous approach, which is expected as there are more classes here and also as it includes the image processing step.

Image Upsampling using L3

Contents

Introduction

L3 Method

Results

Conclusions

Appendix

Navigation menu

Image Upsampling using L3

Introduction

L3 Method

Results

Conclusions

Appendix

Navigation menu

Search