Cnnprediction

From Psych 221 Image Systems Engineering
Jump to navigation Jump to search

Introduction

We compare different approaches to upsampling sensor voltage data without performing a full simulation using nearest neighbor and bilinear interpolation, and convolutional neural nets. These approaches are analyzed quantitatively through various metrics mean squared error, delta E 2000 values, and color histograms, as well as qualitatively from an aesthetics perspective. Good performance from the convolutional neural nets indicate that such an algorithm may be used for approximation of various sensor topologies given prior sensor data which we examine both qualitatively and quantitatively.

Background

As the density of pixels of recent imaging sensors approaches on the order of hundreds of millions, accurately simulating each variation in a variety of different scenes may become too computationally intensive. In finding a means to approximate a sensor response from a prior sensor response, we may be able to iterate through the responses of different variations of sensors, including pixel pitch, pixel shape, color filter arrays, pixel arrangement, which could lead to novel sensor designs. In this project we are given a prior sensor voltage value for a given scene, and we attempt to predict a sensor response as if the pixel width were halved - in effect the resolution of an image generated from the sensor data would be doubled or upsampled.

Image upsampling has long been studied with many known algorithms, including nearest neighbor and bilinear interpolation (of which will be examined in this project), bicubic interpolation, sinc and Lanczos resampling, box sampling, mip map and many others [1], with applications ranging from image restoration, analysis, and the reduction of data in storage and transmission. However, we note that the project seeks to act on sensor data as a prior rather than the RGB values of an image, which would avoid the information loss and introduction caused by demosaicking, illuminant correction or color space, and data compression, which could lead to produce a higher quality image once the sensor data is processed. Nevertheless, we can still draw inspiration from existing image upsampling algorithms to produce higher resolution sensor data. Moreover, recent advancements in machine learning, specifically around convolutional neural networks may prove to provide state-of-the-art approximation of sensor responses, which we will examine in this project. We also attempt to provide an objective means of analyzing image quality which are generated from the sensor response using concepts learned through PSYCH221.

Methods

We attempt to predict how a given sensor with a Bayer color filter array responds to a given scene as if the width of the pixels were reduced by half without performing a full simulation. Two linear transformations, inspired by upsampling techniques, are used to predict the upsampled sensor response to establish baselines as naive approaches. In addition, we attempt to apply state-of-the-art neural network architectures which have demonstrated success at a variety of image processing tasks, including image upsampling. We then compare the quality of the predicted upsampled sensor voltage data to the target sensor voltage using mean squared error, delta E 2000, color histograms, and edge detection.

Dataset and Software Packages

A ~2700 subset of the COCO 2017 Train images was collected which included a variety of objects in various settings under different lighting scenarios which reflects the wide range of scenes that may be be captured by a camera in the real world. In addition, a few images of various facial profiles with a color checker were derived from The Image System Engineering Toolbox for Biology (ISETBIO) repository.

Keras, TensorFlow, and Scikit-learn were used to construct, train, visualize, and test the convolutional neural network. NumPy was used for data manipulation.

Generating Sensor Data

These collected images were then processed as scenes using ISETCAM to produce a sensor response for both a 100 x 120 and a 202 x 242 sensor represented in volts, which were used as the input and the target respectively. The two extra pixels in each dimension represent the fact that when the pixel size is halved, ISETCAM produced half a Bayer tile at the edges of the sensor response which we believed to be errata. This sensor data was split into two-thirds for training and a third for testing respectively.

Upsampling Algorithms

Nearest Neighbor Interpolation

A naive technique used for predicting upsampled sensor voltage data, the nearest neighbor algorithm simply takes the prior sensor response for a 4x4 "tile" and copies the values towards the right, down, and diagonally towards the bottom-right to double the size of the sensor response data. Nearest neighbor simply assumes that the additional pixels in the more dense sensor would have captured the same information regardless. We use the predicted sensor values of the nearest neighbor algorithm as a performance baseline.

Visualization of the nearest neighbor algorithm. Priors (highlighted in bold) are merely transposed to increase resolution.

Bilinear Interpolation

A slightly less naive technique, bilinear interpolation is inspired by the basic demosaicking algorithm of similar name. This algorithm averages the sensor voltage values of adjacent prior 4x4 tiles to produce new voltage values. Adjacency of a prior tile relative to a predicted tile is determined by whether the prior tiles is immediately present in the vertical, horizontal, or diagonal directions with a distance of 1. Tiles which are predicted at the edge of the sensor data will consist of the average of prior tiles which exist.

Visualization of bilinear interpolation algorithm. Priors (highlighted in bold) are averaged to increase resolution.

Convolutional Neural Networks

Inspired by state-of-the-art advances in image processing, we develop a convolution neural network architecture that is believed to represent the state-of-the-art for predicting upsampled sensor data given no further information. Popularized in 2012 by Krizhevksy et. al [2], convolutional neural networks have been successfully applied to many other image processing tasks, including semantic segmentation [3], object detection, and image upsampling [4]. In contrast to traditional neural networks, convolutional neural networks are especially suited towards image processing tasks due to lower number of weights to train (a naive 1-layer neural network for this task would require the number of weights to train to be in the number of the hundreds of millions, while a single 3x3 kernel would only result in 9). We must recognize that there exists an assumption that the information required to predict the upsampled sensor data can be derived locally through the trained kernels.

A visualization of a convolutional neural network with kernel size of 3x3 and 3 filters. The input on the left is multiplied by the values in the middle column and iterated to produce the values on the right.
Deconvolutional Neural Networks

The original introduction of deconvolutional neural networks was intended to extract additional low and mid-range image representations beyond edge primitives - in effect an input is mapped to increasingly sparser reprsentations which can form the elements of an image [5]. We leverage this concept in our neural network design to naturally increases the size of an input using kernels, illustrated below:

Illustration of the behavior seen in a deconvolutional neural network layer. source
Architecture

Since the popularization of convolutional neural networks, a number of architectures have been published for image super-resolution (or upsampling) tasks [4][6]. While we can leverage the more advanced neural network concepts such as network-in-network, skip connections, or simply adding a large number of layers, we prefer to pursue a more simple architecture to focus on developing other parts of the project. The model parameters used are described below, with the novel introduction of a deconvolutional neural network layer (other approaches assume an input that has been already upsampled through another approach, such as bilinear interpolation).

Neural Network Architecture
layer type kernel size filters padding activation params
deconvolution 8x8 32 - ReLU 2080
2D convolution 3x3 16 same ReLU 12816
2D convolution 5x5 1 same linear 401

The selection of the 8x8 kernel size for the initial deconvolution layer was chosen due to a capture of information 4 complete prior 4x4 Bayer tiles, similar to the bilinear interpolation algorithm where values of the neighboring tiles are used to generate the new tiles. Also notably there are only about 15,300 trainable params, indicating that the neural network is relatively easy to train (other state-of-the-art neural networks can contain millions of trainable parameters).

Loss function

Since the input and output of the neural network contained sensor voltage values (as opposed to RGB values), mean squared error was used as the loss function between the target high resolution sensor voltage data generated by ISETCAM and the output generated by the neural network, which can be defined as:

While numerous loss functions exist, mean squared error was chosen due to its simplicity in understanding and ease of calculation. We must acknowledge however mean squared error does not factor in any elements of how a human would compare two images, which include sharpness, noise, dynamic range, color difference, illumination, etc. of which some will be discussed in the results section.

Training Parameters

The neural network was trained with the following parameters:

  • Adam optimizer with a learning rate set to 0.03
  • 40 epochs with early stopping (set to 0.001 with a patience of 1), with a batch size of 16
  • no validation set used

Postprocessing

The upsampled sensor data was then transformed into PNG images using the ISETCAM function ipCompute with the following parameters:

  • bilinear demosaicing
  • conversion to XYZ representation
  • no illuminant correction

From the image production, we are able to provide further analysis between the images generated from the simulated high resolution sensor voltage data, and those predicted by the algorithms.

Analysis

OpenCV's canny edge detection algorithm was used to perform edge detection and to generate color histograms. ISETCAM's S-CIELAB implementation was used to generate delta E 2000 calculations between the predicted and target on the generated images.

Results

Edges that have been detected on the predicted and reference image outputs.

The results of the predicted values of nearest neighbor and bilinear interpolation were rather unsurprising. Qualitatively the nearest neighbor algorithm produced a large number of false color artifacts, introduced unnatural edges, and "pixelation" of the image. Bilinear interpolation produced similar color artifacts but to lesser degree - however due to the naive averaging of priors, edges are drastically reduced as sharp color boundaries are smoothed over. Using edge detection software, we can visualize the edges and see the introduction of many edges in nearest neighbor interpolation while edges are removed with bilinear interpolation compared to the target image.

In contrast, despite the convolutional neural network being shallow, the convolutional neural network performed remarkably well given its simplicity. Moreover, the sharpness of the image was close to that of the image generated from the simulated sensor values, with few obvious false color artifacts and edges that appear to be natural. In our edge visualization, edges for the convolutional neural network remain similar to the target image.

Average color histogram for al images in the test set. Note that green dominates, which biases the neural network to produce green hues.

Quantitatively, we can assess the quality of the predicted sensor values using a few metric, including mean squared error and mean delta E 2000 generated by S-CIELAB.

Test Set Error
Algorithm Mean Squared Error (volts) Mean Delta E 2000 (S-CIELAB)
Nearest Neighbor 4.618 x 10^-3 4.0266
Bilinear Interpolation 3.518 x 10^-3 5.3534
CNN 0.5970 x 10^-3 3.7440

Unsurprisingly, the mean squared error for sensor values produced by the CNN was an order of magnitude lower than that of bilinear interpolation and the nearest network, given that the MSE was optimized for as the loss function for training. However, the low mean squared error did not necessarily translate to good color reproduction - we can see from the delta E 2000 values that the convolutional neural network did not perform appreciably better than nearest neighbor despite the images produced by the neural network to be apparently qualitatively superior.

We can detect this by observing a green overlay on some of the images compared to the target image, which is hypothesized to be the result of the frequency of green pixel values in the sensor data - that is the mean squared error loss function naively emphasizes the error seen in green pixel values over red or blue values by a factor of two. Moreover, since the image from the COCO dataset presumes to capture our everyday environment, by generating a color histogram we can demonstrate that the dominant color in our dataset is green, which would bias our neural network to produce green hues. We can visualize and quantitatively extract this information by producing the color histograms of the predicted and simulated images (notably we could have also performed this on the sensor response itself) using OpenCV.

Sample Output

Below we can see the sample outputs from the algorithms that were developed.

Image from low resolution sensor data
Images produced from high resolution sensor data. For the image produced by convolutional neural network, note the green hue compared to the target image. The color checker in the background is also useful qualitative reference. a=nearest neighbor, b=convolutional neural network, c=bilinear interpolation, d=target

Other examples can be found here, with the ordering remaining the same as denoted above:

Kernel Visualization

The visualization of the kernels allows us to qualitatively observe what appear to the most pertinent features that may be derived from sensor voltage data to produce higher resolution data. In other words, the kernels represent sparse elemental representations of the voltage data that may be reconstituted to form an enlarged representation. Output from the second hidden layer reveals that the convolution kernels consume local information in geometrically grid like pattern, with the striping indicating that some kernels appear to ignore or consume certain columnar or row patterns. This perhaps implies that the individual kernels naturally appear to extract information about spatial frequency beyond the RGB voltage values. We also note that some of the outputs of the second layer are sparse, which means that the number of filters may be reduced for more rapid training and the number of useful features may be limited.

The image above shows the output from the second hidden convolutional layer which consist of 16 kernels. Colors represent magnitude of the value only.

Other examples:

Failure Cases

While the convolution neural network performed fairly well, there was a peculiar phenomenon where strong blue tones disappeared from the image and would result in a completely different color:

The row above shows the images from the simulated sensor data while those below are produced by the convolutional neural network. Note the mistranslation of strong blue colors.

The thesis around this failure may be due to the infrequency of strong, isolated blue tones in the dataset which may bias the neural network to reduce the strength of blue voltage values. The nearest neighbor and bilinear interpolation algorithms did not demonstrate this behavior.

Conclusions

Convolutional neural networks can perform remarkably well at image processing tasks - despite the number of trainable parameters being relatively low, the training unoptimized, and the architecture being quite simple (i.e no network-in-network layers, dropout, etc.) the CNN developed in this project upscaled sensor voltage data which, after post-processing, produced images that had low numbers of artifacts and were visually pleasing. We suggest that convolutional neural networks, with sufficient hyperparameter tuning and training parameters may prove to sufficiently predict sensor responses of various topologies given a prior sensor data response. Additional investment into tuning the hyperparameters of the neural network, or with the introduction of novel architectures may prove to be useful to eliminate some of the aberrations observed in the predicted response.

Future Work

Developing a custom loss function which equalizes the weighting of each RGB value (since Bayer tiles produce two green values for every red and blue value, green is overrepresented in the loss function) may prove to reduce the tendency for the neural network to produce scenes with a green hue. Moreover, if one were to implement a loss function which considered the mean delta E 2000 between the target and predicted image, a Taylor approximation may be used to reduce processing time and would likely result in significantly better color reproduction. Moreover, we may also seek to add additional quantitative metrics to objectively determine the quality of prediction, which may also be used as part of a composite loss function to reduce specific aberrations.

In addition, other non-neural network based algorithms should be used to predict the upsampled version of sensor data since nearest neighbor and bilinear interpolation are considered to be naive. A comparison with more complex upsampling approaches, such as bicubic interpolation, would better bolster (or undermine) the argument that convolutional neural networks are meaningfully more performant than other algorithms.

Appendix

Github: https://github.com/gnedster/psych221

References

[1] https://en.wikipedia.org/wiki/Image_scaling

[2] https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

[3] https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Long_Fully_Convolutional_Networks_2015_CVPR_paper.html

[4] http://personal.ie.cuhk.edu.hk/~ccloy/files/eccv_2014_deepresolution.pdf

[5] https://ftp.cs.nyu.edu/~fergus/papers/matt_cvpr10.pdf

[6] https://arxiv.org/abs/1707.05425