Deep Learning for Illuminant Estimation

Authors: Xuerong Xiao, Jennifer Li

Introduction

Our project aims to estimate the illuminant of a scene using various methods of machine learning with a focus on the deep learning method, convolutional neural networks (CNN). Illumination estimation is an area of interest, because it has applications in topics including image reproduction and image retrieval. For image reproduction, an image may be captured under a certain lighting but rendered under a different lighting. In image retrieval and computer vision, the illumination of the scene needs to be estimated so that it can be accounted for when images of objects are obtained under different lighting. [1]

In past research on illuminant estimation, various methods have been used to recover the illuminant. These methods include gamut mapping and random forest, an ensemble machine learning method, and are described in more detail in the Background section.

Background

In related literature, gamut mapping has been used to recover the illuminant of the scene. [1] In this method, the gamut of each illuminant is precomputed. The gamut is the range of colors that are possible to be displayed. And each gamut of an illuminant is precomputed using a database of measured reflectance spectra. Then, the sensor values obtained for each image are compared to the illuminant gamut using a correlation coefficient. This method was able to classify black body radiator illuminants from 2500K to 8500K correctly to within a few hundred degrees Kelvin.

A drawback of this method is the precomputation of all the illuminant gamuts. This can be resolved by directly comparing feature vectors of images through machine learning. The paper, "Illuminant Classification Based on Random Forest" [2] uses the ensemble machine learning method of random forests to estimate the illuminant. In this method, a random feature vector of the image is used to create decision branches of a decision tree used to classify the illuminant of the image. The results of this research were measured in terms of angular errors, which is the difference between a ground truth color vector of the illuminant and an estimated color vector. These angular errors were compared to those of gamut mapping, and the results were similar.

Another literature of work brings up an interesting application of detecting digital fogery using illuminant estimation [3]. In the method used, the image is segmented into regions of similar chromaticity and the differences in these regions are analyzed. The dissimilarity between objects across region boundaries are compared to the dissimilarty between neighboring objects in that region to detect whether the two regions are under different illuminants. If regions in the image are estimated to be lighted under differing illuminants, this could indicate a digital forgery.

In our method, we have chosen to use machine learning algorithms, like in [2] to estimate the illuminant, so that we can focus on feature extraction and the learning algorithm without the steps of precomputing gamuts or separating the image into regions.

Methods

Data Generation

For our machine learning methods, we first need to generate sets of images, which will be used to train and test our learning algorithms. The pipeline for generating our images is shown below.

Data Generation Pipeline

In the first step, we obtain images from available databases. Ideally, we would desire a database with hyperspectral scene images, or images where we know the surface reflectance for each pixel and wavelength. However, there are no large databases for these images, so we will use standard RGB images. The assumption is that this three dimensional subspace will be a close estimate to the real world reflectances.

With the RGB images, we know the spectrum of red, green, and blue pixels of the monitor. To generate the radiance that reaches our eyes, we represent all pixels in the image using a 3D spectral function with specified RGB intensities. In addition, we assume the illuminant is the white point of the monitor and set all the RGB values to 255 to obtain another spectral function. Then we can achieve the estimation of surface reflectances by dividing out the illuminant. Then any light spectrum can be used with the knowledge of reflectance.

An RGB database we used is the Caltech101 database, which contains 9142 different images. We then use ISET to render each of these images under the the different illuminants we have chosen to classify. These are the flourescent illuminant and nine black body radiation illuminants of temperatures from 2000K to 10000K in 1000K increments. This generates 91420 images in total using the Caltech101 database. The below figures show an example of an original image in the database under different illuminants.

Original scene from Caltech 101.

Original scene under different illuminant: fluorescent and blackbody radiator with different temperatures ranging from 2000K to 10,000K

These images are then also rendered under a camera simulation in ISET. The simulation involves the optical image irradiance calculation on the sensor plane, the sensor response from optical image data, and the image processing pipeline (denoising, interpolation, etc.) from the image sensor to a display (see Appendix I for specific parameters). The camera that we simulated was the Nikon D1. White balancing was turned off, since this is the exact function we wish to implement. The following figures show the same images after being rendered under the camera simulation.

Input scenes under different illuminant after camera simulation of optics, sensor and processing.

We also created RAW images. These images do not go through any processing and are used in practice to make inferences about the illuminant. A total of 235,780 images are rendered. Half of these are RGB and half are RAW images. The following figures show some RAW images that did not go through the proceessing step.

Input scenes under different illuminant after camera simulation of optics and sensor.

After these steps, the dataset is then split into training and testing and validation data for the K-Nearest Neighbors and Convolutional Neural Network learning algorithms. The training data is used to train the classifier, the validation data is used to tune parameters of the classifier algorithm, and the test data is the data used to test the final classifier. The following table show how the data was partitioned.

Partition of generated image data. The same numbers apply to both RGB and RAW images.

k-Nearest Neighbors Baseline

To set a baseline for the results of our main Convolutional Neural Networks (CNN) learning algorithm, we decided to also implement the k-Nearest Neighbors (k-NN) classifier. Unlike CNN, k-NN is an unspervised learning algorithm, meaning the algorithm does not need the labels of the classes as inputs for the training data. k-NN examines the k nearest features in the training set of the training data, and the classes are separated based on patterns found in the features of each image. The results largely depend on the number of nearest neighbors selected. These nearest neighbors are determined by distance metrics. The distance metric we implemented in this project is the Euclidean distance, shown in the formula below.

We selected histograms to be the features of each image for the classification task. For RGB images, the color histograms with 3 * 256 bins are computed. For RAW data, the gray histograms with 256 bins are computed. The Euclidean distance between the histograms of two images I₁ and I₂ is then

$d(I_{1},I_{2})=sqrt(\Sigma _{h}(I_{1}^{h}-I_{2}^{h})^{2})$

where h sums to the total number of bins.

To avoid overfitting, different values of k for the k nearest features are tested. The value of k that results in the highest accuracy is then applied to the test data.

Convolutional Neural Network Deep Learning

The CNNs are similar to ordinary neural networks except that CNNs are more specifically designed for image inputs, which is why we have chosen this method as our main implementation. The structure of CNN consists of many layers, which have neurons in three dimensions (height, width, and depth). In our case, the input depth is the number of color channels in the input images. For example, in this project the input RGB images have a (resized and cropped) volume of 227 * 227 * 3, and the RAW images 227 * 227 * 1.

The project builds on AlexNet [4], whose architecture contains five types of layers: convolutional (Conv) layer, rectified linear units (ReLU) layer, local response normalization (LRN) layer, max pooling layer, and fully connected (FC) layer. And we implemented this method using the deep learning program Caffe [5].

Each layer in the network performs various functions. The Conv layer extracts different features of the input through the convolution operator. The convolution is the sum of dot products between the kernel, or similarity function between images, and small patches sliding across the input at a certain stride. The convolutional layers extract both low and high level features of the data.

The next few layers are the ReLU layer, which applies an element-wise activation function $f(x)=max(0,x)$ to increase the nonlinearity of the network. And the LRN layer normalizes the activity of a neuron by:

where $(a^{j}(x,y))^{2}$ is the activity computed by convolving kernel j at position (x, y) [4]. The hyperparameters k, $\alpha$ , and $\beta$ are tuned using the validation dataset. The max pooling layer computes the maximum value of a particular feature over a region, which is used for downsmapling. The FC layer has connections to all previous activations and computes the class scores. Dropout layers are also included to prevent overfitting in the FC layers.

The CNN then calculates the probability of all the classes given the input using forward propogation. To optimize this output, the CNN also uses backward propagation to compute the gradients of the loss function with respect to the parameters. The loss function of a dataset D can be written as a function of the weights [6].

Where $f_{W}(X^{i})$ represents the loss on the sample $X^{i}$ and is computed in the forward propagation. $r(W)$ is a regularization term, and $\lambda$ is the weight for regularization.

Then, stochastic gradient descent (SGD) is implemented on mini-batches of size $N<<|D|$ , since the size of the dataset in the project is very large. The loss function is approximated by [6]:

The gradient of the loss is computed using the backward propagation. The SGD updates the weights W as $W:=W+V$ where V is updated by

The learning rate $alpha$ and the momentum $\mu$ are also hyperparameters to be tuned using the validation data.

Another layer, the softmax loss layer, is used to compute the multinomial logistic loss of the softmax. It predicts a single class out of the 10 mutually exclusive classes. And this is the class that then becomes the output estimated class.

Support Vector Machine Using Bag of Features

Besides our main focus of the CNN method, we also wanted to explore other machine learning algorithms including support vector machines (SVM). For this method, we only managed to run a test on a small directory of 350 images just to have general idea of the results.

In the SVM learning method, a point to be classified is represented as a p-dimensional vector. The elements of these vectors are made up of the features of the point, in our case the image. A classifier that can be thought of as a (p-1)-dimensional hyperplane, is used to separate the different points into classes. There are many ways to construct this classifier, and the SVM uses the assumption that the classifier that separates the points of different classes such that the distance from the hyperplane to the points closest to the hyperplane is maximized. These closest points are known as the support vectors. Using this method, a classifier is constructed and used to separate the points into different classes.

The following image (obtained from Wikipedia) illustrates these concepts for classifying points with two dimensional feature vectors into two classes.

Example of Classifiers

Here, H₃ is the best classifier line for SVM, because it separates the two classes and maximizes the distance between the support vectors and the line.

In the case of illuminant classification, we can use the SVM classifier built into the MATLAB machine learning toolbox, and we need to choose which set of features to extract from the image. Some examples include the Histogram of Oriented Gradients (HOG) features of an image. Another that we decided to explore was the Bag of Features (BOF) feature extraction method. The BOF method utilizes the Speeed Up Robust Features (SURF) of the image. We decided to try out these features, since they were readily available in the MATLAB toolbox, but as explained later, they did not produce good results.

To implement the SVM on our image classification issue, we used a small database of images and split them into the illuminant classes. For each class, we extracted the bag of features from the training data set and input these features and the corresponding illuminant class labels of each image into the SVM classifier. We then evaluated the bag of features on the training data set and on the test data set. The training set consisted of 60% of the images and the test consisted of the remaining 40%.

Results

k-NN Results

In k-NN classifiers, the choice of k has a great impace on the performance, hence choosing the optimal k is important to obtaining the best k-NN classifier. The below figure summarizes the validation accuracy of the classifier run on the test data at different values of k.

Validation accuracies for RGB and RAW images using k-NN algorithm

The plot shows validation accuracy at different values of k. Although accuracies up to k = 500 are computed, the most representative data are shown. The validation accuracy for RGB data peaks at k=15, and for RAW data peaks at k=30. The table summarizes training, validation and test accuracies of RGB and RAW data at the respective peak values of k. Since color histograms have larger size than the gray histograms, it is expected that the accuracies for RGB data is higher than those for RAW data. Better choice of k may be achieved with cross validation.

The following table shows the training, validation and test accuracies of RGB and RAW data at the respective peak values of k.

The larger training error than test error indicates that there is high bias in the classifier. To mitigate this, larger sets of more complicated features are needed. Thus, we turn to deep learning methods such as CNNs.

CNN Results

In the case of CNNs, the tuning of the hyperparameters mentioned in the methods section plays an important role in the performance of the classifier. For our methods, the network architecture based hyperparameters are not significantly modified, although the input dimension is modified for the RAW data. AlexNet [4] is adopted for the project with the number of outputs of the last FC layer changed to 10. The hyperparameters that were learning based are shown in the table below. These are the values set for the hyperparameters in the solver files of the code and are also included in Appendix I.

Hyperparameters in the solver files

For both RGB and RAW data, the first trial was initiated with a base learning rate of 0.0001. The rate is tuned so that it is neither too high to result in diverging loss, nor too low for efficient learning. Gamma is the factor by which the learning rate drops after iterations of stepsize. Smaller stepsizes make the learning rate go down faster.

We first adjusted and tuned the learning rate related hyperparameters. Then we experimented with increasing training batch size and found that increasing the batch size from 2 to 10 increases the test accuracy by 5% at 20,000 iterations. However, the batch size of 10 is the maximum before the computing resource is out of memory. Thus all later training use a batch size of 10. Changing the weight decay ranging from 0.0001 to 0.005 does not generate obvious change in the performance. The dropout ratios in the FC layers also have negligible effect overall.

For the RGB dataset, besides using the initial base learning rate, the pre-trained weights from AlexNet were also used and resulted in better performance. In fine tuning, the base learning rate is decreased, and the learning rate of the last FC layer is increased so that it learns faster while the rest of the model changes slowly.

Due to limited time, not all possible tuning have been carried out, since each learning takes around 20 hours for 100,000 iterations with the 10 as the maximum batch size the GPU could handle, but there may be more optimal hyperparameter values to explore.

The following plots compare the results of the RAW and RGB data. Test accuracy vs. iterations in stochastic gradient descent for the best models for RGB and RAW data are shown.

Test accuracy vs iterations in SGD for RGB images, fine tuned from the pre-trained mode

Test accuracy vs iterations in SGD for RAW images, training from scratch

We see that both curves converge after many iterations. There is a second plateau in the RGB plot that is due to an increase of learning rate in the middle of training. This was implemented using snapshots in Caffe.

The following table also shows the accuracies of the classifier on the RGB and RAW data. The accuracy is higher for the training data, indicating a problem of overfitting. This could be amended by reducing the Conv layers in the architecture. Other approaches such as increase the regularization and dropout ratios have been tried with little effect.

Training, validation and test accuracies of RGB and RAW data using CNNs

When compared to the k-NN method, the accuracies for the RAW images are slightly better. However, they are still very low. This may signify that the architecture is more suitable for RGB images that have gone through processing.

The confusion matrices shown below were also obtained for the RGB and RAW images to give a better insight on the predictions of the classifier. The horizontal axes of the confusion matrices represent true labels of the illuminants, and the vertical axes the predicted illuminant labels. The numbers are normalized to the total number of images in the sets.

Test Confusion Matrix for RGB data

Test Confusion Matrix for RAW data

We see that most of the misclassification occurs at neighboring classes, which is expected because neighboring classes present smaller differences. Another promising aspect is that the performance of the first few classes is better than that of the latter few classes. Since the perceptual difference of illuminant has an inverse relationship to the black body radiation temperatures, the difference between the first few classes is larger perceptually. So the classifier has similar behavior as human perception.

SVM Bag of Features Results

The following show the confusion matrices of the results of the SVM classifier using Bag of Features. (The illuminant class names have been abbreviated). The first matrix shows the result when the classifier is used on the data used to train it. As expected, the accuracy is high. However, the second matrix shows the results of the classifier used on the test data. We see a case of overfitting, since the accuracy of the classifier on the test data is extremely low.

Classifier used on training data.

Classifier used on test data.

This shows that SVM with BOF was not a useful algorithm to use to solve this problem. BOF should be more suited to object recognition in scenes, which is not exactly the same problem as identifying scene illuminants.

Conclusions and Future Work

In comparing the baseline k-NN classifier and the CNN classifier for illuminant estimation, the deep learning method of CNNs achieved higher accuracy. This is expected, because the network is able to learn different levels of features of an image. One major limitation during this project was the resource of time. As mentioned, the long training time hinders the optimal selection of hyperparameters to choose the most effective classifier model.

Besides trying different datasets to work with, future work for the CNN classifier should focus on implementing a better CNN architecture for classifying RAW images. Another aspect to experiment with would be scrambling the pixels of input images to investigate the influence of spatial arrangement on the classifier performance.

In terms of SVM algorithms, besides increasing the dataset size, further work can be done to explore what types of feature vectors are best used for this application. The Bag of Features method did not work well with illuminant classification, since it is better at detecting textures and objects in the scene. The Histogram of Oriented Gradients feature set may also not work very well, since these mostly have good performance at finding edges of images. Since we are classifying illuminants, the range of colors would be a better feature space to focus on. Perhaps different color features like color histograms could be used along with the SVM to estimate the illuminant.

Acknowledgements

We would like to thank Brian Wandell, Joyce Farrell, and Rosemary Le for general support throughout the class and project, and special thanks to Henryk Blasinski for help with the project.

References

[1] Tominaga, Shoji, Satoru Ebisui, and Brian A. Wandell. "Scene Illuminant Classification: Brighter Is Better." Osapublishing.org. OSA Publishing, Jan. 2001. Web. Accessed Dec. 2015.

[2] Qiu, Guoping, and Bozhi Liu. "Illuminant Classification Based on Random Forest." IEEE Xplore. IEEE, May 2015. Web. Dec. 2015.

[3] Rajapriya, S., and S. Nima Judith Vinmathi. "Detection of Digital Image Forgeries by Illuminant Color Estimation and Classification."International Journal of Innovative Research in Computer and Communication Engineering 2.1 (2014): 248-54. Mar. 2014. Web. Dec. 2015.

[4] Krizhevksy, A. et. al. (2012), "Imagenet classification with deep convolutional neural networks," NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada

[5] http://caffe.berkeleyvision.org

[6] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. (2014), "Caffe: Convolutional Architecture for Fast Feature Embedding," arXiv preprint arXiv:1408.5093.

Appendix I

Example Code for Generating Images

File:Image generation.zip

Code for k-NN and CNN

File:K-NN CNN code.zip

Code for SVM using BOF

File:SVM BOF.zip

RGB images File:RGB.zip

RAW images File:RAW.zip

Appendix II

Work Breakdown:

Xuerong - RAW and RGB image generation, KNN algorithm, CNN algorithm

Jennifer - Related work research, SVM algorithm, help with RGB image generation

Deep Learning for Illuminant Estimation

Contents

Introduction

Background