Image classification with a five band camera

Introduction

Image classification is a central topic in computer vision in which images are lumped into categories via comparison and prediction. Many state-of-the-art classification systems follow a somewhat ubiquitous procedure of feature extraction via dense grid descriptor sampling, coding into higher dimensions, pooling, and finally classification. Integrated computer vision techniques like object and scene recognition are also gaining traction in consumer-grade electronics. Two now common examples of this in digital cameras are (1) scene recognition, in which the appropriate aperture, shutter speed, and nonlinear gamma settings are selected and (2) face detection, which is used to determine the appropriate autofocus parameters and locations for redeye removal.

Aside from pure software-based developments, data acquisition is central to the heart of this problem. Most silicon based digital camera sensors use CCD or CMOS sensor chips that are inherently sensitive to a broad spectrum of light, starting at short visible wavelengths, and reaching into the near-infrared. Many of these sensors have a series of filters that discretize or remove light, ending with three channels of digital data, each corresponding to an RGB color channel: red, green or blue. Such cameras are often termed "three band". It is argued here that a camera with more sensitivity over a broader spectrum of wavelengths could be helpful to many computer vision techniques.

To test this hypothesis, a set of 72 images of four different pieces of fruit were taken on a prototype brand five band camera donated to Stanford's Center for Image Systems Engineering (SCIEN) lab, sampled using SIFT feature descriptors, and compared using RANSAC.

Background

In 2011, Brown and Susstrunk [1] used a modified single lens reflex (SLR) camera to capture several hundred three band color and NIR image pairs to show that the addition of NIR image information leads to significantly improved performance in scene recognition tasks. The authors tested multiple feature descriptors in their analysis, including color-specific SIFT, multispectral SIFT, GIST (feature descriptors generated using three scales and eight orientations per scale), and HMAX (a hierarchy of filtering and max-pooling operations computed independently and concatenated for each band). In each case, the authors found a general trend that adding more information lead to better recognition performance, and that the SIFT based descriptors showed the greatest improvements (recognition rates of 59.6% vs. 73.1% for RGB versus RGB + NIR, respectively).

In 2012, Namin and Petersson [2] presented a method of distinguishing between different materials occurring in natural scenes using a seven band camera. Instead of using SIFT for object detection and localization, the authors used a texture based approach considering Fourier spectrum features from gray-level co-occurrence matrices (GLCMs), and classifiers built using support vector machines (SVM) and AdaBoost. Such an approach considers the entire image as one class of data, such as an image of grass or bushes, rather than identifying certain objects in an image. They found that adding extra spectral information helped them discriminate properties better, achieving an average classification accuracy of 91.9% and 89.1%, respectively for a ten class problem.

In support of recent success of image classification using multispectral imaging, many groups have begun adapting commonly used techniques for multi-band image data. For example, groups Saleem and Sablatnig [3] Xiao, Wu, and Yuan [4] developed adapted feature sampling and extraction techniques for multi-band images. As another example, groups Raja and Kolekar [5] and Salamanti, Germain, and Susstrunk [6] both developed improved shadow removal and image restoration techniques based on the use of extra spectral information.

Methods

Instruments

The camera used in this experiment was a prototype five band camera donated to our lab. Due to it being a prototype, there are many unknowns about the camera, but we do know a few details. The sensor itself has a size of 1920 x 1080 pixels with a depth of 12 bits. When positioning, aiming, and focusing the camera, we were able to use all of its pixels in "1080P mode" using feedback from the preview software on the connected laptop. However, during data acquisition, the saved images were only 1280 x 720 per channel.

Raw files generated from the Matlab acquisition software are linear (meaning they have no gamma transform), 12-bit arrays with five channels per pixel. Each pixel measures one color (red, green, blue, orange, or cyan) based on its position in the color filter array. The demosaicing operation is unknown, but it is known that it can take in 12-bit data and returns 16-bit upscaled data. This data is then scaled down to 8-bits for RGB image formation.

The camera's spectral sensitivities were supplied with the camera, and unsurprisingly indicate highest peak sensitivity for the green channel and lowest peak sensitivity for the blue channel. Usable sensitivity values begin at about 450nm (with the exception of the blue sensor, which seems to extend beyond the 380nm limit to the data supplied) and end at about 680nm. It is also important to note just how much overlap there is between the green channel and the two "extra" cyan and orange channels. The better the separation between those channels, the more extra data that will be included in the five band data set. As with most consumer-grade cameras, sensitivity in the NIR is diminished, either because of the sensor itself or because of a "hot-mirror" between the lens and sensor filtering out these longer wavelengths.

During image acquisition, three different lighting setups were used to see how light directionality, shadows, and sensitivity might play into classification results. One setup was with standard incandescent room lights, the other two with incandescent lamps. All used tungsten filaments, with a relative spectral luminous flux peaking in the NIR. This can be seen in our data supplied by the lab, although the spectrum over which it was measured ends at the NIR. If we were to continue these measurements, we'd see a slow drop off in flux that ends in a six-fold reduction by the time we hit 2500nm.

Other than the camera and lighting elements, the only other items used were a PC laptop controlling the camera, a white paper backdrop for low-feature image backgrounds, and a Mac laptop for processing the images and running the image processing / computer vision software.

Calibration

Camera calibration was handled via a simplified method inspired from Park et al. [7] in which the authors recovered spectral reflectances in a scene using a multispectral camera and light source. In our case, we assumed that each image m was composed of terms gain g, spectral responsivity s, spectral illumination i, spectral reflectances r, and offset b:

$m=g*\sum (s_{\lambda }*i_{\lambda }*r_{\lambda })+b$

The first two terms to be determined are gain and offset. If we first take an image with the lens cover on, our illumination term, as well as reflectivity terms both drop to zero over all wavelengths, thus directly giving us our offset term.

If we then take an image of a pure white reflectance, we may assume that reflectance values remain constant and equal to one over all wavelengths. Combining this information with the gain term obtained from the first step and the spectral luminous output data from our tungsten lamp, we solve for gain rather straightforwardly.

The last step is to solve for reflectance values and compare to real world data. For this, we turned to imaging a Macbeth color calibration target, a cardboard-framed array of 24 painted sample squares demonstrating color samples intended to mimic the spectral reflectances of natural objects. These reflectance values data are shown below. To solve for our observed reflectance values, we use the tungsten illumination and camera sensor responsivity data on hand, our calculated gain and offset terms, and solve for reflectance.

We're somewhat limited in terms of how many unknowns we're solving for and how many observations we can make. Park et at. proposed using a linear model as a set of orthogonal spectral basis functions, which are basis functions, or eigenfunctions, of a correlation matrix derived from measured reflectances of color chips. I instead opted for a much simpler method, instead calibrating for the grayscale colors of the chart, taking advantage of the fact that these values are somewhat constant over our spectral bandwidth. By assuming these values are constant, we can solve for all values over the spectrum at once, and find that our reflectance values match quite well with those from the chart, deviating no more than 10% from calibrated measurements.

Natural	Miscellaneous	Primary	Gray

Image demosaicing

Image demosaicing is a digital image processing algorithm used to reconstruct a RGB color image from an incomplete color sampling. This sampling is the result of an image sensor array overlaid with a color filter array (CFA). A good demosaicer would be one that is able to reconstruct a color image while (1) not introducing false color artifacts such as aliasing or fringing, (2) preserving spatial resolution, (3) keeping computational complexity low for fast processing times (especially important if the software is in the form of in-product firmware), and (4) being open-source enough to analyze noise and error.

The most common commercially used CFA configuration is the Bayer filter shown here [11]. In this case, each 2x2 cell contains one blue short-pass filter, one red long-pass filter, and two green band-pass filters. Other commonly used CFAs are RGBE, CYYM, CYGM, RGBW Bayer, and RGBWs #1-3. The Bayer filter's widespread use is attributed to it being designed to mimic the spectral sensitivity of the human eye, which has higher sensitivity to green wavelengths.

Although many consumer-grade digital cameras can now save images in raw format and allow the user to demosaic their data using software other than from the company, we opted to use an existing demosaicer, which would be supplied in the form of built-in firmware should a product like this one reach the mass market.

Data

The data used in this experiment is a set of 72 image acquisitions for a total of 144 total images of four pieces of fruit: a ripe banana (mostly yellow appearance), an underripe banana (green and yellow appearance), a red delicious apple, and a gala apple. The set was used to determine not only if we could distinguish between major classes (apple vs. banana), but also between subclasses (ripe vs. underripe or type of apple), indicating further sensitivity to color differences. Images were taken from six different spatial locations (with varied XY location and distance from the subject) and under three different lighting condition possibilities: (1) incandescent room lights on, (2) a diffused and reflected lamp from location left of subject, and (3) a lamp from location right of subject. Both sources of light used tungsten filaments.

Each image was acquired using all five available bands of sensitivity from the camera. Data was then either combined using software provided (and thus not readily available to scrutinization or optimization) to produce RGB images from combining (in some unknown way) all five bands of data into three channels, or selectively chosen (i.e. the red, green, and blue channels were preserved while the cyan and orange channels were discarded) to simulate an image taken on a standard three channel camera. These data types are termed 5-band and 3-band, respectively.

An example set of each class of data (ripe banana, underripe banana, red delicious apple, and gala apple) for each type of data (3-band on top, 5-band on bottom). It's visually apparent that the three band data is darker, and for good reason: we've essentially thrown away pixels by discarding two of our five channels of data.

Class 1	Class 2	Class 3	Class 4

Scale-invariant feature transform

Scale-invariant feature transform (SIFT) is a very widely used algorithm for detecting and describing local digital image features, introduced in 1999 by David Lowe [8]. The basic idea is to detect and localize interesting or descriptive points on objects in an image that can describe the image well. This information is then used for comparison later on during classification.

The main steps of the algorithm are as follows: (1) Decompose the image into difference of Gaussian (DoG) scale-space representation. This process is essentially filtering the image twice using Gaussian filters of different sizes, and taking the difference on a pixel-by-pixel basis. This method is often used for corner detection, as we are essentially doing here. (2) Detect and localize minima and maxima across scales. This can be determined with sub-pixel accuracy if the locations are solved via fitting of a 3D quadratic function. (3) Eliminate edge responses using Hessian filtering. Hessian filters are square matrices of 2nd order partial derivatives, often used for edge detection.

What we're left with are a set of keypoints with values associated to the strength of the maxima/minima (indicated as the size of the circles in the images below) as well as a directionality term (indicated by the radius marking). As an example, below are a set of 3-band and 5-band images of a gala apple using different threshold terms for the SIFT procedure. As we can see, with the higher threshold a higher percentage of keypoints are located on the subject as intended, at the cost of a drop in the overall number of results.

No Peak Threshold	Peak Threshold = 1.5

SIFT was chosen for this project for its invariance to differences in contrast, brightness, Gaussian noise, and blurring, as shown below. These calculations were originally done by me for a class homework assignment in EE368 Image Processing, and are reproduced here. For these tests, I took an original image and varied brightness offsets, gamma contrast mapping, Gaussian white noise, or blurring with varying kernel sizes. And while repeatability was strong for the first three cases, we find that SIFT is in fact quite sensitive to blurring. This can come into play if the subject is slightly out of focus for our camera, or the focus changes from camera position to position.

This approach was implemented using the open source VLFeat library, originally written in C, and interfaced in Matlab.

Distance Ratio Test and Random sample consensus

Below are the two methods I used for eliminating bad keypoint matches in the form of a first pass, second pass scheme. Both were implemented using the open source VLFeat library.

Distance Ratio Test

Five years after introducing SIFT in 1999, Lowe demonstrated a method for identifying and rejecting all bad matches in which the distance ratio is calculated and used as a threshold [9]. Here the strongest matches have the lowest distances, and the ratio is defined calculated as the ratio of the best match divided by the second best match. All feature pairs are iteratively compared, and values greater than 0.8 are rejected. In his study, this removed 90% of the false matches while discarding less than 5% of the correct ones.

In this study, a similar, albeit simpler, approach is used as a first pass for eliminating incorrect matches. In the original approach, efficiency was improved by using a best-bin-first algorithm that was cut off after checking the first 200 nearest neighbor candidates. In my approach, I checked all candidates.

Random sample consensus

Random sample consensus (RANSAC) is an iterative method for estimating model parameters from data observations, first introduced by Fischler and Bolles in 1981. A basic assumption in this approach is that our observed data set has both inliers and outliers, and that the inliers fit a certain model. In this case, our model is a homography matrix using projective geometry to map one image to another.

In this approach, a subset of k correspondences are randomly selected, and geometric mapping parameters are computed using linear regression. A geometric mapping is then applied to all keypoints. We then count the number of inliers, or keypoints closer than some value sigma from the corresponding keypoint. Sigma values between one and three pixels are typical. We repeat this process a total of S times, keeping the geometric mapping with the largest number of inliers. The required number of trials is determined via:

$S=log(1-P)/log(1-q^{k})$

where P is the total probability of success and q is the probability of valid correspondence. In this case we expect q to be relatively small.

An example of this process is shown below. In the top image, we see two images of the same pieces of fruit under different lighting conditions and taken from a different location. The second image shows the keypoints that were detected, using zero thresholding. Next, we eliminate matches whose distance ratios are greater than 0.8 via the distance ratio test. Finally, we eliminate outliers via RANSAC.

K-fold cross-validation

As is commonly seen in image matching studies, some sort of cross-validation is needed for validating our process and assessing how well specific results might translate to more general cases. In other words, we'd like to determine how accurately our predictive model will perform in practice. Assessing how certain matches do undersamples our data and likely skews perceived results. Instead, we aim to thoroughly sample good and bad cases by assessing performance on all permutations of our data set.

All forms of cross-validation involve partitioning a sample of data into subsets, analyzing the training set, and validating on the complementary test set. Performing this partitioning and testing in multiple rounds using different partitions reduces variability by being able to average results over individual rounds.

In this experiment I opted for using K-fold cross-validation, visualized on the right [10]. In this case, the original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the query set and the remaining k-1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once.

The k-fold process is fairly flexible, with the value of k being chosen by the experimenter. If the value for k matches the number of observations in the original sampling, then it is called leave-one-out cross validation (LOOCV). This approach is very thorough, but iterative and slow. For example, my original experiment used a k value of 18 such that I was using LOOCV, and each data set (3-band or 5-band) took over three hours to test. When I started experimenting with parameters, I chose a lower value of k for slightly less accurate accuracy numbers, but much improved computation time.

Although Matlab some some built-in cross-validation support, I opted to implement this myself because of how my data was organized.

Results

My overall best image matching results for this project were surprising and not likely "correct", but relatively explainable. The first and only LOOCV run I attempted (because of computational requirements and time) used no SIFT keypoint thresholding (in an attempt to get as many "hits" as possible) and discarded many-to-one matches in RANSAC (to help combat the increase in false positives). Over all eighteen iterations, average performance for 5-band data, 48.8%, was slightly worse than for 3-band data, 51.2%, a difference of about 2.4%. This finding comes with an important caveat: Accuracy was calculated as the number of iterations where each of the four images in the query set were matched to their corresponding image in the test set, with no errors whatsoever. This is not the typical definition of accuracy for image matching, and is harsher than most standards, hence why my findings are so low. While I didn't calculate accuracy numbers based on the typical equation of number of correct matches subtracted by number of incorrect matches divided by total attempts, I can conservatively say that those numbers are around 25% higher than what I chose to report.

My other attempts varied different parameters such as SIFT thresholding, distance ratio values, and RANSAC thresholds, and although overall success rates varied, none were able to match the matching performance reported above, and none were able to achieve a higher matching performance for the 5-band data over the 3-band data. For example, when I increased SIFT thresholds from zero to 1.5% but included many-to-one matches in RANSAC, the average matching performance dropped to 39.7% for 3-band data and to 31.4% for 5-band data. This experiment was performed using 10-fold cross validation to keep computation times low and be able to try different parameter settings in a shorter amount of time.

My results have indicated a few important findings. First, these results are in fact contradictory to what has been reported in literature, where all the papers I could find reported increased performance for data with more information in it.

This leads us to our next finding: I need a better, more transparent method for using 5-band data. As stated before, the demosaicing algorithm used for translating 5-band data into RGB images is unknown to me, and thus cannot be trusted as accurate or ideal. In fact, it is quite likely that because we are trying to combine so much information into such a relatively small representation, we're actually losing data that is critical for SIFT to work properly. As I found before, SIFT is highly sensitive to spatial blurring, but it also seems to be sensitive to spectral resolution. By stuffing 5 channels of information into space enough for 3 channel's worth, we may be shooting ourselves in the foot. It may be that other groups discovered or predicted this finding before publishing results, because many groups tried combining the extra information obtained in their studies in different ways. Some groups even modified their SIFT algorithms to be multispectral, whereas the one implemented here works in grayscale. Therefore by the time data was analyzed by SIFT, it had been demosaiced and down converted. This is far from ideal, but I hadn't anticipated it hurting my results as much as I now expect it did.

The last finding that may have hurt results was unintentional, but unavoidable, image blurring in my single round of image acquisitions. None of my data looked blurred at the time, and on a macro scale don't seem blurry, but remain convinced that with a little help from autofocusing software I could have achieved consistently sharper images. One may to do this would have been to incorporate an active system in which an infrared signal sent from atop the camera and is bounced off the target and back onto a sensor which records actual distance. A motor could then adjust the focus accordingly. Another method could have been to use a passive approach, imaging a test target with sharp black and white lines and running it through a simple algorithm to search for maximum intensity differences. This again could enable a motorized focus control, or simply be a visual feedback for manual focus control, if operated in real-time. I make this point because SIFT is notoriously sensitive to this parameter, as discussed before. During my image acquisition, I was limited to manual focus with feedback from a macro-scale representation of the image, using the entire field of view. Should this experiment be repeated in the future, implementing a fix for more consistency here would be an important step forward.

Conclusions

In conclusion, I demonstrated image matching performance of a five band camera, but achieved rather inconclusive final results. I’ve discussed how ubiquitous computer vision approaches are becoming, even in consumer-grade electronics where real world advantages are gained from clever software implementation. I gave a quick overview of the groups working on this specific problem, and how their findings were more successful and expected than my own. My experiment used a prototype 5-band camera that was calibrated using Macbeth reflectance data and responsivity data supplied. I discussed the importance of image demosaicing and the importance of transparency and retaining accuracy in this operation. I discussed and implemented SIFT keypoint extraction, distance ratio tests, and RANSAC model parameter estimation for eliminating outliers, and k-fold cross-validation for performance estimation. My best image matching/classifying performance was 48.8% accuracy and 51.2% accuracy for 3-band and 5-band data, respectively.

I think the biggest lesson I learned is that an experiment such as this one is much like a chain, and that any particular aspect can introduce kinks. This kind of experiment incorporates many different considerations, including image calibration, acquisition, accuracy, consistency, software approaches, and validation schemes. I even found that I didn't initially give the amount of time required for calculations as much respect as it deserves, with each run of LOOCV taking about 6 hours in total.

If I were to take a second attempt at this project, or if perhaps another student were to try, there would be a number of changes I would make. One, I would spend more time on image aqcuisition and implement some sort of feedback loop for determining the best attainable focus. The focus here would be on consistency. The feedback system could be active or (more likely for a class project) passive, but either would be much better than "eyeballing" it, as I did. Second, I think SIFT keypoint detection was a good choice for this project, but I would try a quicker approach for keypoint matching. Instead, I might try image retrieval using vocabulary trees, which is orders of magnitude faster at comparing images, although slightly less accurate. This improvement in computation time could have allowed me to try much more optimization configurations. Finally, I would definitely try to find a more optimal way of implementing SIFT, taking better advantage of the extra information and being careful not to lose or compress details. This likely could have been done by performing multiple passes of SIFT detection for extra bands of information, and either combining keypoint locations or running band-wise comparisons.

All in all, I view this project as a success despite findings that don't match what's been reported. I would be excited to tackle this project again in the future using the modified approaches listed above.

References

[1] Brown, M.; Susstrunk, S., "Multi-spectral SIFT for scene category recognition," Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.177,184, 20-25 June 2011, doi: 10.1109/CVPR.2011.5995637. < http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5995637&isnumber=5995307>

[2] Namin, S.T.; Petersson, L., "Classification of materials in natural scenes using multi-spectral images," Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on , vol., no., pp.1393,1398, 7-12 Oct. 2012, doi: 10.1109/IROS.2012.6386074. < http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6386074&isnumber=6385431>

[3] Saleem, S.; Sablatnig, R., "A Robust SIFT Descriptor for Multispectral Images," Signal Processing Letters, IEEE , vol.21, no.4, pp.400,403, April 2014, doi: 10.1109/LSP.2014.2304073. <http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6730675&isnumber=6732989>

[4] Yang Xiao; Jianxin Wu; Junsong Yuan, "mCENTRIST: A Multi-Channel Feature Generation Mechanism for Scene Categorization," Image Processing, IEEE Transactions on , vol.23, no.2, pp.823,836, Feb. 2014, doi: 10.1109/TIP.2013.2295756. < http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6690151&isnumber=6685907>

[5] Lloyds Raja, G.; Kolekar, M.H., "Illumination normalization for image restoration using modified retinex algorithm," India Conference (INDICON), 2012 Annual IEEE , vol., no., pp.941,946, 7-9 Dec. 2012, doi: 10.1109/INDCON.2012.6420752. < http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6420752&isnumber=6420575>

[6] Salamati, N.; Germain, A.; Susstrunk, S., "Removing shadows from images using color and near-infrared," Image Processing (ICIP), 2011 18th IEEE International Conference on , vol., no., pp.1713,1716, 11-14 Sept. 2011, doi: 10.1109/ICIP.2011.6115788. <http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6115788&isnumber=6115588>

[7] Park, Jong-Il; Lee, Moon-Hyun; Grossberg, Michael D.; Nayar, Shree K., "Multispectral Imaging Using Multiplexed Illumination," Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on , vol., no., pp.1,8, 14-21 Oct. 2007, doi: 10.1109/ICCV.2007.4409090. < http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4409090&isnumber=4408819>

[8] Lowe, D.G., "Object recognition from local scale-invariant features," Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on , vol.2, no., pp.1150,1157 vol.2, 1999, doi: 10.1109/ICCV.1999.790410. < http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=790410&isnumber=17141>

[9] Lowe, D. G., “Distinctive Image Features from Scale-Invariant Keypoints”, International Journal of Computer Vision, 60, 2, pp. 91-110, 2004.

[10] <http://www.imtech.res.in/raghava/gpsr/Evaluation_Bioinformatics_Methods.htm>

[11] <http://en.wikipedia.org/wiki/Demosaicing>

[12] <http://www.vlfeat.org/>

Appendix I

Source Code

The code used for this project is included here in an effort of transparency. A quick read me file is included. All other software needed can be obtained directly from VLFeat's website [12].

File:Source Code Loewke.zip

Test Images

The following compressed data contains all 144 test images used during this experiment, organized by data type:

File:Test Image Data.zip

Additionally, the raw data captured from the camera can be accessed via the following 1.06GB download:

File:PSYCH221 Data Loewke.zip

Presentation

The following short presentation, saved in PDF form for online viewing, was given in Stanford's PSYCH 221 on March 18, 2014. It aims to cover as much of the information presented in this wiki as possible in under 10 minutes.

File:PSYCH221 Loewke.pdf

Image classification with a five band camera

Contents

Introduction

Background