Image classification with a five band camera: Difference between revisions

Revision as of 18:56, 20 March 2014

Introduction

Image classification is a central topic in computer vision in which images are lumped into categories via comparison and prediction. Many state-of-the-art classification systems follow a somewhat ubiquitous procedure of feature extraction via dense grid descriptor sampling, coding into higher dimensions, pooling, and finally classification. Integrated computer vision techniques like object and scene recognition are also gaining traction in consumer-grade electronics. Two now common examples of this in digital cameras are (1) scene recognition, in which the appropriate aperture, shutter speed, and nonlinear gamma settings are selected and (2) face detection, which is used to determine the appropriate autofocus parameters and locations for redeye removal.

Aside from pure software-based developments, data acquisition is central to the heart of this problem. Most silicon based digital camera sensors use CCD or CMOS sensor chips that are inherently sensitive to a broad spectrum of light, starting at short visible wavelengths, and reaching into the near-infrared. Many of these sensors have a series of filters that discretize or remove light, ending with three channels of digital data, each corresponding to an RGB color channel: red, green or blue. Such cameras are often termed "three band". It is argued here that a camera with more sensitivity over a broader spectrum of wavelengths could be helpful to many computer vision techniques.

To test this hypothesis, a set of 72 images of four different pieces of fruit were taken on a prototype brand five band camera donated to Stanford's Center for Image Systems Engineering (SCIEN) lab, sampled using SIFT feature descriptors, and compared using RANSAC.

Background

In 2011, Brown and Susstrunk [1] used a modified single lens reflex (SLR) camera to capture several hundred three band color and NIR image pairs to show that the addition of NIR image information leads to significantly improved performance in scene recognition tasks. The authors tested multiple feature descriptors in their analysis, including color-specific SIFT, multispectral SIFT, GIST (feature descriptors generated using three scales and eight orientations per scale), and HMAX (a hierarchy of filtering and max-pooling operations computed independently and concatenated for each band). In each case, the authors found a general trend that adding more information lead to better recognition performance, and that the SIFT based descriptors showed the greatest improvements (recognition rates of 59.6% vs. 73.1% for RGB versus RGB + NIR, respectively).

In 2012, Namin and Petersson [2] presented a method of distinguishing between different materials occurring in natural scenes using a seven band camera. Instead of using SIFT for object detection and localization, the authors used a texture based approach considering Fourier spectrum features from gray-level co-occurrence matrices (GLCMs), and classifiers built using support vector machines (SVM) and AdaBoost. Such an approach considers the entire image as one class of data, such as an image of grass or bushes, rather than identifying certain objects in an image. They found that adding extra spectral information helped them discriminate properties better, achieving an average classification accuracy of 91.9% and 89.1%, respectively for a ten class problem.

In support of recent success of image classification using multispectral imaging, many groups have begun adapting commonly used techniques for multi-band image data. For example, groups Saleem and Sablatnig [3] Xiao, Wu, and Yuan [4] developed adapted feature sampling and extraction techniques for multi-band images. As another example, groups Raja and Kolekar [5] and Salamanti, Germain, and Susstrunk [6] both developed improved shadow removal and image restoration techniques based on the use of extra spectral information.

Methods

Instruments

The camera used in this experiment was a prototype five band camera donated to our lab from Olympus. Due to it being a prototype, there are many unknowns about the camera, but we do know a few details. The sensor itself has a size of 1920 x 1080 pixels with a depth of 12 bits. When positioning, aiming, and focusing the camera, we were able to use all of its pixels in "1080P mode" using feedback from the preview software on the connected laptop. However, during data acquisition, the saved images were only 1280 x 720 per channel.

Raw files generated from the Matlab acquisition software are linear (meaning they have no gamma transform), 12-bit arrays with five channels per pixel. Each pixel measures one color (red, green, blue, orange, or cyan) based on its position in the color filter array. The demosaicing operation is unknown, but it is known that it can take in 12-bit data and returns 16-bit upscaled data. This data is then scaled down to 8-bits for RGB image formation.

The camera's spectral sensitivities were supplied with the camera, and unsurprisingly indicate highest peak sensitivity for the green channel and lowest peak sensitivity for the blue channel. Usable sensitivity values begin at about 450nm (with the exception of the blue sensor, which seems to extend beyond the 380nm limit to the data supplied) and end at about 680nm. It is also important to note just how much overlap there is between the green channel and the two "extra" cyan and orange channels. The better the separation between those channels, the more extra data that will be included in the five band data set. As with most consumer-grade cameras, sensitivity in the NIR is diminished, either because of the sensor itself or because of a "hot-mirror" between the lens and sensor filtering out these longer wavelengths.

During image acquisition, three different lighting setups were used to see how light directionality, shadows, and sensitivity might play into classification results. One setup was with standard incandescent room lights, the other two with incandescent lamps. All used tungsten filaments, with a relative spectral luminous flux peaking in the NIR. This can be seen in our data supplied by the lab, although the spectrum over which it was measured ends at the NIR. If we were to continue these measurements, we'd see a slow drop off in flux that ends in a six-fold reduction by the time we hit 2500nm.

Other than the camera and lighting elements, the only other items used were a PC laptop controlling the camera, a white paper backdrop for low-feature image backgrounds, and a Mac laptop for processing the images and running the image processing / computer vision software.

Calibration

Camera calibration was handled via a simplified method inspired from Park et al. [7] in which the authors recovered spectral reflectances in a scene using a multispectral camera and light source. In our case, we assumed that each image m was composed of terms gain g, spectral responsivity s, spectral illumination i, spectral reflectances r, and offset b:

$m=g*\sum (s_{\lambda }*i_{\lambda }*r_{\lambda })+b$

The first two terms to be determined are gain and offset. If we first take an image with the lens cover on, our illumination term, as well as reflectivity terms both drop to zero over all wavelengths, thus directly giving us our offset term.

If we then take an image of a pure white reflectance, we may assume that reflectance values remain constant and equal to one over all wavelengths. Combining this information with the gain term obtained from the first step and the spectral luminous output data from our tungsten lamp, we solve for gain rather straightforwardly.

The last step is to solve for reflectance values and compare to real world data. For this, we turned to imaging a Macbeth color calibration target, a cardboard-framed array of 24 painted sample squares demonstrating color samples intended to mimic the spectral reflectances of natural objects. These reflectance values data are shown below. To solve for our observed reflectance values, we use the tungsten illumination and camera sensor responsivity data on hand, our calculated gain and offset terms, and solve for reflectance.

We're somewhat limited in terms of how many unknowns we're solving for and how many observations we can make. Park et at. proposed using a linear model as a set of orthogonal spectral basis functions, which are basis functions, or eigenfunctions, of a correlation matrix derived from measured reflectances of color chips. I instead opted for a much simpler method, instead calibrating for the grayscale colors of the chart, taking advantage of the fact that these values are somewhat constant over our spectral bandwidth. By assuming these values are constant, we can solve for all values over the spectrum at once, and find that our reflectance values match quite well with those from the chart, deviating no more than 10% from calibrated measurements.

Natural	Miscellaneous	Primary	Gray

Image demosaicing

Image demosaicing is a digital image processing algorithm used to reconstruct a RGB color image from an incomplete color sampling. This sampling is the result of an image sensor array overlaid with a color filter array (CFA). A good demosaicer would be one that is able to reconstruct a color image while (1) not introducing false color artifacts such as aliasing or fringing, (2) preserving spatial resolution, (3) keeping computational complexity low for fast processing times (especially important if the software is in the form of in-product firmware), and (4) being open-source enough to analyze noise and error.

The most common commercially used CFA configuration is the Bayer filter shown here. In this case, each 2x2 cell contains one blue short-pass filter, one red long-pass filter, and two green band-pass filters. Other commonly used CFAs are RGBE, CYYM, CYGM, RGBW Bayer, and RGBWs #1-3. The Bayer filter's widespread use is attributed to it being designed to mimic the spectral sensitivity of the human eye, which has higher sensitivity to green wavelengths.

Although many consumer-grade digital cameras can now save images in raw format and allow the user to demosaic their data using software other than from the company, we opted to use Olympus' demosaicer, which would be supplied in the form of built-in firmware should a product like this one reach the mass market.

Data

The data used in this experiment is a set of 72 image acquisitions for a total of 144 total images of four pieces of fruit: a ripe banana (mostly yellow appearance), an underripe banana (green and yellow appearance), a red delicious apple, and a gala apple. The set was used to determine not only if we could distinguish between major classes (apple vs. banana), but also between subclasses (ripe vs. underripe or type of apple), indicating further sensitivity to color differences. Images were taken from six different spatial locations (with varied XY location and distance from the subject) and under three different lighting condition possibilities: (1) incandescent room lights on, (2) a diffused and reflected lamp from location left of subject, and (3) a lamp from location right of subject. Both sources of light used tungsten filaments.

Each image was acquired using all five available bands of sensitivity from the camera. Data was then either combined using software provided by Olympus (and thus not readily available to scrutinization or optimization) to produce RGB images from combining (in some unknown way) all five bands of data into three channels, or selectively chosen (i.e. the red, green, and blue channels were preserved while the cyan and orange channels were discarded) to simulate an image taken on a standard three channel camera. These data types are termed 5-band and 3-band, respectively.

An example set of each class of data (ripe banana, underripe banana, red delicious apple, and gala apple) for each type of data (3-band on top, 5-band on bottom). It's visually apparent that the three band data is darker, and for good reason: we've essentially thrown away pixels by discarding two of our five channels of data.

Class 1	Class 2	Class 3	Class 4

Scale-invariant feature transform

Scale-invariant feature transform (SIFT) is a very widely used algorithm for detecting and describing local digital image features, introduced in 1999 by David Lowe [8]. The basic idea is to detect and localize interesting or descriptive points on objects in an image that can describe the image well. This information is then used for comparison later on during classification.

The main steps of the algorithm are as follows: (1) Decompose the image into difference of Gaussian (DoG) scale-space representation. This process is essentially filtering the image twice using Gaussian filters of different sizes, and taking the difference on a pixel-by-pixel basis. This method is often used for corner detection, as we are essentially doing here. (2) Detect and localize minima and maxima across scales. This can be determined with sub-pixel accuracy if the locations are solved via fitting of a 3D quadratic function. (3) Eliminate edge responses using Hessian filtering. Hessian filters are square matrices of 2nd order partial derivatives, often used for edge detection.

What we're left with are a set of keypoints with values associated to the strength of the maxima/minima (indicated as the size of the circles in the images below) as well as a directionality term (indicated by the radius marking). As an example, below are a set of 3-band and 5-band images of a gala apple using different threshold terms for the SIFT procedure. As we can see, with the higher threshold a higher percentage of keypoints are located on the subject as intended, at the cost of a drop in the overall number of results.

No Peak Threshold	Peak Threshold = 1.5

SIFT was chosen for this project for its invariance to differences in contrast, brightness, Gaussian noise, and blurring, as shown below. These calculations were originally done by me for a class homework assignment in EE368 Image Processing, and are reproduced here. For these tests, I took an original image and varied brightness offsets, gamma contrast mapping, Gaussian white noise, or blurring with varying kernel sizes. And while repeatability was strong for the first three cases, we find that SIFT is in fact quite sensitive to blurring. This can come into play if the subject is slightly out of focus for our camera, or the focus changes from camera position to position.

This approach was implemented using the open source VLFeat library, originally written in C, and interfaced in Matlab.

Distance Ratio Test and Random sample consensus

Below are the two methods I used for eliminating bad keypoint matches in the form of a first pass, second pass scheme. Both were implemented using the open source VLFeat library.

Distance Ratio Test

Five years after introducing SIFT in 1999, Lowe demonstrated a method for identifying and rejecting all bad matches in which the distance ratio is calculated and used as a threshold [9]. Here the strongest matches have the lowest distances, and the ratio is defined calculated as the ratio of the best match divided by the second best match. All feature pairs are iteratively compared, and values greater than 0.8 are rejected. In his study, this removed 90% of the false matches while discarding less than 5% of the correct ones.

In this study, a similar, albeit simpler, approach is used as a first pass for eliminating incorrect matches. In the original approach, efficiency was improved by using a best-bin-first algorithm that was cut off after checking the first 200 nearest neighbor candidates. In my approach, I checked all candidates.

Random sample consensus

Random sample consensus (RANSAC) is an iterative method for estimating model parameters from data observations, first introduced by Fischler and Bolles in 1981. A basic assumption in this approach is that our observed data set has both inliers and outliers, and that the inliers fit a certain model. In this case, our model is a homography matrix using projective geometry to map one image to another.

In this approach, a subset of k correspondences are randomly selected, and geometric mapping parameters are computed using linear regression. A geometric mapping is then applied to all keypoints. We then count the number of inliers, or keypoints closer than some value sigma from the corresponding keypoint. Sigma values between one and three pixels are typical. We repeat this process a total of S times, keeping the geometric mapping with the largest number of inliers. The required number of trials is determined via:

$S=log(1-P)/log(1-q^{k})$

where P is the total probability of success and q is the probability of valid correspondence. In this case we expect q to be relatively small.

An example of this process is shown below.

K-folds cross-validation

Results

- Organize your results in a good logical order (not necessarily historical order). Include relevant graphs and/or images. Make sure graph axes are labeled. Make sure you draw the reader's attention to the key element of the figure. The key aspect should be the most visible element of the figure or graph. Help the reader by writing a clear figure caption.

Conclusions

- Describe what you learned. What worked? What didn't? Why? What should someone next year try?

References

[1] Brown, M.; Susstrunk, S., "Multi-spectral SIFT for scene category recognition," Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.177,184, 20-25 June 2011, doi: 10.1109/CVPR.2011.5995637. < http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5995637&isnumber=5995307>

[2] Namin, S.T.; Petersson, L., "Classification of materials in natural scenes using multi-spectral images," Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on , vol., no., pp.1393,1398, 7-12 Oct. 2012, doi: 10.1109/IROS.2012.6386074. < http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6386074&isnumber=6385431>

[3] Saleem, S.; Sablatnig, R., "A Robust SIFT Descriptor for Multispectral Images," Signal Processing Letters, IEEE , vol.21, no.4, pp.400,403, April 2014, doi: 10.1109/LSP.2014.2304073. <http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6730675&isnumber=6732989>

[4] Yang Xiao; Jianxin Wu; Junsong Yuan, "mCENTRIST: A Multi-Channel Feature Generation Mechanism for Scene Categorization," Image Processing, IEEE Transactions on , vol.23, no.2, pp.823,836, Feb. 2014, doi: 10.1109/TIP.2013.2295756. < http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6690151&isnumber=6685907>

[5] Lloyds Raja, G.; Kolekar, M.H., "Illumination normalization for image restoration using modified retinex algorithm," India Conference (INDICON), 2012 Annual IEEE , vol., no., pp.941,946, 7-9 Dec. 2012, doi: 10.1109/INDCON.2012.6420752. < http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6420752&isnumber=6420575>

[6] Salamati, N.; Germain, A.; Susstrunk, S., "Removing shadows from images using color and near-infrared," Image Processing (ICIP), 2011 18th IEEE International Conference on , vol., no., pp.1713,1716, 11-14 Sept. 2011, doi: 10.1109/ICIP.2011.6115788. <http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6115788&isnumber=6115588>

[7] Park, Jong-Il; Lee, Moon-Hyun; Grossberg, Michael D.; Nayar, Shree K., "Multispectral Imaging Using Multiplexed Illumination," Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on , vol., no., pp.1,8, 14-21 Oct. 2007, doi: 10.1109/ICCV.2007.4409090. < http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4409090&isnumber=4408819>

[8] Lowe, D.G., "Object recognition from local scale-invariant features," Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on , vol.2, no., pp.1150,1157 vol.2, 1999, doi: 10.1109/ICCV.1999.790410. < http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=790410&isnumber=17141>

[9] Lowe, D. G., “Distinctive Image Features from Scale-Invariant Keypoints”, International Journal of Computer Vision, 60, 2, pp. 91-110, 2004.

Appendix I

Source Code

Test Images

The following compressed data contains all 144 test images used during this experiment, organized by data type:

File:Test Image Data.zip

Additionally, the raw data captured from the camera can be accessed via the following 1.06GB download:

File:PSYCH221 Data Loewke.zip

Presentation

The following short presentation, saved in PDF form for online viewing, was given in Stanford's PSYCH 221 on March 18, 2014. It aims to cover as much of the information presented in this wiki as possible in under 10 minutes.

File:PSYCH221 Loewke.pdf

@@ Line 138: / Line 138: @@
 where ''P'' is the total probability of success and ''q'' is the probability of valid correspondence. In this case we expect ''q'' to be relatively small.
-An example of this process is shown to the right.
+An example of this process is shown below.
+[[File:1.png | 300px]]
+[[File:2.png | 300px]]
+[[File:3.png | 300px]]
+[[File:4.png | 300px]]
 === K-folds cross-validation ===

Image classification with a five band camera: Difference between revisions

Revision as of 18:56, 20 March 2014

Contents

Introduction

Background