David Knight
Title
Skin Segmentation of Noisy, Low-Resolution RGBZ Hand Images: A Color Space Approach
Background
Introduction
The latest generation of video game consoles have brought with them new user input devices that can read user body posture and motion. One such device is the Microsoft Kinect, which is equipped with both a color camera and an infrared depth sensing system. The depth data sensed by the Kinect aids computer vision tasks such as segmentation or skeletal fitting and pose estimation. However, the spatial coarseness of the depth data makes it difficult to extract detailed information about extremities, like the hands, when the user is farther away from the sensor. This project aims to use the color data returned from the Kinect in conjunction with the depth data to segment skin regions out of hand images. Extracting accurate skin regions around the hands will provide a step towards building a software system that can robustly interpret hand gestures using the Microsoft Kinect.
Data
The image data set was output from a skeletal fitting pipeline previously created by Hendrik Dahlkamp and Christian Plagemann. The skeletal fitting stage operates on the full 640 x 480 pixel frames output by the Kinect hardware and proceeds to output a cropped image of only the right hand region. An image sequence of 561 frames was gathered indoors with the test subject standing three meters away from the Kinect. Hands are roughly centered in all the images, and the images are 64 x 64 pixels in size with four 8-bit channels comprised of red, green, blue, and depth data. The depth channel is calibrated such that the hand depth is centered about level 127 (of 255), and each level represents 3 mm.
Methods
Operations are organized into the processing pipeline represented in Figure 1. The 8-bit values of the images in the range [0, 255] are converted to floating point values in the range [0, 1] at the start of the pipeline. All image levels referred to from this point onward assume a floating point range of [0, 1].
Denoising
Because the overall goal of this pipeline is to extract a region of the image (shape data), the objective of this denoising stage differs from denoising images meant for human consumption. An ideal image for extracting a region boundary would be a cartoon-like image with hard edges between segments and solid swaths of color within segments. A bilateral filter is used with heavy blurring parameters to remove both noise and textures on otherwise solid color surfaces while preserving hard edges [1]. An example before and after image are shown in Figure 2.
In the before example image, there is a clear presence of both shot noise and a horizontal zig-zag pattern on edges. The shot noise is removed by the heavy blurring from the bilateral filter, but the zig-zag noise remains. The pipeline does not address the zig-zag noise, but it might be due to the demosaicking algorithm used by the Kinect, or possibly due to interlacing as was suggested during the presentation. Unfortunately, this pipeline does not have control over the demosaicking method used because it does not handle the acquisition of raw image data from the Kinect unit.
Skin Classification
Skin classification is broken down into training and prediction sections. Training occurs before the running of the pipeline, and prediction is the process that occurs during the skin classification block of the pipeline.
Training
The training process involves using skin reflectance data to generate a simulated camera image that is used to pick four reference skin colors in the CIELAB space. These reference skin colors are then used by the pipeline to calculate distance values between the reference skin colors and each pixel's color in a hand image.
The ISET MATLAB toolbox is used to generate a simulated camera image of a scene containing test colors specified by reflectance. The s_reflectanceCharts and s_SimulateSystem demos included with the ISET toolbox are used as a foundation for the simulations. The default Nikon D100 spectral quantum efficiency profile was replaced with an estimated spectral quantum efficiency profile of the Aptina MT9M112 sensor used in the Kinect (see Appendix I for the datasheet). This profile was estimated by tracing the area under the curve of the normalized spectral quantum efficiency graph in the MT9M112 datasheet and summing vertical pixels in ranges along the horizontal wavelength axis. The values for each wavelength were scaled such that the peak quantum efficiency of the green channel matched the peak green quantum efficiency of the Nikon D100 profile. Other sensor parameters from s_reflectanceCharts and s_SimulateSystem were not changed.
Two color chart scenes are simulated using D65 lighting: one of skin colors and one of primary colors. The skin reflectance data comes from an ISO dataset of 8,570 skin samples [2]. The primary color reflectance data comes from the ISET DupontPaintChip_Vhrel set. Figure 3 shows a screenshot of the skin scene simulation including the post-processing settings used. In particular, gray world color correction is disabled.
A reference white color needs to be selected from the simulated primary color image in order to convert skin colors from RGB to CIELAB space. This RGB white value is converted into CIE XYZ space using ISET's generic CRT monitor phosphor power spectral densities. Next, the simulated skin image is cropped to produce an image of only skin colors (without the black background). The RGB values of each pixel in this skin color image are converted to the CIELAB space using the XYZ reference white value and the same ISET CRT monitor phosphor power spectral densities. The CIELAB values for each pixel are then clustered using K-means into four reference skin colors. These four CIELAB color values are saved and used by the pipeline. Of note is the fact that K-means performs clustering based on Euclidean distances, meaning that using K-means clustering with CIELAB values inherently clusters values using CIELAB distances. Figure 4 shows example reference skin colors as they might appear under D65 lighting conditions.
Prediction
In order to make comparisons between hand image color values and the reference skin colors, all of the hand image color pixel values are converted from RGB to CIELAB space. An RGB white value of (0.5490, 0.5216, 0.5529) was hand-selected from the hand image sequence, but ideally an algorithm would be used to select a white point from a full 640 x 480 pixel Kinect image.
As shown in Equation 1, a modified metric is used to judge color similarities.
- -- horizontal distance to nearest estimated skin region
- -- vertical distance to nearest estimated skin region
- Equation 1
It is known that the depth layer is calibrated such that the hand is centered around a value of 0.5. Thresholding the depth data on the range [0.2, 0.8] provides a rough estimate of the hand foreground region. A distance transform is applied to the thresholded depth data that replaces all background pixels with their minimum Euclidean distance a foreground region. These distance values are included in the metric as the and terms. These distance terms penalize pixels that are outside the depth-estimated hand region because it can be assumed that pixels outside this region are less likely to be skin pixels.
Using this metric, a image is generated for each reference skin color where the pixel values are the value between a reference skin color and the CIELAB value of a hand image at that pixel. For each of the four images, the minimum value and pixel location is stored. Figure 5 shows an example false-color image. Figure 6 shows a false-color image of the minimum spatial distance of each pixel to the hand foreground region estimated from the depth data.
Region Highlighting
Region highlighting is performed by operating on the Cr channel after converting the RGB hand image data to the YCbCr color space. This approach is motivated by the fact that human skin has a strong red color component as was witnessed by Michael Jones' work generating RGB histograms for skin colors [3]. The faint outline of a hand can be seen in Figure 7 which depicts an untouched Cr channel of an example image.
A highlighted image layer is generated using both the Cr channel and the images generated during the skin classification stage. For each reference skin color, the Cr value of the pixel that has the minimum is recorded. In other words, the Cr value of the most skin-like pixel is located. A weighted sum of Cr difference layers is generated as is shown in Equation 2. The difference layer is the absolute value of the most skin-like Cr value subtracted from each pixel of the Cr channel. The weight for each reference skin color's difference layer is inversely proportional to the minimum for a particular reference skin color. Thus, lower weights are applied to difference layers generated from reference skin colors that do not significantly match the skin color of the hand image.
- -- pixel of row , column from Cr channel
- -- to reference skin color for pixel location
- -- weighted sum of Cr difference images
- Equation 2
The image generated from Equation 2 is inverted and normalized such that white pixels of the image are most likely to be skin and black pixels are least likely. Finally, this highlighted image generated from the Cr channel is point-wise multiplied by the normalized version of the distance transform image previously depicted in Figure 6. This multiplication operation again penalizes pixels outside the hand region expected from the depth data. Figure 8 shows an example false-color image as output by this stage after inverting and normalization.
Thresholding
The highlighted image output from the previous stage is thresholded to generate a final binary image. Otsu's Method is a common image processing technique to threshold grayscale images based on minimizing a cost function of the foreground and background variances [4]. Equation 3 shows the cost function that is minimized.
- -- total number of pixels in image
- -- number of background pixels for a given threshold
- -- number of foreground pixels for a given threshold
- -- variance of background pixel values for a given threshold
- -- variance of foreground pixel values for a given threshold
- -- cost function for a given threshold
- Eq. 3
Because of the small image size, minimization is done naively by stepping through thresholds values with a step size of 0.01.
Results
Test Sets
The primary test set of images is 12 hand images taken every 50 frames from the indoor image sequence. A ground truth test set was generated by hand labeling skin regions. A depth-only segmentation results set was generated by thresholding the depth channel to the region [0.2, 0.8]. Finally, a color space segmentation results set was generated using the pipeline described in the Methods section.
UPDATE: A more challenging test set of 5 hand images taken every 50 frames from an outdoor image sequence has been added. In addition to being taken outdoors in overcast weather, the background of these images is more varied than the background of the indoor test set. Ground truth, depth-only thresholding, and color space segmentation result sets were generated with the same methods as the indoor test images.
Visual Comparison
Indoor Images
Outdoor Images
Update: The zig-zag effect along edges (seen previously in Figure 2) is even stronger in the outdoor test images than it is in the indoor test images. Also of note is that the zig-zag is only horizontal, making it more likely that the effect is due to interlacing.
Quantitative Results
Indoor Images
Quantitative results are based on counting false positive and false negative labeled pixels in both the depth-only results set and the color space results set. A false positive is a pixel labeled as skin that is labeled as not-skin in the ground truth set, and a false negative is a pixel labeled as not-skin that is labeled as skin in the ground truth set.
Color space segmentation reduces false region labeling for every test image except number 12. Test image 12 experienced an increase in the number of false positive labelings. Looking at the visual comparisons, these false positives are due to a red region in the background falsely being labeled as skin. Other issues are that test image 1 has a high percentage of false positives due to red and white background content being labeled as skin, and test image 3 has a high number of false negatives because the skin region is tighter than the ground truth region.
Outdoor Images
Across all test images, false positives occur much more frequently than false negatives. The color space segmentation pipeline produced more false positives than the depth-only thresholding method in test images 3, 4, and 5. In contrast, the color space segmentation method produced slightly fewer false negatives across all test images.
Runtime Performance
Average measurements of the time needed to run each stage of the pipeline are shown in Table 1. Averages were taken across each of the 12 indoor test images.
Pipeline Stage | Average Run Time (s) | Percent of Total |
Denoising | 0.7207 | 89.8% |
Skin Classification | 0.0454 | 5.7% |
Region Highlighting | 0.0147 | 1.8% |
Thresholding | 0.0215 | 2.7% |
Table 1
The denoising stage is the primary contributor of delay when running the pipeline.
Conclusions
For most of the indoor test images, the color space pipeline produces tighter skin regions than using a fixed threshold on the depth data. Additionally, by counting pixel region labels, the color space pipeline produces fewer labeled false negative and false positive pixels per image than the fixed depth threshold. However, the color space method in two test images falsely labels red patches in the scene background as skin due to the reliance on the Cr channel during the highlighting stage.
Unfortunately, the color space pipeline produces unrecognizable results on the outdoor test images. Looking at the resulting regions, it is clear that significant portions of reddish background are incorrectly labeled as skin, accounting for the unacceptably high number of false positive skin labels. However, the number of false negative pixels is less than 10% across all the outdoor test images, indicating that highlighting actual skin regions in outdoor lighting is not an issue.
Currently, highlighting is performed primarily with information from the Cr channel. Multiplying the distance transform image of the depth-thresholded region with the Cr difference layers is not selective enough to fully rejecting reddish objects in scene backgrounds. Additionally, the images are only used to find pixels that closely match the reference skin colors. A separate rejection stage might be needed in order to reject highlighted regions that appear skin-like in color but have a depth value that is too far or too close to the Kinect.
Another issue is that the current implementation of the pipeline takes almost one second to execute, which is not fast enough to run in real-time on a video stream. The major obstacle preventing the pipeline from running in real-time is the expense of running the bilateral filter. The current bilateral filter implementation is a non-compiled MATLAB function. Using a compiled MEX implementation might provide a significant speed improvement. Additionally, thresholding with Otsu's method can be performed with fewer computations by working on an equivalent maximization problem and updating parameters incrementally instead of recalculating statistics each iteration. The rest of the pipeline without the bilateral filter still takes roughly 0.08 seconds to complete, which is not fast enough to process incoming video at 30 frames per second. Reimplementing the pipeline in C with OpenCV might decrease the processing time enough to allow for processing to take place on a video stream.
Other future work encompasses many areas. Broader testing is needed to see if the color space method is robust against additional indoor lighting schemes and against users with different skin tones. Using a probabilistic graphical model to temporally track features such as most skin-like pixel location and threshold value might be beneficial, as well, since it can be assumed that these parameters should not change very quickly from frame to frame. Also, the zig-zag edge effect should be addressed possibly by applying a interlacing algorithm so long as adding this algorithm does not significantly increase the time needed to execute the pipeline.
Overall, the color space pipeline produces skin regions with few false positives and false negatives under simplistic conditions, but the method is currently not robust against backgrounds that are more reddish in the red-green continuum.
References
Papers
- C. Tomasi, R. Manduchi. "Bilateral Filtering for Gray and Color Images." In Proceedings of the IEEE International Conference on Computer Vision, 1998, pp. 839.
- "Graphic Technology - Standard object colour spectra database for colour reproduction evaluation." International Standards Organization, Geneva, Switzerland, ISO/TR 16066:2003(E), 2003.
- M. J. Jones, J. M. Rehg. "Statistical Color Models with Application to Skin Detection." International Journal of Computer Vision, vol. 46, pp. 81-96, 2002.
- N. Otsu. "A threshold selection method from gray-level histograms." IEEE Transactions on Systems, Man and Cybernetics, vol. 9, pp. 62-66, January 1979.
Software
- D. Lanman. "Bilateral Filtering," Internet: http://www.mathworks.com/matlabcentral/fileexchange/12191, 6 September 2006 [19 February 2011].
- M. A. Ruzon. "RGB2Lab, Lab2RGB," Internet: http://robotics.stanford.edu/~ruzon/software/rgblab.html, 6 May 2009 [12 March 2011].
Appendix I - Files
- MATLAB Files
- Indoor Test Image Data Set
- Outdoor Test Image Data Set
- Aptina MT9M112 Sensor Datasheet
Acknowledgements
- Matt Tang: Working on the same problem using probabilistic graphical models (CS 228)
- Hendrik Dahlkamp: Project mentor (CS Department)
- Christian Plagemann: Project mentor (CS Department)