Depth Mapping Algorithm Performance Analysis

From Psych 221 Image Systems Engineering
Jump to navigation Jump to search

Introduction

Depth mapping of scenes has many applications in computer vision and 3D scene reconstruction, and has recently become more relevant with the new technological developments in these areas. However, algorithms to extract depth information from different imaging systems often have vastly different parameters and approaches, which result in very different performance in terms of depth map quality and runtime.

The goal of our project is to implement various disparity estimation algorithms on stereo image pairs and compare their performance with the intention of creating an accurate map. Methods were used to reduce the variables between an image pair, making finding the depth in an image point directly related to finding its disparity value, thus justifying the focus on disparity accuracy. Moreover, disparity mapping will be synonymous with depth mapping in this project.

Background

Disparity and Depth

Depth information about a scene can be captured using a stereo camera (2 cameras that are separated horizontally but aligned vertically). The stereo image pair taken by the stereo camera contains this depth information in the horizontal differences (when comparing the stereo image pair, objects closer to the camera will be more horizontally displaced). These differences (also called disparities) can be used to determine the relative distance from the camera for different objects in the scene. In Figure 1, you can see such differences on the left where the red and blue don't match up.

Figure 1. Anaglyph of stereo image pair (left) and example disparity map computed from the same stereo image pair (right) [6]


Disparity and depth can be related by the following equation (where x-x' is disparity, z is depth, f is the focal length, and B is the interocular distance). Since depth reduced to a scaled quantity of disparity, the accuracy of the disparity could be said to be the more important calculation and disparity calculation could be interchangeable with depth calculation.

Figure 2. Diagram to Calculate Disparity and Depth [7]

Image Rectification

In order to extract depth information, the stereo image pair must first be rectified (i.e. the images must be transformed in some way such that the only differences that remain are horizontal differences corresponding to the distance of the object from the camera). Rectification can be accomplished both with and without camera calibration. If the corresponding camera intrinsics and extrinsics are given for the stereo image pair, then calibration is not necessary. If they are not given, but photos of a checkerboard or some other calibration object are provided, then the calibration parameters can be calculated.

If not enough camera parameters are given and there are no checkerboard images to be used for calibration, then the following 4 steps can be used for rectification of a given stereo image pair.

  • First we detect SURF keypoints (a scale and rotation invariant feature detector) in each stereo image and extract the feature vectors of each keypoint.
  • We then find matching keypoints between the images using a similarity metric. The uncalibrated rectification used by MATLAB uses the sum of absolute differences metric.
  • We then remove outliers (incorrect matches) using an epipolar constraint. Looking at Figure 3, this means that for a keypoint x on the left image, the matching keypoint on the right image must lie on the corresponding epipolar line defined by the intersection of the epipolar plane and the image plane. [8]
  • Now using the vector distance between the remaining correct matches, we must compute a 2D projective geometric transformation to apply to one or both of the stereo images. After transformation, the stereo images should no longer have any vertical displacement.
Figure 3. Applying epipolar constraint to keypoint matches.

Datasets

SYNS Dataset

The Southampton-York Natural Scenes (SYNS) dataset contains image and 3D range data measured from different rural and urban locations [9]. Each sample contains the following data:

  • LiDAR depth information (360 x 135 degree field of view)
  • Panoramic HDR image captured with a SpheroCam (Nikkor fish-eye lens) (360 x 180 degree image)
  • Stereo image pairs captured with 2 Nikon DSLR cameras (each image pair was captured at a different rotation of the camera such that all of them covered a 360 degree view of the surroundings)

This dataset did not include any intrinsics or extrinsics, or even any calibration photos from which we could extract this data. Therefore, we used the uncalibrated rectification method described above on the stereo images in order to compute disparity maps. An example of a rectified stereo image pair is shown below in Figure 4 (left). We then used MATLAB's disparity mapping algorithm on these rectified stereo images to generate the disparity map shown below in Figure 4 (right).


Figure 4. Rectified stereo images (left) used to compute disparity map (right)

Our goals for this dataset were to compute depth map using various algorithms (described in the Methods section below) and then compare the results to LiDAR info (ground truth). In order to do a quantitative evaluation of this comparison, we would need to match the stereo view to the exact matching region of interest in the LiDAR data. We would be able to create a projection of the pointcloud LiDAR data if we have information about the position and angle of the camera relative to the scene. However, this information was not provided, so we would have to instead use a more brute force method to determine which area of the LiDAR data corresponds to the stereo view being considered. To do this we would have to iterate over the HDR image and quantitatively compare it to the stereo images using a method such as least mean squares. We utilized a MATLAB script to scale the panoramic LiDAR info to the panoramic HDR image, so that we can find the LiDAR information that corresponds to the region of interest found in the HDR image. A sample output of this script is shown in Figure 5. We can then extract this relevant LiDAR info and compare it to the computed depth map.


Figure 5. LiDAR info (left) mapped to HDR panorama (right)

When trying to implement this procedure, we came across several obstacles. One of the main obstacles was the significant image warping introduced by the fish eye lens used for the panoramic HDR that was not present in the stereo images. With camera calibration parameters, we could undo this warping but the dataset lacks the means for us to accomplish this (as discussed above). Another related obstacle is introduced by the need to use an uncalibrated image rectification process. Since the camera extrinsics/intrinsics are not taken into account during rectification, the best geometric transform for the stereo images might include some shear and rotation. Both of these obstacles would make it very difficult to run a brute force algorithm that could accurately compare the stereo images to the HDR images in order to find the matching region of interest in the HDR image.

In order to be able to reach the goals of the project given these obstacles, we decided to switch to the Middlebury Dataset (described below) and adapt our experimental procedure accordingly.

Middlebury Dataset

The Middlebury dataset was created by Nera Nesic, Porter Westling, Xi Wang, York Kitajima, Greg Krathwohl, and Daniel Scharstein at Middlebury College during 2011-2013, and refined with Heiko Hirschmüller at the DLR Germany during 2014[2]. It is comprised of 33 stereo image datasets, 20 of them are taken using the new Middlebury Stereo Evaluation (10 each for training and test sets). The detailed description of the acquisition process can be found in our GCPR 2014 paper [4].

Figure 6: 10 of the datasets, displaying one side of a stereo pair along with the pairs corresponding "ground truth" disparity map [3]


This dataset is highly cited and provides details about all the intrinsic and extrinsic parameters of the camera used, as well as images with "perfect" and "imperfect" rectification. It does not have a "ground truth" depth as with the SYNS LIDAR data, however it provides a highly accurate "ground truth" disparity map for its datasets which were acquired using multiple iterations and refinements detailed in the Figure 7 below:

Figure 7: Block diagram of how highly accurate "ground truth" was made [4]

With all this information, issues seen with the SYNS dataset could be easily bypassed. Therefore the "perfectly" rectified stereo images from the dataset were used, providing ease of implementation and direct comparisons between algorithms mentioned in the later sections.

Methods: Block Matching and Similarity Metrics

To extract the disparity between two stereo images there are 3 methods which were compared. All of these methods utilize block/window matching to produce their results.

Block/Window Matching

Analysis of an image is performed by breaking it down into subsections called blocks or windows and these subsections are used for comparison.

Figure 8: Example original picture, left stereo image [5]

A block is defined in the left stereo image, Figure 9a, and then since the image pair should only have horizontal difference with "perfect" rectification, a horizontal search region can be defined in the right stereo image about the position of the block defined in the left image, Figure 9b.

Figure 9a: Zoomed-in region with block selected for left stereo image. Left image is used as reference
Figure 9b: Same zoomed-in region in right stereo image with horizontal sweep bounds shown and original block position in white

The reference block from the left stereo image now has to be compared with the acquired block from the right stereo image. To do this the color image is translated to grayscale and each pixel the within the block now can have a numerical intensity value, Figure 10. Depending on algorithm these values are used to evaluate a correlation between the blocks and this can lead to calculating the disparity between the two images at that point in the reference image.

Figure 10: Example original picture, left stereo image [5]

SAD

The sum of absolute differences (SAD) is a measure of the similarity between image blocks. It is calculated by taking the absolute difference between each pixel in the original block and the corresponding pixel in the block being used for comparison. [3]

SSD

The sum of squared difference (SSD) takes the squared differences within the two blocks. This measure has a higher computational complexity compared to SAD algorithm as it involves numerous multiplication operations. [3]

CT

The census transform (CT) encodes the relative brightness of each pixel (with respect to its neighbors) and compares the digital encoding of pixels from windows from each image (left and right). The algorithm is as follows:

  • Compute bit-string for pixel p, based on intensities of neighboring pixels
  • Compare left image bit-string with bit-strings from a range of windows in the right image
  • Choose disparity d with lowest Hamming distance

Methods: Algorithm Evaluation

Test 1:

One image was focused on for our analysis of these algorithms. The same block size was be used for all test in order to compare the algorithms alone. Ideally all resulting disparity maps will be compared with the "ground truth" numerically and a performance based on error could be quantified, but for now resulting disparity maps would be visually compared.

Test 2:

Since the block size used would have an effect accuracy of the resulting disparity map, the SAD algorithm was used to test the effect or block size variation. Apart from just block size, Semi-Global Matching (SGM) was also investigate so see its effect over different sizing. SGM is based on the idea pixelwise matching of Mutual Information and approximating a global, 2D smoothness constraint by combining many 1D constraints [10].

Results

Reference Image

Reference image

Sum of Squared Differences

Disparity Map from SSD

Sum of Absolute Difference

Performance with default parameters

Disparity Map without semi-global matching
Disparity Map with semi-global matching

Effect of Block Size and Smoothing

For block matching with and without the semi-global smoothing, we tested the effect of altering the block size of the Block Matching + SAD algorithm. The chart below shows the resulting average error rates over 15 images, along with the corresponding graph.

Figure 6

Overall, the semi-global smoothing had a significantly lower error rate than regular SAD. Furthermore, the block size affects the performance of each differently -- semi-global SAD produces lower error rates with smaller block sizes, and SAD has a local optimum around a block size of 11.

Comparing the computational time of each algorithm over all block sizes tested, SAD with semi-global smoothing took an average of 0.9598 seconds per image, while SAD took an average of 0.3016 seconds. Thus, regular SAD is more than three times as fast.

Census Transformation

Disparity Map from Census Transformation

Conclusions

Available Datasets

Running algorithms and experiments with the SYNS and Middlebury datasets highlighted the tradeoffs between the different stereo image and depth-related datasets that currently exist. SYNS provides a fantastic range of practical real-world scenes, and even includes the ideal ground truth LiDAR information for each scene. However, the lack of camera intrinsics and extrinsics or even just checkerboard images from which we could extract the calibration parameters proved to be prohibitive when trying to map the stereo images to the LiDAR information. This makes the dataset very difficult to use for depth mapping projects.

The Middlebury dataset is currently one of the most widely used datasets for depth mapping, since it has thoroughly documented calibration information as well as images rectified using multiple different methods. It also has a highly accurate disparity map to be used as a ground truth when running our own depth mapping algorithms. However, it would be more ideal if this dataset had true depth information (like LiDAR) that could be used as the ideal ground truth.

Our conclusion with regards to the datasets is that a new ideal dataset could be created that is comprehensive in both its documentation of calibration details and scene capture setup, but also includes true depth and disparity information for comparative studies.

Smoothing in Disparity Algorithms

From the comparison of the SAD algorithm with and without semi-global smoothing, it appears that overall, smoothing increases the accuracy of the disparity map. This is most likely due to the fact that in block matching, it is difficult to identify the disparity between regions of uniform intensity. Thus, uniform regions contain more noise and inaccuracies in the final disparity map. By forcing similar disparity on neighboring blocks, these regions reflect the actual disparity more closely. Furthermore, as shown in Figure 6, while small block size increased the error of regular SAD, the error rate of SAD with smoothing consistently decreased with smaller block size. This may indicate that block matching may not work very well with the limited scope of a small block, but the smoothing reduces the errors caused by this.

Which Algorithm Performs 'Best'?

In general, the performance of the Census Tract algorithm was best when tested on the 15 training images in the Middlebury dataset, with a 20% error rate. However, it took an average of 42.9 seconds to run on each image. On the other hand, SAD took, on average, only 0.3 seconds per image. Thus, if one has a time constraint, using block matching with SAD or SSD may be better suited. SSD is more sensitive to differences in blocks, so may be more accurate when computing disparity, but the squaring operation is more computationally expensive than the absolute value operation in SAD. Thus, the optimal algorithm depends on the demands of one's application.

References

[1] Daniel Scharstein, et. al., “High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth” German Conference on Pattern Recognition. 2014. http://www.cs.middlebury.edu/~schar/papers/datasets-gcpr2014.pdf

[2] Jongchul Lee, et. al., "Improved census transform for noise robust stereo matching," Optical Engineering 55(6), 063107 (16 June 2016). http://dx.doi.org/10.1117/1.OE.55.6.063107

[3] Pranoti Dhole, et. al., "Depth Map Estimation Using SIMULINK Tool," 3rd International Conference on Signal Processing and Integrated Networks. 2016.http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7566714

[4] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nesic, X. Wang, and P. Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In German Conference on Pattern Recognition (GCPR 2014), Münster, Germany, September 2014.

[5] Chris McCormick, Stereo Vision Tutorial - Part I. 2014. http://mccormickml.com/2014/01/10/stereo-vision-tutorial-part-i/

[6] Stereo Vision for Depth Estimation. https://www.mathworks.com/discovery/stereo-vision.html

[7] Depth Map from Stereo Images. 2013. http://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_calib3d/py_depthmap/py_depthmap.html

[8] Robyn Owens, Epipolar Geometry. 1997. http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/OWENS/LECT10/node3.html

[9] Adams, W.J., Elder, J.H., Graf, E.W., Leyland, J., Lugtigheid, A.J., Muryy, A. (2016). The Southampton-York Natural Scenes (SYNS) dataset: Statistics of surface attitude. Scientific Reports, 6, 35805.

[10] Heiko Hirschmuller, "Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information," IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2005.

Appendix I

Appendix II

We divided the work as follows:

  • Oscar Guerrero: Researched disparity algorithm code, ran experiments, worked on presentation and wiki
  • Deepti Mahajan: Researched and implemented disparity algorithm code, ran experiments and gathered results, worked on presentation and wiki
  • Shalini Ranmuthu: Implemented algorithms for SYNS dataset, researched and implemented image rectification methods, worked on presentation and wiki