Stereo Depth Estimation Algorithms

Introduction

The brain is a fantastic supercomputer with unmatched 3D object recognition and depth estimation. The ocular system has over 120 million rods and 6 million cones which send signals that are funneled into of our visual cortex resulting in our capability to see. The massive amounts of data and processing power of our brain, make it a fascinating phenomenon. Stereo depth is a major part of this machinery and in this paper, we delve into the topic by investigating existing computer algorithms. Depth extraction from stereo images boils down to Triangulation. All the observed solutions take a point in the scene and locate its position in the two images along the epipolar line to solve for depth.

The paper seeks to explain the primary approaches and details surrounding stereo depth extraction.

Background

Computer vision has seen explosive growth during the turn of the century; stereo depth, in particular, has seen its fair share of research due to its application in panorama mosaic, improved face tracking, facial recognition, autonomous driving, high-resolution geometric sampling and many others. The result of this high interest is a plethora of algorithms and papers describing various ways of estimating depth from two images.

A typical stereo algorithm consists of four steps: matching cost computation, cost aggregation, optimization and disparity refinement [1]. Stereo correspondence, the process of matching pixels between stereo images, is essentially an optimization problem. The local or global categories for stereo depth estimation are determined by the way an algorithm solves the stereo correspondence problem. In this paper, we will focus on the stereo correspondence aspect of stereo depth calculation.

Methods

"The wide range of modern dense matching algorithms can be assigned to two big categories: local methods, which evaluate the correspondences one point at a time with a winner take all strategy, and global methods where some constraint on the regularity of results are enforced in the estimation"[2]. However, after the advent of the MC-CNN paper[3], a whole host of learning network algorithms have dominated the KITTI data set.

Image Rectification

Two Images taken from stereo cameras could suffer my types of transformations. Some examples of transformations are: translation, rotation, shear, rigid and affine transforms. Matching pixels between images is already a difficult enough problem so its common practice to first rectify or align mages across their epi-polar planes before processing images.

Below we will go over the types of depth estimation methods assuming a rectified image has been provided.

Local or Block Method

One Image is picked as a reference and then a block size is searched across the epipolar plane to identify a corresponding pixel on the paired image. Different cost functions can be used to determine the similarity between pixels. A winner takes all strategy is used for finding the location of a corresponding pixel in the search space.

Examples of costing functions

Sum of absolutes distance

Sum of squared distances

Maximize Cross-correlation

Cross-correlation

Normalized cross-correlation

This can be very efficient and fast making it particularly suitable for real-time applications [2]. However, block based algorithms are susceptible to images with :

Sudden depth variations
Higher degree of occlusions
Low texture regions.

The local solutions do not take advantage of smoothing functions as described in global optimization solutions. Most algorithm methods also use a "left-right consistency" disparity refinement step to improve on the captured images. This step helps mitigate some of the occlusion issues by re-running the algorithm after swapping the reference image.

Semi-Global Matching (SGM)

"Local methods, which are based on correlation can have very efficient implementations that are suitable for real-time applications. However, these methods assume constant disparities within a correlation window, which is incorrect at discontinuities and leads to blurred object boundaries"[8]. The semi-global matching approach is a spin on block matching but attempts to remedy the depth discontinuity boundaries by penalizing these areas at a higher cost than the normal cost function. The other change is the use of mutual information as a cost function, instead of the previously mentioned intensity error functions.

Global Method

Global methods handle occlusion and depth discontinuity much better than previous methods because the optimization is done at a global level. The global optimization implementation attempts to minimize an energy function. The global optimization space is a cost value per pixel(x,y) and disparity. The problem space quickly blows up when you think of an exhaustive search to minimize E(x,y,d) values. The global method is characterized by a data energy function that tries to minimize the error in identifying pixels and a smoothing function that is designed to model the similarity in depth between neighbouring pixels, since an exhaustive search is not tractable.

Simulated Annealing

The simulated annealing implementation was first proposed by Stuart Geman and Donal Geman for restoring images using Bayesian approach [9] and later modified for use with stereo vision by Stephen T. Barnard [10]. The simulated annealing solution to depth estimation follows a method of gradually reducing a costing factor (temperature) using a Gibbs distribution. In the initial stages, big swings in x,y and d (state) are allowed so as to avoid finding local minima for the energy minimization task. The idea behind the solution is to pick a random state and make a small or big change in value to test a new state. If the new state minimizes the energy then capture that state. The method repeats these steps until we have a global minimum or we run out of temperature.

The method takes its inspiration from physical phenomena of high molecular activity at high temperatures and more stable fixed position of molecules at low temperatures. This method is guaranteed to reach the optimal solution but that guarantee is dependent on running to infinity.

The cooling rate is characterized by:

T (k) \geq \frac{c}{l o g (1 + k)}

The Gibbs/Boltzman energy function is characterized by:

p_{i} = \frac{e^{- ε_{i} / k T}}{\sum_{j = 1}^{M} e^{- ε_{j} / k T}}

Markov Random Field Methods

Markov Random Fields have enabled a statistical method of building out a disparity map by using the energy function below and a stochastic modelling process to reach a global minimum.

Belief Propagation

The belief propagation implementation of energy minimization is done by a message passing algorithm approach. Each neighbour broadcasts what it thinks its neighbours' value is supposed to be and in turn modifies own value after getting the opinion of its neighbours. You can imagine a bunch of forces pulling one way and another until a consensus is reached. Belief propagation doesn't have a mathematical proof or guarantee that a consensus can be reached to find global minima, but it is highly performant and is used for the quality of solutions that can be achieved by it.

Graph cuts

Hiroshi Ishikawa introduced using min flow or max cut algorithm paradigms to solve the global optimization problem in 2003. Hiroshi asserts, "Simulated annealing has been used to solve certain MRF problems, although it is also notoriously slow" [12]. The solutions make use of a graph with vertices corresponding to pixels and edges corresponding to connections between pixels and neighbours. Once all the disparities per pixel have been evaluated for their energy cost we can continually break off connections with nieighbouring pixels with huge disparity jumps and try to minimize the overall cost.

Other Methods

CNN

The latest algorithms have implemented various approaches to improving on the idea of a learned stereo matching. The cost optimization designs, whether global or local, seek to gain an answer to whether a reference pixel is similar or not to the pixel under comparison. The machine learning networks seek to improve on this comparison step by adding apriori knowledge. Difficult geometry and occlusion of disparate images are considered costs that are to be minimized in previously mentioned implementations. The learned neural nets attempt to be trained with various geometric image snippets with both positive and negative matches to be able to answer the question of similarity with apriori knowledge.

Features are first extracted from input images and then these features are combined together to determine a similarity score. Here is an example neural architecture from MC_CNN [3], an algorithm that started the movement.

Dynamic Programming

The energy minimization problem is a 2D NP-hard problem. However, dynamic programming can find the global minimum for independent scanlines in polynomial time[11] The algorithm works by calculating costs between pairs of scan lines across the image and then finding the minimum cost path. The reason all scan line pairs are compared is to prevent streaking artifacts; If only neighbouring (vertical or horizontal) scan lines are used we get a streaking effect since the minimum cost of a once-removed scan line doesn't factor in the finding a minimum path for the location.

Plane Sweeping

Also referred to as segmentation method, this solution attempts to coarsely segment an image with some disparities and generate epipolar planes with estimated disparities. "After a set of initial cost values for each segment has been stored into a disparity space distribution (DSD), iterative relaxation (or loopy belief propagation, in the more recent work of Zitnick and Kang (2007)) is used to adjust the disparity estimates for each segment"[12]. This is akin to moving and warping parts of an image for particular disparities and trying to find a match in both image pairs. For example, if a segmented region in the left image is translated to be at a depth of k and then compared to the right image for a match we can iteratively search for planes that have matches until we have solved for the whole image.

Results

The experiments that were run focused on the benefits and drawback of local and global algorithms.

Local algorithm benefits

Local algorithms due to their greedy and winner take all approach utilize much smaller memory and are able to complete in much faster times than Global algorithms. The local algorithms perform well with low depth discontinuity data sets. In this section, we investigate the differences with using different cost function using a friendly dataset for evaluation. Local performance differences were significant even when the graph algorithm was implemented in C++ and the block implementation was written in Matlab.

Local method with SSD vs SAD

There was a small observable difference between SAD and SSD for the local method. Some more background detail is captured in the SSD cost function. Attempts at using RMS or scatter plots to expose this information has proven challenging. One issue is that the ground truth disparity is given from both the left and right image context and another is the visual nature of the errors seeming small contradicting an increase in error values..
The cones data set

The Tsekuba data set

Global methods with SSD vs SAD

The global methods reacted much more than expected when testing squared sum of differences compared to the default absolute value implementations. The sum of squared difference had more artifacts but also greater detail.

Local methods with pathological data sets

Local based Algorithms perform well in smooth images with low occlusion and smaller depth discontinuities. In this experiment we will try to see if we can pick some data sets that should result in a poor performance by the local method.

As expected the local block implementation did not succeed to do well.

Global methods ICM vs BP vs GC

In this section we run an experiment to see which energy optimization solution is relatively better. We test Belief propagation (BP/LBP), Graph Cuts (GC), Iterated Conditional Methods (ICM) using tooling from Middlebury [12] Tsekuba data set comparison between ICM and GC

It is immediately obvious that the latest graph cut based approaches outperform the artifact riddled ICM implementation. The implementation function selected can make a big difference in how well we smooth our disparities.

Jade data set between ICM, GC, BP

Local method vs Global GC

For this section, we present a side by side of the Jade dataset to highlight the great gap in performance between the two approaches when dealing with occlusion and depth discontinuities.

Global optimization does a much better job of handling occlusion and depth discontinuity. Please, note the scales do not much but there is an obvious improvement in capturing the structure with the global solution. Please refer to the 3 images above for other implementations on the Jade dataset.

Even though graph cuts perform much better than previous methods, as of around 2014 CNN's have dominated the benchmarks.

Conclusions

In this paper, we have given a wide range of topics related to the available solutions for stereo depth. The paper has attempted to describe the mathematical and computational components of a small piece of computer vision, namely stereo depth estimation. Surely, this is not an exhaustive review but the paper seeks to introduce the various approaches and their fundamental ideas.

The current state of the art with this problem set is wide and varied but with more stereo images with ground truths, the learned approaches are quickly becoming the norm. Although we take our innate capability of understanding the 3-dimensional geometry of the objects we see, the capability was enabled by a huge trove of a priori knowledge on top of the available planar intersection, shadow and ray tracing, relative size, convergence, defocus and accommodation cues. The world of computer vision is barely scratching the surface when compared to what nature has already perfected. The pace of innovation holds much promise that one day we will be able to outperform the human brain not only in depth perception but also in 3D reconstruction and recognition.

Future Work

A mathematical formulation for describing disparity accuracy would be a great addition to this paper. Additionally, Having CNNs investigated and contrasted with the same manner would give a more exhaustive overview of Stereo Correspondence.

Appendix I

[1] Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47.
[2] Dall'Asta, E. and Roncella, R., 2014. A comparison of semiglobal and local dense matching algorithms for surface reconstruction. International Archives of
Photogrammetry, Remote Sensing and Spatial Information Sciences, 40(5): 187–194.
[3] Zbontar, Jure, and Yann LeCun. "Stereo matching by training a convolutional neural network to compare image patches." Journal of Machine Learning Research 17.1-32 (2016): 2.
[4] Van Meerbergen, Geert, et al. "A hierarchical symmetric stereo algorithm using dynamic programming." International Journal of Computer Vision 47.1 (2002): 275-285.
[5] Boykov, Yuri, Olga Veksler, and Ramin Zabih. "Fast approximate energy minimization via graph cuts." IEEE Transactions on pattern analysis and machine intelligence23.11 (2001): 1222-1239.
[6] Sun, Jian, Nan-Ning Zheng, and Heung-Yeung Shum. "Stereo matching using belief propagation." IEEE Transactions on pattern analysis and machine intelligence 25.7 (2003): 787-800
[7]Szeliski, Richard, et al. "A comparative study of energy minimization methods for markov random fields with smoothness-based priors." IEEE transactions on pattern analysis and machine intelligence 30.6 (2008): 1068-1080.
[8] Hirschmuller, Heiko. "Accurate and efficient stereo processing by semi-global matching and mutual information." Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 2. IEEE, 2005.
[9] Geman, Stuart, and Donald Geman. "Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images." IEEE Transactions on pattern analysis and machine intelligence 6 (1984): 721-741.
[10] Barnard, Stephen T. "A stochastic approach to stereo vision." AAAI. 1986.
[11] Szeliski, Richard. Computer vision: algorithms and applications. Springer Science & Business Media, 2010.
[12] H. Ishikawa, “Exact Optimization for Markov Random Fields with Convex Priors,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1333-1336, Oct. 2003.

Appendix II

Code and files are found at: https://github.com/yfeleke/stereo_depth
Matlab block implementation from: http://mccormickml.com/assets/StereoVision/stereoDisparity.m
Some of the content here was derived from:

UCF Computer Vision class http://crcv.ucf.edu/courses/CAP5415/Fall2012/index.php
Stanford CS231A http://web.stanford.edu/class/cs231a/lectures

Stereo Depth Estimation Algorithms

Contents

Introduction

Background