Monocular Depth Estimation with Global Depth Hints

From Psych 221 Image Systems Engineering
Jump to navigation Jump to search

Introduction

For many imaging tasks, good estimates of 3D structure can lead to significant improvements in performance. Recent advances in convolutional neural networks have made it possible to estimate depth from single RGB images (“Monocular depth estimation”). However, these methods are not able (without considering e.g. focal length) to resolve the inherent scale ambiguity present in a single monocular image (i.e. the tradeoff between distance to and size of the imaged object), and therefore cannot be trusted to provide reliable global depth estimates in all circumstances.

Pulsed-light time-of-flight based methods are capable of extremely high depth resolution at moderate and even long ranges. Such pulsed illumination methods rely on sensitive photodetectors with high temporal resolution, such as avalanche photodiodes (APDs) or single-photon avalanche diodes (SPADs), combined with nanosecond lasers, and are capable of estimating depth to high accuracy [1]. Experiments by Lindell et. al. 2018 [2] have further demonstrated the viability of sensor fusion, combining RGB images with SPAD data quickly and efficiently using a convolutional neural network. The drawbacks of these methods are that they are financially expensive, and they require highly-precise scanning setups.

We propose to resolve these two final drawbacks with a sensor-fusion method for combining pulsed-illumination time-of-flight depth estimation with RGB monocular depth estimation. Instead of just using RGB images to estimate depth, we augment the RGB image with a histogram of image depths collected from a time-of-flight sensor.

Background and Related Work

Monocular Depth Estimation

The task of estimating per-pixel depth from a single image, while fundamentally an ill-posed problem, is nonetheless an area of intense interest. Intuitively, all such methods must make use of the monocular depth cues found in natural images. Saxena et. al. [3] use a technique called Markov Random Fields. More recently, however, convolutional neural networks have been successful at estimating depth in both supervised [4] [5] and unsupervised [6] learning scenarios. These methods all involve training a convolutional neural network to predict depth from a single input image.

Single-photon Avalanche Diodes

Single-photon Avalanche Diodes (SPADs) are a new type of sensor capable of recording the arrival times of individual photons with picosecond accuracy. This extremely high temporal resolution enables pulsed time-of-flight LIDAR by precisely recording the travel time of photons emitted from a pulsed laser into the scene and reflected back to the sensor. Using the travel time and the speed of light, one can easily compute the distance traveled by the photon, and hence the distance to the object of interest.

Global Hints

The idea of using 1D histograms in image processing and reconstruction is not new. Swoboda and Schnorr use a convex variational approach to perform various image restoration tasks (such as denoising and inpainting) leveraging histogram information about the image [9]. In image colorization, Zhang et. al. [8] use a convolutional neural network derived from the U-Net that takes in an input grayscale image, incorporates color histogram information in the bottleneck layers, and outputs a full-color image that represents an intelligent synthesis of the spatial information from the grayscale image and the color information from the histogram.

Methods

Convolutional Neural Network

The core of our method is a convolutional neural network (show model) that is capable of incorporating global depth hints and outputting per-pixel depth estimates.

The model is based off of a U-net [7], which has shown good performance at other image-related tasks such as semantic segmentation. The left side of the network takes as input an RGB image, and consists of a (Conv 3x3, ReLU), followed by 4 downsampling stages at which the image is halved in resolution. Each downsampling stage consists of a 2x2 max pooling, followed by (Conv 3x3, Batchnorm, ReLU) x2.

The lower branch of the network takes the histogram as input, and just performs (Conv 1x1, ReLU) x4. The output of this branch is expanded in the spatial dimension and concatenated to the features from the left branch of the network before being fed into the right branch. The right branch consists of 4 upsampling layers, followed by a single output Conv 1x1 layer. Each upsampling layer first concatenates the output of the left branch of the network at the same resolution. consists of a single bilinear upsampling operation, followed by (Conv 3x3, Batchnorm, ReLU) x2.

We use the Reverse Huber (or BerHu) loss. This loss has the form:

Using this loss (or variants thereof) has been effective at helping convolutional neural networks learn to estimate depth from monocular images.

The network is "fully convolutional," and so may take images of any resolution. In practice, we crop our inputs to be divisible by 16 so that the 4 downsampling and 4 upsampling operations can be performed without any issues. Intuitively, the left half of the network produces a compressed feature representation of the image in the "bottleneck", which is then expanded by the right branch to form the depth map. The idea behind the bottleneck is to force the network to learn a compact feature representation of the input image. The "skip connections" from the left side to the right side allow the right side of the network to use the high-resolution features from the left side of the network to

SPAD Simulation

The histogram collected by a single-photon avalanche diode over a single acquisition sequence is drawn from a Poisson distribution as follows:

where

is the SPAD measurement histogram,

is the number of pulses,

is the attenuation coefficient that encapsulates radial falloff, object reflectance, etc.

is the SPAD detection efficiency,

is the average number of pulse detections from the laser,

is the average number of detections due to ambient light,

is the average number of dark count detections due to the SPAD hardware.

For this work, we restrict ourselves to simulating the radial falloff (photon flux decreases with the square of the distance) and variations in object reflectance (i.e. albedo). Thus, given a ground truth depth map , we first calculate the effective response image as follows:

Where is the albedo image collected using a technique called intrinsic imaging. We use software from Jeon et. al. 2014 (CITE) to extract the albedo image. We then form a histogram of values from to get . Notice how the albedo image has had some of the bright highlights and dark shadows smoothed out so that similar materials present more similarly.

Original RGB Image:

Albedo Image

For comparison, we also conduct the experiment inputting the raw histogram of (which should, theoretically, give the cleanest signal to the neural network).

Training

We trained our convolutional neural network on the NYU Depth v2 dataset [10] using the official train/test split. The dataset comes with a toolbox for extracting and processing the depth data. We used He initialization with a batch size of 20 and trained our network for 80 epochs with an initial learning rate of , decaying it by at the 40th epoch. We used L2 regularization on the convolutional layer weights with . Finally, we performed data augmentation by adding random flips and crops to the data.

Results

Qualitiative

This is a selection of 6 photos from NYU Depth v2. On the left is the original RGB images, and on the right are the corresponding depth maps.

Monocular depth, estimated with the U-Net only (no histogram)

Monocular depth with hints, estimated with the raw depth histogram (no albedo or squared distance falloff).

Monocular depth with hints, estimated with the modified histogram that takes albedo and squared distance falloff into account.

Discussion

Qualitatively, our method performs quite reasonably. There is a clear separation between both networks with hints and the network without hints. Furthermore, it appears that adding accounting for albedo and radial falloff do not result in severe decreases in depth map quality. It's not competitive with state of the art, unfortunately. Issues: Raw histogram is ok, simulated histogram messes up when confronted with scenes with varying albedo (show example). Also, many fine details are missing.

Conclusions

In summary, we successfully demonstrated the effectiveness of adding a global hints histogram to a simple monocular depth CNN. We have shown that this method is robust enough to deal with albedo and squared falloff effects, but have also shown that it has shortcomings with respect to darker colored patches and fine details.

In the future, we will try adding hints to a state-of-the-art monocular depth estimation neural network and assess the relative performance of state-of-the-art methods with and without global hints. We will also try to improve the fine details of our depth maps, and limit the extent to which variations in albedo affect the output of the model. We will also perform real experiments with SPADs and lasers to get a sense for the performance of the model in the real world.

Appendix

References

[1] D. Shin et al., “Photon-Efficient Computational 3-D and Reflectivity Imaging With Single-Photon Detectors,” vol. 1, no. 2, pp. 112–125, 2015.

[2] D. B. Lindell, M. O’Toole, and G. Wetzstein, “Single-photon 3D imaging with deep sensor fusion,” ACM Trans. Graph., vol. 37, no. 4, pp. 1–12, 2018.

[3] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning Depth from Single Monocular Images.” NIPS 2006.

[4] D. Eigen, C. Puhrsch, and R. Fergus, “Depth Map Prediction from a Single Image using a Multi-Scale Deep Network,” pp. 1–9.

[5] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” 2016

[6] P. Swoboda and C. Schnörr, “Convex Variational Image Restoration with Histogram Priors,” SIAM J. Imaging Sci., vol. 6, no. 3, pp. 1719–1735, 2013

[7] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” pp. 1–8. 2015

[8] R. Zhang, J. Zhu, X. Geng, A. S. Lin, T. Yu, and C. V May, “Real-Time User-Guided Image Colorization with Learned Deep Priors,” 2016.

[9] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised Monocular Depth Estimation with Left-Right Consistency,” Sep. 2016.

[10] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor Segmentation and Support Inference from RGBD Images,” pp. 1–14. ECCV 2012