Monocular Depth Estimation with Global Depth Hints

Introduction

For many imaging tasks, good estimates of 3D structure can lead to significant improvements in performance. Recent advances in convolutional neural networks have made it possible to estimate depth from single RGB images (“Monocular depth estimation”). However, these methods are not able (without considering e.g. focal length) to resolve the inherent scale ambiguity present in a single monocular image (i.e. the tradeoff between distance to and size of the imaged object), and therefore cannot be trusted to provide reliable global depth estimates in all circumstances. Pulsed-light time-of-flight based methods are capable of extremely high depth resolution at moderate and even long ranges. Such pulsed illumination methods rely on sensitive photodetectors with high temporal resolution, such as avalanche photodiodes (APDs) or single-photon avalanche diodes (SPADs), combined with nanosecond lasers, and are capable of estimating depth to high accuracy (Shin et. al. 2015). Experiments by Lindell et. al. 2018 have further demonstrated the viability of sensor fusion, combining RGB images with SPAD data quickly and efficiently using a convolutional neural network. The drawbacks of these methods are that they are financially expensive, and they require highly-precise scanning setups. We propose to resolve these two final drawbacks with a sensor-fusion method for combining pulsed-illumination time-of-flight depth estimation with RGB monocular depth estimation. Instead of just using RGB images to estimate depth, we augment the RGB image with a histogram of image depths collected from a time-of-flight sensor.

Background and Related Work

Monocular Depth Estimation

The task of estimating per-pixel depth from a single image, while fundamentally an ill-posed problem, is nonetheless an area of intense interest. Intuitively, all such methods must make use of the monocular depth cues found in natural images. Saxena et. al. (CITE) use a technique called Markov Random Fields. More recently, however, convolutional neural networks have been successful at estimating depth in both supervised (CITE Eigen, Laina) and unsupervised (CITE Godard) learning scenarios. These methods all involve training a convolutional neural network to predict depth from a single input image.

Single-photon Avalanche Diodes

Single-photon Avalanche Diodes (SPADs) are a new type of sensor capable of recording the arrival times of individual photons with picosecond accuracy. This extremely high temporal resolution enables pulsed time-of-flight LIDAR by precisely recording the travel time of photons emitted from a pulsed laser into the scene and reflected back to the sensor. Using the travel time and the speed of light, one can easily compute the distance traveled by the photon, and hence the distance to the object of interest.

Global Hints

The idea of using 1D histograms in image processing and reconstruction is not new. Swoboda and Schnorr use a convex variational approach to perform various image restoration tasks (such as denoising and inpainting) leveraging histogram information about the image (CITE). In image colorization, Zhang et. al. (CITE) use a convolutional neural network derived from the U-Net (CITE) that takes in an input grayscale image, incorporates color histogram information in the bottleneck layers, and outputs a full-color image that represents an intelligent synthesis of the spatial information from the grayscale image and the color information from the histogram.

Methods

Results

Conclusions

Appendix

You can write math equations as follows: $y = x + 5$

You can include images as follows (you will need to upload the image first using the toolbox on the left bar, using the "Upload file" link).

Monocular Depth Estimation with Global Depth Hints

Contents

Introduction

Background and Related Work