Monocular Depth Estimation with Global Depth Hints - Revision history

imported>Student2018: /* References */

2018-12-15T08:04:57Z

References

← Older revision		Revision as of 08:04, 15 December 2018
Line 106:		Line 106:
	== Appendix ==		== Appendix ==
	===References===		===References===
	[1~~] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor Segmentation and Support Inference from RGBD Images,” pp. 1–14.~~
	~~[2] P. Swoboda and C. Schnörr, “Convex Variational Image Restoration with Histogram Priors,” SIAM J. Imaging Sci., vol. 6, no. 3, pp. 1719–1735, 2013~~		[1] D. Shin et al., “Photon-Efficient Computational 3-D and Reflectivity Imaging With Single-Photon Detectors,” vol. 1, no. 2, pp. 112–125, 2015.
	~~[3] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised Monocular Depth Estimation with Left-Right Consistency,” Sep. 2016.~~
	~~[4] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in Proceeding~~		[2] D. B. Lindell, M. O’Toole, and G. Wetzstein, “Single-photon 3D imaging with deep sensor fusion,” ACM Trans. Graph., vol. 37, no. 4, pp. 1–12, 2018.
	[5] D. Shin et al., “Photon-Efficient Computational 3-D and Reflectivity Imaging With Single-Photon Detectors,” vol. 1, no. 2, pp. 112–125, 2015.
	[6] D. B. Lindell, M. O’Toole, and G. Wetzstein, “Single-photon 3D imaging with deep sensor fusion,” ACM Trans. Graph., vol. 37, no. 4, pp. 1–12, 2018.		[3] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning Depth from Single Monocular Images.” NIPS 2006.
	[7] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning Depth from Single Monocular Images.”
	[8] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” pp. 1–8.		[4] D. Eigen, C. Puhrsch, and R. Fergus, “Depth Map Prediction from a Single Image using a Multi-Scale Deep Network,” pp. 1–9.
	[9] R. Zhang, J. Zhu, X. Geng, A. S. Lin, T. Yu, and C. V May, “Real-Time User-Guided Image Colorization with Learned Deep Priors,” 2016.
	[10] D. ~~Eigen~~, C. ~~Puhrsch~~, and R. Fergus, ~~“Depth Map Prediction~~ from ~~a Single Image using a Multi-Scale Deep Network~~,” pp. ~~1–9~~.		[5] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” 2016

			[6] P. Swoboda and C. Schnörr, “Convex Variational Image Restoration with Histogram Priors,” SIAM J. Imaging Sci., vol. 6, no. 3, pp. 1719–1735, 2013

			[7] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” pp. 1–8. 2015

			[8] R. Zhang, J. Zhu, X. Geng, A. S. Lin, T. Yu, and C. V May, “Real-Time User-Guided Image Colorization with Learned Deep Priors,” 2016.

			[9] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised Monocular Depth Estimation with Left-Right Consistency,” Sep. 2016.

			[10] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor Segmentation and Support Inference from RGBD Images,” pp. 1–14. ECCV 2012

imported>Student2018: /* References */

2018-12-15T08:01:28Z

References

← Older revision		Revision as of 08:01, 15 December 2018
Line 106:		Line 106:
	== Appendix ==		== Appendix ==
	===References===		===References===
			[1] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor Segmentation and Support Inference from RGBD Images,” pp. 1–14.
			[2] P. Swoboda and C. Schnörr, “Convex Variational Image Restoration with Histogram Priors,” SIAM J. Imaging Sci., vol. 6, no. 3, pp. 1719–1735, 2013
			[3] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised Monocular Depth Estimation with Left-Right Consistency,” Sep. 2016.
			[4] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in Proceeding
			[5] D. Shin et al., “Photon-Efficient Computational 3-D and Reflectivity Imaging With Single-Photon Detectors,” vol. 1, no. 2, pp. 112–125, 2015.
			[6] D. B. Lindell, M. O’Toole, and G. Wetzstein, “Single-photon 3D imaging with deep sensor fusion,” ACM Trans. Graph., vol. 37, no. 4, pp. 1–12, 2018.
			[7] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning Depth from Single Monocular Images.”
			[8] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” pp. 1–8.
			[9] R. Zhang, J. Zhu, X. Geng, A. S. Lin, T. Yu, and C. V May, “Real-Time User-Guided Image Colorization with Learned Deep Priors,” 2016.
			[10] D. Eigen, C. Puhrsch, and R. Fergus, “Depth Map Prediction from a Single Image using a Multi-Scale Deep Network,” pp. 1–9.

imported>Student2018: /* Training */

2018-12-15T08:01:14Z

Training

← Older revision		Revision as of 08:01, 15 December 2018
Line 74:		Line 74:

	===Training===		===Training===
	We trained our convolutional neural network on the NYU Depth v2 dataset ~~(CITE)~~ using the official train/test split. The dataset comes with a toolbox for extracting and processing the depth data. We used He initialization ~~(CITE)~~ with a batch size of 20 and trained our network for 80 epochs with an initial learning rate of <math> 1e-3 </math>, decaying it by <math> 0.1 </math> at the 40th epoch. We used L2 regularization on the convolutional layer weights with <math> \lambda = 1e-8 </math>. Finally, we performed data augmentation by adding random flips and crops to the data.		We trained our convolutional neural network on the NYU Depth v2 dataset [10] using the official train/test split. The dataset comes with a toolbox for extracting and processing the depth data. We used He initialization with a batch size of 20 and trained our network for 80 epochs with an initial learning rate of <math> 1e-3 </math>, decaying it by <math> 0.1 </math> at the 40th epoch. We used L2 regularization on the convolutional layer weights with <math> \lambda = 1e-8 </math>. Finally, we performed data augmentation by adding random flips and crops to the data.

	== Results ==		== Results ==

imported>Student2018: /* Global Hints */

2018-12-15T07:55:35Z

Global Hints

← Older revision		Revision as of 07:55, 15 December 2018
Line 14:		Line 14:

	===Global Hints===		===Global Hints===
	The idea of using 1D histograms in image processing and reconstruction is not new. Swoboda and Schnorr use a convex variational approach to perform various image restoration tasks (such as denoising and inpainting) leveraging histogram information about the image ~~(CITE)~~. In image colorization, Zhang et. al. ~~(CITE)~~ use a convolutional neural network derived from the U-Net ~~(CITE)~~ that takes in an input grayscale image, incorporates color histogram information in the bottleneck layers, and outputs a full-color image that represents an intelligent synthesis of the spatial information from the grayscale image and the color information from the histogram.		The idea of using 1D histograms in image processing and reconstruction is not new. Swoboda and Schnorr use a convex variational approach to perform various image restoration tasks (such as denoising and inpainting) leveraging histogram information about the image [9]. In image colorization, Zhang et. al. [8] use a convolutional neural network derived from the U-Net that takes in an input grayscale image, incorporates color histogram information in the bottleneck layers, and outputs a full-color image that represents an intelligent synthesis of the spatial information from the grayscale image and the color information from the histogram.

	== Methods ==		== Methods ==

imported>Student2018: /* Convolutional Neural Network */

2018-12-15T07:54:06Z

Convolutional Neural Network

← Older revision		Revision as of 07:54, 15 December 2018
Line 22:		Line 22:
	[[File:Model.png\|800px]]		[[File:Model.png\|800px]]

	The model is based off of a U-net ~~(Ronneberger et. al. 2015)~~, which has shown good performance at other image-related tasks such as semantic segmentation. The left side of the network takes as input an RGB image, and consists of a (Conv 3x3, ReLU), followed by 4 downsampling stages at which the image is halved in resolution. Each downsampling stage consists of a 2x2 max pooling, followed by (Conv 3x3, Batchnorm, ReLU) x2.		The model is based off of a U-net [7], which has shown good performance at other image-related tasks such as semantic segmentation. The left side of the network takes as input an RGB image, and consists of a (Conv 3x3, ReLU), followed by 4 downsampling stages at which the image is halved in resolution. Each downsampling stage consists of a 2x2 max pooling, followed by (Conv 3x3, Batchnorm, ReLU) x2.

	The lower branch of the network takes the histogram as input, and just performs (Conv 1x1, ReLU) x4. The output of this branch is expanded in the spatial dimension and concatenated to the features from the left branch of the network before being fed into the right branch. The right branch consists of 4 upsampling layers, followed by a single output Conv 1x1 layer. Each upsampling layer first concatenates the output of the left branch of the network at the same resolution. consists of a single bilinear upsampling operation, followed by (Conv 3x3, Batchnorm, ReLU) x2.		The lower branch of the network takes the histogram as input, and just performs (Conv 1x1, ReLU) x4. The output of this branch is expanded in the spatial dimension and concatenated to the features from the left branch of the network before being fed into the right branch. The right branch consists of 4 upsampling layers, followed by a single output Conv 1x1 layer. Each upsampling layer first concatenates the output of the left branch of the network at the same resolution. consists of a single bilinear upsampling operation, followed by (Conv 3x3, Batchnorm, ReLU) x2.

imported>Student2018: /* Monocular Depth Estimation */

2018-12-15T07:53:15Z

Monocular Depth Estimation

← Older revision		Revision as of 07:53, 15 December 2018
Line 8:		Line 8:
	== Background and Related Work==		== Background and Related Work==
	===Monocular Depth Estimation===		===Monocular Depth Estimation===
	The task of estimating per-pixel depth from a single image, while fundamentally an ill-posed problem, is nonetheless an area of intense interest. Intuitively, all such methods must make use of the monocular depth cues found in natural images. Saxena et. al. ~~(CITE)~~ use a technique called Markov Random Fields. More recently, however, convolutional neural networks have been successful at estimating depth in both supervised ~~(CITE Eigen, Laina)~~ and unsupervised ~~(CITE Godard)~~ learning scenarios. These methods all involve training a convolutional neural network to predict depth from a single input image.		The task of estimating per-pixel depth from a single image, while fundamentally an ill-posed problem, is nonetheless an area of intense interest. Intuitively, all such methods must make use of the monocular depth cues found in natural images. Saxena et. al. [3] use a technique called Markov Random Fields. More recently, however, convolutional neural networks have been successful at estimating depth in both supervised [4] [5] and unsupervised [6] learning scenarios. These methods all involve training a convolutional neural network to predict depth from a single input image.

	===Single-photon Avalanche Diodes===		===Single-photon Avalanche Diodes===

imported>Student2018: /* Introduction */

2018-12-15T07:50:20Z

Introduction

← Older revision		Revision as of 07:50, 15 December 2018
Line 2:		Line 2:
	For many imaging tasks, good estimates of 3D structure can lead to significant improvements in performance. Recent advances in convolutional neural networks have made it possible to estimate depth from single RGB images (“Monocular depth estimation”). However, these methods are not able (without considering e.g. focal length) to resolve the inherent scale ambiguity present in a single monocular image (i.e. the tradeoff between distance to and size of the imaged object), and therefore cannot be trusted to provide reliable global depth estimates in all circumstances.		For many imaging tasks, good estimates of 3D structure can lead to significant improvements in performance. Recent advances in convolutional neural networks have made it possible to estimate depth from single RGB images (“Monocular depth estimation”). However, these methods are not able (without considering e.g. focal length) to resolve the inherent scale ambiguity present in a single monocular image (i.e. the tradeoff between distance to and size of the imaged object), and therefore cannot be trusted to provide reliable global depth estimates in all circumstances.

	Pulsed-light time-of-flight based methods are capable of extremely high depth resolution at moderate and even long ranges. Such pulsed illumination methods rely on sensitive photodetectors with high temporal resolution, such as avalanche photodiodes (APDs) or single-photon avalanche diodes (SPADs), combined with nanosecond lasers, and are capable of estimating depth to high accuracy ~~(Shin et. al. 2015)~~. Experiments by Lindell et. al. 2018 have further demonstrated the viability of sensor fusion, combining RGB images with SPAD data quickly and efficiently using a convolutional neural network. The drawbacks of these methods are that they are financially expensive, and they require highly-precise scanning setups.		Pulsed-light time-of-flight based methods are capable of extremely high depth resolution at moderate and even long ranges. Such pulsed illumination methods rely on sensitive photodetectors with high temporal resolution, such as avalanche photodiodes (APDs) or single-photon avalanche diodes (SPADs), combined with nanosecond lasers, and are capable of estimating depth to high accuracy [1]. Experiments by Lindell et. al. 2018 [2] have further demonstrated the viability of sensor fusion, combining RGB images with SPAD data quickly and efficiently using a convolutional neural network. The drawbacks of these methods are that they are financially expensive, and they require highly-precise scanning setups.

	We propose to resolve these two final drawbacks with a sensor-fusion method for combining pulsed-illumination time-of-flight depth estimation with RGB monocular depth estimation. Instead of just using RGB images to estimate depth, we augment the RGB image with a histogram of image depths collected from a time-of-flight sensor.		We propose to resolve these two final drawbacks with a sensor-fusion method for combining pulsed-illumination time-of-flight depth estimation with RGB monocular depth estimation. Instead of just using RGB images to estimate depth, we augment the RGB image with a histogram of image depths collected from a time-of-flight sensor.

imported>Student2018: /* Results */

2018-12-15T07:46:11Z

Results

← Older revision		Revision as of 07:46, 15 December 2018
Line 78:		Line 78:
	== Results ==		== Results ==
	===Qualitiative===		===Qualitiative===
	~~We show~~ a ~~table~~ of ~~figures below demonstrating~~ the ~~qualitative performance of~~ the ~~method~~.		This is a selection of 6 photos from NYU Depth v2. On the left is the original RGB images, and on the right are the corresponding depth maps.
	~~===Quantitative===~~
	~~We show a table below collecting some common metrics for evaluating~~ the ~~performance of~~ depth ~~estimation methods~~. ~~We also show Laina et~~. al.~~’s metrics for comparison~~.		[[File:Rgb.png\|400px]] [[File:Ground_truth.png\|400px]]

			Monocular depth, estimated with the U-Net only (no histogram)

			[[File:Nohints.png\|400px]]

			Monocular depth with hints, estimated with the raw depth histogram (no albedo or squared distance falloff).

			[[File:Rawhints.png\|400px]]

			Monocular depth with hints, estimated with the modified histogram that takes albedo and squared distance falloff into account.

			[[File:Hints.png\|400px]]

	===Discussion===		===Discussion===
	Qualitatively, our method performs quite reasonably. There is a clear separation between both networks with hints and the network without hints. Furthermore, it appears that adding accounting for albedo and radial falloff do not result in severe decreases in depth map quality.		Qualitatively, our method performs quite reasonably. There is a clear separation between both networks with hints and the network without hints. Furthermore, it appears that adding accounting for albedo and radial falloff do not result in severe decreases in depth map quality.
	~~Not~~ competitive with state of the art, unfortunately.		It's not competitive with state of the art, unfortunately.
	Issues: Raw histogram is ok, simulated histogram messes up when confronted with scenes with varying albedo (show example). Also, many fine details are missing.		Issues: Raw histogram is ok, simulated histogram messes up when confronted with scenes with varying albedo (show example). Also, many fine details are missing.

imported>Student2018: /* Convolutional Neural Network */

2018-12-15T07:30:19Z

Convolutional Neural Network

← Older revision		Revision as of 07:30, 15 December 2018
Line 27:		Line 27:

	We use the Reverse Huber (or BerHu) loss. This loss has the form:		We use the Reverse Huber (or BerHu) loss. This loss has the form:
	[[File:berhu.png\|~~200px~~]]
			[[File:berhu.png\|300px]]

			Using this loss (or variants thereof) has been effective at helping convolutional neural networks learn to estimate depth from monocular images.

	The network is "fully convolutional," and so may take images of any resolution. In practice, we crop our inputs to be divisible by 16 so that the 4 downsampling and 4 upsampling operations can be performed without any issues. Intuitively, the left half of the network produces a compressed feature representation of the image in the "bottleneck", which is then expanded by the right branch to form the depth map. The idea behind the bottleneck is to force the network to learn a compact feature representation of the input image. The "skip connections" from the left side to the right side allow the right side of the network to use the high-resolution features from the left side of the network to		The network is "fully convolutional," and so may take images of any resolution. In practice, we crop our inputs to be divisible by 16 so that the 4 downsampling and 4 upsampling operations can be performed without any issues. Intuitively, the left half of the network produces a compressed feature representation of the image in the "bottleneck", which is then expanded by the right branch to form the depth map. The idea behind the bottleneck is to force the network to learn a compact feature representation of the input image. The "skip connections" from the left side to the right side allow the right side of the network to use the high-resolution features from the left side of the network to

imported>Student2018: /* Convolutional Neural Network */

2018-12-15T07:28:44Z

Convolutional Neural Network

← Older revision		Revision as of 07:28, 15 December 2018
Line 25:		Line 25:

	The lower branch of the network takes the histogram as input, and just performs (Conv 1x1, ReLU) x4. The output of this branch is expanded in the spatial dimension and concatenated to the features from the left branch of the network before being fed into the right branch. The right branch consists of 4 upsampling layers, followed by a single output Conv 1x1 layer. Each upsampling layer first concatenates the output of the left branch of the network at the same resolution. consists of a single bilinear upsampling operation, followed by (Conv 3x3, Batchnorm, ReLU) x2.		The lower branch of the network takes the histogram as input, and just performs (Conv 1x1, ReLU) x4. The output of this branch is expanded in the spatial dimension and concatenated to the features from the left branch of the network before being fed into the right branch. The right branch consists of 4 upsampling layers, followed by a single output Conv 1x1 layer. Each upsampling layer first concatenates the output of the left branch of the network at the same resolution. consists of a single bilinear upsampling operation, followed by (Conv 3x3, Batchnorm, ReLU) x2.

			We use the Reverse Huber (or BerHu) loss. This loss has the form:
			[[File:berhu.png\|200px]]

	The network is "fully convolutional," and so may take images of any resolution. In practice, we crop our inputs to be divisible by 16 so that the 4 downsampling and 4 upsampling operations can be performed without any issues. Intuitively, the left half of the network produces a compressed feature representation of the image in the "bottleneck", which is then expanded by the right branch to form the depth map. The idea behind the bottleneck is to force the network to learn a compact feature representation of the input image. The "skip connections" from the left side to the right side allow the right side of the network to use the high-resolution features from the left side of the network to		The network is "fully convolutional," and so may take images of any resolution. In practice, we crop our inputs to be divisible by 16 so that the 4 downsampling and 4 upsampling operations can be performed without any issues. Intuitively, the left half of the network produces a compressed feature representation of the image in the "bottleneck", which is then expanded by the right branch to form the depth map. The idea behind the bottleneck is to force the network to learn a compact feature representation of the input image. The "skip connections" from the left side to the right side allow the right side of the network to use the high-resolution features from the left side of the network to