Depth Estimation with an Endoscope using Front Flash
Introduction
An endoscope is an imaging device that allows to visualize internal human organ. It consists of a long flexible tube with a camera with flash at the front. This imaging technique allows for tissue biopsy of internal organs and detection/removal of tumors. Currently, the endoscopes are 2-D visual systems. For this project, we plan to calculate depth map using two 2-D images.
Literature Review
Some research suggests that having depth information (i.e. 3D imaging), would allow for better operative safety and faster equipment training [1]. There are at least five different 3D measuring techniques that allow for depth sensing [2].
One of them is Time of Flight measurement where an attached piece of hardware right next to the camera emits flashes of near infrared light and calculated depth based on the time it took for the pulse to come back to the image sensor. Since this technique operates in the near-infrared region, it does not pass the medical safety regulations. One other method uses intentionally aberrated projected pattern to obtain depth information. This method too requires a projection system which comes at an additional cost.
The method we propose does not involve any additional piece of hardware. The depth maps are calculated by using images taken from the camera at different positions. This technique works because the intensity of the scene changes with the change position of the camera and the flash. This change in intensity allows for the calculation of depth map.
Method
Initial Image Data
We obtained three pairs of optical images that differed in texture, ranging from no texture, half texture, and full texture, pictured below. Each pair of images consists of one image close to the endoscope, and the second image 50mm farther away. In each of these images, the flash is placed at a static location and so the farther away image will appear darker (at a lower intensity) than the closer image. Later on we will use this difference in intensity between corresponding feature points in the two images to estimate depth.
(Left to right) No texture, half texture, and full texture metronome optical images.
ISET Processing
Before we can accurately compare these images, we have to convert the optical image to a display image. This can be achieved through an imaging pipeline implemented in ISET. First we put the optical image through sensor processing, which allows us to account for read noise, which is the noise inherent in darker portions of a detected image, as well as for adjusting the exposure time or the time that the flash must be on in the endoscope imaging system. Then we process this sensor with image demosaicing; no gamma correction is needed because we want the intensity data to be in linear terms for depth calculation.
SIFT Algorithm
We aim to match corresponding points in each pair of images, so that we can do perspective correction and then direct image intensity comparison. We identify these correspondence points through a Scale Invariant Feature Transform (SIFT) algorithm. The SIFT algorithm is implemented in MATLAB code, which is attached in the appendix, and works through the following steps:
Detect Local Extrema
To detect local maximum and minimums in a given image, the SIFT algorithm first reads in images in grayscale and then computes a scale space by using a smoothing Gaussian blur filter with different values for sigma (different amounts of blur). Scale spaces with small values of sigma are better at identifying small features of interest based on the extrema in that space, and scale spaces with larger values of sigma are better at identifying larger features. This can be visualized in the scale space pictured below that is constructed for our non-textured pair of images.
Then the difference between closest pairs of these spaces, called the difference of gaussians, is computed and the local extrema are extracted through an iterative search of pixels and their nearest neighbors. This means that at each pixel there are 27 neighboring pixels for it to be compared to: 8 adjacent in the same gaussian space and the nine neighboring pixels in the above and below gaussian spaces.
Correspondence point identification and filtering
Our final set of correspondence points are determined by filtering the calculated extrema based on a high and low threshold for their value in comparison to the local extrema that surround them. These final correspondence points cannot have too small a change when compared to neighboring extrema because this indicates that there is not enough contrast to be distinguishable, and they cannot have too big a change compared to other local extrema because this indicates that the point is at a difficult to distinguish edge point between two very different levels of contrast in an image.
Construct correspondence point descriptors
For each point in our set of determined correspondence points, our algorithm looks at the neighboring pixels and forms a descriptor based on the grayscale distribution surrounding the correspondence point.
Match similar descriptors in two images
For a given pair of images in our data set, regardless of scale, rotation, or intensity, the descriptors should be similar enough to make a corresponding match. It is also of interest to note that SIFT produces many more correspondence points for highly textured surfaces due to the increasing number of maximum and minimum feature points in an image. The matches for our no texture and full texture images are shown below. Just by visual inspection, one can see that our algorithm is decent at forming correct correspondence points; a typical set of matching points has 8-10% mismatches.
Perspective Correction
Depth Estimation
Results
After image processing in ISET, as described above, we ran our SIFT algorithm to obtain correspondence points for each of our three sets of differently textured initial images. These points were then used for perspective correction. Our final depth maps were produced by comparing the intensities of these two images and placing the maps through a median filter, which smooth pixels depending on a 3x3 surrounding pixel area to get rid of some edge-based artifacts. The filtered and non-filtered images are shown below.
Figure 1 shows the true depth map.
Figure 2 is the no texture case: a) is the initial depth calculation based on 2-flash depth estimation, b) is the depth map after a median filter is applied.
Figure 3 is the half texture case (in which the body of the metronome is smooth): a) is the initial depth calculation based on 2-flash depth estimation, b) is the depth map after a median filter is applied.
Figure 4 is the full texture case: a) is the initial depth calculation based on 2-flash depth estimation, b) is the depth map after a median filter is applied.
Discussion
Conclusion
Future Work
References
[1] Time-of-Flight 3-D endoscopy http://link.springer.com/chapter/10.1007%2F978-3-642-04268-3_58
[2] Depth measurements through controlled aberrations of projected patterns http://www.opticsinfobase.org/oe/abstract.cfm?uri=oe-20-6-6561