Depth Estimation with an Endoscope using Front Flash

From Psych 221 Image Systems Engineering
Jump to navigation Jump to search

Introduction

An endoscope is an imaging device that allows visualization of internal human organs. It consists of a long flexible tube with a camera and flash at the front. This imaging technique allows for tissue biopsy of internal organs and detection/removal of tumors. Currently, endoscopes are 2-D visual systems. For this project, we plan to calculate depth map using two 2-D images.

Muneeb Ahmed and Cezanne Camacho

Literature Review

Some research suggests that having depth information (i.e. 3D imaging), would allow for better operative safety and faster equipment training [1]. There are at least five different 3D measuring techniques that allow for depth sensing [2].

One of them is Time of Flight measurement where an attached piece of hardware, right next to the camera, emits flashes of near infrared light. Depth is then calculated by measuring the time it takes for the pulse to reflect and hit the image sensor. Since this technique operates in the near-infrared region, it does not pass medical safety regulations at the moment. One other method uses intentionally aberrated projected patterns to obtain depth information. This method too requires a projection system, which comes at an additional cost.

The method we propose does not involve any additional piece of hardware. The depth maps are calculated by using images taken from the camera at different positions. This technique works because the intensity of the scene changes with the changing position of the camera and the flash. This change in intensity allows for the calculation of depth map.

Method

Depth Estimation

Andy Lai Lin has done depth estimation using 2-Flash method as shown in the figure below.

Courtesy: Andy Lai Lin

The camera takes two shots:

1 - Camera and flash are at the same position.

2 - Camera is at the same position as before but the flash moves farther away from the scene.

The ratio of these two images, allows us to calculate the depth of the scene. The formula which links the ratio of the two images to depth is as follows:

Courtesy: Andy Lai Lin

To apply this concept to sensing depth using an endoscope, we modified the model as follows:

As is the case with endoscopes, the flash is attached to the camera and two images are taken at two different locations. To allow for the distance formula given above to work in this case, we first transform the image taken farther away from the scene to an equivalent scene that is taken closer to the scene. After the transformation is done, the closer image is divided by the farther away image pixel-by-pixel. The ratio is then used to calculate the depth using the distance formula shown above. To transform the image, we first find the SIFT points. These serve as anchors and allow us to resize the image accordingly.

Initial Image Data

We obtained three pairs of optical images that differed in texture, ranging from no texture, half texture, and full texture, pictured below. Each pair of images consists of one image close to the endoscope, and the second image 50mm farther away. In each of these images, the flash is placed at a static location and so the farther away image will appear darker (at a lower intensity) than the closer image.

(Left to right) No texture, half texture, and full texture metronome optical images.

ISET Processing

Before we can accurately compare these images, we have to convert the optical image to a display image. This can be achieved through an imaging pipeline implemented in ISET. First we put the optical image through sensor processing, which allows us to account for read noise, which is the noise inherent in darker portions of a detected image, as well as for adjusting the exposure time or the time that the flash must be on in the endoscope imaging system. Then we process this sensor with image demosaicing; no gamma correction is needed because we want the intensity data to be in linear terms for depth calculation.

SIFT Algorithm

We aim to match corresponding points in each pair of images, so that we can do perspective correction and then direct image intensity comparison. We identify these correspondence points through a Scale Invariant Feature Transform (SIFT) algorithm. The SIFT algorithm is implemented in MATLAB code, which is attached in the appendix, and works through the following steps:

Detect Local Extrema

To detect local maximum and minimums in a given image, the SIFT algorithm first reads in images in grayscale and then computes a scale space by using a smoothing Gaussian blur filter with different values for sigma (different amounts of blur). Scale spaces with small values of sigma are better at identifying small features of interest based on the extrema in that space, and scale spaces with larger values of sigma are better at identifying larger features. This can be visualized in the scale space pictured below that is constructed for our non-textured pair of images.

Then the difference between closest pairs of these spaces, called the difference of gaussians, is computed and the local extrema are extracted through an iterative search of pixels and their nearest neighbors. This means that at each pixel there are 27 neighboring pixels for it to be compared to: 8 adjacent in the same gaussian space and the nine neighboring pixels in the above and below gaussian spaces.

Correspondence point identification and filtering

Our final set of correspondence points are determined by filtering the calculated extrema based on a high and low threshold for their value in comparison to the local extrema that surround them. These final correspondence points cannot have too small a change when compared to neighboring extrema because this indicates that there is not enough contrast to be distinguishable, and they cannot have too big a change compared to other local extrema because this indicates that the point is at a difficult to distinguish edge point between two very different levels of contrast in an image.

Construct correspondence point descriptors

For each point in our set of determined correspondence points, our algorithm looks at the neighboring pixels and forms a descriptor based on the grayscale distribution surrounding the correspondence point.

Match similar descriptors in two images

For a given pair of images in our data set, regardless of scale, rotation, or intensity, the descriptors should be similar enough to make a corresponding match. It is also of interest to note that SIFT produces many more correspondence points for highly textured surfaces due to the increasing number of maximum and minimum feature points in an image. The matches for our no texture and full texture images are shown below. Just by visual inspection, one can see that our algorithm is decent at forming correct correspondence points; a typical set of matching points has 5-10% mismatches.

SIFT produced matches for no texture and full texture images.

Perspective Correction

After we have the SIFT points, we use one of these points to measure the amount of 'stretch' in the far away scene as compared to the closer scene. First, we calculate the distance of a given SIFT feature from the center of the close-up scene. Then, we calculate the distance of its corresponding SIFT point in the far away scene. We then use the ratio of these two distances to determine the factor with which to resize the far away scene. After the scene has been resized, we crop its boundaries to match it to the near scene. This is illustrated in the images below.

After perspective correction, the front camera image and the perspective corrected back camera image are ready to be fed into the algorithm to give a depth map.

Results

After image processing in ISET, as described above, we ran our SIFT algorithm to obtain correspondence points for each of our three sets of differently textured initial images. These points were then used for perspective correction. Our final depth maps were produced by comparing the intensities of these two images and placing the maps through a median filter, which smooth pixels depending on a 3x3 surrounding pixel area to get rid of some edge-based artifacts.

Depth Maps

Figure 1 shows the true depth map.

Figure 2 is the no texture case: a) is the initial depth calculation based on 2-flash depth estimation, b) is the depth map after a median filter is applied.

Figure 3 is the half texture case (in which the body of the metronome is smooth): a) is the initial depth calculation based on 2-flash depth estimation, b) is the depth map after a median filter is applied.

Figure 4 is the full texture case: a) is the initial depth calculation based on 2-flash depth estimation, b) is the depth map after a median filter is applied.

Figure 1. Ground truth depth map

Figure 2. No texture

Figure 3. Half texture

Figure 4. Full texture

Error analysis

We calculated the mean square error (MSE) between our final calculated depth maps and the ground truth depth map for each textured case. As can be clearly seen from our MSE error image for the no texture case, there is generally small error, except around the edges of the metronome; in more textured cases these feature edge errors are more numerous.

Table 1. Mean square error for different textures of images

Sources of Error

1. Feature edge-based error due to an approximate perspective transformation

Our perspective correction algorithm is only approximate and will have some small amount of error especially around the edges of features. This is because the corrected back flash image, though it looks similar to the front flash image when we zoom in, is missing information that is captured in the front flash image because we are viewing the features at a slightly different angle as we move our endoscope camera. This is intuitive, for example, we can imagine that if we are viewing a metronome towards our front right, when we move closer, we will be able to see more of the left side of the metronome. So, though this transformation could use further research and construction, we justify our use of this zoom-in algorithm by assuming that the endoscope will move a small amount (around 50 mm) between front and back flashes and so the amount of error due to viewing angle difference should be reasonably small. We also show that this method of correction works well for smooth surfaces and so in combination with algorithms that work well only in textured environments could be part of a uniquely helpful depth estimation algorithm.


2. Error in SIFT correspondence points

Since we rely on calculating the distance between SIFT correspondence points and the center of an image to do our perspective correction, if we rely on a mismatch for this calculation, we will get a very small offset error.

Conclusion

Current endoscope depth estimation technologies exist, but rely on the addition of more complex light sources that would have to be added to the small endoscopic channel. Our algorithm integrates well with current endoscope technology using one built in light source and camera, and so would be ideal for integration.

Our SIFT algorithm works well to match corresponding points in a pair of front flash and back flash images that have been taken one after the other and recorded by the camera of an endoscope; it has only about 5-10% error mismatch, with a smaller error for highly textured surfaces. By using these correspondence points in a scale-based perspective correction algorithm, we were able to estimate a depth map based on the ratio of intensities in the two images. This perspective correction algorithm does not account for changes in viewing angle and so we see the greatest error near the edges of features, which get more numerous in highly textured surfaces.

Future Work

If we were to take this project further, we would implement a more accurate perspective correction algorithm. We would also try to use more than two images taken at varying distances to measure the depth of the scene. This would help with lowering the error rate and minimizing edge effects.

Appendix

Matlab Code for SIFT Implementation

  • note: this code is part of a larger set compiled in Visual C/ Matlab, and so is meant for reference and as part of a larger implementation with included image matrix data.

File:Sift implementation.zip

References

[1] Time-of-Flight 3-D endoscopy http://link.springer.com/chapter/10.1007%2F978-3-642-04268-3_58

[2] Depth measurements through controlled aberrations of projected patterns http://www.opticsinfobase.org/oe/abstract.cfm?uri=oe-20-6-6561