Project Title

A Versatile Image Fusion Method

Introduction

High Dynamic Range and Exposure Fusion

Dynamic range of a scene is defined as the ratio of the highest to the lowest luminance. The real world scenes often have a very wide range of luminance, sometimes exceeding 10 orders of magnitude. Fig. 1 shows a HDR scene with a dynamic range of about 167, 470:1. To reproduce these scenes presents a challenge for conventional digital capture and display devices, which suffer a limited dynamic range of only 2 orders of magnitude.

The most common solution to address this problem is to take a sequence of low dynamic range ( LDR ) images of the same scene under different exposure intervals to capture all the radiance information and then render the captured stack to display. There are generally two pipelines. One way is to firstly estimate the camera response function from the image sequence to recover the true radiance of the original scene ( recorded as a 32 bit float radiance map ) [1, 2], and then tone map the created radiance map for display on LDR reproduction media ( usually 8 bit per channel ) [3, 4, 5]. Although this way gives very satisfying result, it's computationally expensive and time consuming. The other way is to fuse the captured images directly without the intermediate step of creating radiance map [6, 7], which is usually referred as "Exposure Fusion ( EF )" [7]. EF produces HDR-like images, which are comparable to those tone-mapped results, at a much lower computational cost. Due to its effectiveness and computational efficiency, EF is adopted by most of HDR applications on mobile platform, which has limited computational power [8, 9].

All-in-focus Imaging

In fact, EF essentially solves the problem of merging multiple images, and consequently could be easily extended to deal with other imaging and photography challenges except for HDR. The most direct application is to fuse multiple focus image stack ( Fig. 2 ) to produce an all-in-focus image [9].

The size of a camera's aperture provides a trade-off between the depth of field ( DoF ) and the amount of light that is captured by an image with a given exposure time. For an image to be sharp across a large range of depths in the scene, a small aperture is required. However, decreasing the aperture size is not always feasible. On the one hand, most low-end cameras, like cellphone cameras, have a fixed aperture size. On the other hand, small apertures require slower shutter speeds, which can result in image blur due to handshake and motion of objects in the scene. EF successfully address this problem to render all pixels in focus. It's also worthy to mention that EF could also combine flash/ no flash image pair taken under low light condition to fight with the artifacts caused by flash light [7].

Project Content

In this project, we would study EF from following aspects:

1) Analyze and implement the algorithm to create HDR image.

2) Extend the algorithm to all-in-focus imaging.

Methods

EF computes the desired image by keeping only the "best" parts in the multi-exposure image stack. The final image is obtained by collapsing the stack using weighted blending, guided by simple quality measures, namely contrast, saturation and well-exposuredness. The process is done in a multi-resolution fashion in order to avoid undesirable artifacts. It is assumed that the images are perfectly aligned, possibly using a registration algorithm [10]. We would firstly go through the original algorithm of exposure fusion and then describe how to extend it to create all-in-focus image.

Weighting Map

In the multiple exposure image stack, over-exposed and under-exposed regions are flat and colorless, which should receive less weight during fusion. While areas under good exposure contain bright colors and details and they should be preserved with more weighting. The algorithm uses the following measures to decide the weighting for each of the pixels in the image stack.

Contrast ( C )

Under- and over-exposed regions are relatively more "flat" or "uniform" without much fluctuation of intensity, or less contrast. Besides, texture and edges are visually important elements. As a result, pixels of high contrast should be assigned large weighting. The algorithm applies a Laplacian filter to the grayscale version of each image following [11], and take the absolute value of the filter response as a simple indicator C for contrast. Fig.3 shows the contrast maps calculated from the image stack in Fig. 1.

Saturation ( S )

Saturated colors are desirable and make the image look vivid. As a pixel undergoes a longer exposure, the resulting colors become desaturated and eventually clipped. This algorithm also includes a saturation measure S, which is computed as the standard deviation within the R, G and B channel, at each pixel. Fig. 4 shows the saturation map calculated from the image stack in Fig. 1.

Well-exposuredness ( E )

According to the camera response curve ( like a sample in Fig. 5 ), the over-exposed pixels are clamped to 1 ( or 255 ) and under-exposed pixels are mapped to 0. So the the gray level of the pixel reveals how well it is exposed. To be specific, pixel intensities around 0.5 are well-exposed and should be more trusted, while those near 0 ( under-exposed ) or 1 ( over-exposed ) are worse exposed and should have less weighing. The algorithm weights each intensity g based on how close it is to 0.5 using a Gauss curve:

$w=e^{-{\frac {(i-0.5)^{2}}{2\sigma ^{2}}}},i\in [0,1]$

The algorithm applies the Gauss curve to each RGB channel separately and multiply the results, yielding the measure E.

Map Combination

For each pixel, the algorithm combines the information from the different measures into a scalar weighting map using multiplication and controls the influence of eachmeasure using a power function:

$W_{k}(i,j)=C_{k}(i,i)^{w_{C}}*S_{k}(i,i)^{w_{S}}*E_{k}(i,i)^{w_{E}}$

where $W_{k},C_{k},S_{k},E_{k}$ refer to the final weighting map, contrast map, saturation map and well-exposuredness map respectively. $w_{C},w_{S},w_{E}$ are the corresponding weighting parameters to control how much each measure contributes to the final weighting map. If any parameter is equal to 0, the corresponding measure is 1 in the multiplication and thus will not be taken into account. $k$ indicates the $k_{th}$ image in the image stack while $(i,j)$ the coordinate of the pixel. This weighting map will be firstly normalized across multiple images at each pixel $(i,j)$ before being used to guide fusion process. Fig. 7 shows the final weighting maps for each of the image of the image stack shown in Fig. 1.

Naive Fusion

Obtained weighting maps for the image stack, the next step is to fuse the multiple images. The most intuitive way is to compute a weighted average across the $N$ images at each pixel $(i,j)$ . The averaging result of each pixel could be calculated as:

$R(i,j)=\sum _{k=0}^{N}W_{k}(i,j)*I_{k}(i,j)$

where $I_{k}$ is the $k_{th}$ image in the input image sequence. This formula is computed on RGB color channel respectively. This process could also be visually demonstrated by Fig. 8. Fig. 9 shows the resultant image from naive fusion. It's easy to see that the transition between pixels are not smooth which makes the final result not appealing. This is because naive averaging the image set could not guarantee seamlessness of blending, especially where weights vary quickly. For instance, considering two neighboring pixels, whose weights differ dramatically such that the first pixel is exactly the corresponding pixel in the first image and the second pixel the one from the second image, in this case there is no average and the final result of the two pixels looks not natural. As the algorithm is directly working on the intensities of the images, the seam artifacts are easily observed, which become even more obvious in flat regions with less textures ( or transitions ).

Pyramid Fusion

To solve the seam problem, the algorithm uses a multi-resolution fusion technique. Specifically, the algorithm transforms the image into pyramid representation [12] and conducts fusion on each level, and then reconstructs the final image from the fused pyramid.

Gaussian Pyramid and Laplacian Pyramid

The Laplacian Pyramid was introduced by Burt and Adelson in the context of compression of images [12]. The name Laplacian Pyramid is a misnomer. The value at each node in the pyramid represents the difference between two Gaussian-like or related functions convolved with the original image. The difference between these two functions is similar to the "Laplacian" operators commonly used in image enhancement. As a result, it is referred to as Laplacian pyramid. It has the advantage that the image is only expanded to 4/3 of the original size and that the same (small) filter kernel can be used for all pyramid levels.

There are three major operations to construct the Gaussian and Laplacian pyramid:

(1) Convolve input signal with a smoothing kernel, then down-samples the result at every other value. Blurring creates smoother version of original, containing fewer high-frequency components and thus makes it possible to represent blurred data with fewer samples than in original.

(2) Interpolate the blurred and down-sampled image to estimate the original image.

(3) Subtract the estimated image from the original image to get the difference.

Applying the first operation to the input image and the resultant low-pass image for multiple times, it creates a stack of successively smaller images, with each pixel containing a local average that corresponds to a pixel neighborhood on a lower level of the pyramid. This image stack is called Gaussian Pyramid as shown in Fig. 9. On each level of the Pyramid, conduct the second and the third operations to get the difference, which also creates a stack of difference image. This difference image stack is a Laplacian Pyramid as shown in Fig. 10.

Pyramid Fusion

To seamlessly blend the multiple images, the algorithm conducts a pyramid fusion. To be specific, the algorithm at first computes the weighting map for each of the image in the image stack as described in the "weighting map" section. Then Gaussian Pyramids for individual weighting maps and Laplacian Pyramids for each input image are constructed. On each level, the Gaussian pyramid level of weighting maps and Laplacian pyramid level of input images are multiplied and the results sum together across the $N$ images. This process could give a new fused Laplacian pyramid as shown in Fig. 11. In Fig. 11, only 2 input images are used for easy demonstration of the pyramid fusion idea.

The final image could be easily reconstructed from this fused Laplacian pyramid. Fig. 12 shows the resultant image from the pyramid fusion, which is visually pleasing. It's obvious that the seam blending problem has been successfully solved. The reason is that pyramid fusion technique blends image features ( edges ) instead of intensities. In this case, sharp transitions in the weight map can only affect sharp transitions appear in the original images. In flat regions of the original images, the intensities of the corresponding Laplacian pyramid are so small (even zero ) that no matter how sharp the weighting variation is, they will not be influenced and thus the smooth transitions could be ensured.

Fig.11. Visual demonstration for pyramid fusion process. The first row is the first input image and its Laplacian pyramid. The second row is the weighting map of the first input image and its Gaussian pyramid. The third row is the second input image and its Laplacian pyramid. The forth row is the weighting map of the second input image and its Gaussian pyramid. The last row is the resultant image and its Laplacian pyramid.

All-in-focus Extension

As described in the previous sections, EF essentially solves the problems of fusing multiple images. It could be easily extended to address other imaging problems by simply adjusting the measures which determine how much each image will contribute to the final result. In the original version of the algorithm, three measures, namely contrast, saturation and well-exposedness are used to decide whether or not a pixel is properly exposed. In this way, it successfully solves the HDR problem. For a multi-focus image stack, our goal is to pick up those pixels which are in good focus and thus are sharp instead of blur. The contrast measure is indeed a good sharpness measure. If only contrast is considered to construct the weighting map, EF algorithm could combine those in focus pixels and thus create an all-in-focus image as shown in Fig. 13.

Fig.13. An all-in-focus image created from image stack in Fig. 2 by EF algorithm.

Results

More Experiment Results

The EF algorithm could be used for a wide range of images. More experiment results are shown in Fig.14. One thing worthy of noticing is that because what EF algorithm does is to pick up those good pixels, we have to make sure that for any specific pixel, there has to be at least one good version for it in the image stack. Otherwise, the algorithm does not work. As a result, for extremely HDR scene or large DoF, more images are needed to capture all information contained in this scene. For normal HDR scene, auto exposure bracketing works well.

Fig.14. More results created by EF algorithm. The right column are the multiple images used to compute the final fused result on the left.

Comparison with Tone Mapping

It would be necessary to compare EF algorithm with those state-of-art tone mapping algorithms. Fig. 15 shows a set of images created by EF and other tone mapping algorithms. The bilateral filtering tone mapping [1] is the most powerful algorithm, which gives a result better than that of EF. However, the image created by EF is comparable to those by other tone mapping algorithms. It's subject to decide which one is better. However, we could get the impression that the resultant image from EF is at least acceptable, especially not compared with those from tone mapping. Actually, the biggest advantage of EF is its computational efficiency. The computation is much simpler than those powerful tone mapping algorithms. The algorithm could be further accelerated easily by using less pyramid levels, which does not hurt the quality of the results a lot. Considering a good trade off between performance and efficiency, EF algorithm is widely used on mobile platform which has a computational power. Frankly, as we know almost all application on smart phone which involves HDR adopts EF algorithm.

Fig.15. Comparison between EF algorithm and tone mapping algorithms. From left to right: result of EF's [7], Tian's [5], Durand& Dorsey’s [1], Ward's [13], Reinhard's [14].

Conclusions

In this project, we analyzed and implemented the Exposure Fusion (EF) algorithm and extend it to create an all-in-focus image. We have test several images and make a simple comparison with tone mapping algorithms, which demonstrates the effectiveness and efficiency of EF.

References

[1] P. E. Debevec and J. Malik, “Recovering high dynamic range radiance maps from photographs”, Proc. ACM SIGGRAPH’97, pp. 369 – 378, 1997.

[2] T. Mitsunaga, S. K. Nayar, Radiometric self calibration,“Proceedings of the Computer Vision and Pattern Recognition, vol.1, 1999, pp.374–380.

[3] G. Ward, A contrast-based scalefactor for luminance display, in: Graphics Gems IV, Academic Press, 1994, pp. 415–421.

[4] F. Durand and J. Dorsey, “Fast bilateral filtering for the display of high-dynamic-range images”, ACM Trans. Graph. (special issue SIGGRAPH 2002) 21, 3, 257-266, 2002.

[5] Q. Tian, J. Duan, M. Chen and T. Peng, "Segmentation Based Tone-mapping for High Dynamic Range Images", Advances Concepts for Intelligent Vision Systems, pp.360-371, 2011.

[6] A. Goshtasby. Fusion of multi-exposure images. Image and Vision Computing, 23:611–618, 2005.

[7] Mertens, T. and Kautz, J. and Van Reeth, F. “Exposure fusion”, Computer Graphics and Applications, 2007. PG'07. 15th Pacific Conference on, 382--390, 2007.

[8] Natasha Gelfand, Andrew Adams, Sung Hee Park, and Kari Pulli, “Multiexposure imaging on mobile devices,” in Proc. of the ACM Multimedia, 2010.

[9] Vaquero, D. and Gelfand, N. and Tico, M. and Pulli, K. and Turk, M., “Generalized Autofocus”, Applications of Computer Vision (WACV), 2011 IEEE Workshop on, pp. 511--518, 2011.

[10] G. Ward. Fast, robust image registration for compositing high dynamic range photographcs from hand-held exposures. Journal of Graphics Tools: JGT, 8(2):17–30, 2003.

[11] J. M. Ogden, E. H. Adelson, J. R. Bergen, and P. J. Burt. Pyramid-based computer graphics. RCA Engineer, 30(5), 1985.

[12] P. Burt and T. Adelson. The Laplacian Pyramid as a Compact Image Code. IEEE Transactions on Communication, COM-31:532–540, 1983.

[13] G. W. Larson, H. Rushmeier, C. Piatko, “A visibility matching tone reproduction operator for high dynamic range scenes”, IEEE Trans on Visualization and Computer Graphics, vol. 3, pp. 291 –306, 1997.

[14] E. Reinhard, M. Stark, P. Shirley and J. Ferwerda,“Photographic tone reproduction for digital images”, Proc. ACM SIGGRAPH’2002.

Acknowledgement

We would like to sincerely thank various authors for making their data available on the Internet for experiments. Images used in this project courtesy of corresponding author(s).

Appendix I - Code and Data

File:A versatile image fusion method presentation slides.pdf
File:Codes.zip
File:Image data.zip

Appendix II - Work Partition

Most of work was done by Qiyuan Tian. Steve Dai contributed to understanding the algorithm and partial presentation slides and final write-up.

Link to our group's other project: Hyperspectral Waveband Registration