Fast Alignment of Image Bursts for Google's HDR+

From Psych 221 Image Systems Engineering
Jump to navigation Jump to search

Introduction

Standard imaging systems often lack the dynamic range necessary to properly expose all parts of a given scene. Typically, an exposure value is chosen to compromise between overexposure, with bright regions lacking detail due to saturation, and underexposure, with dark regions predominantly consisting of noise. High dynamic range (HDR) imaging refers to a family of techniques for alleviating this tradeoff. A common approach to HDR imaging is to capture the same scene multiple times using multiple exposure values. A short exposure produces good results in bright regions, a long exposure increases the signal-to-noise ratio in dark regions, and a number of other exposure values in between can be used to properly expose different parts of the scene. All of the resulting images can then be merged into a single image in which all content appears to be well exposed. This method is effective in certain scenarios, but it necessarily suffers with scene motion and camera motion; the frames must be aligned in order to merge them, but alignment becomes difficult when different features are prominent in each image, and this problem worsens when some or all portions of the scene move between frames.

A great deal of photography today is done with the image sensors built into mobile phones; the very existence of platforms like Instagram and Snapchat are evidence of this. These sensors have increasingly high resolution and general performance, but they are inherently limited by their small size, the quality of the optical assemblies attached to them, the computing power available to them for image processing, the likelihood that they will move while capturing images, and the responsiveness and high throughput desired by their users. The first two limitations suggest that mobile image sensors would be good candidates for HDR imaging. The last three suggest that the common HDR method described above would be a bad candidate for this paradigm. In response to this problematic situation, Google developed a different HDR algorithm optimized for mobile imaging platforms with the constraints discussed above. This project explores a prototype Matlab implementation of frame alignment, a key step in the algorithm, with the goal of understanding which parts of the process most affect the speed of computation.

Background

Google's HDR+ algorithm [1] is designed to operate on bursts of raw frames directly from the image sensor, eventually merging them into a single raw image that can then follow the normal demosaicking and processing steps that a traditional single-exposure image would undergo. The process is roughly broken up into four steps: capture, align, merge, finish. Alignment is the focus of this project, but some information about all four steps is provided below for context. Each step is optimized for speed and user experience.

Capture

Instead of attempting to perform exposure bracketing, HDR+ uses a burst of intentionally-underexposed frames. These frames are captured in quick succession, and each frame is captured with the same exposure value. Underexposure increases the chances of preserving detail in the bright parts of each frame and reduces the impact of scene motion and camera motion, but it has the obvious side effect of decreasing signal in dark areas. The HDR problem is thereby transformed into a denoising problem for most parts of the image. Capturing a short burst of frames aids in denoising in two ways: it provides additional samples of the same scene that can be averaged to reduce noise, and it increases the chances that one or more of the frames was captured with minimal motion.

Align

An underlying assumption for the algorithm is that camera motion and scene motion will be present to a certain extent. Alignment between corresponding regions of all frames is therefore necessary to properly combine the information from all the frames into one usable image. Alignment is accomplished by identifying a reference frame, downsampling each frame several times to create successively smaller frames, and then aligning tiles with the corresponding reference frame tile within a local perturbation window by finding the offset that produces the least difference between the tiles. The process follows a pyramid approach, starting with coarse alignment in the smallest downsampled frame and working up to fine alignment in the largest frames, with previous results determining each new initial guess for tile alignment in the larger image.

Merge

The merge step uses the data from the alignment step to match tiles in multiple frames. The tiles are transformed into the spatial frequency domain, after which high-frequency content is largely discarded and averaging corresponding tiles results in a significant reduction in noise. The averaging function assigns relative weights to contributions from each frame depending on how closely it is aligned with the reference tile. Merging of the raw data occurs separately in each color channel and blends overlapping tiles rather than iterating over each tile independently. The output of this stage is a new, high dynamic range, single-exposure raw image that can be processed as if it came directly from the sensor.

Finish

The finishing stage can be thought of as the part of the processing pipeline that would usually occur after raw frame capture regardless of whether HDR techniques are being used. Standard sensor correction steps like black subtraction and white balancing, optics correction to compensate for vignetting and chromatic aberration, multiple forms of color correction and adjustment, and dehazing are performed, albeit sometimes in nonstandard ways. Dynamic range compression and tone mapping are also performed and are even more important than usual in this context since the image contains a higher dynamic range than typical file representations and displays can support.

Methods

My implementation of the alignment step in Matlab followed the outline above. The major components are finding a reference frame, downsampling, and performing a pyramid-based incremental alignment search.

Reference frame

To find a reference frame, the algorithm proposes using a "lucky imaging" approach on just the first three frames of the burst to minimize shutter lag. The "lucky imaging" reference [3] itself actually refers to astronomical image processing. Sharpness is the key metric considered for choosing the reference frame. In these astronomical images, the subject of interest is usually a point light source, so the image with the maximum single pixel value is chosen. This is more complicated for normal images, but the principle remains similar. The HDR+ approach is to pick out the green pixel data from the Bayer array and establish a sharpness metric based on gradients in this data, which makes sense since the human visual system is most sensitive to contrast in luminance, which green contributes the most to. The specific metric is not described, so I used the sum of the gradient image, with higher sums indicating sharper images. I also tried multiple gradient operators in an effort to find the fastest approach.


Downsampling

To create the proper data for the pyramid search, HDR+ uses multiple levels of downsampling to create frames at different resolutions. With similar tile sizes for local minimum distance searches across these different resolutions, the alignment search automatically progresses from coarse to fine perturbations, resulting in an efficient approach. The first step in downsampling is averaging each Bayer 2x2 RGGB block to produce a grayscale image with 1/4 the original resolution. From this point, three more levels of downsampling are required. The scale factors in both dimensions for the remaining three levels are 2, 4, and 4. Many approaches to downsampling could be taken, but I primarily explored strictly preserving one pixel from each block and taking the mean of each block to see the speed performance tradeoffs. The intuition is that the mean should have better visual performance but the simpler pixel discarding method should be faster.


The HDR+ paper and its supplement [2] go to great lengths to explore optimization of the alignment search step. A tile-based search is used, with alignment from a tile in a given frame compared to the corresponding tile in the reference frame. An L2 distance metric between different tile alignment options at the coarser levels is discussed as optimal, with a 9x9 pixel displacement search area suggested for each of these levels. L1 is the preferred metric along with a 3x3 pixel displacement search at the finest alignment level. A brute force method is suggested for L1 alignment.

For the L2 search, a derivation is done suggesting that element-wise square of the reference tile and the neighborhood of search options for the other tile should be calculated as a first step. Then, for each tile alignment being searched, the corresponding squared data can be pulled from the precalculated matrix. The cross-correlation of the reference tile and the tile being searched is also calculated, and this calculation is supposed to be much more efficient than a brute force method when performed as the inverse Fourier transform of the point-wise product of the Fourier transform of the searched tile and the conjugate of the Fourier transform of the reference tile. As this seemed to be the most computationally involved step of the process, I decided to explore the difference between a brute force approach and the optimized matrix math approach.

Test image choice

The HDR+ paper advertises a freely available repository of test images for the public to use in algorithm development. However, as of 12/15/2017, the website that is supposed to host these images still claims the dataset is "coming soon." As a result, I used the pair of the grayscale JPEGs from figure 5 in the paper as a starting point and simply upscaled them artificially to create the Bayer data frames needed to run the full alignment step. I also used the Adobe Lightroom app on my phone to capture raw data in DNG format and imported those frames using the Tiff reader in Matlab. Professor Wandell also created some sample images in ISET.

Results

Performance results for each section are summarized below. My scripts were run in Matlab on a PC with a 7th generation Intel Core i7 processor and 16GB RAM. Timing varied across different runs, but the numbers below are generally representative.

Reference frame

Downsampling to green
Method   Time (s/MP)
blockproc()   2.175
For loop (pixel)   0.0746
For loop (line)   0.0121
Gradient
Operator   Time (s/MP)
Sobel   0.0171
Prewitt   0.0177
Central   0.0204
Intermediate   0.0122
Roberts   0.0208

Downsampling all frames

Downsampling one 12MP frame to all four levels
Method   Time (s/frame)
blockproc(mean)   101
For loop (pixel) (first mean)   13.6
For loop (pixel) (no mean)   2.272
Assignment (first blockproc mean)   90.3
Assignment (first mean)   26.0
Assignment (no mean)   0.0121

Aligning all frames

Aligning each set of downsampled frames
Method   Time (s/frame)
For loop with brute force   2.031
Matrix math distance calculation   7.469

Conclusions

Exploring this algorithm turned out to be far more time consuming than I had anticipated. It was, however, very interesting to explore the choices the Google team made in developing the HDR+ algorithm. As the owner of a phone whose camera employs this algorithm, I can vouch for its practical efficacy, and I believe the team achieved the goals they set out for themselves.

In terms of my own implementation of the alignment step in the algorithm, my intuitions were not always predictive of the results I found. I consistently found brute force, naive for-loop implementations of each component of alignment to be faster than more elegant Matlab constructions, which is quite different from my past experience. Despite running my tests on a relatively powerful machine with plenty of available memory, in no configuration was I able to replicate the speed results discussed in the paper within an order of magnitude, which leads me to believe that the HDR+ algorithm has heavy underlying optimization based on the ARM architecture it is targeted for. Additionally, while the discrete steps of the algorithm are described in some detail, some key information about specific implementation choices was omitted, and further the test dataset used for algorithm development still has not been made available to the public, so it is difficult to get a realistic comparison of functional performance.

I am not convinced my implementation at every step was correct; my matrix math implementation almost certainly produces incorrect results. However, debugging turned out to be somewhat difficult without all the reference data available. Even creating contrived test images with obvious correct alignment values only goes so far because the official implementation is not available to check whether it actually produces the expected result. Future steps would certainly include fixing any functional problems with the algorithm implementation. I am also convinced that much better optimization for the matrix-based approach to alignment is possible, but I wonder whether the FFT operations are actually less efficient in Matlab than they would be on embedded ARM hardware. I would also like to have baseline functional performance data available for the Google implementation in order to make more direct comparisons.

Appendix

Files:
File:Hdrplus align.zip
File:Images and import.zip

References

[1] http://www.hdrplusdata.org/hdrplus.pdf (Original HDR+ paper by Hasinoff, et al)
[2] http://www.hdrplusdata.org/hdrplus_supp.pdf (Supplement to HDR+ paper by Hasinoff, et al)
[3] https://www.microsoft.com/en-us/research/wp-content/uploads/2010/03/SeeingMtRainier.pdf (Joshi and Cohen, the "lucky imaging" reference)