Michelson
Back to Psych 221 Projects 2013
Background
Introduction
Since the social media revolution began, unreliable user-sourced media has increasingly become a primary source of information. As access to advanced photo-editing software becomes more widespread, it is becoming more and more necessary to be able to distinguish between genuine and forged photographs in order to curb the spread of false information. Professional forgeries can often be visually indistinguishable from real photographs, making visual inspection an invalid mechanism for detection. Luckily, forgeries often alter underlying statistics about the image, allowing for a variety of techniques to be used to expose these inconsistencies and detect forgeries. This project focuses on one of several methods to expose digital image forgeries, specifically detecting inconsistencies in color filter array (CFA) interpolated images.
CFA Interpolation
While our eyes see the world in various shades of color, the sensors typically used by camera cannot distinguish the wavelengths of incident photons, instead measuring only intensity. In order to produce a color image, the camera pixels are covered by an array of filters which only allow light in a specific waveband to pass through, creating the effect of 'red', 'green', and 'blue' pixels. Many varieties of color arrays are possible; the typical array, known as the Bayer array, is shown above.
Having this array of pixels is not sufficient to produce a color picture, however. It would appear very strange to the viewer to see an image that had pixels that were either pure red, pure blue, or pure green. A better method is to attempt to guess the missing color values based on the values of their neighbours. The resulting techniques are collectively known as Color Filter Array interpolation. The simplest interpolation techniques are simply a weighted average of the immediate neighbours of a pixel, however many more advanced techniques are possible. A survey of common techniques is available in [2]
Application to Detection
The end result of any CFA interpolation is that the RGB values of image pixels are correlated with the value of their neighbors. While a forger may succeed in not leaving behind any obvious visual errors, there is a strong chance that their forgeries will not properly preserve these correlations in the altered regions. By tracking and measuring these correlations, it is then possible to locate regions of potential forgery.
Methods
Expectation Maximization
The problem with simply verifying that correlations exist in the image is that there are many commonly employed interpolation algorithms [3], and so one doesn't immediately know which interpolation method was used. The Expectation Maximization algorithm allows us to simultaneously determine the method used and which pixels are correlated to the neighbors within a given channel.
E-Step
In order to estimate both a probability and a model, the EM algorithm has two stages. The first stage, the E-stage, attempts to estimate the probability that our samples fall into a proposed correlation model, as opposed to being totally uncorrelated.
We represent our model by a matrix α; the central element of alpha represents the sample we're investigating, and the surrounding elements represent the weighting applied to the surrounding pixels (for example, α11 would represent the weighting of the pixel one down and one to the right of the inspected pixel). Our assumption is that our pixels are perfectly correlated with their neighbours according to alpha, with variations caused by random noise (here represented by n):
In the E-step we calculate first calculate the residual error for each pixel under this model:
We then calculate the probability that the deviation of a given sample is explained by the model assuming that any variations are caused by Gaussian noise:
M-step
After we've estimated the probability of each sample fitting into the model, we use this data to improve the model as much as possible. We refine our estimate of variance based on our results in the E-step, and then revise alpha so as to minimize the residual error. The exact details of the derivation can be found in [2], but the end result is that the new a is determined by minimizing the following linear system, which can be done using least squares.
where
and p0 is our prior on our distribution, here equal to 1 over the range of possible values for f.
We repeat the E and M steps until our solution stops changing more than a small (where small is at the discretion of the implementer) amount. The algorithm is guaranteed to converge, however what it converges too may not be a local minima. Starting with random values helps mitigate pathological cases[4].
Application to Detection
Of primary interest when it comes to the results of the E-M algorithm is the probability map (P(x,y) above) produced. This gives a per-channel indication of the likelihood of each pixel being correlated with its neighbors. Naturally, large regions of low probability indicate high potential for forgery. However, it's not always trivial to visually detect issues with the probability map, and is generally not conducive to automatic detection.
Luckily, the presence of correlations and patterns of probability result in certain peaks in the frequency domain. Therefore, we can perform a Fourier transform on the data in order to better visualize the presence of correlations. Below is an example of the transformation from a non-forged and forged image. This provides a much cleaner method of identifying regions of suspicion in a probability map.
Blocked Detection Mechanism
It's unlikely for a forger to alter nearly the entirety of an image. On the contrary, forgeries may often alter only very small portions of an image. As such, examining the probability map for an entire image at once is unlikely to reveal much information about a forgery, as the effect of the uncorrelated region will be averaged out over the entire image. As such, it becomes necessary to examine the image in smaller blocks. This leads to two interesting potential avenues of exploration.
Approach A
Approach A involves performing the EM algorithm over the entire image, and examining regions of the produced probability map independently. Approach A is the more vanilla approach; In general, an image produced by a camera should exhibit the same correlations throughout the image. In addition, because the EM algorithm presents the bulk of the computation time for detection, approach A allows for the EM algorithm to be performed only once over the entire image, and then allowing any amount of processing of various blocks without recomputation.
Approach B
Approach B involves performing the EM algorithm on each of the blocks independently, and examininf the resulting probability map for that block. Approach B is slightly interesting because while one might expect that correlations would be consistent across an unforged image, for some of the more advanced interpolation algorithms this is not the case. In addition, it has some computational advantages. For small enough blocksizes it became possible for the entire working set of an EM computation to fit in my computers L1 cache, which resulted in a noticeable speed-up. Additionally, the EM algorithm can take a relatively long time to converge. Setting a low threshold for convergence mitigates this; however a solution that is not well converged can result in a poor probability map; this blockwise approach allowed me to refine the algorithm more for suspicious blocks, to ensure that the suspicion was not simply the result of incomplete calculations.
Approach B was my initial approach, and can be found in the CFAdetectB.m file. However, I eventually switched to approach A. This was partially due to improved performance, but was a greater function of the fact that the probability maps can be cached, allowing for the model to be changed and tested without recomputation of the EM algorithm on the image.
Metric
In order to develop automatic detection, some metric is necessary to determine whether a given block has the potential for forgery. Since the Fourier Transforms of the probability maps provide such clear insight into the correlations within a block, I opted on a technique that made use of them. The metric in question involves gauging the maximum value of the peaks in the Fourier domain, and comparing them to them to the average value in the less interesting regions. For a single block 4 scores are computed, one for each peak. The image to the right provides an illustration. The peaks (the regions in red) are compared to the average (the regions in yellow) to compute a score.
Detector
The detector design follows directly from the method and metric described above. After the EM algorithm is performed on the picture, a sliding window (that moves in units of half the blocksize) is used to examine blocksize chunks. The scores according to the previously defined metric are then computed for each channel.
As will be explained below, it's possible that certain channels present little information about the correlations present in the image. Therefore, to limit false positives only one of the channels has to exhibit correlations to be cleared of suspicion. It's unlikely that an forged image will be forged in such a way that it exhibits correlations in one channel but not the other, so this should theoretically not limit detection accuracy significantly.
There's a choice than can be made regarding what to do with the scores within a given channel. I explored two variants. The first is the 'one-peak' method. For this mode, only one of the peaks needs to be above a certain threshold in order for the block to pass. The rationale for this mode is that the existence of only one peak above the threshold is indicative or correlations in the image. However, this metric is susceptible to noise, as it can simply be by chance that a sample from the peak location is a factor above the average.
The other method is the 'all-peak' method. This method requires that all of the peaks be above the threshold. This method is much less susceptible to noise, and allows for lower thresholds to be set while still maintaining reliability. However, it also has a tendency to produce false positives.
Difficulties
A simple detector for this mechanism would simply check go through each block in an image and check that its scores are all above a certain threshold. Since non-forged images should clearly exhibit correlations, and forged images don't, this should result in easy detection. Unfortunately, as one starts getting to smaller and smaller blocksizes, some blocks do not exhibit strong probability of belonging to a correlated model. Blocks that exhibit this behavior tend to have commonalities to them. They are often either uniform in appearance, or have a bleakness to them. These 'boring' areas cause problems for the detector.
Gauges of 'Boring' Areas
The easiest theoretical way to get around the 'boring' problem is to ignore these so-called boring areas in detection. This may result in false negatives, however we are not overly concerned with forgeries in areas that are not of any particular interest to the viewer. The problem is in gauging what exactly makes a block 'boring'. I tried several different metrics to gauge this for a block,
- Variance: The variance provides a simple statistical measure of how much values change across a block. This is a very good measure for blocks that are pretty much monotone. It does not capture blocks that may be two-tone (which are nearly equally as uninteresting), or otherwise have patterns that cause them to fit poorly to the model.
- Local Difference: A simple difference metric where a pixel is compared to the mean of its neighbours. The rms value across all of the pixels is computed. This captures a more local definition of uniformity. This metric was a little less clear cut than variance in terms of deciding whether or not an image was forged, but generally very low values were correlated with blocks of little interest.
- Gradient: The rms value of the block's numerical gradient was an attempt at a halfway point between the two above metrics.
The difficulty with these metrics is twofold. One is that it's difficult to set adequate thresholds. Even with the probability maps cached, doing any sort of sensitivity analysis took prohibitively long on my machine, leading to thresholds set by sample values as opposed to what works best on the data. This is not purely disadvantageous, as one does not want to risk overfitting a data set, but more rigorous parameter selection would be beneficial.
Additionally, the developed metrics were helpful but not sufficient. Certain blocks passed on these metrics but still failed to exhibit strong correlations. I tried a variety of other metrics, such as second derivative, entropy, and even a grayness metric that looked across channels, but none of these provided the desired insight and merely added to the computation time. More metrics could greatly enhance detection.
Counting technique
Another method of dealing with the 'boring' block problem, which can be used in conjunction with the block metrics, is to actually make a count of suspicious regions, and output only images that exhibit above a certain threshold of suspicious regions. Using this technique noticeably boosted accuracy on the training set over binary detection; however, this technique is also prone to overfitting of a data set, as selection of the minimum count was determined so as to achieve the best result.
Design Space
As can be seen by the previous sections, the design space for a detector, even taking the basic model as a given, is huge. The list of major tunables include
- The score threshold
- The blocksize
- The metrics for ignoring blocks
- The cutoff count for suspicious regions
In addition, other major design decisions can be changed; for example, whether we need to see just one peak in the fourier domain or whether all 4 peaks must be present, and whether all channels must pass the test or just one. These choices create many opportunities for tuning. However, such tuning likely requires a much larger dataset; tuning on any small dataset is highly subject to overfitting.
Results
Training Set
The classifier was tested and refined on a set of 140 images. In many ways, these images represented a set of some of the most difficult to detect possible forgeries. Many of the images had forged blocks as small as 60x60, and in some cases these blocks were in very uninteresting regions of the image, such as on a blank wall. In many ways this makes for a very sensitive classifier, however it is possibly not well designed for real world cases.
Training Set Accuracy
The classifier achieved 74% on the training set using two very different methods.
- A 60x60 blocksize, a one-peak score threshold of 1.7, and a forgery count of 10 to be classified as forged.
- A 90x90 blocksize, a four-peak score threshold of 1.2, and a forgery count of 1 to be classified as forged.
The second method resulted in a false negative rate of 17% and a false positive rate of 34%. This false positive rate is possibly higher than desired, and can be reduced by altering the threshold and increasing blocksize; however this comes at the cost of accuracy.
The fact that two very different methods were able to achieve the same accuracy is a testament to the size of the legitimate design space. More exploration could likely lead to a more robust and accurate classifier.
Test Set Accuracy
The classifier achieved 66% accuracy on a test set of 12 png images using the second method outlined above. This included 3 false negatives and 1 false positive.
On the JPEG and resized images the classifier incorrectly classifies all images as forged. However the classifier was no designed to handle these images.
Analysis
The automatic classifier here achieves reasonable accuracy on a very difficult dataset. The forgeries are not visible in the image and do not necessarily occur in regions of suspicion, which makes them not necessarily an accurate reflection of reality. In reality, a combination of forensics techniques and human interaction could serve much better than an automatic classifier like the one built. Suspicious regions may be highlighted on a first pass, and a human may be able to do more advanced analysis from there. Inspecting probability maps manually following an indication of suspicious regions often results in much better results than using the classifier alone. Additionally, boring regions can be ignored as unlikely spots of forgery, and other techniques may be used to validate the regions flagged as suspicious.
Conclusions
The classifier developed in this project achieved reasonable accuracy on a large training set. However, 74% is still far from ideal and further refinement would be required to produce a truly useful automatic classifier. However, even as it stands now, manual inspection after screening is very promising even using such a rudimentary classifier. Further the technique is just one out a toolkit; joined with other techniques a more robust and accurate classifier is likely possible.
[1] Farid, H. Image Forgery Detection: A Survey. IEEE Signal Processing Magazine. March 2009.
[2] Farid, H. Exposing Digital Forgeries in Color Filter Array Interpolated Images. Signal Processing, IEEE Transactions. Vol. 53, Issue 10. October 2005.
[3] http://en.wikipedia.org/wiki/Demosaicing
[4] http://en.wikipedia.org/wiki/Expectation-maximization_algorithm