Implementation and analysis of a perceptual metric for photo retouching

Introduction

Retouched images are everywhere today. Magazine covers feature impossibly fit and blemish free models, and advertisements frequently show people too thin to be real. While some of these alterations could be considered comical, an increasing number of studies show that these pictures lead to low self-image and other mental health problems for many of those that view them. To help address this problem, lawmakers in several countries, including France and the UK, have proposed legislation that would require publishers to label any severely retouched images, and over the last few days, Isreal has passed the first law to require labels for retouched images (in this case, for thinning the model).

Legislation requiring the labeling of modified images raises a number of issues. Namely, how do we define “severely retouched”? Nearly all published images are modified in some way, whether through basic cropping or color adjustments or more significant alterations. Which, if any, of these changes are acceptable? The second problem is that there are a huge number of photographs published every day. How can they all be analyzed for retouching in a timely, cost-effective manner?

In their 2011 paper “A perceptual metric for photo retouching,” Kee and Farid proposed a perceptual photo rating scheme to solve these problems. With their method, an algorithm would analyze the original and retouched versions of an image to determine the extent of the geometric (e.g., stretching, warping) and photometric (e.g., blurring, sharpening) changes made to the original. The results of this analysis would be compared to a database of human-rated altered images to automatically assign a perceptual modification score between 1 (“very similar”) and 5 (“very different”). This scheme, intended to deliver an objective measure of perceptual modification with minimal human involvement, would allow authorities or publishers to define a threshold for a “severely retouched” image and label them accordingly.

This project is largely intended as an effort to reproduce the results from the Kee and Farid paper. Accordingly, the algorithm and methods described by the paper have been implemented and tested on a set of images. The rest of this report describes the algorithm implementation process. The report discusses the results of applying this algorithm to a set of retouched images, as well as potential improvements to improve the effectiveness and practicality of the algorithm.

Methods

The retouching algorithm described in the Farid paper consists of a number of automatic and manual steps, culminating in the generation of a predicted score for the significance of alterations made to the human subject(s) each image. Each image is first analyzed for geometric changes and transformations. These geometric changes are processed to produce a series of four statistics, two describing the changes to faces in the image and two describing the changes to bodies. Four metrics are then gathered to describe photometric changes to faces in the image using the SSIM measure of local image similarity and by determining the extent of local blurring and sharpening in the image.

Once gathered, the eight statistics corresponding to each image are then mapped to a predicted user rating using nonlinear support vector regression (SVR). The SVR models used in this process are in turn trained on the statistics and corresponding user ratings for a set of test images.

Datasets

Before/after image pairs and face/hair/torso masks

To provide a perceptual modification rating, the retouching algorithm requires both the original and retouched versions of an image. In addition, as will be discussed in later sections, the algorithm requires a large set of such image pairs to derive the necessary models. The Kee and Farid paper describes the use of 468 such image pairs for the original development and testing of the algorithm. These images were collected from a variety of online sources, primarily from the websites of image retouching services. The retouched images ranged from minor to severe in the extent of extent of retouching present (a statistical analysis of the image ratings are in the results section).

To aid in the implementation process, the original set of 468 image pairs used in the Farid paper were acquired directly from the author. In addition, Farid provided the face, hair, and torso masks for the subjects of each image pair, the necessity of which will be discussed in further sections.

Observer ratings on retouching severity

The supplementary resources included with Kee & Farid’s 2011 paper contain a data set with Amazon Mechanical Turk observer ratings and predicted ratings generated by their algorithm for each of the 468 before/after images in the photo set (the appendix contains more details about this data set). Unfortunately, the numbering for these image ratings did not match up with the numbering of the images in the photo set provided by Kee and Farid, so there was no direct way to correlate our data to the image ratings from Kee & Farid’s 2011 paper. Hours before the submission deadline for this report, however, Prof Farid provided us with an updated CSV corresponding to the data in the paper, allowing us to rush to redo our analysis to incorporate this data set.

Before receiving the updated data, we generated observer ratings by rating 290 of the before/after images ourselves. Each group member rated the images on a scale of 1 (“very similar”) to 5 (“very different”) in the order they appeared in the photo set provided to us. The presentation of images was self-timed--we allowed ourselves to manually toggle between before and after images as many times as desired. We only rated a subset of the before/after images--290 of the 468 before/after images in the provided photo set--because at the time of rating, we had only finished generating statistics for this portion of the images. Professor Farid’s new data, however, contained 50 ratings for each image gathered through Amazon’s mechanical Turk.

Geometric model

Geometric changes affect the overall shape of the subject. Examples of geometric modifications include slimming of legs, hips, and arms, elongating the neck, improving posture, enlarging the eyes, and making faces more symmetric.

The algorithm models geometric distortion between the before and after images by using the 8 parameter local fit shown in equation 1. Here, the terms “b” and “c” take into account basic brightness and contrast changes between the before and after images. The equation uses an affine transformation and variables m1 and m2 to determine how a pixel has been warped and translated in the x and y directions .

$cf_{after}(x,y)+b=f_{before}(m1x+m2y+tx,m3x+m4y+ty)$ 1

Once image registration is complete, we can take the results of the affine fit and compute a vector field describing how much the geometry of the after image has been altered. This vector field, when computed over the face and body, provides the basis for determining how much an image has been geometrically altered.

Implementation

We used Kee & Farid's image registration code to find the affine warp. We used custom Matlab code to compute the two-dimensional vector field showing how the image has been geometrically retouched.

Photometric model

Photometric alterations affect skin tone and texture: smoothing, sharpening, or other operations that remove or reduce wrinkles, cellulite, blemishes, freckles, and dark circles under the eyes.

Photometric changes are modeled with a local linear filter and a generic measure of local image similarity (SSIM).

The local linear filter is used to determine the extent of the sharpening or blurring that occurs over a small region of the image. This computation is done by estimating a matrix “H” that transforms a portion of the registered, unaltered image into the corresponding part of the altered image, as shown in equation 2. Since different parts of the image may be sharpened and blurred independently, H must be computed on local regions only to determine the extent of photometric image alterations. For our reimplementation of the algorithm, we analyzed 4x4 pixel blocks of the images.

$F_{after}=HF_{before}$ (2)

We then determined the frequency response of each of the local filters H by using equation 3. If the result D is positive, the portion of the image corresponding to that H has been blurred, if it is negative, the local part of the image has been sharpened.

$D=\sum |F_{before}(\omega )|\omega -\sum |HF_{before}(\omega )|\omega$ (3)

We also used a generic measure of local image similarity (SSIM) to detect photometric modifications not captured by equation 3. The SSIM measure embodies contrast and structural modifications as follows in equation 4.

Implementation

For the implementation of the local linear filters, we used Stephen Boyd’s CVX code [2] to solve for H with a Tikhonov Regression on 4x4 pixel blocks of the images, and wrote custom code to determine the frequency response of each local filter. We used Zhou Wang’s SSIM code [3] to produce a SSIM index map of the image. Since the generated SSIM map is smaller than the image, we wrote custom code to remove the padding in the SSIM map so that the regions in the SSIM map and the image would line up. All photometric computations were computed on the luminance channel of the before and after images since Kee & Farid found that brightness did not impact observer ratings.

Perceptual distortion model

Perceptual distortion refers to the extent of photo manipulation. In our algorithm, it is encapsulated in eight statistics taken from the geometric and photometric modification models:

   Statistics from the geometric modification model:
       (1) mean and (2) standard deviation (SD) of geometric motion over the subject’s face
       (3) mean and (4) SD of geometric motion over the subject’s body
   Statistics from the photometric modification model:
       (5) mean and (6) SD of smoothing and sharpening filters over the subject’s face
       (7) mean and (8) SD of SSIM over the subject’s face

Implementation

The geometric stats were determined by projecting the 2D vector field representing the geometric transformation onto a gradient vector of the image. This projection is designed to emphasize any geometric distortions perpendicular to major image features, which should be the most perceptually noticeable.

For the different measurements, we masked out everything but the region of interest (face/body), using the masks provided to us by Prof. Farid; the face region was defined in terms of the face mask and the body region was defined to include the hair and torso masks. For each of these measurements, we then found the mean and standard deviation of the masked regions.

Support vector regression

We used a nonlinear support vector regression to map the perceptual distortion statistics to our observer ratings and to generate the predicted scores for each before/after image. In particular, we used a nu-SVR support vector model with a radial basis function kernel to estimate a relationship between between eight perceptual distortion summary statistics (each scaled to the range of [-1,1]) and the average observer rating for each of retouched images.

Kee & Farid used leave-one-out cross validation (LOOCV) for their regression analysis. In LOOCV, an SVR model is generated using all but one of the images as the training set, and this model is used to make a prediction on the remaining test image; this process is repeated using each of the images as the test image to generate predictions for all of the images.

Due to computational constraints, we were unable to use LOOCV for parameter selection and model training. We started by running the regression analysis with the whole data set being used as both the training- and testing-sets. This resulted in a regression model that was overfitted to the data, so we then divided half of our images into a training set and used the remaining half of our images as the test set for which we wanted to generate predictions; after doing so, we trained the support vector model using this training set.

The SVR has two primary degrees of freedom, γ and C, where γ specifies the spatial extent of the kernel function and C specifies the penalty applied to deviations from the regression function. We selected these parameters C and γ using a dense 2D grid search with a five-fold cross-validation on the training set. The parameter selection procedure entails dividing the training set into five subsets, and testing various pairs of C and γ values on each subset; the pair of C and γ values with the best cross-validation accuracy (best accuracy across all five folds) is used to train our SVR on the entire training set. We then use the generated model that results from this to generate predictions for the test set.

Implementation

We implemented our SVR algorithm using the standard libSVM tool set. Since libSVM lacks the functionality to do the grid search parameter selection in MATLAB, we also used custom MATLAB code for this step that we obtained in the libSVM FAQ website.

Batch script implementation

When running our algorithm, we determined that image registration alone takes several hours per image. Due to the large number of photographs we needed to analyze and our time constraints, we implemented a batch script to analyze multiple photographs in parallel on Stanford’s new Farmshare cluster.

Our system starts with a generic Matlab script designed to process a single image and pull out the 8 required statistical measurements. Our batch script then modifies this matlab code to analyze a different individual image from our dataset.

Our batch script then copies all of the necessary matlab library files for our implementation onto the Farmshare cluster, and issues a separate batch job to run our Matlab analysis for each photograph. Job run logs, image analysis results, and the test commands used for each batch job are stored in the cluster data space.

Our implementation of the batch system incorporated reference code from Stanford PhD students John Brunhaver and Jacob Leverich, who have built and distributed standard scripts to automate batch submission and to provide a simplified wrapper around the “qsub” batch job submission program respectively.

While using the cluster allowed us to run multiple jobs in parallel, we still ran into a few runtime issues. First, this time of the quarter, the cluster is fairly busy, and jobs sometimes have to wait for days before they are allocated to a Farmshare machine for execution. Even then, the Farmshare machines take on the order of 3-4 hours to analyze a single, full-sized image. We were able to decrease this runtime by some extent by downsampling our sample images to 256x256 pixels each, but run time for all 400 images was still on the order of days.

Results

Statistics on observer ratings data sets

The mean Amazon Mechanical Turk observer rating in the data set used in Kee & Farid’s 2011 study was 2.65 with a standard deviation of 0.63. We only ran the SVR on a subset of these rated images--290 of 468 images--because these were the only images that had the full set of summary statistics and observer ratings. For this subset of images, the mean Amazon Mechanical Turk observer rating was slightly lower than the overall mean; the subset’s mean was 2.56 with a standard deviation of 0.64.

For our self-administered observer ratings, the mean observer rating was 3.238 with a standard deviation of 0.71. We also found that our self-administered observer ratings matched up well with the Turk observer ratings; the correlation coefficient between the average ratings for the images in these two data sets was 0.844. A comparison of the rating distribution of these datasets is displayed in the figure below. This suggests that we would have been able to achieve similar predicted ratings in SVR step of our algorithm if we were forced to use our ratings instead of the ones obtained by Kee & Farid.

Caption: we compared the means of our ratings for each before after/image to the ratings obtained for those same images in Kee & Farid’s (2011) study.

Predicted vs. observed ratings

The results of the retouching algorithm is a predicted observer rating for each image, produced from the eight summary statistics using a nonlinear SVR. This rating is intended to match as closely as possible the mean of user ratings on which the SVM models are trained. figure below shows the results of the algorithm as applied to a subset of the image set. The high degree of correlation between these predicted values and the observed values for each image (the mean user ratings), with an R-value of 0.924 and mean square error (MSE) of 0.060 demonstrating the generation of an accurate SVR model.

Caption: A nonlinear SVR was used to correlate the summary statistics with predicted user ratings. The SVR model was trained and tested on the same image set, with parameters determined using 5-fold cross validation.

However, these results were generated using a model both trained and tested across the same image set, suggesting an overfit model that may not be accurate generally. In practical applications, an SVR model would be used to predict ratings for new images on which the model was not trained. As such, the authors use leave-one-out cross validation for parameter selection and model training, whereby SVR models are generated on all of the images in the set but one test image. Thus, a different model is generated for each image.

As leave-one-out cross-validation proved to be too computationally expensive, the photo set was split into two training and testing sets of roughly equal size. Parameter selection and model training was then conducted on the training set and predictions run on the testing set, the results of which can be seen in figure below. The accuracy of this model was markedly poorer, with an R-value of 0.054 and an MSE of 23.602.

Caption: A nonlinear SVR was used to correlate the summary statistics with predicted user ratings. The SVR model was trained and tested on separate but equally sized image subsets, with parameters determined using 5-fold cross validation on the training subset.

In order to isolate the cause of this poor correlation, the algorithm was applied in a similar manner with just the four geometric statistics and with just the four photometric statistics, as seen below. Although the results using the four photometric statistics suggest that they alone offer poor predictive capability, casual observation of the results generated with the four geometric statistics suggest that these statistics are potentially invalid.

Caption: A nonlinear SVR was used to correlate the four photometric statistics with predicted user ratings. The SVR model was trained and tested on separate but equally sized image subsets, with parameters determined using 5-fold cross validation on the training subset. Several out of range values were discarded.

Caption: A nonlinear SVR was used to correlate the four geometric statistics with predicted user ratings. The SVR model was trained and tested on separate but equally sized image subsets, with parameters determined using 5-fold cross validation on the training subset. Several out of range values were discarded.

Conclusions

While we were able to reimplement Kee and Farid’s algorithm, we encountered a number of problems that limited the scope of our results. First, and most significant, up until a few hours before the project deadline, we were unable to correlate the images provided by Farid with the predicted and observed ratings referenced in their paper. While we rushed to incorporated Kee and Farid’s revised data as soon as it became available, the extreme time constraints on the analysis left us unable to fully debug our results.

Before receiving Prof. Farid’s revised data, we decided to obtain new observer ratings by having each group member rate the image on the same 1 to 5 rating scale that was mentioned in the paper, and averaging our ratings together.

The reason we were unable to initially correlate the images in the photo set to the rating scores in the supplemental resources of the paper was because the image numbering in the photo set did not match the numbering in the data set of ratings. After looking up the photo set numbers for the ten images in Figure 4 of the paper, we found that the observer/predicted values in the .csv file for these image numbers differed from the reported observer/predicted ratings in Figure 4. Surprisingly, at least one set of measurements from Figure 4 appears to be completely absent from Kee and Farid’s original results spreadsheet.

The experience of rating each before/after image allowed us to more deeply reflect on what factors contributed to our decision to rate an image a certain way. We also noticed retouching trends across the set of images that we rated: retouching differences based on gender (women tended to be softened, whereas men tended to be sharpened), retouching differences based on skin tone (light skin tended to made lighter, whereas dark skin tended to be made darker), and common focus areas (eyes, bust, waist, hair). The awareness of our decision-making process and the increased familiarity with common retouching procedures that resulted from the experience of rating images ourselves allowed us to identify new ways in which we could improve the existing algorithm, which will be discussed later.

We also ran into a number of issues when dealing with the implementation of the geometric transformation script. The most pressing issue for us was the length of time it took--at least 3-4 hours--to complete running on a pair of before/after images. While we tried to solve this problem by batching all of our analysis to Farmshare, this process turned out to be unreliable, with jobs sometimes staying queued for over 24 hours. We dealt with this issue by creating scripts to process a few images in a row, connecting to multiple Corn machines, and manually running each of these scripts on each cluster we connected to. This setup made the image processing stage run much faster since the scripts were immediately run and it allowed us to reliably monitor the progress of the processing of each image.

The next problem we ran into was that the geometric transformation code provided by Professor Farid initially suffered from several bugs. First, out of the box, the code did not return all of the parameters of the local affine fit, requiring a moderate debugging effort. Second, the code broke on images with unequal heights and widths, forcing us to resize our sample images before running them through image registration. While this requirement was simple to meet, the other bugs in the code, combined with the fact that this requirement is never mentioned in the code’s documentation and the results of our regression analysis indicate that the geometric statistics appear to be invalid leads us to believe that there are additional bugs in the image registration code.

Another problem that we encountered was the slow speed to run the regression analysis using leave one out cross validation (LOOCV) procedure that Kee & Farid used in their 2011 paper. In order to quickly assess if our generated statistics could generate valid predictions, we ran our regression analyses using all images in both the training and testing sets. Since this produced an overfitted model that may not be accurate generally, our regression analysis could have been improved if we had used the LOOCV technique instead.

Future work

While working with the algorithm, we also identified a few key areas where the algorithm could be improved. First, given our issues in running all of the jobs, we believe that optimizing the algorithm for run-time should be a high priority. If we shifted away from Matlab, and instead implemented this algorithm in C, we could most likely realize substantial performance gains. Algorithm run-time could probably also be improved by culling the image background from the photos before the image registration step.

The second major area for improvement would be to automate face and body segmentation. Currently, the algorithm relies on a user manually masking out the face and hair regions to compute the required image statistics. Automating this process by using standard face and person detection algorithms could greatly decrease the effort required to rate each image.

We also have several suggestions for improving the quality of the algorithm’s results. First, as future work, we’d like to incorporate modifications to extremities with our results. For example, in the before/after image 205, Britney Spears’ calves are lengthened, creating a substantial perceptual change that is not picked up by the current algorithm.

Also, it is important to recognize that some perceptual distortions are more noticeable than others. Specifically modifications to the eyes and mouth, such as filling in missing teeth, can result in huge changes to the human provided perceptual scores, and only minor changes in the algorithmically predicted results. We propose breaking the face down into multiple regions, including the eyes, nose, and mouth, and give added weighting to these regions to bring the algorithmic estimations more in-line with the realities of human perception.

Next, we recognize that removing some minor freckles and blemishes probably does not have much of a perceptual effect on the image, and should not drastically increase an image’s perceived score. Therefore, we propose adding a tunable minimum threshold below which any photometric and geometric distortions will be ignored.

Finally, we propose to improve the algorithm by adding statistical measurements for photometric modifications done to the body, which the algorithm currently does not detect. Photometric modifications done to the body, such as blurring cellulite out of the body, can result in a visible perceptual effect on the image, so adding these statistics could potentially improve the estimates of the algorithm.

References - Resources and related work

   1. Farid’s image transformation MATLAB code: 
        http://www.cs.dartmouth.edu/farid/Hany_Farid/Research/Entries/2011/5/17_Image_Registration.html
   2. CVX: http://cvxr.com/cvx/
   3. Zhou Wang’s SSIM code: https://ece.uwaterloo.ca/~z70wang/research/ssim/
   4. Implementation (or SSIM in general) described in this paper: 
        http://www.cns.nyu.edu/pub/eero/wang03-reprint.pdf
   5. libSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
   6. Parameter selection in MATLAB: http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f803
   7. John Brunhaver wrote the two main functions in the photo_batch

Appendix I - Code and Data

Data

Group Image Ratings

User ratings recorded by group members for subsets of the images

Amazon Mechanical Turk observer ratings on retouching severity

Photoset: http://www.cs.dartmouth.edu/farid/downloads/publications/pnas11/beforeafter.tar

Corresponding Masks: http://www.cs.dartmouth.edu/farid/downloads/publications/pnas11/masks.tar

File:Farid ratings.zip: User ratings data graciously provided by Prof Farid. This set differs from that provided with the publication in that the numbering matches the author's photo sets.

Fifty observer ratings for each of the 468 before/after images used in Kee & Farid’s 2011 paper were acquired from the supplementary resources for the research paper. The authors gathered the user data using the process and observers described below:

Task: Each observer session lasted approximately 30 minutes and was structured as followed:

1. Each participant was initially shown a representative set of 20 before/after images to help them gauge the range of distortions that they could expect to see

2. Each participant was then asked to rate 70 pairs of before/after images on a scale of 1 (“very similar”) to 5 (“very different”). The presentation of images was self-timed; participants could manually toggle between before and after images as many times as they wished. Each observer rated a random set of 5 images 3 times each to measure the consistency of their responses.

Observers: These ratings were provided by 390 observers that were recruited through Amazon’s Mechanical Turk. Each observer was paid $3 for their participation in the session. 9.5% of observers were excluded because they responded with high variance on repeated trials and frequently toggled only once between before and after images, suggesting a lack of consistency or seriousness in the observer’s rating process.

Code

zip file with custom code

This .zip contains several files, including: photo_batch.pl: Perl script used to batch out statistics gathering jobs to multiple machines in the Farmshare cluster. Modify/run this script to gather stats.

image_farm.m: Master matlab code modified by photo_batch.pl to do cluster statistics gathering

photometric.m, stats.m, vfield.m: function files called by image_farm.m or run_in_serial...m.

jsub - executable necessary farmshare cluster

run_in_serial_331_340.m: serial adaptation of image_farm.m

prepAndRunSVM_revised....m - different variants of code to run SVR on the gathered stats. See files for differences.

Other code needed (ssim_index.m, CVX, image registration code, libSVM) cited in the report.

Appendix II - Work partition

Much of the work for this project was performed cooperatively, with all three group members meeting together frequently to discuss and explore the algorithm and its implementation. However, each member focused on different aspects of the project. Andrew Danowitz led a lot of the early code exploration and did most of the implementation for the photometric (filter and SSIM) components. He also pieced together the different implementation components and brought about the capacity to submit much of the computational work to the Farmshare cluster. Andrew contributed a great deal to the report and presentation slides as well.

In addition to taking part in the collaborative aspects of the project, Andrea Zvinakis wrote much of the report and presentation slides. She also performed statistical analysis and, when the Farmshare cluster was unable to support our workload, was responsible for running the summary statistics generating code on hundreds of images on other computers. Andrea also set up the capacity for the group to rate certain images from the photo set.

Taking part in the collaborative aspects of the project as well, Bradley Collins also made small contributions to the report and presentation slides. Bradley also explored or implemented several parts of the geometric component of the algorithm, helped adapt the summary statistics generating code to run jobs serially on campus computers, and used that code to gather many of the image statistics. In addition, Bradley was responsible for most of the SVR implementation and running.

CollinsZvinakisDanowitz

Contents

Implementation and analysis of a perceptual metric for photo retouching

Introduction

Methods

Datasets