Introduction

In recent years, advertisers and magazine editors have been accused of over-glamorizing the subjects of their photoshoots. Through photo-editing software like Adobe Photoshop, models have graced ads and magazine covers with seemingly impossibly slim and curvy figures, or impossibly smooth skin. In response to this phenomenon of 'photoshopping to the extreme,' Eric Kee and Hany Farid published a paper in 2011, detailing on using computers and algorithms to judge how much a picture has been altered--something that usually only a human being might be able to perceive properly.

This project is in essence an attempt to follow and replicate the results obtained by Kee and Farid.

Methods

Our methodology was to gather statistics that describe the alterations people notice most and use them to predict user perception of a photo manipulation. These statistics are separated into two methods as described in the source paper of this project, geometric distortions and photometric distortions. Geometric distortions describe modifications such as flattening the stomach enlarging the bust and shaping hips which help form our opinion on the degree of modifications. The other most common distortions are classified as photometric distortions. These describe pixel to pixel differences between images and the effect of blurring and sharpening on images. We gather these statistics and compare them to data collected from real users. In this way, we can train the software to not only collect these statistics but produce a rating that agrees with our perception of photo manipulation.

Data Collection

A series of approximately 137 before and after images were collected off of the internet from various sources including: magazine covers, artist demonstrations and collections on discussion forums. Full body, face and torso masks were individually and manually made for each of these images. These were necessary in order to collect relevant statistics. If an object is added to the after photo that doesn't affect our perception we would only be introducing noise by attempting to fit statistics to it. Therefore, fitting blurring filters are only done over the face and geometric distortions are weighted based on where they occur on the body. This step of image preparation is the only non automatic part of the software and was a problem for our source paper as well.

To collect all of the user data a website was also constructed (http://www.stanford.edu/~arion ). This website allowed a user to rate the degree of image manipulation on a scale from 1 to 5 for 70 photos chosen at random from our set of 137. We then used Google's form interface to have that information sent back to us for aggregation.

The data collected to date is summarized above in the figure of the standard error. The images were assigned a random number but are sorted in the image above in order of average rating. The error bars are calculated as: ${\frac {\sigma }{\sqrt {n}}}$ where $n$ is the number of times a photo has been rated and $\sigma$ is the deviation of those ratings.

The spread of some ratings are very widely spread while others have a very tight distribution. However, there are a few users who disagree heavily with the average results. This implies that a few users may not have carried out the experiment correctly.

Geometric Distortions

We implemented an algorithm for image registration as was described in reference 4. The algorithm calculates an affine distortion that maps the original image to the edited one. In this context, an affine transformation is described as:

$m_{7}f({\tilde {x}},{\tilde {y}},after)+m_{8}=f(m_{1}x+m_{2}y+m_{5},m_{3}x+m_{4}y+m_{6})$

M7 and M8 describe the contrast and brightness alterations in the image and are only included to improve the quality of the geometric fit in the presence of contrast variations. The other terms describe the translation and stretching of the image. These values can then describe the vector field which represents the amount of local distortions between the original and edited image.

$v(x,y)=(m_{1}x+m_{2}y+m_{5}-x){\hat {x}}+(m_{3}x+m_{4}y+m_{6}-y){\hat {y}}$

The finer details required to implement this algorithm are in the references and our source code is available in our appendices but there are a few parts of the algorithm that are worth explaining. The image below shows the performance of the algorithm. The algorithm takes the before and after image and calculates the vector field V(x,y) which describes the geometric changes between the before and after images. Notice that the magnitude of v(x,y) has hot spots where the photo editor brought in her hips. The applied fit is a copy of the original that has the calculated vector field applied to it. Notice that this image is now better aligned to the edited image. This is a requirement for the next step.

The algorithm does the following in order to establish a fit:

Both images are transformed into 255X255 grayscale images. Their borders are padded with random noise and the background in each image is replaced with random noise.
They are then both histogram equalized.
The image is then blurred and decimated in a Gaussian pyramid.
At each level, we do a preliminary align where the entire image is the block size then we break the image into 9X9 blocks and perform a geometric fit.
To ensure smoothness in m values, we then penalize for smoothness and iteratively solve a least squares problem that minimizes the error in both estimating m and the smoothness of m.
With this estimate of m, we then calculate the likelyhood of each pixel in the original image belonging to the edited one and again iteratively solve for these m values.
This overall process is repeated 5 times to improve the fit before the algorithm moves on to the next level of the pyramid.
In the end we collect all of the transformations and return an estimate for the vector v(x,y).

Not all images were perfectly registered in our testing. For example, in the image on the right we can see that though the algorithm did estimate the distortions that was made by this photo manipulator, it did not distort the image to the same degree. We believe that our settings to penalize smoothness and number of iterations were not set as well for this image as it was for others. Convergence is not unique and for a few images we do notice a slightly better fit for repeated tries.

After we have a mapping v(x,y) for an image. We calculate the mean and std variation of the magnitude of v(x,y) over the face region. We also do the same over the entire body region except for a few exceptions. Essentially we weight modifications to the bust, waist and hips higher by a factor of 2, we weigh modifications to the face by a factor of 0.5 and we weigh modifications to every other body part by 1. We did this exact weighting mainly to agree with the methods of our source paper.

Photometric

Besides geometric adjustment, photographers also edit the photometric features to make the images more stunning. Photo editors don't just make a slimmer face, they also blur freckles, remove blemishes and sharpen eyes. For our photometric model, we focus on the face of the images as it's photometric alteration attracts the most attention. The goal is to measure the many tricks one could do while manipulating the image of a face. For example, utilizing a sharpening filter on the eye region to make eyes more sparkling or a blurring filter to make skin more smooth(as shown in the following image left:original image, right:altered image) is captured by the following photometric methods.

To detect these linear filters we first have to divide the face into smaller patches in order to analyze local regions. In the Kee and Farid's paper, they used two measuring method to evaluate the degree each patch has been altered: structural similarity(SSIM) and perceptual distortion(D Value estimation). For both methods only the luminance channel is used.

Structural Similarity (SSIM)

SSIM is a measure of how similar two images are, which includes both structural modifications and contrast changes between the two images.

$C(x,y)=c(x,y)^{a}s(x,y)^{b}$

where

$c(x,y)={\frac {2\sigma _{a}\sigma _{b}+C_{2}}{\sigma _{(}ab)^{2}+\sigma _{b}^{2}+C_{2}}}$

$s(x,y)={\frac {\sigma _{ab}+C_{3}}{\sigma _{a}\sigma _{b}+C_{3}}}$

$\sigma _{a}$ , $\sigma _{b}$ and $\sigma _{ab}$ are the standard deviations and the covariance of local regions from two images. $C_{2}$ and $C_{3}$ are constant. In Kee and Farid's paper, $C_{2}=0.03^{2}$ and $C_{3}=C_{2}/2$

As shown in the following image, the skin has been greatly smoothed as indicated by large SSIM values.

For this algorithm we used Matlab code from the research group in ECE, University of Waterloo [1]

Perceptual Distortion

To determine the "D Values", we analyze the power change in frequency domain.

We assume each local region has been applied a linear filter, either smoothing filter(low-pass filter) or sharpening filter(high-pass filter).
We compare the power change in local regions of the original and altered images to get D value. The formula is shown below:

$D(x,y)=\sum _{\omega }|{\tilde {F}}_{b}(\omega )|\omega -\sum _{\omega }|H(\omega ){\tilde {F}}_{b}(\omega )|\omega$
${\tilde {F}}_{B}(\omega )$ is a geo-fit image, $H(\omega )$ is a filter applied to the local region. And $H(\omega ){\tilde {F}}_{B}(\omega )$ is an altered image.

It is important to note that the D value does not only represent a quantitative measurement for a filter, it also indicates the kind of a filter. If D<0, a sharpening filter has been applied. If D>0, a blurring filter has been applied. Using the D values, our software can confirm that in the example image presented below, the skin regions have been smoothed with a blurring filter, while eye areas have been sharpened with a sharpening filter.
We should mention that for our project, the local regions used are 9 by 9 patches, which was the same local region implemented in Kee and Farid's work.

We implemented this algorithm ourselves as described by Kee and Farid and the source code is available in our appendices.

After we obtain these dense maps, we then extract four statistics for the machine learning step. The mean and variance of both the SSIM and D values in the facial region.

Dataset Filtering

User Ratings

Prior to training our model, we went through all of the data, including the user ratings as well as the generated parameters, to check for outliers that may skew the model. For example, we would have liked to filter out users that rated a '1' to an image in which the consensus was a '4.' Due to the lack of time and manpower to do this, we did not extensively pore over the numerous data entries we received. However, one volunteer submitted a form rating ALL images as '1's. This feedback was then filtered out, as it was clear the user did not take the survey seriously.

Parameters

Due to potential implementation flaws in our algorithms and photomasks, or the way certain images reacted with the algorithms, some of the parameter values produced were unusually large with respect to the trend. These images were also filtered out to maintain a consistency among the parameters, and also such that parameter normalization will not be askew because of these large values.

In the below graphs, the sorted rating distribution and the corresponding predictions are shown first, followed by a plot of the eight parameters of the corresponding photos. In the pre-filtered selection, you can see very large and clear spikes signifying our outlier parameter values. These photos are filtered out of our training on the next try, and the result is shown in the plots on the right.

Beautification

In addition, our original set of photos included pictures that were not just photoshopped to make an image look nicer, but rather just to demonstrate the degree of how altered an image was. To be consistent with Kee and Farid's paper, which focused only on beautification, we decided to also filter out the photos which 'uglified' the photo's subjects (such as where a subject was fattened or aged).

Supervised Learning

Following the proper processing of the data, all the parameters were normalized to a range between [-1,1], meaning a median of 0 and a variance of 1. These were done with respect to the same variables over all the photos. These parameters were saved into a .mat file that could be loaded straight into the LIBSVM software, which was developed by Chih-Chung Chang and Chih-Jen Lin at the National Taiwan University.

Example of classification by a Support Vector Machine (SVM). Kernel parameters adjust how well the dividing line fits.

Classification and Regression

Kee and Farid did their training only using regression. Essentially, all of the user ratings for a single photo were averaged, to produce a single value between 1 to 5--not necessarily an integer. Consequently, the trained model they produce could rate an image between integer values (eg. 3.9, 4.1).

We followed this approach, but out of pure curiosity, also decided to train our model to perform classification as well, to see how it would match up against the regression model. In addition to saving the mean user ratings for each photo, we also calculated the mode (the most commonly occurring value) for a photo, using that as the label value for a particular image. Using the classification approach, our trained model would be expected to produce a whole integer value for a photo.

Training

The models were trained using a nu-SVC and nu-SVR, both with Gaussian radial basis kernel. and our results obtained via leave-one-out cross-validation. This means, if we had 137 test photos, we would use 136 of them in the training set, and validate the model using the last photo. This was repeated for every single photo, and the overall classification accuracy and regression error were saved for each result. This overall data was then compared to the original user labels to obtain a correlation. In order to obtain a better model, we followed Kee and Farid and employed a grid-searching method over the SVM parameters of cost and gamma. Eventually the parameters obtained were cost = 21.5 and gamma = 0.736, which produced the best correlation.

Results

Statistical Results

The resulting classification and regression predictions from each of the cross-validations were compared to the original labels. Our model very clearly performed a lot better using regression, than with classification, which is understandable due to the fact that using the mode to label photos may not be the best statistical method. To compute the accuracy of classification, the number of correct entries was divided by the number of total entries. The compute the accuracy of the regression, the percentage error of each prediction from the original label was calculated, inverted, then averaged over all the photos. The correlation coefficients were also computed using MATLAB's corr(X,Y) function.

Classification

Accuracy: 0.378
Correlation: 0.336

Regression

Accuracy: 0.824
Correlation: 0.652

While accuracy may be an interesting value to look at, it is not necessarily very representative of the data because of the averaging that was done. Thus, the most important value to look at is really the correlation, that is, how close our results 'fit' the original data. Kee and Farid obtained an R-value of 0.80 for their regression results, which we fall short of.

Image Results

Here are some examples and comparisons on how our model performed against users:

C and R indicate Classification and Regression results, respectively. The numbers in black are user aggregate ratings, whereas the red numbers are predictions from our model

Our model was fairly spot on for the first two selected photos of Kim Kardashian and Megan Fox. We selected the photo of Kim Kardashian in particular, because it was one of the photos in our possession that Kee and Farid had also rated. It is of interest to know that their photo was rated 3.00 by humans and 2.99 by their model, which are both very close numerically to our own user ratings and model prediction.

However, for the next three there were deviations of at least 0.8. We can see that our model has the tendency to predict a value between 2-3, based on these results as well as the correlation plots from before. For photos that vary very little or change in a manner significant to the human eye but not to geometric fitting, such as the last photo, our model has difficulty predicting the great difference. These differences may be a result of the algorithms producing poorly matching parameters, or a result of bad user data skewing another image which share very similar parameters to others.

Conclusions

After working on this project, we now have an appreciation of how difficult it is to model human perception. A lot of work goes into accurately modelling image editing and acquiring good data. It is however possible and we believe we learned how to implement it.

We had a lot of success in the algorithms we implemented. Some algorithms were taken from the mentioned sources but we were very happy with what we were able to develop ourselves. It took a lot of work but we believed it was important to understand these core methods and we learned much more through development that we would have otherwise.

If we had one let down on this project, it was that our fit could have been tighter with more time. We believe that we need to carefully collect many more images and recollect data from users. We may not have had the best spread or quantity of distortions. We also worry that in our collection of data that we were not careful enough to instruct users on the rules and adjust them to the range of images we were presenting.

With more time, we would definitely work on expanding our data set. Perhaps seek funding to pay testers so that we can motivate them to be as meticulous as possible in carrying out the tests. We would also like to automate a lot of the image preparation. We would investigate face recognition and edge detection to select body parts and faces for masking. Even if these required user intervention to confirm a selection, it would go a long way to making the use of this software pain free.

References

Kee E, Farid H, 2011. A Perceptual Metric for Photo Retouching.
Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011. [2]
Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, Apr. 2004.
Senthil Periaswamy, Hany Farid, Medical image registration with partial data, Medical Image Analysis, Volume 10, Issue 3, June 2006, Pages 452-464

Appendix I

File:GellineauHuangLee.zip

Images Folder (Before, After, Masks)
GUI Files (PM_Main.fig, ImageEvaluate.fig)
Datasets (project.mat, statistics.mat, PM.xlsx)
Libsvm Files (svmpredict.mexw32, svmtrain.mexw32)
calculateD.m
DataFilter.m
DataVisualize.m
ELA.m
filesort.m
ftp_psych.m
ImageEvaluate.m
initialStruct.m
PM_Main.m
ReadLabelData.m
ReadParamData.m
rescaleImage.m
ssim.m
ssim_index.m
svm.m
testgeo.m
testsvm.m
Photo Metric Presentation.pptx
GUI User Guide (PM GUI.docx)

Appendix II

Our group met frequently to discuss the direction and requirements of our project. As a result, we assigned the following tasks to ourselves:

Antonio: Geometric Algorithm, website design, Main GUI, Masking in Photoshop

Justin: Image Acquisition, SVR, SSIM, Data Parsing, Large share of Masking in Photoshop

Chun-Wei: D-Value Algorithm, Additional GUIs for testing D-Values and ratings, Image Evaluation, Masking in Photoshop

We would like to note that problems came up at almost every level and we did work together on most problems to get through it all. We also spent a lot of time together just preparing and testing our code. It took a half a day to analyze all 137 images split between 4 workstations.

GellineauHuangLee

Contents

Introduction

Methods

Data Collection

Geometric Distortions