CollinsZvinakisDanowitz: Difference between revisions

From Psych 221 Image Systems Engineering
Jump to navigation Jump to search
imported>Psych2012
imported>Psych2012
Line 12: Line 12:


= Methods =
= Methods =
== Measuring retinotopic maps ==
The retouching algorithm described in the Farid paper consists of a number of automatic and manual steps, culminating in the generation of a predicted score for the significance of alterations made to the human subject(s) each image. Each image is first analyzed for geometric changes and transformations. These geometric changes are processed to produce a series of four statistics, two describing the changes to faces in the image and two describing the changes to bodies. Four metrics are then gathered to describe photometric changes to faces in the image using the SSIM measure of local image similarity and by determining the extent of local blurring and sharpening in the image.
Retinotopic maps were obtained in 5 subjects using Population Receptive Field mapping methods [http://white.stanford.edu/~brian/papers/mri/2007-Dumoulin-NI.pdf Dumoulin and Wandell (2008)]. These data were collected for another [http://www.journalofvision.org/9/8/768/ research project] in the Wandell lab. We re-analyzed the data for this project, as described below.  


=== Subjects ===
Once gathered, the eight statistics corresponding to each image are then mapped to a predicted user rating using nonlinear support vector regression (SVR). The SVR models used in this process are in turn trained on the statistics and corresponding user ratings for a set of test images.
Subjects were 5 healthy volunteers.
Datasets
Before/after image pairs and face/hair/torso masks
To provide a perceptual modification rating, the retouching algorithm requires both the original and retouched versions of an image. In addition, as will be discussed in later sections, the algorithm requires a large set of such image pairs to derive the necessary models. The Kee and Farid paper describes the use of 468 such image pairs for the original development and testing of the algorithm. These images were collected from a variety of online sources, primarily from the websites of image retouching services. The retouched images ranged from minor to severe in the extent of extent of retouching present (a statistical analysis of the image ratings are in the results section).


=== MR acquisition ===
To aid in the implementation process, the original set of 468 image pairs used in the Farid paper were acquired directly from the author. In addition, Farid provided the face, hair, and torso masks for the subjects of each image pair, the necessity of which will be discussed in further sections.
Data were obtained on a GE scanner. Et cetera.
Observer ratings on retouching severity
The supplementary resources included with Kee & Farid’s 2011 paper contain a data set with Amazon Mechanical Turk observer ratings and predicted ratings generated by their algorithm for each of the 468 before/after images in the photo set (the appendix contains more details about this data set). Unfortunately, the numbering for these image ratings did not match up with the numbering of the images in the photo set provided by Kee and Farid, so there was no direct way to correlate our data to the image ratings from Kee & Farid’s 2011 paper. Hours before the submission deadline for this report, however, Prof Farid provided us with an updated CSV corresponding to the data in the paper, allowing us to rush to redo our analysis to incorporate this data set.


=== MR Analysis ===
Before receiving the updated data, we generated observer ratings by rating 290 of the before/after images ourselves. Each group member rated the images on a scale of 1 (“very similar”) to 5 (“very different”) in the order they appeared in the photo set provided to us. The presentation of images was self-timed--we allowed ourselves to manually toggle between before and after images as many times as desired. We only rated a subset of the before/after images--290 of the 468 before/after images in the provided photo set--because at the time of rating, we had only finished generating statistics for this portion of the images. Professor Farid’s new data, however, contained 50 ratings for each image gathered through Amazon’s mechanical Turk.
The MR data was analyzed using [http://white.stanford.edu/newlm/index.php/MrVista mrVista] software tools.  
Geometric model
Geometric changes affect the overall shape of the subject. Examples of geometric modifications include slimming of legs, hips, and arms, elongating the neck, improving posture, enlarging the eyes, and making faces more symmetric.


==== Pre-processing ====
The algorithm models geometric distortion between the before and after images by using the 8 parameter local fit shown in equation 1. Here, the terms “b” and “c” take into account basic brightness and contrast changes between the before and after images. The equation uses an affine transformation and variables m1 and m2 to determine how a pixel has been warped and translated in the x and y directions .
All data were slice-time corrected, motion corrected, and repeated scans were averaged together to create a single average scan for each subject. Et cetera.


==== PRF model fits ====
cfafter(x,y)+b=fbefore(m1x+m2y+tx,m3x+m4y+ty)          1
PRF models were fit with a 2-gaussian model.


==== MNI space ====
Once image registration is complete, we can take the results of the affine fit and compute a vector field describing how much the geometry of the after image has been altered. This vector field, when computed over the face and body, provides the basis for determining how much an image has been geometrically altered.
After a pRF model was solved for each subject, the model was trasnformed into MNI template space. This was done by first aligning the high resolution t1-weighted anatomical scan from each subject to an MNI template. Since the pRF model was coregistered to the t1-anatomical scan, the same alignment matrix could then be applied to the pRF model. <br>
Implementation
Once each pRF model was aligned to MNI space, 4 model parameters - x, y, sigma, and r^2 - were averaged across each of the 6 subjects  in each voxel.
We used Kee & Farid's image registration code to find the affine warp. We used custom Matlab code to compute the two-dimensional vector field showing how the image has been geometrically retouched.
Photometric model
Photometric alterations affect skin tone and texture: smoothing, sharpening, or other operations that remove or reduce wrinkles, cellulite, blemishes, freckles, and dark circles under the eyes.


Et cetera.
Photometric changes are modeled with a local linear filter and a generic measure of local image similarity (SSIM).


The local linear filter is used to determine the extent of the sharpening or blurring that occurs over a small region of the image. This computation is done by estimating a matrix “H” that transforms a portion of the registered, unaltered image into the corresponding part of the altered image, as shown in equation 2. Since different parts of the image may be sharpened and blurred independently, H must be computed on local regions only to determine the extent of photometric image alterations. For our reimplementation of the algorithm, we analyzed 4x4 pixel blocks of the images.
Fafter=HFbefore                    2
We then determined the frequency response of each of the local filters H by using equation 3. If the result D is positive, the portion of the image corresponding to that H has been blurred, if it is negative, the local part of the image has been sharpened.
D=Fbefore()-H()Fbefore ()                3
We also used a generic measure of local image similarity (SSIM) to detect photometric modifications not captured by equation 3. The SSIM measure embodies contrast and structural modifications as follows in equation 4.
C(x,y) =c(x,y)s(x,y)                    4
where:
in which a, banda,bare the means and standard deviations of the after-image region and the registered before-image region, respectively, and abis the covariance of these two image regions. We used the standard constants when calculating SSIM (Wang 2003) [4], where =1,=1,C2= (0.03)2, and C3=C2/2.
Implementation
For the implementation of the local linear filters, we used Stephen Boyd’s CVX code [2] to solve for H with a Tikhonov Regression on 4x4 pixel blocks of the images, and wrote custom code to determine the frequency response of each local filter. We used Zhou Wang’s SSIM code [3] to produce a SSIM index map of the image. Since the generated SSIM map is smaller than the image, we wrote custom code to remove the padding in the SSIM map so that the regions in the SSIM map and the image would line up. All photometric computations were computed on the luminance channel of the before and after images since Kee & Farid found that brightness did not impact observer ratings.


= Results - What you found =
= Results - What you found =

Revision as of 06:58, 22 March 2012

Implementation and analysis of a perceptual metric for photo retouching

Introduction

Retouched images are everywhere today. Magazine covers feature impossibly fit and blemish free models, and advertisements frequently show people too thin to be real. While some of these alterations could be considered comical, an increasing number of studies show that these pictures lead to low self-image and other mental health problems for many of those that view them. To help address this problem, lawmakers in several countries, including France and the UK, have proposed legislation that would require publishers to label any severely retouched images, and over the last few days, Isreal has passed the first law to require labels for retouched images (in this case, for thinning the model).

Legislation requiring the labeling of modified images raises a number of issues. Namely, how do we define “severely retouched”? Nearly all published images are modified in some way, whether through basic cropping or color adjustments or more significant alterations. Which, if any, of these changes are acceptable? The second problem is that there are a huge number of photographs published every day. How can they all be analyzed for retouching in a timely, cost-effective manner?

In their 2011 paper “A perceptual metric for photo retouching,” Kee and Farid proposed a perceptual photo rating scheme to solve these problems. With their method, an algorithm would analyze the original and retouched versions of an image to determine the extent of the geometric (e.g., stretching, warping) and photometric (e.g., blurring, sharpening) changes made to the original. The results of this analysis would be compared to a database of human-rated altered images to automatically assign a perceptual modification score between 1 (“very similar”) and 5 (“very different”). This scheme, intended to deliver an objective measure of perceptual modification with minimal human involvement, would allow authorities or publishers to define a threshold for a “severely retouched” image and label them accordingly.

This project is largely intended as an effort to reproduce the results from the Kee and Farid paper. Accordingly, the algorithm and methods described by the paper have been implemented and tested on a set of images. The rest of this report describes the algorithm implementation process. The report discusses the results of applying this algorithm to a set of retouched images, as well as potential improvements to improve the effectiveness and practicality of the algorithm.

Methods

The retouching algorithm described in the Farid paper consists of a number of automatic and manual steps, culminating in the generation of a predicted score for the significance of alterations made to the human subject(s) each image. Each image is first analyzed for geometric changes and transformations. These geometric changes are processed to produce a series of four statistics, two describing the changes to faces in the image and two describing the changes to bodies. Four metrics are then gathered to describe photometric changes to faces in the image using the SSIM measure of local image similarity and by determining the extent of local blurring and sharpening in the image.

Once gathered, the eight statistics corresponding to each image are then mapped to a predicted user rating using nonlinear support vector regression (SVR). The SVR models used in this process are in turn trained on the statistics and corresponding user ratings for a set of test images. Datasets Before/after image pairs and face/hair/torso masks To provide a perceptual modification rating, the retouching algorithm requires both the original and retouched versions of an image. In addition, as will be discussed in later sections, the algorithm requires a large set of such image pairs to derive the necessary models. The Kee and Farid paper describes the use of 468 such image pairs for the original development and testing of the algorithm. These images were collected from a variety of online sources, primarily from the websites of image retouching services. The retouched images ranged from minor to severe in the extent of extent of retouching present (a statistical analysis of the image ratings are in the results section).

To aid in the implementation process, the original set of 468 image pairs used in the Farid paper were acquired directly from the author. In addition, Farid provided the face, hair, and torso masks for the subjects of each image pair, the necessity of which will be discussed in further sections. Observer ratings on retouching severity The supplementary resources included with Kee & Farid’s 2011 paper contain a data set with Amazon Mechanical Turk observer ratings and predicted ratings generated by their algorithm for each of the 468 before/after images in the photo set (the appendix contains more details about this data set). Unfortunately, the numbering for these image ratings did not match up with the numbering of the images in the photo set provided by Kee and Farid, so there was no direct way to correlate our data to the image ratings from Kee & Farid’s 2011 paper. Hours before the submission deadline for this report, however, Prof Farid provided us with an updated CSV corresponding to the data in the paper, allowing us to rush to redo our analysis to incorporate this data set.

Before receiving the updated data, we generated observer ratings by rating 290 of the before/after images ourselves. Each group member rated the images on a scale of 1 (“very similar”) to 5 (“very different”) in the order they appeared in the photo set provided to us. The presentation of images was self-timed--we allowed ourselves to manually toggle between before and after images as many times as desired. We only rated a subset of the before/after images--290 of the 468 before/after images in the provided photo set--because at the time of rating, we had only finished generating statistics for this portion of the images. Professor Farid’s new data, however, contained 50 ratings for each image gathered through Amazon’s mechanical Turk. Geometric model Geometric changes affect the overall shape of the subject. Examples of geometric modifications include slimming of legs, hips, and arms, elongating the neck, improving posture, enlarging the eyes, and making faces more symmetric.

The algorithm models geometric distortion between the before and after images by using the 8 parameter local fit shown in equation 1. Here, the terms “b” and “c” take into account basic brightness and contrast changes between the before and after images. The equation uses an affine transformation and variables m1 and m2 to determine how a pixel has been warped and translated in the x and y directions .

cfafter(x,y)+b=fbefore(m1x+m2y+tx,m3x+m4y+ty) 1

Once image registration is complete, we can take the results of the affine fit and compute a vector field describing how much the geometry of the after image has been altered. This vector field, when computed over the face and body, provides the basis for determining how much an image has been geometrically altered. Implementation We used Kee & Farid's image registration code to find the affine warp. We used custom Matlab code to compute the two-dimensional vector field showing how the image has been geometrically retouched. Photometric model Photometric alterations affect skin tone and texture: smoothing, sharpening, or other operations that remove or reduce wrinkles, cellulite, blemishes, freckles, and dark circles under the eyes.

Photometric changes are modeled with a local linear filter and a generic measure of local image similarity (SSIM).

The local linear filter is used to determine the extent of the sharpening or blurring that occurs over a small region of the image. This computation is done by estimating a matrix “H” that transforms a portion of the registered, unaltered image into the corresponding part of the altered image, as shown in equation 2. Since different parts of the image may be sharpened and blurred independently, H must be computed on local regions only to determine the extent of photometric image alterations. For our reimplementation of the algorithm, we analyzed 4x4 pixel blocks of the images.

Fafter=HFbefore 2

We then determined the frequency response of each of the local filters H by using equation 3. If the result D is positive, the portion of the image corresponding to that H has been blurred, if it is negative, the local part of the image has been sharpened.

D=Fbefore()-H()Fbefore () 3

We also used a generic measure of local image similarity (SSIM) to detect photometric modifications not captured by equation 3. The SSIM measure embodies contrast and structural modifications as follows in equation 4.

C(x,y) =c(x,y)s(x,y) 4

where: in which a, banda,bare the means and standard deviations of the after-image region and the registered before-image region, respectively, and abis the covariance of these two image regions. We used the standard constants when calculating SSIM (Wang 2003) [4], where =1,=1,C2= (0.03)2, and C3=C2/2.

Implementation For the implementation of the local linear filters, we used Stephen Boyd’s CVX code [2] to solve for H with a Tikhonov Regression on 4x4 pixel blocks of the images, and wrote custom code to determine the frequency response of each local filter. We used Zhou Wang’s SSIM code [3] to produce a SSIM index map of the image. Since the generated SSIM map is smaller than the image, we wrote custom code to remove the padding in the SSIM map so that the regions in the SSIM map and the image would line up. All photometric computations were computed on the luminance channel of the before and after images since Kee & Farid found that brightness did not impact observer ratings.

Results - What you found

Caption: we compared the means of our ratings for each before after/image to the ratings obtained for those same images in Kee & Farid’s (2011) study.


Caption: A nonlinear SVR was used to correlate the summary statistics with predicted user ratings. The SVR model was trained and tested on the same image set, with parameters determined using 5-fold cross validation.

Caption: A nonlinear SVR was used to correlate the summary statistics with predicted user ratings. The SVR model was trained and tested on separate but equally sized image subsets, with parameters determined using 5-fold cross validation on the training subset.

Caption: A nonlinear SVR was used to correlate the four photometric statistics with predicted user ratings. The SVR model was trained and tested on separate but equally sized image subsets, with parameters determined using 5-fold cross validation on the training subset. Several out of range values were discarded.

Caption: A nonlinear SVR was used to correlate the four geometric statistics with predicted user ratings. The SVR model was trained and tested on separate but equally sized image subsets, with parameters determined using 5-fold cross validation on the training subset. Several out of range values were discarded.

Conclusions

While we were able to reimplement Kee and Farid’s algorithm, we encountered a number of problems that limited the scope of our results. First, and most significant, up until a few hours before the project deadline, we were unable to correlate the images provided by Farid with the predicted and observed ratings referenced in their paper. While we rushed to incorporated Kee and Farid’s revised data as soon as it became available, the extreme time constraints on the analysis left us unable to fully debug our results.

Before receiving Prof. Farid’s revised data, we decided to obtain new observer ratings by having each group member rate the image on the same 1 to 5 rating scale that was mentioned in the paper, and averaging our ratings together.

The reason we were unable to initially correlate the images in the photo set to the rating scores in the supplemental resources of the paper was because the image numbering in the photo set did not match the numbering in the data set of ratings. After looking up the photo set numbers for the ten images in Figure 4 of the paper, we found that the observer/predicted values in the .csv file for these image numbers differed from the reported observer/predicted ratings in Figure 4. Surprisingly, at least one set of measurements from Figure 4 appears to be completely absent from Kee and Farid’s original results spreadsheet.

The experience of rating each before/after image allowed us to more deeply reflect on what factors contributed to our decision to rate an image a certain way. We also noticed retouching trends across the set of images that we rated: retouching differences based on gender (women tended to be softened, whereas men tended to be sharpened), retouching differences based on skin tone (light skin tended to made lighter, whereas dark skin tended to be made darker), and common focus areas (eyes, bust, waist, hair). The awareness of our decision-making process and the increased familiarity with common retouching procedures that resulted from the experience of rating images ourselves allowed us to identify new ways in which we could improve the existing algorithm, which will be discussed later.

We also ran into a number of issues when dealing with the implementation of the geometric transformation script. The most pressing issue for us was the length of time it took--at least 3-4 hours--to complete running on a pair of before/after images. While we tried to solve this problem by batching all of our analysis to Farmshare, this process turned out to be unreliable, with jobs sometimes staying queued for over 24 hours. We dealt with this issue by creating scripts to process a few images in a row, connecting to multiple Corn machines, and manually running each of these scripts on each cluster we connected to. This setup made the image processing stage run much faster since the scripts were immediately run and it allowed us to reliably monitor the progress of the processing of each image.

The next problem we ran into was that the geometric transformation code provided by Professor Farid initially suffered from several bugs. First, out of the box, the code did not return all of the parameters of the local affine fit, requiring a moderate debugging effort. Second, the code broke on images with unequal heights and widths, forcing us to resize our sample images before running them through image registration. While this requirement was simple to meet, the other bugs in the code, combined with the fact that this requirement is never mentioned in the code’s documentation and the results of our regression analysis indicate that the geometric statistics appear to be invalid leads us to believe that there are additional bugs in the image registration code.

Another problem that we encountered was the slow speed to run the regression analysis using leave one out cross validation (LOOCV) procedure that Kee & Farid used in their 2011 paper. In order to quickly assess if our generated statistics could generate valid predictions, we ran our regression analyses using all images in both the training and testing sets. Since this produced an overfitted model that may not be accurate generally, our regression analysis could have been improved if we had used the LOOCV technique instead.

Future work

While working with the algorithm, we also identified a few key areas where the algorithm could be improved. First, given our issues in running all of the jobs, we believe that optimizing the algorithm for run-time should be a high priority. If we shifted away from Matlab, and instead implemented this algorithm in C, we could most likely realize substantial performance gains. Algorithm run-time could probably also be improved by culling the image background from the photos before the image registration step.

The second major area for improvement would be to automate face and body segmentation. Currently, the algorithm relies on a user manually masking out the face and hair regions to compute the required image statistics. Automating this process by using standard face and person detection algorithms could greatly decrease the effort required to rate each image.

We also have several suggestions for improving the quality of the algorithm’s results. First, as future work, we’d like to incorporate modifications to extremities with our results. For example, in the before/after image 205, Britney Spears’ calves are lengthened, creating a substantial perceptual change that is not picked up by the current algorithm.

Also, it is important to recognize that some perceptual distortions are more noticeable than others. Specifically modifications to the eyes and mouth, such as filling in missing teeth, can result in huge changes to the human provided perceptual scores, and only minor changes in the algorithmically predicted results. We propose breaking the face down into multiple regions, including the eyes, nose, and mouth, and give added weighting to these regions to bring the algorithmic estimations more in-line with the realities of human perception.

Next, we recognize that removing some minor freckles and blemishes probably does not have much of a perceptual effect on the image, and should not drastically increase an image’s perceived score. Therefore, we propose adding a tunable minimum threshold below which any photometric and geometric distortions will be ignored.

Finally, we propose to improve the algorithm by adding statistical measurements for photometric modifications done to the body, which the algorithm currently does not detect. Photometric modifications done to the body, such as blurring cellulite out of the body, can result in a visible perceptual effect on the image, so adding these statistics could potentially improve the estimates of the algorithm.

References - Resources and related work

   Farid’s image transformation MATLAB code: 
        http://www.cs.dartmouth.edu/farid/Hany_Farid/Research/Entries/2011/5/17_Image_Registration.html
   CVX: http://cvxr.com/cvx/
   Zhou Wang’s SSIM code: https://ece.uwaterloo.ca/~z70wang/research/ssim/
   Implementation (or SSIM in general) described in this paper: 
        http://www.cns.nyu.edu/pub/eero/wang03-reprint.pdf
   libSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
   Parameter selection in MATLAB: http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f803
   John Brunhaver wrote the two main functions in the photo_batch

Appendix I - Code and Data

Data

User ratings recorded by group members for subsets of the images

File:Farid ratings.zip: User ratings data graciously provided by Prof Farid. This set differs from that provided with the publication in that the numbering matches the author's photo sets.

Amazon Mechanical Turk observer ratings on retouching severity

Photoset: http://www.cs.dartmouth.edu/farid/downloads/publications/pnas11/beforeafter.tar

Corresponding Masks: http://www.cs.dartmouth.edu/farid/downloads/publications/pnas11/masks.tar

Fifty observer ratings for each of the 468 before/after images used in Kee & Farid’s 2011 paper were acquired from the supplementary resources for the research paper. The authors gathered the user data using the process and observers described below:

Task: Each observer session lasted approximately 30 minutes and was structured as followed:

1. Each participant was initially shown a representative set of 20 before/after images to help them gauge the range of distortions that they could expect to see 2. Each participant was then asked to rate 70 pairs of before/after images on a scale of 1 (“very similar”) to 5 (“very different”). The presentation of images was self-timed; participants could manually toggle between before and after images as many times as they wished. Each observer rated a random set of 5 images 3 times each to measure the consistency of their responses.

Observers: These ratings were provided by 390 observers that were recruited through Amazon’s Mechanical Turk. Each observer was paid $3 for their participation in the session. 9.5% of observers were excluded because they responded with high variance on repeated trials and frequently toggled only once between before and after images, suggesting a lack of consistency or seriousness in the observer’s rating process.

Code

zip file with custom code

This .zip contains several files, including: photo_batch.pl: Perl script used to batch out statistics gathering jobs to multiple machines in the Farmshare cluster. Modify/run this script to gather stats.

image_farm.m: Master matlab code modified by photo_batch.pl to do cluster statistics gathering

photometric.m, stats.m, vfield.m: function files called by image_farm.m or run_in_serial...m.

jsub - executable necessary farmshare cluster

run_in_serial_331_340.m: serial adaptation of image_farm.m

prepAndRunSVM_revised....m - different variants of code to run SVR on the gathered stats. See files for differences.

Other code needed (ssim_index.m, CVX, image registration code, libSVM) cited in the report.

Appendix II - Work partition

Much of the work for this project was performed cooperatively, with all three group members meeting together frequently to discuss and explore the algorithm and its implementation. However, each member focused on different aspects of the project. Andrew Danowitz led a lot of the early code exploration and did most of the implementation for the photometric (filter and SSIM) components. He also pieced together the different implementation components and brought about the capacity to submit much of the computational work to the Farmshare cluster. Andrew contributed a great deal to the report and presentation slides as well.

In addition to taking part in the collaborative aspects of the project, Andrea Zvinakis wrote much of the report and presentation slides. She also performed statistical analysis and, when the Farmshare cluster was unable to support our workload, was responsible for running the summary statistics generating code on hundreds of images on other computers. Andrea also set up the capacity for the group to rate certain images from the photo set.

Taking part in the collaborative aspects of the project as well, Bradley Collins also made small contributions to the report and presentation slides. Bradley also explored or implemented several parts of the geometric component of the algorithm, helped adapt the summary statistics generating code to run jobs serially on campus computers, and used that code to gather many of the image statistics. In addition, Bradley was responsible for most of the SVR implementation and running.