CollinsZvinakisDanowitz: Difference between revisions
imported>Psych2012 |
imported>Psych2012 |
||
| Line 61: | Line 61: | ||
= Conclusions = | = Conclusions = | ||
While we were able to reimplement Kee and Farid’s algorithm, we encountered a number of problems that limited the scope of our results. First, and most significant, up until a few hours before the project deadline, we were unable to correlate the images provided by Farid with the predicted and observed ratings referenced in their paper. While we rushed to incorporated Kee and Farid’s revised data as soon as it became available, the extreme time constraints on the analysis left us unable to fully debug our results. | |||
Before receiving Prof. Farid’s revised data, we decided to obtain new observer ratings by having each group member rate the image on the same 1 to 5 rating scale that was mentioned in the paper, and averaging our ratings together. | |||
The reason we were unable to initially correlate the images in the photo set to the rating scores in the supplemental resources of the paper was because the image numbering in the photo set did not match the numbering in the data set of ratings. After looking up the photo set numbers for the ten images in Figure 4 of the paper, we found that the observer/predicted values in the .csv file for these image numbers differed from the reported observer/predicted ratings in Figure 4. Surprisingly, at least one set of measurements from Figure 4 appears to be completely absent from Kee and Farid’s original results spreadsheet. | |||
The experience of rating each before/after image allowed us to more deeply reflect on what factors contributed to our decision to rate an image a certain way. We also noticed retouching trends across the set of images that we rated: retouching differences based on gender (women tended to be softened, whereas men tended to be sharpened), retouching differences based on skin tone (light skin tended to made lighter, whereas dark skin tended to be made darker), and common focus areas (eyes, bust, waist, hair). The awareness of our decision-making process and the increased familiarity with common retouching procedures that resulted from the experience of rating images ourselves allowed us to identify new ways in which we could improve the existing algorithm, which will be discussed later. | |||
We also ran into a number of issues when dealing with the implementation of the geometric transformation script. The most pressing issue for us was the length of time it took--at least 3-4 hours--to complete running on a pair of before/after images. While we tried to solve this problem by batching all of our analysis to Farmshare, this process turned out to be unreliable, with jobs sometimes staying queued for over 24 hours. We dealt with this issue by creating scripts to process a few images in a row, connecting to multiple Corn machines, and manually running each of these scripts on each cluster we connected to. This setup made the image processing stage run much faster since the scripts were immediately run and it allowed us to reliably monitor the progress of the processing of each image. | |||
The next problem we ran into was that the geometric transformation code provided by Professor Farid initially suffered from several bugs. First, out of the box, the code did not return all of the parameters of the local affine fit, requiring a moderate debugging effort. Second, the code broke on images with unequal heights and widths, forcing us to resize our sample images before running them through image registration. While this requirement was simple to meet, the other bugs in the code, combined with the fact that this requirement is never mentioned in the code’s documentation and the results of our regression analysis indicate that the geometric statistics appear to be invalid leads us to believe that there are additional bugs in the image registration code. | |||
Another problem that we encountered was the slow speed to run the regression analysis using leave one out cross validation (LOOCV) procedure that Kee & Farid used in their 2011 paper. In order to quickly assess if our generated statistics could generate valid predictions, we ran our regression analyses using all images in both the training and testing sets. Since this produced an overfitted model that may not be accurate generally, our regression analysis could have been improved if we had used the LOOCV technique instead. | |||
==Future work== | |||
While working with the algorithm, we also identified a few key areas where the algorithm could be improved. First, given our issues in running all of the jobs, we believe that optimizing the algorithm for run-time should be a high priority. If we shifted away from Matlab, and instead implemented this algorithm in C, we could most likely realize substantial performance gains. Algorithm run-time could probably also be improved by culling the image background from the photos before the image registration step. | |||
The second major area for improvement would be to automate face and body segmentation. Currently, the algorithm relies on a user manually masking out the face and hair regions to compute the required image statistics. Automating this process by using standard face and person detection algorithms could greatly decrease the effort required to rate each image. | |||
We also have several suggestions for improving the quality of the algorithm’s results. First, as future work, we’d like to incorporate modifications to extremities with our results. For example, in the before/after image 205, Britney Spears’ calves are lengthened, creating a substantial perceptual change that is not picked up by the current algorithm. | |||
Also, it is important to recognize that some perceptual distortions are more noticeable than others. Specifically modifications to the eyes and mouth, such as filling in missing teeth, can result in huge changes to the human provided perceptual scores, and only minor changes in the algorithmically predicted results. We propose breaking the face down into multiple regions, including the eyes, nose, and mouth, and give added weighting to these regions to bring the algorithmic estimations more in-line with the realities of human perception. | |||
Next, we recognize that removing some minor freckles and blemishes probably does not have much of a perceptual effect on the image, and should not drastically increase an image’s perceived score. Therefore, we propose adding a tunable minimum threshold below which any photometric and geometric distortions will be ignored. | |||
Finally, we propose to improve the algorithm by adding statistical measurements for photometric modifications done to the body, which the algorithm currently does not detect. Photometric modifications done to the body, such as blurring cellulite out of the body, can result in a visible perceptual effect on the image, so adding these statistics could potentially improve the estimates of the algorithm. | |||
= References - Resources and related work = | = References - Resources and related work = | ||
Revision as of 06:57, 22 March 2012
Implementation and analysis of a perceptual metric for photo retouching
Introduction
Retouched images are everywhere today. Magazine covers feature impossibly fit and blemish free models, and advertisements frequently show people too thin to be real. While some of these alterations could be considered comical, an increasing number of studies show that these pictures lead to low self-image and other mental health problems for many of those that view them. To help address this problem, lawmakers in several countries, including France and the UK, have proposed legislation that would require publishers to label any severely retouched images, and over the last few days, Isreal has passed the first law to require labels for retouched images (in this case, for thinning the model).
Legislation requiring the labeling of modified images raises a number of issues. Namely, how do we define “severely retouched”? Nearly all published images are modified in some way, whether through basic cropping or color adjustments or more significant alterations. Which, if any, of these changes are acceptable? The second problem is that there are a huge number of photographs published every day. How can they all be analyzed for retouching in a timely, cost-effective manner?
In their 2011 paper “A perceptual metric for photo retouching,” Kee and Farid proposed a perceptual photo rating scheme to solve these problems. With their method, an algorithm would analyze the original and retouched versions of an image to determine the extent of the geometric (e.g., stretching, warping) and photometric (e.g., blurring, sharpening) changes made to the original. The results of this analysis would be compared to a database of human-rated altered images to automatically assign a perceptual modification score between 1 (“very similar”) and 5 (“very different”). This scheme, intended to deliver an objective measure of perceptual modification with minimal human involvement, would allow authorities or publishers to define a threshold for a “severely retouched” image and label them accordingly.
This project is largely intended as an effort to reproduce the results from the Kee and Farid paper. Accordingly, the algorithm and methods described by the paper have been implemented and tested on a set of images. The rest of this report describes the algorithm implementation process. The report discusses the results of applying this algorithm to a set of retouched images, as well as potential improvements to improve the effectiveness and practicality of the algorithm.
Methods
Measuring retinotopic maps
Retinotopic maps were obtained in 5 subjects using Population Receptive Field mapping methods Dumoulin and Wandell (2008). These data were collected for another research project in the Wandell lab. We re-analyzed the data for this project, as described below.
Subjects
Subjects were 5 healthy volunteers.
MR acquisition
Data were obtained on a GE scanner. Et cetera.
MR Analysis
The MR data was analyzed using mrVista software tools.
Pre-processing
All data were slice-time corrected, motion corrected, and repeated scans were averaged together to create a single average scan for each subject. Et cetera.
PRF model fits
PRF models were fit with a 2-gaussian model.
MNI space
After a pRF model was solved for each subject, the model was trasnformed into MNI template space. This was done by first aligning the high resolution t1-weighted anatomical scan from each subject to an MNI template. Since the pRF model was coregistered to the t1-anatomical scan, the same alignment matrix could then be applied to the pRF model.
Once each pRF model was aligned to MNI space, 4 model parameters - x, y, sigma, and r^2 - were averaged across each of the 6 subjects in each voxel.
Et cetera.
Results - What you found
Caption: we compared the means of our ratings for each before after/image to the ratings obtained for those same images in Kee & Farid’s (2011) study.
Caption: A nonlinear SVR was used to correlate the summary statistics with predicted user ratings. The SVR model was trained and tested on the same image set, with parameters determined using 5-fold cross validation.
Caption: A nonlinear SVR was used to correlate the summary statistics with predicted user ratings. The SVR model was trained and tested on separate but equally sized image subsets, with parameters determined using 5-fold cross validation on the training subset.
Caption: A nonlinear SVR was used to correlate the four photometric statistics with predicted user ratings. The SVR model was trained and tested on separate but equally sized image subsets, with parameters determined using 5-fold cross validation on the training subset. Several out of range values were discarded.
Caption: A nonlinear SVR was used to correlate the four geometric statistics with predicted user ratings. The SVR model was trained and tested on separate but equally sized image subsets, with parameters determined using 5-fold cross validation on the training subset. Several out of range values were discarded.
Conclusions
While we were able to reimplement Kee and Farid’s algorithm, we encountered a number of problems that limited the scope of our results. First, and most significant, up until a few hours before the project deadline, we were unable to correlate the images provided by Farid with the predicted and observed ratings referenced in their paper. While we rushed to incorporated Kee and Farid’s revised data as soon as it became available, the extreme time constraints on the analysis left us unable to fully debug our results.
Before receiving Prof. Farid’s revised data, we decided to obtain new observer ratings by having each group member rate the image on the same 1 to 5 rating scale that was mentioned in the paper, and averaging our ratings together.
The reason we were unable to initially correlate the images in the photo set to the rating scores in the supplemental resources of the paper was because the image numbering in the photo set did not match the numbering in the data set of ratings. After looking up the photo set numbers for the ten images in Figure 4 of the paper, we found that the observer/predicted values in the .csv file for these image numbers differed from the reported observer/predicted ratings in Figure 4. Surprisingly, at least one set of measurements from Figure 4 appears to be completely absent from Kee and Farid’s original results spreadsheet.
The experience of rating each before/after image allowed us to more deeply reflect on what factors contributed to our decision to rate an image a certain way. We also noticed retouching trends across the set of images that we rated: retouching differences based on gender (women tended to be softened, whereas men tended to be sharpened), retouching differences based on skin tone (light skin tended to made lighter, whereas dark skin tended to be made darker), and common focus areas (eyes, bust, waist, hair). The awareness of our decision-making process and the increased familiarity with common retouching procedures that resulted from the experience of rating images ourselves allowed us to identify new ways in which we could improve the existing algorithm, which will be discussed later.
We also ran into a number of issues when dealing with the implementation of the geometric transformation script. The most pressing issue for us was the length of time it took--at least 3-4 hours--to complete running on a pair of before/after images. While we tried to solve this problem by batching all of our analysis to Farmshare, this process turned out to be unreliable, with jobs sometimes staying queued for over 24 hours. We dealt with this issue by creating scripts to process a few images in a row, connecting to multiple Corn machines, and manually running each of these scripts on each cluster we connected to. This setup made the image processing stage run much faster since the scripts were immediately run and it allowed us to reliably monitor the progress of the processing of each image.
The next problem we ran into was that the geometric transformation code provided by Professor Farid initially suffered from several bugs. First, out of the box, the code did not return all of the parameters of the local affine fit, requiring a moderate debugging effort. Second, the code broke on images with unequal heights and widths, forcing us to resize our sample images before running them through image registration. While this requirement was simple to meet, the other bugs in the code, combined with the fact that this requirement is never mentioned in the code’s documentation and the results of our regression analysis indicate that the geometric statistics appear to be invalid leads us to believe that there are additional bugs in the image registration code.
Another problem that we encountered was the slow speed to run the regression analysis using leave one out cross validation (LOOCV) procedure that Kee & Farid used in their 2011 paper. In order to quickly assess if our generated statistics could generate valid predictions, we ran our regression analyses using all images in both the training and testing sets. Since this produced an overfitted model that may not be accurate generally, our regression analysis could have been improved if we had used the LOOCV technique instead.
Future work
While working with the algorithm, we also identified a few key areas where the algorithm could be improved. First, given our issues in running all of the jobs, we believe that optimizing the algorithm for run-time should be a high priority. If we shifted away from Matlab, and instead implemented this algorithm in C, we could most likely realize substantial performance gains. Algorithm run-time could probably also be improved by culling the image background from the photos before the image registration step.
The second major area for improvement would be to automate face and body segmentation. Currently, the algorithm relies on a user manually masking out the face and hair regions to compute the required image statistics. Automating this process by using standard face and person detection algorithms could greatly decrease the effort required to rate each image.
We also have several suggestions for improving the quality of the algorithm’s results. First, as future work, we’d like to incorporate modifications to extremities with our results. For example, in the before/after image 205, Britney Spears’ calves are lengthened, creating a substantial perceptual change that is not picked up by the current algorithm.
Also, it is important to recognize that some perceptual distortions are more noticeable than others. Specifically modifications to the eyes and mouth, such as filling in missing teeth, can result in huge changes to the human provided perceptual scores, and only minor changes in the algorithmically predicted results. We propose breaking the face down into multiple regions, including the eyes, nose, and mouth, and give added weighting to these regions to bring the algorithmic estimations more in-line with the realities of human perception.
Next, we recognize that removing some minor freckles and blemishes probably does not have much of a perceptual effect on the image, and should not drastically increase an image’s perceived score. Therefore, we propose adding a tunable minimum threshold below which any photometric and geometric distortions will be ignored.
Finally, we propose to improve the algorithm by adding statistical measurements for photometric modifications done to the body, which the algorithm currently does not detect. Photometric modifications done to the body, such as blurring cellulite out of the body, can result in a visible perceptual effect on the image, so adding these statistics could potentially improve the estimates of the algorithm.
References - Resources and related work
Farid’s image transformation MATLAB code:
http://www.cs.dartmouth.edu/farid/Hany_Farid/Research/Entries/2011/5/17_Image_Registration.html
CVX: http://cvxr.com/cvx/
Zhou Wang’s SSIM code: https://ece.uwaterloo.ca/~z70wang/research/ssim/
Implementation (or SSIM in general) described in this paper:
http://www.cns.nyu.edu/pub/eero/wang03-reprint.pdf
libSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Parameter selection in MATLAB: http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f803
John Brunhaver wrote the two main functions in the photo_batch
Appendix I - Code and Data
Data
User ratings recorded by group members for subsets of the images
File:Farid ratings.zip: User ratings data graciously provided by Prof Farid. This set differs from that provided with the publication in that the numbering matches the author's photo sets.
Amazon Mechanical Turk observer ratings on retouching severity
Photoset: http://www.cs.dartmouth.edu/farid/downloads/publications/pnas11/beforeafter.tar
Corresponding Masks: http://www.cs.dartmouth.edu/farid/downloads/publications/pnas11/masks.tar
Fifty observer ratings for each of the 468 before/after images used in Kee & Farid’s 2011 paper were acquired from the supplementary resources for the research paper. The authors gathered the user data using the process and observers described below:
Task: Each observer session lasted approximately 30 minutes and was structured as followed:
1. Each participant was initially shown a representative set of 20 before/after images to help them gauge the range of distortions that they could expect to see 2. Each participant was then asked to rate 70 pairs of before/after images on a scale of 1 (“very similar”) to 5 (“very different”). The presentation of images was self-timed; participants could manually toggle between before and after images as many times as they wished. Each observer rated a random set of 5 images 3 times each to measure the consistency of their responses.
Observers: These ratings were provided by 390 observers that were recruited through Amazon’s Mechanical Turk. Each observer was paid $3 for their participation in the session. 9.5% of observers were excluded because they responded with high variance on repeated trials and frequently toggled only once between before and after images, suggesting a lack of consistency or seriousness in the observer’s rating process.
Code
This .zip contains several files, including: photo_batch.pl: Perl script used to batch out statistics gathering jobs to multiple machines in the Farmshare cluster. Modify/run this script to gather stats.
image_farm.m: Master matlab code modified by photo_batch.pl to do cluster statistics gathering
photometric.m, stats.m, vfield.m: function files called by image_farm.m or run_in_serial...m.
jsub - executable necessary farmshare cluster
run_in_serial_331_340.m: serial adaptation of image_farm.m
prepAndRunSVM_revised....m - different variants of code to run SVR on the gathered stats. See files for differences.
Other code needed (ssim_index.m, CVX, image registration code, libSVM) cited in the report.
Appendix II - Work partition
Much of the work for this project was performed cooperatively, with all three group members meeting together frequently to discuss and explore the algorithm and its implementation. However, each member focused on different aspects of the project. Andrew Danowitz led a lot of the early code exploration and did most of the implementation for the photometric (filter and SSIM) components. He also pieced together the different implementation components and brought about the capacity to submit much of the computational work to the Farmshare cluster. Andrew contributed a great deal to the report and presentation slides as well.
In addition to taking part in the collaborative aspects of the project, Andrea Zvinakis wrote much of the report and presentation slides. She also performed statistical analysis and, when the Farmshare cluster was unable to support our workload, was responsible for running the summary statistics generating code on hundreds of images on other computers. Andrea also set up the capacity for the group to rate certain images from the photo set.
Taking part in the collaborative aspects of the project as well, Bradley Collins also made small contributions to the report and presentation slides. Bradley also explored or implemented several parts of the geometric component of the algorithm, helped adapt the summary statistics generating code to run jobs serially on campus computers, and used that code to gather many of the image statistics. In addition, Bradley was responsible for most of the SVR implementation and running.




