Identifying Model Weaknesses Using Image System Degradations
Introduction
Background
Z-Score
The z-score is a statistical measure that indicates how far a given value deviates from the mean of a distribution and is measured in units of standard deviation. The z-score tells us how many standard deviations a value is from the mean. A z-score of 0 indicates that the value is equal to the mean, while positive and negative z-scores indicate that the value is above or below the mean, respectively. Z-scores are useful for identifying outliers and understanding how unusual a given observation is relative to the overall data distribution. The z-score for an observation is calculated as: Where:
- is the observed value,
- is the mean of the distribution, and
- is the standard deviation of the distribution.
ROC and AUC
The Receiver Operating Characteristic (ROC) curve is a representation of a model's classification performance across different classification thresholds. Typically, the ROC is plotted as the True Positive Rate (TPR) against the False Positive Rate (FPR) for various threshold values. The ROC curve provides the ability to assess the trade-off between the true positive rate and the false positive rate as we vary the positive classification threshold. A model with perfect classification ability would have a curve that passes through the top-left corner (TPR = 1, FPR = 0), while a model that makes random predictions would produce a diagonal line from (0,0) to (1,1).
The Area Under the ROC Curve (AUC) is a scalar value that is calculated by computing the area under the ROC and quantifies the overall performance of the classifier. The AUC score ranges from 0 to 1, where a value of 1 indicates a perfect classifier, a value of 0.5 indicates a classifier that performs no better than random chance, and a value below 0.5 suggests the model is worse than random guessing, suggesting that the model is actively classifying the image incorrectly. For our project, we use the AUC to quantify the uncertainty introduced by degradation. By analyzing how the AUC changes under increasing degradation, we can assess how well the model continues to distinguish between specific classes. If the AUC decreases significantly as the level of degradation increases, it indicates that the degradation is causing more confusion between classes.
Methods
Using Z-Score to Narrow Down the Most Affected Classes
To identify which classes are most affected by a degradation, we utilized z-score to quantify the degree of systematic confusion at the class level.
For each class, we measure the change in the top-1 class prediction probability for each image before and after applying degradation. The drop in the top-1 class probability is calculated as: Where is the probability of the top-1 prediction before degradation, and is the probability after degradation. We calculate this drop for the 10 images per class and compute the mean and standard deviation of these probability drops.
Using these statistics, we compute the z-score for each image's probability drop as: To calculate the confusion for the entire class, we compute the mean of the z-scores across all 10 images. Classes with mean z-scores close to zero exhibit consistent confusion, meaning the degradation affects the classification probabilities for all images in a similar way. Classes with higher mean z-scores exhibit more random effects as the degradation causes larger and more inconsistent changes in prediction probabilities across the 10 images. Ultimately, this method allows us to pinpoint the classes that are most sensitive to specific degradations.
Using AUC Scores to Quantify Uncertainty
After identifying the classes most affected by the degradation using the class average z-score method, we used AUC scores to quantify the level of uncertainty introduced by the degradation for specific classes.
To do this, we analyzed how the model's top-1 prediction changed after the degradation was applied. We focused on cases where the model's new top-1 prediction differed from its original top-1 prediction. The newly predicted class after degradation was treated as the "ground truth" for AUC analysis. For each image in the affected classes, we recorded the model's predicted probabilities for this "new ground truth" label and computed an ROC curve to assess how well the model ranked this label compared to other classes.
For example, after applying the pixel size degradation, images of fiddler crabs were frequently misclassified as hermit crabs. To quantify this confusion, we treated the hermit crab as the "ground truth" label and calculated the AUC score using the model’s predicted probabilities for the hermit crab across images from both the fiddler crab class and the hermit crab class. This approach allows us to measure how well the model distinguishes the two classes under the influence of degradation.

Results
Conclusions
Appendix
You can write math equations as follows:
You can include images as follows (you will need to upload the image first using the toolbox on the left bar, using the "Upload file" link).