Identifying Model Weaknesses Using Image System Degradations: Difference between revisions
| Line 25: | Line 25: | ||
== Methods == | == Methods == | ||
=== Using Z-Score to Narrow Down the Most Affected Classes === | ==== Using Z-Score to Narrow Down the Most Affected Classes ==== | ||
To identify which classes are most affected by a degradation, we utilized z-score to quantify the degree of systematic confusion at the class level. | To identify which classes are most affected by a degradation, we utilized z-score to quantify the degree of systematic confusion at the class level. | ||
| Line 41: | Line 41: | ||
To calculate the confusion for the entire class, we compute the mean of the z-scores across all 10 images. Classes with mean z-scores close to zero exhibit consistent confusion, meaning the degradation affects the classification probabilities for all images in a similar way. Classes with higher mean z-scores exhibit more random effects as the degradation causes larger and more inconsistent changes in prediction probabilities across the 10 images. Ultimately, this method allows us to pinpoint the classes that are most sensitive to specific degradations. | To calculate the confusion for the entire class, we compute the mean of the z-scores across all 10 images. Classes with mean z-scores close to zero exhibit consistent confusion, meaning the degradation affects the classification probabilities for all images in a similar way. Classes with higher mean z-scores exhibit more random effects as the degradation causes larger and more inconsistent changes in prediction probabilities across the 10 images. Ultimately, this method allows us to pinpoint the classes that are most sensitive to specific degradations. | ||
=== Using AUC Scores to Quantify Uncertainty === | ==== Using AUC Scores to Quantify Uncertainty ==== | ||
After identifying the classes most affected by the degradation using the class average z-score method, we used AUC scores to quantify the level of uncertainty introduced by the degradation for specific classes. | After identifying the classes most affected by the degradation using the class average z-score method, we used AUC scores to quantify the level of uncertainty introduced by the degradation for specific classes. | ||
Revision as of 07:53, 13 December 2024
Introduction
Deep learning models, particularly convolutional neural networks (CNNs), have achieved remarkable success in image classification tasks. However, a significant gap exists between the controlled environments in which these models are trained and the unpredictable conditions of the real world. Most models are trained on clean, well-lit, and high-resolution datasets, yet real-world images are often subject to a wide range of degradations. These degradations may arise from factors such as blur, poor lighting, sensor noise, and changes in image resolution. The resulting distribution shift poses a substantial challenge for model generalization, potentially leading to misclassifications and performance degradation.
Addressing this challenge requires a deeper understanding of how image system degradations impact model behavior. By systematically evaluating model performance under controlled degradations, it becomes possible to identify specific vulnerabilities and failure points. This insight not only informs the development of more robust models but also supports the design of lightweight debugging pipelines for existing models that are able to rapidly assess model weaknesses through targeted interventions. By enabling a faster and more focused evaluation of model robustness, lightweight debugging could serve as a practical and cost-effective tool for improving model performance under real-world conditions.
Our paper aims to bridge the gap between idealized training conditions and the realities of image system degradations, and provide a clear pathway for efficiently diagnosing model weaknesses and informing targeted improvements. This work advances the broader goal of developing robust vision systems capable of operating effectively in real-world scenarios.
Background
Z-Score
The z-score is a statistical measure that indicates how far a given value deviates from the mean of a distribution and is measured in units of standard deviation. The z-score tells us how many standard deviations a value is from the mean. A z-score of 0 indicates that the value is equal to the mean, while positive and negative z-scores indicate that the value is above or below the mean, respectively. Z-scores are useful for identifying outliers and understanding how unusual a given observation is relative to the overall data distribution. The z-score for an observation is calculated as: Where:
- is the observed value,
- is the mean of the distribution, and
- is the standard deviation of the distribution.
ROC and AUC
The Receiver Operating Characteristic (ROC) curve is a representation of a model's classification performance across different classification thresholds. Typically, the ROC is plotted as the True Positive Rate (TPR) against the False Positive Rate (FPR) for various threshold values. The ROC curve provides the ability to assess the trade-off between the true positive rate and the false positive rate as we vary the positive classification threshold. A model with perfect classification ability would have a curve that passes through the top-left corner (TPR = 1, FPR = 0), while a model that makes random predictions would produce a diagonal line from (0,0) to (1,1).
The Area Under the ROC Curve (AUC) is a scalar value that is calculated by computing the area under the ROC and quantifies the overall performance of the classifier. The AUC score ranges from 0 to 1, where a value of 1 indicates a perfect classifier, a value of 0.5 indicates a classifier that performs no better than random chance, and a value below 0.5 suggests the model is worse than random guessing, suggesting that the model is actively classifying the image incorrectly. For our project, we use the AUC to quantify the uncertainty introduced by degradation. By analyzing how the AUC changes under increasing degradation, we can assess how well the model continues to distinguish between specific classes. If the AUC decreases significantly as the level of degradation increases, it indicates that the degradation is causing more confusion between classes.
Methods
Using Z-Score to Narrow Down the Most Affected Classes
To identify which classes are most affected by a degradation, we utilized z-score to quantify the degree of systematic confusion at the class level.
For each class, we measure the change in the top-1 class prediction probability for each image before and after applying degradation. The drop in the top-1 class probability is calculated as: Where is the probability of the top-1 prediction before degradation, and is the probability after degradation. We calculate this drop for the 10 images per class and compute the mean and standard deviation of these probability drops.
Using these statistics, we compute the z-score for each image's probability drop as: To calculate the confusion for the entire class, we compute the mean of the z-scores across all 10 images. Classes with mean z-scores close to zero exhibit consistent confusion, meaning the degradation affects the classification probabilities for all images in a similar way. Classes with higher mean z-scores exhibit more random effects as the degradation causes larger and more inconsistent changes in prediction probabilities across the 10 images. Ultimately, this method allows us to pinpoint the classes that are most sensitive to specific degradations.
Using AUC Scores to Quantify Uncertainty
After identifying the classes most affected by the degradation using the class average z-score method, we used AUC scores to quantify the level of uncertainty introduced by the degradation for specific classes.
To do this, we analyzed how the model's top-1 prediction changed after the degradation was applied. We focused on cases where the model's new top-1 prediction differed from its original top-1 prediction. The newly predicted class after degradation was treated as the "ground truth" for AUC analysis. For each image in the affected classes, we recorded the model's predicted probabilities for this "new ground truth" label and computed an ROC curve to assess how well the model ranked this label compared to other classes.
For example, after applying the pixel size degradation, images of fiddler crabs were frequently misclassified as hermit crabs. To quantify this confusion, we treated the hermit crab as the "ground truth" label and calculated the AUC score using the model’s predicted probabilities for the hermit crab across images from both the fiddler crab class and the hermit crab class. This approach allows us to measure how well the model distinguishes the two classes under the influence of degradation.

Results
Conclusions
Appendix
You can write math equations as follows:
You can include images as follows (you will need to upload the image first using the toolbox on the left bar, using the "Upload file" link).