Identifying Model Weaknesses Using Image System Degradations

Introduction

Deep learning models, particularly convolutional neural networks (CNNs), have achieved remarkable success in image classification tasks. However, a significant gap exists between the controlled environments in which these models are trained and the unpredictable conditions of the real world. Most models are trained on clean, well-lit, and high-resolution datasets, yet real-world images are often subject to a wide range of degradations. These degradations may arise from factors such as blur, poor lighting, sensor noise, and changes in image resolution. The resulting distribution shift poses a substantial challenge for model generalization, potentially leading to misclassifications and performance degradation.

Addressing this challenge requires a deeper understanding of how image system degradations impact model behavior. By systematically evaluating model performance under controlled degradations, it becomes possible to identify specific vulnerabilities and failure points. This insight not only informs the development of more robust models but also supports the design of lightweight debugging pipelines for existing models that are able to rapidly assess model weaknesses through targeted interventions. By enabling a faster and more focused evaluation of model robustness, lightweight debugging could serve as a practical and cost-effective tool for improving model performance under real-world conditions.

Our paper aims to bridge the gap between idealized training conditions and the realities of image system degradations, and provide a clear pathway for efficiently diagnosing model weaknesses and informing targeted improvements. This work advances the broader goal of developing robust vision systems capable of operating effectively in real-world scenarios.

Background

Z-Score

The z-score is a statistical measure that indicates how far a given value deviates from the mean of a distribution and is measured in units of standard deviation. The z-score tells us how many standard deviations a value is from the mean. A z-score of 0 indicates that the value is equal to the mean, while positive and negative z-scores indicate that the value is above or below the mean, respectively. Z-scores are useful for identifying outliers and understanding how unusual a given observation is relative to the overall data distribution. The z-score for an observation $o$ is calculated as: $z = \frac{o - μ}{σ}$ Where:

$o$ is the observed value,
$μ$ is the mean of the distribution, and
$σ$ is the standard deviation of the distribution.

ROC and AUC

The Receiver Operating Characteristic (ROC) curve is a representation of a model's classification performance across different classification thresholds. Typically, the ROC is plotted as the True Positive Rate (TPR) against the False Positive Rate (FPR) when sweeping various threshold values. The ROC curve provides the ability to assess the trade-off between the true positive rate and the false positive rate as we vary the positive classification threshold. A model with perfect classification ability would have a curve that passes through the top-left corner (TPR = 1, FPR = 0), while a model that makes random predictions would produce a diagonal line from (0,0) to (1,1).

The Area Under the (ROC) Curve (AUC) is a scalar value that is calculated by computing the area under the ROC curve and quantifies the overall performance of the classifier. The AUC score ranges from 0 to 1, where a value of 1 indicates a perfect classifier, a value of 0.5 indicates a classifier that performs no better than random chance, and a value below 0.5 suggests the model is worse than random guessing, suggesting that the model is actively classifying the image incorrectly. For our project, we use the AUC to quantify the uncertainty introduced by degradation. By analyzing how the AUC changes under increasing degradation, we can assess how well the model continues to distinguish between specific classes. If the AUC decreases significantly as the level of degradation increases, it indicates that the degradation is causing more confusion between classes.

Methods

Selecting the Model and Dataset

For this project, we selected ResNet-18 as the model and ImageNetV2 as the dataset. These choices were made to ensure a robust evaluation of how degradations on high-quality images affects widely-used classification model performance .

ResNet-18

ResNet-18 is a convolutional neural network (CNN) released by Microsoft and is one of the smaller models in the ResNet family of models. We selected ResNet-18 for this project because of its lightweight architecture and its ability to provide a rapid inference while still achieving a strong performance on the ImageNet benchmark. Additionally, ResNet-18 has been widely used in image degradation and robustness studies, making it a strong baseline for evaluating how different types of degradation affect model performance.

ImageNetV2

ImageNetV2 is a collection of 10,000 images that serves as an extension to the original ImageNet dataset. One of the key reasons for using ImageNetV2 was to test the performance of the ResNet-18 model on a set of images that are similar in distribution to ImageNet, but not seen by any pre-trained ImageNet models. This allows us to better understand how degradations affect generalization performance. We focused on the "Matched Frequency" subset of ImageNetV2, which is designed to have a class distribution similar to the original ImageNet dataset. This alignment ensures that the degradation effects are comparable to those seen on ImageNet.

Applying Image Degradations

Pixel Size Degradation

The process of pixel size degradation involves adjusting the physical size of each pixel on the image sensor. In this project, we altered pixel sizes to 5 µm, 8 µm, and 11 µm, affecting the sensor’s light-collection ability and spatial resolution. Pixel size directly affects the signal-to-noise ratio (SNR), photon collection efficiency, and the modulation transfer function (MTF).

Pixel size refers to the physical dimensions of each individual pixel on an image sensor. Pixels convert incoming photons into an electrical signal. Smaller pixels have a smaller surface area, meaning they collect fewer photons, leading to a lower photon count. This reduction in photons increases the impact of photon shot noise, which follows a Poisson distribution. Conversely, larger pixels collect more light, resulting in a higher SNR but at the cost of reduced spatial resolution. In this project, the fill factor, which the fraction of each pixel’s surface area that actively collects light, was held constant, ensuring that the light-collecting area was proportional to the pixel size.

Impact on Image Quality

The following table summarizes the trade-offs observed as pixel size changes. Note that the changes in the table discuss the relative changes between the different pixel sizes chosen.

Pixel Size (µm)	Resolution	SNR	MTF
5 µm	Highest	Relatively low	High MTF at high frequencies
8 µm	Medium	Moderate	Moderate MTF
11 µm	Lower	Higher SNR	Lower MTF

Discussion of Exposure Time and Clipping

Exposure time and pixel size are closely related in how they affect the image sensor’s ability to capture light. Lengthening the exposure time has the same effect as increasing pixel size. This is because both methods increase the total number of photons that are collected at each pixel.

Longer exposure times allow the pixel to collect photons for a longer period, increasing the total photon count.
Larger pixel sizes collect more photons because the physical area of the pixel is larger.

Both of these approaches improve the SNR, as the total number of collected photons increases, reducing the relative impact of shot noise. However, when pixel size increases or exposure time increases, there is a higher chance that the sensor will saturate. This occurs because the maximum charge capacity of the pixel is limited. When too many photons are captured, the pixel can no longer store the charge, and it becomes clipped. This results in certain pixel intensities being capped at the maximum intensity value.

Clipping typically affects the RGB color channels independently. In this project, it was observed that certain channels were clipped during the pixel degradation process. This is because increasing the pixel size increases the photon collection rate, which can saturate any of the channels. If, for example, the red channel receives more light than the other channels (perhaps due to a stronger red component in the illumination), it may clip while the other channels remain within range. This effect is most noticeable in regions of the image where light is intense, and will appear to tint portions of the image based on the saturated channel.

Preparing Images for Model Inference

After applying degradations to the images, we prepare them for input into the ResNet model by applying a series of transformations to the images.

Image Resizing

Each image must first be resized through downsampling or upsampling to 224x224 pixels. In this project, we used bilinear interpolation, a widely used method for image resizing in machine learning. Bilinear interpolation involves taking a weighted average of 4 neighboring pixels from the original image and combining them into one pixel. Unlike simpler methods such as nearest-neighbor, this approach avoids hard edges and sharp transitions between pixels. However, the additional loss of spatial detail or introduction of blurring after the degradations that were already applied to the original image further blurs the high-frequency components. This resizing is an image degradation in itself and must be taken into account when considering the overall degradation done to images being input into the model. After all, this resizing process is analogous to reducing the resolution of a sensor.

To compute the intensity $I (x^{'}, y^{'})$ of a pixel at a non-integer position $(x^{'}, y^{'})$ in the resized image, we use the bilinear interpolation formula: $I (x^{'}, y^{'}) = (1 - d x) (1 - d y) I (x_{1}, y_{1}) + d x (1 - d y) I (x_{2}, y_{1}) + (1 - d x) d y I (x_{1}, y_{2}) + d x d y I (x_{2}, y_{2})$ Where:

$(x_{1}, y_{1}), (x_{2}, y_{1}), (x_{1}, y_{2})$ , and $(x_{2}, y_{2})$ are the four nearest neighboring pixels in the original image.
$d x = x^{'} - ⌊ x^{'} ⌋$ is the horizontal distance from $x^{'}$ to the left edge of the pixel.
$d y = y^{'} - ⌊ y^{'} ⌋$ is the vertical distance from $y^{'}$ to the top edge of the pixel.
$I (x_{1}, y_{1}), I (x_{2}, y_{1}), I (x_{1}, y_{2}), I (x_{2}, y_{2})$ are the intensities of the four neighboring pixels.

This process ensures that the closer the target pixel $(x^{'}, y^{'})$ is to a particular neighbor, the more influence that neighbor has on the intensity of the pixel at $(x^{'}, y^{'})$ . The upsampling and downsampling process both use these same ideas, but they differ in how pixel information is preserved and interpolated. During downsampling, pixel information is reduced by averaging or merging neighboring pixels into a smaller number of pixels, often leading to a loss of high-frequency details like edges and fine textures. In contrast, upsampling increases the image size by creating new pixels whose values are interpolated from existing neighboring pixels, effectively filling in the gaps between existing pixel locations. While downsampling removes detail, upsampling cannot restore details that were not present in the original image, and it often results in smoother, more blurred images due to the interpolation process.

Tensor Conversion

Additionally, the pixel values, which are originally stored as integers in the range $[0, 255]$ , are converted to floating-point numbers in the range $[0, 1]$ .

This step does not affect the spatial resolution or quality of the image. Instead, it is intended to ensure numerical stability during the forward pass of the neural network. By having pixel intensities in a small range $[0, 1]$ , the risk of gradient explosion or instability during backpropagation is reduced.

Normalizing Pixel Values

The final step of this process is to normalize pixel values according to the ImageNet mean and standard deviation. This step ensures that the distribution of pixel intensities for the input image matches the distribution of images used during ResNet training. Each pixel in the image is normalized using the channel-specific mean and standard deviation for images from the ImageNet dataset. The mean and standard deviation for the RGB color channels are as follows:

Channel-wise Means: $μ_{red} = 0.485, μ_{green} = 0.456, μ_{blue} = 0.406$

Channel-wise Standard Deviations: $σ_{red} = 0.229, σ_{green} = 0.224, σ_{blue} = 0.225$

Each pixel in the image is normalized separately for each color channel. The normalization formula for a pixel $p$ and one of the color channels $c$ is given by: $p_{normalized} (c) = \frac{p_{original} (c) - μ (c)}{σ (c)}$

The effect of this step is to standardize the input image so that it has a mean of 0 and a standard deviation of 1 for each channel. This normalization step is conceptually similar to mean subtraction and variance normalization in signal processing and does not have any effect on the image quality being fed into the model.

Analyzing the Results

Using Z-Score to Narrow Down the Most Affected Classes

To identify which classes are most affected by a degradation, we utilized z-score to quantify the degree of systematic confusion at the class level.

For each class, we measure the change in the top-1 class prediction probability for each image before and after applying degradation. The drop in the top-1 class probability is calculated as: ${drop}_{i} = p_{original} - p_{degraded}$ Where $p_{original}$ is the probability of the top-1 prediction before degradation, and $p_{degraded}$ is the probability after degradation. We calculate this drop for the 10 images per class and compute the mean and standard deviation of these probability drops.

Using these statistics, we compute the z-score for each image's probability drop as: $z_{i} = \frac{{drop}_{i} - μ}{σ}$ To calculate the confusion for the entire class, we compute the mean of the z-scores across all 10 images. Classes with mean z-scores close to zero exhibit consistent confusion, meaning the degradation affects the classification probabilities for all images in a similar way. Classes with higher mean z-scores exhibit more random effects as the degradation causes larger and more inconsistent changes in prediction probabilities across the 10 images. Ultimately, this method allows us to pinpoint the classes that are most sensitive to specific degradations.

Using AUC Scores to Quantify Uncertainty

After identifying the classes most affected by the degradation using the class average z-score method, we used AUC scores to quantify the level of uncertainty introduced by the degradation for specific classes.

To do this, we analyzed how the model's top-1 prediction changed after the degradation was applied. We focused on cases where the model's new top-1 prediction differed from its original top-1 prediction. The newly predicted class after degradation was treated as the "ground truth" for AUC analysis. For each image in the affected classes, we recorded the model's predicted probabilities for this "new ground truth" label and computed an ROC curve to assess how well the model ranked this label compared to other classes.

For example, after applying the pixel size degradation, images of fiddler crabs were frequently misclassified as hermit crabs. To quantify this confusion, we treated the hermit crab as the "ground truth" label and calculated the AUC score using the model’s predicted probabilities for the hermit crab across images from both the fiddler crab class and the hermit crab class. This approach allows us to measure how well the model distinguishes the two classes under the influence of degradation.

Results

Defocus Blur

The effects of applying defocus blur to images in our dataset suggest ResNet-18 relies heavily on patterns, textures, and colors rather than on shape features. This insight is critical for understanding the underlying mechanisms that drive the model's decision-making process.

Visual Evidence

To illustrate this point, we evaluated the worst-performing classes after applying the image degradation. For these classes, the model’s classification was incorrect for most of the images in the dataset. Among them, the tiger class was the most affected. Two particularly striking examples highlight this issue: in one case, the model mapped a white tiger to a Komondor dog, and in another, it mapped an orange tiger to a Leonberger. These examples are especially revealing. Despite the substantial differences in the shapes and structures of these animals, the model’s misclassifications suggest it struggles to distinguish between them, likely due to similarities in color and texture.

TODO: show image of tiger and komondor TODO: show image of tiger and leonberg

Further evidence is provided by the sharp drop in the model's confidence that the images belonged to the tiger class before and after applying the degradation. For the white tiger, the probability dropped from 0.78 to 0.003, and for the orange tiger, it dropped from 0.96 to 0.08. This dramatic reduction underscores the model's reliance on clear, high-frequency patterns. Without these patterns, the model appears to shift its reliance to surface-level color features, rather than using more abstract, shape-based distinctions.

TODO: make a table with: P_initial(komondor) = 0.003 P_post(komondor) = 0.32

P_initial(tiger) = 0.78 P_post(tiger) = 0.003

P_initial(leonberg) = 0.001 P_post(leonberg) = 0.74

P_initial(tiger) = 0.96 P_post(tiger) = 0.08

Structural Distinctiveness as a Safeguard

In contrast, images of animals with distinct and rigid shape features, such as crabs, were far less affected by degradation. Even under significant image system degradations, misclassifications within the crab class were mostly intra-species errors (e.g., mistaking one type of crab for another) rather than cross-species errors. This pattern strongly suggests that ResNet-18's reliance on pattern and color features is a limiting factor, as classes with pronounced structural uniqueness remain robust to degradation.

TODO: insert table listing the top 5 worst, and the top 5 best (show that crabs fared really well because they are shape-based)

Additionally, we plotted the tiger class's AUC scores over the range of diopter values we applied, to understand how sensitive the model is to this degradation.

TODO: AUC curve for tiger

The consistent decline in AUC as blur increases (especially between blur-12 and blur-20) indicates that the model is highly sensitive to blur degradation. This suggests that the model relies heavily on fine-grained details to classify objects correctly. Of course, to fully understand the model's sensitivity, we need to evaluate it with more diopter values, especially in between the ones we already tested, to capture the specific curvature of the AUC plot.

All of these observations point to a broader conclusion: ResNet-18’s classification strategy appears to prioritize patterns and colors over shapes. This reliance makes the model vulnerable to misclassification in cases where patterns and colors are ambiguous or overlap across classes. Conversely, classes with distinct, shape-driven features demonstrate resilience to degradation, supporting the idea that shape-based recognition is more robust to visual noise. Our findings underscore the need to develop models that pay more attention to shape features, especially for applications where robustness to image degradations is essential.

Encouragingly, these findings are well-corroborated by the literature, specifically for ImageNet-trained CNNs (ResNet-18 included): https://arxiv.org/abs/1811.12231, https://arxiv.org/abs/2310.18894

Exposure Time

The effects of simulating a decreased exposure time on the images in our dataset suggest that the model is highly sensitive to noise.

To illustrate this point, we once again evaluated the worst-performing classes after applying the image degradation. The worst among them was the robin class. Once noise was applied to the images, ResNet-18 classified the robin as a platypus, hyena, and toaster. We believe the model's deterioration can be attributed to the injection of noise; it relies heavily on fine-grained details, and makes spurious predictions when its primary features are distorted. The simple prescription based on these results would be to fine-tune ResNet-18 on more low-light conditions.

TODO: insert pictures of robin --> platypus TODO: insert pictures of robin --> hyena TODO: insert pictures of robin --> toaster

Looking at the robin AUC curve over the swept exposure times, we observe that for the values to the left of the baseline, the model performance drops at a rate faster than linear. This suggests that it is highly sensitive to noise. To the right of the baseline exposure point, the AUC score drops, but not as much. This suggests that the model is much more sensitive to low-light conditions (noise), than it is to overexposed images.

TODO: insert AUC curve

While this degradation did not provide as much insight into the model's decision-making process, it exposed how terribly it fares when there is a perceptible, but not destructive, amount of noise.

Indeed, improving the ResNet architecture in low-light scenarios appears to be a significant area of study: https://www.researchgate.net/publication/373599865_Swin_transformer_and_ResNet_based_deep_networks_for_low-light_image_enhancement.

Pixel Size

TODO: table of exposure time top 5 and bottom 5 and pixel size top 5/bottom 5 to indicate the pixel size and exposure time classes correspond quite well to each other, which makes sense because pixel size big = exposure time long

Conclusion

Benchmarking model performance on datasets like ImageNet typically requires evaluating thousands of images, a process that is both time-consuming and computationally expensive. While effective, this approach can be inefficient for debugging specific failure points in models. Our proposed framework offers a more targeted approach by systematically applying controlled image system degradations to a much smaller set of images. This method allows for a more focused investigation of model weaknesses, enabling faster and more efficient debugging compared to traditional large-scale benchmarking.

Unlike reactionary fine-tuning, which addresses model failures after they occur, our approach enables proactive diagnosis of failure modes. By exposing models to systematic degradations, we can reveal specific image properties that drive misclassifications. This understanding allows for targeted fine-tuning efforts, such as augmenting the training set with specific degradation types or modifying the model architecture to address identified weaknesses. As a result, our pipeline supports a more deliberate and resource-efficient strategy for improving model robustness.

Future work will expand this approach by incorporating a larger suite of degradation types. More targeted degradations, tailored to specific image properties, will enable a deeper exploration of model perception. Additionally, we aim to empirically validate our literature-corroborated findings by fine-tuning models based on the insights gained from systematic degradation analysis. This empirical validation will provide a clearer link between degradation-driven debugging and performance improvement, solidifying the effectiveness of the proposed pipeline.

By offering a systematic, lightweight alternative to traditional benchmarking, our approach empowers researchers and practitioners to diagnose and address model vulnerabilities with greater efficiency. This work highlights the value of controlled degradations as a tool for understanding model perception and improving robustness, thereby advancing the development of vision systems capable of withstanding the diverse and unpredictable nature of real-world imagery.

Appendix

You can write math equations as follows: $y = x + 5$

You can include images as follows (you will need to upload the image first using the toolbox on the left bar, using the "Upload file" link).

Identifying Model Weaknesses Using Image System Degradations

Contents

Introduction

Background

Z-Score

ROC and AUC

Methods

Selecting the Model and Dataset

ResNet-18

ImageNetV2

Applying Image Degradations

Pixel Size Degradation

Impact on Image Quality

Discussion of Exposure Time and Clipping

Preparing Images for Model Inference

Image Resizing

Tensor Conversion

Normalizing Pixel Values

Analyzing the Results

Using Z-Score to Narrow Down the Most Affected Classes

Using AUC Scores to Quantify Uncertainty

Results

Defocus Blur

Visual Evidence

Structural Distinctiveness as a Safeguard

Exposure Time

Pixel Size

Conclusion

Appendix

Navigation menu

Identifying Model Weaknesses Using Image System Degradations

Introduction

Background

Z-Score

ROC and AUC

Methods

Selecting the Model and Dataset

ResNet-18

ImageNetV2

Applying Image Degradations

Pixel Size Degradation

Impact on Image Quality

Discussion of Exposure Time and Clipping

Preparing Images for Model Inference

Image Resizing

Tensor Conversion

Normalizing Pixel Values

Analyzing the Results

Using Z-Score to Narrow Down the Most Affected Classes

Using AUC Scores to Quantify Uncertainty

Results

Defocus Blur

Visual Evidence

Structural Distinctiveness as a Safeguard

Exposure Time

Pixel Size

Conclusion

Appendix

Navigation menu

Search