Identifying Model Weaknesses Using Image System Degradations: Difference between revisions
| (88 intermediate revisions by 2 users not shown) | |||
| Line 1: | Line 1: | ||
== Introduction == | == Introduction == | ||
Deep learning models, particularly convolutional neural networks (CNNs), have achieved remarkable success in image classification tasks. However, a significant gap exists between the controlled environments in which these models are trained and the unpredictable conditions of the real world. Most models are trained on clean, well-lit, and high-resolution datasets, yet real-world images are often subject to a wide range of degradations. These degradations may arise from factors such as blur, poor lighting, sensor noise, and changes in image resolution. The resulting distribution shift poses a substantial challenge for model generalization, potentially leading to misclassifications and performance degradation. | |||
Addressing this challenge requires a deeper understanding of how image system degradations impact model behavior. By systematically evaluating model performance under controlled degradations, it becomes possible to identify specific vulnerabilities and failure points. This insight not only informs the development of more robust models but also supports the design of lightweight debugging pipelines for existing models that are able to rapidly assess model weaknesses through targeted interventions. By enabling a faster and more focused evaluation of model robustness, lightweight debugging could serve as a practical and cost-effective tool for improving model performance under real-world conditions. | |||
Our paper aims to bridge the gap between idealized training conditions and the realities of image system degradations, and provide a clear pathway for efficiently diagnosing model weaknesses and informing targeted improvements. This work advances the broader goal of developing robust vision systems capable of operating effectively in real-world scenarios. | |||
== Background == | == Background == | ||
| Line 13: | Line 18: | ||
* <math>\sigma</math> is the standard deviation of the distribution. | * <math>\sigma</math> is the standard deviation of the distribution. | ||
===ROC and AUC=== | === ROC and AUC === | ||
The Receiver Operating Characteristic (ROC) curve is a representation of a model's classification performance across different classification thresholds. Typically, the ROC is plotted as the True Positive Rate (TPR) against the False Positive Rate (FPR) when sweeping various threshold values. The ROC curve provides the ability to assess the trade-off between the true positive rate and the false positive rate as we vary the positive classification threshold. A model with perfect classification ability would have a curve that passes through the top-left corner (TPR = 1, FPR = 0), while a model that makes random predictions would produce a diagonal line from (0,0) to (1,1). | |||
The Area Under the (ROC) Curve (AUC) is a scalar value that is calculated by computing the area under the ROC curve and quantifies the overall performance of the classifier. The AUC score ranges from 0 to 1, where a value of 1 indicates a perfect classifier, a value of 0.5 indicates a classifier that performs no better than random chance, and a value below 0.5 suggests the model is worse than random guessing, suggesting that the model is actively classifying the image incorrectly. For our project, we use the AUC to quantify the uncertainty introduced by degradation. By analyzing how the AUC changes under increasing degradation, we can assess how well the model continues to distinguish between specific classes. If the AUC decreases significantly as the level of degradation increases, it indicates that the degradation is causing more confusion between classes. | |||
== Methods == | == Methods == | ||
=== Selecting the Model and Dataset === | |||
For this project, we selected ResNet-18 as the model and ImageNetV2 as the dataset. These choices were made to ensure a robust evaluation of how degradations on high-quality images affects widely-used classification model performance . | |||
==== ResNet-18 ==== | |||
ResNet-18 is a convolutional neural network (CNN) released by Microsoft and is one of the smaller models in the ResNet family of models. We selected ResNet-18 for this project because of its lightweight architecture and its ability to provide a rapid inference while still achieving a strong performance on the ImageNet benchmark. Additionally, ResNet-18 has been widely used in image degradation and robustness studies, making it a strong baseline for evaluating how different types of degradation affect model performance. | |||
==== ImageNetV2 ==== | |||
ImageNetV2 is a collection of 10,000 images that serves as an extension to the original ImageNet dataset. One of the key reasons for using ImageNetV2 was to test the performance of the ResNet-18 model on a set of images that are similar in distribution to ImageNet, but not seen by any pre-trained ImageNet models. This allows us to better understand how degradations affect generalization performance. We focused on the "Matched Frequency" subset of ImageNetV2, which is designed to have a class distribution similar to the original ImageNet dataset. This alignment ensures that the degradation effects are comparable to those seen on ImageNet. | |||
=== Applying Image Degradations === | |||
==== Defocus Blur Degradation ==== | |||
Defocus blur is a key factor in the degradation of image quality. It occurs when light rays from a point source fail to converge at a single point on the image sensor, leading to a "blurred" appearance. The degree of defocus blur is commonly quantified using diopters (D), a unit of measurement that describes the reciprocal of the focal length (in meters) of an optical system. A higher diopter value corresponds to a shorter focal length and thus more severe defocus blur. For example, a lens with a focal length of 0.5 meters has a diopter value of 2D, while a lens with a focal length of 0.25 meters has a diopter value of 4D. | |||
====Zernike Polynomials and the C4 Coefficient==== | |||
Zernike polynomials are a set of orthogonal polynomials defined on the unit disk, often used in optics to model wavefront aberrations in optical systems. Each Zernike term corresponds to a specific type of aberration, such as defocus, astigmatism, or coma. Defocus is particularly relevant to image system analysis, and it is represented by the coefficient <math> C_4 </math> in the Zernike polynomial expansion. The coefficient is directly related to the amount of defocus in the optical system. | |||
==== Pixel Size Degradation ==== | |||
The process of pixel size degradation involves adjusting the physical size of each pixel on the image sensor. In this project, we altered pixel sizes to 5 µm, 8 µm, and 11 µm, affecting the sensor’s light-collection ability and spatial resolution. Pixel size directly affects the signal-to-noise ratio (SNR), photon collection efficiency, and the modulation transfer function (MTF). | |||
Pixel size refers to the physical dimensions of each individual pixel on an image sensor. Pixels convert incoming photons into an electrical signal. Smaller pixels have a smaller surface area, meaning they collect fewer photons, leading to a lower photon count. This reduction in photons increases the impact of photon shot noise, which follows a Poisson distribution. Conversely, larger pixels collect more light, resulting in a higher SNR but at the cost of reduced spatial resolution. In this project, the fill factor, which the fraction of each pixel’s surface area that actively collects light, was held constant, ensuring that the light-collecting area was proportional to the pixel size. Additionally, we used isetcam auto exposure for all the different pixel sizes. | |||
====== Impact on Image Quality ====== | |||
The following table summarizes the trade-offs observed as pixel size changes. Note that the changes in the table discuss the relative changes between the different pixel sizes chosen. | |||
<center> | |||
{| class="wikitable" | |||
|+ Effect of Increasing Pixel Size on Various Parameters | |||
! Pixel Size (µm) !! Resolution !! SNR !! MTF | |||
|- | |||
| 5 µm || Highest || Relatively low || High MTF at high frequencies | |||
|- | |||
| 8 µm || Medium || Moderate || Moderate MTF | |||
|- | |||
| 11 µm || Lower || Higher SNR || Lower MTF | |||
|} | |||
</center> | |||
====== Discussion of Exposure Time and Channel Clipping ====== | |||
Exposure time and pixel size are closely related in how they affect the image sensor’s ability to capture light. Lengthening the exposure time has the same effect as increasing pixel size. This is because both methods increase the total number of photons that are collected at each pixel. | |||
* Longer exposure times allow the pixel to collect photons for a longer period, increasing the total photon count. | |||
* Larger pixel sizes collect more photons because the physical area of the pixel is larger. | |||
Both of these approaches improve the SNR, as the total number of collected photons increases, reducing the relative impact of shot noise. However, when pixel size increases or exposure time increases, there is a higher chance that the sensor will saturate. This occurs because the maximum charge capacity of the pixel is limited. When too many photons are captured, the pixel can no longer store the charge, and it becomes clipped. This results in certain pixel intensities being capped at the maximum intensity value. | |||
Clipping typically affects the RGB color channels independently. In this project, it was observed that certain channels were clipped during the pixel degradation process. This is because increasing the pixel size increases the photon collection rate, which can saturate any of the channels. If, for example, the red channel receives more light than the other channels (perhaps due to a stronger red component in the illumination), it may clip while the other channels remain within range. This effect is most noticeable in regions of the image where light is intense, and will appear to tint portions of the image based on the saturated channel. | |||
==== Exposure Time Degradation ==== | |||
In this project, we also altered the exposure time of the image sensor by itself to observe its impact on image quality and sensor performance. The exposure times tested were 0.005 s, 0.01s, 0.1 s (baseline), and 1 s, with the pixel size being set to 2 µm. By increasing exposure time, the sensor collects more photons at each pixel, which improves the signal-to-noise ratio (SNR) but increases the risk of saturation and clipping. Shorter exposure times reduce the total photon count, leading to lower SNR but preserving highlights and bright regions from overexposure. | |||
<gallery> | |||
File:Bird normal.jpeg|style="width:300px;height:auto;"|Original Image | |||
File:Bird pixel size.png|style="width:300px;height:auto;"|8 µm Pixel Size | |||
File:Bird exposure time.png|style="width:300px;height:auto;"|1 ms Exposure Time | |||
</gallery> | |||
=== Preparing Images for Model Inference === | |||
After applying degradations to the images, we prepare them for input into the ResNet model by applying a series of transformations to the images. | |||
==== Image Resizing ==== | |||
Each image must first be resized through downsampling or upsampling to 224x224 pixels. In this project, we used bilinear interpolation, a widely used method for image resizing in machine learning. Bilinear interpolation involves taking a weighted average of 4 neighboring pixels from the original image and combining them into one pixel. Unlike simpler methods such as nearest-neighbor, this approach avoids hard edges and sharp transitions between pixels. However, the additional loss of spatial detail or introduction of blurring after the degradations that were already applied to the original image further blurs the high-frequency components. This resizing is an image degradation in itself and must be taken into account when considering the overall degradation done to images being input into the model. After all, this resizing process is analogous to reducing the resolution of a sensor. | |||
To compute the intensity <math> I(x', y') </math> of a pixel at a non-integer position <math> (x', y') </math> in the resized image, we use the bilinear interpolation formula: | |||
<math> | |||
I(x', y') = (1 - dx)(1 - dy) \, I(x_1, y_1) + dx(1 - dy) \, I(x_2, y_1) + (1 - dx)dy \, I(x_1, y_2) + dx \, dy \, I(x_2, y_2) | |||
</math> | |||
Where: | |||
* <math>(x_1, y_1), (x_2, y_1), (x_1, y_2)</math>, and <math>(x_2, y_2)</math> are the four nearest neighboring pixels in the original image. | |||
* <math> dx = x' - \lfloor x' \rfloor </math> is the horizontal distance from <math>x'</math> to the left edge of the pixel. | |||
* <math> dy = y' - \lfloor y' \rfloor </math> is the vertical distance from <math>y'</math> to the top edge of the pixel. | |||
* <math>I(x_1, y_1), I(x_2, y_1), I(x_1, y_2), I(x_2, y_2) </math> are the intensities of the four neighboring pixels. | |||
This process ensures that the closer the target pixel <math>(x', y')</math> is to a particular neighbor, the more influence that neighbor has on the intensity of the pixel at <math>(x', y')</math>. The upsampling and downsampling process both use these same ideas, but they differ in how pixel information is preserved and interpolated. During downsampling, pixel information is reduced by averaging or merging neighboring pixels into a smaller number of pixels, often leading to a loss of high-frequency details like edges and fine textures. In contrast, upsampling increases the image size by creating new pixels whose values are interpolated from existing neighboring pixels, effectively filling in the gaps between existing pixel locations. While downsampling removes detail, upsampling cannot restore details that were not present in the original image, and it often results in smoother, more blurred images due to the interpolation process. | |||
==== Tensor Conversion ==== | |||
Additionally, the pixel values, which are originally stored as integers in the range <math>[0, 255]</math>, are converted to floating-point numbers in the range <math>[0, 1]</math>. | |||
This step does not affect the spatial resolution or quality of the image. Instead, it is intended to ensure numerical stability during the forward pass of the neural network. By having pixel intensities in a small range <math>[0, 1]</math>, the risk of gradient explosion or instability during backpropagation is reduced. | |||
==== Normalizing Pixel Values ==== | |||
The final step of this process is to normalize pixel values according to the ImageNet mean and standard deviation. This step ensures that the distribution of pixel intensities for the input image matches the distribution of images used during ResNet training. Each pixel in the image is normalized using the channel-specific mean and standard deviation for images from the ImageNet dataset. The mean and standard deviation for the RGB color channels are as follows: | |||
Channel-wise Means: | |||
<math> | |||
\mu_{\text{red}} = 0.485, \mu_{\text{green}} = 0.456, \mu_{\text{blue}} = 0.406 | |||
</math> | |||
Channel-wise Standard Deviations: | |||
<math> | |||
\sigma_{\text{red}} = 0.229, \sigma_{\text{green}} = 0.224, \sigma_{\text{blue}} = 0.225 | |||
</math> | |||
Each pixel in the image is normalized separately for each color channel. The normalization formula for a pixel <math> p </math> and one of the color channels <math> c </math> is given by: | |||
<math> | |||
p_{\text{normalized}}(c) = \frac{p_{\text{original}}(c) - \mu(c)}{\sigma(c)} | |||
</math> | |||
The effect of this step is to standardize the input image so that it has a mean of 0 and a standard deviation of 1 for each channel. This normalization step is conceptually similar to mean subtraction and variance normalization in signal processing and does not have any effect on the image quality being fed into the model. | |||
=== Using Z-Score to Narrow Down the Most Affected Classes === | === Analyzing the Results === | ||
==== Using Z-Score to Narrow Down the Most Affected Classes ==== | |||
To identify which classes are most affected by a degradation, we utilized z-score to quantify the degree of systematic confusion at the class level. | To identify which classes are most affected by a degradation, we utilized z-score to quantify the degree of systematic confusion at the class level. | ||
| Line 33: | Line 142: | ||
To calculate the confusion for the entire class, we compute the mean of the z-scores across all 10 images. Classes with mean z-scores close to zero exhibit consistent confusion, meaning the degradation affects the classification probabilities for all images in a similar way. Classes with higher mean z-scores exhibit more random effects as the degradation causes larger and more inconsistent changes in prediction probabilities across the 10 images. Ultimately, this method allows us to pinpoint the classes that are most sensitive to specific degradations. | To calculate the confusion for the entire class, we compute the mean of the z-scores across all 10 images. Classes with mean z-scores close to zero exhibit consistent confusion, meaning the degradation affects the classification probabilities for all images in a similar way. Classes with higher mean z-scores exhibit more random effects as the degradation causes larger and more inconsistent changes in prediction probabilities across the 10 images. Ultimately, this method allows us to pinpoint the classes that are most sensitive to specific degradations. | ||
=== Using AUC Scores to Quantify Uncertainty === | ==== Using AUC Scores to Quantify Uncertainty ==== | ||
After identifying the classes most affected by the degradation using the class average z-score method, we used AUC scores to quantify the level of uncertainty introduced by the degradation for specific classes. | After identifying the classes most affected by the degradation using the class average z-score method, we used AUC scores to quantify the level of uncertainty introduced by the degradation for specific classes. | ||
| Line 41: | Line 150: | ||
For example, after applying the pixel size degradation, images of fiddler crabs were frequently misclassified as hermit crabs. To quantify this confusion, we treated the hermit crab as the "ground truth" label and calculated the AUC score using the model’s predicted probabilities for the hermit crab across images from both the fiddler crab class and the hermit crab class. This approach allows us to measure how well the model distinguishes the two classes under the influence of degradation. | For example, after applying the pixel size degradation, images of fiddler crabs were frequently misclassified as hermit crabs. To quantify this confusion, we treated the hermit crab as the "ground truth" label and calculated the AUC score using the model’s predicted probabilities for the hermit crab across images from both the fiddler crab class and the hermit crab class. This approach allows us to measure how well the model distinguishes the two classes under the influence of degradation. | ||
[[File:crab_quant.png|center|600px|thumb|Figure showing the relationship between the level of degradation (x-axis) and the AUC scores (y-axis) for the hermit crab as the ground truth class, with fiddler crab as the comparison class.]] | [[File:crab_quant.png|center|600px|thumb|Figure showing the relationship between the level of degradation (x-axis) and the AUC scores (y-axis) for the hermit crab as the ground truth class, with fiddler crab as the comparison class. From left to right, we plot the AUC without any degradation (baseline), a 5 micron pixel, an 8 micros pixel, then an 11 micron pixel.]] | ||
== Results == | == Results == | ||
=== Defocus Blur === | |||
The effects of applying defocus blur to images in our dataset suggest ResNet-18 relies heavily on patterns, textures, and colors rather than on shape features. This insight is critical for understanding the underlying mechanisms that drive the model's decision-making process. | |||
====Visual Evidence==== | |||
To illustrate this point, we evaluated the worst-performing classes after applying the image degradation. For these classes, the model’s classification was incorrect for most of the images in the dataset. Among them, the tiger class was the most affected. Two particularly striking examples highlight this issue: in one case, the model mapped a white tiger, blurred with a diopter value of 12, to a Komondor dog, and in another image with the same amount of blurring, it mapped an orange tiger to a Leonberger. These examples are especially revealing. Despite the substantial differences in the shapes and structures of these animals, the model’s misclassifications suggest it struggles to distinguish between them, likely due to similarities in color and texture. | |||
[[File:Comp1_tiger_komondor.png|center|thumb|800px]] | |||
[[File:Comp2.png|center|thumb|800px]] | |||
To be clear, the Komondor and Leonberger images themselves were not chosen by the model; they are simply representations of what the animals look like. | |||
Further evidence is provided by the sharp drop in the model's confidence that the images belonged to the tiger class before and after applying the degradation. For the white tiger, the probability dropped from 0.78 to 0.003, and for the orange tiger, it dropped from 0.96 to 0.08. This dramatic reduction underscores the model's reliance on clear, high-frequency patterns. Without these patterns, the model appears to shift its reliance to surface-level color features, rather than using more abstract, shape-based distinctions. | |||
<center> | |||
{| class="wikitable" | |||
! Class !! <math> P_{pre degrade} </math> !! <math> P_{post degrade} </math> | |||
|- | |||
| Komondor || 0.003 || 0.32 | |||
|- | |||
| Tiger || 0.78 || 0.003 | |||
|- | |||
| Leonberg || 0.001 || 0.74 | |||
|- | |||
| Tiger || 0.96 || 0.08 | |||
|} | |||
</center> | |||
====Structural Distinctiveness as a Safeguard==== | |||
In contrast, images of animals with distinct and rigid shape features, such as crabs, were far less affected by degradation. Even under significant image system degradations, misclassifications within the crab class were mostly intra-species errors (e.g., mistaking one type of crab for another) rather than cross-species errors. This pattern strongly suggests that ResNet-18's reliance on pattern and color features is a limiting factor, as classes with pronounced structural uniqueness remain robust to degradation. | |||
<center> | |||
{| class="wikitable" | |||
|+ Degradation Statistics for a Subset of Labels | |||
! Ground Truth !! Degradation Level (diopters) !! Mean Drop !! Median Drop !! Std Drop | |||
|- | |||
| Tiger || 12 || 0.500600 || 0.470581 || 0.270326 | |||
|- | |||
| Robin || 12 || 0.377581 || 0.277770 || 0.341606 | |||
|- | |||
| Chickadee || 12 || 0.367711 || 0.206856 || 0.371604 | |||
|- | |||
| King Crab || 12 || 0.264972 || 0.172101 || 0.309726 | |||
|} | |||
</center> | |||
Additionally, we plotted the tiger class's AUC scores with Komondor as the true class over the range of diopter values we applied, to understand how sensitive the model is to this degradation. | |||
[[File:Tiger_komondor.png|center|thumb|600px|Caption: AUC values after applying varying degrees of blur to tiger and komondor images with komondor as the ground truth.]] | |||
The consistent decline in AUC as blur increases (especially between blur-12 and blur-20) indicates that the model is highly sensitive to blur degradation. This suggests that the model relies heavily on fine-grained details to classify objects correctly. Of course, to fully understand the model's sensitivity, we need to evaluate it with more diopter values, especially in between the ones we already tested, to capture the specific curvature of the AUC plot. | |||
All of these observations point to a broader conclusion: ResNet-18’s classification strategy appears to prioritize patterns and colors over shapes. This reliance makes the model vulnerable to misclassification in cases where patterns and colors are ambiguous or overlap across classes. Conversely, classes with distinct, shape-driven features demonstrate resilience to degradation, supporting the idea that shape-based recognition is more robust to visual noise. Our findings underscore the need to develop models that pay more attention to shape features, especially for applications where robustness to image degradations is essential. | |||
Encouragingly, these findings are well-corroborated by the literature, specifically for ImageNet-trained CNNs [https://arxiv.org/abs/1811.12231(1)][https://arxiv.org/abs/2310.18894(2)]. | |||
===Exposure Time=== | |||
The effects of simulating a decreased exposure time on the images in our dataset suggest that the model is highly sensitive to noise. | |||
To illustrate this point, we once again evaluated the worst-performing classes after applying the image degradation. The worst among them was the robin class with an exposure time of <math> 5*10^{-3} </math> seconds. Once noise was applied to the images, ResNet-18 classified the robin as a platypus, hyena, and toaster. We believe the model's deterioration can be attributed to the injection of noise; it relies heavily on fine-grained details, and makes spurious predictions when its primary features are distorted. There is a possibility that the model is over-attentive to the colors of the image again, which is supported by the toaster and the platypus classification. The simple prescription based on these results would be to fine-tune ResNet-18 on more low-light conditions. | |||
[[File:Comp3.png|800px|center|thumb]] | |||
[[File:Comp4.png|800px|center|thumb]] | |||
[[File:Comp8.png|800px|center|thumb]] | |||
While the platypus and hyena images were not taken from datasets, the red toaster was. | |||
Looking at the robin AUC curve over the swept exposure times, we observe that for the values to the left of the baseline, the model performance drops at a rate faster than linear. This suggests that it is highly sensitive to noise. To the right of the baseline exposure point, the AUC score drops, but not as much. This suggests that the model is much more sensitive to low-light conditions (noise), than it is to overexposed images. | |||
[[File:Auc2.png|center|600px|thumb]] | |||
While this degradation did not provide as much insight into the model's decision-making process, it exposed how terribly it fares when there is a perceptible, but not destructive, amount of noise. | |||
Indeed, improving the ResNet architecture in low-light scenarios appears to be a significant area of study [https://www.researchgate.net/publication/373599865_Swin_transformer_and_ResNet_based_deep_networks_for_low-light_image_enhancement(3)]. | |||
=== Pixel Size === | |||
The relative ranking of results for pixel size closely mirrors the results of exposure time. However, the change in model confidence for increasing pixel size was much more drastic than for exposure times. We interpreted this as the change in pixel size degradation being a composition of blurring and increased exposure time. | |||
{| class="wikitable" | |||
|+ Ranking of Classes by Mean Drop for Pixel Size 8 µm and Exposure Time 0.01 s | |||
! Rank !! Ground Truth !! Mean Drop (Pixel Size 8 µm) !! Mean Drop (Exposure Time 0.01 s) !! Rank (Pixel Size 8 µm) !! Rank (Exposure Time 0.01 s) | |||
|- | |||
| 1 || 15 || 0.808365 || 0.566514 || 1 || 1 | |||
|- | |||
| 2 || 292 || 0.731833 || 0.535144 || 2 || 2 | |||
|- | |||
| 3 || 19 || 0.673408 || 0.462795 || 3 || 3 | |||
|- | |||
| 4 || 120 || 0.653789 || 0.415552 || 4 || 4 | |||
|} | |||
== Conclusion == | |||
Benchmarking model performance on datasets like ImageNet typically requires evaluating thousands of images, a process that is both time-consuming and computationally expensive. While effective, this approach can be inefficient for debugging specific failure points in models. Our proposed framework offers a more targeted approach by systematically applying controlled image system degradations to a much smaller set of images. This method allows for a more focused investigation of model weaknesses, enabling faster and more efficient debugging compared to traditional large-scale benchmarking. | |||
Unlike reactionary fine-tuning, which addresses model failures after they occur, our approach enables proactive diagnosis of failure modes. By exposing models to systematic degradations, we can reveal specific image properties that drive misclassifications. This understanding allows for targeted fine-tuning efforts, such as augmenting the training set with specific degradation types or modifying the model architecture to address identified weaknesses. As a result, our pipeline supports a more deliberate and resource-efficient strategy for improving model robustness. | |||
We are greatly encouraged by the fact that our conclusions about ResNet-18's behavior are well-supported by the literature, suggesting that our debugging pipeline is effective at exposing model weaknesses. | |||
Future work will expand this approach by incorporating a larger suite of degradation types. More targeted degradations, tailored to specific image properties, will enable a deeper exploration of model perception. Additionally, we aim to empirically validate our literature-corroborated findings by fine-tuning models based on the insights gained from systematic degradation analysis. This empirical validation will provide a clearer link between degradation-driven debugging and performance improvement, solidifying the effectiveness of the proposed pipeline. | |||
By offering a systematic, lightweight alternative to traditional benchmarking, our approach empowers researchers and practitioners to diagnose and address model vulnerabilities with greater efficiency. This work highlights the value of controlled degradations as a tool for understanding model perception and improving robustness, thereby advancing the development of vision systems capable of withstanding the diverse and unpredictable nature of real-world imagery. | |||
== | == References == | ||
[https://arxiv.org/abs/1811.12231 Geirhos et al., ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness] | |||
[https://www.researchgate.net/publication/373599865_Swin_transformer_and_ResNet_based_deep_networks_for_low-light_image_enhancement Xu et al., Swin transformer and ResNet based deep networks for low-light image enhancement] | |||
[https://arxiv.org/abs/2310.18894 Li et al., Emergence of Shape Bias in Convolutional Neural Networks through Activation Sparsity] | |||
[ | [https://chatgpt.com GPT4] for writing support | ||
Latest revision as of 20:09, 13 December 2024
Introduction
Deep learning models, particularly convolutional neural networks (CNNs), have achieved remarkable success in image classification tasks. However, a significant gap exists between the controlled environments in which these models are trained and the unpredictable conditions of the real world. Most models are trained on clean, well-lit, and high-resolution datasets, yet real-world images are often subject to a wide range of degradations. These degradations may arise from factors such as blur, poor lighting, sensor noise, and changes in image resolution. The resulting distribution shift poses a substantial challenge for model generalization, potentially leading to misclassifications and performance degradation.
Addressing this challenge requires a deeper understanding of how image system degradations impact model behavior. By systematically evaluating model performance under controlled degradations, it becomes possible to identify specific vulnerabilities and failure points. This insight not only informs the development of more robust models but also supports the design of lightweight debugging pipelines for existing models that are able to rapidly assess model weaknesses through targeted interventions. By enabling a faster and more focused evaluation of model robustness, lightweight debugging could serve as a practical and cost-effective tool for improving model performance under real-world conditions.
Our paper aims to bridge the gap between idealized training conditions and the realities of image system degradations, and provide a clear pathway for efficiently diagnosing model weaknesses and informing targeted improvements. This work advances the broader goal of developing robust vision systems capable of operating effectively in real-world scenarios.
Background
Z-Score
The z-score is a statistical measure that indicates how far a given value deviates from the mean of a distribution and is measured in units of standard deviation. The z-score tells us how many standard deviations a value is from the mean. A z-score of 0 indicates that the value is equal to the mean, while positive and negative z-scores indicate that the value is above or below the mean, respectively. Z-scores are useful for identifying outliers and understanding how unusual a given observation is relative to the overall data distribution. The z-score for an observation is calculated as: Where:
- is the observed value,
- is the mean of the distribution, and
- is the standard deviation of the distribution.
ROC and AUC
The Receiver Operating Characteristic (ROC) curve is a representation of a model's classification performance across different classification thresholds. Typically, the ROC is plotted as the True Positive Rate (TPR) against the False Positive Rate (FPR) when sweeping various threshold values. The ROC curve provides the ability to assess the trade-off between the true positive rate and the false positive rate as we vary the positive classification threshold. A model with perfect classification ability would have a curve that passes through the top-left corner (TPR = 1, FPR = 0), while a model that makes random predictions would produce a diagonal line from (0,0) to (1,1).
The Area Under the (ROC) Curve (AUC) is a scalar value that is calculated by computing the area under the ROC curve and quantifies the overall performance of the classifier. The AUC score ranges from 0 to 1, where a value of 1 indicates a perfect classifier, a value of 0.5 indicates a classifier that performs no better than random chance, and a value below 0.5 suggests the model is worse than random guessing, suggesting that the model is actively classifying the image incorrectly. For our project, we use the AUC to quantify the uncertainty introduced by degradation. By analyzing how the AUC changes under increasing degradation, we can assess how well the model continues to distinguish between specific classes. If the AUC decreases significantly as the level of degradation increases, it indicates that the degradation is causing more confusion between classes.
Methods
Selecting the Model and Dataset
For this project, we selected ResNet-18 as the model and ImageNetV2 as the dataset. These choices were made to ensure a robust evaluation of how degradations on high-quality images affects widely-used classification model performance .
ResNet-18
ResNet-18 is a convolutional neural network (CNN) released by Microsoft and is one of the smaller models in the ResNet family of models. We selected ResNet-18 for this project because of its lightweight architecture and its ability to provide a rapid inference while still achieving a strong performance on the ImageNet benchmark. Additionally, ResNet-18 has been widely used in image degradation and robustness studies, making it a strong baseline for evaluating how different types of degradation affect model performance.
ImageNetV2
ImageNetV2 is a collection of 10,000 images that serves as an extension to the original ImageNet dataset. One of the key reasons for using ImageNetV2 was to test the performance of the ResNet-18 model on a set of images that are similar in distribution to ImageNet, but not seen by any pre-trained ImageNet models. This allows us to better understand how degradations affect generalization performance. We focused on the "Matched Frequency" subset of ImageNetV2, which is designed to have a class distribution similar to the original ImageNet dataset. This alignment ensures that the degradation effects are comparable to those seen on ImageNet.
Applying Image Degradations
Defocus Blur Degradation
Defocus blur is a key factor in the degradation of image quality. It occurs when light rays from a point source fail to converge at a single point on the image sensor, leading to a "blurred" appearance. The degree of defocus blur is commonly quantified using diopters (D), a unit of measurement that describes the reciprocal of the focal length (in meters) of an optical system. A higher diopter value corresponds to a shorter focal length and thus more severe defocus blur. For example, a lens with a focal length of 0.5 meters has a diopter value of 2D, while a lens with a focal length of 0.25 meters has a diopter value of 4D.
Zernike Polynomials and the C4 Coefficient
Zernike polynomials are a set of orthogonal polynomials defined on the unit disk, often used in optics to model wavefront aberrations in optical systems. Each Zernike term corresponds to a specific type of aberration, such as defocus, astigmatism, or coma. Defocus is particularly relevant to image system analysis, and it is represented by the coefficient in the Zernike polynomial expansion. The coefficient is directly related to the amount of defocus in the optical system.
Pixel Size Degradation
The process of pixel size degradation involves adjusting the physical size of each pixel on the image sensor. In this project, we altered pixel sizes to 5 µm, 8 µm, and 11 µm, affecting the sensor’s light-collection ability and spatial resolution. Pixel size directly affects the signal-to-noise ratio (SNR), photon collection efficiency, and the modulation transfer function (MTF).
Pixel size refers to the physical dimensions of each individual pixel on an image sensor. Pixels convert incoming photons into an electrical signal. Smaller pixels have a smaller surface area, meaning they collect fewer photons, leading to a lower photon count. This reduction in photons increases the impact of photon shot noise, which follows a Poisson distribution. Conversely, larger pixels collect more light, resulting in a higher SNR but at the cost of reduced spatial resolution. In this project, the fill factor, which the fraction of each pixel’s surface area that actively collects light, was held constant, ensuring that the light-collecting area was proportional to the pixel size. Additionally, we used isetcam auto exposure for all the different pixel sizes.
Impact on Image Quality
The following table summarizes the trade-offs observed as pixel size changes. Note that the changes in the table discuss the relative changes between the different pixel sizes chosen.
| Pixel Size (µm) | Resolution | SNR | MTF |
|---|---|---|---|
| 5 µm | Highest | Relatively low | High MTF at high frequencies |
| 8 µm | Medium | Moderate | Moderate MTF |
| 11 µm | Lower | Higher SNR | Lower MTF |
Discussion of Exposure Time and Channel Clipping
Exposure time and pixel size are closely related in how they affect the image sensor’s ability to capture light. Lengthening the exposure time has the same effect as increasing pixel size. This is because both methods increase the total number of photons that are collected at each pixel.
- Longer exposure times allow the pixel to collect photons for a longer period, increasing the total photon count.
- Larger pixel sizes collect more photons because the physical area of the pixel is larger.
Both of these approaches improve the SNR, as the total number of collected photons increases, reducing the relative impact of shot noise. However, when pixel size increases or exposure time increases, there is a higher chance that the sensor will saturate. This occurs because the maximum charge capacity of the pixel is limited. When too many photons are captured, the pixel can no longer store the charge, and it becomes clipped. This results in certain pixel intensities being capped at the maximum intensity value.
Clipping typically affects the RGB color channels independently. In this project, it was observed that certain channels were clipped during the pixel degradation process. This is because increasing the pixel size increases the photon collection rate, which can saturate any of the channels. If, for example, the red channel receives more light than the other channels (perhaps due to a stronger red component in the illumination), it may clip while the other channels remain within range. This effect is most noticeable in regions of the image where light is intense, and will appear to tint portions of the image based on the saturated channel.
Exposure Time Degradation
In this project, we also altered the exposure time of the image sensor by itself to observe its impact on image quality and sensor performance. The exposure times tested were 0.005 s, 0.01s, 0.1 s (baseline), and 1 s, with the pixel size being set to 2 µm. By increasing exposure time, the sensor collects more photons at each pixel, which improves the signal-to-noise ratio (SNR) but increases the risk of saturation and clipping. Shorter exposure times reduce the total photon count, leading to lower SNR but preserving highlights and bright regions from overexposure.
-
Original Image
-
8 µm Pixel Size
-
1 ms Exposure Time
Preparing Images for Model Inference
After applying degradations to the images, we prepare them for input into the ResNet model by applying a series of transformations to the images.
Image Resizing
Each image must first be resized through downsampling or upsampling to 224x224 pixels. In this project, we used bilinear interpolation, a widely used method for image resizing in machine learning. Bilinear interpolation involves taking a weighted average of 4 neighboring pixels from the original image and combining them into one pixel. Unlike simpler methods such as nearest-neighbor, this approach avoids hard edges and sharp transitions between pixels. However, the additional loss of spatial detail or introduction of blurring after the degradations that were already applied to the original image further blurs the high-frequency components. This resizing is an image degradation in itself and must be taken into account when considering the overall degradation done to images being input into the model. After all, this resizing process is analogous to reducing the resolution of a sensor.
To compute the intensity of a pixel at a non-integer position in the resized image, we use the bilinear interpolation formula: Where:
- , and are the four nearest neighboring pixels in the original image.
- is the horizontal distance from to the left edge of the pixel.
- is the vertical distance from to the top edge of the pixel.
- are the intensities of the four neighboring pixels.
This process ensures that the closer the target pixel is to a particular neighbor, the more influence that neighbor has on the intensity of the pixel at . The upsampling and downsampling process both use these same ideas, but they differ in how pixel information is preserved and interpolated. During downsampling, pixel information is reduced by averaging or merging neighboring pixels into a smaller number of pixels, often leading to a loss of high-frequency details like edges and fine textures. In contrast, upsampling increases the image size by creating new pixels whose values are interpolated from existing neighboring pixels, effectively filling in the gaps between existing pixel locations. While downsampling removes detail, upsampling cannot restore details that were not present in the original image, and it often results in smoother, more blurred images due to the interpolation process.
Tensor Conversion
Additionally, the pixel values, which are originally stored as integers in the range , are converted to floating-point numbers in the range .
This step does not affect the spatial resolution or quality of the image. Instead, it is intended to ensure numerical stability during the forward pass of the neural network. By having pixel intensities in a small range , the risk of gradient explosion or instability during backpropagation is reduced.
Normalizing Pixel Values
The final step of this process is to normalize pixel values according to the ImageNet mean and standard deviation. This step ensures that the distribution of pixel intensities for the input image matches the distribution of images used during ResNet training. Each pixel in the image is normalized using the channel-specific mean and standard deviation for images from the ImageNet dataset. The mean and standard deviation for the RGB color channels are as follows:
Channel-wise Means:
Channel-wise Standard Deviations:
Each pixel in the image is normalized separately for each color channel. The normalization formula for a pixel and one of the color channels is given by:
The effect of this step is to standardize the input image so that it has a mean of 0 and a standard deviation of 1 for each channel. This normalization step is conceptually similar to mean subtraction and variance normalization in signal processing and does not have any effect on the image quality being fed into the model.
Analyzing the Results
Using Z-Score to Narrow Down the Most Affected Classes
To identify which classes are most affected by a degradation, we utilized z-score to quantify the degree of systematic confusion at the class level.
For each class, we measure the change in the top-1 class prediction probability for each image before and after applying degradation. The drop in the top-1 class probability is calculated as: Where is the probability of the top-1 prediction before degradation, and is the probability after degradation. We calculate this drop for the 10 images per class and compute the mean and standard deviation of these probability drops.
Using these statistics, we compute the z-score for each image's probability drop as: To calculate the confusion for the entire class, we compute the mean of the z-scores across all 10 images. Classes with mean z-scores close to zero exhibit consistent confusion, meaning the degradation affects the classification probabilities for all images in a similar way. Classes with higher mean z-scores exhibit more random effects as the degradation causes larger and more inconsistent changes in prediction probabilities across the 10 images. Ultimately, this method allows us to pinpoint the classes that are most sensitive to specific degradations.
Using AUC Scores to Quantify Uncertainty
After identifying the classes most affected by the degradation using the class average z-score method, we used AUC scores to quantify the level of uncertainty introduced by the degradation for specific classes.
To do this, we analyzed how the model's top-1 prediction changed after the degradation was applied. We focused on cases where the model's new top-1 prediction differed from its original top-1 prediction. The newly predicted class after degradation was treated as the "ground truth" for AUC analysis. For each image in the affected classes, we recorded the model's predicted probabilities for this "new ground truth" label and computed an ROC curve to assess how well the model ranked this label compared to other classes.
For example, after applying the pixel size degradation, images of fiddler crabs were frequently misclassified as hermit crabs. To quantify this confusion, we treated the hermit crab as the "ground truth" label and calculated the AUC score using the model’s predicted probabilities for the hermit crab across images from both the fiddler crab class and the hermit crab class. This approach allows us to measure how well the model distinguishes the two classes under the influence of degradation.

Results
Defocus Blur
The effects of applying defocus blur to images in our dataset suggest ResNet-18 relies heavily on patterns, textures, and colors rather than on shape features. This insight is critical for understanding the underlying mechanisms that drive the model's decision-making process.
Visual Evidence
To illustrate this point, we evaluated the worst-performing classes after applying the image degradation. For these classes, the model’s classification was incorrect for most of the images in the dataset. Among them, the tiger class was the most affected. Two particularly striking examples highlight this issue: in one case, the model mapped a white tiger, blurred with a diopter value of 12, to a Komondor dog, and in another image with the same amount of blurring, it mapped an orange tiger to a Leonberger. These examples are especially revealing. Despite the substantial differences in the shapes and structures of these animals, the model’s misclassifications suggest it struggles to distinguish between them, likely due to similarities in color and texture.


To be clear, the Komondor and Leonberger images themselves were not chosen by the model; they are simply representations of what the animals look like.
Further evidence is provided by the sharp drop in the model's confidence that the images belonged to the tiger class before and after applying the degradation. For the white tiger, the probability dropped from 0.78 to 0.003, and for the orange tiger, it dropped from 0.96 to 0.08. This dramatic reduction underscores the model's reliance on clear, high-frequency patterns. Without these patterns, the model appears to shift its reliance to surface-level color features, rather than using more abstract, shape-based distinctions.
| Class | ||
|---|---|---|
| Komondor | 0.003 | 0.32 |
| Tiger | 0.78 | 0.003 |
| Leonberg | 0.001 | 0.74 |
| Tiger | 0.96 | 0.08 |
Structural Distinctiveness as a Safeguard
In contrast, images of animals with distinct and rigid shape features, such as crabs, were far less affected by degradation. Even under significant image system degradations, misclassifications within the crab class were mostly intra-species errors (e.g., mistaking one type of crab for another) rather than cross-species errors. This pattern strongly suggests that ResNet-18's reliance on pattern and color features is a limiting factor, as classes with pronounced structural uniqueness remain robust to degradation.
| Ground Truth | Degradation Level (diopters) | Mean Drop | Median Drop | Std Drop |
|---|---|---|---|---|
| Tiger | 12 | 0.500600 | 0.470581 | 0.270326 |
| Robin | 12 | 0.377581 | 0.277770 | 0.341606 |
| Chickadee | 12 | 0.367711 | 0.206856 | 0.371604 |
| King Crab | 12 | 0.264972 | 0.172101 | 0.309726 |
Additionally, we plotted the tiger class's AUC scores with Komondor as the true class over the range of diopter values we applied, to understand how sensitive the model is to this degradation.

The consistent decline in AUC as blur increases (especially between blur-12 and blur-20) indicates that the model is highly sensitive to blur degradation. This suggests that the model relies heavily on fine-grained details to classify objects correctly. Of course, to fully understand the model's sensitivity, we need to evaluate it with more diopter values, especially in between the ones we already tested, to capture the specific curvature of the AUC plot.
All of these observations point to a broader conclusion: ResNet-18’s classification strategy appears to prioritize patterns and colors over shapes. This reliance makes the model vulnerable to misclassification in cases where patterns and colors are ambiguous or overlap across classes. Conversely, classes with distinct, shape-driven features demonstrate resilience to degradation, supporting the idea that shape-based recognition is more robust to visual noise. Our findings underscore the need to develop models that pay more attention to shape features, especially for applications where robustness to image degradations is essential.
Encouragingly, these findings are well-corroborated by the literature, specifically for ImageNet-trained CNNs [1][2].
Exposure Time
The effects of simulating a decreased exposure time on the images in our dataset suggest that the model is highly sensitive to noise.
To illustrate this point, we once again evaluated the worst-performing classes after applying the image degradation. The worst among them was the robin class with an exposure time of seconds. Once noise was applied to the images, ResNet-18 classified the robin as a platypus, hyena, and toaster. We believe the model's deterioration can be attributed to the injection of noise; it relies heavily on fine-grained details, and makes spurious predictions when its primary features are distorted. There is a possibility that the model is over-attentive to the colors of the image again, which is supported by the toaster and the platypus classification. The simple prescription based on these results would be to fine-tune ResNet-18 on more low-light conditions.



While the platypus and hyena images were not taken from datasets, the red toaster was.
Looking at the robin AUC curve over the swept exposure times, we observe that for the values to the left of the baseline, the model performance drops at a rate faster than linear. This suggests that it is highly sensitive to noise. To the right of the baseline exposure point, the AUC score drops, but not as much. This suggests that the model is much more sensitive to low-light conditions (noise), than it is to overexposed images.

While this degradation did not provide as much insight into the model's decision-making process, it exposed how terribly it fares when there is a perceptible, but not destructive, amount of noise.
Indeed, improving the ResNet architecture in low-light scenarios appears to be a significant area of study [3].
Pixel Size
The relative ranking of results for pixel size closely mirrors the results of exposure time. However, the change in model confidence for increasing pixel size was much more drastic than for exposure times. We interpreted this as the change in pixel size degradation being a composition of blurring and increased exposure time.
| Rank | Ground Truth | Mean Drop (Pixel Size 8 µm) | Mean Drop (Exposure Time 0.01 s) | Rank (Pixel Size 8 µm) | Rank (Exposure Time 0.01 s) |
|---|---|---|---|---|---|
| 1 | 15 | 0.808365 | 0.566514 | 1 | 1 |
| 2 | 292 | 0.731833 | 0.535144 | 2 | 2 |
| 3 | 19 | 0.673408 | 0.462795 | 3 | 3 |
| 4 | 120 | 0.653789 | 0.415552 | 4 | 4 |
Conclusion
Benchmarking model performance on datasets like ImageNet typically requires evaluating thousands of images, a process that is both time-consuming and computationally expensive. While effective, this approach can be inefficient for debugging specific failure points in models. Our proposed framework offers a more targeted approach by systematically applying controlled image system degradations to a much smaller set of images. This method allows for a more focused investigation of model weaknesses, enabling faster and more efficient debugging compared to traditional large-scale benchmarking.
Unlike reactionary fine-tuning, which addresses model failures after they occur, our approach enables proactive diagnosis of failure modes. By exposing models to systematic degradations, we can reveal specific image properties that drive misclassifications. This understanding allows for targeted fine-tuning efforts, such as augmenting the training set with specific degradation types or modifying the model architecture to address identified weaknesses. As a result, our pipeline supports a more deliberate and resource-efficient strategy for improving model robustness.
We are greatly encouraged by the fact that our conclusions about ResNet-18's behavior are well-supported by the literature, suggesting that our debugging pipeline is effective at exposing model weaknesses.
Future work will expand this approach by incorporating a larger suite of degradation types. More targeted degradations, tailored to specific image properties, will enable a deeper exploration of model perception. Additionally, we aim to empirically validate our literature-corroborated findings by fine-tuning models based on the insights gained from systematic degradation analysis. This empirical validation will provide a clearer link between degradation-driven debugging and performance improvement, solidifying the effectiveness of the proposed pipeline.
By offering a systematic, lightweight alternative to traditional benchmarking, our approach empowers researchers and practitioners to diagnose and address model vulnerabilities with greater efficiency. This work highlights the value of controlled degradations as a tool for understanding model perception and improving robustness, thereby advancing the development of vision systems capable of withstanding the diverse and unpredictable nature of real-world imagery.
References
Xu et al., Swin transformer and ResNet based deep networks for low-light image enhancement
Li et al., Emergence of Shape Bias in Convolutional Neural Networks through Activation Sparsity
GPT4 for writing support