Psych 221 Image Systems Engineering - User contributions [en]

Neural Network Implementation of S-CIELAB for Perceptual Color Metrics

2025-12-10T04:32:35Z

Annayu: /* References */

== Abstract ==
The Spatial CIELAB (S-CIELAB) metric is a widely used perceptual color-difference measure that incorporates both chromatic appearance and spatial properties of the human visual system. Despite its accuracy, S-CIELAB is computationally expensive due to its multi-stage processing pipeline, including opponent-color transformation, frequency-dependent spatial filtering, and nonlinear post-processing. Moreover, these filtering operations rely on fixed convolution kernels and nonlinearities that are typically not differentiable in a manner compatible with gradient-based optimization, making S-CIELAB difficult to integrate directly into learning-based imaging systems.

This project investigates whether a neural network can learn a surrogate model that predicts S-CIELAB responses efficiently from local image patches. Using the TID2013 dataset, we develop two surrogate models: 1) a Multi-Layer Perceptron (MLP) trained on 18-dimensional XYZ-based statistical descriptors of local patches, and 2) a Convolutional Neural Network (CNN) trained to directly map 6-channel XYZ patch pairs to full-resolution ∆E maps. The MLP achieves R ≈ 0.94 and RMSE ≈ 0.96 for patch-mean ∆E prediction while the CNN achieves R ≈ 0.96 and RMSE ≈ 1.85 for per pixel ∆E prediction.

These results show that a compact neural model can effectively approximate S-CIELAB while being fast and fully differentiable, enabling its potential use as a perceptual loss or quality metric in modern imaging pipelines.

== Introduction ==
Perceptual color‐difference metrics play a fundamental role in modern imaging pipelines, image compression, and quality assessment systems. They aim to quantify human-perceived differences between a reference image and its distorted version, allowing algorithms to optimize not only pixel-wise accuracy but also perceptual fidelity. Among these metrics, S-CIELAB (Spatial CIELAB) has become one of the most influential extensions of the classic CIELAB ΔE formulation. Unlike conventional ΔE, which compares colors independently per pixel, S-CIELAB incorporates spatial filtering stages that approximate the frequency-dependent sensitivity of the human visual system (HVS). As a result, the metric aligns more closely with human perceived color differences, particularly in images containing high-frequency textures, blur, noise, or structured distortions.

Although S-CIELAB significantly improves perceptual accuracy, its multi-stage processing pipeline involves large kernel spatial convolutions, nonlinear transformations, and piecewise or non-differentiable operations. This makes S-CIELAB slow to compute at scale and fundamentally incompatible with gradient-based optimizations.

A patch-based formulation is particularly suitable for learning a surrogate model of S-CIELAB. S-CIELAB itself operates locally: its spatial filtering approximates the HVS contrast sensitivity over limited visual angles, and its ΔE computation depends primarily on neighborhood-level color differences rather than global structure. By extracting fixed-size patches corresponding to a constant visual angle (2°×2° in this work), we preserve the locality intrinsic to the metric while avoiding the need to model long-range correlations. Patch-based learning also increases the number of training samples significantly, improving statistical robustness, and allows the model to focus on local statistical features, such as mean chromaticity shifts or contrast changes, that most strongly influence S-CIELAB responses. This makes the surrogate easier to train, more compact, and more generalizable across diverse distortion types.

To bridge the limitations of S-CIELAB and the needs of modern neural pipelines, recent research has explored learning surrogate models that mimic perceptual metrics while remaining computationally efficient and differentiable. A learnable surrogate model for S-CIELAB would allow imaging systems to optimize directly for perceptual color fidelity, improving their alignment with human judgments. In this project, we investigate whether a compact Multi-Layer Perceptron (MLP) or a Convolutional Neural Network (CNN) can learn to reproduce the S-CIELAB ΔE response at the patch level. Instead of training on synthetic images, we leverage the widely used TID2013 dataset, which provides 25 reference images and 24 distortion types across five severity levels.

The objective of this work is twofold:
# to evaluate whether a simple neural model can approximate S-CIELAB with high accuracy, and
# to explore the feasibility of replacing costly perceptual metrics with lightweight, differentiable surrogates suitable for future imaging applications.

== Background ==
Perceptual color‐difference metrics quantify how humans perceive changes between a reference color stimulus and a distorted one. Among these metrics, the CIE 1976 L*a*b* (CIELAB) color space and its associated ΔE formulations remain the most widely used because they provide a perceptually uniform representation of color differences under standardized viewing conditions.

=== Computational CIE Color Models ===
Color difference metrics are built upon the CIE color‐appearance framework, which provides a device‐independent way of quantifying how humans perceive color. The foundational model is the CIE 1931 XYZ color space, derived from color‐matching functions that approximate the response of the human cone photoreceptors. Given a device RGB image, a calibrated 3×3 matrix (M) converts RGB intensities into XYZ tristimulus values:

<div style="text-align:center;">
<math>
\begin{bmatrix}
x\\
y\\
z
\end{bmatrix} = M \begin{bmatrix}
R\\
B\\
G
\end{bmatrix}
</math>
</div>

In this transformation, Y represents luminance, while X and Z carry chromatic information. Because XYZ is perceptually non-uniform, Euclidean distances in XYZ space do not reliably correspond to perceived color differences. To obtain a perceptually uniform space, CIE introduced CIELAB in 1976.

=== CIELAB and Perceptual ΔE* Calculations ===
The nonlinear transformations that the CIELAB space applies to XYZ are as follows:
<div style="text-align:center;">
<math>
L*=116f\left(\frac{Y}{Y_n}\right)-16, a*=500\left[f\left(\frac{X}{X_n}\right)-f\left(\frac{Y}{Y_n}\right)\right], b*=200\left[f\left(\frac{Y}{Y_n}\right)-f\left(\frac{Z}{Z_n}\right)\right]
</math>
</div>

where

<div style="text-align:center;">
<math>
f(t) =
\begin{cases}
t^{1/3}, & t > 0.008856 \\
7.787t+\frac{16}{116}, & \text{otherwise}
\end{cases}
</math>
</div>

A reference white point (<math>X_n, Y_n, Z_n</math>) is necessary to conduct these calculations.

Perceptual color difference (ΔE) between two pixel is then calculated as:

<div style="text-align:center;">
<math>
\Delta E_{ab}*=\sqrt{(L_1* - L_2*)^2 + (a_1* - a_2*)^2 + (b_1* - b_2*)^2}
</math>
</div>

This ΔE* metric is widely used due to its simplicity and perceptual relevance. However, it treats each pixel independently and therefore cannot model spatial visual masking, where texture or neighboring structures influence visibility. For example, a small change in a flat region is highly visible. The same change embedded in strong texture may be nearly invisible. To overcome this limitation, spatial extensions were introduced.

=== Spatial Extensions: The Principle Behind S-CIELAB ===
Rather than comparing pixels independently, S‐CIELAB incorporates frequency-dependent spatial filtering based on known characteristics of the human visual system:
* High-frequency noise is often masked by textures.
* Low-frequency distortions (blur, banding) are more visible.
* Chrominance channels have lower spatial resolution than luminance.

S‐CIELAB applies separate low-pass filters to the L*, a*, and b* channels using empirically derived contrast sensitivity functions (CSFs). Formally:

<div style="text-align:center;">
<math>
(L*, a*, b*)_{filtered} = (L*, a*, b*) \ast K_{CSF}
</math>
</div>

where <math>\ast</math> denotes convolution. The ΔE map is then computed over the filtered channels.

This is important because filtering makes S‐CIELAB significantly more perceptually aligned than standard ΔE*, but is also the reason why it’s more computationally expensive, non-differentiable, and unsuitable as a loss function for neural networks. Thus, even though S-CIELAB is accurate, it is hard to use in modern imaging pipelines.

=== Patch-Based Perspective for Learning S-CIELAB ===

Although S-CIELAB produces a pixel-wise ΔE map, the spatial filters operate locally. The perceptual decision at any location depends mainly on the statistics within a small neighborhood. Therefore, learning S‐CIELAB from data does not require full images; instead, patches provide:

# Locality of Human Perception
#: A 2°×2° visual field corresponds roughly to the scale at which the retina integrates spatial information. This aligns well with S-CIELAB’s filtering nature.
# Statistical Stability
#: Patch-level averaging reduces noise and variation across strong textures, yielding a smoother learning target (mean ΔE per patch).
# Efficiency
#: Training on patches, reduces GPU/CPU memory, increases dataset size, and removes global image dependencies.
# Simplifies Feature Engineering
#: MLPs or lightweight CNNs can learn localized perceptual behavior including patch features like lightness, chroma, chromatic contrast, and local frequency characteristics (indirectly through variance).

Patch-level surrogates can capture essential S-CIELAB behavior without requiring full-image modeling.

=== Motivation ===
=== Dataset and Preprocessing ===

In this project, the TID2013 image quality assessment dataset is used as the data source. It contains 25 reference images, each distorted by 24 distortion types across five severity levels. Each distorted image is paired with its corresponding reference image. For each reference–distorted pair, the ground-truth S-CIELAB ΔE map is computed using the scielabRGB function provided in ISETCam, with a display calibration file (crt.mat) and a viewing distance of 0.3 m. This produces a pixel-wise perceptual color-difference map that serves as the ground truth for neural network training.

[[File:Fig1 tid images.PNG|750px|thumb|centre|Figure 1. L to R: Hat reference photo; hat reference photo with level 5 of distortion, lighthouse reference photo; lighthouse reference photo with level 4 of distortion.]]

[[File:Fig2 tid distort.PNG|600px|thumb|centre|Figure 2. Examples of TID2013 reference and distorted images together with their corresponding S-CIELAB ΔE maps. (a) the original reference image and (b) its Gaussian noise distorted version at Level 1. (c) S-CIELAB ΔE heatmaps for Gaussian Noise Level 1 and (d) Level 5.]]

While investigating TID2013 distortions, it can be observed that:
* Low-level distortion mainly affects local fine textures, leading to relatively small ΔE values.
* High-level distortion produces much stronger responses across the entire image, with significantly larger ΔE values.
* The spatial distribution of ΔE closely follows visually perceived degradation, confirming the perceptual relevance of S-CIELAB.
This visualization also motivates the patch-based learning strategy, as the perceptual error is locally structured and varies across spatial regions.

Color indicates perceptual color difference, where higher ΔE values represent stronger perceived distortion. As expected, the fifth level of distortion severity exhibits substantially higher ΔE values and more visible spatial error patterns compared to the first level of distortion.

=== Patch-Based Representation ===
Patch-based learning is used because S-CIELAB itself operates locally through spatial filtering. Patch-level learning greatly increases the number of training samples as it divides reference images into much smaller windows while avoiding the need for larger CNNs to capture local perceptual behavior. It enables a simple, fully connected MLP instead of spatial convolutions.

In this project, image pairs and corresponding ∆E maps are divided into 2° × 2° visual angle patches. The patch size in pixels is calculated from the horizontal field of view (HFOV) of the scene as:

<div style="text-align:center;">
<math>
\text{patch pixels} = 2^{\circ} / (\text{HFOV} / \text{image width})
</math>
</div>

A 50% overlap of patches increases sample count and improves statistical coverage of spatial distortions.

=== Dataset Construction and Splitting ===

After feature extraction, a total of over 200,000 patch samples are collected. The dataset is split as 80% for training and 20% for testing. Before training, all features are standardized to zero mean and unit variance using mapstd.

== MLP Implementation ==
=== MLP Feature Extraction (18-Dimensional Vector) ===
The MLP operates on engineered statistical features intended to summarize patch-level behavior relevant to S-CIELAB. For each patch, we compute:
* Reference XYZ: Mean (3), Standard deviation (3)
* Distorted XYZ: Mean (3), Standard deviation (3)
* Difference statistics: Mean difference (3), Standard deviation difference (3)

This results in an 18-dimensional feature vector: <math>x=\left[ \mu_R, \sigma_R, \mu_T, \sigma_T, \Delta\mu, \Delta\sigma \right] \in \mathbb{R}^{18}</math>

The target label for each patch is the patch mean S-CIELAB ΔE.

[[File:Fig3 mlp pipeline.PNG|750px|thumb|centre|Figure 3. MLP network structure - from an 18D input vector to 1 fully connected MLP that outputs a patch mean S-CIELAB ∆E.]]

Instead of directly using raw pixel values as network inputs, this work adopts an 18-dimensional statistical feature representation extracted from each 2°×2° image patch. Specifically, for both the reference and distorted patches in XYZ color space, the mean and standard deviation of each channel are computed, forming 12 features. In addition, the differences between the corresponding means and standard deviations of the reference and distorted patches are calculated, resulting in a total of 18 features.

This feature design is motivated by both perceptual and practical considerations. First, the S-CIELAB metric itself is not defined at the pixel level alone, but is constructed through spatial filtering and local color difference operations that integrate information over a neighborhood. Therefore, local statistical descriptors better reflect the perceptual behavior captured by S-CIELAB than individual pixel values.

Second, using raw pixels would lead to an extremely high-dimensional input space. For example, a 2°×2° patch typically contains hundreds of pixels, and directly feeding these values into a neural network would significantly increase model complexity, training instability, and computational cost. In contrast, the proposed 18-dimensional feature vector provides a compact and low-dimensional representation while preserving the essential color and contrast information relevant to perceptual difference estimation.

Third, patch-level statistical features improve the robustness of learning by suppressing pixel-level noise and small spatial misalignments. Since S-CIELAB is designed to model perceived color differences rather than exact pixel correspondences, learning from patch-wise statistics is more consistent with the perceptual objective of the metric.

Overall, this feature representation allows the Multi-Layer Perceptron to focus on perceptually meaningful information while remaining lightweight, stable, and computationally efficient.

=== MLP Regression Model ===
A Multi-Layer Perceptron (MLP) is used to learn the nonlinear mapping: <math>f_{\theta}:R^{18} \rightarrow R</math>

Network configuration:
* Input layer: 18 neurons
* Hidden layers: [64, 32]
* Output layer: 1 neuron (predicted ΔE)
* Activation: default MATLAB nonlinear activations
* Training algorithm: Levenberg–Marquardt
* Epochs: 200
* Validation split (0.1 for validation)

The network is trained using mean squared error (MSE) loss.

== CNN Implementation ==
=== CNN Input Representation ===
The CNN receives a 64x64x6 input tensor:
* Channels 1-3: reference patch (XYZ)
* Channels 4-6: distorted patch (XYZ)
This representation allows the network to learn spatial color differences directly in the tristimulus domain without engineered feature extraction.

=== CNN Network Architecture ===
A lightweight UNet architecture is used to compute a per-pixel ∆E map. The encoder (2 layers) progressively reduces spatial resolution while expanding feature depth, extracting hierarchical spatial and chromatic features. The decoder (2 layers) reconstructs the spatial resolution using transposed convolutions and skip connections. Skip connections retain high-frequency information critical for modeling texture masking and spatial distortions. The CNN is trained using pixelwise MSE between predicted ∆E and ground truth ∆E.

This architecture is well suited to perceptual tasks because it learns both fine-grain and contextual representations across spatial scales, mimicking S-CIELAB’s multiscale filtering. The network is also trained to minimize MSE.

[[File:Fig4 cnn pipeline.PNG|1000px|thumb|centre|Figure 4. Simplified representation of U-Net architecture used in CNN implementation.]]

== Experimental Results ==
=== Prediction Accuracy of the MLP Model ===
After training, the MLP model was evaluated on the independent test set. The predicted patch-wise mean ΔE values were compared against the ground-truth S-CIELAB values. The model achieves the following performance on the test set:
* Correlation Coefficient (R) ≈ 0.94
* RMSE ≈ 0.96
The scatter plot of predicted versus true ΔE values (Fig. 5) shows a strong linear relationship, indicating that the MLP is able to accurately learn the nonlinear mapping between the extracted XYZ features and the perceptual color difference measured by S-CIELAB.

The use of the expanded 18-dimensional feature set leads to a significant improvement in prediction accuracy. This demonstrates that including both reference and distorted statistics, as well as their differences, provides richer perceptual information for learning.

[[File:Fig5 mlp results.png|1000px|thumb|centre|Figure 5. Scatter plot comparing the MLP-predicted mean ΔE values with the ground-truth S-CIELAB ΔE computed from the TID2013 dataset. Each point represents one 2°×2° image patch. The model achieves strong correlation with the perceptual metric (R = 0.945) and low prediction error (RMSE = 0.961), demonstrating that a lightweight neural network can effectively approximate S-CIELAB for perceptual color-difference estimation.]]

=== Prediction Accuracy of the CNN Model ===
After training, the CNN achieved strong predictive performance on the test set. Quantitatively, the network produced:
*n Correlation coefficient (R) ≈ 0.96
* RMSE ≈ 1.85
The correlation coefficient of 0.96 represents a very strong linear relationship between the CNN predicted perceptual difference values and the ground-truth S-CIELAB values. Although the CNN’s per-pixel RMSE is numerically higher than that of the MLP model, it is important to note that this is expected as the CNN predicts full-resolution ΔE maps, rather than a smoother scalar summary of a patch.

These qualitative metrics also show that a CNN can approximate S-CIELAB with high accuracy.

[[File:Fig6 cnn results.png|1000px|thumb|centre|Figure 6. (L) CNN training and convergence curve; (R) scatter plot of trained model correlation compared against the ideal prediction/ground-truth.]]

== Discussion ==
The results demonstrate that both the MLP and CNN architectures can effectively approximate the S-CIELAB perceptual color-difference metric while remaining fully differentiable and significantly more computationally efficient.

The MLP achieves strong performance using only simple XYZ-based statistical descriptors, confirming that much of S-CIELAB’s nonlinear behavior can be captured at the patch level without explicitly modeling spatial interactions.

The CNN learns to reproduce full-resolution ΔE maps with high spatial fidelity. Its strong correlation shows that a UNet is capable of modeling S-CIELAB’s frequency-dependent filtering and texture-masking behavior directly from data.

Together, these results indicate that S-CIELAB is inherently local and learnable, and that neural surrogates can effectively replace the original metric in scenarios requiring differentiability, speed, or integration into end-to-end imaging systems.

== Conclusion ==

This work presents a neural surrogate approach for approximating the S-CIELAB perceptual color-difference metric using both an MLP and CNN. The proposed MLP and CNN architectures provide efficient, differentiable models for predicting patch-level and pixel-level ΔE, respectively, offering substantial speed improvements and enabling seamless integration into modern optimization pipelines. These results validate that neural network implementations can effectively capture both the chromatic and spatial components of perceptual color differences.

Future work will focus on extending the models and broadening their applicability and expanding possible parameters to be predicted. This includes:
* Incorporate training parameters like FOV magnitude (eg. 2°x2° patches vs 5°x5°) and white point
* Translating the networks to Python for integration with the ISETPy repository
* Evaluating additional perceptual and image-quality metrics (eg. chromaticity or lightness errors, full-image perceptual maps, and perceptual attention maps)
* Improving model performance by accelerating computation, expanding the training dataset, tuning hyperparameters, and experimenting with different overlapping patch sampling

These directions will further enhance the power and generality of learned perceptual surrogates for imaging and color-processing applications.

== Appendix ==
Code Repository: https://github.com/anbananna/Perceptual_Color_Metrics

== References ==
[1] Farrell, J. E., Xiao, F., Catrysse, P., & Wandell, B. (2004). A simulation tool for evaluating digital camera image quality. In Image Quality and System Performance (Miyake & Rasmussen, Eds.), Proceedings of SPIE, 5294, 124–131.

[2] Ponomarenko, N., Jin, L., Ieremeiev, O., Lukin, V., Egiazarian, K., Astola, J., Vozel, B., Chehdi, K., Carli, M., Battisti, F., & Kuo, C.-C. J. (2015). Image database TID2013: Peculiarities, results and perspectives. Signal Processing: Image Communication, 30, 57–77. https://doi.org/10.1016/j.image.2014.10.009

[3] Wandell, B. A. (n.d.). ISETCam: Image Systems Engineering Toolbox for Cameras [Software]. GitHub. https://github.com/iset/isetcam

[4] Zhang, X., & Wandell, B. A. (1997). A spatial extension of CIELAB for digital color image reproduction. SID Symposium Digest of Technical Papers, 28, 731–734. https://doi.org/10.1889/1.1837854

== Work Breakdown ==

Shu An: MLP model development, MATLAB code for MLP model, written report, and MLP presentation slides

Anna Yu: CNN model development, MATLAB code for CNN model, written report, and presentation

Neural Network Implementation of S-CIELAB for Perceptual Color Metrics

2025-12-10T04:25:58Z

Annayu: /* CNN Implementation */

== Abstract ==
The Spatial CIELAB (S-CIELAB) metric is a widely used perceptual color-difference measure that incorporates both chromatic appearance and spatial properties of the human visual system. Despite its accuracy, S-CIELAB is computationally expensive due to its multi-stage processing pipeline, including opponent-color transformation, frequency-dependent spatial filtering, and nonlinear post-processing. Moreover, these filtering operations rely on fixed convolution kernels and nonlinearities that are typically not differentiable in a manner compatible with gradient-based optimization, making S-CIELAB difficult to integrate directly into learning-based imaging systems.

This project investigates whether a neural network can learn a surrogate model that predicts S-CIELAB responses efficiently from local image patches. Using the TID2013 dataset, we develop two surrogate models: 1) a Multi-Layer Perceptron (MLP) trained on 18-dimensional XYZ-based statistical descriptors of local patches, and 2) a Convolutional Neural Network (CNN) trained to directly map 6-channel XYZ patch pairs to full-resolution ∆E maps. The MLP achieves R ≈ 0.94 and RMSE ≈ 0.96 for patch-mean ∆E prediction while the CNN achieves R ≈ 0.96 and RMSE ≈ 1.85 for per pixel ∆E prediction.

These results show that a compact neural model can effectively approximate S-CIELAB while being fast and fully differentiable, enabling its potential use as a perceptual loss or quality metric in modern imaging pipelines.

== Introduction ==
Perceptual color‐difference metrics play a fundamental role in modern imaging pipelines, image compression, and quality assessment systems. They aim to quantify human-perceived differences between a reference image and its distorted version, allowing algorithms to optimize not only pixel-wise accuracy but also perceptual fidelity. Among these metrics, S-CIELAB (Spatial CIELAB) has become one of the most influential extensions of the classic CIELAB ΔE formulation. Unlike conventional ΔE, which compares colors independently per pixel, S-CIELAB incorporates spatial filtering stages that approximate the frequency-dependent sensitivity of the human visual system (HVS). As a result, the metric aligns more closely with human perceived color differences, particularly in images containing high-frequency textures, blur, noise, or structured distortions.

Although S-CIELAB significantly improves perceptual accuracy, its multi-stage processing pipeline involves large kernel spatial convolutions, nonlinear transformations, and piecewise or non-differentiable operations. This makes S-CIELAB slow to compute at scale and fundamentally incompatible with gradient-based optimizations.

A patch-based formulation is particularly suitable for learning a surrogate model of S-CIELAB. S-CIELAB itself operates locally: its spatial filtering approximates the HVS contrast sensitivity over limited visual angles, and its ΔE computation depends primarily on neighborhood-level color differences rather than global structure. By extracting fixed-size patches corresponding to a constant visual angle (2°×2° in this work), we preserve the locality intrinsic to the metric while avoiding the need to model long-range correlations. Patch-based learning also increases the number of training samples significantly, improving statistical robustness, and allows the model to focus on local statistical features, such as mean chromaticity shifts or contrast changes, that most strongly influence S-CIELAB responses. This makes the surrogate easier to train, more compact, and more generalizable across diverse distortion types.

To bridge the limitations of S-CIELAB and the needs of modern neural pipelines, recent research has explored learning surrogate models that mimic perceptual metrics while remaining computationally efficient and differentiable. A learnable surrogate model for S-CIELAB would allow imaging systems to optimize directly for perceptual color fidelity, improving their alignment with human judgments. In this project, we investigate whether a compact Multi-Layer Perceptron (MLP) or a Convolutional Neural Network (CNN) can learn to reproduce the S-CIELAB ΔE response at the patch level. Instead of training on synthetic images, we leverage the widely used TID2013 dataset, which provides 25 reference images and 24 distortion types across five severity levels.

The objective of this work is twofold:
# to evaluate whether a simple neural model can approximate S-CIELAB with high accuracy, and
# to explore the feasibility of replacing costly perceptual metrics with lightweight, differentiable surrogates suitable for future imaging applications.

== Background ==
Perceptual color‐difference metrics quantify how humans perceive changes between a reference color stimulus and a distorted one. Among these metrics, the CIE 1976 L*a*b* (CIELAB) color space and its associated ΔE formulations remain the most widely used because they provide a perceptually uniform representation of color differences under standardized viewing conditions.

=== Computational CIE Color Models ===
Color difference metrics are built upon the CIE color‐appearance framework, which provides a device‐independent way of quantifying how humans perceive color. The foundational model is the CIE 1931 XYZ color space, derived from color‐matching functions that approximate the response of the human cone photoreceptors. Given a device RGB image, a calibrated 3×3 matrix (M) converts RGB intensities into XYZ tristimulus values:

<div style="text-align:center;">
<math>
\begin{bmatrix}
x\\
y\\
z
\end{bmatrix} = M \begin{bmatrix}
R\\
B\\
G
\end{bmatrix}
</math>
</div>

In this transformation, Y represents luminance, while X and Z carry chromatic information. Because XYZ is perceptually non-uniform, Euclidean distances in XYZ space do not reliably correspond to perceived color differences. To obtain a perceptually uniform space, CIE introduced CIELAB in 1976.

=== CIELAB and Perceptual ΔE* Calculations ===
The nonlinear transformations that the CIELAB space applies to XYZ are as follows:
<div style="text-align:center;">
<math>
L*=116f\left(\frac{Y}{Y_n}\right)-16, a*=500\left[f\left(\frac{X}{X_n}\right)-f\left(\frac{Y}{Y_n}\right)\right], b*=200\left[f\left(\frac{Y}{Y_n}\right)-f\left(\frac{Z}{Z_n}\right)\right]
</math>
</div>

where

<div style="text-align:center;">
<math>
f(t) =
\begin{cases}
t^{1/3}, & t > 0.008856 \\
7.787t+\frac{16}{116}, & \text{otherwise}
\end{cases}
</math>
</div>

A reference white point (<math>X_n, Y_n, Z_n</math>) is necessary to conduct these calculations.

Perceptual color difference (ΔE) between two pixel is then calculated as:

<div style="text-align:center;">
<math>
\Delta E_{ab}*=\sqrt{(L_1* - L_2*)^2 + (a_1* - a_2*)^2 + (b_1* - b_2*)^2}
</math>
</div>

This ΔE* metric is widely used due to its simplicity and perceptual relevance. However, it treats each pixel independently and therefore cannot model spatial visual masking, where texture or neighboring structures influence visibility. For example, a small change in a flat region is highly visible. The same change embedded in strong texture may be nearly invisible. To overcome this limitation, spatial extensions were introduced.

=== Spatial Extensions: The Principle Behind S-CIELAB ===
Rather than comparing pixels independently, S‐CIELAB incorporates frequency-dependent spatial filtering based on known characteristics of the human visual system:
* High-frequency noise is often masked by textures.
* Low-frequency distortions (blur, banding) are more visible.
* Chrominance channels have lower spatial resolution than luminance.

S‐CIELAB applies separate low-pass filters to the L*, a*, and b* channels using empirically derived contrast sensitivity functions (CSFs). Formally:

<div style="text-align:center;">
<math>
(L*, a*, b*)_{filtered} = (L*, a*, b*) \ast K_{CSF}
</math>
</div>

where <math>\ast</math> denotes convolution. The ΔE map is then computed over the filtered channels.

This is important because filtering makes S‐CIELAB significantly more perceptually aligned than standard ΔE*, but is also the reason why it’s more computationally expensive, non-differentiable, and unsuitable as a loss function for neural networks. Thus, even though S-CIELAB is accurate, it is hard to use in modern imaging pipelines.

=== Patch-Based Perspective for Learning S-CIELAB ===

Although S-CIELAB produces a pixel-wise ΔE map, the spatial filters operate locally. The perceptual decision at any location depends mainly on the statistics within a small neighborhood. Therefore, learning S‐CIELAB from data does not require full images; instead, patches provide:

# Locality of Human Perception
#: A 2°×2° visual field corresponds roughly to the scale at which the retina integrates spatial information. This aligns well with S-CIELAB’s filtering nature.
# Statistical Stability
#: Patch-level averaging reduces noise and variation across strong textures, yielding a smoother learning target (mean ΔE per patch).
# Efficiency
#: Training on patches, reduces GPU/CPU memory, increases dataset size, and removes global image dependencies.
# Simplifies Feature Engineering
#: MLPs or lightweight CNNs can learn localized perceptual behavior including patch features like lightness, chroma, chromatic contrast, and local frequency characteristics (indirectly through variance).

Patch-level surrogates can capture essential S-CIELAB behavior without requiring full-image modeling.

=== Motivation ===
=== Dataset and Preprocessing ===

In this project, the TID2013 image quality assessment dataset is used as the data source. It contains 25 reference images, each distorted by 24 distortion types across five severity levels. Each distorted image is paired with its corresponding reference image. For each reference–distorted pair, the ground-truth S-CIELAB ΔE map is computed using the scielabRGB function provided in ISETCam, with a display calibration file (crt.mat) and a viewing distance of 0.3 m. This produces a pixel-wise perceptual color-difference map that serves as the ground truth for neural network training.

[[File:Fig1 tid images.PNG|750px|thumb|centre|Figure 1. L to R: Hat reference photo; hat reference photo with level 5 of distortion, lighthouse reference photo; lighthouse reference photo with level 4 of distortion.]]

[[File:Fig2 tid distort.PNG|600px|thumb|centre|Figure 2. Examples of TID2013 reference and distorted images together with their corresponding S-CIELAB ΔE maps. (a) the original reference image and (b) its Gaussian noise distorted version at Level 1. (c) S-CIELAB ΔE heatmaps for Gaussian Noise Level 1 and (d) Level 5.]]

While investigating TID2013 distortions, it can be observed that:
* Low-level distortion mainly affects local fine textures, leading to relatively small ΔE values.
* High-level distortion produces much stronger responses across the entire image, with significantly larger ΔE values.
* The spatial distribution of ΔE closely follows visually perceived degradation, confirming the perceptual relevance of S-CIELAB.
This visualization also motivates the patch-based learning strategy, as the perceptual error is locally structured and varies across spatial regions.

Color indicates perceptual color difference, where higher ΔE values represent stronger perceived distortion. As expected, the fifth level of distortion severity exhibits substantially higher ΔE values and more visible spatial error patterns compared to the first level of distortion.

=== Patch-Based Representation ===
Patch-based learning is used because S-CIELAB itself operates locally through spatial filtering. Patch-level learning greatly increases the number of training samples as it divides reference images into much smaller windows while avoiding the need for larger CNNs to capture local perceptual behavior. It enables a simple, fully connected MLP instead of spatial convolutions.

In this project, image pairs and corresponding ∆E maps are divided into 2° × 2° visual angle patches. The patch size in pixels is calculated from the horizontal field of view (HFOV) of the scene as:

<div style="text-align:center;">
<math>
\text{patch pixels} = 2^{\circ} / (\text{HFOV} / \text{image width})
</math>
</div>

A 50% overlap of patches increases sample count and improves statistical coverage of spatial distortions.

=== Dataset Construction and Splitting ===

After feature extraction, a total of over 200,000 patch samples are collected. The dataset is split as 80% for training and 20% for testing. Before training, all features are standardized to zero mean and unit variance using mapstd.

== MLP Implementation ==
=== MLP Feature Extraction (18-Dimensional Vector) ===
The MLP operates on engineered statistical features intended to summarize patch-level behavior relevant to S-CIELAB. For each patch, we compute:
* Reference XYZ: Mean (3), Standard deviation (3)
* Distorted XYZ: Mean (3), Standard deviation (3)
* Difference statistics: Mean difference (3), Standard deviation difference (3)

This results in an 18-dimensional feature vector: <math>x=\left[ \mu_R, \sigma_R, \mu_T, \sigma_T, \Delta\mu, \Delta\sigma \right] \in \mathbb{R}^{18}</math>

The target label for each patch is the patch mean S-CIELAB ΔE.

[[File:Fig3 mlp pipeline.PNG|750px|thumb|centre|Figure 3. MLP network structure - from an 18D input vector to 1 fully connected MLP that outputs a patch mean S-CIELAB ∆E.]]

Instead of directly using raw pixel values as network inputs, this work adopts an 18-dimensional statistical feature representation extracted from each 2°×2° image patch. Specifically, for both the reference and distorted patches in XYZ color space, the mean and standard deviation of each channel are computed, forming 12 features. In addition, the differences between the corresponding means and standard deviations of the reference and distorted patches are calculated, resulting in a total of 18 features.

This feature design is motivated by both perceptual and practical considerations. First, the S-CIELAB metric itself is not defined at the pixel level alone, but is constructed through spatial filtering and local color difference operations that integrate information over a neighborhood. Therefore, local statistical descriptors better reflect the perceptual behavior captured by S-CIELAB than individual pixel values.

Second, using raw pixels would lead to an extremely high-dimensional input space. For example, a 2°×2° patch typically contains hundreds of pixels, and directly feeding these values into a neural network would significantly increase model complexity, training instability, and computational cost. In contrast, the proposed 18-dimensional feature vector provides a compact and low-dimensional representation while preserving the essential color and contrast information relevant to perceptual difference estimation.

Third, patch-level statistical features improve the robustness of learning by suppressing pixel-level noise and small spatial misalignments. Since S-CIELAB is designed to model perceived color differences rather than exact pixel correspondences, learning from patch-wise statistics is more consistent with the perceptual objective of the metric.

Overall, this feature representation allows the Multi-Layer Perceptron to focus on perceptually meaningful information while remaining lightweight, stable, and computationally efficient.

=== MLP Regression Model ===
A Multi-Layer Perceptron (MLP) is used to learn the nonlinear mapping: <math>f_{\theta}:R^{18} \rightarrow R</math>

Network configuration:
* Input layer: 18 neurons
* Hidden layers: [64, 32]
* Output layer: 1 neuron (predicted ΔE)
* Activation: default MATLAB nonlinear activations
* Training algorithm: Levenberg–Marquardt
* Epochs: 200
* Validation split (0.1 for validation)

The network is trained using mean squared error (MSE) loss.

== CNN Implementation ==
=== CNN Input Representation ===
The CNN receives a 64x64x6 input tensor:
* Channels 1-3: reference patch (XYZ)
* Channels 4-6: distorted patch (XYZ)
This representation allows the network to learn spatial color differences directly in the tristimulus domain without engineered feature extraction.

=== CNN Network Architecture ===
A lightweight UNet architecture is used to compute a per-pixel ∆E map. The encoder (2 layers) progressively reduces spatial resolution while expanding feature depth, extracting hierarchical spatial and chromatic features. The decoder (2 layers) reconstructs the spatial resolution using transposed convolutions and skip connections. Skip connections retain high-frequency information critical for modeling texture masking and spatial distortions. The CNN is trained using pixelwise MSE between predicted ∆E and ground truth ∆E.

This architecture is well suited to perceptual tasks because it learns both fine-grain and contextual representations across spatial scales, mimicking S-CIELAB’s multiscale filtering. The network is also trained to minimize MSE.

[[File:Fig4 cnn pipeline.PNG|1000px|thumb|centre|Figure 4. Simplified representation of U-Net architecture used in CNN implementation.]]

== Experimental Results ==
=== Prediction Accuracy of the MLP Model ===
After training, the MLP model was evaluated on the independent test set. The predicted patch-wise mean ΔE values were compared against the ground-truth S-CIELAB values. The model achieves the following performance on the test set:
* Correlation Coefficient (R) ≈ 0.94
* RMSE ≈ 0.96
The scatter plot of predicted versus true ΔE values (Fig. 5) shows a strong linear relationship, indicating that the MLP is able to accurately learn the nonlinear mapping between the extracted XYZ features and the perceptual color difference measured by S-CIELAB.

The use of the expanded 18-dimensional feature set leads to a significant improvement in prediction accuracy. This demonstrates that including both reference and distorted statistics, as well as their differences, provides richer perceptual information for learning.

[[File:Fig5 mlp results.png|1000px|thumb|centre|Figure 5. Scatter plot comparing the MLP-predicted mean ΔE values with the ground-truth S-CIELAB ΔE computed from the TID2013 dataset. Each point represents one 2°×2° image patch. The model achieves strong correlation with the perceptual metric (R = 0.945) and low prediction error (RMSE = 0.961), demonstrating that a lightweight neural network can effectively approximate S-CIELAB for perceptual color-difference estimation.]]

=== Prediction Accuracy of the CNN Model ===
After training, the CNN achieved strong predictive performance on the test set. Quantitatively, the network produced:
*n Correlation coefficient (R) ≈ 0.96
* RMSE ≈ 1.85
The correlation coefficient of 0.96 represents a very strong linear relationship between the CNN predicted perceptual difference values and the ground-truth S-CIELAB values. Although the CNN’s per-pixel RMSE is numerically higher than that of the MLP model, it is important to note that this is expected as the CNN predicts full-resolution ΔE maps, rather than a smoother scalar summary of a patch.

These qualitative metrics also show that a CNN can approximate S-CIELAB with high accuracy.

[[File:Fig6 cnn results.png|1000px|thumb|centre|Figure 6. (L) CNN training and convergence curve; (R) scatter plot of trained model correlation compared against the ideal prediction/ground-truth.]]

== Discussion ==
The results demonstrate that both the MLP and CNN architectures can effectively approximate the S-CIELAB perceptual color-difference metric while remaining fully differentiable and significantly more computationally efficient.

The MLP achieves strong performance using only simple XYZ-based statistical descriptors, confirming that much of S-CIELAB’s nonlinear behavior can be captured at the patch level without explicitly modeling spatial interactions.

The CNN learns to reproduce full-resolution ΔE maps with high spatial fidelity. Its strong correlation shows that a UNet is capable of modeling S-CIELAB’s frequency-dependent filtering and texture-masking behavior directly from data.

Together, these results indicate that S-CIELAB is inherently local and learnable, and that neural surrogates can effectively replace the original metric in scenarios requiring differentiability, speed, or integration into end-to-end imaging systems.

== Conclusion ==

This work presents a neural surrogate approach for approximating the S-CIELAB perceptual color-difference metric using both an MLP and CNN. The proposed MLP and CNN architectures provide efficient, differentiable models for predicting patch-level and pixel-level ΔE, respectively, offering substantial speed improvements and enabling seamless integration into modern optimization pipelines. These results validate that neural network implementations can effectively capture both the chromatic and spatial components of perceptual color differences.

Future work will focus on extending the models and broadening their applicability and expanding possible parameters to be predicted. This includes:
* Incorporate training parameters like FOV magnitude (eg. 2°x2° patches vs 5°x5°) and white point
* Translating the networks to Python for integration with the ISETPy repository
* Evaluating additional perceptual and image-quality metrics (eg. chromaticity or lightness errors, full-image perceptual maps, and perceptual attention maps)
* Improving model performance by accelerating computation, expanding the training dataset, tuning hyperparameters, and experimenting with different overlapping patch sampling

These directions will further enhance the power and generality of learned perceptual surrogates for imaging and color-processing applications.

== Appendix ==
Code Repository: https://github.com/anbananna/Perceptual_Color_Metrics

== References ==
[1] A simulation tool for evaluating digital camera image quality (2004). J. E. Farrell, F. Xiao, P. Catrysse, B. Wandell . Proc. SPIE vol. 5294, p. 124-131, Image Quality and System Performance, Miyake and Rasmussen (Eds). January 2004

[2] Ponomarenko, N., Jin, L., Ieremeiev, O., Lukin, V., Egiazarian, K., Astola, J., Vozel, B., Chehdi, K., Carli, M., Battisti, F., & Kuo, C.-C. J. (2015). Image database TID2013: Peculiarities, results and perspectives. Signal Processing: Image Communication, 30, 57–77. https://doi.org/10.1016/j.image.2014.10.009

[3] Wandell, B. A. ISETCam: Image Systems Engineering Toolbox for Cameras. GitHub. https://github.com/iset/isetcam

[4] Zhang, X., & Wandell, B. A. (1997). A spatial extension of CIELAB for digital color image reproduction. SID Symposium Digest of Technical Papers, 28, 731–734. https://doi.org/10.1889/1.1837854

== Work Breakdown ==

Shu An: MLP model development, MATLAB code for MLP model, written report, and MLP presentation slides

Anna Yu: CNN model development, MATLAB code for CNN model, written report, and presentation

Neural Network Implementation of S-CIELAB for Perceptual Color Metrics

2025-12-10T04:17:39Z

Annayu: /* Background */

== Abstract ==
The Spatial CIELAB (S-CIELAB) metric is a widely used perceptual color-difference measure that incorporates both chromatic appearance and spatial properties of the human visual system. Despite its accuracy, S-CIELAB is computationally expensive due to its multi-stage processing pipeline, including opponent-color transformation, frequency-dependent spatial filtering, and nonlinear post-processing. Moreover, these filtering operations rely on fixed convolution kernels and nonlinearities that are typically not differentiable in a manner compatible with gradient-based optimization, making S-CIELAB difficult to integrate directly into learning-based imaging systems.

This project investigates whether a neural network can learn a surrogate model that predicts S-CIELAB responses efficiently from local image patches. Using the TID2013 dataset, we develop two surrogate models: 1) a Multi-Layer Perceptron (MLP) trained on 18-dimensional XYZ-based statistical descriptors of local patches, and 2) a Convolutional Neural Network (CNN) trained to directly map 6-channel XYZ patch pairs to full-resolution ∆E maps. The MLP achieves R ≈ 0.94 and RMSE ≈ 0.96 for patch-mean ∆E prediction while the CNN achieves R ≈ 0.96 and RMSE ≈ 1.85 for per pixel ∆E prediction.

These results show that a compact neural model can effectively approximate S-CIELAB while being fast and fully differentiable, enabling its potential use as a perceptual loss or quality metric in modern imaging pipelines.

== Introduction ==
Perceptual color‐difference metrics play a fundamental role in modern imaging pipelines, image compression, and quality assessment systems. They aim to quantify human-perceived differences between a reference image and its distorted version, allowing algorithms to optimize not only pixel-wise accuracy but also perceptual fidelity. Among these metrics, S-CIELAB (Spatial CIELAB) has become one of the most influential extensions of the classic CIELAB ΔE formulation. Unlike conventional ΔE, which compares colors independently per pixel, S-CIELAB incorporates spatial filtering stages that approximate the frequency-dependent sensitivity of the human visual system (HVS). As a result, the metric aligns more closely with human perceived color differences, particularly in images containing high-frequency textures, blur, noise, or structured distortions.

Although S-CIELAB significantly improves perceptual accuracy, its multi-stage processing pipeline involves large kernel spatial convolutions, nonlinear transformations, and piecewise or non-differentiable operations. This makes S-CIELAB slow to compute at scale and fundamentally incompatible with gradient-based optimizations.

A patch-based formulation is particularly suitable for learning a surrogate model of S-CIELAB. S-CIELAB itself operates locally: its spatial filtering approximates the HVS contrast sensitivity over limited visual angles, and its ΔE computation depends primarily on neighborhood-level color differences rather than global structure. By extracting fixed-size patches corresponding to a constant visual angle (2°×2° in this work), we preserve the locality intrinsic to the metric while avoiding the need to model long-range correlations. Patch-based learning also increases the number of training samples significantly, improving statistical robustness, and allows the model to focus on local statistical features, such as mean chromaticity shifts or contrast changes, that most strongly influence S-CIELAB responses. This makes the surrogate easier to train, more compact, and more generalizable across diverse distortion types.

To bridge the limitations of S-CIELAB and the needs of modern neural pipelines, recent research has explored learning surrogate models that mimic perceptual metrics while remaining computationally efficient and differentiable. A learnable surrogate model for S-CIELAB would allow imaging systems to optimize directly for perceptual color fidelity, improving their alignment with human judgments. In this project, we investigate whether a compact Multi-Layer Perceptron (MLP) or a Convolutional Neural Network (CNN) can learn to reproduce the S-CIELAB ΔE response at the patch level. Instead of training on synthetic images, we leverage the widely used TID2013 dataset, which provides 25 reference images and 24 distortion types across five severity levels.

The objective of this work is twofold:
# to evaluate whether a simple neural model can approximate S-CIELAB with high accuracy, and
# to explore the feasibility of replacing costly perceptual metrics with lightweight, differentiable surrogates suitable for future imaging applications.

== Background ==
Perceptual color‐difference metrics quantify how humans perceive changes between a reference color stimulus and a distorted one. Among these metrics, the CIE 1976 L*a*b* (CIELAB) color space and its associated ΔE formulations remain the most widely used because they provide a perceptually uniform representation of color differences under standardized viewing conditions.

=== Computational CIE Color Models ===
Color difference metrics are built upon the CIE color‐appearance framework, which provides a device‐independent way of quantifying how humans perceive color. The foundational model is the CIE 1931 XYZ color space, derived from color‐matching functions that approximate the response of the human cone photoreceptors. Given a device RGB image, a calibrated 3×3 matrix (M) converts RGB intensities into XYZ tristimulus values:

<div style="text-align:center;">
<math>
\begin{bmatrix}
x\\
y\\
z
\end{bmatrix} = M \begin{bmatrix}
R\\
B\\
G
\end{bmatrix}
</math>
</div>

In this transformation, Y represents luminance, while X and Z carry chromatic information. Because XYZ is perceptually non-uniform, Euclidean distances in XYZ space do not reliably correspond to perceived color differences. To obtain a perceptually uniform space, CIE introduced CIELAB in 1976.

=== CIELAB and Perceptual ΔE* Calculations ===
The nonlinear transformations that the CIELAB space applies to XYZ are as follows:
<div style="text-align:center;">
<math>
L*=116f\left(\frac{Y}{Y_n}\right)-16, a*=500\left[f\left(\frac{X}{X_n}\right)-f\left(\frac{Y}{Y_n}\right)\right], b*=200\left[f\left(\frac{Y}{Y_n}\right)-f\left(\frac{Z}{Z_n}\right)\right]
</math>
</div>

where

<div style="text-align:center;">
<math>
f(t) =
\begin{cases}
t^{1/3}, & t > 0.008856 \\
7.787t+\frac{16}{116}, & \text{otherwise}
\end{cases}
</math>
</div>

A reference white point (<math>X_n, Y_n, Z_n</math>) is necessary to conduct these calculations.

Perceptual color difference (ΔE) between two pixel is then calculated as:

<div style="text-align:center;">
<math>
\Delta E_{ab}*=\sqrt{(L_1* - L_2*)^2 + (a_1* - a_2*)^2 + (b_1* - b_2*)^2}
</math>
</div>

This ΔE* metric is widely used due to its simplicity and perceptual relevance. However, it treats each pixel independently and therefore cannot model spatial visual masking, where texture or neighboring structures influence visibility. For example, a small change in a flat region is highly visible. The same change embedded in strong texture may be nearly invisible. To overcome this limitation, spatial extensions were introduced.

=== Spatial Extensions: The Principle Behind S-CIELAB ===
Rather than comparing pixels independently, S‐CIELAB incorporates frequency-dependent spatial filtering based on known characteristics of the human visual system:
* High-frequency noise is often masked by textures.
* Low-frequency distortions (blur, banding) are more visible.
* Chrominance channels have lower spatial resolution than luminance.

S‐CIELAB applies separate low-pass filters to the L*, a*, and b* channels using empirically derived contrast sensitivity functions (CSFs). Formally:

<div style="text-align:center;">
<math>
(L*, a*, b*)_{filtered} = (L*, a*, b*) \ast K_{CSF}
</math>
</div>

where <math>\ast</math> denotes convolution. The ΔE map is then computed over the filtered channels.

This is important because filtering makes S‐CIELAB significantly more perceptually aligned than standard ΔE*, but is also the reason why it’s more computationally expensive, non-differentiable, and unsuitable as a loss function for neural networks. Thus, even though S-CIELAB is accurate, it is hard to use in modern imaging pipelines.

=== Patch-Based Perspective for Learning S-CIELAB ===

Although S-CIELAB produces a pixel-wise ΔE map, the spatial filters operate locally. The perceptual decision at any location depends mainly on the statistics within a small neighborhood. Therefore, learning S‐CIELAB from data does not require full images; instead, patches provide:

# Locality of Human Perception
#: A 2°×2° visual field corresponds roughly to the scale at which the retina integrates spatial information. This aligns well with S-CIELAB’s filtering nature.
# Statistical Stability
#: Patch-level averaging reduces noise and variation across strong textures, yielding a smoother learning target (mean ΔE per patch).
# Efficiency
#: Training on patches, reduces GPU/CPU memory, increases dataset size, and removes global image dependencies.
# Simplifies Feature Engineering
#: MLPs or lightweight CNNs can learn localized perceptual behavior including patch features like lightness, chroma, chromatic contrast, and local frequency characteristics (indirectly through variance).

Patch-level surrogates can capture essential S-CIELAB behavior without requiring full-image modeling.

=== Motivation ===
=== Dataset and Preprocessing ===

In this project, the TID2013 image quality assessment dataset is used as the data source. It contains 25 reference images, each distorted by 24 distortion types across five severity levels. Each distorted image is paired with its corresponding reference image. For each reference–distorted pair, the ground-truth S-CIELAB ΔE map is computed using the scielabRGB function provided in ISETCam, with a display calibration file (crt.mat) and a viewing distance of 0.3 m. This produces a pixel-wise perceptual color-difference map that serves as the ground truth for neural network training.

[[File:Fig1 tid images.PNG|750px|thumb|centre|Figure 1. L to R: Hat reference photo; hat reference photo with level 5 of distortion, lighthouse reference photo; lighthouse reference photo with level 4 of distortion.]]

[[File:Fig2 tid distort.PNG|600px|thumb|centre|Figure 2. Examples of TID2013 reference and distorted images together with their corresponding S-CIELAB ΔE maps. (a) the original reference image and (b) its Gaussian noise distorted version at Level 1. (c) S-CIELAB ΔE heatmaps for Gaussian Noise Level 1 and (d) Level 5.]]

While investigating TID2013 distortions, it can be observed that:
* Low-level distortion mainly affects local fine textures, leading to relatively small ΔE values.
* High-level distortion produces much stronger responses across the entire image, with significantly larger ΔE values.
* The spatial distribution of ΔE closely follows visually perceived degradation, confirming the perceptual relevance of S-CIELAB.
This visualization also motivates the patch-based learning strategy, as the perceptual error is locally structured and varies across spatial regions.

Color indicates perceptual color difference, where higher ΔE values represent stronger perceived distortion. As expected, the fifth level of distortion severity exhibits substantially higher ΔE values and more visible spatial error patterns compared to the first level of distortion.

=== Patch-Based Representation ===
Patch-based learning is used because S-CIELAB itself operates locally through spatial filtering. Patch-level learning greatly increases the number of training samples as it divides reference images into much smaller windows while avoiding the need for larger CNNs to capture local perceptual behavior. It enables a simple, fully connected MLP instead of spatial convolutions.

In this project, image pairs and corresponding ∆E maps are divided into 2° × 2° visual angle patches. The patch size in pixels is calculated from the horizontal field of view (HFOV) of the scene as:

<div style="text-align:center;">
<math>
\text{patch pixels} = 2^{\circ} / (\text{HFOV} / \text{image width})
</math>
</div>

A 50% overlap of patches increases sample count and improves statistical coverage of spatial distortions.

=== Dataset Construction and Splitting ===

After feature extraction, a total of over 200,000 patch samples are collected. The dataset is split as 80% for training and 20% for testing. Before training, all features are standardized to zero mean and unit variance using mapstd.

== MLP Implementation ==
=== MLP Feature Extraction (18-Dimensional Vector) ===
The MLP operates on engineered statistical features intended to summarize patch-level behavior relevant to S-CIELAB. For each patch, we compute:
* Reference XYZ: Mean (3), Standard deviation (3)
* Distorted XYZ: Mean (3), Standard deviation (3)
* Difference statistics: Mean difference (3), Standard deviation difference (3)

This results in an 18-dimensional feature vector: <math>x=\left[ \mu_R, \sigma_R, \mu_T, \sigma_T, \Delta\mu, \Delta\sigma \right] \in \mathbb{R}^{18}</math>

The target label for each patch is the patch mean S-CIELAB ΔE.

[[File:Fig3 mlp pipeline.PNG|750px|thumb|centre|Figure 3. MLP network structure - from an 18D input vector to 1 fully connected MLP that outputs a patch mean S-CIELAB ∆E.]]

Instead of directly using raw pixel values as network inputs, this work adopts an 18-dimensional statistical feature representation extracted from each 2°×2° image patch. Specifically, for both the reference and distorted patches in XYZ color space, the mean and standard deviation of each channel are computed, forming 12 features. In addition, the differences between the corresponding means and standard deviations of the reference and distorted patches are calculated, resulting in a total of 18 features.

This feature design is motivated by both perceptual and practical considerations. First, the S-CIELAB metric itself is not defined at the pixel level alone, but is constructed through spatial filtering and local color difference operations that integrate information over a neighborhood. Therefore, local statistical descriptors better reflect the perceptual behavior captured by S-CIELAB than individual pixel values.

Second, using raw pixels would lead to an extremely high-dimensional input space. For example, a 2°×2° patch typically contains hundreds of pixels, and directly feeding these values into a neural network would significantly increase model complexity, training instability, and computational cost. In contrast, the proposed 18-dimensional feature vector provides a compact and low-dimensional representation while preserving the essential color and contrast information relevant to perceptual difference estimation.

Third, patch-level statistical features improve the robustness of learning by suppressing pixel-level noise and small spatial misalignments. Since S-CIELAB is designed to model perceived color differences rather than exact pixel correspondences, learning from patch-wise statistics is more consistent with the perceptual objective of the metric.

Overall, this feature representation allows the Multi-Layer Perceptron to focus on perceptually meaningful information while remaining lightweight, stable, and computationally efficient.

=== MLP Regression Model ===
A Multi-Layer Perceptron (MLP) is used to learn the nonlinear mapping: <math>f_{\theta}:R^{18} \rightarrow R</math>

Network configuration:
* Input layer: 18 neurons
* Hidden layers: [64, 32]
* Output layer: 1 neuron (predicted ΔE)
* Activation: default MATLAB nonlinear activations
* Training algorithm: Levenberg–Marquardt
* Epochs: 200
* Validation split (0.1 for validation)

The network is trained using mean squared error (MSE) loss.

== CNN Implementation ==
=== CNN Input Representation ===
The CNN receives a 64x64x6 input tensor:
* Channels 1-3: reference patch (XYZ)
* Channels 4-6: distorted patch (XYZ)
This representation allows the network to learn spatial color differences directly in the tristimulus domain without engineered feature extraction.

=== CNN Network Architecture ===
A lightweight UNet architecture is used to compute a per-pixel ∆E map. The encoder (2 layers) progressively reduces spatial resolution while expanding feature depth, extracting hierarchical spatial and chromatic features. The decoder (2 layers) reconstructs the spatial resolution using transposed convolutions and skip connections. Skip connections retain high-frequency information critical for modeling texture masking and spatial distortions. The CNN is trained using pixelwise MSE between predicted ∆E and ground truth ∆E.

This architecture is well suited to perceptual tasks because it learns both fine-grain and contextual representations across spatial scales, mimicking S-CIELAB’s multiscale filtering. The network is also trained to minimize MSE.

[[File:Fig4 cnn pipeline.PNG|800px|thumb|centre|Figure 4. Simplified representation of U-Net architecture used in CNN implementation.]]

File:Fig6 cnn results.png

2025-12-10T03:34:14Z

Annayu:

File:Fig5 mlp results.png

2025-12-10T03:33:37Z

Annayu:

File:Fig4 cnn pipeline.PNG

2025-12-10T03:29:04Z

Annayu:

File:Fig3 mlp pipeline.PNG

2025-12-10T03:28:07Z

Annayu:

File:Fig2 tid distort.PNG

2025-12-10T03:26:55Z

Annayu:

File:Fig1 tid images.PNG

2025-12-10T03:26:36Z

Annayu:

Neural Network Implementation of S-CIELAB for Perceptual Color Metrics

2025-12-10T03:20:56Z

Annayu: /* Background */

Neural Network Implementation of S-CIELAB for Perceptual Color Metrics

2025-12-10T02:21:36Z

Annayu: Created page with "== Abstract == The Spatial CIELAB (S-CIELAB) metric is a widely used perceptual color-difference measure that incorporates both chromatic appearance and spatial properties of the human visual system. Despite its accuracy, S-CIELAB is computationally expensive due to its multi-stage processing pipeline, including opponent-color transformation, frequency-dependent spatial filtering, and nonlinear post-processing. Moreover, these filtering operations rely on fixed convoluti..."

Psych221-Projects-2025-Fall

2025-12-10T01:07:04Z

Annayu: /* Projects for Psych 221 (2025-2026) */

[http://vista.su.domains/psych221wiki/index.php?title=Main_Page#Psych221 Return to Psych 221 Main Page]

There are two deliverables for the project:
# A group presentation
# A wiki-style project page write-up

* The write-up should roughly follow [http://vista.su.domains/psych221wiki/index.php?title=Project_Guidelines this organization from the Project Guidelines Page]
* Please visit [https://www.mediawiki.org/wiki/Help:Editing_pages MediaWiki's editing help page].

== To set up your project's page ==
* Log in to this wiki with the username and password you created.
* Edit the Projects section of this page (just below). Do this by clicking on "[edit]" to the right of each section title.
* Make a new line for your project using the format shown below, pasting the line for your project under the last item/group. The first part of the text within the double brackets is the name of the new page. This must be unique, and putting the group member names is a safest way to assure this. The second part, after '|' is the displayed text and can be your project title.
* Save the Project section by clicking the Save button at the bottom of the page
* Finally, click on the link for your project. This will take you to a new blank page that you can edit. You can use the basic format for your page that is in the Sample Project.
* Math tip: Use the tags <math> and </math> to wrap an equation. For example, this code:

<pre> <math> a + b = c^2 </math> </pre>

Renders as this equation:

:<math>a + b = c^2</math>

* Uploading images - Use the "upload file" link at the upper right of the page. Do a Google Search to learn "What is the syntax for inserting an uploaded image file into the wikimedia page?"

==Projects for Psych 221 (2025-2026)==
# [[WandellFarrellLian|Sample Project]]
#* Brian Wandell, Joyce Farrell, David Cardinal, Hyunwoo Gu.
# [[Pokemon Color Transfer|Pokemon Color Transfer]]
#* Wenxiao Cai, Yifei Deng
# [[ISETHDR CV Experiment|ISETHDR CV Experiment]]
#* Gray Kim, Louise Schul
# [[ISETBIO Baseball Simulation Experiment|ISETBIO Baseball Simulation Experiment]]
#* Alex Lipman
# [[Evaluation Pipeline with GenAI-Assisted Algorithm Development for Virtual Image Denoising and Pixel-Defect Correction|Evaluation Pipeline with GenAI-Assisted Algorithm Development for Virtual Image Denoising and Pixel-Defect Correction]]
#* Stephanie Chang, Yulin Deng
# [[Simulation of Reflectance in Oral Tissue Using MCMatlab]]
#* Sylvia Chin, Lise Brisson
# [[Neural Network Implementation of S-CIELAB for Perceptual Color Metrics]]
#* Anna Yu, Shu An