Po-Hsiang Wang

From Psych 221 Image Systems Engineering
Jump to navigation Jump to search

Introduction

Super-Resolution Generative Adversarial Networks (SRGAN) is a deep learning application to generate high resolution (HR) images from low resolution (LR) image [1-6]. In this work, we use SRGAN to up-scale 32x32 images to 128x128 pixels. Meanwhile, we evaluate the impact of different camera parameters on the quality of final up-scaled (high resolution) images and infer from these stimuli to understand what the network is able to learn.


Fig.1 SRGAN converts a LR image to HR through a generator/discriminator network.
Fig.2 Data processing and model training pipeline: Original image is processed with different camera parameters using ISET Camera Designer. These images are resized to 32x32x3 and served as the (LR) input to the generator. The target HR images are the original ones which are not processed. Total four models were trained:
1. Model_SR: SRGAN model that does super resolution only
2. Model_SR_Color: SRGAN model that does super resolution and color correction
3. Model_SR_Pixel: SRGAN model that does super resolution and restore spatial resolution due to reduction of system MTF
4. Model_SR_Deblur: SRGAN model that does super resolution and deblur


Dataset

Training

1800 cat and dog images were downloaded from Flickr and Pixabay. These images were processed using three different camera settings using ISET Camera Designer developed by David Cardinal, 2020, F20-PSYCH-221, Stanford University, under the framework of ISETCAM [7-9] link
  • Camera setting 1- F/4; aperture diameter: 1mm, pixel size: 1umx1um; under exposing; default image processor
  • Camera setting 2- F/4; aperture diameter: 1mm, pixel size: 25umx25um
  • Camera setting 3- F/22; aperture diameter: 0.176mm; pixel size: 1umx1um
Objective for choosing these settings
  • Camera setting 1- To produce images that require color correction as we use default image processor while under exposure, these images have warm tone
  • Camera setting 2- In general, large pixel size is desirable because it results in higher dynamic range and signal-to-noise ratio. However, the reduction in spatial resolution and system MTF introduce severe pixelated effect
  • Camera setting 3- Images looks much blurry compared with original images because they becomes diffraction limited at small aperture value


Fig.3 Original(baseline) and processed images through ISET Camera Designer.

Testing

About 100 images, not necessarily cats and dogs, are used for evaluation. As an input to four trained models (Model_SR, Model_SR_Color, Model_SR_Pixel, and Model_SR_Deblur), these images went through the same camera settings and resized to 32x32x3 similar to the training dataset.

Methods

SRGAN model

SRGAN structure

SRGAN takes a low-resolution image and up-scale it to become a higher resolution image. Input to the generator is a LR image, the generator will then create a 'fake' HR image. Input to the discriminator can be either a 'real' HR image or a 'fake' HR image. The job of the discriminator is to tell whether the input image is generated from the generator or not and give feedback (generator did a good job if discriminator got fooled)/back-propagate the GAN loss to train the generator.

Fig.4 Generator and Discriminator in SRGAN
Ledig, Christian, et al. "Photo-realistic single image super-resolution using a generative adversarial network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. link
The Figure shown in the left describes the networks from the original paper. In this work, small modifications in our code are done as follows
    • PixelShuffler x2: This is for feature map upscaling. We use 2x ‘deconv2d’ Keras built-in function for implementation.
    • PRelu(Parameterized Relu): PRelu introduces learnable parameter that makes it possible to adaptively learn the negative part coefficient. We use Relu as activation function for simplicity.


The rest parts of our SRGAN model used in this work follows the original implementation
    • k3n64s1 means 3x3 kernel filter, outputting 64 channels with stride of 1.
    • Residual blocks: Since deeper networks are more difficult to train. The residual learning framework eases the training of these networks, and enables them to be substantially deeper, leading to improved performance.




SRGAN loss functions

Loss function is a measure of how good prediction the model does. Different loss functions are designed for different tasks.

Loss function of discriminator
  • Mean squared error (MSE) loss,

where yi is the ground-truth: (0: fake, 1: real) and ŷi is the predicted label from the discriminator. This loss is minimized when the discriminator can correctly distinguish between 'real' and 'fake' HR images.


Loss function of generator
  • Adversarial loss: We use 'binary cross-entropy' loss in Keras. The goal is to have the discriminator output as "1" when the input is 'fake' HR such as the loss term is minimized.
  • Content loss: L2 distance between 'real' and 'fake' HR feature maps. The feature maps are extracted from pre-trained VGG19 network. We use the 9th convolution layer (3x3x256 convolution) for extraction. In the example below, we can see these feature maps are quite similar to images filter by different spatial frequencies. The generator is optimized to maintain such perceptual similarity between a real HR image and its fake HR image.


Fig.5 Example of feature maps in 4 different channels (out of 256) extracted from 9th layer of VGG19 network

Image quality evaluations

SSIM

We use Structural Similarity (SSIM) index to measure the similarity between the model generated HR image and the original HR image (target image). Higher number indicates higher similarity (better quality of the model generated images).



S-CIELAB

We also use S-CIELAB [10] link as another evaluation matrix. S-CIELAB is an extension of the CIE L*a*b* DeltaE color difference formula and provides human spatial sensitivity difference between a reference and a corresponding test image. The key components of calculating S-CIELAB representation include color transformation and the spatial filtering steps that simulates the human visual system before the standard CIELAB Delta E calculations.



Results

Effect of missing pixels

Here we randomly set 10% of pixels to 0 value and see if SRGAN can still learn and create valid HR images. The input image is 32x32x3 with 10% missing pixels; the target out put image is 128x128x3. The result below shows that the model is actually doing well where the model generated HR images not only have smooth edges but also restore the details at the face given the presence of noise pixels. It is worth mentioning in the left-most column images, since it is not common to have a dog wearing goggle in the training examples, the eyes is not fully recovered in this example.

Fig.6 Training and testing of SRGAN model having 10% missing pixels as input

HR images from trained SRGAN models

The following matrices summarize the results from all four trained SRGAN models. The 'input' rows shows the input LR images (original and processed images using three different camera settings as described in method section). Each cell in the matrix represent the result HR images generated by a given model from a given input LR. For example, in the each 4x4 matrix, the 3rd row, 2nd column image is the result generated by Model_SR_Pixel using LR images processed by camera setting1. It is obvious that the images in the diagonal have higher qualities (closer to the target images) compared with the off-diagonal images because these are how the models were originally trained for.

Key observations of each models
  • Model_SR: this model is trained to perform super resolution only. As expected, the result looks simply upscaled without changing other characteristics of the input images such as color and focus.
  • Model_SR_Color: the outline and details looks similar to Model_SR. Also, because this model is trained to do color correction, the color tone is different between the input and output images (becomes 'brighter' in general).
  • Model_SR_Pixel: unlike Model_SR and Model_SR_Color, result from this model looks relatively un-natural. However, when the input image is from camera setting 3 (reduction of system MTF due to large sensor pixel), the resulting HR image improved a lot - it learns how to restore spatial resolution to some extent.
  • Model_SR_Deblur: This model successfully learned how to de-blur. It is also interesting that all of its output images seems to remain in good focus regardless whether the input image is within/out of focus.
Atypical testing image

As mentioned, the test set includes non-cat and dog images. The bottom right image is a example of atypical image that the models never saw during training. The edges were still nicely restored in Model_SR, Model_SR_Color, and Model_SR_Deblur. The Model_SR_Pixel however, created some 3D-alike effect at the blue gradient color location.



SSIM and S-CIELAB scores

Below shows the SSIM score and average Delta E of S-CIELAB representation for 82 test images. The input image of each model are the ones with corresponding camera settings in Fig.2. Model_SR performs better in both SSIM and S-CIELAB because it is a relatively simple task (only super resolution). When we compare the scores of other three models against Model_SR in the x-y plot, we can see that the SSIM scores are highly correlated, i.e. the models success and fail together. This is expected because these models have the same structure and loss functions therefore they have similar strength and limitations. Similar results are observed for the S-CIELAB scores. In the next section, we will show few S-CIELAB delta E maps of both good and bad generated HR images.



S-CIELAB delta E maps

The S-CIELAB delta E maps show the difference between target images and the model generated images. Consider these difference as 'residue' (mainly at the edges) that the model can improve, it might be interesting in future work to replace/add S-CIELAB representation to the generator's loss function. The reason being, one of the major changes that a more advanced version of SRGAN model (called Enhanced SRGAN, ESRGAN [2]) have done is to use feature maps before activation for calculating content loss. As we extract feature maps from relatively deep layer in VGG19 layer, some of the features after activation becomes inactive and contains fewer information. It is possible that S-CIELAB can provide additional information, especially from human spatial sensitivity point of view, to the generator during training and create a new class of super resolution images that focus more on how accurate the reproduction of a color is to the original when viewed by a human observer.



Conclusions

1. Effect of missing pixels- SRGAN model is able to deal with missing/noise pixels (about 10% in our experiment) and generate HR images not only have smooth edges but also restore the details.

2. HR images from trained SRGAN models- SRGAN model can be trained to perform super-resolution and image enhancement (including color correction and de-blurring) simultaneously using Tensorflow implementation of "Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network link".

3. Proposed future work for model improvement- We proposed in the future work to use S-CIELAB delta E maps as generator loss function to incoporate human spatial color sensitivity during training. This could enable a new class of super resolution images that focus more on the reproduction of a color patterns when viewed by a human observer.

Reference

[1] Ledig, Christian, et al. "Photo-realistic single image super-resolution using a generative adversarial network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. link
[2] Wang, Xintao, et al. "Esrgan: Enhanced super-resolution generative adversarial networks." Proceedings of the European Conference on Computer Vision (ECCV). 2018. link
[3] Ciolino, Matthew, David Noever, and Josh Kalin. "Training set effect on super resolution for automated target recognition." Automatic Target Recognition XXX. Vol. 11394. International Society for Optics and Photonics, 2020. link
[4] Nagano, Yudai, and Yohei Kikuta. "SRGAN for super-resolving low-resolution food images." Proceedings of the Joint Workshop on Multimedia for Cooking and Eating Activities and Multimedia Assisted Dietary Management. 2018. link
[5] Pathak, Harsh Nilesh, et al. "Efficient super resolution for large-scale images using attentional GAN." 2018 IEEE International Conference on Big Data (Big Data). IEEE, 2018. link
[6] Takano, Nao, and Gita Alaghband. "Srgan: Training dataset matters." arXiv preprint arXiv:1903.09922 (2019). link
[7] A simulation tool for evaluating digital camera image quality (2004). J. E. Farrell, F. Xiao, P. Catrysse, B. Wandell . Proc. SPIE vol. 5294, p. 124-131, Image Quality and System Performance, Miyake and Rasmussen (Eds). January 2004. link
[8] Digital camera simulation (2012). J. E. Farrell, P. B. Catrysse, B.A. Wandell . Applied Optics Vol. 51 , Iss. 4, pp. A80–A90. link
[9] Image Systems Simulation (2015). J.E. Farrell and B.A. Wandell Handbook of Digital Imaging (Edited by Kriss). Chapter 8. ISBN: 978-0-470-51059-9. link
[10] Zhang, Xuemei, and Brian A. Wandell. "A spatial extension of CIELAB for digital color image reproduction." SID international symposium digest of technical papers. Vol. 27. SOCIETY FOR INFORMATION DISPLAY, 1996. link

Appendix1

Github repository- https://github.com/pohwa065/SRGAN-for-Super-Resolution-and-Image-Enhancement
Presentation- [Deep Learning for Super Resolution and Image Enhancement]

Appendix2

Computer Hardware

Machine: Dell Inspiron 15 7000

Processor: Intel(R) Core(TM) i5-7300HQ CPU @ 2.50GHz, 2501 Mhz, 4 Core(s), 4 Logical Processor(s)

Memory: 8 GB

Generator and Discriminator loss over epochs during training of Model_SR