Uriel Rosa

From Psych 221 Image Systems Engineering
Jump to navigation Jump to search

Performance evaluation of monochromatic and defocusing camera design pipelines in the semantic labeling of images classified by convolutional neural networks

Introduction

Cameras designed for robotic, and/or autonomous vehicle vision applications have typically been adapted from existing human-intended pipelines. However, robotic handling of images does not necessarily have to emulate the human vision systems for achieving high performance, or reduced costs. The semantic labeling of images classified by CNN (convolutional neural network) approaches might be substantially influenced by the design parameters of the cameras acquiring the images.

In this study, camera pipeline parameters were modified to investigating the effects of replacing typical in focus RGB images with similar images reprocessed as monochrome, defocused to include chromatic aberration effects and defocused monochrome images.

The ieCameraDesigner ISET [1] application software was configured in four distinct camera designs to generate rgb, monochromatic, defocused and a combination of these effects in distinct pipelines.

Natural images of African mammals downloaded from David Cardinal’s data set [2] were screened for close similarity to produce a base dataset of containing images processed by a state-of-the-art CNN, the Resnet-50.

The goal of this study is to evaluate simulated effects of the camera parameters monochromatic and chromatic defocusing on the performance of semantic labeling computed by convolutional neural networks.

Background

Relevant literature referring to the monochrome, chromatic defocusing and semantic labeling classification of current convolutional neural network frameworks, and autonomous vision, is cited and partially edited in this section, as follows.

“Recent years have witnessed amazing progress in AI related fields such as computer vision, machine learning and autonomous vehicles. Since the first successful demonstrations in the 1980s, great progress has been made in the field of autonomous vehicles. However, fully autonomous navigation in arbitrarily complex environments also require informed decisions made by CNN. Accurate perception systems, autonomous vision, are required in autonomous navigation [3].

An object detection task can be addressed with a variety of different sensors in an integrated approach. However, cameras are the cheapest and most commonly used type of sensors for the detection of objects. The visible spectrum of light is typically used for daytime detections, whereas the infrared spectrum can be used for nighttime detection [3]. A traditional detection pipeline includes object classification and verification/refinement.

Semantic segmentation is a fundamental topic in computer vision. The goal of semantic segmentation is to assign each pixel in the image a label from a predefined set of categories. Semantic segmentation is the first step towards scene understanding. It is mainly based on low-features, such as color, edges, and brightness. The methods for feature selection have been reported in the above subtasks of lane and road detection, traffic sign recognition, and vehicle detection. Wu et al. (2016b) have proposed a more efficient ResNet architecture by analyzing the effective depths of 21 residual units. They point out that ResNets behave as linear ensembles of shallow networks. Based on this understanding they design a group of relatively shallow convolutional networks for the task of semantic image segmentation [3].

The modern era of neural networks began with the pioneering work of McCulloch and Pitts(1943). They described a logical calculus of neural networks that united the studies of neurophysiology and mathematical logic. With a sufficient number of simple units, neurons, and synaptic connections set properly and operating synchronously, they showed that a network would compute any computable function. The disciplines of neural networks and artificial intelligence were born. The properties of the machines and their behavior are inspired by facts about animal brains. In 1986 the development of the back-propagation algorithm was reported by Rumelhart, Hinton and Williams. Back-propagation learning was discovered independently in two other places about the same time (Parker, 1985; LeCun, 1985). Convolutional Neural Networks allowed a significant improvement in the performance of object detection [4]; and machine vision. AlexNet 2012/Resnet - layer depth enhancement[5].

Deep Residual Learning for Image Recognition, “ResNet-50”, are currently the state-of-the-art in image classification. Deeper neural networks are more difficult to train. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun presented a residual learning framework to ease the training of networks that are substantially deeper than those used previously. They explicitly reformulated the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. They provided comprehensive empirical evidence showing that these residual networks are easier to optimize, and gain accuracy from considerably increased depth [6,7]. The method was evaluated on the ImageNet 2012 classification dataset that consisted of 1000 classes. The models are trained on the 1.28 million training images and evaluated on the 50k validation images. It was tested on 100k test images.

Pre-trained Model: A pre-trained model has been previously trained on a dataset and contains the weights and biases that represent the features of whichever dataset it was trained on. Learned features are often transferable to different data. For example, a model trained on a large dataset of animal images will contain learned features like edges or horizontal lines that you would be transferable to our dataset. Pre-trained models are beneficial for many reasons. Using a pre-trained model saves time. Time and compute resources has already been spent to learn a lot of features that the model will likely benefit from [6,7,8].

Monochrome images: Technologies used in autonomous vehicles typically include lane detection. Typical images collected by the on-board camera are color images [9,10]. Each pixel in the image is made up of R, G, and B three color components, which contains large amount of information. Processing these images directly makes the algorithm consume a lot of time. Image preprocessing includes grayscale conversion of color image, gray stretch, median filter to eliminate the image noise and other interference information. Gray stretch can increase the contrast between the lane and the road, which makes the lane lines more prominent. Equation (1) represents the function which is to be applied to an RGB image to convert it to Gray Scale.

L(x,y) = 0.21 R(x,y) + 0.72G(x,y) + 0.07 B(x,y) (1)

Where R - Red component of the image G - Green component of the image B - Blue component of the image x,y - position of a pixel [9]. In the current study, gray stretch is not performed. However, color images are replaced by monochrome images to evaluating this effect in CNN classification performances.

Defocus aberration: In optics, defocus is the aberration in which an image is simply out of focus. Optically, defocus refers to a translation of the focus along the optical axis away from the detection surface. In general, defocus reduces the sharpness and contrast of the image. What should be sharp, high-contrast edges in a scene become gradual transitions. Fine detail in the scene is blurred or even becomes invisible. Nearly all image-forming optical devices incorporate some form of focus adjustment to minimize defocus and maximize image quality [11]. Figure 1 demonstrates typical cases of chromatic aberrations.



Figure 1. Left: intense effect of chromatic aberration (mouth); Center: diagram indicating chromatic aberration produced by the lens; Right: text image shows strong chromatic aberrations [11,12].

Image Signal Processing Pipelines: Jiang et. al [13] introduced a method that combines machine learning and image systems simulation that automates the image processing pipeline design. The approach is based on a new way of thinking of the image processing pipeline as a large collection of local linear filters. The method has been used to design pipelines for novel sensor architectures in consumer photography applications. It applies a learning-based approach to the ISP pipeline design using affine mapping frameworks. Image patches are clustered based on simple features and then a per-class affine mapping learns to map the raw patches to the sRGB patches. This work combines image systems simulation technology and modern computational methods into a methodology that creates image processing pipelines.”

Methods

In the current study defocused aberrations are introduced in the simulated ISET/isetcam pipeline for monochrome and rgb images. The processed images allow quantifying these effects on the performance of the CNN.

The pre-trained Resnet-50 upper classification layers are re-trained with four classes of wild animal images obtained from David Cardinal’s African mammals data set. The simulated camera pipeline designs are obtained from the ISET/isetcam ieCameraDesigner ISE 2020 application.

Original dataset: Natural images of African mammals downloaded from David Cardinal’s data set [2] were screened for close similarity to produce a total of 100 images per class (50% original and 50% reflected images). A sample of animal classes containing original JPEG images is shown in figure 2.



Figure 2. Original images representing the four classes of animals selected from David Cardinal’s data set; from left to right: cheetah, hyenas, leopards and lions.

ISET dataset simulated with the ieCameraDesigner

RGB: the original images were processed through the ieCameraDesigner pipeline using the default optical image, sensor and ISP designs, except the sensor color image was selected as rgb. A sample of the same animal classes containing the monochrome images is shown in figure 3.



Figure 3. ISET simulated RGB images representing the four classes of animals selected from David Cardinal’s data set; from left to right: cheetah, hyenas, leopards and lions.

Defocused RGB: the original images were processed through the ieCameraDesigner pipeline using the default optical image, sensor and ISP designs, except the sensor color image was selected as rgb. The cameraTweak file shown in Appendix I was loaded in the ieCameraDesigner pipeline to produce the defocus effect of 5.5 diopters on the images. A sample of the same animal classes containing the defocused RGB images is shown in figure 4.



Figure 4. ISET simulated defocused RGB images representing the four classes of animals selected from David Cardinal’s data set; from left to right: cheetah, hyenas, leopards and lions.

Monochrome: The original images were processed through the ieCameraDesigner pipeline using the default optical image, sensor and ISP designs, except the sensor color image was selected as monochrome. A sample of the same animal classes containing the monochrome images is shown in figure 5.




Figure 5. ISET simulated monochrome images representing the four classes of animals selected from David Cardinal’s data set; from left to right: cheetah, hyenas, leopards and lions.

Defocused monochrome: the original images were processed through the ieCameraDesigner pipeline using the default optical image, sensor and ISP designs, except the sensor color image was selected as monochrome. The cameraTweak file shown in Appendix I was loaded in the ieCameraDesigner pipeline to produce the defocus effect of 5.5 diopters on the images. A sample of the same animal classes containing the defocused monochrome images is shown in figure 6.




Figure 6. ISET simulated defocused monochrome images representing the four classes of animals selected from David Cardinal’s data set; from left to right: cheetah, hyenas, leopards and lions.

Figure 7 shows a block diagram of the dataset obtained after processing the original data through the modified pipelines created by the ISET/isetcam ieCameraDesigner application.




Figure 7. Block diagram showing the dataset obtained after original dataset is processed by the ISET/isetcam ieCameraDesigner application.


Resnet50 network

A total of four distinct classes were formed through the simulated pipelines for each of the effects to be investigated: monochrome, defocused monochrome, RGB and defocused RGB. The modifications were introduced in the default pipeline to obtain these four simulated pipelines. The Resnet50 network upper layers were retrained with the original and the new images by following the MLTransferLearning guidelines [14]. A total of 50 pipeline processed images per class were enhanced to produce another 50 images for the CNN classifications by randomly reflecting these images around the vertical x of the frame and scaling from a factor between 1 and 2 [figure 8]. The assessment of the capability of each pipeline for correctly classifying the images was used as comparison criteria. The classification performance by training the original images was adopted as a baseline. The initial learning rate of the network was adjusted to 0.0001.




Figure 8. The number of 50 pipeline processed images used per class were enhanced for the CNN transfer learning training and validation process by randomly reflecting these images around the vertical x of the frame and scaling from a factor between 1 and 2.


After the training and validation process was performed a final metrics for the network performance was issued, the “Final Validation Accuracy, %” and the “Final Validation Loss”. These metrics are used for evaluating the performances of the different camera designs.

Results

The results shown in tables 1 and 2 are used to helping analyze how different camera pipeline implementations, or camera models, influence the performance of the pre-trained neural networks on this study.

It can be seen from the sample figures 9 and 10 that the number of epochs used for training and concurrent validation needs to be at least higher than 100 epochs, but preferable around 200 epochs for allowing enough steps for the network to learn. The processing time of the 200-epoch setup on a Dell G5/GEForce1650 NVIDIA card takes about 35-40 minutes run.



Figure 9. Total of 100 epochs of the Resnet-50 deepNetworkDesigner training and validation process of a monochrome camera pipeline.



Figure 10. Total of 200 epochs of the Resnet-50 deepNetworkDesigner training and validation process of a monochrome camera pipeline.


It can be seen from table 1 the monochrome pipeline performed quite well, 88%,0.47 (accuracy, loss), 200-epoch, compared with the results of the original base line image dataset 91%, 0.24.

Defocusing pipelines, in general, did not produce good performances. This implementation seems to have failed to recognize features probably because the data large dataset is primarily trained on focused images taken from and for human use. Likely the entire network would need to be retrained for improving results on defocusing.

Table 1. Resnet-50 classification performance results.


Table 1 and 2 trends seems to agree on the performance results obtained with Resnet-50 and the SqueezeNet. The monochrome pipeline perform well and the defocused pipelines do not.

Table 2. SqueezeNet classification performance per class results.

Conclusions

The results and analysis of the current study indicates:

The simulation of different camera pipelines and further training and validation process of neural networks using semantic classification and pre-trained CNN is a valuable tool in accessing the performance of different camera pipeline designs.

The monochrome pipeline performed very well compared to the performance of the original image dataset.

References

[1] ISET/isetcam “aiCameraDesigner” ISE 2020 in: https://github.com/ISET/isetcam

[2] David Cardinal’s Natural images of African mammals:

      https://canvas.stanford.edu/files/6540020/download?download_frd=1

[3] J. Janaia, F. Guney, A. Behla and A. Geigera. “Computer Vision for Autonomous Vehicles: Problems, Datasets and State-of-the-Art”. Preprint submitted to ISPRS Journal of Photogrammetry and Remote Sensing, April 20, 2017.

[4 ] H. Simon. “Neural Networks: a comprehensive foundation”. Prentice Hall 2nd ed. 1999.

[5] https://en.wikipedia.org/wiki/AlexNet

[6] K. He, X. Zhang, S.Ren and J. Sun. “Deep Residual Learning for Image Recognition”. https://arxiv.org/abs/1512.03385

[7] Resnet-50 Kaggle Pre-trained models for keras: https://www.kaggle.com/keras/resnet50

[8] Resnet-50 Matlab: https://www.mathworks.com/help/deeplearning/ref/resnet50.html?s_tid=srchtitle

[9] V. Viswanathan , R. Hussein. “Applications of Image Processing and Real-Time embedded Systems in Autonomous Cars: A Short Review”. International Journal of Image Processing (IJIP), Volume (11): Issue (2): 2017.

[10] H. Zhu, K. Yuen, L. Mihaylova and H. Leung. “Overview of Environment Perception for Intelligent Vehicles”. IEEE Transactions on Intelligent Transportation Systems, vol.18, No.10, October 2017.

[11] https://en.wikipedia.org/wiki/Defocus_aberration

[12] https://digital-photography-school.com/chromatic-aberration-what-is-it-and-how-to-avoid-it/

[13] H. Jiang, Q. Tian, J. Farrell and B. Wandell. “Learning the image processing pipeline”. IEEE Transactions on Image Processing, vol. 26, no. 10, pp. 5032–5042, Oct 2017.

[14] MLTransferLearning guidelines: https://stanford-pilot.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=d0f21cb7-7111-43d0-a647-ac4c017dfba0

Acknowledgements

I am thankful to the Psych221 teaching team Brian Wandell, David Cardinal, Joyce Farrell and Zheng Lyu for their suggestions, development and availability of the developed software tools and for the use of David Cardinal’s African Mammals image data set.

Appendix

cameraTweak.m

function [oi, sensor, ip] = cameraTweak(varargin)

%% % This file is copied into the local data store for each user

% so that they can add their own customization code to the optics, sensor, and ip

% of the camera they have begun to design in the UI.

%

% Or, they can also directly load one from scratch here, over-riding whatever is

% passed in, by simply assigning the one they load to the appropriate output variable name.

% % Once this template is copied to the local store, changes to it will not persist,

% as it will be copied over again if it is missing. So make persistent changes to the .template

% file, not the .m file that it is copied into.

%%

%% In general, the cameraDesigner application will pass something for oi, sensor, and ip

% but this allows us to be more flexible

% If one of those does not exist in the environment to pass in, the varargin code

% causes a default to be created, which can then be customized in code.

%%

%% You can also hard-code an image folder here, if you don't want to specify one

% each time you run an experiment. Otherwise it will be prompted when you use Evaluate...

%%

varargin = ieParamFormat(varargin);

p = inputParser;

p.addParameter('oi',oiCreate(),@(x)(isequal(class(x), 'struct')));

p.addParameter('sensor',sensorCreate(),@(x)(isequal(class(x), 'struct')));

p.addParameter('ip',ipCreate(),@(x)(isequal(class(x), 'struct')));

p.addParameter('imageFolder',"",@ischar); % or string?

p.parse(varargin{:});

oi = p.Results.oi;

sensor = p.Results.sensor;

ip = p.Results.ip;

% Put your custom code here to modifiy

% your optics (oi.optics)

%uar--------- Wavefront method

wvf0 = wvfCreate;

% This is how to set the focal length and pupil diameter. It is annoying

% that the diameter is millimeters. I hope to change it to meters but that

% will involve dealing with many scripts. And more patience than I have

% right now. (BW).

f_ = oiGet(oi,'optics focal length'); %uar 8e-3

a_ = oiGet(oi,'optics aperture diameter'); %uar 3.

wvf0 = wvfSet(wvf0,'focal length',f_);  % Meters

wvf0 = wvfSet(wvf0,'pupil diameter',a_);  % Millimeters

% We need to calculate the pointspread explicitly

wvf0 = wvfComputePSF(wvf0);

% Finally, we convert the wavefront representation to a shift-invariant

% optical image with this routine.

oi0 = wvf2oi(wvf0);

oiPlot(oi0,'psf 550');

% Here is the summary

fprintf('f# %0.2f and defocus %0.2f\n',oiGet(oi0,'fnumber'),oiGet(oi0,'wvf zcoeffs','defocus'));

% Now we compute with the oi as usual

% Here is the point spread. Diffraction-limited in this case.

% Notice the Airy disk

%oiPlot(oi0,'psf 550');

%% Adjust the defocus (in diopters)

wvf1 = wvfCreate;

% Make a new one with some defocus

wvf1 = wvfSet(wvf1,'zcoeffs',5.5,'defocus');

wvf1 = wvfComputePSF(wvf1);

oi1 = wvf2oi(wvf1);

oiPlot(oi1,'psf 550');

fprintf('f# %0.2f and defocus %0.2f\n',oiGet(oi1,'fnumber'),oiGet(oi1,'wvf zcoeffs','defocus'));

% Compute

%uar oi1 = oiCompute(oi1,scene);

oi = oiSet(oi1,'name','Defocused');

%oi=oi1;

%oiWindow(oi1);

% uar -------------------------

% your sensor (sensor)

% and your image processor (ip)

% Whatever is in those three structures when this function exits

% will be used in the evaluation process


Slides

File:Ppt uar.pdf


Files

https://office365stanford-my.sharepoint.com/personal/uarosa_stanford_edu/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fuarosa%5Fstanford%5Fedu%2FDocuments%2FAttachments%2FDavidCardinal%5FAfricanMammals%5Fscreened%5Ffiles%2Ezip&parent=%2Fpersonal%2Fuarosa%5Fstanford%5Fedu%2FDocuments%2FAttachments&originalPath=aHR0cHM6Ly9vZmZpY2UzNjVzdGFuZm9yZC1teS5zaGFyZXBvaW50LmNvbS86dTovZy9wZXJzb25hbC91YXJvc2Ffc3RhbmZvcmRfZWR1L0VRNTNzNUFnb0poT2thUFI3TldXQjhFQkJMM1VsUzJQY2I1b3JzT2otd3ZWLUE_cnRpbWU9cFFFejd4R1EyRWc

https://office365stanford-my.sharepoint.com/personal/uarosa_stanford_edu/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fuarosa%5Fstanford%5Fedu%2FDocuments%2FAttachments%2FDavidCardinal%5FAfricanMammals%5Fprocessed%5Ffiles%2Ezip&parent=%2Fpersonal%2Fuarosa%5Fstanford%5Fedu%2FDocuments%2FAttachments&originalPath=aHR0cHM6Ly9vZmZpY2UzNjVzdGFuZm9yZC1teS5zaGFyZXBvaW50LmNvbS86dTovZy9wZXJzb25hbC91YXJvc2Ffc3RhbmZvcmRfZWR1L0VTV0c3TTVyb0JaQ3ZydHpzeEY5UzMwQk1ra09jdC02TVk5MVQzbW9INGhzd3c_cnRpbWU9RjdHaVZ4S1EyRWc