Accelerating L3 Processing Pipeline for Cameras with Novel CFAs on NVIDIA® Shield™ Tablets using GPUs

Author: Negar Rahmati

Introduction

A graphics processor unit (GPU), also occasionally called visual processor unit (VPU), is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display. GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles. Modern GPUs are very efficient at manipulating computer graphics and image processing, and their highly parallel structure makes them more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel.

One algorithm that can be implemented in parallel using cuda GPU is the L3 algorithm. To speed the development of novel camera architectures the L3 method (Local, Linear and Learned) is proposed that automatically creates an optimized image processing pipeline. The L3 method assigns each sensor pixel into one of 400 classes, and applies class-dependent local linear transforms that map the sensor data from a pixel and its neighbors into the target output (e.g., CIE XYZ rendered under a D65 illuminant). [1]

This project aims to accelerate the L3 pipeline on NVIDIA® Shield™ Tablets using GPUs for real time rendering of videos. The cuda code is ran on the Tegra Shield tablet. This tablet is an ultimate tablet for gamers containing 192 supercomputer-class GPU cores. It has the first GPU architecture to span from supercomputers to PCs to mobile devices. Finally, the run time on the Tegra shield tablet is compared with the run time of the cuda code on PC and a Desktop with GTX 770.

Background

The L3 method automatically generates an image processing pipeline (sensor correction, demosaicking, illuminant correction, noise reduction) for novel camera architectures.1–4 The L3 method comprises two parts:1 L3 processing pipeline and L3 training module. The L3 processing pipeline first classifies each sensor pixel into one of many possible classes based on a statistical analysis of the responses at the pixel and its neighborhood. Then, it applies the class-associated local linear transform to map the pixel to the target output space. [1]

The L3 training module pre-computes and stores the table of class-associated linear transforms that convert sensor data to the target CIE XYZ values. The transformed are learned from the training data using Wiener estimation. The training data consist of sensor responses and desired rendering data (i.e. CIE XYZ values for consumer photography). They are generated through simulation from a camera model that is implemented using the Image Systems Engineering Toolbox (ISET). [1] [2]

The L3 algorithm explores an approach that separates illuminant correction from the rest of the pipeline. The same-illuminant table is learned, that renders data acquired under one illuminant into XYZ values under the same illuminant. This accomplishes demosaicking, denoising and sensor correction, but it does not perform illuminant correction (Figure 1(b)). Then, an illuminant correction transform is applied to obtain the final rendering. This architecture requires training and storing only one table of L3 transforms to be used for all illuminants and one illuminant correction matrix for each illuminant of interest. [1]

The L3 algorithm uses simulated training data to learn a table of linear transforms. Each pixel in the simulated sensor data is assigned to a distinct class based on (a) its color type (R, G, B, W), the neighborhood (b) saturation pattern (no saturation, W-saturation, W&G-saturation), (c) response voltage level (20 samples spanning the voltage swing), (d) spatial contrast (uniform or textured).1 The transform for a given class is derived by optimizing a Wiener filter between the data in the pixel neighborhood in the sensor, and the corresponding ideal value in the target color space (XYZ). [1]

The L3 pipeline can be even more efficient if implemented in cuda taking advantage of the parallel structure of the algorithm. the rendering process can be parallellized for each pixel. The rendering process identifies the class membership for each pixel in the sensor data and then applies the appropriate stored linear transform. The image processing pipeline, which performs demosaicking, denoising, sensor conversion comprises the application of these precomputed local, linear transforms. [1]

Tools

The run time measurement was done running the cuda code on a Tegra Shield tablet. This tablet has 192 supercomputer-class GPU cores and is appropriate for running the cuda code. To develop Android applications you need many tools, such as the Android SDK, the Android NDK, Java, the Eclipse IDE, the Android Development Tools (ADT), and command line (bash on Mac or Linux, Cygwin on Windows). To simplify installation, NVIDIA has created the Tegra Android Development Pack (TADP), a single file that installs everything you need. NVIDIA also provides a virtual image containing all the tools. [3] This pack enables cuda program on a compatible android device. CUDA platform enables the programmer to leverage the parallel processing power available in modern GPUs to tackle more general problems than those specific to graphics.

Method

The instructions on creating a new CUDA compatible android project can be found in the CUDA on Tegra section of the Tegra tutorial. [3] The structure of the project is similar to the structure of a normal project except that it has a folder specific to CUDA files. The android files are located in the JNI folder. The Java Native Interface (JNI) is a programming framework that enables Java code running in a Java Virtual Machine (JVM) to call and be called by native applications (programs specific to a hardware and operating system platform) and libraries written in other languages such as C, C++ and assembly. The nativeApp.cpp is the android code used to communicate with the device, render frames, respond to touch inputs etc. More importantly, this android starter code is responsible to call the CUDA functions to be run on the device.

The CUDA should be wrapped in a class so that the android started code can call it on the device. Then, the CUDA code should be compiled separately using a makefile similar to fig 3. Using this makefile to make the CUDA files creates libraries that the android.mk, the make file for android, can use to run the code on the device. Samples of both makefile for the CUDA code and the android.mk file are demonstrated in fig. 3. Note that the cuda code needs to be compiled separately before the project is run as android application.

The Read/write permissions are restrictive. We need to use the /sdcard/ directory on device to store files that are going to be manipulated. The memory is limited thus we need to start the test with small chuncks of data. In this project data from a one image frame was used to evaluate the L3 run time on a Tegra Shield tablet. Note that if we just want to have the android starter code render images, we can use the assets folder on the simulator instead of the SD card memory on the device.

Android provides the android debugger (ADB) to debug the application. Alternatively, we can use android built in commands as follows to track the instructions execution on the device:

    #include <android/log.h>

    #define LOGI(...) ((void)__android_log_print(ANDROID_LOG_INFO,  APP_NAME, __VA_ARGS__))

The link to the project on Github as well as the README file is available in the code section below.

The time is measured using regular C/C++ time stamp functions. Since in CUDA programming some instructions are non-blocking, all time measurements are done after a short-overhead blocking operation.

Results

The L3 processing algorithm was run on a Tegra Shield tablet for 100 frames. First the run time was measured including all IO operations and memory allocations. As shown in Figure 4, the run time is not consistent and is larger than expected in comparison to the run time on other devices.

The IOs done from the SD card of the device have created huge inconsistency and delay in the results. Assuming that in the final product, we won't be reading from and writing to the SD card, the L3 run time was measured again in the same scenario explained above. This time, the run time of the IO operations was not measured.

As demonstrated in Figure 2, the new run time shows that this time the run time is more consistent and the run time has dramatically decreased. Reading from the SD card has the potential for causing the inconsistency and randomness in the result because of having to respond to several other requests at the same time and not having a high IO priority in the scheduler.

Below, you can find the run time of the algorithm for two other devices for making a comparison. The GPU on the tablet approximately results in a performance 3 times better than the Desktop with GTX 770.

Run time of the L3 processing algorithm on 3 different devices:

Tegra Shield Tablet: 23 ms
Desktop with GTX 770: 62 ms
CPU: 12400 ms

Future Work

Future work could focus on optimizing the existing CUDA code. There are several techniques that are used to optimize CUDA code. A few of these techniques are available in "CUDA optimization strategies for compute-and memory-bound neuroimaging algorithms.". [4]

Acknowledgment

Special thanks to Haomiao Jiang, the mentor of the project for providing the L3 pipeline Cuda code and its runtime on the desktop and CPU as well as being extremely helpful throughout the project.

References

1. Germain, F. G., Akinola, I. A., Tian, Q., Lansel, S., & Wandell, B. A. (2015, February). Efficient illumination correction in the local, linear, learned (L3) method.
2. Farrell, J. E., Xiao, F., Catrysse, P. B., and Wandell, B. A., “A simulation tool for evaluating digital camera image quality,” in [Electronic Imaging 2004], 124–131, International Society for Optics and Photonics (2003)
3. Tegra Shield tablet Teaching material. Accessed 03/01/2015. <https://developer.nvidia.com/cuda-education>
4. Lee, Daren, et al. "CUDA optimization strategies for compute-and memory-bound neuroimaging algorithms." Computer methods and programs in biomedicine 106.3 (2012): 175-187.

Code

The Compiled project is available on my github page: <https://github.com/negarrahmati/L3-Tegra2>

Accelerating L3 Processing Pipeline for Cameras with Novel CFAs on NVIDIA® Shield™ Tablets using GPUs

Contents

Introduction

Background

Tools

Method

Results

Future Work

Acknowledgment

References

Code

Navigation menu

Accelerating L3 Processing Pipeline for Cameras with Novel CFAs on NVIDIA® Shield™ Tablets using GPUs

Introduction

Background

Tools

Method

Results

Future Work

Acknowledgment

References

Code

Navigation menu

Search