ZhuoYi

From Psych 221 Image Systems Engineering
Revision as of 05:52, 19 November 2020 by imported>Student221 (→‎2D Image Compression Quality)
Jump to navigation Jump to search

Introduction

Semantic segmentation using CNN requires large volume and high quality of the training dataset to achieve high performance. These requirements pose challenges to storage and compute hardware resources. However, lower quality dataset induces unwanted artifacts that may destroy the important image information. To solve these challenges, we need to better understand how the quality of training data affects the semantic segmentation algorithm performance. The goal of this project is to see how training data quality affects semantic segmentation network performance.

Methods

In this project, several experiments are conducted to study the connection between the performance of a semantic segmentation algorithm and 2D image/3D lidar point cloud quality. The attributes of 2D images, such as compression ratio and resolution are explored. For 3D lidar data, resolution and data channels are studied.

Compute Hardware

Machine: Macbook Pro

Processor: 2.9 GHz Quad-Core Intel Core i7

Memory: 16 GB 2133MHz LPDDR3

Software

Program: Matlab R2020b

Toolboxes: Deep Learning and Lidar toolboxes

Experiment Data Flow

The following diagram shows the data flow in all the experiments in this project. There are 4 stages. Stage one is data loading the label generation. Data preprocessing and partitioning are in stage two. Network training is in stage three. And then the network evaluation is in the last stage.

Dataset

The experiments use two different type of dataset, 2D image and 3D lidar point cloud data. The image data is collected on a highway from a front-facing camera mounted on the ego vehicle, and the lidar data is collected from an Ouster OS1 lidar sensor on the same vehicle. The camera and lidar data are approximately time-synced and calibrated to estimate their intrinsic and extrinsic parameters.

To do semantic segmentation, pixel label dataset is required. In this project, both 2D and 3D pixel label datasets are generated based on bounding box label data. This induces unwanted artifacts since all the pixel labels are rectangular. In addition, only 2 classes (car/background) are used in the 2D image experiments, and 3 classes (car/truck/background) are used in the 3D lidar experiments. Note also that background pixels dominate in most of the images.

Since Matlab only supports single core CPU on Mac for training, to reduce compute time, the max number of epochs and iterations is limited to 10 and 1800 respectively. Also, total 600 images for both image and lidar dataset are selected for the same reason. The last experiment on lidar data channel uses 800 images to improve metrics. These images are randomly shuffled before partitioning into training (60%), validation (20%) and testing (20%) data. Consequently, the experiment results in this project focus on the relative performance metrics.

Networks and Metrics

For 2D image experiments, a Deeplab v3+ network with weights initialized from a pre-trained Resnet-18 network is built. For 3D lidar experiments, a SqueezeSegV2 semantic segmentation network on 3-D organized lidar point cloud data is trained. All experiments use compute time and class accuracy and IoU to evaluate how the data quality affects network training and segmentation performance.

Results

2D Image Resolution

In this experiment, image resolution is reduced by a factor of 2 and 4 in both dimensions (horizontally and vertically). The image size 644x482 and 322x241 are approximately down 70% and 91% respectively.

The training duration for 644x482 image data takes 3 times as long as the other two resolution image datasets. Not only that, the network trained by 644x482 image dataset seems to have better accuracy and IoU rating for “Car” segmentation than the other two.


2D Image Compression Quality

Ideally, RAW image dataset should be used. Because the only data available from the source is in JPEG format, we used it as the Baseline and compressed on top of it. The images above show the quality of images with different JPEG compression quality factor. Only Quality 50 and 25 are used. And their sizes are 24% and 56% less than the original JPEG images.

The compressed image datasets (Q50 and Q25) result in the training time as long as 13+ hours. Surprisingly, the accuracies from both networks are a lot higher than the Baseline dataset. This will need further study of the underlying network architecture.

3D Lidar Data Resolution

3D Lidar Data Channels

Conclusions

The measurements from the experiments conducted in this project are far from conclusive. There are a few interesting observations and in-depth study of the underlying network architecture is required to further answer the questions about the behavior of the network performance.

Observations

  • Macbook Pro is not an ideal machine for this task!
  • With the same network architecture, longer training time and higher volume of training data result in better network segmentation performance.
  • Network behaviors vary in connection to the attributes of the data for both 2D images and 3D lidar data point cloud.
  • The segmentation performance for different classes can vary. This may be due to the training images differences. (Recall that the images are shuffled before partitioning into train/valid/test sets.)
  • The point cloud segmentation network behaves differently from the 2D image segmentation network. Its performance is very sensitive to the volume and quality of the training data.

Future Studies

  • Why does compressed training 2D images result in better network performance and how come it makes the training process difficult to converge?
  • How reliable are the performance metrics?
  • For 2D image segmentation, why does Deeplab v3+ take longer time to train and have better performance when using 644x482 image dataset than even a higher resolution images?

Acknowledgement

I would like to thank Brian Wandell and David Cardinal for their guidance and suggestions and special thank to David point me to the dataset used in all the experiments in this project.

Reference

  • Suvash Sharma, Christopher Hudson, Daniel Carruth, Matt Doude, John E. Ball, Bo Tang, Chris Goodin, and Lalitha Dabbiru "Performance analysis of semantic segmentation algorithms trained with JPEG compressed datasets", Proc. SPIE 11401, Real-Time Image Processing and Deep Learning 2020, 1140104 (22 April 2020); https://doi-org.stanford.idm.oclc.org/10.1117/12.2557928
  • Chen, Liang-Chieh et al. “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation.” ECCV (2018).
  • Brostow, G. J., J. Fauqueur, and R. Cipolla. "Semantic object classes in video: A high-definition ground truth database." Pattern Recognition Letters. Vol. 30, Issue 2, 2009, pp 88-97.
  • Wu, Bichen, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and Kurt Keutzer. “SqueezeSegV2: Improved Model Structure and Unsupervised Domain Adaptation for Road-Object Segmentation from a LiDAR Point Cloud.” In 2019 International Conference on Robotics and Automation (ICRA), 4376–82. Montreal, QC, Canada: IEEE, 2019.https://doi.org/10.1109/ICRA.2019.8793495.

Appendix

Dataset Source

Code

  • Coming soon after presentation