ZhuoYi: Difference between revisions
imported>Student221 |
imported>Student221 |
||
Line 44: | Line 44: | ||
== Conclusions == | == Conclusions == | ||
The measurements from the experiments conducted in this project are far from conclusive. There are a few interesting observations and in-depth study of the underlying network architecture is required to answer the behavior of the network performance. | |||
=== Observations === | |||
* Macbook Pro is not an ideal machine for this task! | |||
* With the same network architecture, longer training time and higher volume of training data result in better network segmentation performance. | |||
* Network behaviors vary in connection to the attributes of the data for both 2D images and 3D lidar data point cloud. | |||
* The segmentation performance for different classes can vary. This may be due to the training images differences. (Recall that the images are shuffled before partitioning into train/valid/test sets.) | |||
* The point cloud segmentation network behaves differently from the 2D image segmentation network. Its performance is very sensitive to the volume and quality of the training data. | |||
=== Future Studies === | |||
* Why does compressed training 2D images result in better network performance and how come it makes the training process difficult to converge? | |||
* How reliable are the performance metrics? | |||
* For 2D image segmentation, why does Deeplab v3+ take longer time to train and have better performance when using 644x482 image dataset than even a higher resolution images? | |||
== Acknowledgement== | == Acknowledgement== |
Revision as of 05:30, 19 November 2020
Introduction
Semantic segmentation using CNN requires large volume and high quality of the training dataset to achieve high performance. These requirements pose challenges to storage and compute hardware resources. However, lower quality dataset induces unwanted artifacts that may destroy the important image information. To solve these challenges, we need to better understand how the quality of training data affects the semantic segmentation algorithm performance. The goal of this project is to see how training data quality affects semantic segmentation network performance.
Methods
In this project, several experiments are conducted to study the connection between the performance of a semantic segmentation algorithm and 2D image/3D lidar point cloud quality. The attributes of 2D images, such as compression ratio and resolution are explored. For 3D lidar data, resolution and data channels are studied.
Compute Hardware
Machine: Macbook Pro
Processor: 2.9 GHz Quad-Core Intel Core i7
Memory: 16 GB 2133MHz LPDDR3
Software
Program: Matlab R2020b
Toolboxes: Deep Learning and Lidar toolboxes
Experiment Data Flow
The following diagram shows the data flow in all the experiments in this project. There are 4 stages. Stage one is data loading the label generation. Data preprocessing and partitioning are in stage two. Network training is in stage three. And then the network evaluation is in the last stage.
Dataset
The experiments use two different type of dataset, 2D image and 3D lidar point cloud data. The image data is collected on a highway from a front-facing camera mounted on the ego vehicle, and the lidar data is collected from an Ouster OS1 lidar sensor on the same vehicle. The camera and lidar data are approximately time-synced and calibrated to estimate their intrinsic and extrinsic parameters.
To do semantic segmentation, pixel label dataset is required. In this project, both 2D and 3D pixel label datasets are generated based on bounding box label data. This induces unwanted artifacts since all the pixel labels are rectangular. In addition, only 2 classes (car/background) are used in the 2D image experiments, and 3 classes (car/truck/background) are used in the 3D lidar experiments. Note also that background pixels dominate in most of the images.
Since Matlab only supports single core CPU on Mac for training, to reduce compute time, the max number of epochs and iterations is limited to 10 and 1800 respectively. Also, total 600 images for both image and lidar dataset are selected for the same reason. The last experiment on lidar data channel uses 800 images to improve metrics. These images are randomly shuffled before partitioning into training (60%), validation (20%) and testing (20%) data. Consequently, the experiment results in this project focus on the relative performance metrics.
Networks and Metrics
For 2D image experiments, a Deeplab v3+ network with weights initialized from a pre-trained Resnet-18 network is built. For 3D lidar experiments, a SqueezeSegV2 semantic segmentation network on 3-D organized lidar point cloud data is trained. All experiments use compute time and class accuracy and IoU to evaluate how the data quality affects network training and segmentation performance.
Results
Conclusions
The measurements from the experiments conducted in this project are far from conclusive. There are a few interesting observations and in-depth study of the underlying network architecture is required to answer the behavior of the network performance.
Observations
- Macbook Pro is not an ideal machine for this task!
- With the same network architecture, longer training time and higher volume of training data result in better network segmentation performance.
- Network behaviors vary in connection to the attributes of the data for both 2D images and 3D lidar data point cloud.
- The segmentation performance for different classes can vary. This may be due to the training images differences. (Recall that the images are shuffled before partitioning into train/valid/test sets.)
- The point cloud segmentation network behaves differently from the 2D image segmentation network. Its performance is very sensitive to the volume and quality of the training data.
Future Studies
- Why does compressed training 2D images result in better network performance and how come it makes the training process difficult to converge?
- How reliable are the performance metrics?
- For 2D image segmentation, why does Deeplab v3+ take longer time to train and have better performance when using 644x482 image dataset than even a higher resolution images?
Acknowledgement
I would like to thank Brian Wandell and David Cardinal for their guidance and suggestions and special thank to David point me to the dataset used in all the experiments in this project.
Reference
- Suvash Sharma, Christopher Hudson, Daniel Carruth, Matt Doude, John E. Ball, Bo Tang, Chris Goodin, and Lalitha Dabbiru "Performance analysis of semantic segmentation algorithms trained with JPEG compressed datasets", Proc. SPIE 11401, Real-Time Image Processing and Deep Learning 2020, 1140104 (22 April 2020); https://doi-org.stanford.idm.oclc.org/10.1117/12.2557928
- Chen, Liang-Chieh et al. “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation.” ECCV (2018).
- Brostow, G. J., J. Fauqueur, and R. Cipolla. "Semantic object classes in video: A high-definition ground truth database." Pattern Recognition Letters. Vol. 30, Issue 2, 2009, pp 88-97.
- Wu, Bichen, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and Kurt Keutzer. “SqueezeSegV2: Improved Model Structure and Unsupervised Domain Adaptation for Road-Object Segmentation from a LiDAR Point Cloud.” In 2019 International Conference on Robotics and Automation (ICRA), 4376–82. Montreal, QC, Canada: IEEE, 2019.https://doi.org/10.1109/ICRA.2019.8793495.
Appendix
Dataset Source
Code
- Coming soon after presentation