ZhuoYi
Correlation Analysis between Semantic Segmentation Performance and Training Data Quality
Introduction
Semantic segmentation using CNN requires large volume and high quality of the training dataset to achieve high performance. These requirements pose challenges to storage and compute hardware resources. However, lower quality dataset induces unwanted artifacts that may destroy the important image information. To solve these challenges, we need to better understand how the quality of training data affects the semantic segmentation algorithm performance. The goal of this project is to see how training data quality affects semantic segmentation network performance.
Methods
In this project, several experiments are conducted to study the connection between the performance of a semantic segmentation algorithm and 2D image/3D lidar point cloud quality. The attributes of 2D images, such as compression ratio and resolution are explored. For 3D lidar data, resolution and data channels are studied.
Compute Hardware
Machine: Macbook Pro
Processor: 2.9 GHz Quad-Core Intel Core i7
Memory: 16 GB 2133MHz LPDDR3
Software
Program: Matlab R2020b
Toolboxes: Deep Learning and Lidar toolboxes
Experiment Data Flow
The following diagram shows the data flow in all the experiments in this project. There are 4 stages. Stage one is data loading the label generation. Data preprocessing and partitioning are in stage two. Network training is in stage three. And then the network evaluation is in the last stage.
Dataset
The experiments use two different type of dataset, 2D image and 3D lidar point cloud data. The image data is collected on a highway from a front-facing camera mounted on the ego vehicle, and the lidar data is collected from an Ouster OS1 lidar sensor on the same vehicle. The camera and lidar data are approximately time-synced and calibrated to estimate their intrinsic and extrinsic parameters.
To do semantic segmentation, pixel label dataset is required. In this project, both 2D and 3D pixel label datasets are generated based on bounding box label data. This induces unwanted artifacts since all the pixel labels are rectangular. In addition, only 2 classes (car/background) are used in the 2D image experiments, and 3 classes (car/truck/background) are used in the 3D lidar experiments. Note also that background pixels dominate in most of the images.
Since Matlab only supports single core CPU on Mac for training, to reduce compute time, the max number of epochs and iterations is limited to 10 and 1800 respectively. Also, total 600 images for both image and lidar dataset are selected for the same reason. The last experiment on lidar data channel uses 800 images to improve metrics. These images are randomly shuffled before partitioning into training (60%), validation (20%) and testing (20%) data. Consequently, the experiment results in this project focus on the relative performance metrics.
Networks and Metrics
For 2D image experiments, a Deeplab v3+ network with weights initialized from a pre-trained Resnet-18 network is built. For 3D lidar experiments, a SqueezeSegV2 semantic segmentation network on 3-D organized lidar point cloud data is trained. All experiments use compute time and class accuracy and IoU to evaluate how the data quality affects network training and segmentation performance.
Results
2D Image Resolution
In this experiment, image resolution is reduced by a factor of 2 and 4 in both dimensions (horizontally and vertically). The image size 644x482 and 322x241 are approximately down 70% and 91% respectively.
The training duration for 644x482 image data takes 3 times as long as the other two resolution image datasets. Not only that, the network trained by 644x482 image dataset seems to have better accuracy and IoU rating for “Car” segmentation than the other two.
2D Image Compression Quality
Ideally, RAW image dataset should be used. Because the only data available from the source is in JPEG format, we used it as the Baseline and compressed on top of it. The images above show the quality of images with different JPEG compression quality factor. Only Quality 50 and 25 are used. And their sizes are 24% and 56% less than the original JPEG images.
The compressed image datasets (Q50 and Q25) result in the training time as long as 13+ hours. Surprisingly, the accuracies from both networks are a lot higher than the Baseline dataset. This will need further study of the underlying network architecture.
3D Lidar Data Resolution
For lidar data resolution experiment, there are 3 classes. And we used fixed step downsampling to reduce data resolution to 25% and 10% resolutions. Unlike 2D images, the file size and network training time reduces as expected. ~25% of the Baseline duration for R25 and ~10% of the Baseline duration for R10. Also, the performance of the resulting network drops along with the resolution.
One observation here is that the network trained using 25% resolution images has worse performance in “truck” segmentation of all while it has better accuracy and IoU metrics in “car” segmentation than the network trained using 10% resolution images. One possible reason is due to different training images between the two runs. (Recall that images are pre-shuffled before partitioning.)
3D Lidar Data Channels
The number of images in the pool is increased to 800 for this experiment to improve the metrics. Each original lidar data image in this experiment contains 5 channels - X, Y, Z, Intensity, and Range. The baseline is full resolution with 5 channels. Then we removed the information of X, Intensity and Range, one at a time, in the remaining 3 different runs.
All 4 runs have similar training time about 3 hours. This shows that the training time is not very sensitive to the data channels. However, the network performance varies a lot. The reason for this will require a careful study of the underlying network architecture which is beyond the scope of this experiment.
2D Image Vs 3D Lidar Data (Resolution)
To compare between 2D image and 3D lidar data segmentation techniques using the results from the resolution experiment, 3D lidar data segmentation seems to be more sensitive to the volume of the training set and data quality while 2D image segmentation network performance is less predictable based on resolution of the dataset alone. In the following chart, in order to have better comparison, the lidar data measurement is an average of the accuracy metrics from "Car" and "Truck".
Conclusions
The measurements from the experiments conducted in this project are far from conclusive. There are a few interesting observations and in-depth study of the underlying network architecture is required to further answer the questions about the behavior of the network performance.
Observations
- Macbook Pro is not an ideal machine for this task!
- With the same network architecture, longer training time and higher volume of training data result in better network segmentation performance.
- Network behaviors vary in connection to the attributes of the data for both 2D images and 3D lidar data point cloud.
- The segmentation performance for different classes can vary. This may be due to the training images differences. (Recall that the images are shuffled before partitioning into train/valid/test sets.)
- The point cloud segmentation network behaves differently from the 2D image segmentation network. Its performance is very sensitive to the volume and quality of the training data.
- The pixel imbalance of image classes affects the network performance in terms of segmentation accuracy. A weight may be applied to different classes accordingly to improve performance.
- The network trained by lower resolution 2D images could result in better performance.
Future Studies
- Why does compressed JPEG training 2D images result in better network performance and how come it makes the training process difficult to converge?
- How does the network respond to the artifacts resulted from low quality JPEG images?
- How reliable are the performance metrics?
- Metrics may be misleading. What are the better metrics we can use for network performance under different applications?
- For 2D image segmentation, why does Deeplab v3+ take longer time to train and have better performance when using 644x482 image dataset than even a higher resolution images? Does spatial aliasing play a role here?
- How does gaussian noise on 2D image affect network performance? Could this be related to the better network performance trained by the lower quality images we observed in this project?
- Exploring how other networks behave with respect to training image qualities.
Acknowledgement
I would like to thank Brian Wandell and David Cardinal for their guidance and suggestions and special thank to David point me to the dataset used in all the experiments in this project.
Reference
- Suvash Sharma, Christopher Hudson, Daniel Carruth, Matt Doude, John E. Ball, Bo Tang, Chris Goodin, and Lalitha Dabbiru "Performance analysis of semantic segmentation algorithms trained with JPEG compressed datasets", Proc. SPIE 11401, Real-Time Image Processing and Deep Learning 2020, 1140104 (22 April 2020); https://doi-org.stanford.idm.oclc.org/10.1117/12.2557928
- Chen, Liang-Chieh et al. “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation.” ECCV (2018).
- Brostow, G. J., J. Fauqueur, and R. Cipolla. "Semantic object classes in video: A high-definition ground truth database." Pattern Recognition Letters. Vol. 30, Issue 2, 2009, pp 88-97.
- Wu, Bichen, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and Kurt Keutzer. “SqueezeSegV2: Improved Model Structure and Unsupervised Domain Adaptation for Road-Object Segmentation from a LiDAR Point Cloud.” In 2019 International Conference on Robotics and Automation (ICRA), 4376–82. Montreal, QC, Canada: IEEE, 2019.https://doi.org/10.1109/ICRA.2019.8793495.
Appendix
Dataset Source
Code
- All code for all experiments repro in this project can be downloaded here: FinalProject.zip