Journal Home > Volume 25 , Issue 4

This paper addresses the problem of the semantic segmentation of large-scale 3D road scenes by incorporating the complementary advantages of point clouds and images. To make full use of geometrical and visual information, this paper extracts 3D geometric features from a point cloud using a deep neural network for 3D semantic segmentation and extracts 2D visual features from images using a Convolutional Neural Network (CNN) for 2D semantic segmentation. In order to bridge the features of the two modalities, this paper uses superpoints as an intermediate representation to connect the 2D features with the 3D features. A superpoint-based pooling method is proposed to fuse the features from the two different modalities for joint learning. To evaluate the approach, the paper generates 3D scenes from the Virtual KITTI dataset. The results of the experiments demonstrate that the proposed approach is capable of segmenting large-scale 3D road scenes based on the compact and semantically homogeneous superpoints, and that it achieves considerable improvements over the 2D image and 3D point cloud semantic segmentation methods.


menu
Abstract
Full text
Outline
About this article

Fusing Geometrical and Visual Information via Superpoints for the Semantic Segmentation of 3D Road Scenes

Show Author's information Liuyuan DengMing Yang( )Zhidong LiangYuesheng HeChunxiang Wang
Department of Automation, Shanghai Jiao Tong University, Shanghai 200240
Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.
Research Institute of Robotics, Shanghai Jiao Tong University, Shanghai 200240, China.

Abstract

This paper addresses the problem of the semantic segmentation of large-scale 3D road scenes by incorporating the complementary advantages of point clouds and images. To make full use of geometrical and visual information, this paper extracts 3D geometric features from a point cloud using a deep neural network for 3D semantic segmentation and extracts 2D visual features from images using a Convolutional Neural Network (CNN) for 2D semantic segmentation. In order to bridge the features of the two modalities, this paper uses superpoints as an intermediate representation to connect the 2D features with the 3D features. A superpoint-based pooling method is proposed to fuse the features from the two different modalities for joint learning. To evaluate the approach, the paper generates 3D scenes from the Virtual KITTI dataset. The results of the experiments demonstrate that the proposed approach is capable of segmenting large-scale 3D road scenes based on the compact and semantically homogeneous superpoints, and that it achieves considerable improvements over the 2D image and 3D point cloud semantic segmentation methods.

Keywords: deep learning, scene understanding, point cloud semantic segmentation, multi-modal information fusion

References(31)

[1]
E. Shelhamer, J. Long, and T. Darrell, Fully convolutional networks for semantic segmentation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 640-651, 2017.
[2]
F. Yu and V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv preprint arXiv:1511.07122, 2015.
[3]
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, Pyramid scene parsing network, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 6230-6239.
DOI
[4]
L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834-848, 2018.
[5]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, The cityscapes dataset for semantic urban scene understanding, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 3213-3223.
DOI
[6]
A. Boulch, B. L. Saux, and N. Audebert, Unstructured point cloud semantic labeling using deep segmentation networks, in Proc. of Eurographics Workshop on 3D Object Retrieval, Lyon, France, 2017, pp. 17-24.
[7]
L. P. Tchapmi, C. B. Choy, I. Armeni, J. Gwak, and S. Savarese, Segcloud: Semantic segmentation of 3d point clouds, arXiv preprint arXiv: 1710.07563, 2017.
[8]
C. R. Qi, H. Su, K. Mo, and L. J. Guibas, Pointnet: Deep learning on point sets for 3d classification and segmentation, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 652-660.
[9]
L. Landrieu and M. Simonovsky, Large-scale point cloud semantic segmentation with superpoint graphs, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4558-4567.
DOI
[10]
H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, Multi-view convolutional neural networks for 3d shape recognition, in Proc. IEEE International Conference on Computer Vision, Santiago, Chile, 2015, pp. 945-953.
DOI
[11]
R. Zhang, G. Li, M. Li, and L. Wang, Fusion of images and point clouds for the semantic segmentation of large-scale 3d scenes based on deep learning, ISPRS Journal of Photogrammetry and Remote Sensing, vol. 143, pp. 85-96, 2018.
[12]
V. Hegde and R. Zadeh, Fusionnet: 3d object classification using multiple data representations, arXiv preprint arXiv: 1607.05695, 2016.
[13]
C. R. Qi, L. Yi, H. Su, and L. J. Guibas, Pointnet++: Deep hierarchical feature learning on point sets in a metric space, in Proc. of Advances in Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 5099-5108.
[14]
S. Guinard and L. Landrieu, Weakly supervised segmentation-aided classification of urban scenes from 3d lidar point clouds, Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., vol. XLII-1/W1, pp. 151-157, 2017.
[15]
A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, Virtual worlds as proxy for multi-object tracking analysis, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4340-4349.
DOI
[16]
J. Shotton, J. Winn, C. Rother, and A. Criminisi, Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context, Int. J. Comput. Vis., vol. 81, no. 1, pp. 2-23, 2009.
[17]
L. Ladický, P. Sturgess, K. Alahari, C. Russell, and P. H. Torr, What, where and how many? Combining object detectors and CRFs, in Proc. European Conference on Computer Vision, Heraklion, Greece, 2010, pp. 424-437.
DOI
[18]
R. Zhang, S. A. Candra, K. Vetter, and A. Zakhor, Sensor fusion for semantic segmentation of urban scenes, in Proc. IEEE International Conference on Robotics and Automation, Seattle, WA, USA, 2015, pp. 1850-1857.
DOI
[19]
L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation, in Proc. of European Conference on Computer Vision, Munich, Germany, 2018, pp. 833-851.
DOI
[20]
K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv: 1409.1556, 2014.
[21]
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.
DOI
[22]
H. Noh, S. Hong, and B. Han, Learning deconvolution network for semantic segmentation, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Santiago, Chile, 2015, pp. 1520-1528.
DOI
[23]
V. Badrinarayanan, A. Kendall, and R. Cipolla, Segnet: A deep convolutional encoder-decoder architecture for image segmentation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481-2495, 2017.
[24]
P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, Understanding convolution for semantic segmentation, in Proc. IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, NV, USA, 2018, pp. 1451-1460.
DOI
[25]
L. Deng, M. Yang, Y. Qian, C. Wang, and B. Wang, CNN-based semantic segmentation for urban traffic scenes using fisheye camera, in Proc. IEEE Intelligent Vehicles Symposium, Los Angeles, CA, USA, 2017, pp. 231-236.
DOI
[26]
L. Deng, M. Yang, H. Li, T. Li, B. Hu, and C. Wang, Restricted deformable convolution-based road scene semantic segmentation using surround view cameras, IEEE Transactions on Intelligent Transportation Systems, .
[27]
E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, ERFNet: Efficient residual factorized convnet for real-time semantic segmentation, IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 1, pp. 263-272, 2017.
[28]
Z. Liang, M. Yang, and C. Wang, 3d graph embedding learning with a structure-aware loss function for point cloud semantic instance segmentation, arXiv preprint arXiv: 1902.05247, 2019.
[29]
Z. Liang, M. Yang, L. Deng, C. Wang, and B. Wang, Hierarchical depthwise graph convolutional neural network for 3d semantic segmentation of point clouds, in Proc. IEEE International Conference on Robotics and Automation, Montreal, Canada, 2019, pp. 8152-8158.
DOI
[30]
K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv preprint arXiv: 1406.1078, 2014.
[31]
M. Simonovsky and N. Komodakis, Dynamic edge-conditioned filters in convolutional neural networks on graphs, in Proc. IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA, 2017, pp. 3693-3702.
DOI
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 01 April 2019
Revised: 23 July 2019
Accepted: 29 July 2019
Published: 13 January 2020
Issue date: August 2020

Copyright

© The author(s) 2020

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. U1764264/61873165), Shanghai Automotive Industry Science and Technology Development Foundation (No. 1807), and the International Chair on Automated Driving of Ground Vehicle.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return