Fusing Geometrical and Visual Information via Superpoints for the Semantic Segmentation of 3D Road Scenes

Liuyuan Deng; Ming Yang; Zhidong Liang; Yuesheng He; Chunxiang Wang

doi:10.26599/TST.2019.9010038

Tsinghua Science and Technology 2020, 25(4): 498-507 https://doi.org/10.26599/TST.2019.9010038

Open Access | Issue | Published: 13 January 2020

Fusing Geometrical and Visual Information via Superpoints for the Semantic Segmentation of 3D Road Scenes

Show Author's Information Hide Author's Information Liuyuan Deng, Ming Yang(

), Zhidong Liang, Yuesheng He, Chunxiang Wang

Department of Automation, Shanghai Jiao Tong University, Shanghai

200240

Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.

Research Institute of Robotics, Shanghai Jiao Tong University, Shanghai 200240, China.

Keywords:

deep learning, scene understanding, point cloud semantic segmentation, multi-modal information fusion

Cite this article:

Deng L, Yang M, Liang Z, et al. Fusing Geometrical and Visual Information via Superpoints for the Semantic Segmentation of 3D Road Scenes. Tsinghua Science and Technology, 2020, 25(4): 498-507. https://doi.org/10.26599/TST.2019.9010038

Download citation

EndNote(RIS)

BibTeX

626

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

This paper addresses the problem of the semantic segmentation of large-scale 3D road scenes by incorporating the complementary advantages of point clouds and images. To make full use of geometrical and visual information, this paper extracts 3D geometric features from a point cloud using a deep neural network for 3D semantic segmentation and extracts 2D visual features from images using a Convolutional Neural Network (CNN) for 2D semantic segmentation. In order to bridge the features of the two modalities, this paper uses superpoints as an intermediate representation to connect the 2D features with the 3D features. A superpoint-based pooling method is proposed to fuse the features from the two different modalities for joint learning. To evaluate the approach, the paper generates 3D scenes from the Virtual KITTI dataset. The results of the experiments demonstrate that the proposed approach is capable of segmenting large-scale 3D road scenes based on the compact and semantically homogeneous superpoints, and that it achieves considerable improvements over the 2D image and 3D point cloud semantic segmentation methods.

Full text

Abstract

Full text

Outline

About this article

Fusing Geometrical and Visual Information via Superpoints for the Semantic Segmentation of 3D Road Scenes

Show Author's information Hide Author's Information Liuyuan Deng, Ming Yang(

), Zhidong Liang, Yuesheng He, Chunxiang Wang

Department of Automation, Shanghai Jiao Tong University, Shanghai

200240

Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.

Research Institute of Robotics, Shanghai Jiao Tong University, Shanghai 200240, China.

Abstract

Keywords: deep learning, scene understanding, point cloud semantic segmentation, multi-modal information fusion

References(31)

[1]

E. Shelhamer, J. Long, and T. Darrell, Fully convolutional networks for semantic segmentation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 640-651, 2017.

DOI Google Scholar

[2]

F. Yu and V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv preprint arXiv:1511.07122, 2015.

Google Scholar

[3]

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, Pyramid scene parsing network, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 6230-6239.

DOI

[4]

L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834-848, 2018.

DOI Google Scholar

[5]

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, The cityscapes dataset for semantic urban scene understanding, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 3213-3223.

DOI

[6]

A. Boulch, B. L. Saux, and N. Audebert, Unstructured point cloud semantic labeling using deep segmentation networks, in Proc. of Eurographics Workshop on 3D Object Retrieval, Lyon, France, 2017, pp. 17-24.

[7]

L. P. Tchapmi, C. B. Choy, I. Armeni, J. Gwak, and S. Savarese, Segcloud: Semantic segmentation of 3d point clouds, arXiv preprint arXiv: 1710.07563, 2017.

Google Scholar

[8]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas, Pointnet: Deep learning on point sets for 3d classification and segmentation, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 652-660.

[9]

L. Landrieu and M. Simonovsky, Large-scale point cloud semantic segmentation with superpoint graphs, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4558-4567.

DOI

[10]

H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, Multi-view convolutional neural networks for 3d shape recognition, in Proc. IEEE International Conference on Computer Vision, Santiago, Chile, 2015, pp. 945-953.

DOI

[11]

R. Zhang, G. Li, M. Li, and L. Wang, Fusion of images and point clouds for the semantic segmentation of large-scale 3d scenes based on deep learning, ISPRS Journal of Photogrammetry and Remote Sensing, vol. 143, pp. 85-96, 2018.

DOI Google Scholar

[12]

V. Hegde and R. Zadeh, Fusionnet: 3d object classification using multiple data representations, arXiv preprint arXiv: 1607.05695, 2016.

Google Scholar

[13]

C. R. Qi, L. Yi, H. Su, and L. J. Guibas, Pointnet++: Deep hierarchical feature learning on point sets in a metric space, in Proc. of Advances in Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 5099-5108.

[14]

S. Guinard and L. Landrieu, Weakly supervised segmentation-aided classification of urban scenes from 3d lidar point clouds, Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., vol. XLII-1/W1, pp. 151-157, 2017.

DOI Google Scholar

[15]

A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, Virtual worlds as proxy for multi-object tracking analysis, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4340-4349.

DOI

[16]

J. Shotton, J. Winn, C. Rother, and A. Criminisi, Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context, Int. J. Comput. Vis., vol. 81, no. 1, pp. 2-23, 2009.

DOI Google Scholar

[17]

L. Ladický, P. Sturgess, K. Alahari, C. Russell, and P. H. Torr, What, where and how many? Combining object detectors and CRFs, in Proc. European Conference on Computer Vision, Heraklion, Greece, 2010, pp. 424-437.

DOI

[18]

R. Zhang, S. A. Candra, K. Vetter, and A. Zakhor, Sensor fusion for semantic segmentation of urban scenes, in Proc. IEEE International Conference on Robotics and Automation, Seattle, WA, USA, 2015, pp. 1850-1857.

DOI

[19]

L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation, in Proc. of European Conference on Computer Vision, Munich, Germany, 2018, pp. 833-851.

DOI

[20]

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv: 1409.1556, 2014.

Google Scholar

[21]

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.

DOI

[22]

H. Noh, S. Hong, and B. Han, Learning deconvolution network for semantic segmentation, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Santiago, Chile, 2015, pp. 1520-1528.

DOI

[23]

V. Badrinarayanan, A. Kendall, and R. Cipolla, Segnet: A deep convolutional encoder-decoder architecture for image segmentation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481-2495, 2017.

DOI Google Scholar

[24]

P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, Understanding convolution for semantic segmentation, in Proc. IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, NV, USA, 2018, pp. 1451-1460.

DOI

[25]

L. Deng, M. Yang, Y. Qian, C. Wang, and B. Wang, CNN-based semantic segmentation for urban traffic scenes using fisheye camera, in Proc. IEEE Intelligent Vehicles Symposium, Los Angeles, CA, USA, 2017, pp. 231-236.

DOI

[26]

L. Deng, M. Yang, H. Li, T. Li, B. Hu, and C. Wang, Restricted deformable convolution-based road scene semantic segmentation using surround view cameras, IEEE Transactions on Intelligent Transportation Systems, .

DOI Google Scholar

[27]

E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, ERFNet: Efficient residual factorized convnet for real-time semantic segmentation, IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 1, pp. 263-272, 2017.

DOI Google Scholar

[28]

Z. Liang, M. Yang, and C. Wang, 3d graph embedding learning with a structure-aware loss function for point cloud semantic instance segmentation, arXiv preprint arXiv: 1902.05247, 2019.

Google Scholar

[29]

Z. Liang, M. Yang, L. Deng, C. Wang, and B. Wang, Hierarchical depthwise graph convolutional neural network for 3d semantic segmentation of point clouds, in Proc. IEEE International Conference on Robotics and Automation, Montreal, Canada, 2019, pp. 8152-8158.

DOI

[30]

K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv preprint arXiv: 1406.1078, 2014.

Google Scholar

[31]

M. Simonovsky and N. Komodakis, Dynamic edge-conditioned filters in convolutional neural networks on graphs, in Proc. IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA, 2017, pp. 3693-3702.

DOI

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 01 April 2019

Revised: 23 July 2019

Accepted: 29 July 2019

Published: 13 January 2020

Issue date: August 2020

Copyright

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. U1764264/61873165), Shanghai Automotive Industry Science and Technology Development Foundation (No. 1807), and the International Chair on Automated Driving of Ground Vehicle.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).