Survey of Pedestrian Action Recognition Techniques for Autonomous Driving

Li Chen; Nan Ma; Patrick Wang; Jiahong Li; Pengfei Wang; Guilin Pang; Xiaojun Shi

doi:10.26599/TST.2019.9010018

Tsinghua Science and Technology 2020, 25(4): 458-470 https://doi.org/10.26599/TST.2019.9010018

Open Access | Issue | Published: 13 January 2020

Survey of Pedestrian Action Recognition Techniques for Autonomous Driving

Show Author's Information Hide Author's Information Li Chen, Nan Ma(

), Patrick Wang, Jiahong Li, Pengfei Wang, Guilin Pang, Xiaojun Shi

Beijing Key Laboratory of Information Service Engineering, College of Robotics, Beijing Union University, Beijing 100101, China.

• Northeastern University, Boston, MA 02115, USA.

• Communication and Information Center of Ministry of Emergency Management of the People’s Republic of China, Beijing 100013, China.

College of Robotics, Beijing Union University, Beijing 100101, China.

Keywords:

autonomous driving, pedestrian action recognition, action datasets, two-stream network

Cite this article:

Chen L, Ma N, Wang P, et al. Survey of Pedestrian Action Recognition Techniques for Autonomous Driving. Tsinghua Science and Technology, 2020, 25(4): 458-470. https://doi.org/10.26599/TST.2019.9010018

Download citation

EndNote(RIS)

BibTeX

766

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

The development of autonomous driving has brought with it requirements for intelligence, safety, and stability. One example of this is the need to construct effective forms of interactive cognition between pedestrians and vehicles in dynamic, complex, and uncertain environments. Pedestrian action detection is a form of interactive cognition that is fundamental to the success of autonomous driving technologies. Specifically, vehicles need to detect pedestrians, recognize their limb movements, and understand the meaning of their actions before making appropriate decisions in response. In this survey, we present a detailed description of the architecture for pedestrian action recognition in autonomous driving, and compare the existing mainstream pedestrian action recognition techniques. We also introduce several commonly used datasets used in pedestrian motion recognition. Finally, we present several suggestions for future research directions.

Full text

Abstract

Full text

Outline

About this article

Survey of Pedestrian Action Recognition Techniques for Autonomous Driving

Show Author's information Hide Author's Information Li Chen, Nan Ma(

), Patrick Wang, Jiahong Li, Pengfei Wang, Guilin Pang, Xiaojun Shi

Beijing Key Laboratory of Information Service Engineering, College of Robotics, Beijing Union University, Beijing 100101, China.

• Northeastern University, Boston, MA 02115, USA.

• Communication and Information Center of Ministry of Emergency Management of the People’s Republic of China, Beijing 100013, China.

College of Robotics, Beijing Union University, Beijing 100101, China.

Abstract

Keywords: autonomous driving, pedestrian action recognition, action datasets, two-stream network

References(57)

[1]

N. Ma, Y. Gao, J. H. Li, and D. Y. Li, Interactive cognition in self-driving, (in Chinese), Sci. Sin. Inform., vol. 48, no. 8, pp. 125-138, 2018.

Google Scholar

[2]

N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 2005, pp. 886-893.

[3]

N. Dalal, Finding people in images and videos, Ph.D. dissertation, Institute National Polytechnology de Grenoble-INPG, Grenoble, France, 2006.

[4]

C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.

DOI Google Scholar

[5]

D. Singh, M. A. Khan, A. Bansal, and N. Bansal, An application of SVM in character recognition with chain code communication, in Proceedings of Control and Intelligent Systems (CCIS), Mathura, India, 2015, pp.167-171.

DOI

[6]

A. S. Ahmad, M. Y. Hassan, M. P. Abdullah, H. A. Rahman, F. Hussin, H. Abdullah, and R. Saidur, A review on applications of ANN and SVM for building electrical energy consumption forecasting, Renewable and Sustainable Energy Reviews, vol. 33, no. 1, pp. 102-109, 2014.

DOI Google Scholar

[7]

H. Liu, T. Xu, X. Wang, and Y. Qian, Related HOG features for human detection using cascaded AdaBoost and SVM classifiers, in Proceedings of Advances in Multimedia Modeling, Berlin, Germany, 2013, pp. 345-355.

DOI

[8]

Y. Pang, Y. Yuan, X. Li, and J. Pan, Efficient HOG human detection, Signal Processing, vol. 91, no. 4, pp. 773-781, 2011.

DOI Google Scholar

[9]

F. Han, Y. Shan, R. Cekander, H. S. Sawhney, and R. Kumar, A two-stage approach to people and vehicle detection with HOG-based SVM, in Proceedings of Performance Metrics for Intelligent Systems 2006 Workshop, Gaithersburg, MD, USA, 2006, pp. 133-140.

[10]

P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, Object detection with discriminatory trained part-based models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627-1645, 2010.

DOI Google Scholar

[11]

J. Yan, X. Zhang, Z. Lei, S. Liao, and S. Z. Li, Robust multi-resolution pedestrian detection in traffic scenes, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 2013, pp. 3033-3040.

DOI

[12]

J. X. Zeng and C. Xiao, Pedestrian detection combined with single and couple pedestrian DPM models in traffic scene, Acta Electronica Sinica, vol. 44, no. 11, pp. 2668-2675, 2016.

Google Scholar

[13]

J. Yan, Z. Lei, L. Wen, and S. Z. Li, The fastest deformable part model for object detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 2014, pp. 2497-2504.

DOI

[14]

Y. Tian, R. Sukthankar, and M. Shah, Spatiotemporal deformable part models for action detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 2013, pp. 2642-2649.

DOI

[15]

P. Viola, M. J. Jones, and D. Snow, Detecting pedestrians using patterns of motion and appearance, International Journal of Computer Vision, vol. 63, no. 2, pp. 153-161, 2005.

DOI Google Scholar

[16]

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate accurate object detection and semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 2014, pp. 580-587.

DOI

[17]

K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904-1916, 2015.

DOI Google Scholar

[18]

R. Girshick, Fast-RCNN, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 1440-1448.

DOI

[19]

J. R. R. Uijlings, K. E. A. Van De Sande, T. Gever, and A. W. M. Smeulders, Selective search for object recognition, International Journal of Computer Vision, vol. 104, no. 2, pp. 154-171, 2013.

DOI Google Scholar

[20]

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, in Proceedings of Neural Information Processing Systems 28 (NIPS 2015), Montreal, Canada, 2015, pp. 91-99.

[21]

L. G. Ye, S. Y. Sun, K. J. Gao, and H. T. Zhao, Nighttime pedestrian detection based on faster region convolution neural network, (in Chinese), Progress in Laser and Optoelectronics, vol. 54, no. 8, pp. 123-129, 2017.

DOI Google Scholar

[22]

J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan, Scale-aware fast R-CNN for pedestrian detection, IEEE Transactions on Multimedia, vol. 20, no. 4, pp. 985-996, 2017.

DOI Google Scholar

[23]

L. Zhang, L. Lin, X. Liang, and K. He, Is faster R-CNN doing well for pedestrian detection?, in European Conference on Computer Vision (ECCV), Amsterdam, the Netherlands, 2016, pp. 443-457.

DOI

[24]

A. X. Guo, B. Q. Li, and Y. Li, Pedestrian detection based on deep convolutional neural network, (in Chinese), Computer Engineering and Applications, vol. 52, no. 13, pp. 162-166, 2016.

Google Scholar

[25]

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You only look once: Unified, real-time object detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 779-788.

DOI

[26]

Z. Gao, S. Li, J. Chen, and Z. J. Li, Pedestrian detection method based on YOLO network, Computer Engineering, vol. 44, no. 5, pp. 215-219, 2018.

Google Scholar

[27]

X. Z. Hao and Z. G. Chai, Improved pedestrian detection method based on depth residual network, (in Chinese), Application Research of Computers, vol. 36, no. 6, pp. 1-3, 2019.

Google Scholar

[28]

X. Z. Hao and L. Q. Huang, Pedestrian detection based on deep neural network in traffic environment, (in Chinese), Information & Comunications, vol. 185, no. 5, pp. 74-77, 2018.

Google Scholar

[29]

G. Johansson, Visual perception of biological motion and a model for its analysis, Perception & Psychophysics, vol. 14, no. 2, pp. 201-211, 1973.

DOI Google Scholar

[30]

B. K. P. Horn and B. G. Schunck, Determining optical flow, Artificial Intelligence, vol. 17, nos. 1-3, pp. 185-203, 1981.

DOI Google Scholar

[31]

B. D. Lucas and T. Kanade, An iterative image registration technique with an application to stereo vision, in Proc. 7th Int. Conf. on Artificial Intelligence (IJCAI), Vancouver, Canada, 1981, pp. 674-679.

[32]

M. J. Black and P. Anandan, The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields, Computer Vision and Image Understanding, vol. 63, no. 1, pp. 75-104, 1996.

DOI Google Scholar

[33]

T. Brox and J. Malik, Large displacement optical flow: descriptor matching in variational motion estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 3, pp. 500-513, 2011.

DOI Google Scholar

[34]

P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, DeepFlow: Large displacement optical flow with deep matching, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 2013, pp. 1385-1392.

DOI

[35]

H. Wang and C. Schmid, Action recognition with improved trajectories, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 2013, pp. 3551-3558.

DOI

[36]

X. Peng, C. Zou, Y. Qiao, and Q. Peng, Action recognition with stacked Fisher vectors, in European Conference on Computer Vision (ECCV), Zurich, Switzerland, 2014, pp. 581-595.

DOI

[37]

R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal, Histograms of oriented optical flow and binet-cauchy kernels on non-linear dynamic systems for the recognition of human actions, in 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 1932-1939.

DOI

[38]

D. G. Lowe, Object recognition from local scale-invariant features, in ICCV, vol. 99, no. 2, pp. 1150-1157, 1999.

DOI Google Scholar

[39]

Y. Ke and R. Sukthankar, PCA-SIFT: A more distinctive representation for local image descriptors, In CVPR , vol. 4, no. 2, pp. 506-513, 2004.

Google Scholar

[40]

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, in Proceedings of the IEEE International Conference on Computer Vision (CVPR), Anchorage, AK, USA, 2008, pp. 1-8.

DOI

[41]

H. Wang, A. Klaser, C. Schmid, and C. L. Liu, Dense trajectories and motion boundary descriptors for action recognition, International Journal of Computer Vision, vol. 103, no. 1, pp. 60-79, 2013.

DOI Google Scholar

[42]

H. Wang and C. Schmid, Action recognition with improved trajectories, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 2013, pp. 3551-3558.

DOI

[43]

L. Wang, Y. Qiao, and X. Tang, Action recognition with trajectory-pooled deep-convolutional descriptors, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 4305-4314.

DOI

[44]

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, in Proceedings of Neural Information Processing Systems 27 (NIPS 2014), Montreal, Canada, 2014, pp. 568-576.

[45]

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal segment networks: Towards good practices for deep action recognition, in European Conference on Computer Vision, Amsterdam, the Netherlands, 2016, pp. 20-36.

DOI

[46]

C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional two-stream network fusion for video action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2016, pp. 1933-1941.

DOI

[47]

J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, Beyond short snippets: Deep networks for video classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 4694-4702.

[48]

S. Yan, Y Xiong, and D. Lin, Spatial temporal graph convolutional networks for skeleton-based action recognition, in Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2018.

DOI

[49]

Y. Zhu, Z. Lan, S. Newsam, and A. Hauptmann, Hidden two-stream convolutional networks for action recognition, in Asian Conference on Computer Vision, Perth Weatern, Australia, 2018, pp. 363-378.

DOI

[50]

C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: A local SVM approach, in Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, 2004, pp. 32-36.

DOI

[51]

M. D. Rodriguez, J. Ahmed, and M. Shah, Action MACH: A spatio-temporal maximum average association height filter for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, 2008, pp. 1-6.

DOI

[52]

M. Marszalek, I. Laptev, and C. Schmid, Actions in context, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 2009, pp. 2929-2936.

DOI

[53]

C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al., AVA: A video dataset of spatio-temporally localized atomic visual actions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 6047-6056.

DOI

[54]

K. Reddy and M. Shah, Recognizing 50 human action categories of Web videos, Machine Vision and Applications, vol. 24, no. 5, pp. 971-981, 2017.

DOI Google Scholar

[55]

K. Soomro, A. R. Zamir, and M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, arXiv preprint arXiv: 1212.0402, 2012.

Google Scholar

[56]

J. Liu, Y. Yang, and M. Shah, Learning semantic visual vocabularies using diffusion distance, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 2009, pp. 461-468.

DOI

[57]

H. Su, S. Maji, E. Kalogeraki, and E. Learned-Miller, Multi-view convolutional neural networks for 3D shape recognition, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 945-953.

DOI

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 08 March 2019

Revised: 26 April 2019

Accepted: 05 May 2019

Published: 13 January 2020

Issue date: August 2020

Copyright

Acknowledgements

We are grateful to the anonymous reviewers for their constructive suggestions. This study was partially funded by the National Natural Science Foundation of China (Nos. 61871038, 61803034, and 61672178), Beijing Natural Science Foundation (No. 4182022), and Beijing Union University Graduate Funding Project.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).