Journal Home > Volume 25 , Issue 4

The development of autonomous driving has brought with it requirements for intelligence, safety, and stability. One example of this is the need to construct effective forms of interactive cognition between pedestrians and vehicles in dynamic, complex, and uncertain environments. Pedestrian action detection is a form of interactive cognition that is fundamental to the success of autonomous driving technologies. Specifically, vehicles need to detect pedestrians, recognize their limb movements, and understand the meaning of their actions before making appropriate decisions in response. In this survey, we present a detailed description of the architecture for pedestrian action recognition in autonomous driving, and compare the existing mainstream pedestrian action recognition techniques. We also introduce several commonly used datasets used in pedestrian motion recognition. Finally, we present several suggestions for future research directions.


menu
Abstract
Full text
Outline
About this article

Survey of Pedestrian Action Recognition Techniques for Autonomous Driving

Show Author's information Li ChenNan Ma( )Patrick WangJiahong LiPengfei WangGuilin PangXiaojun Shi
Beijing Key Laboratory of Information Service Engineering, College of Robotics, Beijing Union University, Beijing 100101, China.
Northeastern University, Boston, MA 02115, USA.
Communication and Information Center of Ministry of Emergency Management of the People’s Republic of China, Beijing 100013, China.
College of Robotics, Beijing Union University, Beijing 100101, China.

Abstract

The development of autonomous driving has brought with it requirements for intelligence, safety, and stability. One example of this is the need to construct effective forms of interactive cognition between pedestrians and vehicles in dynamic, complex, and uncertain environments. Pedestrian action detection is a form of interactive cognition that is fundamental to the success of autonomous driving technologies. Specifically, vehicles need to detect pedestrians, recognize their limb movements, and understand the meaning of their actions before making appropriate decisions in response. In this survey, we present a detailed description of the architecture for pedestrian action recognition in autonomous driving, and compare the existing mainstream pedestrian action recognition techniques. We also introduce several commonly used datasets used in pedestrian motion recognition. Finally, we present several suggestions for future research directions.

Keywords: autonomous driving, pedestrian action recognition, action datasets, two-stream network

References(57)

[1]
N. Ma, Y. Gao, J. H. Li, and D. Y. Li, Interactive cognition in self-driving, (in Chinese), Sci. Sin. Inform., vol. 48, no. 8, pp. 125-138, 2018.
[2]
N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 2005, pp. 886-893.
[3]
N. Dalal, Finding people in images and videos, Ph.D. dissertation, Institute National Polytechnology de Grenoble-INPG, Grenoble, France, 2006.
[4]
C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
[5]
D. Singh, M. A. Khan, A. Bansal, and N. Bansal, An application of SVM in character recognition with chain code communication, in Proceedings of Control and Intelligent Systems (CCIS), Mathura, India, 2015, pp.167-171.
DOI
[6]
A. S. Ahmad, M. Y. Hassan, M. P. Abdullah, H. A. Rahman, F. Hussin, H. Abdullah, and R. Saidur, A review on applications of ANN and SVM for building electrical energy consumption forecasting, Renewable and Sustainable Energy Reviews, vol. 33, no. 1, pp. 102-109, 2014.
[7]
H. Liu, T. Xu, X. Wang, and Y. Qian, Related HOG features for human detection using cascaded AdaBoost and SVM classifiers, in Proceedings of Advances in Multimedia Modeling, Berlin, Germany, 2013, pp. 345-355.
DOI
[8]
Y. Pang, Y. Yuan, X. Li, and J. Pan, Efficient HOG human detection, Signal Processing, vol. 91, no. 4, pp. 773-781, 2011.
[9]
F. Han, Y. Shan, R. Cekander, H. S. Sawhney, and R. Kumar, A two-stage approach to people and vehicle detection with HOG-based SVM, in Proceedings of Performance Metrics for Intelligent Systems 2006 Workshop, Gaithersburg, MD, USA, 2006, pp. 133-140.
[10]
P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, Object detection with discriminatory trained part-based models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627-1645, 2010.
[11]
J. Yan, X. Zhang, Z. Lei, S. Liao, and S. Z. Li, Robust multi-resolution pedestrian detection in traffic scenes, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 2013, pp. 3033-3040.
DOI
[12]
J. X. Zeng and C. Xiao, Pedestrian detection combined with single and couple pedestrian DPM models in traffic scene, Acta Electronica Sinica, vol. 44, no. 11, pp. 2668-2675, 2016.
[13]
J. Yan, Z. Lei, L. Wen, and S. Z. Li, The fastest deformable part model for object detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 2014, pp. 2497-2504.
DOI
[14]
Y. Tian, R. Sukthankar, and M. Shah, Spatiotemporal deformable part models for action detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 2013, pp. 2642-2649.
DOI
[15]
P. Viola, M. J. Jones, and D. Snow, Detecting pedestrians using patterns of motion and appearance, International Journal of Computer Vision, vol. 63, no. 2, pp. 153-161, 2005.
[16]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate accurate object detection and semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 2014, pp. 580-587.
DOI
[17]
K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904-1916, 2015.
[18]
R. Girshick, Fast-RCNN, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 1440-1448.
DOI
[19]
J. R. R. Uijlings, K. E. A. Van De Sande, T. Gever, and A. W. M. Smeulders, Selective search for object recognition, International Journal of Computer Vision, vol. 104, no. 2, pp. 154-171, 2013.
[20]
S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, in Proceedings of Neural Information Processing Systems 28 (NIPS 2015), Montreal, Canada, 2015, pp. 91-99.
[21]
L. G. Ye, S. Y. Sun, K. J. Gao, and H. T. Zhao, Nighttime pedestrian detection based on faster region convolution neural network, (in Chinese), Progress in Laser and Optoelectronics, vol. 54, no. 8, pp. 123-129, 2017.
[22]
J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan, Scale-aware fast R-CNN for pedestrian detection, IEEE Transactions on Multimedia, vol. 20, no. 4, pp. 985-996, 2017.
[23]
L. Zhang, L. Lin, X. Liang, and K. He, Is faster R-CNN doing well for pedestrian detection?, in European Conference on Computer Vision (ECCV), Amsterdam, the Netherlands, 2016, pp. 443-457.
DOI
[24]
A. X. Guo, B. Q. Li, and Y. Li, Pedestrian detection based on deep convolutional neural network, (in Chinese), Computer Engineering and Applications, vol. 52, no. 13, pp. 162-166, 2016.
[25]
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You only look once: Unified, real-time object detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 779-788.
DOI
[26]
Z. Gao, S. Li, J. Chen, and Z. J. Li, Pedestrian detection method based on YOLO network, Computer Engineering, vol. 44, no. 5, pp. 215-219, 2018.
[27]
X. Z. Hao and Z. G. Chai, Improved pedestrian detection method based on depth residual network, (in Chinese), Application Research of Computers, vol. 36, no. 6, pp. 1-3, 2019.
[28]
X. Z. Hao and L. Q. Huang, Pedestrian detection based on deep neural network in traffic environment, (in Chinese), Information & Comunications, vol. 185, no. 5, pp. 74-77, 2018.
[29]
G. Johansson, Visual perception of biological motion and a model for its analysis, Perception & Psychophysics, vol. 14, no. 2, pp. 201-211, 1973.
[30]
B. K. P. Horn and B. G. Schunck, Determining optical flow, Artificial Intelligence, vol. 17, nos. 1-3, pp. 185-203, 1981.
[31]
B. D. Lucas and T. Kanade, An iterative image registration technique with an application to stereo vision, in Proc. 7th Int. Conf. on Artificial Intelligence (IJCAI), Vancouver, Canada, 1981, pp. 674-679.
[32]
M. J. Black and P. Anandan, The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields, Computer Vision and Image Understanding, vol. 63, no. 1, pp. 75-104, 1996.
[33]
T. Brox and J. Malik, Large displacement optical flow: descriptor matching in variational motion estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 3, pp. 500-513, 2011.
[34]
P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, DeepFlow: Large displacement optical flow with deep matching, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 2013, pp. 1385-1392.
DOI
[35]
H. Wang and C. Schmid, Action recognition with improved trajectories, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 2013, pp. 3551-3558.
DOI
[36]
X. Peng, C. Zou, Y. Qiao, and Q. Peng, Action recognition with stacked Fisher vectors, in European Conference on Computer Vision (ECCV), Zurich, Switzerland, 2014, pp. 581-595.
DOI
[37]
R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal, Histograms of oriented optical flow and binet-cauchy kernels on non-linear dynamic systems for the recognition of human actions, in 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 1932-1939.
DOI
[38]
D. G. Lowe, Object recognition from local scale-invariant features, in ICCV, vol. 99, no. 2, pp. 1150-1157, 1999.
[39]
Y. Ke and R. Sukthankar, PCA-SIFT: A more distinctive representation for local image descriptors, In CVPR , vol. 4, no. 2, pp. 506-513, 2004.
[40]
I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, in Proceedings of the IEEE International Conference on Computer Vision (CVPR), Anchorage, AK, USA, 2008, pp. 1-8.
DOI
[41]
H. Wang, A. Klaser, C. Schmid, and C. L. Liu, Dense trajectories and motion boundary descriptors for action recognition, International Journal of Computer Vision, vol. 103, no. 1, pp. 60-79, 2013.
[42]
H. Wang and C. Schmid, Action recognition with improved trajectories, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 2013, pp. 3551-3558.
DOI
[43]
L. Wang, Y. Qiao, and X. Tang, Action recognition with trajectory-pooled deep-convolutional descriptors, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 4305-4314.
DOI
[44]
K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, in Proceedings of Neural Information Processing Systems 27 (NIPS 2014), Montreal, Canada, 2014, pp. 568-576.
[45]
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal segment networks: Towards good practices for deep action recognition, in European Conference on Computer Vision, Amsterdam, the Netherlands, 2016, pp. 20-36.
DOI
[46]
C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional two-stream network fusion for video action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2016, pp. 1933-1941.
DOI
[47]
J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, Beyond short snippets: Deep networks for video classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 4694-4702.
[48]
S. Yan, Y Xiong, and D. Lin, Spatial temporal graph convolutional networks for skeleton-based action recognition, in Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2018.
DOI
[49]
Y. Zhu, Z. Lan, S. Newsam, and A. Hauptmann, Hidden two-stream convolutional networks for action recognition, in Asian Conference on Computer Vision, Perth Weatern, Australia, 2018, pp. 363-378.
DOI
[50]
C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: A local SVM approach, in Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, 2004, pp. 32-36.
DOI
[51]
M. D. Rodriguez, J. Ahmed, and M. Shah, Action MACH: A spatio-temporal maximum average association height filter for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, 2008, pp. 1-6.
DOI
[52]
M. Marszalek, I. Laptev, and C. Schmid, Actions in context, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 2009, pp. 2929-2936.
DOI
[53]
C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al., AVA: A video dataset of spatio-temporally localized atomic visual actions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 6047-6056.
DOI
[54]
K. Reddy and M. Shah, Recognizing 50 human action categories of Web videos, Machine Vision and Applications, vol. 24, no. 5, pp. 971-981, 2017.
[55]
K. Soomro, A. R. Zamir, and M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, arXiv preprint arXiv: 1212.0402, 2012.
[56]
J. Liu, Y. Yang, and M. Shah, Learning semantic visual vocabularies using diffusion distance, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 2009, pp. 461-468.
DOI
[57]
H. Su, S. Maji, E. Kalogeraki, and E. Learned-Miller, Multi-view convolutional neural networks for 3D shape recognition, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 945-953.
DOI
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 08 March 2019
Revised: 26 April 2019
Accepted: 05 May 2019
Published: 13 January 2020
Issue date: August 2020

Copyright

© The author(s) 2020

Acknowledgements

We are grateful to the anonymous reviewers for their constructive suggestions. This study was partially funded by the National Natural Science Foundation of China (Nos. 61871038, 61803034, and 61672178), Beijing Natural Science Foundation (No. 4182022), and Beijing Union University Graduate Funding Project.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return