A Survey of Human Action Recognition and Posture Prediction

Nan Ma; Zhixuan Wu; Yiu-ming Cheung; Yuchen Guo; Yue Gao; Jiahong Li; Beijyan Jiang

doi:10.26599/TST.2021.9010068

Tsinghua Science and Technology 2022, 27(6): 973-1001 https://doi.org/10.26599/TST.2021.9010068

Open Access | Issue | Published: 21 June 2022

A Survey of Human Action Recognition and Posture Prediction

Show Author's Information Hide Author's Information Nan Ma(

), Zhixuan Wu, Yiu-ming Cheung, Yuchen Guo, Yue Gao, Jiahong Li, Beijyan Jiang

Beijing Key Laboratory of Information Service Engineering, the College of Robotics, Beijing Union University, Beijing 100101, China

Department of Computer Science, Hong Kong Baptist University, Hong Kong 999077, China

Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China

School of Software, Tsinghua University, Beijing 100084, China

College of Robotics, Beijing Union University, Beijing 100101, China

Keywords:

computer vision, human action recognition, posture prediction, human-computer cooperation, interactive cognition

Cite this article:

Ma N, Wu Z, Cheung Y-m, et al. A Survey of Human Action Recognition and Posture Prediction. Tsinghua Science and Technology, 2022, 27(6): 973-1001. https://doi.org/10.26599/TST.2021.9010068

Download citation

EndNote(RIS)

BibTeX

1710

Views

167

Downloads

Citations

Crossref

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

Human action recognition and posture prediction aim to recognize and predict respectively the action and postures of persons in videos. They are both active research topics in computer vision community, which have attracted considerable attention from academia and industry. They are also the precondition for intelligent interaction and human-computer cooperation, and they help the machine perceive the external environment. In the past decade, tremendous progress has been made in the field, especially after the emergence of deep learning technologies. Hence, it is necessary to make a comprehensive review of recent developments. In this paper, firstly, we attempt to present the background, and then discuss research progresses. Secondly, we introduce datasets, various typical feature representation methods, and explore advanced human action recognition and posture prediction algorithms. Finally, facing the challenges in the field, this paper puts forward the research focus, and introduces the importance of action recognition and posture prediction by taking interactive cognition in self-driving vehicle as an example.

Full text

Abstract

Full text

Outline

About this article

A Survey of Human Action Recognition and Posture Prediction

Show Author's information Hide Author's Information Nan Ma(

), Zhixuan Wu, Yiu-ming Cheung, Yuchen Guo, Yue Gao, Jiahong Li, Beijyan Jiang

Beijing Key Laboratory of Information Service Engineering, the College of Robotics, Beijing Union University, Beijing 100101, China

Department of Computer Science, Hong Kong Baptist University, Hong Kong 999077, China

Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China

School of Software, Tsinghua University, Beijing 100084, China

College of Robotics, Beijing Union University, Beijing 100101, China

Abstract

Keywords: computer vision, human action recognition, posture prediction, human-computer cooperation, interactive cognition

References(270)

[1]

D. Y. Li, N. Ma, and Y. Gao, Future vehicles: Learnable wheeled robots, Sci. China Inf. Sci., vol. 63, no. 9, p. 193201, 2020.

DOI Google Scholar

[2]

H. S. Fang, S. Q. Xie, Y. W. Tai, and C. W. Lu, RMPE: Regional multi-person pose estimation, in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 2353–2362.

DOI Google Scholar

[3]

H. P. Liu, Y. P. Wu, and F. C. Sun, Extreme trust region policy optimization for active object recognition, IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp. 2253–2258, 2018.

DOI Google Scholar

[4]

L. Chen, N. Ma, P. Wang, J. H. Li, P. F. Wang, G. L. Pang, and X. J. Shi, Survey of pedestrian action recognition techniques for autonomous driving, Tsinghua Science and Technology, vol. 25, no. 4, pp. 458–470, 2020.

DOI Google Scholar

[5]

X. Y. Zhang, C. S. Li, H. C. Shi, X. B. Zhu, P. Li, and J. Dong, AdapNet: Adaptability decomposing encoder-decoder network for weakly supervised action recognition and localization, IEEE Trans. Neural Netw. Learn. Syst., .

DOI Google Scholar

[6]

M. S. Li, S. H. Chen, X. Chen, Y. Zhang, Y. F. Wang, and Q. Tian, Symbiotic graph neural networks for 3D skeleton-based human action recognition and motion prediction, IEEE Trans. Pattern Anal. Mach. Intell., .

DOI Google Scholar

[7]

Y. Kong and Y. Fu, Human action recognition and prediction: A survey, arXiv preprint arXiv: 1806.11230, 2018.

Google Scholar

[8]

N. Khalid, M. Gochoo, A. Jalal, and K. Kim, Modeling two-person segmentation and locomotion for stereoscopic action identification: A sustainable video surveillance system, Sustainability, vol. 13, no. 2, p. 970, 2021.

DOI Google Scholar

[9]

T. S. Kim and A. Reiter, Interpretable 3D human action analysis with temporal convolutional networks, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 2017, pp. 1623–1631.

DOI Google Scholar

[10]

Y. Q. Zhao, W. W. T. Fok, and C. W. Chan, Video-based violence detection by human action analysis with neural network, in Proc. SPIE 11321, 2019 Int. Conf. Image and Video Processing, and Artificial Intelligence, Shanghai, China, 2019, p. 113212N.

Google Scholar

[11]

G. Li and C. Y. Li, Learning skeleton information for human action analysis using Kinect, Signal Process.: Image Commun., vol. 84, p. 115814, 2020.

DOI Google Scholar

[12]

M. Mahmood, A. Jalal, and K. Kim, WHITE STAG model: Wise human interaction tracking and estimation (WHITE) using spatio-temporal and angular-geometric (STAG) descriptors, Multimed. Tools Appl., vol. 79, no. 11, pp. 6919–6950, 2020.

DOI Google Scholar

[13]

L. Vianello, J. B. Mouret, E. Dalin, A. Aubry, and S. Ivaldi. Human posture prediction during physical human-robot interaction, IEEE Robot. Autom. Lett., vol. 6, no. 3, pp. 6046–6053, 2021.

DOI Google Scholar

[14]

Y. R. Bin, X. Cao, X. Y. Chen, Y. H. Ge, Y. Tai, C. J. Wang, J. L. Li, F. Y. Huang, C. X. Gao, and N. Sang, Adversarial semantic data augmentation for human pose estimation, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 606–622.

DOI Google Scholar

[15]

R. Schnürer, A. C. Öztireli, M. Heitzler, R. Sieber, and L. Hurni, Instance segmentation, body part parsing, and pose estimation of human figures in pictorial maps, Int. J. Cartogr., .

DOI Google Scholar

[16]

E. S. L. Ho, J. C. P. Chan, D. C. K. Chan, H. P. H. Shum, Y. M. Cheung, and P. C. Yuen, Improving posture classification accuracy for depth sensor-based human activity monitoring in smart environments, Comput. Vis. Image Underst., vol. 148, pp. 97–110, 2016.

DOI Google Scholar

[17]

A. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: A local SVM approach, in Proc. 17^th Int. Conf. Pattern Recognition, Cambridge, UK, 2004, pp. 32–36.

DOI Google Scholar

[18]

D. Weinland, R. Ronfard, and E. Boyer, Free viewpoint action recognition using motion history volumes, Comput. Vis. Image Underst., vol. 104, nos. 2&3, pp. 249–257, 2006.

DOI Google Scholar

[19]

W. Choi, K. Shahid, and S. Savarese, What are they doing?: Collective activity classification using spatio-temporal relationship among people, in Proc. 12^th Int. Conf. Computer Vision Workshops, ICCV Workshops, Kyoto, Japan, 2009, pp. 1282–1289.

Google Scholar

[20]

M. Marszalek, I. Laptev, and C. Schmid, Actions in context, in Proc. 2009 IEEE Conf. Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 2929–2936.

DOI Google Scholar

[21]

S. Singh, S. A. Velastin, and H. Ragheb, MuHAVi: A multicamera human action video dataset for the evaluation of action recognition methods, in Proc. 7^th IEEE Int. Conf. Advanced Video and Signal Based Surveillance, Boston, MA, USA, 2010, pp. 48–55.

DOI Google Scholar

[22]

M. S. Ryoo and J. K. Aggarwal, UT-interaction dataset, ICPR contest on semantic description of human activities (SDHA), in Proc. IEEE Int. Conf. Pattern Recognition Workshops, Zurich, Switzerland, 2010.

Google Scholar

[23]

Y. G. Jiang, G. N. Ye, S. F. Chang, D. Ellis, and A. A. Loui, Consumer video understanding: A benchmark database and an evaluation of human and machine performance, in Proc. 1^st ACM Int. Conf. Multimedia Retrieval, Trento, Italy, 2011, pp. 1–8.

DOI Google Scholar

[24]

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, in Proc. 2011 Int. Conf. Computer Vision, Barcelona, Spain, 2011, pp. 2556–2563.

DOI Google Scholar

[25]

K. Soomro, A. R. Zamir, and M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, arXiv preprint arXiv: 1212.0402, 2012.

Google Scholar

[26]

L. Xia, C. C. Chen, and J. K. Aggarwal, View invariant human action recognition using histograms of 3D joints, in Proc. 2012 IEEE Computer Society Conf. Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 2012, pp. 20–27.

DOI Google Scholar

[27]

H. S. Koppula, R. Gupta, and A. Saxena, Learning human activities and object affordances from RGB-D videos, Int. J. Robot. Res., vol. 32, no. 8, pp. 951–970, 2013.

DOI Google Scholar

[28]

H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, Towards understanding action recognition, in Proc. 2013 IEEE Int. Conf. Computer Vision, Sydney, NSW, Australia, 2013, pp. 3192–3199.

DOI Google Scholar

[29]

L. Seidenari, V. Varano, S. Berretti, A. Del Bimbo, and P. Pala, Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses, in Proc. 2013 IEEE Conf. Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 2013, pp. 479–485.

DOI Google Scholar

[30]

W. Y. Zhang, M. L. Zhu, and K. G. Derpanis, From actemes to action: A strongly-supervised representation for detailed action understanding, in Proc. 2013 IEEE Int. Conf. Computer Vision, Sydney, Australia, 2013, pp. 2248–2255.

DOI Google Scholar

[31]

C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 1325–1339, 2014.

DOI Google Scholar

[32]

M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, 2D human pose estimation: New benchmark and state of the art analysis, in Proc. 2014 IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 3686–3693.

DOI Google Scholar

[33]

T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, Microsoft COCO: Common objects in context, in Proc. 13^th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740–755.

DOI Google Scholar

[34]

F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, ActivityNet: A large-scale video benchmark for human activity understanding, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 961–970.

DOI Google Scholar

[35]

J. F. Hu, W. S. Zheng, J. H. Lai, and J. G. Zhang, Jointly learning heterogeneous features for RGB-D activity recognition, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 5344–5352.

DOI Google Scholar

[36]

S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan, YouTube-8M: A large-scale video classification benchmark, arXiv preprint arXiv: 1609.08675, 2016.

Google Scholar

[37]

A. Shahroudy, J. Liu, T. T. Ng, and G. Wang, NTU RGB+D: A large scale dataset for 3D human activity analysis, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 1010–1019.

DOI Google Scholar

[38]

Sigurdsson

G. A.

, Varol

, Wang

X. L.

, Farhadi

, Laptev

, and Gupta

, Hollywood in homes: Crowdsourcing data collection for activity understanding, in Proc. 14^th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 510–526.10.1007/978-3-319-46448-0_31

DOI Google Scholar

[39]

D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. P. Xu, and C. Theobalt, Monocular 3D human pose estimation in the wild using improved CNN supervision, in Proc. 2017 Int. Conf. 3D Vision (3DV), Qingdao, China, 2017, pp. 506–516.

DOI Google Scholar

[40]

A. Rasouli, I. Kotseruba, and J. K. Tsotsos, Agreeing to cross: How drivers and pedestrians communicate, in Proc. 2017 IEEE Intelligent Vehicles Symp. (IV), Los Angeles, CA, USA, 2017, pp. 264–269.

DOI Google Scholar

[41]

C. H. Liu, Y. Y. Hu, Y. H. Li, S. J. Song, and J. Y. Liu, PKU-MMD: A large scale benchmark for continuous multi-modal human action understanding, arXiv preprint arXiv: 1703.07475, 2017.

Google Scholar

[42]

M. Trumble, A. Gilbert, C. Malleson, A. Hilton, and J. Collomosse, Total capture: 3D human pose estimation fusing video and inertial sensors, in Proc. British Machine Vision Conf., 2017, vol. 2, no. 5, pp. 1–13.

DOI Google Scholar

[43]

J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman. A short note about kinetics-600, arXiv preprint arXiv: 1808.01340, 2018.

Google Scholar

[44]

C. H. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Q. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al., AVA: A video dataset of spatio-temporally localized atomic visual actions, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6047–6056.

DOI Google Scholar

[45]

W. Kim, M. S. Ramanagopal, C. Barto, M. Y. Yu, K. Rosaen, N. Goumas, R. Vasudevan, and M. Johnson-Roberson, PedX: Benchmark dataset for metric 3-D pose estimation of pedestrians in complex urban intersections, IEEE Robot. Autom. Lett., vol. 4, no. 2, pp. 1940–1947, 2019.

DOI Google Scholar

[46]

M. Monfort, A. Andonian, B. L. Zhou, K. Ramakrishnan, S. A. Bargal, T. Yan, L. Brown, Q. F. Fan, D. Gutfreund, C. Vondrick, et al., Moments in time dataset: One million videos for event understanding, IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 502–508, 2020.

DOI Google Scholar

[47]

J. Monfort, A. Andonian, B. L. Zhou, K. Ramakrishnan, S. A. Bargal, T. Yan, L. Brown, Q. F. Fan, D. Gutfreund, C. Vondrick, et al., Moments in time dataset: One million videos for event understanding, IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 502–508, 2020.

DOI Google Scholar

[48]

J. Liu, A. Shahroudy, M. Perez, G. Wang, L. Y. Duan, and A. C. Kot, NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding, IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 10, pp. 2684–2701, 2020.

DOI Google Scholar

[49]

D. Shao, Y. Zhao, B. Dai, and D. H. Lin, Intra-and inter-action understanding via temporal action parsing, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 727–736.

DOI Google Scholar

[50]

D. Shao, Y. Zhao, B. Dai, and D. H. Lin, FineGym: A hierarchical video dataset for fine-grained action understanding, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 2613–2622.

DOI Google Scholar

[51]

C. Y. Ding, K. Liu, F. Cheng, and E. Belyaev, Spatio-temporal attention on manifold space for 3D human action recognition, Appl. Intell., vol. 51, no. 1, pp. 560–570, 2021.

DOI Google Scholar

[52]

H. B. Zhang, Y. X. Zhang, B. N. Zhong, Q. Lei, L. J. Yang, J. X. Du, and D. S. Chen, A comprehensive survey of vision-based human action recognition methods, Sensors, vol. 19, no. 5, p. 1005, 2019.

DOI Google Scholar

[53]

M. A. R. Ahad, J. K. Tan, H. Kim, and S. Ishikawa, Approaches for global-based action representations for games and action understanding, in Proc. 2011 IEEE Int. Conf. Automatic Face & Gesture Recognition, Santa Barbara, CA, USA, 2011, pp. 753–758.

DOI Google Scholar

[54]

Y. Zhu, J. K. Zhao, Y. N. Wang, and B. B. Zheng, A review of human action recognition based on deep learning, (in Chinese), Acta Autom. Sin., vol. 42, no. 6, pp. 848–857, 2016.

Google Scholar

[55]

M. Singh, A. Basu, and M. K. Mandal, Human activity recognition based on silhouette directionality, IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 9, pp. 1280–1292, 2008.

DOI Google Scholar

[56]

J. F. Jiang and S. S. Tian, Human detection based on background subtraction and closed contour fitting, (in Chinese), Comput. Eng. Appl., vol. 51, no. 14, pp. 198–202, 2015.

Google Scholar

[57]

E. K. N. Asumang, X. Zuo, S. Zheng, and H. L. Yu, Human pose estimation based on evidence supporting and sub-graph pruning, in Proc. 32^nd Youth Academic Ann. Conf. Chinese Association of Automation (YAC), Hefei, China, 2017, pp. 20–27.

DOI Google Scholar

[58]

A. Abdelbaky and S. Aly, Human action recognition using short-time motion energy template images and PCANet features, Neural Comput. Appl., vol. 32, no. 16, pp. 12561–12574, 2020.

DOI Google Scholar

[59]

N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, in Proc. 2005 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 2005, pp. 886–893.

Google Scholar

[60]

M. Zhang and A. A. Sawchuk, USC-HAD: A daily activity dataset for ubiquitous activity recognition using wearable sensors, in Proc. 2012 ACM Conf. Ubiquitous Computing, Pittsburgh, PA, USA, 2012, pp. 1036–1043.

DOI Google Scholar

[61]

C. Peng, H. Z. Huang, A. C. Tsoi, S. L. Lo, Y. Liu, and Z. Y. Yang. Motion boundary emphasised optical flow method for human action recognition, IET Comput. Vis., vol. 14, no. 6, pp. 378–390, 2020.

DOI Google Scholar

[62]

S. Ali and M. Shah, Human action recognition in videos using kinematic features and multiple instance learning, IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 2, pp. 288–303, 2010.

DOI Google Scholar

[63]

T. W. Lu, S. H. Ai, Y. Y. Jiang, Y. D. Xiong and F. Min, Deep optical flow feature fusion based on 3D convolutional networks for video action recognition, in Proc. 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), Guangzhou, China, 2018, pp. 1077–1080.

DOI Google Scholar

[64]

A. Ullah, K. Muhammad, J. Del Ser, S. W. Baik, and V. H. C. de Albuquerque, Activity recognition using temporal optical flow convolutional features and multilayer LSTM, IEEE Trans. Ind. Electron., vol. 66, no. 12, pp. 9692–9702, 2019.

DOI Google Scholar

[65]

E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. FlowNet 2.0: Evolution of optical flow estimation with deep networks, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 1647–1655.

DOI Google Scholar

[66]

H. A. Rashwan, M. A. Garcia, S. Abdulwahab, and D. Puig, Action representation and recognition through temporal co-occurrence of flow fields and convolutional neural networks, Multimed. Tools Appl., vol. 79, no. 45, pp. 34141–34158, 2020.

DOI Google Scholar

[67]

Y. Zhu, X. Y. Li, C. H. Liu, M. Zolfaghari, Y. J. Xiong, C. R. Wu, Z. Zhang, J. Tighe, R. Manmatha, and M. Li, A comprehensive study of deep video action recognition, arXiv preprint arXiv: 2012.06567, 2020.

Google Scholar

[68]

I. Laptev, On space-time interest points, Int. J. Comput. Vis., vol. 64, no. 2, pp. 107–123, 2005.

DOI Google Scholar

[69]

X. Liu, Y. M. Cheung, M. Li, and H. L. Liu, A lip contour extraction method using localized active contour model with automatic parameter selection, in Proc. 20^th Int. Conf. Pattern Recognition, Istanbul, Turkey, 2010, pp. 4332–4335.

DOI Google Scholar

[70]

L. C. Zhu and Y. Yang, ActBERT: Learning global-local video-text representations, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 8743–8752.

DOI Google Scholar

[71]

C. I. Patel, D. Labana, S. Pandya, K. Modi, H. Ghayvat, and M. Awais, Histogram of oriented gradient-based fusion of features for human action recognition in action video sequences, Sensors, vol. 20, no. 24, p. 7299, 2020.

DOI Google Scholar

[72]

Z. X. Zheng, G. Y. An, D. P. Wu, and Q. Q. Ruan, Global and local knowledge-aware attention network for action recognition, IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 1, pp. 334–347, 2021.

DOI Google Scholar

[73]

M. T. Gopalakrishna, M. Ravishankar, and D. R. Rameshbabu, Multiple moving object recognitions in video based on log Gabor-PCA approach, in Recent Advances in Intelligent Informatics, S. M. Thampi, A. Abraham, S. K. Pal, and J. M. C. Rodriguez, eds. Champaign, IL, USA: Springer, 2014, pp. 93–100.

DOI

[74]

J. Paul, W. Stechele, M. Kröhnert, and T. Asfour, Resource-aware programming for robotic vision, arXiv preprint arXiv: 1405.2908, 2014.

Google Scholar

[75]

H. Vaghela, M. Oza, and S. Bagul, MREAK: Morphological retina keypoint descriptor, in Proc. 2019 Int. Conf. Artificial Intelligence and Information Technology (ICAIIT), Yogyakarta, Indonesia, 2019, pp. 10–15.

DOI Google Scholar

[76]

A. J. Piergiovanni and M. S. Ryoo, Representation flow for action recognition, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 9937–9945.

DOI Google Scholar

[77]

S. Sadhukhan, S. Mallick, P. K. Singh, R. Sarkar, and D. Bhattacharjee, A comparative study of different feature descriptors for video-based human action recognition, in Intelligent Computing: Image Processing Based Applications, J. K. Mandal and S. Banerjee, eds. Singapore: Springer, 2020, pp. 35–52.

DOI

[78]

H. Zhao, J. W. Dang, S. Wang, Y. P. Wang, and D. C. Gao, Dense trajectory action recognition algorithm based on improved SURF, IOP Conf. Ser.: Earth Environ. Sci., vol. 252, no. 3, p. 032179, 2019.

DOI Google Scholar

[79]

M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, D2-Net: A trainable CNN for joint description and detection of local features, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 8084–8093.

DOI Google Scholar

[80]

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, in Proc. 2008 IEEE Conf. Computer Vision and Pattern Recognition, Anchorage, AK, USA, 2008, pp. 1–8.

DOI Google Scholar

[81]

S. R. Mishra, K. D. Krishna, G. Sanyal, and A. Sarkar, A feature weighting technique on SVM for human action recognition, J. Sci. Ind. Res., vol. 79, no. 7, pp. 626–630, 2020.

Google Scholar

[82]

V. Bloom, D. Makris, and V. Argyriou, Clustered spatio-temporal manifolds for online action recognition, in Proc. 22^nd Int. Conf. Pattern Recognition, Stockholm, Sweden, 2014, pp. 3963–3968.

DOI Google Scholar

[83]

M. A. R. Ahad, J. K. Tan, H. Kim, and S. Ishikawa, Motion history image: Its variants and applications, Mach. Vis. Appl., vol. 23, no. 2, pp. 255–281, 2012.

DOI Google Scholar

[84]

A. F. Bobick and J. W. Davis, The recognition of human movement using temporal templates, IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 3, pp. 257–267, 2001.

DOI Google Scholar

[85]

A. Bobick and J. Davis, Real-time recognition of activity using temporal templates, in Proc. 3^rd IEEE Workshop on Applications of Computer Vision. WACV’96, Sarasota, FL, USA, 1996, pp. 39–42.

Google Scholar

[86]

S. Zernetsch, V. Kress, B. Sick, and K. Doll, Early start intention detection of cyclists using motion history images and a deep residual network, in Proc. 2018 IEEE Intelligent Vehicles Symp. (IV), Changshu, China, 2018, pp. 1–6.

DOI Google Scholar

[87]

T. Vajda, Action recognition based on fast dynamic-time warping method, in Proc. 5^th Int. Conf. Intelligent Computer Communication and Processing, Cluj-Napoca, Romania, 2009, pp. 127–131.

DOI Google Scholar

[88]

C. Y. Chang, D. A. Huang, Y. N. Sui, L. Fei-Fei, and J. C. Niebles, D3TW: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 3541–3550.

DOI Google Scholar

[89]

X. Yang, D. J. D. Liu, J. Liu, F. R. Yan, P. P. Chen, and Q. Niu, Follower: A novel self-deployable action recognition framework, Sensors, vol. 21, no. 3, p. 950, 2021.

DOI Google Scholar

[90]

X. Z. Wang and S. X. Lu, Improved fuzzy multicategory support vector machines classifier, in Proc. 2006 Int. Conf. Machine Learning and Cybernetics, Dalian, China, 2006, pp. 3585–3589.

DOI Google Scholar

[91]

V. Parameswari and S. Pushpalatha, Human activity recognition using SVM and deep learning, Int. European Journal of Molecular & Clinical Medicine., vol. 7, no. 4, pp. 1984–1990, 2020.

Google Scholar

[92]

P. Hristov, A. Manolova, and O. Boumbarov, Deep learning and SVM-based method for human activity recognition with skeleton data, in Proc. 28^th National Conf. Int. Participation (TELECOM), Sofia, Bulgaria, 2020, pp. 49–52.

DOI Google Scholar

[93]

K. Li, Human action recognition based on fuzzy support vector machines, in Proc. 5^th Int. Symp. Computational Intelligence and Design, Hangzhou, China, 2012, pp. 45–48.

DOI Google Scholar

[94]

G. Uslu and S. Baydere, Support Vector Machine based activity detection, in Proc. 21^st Signal Processing and Communications Applications Conf. (SIU), Haspolat, Turkey, 2013, pp. 1–4.

DOI Google Scholar

[95]

H. G. Wang, Z. J. Song, W. Q. Li, and P. C. Wang. A hybrid network for large-scale action recognition from RGB and depth modalities, Sensors, vol. 20, no. 11, p. 3305, 2020.

DOI Google Scholar

[96]

J. Q. Zhou and M. Zhi, A human action recognition method based on MHI and support vector machine, (in Chinese), Softw. Guide, vol. 16, no. 2, pp. 36–38, 2017.

Google Scholar

[97]

L. Chen and H. C. Lu, A new object recognition method based on ML-pLSA model, (in Chinese), J. Electron. Inf. Technol., vol. 33, no. 12, pp. 2909–2915, 2011.

Google Scholar

[98]

L. Z. Tan, L. M. Xia, J. X. Huang, and S. P. Xia, Human action recognition based on pLSA model, (in Chinese), J. Natl. Univ. Def. Technol., vol. 35, no. 5, pp. 102–108, 2013.

Google Scholar

[99]

T. V. Duong, H. H. Bui, D. Q. Phung, and S. Venkatesh, Activity recognition and abnormality detection with the switching hidden semi-Markov model, in Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 2005, pp. 838–845.

Google Scholar

[100]

C. Sminchisescu, A. Kanaujia, and D. Metaxas, Conditional models for contextual human motion recognition, Comput. Vis. Image Underst., vol. 104, nos. 2&3, pp. 210–220, 2006.

DOI Google Scholar

[101]

J. W. Xu and Q. Luo, Human action recognition based on mixed Gaussian hidden Markov model, MATEC Web Conf., vol. 336, p. 06004, 2021.

DOI Google Scholar

[102]

L. Zhao, L. Guo, J. S. Xie, and H. Liu, Video abnormal target description based on CRF model, in Proc. 2012 Int. Conf. Audio, Language and Image Processing, Shanghai, China, 2012, pp. 519–524.

Google Scholar

[103]

K. Liu, L. Gao, N. M. Khan, L. Qi, and L. Guan. A multi-stream graph convolutional networks-hidden conditional random field model for skeleton-based action recognition, IEEE Trans. Multimed., vol. 23, pp. 64–76, 2020.

DOI Google Scholar

[104]

T. L. Liu, X. D. Dong, Y. Z. Wang, X. B. Dai, Q. Z. You, and J. B. Luo, Double-layer conditional random fields model for human action recognition, Signal Process.: Image Commun., vol. 80, p. 115672, 2020.

DOI Google Scholar

[105]

J. Yamato, J. Ohya, and K. Ishii, Recognizing human action in time-sequential images using hidden Markov model, in Proc. 1992 IEEE Computer Society Conf. Computer Vision and Pattern Recognition, Champaign, IL, USA, 1992, pp. 379–385.

Google Scholar

[106]

P. Zhang, M. Ito, S. I. Ito, and M. Fukumi, Implementation of EOG mouse using Learning Vector Quantization and EOG-feature based methods, in Proc. 2013 IEEE Conf. Systems, Process & Control (ICSPC), Kuala Lumpur, Malaysia, 2013, pp. 88–92.

Google Scholar

[107]

H. Liu, L. Guo, B. Yi, and G. Z. Wang, Human activity recognition based on 3D skeletons and MCRF model, (in Chinese), J. Univ. Sci. Technol. China, vol. 44, no. 4, pp. 285–291, 2014.

Google Scholar

[108]

R. Chereshnev and A. Kertész-Farkas, RapidHARe: A computationally inexpensive method for real-time human activity recognition from wearable sensors, J. Ambient Intell. Smart Environ., vol. 10, no. 5, pp. 377–391, 2018.

DOI Google Scholar

[109]

S. Ali and N. Bouguila, Multimodal action recognition using variational-based Beta-Liouville hidden Markov models, IET Image Process., vol. 14, no. 17, pp. 4785–4794, 2020.

DOI Google Scholar

[110]

B. F. Shi, Q. Dai, Y. D. Mu, and J. D. Wang, Weakly-supervised action localization by generative attention modeling, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 1006–1016.

DOI Google Scholar

[111]

H. H. Chen, B. B. Jiang, and X. Yao, Semisupervised negative correlation learning, IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 11, pp. 5366–5379, 2018.

DOI Google Scholar

[112]

S. Y. Shin, S. Lee, I. D. Yun, S. M. Kim, and K. M. Lee, Joint weakly and semi-supervised deep learning for localization and classification of masses in breast ultrasound images, IEEE Trans. Med. Imaging, vol. 38, no. 3, pp. 762–774, 2019.

DOI Google Scholar

[113]

C. Tang, W. J. Wang, X. F. Wang, C. Zhang, and L. Zou, Human action recognition based on multi-view semi-supervised learning, (in Chinese), Pattern Recognit. Artif. Intell., vol. 32, no. 4, pp. 376–384, 2019.

Google Scholar

[114]

G. Pikramenos, E. Mathe, E. Vali, I. Vernikos, A. Papadakis, E. Spyrou, and P. Mylonas, An adversarial semi-supervised approach for action recognition from pose information, Neural Comput. Appl., vol. 32, no. 23, pp. 17181–17195, 2020.

DOI Google Scholar

[115]

C. Chen, R. Jafari, and N. Kehtarnavaz, Improving human action recognition using fusion of depth camera and inertial sensors, IEEE Trans. Hum.-Mach. Syst., vol. 45, no. 1, pp. 51–61, 2015.

DOI Google Scholar

[116]

L. T. Law and Y. M. Cheung, Color image segmentation using rival penalized controlled competitive learning, in Proc. Int. Joint Conf. Neural Networks, Portland, OR, USA, 2003, pp. 108–112.

Google Scholar

[117]

C. Chen, R. Jafari, and N. Kehtarnavaz, A real-time human action recognition system using depth and inertial sensor fusion, IEEE Sens. J., vol. 16, no. 3, pp. 773–781, 2016.

DOI Google Scholar

[118]

N. Dawar and N. Kehtarnavaz, Action detection and recognition in continuous action streams by deep learning-based sensing fusion, IEEE Sens. J., vol. 18, no. 23, pp. 9660–9668, 2018.

DOI Google Scholar

[119]

J. N. Lei, X. F. Ren, and D. Fox, Fine-grained kitchen activity recognition using RGB-D, in Proc. 2012 ACM Conf. Ubiquitous Computing, Pittsburgh, PA, USA, 2012, pp. 208–211.

DOI Google Scholar

[120]

J. Ranjan, Y. Yao, E. Griffiths, and K. Whitehouse, Using mid-range RFID for location based activity recognition, in Proc. 2012 ACM Conf. Ubiquitous Computing, Pittsburgh, PA, USA, 2012, pp. 647–648.

DOI Google Scholar

[121]

M. O. Killijian, M. Roy, G. Trédan, and C. Zanon, SOUK: Social observation of human kinetics, in Proc. 2013 ACM Int. Joint Conf. Pervasive and Ubiquitous Computing, Zurich, Switzerland, 2013, pp. 193–196.

DOI Google Scholar

[122]

G. M. Jeong, P. H. Truong, and S. I. Choi, Classification of three types of walking activities regarding stairs using plantar pressure sensors, IEEE Sens. J., vol. 17, no. 9, pp. 2638–2639, 2017.

DOI Google Scholar

[123]

M. Koohzadi and N. M. Charkari, Survey on deep learning methods in human action recognition, IET Comput. Vis., vol. 11, no. 8, pp. 623–632, 2017.

DOI Google Scholar

[124]

Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis, in Proc. CVPR 2011, Colorado Springs, CO, USA, 2011, pp. 3361–3368.

Google Scholar

[125]

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, Large-scale video classification with convolutional neural networks, in Proc. 2014 IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 1725–1732.

DOI Google Scholar

[126]

K. Tong, Y. Q. Wu, and F. Zhou, Recent advances in small object detection based on deep learning: A review, Image Vis. Comput., vol. 97, p. 103910, 2020.

DOI Google Scholar

[127]

W. G. Wang, Q. X. Lai, H. Z. Fu, J. B. Shen, H. B. Ling, and R. G. Yang, Salient object detection in the deep learning era: An in-depth survey, IEEE Trans. Pattern Anal. Mach. Intell., .

DOI Google Scholar

[128]

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, arXiv preprint arXiv: 1406.2199, 2014.

Google Scholar

[129]

Z. W. Ding, P. C. Wang, P. O. Ogunbona, and W. Q. Li, Investigation of different skeleton features for CNN-based 3D action recognition, in Proc. 2017 IEEE Int. Conf. Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 2017, pp. 617–622.

Google Scholar

[130]

T. Huynh-The and D. S. Kim, Data augmentation for CNN-based 3D action recognition on small-scale datasets, in Proc. 17^th Int. Conf. Industrial Informatics (INDIN), Helsinki, Finland, 2019, pp. 239–244.

DOI Google Scholar

[131]

S. Li, Z. C. Zhao, and F. Su, A spatio-temporal hybrid network for action recognition, in Proc. 2019 IEEE Visual Communications and Image Processing (VCIP), Sydney, Australia, 2019, pp. 1–4.

DOI Google Scholar

[132]

C. Y. Yang, Y. H. Xu, J. P. Shi, B. Dai, and B. L. Zhou, Temporal pyramid network for action recognition, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 588–597.

DOI Google Scholar

[133]

G. H. Jiang, X. Y. Jiang, Z. J. Fang, and S. S. Chen, An efficient attention module for 3d convolutional neural networks in action recognition, Appl. Intell., vol. 51, no. 10, pp. 7043–7057, 2021.

DOI Google Scholar

[134]

S. Kumawat, M. Verma, Y. Nakashima, and S. Raman, Depthwise spatio-temporal STFT convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell., .

DOI Google Scholar

[135]

C. C. Liu, J. Ying, H. M. Yang, X. Hu, and J. Liu, Improved human action recognition approach based on two-stream convolutional neural network model, Vis. Comput., vol. 37, no. 6, pp. 1327–1341, 2021.

DOI Google Scholar

[136]

Z. F. Zhang, Z. M. Lv, C. Q. Gan, and Q. Y. Zhu, Human action recognition using convolutional LSTM and fully-connected LSTM with different attentions, Neurocomputing, vol. 410, pp. 304–316, 2020.

DOI Google Scholar

[137]

M. Majd and R. Safabakhsh, Correlational convolutional LSTM for human action recognition, Neurocomputing, vol. 396, pp. 224–229, 2020.

DOI Google Scholar

[138]

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, and K. Saenko, Long-term recurrent convolutional networks for visual recognition and description, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 2625–2634.

DOI Google Scholar

[139]

J. Y. H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, Beyond short snippets: Deep networks for video classification, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 4694–4702.

Google Scholar

[140]

W. B. Li, L. Y. Wen, M. C. Chang, S. N. Lim, and S. W. Lyu, Adaptive RNN tree for large-scale human action recognition, in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 1453–1461.

DOI Google Scholar

[141]

J. Liu, G. Wang, L. Y. Duan, K. Abdiyeva, and A. C. Kot, Skeleton-based human action recognition with global context-aware attention LSTM networks, IEEE Trans. Image Process., vol. 27, no. 4, pp. 1586–1599, 2018.

DOI Google Scholar

[142]

J. W. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles, Action genome: Actions as compositions of spatio-temporal scene graphs, in Proc. 2020 IEEE/CVF Conf. ComputerVision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 10233–10244.

DOI Google Scholar

[143]

A. Ullah, K. Muhammad, T. Hussain, and S. W. Baik, Conflux LSTMs network: A novel approach for multi-view action recognition, Neurocomputing, vol. 435, pp. 321–329, 2021.

DOI Google Scholar

[144]

M. A. Khan, K. Javed, S. A. Khan, T. Saba, U. Habib, J. A. Khan, and A. A. Abbasi, Human action recognition using fusion of multiview and deep features: An application to video surveillance, Multimed. Tools Appl., .

DOI Google Scholar

[145]

J. M. Llaurado-Fons, A. Martinez, F. A. Pujol-López, and H. Mora, An architecture for human action recognition in smart cities video surveillance systems, in Proc. Int. Research and Innovation Forum 2020: Disruptive Technologies in Times of Change. Champaign, IL, USA: Springer International Publishing, 2021, pp. 51–56.

DOI

[146]

, Gao

, Li

J. H.

, and Li

D. Y.

, Interactive cognition in self-driving, (in Chinese), Sci. Sin. Inform., vol. 48, no. 8, pp. 1083–1096, 2018.10.1360/N112018-00028

DOI Google Scholar

[147]

U. Wang, H. X. Wu, J. J. Zhang, Z. F. Gao, J. M. Wang, P. S. Yu, and M. S. Long. PredRNN: A recurrent neural network for spatiotemporal predictive learning, arXiv preprint arXiv: 2103.09504, 2021.

Google Scholar

[148]

T. Z. Zhang, S. Liu, C. S. Xu, and H. Q. Lu, Boosted multi-class semi-supervised learning for human action recognition, Pattern Recognit., vol. 44, nos. 10&11, pp. 2334–2342, 2011.

DOI Google Scholar

[149]

Y. Y. Wang and B. Wang, The conditional random fields method for human action recognition, (in Chinese), J. Chongqing Univ. Technol. (Nat. Sci.), vol. 27, no. 6, pp. 93–99&105, 2013.

Google Scholar

[150]

S. Wang, Z. G. Ma, Y. Yang, X. Li, C. Y. Pang, and A. G. Hauptmann, Semi-supervised multiple feature analysis for action recognition, IEEE Trans. Multimed., vol. 16, no. 2, pp. 289–298, 2014.

DOI Google Scholar

[151]

S. Al-Obaidi and C. Abhayaratne, Temporal salience based human action recognition, in Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019, pp. 2017–2021.

DOI Google Scholar

[152]

N. Almaadeed, O. Elharrouss, S. Al-Maadeed, A. Bouridane, and A. Beghdadi, A novel approach for robust multi human action recognition and summarization based on 3D convolutional neural networks, arXiv preprint arXiv: 1907.11272, 2019.

Google Scholar

[153]

S. H. S. Basha, V. Pulabaigari, and S. Mukherjee, An information-rich sampling technique over spatio-temporal CNN for classification of human actions in videos, arXiv preprint arXiv: 2002.02100, 2020.

Google Scholar

[154]

Z. X. Wu, X. Wang, Y. G. Jiang, H. Ye, and X. Y. Xue, Modeling spatial-temporal clues in a hybrid deep learning framework for video classification, in Proc. 23^rd ACM Int. Conf. Multimedia, Brisbane, Australia, 2015, pp. 461–470.

DOI Google Scholar

[155]

T. H. Yeh, C. Kuo, A. S. Liu, Y. H. Liu, Y. H. Yang, Z. J. Li, J. T. Shen, and L. C. Fu, ResFlow: Multi-tasking of sequentially pooling spatiotemporal features for action recognition and optical flow estimation, in Proc. 2019 IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), Macau, China, 2019, pp. 2835–2840.

DOI Google Scholar

[156]

Z. Shou, X. D. Lin, Y. Kalantidis, L. Sevilla-Lara, M. Rohrbach, S. F. Chang, and Z. C. Yan, DMC-Net: Generating discriminative motion cues for fast compressed video action recognition, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 1268–1277.

DOI Google Scholar

[157]

Y. Y. Zhang, K. R. Hao, X. S. Tang, B. Wei, and L. H. Ren, Long-term 3D convolutional fusion network for action recognition, in Proc. 2019 IEEE Int. Conf. Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 2019, pp. 216–220.

DOI Google Scholar

[158]

H. Alwassel, D. Mahajan, B. Korbar, L. Torresani, B. Ghanem, and D. Tran, Self-supervised learning by cross-modal audio-video clustering, arXiv preprint arXiv: 1911.12667, 2020.

Google Scholar

[159]

R. Vemulapalli, F. Arrate, and R. Chellappa, Human action recognition by representing 3D skeletons as points in a lie group, in Proc. 2014 IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 588–595.

DOI Google Scholar

[160]

M. S. Li, S. H. Chen, X. Chen, Y. Zhang, Y. F. Wang, and Q. Tian, Actional-structural graph convolutional networks for skeleton-based action recognition, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 3590–3598.

DOI Google Scholar

[161]

C. Y. Si, W. T. Chen, W. Wang, L. Wang, and T. N. Tan, An attention enhanced graph convolutional LSTM network for skeleton-based action recognition, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 1227–1236.

DOI Google Scholar

[162]

K. Cheng, Y. F. Zhang, X. Y. He, W. H. Chen, J. Cheng, and H. Q. Lu, Skeleton-based action recognition with shift graph convolutional network, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 180–189.

DOI Google Scholar

[163]

Y. X. Chen, Z. Q. Zhang, C. F. Yuan, B. Li, Y. Deng, and W. M. Hu, Channel-wise topology refinement graph convolution for skeleton-based action recognition, arXiv preprint arXiv: 2107.12213, 2021.

Google Scholar

[164]

H. D. Duan, Y. Zhao, K. Chen, D. Shao, D. H. Lin, and B. Dai, Revisiting skeleton-based action recognition, arXiv preprint arXiv: 2104.13586, 2021.

Google Scholar

[165]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning spatiotemporal features with 3D convolutional networks, in Proc. 2015 IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 4489–4497.

DOI Google Scholar

[166]

B. Y. Jiang, M. M. Wang, W. H. Gan, W. Wu, and J. J. Yan, STM: Spatiotemporal and motion encoding for action recognition, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 2000–2009.

DOI Google Scholar

[167]

Y. Li, B. Ji, X. T. Shi, J. G. Zhang, B. Kang, and L. M. Wang, TEA: Temporal excitation and aggregation for action recognition, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 906–915.

DOI Google Scholar

[168]

H. D. Duan, Y. Zhao, Y. J. Xiong, W. T. Liu, and D. H. Lin, Omni-sourced webly-supervised learning for video recognition, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 670–688.

DOI Google Scholar

[169]

S. N. Gowda, M. Rohrbach, and L. Sevilla-Lara, SMART frame selection for action recognition, arXiv preprint arXiv: 2012.10671, 2020.

Google Scholar

[170]

C. Li, Q. Y. Zhong, D. Xie, and S. L. Pu, Collaborative spatiotemporal feature learning for video action recognition, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 7864–7873.

DOI Google Scholar

[171]

S. Z. Chen and D. Huang, Elaborative rehearsal for zero-shot action recognition, arXiv preprint arXiv: 2108.02833, 2021.

Google Scholar

[172]

J. Munro and D. Damen, Multi-modal domain adaptation for fine-grained action recognition, in Proc. 2020 IEEE/ CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 119–129.

DOI Google Scholar

[173]

Q. Q. Xiong, J. J. Zhang, P. Wang, D. D. Liu, and R. X. Gao, Transferable two-stream convolutional neural network for human action recognition, J. Manuf. Syst., vol. 56, pp. 605–614, 2020.

DOI Google Scholar

[174]

A. Abdelbaky and S. Aly, Two-stream spatiotemporal feature fusion for human action recognition, Vis. Comput., vol. 37, no. 7, pp. 1821–1835, 2021.

DOI Google Scholar

[175]

X. T. Yang, X. D. Yang, M. Y. Liu, F. Y. Xiao, L. S. Davis, and J. Kautz, STEP: Spatio-temporal progressive learning for video action detection, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 264–272.

DOI Google Scholar

[176]

Y. X. Li, W. Y. Lin, J. See, N. Xu, S. G. Xu, K. Yan, and C. Yang, CFAD: Coarse-to-fine action detector for spatiotemporal action localization, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 510–527.

DOI Google Scholar

[177]

M. Korban and X. Li, DDGCN: A dynamic directed graph convolutional network for action recognition, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 761–776.

DOI Google Scholar

[178]

W. Peng, J. G. Shi, T. Varanka, and G. Y. Zhao, Rethinking the ST-GCNs for 3D skeleton-based human action recognition, Neurocomputing, vol. 454, pp. 45–53, 2021.

DOI Google Scholar

[179]

W. Peng, J. G. Shi, and G. Y. Zhao, Spatial temporal graph deconvolutional network for skeleton-based human action recognition, IEEE Signal Process. Lett., vol. 28, pp. 244–248, 2021.

DOI Google Scholar

[180]

J. Li, P. Lei, and S. Todorovic, Weakly supervised energy-based learning for action segmentation, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 6242–6250.

DOI Google Scholar

[181]

T. F. Zhou, W. G. Wang, S. Y. Qi, H. B. Ling, and J. B. Shen, Cascaded human-object interaction recognition, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 4262–4271.

DOI Google Scholar

[182]

J. J. Tang, J. Xia, X. Z. Mu, B. Pang, and C. W. Lu, Asynchronous interaction aggregation for action detection, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 71–87.

DOI Google Scholar

[183]

H. H. Pham, L. Khoudour, A. Crouzil, P. Zegers, and S. A. Velastin, Exploiting deep residual networks for human action recognition from skeletal data, Comput. Vis. Image Underst., vol. 170, pp. 51–66, 2018.

DOI Google Scholar

[184]

P. F. Zhang, C. L. Lan, J. L. Xing, W. J. Zeng, J. R. Xue, and N. N. Zheng, View adaptive neural networks for high performance skeleton-based human action recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 1963–1978, 2019.

DOI Google Scholar

[185]

W. Peng, X. P. Hong, H. Y. Chen, and G. Y. Zhao, Learning graph convolutional network for skeleton-based human action recognition by neural searching, in Proc. the 34^th AAAI Conf. Artificial Intelligence, 2020, vol. 34, no. 3, pp. 2669–2676.

DOI Google Scholar

[186]

Z. Y. Liu, H. W. Zhang, Z. H. Chen, Z. Y. Wang, and W. L. Ouyang, Disentangling and unifying graph convolutions for skeleton-based action recognition, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 140–149.

DOI Google Scholar

[187]

M. Fayyaz and J. Gall, SCT: Set constrained temporal transformer for set supervised action segmentation, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 498–507.

DOI Google Scholar

[188]

J. Q. Dong, Y. B. Gao, H. J. Lee, H. Zhou, Y. F. Yao, Z. J. Fang, and B. Huang, Action recognition based on the fusion of graph convolutional networks with high order features, Appl. Sci., vol. 10, no. 4, p. 1482, 2020.

DOI Google Scholar

[189]

J. M. Zhou, K. Y. Lin, H. X. Li, and W. S. Zheng, Graph-based high-order relation modeling for long-term action recognition, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 8980–8989.

DOI Google Scholar

[190]

S. Sudhakaran, S. Escalera, and O. Lanz, Gate-shift networks for video action recognition, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 1099–1108.

DOI Google Scholar

[191]

Q. H. Ke, M. Bennamoun, H. Rahmani, S. An, F. Sohel, and F. Boussaid, Learning latent global network for skeleton-based action prediction, IEEE Trans. Image Process., vol. 29, pp. 959–970, 2019.

DOI Google Scholar

[192]

J. Liu, A. Shahroudy, G. Wang, L. Y. Duan, and A. C. Kot, Skeleton-based online action prediction using scale selection network, IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 6, pp. 1453–1467, 2020.

DOI Google Scholar

[193]

B. Rout, R. R. Dash, and D. Dhupal, Posture prediction and optimization for a manual assembly operation involving lifting of weights, Int. J. Simul. Multidisci. Des. Optim., vol. 11, p. 1, 2020.

DOI Google Scholar

[194]

H. Y. Luan, Y. Xiong, J. L. Zhou, and T. P. Qian, From DeepNet to HRNet, here is a full guide to in-depth learning “human posture estimation”, blog.nanonets, http://blog.itpub.net/31562039/viewspace-2643565/, 2019.

[195]

G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy, Towards accurate multi-person pose estimation in the wild, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 3711–3719.

DOI Google Scholar

[196]

S. Kreiss, L. Bertoni, and A. Alahi, PifPaf: Composite fields for human pose estimation, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 11969–11978.

DOI Google Scholar

[197]

S. L. Huang, M. M. Gong, and D. C. Tao, A coarse-fine network for keypoint localization, in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 3047–3056.

DOI Google Scholar

[198]

F. Richoux and J. F. Baffier, Automatic cost function learning with interpretable compositional networks, arXiv preprint arXiv: 2002.09811, 2021.

Google Scholar

[199]

Y. L. Chen, Z. C. Wang, Y. X. Peng, Z. Q. Zhang, G. Yu, and J. Sun, Cascaded pyramid network for multi-person pose estimation, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7103–7112.

DOI Google Scholar

[200]

F. C. Long, T. Yao, Z. F. Qiu, X. M. Tian, T. Mei, and J. B. Luo, Coarse-to-fine localization of temporal action proposals, IEEE Trans. Multimed., vol. 22, no. 6, pp. 1577–1590, 2020.

DOI Google Scholar

[201]

K. M. He, G. Gkioxari, P. Dollár, and R. Girshick, Mask R-CNN, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2980–2988.

DOI Google Scholar

[202]

Z. Tian, C. H. Shen, and H. Chen, Conditional convolutions for instance segmentation, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 282–298.

DOI Google Scholar

[203]

Z. J. Huang, L. C. Huang, Y. C. Gong, C. Huang, and X. G. Wang, Mask scoring R-CNN, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 6402–6411.

DOI Google Scholar

[204]

R. Dabral, N. B. Gundavarapu, R. Mitra, A. Sharma, G. Ramakrishnan, and A. Jain, Multi-person 3D human pose estimation from monocular images, in Proc. 2019 Int. Conf. 3D Vision (3DV), Quebec City, Canada, 2019, pp. 405–414.

DOI Google Scholar

[205]

S. Jin, W. T. Liu, E. Z. Xie, W. H. Wang, C. Qian, W. L. Ouyang, and P. Luo, Differentiable hierarchical graph grouping for multi-person pose estimation, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 718–734.

DOI Google Scholar

[206]

W. A. Mao, Z. Tian, X. L. Wang, and C. H. Shen, FCPose: Fully convolutional multi-person pose estimation with dynamic instance-aware convolutions, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 9030–9039.

DOI Google Scholar

[207]

H. X. Qiao, Y. Xu, Z. J. Zhao, J. H. Tian, J. H. Zhang, and C. B. Peng, The network improvement and connection refinement for multi-person pose estimation, in Proc. 2^nd Int. Conf. Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 2019, pp. 414–418.

DOI Google Scholar

[208]

L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen, and B. Schiele, Articulated people detection and pose estimation: Reshaping the future, in Proc. 2012 IEEE Conf. Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 3178–3185.

DOI Google Scholar

[209]

M. Eichner and V. Ferrari, We are family: Joint pose estimation of multiple persons, in Proc. 11^th European Conf. Computer Vision, Heraklion, Greece, 2010, pp. 228–242.

DOI Google Scholar

[210]

Z. Y. Huang, Y. Liu, Y. J. Fang, and B. K. P. Horn, Video-based fall detection for seniors with human pose estimation, in Proc. 4^th Int. Conf. Universal Village (UV), Boston, MA, USA, 2018, pp. 1–4.

DOI Google Scholar

[211]

A. Viswakumar, V. Rajagopalan, T. Ray, and C. Parimi, Human gait analysis using OpenPose, in Proc. 5^th Int. Conf. Image Information Processing (ICIIP), Shimla, India, 2019, pp. 310–314.

DOI Google Scholar

[212]

Z. Cao, G. Hidalgo, T. Simon, S. E. Wei, and Y. Sheikh, OpenPose: Realtime multi-person 2D pose estimation using Part Affinity Fields, IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 1, pp. 172–186, 2021.

DOI Google Scholar

[213]

N. Kato, T. Q. Li, K. Nishino, and Y. Uchida, Improving multi-person pose estimation using label correction, arXiv preprint arXiv: 1811.03331, 2018.

Google Scholar

[214]

M. Slembrouck, H. Luong, J. Gerlo, K. Schütte, D. Van Cauwelaert, D. De Clercq, B. Vanwanseele, P. Veelaert, and W. Philips, Multiview 3D markerless human pose estimation from OpenPose skeletons, in Proc. 20^th Int. Conf. Advanced Concepts for Intelligent Vision Systems, Auckland, New Zealand, 2020, pp. 166–178.

DOI Google Scholar

[215]

M. Rajchl, M. C. H. Lee, O. Oktay, K. Kamnitsas, J. Passerat-Palmbach, W. J. Bai, M. Damodaram, M. A. Rutherford, J. V. Hajnal, B. Kainz, et al., DeepCut: Object segmentation from bounding box annotations using convolutional neural networks, IEEE Trans. Med. Imaging, vol. 36, no. 2, pp. 674–683, 2017.

DOI Google Scholar

[216]

L. Pishchulin, E. Insafutdinov, S. Y. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele, DeepCut: Joint subset partition and labeling for multi person pose estimation, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4929–4937.

DOI Google Scholar

[217]

A. Newell, Z. A. Huang, and J. Deng, Associative embedding: End-to-end learning for joint detection and grouping, in Proc. 31^st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 2274–2284.

Google Scholar

[218]

Z. H. Yu, J. Zheng, D. Z. Lian, Z. H. Zhou, and S. H. Gao, Single-image piece-wise planar 3D reconstruction via associative embedding, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 1029–1037.

DOI Google Scholar

[219]

P. Wang, X. H. Shen, Z. Lin, S. Cohen, B. Price, and A. Yuille, Joint object and part segmentation using deep learned potentials, in Proc. 2015 IEEE Int. Conf. Computer Vision (ICCV), Santiago, Chile, 2015, pp. 1573–1581.

DOI Google Scholar

[220]

A. S. Jackson, M. Valstar, and G. Tzimiropoulos, A CNN cascade for landmark guided semantic part segmentation, in Proc. European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 143–155.

DOI Google Scholar

[221]

A. Rangesh and M. M. Trivedi, When vehicles see pedestrians with phones: A multicue framework for recognizing phone-based activities of pedestrians, IEEE Trans. Intell. Vehicles, vol. 3, no. 2, pp. 218–227, 2018.

DOI Google Scholar

[222]

C. Z. Lin, J. W. Lu, and J. Zhou, Multi-grained deep feature learning for robust pedestrian detection, IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 12, pp. 3608–3621, 2019.

DOI Google Scholar

[223]

C. Anderson, X. X. Du, R. Vasudevan, and M. Johnson-Roberson, Stochastic sampling simulation for pedestrian trajectory prediction, in Proc. 2019 IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), Macau, China, 2019, pp. 4236–4243.

DOI Google Scholar

[224]

B. W. Cheng, B. Xiao, J. D. Wang, H. H. Shi, T. S. Huang, and L. Zhang, HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 5385–5394.

DOI Google Scholar

[225]

N. Jaouedi, F. J. Perales, J. M. Buades, N. Boujnah, and M. S. Bouhlel, Prediction of human activities based on a new structure of skeleton features and deep learning model, Sensors, vol. 20, no. 17, p. 4944, 2020.

DOI Google Scholar

[226]

L. Xu, X. Ma, and J. Yan, Scene-perception graph convolutional networks for human action prediction, in Proc. 2021 Int. Joint Conf. Neural Networks (IJCNN), Shenzhen, China, 2021, pp. 1–8.

Google Scholar

[227]

Y. Tang, L. Zhao, Z. L. Yao, C. Gong, and J. Yang, Graph-based motion prediction for abnormal action detection, in Proc. 2^nd ACM Int. Conf. Multimedia in Asia, Singapore, 2021, p. 63.

DOI Google Scholar

[228]

C. Vondrick, H. Pirsiavash, and A. Torralba, Anticipating visual representations from unlabeled video, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 98–106.

DOI Google Scholar

[229]

Q. H. Ke, M. Bennamoun, S. J. An, F. Sohel, and F. Boussaid, Leveraging structural context models and ranking score fusion for human interaction prediction, IEEE Trans. Multimed., vol. 20, no. 7, pp. 1712–1723, 2018.

DOI Google Scholar

[230]

H. Xue, D. Q. Huynh, and M. Reynolds, Bi-Prediction: Pedestrian trajectory prediction based on bidirectional LSTM classification, in Proc. 2017 Int. Conf. Digital Image Computing: Techniques and Applications (DICTA), Sydney, NSW, Australia, 2017, pp. 1–8.

DOI Google Scholar

[231]

H. Xue, D. Q. Huynh, and M. Reynolds, SS-LSTM: A hierarchical LSTM model for pedestrian trajectory prediction, in Proc. 2018 IEEE Winter Conf. Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 2018, pp. 1186–1194.

DOI Google Scholar

[232]

P. Gujjar and R. Vaughan, Classifying pedestrian actions in advance using predicted video of urban driving scenes, in Proc. 2019 Int. Conf. Robotics and Automation (ICRA), Montreal, Canada, 2019, pp. 2097–2103.

DOI Google Scholar

[233]

A. Furnari and G. Farinella, What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 6251–6260.

DOI Google Scholar

[234]

K. Saleh, M. Hossny, and S. Nahavandi, Real-time intent prediction of pedestrians for autonomous ground vehicles via spatio-temporal densenet, in Proc. 2019 Int. Conf. Robotics and Automation (ICRA), Montreal, Canada, 2019, pp. 9704–9710.

DOI Google Scholar

[235]

T. Yau, S. Malekmohammadi, A. Rasouli, P. Lakner, M. Rohani, and J. Luo, Graph-SIM: A graph-based spatiotemporal interaction modelling for pedestrian action prediction, in Proc. 2021 IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 2021, pp. 8580–8586.

DOI Google Scholar

[236]

T. Y. Zhang, W. Q. Min, Y. Zhu, Y. Rui, and S. Q. Jiang, An egocentric action anticipation framework via fusing intuition and analysis, in Proc. 28^th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 402–410.

DOI Google Scholar

[237]

M. Kocabas, S. Karagoz, and E. Akbas, MultiPoseNet: Fast multi-person pose estimation using pose residual network, in Proc. 15^th European Conf. Computer Vision, Munich, Germany, 2018, pp. 437–453.

DOI Google Scholar

[238]

B. Xiao, H. P. Wu, and Y. C. Wei, Simple baselines for human pose estimation and tracking, in Proc. 15^th European Conf. Computer Vision, Munich, Germany, 2018, pp. 472–487.

DOI Google Scholar

[239]

W. B. Li, Z. C. Wang, B. Y. Yin, Q. X. Peng, Y. M. Du, T. Z. Xiao, G. Yu, H. T. Lu, Y. C. Wei, and J. Sun, Rethinking on multi-stage networks for human pose estimation, arXiv preprint arXiv: 1901.00148, 2019.

Google Scholar

[240]

K. Sun, B. Xiao, D. Liu, and J. D. Wang, Deep high-resolution representation learning for human pose estimation, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 5686–5696.

DOI Google Scholar

[241]

H. J. Liu, F. Q. Liu, X. Y. Fan, and D. Huang, Polarized self-attention: Towards high-quality pixel-wise regression, arXiv preprint arXiv: 2107.00782, 2021.

Google Scholar

[242]

E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, DeeperCut: A deeper, stronger, and faster multi-person pose estimation model, in Proc. 14^th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 34–50.

DOI Google Scholar

[243]

S. E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, Convolutional pose machines, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 4724–4732.

DOI Google Scholar

[244]

A. Newell, K. Y. Yang, and J. Deng, Stacked hourglass networks for human pose estimation, in Proc. 14^th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 483–499.

DOI Google Scholar

[245]

X. Chu, W. Yang, W. L. Ouyang, C. Ma, A. L. Yuille, and X. G. Wang, Multi-context attention for human pose estimation, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 5669–5678.

DOI Google Scholar

[246]

D. Groos, H. Ramampiaro, and E. A. F. Ihlen, EfficientPose: Scalable single-person pose estimation, Appl. Intell., vol. 51, no. 4, pp. 2518–2533, 2021.

DOI Google Scholar

[247]

A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, End-to-end recovery of human shape and pose, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7122–7131.

DOI Google Scholar

[248]

Y. L. Xu, S. C. Zhu, and T. Tung, DenseRaC: Joint 3D pose and shape estimation by dense render-and-compare, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 7759–7769.

DOI Google Scholar

[249]

L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. N. Metaxas, Semantic graph convolutional networks for 3D human pose regression, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 3420–3430.

DOI Google Scholar

[250]

F. Y. Huang, A. L. Zeng, M. H. Liu, Q. X. Lai, and Q. Xu, DeepFuse: An IMU-aware network for real-time 3D human pose estimation from multi-view image, in Proc. 2020 IEEE Winter Conf. Applications of Computer Vision, Snowmass, CO, USA, 2020, pp. 418–427.

DOI Google Scholar

[251]

W. K. Shan, H. P. Lu, S. S. Wang, X. F. Zhang, and W. Gao, Improving robustness and accuracy via relative information encoding in 3D human pose estimation, arXiv preprint arXiv: 2107.13994, 2021.

Google Scholar

[252]

N. D. Reddy, L. Guigues, L. Pishchulin, J. Eledath, S. G. Narasimhan, C. M. University, and Amazon, TesseTrack: End-to-end learnable multi-person articulated 3D pose tracking, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2021, pp. 15190–15200.

DOI Google Scholar

[253]

J. Q. Zhong, H. Sun, W. M. Cao, and Z. H. He, Pedestrian motion trajectory prediction with stereo-based 3D deep pose estimation and trajectory learning, IEEE Access, vol. 8, pp. 23480–23486, 2020.

DOI Google Scholar

[254]

M. Andriluka, U. Iqbal, E. Insafutdinov, L. Pishchulin, A. Milan, J. Gall, and B. Schiele, PoseTrack: A benchmark for human pose estimation and tracking, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 5167–5176.

DOI Google Scholar

[255]

A. Arnab, C. Doersch, and A. Zisserman, Exploiting temporal context for 3D human pose estimation in the wild, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 3390–3399.

DOI Google Scholar

[256]

X. Sun, B. Xiao, F. Y. Wei, S. Liang, and Y. C. Wei, Integral human pose regression, in Proc. 15^th European Conf. Computer Vision, Munich, Germany, 2018, pp. 536–553.

DOI Google Scholar

[257]

F. Zhang, X. T. Zhu, H. B. Dai, M. Ye, and C. Zhu, Distribution-aware coordinate representation for human pose estimation, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 7091–7100.

DOI Google Scholar

[258]

H. Chen, P. F. Guo, P. F. Li, G. H. Lee, and G. Chirikjian, Multi-person 3D pose estimation in crowded scenes based on multi-view geometry, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 541–557.

DOI Google Scholar

[259]

Z. Zhang, C. Y. Wang, W. H. Qin, and W. J. Zeng, Fusing wearable IMUs with multi-view images for human pose estimation: A geometric approach, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 2197–2206.

DOI Google Scholar

[260]

N. Sarafianos, B. Boteanu, B. Ionescu, and I. A. Kakadiaris, 3D Human pose estimation: A review of the literature and analysis of covariates, Comput. Vis. Image Underst., vol. 152, pp. 1–20, 2016.

DOI Google Scholar

[261]

Y. Kong, Z. Q. Tao, and Y. Fu, Deep sequential context networks for action prediction, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3662–3670.

DOI Google Scholar

[262]

A. Jalal, I. Akhtar, and K. Kim, Human posture estimation and sustainable events classification via pseudo-2D stick model and K-ary tree hashing, Sustainability, vol. 12, no. 23, p. 9814, 2020.

DOI Google Scholar

[263]

M. Hassan, V. Choutas, D. Tzionas, and M. Black, Resolving 3D human pose ambiguities with 3D scene constraints, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 2282–2292.

DOI Google Scholar

[264]

J. N. Zhen, Q. Fang, J. M. Sun, W. T. Liu, W. Jiang, H. J. Bao, and X. W. Zhou, SMAP: Single-shot multi-person absolute 3D pose estimation, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 550–566.

DOI Google Scholar

[265]

H. Y. Xu, E. G. Bazavan, A. Zanfir, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, GHUM & GHUML: Generative 3D human shape and articulated pose models, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 6183–6192.

DOI Google Scholar

[266]

W. Yang, S. Li, W. L. Ouyang, H. S. Li, and X. G. Wang, Learning feature pyramids for human pose estimation, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 1290–1299.

DOI Google Scholar

[267]

K. Huang, T. Q. Sui, and H. Wu. 3D human pose estimation with multi-scale graph convolution and hierarchical body pooling, Multimed. Syst., .

DOI Google Scholar

[268]

D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, 3D human pose estimation in video with temporal convolutions and semi-supervised training, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 7745–7754.

DOI Google Scholar

[269]

H. S. Koppula and A. Saxena, Anticipating human activities using object affordances for reactive robotic response, IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 14–29, 2016.

DOI Google Scholar

[270]

L. Shi, Y. F. Zhang, J. Cheng, and H. Q. Lu, Two-stream adaptive graph convolutional networks for skeleton-based action recognition, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 12018–12027.

DOI Google Scholar

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 31 May 2021

Accepted: 06 September 2021

Published: 21 June 2022

Issue date: December 2022

Copyright

Acknowledgements

The authors wish to thank Dian’en Zhang and Wenjuan Li from Beijing Union University, Beijing, China. We really thank anonymous reviewers’ constructive suggestions. This work was supported by the National Natural Science Foundation of China (Nos. 61871038 and 61931012), the Premium Funding Project for Academic Human Resources Development of Beijing Union University (No. BPHR2020AZ02), and the Generic Pre-research Program of the Equipment Development Department in Military Commission (No. 41412040302).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).