AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (9.9 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Gesture Recognition with Focuses Using Hierarchical Body Part Combination

Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
Beijing Engineering Research Center for IOT Software and Systems, Beijing University of Technology, Beijing 100124, China
Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China, and also with the Beijing Engineering Research Center for IOT Software and Systems, Beijing University of Technology, Beijing 100124, China
Show Author Information

Abstract

Human gesture recognition is an important research field of human-computer interaction due to its potential applications in various fields, but existing methods still face challenges in achieving high levels of accuracy. To address this issue, some existing researches propose to fuse the global features with the cropped features called focuses on vital body parts like hands. However, most methods rely on experience when choosing the focus, the scheme of focus selection is not discussed in detail. In this paper, a hierarchical body part combination method is proposed to take into account the number, combinations, and logical relationships between body parts. The proposed method generates multiple focuses using this method and employs chart-based surface modality alongside red-green-blue and optical flow modalities to enhance each focus. A feature-level fusion scheme based on the residual connection structure is proposed to fuse different modalities at convolution stages, and a focus fusion scheme is proposed to learn the relevancy of focus channels for each gesture class individually. Experiments conducted on ChaLearn isolated gesture dataset show that the use of multiple focuses in conjunction with multi-modal features and fusion strategies leads to better gesture recognition accuracy.

References

[1]
L. Zhang, G. Zhu, P. Shen, J. Song, S. Afaq Shah, and M. Bennamoun, Learning spatiotemporal features using 3DCNN and convolutional LSTM for gesture recognition, in Proc. 2017 IEEE Int. Conf. Computer Vision Workshops, Venice, Italy, 2017, pp. 3120–3128.
[2]
H. Wang, P. Wang, Z. Song, and W. Li, Large-scale multimodal gesture recognition using heterogeneous networks, in Proc. 2017 IEEE Int. Conf. Computer Vision Workshops, Venice, Italy, 2017, pp. 3129–3137.
[3]
P. Narayana, J. R. Beveridge, and B. A. Draper, Gesture recognition: Focus on the hands, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 5235–5244.
[4]
K. Roy and R. R. Sahay, Dynamic gesture recognition with pose-based CNN features derived from videos using LSTM, in Proc. 11 th Indian Conf. Computer Vision, Graphics and Image Processing, Hyderabad, India, 2020, p. 45.
[5]
A. G. Perera, Y. W. Law, and J. Chahl, UAV-GESTURE: A dataset for UAV control and gesture recognition, in Proc. European Conf. Computer Vision, Munich, Germany, 2019, pp. 117–128.
[6]
J. Liu, A. Shahroudy, M. Perez, G. Wang, L. Y. Duan, and A. C. Kot, NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding, IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 10, pp. 2684–2701, 2020.
[7]

R. Mahmoud, S. Belgacem, and M. N. Omri, Towards an end-to-end isolated and continuous deep gesture recognition process, Neural Comput. Appl., vol. 34, no. 16, pp. 13713–13732, 2022.

[8]

R. Jain, R. K. Karsh, and A. A. Barbhuiya, Literature review of vision-based dynamic gesture recognition using deep learning techniques, Concurr. Comput., vol. 34, no. 22, pp. e7159, 2022.

[9]

N. Naz, H. Sajid, S. Ali, O. Hasan, and M. K. Ehsan, Signgraph: An efficient and accurate pose-based graph convolution approach toward sign language recognition, IEEE Access, vol. 11, pp. 19135–19147, 2023.

[10]
K. M. Dafnis, E. Chroni, C. Neidle, and D. Metaxas, Bidirectional skeleton-based isolated sign recognition using graph convolutional networks, in Proc. 13 th Language Resources and Evaluation Conf., Marseille, France, 2022, pp. 7328–7338.
[11]

J. C. Núñez, R. Cabido, J. J. Pantrigo, A. S. Montemayor, and J. F. Vélez, Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition, Pattern Recogn., vol. 76, pp. 80–94, 2018.

[12]
Y. Min, Y. Zhang, X. Chai, and X. Chen, An efficient PointLSTM for point clouds based gesture recognition, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 5760–5769.
[13]
H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, Revisiting skeleton-based action recognition, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 2959–2968.
[14]
K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, in Proc. 27 th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 568–576.
[15]
Q. Miao, Y. Li, W. Ouyang, Z. Ma, X. Xu, W. Shi, and X. Cao, Multimodal gesture recognition based on the ResC3D network, in Proc. 2017 IEEE Int. Conf. Computer Vision Workshops, Venice, Italy, 2017, pp. 3047–3055.
[16]
X. Shi, Z. Chen, H. Wang, D. Y. Yeung, W. K. Wong, and W. C. Woo, Convolutional LSTM network: A machine learning approach for precipitation nowcasting, in Proc. 28 th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 802–810.
[17]

Y. Zhang, L. Shi, Y. Wu, K. Cheng, J. Cheng, and H. Lu, Gesture recognition based on deep deformable 3D convolutional neural networks, Pattern Recogn., vol. 107, pp. 107416, 2020.

[18]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, Large-scale video classification with convolutional neural networks, in Proc. 2014 IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 1725–1732.
[19]
O. Köpüklü, N. Köse, and G. Rigoll, Motion fused frames: Data level fusion strategy for hand gesture recognition, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 2018, pp. 2103–2111.
[20]
A. Roitberg, T. Pollert, M. Haurilet, M. Martin, and R. Stiefelhagen, Analysis of deep fusion strategies for multi-modal gesture recognition, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 2019, pp. 198–206.
[21]
C. Feichtenhofer, H. Fan, J. Malik, and K. He, SlowFast networks for video recognition, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 6201–6210.
[22]
Z. Teed and J. Deng, RAFT: Recurrent all-pairs field transforms for optical flow, in Proc. 16 th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 402–419.
[23]
R. Alp Güler, N. Neverova, and I. Kokkinos, DensePose: Dense human pose estimation in the wild, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7297–7306.
[24]
P. Wang, W. Li, S. Liu, Z. Gao, C. Tang, and P. Ogunbona, Large-scale isolated gesture recognition using convolutional neural networks, in Proc. 2016 23 rd Int. Conf. Pattern Recognition (ICPR ), Cancun, Mexico, 2016, pp. 7–12.
[25]
Y. Du, W. Wang, and L. Wang, Hierarchical recurrent neural network for skeleton based action recognition, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition (CVPR ), Boston, MA, USA, 2015, pp. 1110–1118.
[26]
J. Wan, S. Z. Li, Y. Zhao, S. Zhou, I. Guyon, and S. Escalera, ChaLearn looking at people RGB-D isolated and continuous datasets for gesture recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 2016, pp. 761–769.
[27]

J. Wan, C. Lin, L. Wen, Y. Li, Q. Miao, S. Escalera, G. Anbarjafari, I. Guyon, G. Guo, and S. Li, ChaLearn looking at people: IsoGD and ConGD large-scale RGB-D gesture recognition, IEEE Trans. Cybern., vol. 52, no. 5, pp. 3422–3433, 2022.

[28]
Y. Li, Q. Miao, K. Tian, Y. Fan, X. Xu, R. Li, and J. Song, Large-scale gesture recognition with a fusion of RGB-D data based on the C3D model, in Proc. 2016 23 rd Int. Conf. Pattern Recognition, Cancun, Mexico, 2016, pp. 25–30.
[29]

S. Wan, L. Yang, K. Ding, and D. Qiu, Dynamic gesture recognition based on three-stream coordinate attention network and knowledge distillation, IEEE Access, vol. 11, pp. 50547–50559, 2023.

[30]

Z. Yu, B. Zhou, J. Wan, P. Wang, H. Chen, X. Liu, S. Li, and G. Zhao, Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition, IEEE Trans. Image Process., vol. 30, pp. 5626–5640, 2021.

[31]
B. Zhou, P. Wang, J. Wan, Y. Liang, F. Wang, D. Zhang, Z. Lei, H. Li, and R. Jin, Decoupling and recoupling spatiotemporal representation for RGB-D-based motion recognition, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 20122–20131.
[32]

Y. Li, Q. Miao, X. Qi, Z. Ma, and W. Ouyang, A spatiotemporal attention-based ResC3D model for large-scale gesture recognition, Mach. Vis. Appl., vol. 30, no. 5, pp. 875–888, 2019.

[33]

G. Zhu, L. Zhang, L. Yang, L. Mei, S. A. A. Shah, M. Bennamoun, and P. Shen, Redundancy and attention in convolutional LSTM for gesture recognition, IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 4, pp. 1323–1335, 2020.

[34]
C. Lin, J. Wan, Y. Liang, and S. Z. Li, Large-scale isolated gesture recognition using a refined fused model based on masked Res-C3D network and skeleton LSTM, in Proc. 2018 13 th IEEE Int. Conf. Automatic Face & Gesture Recognition, Xi’an, China, 2018, pp. 52–58.
[35]
Y. Li, H. Chen, G. Feng, and Q. Miao, Learning robust representations with information bottleneck and memory network for RGB-D-based gesture recognition, in Proc. 2023 IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 20911–20921.
[36]
O. M. Sincan and H. Y. Keles, AUTSL: A large scale multi-modal Turkish sign language dataset and baseline methods, IEEE Access, vol. 8, pp. 181340–181355, 2020.
[37]
P. Selvaraj, G. Nc, P. Kumar, and M. Khapra, OpenHands: Making sign language recognition accessible with pose-based pretrained models across languages, in Proc. 60 th Annu. Meeting of the Association for Computational Linguistics (Volume 1 : Long Papers ), Dublin, Ireland, 2022, pp. 2114–2133.
[38]
M. De Coster, M. Van Herreweghe, and J. Dambre, Isolated sign recognition from RGB video using pose flow and self-attention, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 2021, pp. 3436–3445.
[39]
S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, and Y. Fu, Skeleton aware multi-modal sign language recognition, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 2021, pp. 3408–3418.
Tsinghua Science and Technology
Pages 1583-1599
Cite this article:
Zhang C, Hou Y, He J, et al. Gesture Recognition with Focuses Using Hierarchical Body Part Combination. Tsinghua Science and Technology, 2025, 30(4): 1583-1599. https://doi.org/10.26599/TST.2024.9010059

43

Views

0

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 07 November 2023
Revised: 16 February 2024
Accepted: 20 March 2024
Published: 03 March 2025
© The Author(s) 2025.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return