AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (7.2 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Lightweight Multiscale Spatio-Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

Lab of Cloud Computing and Big Data Processing, School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China
Show Author Information

Abstract

Using skeletal information to model and recognize human actions is currently a hot research subject in the realm of Human Action Recognition (HAR). Graph Convolutional Networks (GCN) have gained popularity in this discipline due to their capacity to efficiently process graph-structured data. However, it is challenging for current models to handle distant dependencies that commonly exist between human skeleton nodes, which hinders the development of algorithms in related fields. To solve these problems, the Lightweight Multiscale Spatio-Temporal Graph Convolutional Network (LMSTGCN) is proposed. Firstly, the Lightweight Multiscale Spatial Graph Convolutional Network (LMSGCN) is constructed to capture the information in various hierarchies, and multiple inner connections between skeleton joints are captured by dividing the input features into a number of subsets along the channel direction. Secondly, the dilated convolution is incorporated into the temporal convolution to construct Lightweight Multiscale Temporal Convolutional Network (LMTCN), which allows to obtain a wider receptive field while keeping the size of the convolution kernel unchanged. Thirdly, the Spatio-Temporal Location Attention (STLAtt) module is used to identify the most informative joints in the sequence of skeletal information at a specific frame, hence improving the model’s ability to extract features and recognize actions. Finally, multi-stream data fusion input structure is used to enhance the input data and expand the feature information. Experiments on three public datasets illustrate the effectiveness of the proposed network.

References

[1]

L. Wang, D. Q. Huynh, and P. Koniusz, A comparative review of recent kinect-based action recognition algorithms, IEEE Trans. Image Process., vol. 29, pp. 15–28, 2020.

[2]

Q. Cai, Y. B. Deng, H. S. Li, L. Yu, and S. F. Ming, Survey on human action recognition based on deep learning, (in Chinese), Comput. Sci., vol. 47, no. 4, pp. 85–93, 2020.

[3]

H. Qian, J. Yi, and Y. Fu, Review of human action recognition based on deep learning, (in Chinese), J. Front. Comput. Sci. Technol., vol. 15, no. 3, pp. 438–455, 2021.

[4]

G. V. Reddy, K. Deepika, L. Malliga, D. Hemanand, C. Senthilkumar, S. Gopalakrishnan, and Y. Farhaoui, Human action recognition using difference of Gaussian and difference of wavelet, Big Data Mining and Analytics, vol. 6, no. 3, pp. 336–346, 2023.

[5]
S. Yan, Y. Xiong, and D. Lin, Spatial temporal graph convolutional networks for skeleton-based action recognition, in Proc. 32 nd AAAI Conf. Artificial Intelligence, New Orleans, LA, USA, 2018, pp. 7444–7452.
[6]
L. Shi, Y. Zhang, J. Cheng, and H. Lu, Two-stream adaptive graph convolutional networks for skeleton-based action recognition, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 12018–12027.
[7]
M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, Actional-structural graph convolutional networks for skeleton-based action recognition, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 3590–3598.
[8]

P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, View adaptive neural networks for high performance skeleton-based human action recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 1963–1978, 2019.

[9]
H. Wang and L. Wang, Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3633–3642.
[10]
S. Li, W. Li, C. Cook, C. Zhu, and Y. Gao, Independently recurrent neural network (IndRNN): Building a longer and deeper RNN, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 5457–5466.
[11]
C. Li, Q. Zhong, D. Xie, and S. Pu, Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation, in Proc. 27 th Int. Joint Conf. Artificial Intelligence, Stockholm, Sweden, 2018, pp. 786–792.
[12]
D. Liang, G. Fan, G. Lin, W. Chen, X. Pan, and H. Zhu, Three-stream convolutional neural network with multi-task and ensemble learning for 3D action recognition, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 2019, pp. 934–940.
[13]
Y. Du, Y. Fu, and L. Wang, Skeleton based action recognition with convolutional neural network, in Proc. 2015 3 rd IAPR Asian Conf. Pattern Recognition, Kuala Lumpur, Malaysia, 2015, pp. 579–583.
[14]

A. M. U. Din and S. Qureshi, Limits of depth: Over-smoothing and over-squashing in GNNs, Big Data Mining and Analytics, vol. 7, no. 1, pp. 205–216, 2024.

[15]
Y. H. Wen, L. Gao, H. Fu, F. L. Zhang, and S. Xia, Graph CNNs with motif and variable temporal block for skeleton-based action recognition, in Proc. 33 rd AAAI Conf. Artificial Intelligence, Honolulu, HI, USA, 2019, pp. 8989–8996.
[16]
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, MobileNets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv: 1704.04861, 2017.
[17]
K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu, Skeleton-based action recognition with shift graph convolutional network, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 180–189.
[18]
C. Si, W. Chen, W. Wang, L. Wang, and T. Tan, An attention enhanced graph convolutional LSTM network for skeleton-based action recognition, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 1227–1236.
[19]

J. Liu, G. Wang, L. Y. Duan, K. Abdiyeva, and A. C. Kot, Skeleton-based human action recognition with global context-aware attention LSTM networks, IEEE Trans. Image Process., vol. 27, no. 4, pp. 1586–1599, 2018.

[20]
D. Li, Exploiting spatio-temporal relationships for video action recognition and detection, PhD dissertation, (in Chinese), University of Science and Technology of China, China, 2021.
[21]

H. Xia and X. Gao, Multi-scale mixed dense graph convolution network for skeleton-based action recognition, IEEE Access, vol. 9, pp. 36475–36484, 2021.

[22]

S. H. Gao, M. M. Cheng, K. Zhao, X. Y. Zhang, M. H. Yang, and P. Torr, Res2Net: A new multi-scale backbone architecture, IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 2, pp. 652–662, 2021.

[23]
Z. Chen, S. Li, B. Yang, Q. Li, and H. Liu, Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition, in Proc. 35 th AAAI Conf. Artificial Intelligence, Virtual Event, 2021, pp. 1113–1122.
[24]

Y. Li, K. Li, C. Chen, X. Zhou, Z. Zeng, and K. Li, Modeling temporal patterns with dilated convolutions for time-series forecasting, ACM Trans. Knowledge Discov. Data, vol. 16, no. 1, p. 14, 2022.

[25]
S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi, ESPNet: Efficient spatial pyramid of dilated convolutions for semantic segmentation, in Proc. 15 th European Conf. Computer Vision, Munich, Germany, 2018, pp. 561–580.
[26]
Q. Hou, D. Zhou, and J. Feng, Coordinate attention for efficient mobile network design, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 13708–13717.
[27]
J. Hu, L. Shen, and G. Sun, Squeeze-and-excitation networks, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7132–7141.
[28]
Y. F. Song, Z. Zhang, C. Shan, and L. Wang, Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition, in Proc. 28 th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 1625–1633.
[29]
A. Shahroudy, J. Liu, T. T. Ng, and G. Wang, NTU RGB+D: A large scale dataset for 3D human activity analysis, in Proc. 2016 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 1010–1019.
[30]

J. Liu, A. Shahroudy, M. Perez, G. Wang, L. Y. Duan, and A. C. Kot, NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding, IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 10, pp. 2684–2701, 2020.

[31]
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., The kinetics human action video dataset, arXiv preprint arXiv: 1705.06950, 2017.
[32]
L. Zhu, X. Wang, Z. Ke, W. Zhang, and R. Lau, BiFormer: Vision transformer with bi-level routing attention, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 10323–10333.
[33]

Y. F. Song, Z. Zhang, C. Shan, and L. Wang, Constructing stronger and faster baselines for skeleton-based action recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 1474–1488, 2023.

[34]
J. Liu, A. Shahroudy, D. Xu, and G. Wang, Spatio-temporal LSTM with trust gates for 3D human action recognition, in Proc. 14 th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 816–833.
[35]
P. Zhang, C. Lan, W. Zeng, J. Xing, J. Xue, and N. Zheng, Semantics-guided neural networks for efficient skeleton-based human action recognition, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 1109–1118.
[36]
W. Peng, X. Hong, H. Chen, and G. Zhao, Learning graph convolutional network for skeleton-based human action recognition by neural searching, in Proc. 34 th AAAI Conf. Artificial Intelligence, New York, NY, USA, 2020, pp. 2669–2676.
[37]
L. Shi, Y. Zhang, J. Cheng, and H. Lu, Skeleton-based action recognition with directed graph neural networks, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 7904–7913.
[38]
K. Cheng, Y. Zhang, C. Cao, L. Shi, J. Cheng, and H. Lu, Decoupling GCN with DropGraph module for skeleton-based action recognition, in Proc. 16 th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 536–553.
[39]
F. Ye, S. Pu, Q. Zhong, C. Li, D. Xie, and H. Tang, Dynamic GCN: Context-enriched topology learning for skeleton-based action recognition, in Proc. 28 th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 55–63.
[40]

K. Cheng, Y. Zhang, X. He, J. Cheng, and H. Lu, Extremely lightweight skeleton-based action recognition with ShiftGCN++, IEEE Trans. Image Process., vol. 30, pp. 7333–7348, 2021.

[41]
C. Caetano, J. Sena, F. Brémond, J. A. dos Santos, and W. R. Schwartz, SkeleMotion: A new representation of skeleton joint sequences based on motion information for 3D action recognition, in Proc. 2019 16 th IEEE Int. Conf. Advanced Video and Signal based Surveillance, Taipei, China, 2019, pp. 1–8.
[42]
Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, Disentangling and unifying graph convolutions for skeleton-based action recognition, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 140–149.
[43]

Z. Tu, J. Zhang, H. Li, Y. Chen, and J. Yuan, Joint-bone fusion graph convolutional network for semi-supervised skeleton action recognition, IEEE Trans. Multimedia, vol. 25, pp. 1819–1831, 2023.

[44]
Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, Channel-wise topology refinement graph convolution for skeleton-based action recognition, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 13339–13348.
[45]

M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, Symbiotic graph neural networks for 3D skeleton-based human action recognition and motion prediction, IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 3316–3333, 2022.

Big Data Mining and Analytics
Pages 310-325
Cite this article:
Zheng Z, Yuan Q, Zhang H, et al. Lightweight Multiscale Spatio-Temporal Graph Convolutional Network for Skeleton-Based Action Recognition. Big Data Mining and Analytics, 2025, 8(2): 310-325. https://doi.org/10.26599/BDMA.2024.9020095

347

Views

52

Downloads

1

Crossref

1

Web of Science

1

Scopus

0

CSCD

Altmetrics

Received: 03 May 2024
Revised: 20 November 2024
Accepted: 03 December 2024
Published: 28 January 2025
© The author(s) 2025.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return