Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Jin-Gong Jia; Yuan-Feng Zhou; Xing-Wei Hao; Feng Li; Christian Desrosiers; Cai-Ming Zhang

doi:10.1007/s11390-020-0405-6

Journal of Computer Science and Technology 2020, 35(3): 538-550 https://doi.org/10.1007/s11390-020-0405-6

Regular Paper | Issue | Published: 29 May 2020

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Show Author's Information Hide Author's Information Jin-Gong Jia^¹, Yuan-Feng Zhou^¹(

), Xing-Wei Hao^¹, Feng Li^¹, Christian Desrosiers^², Cai-Ming Zhang^¹

School of Software, Shandong University, Jinan 250101, China

Department of Software and IT Engineering, University of Quebec, Montreal H3C 3P8, Canada

Keywords:

neural network, skeleton, action recognition, temporal convolutional network (TCN), vector feature representation

Cite this article:

Jia J-G, Zhou Y-F, Hao X-W, et al. Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition. Journal of Computer Science and Technology, 2020, 35(3): 538-550. https://doi.org/10.1007/s11390-020-0405-6

Download citation

EndNote(RIS)

BibTeX

242

Views

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Electronic supplementary material About this article

Abstract

With the growing popularity of somatosensory interaction devices, human action recognition is becoming attractive in many application scenarios. Skeleton-based action recognition is effective because the skeleton can represent the position and the structure of key points of the human body. In this paper, we leverage spatiotemporal vectors between skeleton sequences as input feature representation of the network, which is more sensitive to changes of the human skeleton compared with representations based on distance and angle features. In addition, we redesign residual blocks that have different strides in the depth of the network to improve the processing ability of the temporal convolutional networks (TCNs) for long time dependent actions. In this work, we propose the two-stream temporal convolutional networks (TS-TCNs) that take full advantage of the inter-frame vector feature and the intra-frame vector feature of skeleton sequences in the spatiotemporal representations. The framework can integrate different feature representations of skeleton sequences so that the two feature representations can make up for each other’s shortcomings. The fusion loss function is used to supervise the training parameters of the two branch networks. Experiments on public datasets show that our network achieves superior performance and attains an improvement of 1.2% over the recent GCN-based (BGC-LSTM) method on the NTU RGB+D dataset.

Electronic supplementary material

File

jcst-35-3-538-Highlights.pdf (222.6 KB)

About this article

Publication history

Received: 29 February 2020

Revised: 05 April 2020

Published: 29 May 2020

Issue date: May 2020

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Publication history

Copyright