Journal Home > Volume 2 , Issue 1

This paper presents a novel framework for human action recognition based on salient object detection and a new combination of local and global descriptors. We first detect salient objects in video frames and only extract features for such objects. We then use a simple strategy to identify and process only those video frames that contain salient objects. Processing salient objects instead of all frames not only makes the algorithm more efficient, but more importantly also suppresses the interference of background pixels. We combine this approach with a new combination of local and global descriptors, namely 3D-SIFT and histograms of oriented optical flow (HOOF), respectively. The resulting saliency guided 3D-SIFT-HOOF (SGSH) feature is used along with a multi-class support vector machine (SVM) classifier for human action recognition. Experiments conducted on the standard KTH and UCF-Sports action benchmarks show that our new method outperforms the competing state-of-the-art spatiotemporal feature-based human action recognition methods.


menu
Abstract
Full text
Outline
About this article

Saliency guided local and global descriptors for effective action recognition

Show Author's information Ashwan Abdulmunem1,2( )Yu-Kun Lai1Xianfang Sun1
Department of Computer Science, School of Science, Kerbala University, Kerbala, Iraq.
School of Computer Science and Informatics, Cardiff University, Cardiff, CF24 3AA, UK.

Abstract

This paper presents a novel framework for human action recognition based on salient object detection and a new combination of local and global descriptors. We first detect salient objects in video frames and only extract features for such objects. We then use a simple strategy to identify and process only those video frames that contain salient objects. Processing salient objects instead of all frames not only makes the algorithm more efficient, but more importantly also suppresses the interference of background pixels. We combine this approach with a new combination of local and global descriptors, namely 3D-SIFT and histograms of oriented optical flow (HOOF), respectively. The resulting saliency guided 3D-SIFT-HOOF (SGSH) feature is used along with a multi-class support vector machine (SVM) classifier for human action recognition. Experiments conducted on the standard KTH and UCF-Sports action benchmarks show that our new method outperforms the competing state-of-the-art spatiotemporal feature-based human action recognition methods.

Keywords: classification, action recognition, saliency detection, local and global descriptors, bag of visual words (BoVWs)

References(39)

[1]
Kläser, A.; Marszalek, M.; Schmid, C. A spatio-temporal descriptor based on 3D-gradients. In: Proceedings of British Machine Vision Conference, 995-1004, 2008.
DOI
[2]
Scovanner, P.; Ali, S.; Shah, M. A 3-dimensional SIFT descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia, 357-360, 2007.
DOI
[3]
Willems, G.; Tuytelaars, T.; Van Gool, L. An efficient dense and scale-invariant spatiotemporal interest point detector. In: Lecture Notes in Computer Science, Vol. 5303. Forsyth, D.; Torr, P.; Zisserman, A. Eds. Springer Berlin Heidelberg, 650-663, 2008.
DOI
[4]
Yuan, C.; Li, X.; Hu, W.; Ling, H.; Maybank, S. 3D R transform on spatiotemporal interest points for action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 724-730, 2013.
DOI
[5]
Zhang, H.; Zhou, W.; Reardon, C.; Parker, L. Simplex-based 3D spatio-temporal feature description for action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2067-2074, 2014.
DOI
[6]
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 35, No. 1, 221-231, 2013.
[7]
Taylor, G. W.; Fergus, R.; LeCun, Y.; Bregler, C. Convolutional learning of spatiotemporal features. In: Proceedings of the 11th European Conference on Computer Vision: Part VI, 140-153, 2010.
DOI
[8]
Sun, X.; Chen, M.; Hauptmann, A. Action recognition via local descriptors and holistic features. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 58-65, 2009.
[9]
Niebles, J. C.; Li, F.-F. A hierarchical model of shape and appearance for human action classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1-8, 2007.
DOI
[10]
Chaudhry, R.; Ravichandran, A.; Hager, G.; Vidal, R. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1932-1939, 2009.
DOI
[11]
Dalal, N.; Triggs, B.; Schmid, C. Human detection using oriented histograms of flow and appearance. In: Proceedings of the 9th European Conference on Computer Vision, Vol. 2, 428-441, 2006.
DOI
[12]
Laptev, I. On space-time interest points. International Journal of Computer Vision Vol. 64, Nos. 2-3, 107-123, 2005.
[13]
Laptev, I.; Lindeberg, T. Space-time interest points. In: Proceedings of the 9th IEEE International Conference on Computer Vision, 432-439, 2003.
DOI
[14]
Bregonzio, M.; Gong, S.; Xiang, T. Recognising action as clouds of space-time interest points. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1948-1955, 2009.
DOI
[15]
Dollar, P.; Rabaud, V.; Cottrell, G.; Belongie, S. Behavior recognition via sparse spatio-temporal features. In: Proceedings of the 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 65-72, 2005.
[16]
Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91-110, 2004.
[17]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1, 886-893, 2005.
[18]
Laptev, I.; Marszalek, M.; Schmid, C.; Rozenfeld, B. Learning realistic human actions from movies. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1-8, 2008.
DOI
[19]
Ma, S.; Sigal, L.; Sclaroff, S. Space-time tree ensemble for action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 5024-5032, 2015.
DOI
[20]
Qu, T.; Liu, Y.; Li, J.; Wu, M. Action recognition using multi-layer topographic independent component analysis. Journal of Information & Computational Science Vol. 12, No. 9, 3537-3546, 2015.
[21]
Wang, H.; Klaser, A.; Schmid, C.; Liu, C.-L. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision Vol. 103, No. 1, 60-79, 2013.
[22]
Wu, J.; Zhang, Y.; Lin, W. Towards good practices for action video encoding. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2577-2584, 2014.
DOI
[23]
Oikonomopoulos, A.; Patras, I.; Pantic, M. Spatiotemporal salient points for visual recognition of human actions. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics Vol. 36, No. 3, 710-719, 2005.
[24]
Margolin, R.; Tal, A.; Zelnik-Manor, L. What makes a patch distinct? In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1139-1146, 2013.
DOI
[25]
Brox, T.; Bruhn, A.; Papenberg, N.; Weickert, J. High accuracy optical flow estimation based on a theory for warping. In: Proceedings of the 8th European Conference on Computer Vision, Springer LNCS 3024. Pajdla, T.; Matas, J. Eds. Springer-Verlag Berlin Heidelberg, Vol. 4, 25-36, 2004.
DOI
[26]
Chang, C.-C.; Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology Vol. 2, No. 3, Article No. 27, 2011.
[27]
Al Ghamdi, M.; Zhang, L.; Gotoh, Y. Spatiotemporal SIFT and its application to human action classification. In: Lecture Notes in Computer Science, Vol. 7583. Fusiello, A.; Murino, V.; Cucchiara, R. Eds. Springer Berlin Heidelberg, 301-310, 2012.
[28]
Liu, J.; Kuipers, B.; Savarese, S. Recognizing human actions by attributes. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 3337-3344, 2011.
DOI
[29]
Iosifidis, A.; Tefas, A.; Pitas, I. Discriminant bag of words based representation for human action recognition. Pattern Recognition Letters Vol. 49, 185-192, 2014.
[30]
Baumann, F.; Ehlers, A.; Rosenhahn, B.; Liao, J. Recognizing human actions using novel space-time volume binary patterns. Neurocomputing Vol. 173, No. P1, 54-63, 2016.
[31]
Kläser, A. Learning human actions in video. Ph.D. Thesis. Université de Grenoble, 2010.
[32]
Ji, Y.; Shimada, A.; Nagahara, H.; Taniguchi, R.-i. A compact descriptor CHOG3D and its application in human action recognition. IEEJ Transactions on Electrical and Electronic Engineering Vol. 8, No. 1, 69-77, 2013.
[33]
Wang, H.; Klaser, A.; Schmid, C.; Liu, C.-L. Action recognition by dense trajectories. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 3169-3176, 2011.
DOI
[34]
Wu, X.; Xu, D.; Duan, L.; Luo, J. Action recognition using context and appearance distribution features. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 489-496, 2011.
DOI
[35]
Raptis, M.; Soatto, S. Tracklet descriptors for action modeling and video analysis. In: Proceedings of the 11th European Conference on Computer vision: Part I, 577-590, 2010.
DOI
[36]
Raptis, M.; Kokkinos, I.; Soatto, S. Discovering discriminative action parts from mid-level video representations. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1242-1249, 2012.
DOI
[37]
Ma, S.; Zhang, J.; Ikizler-Cinbis, N.; Sclaroff, S. Action recognition and localization by hierarchical space-time segments. In: Proceedings of IEEE International Conference on Computer Vision, 2744-2751, 2013.
DOI
[38]
Everts, I.; van Gemert, J. C.; Gevers, T. Evaluation of color STIPs for human action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2850-2857, 2013.
DOI
[39]
Le, Q. V.; Zou, W. Y.; Yeung, S. Y.; Ng, A. Y. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 3361-3368, 2011.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Revised: 01 December 2015
Accepted: 09 December 2015
Published: 29 January 2016
Issue date: March 2016

Copyright

© The Author(s) 2016

Acknowledgements

This research is funded by Iraqi Ministry of Higher Education and Scientific Research (MHESR).

Rights and permissions

This article is published with open access at Springerlink.com

The articles published in this journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Return