Journal Home > Volume 7 , Issue 2

Human-object interaction (HOI) detection is crucial for human-centric image understanding which aims to infer human, action, object triplets within an image. Recent studies often exploit visual features and the spatial configuration of a human-object pair in order to learn the action linking the human and object in the pair. We argue that such a paradigm of pairwise feature extraction and action inference can be applied not only at the whole human and object instance level, but also at the part level at which a body part interacts with an object, and at the semantic level by considering the semantic label of an object along with human appearance and human-object spatial configuration, to infer the action. We thus propose a multi-levelpairwise feature network (PFNet) for detecting human-object interactions. The network consists of threeparallel streams to characterize HOI utilizing pairwise features at the above three levels; the three streams are finally fused to give the action prediction. Extensive experiments show that our proposed PFNet outperforms other state-of-the-art methods on the V-COCO dataset and achieves comparable results to the state-of-the-art on the HICO-DET dataset.


menu
Abstract
Full text
Outline
About this article

Detecting human-object interaction with multi-level pairwise feature network

Show Author's information Hanchao Liu1Tai-Jiang Mu1( )Xiaolei Huang2
Key Laboratory of Pervasive Computing, Ministry of Education, BNRist, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
College of Information Sciences and Technology, Pennsylvania State University, University Park, PA 16802, USA

Abstract

Human-object interaction (HOI) detection is crucial for human-centric image understanding which aims to infer human, action, object triplets within an image. Recent studies often exploit visual features and the spatial configuration of a human-object pair in order to learn the action linking the human and object in the pair. We argue that such a paradigm of pairwise feature extraction and action inference can be applied not only at the whole human and object instance level, but also at the part level at which a body part interacts with an object, and at the semantic level by considering the semantic label of an object along with human appearance and human-object spatial configuration, to infer the action. We thus propose a multi-levelpairwise feature network (PFNet) for detecting human-object interactions. The network consists of threeparallel streams to characterize HOI utilizing pairwise features at the above three levels; the three streams are finally fused to give the action prediction. Extensive experiments show that our proposed PFNet outperforms other state-of-the-art methods on the V-COCO dataset and achieves comparable results to the state-of-the-art on the HICO-DET dataset.

Keywords: deep learning, human-object interaction detection, pairwisefeature network, multi-level;object instance

References(40)

[1]
He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J.; Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770-778, 2016.
[2]
Ren, S. Q.; He, K. M.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 6, 1137-1149, 2017.
[3]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6517-6525, 2017.
DOI
[4]
Borji, A.; Cheng, M. M.; Hou, Q. B.; Jiang, H. Z.; Li, J. Salient object detection: A survey. Computational Visual Media Vol. 5, No. 2, 117-150, 2019.
[5]
Xu, D. F.; Zhu, Y. K.; Choy, C. B.; Fei-Fei, L. Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3097-3106, 2017.
[6]
Peyre, J.; Laptev, I.; Schmid, C.; Sivic, J. Detecting unseen visual relations using analogies. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1981-1990, 2019.
DOI
[7]
Chao, Y. W.; Liu, Y. F.; Liu, X. Y.; Zeng, H. Y.; Deng, J. Learning to detect human-object interactions. arXiv preprint arXiv:1702.05448, 2017.
[8]
Gkioxari, G.; Girshick, R.; Dollár, P.; He, K. M. Detecting and recognizing human-object interactions. arXiv preprint arXiv:1704.07333, 2017.
[9]
Ma, C. Y.; Kadav, A.; Melvin, I.; Kira, Z.; AlRegib, G.; Graf, H. P. Attend and interact: Higher-order object interactions for video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6790-6800, 2018.
[10]
Mallya, A.; Lazebnik, S. Learning models for actions and person-object interactions with transfer to question answering. In: Computer Vision—ECCV 2016. Lecture Notes in Computer Science, Vol 9905. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 414-428, 2016.
DOI
[11]
Gao, C.; Zou, Y. L.; Huang, J. B. iCAN: Instance-centric attention network for human-object interaction detection. arXiv preprint arXiv:1808.10437, 2018.
[12]
Li, Y. L.; Zhou, S. Y.; Huang, X. J.; Xu, L.; Ma, Z.; Fang, H. S.; Wang, Y. F.; Lu, C. W. Transferable interactiveness knowledge for human-object interaction detection. arXiv preprint arXiv:1881.08264, 2019.
[13]
Wang, T. C.; Anwer, R. M.; Khan, M. H.; Khan, F. S.; Pang, Y. W.; Shao, L. et al. Deep contextual attention for human-object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5693-5701, 2019.
DOI
[14]
Gupta, T.; Schwing, A. G.; Hoiem, D. No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9676-9684, 2019.
DOI
[15]
Wan, B.; Zhou, D. S.; Liu, Y. F.; Li, R. J.; He, X. M. Pose-aware multi-level feature network for human object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9468-9477, 2019.
DOI
[16]
Zhou, P.; Chi, M. Relation parsing neural network for human-object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 843-851, 2019.
DOI
[17]
Gupta, S.; Malik, J. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.
[18]
Zhao, Z. C.; Ma, H. M.; You, S. D. Single image action recognition using semantic body part actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3411-3419, 2017.
DOI
[19]
Luvizon, D. C.; Picard, D.; Tabia, H. 2D/3D pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5137-5146, 2018.
DOI
[20]
Abdulmunem, A.; Lai, Y. K.; Sun, X. F. Saliency guided local and global descriptors for effective action recognition. Computational Visual Media Vol. 2, No. 1, 97-106, 2016.
[21]
Girdhar, R.; Ramanan, D. Attentional pooling for action recognition. arXiv preprint arXiv:1711.01467, 2017.
[22]
Ulutan, O.; Iftekhar, A. S. M.; Manjunath, B. S. VSGNet: Spatial attention network for detecting human object interactions using graph convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 13617-13626, 2020.
DOI
[23]
Qi, S. Y.; Wang, W. G.; Jia, B. X.; Shen, J. B.; Zhu, S. C. Learning human-object interactions by graph parsing neural networks. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11213. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 407-423, 2018.
[24]
Xu, B.; Wong, Y.; Li, J.; Zhao, Q.; Kankanhalli, M. S. Learning to detect human-object interactions with knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019-2028, 2019.
DOI
[25]
Kato, K.; Li, Y.; Gupta, A. Compositional learning for human object interaction. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11218. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 247-264, 2018.
[26]
Bansal, A.; Rambhatla, S. S.; Shrivastava, A.; Chellappa, R. Detecting human-object interactions via functional generalization. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, No. 7, 10460-10469, 2020.
DOI
[27]
Wang, T. C.; Yang, T.; Danelljan, M.; Khan, F. S.; Zhang, X. Y.; Sun, J. Learning human-object interaction detection using interaction points. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4115-4124, 2020.
DOI
[28]
Liao, Y.; Liu, S.; Wang, F.; Chen, Y. J.; Qian, C.; Feng, J. S. PPDM: Parallel point detection and matching for real-time human-object interaction detection. arXiv preprint arXiv:1912.12898, 2020.
[29]
He, K. M.; Gkioxari, G.; Dollar, P.; Girshick, R. B. ”Mask R-CNN”. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 2, 386-397, 2020.
[30]
Fang, H. S.; Xie, S. Q.; Tai, Y. W.; Lu, C. W. RMPE: Regional multi-person pose estimation. arXiv preprint arXiv:1612.00137, 2016.
[31]
Fang, H. S.; Cao, J. K.; Tai, Y. W.; Lu, C. W. Pairwise body-part attention for recognizing human-object interactions. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11214. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 52-68, 2018.
[32]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. 2013.Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 2, 3111-3119, 2013.
[33]
Lin, T. Y.; Goyal, P.; Girshick, R.; He, K. M.; Dollár, P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2999-3007, 2017.
DOI
[34]
Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision—ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740-755, 2014.
DOI
[35]
Girshick, R.; Radosavovic, I.; Gkioxari, G.; Dollar, P.; He, K. M. Detectron. 2018. Available at https://github.com/facebookresearch/detectron.
[36]
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[37]
Zhou, T. F.; Wang, W. G.; Qi, S. Y.; Ling, H. B.; Shen, J. B. Cascaded human-object interaction recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4262-4271, 2020.
DOI
[38]
Shen, L.; Yeung, S.; Hoffman, J.; Mori, G.; Fei-Fei, L. Scaling human-object interaction recognition through zero-shot learning. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 1568-1576, 2018.
DOI
[39]
Li, Y. L.; Liu, X. P.; Lu, H.; Wang, S. Y.; Liu, J. Q.; Li, J. F.; Lu, C. W. Detailed 2D-3D joint representation for human-object interaction. arXiv preprint arXiv:2004.08154, 2020.
[40]
Li, Y. L.; Xu, L.; Liu, X. P.; Huang, X. J.; Xu, Y.; Wang, S. Y.; Fang, H. S.; Ma, Z.; Chen, M. Y.; Lu, C. W. PaStaNet: Toward human activity knowledge engine. arXiv preprint arXiv:2004.00945, 2020.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 26 June 2020
Accepted: 20 July 2020
Published: 19 October 2020
Issue date: June 2021

Copyright

© The Author(s) 2020

Acknowledgements

We thank the reviewers for their constructive comments. This work was supported by the National Natural Science Foundation of China (Project No. 61902210), a Research Grant of Beijing Higher Institution Engineering Research Center, and the Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology.

Rights and permissions

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.

Return