Detecting human-object interaction with multi-level pairwise feature network

Hanchao Liu; Tai-Jiang Mu; Xiaolei Huang

doi:10.1007/s41095-020-0188-2

Computational Visual Media 2021, 7(2): 229-239 https://doi.org/10.1007/s41095-020-0188-2

Research Article |

Open Access | Issue | Published: 19 October 2020

Detecting human-object interaction with multi-level pairwise feature network

Show Author's Information Hide Author's Information Hanchao Liu^¹, Tai-Jiang Mu^¹(

), Xiaolei Huang^²

1Key Laboratory of Pervasive Computing, Ministry of Education, BNRist, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

2College of Information Sciences and Technology, Pennsylvania State University, University Park, PA 16802, USA

Keywords:

deep learning, human-object interaction detection, pairwisefeature network, multi-level;object instance

Cite this article:

Liu H, Mu T-J, Huang X. Detecting human-object interaction with multi-level pairwise feature network. Computational Visual Media, 2021, 7(2): 229-239. https://doi.org/10.1007/s41095-020-0188-2

Download citation

EndNote(RIS)

BibTeX

812

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

Human-object interaction (HOI) detection is crucial for human-centric image understanding which aims to infer $⟨$ human, action, object $⟩$ triplets within an image. Recent studies often exploit visual features and the spatial configuration of a human-object pair in order to learn the action linking the human and object in the pair. We argue that such a paradigm of pairwise feature extraction and action inference can be applied not only at the whole human and object instance level, but also at the part level at which a body part interacts with an object, and at the semantic level by considering the semantic label of an object along with human appearance and human-object spatial configuration, to infer the action. We thus propose a multi-levelpairwise feature network (PFNet) for detecting human-object interactions. The network consists of threeparallel streams to characterize HOI utilizing pairwise features at the above three levels; the three streams are finally fused to give the action prediction. Extensive experiments show that our proposed PFNet outperforms other state-of-the-art methods on the V-COCO dataset and achieves comparable results to the state-of-the-art on the HICO-DET dataset.

Full text

Abstract

Full text

Outline

About this article

Detecting human-object interaction with multi-level pairwise feature network

Show Author's information Hide Author's Information Hanchao Liu^¹, Tai-Jiang Mu^¹(

), Xiaolei Huang^²

1Key Laboratory of Pervasive Computing, Ministry of Education, BNRist, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

2College of Information Sciences and Technology, Pennsylvania State University, University Park, PA 16802, USA

Abstract

Keywords: deep learning, human-object interaction detection, pairwisefeature network, multi-level;object instance

References(40)

[1]

He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J.; Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770-778, 2016.

[2]

Ren, S. Q.; He, K. M.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 6, 1137-1149, 2017.

DOI Google Scholar

[3]

Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6517-6525, 2017.

DOI

[4]

Borji, A.; Cheng, M. M.; Hou, Q. B.; Jiang, H. Z.; Li, J. Salient object detection: A survey. Computational Visual Media Vol. 5, No. 2, 117-150, 2019.

DOI Google Scholar

[5]

Xu, D. F.; Zhu, Y. K.; Choy, C. B.; Fei-Fei, L. Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3097-3106, 2017.

[6]

Peyre, J.; Laptev, I.; Schmid, C.; Sivic, J. Detecting unseen visual relations using analogies. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1981-1990, 2019.

DOI

[7]

Chao, Y. W.; Liu, Y. F.; Liu, X. Y.; Zeng, H. Y.; Deng, J. Learning to detect human-object interactions. arXiv preprint arXiv:1702.05448, 2017.

Google Scholar

[8]

Gkioxari, G.; Girshick, R.; Dollár, P.; He, K. M. Detecting and recognizing human-object interactions. arXiv preprint arXiv:1704.07333, 2017.

Google Scholar

[9]

Ma, C. Y.; Kadav, A.; Melvin, I.; Kira, Z.; AlRegib, G.; Graf, H. P. Attend and interact: Higher-order object interactions for video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6790-6800, 2018.

[10]

Mallya, A.; Lazebnik, S. Learning models for actions and person-object interactions with transfer to question answering. In: Computer Vision—ECCV 2016. Lecture Notes in Computer Science, Vol 9905. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 414-428, 2016.

DOI

[11]

Gao, C.; Zou, Y. L.; Huang, J. B. iCAN: Instance-centric attention network for human-object interaction detection. arXiv preprint arXiv:1808.10437, 2018.

Google Scholar

[12]

Li, Y. L.; Zhou, S. Y.; Huang, X. J.; Xu, L.; Ma, Z.; Fang, H. S.; Wang, Y. F.; Lu, C. W. Transferable interactiveness knowledge for human-object interaction detection. arXiv preprint arXiv:1881.08264, 2019.

Google Scholar

[13]

Wang, T. C.; Anwer, R. M.; Khan, M. H.; Khan, F. S.; Pang, Y. W.; Shao, L. et al. Deep contextual attention for human-object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5693-5701, 2019.

DOI

[14]

Gupta, T.; Schwing, A. G.; Hoiem, D. No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9676-9684, 2019.

DOI

[15]

Wan, B.; Zhou, D. S.; Liu, Y. F.; Li, R. J.; He, X. M. Pose-aware multi-level feature network for human object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9468-9477, 2019.

DOI

[16]

Zhou, P.; Chi, M. Relation parsing neural network for human-object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 843-851, 2019.

DOI

[17]

Gupta, S.; Malik, J. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.

Google Scholar

[18]

Zhao, Z. C.; Ma, H. M.; You, S. D. Single image action recognition using semantic body part actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3411-3419, 2017.

DOI

[19]

Luvizon, D. C.; Picard, D.; Tabia, H. 2D/3D pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5137-5146, 2018.

DOI

[20]

Abdulmunem, A.; Lai, Y. K.; Sun, X. F. Saliency guided local and global descriptors for effective action recognition. Computational Visual Media Vol. 2, No. 1, 97-106, 2016.

DOI Google Scholar

[21]

Girdhar, R.; Ramanan, D. Attentional pooling for action recognition. arXiv preprint arXiv:1711.01467, 2017.

Google Scholar

[22]

Ulutan, O.; Iftekhar, A. S. M.; Manjunath, B. S. VSGNet: Spatial attention network for detecting human object interactions using graph convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 13617-13626, 2020.

DOI

[23]

Qi, S. Y.; Wang, W. G.; Jia, B. X.; Shen, J. B.; Zhu, S. C. Learning human-object interactions by graph parsing neural networks. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11213. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 407-423, 2018.

[24]

Xu, B.; Wong, Y.; Li, J.; Zhao, Q.; Kankanhalli, M. S. Learning to detect human-object interactions with knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019-2028, 2019.

DOI

[25]

Kato, K.; Li, Y.; Gupta, A. Compositional learning for human object interaction. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11218. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 247-264, 2018.

[26]

Bansal, A.; Rambhatla, S. S.; Shrivastava, A.; Chellappa, R. Detecting human-object interactions via functional generalization. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, No. 7, 10460-10469, 2020.

DOI

[27]

Wang, T. C.; Yang, T.; Danelljan, M.; Khan, F. S.; Zhang, X. Y.; Sun, J. Learning human-object interaction detection using interaction points. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4115-4124, 2020.

DOI

[28]

Liao, Y.; Liu, S.; Wang, F.; Chen, Y. J.; Qian, C.; Feng, J. S. PPDM: Parallel point detection and matching for real-time human-object interaction detection. arXiv preprint arXiv:1912.12898, 2020.

Google Scholar

[29]

He, K. M.; Gkioxari, G.; Dollar, P.; Girshick, R. B. ”Mask R-CNN”. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 2, 386-397, 2020.

DOI Google Scholar

[30]

Fang, H. S.; Xie, S. Q.; Tai, Y. W.; Lu, C. W. RMPE: Regional multi-person pose estimation. arXiv preprint arXiv:1612.00137, 2016.

Google Scholar

[31]

Fang, H. S.; Cao, J. K.; Tai, Y. W.; Lu, C. W. Pairwise body-part attention for recognizing human-object interactions. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11214. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 52-68, 2018.

[32]

Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. 2013.Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 2, 3111-3119, 2013.

[33]

Lin, T. Y.; Goyal, P.; Girshick, R.; He, K. M.; Dollár, P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2999-3007, 2017.

DOI

[34]

Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision—ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740-755, 2014.

DOI

[35]

Girshick, R.; Radosavovic, I.; Gkioxari, G.; Dollar, P.; He, K. M. Detectron. 2018. Available at https://github.com/facebookresearch/detectron.

[36]

Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Google Scholar

[37]

Zhou, T. F.; Wang, W. G.; Qi, S. Y.; Ling, H. B.; Shen, J. B. Cascaded human-object interaction recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4262-4271, 2020.

DOI

[38]

Shen, L.; Yeung, S.; Hoffman, J.; Mori, G.; Fei-Fei, L. Scaling human-object interaction recognition through zero-shot learning. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 1568-1576, 2018.

DOI

[39]

Li, Y. L.; Liu, X. P.; Lu, H.; Wang, S. Y.; Liu, J. Q.; Li, J. F.; Lu, C. W. Detailed 2D-3D joint representation for human-object interaction. arXiv preprint arXiv:2004.08154, 2020.

Google Scholar

[40]

Li, Y. L.; Xu, L.; Liu, X. P.; Huang, X. J.; Xu, Y.; Wang, S. Y.; Fang, H. S.; Ma, Z.; Chen, M. Y.; Lu, C. W. PaStaNet: Toward human activity knowledge engine. arXiv preprint arXiv:2004.00945, 2020.

Google Scholar

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 26 June 2020

Accepted: 20 July 2020

Published: 19 October 2020

Issue date: June 2021

Copyright

Acknowledgements

We thank the reviewers for their constructive comments. This work was supported by the National Natural Science Foundation of China (Project No. 61902210), a Research Grant of Beijing Higher Institution Engineering Research Center, and the Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology.

Rights and permissions

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.