Inferring object properties from human interaction and transferring them to new motions

Qian Zheng; Weikai Wu; Hanting Pan; Niloy Mitra; Daniel Cohen-Or; Hui Huang

doi:10.1007/s41095-021-0218-8

Computational Visual Media 2021, 7(3): 375-392 https://doi.org/10.1007/s41095-021-0218-8

Research Article |

Open Access | Issue | Published: 19 April 2021

Inferring object properties from human interaction and transferring them to new motions

Show Author's Information Hide Author's Information Qian Zheng^¹, Weikai Wu^¹, Hanting Pan^¹, Niloy Mitra^², Daniel Cohen-Or^³, Hui Huang^¹(

)

1Shenzhen University, Shenzhen, China

2University College London, London, UK

3Tel Aviv University, Tel-Aviv, Israel

Keywords:

human interaction motion, object propertyinference, motion analysis, motion synthesis

Cite this article:

Zheng Q, Wu W, Pan H, et al. Inferring object properties from human interaction and transferring them to new motions. Computational Visual Media, 2021, 7(3): 375-392. https://doi.org/10.1007/s41095-021-0218-8

Download citation

EndNote(RIS)

BibTeX

552

Views

Downloads

Citations

Crossref

WoS

Scopus

CSCD

Abstract Full text Electronic supplementary material About this article

Abstract

Humans regularly interact with their surrounding objects. Such interactions often result in strongly correlated motions between humans and the interacting objects. We thus ask: "Is it possible to infer object properties from skeletal motion alone, even without seeing the interacting object itself?" In this paper, we present a fine-grained action recognition method that learns to infer such latent object properties from human interaction motion alone. This inference allows us to disentangle the motion from the object property and transfer object properties to a given motion. We collected a large number of videos and 3D skeletal motions of performing actors using an inertial motion capture device. We analyzed similar actions and learned subtle differences between them to reveal latent properties of the interacting objects. In particular, we learned to identify the interacting object, by estimating its weight, or its spillability. Our results clearly demonstrate that motions and interacting objects are highly correlated and that related object latent properties can be inferred from 3D skeleton sequences alone, leading to new synthesis possibilities for motions involving human interaction. Our dataset is available at http://vcc.szu.edu.cn/research/2020/IT.html.

Full text

Abstract

Full text

Outline

Electronic supplementary material

About this article

Inferring object properties from human interaction and transferring them to new motions

Show Author's information Hide Author's Information Qian Zheng^¹, Weikai Wu^¹, Hanting Pan^¹, Niloy Mitra^², Daniel Cohen-Or^³, Hui Huang^¹(

)

1Shenzhen University, Shenzhen, China

2University College London, London, UK

3Tel Aviv University, Tel-Aviv, Israel

Abstract

Keywords: human interaction motion, object propertyinference, motion analysis, motion synthesis

References(62)

[1]

Blake, R.; Shiffrar, M. Perception of human motion. Annual Review of Psychology Vol. 58, No. 1, 47–73, 2007.

DOI Google Scholar

[2]

Runeson, S.; Frykholm, G. Visual perception of lifted weight. Journal of Experimental Psychology: Human Perception and Performance Vol. 7, No. 4, 733–740, 1981.

DOI Google Scholar

[3]

Podda, J.; Ansuini, C.; Vastano, R.; Cavallo, A.; Becchio, C. The heaviness of invisible objects: Predictive weight judgments from observed real and pantomimed grasps. Cognition Vol. 168, 140–145, 2017.

DOI Google Scholar

[4]

Vaina, L. M.; Goodglass, H.; Daltroy, L. Inference of object use from pantomimed actions by aphasics and patients with right hemisphere lesions.Synthese Vol. 104, No. 1, 43–57, 1995.

DOI Google Scholar

[5]

Shahroudy, A.; Liu, J.; Ng, T. T.; Wang, G. NTU RGB+D: A large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1010–1019, 2016.

DOI

[6]

Liu, C.; Hu, Y.; Li, Y.; Song, S.; Liu, J. PKU-MMD: A large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475, 2017.

Google Scholar

[7]

Lo Presti, L.; La Cascia, M. 3D skeleton-based human action classification: A survey. Pattern Recognition Vol. 53, 130–147, 2016.

DOI Google Scholar

[8]

Yao, B. P.; Fei-Fei, L. Modeling mutual context of object and human pose in human-object interaction activities. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 17–24, 2010.

DOI

[9]

Gkioxari, G.; Girshick, R.; Dollár, P.; He, K. Detecting and recognizing human-object interactions. In: Pro-ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8359–8367, 2018.

DOI

[10]

Kato, K.; Li, Y.; Gupta, A. Compositional learning for human object interaction. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11218. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 247–264, 2018.

DOI

[11]

Grabner, H.; Gall, J.; van Gool, L. What makes a chair a chair? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1529–1536, 2011.

DOI

[12]

Kim, V. G.; Chaudhuri, S.; Guibas, L.; Funkhouser, T. Shape2Pose. ACM Transactions on Graphics Vol. 33, No. 4, Article No. 120, 2014.

DOI Google Scholar

[13]

Hu, R. Z.; Yan, Z. H.; Zhang, J. W.; van Kaick, O.,Shamir, A.,Zhang, H.; Huang, H. Predictive and generative neural networks for object functionality. ACM Transactions on Graphics Vol. 37, No. 4, Article No. 151, 2018.

DOI Google Scholar

[14]

Savva, M.; Chang, A. X.; Hanrahan, P.; Fisher, M.; Nießner, M. SceneGrok. ACM Transactions on Graphics Vol. 33, No. 6, Article No. 212, 2014.

DOI Google Scholar

[15]

Li, X. T.; Liu, S. F.; Kim, K.; Wang, X. L.; Yang, M. H.; Kautz, J. Putting humans in a scene: Learning affordance in 3D indoor environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12360–12368, 2019.

[16]

Hu, R.; Savva, M.; van Kaick, O. Functionality representations and applications for shape analysis. Computer Graphics Forum Vol. 37, No. 2, 603–624, 2018.

DOI Google Scholar

[17]

Jiang, Y.; Koppula, H.; Saxena, A. Hallucinated humans as the hidden context for labeling 3D scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2993–3000, 2013.

DOI

[18]

Jiang, Y.; Koppula, H. S.; Saxena, A. Modeling 3D environments through hidden human context. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 38, No. 10, 2040–2053, 2016.

DOI Google Scholar

[19]

Ho, E. S. L.; Komura, T.; Tai, C. L. Spatial relationship preserving character motion adaptation. ACM Transactions on Graphics Vol. 29, No. 4, Article No. 33, 2010.

DOI Google Scholar

[20]

Shen, Y. J.; Yang, L. Z.; Ho, E. S. L.; Shum, H. P. H. Interaction-based human activity comparison. IEEE Transactions on Visualization and Computer Graphics Vol. 26, No. 8, 2620–2633, 2019.

DOI Google Scholar

[21]

Stoffregen, T. A.; Flynn, S. B. Visual perception of support-surface deformability from human body kinematics. Ecological Psychology Vol. 6, No. 1, 33–64, 1994.

DOI Google Scholar

[22]

Hamilton, A. F.; Joyce, D. W.; Flanagan, J. R.; Frith, C. D.; Wolpert, D. M. Kinematic cues in perceptual weight judgement and their origins in box lifting. Psychological Research Vol. 71, No. 1, 13–21, 2007.

DOI Google Scholar

[23]

Schmidt, F.; Paulun, V. C.; van Assen, J. J. R.; Fleming, R. W. Inferring the stiffness of unfamiliar objects from optical, shape, and motion cues. Journal of Vision Vol. 17, No. 3, 18, 2017.

DOI Google Scholar

[24]

Koppula, H. S.; Gupta, R.; Saxena, A. Learning human activities and object affordances from RGB-D videos. The International Journal of Robotics Research Vol. 32, No. 8, 951–970, 2013.

DOI Google Scholar

[25]

Kang, C. G.; Lee, S. H. Scene reconstruction and analysis from motion.Graphical Models Vol. 94, 25–37, 2017.

DOI Google Scholar

[26]

Monszpart, A.; Guerrero, P.; Ceylan, D.; Yumer, E.; Mitra, N. J. iMapper: Interaction-guided joint scene and human motion mapping from monocular videos. ACM Transactions on Graphics Vol. 38, No. 4, Article No. 92, 2019.

DOI Google Scholar

[27]

Davis, J. W.; Gao, H. Recognizing human action efforts: An adaptive three-mode PCA framework. In: Proceedings of the 9th IEEE International Conference on Computer Vision, 1463–1469, 2003.

DOI

[28]

Gupta, A.; Davis, L. S. Objects in action: An approach for combining action understanding and object perception. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–8, 2007.

DOI

[29]

Wu, J.; Yildirim, I.; Lim, J. J.; Freeman, W. T.; Tenenbaum, J. B. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 1, 127–135, 2015.

[30]

Wu, J. J.; Lim, J.; Zhang, H. Y.; Tenenbaum, J.; Freeman, W. Physics 101: Learning physical object properties from unlabeled videos. In: Proceedings of the British Machine Vision Conference, 39.1–39.12, 2016.

[31]

Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9907. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 816–833, 2016.

DOI

[32]

Liu, J.; Wang, G.; Hu, P.; Duan, L. Y.; Kot, A. C. Global context-aware attention LSTM networks for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3671–3680, 2017.

DOI

[33]

Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, 4263–4270, 2017.

[34]

Yan, S. J.; Xiong, Y. J.; Lin, D. H. Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2018.

[35]

Ke, Q. H.; Bennamoun, M.; An, S. J.; Sohel, F.; Boussaid, F. A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4570–4579, 2017.

[36]

Li, C.; Zhong, Q. Y.; Xie, D.; Pu, S. L. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, 786–792, 2018.

DOI

[37]

Aristidou, A.; Cohen-Or, D.; Hodgins, J. K.; Chrysanthou, Y.; Shamir, A. Deep motifs and motion signatures. ACM Transactions on Graphics Vol. 37, No. 6, Article No. 187, 2018.

DOI Google Scholar

[38]

Hsu, E.; Pulli, K.; Popović, J. Style translation for human motion.ACM Transactions on Graphics Vol. 24, No. 3, 1082–1089, 2005.

DOI Google Scholar

[39]

Xia, S. H.; Wang, C. Y.; Chai, J. X.; Hodgins, J. Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics Vol. 34, No. 4, Article No. 119, 2015.

DOI Google Scholar

[40]

Yumer, M. E.; Mitra, N. J. Spectral style transfer for human motion between independent actions. ACM Transactions on Graphics Vol. 35, No. 4, Article No. 137, 2016.

DOI Google Scholar

[41]

Bellini, R.; Kleiman, Y.; Cohen-Or, D. Dance to the beat: Synchronizing motion to audio. Computational Visual Media Vol. 4, No. 3, 197–208, 2018.

DOI Google Scholar

[42]

Cao, Z.; Simon, T.; Wei, S. H.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1302–1310, 2017.

DOI

[43]

Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9910. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 34–50, 2016.

DOI

[44]

Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 483–499, 2016.

DOI

[45]

Wei, S. H.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4724–4732, 2016.

DOI

[46]

Güler, R. A.; Neverova, N.; Kokkinos, I. DensePose: Dense human pose estimation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7297–7306, 2018.

DOI

[47]

Tekin, B.; Rozantsev, A.; Lepetit, V.; Fua, P. Direct prediction of 3D body poses from motion compensated sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 991–1000, 2016.

DOI

[48]

Tome, D.; Russell, C.; Agapito, L. Lifting from the deep: Convolutional 3D pose estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5689–5698, 2017.

DOI

[49]

Mehta, D.; Sridhar, S.; Sotnychenko, O.; Rhodin, H.; Shafiei, M.; Seidel, H. P.; Xu, W.; Casas, D.; Theobalt, C. VNect: Real-time 3D human pose estimation with a single RGB camera. ACM Transactions on Graphics Vol. 36, No. 4, Article No. 44, 2017.

DOI Google Scholar

[50]

Kanazawa, A.; Black, M. J.; Jacobs, D. W.; Malik, J. End-to-end recovery of human shape and pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7122–7131, 2018.

DOI

[51]

Pavlakos, G.; Zhou, X. W.; Daniilidis, K. Ordinal depth supervision for 3D human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7307–7316, 2018.

DOI

[52]

Andriluka, M.; Iqbal, U.; Insafutdinov, E.; Pishchulin, L.; Milan, A.; Gall, J.; Schiele, B. PoseTrack: A benchmark for human pose estimation and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5167–5176, 2018.

DOI

[53]

CMU. CMU Graphics Lab Motion Capture Database. 2018. Available at http://mocap.cs.cmu.edu/.

[54]

Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1724–1734, 2014.

DOI

[55]

Wang, Y.; Sun, Y. B.; Liu, Z. W.; Sarma, S. E.; Bronstein, M. M.; Solomon, J. M. Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics Vol. 38, No. 5, Article No. 146, 2019.

DOI Google Scholar

[56]

Zhang, P. F.; Xue, J. R.; Lan, C. L.; Zeng, W. J.; Gao, Z. N.; Zheng, N. N. Adding attentiveness to the neurons in recurrent neural networks. In:Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11213. Springer Cham, 136–152, 2018.

DOI

[57]

Holden, D.; Saito, J.; Komura, T. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics Vol. 35, No. 4, Article No. 138, 2016.

DOI Google Scholar

[58]

Aberman, K.; Wu, R. D.; Lischinski, D.; Chen, B. Q.; Cohen-Or, D. Learning character-agnostic motion for motion retargeting in 2D. ACM Transactions on Graphics Vol. 38, No. 4, Article No. 75, 2019.

DOI Google Scholar

[59]

Gui, L. Y.; Wang, Y. X.; Liang, X. D.; Moura, J. M. F. Adversarial geometry-aware human motion prediction. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11208. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 823–842, 2018.

DOI

[60]

Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1735–1742, 2006.

[61]

Charles, R. Q.; Hao, S.; Mo, K. C.; Guibas, L. J. PointNet: Deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 77–85, 2017.

DOI

[62]

Wang, H.; Ho, E. S. L.; Shum, H. P. H.; Zhu, Z. X. Spatio-temporal manifold learning for human motions via long-horizon modeling. IEEE Transactions on Visualization and Computer Graphics Vol. 27, No. 1, 216–227, 2021.

DOI Google Scholar

Electronic supplementary material

Video

41095_2021_218_MOESM1_ESM.mp4

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 22 January 2021

Accepted: 24 February 2021

Published: 19 April 2021

Issue date: September 2021

Copyright

Acknowledgements

We sincerely thank the reviewers for their valuable comments. This work was supported in part by ShenzhenInnovation Program (JCYJ20180305125709986), National Natural Science Foundation of China (61861130365, 61761146002), GD Science and Tech- nology Program (2020A0505100064, 2015A030312015),and DEGP Key Project (2018KZDXM058).

Rights and permissions

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.