Journal Home > Volume 7 , Issue 3

Humans regularly interact with their surrounding objects. Such interactions often result in strongly correlated motions between humans and the interacting objects. We thus ask: "Is it possible to infer object properties from skeletal motion alone, even without seeing the interacting object itself?" In this paper, we present a fine-grained action recognition method that learns to infer such latent object properties from human interaction motion alone. This inference allows us to disentangle the motion from the object property and transfer object properties to a given motion. We collected a large number of videos and 3D skeletal motions of performing actors using an inertial motion capture device. We analyzed similar actions and learned subtle differences between them to reveal latent properties of the interacting objects. In particular, we learned to identify the interacting object, by estimating its weight, or its spillability. Our results clearly demonstrate that motions and interacting objects are highly correlated and that related object latent properties can be inferred from 3D skeleton sequences alone, leading to new synthesis possibilities for motions involving human interaction. Our dataset is available at http://vcc.szu.edu.cn/research/2020/IT.html.


menu
Abstract
Full text
Outline
Electronic supplementary material
About this article

Inferring object properties from human interaction and transferring them to new motions

Show Author's information Qian Zheng1Weikai Wu1Hanting Pan1Niloy Mitra2Daniel Cohen-Or3Hui Huang1( )
Shenzhen University, Shenzhen, China
University College London, London, UK
Tel Aviv University, Tel-Aviv, Israel

Abstract

Humans regularly interact with their surrounding objects. Such interactions often result in strongly correlated motions between humans and the interacting objects. We thus ask: "Is it possible to infer object properties from skeletal motion alone, even without seeing the interacting object itself?" In this paper, we present a fine-grained action recognition method that learns to infer such latent object properties from human interaction motion alone. This inference allows us to disentangle the motion from the object property and transfer object properties to a given motion. We collected a large number of videos and 3D skeletal motions of performing actors using an inertial motion capture device. We analyzed similar actions and learned subtle differences between them to reveal latent properties of the interacting objects. In particular, we learned to identify the interacting object, by estimating its weight, or its spillability. Our results clearly demonstrate that motions and interacting objects are highly correlated and that related object latent properties can be inferred from 3D skeleton sequences alone, leading to new synthesis possibilities for motions involving human interaction. Our dataset is available at http://vcc.szu.edu.cn/research/2020/IT.html.

Keywords: human interaction motion, object propertyinference, motion analysis, motion synthesis

References(62)

[1]
Blake, R.; Shiffrar, M. Perception of human motion. Annual Review of Psychology Vol. 58, No. 1, 47–73, 2007.
[2]
Runeson, S.; Frykholm, G. Visual perception of lifted weight. Journal of Experimental Psychology: Human Perception and Performance Vol. 7, No. 4, 733–740, 1981.
[3]
Podda, J.; Ansuini, C.; Vastano, R.; Cavallo, A.; Becchio, C. The heaviness of invisible objects: Predictive weight judgments from observed real and pantomimed grasps. Cognition Vol. 168, 140–145, 2017.
[4]
Vaina, L. M.; Goodglass, H.; Daltroy, L. Inference of object use from pantomimed actions by aphasics and patients with right hemisphere lesions.Synthese Vol. 104, No. 1, 43–57, 1995.
[5]
Shahroudy, A.; Liu, J.; Ng, T. T.; Wang, G. NTU RGB+D: A large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1010–1019, 2016.
DOI
[6]
Liu, C.; Hu, Y.; Li, Y.; Song, S.; Liu, J. PKU-MMD: A large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475, 2017.
[7]
Lo Presti, L.; La Cascia, M. 3D skeleton-based human action classification: A survey. Pattern Recognition Vol. 53, 130–147, 2016.
[8]
Yao, B. P.; Fei-Fei, L. Modeling mutual context of object and human pose in human-object interaction activities. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 17–24, 2010.
DOI
[9]
Gkioxari, G.; Girshick, R.; Dollár, P.; He, K. Detecting and recognizing human-object interactions. In: Pro-ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8359–8367, 2018.
DOI
[10]
Kato, K.; Li, Y.; Gupta, A. Compositional learning for human object interaction. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11218. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 247–264, 2018.
DOI
[11]
Grabner, H.; Gall, J.; van Gool, L. What makes a chair a chair? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1529–1536, 2011.
DOI
[12]
Kim, V. G.; Chaudhuri, S.; Guibas, L.; Funkhouser, T. Shape2Pose. ACM Transactions on Graphics Vol. 33, No. 4, Article No. 120, 2014.
[13]
Hu, R. Z.; Yan, Z. H.; Zhang, J. W.; van Kaick, O.,Shamir, A.,Zhang, H.; Huang, H. Predictive and generative neural networks for object functionality. ACM Transactions on Graphics Vol. 37, No. 4, Article No. 151, 2018.
[14]
Savva, M.; Chang, A. X.; Hanrahan, P.; Fisher, M.; Nießner, M. SceneGrok. ACM Transactions on Graphics Vol. 33, No. 6, Article No. 212, 2014.
[15]
Li, X. T.; Liu, S. F.; Kim, K.; Wang, X. L.; Yang, M. H.; Kautz, J. Putting humans in a scene: Learning affordance in 3D indoor environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12360–12368, 2019.
[16]
Hu, R.; Savva, M.; van Kaick, O. Functionality representations and applications for shape analysis. Computer Graphics Forum Vol. 37, No. 2, 603–624, 2018.
[17]
Jiang, Y.; Koppula, H.; Saxena, A. Hallucinated humans as the hidden context for labeling 3D scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2993–3000, 2013.
DOI
[18]
Jiang, Y.; Koppula, H. S.; Saxena, A. Modeling 3D environments through hidden human context. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 38, No. 10, 2040–2053, 2016.
[19]
Ho, E. S. L.; Komura, T.; Tai, C. L. Spatial relationship preserving character motion adaptation. ACM Transactions on Graphics Vol. 29, No. 4, Article No. 33, 2010.
[20]
Shen, Y. J.; Yang, L. Z.; Ho, E. S. L.; Shum, H. P. H. Interaction-based human activity comparison. IEEE Transactions on Visualization and Computer Graphics Vol. 26, No. 8, 2620–2633, 2019.
[21]
Stoffregen, T. A.; Flynn, S. B. Visual perception of support-surface deformability from human body kinematics. Ecological Psychology Vol. 6, No. 1, 33–64, 1994.
[22]
Hamilton, A. F.; Joyce, D. W.; Flanagan, J. R.; Frith, C. D.; Wolpert, D. M. Kinematic cues in perceptual weight judgement and their origins in box lifting. Psychological Research Vol. 71, No. 1, 13–21, 2007.
[23]
Schmidt, F.; Paulun, V. C.; van Assen, J. J. R.; Fleming, R. W. Inferring the stiffness of unfamiliar objects from optical, shape, and motion cues. Journal of Vision Vol. 17, No. 3, 18, 2017.
[24]
Koppula, H. S.; Gupta, R.; Saxena, A. Learning human activities and object affordances from RGB-D videos. The International Journal of Robotics Research Vol. 32, No. 8, 951–970, 2013.
[25]
Kang, C. G.; Lee, S. H. Scene reconstruction and analysis from motion.Graphical Models Vol. 94, 25–37, 2017.
[26]
Monszpart, A.; Guerrero, P.; Ceylan, D.; Yumer, E.; Mitra, N. J. iMapper: Interaction-guided joint scene and human motion mapping from monocular videos. ACM Transactions on Graphics Vol. 38, No. 4, Article No. 92, 2019.
[27]
Davis, J. W.; Gao, H. Recognizing human action efforts: An adaptive three-mode PCA framework. In: Proceedings of the 9th IEEE International Conference on Computer Vision, 1463–1469, 2003.
DOI
[28]
Gupta, A.; Davis, L. S. Objects in action: An approach for combining action understanding and object perception. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–8, 2007.
DOI
[29]
Wu, J.; Yildirim, I.; Lim, J. J.; Freeman, W. T.; Tenenbaum, J. B. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 1, 127–135, 2015.
[30]
Wu, J. J.; Lim, J.; Zhang, H. Y.; Tenenbaum, J.; Freeman, W. Physics 101: Learning physical object properties from unlabeled videos. In: Proceedings of the British Machine Vision Conference, 39.1–39.12, 2016.
[31]
Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9907. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 816–833, 2016.
DOI
[32]
Liu, J.; Wang, G.; Hu, P.; Duan, L. Y.; Kot, A. C. Global context-aware attention LSTM networks for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3671–3680, 2017.
DOI
[33]
Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, 4263–4270, 2017.
[34]
Yan, S. J.; Xiong, Y. J.; Lin, D. H. Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[35]
Ke, Q. H.; Bennamoun, M.; An, S. J.; Sohel, F.; Boussaid, F. A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4570–4579, 2017.
[36]
Li, C.; Zhong, Q. Y.; Xie, D.; Pu, S. L. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, 786–792, 2018.
DOI
[37]
Aristidou, A.; Cohen-Or, D.; Hodgins, J. K.; Chrysanthou, Y.; Shamir, A. Deep motifs and motion signatures. ACM Transactions on Graphics Vol. 37, No. 6, Article No. 187, 2018.
[38]
Hsu, E.; Pulli, K.; Popović, J. Style translation for human motion.ACM Transactions on Graphics Vol. 24, No. 3, 1082–1089, 2005.
[39]
Xia, S. H.; Wang, C. Y.; Chai, J. X.; Hodgins, J. Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics Vol. 34, No. 4, Article No. 119, 2015.
[40]
Yumer, M. E.; Mitra, N. J. Spectral style transfer for human motion between independent actions. ACM Transactions on Graphics Vol. 35, No. 4, Article No. 137, 2016.
[41]
Bellini, R.; Kleiman, Y.; Cohen-Or, D. Dance to the beat: Synchronizing motion to audio. Computational Visual Media Vol. 4, No. 3, 197–208, 2018.
[42]
Cao, Z.; Simon, T.; Wei, S. H.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1302–1310, 2017.
DOI
[43]
Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9910. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 34–50, 2016.
DOI
[44]
Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 483–499, 2016.
DOI
[45]
Wei, S. H.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4724–4732, 2016.
DOI
[46]
Güler, R. A.; Neverova, N.; Kokkinos, I. DensePose: Dense human pose estimation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7297–7306, 2018.
DOI
[47]
Tekin, B.; Rozantsev, A.; Lepetit, V.; Fua, P. Direct prediction of 3D body poses from motion compensated sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 991–1000, 2016.
DOI
[48]
Tome, D.; Russell, C.; Agapito, L. Lifting from the deep: Convolutional 3D pose estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5689–5698, 2017.
DOI
[49]
Mehta, D.; Sridhar, S.; Sotnychenko, O.; Rhodin, H.; Shafiei, M.; Seidel, H. P.; Xu, W.; Casas, D.; Theobalt, C. VNect: Real-time 3D human pose estimation with a single RGB camera. ACM Transactions on Graphics Vol. 36, No. 4, Article No. 44, 2017.
[50]
Kanazawa, A.; Black, M. J.; Jacobs, D. W.; Malik, J. End-to-end recovery of human shape and pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7122–7131, 2018.
DOI
[51]
Pavlakos, G.; Zhou, X. W.; Daniilidis, K. Ordinal depth supervision for 3D human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7307–7316, 2018.
DOI
[52]
Andriluka, M.; Iqbal, U.; Insafutdinov, E.; Pishchulin, L.; Milan, A.; Gall, J.; Schiele, B. PoseTrack: A benchmark for human pose estimation and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5167–5176, 2018.
DOI
[53]
CMU. CMU Graphics Lab Motion Capture Database. 2018. Available at http://mocap.cs.cmu.edu/.
[54]
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1724–1734, 2014.
DOI
[55]
Wang, Y.; Sun, Y. B.; Liu, Z. W.; Sarma, S. E.; Bronstein, M. M.; Solomon, J. M. Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics Vol. 38, No. 5, Article No. 146, 2019.
[56]
Zhang, P. F.; Xue, J. R.; Lan, C. L.; Zeng, W. J.; Gao, Z. N.; Zheng, N. N. Adding attentiveness to the neurons in recurrent neural networks. In:Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11213. Springer Cham, 136–152, 2018.
DOI
[57]
Holden, D.; Saito, J.; Komura, T. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics Vol. 35, No. 4, Article No. 138, 2016.
[58]
Aberman, K.; Wu, R. D.; Lischinski, D.; Chen, B. Q.; Cohen-Or, D. Learning character-agnostic motion for motion retargeting in 2D. ACM Transactions on Graphics Vol. 38, No. 4, Article No. 75, 2019.
[59]
Gui, L. Y.; Wang, Y. X.; Liang, X. D.; Moura, J. M. F. Adversarial geometry-aware human motion prediction. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11208. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 823–842, 2018.
DOI
[60]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1735–1742, 2006.
[61]
Charles, R. Q.; Hao, S.; Mo, K. C.; Guibas, L. J. PointNet: Deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 77–85, 2017.
DOI
[62]
Wang, H.; Ho, E. S. L.; Shum, H. P. H.; Zhu, Z. X. Spatio-temporal manifold learning for human motions via long-horizon modeling. IEEE Transactions on Visualization and Computer Graphics Vol. 27, No. 1, 216–227, 2021.
Video
41095_2021_218_MOESM1_ESM.mp4
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 22 January 2021
Accepted: 24 February 2021
Published: 19 April 2021
Issue date: September 2021

Copyright

© The Author(s) 2021

Acknowledgements

We sincerely thank the reviewers for their valuable comments. This work was supported in part by ShenzhenInnovation Program (JCYJ20180305125709986), National Natural Science Foundation of China (61861130365, 61761146002), GD Science and Tech- nology Program (2020A0505100064, 2015A030312015),and DEGP Key Project (2018KZDXM058).

Rights and permissions

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.

Return