Journal Home > Volume 9 , Issue 1

Accurate and temporally consistent modeling of human bodies is essential for a wide range of applications, including character animation, understan-ding human social behavior, and AR/VR interfaces. Capturing human motion accurately from a monocular image sequence remains challenging; modeling quality is strongly influenced by temporal consistency of the captured body motion. Our work presents an elegant solution to integrating temporal constraints during fitting. This increases both temporal consistency and robustness during optimization. In detail, we derive parameters of a sequence of body models, representing shape and motion of a person. We optimize these parameters over the complete image sequence, fitting a single consistent body shape while imposing temporal consistency on the body motion, assuming body joint trajectories to be linear over short time. Our approach enables the derivation of realistic 3D body models from image sequences, including jaw pose, facial expression, and articulated hands. Our experiments show that our approach accurately estimates body shape and motion, even for challenging movements and poses. Further, we apply it to the particular application of sign language analysis, where accurate and temporally consistent motion modelling is essential, and show that the approach is well-suited to this kind of application.


menu
Abstract
Full text
Outline
Electronic supplementary material
About this article

Imposing temporal consistency on deep monocular body shape and pose estimation

Show Author's information Alexandra Zimmer1,2( )Anna Hilsmann1Wieland Morgenstern1Peter Eisert1,3
Fraunhofer Heinrich-Hertz-Institut, 10587 Berlin, Germany
Technische Universität Berlin, 10623 Berlin, Germany
Humboldt Universität zu Berlin, 10117 Berlin, Germany

Abstract

Accurate and temporally consistent modeling of human bodies is essential for a wide range of applications, including character animation, understan-ding human social behavior, and AR/VR interfaces. Capturing human motion accurately from a monocular image sequence remains challenging; modeling quality is strongly influenced by temporal consistency of the captured body motion. Our work presents an elegant solution to integrating temporal constraints during fitting. This increases both temporal consistency and robustness during optimization. In detail, we derive parameters of a sequence of body models, representing shape and motion of a person. We optimize these parameters over the complete image sequence, fitting a single consistent body shape while imposing temporal consistency on the body motion, assuming body joint trajectories to be linear over short time. Our approach enables the derivation of realistic 3D body models from image sequences, including jaw pose, facial expression, and articulated hands. Our experiments show that our approach accurately estimates body shape and motion, even for challenging movements and poses. Further, we apply it to the particular application of sign language analysis, where accurate and temporally consistent motion modelling is essential, and show that the approach is well-suited to this kind of application.

Keywords: motion capture, surface reconstruction, face modeling, body model estimation

References(31)

[1]
Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A. A.; Tzionas, D.; Black, M. J. Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10967–10977, 2019.
DOI
[2]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 1, 172–186, 2021.
[3]
Bogo, F.; Kanazawa, A.; Lassner, C.; Gehler, P.; Romero, J.; Black, M. J. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9909. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 561–578, 2016.
DOI
[4]
Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M. J. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics Vol. 34, No. 6, Article No. 248, 2015.
[5]
Choutas, V.; Pavlakos, G.; Bolkart, T.; Tzionas, D.; Black, M. J. Monocular expressive body regression through body-driven attention. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12355. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 20–40, 2020.
DOI
[6]
Kanazawa, A.; Black, M. J.; Jacobs, D. W.; Malik, J. End-to-end recovery of human shape and pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7122–7131, 2018.
DOI
[7]
Zhou, Y. X.; Habermann, M.; Habibie, I.; Tewari, A.; Theobalt, C.; Xu, F. Monocular real-time full body capture with inter-part correlations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4809–4820, 2021.
DOI
[8]
Romero, J.; Tzionas, D.; Black, M. J. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics Vol. 36, No. 6, Article No. 245, 2017.
[9]
Zhang, H.; Tian, Y.; Zhou, X.; Ouyang, W.; Liu, Y.; Wang, L.; Sun, Z. PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 11446–11456, 2021.
DOI
[10]
Lassner, C.; Romero, J.; Kiefel, M.; Bogo, F.; Black, M. J.; Gehler, P. V. Unite the people: Closing the loop between 3D and 2D human representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4704–4713, 2017.
DOI
[11]
Zhang, J. Y.; Pepose, S.; Joo, H.; Ramanan, D.; Malik, J.; Kanazawa, A. Perceiving 3D human-object spatial arrangements from a single image in the wild. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12357. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 34–51, 2020.
DOI
[12]
Xu, X. Y.; Chen, H.; Moreno-Noguer, F.; Jeni, L. A.; de la Torre, F. 3D human pose, shape and texture from low-resolution images and videos. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2021.
[13]
Zou, S. H.; Zuo, X. X.; Qian, Y. M.; Wang, S.; Xu, C.; Gong, M. L.; Cheng, L. 3D human shape reconstruction from a polarization image. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12359. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 351–368, 2020.
[14]
Zuo, X. X.; Wang, S.; Zheng, J. B.; Yu, W. W.; Gong, M. L.; Yang, R. G.; Cheng, L. SparseFusion: Dynamic human avatar modeling from sparse RGBD images. IEEE Transactions on Multimedia Vol. 23, 1617–1629, 2021.
[15]
Xu, L.; Xu, W. P.; Golyanik, V.; Habermann, M.; Fang, L.; Theobalt, C. EventCap: Monocular 3D capture of high-speed human motions using an event camera. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4967–4977, 2020.
DOI
[16]
Kocabas, M.; Athanasiou, N.; Black, M. J. VIBE: Video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5252–5262, 2020.
DOI
[17]
Mahmood, N.; Ghorbani, N.; Troje, N. F.; Pons-Moll, G.; Black, M. AMASS: Archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5441–5450, 2019.
DOI
[18]
Xiang, D. L.; Joo, H.; Sheikh, Y. Monocular total capture: Posing face, body, and hands in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10957–10966, 2019.
DOI
[19]
Joo, H.; Simon, T.; Sheikh, Y. Total capture: A 3D deformation model for tracking faces, hands, and bodies. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8320–8329, 2018.
DOI
[20]
Hodgins, J. K. CMU Graphics Lab Motion Capture Database. Available at http://mocap.cs.cmu.edu/resources.php.
[21]
Rong, Y.; Shiratori, T.; Joo, H. FrankMocap: Fast monocular 3D hand and body motion capture by regression and integration. arXiv preprint arXiv: 2008.08324, 2020.
[22]
Kolotouros, N.; Pavlakos, G.; Black, M.; Daniilidis, K. Learning to reconstruct 3D human pose and shapevia model-fitting in the loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2252–2261, 2019.
DOI
[23]
Caliskan, A.; Mustafa, A.; Hilton, A. Temporal consistency loss for high resolution textured and clothed 3D human reconstruction from monocular video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 1780–1790, 2021.
DOI
[24]
He, Y. N.; Pang, A. Q.; Chen, X.; Liang, H.; Wu, M. Y.; Ma, Y. X.; Xu, L. ChallenCap: Monocular 3D capture of challenging human performances using multi-modal references. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11395–11406, 2021.
[25]
Zheng, C.; Wu, W. H.; Chen, C.; Yang, T.; Zhu, S. J.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep learning-based human pose estimation: A survey. arXiv preprint arXiv: 2012.13392, 2020.
[26]
Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7745–7754, 2019.
DOI
[27]
Wen, Y.-H.; Gao, L.; Fu, H.; Zhang, F.-L.; Xia, S. Graph CNNs with motif and variable temporal block for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 33, No. 01, 8989–8996, 2019.
[28]
Teschner, M.; Kimmerle, S.; Heidelberger, B.; Zachmann, G.; Raghupathi, L.; Fuhrmann, A.; Cani, M.; Faure, F.; Magnenat-Thalmann, N.; Straβer, W.; et al. Collision detection for deformable objects. In: Proceedings of the 25th Annual Conference of the European Association for Computer Graphics, Eurographics 2004 - State of the Art Reports, 2004.
DOI
[29]
Nocedal, J.; Wright, S. J. Nonlinear equations. In: Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer New York, 276–312, 2006.
[30]
Ghorbani, S.; Mahdaviani, K.; Thaler, A.; Kording, K.; Cook, D. J.; Blohm, G.; Troje, N. F. MoVi: A large multi-purpose human motion and video dataset. PLoS ONE Vol. 16, No. 6, e0253157, 2021.
[31]
Fournier, M.; Dischler, J. M.; Bechmann, D. 3D distance transform adaptive filtering for smoothing and denoising triangle meshes. In: Proceedings of the 4th International Conference on Computer Graphics and Interactive Techniques in Australasia and Southeast Asia, 407–416, 2006.
DOI
Video
41095_0272_ESM.mp4
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 06 October 2021
Accepted: 20 January 2022
Published: 18 October 2022
Issue date: March 2023

Copyright

© The Author(s) 2022.

Acknowledgements

This work was partly funded by the European Union’s Horizon 2020 Research and Innovation Programme under Agreement No. 952147 (Invictus) as well as the German Federal Ministry of Education and Research (BMBF) through the Research Program MoDL under Contract No. 01 IS 20044.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Return