Journal Home > Volume 24 , Issue 6

Human pose estimation has received significant attention recently due to its various applications in the real world. As the performance of the state-of-the-art human pose estimation methods can be improved by deep learning, this paper presents a comprehensive survey of deep learning based human pose estimation methods and analyzes the methodologies employed. We summarize and discuss recent works with a methodology-based taxonomy. Single-person and multi-person pipelines are first reviewed separately. Then, the deep learning techniques applied in these pipelines are compared and analyzed. The datasets and metrics used in this task are also discussed and compared. The aim of this survey is to make every step in the estimation pipelines interpretable and to provide readers a readily comprehensible explanation. Moreover, the unsolved problems and challenges for future research are discussed.


menu
Abstract
Full text
Outline
About this article

Deep Learning Based 2D Human Pose Estimation: A Survey

Show Author's information Qi DangJianqin Yin*( )Bin WangWenqing Zheng
Automation School, Beijing University of Posts and Telecommunications, Beijing 100876, China.
State Key Lab. of Intelligent Technology and Systems, Tsinghua University, Beijing 100084, China.
School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China.

Abstract

Human pose estimation has received significant attention recently due to its various applications in the real world. As the performance of the state-of-the-art human pose estimation methods can be improved by deep learning, this paper presents a comprehensive survey of deep learning based human pose estimation methods and analyzes the methodologies employed. We summarize and discuss recent works with a methodology-based taxonomy. Single-person and multi-person pipelines are first reviewed separately. Then, the deep learning techniques applied in these pipelines are compared and analyzed. The datasets and metrics used in this task are also discussed and compared. The aim of this survey is to make every step in the estimation pipelines interpretable and to provide readers a readily comprehensible explanation. Moreover, the unsolved problems and challenges for future research are discussed.

Keywords: computer vision, deep learning, human pose estimation

References(64)

[1]
Papandreou G., Zhu T., Kanazawa N., Toshev A., Tompson J., Bregler C., and Murphy K., Towards accurate multiperson pose estimation in the wild, arXiv preprint arXiv:1701.01779, 2017.
[2]
Insafutdinov E., Pishchulin L., Andres B., Andriluka M., and Schiele B., Deepercut: A deeper, stronger, and faster multi-person pose estimation model, in European Conference on Computer Vision, 2016, pp. 34-50.
DOI
[3]
Cao Z., Simon T., Wei S.-E., and Sheikh Y., Realtime multi-person 2d pose estimation using part affinity fields, in CVPR, 2017, vol. 1, p. 7.
[4]
Wei S.-E., Ramakrishna V., Kanade T., and Sheikh Y., Convolutional pose machines, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4724-4732.
DOI
[5]
Toshev A. and Szegedy C., Deeppose: Human pose estimation via deep neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, WI, USA, 2014, pp. 1653-1660.
DOI
[6]
Chen X. and Yuille A. L., Articulated pose estimation by a graphical model with image dependent pairwise relations, in Advances in Neural Information Processing Systems, 2014, pp. 1736-1744.
[7]
Martinez J., Hossain R., Romero J., and Little J. J., A simple yet effective baseline for 3d human pose estimation, in IEEE International Conference on Computer Vision, Venice, Italy, 2017, vol. 206, p. 3.
DOI
[8]
Chen Y., Shen C., Wei X.-S., Liu L., and Yang J., Adversarial posenet: A structure-aware convolutional network for human pose estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1212-1221.
DOI
[9]
Pishchulin L., Insafutdinov E., Tang S., Andres B., Andriluka M., Gehler P. V., and Schiele B., Deepcut: Joint subset partition and labeling for multi person pose estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4929-4937.
DOI
[10]
Yang Y. and Ramanan D., Articulated pose estimation with flexible mixtures-of-parts, in Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 2011, pp. 1385-1392.
DOI
[11]
Yang Y. and Ramanan D., Articulated human detection with flexible mixtures of parts, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2878-2890, 2013.
[12]
Wang F. and Li Y., Beyond physical connections: Tree models in human pose estimation, in Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 2013, pp. 596-603.
DOI
[13]
Sun M. and Savarese S., Articulated part-based model for joint object detection and pose estimation, in Computer Vision (ICCV), Barcelona, Spain, 2011, pp. 723-730.
DOI
[14]
Eichner M., Marin-Jimenez M., Zisserman A., and Ferrari V., 2d articulated human pose estimation and retrieval in (almost) unconstrained still images, International Journal of Computer Vision, vol. 99, no. 2, pp. 190-214, 2012.
[15]
Eichner M. and Ferrari V., We are family: Joint pose estimation of multiple persons, in European Conference on Computer Vision, Crete, Greece, 2010, pp. 228-242.
DOI
[16]
Guo Y., Liu Y., Oerlemans A., Lao S., Wu S., and Lew M. S., Deep learning for visual understanding: A review, Neurocomputing, vol. 187, pp. 27-48, 2016.
[17]
Poppe R., Vision-based human motion analysis: An overview, Computer Vision and Image Understanding, vol. 108, nos. 1&2, pp. 4-18, 2007.
[18]
Liu Z., Zhu J., Bu J., and Chen C., A survey of human pose estimation: The body parts parsing based methods, Journal of Visual Communication and Image Representation, vol. 32, pp. 10-19, 2015.
[19]
Zhang H.-B., Lei Q., Zhong B.-N., Du J.-X., and Peng J., A survey on human pose estimation, Intelligent Automation & Soft Computing, vol. 22, no. 3, pp. 483-489, 2016.
[20]
Gong W., Zhang X., Gonźalez J., Sobral A., Bouwmans T., Tu C., and Zahzah E.-h., Human pose estimation from monocular images: A comprehensive survey, Sensors, vol. 16, no. 12, p. 1966, 2016.
[21]
Murphy-Chutorian E. and Trivedi M. M., Head pose estimation in computer vision: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 607-626, 2009.
[22]
Erol A., Bebis G., Nicolescu M., Boyle R. D., and Twombly X., Vision-based hand pose estimation: A review, Computer Vision and Image Understanding, vol. 108, nos. 1&2, pp. 52-73, 2007.
[23]
Asadi-Aghbolaghi M., Clapés A., Bellantonio M., Escalante H. J., Ponce-López V., Baró X., Guyon I., Kasaei S., and Escalera S., A survey on deep learning based approaches for action and gesture recognition in image sequences, in Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 2017, pp. 476-483.
DOI
[24]
Ferrari V., Marin-Jimenez M., and Zisserman A., Progressive search space reduction for human pose estimation, in Computer Vision and Pattern Recognition, Anchorage, AK, USA, 2008, pp. 1-8.
DOI
[25]
Carreira J., Agrawal P., Fragkiadaki K., and Malik J., Human pose estimation with iterative error feedback, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4733-4742.
DOI
[26]
Sun X., Shang J., Liang S., and Wei Y., Compositional human pose regression, in The IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, vol. 2.
DOI
[27]
Luvizon D. C., Tabia H., and Picard D., Human pose regression by combining indirect part detection and contextual information, arXiv preprint arXiv:1710.02322, 2017.
[28]
Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., and Bengio Y., Generative adversarial nets, in Advances in Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672-2680.
[29]
Newell A., Yang K., and Deng J., Stacked hourglass networks for human pose estimation, in European Conference on Computer Vision, Amsterdam, Netherlands, 2016, pp. 483-499.
DOI
[30]
Chu X., Yang W., Ouyang W., Ma C., Yuille A. L., and Wang X., Multi-context attention for human pose estimation, arXiv preprint arXiv:1702.07432, 2017.
[31]
Pfister T., Simonyan K., Charles J., and Zisserman A., Deep convolutional neural networks for efficient pose estimation in gesture videos, in Asian Conference on Computer Vision, Singapore, 2014, pp. 538-552.
DOI
[32]
Tompson J. J., Jain A., LeCun Y., and Bregler C., Joint training of a convolutional network and a graphical model for human pose estimation, in Advances in Neural Information Processing Systems, Montreal, Canada, 2014, pp. 1799-1807.
[33]
Pfister T., Charles J., and Zisserman A., Flowing convnets for human pose estimation in videos, in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 2015, pp. 1913-1921.
DOI
[34]
Sapp B. and Taskar B., Modec: Multimodal decomposable models for human pose estimation, in Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 2013, pp. 3674-3681.
DOI
[35]
Radosavovic I., Doll´ar P., Girshick R., Gkioxari G., and He K., Data distillation: Towards omni-supervised learning, arXiv preprint arXiv:1712.04440, 2017.
[36]
Fang H., Xie S., Tai Y.-W., and Lu C., Rmpe: Regional multi-person pose estimation, in The IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017.
DOI
[37]
He K., Gkioxari G., Doll´ar P., and Girshick R., Mask rcnn, in Computer Vision (ICCV), Venice, Italy, 2017, pp. 2980-2988.
DOI
[38]
Iqbal U. and Gall J., Multi-person pose estimation with local joint-to-person associations, in European Conference on Computer Vision, Amsterdam, Netherlands, 2016, pp. 627-642.
DOI
[39]
Chen Y., Wang Z., Peng Y., Zhang Z., Yu G., and Sun J., Cascaded pyramid network for multi-person pose estimation, arXiv preprint arXiv:1711.07319, 2017.
[40]
Simonyan K. and Zisserman A., Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.
[41]
He K., Zhang X., Ren S., and Sun J., Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.
DOI
[42]
Szegedy C., Ioffe S., Vanhoucke V., and Alemi A. A., Inception-v4, inception-resnet and the impact of residual connections on learning, in AAAI, San Francisco, CA, USA, 2017, vol. 4, p. 12.
[43]
Lin T.-Y., Doll´ar P., Girshick R., He K., Hariharan B., and Belongie S., Feature pyramid networks for object detection, in CVPR, Honolulu, HI, USA, 2017, vol. 1, p. 4.
DOI
[44]
Bodla N., Singh B., Chellappa R., and Davis L. S., Improving object detection with one line of code, arXiv preprint arXiv:1704.04503, 2017.
[45]
Wang Y. W., Wang C., Li Q., Leng B., Li Z., and Yan J., Team oks keypoint detection, http://presentations.cocodataset.org/COCO17-Keypoints-TeamOKS.pdf, 2017.
[46]
Burgos-Artizzu X. P., Hall D. C., Perona P., and Dollár P., Merging pose estimates across space and time, in British Machine Vision Conference (BMVC), Bristol, UK, 2013.
DOI
[47]
Chen X. and Yuille A., Parsing occluded people by flexible compositions, in Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 3945-3954.
DOI
[48]
Chen X., Mottaghi R., Liu X., Fidler S., Urtasun R., and Yuille A., Detect what you can: Detecting and representing objects using holistic models and body parts, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, WI, USA, 2014, pp. 1971-1978.
DOI
[49]
Insafutdinov E., Andriluka M., Pishchulin L., Tang S., Levinkov E., Andres B., and Schiele B., Arttrack: Articulated multi-person tracking in the wild, in IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Honolulu, HI, USA, 2017.
DOI
[50]
Zhu X., Jiang Y., and Luo Z., Multi-person pose estimation for posetrack with enhanced part affinity fields, presented at the ICCV PoseTrack Workshop, Venice, Italy, 2017.
[51]
Newell A., Huang Z., and Deng J., Associative embedding: End-to-end learning for joint detection and grouping, in Advances in Neural Information Processing Systems, San Francisco, CA, USA, 2017, pp. 2274-2284.
[52]
Xie S., Girshick R., Doll´ar P., Tu Z., and He K., Aggregated residual transformations for deep neural networks, in Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 5987-5995.
DOI
[53]
Huang G., Liu Z., Weinberger K. Q., and van der Maaten L., Densely connected convolutional networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, vol. 1, p. 3.
DOI
[54]
Chen L.-C., Papandreou G., Kokkinos I., Murphy K., and Yuille A. L., Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834-848, 2018.
[55]
Zhong Z., Yan J., and Liu C.-L., Practical network blocks design with q-learning, arXiv preprint arXiv:1708.05552, 2017.
[56]
Ren S., He K., Girshick R., and Sun J., Faster r-cnn: Towards real-time object detection with region proposal networks, in Advances in Neural Information Processing Systems, Austin, TX, USA, 2015, pp. 91-99.
[57]
Johnson S. and Everingham M., Clustered pose and nonlinear appearance models for human pose estimation, in Proceedings of the British Machine Vision Conference, Aberystwyth, UK, 2010.
DOI
[58]
Antol S., Zitnick C. L., and Parikh D., Zero-shot learning via visual abstraction, in European Conference on Computer Vision, Zurich, Switzerland, 2014, pp. 401-416.
DOI
[59]
Andriluka M., Pishchulin L., Gehler P., and Schiele B., 2d human pose estimation: New benchmark and state of the art analysis, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, WI, USA, 2014, pp. 3686-3693.
DOI
[60]
Cherian A., Mairal J., Alahari K., and Schmid C., Mixing body-part sequences for human pose estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, WI, USA, 2014, pp. 2353-2360.
DOI
[61]
Lin T.-Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Doll´ar P., and Zitnick C. L., Microsoft coco: Common objects in context, in European Conference on Computer Vision, Zurich, Switzerland, 2014, pp. 740-755.
DOI
[62]
Wu J., Zheng H., Zhao B., Li Y., Yan B., Liang R., Wang W., Zhou S., Lin G., Fu Y., et al., Ai challenger: A large-scale dataset for going deeper in image understanding, arXiv preprint arXiv:1711.06475, 2017.
[63]
Andriluka M., Iqbal U., Milan A., Insafutdinov E., Pishchulin L., Gall J., and Schiele B., Posetrack: A benchmark for human pose estimation and tracking, arXiv preprint arXiv:1710.10000, 2017.
[64]
Mscoco keypoint evaluation metric, http://mscoco.org/dataset/#keypoints-eval, 2017.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 05 March 2018
Revised: 30 April 2018
Accepted: 03 May 2018
Published: 05 December 2019
Issue date: December 2019

Copyright

© The author(s) 2019

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. 61673192, 61573219, and 61472163), the Fund for Outstanding Youth of Shandong Provincial High School (No. ZR2016JL023), the National High-Tech Research and Development Plan (No. 2015AA042306), and the National Social Science Fund Project (No. 13CTQ010).

Rights and permissions

Return