Deep Learning Based 2D Human Pose Estimation: A Survey

Qi Dang; Jianqin Yin; Bin Wang; Wenqing Zheng

doi:10.26599/TST.2018.9010100

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (7 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Deep Learning Based 2D Human Pose Estimation: A Survey

Qi Dang, Jianqin Yin^{^*}(

), Bin Wang, Wenqing Zheng

Automation School, Beijing University of Posts and Telecommunications, Beijing 100876, China.

State Key Lab. of Intelligent Technology and Systems, Tsinghua University, Beijing 100084, China.

School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China.

Show Author Information

Abstract

Human pose estimation has received significant attention recently due to its various applications in the real world. As the performance of the state-of-the-art human pose estimation methods can be improved by deep learning, this paper presents a comprehensive survey of deep learning based human pose estimation methods and analyzes the methodologies employed. We summarize and discuss recent works with a methodology-based taxonomy. Single-person and multi-person pipelines are first reviewed separately. Then, the deep learning techniques applied in these pipelines are compared and analyzed. The datasets and metrics used in this task are also discussed and compared. The aim of this survey is to make every step in the estimation pipelines interpretable and to provide readers a readily comprehensible explanation. Moreover, the unsolved problems and challenges for future research are discussed.

Keywords

computer vision deep learning human pose estimation

References

[1]

Papandreou G., Zhu T., Kanazawa N., Toshev A., Tompson J., Bregler C., and Murphy K., Towards accurate multiperson pose estimation in the wild, arXiv preprint arXiv:1701.01779, 2017.

Google Scholar

[2]

Insafutdinov E., Pishchulin L., Andres B., Andriluka M., and Schiele B., Deepercut: A deeper, stronger, and faster multi-person pose estimation model, in European Conference on Computer Vision, 2016, pp. 34-50.

Crossref

[3]

Cao Z., Simon T., Wei S.-E., and Sheikh Y., Realtime multi-person 2d pose estimation using part affinity fields, in CVPR, 2017, vol. 1, p. 7.

[4]

Wei S.-E., Ramakrishna V., Kanade T., and Sheikh Y., Convolutional pose machines, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4724-4732.

Crossref

[5]

Toshev A. and Szegedy C., Deeppose: Human pose estimation via deep neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, WI, USA, 2014, pp. 1653-1660.

Crossref

[6]

Chen X. and Yuille A. L., Articulated pose estimation by a graphical model with image dependent pairwise relations, in Advances in Neural Information Processing Systems, 2014, pp. 1736-1744.

[7]

Martinez J., Hossain R., Romero J., and Little J. J., A simple yet effective baseline for 3d human pose estimation, in IEEE International Conference on Computer Vision, Venice, Italy, 2017, vol. 206, p. 3.

Crossref

[8]

Chen Y., Shen C., Wei X.-S., Liu L., and Yang J., Adversarial posenet: A structure-aware convolutional network for human pose estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1212-1221.

Crossref

[9]

Pishchulin L., Insafutdinov E., Tang S., Andres B., Andriluka M., Gehler P. V., and Schiele B., Deepcut: Joint subset partition and labeling for multi person pose estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4929-4937.

Crossref

[10]

Yang Y. and Ramanan D., Articulated pose estimation with flexible mixtures-of-parts, in Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 2011, pp. 1385-1392.

Crossref

[11]

Yang Y. and Ramanan D., Articulated human detection with flexible mixtures of parts, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2878-2890, 2013.

Crossref Google Scholar

[12]

Wang F. and Li Y., Beyond physical connections: Tree models in human pose estimation, in Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 2013, pp. 596-603.

Crossref

[13]

Sun M. and Savarese S., Articulated part-based model for joint object detection and pose estimation, in Computer Vision (ICCV), Barcelona, Spain, 2011, pp. 723-730.

Crossref

[14]

Eichner M., Marin-Jimenez M., Zisserman A., and Ferrari V., 2d articulated human pose estimation and retrieval in (almost) unconstrained still images, International Journal of Computer Vision, vol. 99, no. 2, pp. 190-214, 2012.

Crossref Google Scholar

[15]

Eichner M. and Ferrari V., We are family: Joint pose estimation of multiple persons, in European Conference on Computer Vision, Crete, Greece, 2010, pp. 228-242.

Crossref

[16]

Guo Y., Liu Y., Oerlemans A., Lao S., Wu S., and Lew M. S., Deep learning for visual understanding: A review, Neurocomputing, vol. 187, pp. 27-48, 2016.

Crossref Google Scholar

[17]

Poppe R., Vision-based human motion analysis: An overview, Computer Vision and Image Understanding, vol. 108, nos. 1&2, pp. 4-18, 2007.

Crossref Google Scholar

[18]

Liu Z., Zhu J., Bu J., and Chen C., A survey of human pose estimation: The body parts parsing based methods, Journal of Visual Communication and Image Representation, vol. 32, pp. 10-19, 2015.

Crossref Google Scholar

[19]

Zhang H.-B., Lei Q., Zhong B.-N., Du J.-X., and Peng J., A survey on human pose estimation, Intelligent Automation & Soft Computing, vol. 22, no. 3, pp. 483-489, 2016.

Crossref Google Scholar

[20]

Gong W., Zhang X., Gonźalez J., Sobral A., Bouwmans T., Tu C., and Zahzah E.-h., Human pose estimation from monocular images: A comprehensive survey, Sensors, vol. 16, no. 12, p. 1966, 2016.

Crossref Google Scholar

[21]

Murphy-Chutorian E. and Trivedi M. M., Head pose estimation in computer vision: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 607-626, 2009.

Crossref Google Scholar

[22]

Erol A., Bebis G., Nicolescu M., Boyle R. D., and Twombly X., Vision-based hand pose estimation: A review, Computer Vision and Image Understanding, vol. 108, nos. 1&2, pp. 52-73, 2007.

Crossref Google Scholar

[23]

Asadi-Aghbolaghi M., Clapés A., Bellantonio M., Escalante H. J., Ponce-López V., Baró X., Guyon I., Kasaei S., and Escalera S., A survey on deep learning based approaches for action and gesture recognition in image sequences, in Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 2017, pp. 476-483.

Crossref

[24]

Ferrari V., Marin-Jimenez M., and Zisserman A., Progressive search space reduction for human pose estimation, in Computer Vision and Pattern Recognition, Anchorage, AK, USA, 2008, pp. 1-8.

Crossref

[25]

Carreira J., Agrawal P., Fragkiadaki K., and Malik J., Human pose estimation with iterative error feedback, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4733-4742.

Crossref

[26]

Sun X., Shang J., Liang S., and Wei Y., Compositional human pose regression, in The IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, vol. 2.

Crossref

[27]

Luvizon D. C., Tabia H., and Picard D., Human pose regression by combining indirect part detection and contextual information, arXiv preprint arXiv:1710.02322, 2017.

Google Scholar

[28]

Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., and Bengio Y., Generative adversarial nets, in Advances in Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672-2680.

[29]

Newell A., Yang K., and Deng J., Stacked hourglass networks for human pose estimation, in European Conference on Computer Vision, Amsterdam, Netherlands, 2016, pp. 483-499.

Crossref

[30]

Chu X., Yang W., Ouyang W., Ma C., Yuille A. L., and Wang X., Multi-context attention for human pose estimation, arXiv preprint arXiv:1702.07432, 2017.

Google Scholar

[31]

Pfister T., Simonyan K., Charles J., and Zisserman A., Deep convolutional neural networks for efficient pose estimation in gesture videos, in Asian Conference on Computer Vision, Singapore, 2014, pp. 538-552.

Crossref

[32]

Tompson J. J., Jain A., LeCun Y., and Bregler C., Joint training of a convolutional network and a graphical model for human pose estimation, in Advances in Neural Information Processing Systems, Montreal, Canada, 2014, pp. 1799-1807.

[33]

Pfister T., Charles J., and Zisserman A., Flowing convnets for human pose estimation in videos, in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 2015, pp. 1913-1921.

Crossref

[34]

Sapp B. and Taskar B., Modec: Multimodal decomposable models for human pose estimation, in Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 2013, pp. 3674-3681.

Crossref

[35]

Radosavovic I., DollÂ´ar P., Girshick R., Gkioxari G., and He K., Data distillation: Towards omni-supervised learning, arXiv preprint arXiv:1712.04440, 2017.

Google Scholar

[36]

Fang H., Xie S., Tai Y.-W., and Lu C., Rmpe: Regional multi-person pose estimation, in The IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017.

Crossref

[37]

He K., Gkioxari G., DollÂ´ar P., and Girshick R., Mask rcnn, in Computer Vision (ICCV), Venice, Italy, 2017, pp. 2980-2988.

Crossref

[38]

Iqbal U. and Gall J., Multi-person pose estimation with local joint-to-person associations, in European Conference on Computer Vision, Amsterdam, Netherlands, 2016, pp. 627-642.

Crossref

[39]

Chen Y., Wang Z., Peng Y., Zhang Z., Yu G., and Sun J., Cascaded pyramid network for multi-person pose estimation, arXiv preprint arXiv:1711.07319, 2017.

Google Scholar

[40]

Simonyan K. and Zisserman A., Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.

Google Scholar

[41]

He K., Zhang X., Ren S., and Sun J., Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.

Crossref

[42]

Szegedy C., Ioffe S., Vanhoucke V., and Alemi A. A., Inception-v4, inception-resnet and the impact of residual connections on learning, in AAAI, San Francisco, CA, USA, 2017, vol. 4, p. 12.

[43]

Lin T.-Y., DollÂ´ar P., Girshick R., He K., Hariharan B., and Belongie S., Feature pyramid networks for object detection, in CVPR, Honolulu, HI, USA, 2017, vol. 1, p. 4.

Crossref

[44]

Bodla N., Singh B., Chellappa R., and Davis L. S., Improving object detection with one line of code, arXiv preprint arXiv:1704.04503, 2017.

Google Scholar

[45]

Wang Y. W., Wang C., Li Q., Leng B., Li Z., and Yan J., Team oks keypoint detection, http://presentations.cocodataset.org/COCO17-Keypoints-TeamOKS.pdf, 2017.

[46]

Burgos-Artizzu X. P., Hall D. C., Perona P., and Dollár P., Merging pose estimates across space and time, in British Machine Vision Conference (BMVC), Bristol, UK, 2013.

Crossref

[47]

Chen X. and Yuille A., Parsing occluded people by flexible compositions, in Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 3945-3954.

Crossref

[48]

Chen X., Mottaghi R., Liu X., Fidler S., Urtasun R., and Yuille A., Detect what you can: Detecting and representing objects using holistic models and body parts, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, WI, USA, 2014, pp. 1971-1978.

Crossref

[49]

Insafutdinov E., Andriluka M., Pishchulin L., Tang S., Levinkov E., Andres B., and Schiele B., Arttrack: Articulated multi-person tracking in the wild, in IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Honolulu, HI, USA, 2017.

Crossref

[50]

Zhu X., Jiang Y., and Luo Z., Multi-person pose estimation for posetrack with enhanced part affinity fields, presented at the ICCV PoseTrack Workshop, Venice, Italy, 2017.

[51]

Newell A., Huang Z., and Deng J., Associative embedding: End-to-end learning for joint detection and grouping, in Advances in Neural Information Processing Systems, San Francisco, CA, USA, 2017, pp. 2274-2284.

[52]

Xie S., Girshick R., DollÂ´ar P., Tu Z., and He K., Aggregated residual transformations for deep neural networks, in Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 5987-5995.

Crossref

[53]

Huang G., Liu Z., Weinberger K. Q., and van der Maaten L., Densely connected convolutional networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, vol. 1, p. 3.

Crossref

[54]

Chen L.-C., Papandreou G., Kokkinos I., Murphy K., and Yuille A. L., Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834-848, 2018.

Crossref Google Scholar

[55]

Zhong Z., Yan J., and Liu C.-L., Practical network blocks design with q-learning, arXiv preprint arXiv:1708.05552, 2017.

Google Scholar

[56]

Ren S., He K., Girshick R., and Sun J., Faster r-cnn: Towards real-time object detection with region proposal networks, in Advances in Neural Information Processing Systems, Austin, TX, USA, 2015, pp. 91-99.

[57]

Johnson S. and Everingham M., Clustered pose and nonlinear appearance models for human pose estimation, in Proceedings of the British Machine Vision Conference, Aberystwyth, UK, 2010.

Crossref

[58]

Antol S., Zitnick C. L., and Parikh D., Zero-shot learning via visual abstraction, in European Conference on Computer Vision, Zurich, Switzerland, 2014, pp. 401-416.

Crossref

[59]

Andriluka M., Pishchulin L., Gehler P., and Schiele B., 2d human pose estimation: New benchmark and state of the art analysis, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, WI, USA, 2014, pp. 3686-3693.

Crossref

[60]

Cherian A., Mairal J., Alahari K., and Schmid C., Mixing body-part sequences for human pose estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, WI, USA, 2014, pp. 2353-2360.

Crossref

[61]

Lin T.-Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., DollÂ´ar P., and Zitnick C. L., Microsoft coco: Common objects in context, in European Conference on Computer Vision, Zurich, Switzerland, 2014, pp. 740-755.

Crossref

[62]

Wu J., Zheng H., Zhao B., Li Y., Yan B., Liang R., Wang W., Zhou S., Lin G., Fu Y., et al., Ai challenger: A large-scale dataset for going deeper in image understanding, arXiv preprint arXiv:1711.06475, 2017.

Google Scholar

[63]

Andriluka M., Iqbal U., Milan A., Insafutdinov E., Pishchulin L., Gall J., and Schiele B., Posetrack: A benchmark for human pose estimation and tracking, arXiv preprint arXiv:1710.10000, 2017.

Google Scholar

[64]

Mscoco keypoint evaluation metric, http://mscoco.org/dataset/#keypoints-eval, 2017.

Tsinghua Science and Technology

Volume 24 Issue 6,
December 2019

Pages 663-676

DOI: 10.26599/TST.2018.9010100

Cite this article:

Dang Q, Yin J, Wang B, et al. Deep Learning Based 2D Human Pose Estimation: A Survey. Tsinghua Science and Technology, 2019, 24(6): 663-676. https://doi.org/10.26599/TST.2018.9010100

1839

Views

217

Downloads

166

Crossref

N/A

Web of Science

175

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 05 March 2018

Revised: 30 April 2018

Accepted: 03 May 2018

Published: 05 December 2019