Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module

Yuanzhen Li; Fei Luo; Chunxia Xiao

doi:10.1007/s41095-022-0279-3

Computational Visual Media 2022, 8(4): 631-647 https://doi.org/10.1007/s41095-022-0279-3

Research Article |

Open Access | Issue | Published: 16 June 2022

Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module

Show Author's Information Hide Author's Information Yuanzhen Li^¹, Fei Luo^¹(

), Chunxia Xiao^¹(

)

1School of Computer Science, Wuhan University, Wuhan 430072, China

Keywords:

monocular depth estimation, texture copy, depth drift, attention module

Cite this article:

Li Y, Luo F, Xiao C. Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module. Computational Visual Media, 2022, 8(4): 631-647. https://doi.org/10.1007/s41095-022-0279-3

Download citation

EndNote(RIS)

BibTeX

861

Views

Downloads

Citations

Crossref

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

Self-supervised monocular depth estimation has been widely investigated and applied in previous works. However, existing methods suffer from texture-copy, depth drift, and incomplete structure. It is difficult for normal CNN networks to completely understand the relationship between the object and its surrounding environment. Moreover, it is hard to design the depth smoothness loss to balance depth smoothness and sharpness. To address these issues, we propose a coarse-to-fine method with a normalized convolutional block attention module (NCBAM). In the coarse estimation stage, we incorporate the NCBAM into depth and pose networks to overcome the texture-copy and depth drift problems. Then, we use a new network to refine the coarse depth guided by the color image and produce a structure-preserving depth result in the refinement stage. Our method can produce results competitive with state-of-the-art methods. Comprehensive experiments prove the effectiveness of our two-stage method using the NCBAM.

Full text

Abstract

Full text

Outline

About this article

Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module

Show Author's information Hide Author's Information Yuanzhen Li^¹, Fei Luo^¹(

), Chunxia Xiao^¹(

)

1School of Computer Science, Wuhan University, Wuhan 430072, China

Abstract

Keywords: monocular depth estimation, texture copy, depth drift, attention module

References(65)

[1]

Cao, Y. P.; Kobbelt, L.; Hu, S. M. Real-time high-accuracy three-dimensional reconstruction with consumer RGB-D cameras. ACM Transactions on Graphics Vol. 37, No. 5, Article No. 171, 2018.

DOI Google Scholar

[2]

Fu, Y. P.; Yan, Q. G.; Liao, J.; Xiao, C. X. Joint texture and geometry optimization for RGB-D reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5949-5958, 2020.

[3]

Yang, L.; Yan, Q. G.; Fu, Y. P.; Xiao, C. X. Surface reconstruction via fusing sparse-sequence of depth images. IEEE Transactions on Visualization and Computer Graphics Vol. 24, No. 2, 1190-1203, 2018.

DOI Google Scholar

[4]

Fu, Y. P.; Yan, Q. G.; Yang, L.; Liao, J.; Xiao, C. X. Texture mapping for 3D reconstruction with RGB-D sensor. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4645-4653, 2018.

[5]

Fu, Y. P.; Yan, Q. G.; Liao, J.; Zhou, H. J.; Tang, J.; Xiao, C. X. Seamless texture optimization for RGB-D reconstruction. IEEE Transactions on Visualization and Computer Graphics , 2021.

DOI Google Scholar

[6]

Luo, H. C.; Gao, Y.; Wu, Y. H.; Liao, C. Y.; Yang, X.; Cheng, K. T. Real-time dense monocular SLAM with online adapted depth prediction network. IEEE Transactions on Multimedia Vol. 21, No. 2, 470-483, 2019.

DOI Google Scholar

[7]

Fan, X. Y.; Wu, W. J.; Zhang, L.; Yan, Q. G.; Fu, G.; Chen, Z. P.; Long, C.; Xiao, C. Shading-aware shadow detection and removal from a single image. The Visual Computer Vol. 36, Nos. 10-12, 2175-2188, 2020.

DOI Google Scholar

[8]

Karsch, K.; Liu, C.; Kang, S. B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 36, No. 11, 2144-2158, 2014.

DOI Google Scholar

[9]

Watson, J.; Aodha, O. M.; Turmukhambetov, D.; Brostow, G. J.; Firman, M. Learning stereo from single images. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 722-740, 2020.

DOI

[10]

Guo, X.; Li, H.; Yi, S.; Ren, J.; Wang, X. Learning monocular depth by distilling cross-domain stereo networks. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11215. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 506-523, 2018.

DOI

[11]

Godard, C.; Aodha, O. M.; Brostow, G. J. Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6602-6611, 2017.

DOI

[12]

Godard, C.; Aodha, O. M.; Firman, M.; Brostow, G. Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3827-3837, 2019.

DOI

[13]

Zhou, T. H.; Brown, M.; Snavely, N.; Lowe, D. G. Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6612-6619, 2017.

DOI

[14]

Zhao, W.; Liu, S. H.; Shu, Y. Z.; Liu, Y. J. Towards better generalization: Joint depth-pose learning without PoseNet. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9148-9158, 2020.

DOI

[15]

Klingner, M.; Termöhlen, J. A.; Mikolajczyk, J.; Fingscheidt, T. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12365. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 582-600, 2020.

[16]

Ranjan, A.; Jampani, V.; Balles, L.; Kim, K.; Sun, D. Q.; Wulff, J.; Black, M. J. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12232-12241, 2019.

DOI

[17]

Yang, Z. H.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R. LEGO: Learning edge with geometry all at once by watching videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 225-234, 2018.

DOI

[18]

Guizilini, V.; Ambruş, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3D packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2482-2491, 2020.

DOI

[19]

Woo, S.; Park, J.; Lee, J. Y.; Kweon, I. S. CBAM: Convolutional block attention module. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11211. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 3-19, 2018.

[20]

Schonberger, J. L.; Frahm, J. M. Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4104-4113, 2016.

DOI

[21]

Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 30, No. 2, 328-341, 2008.

DOI Google Scholar

[22]

Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, 2650-2658, 2015.

DOI

[23]

Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 2, 2366-2374, 2014.

[24]

Yin, Z. C.; Shi, J. P. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1983-1992, 2018.

DOI

[25]

Xie, J. Y.; Girshick, R.; Farhadi, A. Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks. In: Computer Vision - ECCV 2016. Lecture Notes in Computer Science, Vol. 9908. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 842-857, 2016.

DOI

[26]

Garg, R.; Vijay Kumar, B. G.; Carneiro, G.; Reid, I. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: Computer Vision - ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 740-756, 2016.

DOI

[27]

Pilzer, A.; Xu, D.; Puscas, M.; Ricci, E.; Sebe, N. Unsupervised adversarial depth estimation using cycled generative networks. In: Proceedings of the International Conference on 3D Vision, 587-595, 2018.

DOI

[28]

Aleotti, F.; Tosi, F.; Poggi, M.; Mattoccia, S. Generative adversarial networks for unsupervised monocular depth prediction. In: Computer Vision - ECCV 2018 Workshops. Lecture Notes in Computer Science, Vol. 11129. Leal-Taixé, L.; Roth, S. Eds. Springer Cham, 337-354, 2019.

DOI

[29]

Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5667-5675, 2018.

DOI

[30]

Zhu, S. J.; Brazil, G.; Liu, X. M. The edge of depth: Explicit constraints between segmentation and depth. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13113-13122, 2020.

DOI

[31]

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000-6010, 2017.

[32]

Yuan, Y. H.; Huang, L.; Guo, J. Y.; Zhang, C.; Chen, X. L.; Wang, J. D. OCNet: Object context for semantic segmentation. International Journal of Computer Vision Vol. 129, No. 8, 2375-2398, 2021.

DOI Google Scholar

[33]

Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 12159-12168, 2021.

DOI

[34]

Li, Z. S.; Liu, X. T.; Drenkow, N.; Ding, A.; Creighton, F. X.; Taylor, R. H.; Unberath, M. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6177-6186, 2021.

[35]

Yang, G. L.; Tang, H.; Ding, M. L.; Sebe, N.; Ricci, E. Transformer-based attention networks for continuous pixel-wise prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 16249-16259, 2021.

DOI

[36]

Johnston, A.; Carneiro, G. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4755-4764, 2020.

DOI

[37]

Guo, M.; Xu, T.; Liu, J.; Liu, Z.; Jiang, P.; Mu, T.; Zhang, S.; Martin, R. R.; Cheng, M.; Hu, S. Attention mechanisms in computer vision: A survey. Computational Visual Media Vol. 8, No. 3, 331-368, 2022.

DOI Google Scholar

[38]

Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-end learning of geometry and context for deep stereo regression. In: Proceedings of the IEEE International Conference on Computer Vision, 66-75, 2017.

DOI

[39]

Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, 2017-2025, 2015.

[40]

Chen, Y. H.; Schmid, C.; Sminchisescu, C. Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 7062-7071, 2019.

DOI

[41]

Wang, Z.; Bovik, A. C.; Sheikh, H. R.; Simoncelli, E. P. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing Vol. 13, No. 4, 600-612, 2004.

DOI Google Scholar

[42]

Wang, C. Y.; Buenaposada, J. M.; Zhu, R.; Lucey, S. Learning depth from monocular videos using direct methods. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022-2030, 2018.

DOI

[43]

Ramamonjisoa, M.; Lepetit, V. SharpNet: Fast and accurate recovery of occluding contours in monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 2109-2118, 2019.

DOI

[44]

Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging Vol. 3, No. 1, 47-57, 2017.

DOI Google Scholar

[45]

Yang, Z. H.; Wang, P.; Xu, W.; Zhao, L.; Nevatia, R. Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 32, No. 1, 7493-7500, 2018.

DOI Google Scholar

[46]

Zou, Y. L.; Luo, Z. L.; Huang, J. B. DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11209. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 38-55, 2018.

[47]

Luo, C. X.; Yang, Z. H.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R.; Yuille, A. Every pixel counts++: Joint learning of geometry and motion with 3D holistic understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 10, 2624-2641, 2020.

DOI Google Scholar

[48]

Casser, V.; Pirk, S.; Mahjourian, R.; Angelova, A. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 33, 8001-8008, 2019.

DOI Google Scholar

[49]

DOI

[50]

Mehta, I.; Sakurikar, P.; Narayanan, P. J. Structured adversarial training for unsupervised monocular depth estimation. In: Proceedings of the International Conference on 3D Vision, 314-323, 2018.

DOI

[51]

Poggi, M.; Tosi, F.; Mattoccia, S. Learning monocular depth estimation with unsupervised trinocular assumptions. In: Proceedings of the International Conference on 3D Vision, 324-333, 2018.

DOI

[52]

Pillai, S.; Ambruş R.; Gaidon, A. SuperDepth: Self-supervised, super-resolved monocular depth estimation. In: Proceedings of the International Conference on Robotics and Automation, 9250-9256, 2019.

DOI

[53]

Watson, J.; Firman, M.; Brostow, G.; Turmuk-hambetov, D. Self-supervised monocular depth hints. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2162-2171, 2019.

DOI

[54]

Tosi, F.; Aleotti, F.; Poggi, M.; Mattoccia, S. Learning monocular depth estimation infusing traditional stereo knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9791-9801, 2019.

DOI

[55]

Li, R. H.; Wang, S.; Long, Z. Q.; Gu, D. B. UnDeepVO: Monocular visual odometry through unsupervised deep learning. In: Proceedings of the IEEE International Conference on Robotics and Automation, 7286-7291, 2018.

[56]

Ramamonjisoa, M.; Firman, M.; Watson, J.; Lepetit, V.; Turmukhambetov, D. Single image depth prediction with wavelet decomposition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11084-11093, 2021.

DOI

[57]

Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3213-3223, 2016.

DOI

[58]

Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3354-3361, 2012.

DOI

[59]

Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In: Proceedings of the 31st Conference on Neural Information Processing Systems, 2017.

[60]

Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations, 2015.

[61]

Saxena, A.; Sun, M.; Ng, A. Y. Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 31, No. 5, 824-840, 2009.

DOI Google Scholar

[62]

DOI Google Scholar

[63]

Liu, M. M.; Salzmann, M.; He, X. M. Discrete-continuous depth estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 716-723, 2014.

DOI

[64]

Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In: Proceedings of the 4th International Conference on 3D Vision, 239-248, 2016.

DOI

[65]

Mur-Artal, R.; Montiel, J. M. M.; Tardós, J. D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics Vol. 31, No. 5, 1147-1163, 2015.

DOI Google Scholar

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 07 January 2022

Accepted: 22 February 2022

Published: 16 June 2022

Issue date: December 2022

Copyright

Acknowledgements

This work is partially supported by the Key Technological Innovation Projects of Hubei Province (2018AAA062), National Natural Science Foundation of China (61972298), Wuhan University-Huawei GeoInformatics Innovation Lab.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.