Journal Home > Volume 9 , Issue 2

We present a lightweight and efficient semi-supervised video object segmentation network based on the space-time memory framework. To some extent, our method solves the two difficulties encountered in traditional video object segmentation: one is that the single frame calculation time is too long, and the other is that the current frame’s segmentation should use more information from past frames. The algorithm uses a global context (GC) module to achieve high-performance, real-time segmentation. The GC module can effectively integrate multi-frame image information without increased memory and can process each frame in real time. Moreover, the prediction mask of the previous frame is helpful for the segmentation of the current frame, so we input it into a spatial constraint module (SCM), which constrains the areas of segments in the current frame. The SCM effectively alleviates mismatching of similar targets yet consumes few additional resources. We added a refinement module to the decoder to improve boundary segmentation. Our model achieves state-of-the-art results on various datasets, scoring 80.1% on YouTube-VOS 2018 and a 𝒥& score of 78.0% on DAVIS 2017, while taking 0.05 s per frame on the DAVIS 2016 validation dataset.


menu
Abstract
Full text
Outline
About this article

Global video object segmentation with spatial constraint module

Show Author's information Yadang Chen1Duolin Wang1( )Zhiguo Chen1Zhi-Xin Yang2Enhua Wu3,4
Engineering Research Center of Digital Forensics, Ministry of Education, School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing 210044, China
State Key Laboratory of Internet of Things for Smart City, Department of Electromechanical Engineering, University of Macau, Macau 999078, China
State Key Laboratory of Computer Science, Institute of Software, University of Chinese Academy of Sciences, Beijing 100190, China
Faculty of Science and Technology, University of Macau, Macau 999078, China

Abstract

We present a lightweight and efficient semi-supervised video object segmentation network based on the space-time memory framework. To some extent, our method solves the two difficulties encountered in traditional video object segmentation: one is that the single frame calculation time is too long, and the other is that the current frame’s segmentation should use more information from past frames. The algorithm uses a global context (GC) module to achieve high-performance, real-time segmentation. The GC module can effectively integrate multi-frame image information without increased memory and can process each frame in real time. Moreover, the prediction mask of the previous frame is helpful for the segmentation of the current frame, so we input it into a spatial constraint module (SCM), which constrains the areas of segments in the current frame. The SCM effectively alleviates mismatching of similar targets yet consumes few additional resources. We added a refinement module to the decoder to improve boundary segmentation. Our model achieves state-of-the-art results on various datasets, scoring 80.1% on YouTube-VOS 2018 and a 𝒥& score of 78.0% on DAVIS 2017, while taking 0.05 s per frame on the DAVIS 2016 validation dataset.

Keywords:

video object segmentation, semantic segmen-tation, global context (GC) module, spatial constraint
Received: 31 December 2021 Accepted: 05 March 2022 Published: 03 January 2023 Issue date: June 2023
References(52)
[1]
Chen, D.; Tang, F.; Dong, W. M.; Yao, H. X.; Xu, C. S. SiamCPN: Visual tracking with the Siamese center-prediction network. Computational Visual Media Vol. 7, No. 2, 253–265, 2021.
[2]
Li, X.; Liu, S.; De Mello, S.; Wang, X.; Kautz, J.; Yang, M. H. Joint-task self-supervised learning for temporal correspondence. arXiv preprint arXiv:1909.11895, 2019.
[3]
Zhang, F. L.; Barnes, C.; Zhang, H. T.; Zhao, J. H.; Salas, G. Coherent video generation for multiple hand-held cameras with dynamic foreground. Computational Visual Media Vol. 6, No. 3, 291–306, 2020.
[4]
Cheng, J. C.; Tsai, Y. H.; Hung, W. C.; Wang, S. J.; Yang, M. H. Fast and accurate online video object segmentation via tracking parts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7415–7424, 2018.
[5]
Maninis, K. K.; Caelles, S.; Chen, Y.; Pont-Tuset, J.; Leal-Taixé, L.; Cremers, D.; Van Gool, L. Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 41, No. 6, 1515–1530, 2019.
[6]
Voigtlaender, P.; Chai, Y. N.; Schroff, F.; Adam, H.; Leibe, B.; Chen, L. C. FEELVOS: Fast end-to-end embedding learning for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9473–9482, 2019.
[7]
Li, Y.; Shen, Z.; Shan, Y. Fast video object segmentation using the global context module. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12355. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 735–750, 2020.
[8]
Hu, Y. T.; Huang, J. B.; Schwing, A. G. MaskRNN: Instance level video object segmentation. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 324–333, 2017.
[9]
Khoreva, A.; Benenson, R.; Ilg, E.; Brox, T.; Schiele, B. Lucid data dreaming for object tracking. In: Proceedings of the 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops, 2017.
[10]
Li, X.; Loy, C. C. Video object segmentation with joint re-identification and attention-aware mask propagation. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11207. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 93–110, 2018.
[11]
Perazzi, F.; Khoreva, A.; Benenson, R.; Schiele, B.; Sorkine-Hornung, A. Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3491–3500, 2017.
[12]
Caelles, S.; Maninis, K.-K.; Pont-Tuset, J.; Leal-Taixé, L.; Cremers, D.; Van Gool, L. One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5320–5329, 2017.
[13]
Voigtlaender, P.; Leibe, B. Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364, 2017.
[14]
Yoon, J. S.; Rameau, F.; Kim, J.; Lee, S.; Shin, S.; Kweon, I. S. Pixel-level matching for video object segmentation using convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, 2186–2195, 2017.
[15]
Wang, Z. Q.; Xu, J.; Liu, L.; Zhu, F.; Shao, L. RANet: Ranking attention network for fast video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3977–3986, 2019.
[16]
Oh, S. W.; Lee, J. Y.; Sunkavalli, K.; Kim, S. J. Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7376–7385, 2018.
[17]
Yang, L.; Wang, Y.; Xiong, X.; Yang, J.; Katsaggelos, A. K. Efficient video object segmentation via network modulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6499–6507, 2018.
[18]
Oh, S. W.; Lee, J.-Y.; Xu, N.; Kim, S. J. Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9225–9234, 2019.
[19]
Seong, H.; Hyun, J.; Kim, E. Kernelized memory network for video object segmentation. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12367. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 629–645, 2020.
[20]
Zhang, P.; Hu, L.; Zhang, B.; Pan, P. Spatial constrained memory network for semi-supervised video object segmentation. In: Proceedings of the 2020 DAVIS Challenge on Video Object Segmentation - CVPR Workshops, 2020.
[21]
Chen, L. C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
[22]
Liu, P.; Fu, H. Y.; Ma, H. D. An end-to-end convolutional network for joint detecting and denoising adversarial perturbations in vehicle classification. Computational Visual Media Vol. 7, No. 2, 217–227, 2021.
[23]
Huo, Y. C.; Yoon, S. E. A survey on deep learning-based Monte Carlo denoising. Computational Visual Media Vol. 7, No. 2, 169–185, 2021.
[24]
Danon, D.; Arar, M.; Cohen-Or, D.; Shamir, A. Image resizing by reconstruction from deep features. Computational Visual Media Vol. 7, No. 4, 453–466, 2021.
[25]
Liu, X. T.; Li, C. Z.; Wong, T. T. Boundary-aware texture region segmentation from manga. Computational Visual Media Vol. 3, No. 1, 61–71, 2017.
[26]
Chen, Y. H.; Pont-Tuset, J.; Montes, A.; Gool, L. V. Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1189–1198, 2018.
[27]
Khoreva, A.; Benenson, R.; Ilg, E.; Brox, T.; Schiele, B. Lucid data dreaming for video object segmentation. International Journal of Computer Vision Vol. 127, No. 9, 1175–1197, 2019.
[28]
Wang, X. L.; Jabri, A.; Efros, A. A. Learning correspondence from the cycle-consistency of time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2561–2571, 2019.
[29]
Zhang, M. L.; Zhou, Z. H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition Vol. 40, No. 7, 2038–2048, 2007.
[30]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7794–7803, 2018.
[31]
Liang, Y. Q.; Li, X.; Jafari, N.; Chen, Q. Video object segmentation with adaptive feature bank and uncertain-region refinement. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, Article No. 289, 3430–3441, 2020.
[32]
Cheng, H. K.; Tai, Y. W.; Tang, C. K. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. arXiv preprint arXiv: 2106.05210, 2021.
[33]
Hu, L.; Zhang, P.; Zhang, B.; Pan, P.; Xu, Y.; Jin, R. Learning position and target consistency for memory-based video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4142–4152, 2021.
[34]
Xie, H.; Yao, H.; Zhou, S.; Zhang, S.; Sun, W. Efficient regional memory network for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1286–1295, 2021.
[35]
Tang, L. L.; Chen, K.; Wu, C. Z.; Hong, Y.; Jia, K.; Yang, Z. X. Improving semantic analysis on point clouds via auxiliary supervision of local geometric priors. IEEE Transactions on Cybernetics Vol. 52, No. 6, 4949–4959, 2022.
[36]
Yang, Z. X.; Tang, L. L.; Zhang, K.; Wong, P. K. Multi-view CNN feature aggregation with ELM auto-encoder for 3D shape recognition. Cognitive Computation Vol. 10, No. 6, 908–921, 2018.
[37]
Perazzi, F.; Pont-Tuset, J.; McWilliams, B.; Van Gool, L.; Gross, M.; Sorkine-Hornung, A. A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 724–732, 2016.
[38]
Pont-Tuset, J.; Perazzi, F.; Caelles, S.; Arbeláez, P.; Sorkine-Hornung, A.; Van Gool, L. The 2017 DAVIS Challenge on Video Object Segmentation. arXiv preprint arXiv:1704.00675, 2017.
[39]
Xu, N.; Yang, L.; Fan, Y.; Yue, D.; Liang, Y.; Yang, J.; Huang, T. YouTube-VOS: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
[40]
Bao, L. C.; Wu, B. Y.; Liu, W. CNN in MRF: Video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5977–5986, 2018.
[41]
Luiten, J.; Voigtlaender, P.; Leibe, B. PReMVOS: Proposal-generation, refinement and merging for video object segmentation. arXiv preprint arXiv:1807.09190, 2018.
[42]
Li, Y.; Wen, L.; Chang, M. C.; Lyu, S. Graph-to-graph energy minimization for video object segmentation. In: Proceedings of the 16th IEEE International Conference on Advanced Video and Signal Based Surveillance, 1–8, 2019.
[43]
Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W. M.; Torr, P. H. S. Fast online object tracking and segmentation: A unifying approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1328–1338, 2019.
[44]
Hu, Y. T.; Huang, J. B.; Schwing, A. G. VideoMatch: Matching based video object segmentation. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11212. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 56–73, 2018.
[45]
Johnander, J.; Danelljan, M.; Brissman, E.; Khan, F. S.; Felsberg, M. A generative appearance model for end-to-end video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8945–8954, 2019.
[46]
Lin, T. Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C. L.; Dollár, P. Microsoft COCO: Common objects in context. In: Computer Vision – ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.
DOI
[47]
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[48]
Ventura, C.; Bellver, M.; Girbau, A.; Salvador, A.; Marques, F.; Giro-i-Nieto, X. RVOS: End-to-end recurrent network for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5272–5281, 2019.
[49]
Xu, N.; Yang, L.; Fan, Y.; Yang, J.; Yue, D.; Liang, Y.; Price, B.; Cohen, S.; Huang, T. YouTube-VOS: Sequence-to-sequence video object segmentation. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11209. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 603–619, 2018.
DOI
[50]
Wehrwein, S.; Szeliski, R. Video segmentation with background motion models. In: Proceedings of the British Machine Vision Conference, 96.1–96.12, 2017.
[51]
Voigtlaender, P.; Luiten, J.; Leibe, B. BoLTVOS: Box-level tracking for video object segmentation. arXiv preprint arXiv:1904.04552, 2019.
[52]
Lin, H. J.; Qi, X. J.; Jia, J. Y. AGSS-VOS: Attention guided single-shot video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3948–3956, 2019.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 31 December 2021
Accepted: 05 March 2022
Published: 03 January 2023
Issue date: June 2023

Copyright

© The Author(s) 2022.

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (Grant Nos. 61802197, 62072449, and 61632003), the Science and Technology Development Fund, Macau SAR (Grant Nos. 0018/2019/AKP and SKL-IOTSC(UM)-2021-2023), the Guangdong Science and Technology Department (Grant No. 2020B1515130001), and University of Macau (Grant Nos. MYRG2020-00253-FST and MYRG2022-00059-FST).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Return