Learning accurate template matching with differentiable coarse-to-fine correspondence refinement

Zhirui Gao; Renjiao Yi; Zheng Qin; Yunfan Ye; Chenyang Zhu; Kai Xu

doi:10.1007/s41095-023-0333-9

Computational Visual Media 2024, 10(2): 309-330 https://doi.org/10.1007/s41095-023-0333-9

Research Article |

Open Access | Issue | Published: 03 January 2024

Learning accurate template matching with differentiable coarse-to-fine correspondence refinement

Show Author's Information Hide Author's Information Zhirui Gao^¹, Renjiao Yi^¹, Zheng Qin^¹, Yunfan Ye^¹, Chenyang Zhu^¹, Kai Xu^¹(

)

1College of Computer, National University of Defense Technology, Changsha 410073, China

Keywords:

transformers, template matching, differentiable homography, structure-awareness

Cite this article:

Gao Z, Yi R, Qin Z, et al. Learning accurate template matching with differentiable coarse-to-fine correspondence refinement. Computational Visual Media, 2024, 10(2): 309-330. https://doi.org/10.1007/s41095-023-0333-9

Download citation

EndNote(RIS)

BibTeX

197

Views

Downloads

Citations

Crossref

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

Template matching is a fundamental task in computer vision and has been studied for decades. It plays an essential role in manufacturing industry for estimating the poses of different parts, facilitating downstream tasks such as robotic grasping. Existing methods fail when the template and source images have different modalities, cluttered backgrounds, or weak textures. They also rarely consider geometric transformations via homographies, which commonly exist even for planar industrial parts. To tackle the challenges, we propose an accurate template matching method based on differentiable coarse-to-fine correspondence refinement. We use an edge-aware module to overcome the domain gap between the mask template and the grayscale image, allowing robust matching. An initial warp is estimated using coarse correspondences based on novel structure-aware information provided by transformers. This initial alignment is passed to a refinement network using references and aligned images to obtain sub-pixel level correspondences which are used to give the final geometric transformation. Extensive evaluation shows that our method to be significantly better than state-of-the-art methods and baselines, providing good generalization ability and visually plausible results even on unseen real data.

Full text

Abstract

Full text

Outline

About this article

Learning accurate template matching with differentiable coarse-to-fine correspondence refinement

Show Author's information Hide Author's Information Zhirui Gao^¹, Renjiao Yi^¹, Zheng Qin^¹, Yunfan Ye^¹, Chenyang Zhu^¹, Kai Xu^¹(

)

1College of Computer, National University of Defense Technology, Changsha 410073, China

Abstract

Keywords: transformers, template matching, differentiable homography, structure-awareness

References(75)

[1]

Hinterstoisser, S.; Cagniart, C.; Ilic, S.; Sturm, P.; Navab, N.; Fua, P.; Lepetit, V. Gradient response maps for real-time detection of textureless objects. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 34, No. 5, 876–888, 2012.

DOI Google Scholar

[2]

Ballard, D. H. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognition Vol. 13, No. 2, 111–122, 1981.

DOI Google Scholar

[3]

Muja, M.; Rusu, R. B.; Bradski, G.; Lowe, D. G. REIN –A fast, robust, scalable REcognition INfrastructure. In: Proceedings of the IEEE International Conference on Robotics and Automation, 2939–2946, 2011.

DOI

[4]

Hinterstoisser, S.; Lepetit, V.; Ilic, S.; Fua, P.; Navab, N. Dominant orientation templates for real-time detection of texture-less objects. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2257–2264, 2010.

DOI

[5]

Cheng, J. X.; Wu, Y.; AbdAlmageed, W.; Natarajan, P. QATM: Quality-aware template matching for deep learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11545–11554, 2019.

DOI

[6]

Gao, B.; Spratling, M. W. Robust template matching via hierarchical convolutional features from a shape biased CNN. In: The International Conference on Image, Vision and Intelligent Systems. Lecture Notes in Electrical Engineering, Vol. 813. Yao, J.; Xiao, Y.; You, P.; Sun, G. Eds. Springer Singapore, 333–344, 2022.

DOI

[7]

Ren, Q.; Zheng, Y. B.; Sun, P.; Xu, W. Y.; Zhu, D.; Yang, D. X. A robust and accurate end-to-end template matching method based on the Siamese network. IEEE Geoscience and Remote Sensing Letters Vol. 19, Article No. 8015505, 2022.

DOI Google Scholar

[8]

Wu, Y.; Abd-Almageed, W.; Natarajan, P. Deep matching and validation network: An end-to-end solution to constrained image splicing localization and detection. In: Proceedings of the 25th ACM international conference on Multimedia, 1480–1502, 2017.

DOI

[9]

Rocco, I.; Arandjelovic, R.; Sivic, J. Convolutional neural network architecture for geometric matching. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 41, No. 11, 2553–2567, 2019.

DOI Google Scholar

[10]

Efe, U.; Ince, K. G.; Aydin Alatan, A. DFM: A performance baseline for deep feature matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 4279–4288, 2021.

DOI

[11]

Jiang, W.; Trulls, E.; Hosang, J.; Tagliasacchi, A.; Yi, K. M. COTR: Correspondence transformer for matching across images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6187–6197, 2021.

DOI

[12]

Sarlin, P. E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning feature matching with graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4937–4946, 2020.

DOI

[13]

Sun, J. M.; Shen, Z. H.; Wang, Y. A.; Bao, H. J.; Zhou, X. W. LoFTR: Detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8918–8927, 2021.

DOI

[14]

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing System, 6000–6010, 2017.

[15]

Wu, K.; Peng, H. W.; Chen, M. H.; Fu, J. L.; Chao, H. Y. Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10013–10021, 2021.

DOI

[16]

Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, 2017–2025, 2015.

[17]

Fischler, M.; Bolles, R. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM Vol. 24, No. 6, 381–395, 1981.

DOI Google Scholar

[18]

Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast autoregressive transformers with linear attention. In: Proceedings of the 37th International Conference on Machine Learning, 5156–5165, 2020.

[19]

Park, T.; Liu, M. Y.; Wang, T. C.; Zhu, J. Y. Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2332–2341, 2019.

DOI

[20]

Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision – ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.

DOI

[21]

Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91–110, 2004.

DOI Google Scholar

[22]

Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded up robust features. In: Computer Vision – ECCV 2006. Lecture Notes in Computer Science, Vol. 3951. Leonardis, A.; Bischof, H.; Pinz, A. Eds. Springer Berlin Heidelberg, 404–417, 2006.

DOI

[23]

Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In: Proceedings of the International Conference on Computer Vision, 2564–2571, 2011.

DOI

[24]

Barath, D.; Matas, J.; Noskova, J. MAGSAC: Marginalizing sample consensus. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10189–10197, 2019.

DOI

[25]

Brachmann, E.; Rother, C. Neural-guided RANSAC: Learning where to sample model hypotheses. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 4321–4330, 2019.

DOI

[26]

Brachmann, E.; Krull, A.; Nowozin, S.; Shotton, J.; Michel, F.; Gumhold, S.; Rother, C. DSAC—Differentiable RANSAC for camera localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2492–2500, 2017.

DOI

[27]

Lucas, B. D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence, 674–679, 1981.

[28]

DeTone, D.; Malisiewicz, T.; Rabinovich, A. Deep image homography estimation. arXiv preprint arXiv:1606.03798, 2016.

Google Scholar

[29]

Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd edn. Cambridge, UK: Cambridge University Press, 2003.

DOI

[30]

Nguyen, T.; Chen, S. W.; Shivakumar, S. S.; Taylor, C. J.; Kumar, V. Unsupervised deep homography: A fast and robust homography estimation model. IEEE Robotics and Automation Letters Vol. 3, No. 3, 2346–2353, 2018.

DOI Google Scholar

[31]

Zhang, J. R.; Wang, C.; Liu, S. C.; Jia, L. P.; Ye, N. J.; Wang, J.; Zhou, J.; Sun, J. Content-aware unsupervised deep homography estimation. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 653–669, 2020.

DOI

[32]

Koguciuk, D.; Arani, E.; Zonooz, B. Perceptual loss for robust unsupervised homography estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 4269–4278, 2021.

DOI

[33]

DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-supervised interest point detection and description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 337–33712, 2018.

DOI

[34]

Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-net: A trainable CNN for joint description and detection of local features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8084–8093, 2019.

DOI

[35]

Yi, K. M.; Trulls, E.; Lepetit, V.; Fua, P. LIFT: Learned invariant feature transform. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9910. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 467–483, 2016.

DOI

[36]

Luo, Z. X.; Zhou, L.; Bai, X. Y.; Chen, H. K.; Zhang, J. H.; Yao, Y.; Li, S. W.; Fang, T.; Quan, L.ASLFeat: Learning local features of accurate shape and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6588–6597, 2020.

DOI

[37]

Chen, H. K.; Luo, Z. X.; Zhang, J. H.; Zhou, L.; Bai, X. Y.; Hu, Z. Y.; Tai, C. L.; Quan, L. Learning to match features with seeded graph matching network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6281–6290, 2021.

DOI

[38]

Jiang, B.; Sun, P. F.; Luo, B. GLMNet: Graph learning-matching convolutional networks for feature matching. Pattern Recognition Vol. 121, 108167, 2022.

DOI Google Scholar

[39]

Shi, Y.; Cai, J. X.; Shavit, Y.; Mu, T. J.; Feng, W. S.; Zhang, K. ClusterGNN: Cluster-based coarse-to-fine graph neural network for efficient feature matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12507–12516, 2022.

DOI

[40]

Roessle, B.; Nießner, M. End2End multi-view feature matching with differentiable pose optimization. arXiv preprint arXiv:2205.01694, 2022.

Google Scholar

[41]

Suwanwimolkul, S.; Komorita, S. Efficient linear attention for fast and accurate keypoint matching. In: Proceedings of the International Conference on Multimedia Retrieval, 330–341, 2022.

DOI

[42]

Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 213–229, 2020.

DOI

[43]

Kitaev, N.; Kaiser, Ł; Levskaya, A. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.

Google Scholar

[44]

Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient transformers: A survey. ACM Computing Surveys Vol. 55, No. 6, 109, 2022.

DOI Google Scholar

[45]

Lan, Y. Q.; Duan, Y.; Liu, C. Y.; Zhu, C. Y.; Xiong, Y. S.; Huang, H.; Xu, K. ARM3D: Attention-based relation module for indoor 3D object detection. Computational Visual Media Vol. 8, No. 3, 395–414, 2022.

DOI Google Scholar

[46]

Su, Z.; Liu, W. Z.; Yu, Z. T.; Hu, D. W.; Liao, Q.; Tian, Q.; Pietikäinen, M.; Liu, L. Pixel difference networks for efficient edge detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5097–5107, 2021.

DOI

[47]

Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

Google Scholar

[48]

Jau, Y. Y.; Zhu, R.; Su, H.; Chandraker, M. Deep keypoint-based camera pose estimation with geometric constraints. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 4950–4957, 2020.

DOI

[49]

Wang, Q.; Zhou, X.; Hariharan, B.; Snavely, N.Learning feature descriptors using camera pose supervision. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 757–774, 2020.

DOI

[50]

Zhou, Q.; Agostinho, S.; Ošep, A.; Leal-Taixé, L. Is geometry enough for matching in visual localization? In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13670. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 407–425, 2022.

DOI

[51]

Qi, C. R.; Yi, L.; Su, H.; Guibas, L. J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 5105–5114, 2017.

[52]

Li, Y.; Harada, T. Lepard: Learning partial point cloud matching in rigid and deformable scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5544–5554, 2022.

DOI

[53]

Su, J. L.; Lu, Y.; Pan, S. F.; Murtadha, A.; Wen, B.; Liu, Y. F. RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.

Google Scholar

[54]

Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 2, 2292–2300, 2013.

[55]

Rocco, I.; Cimpoi, M.; Arandjelovi? R.; Torii, A.; Pajdla, T.; Sivic, J. Neighbourhood consensus networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 1658–1669, 2018.

[56]

Tyszkiewicz, M. J.; Fua, P.; Trulls, E. DISK: Learning local features with policy gradient. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, 14254–14265, 2020.

[57]

Barath, D.; Matas, J. Graph-cut ransac. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6733–6741, 2018.

DOI

[58]

Chum, O.; Matas, J.; Kittler, J. Locally optimized RANSAC. In: Pattern Recognition. Lecture Notes in Computer Science, Vol. 2781. Michaelis, B.; Krell, G. Eds. Springer Berlin Heidelberg, 236–243, 2003.

DOI

[59]

Leordeanu, M.; Hebert, M. A spectral technique forcorrespondence problems using pairwise constraints. In: Proceedings of the 10th IEEE International Conference on Computer Vision, 1482–1489, 2005.

DOI

[60]

Bai, X. Y.; Luo, Z. X.; Zhou, L.; Chen, H. K.; Li, L.; Hu, Z. Y.; Fu, H. B.; Tai, C. L. PointDSC: Robust point cloud registration using deep spatial consistency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15854–15864, 2021.

DOI

[61]

Chen, Z.; Sun, K.; Yang, F.; Tao, W. B. SC2-PCR: A second order spatial compatibility for efficient and robust point cloud registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13211–13221, 2022.

DOI

[62]

Quan, S. W.; Yang, J. Q. Compatibility-guided sampling consensus for 3-D point cloud registration. IEEE Transactions on Geoscience and Remote Sensing Vol. 58, No. 10, 7380–7392, 2020.

DOI Google Scholar

[63]

Yang, J. Q.; Xian, K.; Wang, P.; Zhang, Y. N. A performance evaluation of correspondence grouping methods for 3D rigid data matching. IEEETransactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 6, 1859–1874, 2021.

DOI Google Scholar

[64]

Qin, Z.; Yu, H.; Wang, C.; Guo, Y. L.; Peng, Y. X.; Xu, K. Geometric transformer for fast and robust point cloud registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11133–11142, 2022.

DOI

[65]

Mises, R. V.; Pollaczek-Geiringer, H. Praktische verfahren der gleichungsauflösung. ZAMM - Zeitschrift Für Angewandte Mathematik Und Mechanik Vol. 9, No. 1, 58–77, 1929.

DOI Google Scholar

[66]

Mok, T. C. W.; Chung, A. C. S. Affine medical image registration with coarse-to-fine vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20803–20812, 2022.

DOI

[67]

Parihar, U. S.; Gujarathi, A.; Mehta, K.; Tourani, S.; Garg, S.; Milford, M.; Krishna K. M. RoRD: Rotation-robust descriptors and orthographic views for local feature matching. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 1593–1600, 2021.

[68]

Shen, X.; Darmon, F.; Efros, A. A.; Aubry, M. RANSAC-flow: Generic two-stage image alignment. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12349. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 618–637, 2020.

DOI

[69]

Truong, P.; Danelljan, M.; Timofte, R. GLU-net: Global-local universal network for dense flow and correspondences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6257–6267, 2020.

DOI

[70]

Lee, M. C. H.; Oktay, O.; Schuh, A.; Schaap, M.; Glocker, B. Image-and-spatial transformer networks for structure-guided image registration. In: Medical Image Computing and Computer Assisted Intervention –MICCAI 2019. Lecture Notes in Computer Science, Vol. 11765. Shen, D., et al. Eds. Springer Cham, 337–345, 2019.

DOI

[71]

Shu, C.; Chen, X.; Xie, Q. W.; Han, H. An unsupervised network for fast microscopic image registration. In: Proceedings of the SPIE 10581, Medical Imaging 2018: Digital Pathology, 105811D, 2018.

DOI

[72]

Riba, E.; Mishkin, D.; Ponsa, D.; Rublee, E.; Bradski, G. Kornia: An open source differentiable computer vision library for PyTorch. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 3663–3672, 2020.

DOI

[73]

Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 779–788, 2016.

DOI

[74]

Canny, J. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. PAMI-8, No. 6, 679–698, 1986.

DOI Google Scholar

[75]

Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. Journal of Machine Learning Research Vol. 9, No. 86, 2579–2605, 2008.

Google Scholar

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 17 October 2022

Accepted: 02 January 2023

Published: 03 January 2024

Issue date: April 2024

Copyright

Acknowledgements

We thank Lintao Zheng and Jun Li for their help with dataset preparation and discussions.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.