Self-Aligning Multi-Modal Transformer for Oropharyngeal Swab Point Localization

Tianyu Liu; Fuchun Sun

doi:10.26599/TST.2023.9010070

| Sign up

PDF (11.8 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

Self-Aligning Multi-Modal Transformer for Oropharyngeal Swab Point Localization

Tianyu Liu^¹, Fuchun Sun^¹()

1Department of Computer Science and Technology, Tsinghua University, Beijing 100083, China

Show Author Information

Abstract

The oropharyngeal swabbing is a pre-diagnostic procedure used to test various respiratory diseases, including COVID and Influenza A (H1N1). To improve the testing efficiency of testing, a real-time, accurate, and robust sampling point localization algorithm is needed for robots. However, current solutions rely heavily on visual input, which is not reliable enough for large-scale deployment. The transformer has significantly improved the performance of image-related tasks and challenged the dominance of traditional convolutional neural networks (CNNs) in the image field. Inspired by its success, we propose a novel self-aligning multi-modal transformer (SAMMT) to dynamically attend to different parts of unaligned feature maps, preventing information loss caused by perspective disparity and simplifying overall implementation. Unlike preexisting multi-modal transformers, our attention mechanism works in image space instead of embedding space, rendering the need for the sensor registration process obsolete. To facilitate the multi-modal task, we collected and annotate an oropharynx localization/segmentation dataset by trained medical personnel. This dataset is open-sourced and can be used for future multi-modal research. Our experiments show that our model improves the performance of the localization task by 4.2% compared to the pure visual model, and reduces the pixel-wise error rate of the segmentation task by 16.7% compared to the CNN baseline.

Keywords

multi-modal perception robotic perception transformer segmentation localization

References

[1]

Y. L. Chen, F. J. Song, and Y. J. Gong, Remote human-robot collaborative impedance control strategy of pharyngeal swab sampling robot, in Proc. 2020 5th Int. Conf. Automation, Control and Robotics Engineering (CACRE), Dalian, China, 2020, pp. 341–345.

[2]

G. Z. Yang, B. J. Nelson, R. R. Murphy, H. Choset, H. Christensen, S. H. Collins, P. Dario, K. Goldberg, K. Ikuta, N. Jacobstein, et al., Combating COVID-19—The role of robotics in managing public health and infectious diseases, Sci. Robot., vol. 5, no. 40, p. eabb5589, 2020.

Crossref Google Scholar

[3]

S. Q. Li, W. L. Guo, H. Liu, T. Wang, Y. Y. Zhou, T. Yu, C. Y. Wang, Y. M. Yang, N. S. Zhong, N. F. Zhang, et al., Clinical application of an intelligent oropharyngeal swab robot: Implication for the COVID-19 pandemic, Eur. Respir. J., vol. 56, no. 2, p. 2001912, 2020.

Crossref Google Scholar

[4]

Z. Xie, B. Chen, J. Liu, F. Yuan, Z. Shao, H. Yang, A. G. Domel, J. Zhang, and L. Wen, A tapered soft robotic oropharyngeal swab for throat testing: A new way to collect sputa samples, IEEE Robot. Autom. Mag., vol. 28, no. 1, pp. 90–100, 2021.

Crossref Google Scholar

[5]

M. Draelos, N. Deshpande, and E. Grant, The Kinect up close: Adaptations for short-range imaging, in Proc. 2012 IEEE Int. Conf. Multisensor Fusion and Integration for Intelligent Systems (MFI), Hamburg, Germany, 2012, pp. 251–256.

Crossref

[6]

C. D. Herrera, J. Kannala, and J. Heikkilä, Joint depth and color camera calibration with distortion correction, IEEE Trans. Pattern Anal. Mach. Intell, vol. 34, no. 10, pp. 2058–2064, 2012.

Crossref Google Scholar

[7]

K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, et al., A survey on visual transformer, arXiv preprint arXiv: 2012.12556, 2020.

[8]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.

[9]

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2018.

[10]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, arXiv preprint arXiv: 2005.14165, 2020.

[11]

B. Wu, C. Xu, X. Dai, A. Wan, P. Zhang, Z. Yan, M. Tomizuka, J. Gonzalez, K. Keutzer, and P. Vajda, Visual transformers: Token-based image representation and processing for computer vision, arXiv preprint arXiv: 2006.03677, 2020.

[12]

N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran, Image transformer, in Proc. ICML 2018 : Int. Conf. Machine Learning, Sanya, China, 2018, pp. 4055–4064.

[13]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16×16 words: Transformers for image recognition at scale, arXiv preprint arXiv: 2010.11929, 2020.

[14]

I. Bello, B. Zoph, Q. Le, A. Vaswani, and J. Shlens, Attention augmented convolutional networks, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2020, pp. 3285–3294.

Crossref

[15]

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, Training dataefficient image transformers & distillation through attention, in Proc. ICML 2021 : Int. Conf. Machine Learning, Virtual Event, 2021, pp. 10347–10357.

[16]

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, End-to-end object detection with transformers, in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. M. Frahm, eds. Cham, Switzerland: Springer, 2020, pp. 213–229.

Crossref

[17]

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, Deformable DETR: Deformable transformers for end-to-end object detection, arXiv preprint arXiv: 2010.04159, 2020.

[18]

D. Neimark, O. Bar, M. Zohar, and D. Asselmann, Video transformer network, arXiv preprint arXiv: 2102.00719, 2021.

Crossref

[19]

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, SegFormer: Simple and efficient design for semantic segmentation with transformers, arXiv preprint arXiv: 2105.15203, 2021.

[20]

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, Zero-shot text-to-image generation, arXiv preprint arXiv:, 2102.12092, 2021.

[21]

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, arXiv preprint arXiv: 2103.14030, 2021.

Crossref

[22]

M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, Generative pretraining from pixels, in Proc. 37th Int. Conf. Machine Learning, Virtual Event, 2020, pp. 1691–1703.

[23]

X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, Twins: Revisiting the design of spatial attention in vision transformers, arXiv preprint arXiv: 2104.13840, 2021.

[24]

W. Wang, E. Xie, X. Li, D. P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, arXiv preprint arXiv: 2102.12122, 2021.

Crossref

[25]

H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, CvT: Introducing convolutions to vision transformers, arXiv preprint arXiv: 2103.15808, 2021.

Crossref

[26]

H. Tan and M. Bansal, LXMERT: Learning cross-modality encoder representations from transformers, arXiv preprint arXiv: 1908.07490, 2019.

Crossref

[27]

W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, VL-BERT: Pre-training of generic visual-linguistic representations, arXiv preprint arXiv: 1908.08530, 2019.

[28]

L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao, Unified vision-language pre-training for image captioning and VQA, Proc. AAAI Conf. Artif. Intell., vol. 34, no. 7, pp. 13041–13049, 2020.

Crossref Google Scholar

[29]

W. Hao, C. Li, X. Li, L. Carin, and J. Gao, Towards learning a generic agent for vision-and-language navigation via pre-training, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 13134–13143.

Crossref

[30]

C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, VideoBERT: A joint model for video and language representation learning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2020, pp. 7463–7472.

Crossref

[31]

S. Yao and X. Wan, Multimodal transformer for multimodal machine translation, in Proc. 58th Annual Meeting of the Association for Computational Linguistics, Virtual Event, 2020, pp. 4346–4350.

Crossref

[32]

Design motor drive systems with ease, https://www.therobotreport.com/danish-startupdevelops-throat-swabbing-robot-for-covid-19-testing/, 2020.

[33]

A. Baevski and M. Auli, Adaptive input representations for neural language modeling, arXiv preprint arXiv: 1809.10853, 2018.

[34]

Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, and L. S. Chao, Learning deep transformer models for machine translation, arXiv preprint arXiv: 1906.01787, 2019.

Crossref

[35]

Z. H. Feng, J. Kittler, M. Awais, P. Huber, and X. J. Wu, Wing loss for robust facial landmark localisation with convolutional neural networks, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 2235–2245.

Crossref

Tsinghua Science and Technology

Volume 29 Issue 4,
August 2024

Pages 1082-1091

DOI: 10.26599/TST.2023.9010070

Cite this article:

Liu T, Sun F. Self-Aligning Multi-Modal Transformer for Oropharyngeal Swab Point Localization. Tsinghua Science and Technology, 2024, 29(4): 1082-1091. https://doi.org/10.26599/TST.2023.9010070