Journal Home > Volume 29 , Issue 4

The oropharyngeal swabbing is a pre-diagnostic procedure used to test various respiratory diseases, including COVID and Influenza A (H1N1). To improve the testing efficiency of testing, a real-time, accurate, and robust sampling point localization algorithm is needed for robots. However, current solutions rely heavily on visual input, which is not reliable enough for large-scale deployment. The transformer has significantly improved the performance of image-related tasks and challenged the dominance of traditional convolutional neural networks (CNNs) in the image field. Inspired by its success, we propose a novel self-aligning multi-modal transformer (SAMMT) to dynamically attend to different parts of unaligned feature maps, preventing information loss caused by perspective disparity and simplifying overall implementation. Unlike preexisting multi-modal transformers, our attention mechanism works in image space instead of embedding space, rendering the need for the sensor registration process obsolete. To facilitate the multi-modal task, we collected and annotate an oropharynx localization/segmentation dataset by trained medical personnel. This dataset is open-sourced and can be used for future multi-modal research. Our experiments show that our model improves the performance of the localization task by 4.2% compared to the pure visual model, and reduces the pixel-wise error rate of the segmentation task by 16.7% compared to the CNN baseline.


menu
Abstract
Full text
Outline
About this article

Self-Aligning Multi-Modal Transformer for Oropharyngeal Swab Point Localization

Show Author's information Tianyu Liu1Fuchun Sun1( )
Department of Computer Science and Technology, Tsinghua University, Beijing 100083, China

Abstract

The oropharyngeal swabbing is a pre-diagnostic procedure used to test various respiratory diseases, including COVID and Influenza A (H1N1). To improve the testing efficiency of testing, a real-time, accurate, and robust sampling point localization algorithm is needed for robots. However, current solutions rely heavily on visual input, which is not reliable enough for large-scale deployment. The transformer has significantly improved the performance of image-related tasks and challenged the dominance of traditional convolutional neural networks (CNNs) in the image field. Inspired by its success, we propose a novel self-aligning multi-modal transformer (SAMMT) to dynamically attend to different parts of unaligned feature maps, preventing information loss caused by perspective disparity and simplifying overall implementation. Unlike preexisting multi-modal transformers, our attention mechanism works in image space instead of embedding space, rendering the need for the sensor registration process obsolete. To facilitate the multi-modal task, we collected and annotate an oropharynx localization/segmentation dataset by trained medical personnel. This dataset is open-sourced and can be used for future multi-modal research. Our experiments show that our model improves the performance of the localization task by 4.2% compared to the pure visual model, and reduces the pixel-wise error rate of the segmentation task by 16.7% compared to the CNN baseline.

Keywords: segmentation, localization, transformer, multi-modal perception, robotic perception

References(35)

[1]
Y. L. Chen, F. J. Song, and Y. J. Gong, Remote human-robot collaborative impedance control strategy of pharyngeal swab sampling robot, in Proc. 2020 5th Int. Conf. Automation, Control and Robotics Engineering (CACRE), Dalian, China, 2020, pp. 341–345.
[2]

G. Z. Yang, B. J. Nelson, R. R. Murphy, H. Choset, H. Christensen, S. H. Collins, P. Dario, K. Goldberg, K. Ikuta, N. Jacobstein, et al., Combating COVID-19—The role of robotics in managing public health and infectious diseases, Sci. Robot., vol. 5, no. 40, p. eabb5589, 2020.

[3]

S. Q. Li, W. L. Guo, H. Liu, T. Wang, Y. Y. Zhou, T. Yu, C. Y. Wang, Y. M. Yang, N. S. Zhong, N. F. Zhang, et al., Clinical application of an intelligent oropharyngeal swab robot: Implication for the COVID-19 pandemic, Eur. Respir. J., vol. 56, no. 2, p. 2001912, 2020.

[4]

Z. Xie, B. Chen, J. Liu, F. Yuan, Z. Shao, H. Yang, A. G. Domel, J. Zhang, and L. Wen, A tapered soft robotic oropharyngeal swab for throat testing: A new way to collect sputa samples, IEEE Robot. Autom. Mag., vol. 28, no. 1, pp. 90–100, 2021.

[5]
M. Draelos, N. Deshpande, and E. Grant, The Kinect up close: Adaptations for short-range imaging, in Proc. 2012 IEEE Int. Conf. Multisensor Fusion and Integration for Intelligent Systems (MFI), Hamburg, Germany, 2012, pp. 251–256.
DOI
[6]

C. D. Herrera, J. Kannala, and J. Heikkilä, Joint depth and color camera calibration with distortion correction, IEEE Trans. Pattern Anal. Mach. Intell, vol. 34, no. 10, pp. 2058–2064, 2012.

[7]
K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, et al., A survey on visual transformer, arXiv preprint arXiv: 2012.12556, 2020.
[8]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.
[9]
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2018.
[10]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, arXiv preprint arXiv: 2005.14165, 2020.
[11]
B. Wu, C. Xu, X. Dai, A. Wan, P. Zhang, Z. Yan, M. Tomizuka, J. Gonzalez, K. Keutzer, and P. Vajda, Visual transformers: Token-based image representation and processing for computer vision, arXiv preprint arXiv: 2006.03677, 2020.
[12]
N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran, Image transformer, in Proc. ICML 2018 : Int. Conf. Machine Learning, Sanya, China, 2018, pp. 4055–4064.
[13]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16×16 words: Transformers for image recognition at scale, arXiv preprint arXiv: 2010.11929, 2020.
[14]
I. Bello, B. Zoph, Q. Le, A. Vaswani, and J. Shlens, Attention augmented convolutional networks, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2020, pp. 3285–3294.
DOI
[15]
H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, Training dataefficient image transformers & distillation through attention, in Proc. ICML 2021 : Int. Conf. Machine Learning, Virtual Event, 2021, pp. 10347–10357.
[16]
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, End-to-end object detection with transformers, in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. M. Frahm, eds. Cham, Switzerland: Springer, 2020, pp. 213–229.
DOI
[17]
X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, Deformable DETR: Deformable transformers for end-to-end object detection, arXiv preprint arXiv: 2010.04159, 2020.
[18]
D. Neimark, O. Bar, M. Zohar, and D. Asselmann, Video transformer network, arXiv preprint arXiv: 2102.00719, 2021.
DOI
[19]
E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, SegFormer: Simple and efficient design for semantic segmentation with transformers, arXiv preprint arXiv: 2105.15203, 2021.
[20]

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, Zero-shot text-to-image generation, arXiv preprint arXiv:, 2102.12092, 2021.

[21]
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, arXiv preprint arXiv: 2103.14030, 2021.
DOI
[22]
M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, Generative pretraining from pixels, in Proc. 37th Int. Conf. Machine Learning, Virtual Event, 2020, pp. 1691–1703.
[23]
X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, Twins: Revisiting the design of spatial attention in vision transformers, arXiv preprint arXiv: 2104.13840, 2021.
[24]
W. Wang, E. Xie, X. Li, D. P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, arXiv preprint arXiv: 2102.12122, 2021.
DOI
[25]
H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, CvT: Introducing convolutions to vision transformers, arXiv preprint arXiv: 2103.15808, 2021.
DOI
[26]
H. Tan and M. Bansal, LXMERT: Learning cross-modality encoder representations from transformers, arXiv preprint arXiv: 1908.07490, 2019.
DOI
[27]
W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, VL-BERT: Pre-training of generic visual-linguistic representations, arXiv preprint arXiv: 1908.08530, 2019.
[28]

L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao, Unified vision-language pre-training for image captioning and VQA, Proc. AAAI Conf. Artif. Intell., vol. 34, no. 7, pp. 13041–13049, 2020.

[29]
W. Hao, C. Li, X. Li, L. Carin, and J. Gao, Towards learning a generic agent for vision-and-language navigation via pre-training, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 13134–13143.
DOI
[30]
C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, VideoBERT: A joint model for video and language representation learning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2020, pp. 7463–7472.
DOI
[31]
S. Yao and X. Wan, Multimodal transformer for multimodal machine translation, in Proc. 58th Annual Meeting of the Association for Computational Linguistics, Virtual Event, 2020, pp. 4346–4350.
DOI
[32]
[33]
A. Baevski and M. Auli, Adaptive input representations for neural language modeling, arXiv preprint arXiv: 1809.10853, 2018.
[34]
Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, and L. S. Chao, Learning deep transformer models for machine translation, arXiv preprint arXiv: 1906.01787, 2019.
DOI
[35]
Z. H. Feng, J. Kittler, M. Awais, P. Huber, and X. J. Wu, Wing loss for robust facial landmark localisation with convolutional neural networks, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 2235–2245.
DOI
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 05 April 2023
Revised: 04 July 2023
Accepted: 11 July 2023
Published: 09 February 2024
Issue date: August 2024

Copyright

© The Author(s) 2024.

Acknowledgements

Acknowledgment

Rights and permissions

The articles published in this open access journal are distributed under the terms of theCreative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return