Journal Home > Volume 28 , Issue 4

This paper explores the Vision Transformer (ViT) backbone for Unsupervised Domain Adaptive (UDA) person Re-Identification (Re-ID). While some recent studies have validated ViT for supervised Re-ID, no study has yet to use ViT for UDA Re-ID. We observe that the ViT structure provides a unique advantage for UDA Re-ID, i.e., it has a prompt (the learnable class token) at its bottom layer, that can be used to efficiently condition the deep model for the underlying domain. To utilize this advantage, we propose a novel two-stage UDA pipeline named Prompting And Tuning (PAT) which consists of a prompt learning stage and a subsequent fine-tuning stage. In the first stage, PAT roughly adapts the model from source to target domain by learning the prompts for two domains, while in the second stage, PAT fine-tunes the entire backbone for further adaption to increase the accuracy. Although these two stages both adopt the pseudo labels for training, we show that they have different data preferences. With these two preferences, prompt learning and fine-tuning integrated well with each other and jointly facilitated a competitive PAT method for UDA Re-ID.


menu
Abstract
Full text
Outline
About this article

Prompting and Tuning: A Two-Stage Unsupervised Domain Adaptive Person Re-identification Method on Vision Transformer Backbone

Show Author's information Shengming Yu1Zhaopeng Dou1Shengjin Wang1( )
Department of Electroic Engineering, Tsinghua University, Beijing 100084, China

Abstract

This paper explores the Vision Transformer (ViT) backbone for Unsupervised Domain Adaptive (UDA) person Re-Identification (Re-ID). While some recent studies have validated ViT for supervised Re-ID, no study has yet to use ViT for UDA Re-ID. We observe that the ViT structure provides a unique advantage for UDA Re-ID, i.e., it has a prompt (the learnable class token) at its bottom layer, that can be used to efficiently condition the deep model for the underlying domain. To utilize this advantage, we propose a novel two-stage UDA pipeline named Prompting And Tuning (PAT) which consists of a prompt learning stage and a subsequent fine-tuning stage. In the first stage, PAT roughly adapts the model from source to target domain by learning the prompts for two domains, while in the second stage, PAT fine-tunes the entire backbone for further adaption to increase the accuracy. Although these two stages both adopt the pseudo labels for training, we show that they have different data preferences. With these two preferences, prompt learning and fine-tuning integrated well with each other and jointly facilitated a competitive PAT method for UDA Re-ID.

Keywords:

unsupervised domain adaption;person re-identification, transformer, prompt learning, uncertainty
Received: 20 July 2022 Revised: 02 October 2022 Accepted: 07 October 2022 Published: 06 January 2023 Issue date: August 2023
References(58)
[1]
L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, Scalable person re-identification: A benchmark, in Proc. 2015 IEEE Int. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 1116–1124.
[2]
M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. H. Hoi, Deep learning for person re-identification: A survey and outlook, IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 2872-2893, 2022.
[3]
L. Zheng, Y. Yang, and A. G. Hauptmann, Person re-identification: Past, present and future, arXiv preprint arXiv: 1610.02984, 2016.
[4]
J. Chen, X. Wang, Z. Guo, X. Zhang, and Sun J, Dynamic region-aware convolution, in Proc. of 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 8060–8069.
[5]
Z. Zhang, C. Lan, W. Zeng, and Z. Chen, Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification, in Proc. of 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10404–10413.
[6]
S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang, TransReID: Transformer-based object re-identification, in Proc. of 2021 IEEE/CVF Int. Conf. on Computer Vision, Montreal, Canada, 2021, pp. 14993–15002.
[7]
F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. H. Miller, Language models as knowledge bases? in Proc. 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int. Joint Conf. on Natural Language Processing, Hong Kong, China, 2019, pp. 2463–2473.
[8]
C. Ge, R. Huang, M. Xie, Z. Lai, S. Song, S. Li, and G. Huang, Domain adaptation via prompt learning, arXiv preprint arXiv: 2202.06687, 2022.
[9]
L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, Scalable person re-identification: A benchmark, in Proc. 2015 IEEE Int. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 1116–1124.
[10]
S. Karanam, M. Gou, Z. Wu, A. Rates-Borras, O. Camps, and R. J. Radke, A systematic evaluation and benchmark for person re-identification: Features, metrics, and datasets, IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 3, pp. 523-536, 2019.
[11]
T. Lin, Y. Wang, X. Liu, and X. Qiu, A survey of transformers, AI Open, vol. 3, pp. 111–132, 2022.
[12]
K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, et al., A survey on vision transformer, IEEE Trans. Pattern Anal. Mach. Intell., .
[13]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv: 2010.11929, 2020.
[14]
H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou, Training data-efficient image transformers & distillation through attention, in Proc. 38th Int. Conf. on Machine Learning, 2021, pp. 10347–10357.
[15]
K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, Transformer in transformer, in Proc. 35thConf. on Neural Information Processing Systems, 2021, p. 34.
[16]
X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, Twins: Revisiting the design of spatial attention in vision transformers. in Proc. 35th Conf. on Neural Information Processing Systems, 2021, pp. 9355–9366.
[17]
C. F. Chen, R. Panda, and Q. Fan, RegionViT: Regional-to-local attention for vision transformers, presented at the Tenth Int. Conf. on Learning Representations, 2021.
[18]
X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, CSWIN transformer: A general vision transformer backbone with cross-shaped windows. in Proc. 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2021, pp. 12114–12124.
[19]
D. Zhou, B. Kang, X. Jin, L. Yang, X. Lian, Z. Jiang, Q. Hou, and J. Feng, DeepViT: Towards deeper vision transformer, arXiv preprint arXiv: 2103.11886, 2021.
[20]
X. Chu, Z. Tian, B. Zhang, X. Wang, X. Wei, H. Xia, and C. Shen, Conditional positional encodings for vision transformers, arXiv preprint arXiv: 2102.10882, 2021.
[21]
T. Zhang, L. Wei, L. Xie, Z. Zhuang, Y. Zhang, B. Li, and Q. Tian, Spatiotemporal transformer for video-based person re-identification, arXiv preprint arXiv: 2103.16469, 2021.
[22]
X. Liu, P. Zhang, C. Yu, H. Lu, X. Qian, and X. Yang, A video is worth three views: Trigeminal transformers for video-based person re-identification, arXiv preprint arXiv: 2104.01745, 2021.
[23]
W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, Image-image domain adaptation with preserved CVFself-similarity and domain-dissimilarity for person re-identification, in Proc. of 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 994–1003.
[24]
L. Wei, S. Zhang, W. Gao, and Q. Tian, Person transfer GAN to bridge domain gap for person re-identification, in Proc. 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 79–88.
[25]
Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang, Invariance matters: Exemplar memory for domain adaptive person re-identification, in Proc. 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 598–607.
[26]
X. Wang, H. Zhang, W. Huang, and M. R. Scott, Cross-batch memory for embedding learning, in Proc. 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 6387–6396.
[27]
H. X. Yu, W. S. Zheng, A. Wu, X. Guo, S. Gong, and J. H. Lai, Unsupervised person re-identification by soft multilabel learning, in Proc. 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 2143–2152.
[28]
Y. Fu, Y. Wei, G. Wang, Y. Zhou, H. Shi, U. Uiuc, and T. Huang, Self-similarity grouping: A simple unsupervised cross domain adaptation approach for person re-identification, in Proc. 2019 IEEE/CVF Int. Conf. on Computer Vision, Seoul, Republic of Korea, 2019, pp. 6111–6120.
[29]
Y. Ge, D. Chen, and H. Li, Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification, in Proc. 8th Int. Conf. on Learning Representations, Addis Ababa, Ethiopia, 2020.
[30]
F. Zhao, S. Liao, G. S. Xie, J. Zhao, K. Zhang, and L. Shao, Unsupervised domain adaptation with noise resistible mutual-training for person re-identification, in Proc. 16th European Conf. on Computer Vision, Glasgow, UK, 2020, pp. 526–544.
[31]
K. Zheng, C. Lan, W. Zeng, Z. Zhang, and Z. J. Zha, Exploiting sample uncertainty for domain adaptive person re-identification, Proc. AAAI Conf. Artif. Intell., vol. 35, no. 4, pp. 3538–3546, 2021.
[32]
Y. Zhai, S. Lu, Q. Ye, X. Shan, J. Chen, R. Ji, and Y. Tian, AD-Cluster: Augmented discriminative clustering for domain adaptive person re-identification, in Proc. 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 9018–9027.
[33]
Y. Zhai, Q. Ye, S. Lu, M. Jia, R. Ji, and Y. Tian, Multiple expert brainstorming for domain adaptive person re-identification, In Proc. 16th European Conf. on Computer Vision, Glasgow, UK, 2020, pp. 594–611.
[34]
K. Zheng, W. Liu, L. He, T. Mei, J. Luo, and Z. J. Zha, Group-aware label transfer for domain adaptive person re-identification, in Proc. 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 5306–5315.
[35]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners. in Proc. 34th Int. Conf. on Neural Information Processing Systems, Vancouver, Canada, 2020, pp. 1877–1901.
[36]
Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, How can we know what language models know? Trans. Assoc. Comput. Linguist, vol. 8, pp. 423–438, 2020.
[37]
Z. Zhong, D. Friedman, and D. Chen, Factual probing is [mask]: Learning vs. learning to recall, in Proc. 2021 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 5017–5033.
[38]
T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh, AUTOPROMPT: Eliciting knowledge from language models with automatically generated prompts, in Proc. 2020 Conf. on Empirical Methods in Natural Language Processing, 2020, pp. 4222–4235.
[39]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in Proc. 38th Int. Conf. on Machine Learning, 2021, pp. 8748–8763.
[40]
C. Jia, Y. Yang, Y. Xia, Y. T. Chen, Z. Parekh, H. Pham, Q. Le, Y. H. Sung, Z. Li, and T. Duerig, Scaling up visual and vision-language representation learning with noisy text supervision, in Proc. 38th Int. Conf. on Machine Learning, 2021, pp. 4904–4916.
[41]
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, Learning to prompt for vision-language models, Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, 2022.
[42]
C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, Prompting visual-language models for efficient video understanding, in Proc. 17th European Conf. on Computer Vision, Tel Aviv, Israel, 2022, pp. 105–124.
[43]
M. Jia, L. Tang, B. C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S. N. Lim, Visual prompt tuning, in Proc. 17th European Conf. on Computer Vision, Tel Aviv, Israel, 2022, pp. 709–727.
[44]
Y. Ge, F. Zhu, D. Chen, R. Zhao, and H. Li, Self-paced contrastive learning with hybrid memory for domain adaptive object re-ID, in Proc. 34thInt. Conf. on Neural Information Processing Systems, Vancouver, Canada, 2020, p. 949.
[45]
C. Fu, Y. Li, and Y. Zhang, ATNet: Answering cloze-style questions via intra-attention and inter-attention, in Proc. 23rd Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Macau, China, 2019, pp. 242–252.
[46]
X. Chang, Y. Yang, T. Xiang, and T. M. Hospedales, Disjoint label space transfer learning with common factorised space, in Proc. Thirty-Third AAAI Conf. on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conf. and Ninth AAAI Symp. on Educational Advances in Artificial Intelligence, Honolulu, HA, USA, 2019, p. 404.
[47]
Y. Lin, X. Dong, L. Zheng, Y. Yan, and Y. Yang, A bottom-up clustering approach to unsupervised person re-identification, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 8738–8745, 2019.
[48]
L. Qi, L. Wang, J. Huo, L. Zhou, Y. Shi, and Y. Gao, A novel unsupervised camera-aware domain adaptation framework for person re-identification, in Proc. 2019 IEEE/CVF Int. Conf. on Computer Vision, Seoul, Republic of Korea, 2019, pp. 8079–8088.
[49]
Y. J. Li, C. S. Lin, Y. B. Lin, and Y. C. F. Wang, Cross-dataset person re-identification via unsupervised pose disentanglement and adaptation, in Proc. 2019 IEEE/CVF Int. Conf. on Computer Vision, Seoul, Republic of Korea, 2019, pp. 7918–7928.
[50]
X. Zhang, J. Cao, C. Shen, and M. You, Self-training with progressive augmentation for unsupervised cross-domain person re-identification, in Proc. 2019 IEEE/CVF Int. Conf. on Computer Vision, Seoul, Republic of Korea, 2019, pp. 8221–8230.
[51]
F. Yang, K. Li, Z. Zhong, Z. Luo, X. Sun, H. Cheng, X. Guo, F. Huang, R. Ji, and , Asymmetric co-teaching for unsupervised cross-domain person re-identification, Proc. AAAI Conf. Artif. Intell., vol. 34, no. 7, pp. 12597–12604, 2020.
[52]
D. Wang and S. Zhang, Unsupervised person re-identification via multi-label classification, in Proc. 2020 EEE/CVF Conf. on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10978–10987.
[53]
Y. Huang, P. Peng, Y. Jin, J. Xing, C. Lang, and S. Feng, Domain adaptive attention model for unsupervised cross-domain person re-identification, arXiv preprint arXiv: 1905.10529, 2019.
[54]
X. Jin, C. Lan, W. Zeng, and Z. Chen, Global distance-distributions separation for unsupervised person re-identification, in Proc. 16th European Conf. on Computer Vision, Glasgow, UK, 2020, pp. 735–751.
[55]
W. Wang, F. Zhao, S. Liao, and L. Shao, Attentive WaveBlock: Complementarity-enhanced mutual networks for unsupervised domain adaptation in person re-identification and beyond, IEEE Trans. Image Process., vol. 31, pp. 1532–1544, 2022.
[56]
Y. Dai, J. Liu, Y. Sun, Z. Tong, C. Zhang, and L. Y. Duan, IDM: An intermediate domain module for domain adaptive person re-ID, in Proc. 2021 IEEE/CVF Int. Conf. on Computer Vision, Montreal, Canada, 2021, pp. 11844–11854.
[57]
L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, Scalable person re-identification: A benchmark, in Proc. 2015 IEEE Int. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 1116–1124.
[58]
E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, Performance measures and a data set for multi-target, multi-camera tracking, In Proc. 14th European Conf. on Computer Vision, Amsterdam, The Netherlands, 2016, pp. 17–35.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 20 July 2022
Revised: 02 October 2022
Accepted: 07 October 2022
Published: 06 January 2023
Issue date: August 2023

Copyright

© The author(s) 2023.

Acknowledgements

This work was supported by the National Key Research and Development Program of China in the 13th Five-Year (No. 2016YFB0801301) and in the 14th Five-Year (Nos. 2021YFFO602103, 2021YFF0602102, and 20210Y1702).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return