Prompting and Tuning: A Two-Stage Unsupervised Domain Adaptive Person Re-identification Method on Vision Transformer Backbone

Shengming Yu; Zhaopeng Dou; Shengjin Wang

doi:10.26599/TST.2022.9010044

| Sign up

PDF (2.7 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

Prompting and Tuning: A Two-Stage Unsupervised Domain Adaptive Person Re-identification Method on Vision Transformer Backbone

Shengming Yu^¹, Zhaopeng Dou^¹, Shengjin Wang^¹()

1Department of Electroic Engineering, Tsinghua University, Beijing 100084, China

Show Author Information

Abstract

This paper explores the Vision Transformer (ViT) backbone for Unsupervised Domain Adaptive (UDA) person Re-Identification (Re-ID). While some recent studies have validated ViT for supervised Re-ID, no study has yet to use ViT for UDA Re-ID. We observe that the ViT structure provides a unique advantage for UDA Re-ID, i.e., it has a prompt (the learnable class token) at its bottom layer, that can be used to efficiently condition the deep model for the underlying domain. To utilize this advantage, we propose a novel two-stage UDA pipeline named Prompting And Tuning (PAT) which consists of a prompt learning stage and a subsequent fine-tuning stage. In the first stage, PAT roughly adapts the model from source to target domain by learning the prompts for two domains, while in the second stage, PAT fine-tunes the entire backbone for further adaption to increase the accuracy. Although these two stages both adopt the pseudo labels for training, we show that they have different data preferences. With these two preferences, prompt learning and fine-tuning integrated well with each other and jointly facilitated a competitive PAT method for UDA Re-ID.

Keywords

unsupervised domain adaption;person re-identification transformer prompt learning uncertainty

References

[1]

Zheng

, L.

Shen

, L.

Tian

, S.

Wang

, J.

Wang

, and Q.

Tian

, Scalable person re-identification: A benchmark, in Proc. 2015 IEEE Int. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 1116–1124.

Crossref Google Scholar

[2]

, J.

Shen

, G.

Lin

, T.

Xiang

, L.

Shao

, and S. C. H.

Hoi

, Deep learning for person re-identification: A survey and outlook, IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 2872-2893, 2022.

Crossref Google Scholar

[3]

Zheng

, Y.

Yang

, and A. G.

Hauptmann

, Person re-identification: Past, present and future, arXiv preprint arXiv: 1610.02984, 2016.

Google Scholar

[4]

Chen

, X.

Wang

, Z.

Guo

, X.

Zhang

, and Sun

, Dynamic region-aware convolution, in Proc. of 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 8060–8069.

Crossref Google Scholar

[5]

Zhang

, C.

Lan

, W.

Zeng

, and Z.

Chen

, Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification, in Proc. of 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10404–10413.

Crossref Google Scholar

[6]

, H.

Luo

, P.

Wang

, F.

Wang

, H.

, and W.

Jiang

, TransReID: Transformer-based object re-identification, in Proc. of 2021 IEEE/CVF Int. Conf. on Computer Vision, Montreal, Canada, 2021, pp. 14993–15002.

Crossref Google Scholar

[7]

Petroni

, T.

Rocktäschel

, S.

Riedel

, P.

Lewis

, A.

Bakhtin

, Y.

, and A. H.

Miller

, Language models as knowledge bases? in Proc. 2019 Conf. on Empirical Methods in Natural Language Processing and the 9^th Int. Joint Conf. on Natural Language Processing, Hong Kong, China, 2019, pp. 2463–2473.

Crossref Google Scholar

[8]

, R.

Huang

, M.

Xie

, Z.

Lai

, S.

Song

, S.

, and G.

Huang

, Domain adaptation via prompt learning, arXiv preprint arXiv: 2202.06687, 2022.

Google Scholar

[9]

Zheng

, L.

Shen

, L.

Tian

, S.

Wang

, J.

Wang

, and Q.

Tian

, Scalable person re-identification: A benchmark, in Proc. 2015 IEEE Int. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 1116–1124.

Crossref Google Scholar

[10]

Karanam

, M.

Gou

, Z.

, A.

Rates-Borras

, O.

Camps

, and R. J.

Radke

, A systematic evaluation and benchmark for person re-identification: Features, metrics, and datasets, IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 3, pp. 523-536, 2019.

Crossref Google Scholar

[11]

Lin

, Y.

Wang

, X.

Liu

, and X.

Qiu

, A survey of transformers, AI Open, vol. 3, pp. 111–132, 2022.

Crossref Google Scholar

[12]

Han

, Y.

Wang

, H.

Chen

, X.

Chen

, J.

Guo

, Z.

Liu

, Y.

Tang

, A.

Xiao

, C.

, Y.

, et al., A survey on vision transformer, IEEE Trans. Pattern Anal. Mach. Intell., .

Crossref Google Scholar

[13]

Dosovitskiy

, L.

Beyer

, A.

Kolesnikov

, D.

Weissenborn

, X.

Zhai

, T.

Unterthiner

, M.

Dehghani

, M.

Minderer

, G.

Heigold

, S.

Gelly

, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv: 2010.11929, 2020.

Google Scholar

[14]

Touvron

, M.

Cord

, M.

Douze

, F.

Massa

, A.

Sablayrolles

, and H.

Jegou

, Training data-efficient image transformers & distillation through attention, in Proc. 38^th Int. Conf. on Machine Learning, 2021, pp. 10347–10357.

Google Scholar

[15]

Han

, A.

Xiao

, E.

, J.

Guo

, C.

, and Y.

Wang

, Transformer in transformer, in Proc. 35^thConf. on Neural Information Processing Systems, 2021, p. 34.

Google Scholar

[16]

Chu

, Z.

Tian

, Y.

Wang

, B.

Zhang

, H.

Ren

, X.

Wei

, H.

Xia

, and C.

Shen

, Twins: Revisiting the design of spatial attention in vision transformers. in Proc. 35^th Conf. on Neural Information Processing Systems, 2021, pp. 9355–9366.

Google Scholar

[17]

C. F.

Chen

, R.

Panda

, and Q.

Fan

, RegionViT: Regional-to-local attention for vision transformers, presented at the Tenth Int. Conf. on Learning Representations, 2021.

Google Scholar

[18]

Dong

, J.

Bao

, D.

Chen

, W.

Zhang

, N.

, L.

Yuan

, D.

Chen

, and B.

Guo

, CSWIN transformer: A general vision transformer backbone with cross-shaped windows. in Proc. 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2021, pp. 12114–12124.

Crossref Google Scholar

[19]

Zhou

, B.

Kang

, X.

Jin

, L.

Yang

, X.

Lian

, Z.

Jiang

, Q.

Hou

, and J.

Feng

, DeepViT: Towards deeper vision transformer, arXiv preprint arXiv: 2103.11886, 2021.

Google Scholar

[20]

Chu

, Z.

Tian

, B.

Zhang

, X.

Wang

, X.

Wei

, H.

Xia

, and C.

Shen

, Conditional positional encodings for vision transformers, arXiv preprint arXiv: 2102.10882, 2021.

Google Scholar

[21]

Zhang

, L.

Wei

, L.

Xie

, Z.

Zhuang

, Y.

Zhang

, B.

, and Q.

Tian

, Spatiotemporal transformer for video-based person re-identification, arXiv preprint arXiv: 2103.16469, 2021.

Google Scholar

[22]

Liu

, P.

Zhang

, C.

, H.

, X.

Qian

, and X.

Yang

, A video is worth three views: Trigeminal transformers for video-based person re-identification, arXiv preprint arXiv: 2104.01745, 2021.

Google Scholar

[23]

Deng

, L.

Zheng

, Q.

, G.

Kang

, Y.

Yang

, and J.

Jiao

, Image-image domain adaptation with preserved CVFself-similarity and domain-dissimilarity for person re-identification, in Proc. of 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 994–1003.

Crossref Google Scholar

[24]

Wei

, S.

Zhang

, W.

Gao

, and Q.

Tian

, Person transfer GAN to bridge domain gap for person re-identification, in Proc. 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 79–88.

Crossref Google Scholar

[25]

Zhong

, L.

Zheng

, Z.

Luo

, S.

, and Y.

Yang

, Invariance matters: Exemplar memory for domain adaptive person re-identification, in Proc. 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 598–607.

Crossref Google Scholar

[26]

Wang

, H.

Zhang

, W.

Huang

, and M. R.

Scott

, Cross-batch memory for embedding learning, in Proc. 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 6387–6396.

Crossref Google Scholar

[27]

H. X.

, W. S.

Zheng

, A.

, X.

Guo

, S.

Gong

, and J. H.

Lai

, Unsupervised person re-identification by soft multilabel learning, in Proc. 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 2143–2152.

Crossref Google Scholar

[28]

, Y.

Wei

, G.

Wang

, Y.

Zhou

, H.

Shi

, U.

Uiuc

, and T.

Huang

, Self-similarity grouping: A simple unsupervised cross domain adaptation approach for person re-identification, in Proc. 2019 IEEE/CVF Int. Conf. on Computer Vision, Seoul, Republic of Korea, 2019, pp. 6111–6120.

Crossref Google Scholar

[29]

, D.

Chen

, and H.

, Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification, in Proc. 8^th Int. Conf. on Learning Representations, Addis Ababa, Ethiopia, 2020.

Google Scholar

[30]

Zhao

, S.

Liao

, G. S.

Xie

, J.

Zhao

, K.

Zhang

, and L.

Shao

, Unsupervised domain adaptation with noise resistible mutual-training for person re-identification, in Proc. 16^th European Conf. on Computer Vision, Glasgow, UK, 2020, pp. 526–544.

Crossref Google Scholar

[31]

Zheng

, C.

Lan

, W.

Zeng

, Z.

Zhang

, and Z. J.

Zha

, Exploiting sample uncertainty for domain adaptive person re-identification, Proc. AAAI Conf. Artif. Intell., vol. 35, no. 4, pp. 3538–3546, 2021.

Crossref Google Scholar

[32]

Zhai

, S.

, Q.

, X.

Shan

, J.

Chen

, R.

, and Y.

Tian

, AD-Cluster: Augmented discriminative clustering for domain adaptive person re-identification, in Proc. 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 9018–9027.

Crossref Google Scholar

[33]

Zhai

, Q.

, S.

, M.

Jia

, R.

, and Y.

Tian

, Multiple expert brainstorming for domain adaptive person re-identification, In Proc. 16^th European Conf. on Computer Vision, Glasgow, UK, 2020, pp. 594–611.

Crossref Google Scholar

[34]

Zheng

, W.

Liu

, L.

, T.

Mei

, J.

Luo

, and Z. J.

Zha

, Group-aware label transfer for domain adaptive person re-identification, in Proc. 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 5306–5315.

Crossref Google Scholar

[35]

T. B.

Brown

, B.

Mann

, N.

Ryder

, M.

Subbiah

, J.

Kaplan

, P.

Dhariwal

, A.

Neelakantan

, P.

Shyam

, G.

Sastry

, A.

Askell

, et al., Language models are few-shot learners. in Proc. 34^th Int. Conf. on Neural Information Processing Systems, Vancouver, Canada, 2020, pp. 1877–1901.

Google Scholar

[36]

Jiang

, F. F.

, J.

Araki

, and G.

Neubig

, How can we know what language models know? Trans. Assoc. Comput. Linguist, vol. 8, pp. 423–438, 2020.

Crossref Google Scholar

[37]

Zhong

, D.

Friedman

, and D.

Chen

, Factual probing is [mask]: Learning vs. learning to recall, in Proc. 2021 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 5017–5033.

Crossref Google Scholar

[38]

Shin

, Y.

Razeghi

, R. L.

Logan IV

, E.

Wallace

, and S.

Singh

, AUTOPROMPT: Eliciting knowledge from language models with automatically generated prompts, in Proc. 2020 Conf. on Empirical Methods in Natural Language Processing, 2020, pp. 4222–4235.

Crossref Google Scholar

[39]

Radford

, J. W.

Kim

, C.

Hallacy

, A.

Ramesh

, G.

Goh

, S.

Agarwal

, G.

Sastry

, A.

Askell

, P.

Mishkin

, J.

Clark

, et al., Learning transferable visual models from natural language supervision, in Proc. 38^th Int. Conf. on Machine Learning, 2021, pp. 8748–8763.

Google Scholar

[40]

Jia

, Y.

Yang

, Y.

Xia

, Y. T.

Chen

, Z.

Parekh

, H.

Pham

, Q.

, Y. H.

Sung

, Z.

, and T.

Duerig

, Scaling up visual and vision-language representation learning with noisy text supervision, in Proc. 38^th Int. Conf. on Machine Learning, 2021, pp. 4904–4916.

Google Scholar

[41]

Zhou

, J.

Yang

, C. C.

Loy

, and Z.

Liu

, Learning to prompt for vision-language models, Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, 2022.

Crossref Google Scholar

[42]

, T.

Han

, K.

Zheng

, Y.

Zhang

, and W.

Xie

, Prompting visual-language models for efficient video understanding, in Proc. 17^th European Conf. on Computer Vision, Tel Aviv, Israel, 2022, pp. 105–124.

Crossref Google Scholar

[43]

Jia

, L.

Tang

, B. C.

Chen

, C.

Cardie

, S.

Belongie

, B.

Hariharan

, and S. N.

Lim

, Visual prompt tuning, in Proc. 17^th European Conf. on Computer Vision, Tel Aviv, Israel, 2022, pp. 709–727.

Crossref Google Scholar

[44]

, F.

Zhu

, D.

Chen

, R.

Zhao

, and H.

, Self-paced contrastive learning with hybrid memory for domain adaptive object re-ID, in Proc. 34^thInt. Conf. on Neural Information Processing Systems, Vancouver, Canada, 2020, p. 949.

Google Scholar

[45]

, Y.

, and Y.

Zhang

, ATNet: Answering cloze-style questions via intra-attention and inter-attention, in Proc. 23^rd Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Macau, China, 2019, pp. 242–252.

Crossref Google Scholar

[46]

Chang

, Y.

Yang

, T.

Xiang

, and T. M.

Hospedales

, Disjoint label space transfer learning with common factorised space, in Proc. Thirty-Third AAAI Conf. on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conf. and Ninth AAAI Symp. on Educational Advances in Artificial Intelligence, Honolulu, HA, USA, 2019, p. 404.

Crossref Google Scholar

[47]

Lin

, X.

Dong

, L.

Zheng

, Y.

Yan

, and Y.

Yang

, A bottom-up clustering approach to unsupervised person re-identification, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 8738–8745, 2019.

Crossref Google Scholar

[48]

, L.

Wang

, J.

Huo

, L.

Zhou

, Y.

Shi

, and Y.

Gao

, A novel unsupervised camera-aware domain adaptation framework for person re-identification, in Proc. 2019 IEEE/CVF Int. Conf. on Computer Vision, Seoul, Republic of Korea, 2019, pp. 8079–8088.

Crossref Google Scholar

[49]

Y. J.

, C. S.

Lin

, Y. B.

Lin

, and Y. C. F.

Wang

, Cross-dataset person re-identification via unsupervised pose disentanglement and adaptation, in Proc. 2019 IEEE/CVF Int. Conf. on Computer Vision, Seoul, Republic of Korea, 2019, pp. 7918–7928.

Crossref Google Scholar

[50]

Zhang

, J.

Cao

, C.

Shen

, and M.

You

, Self-training with progressive augmentation for unsupervised cross-domain person re-identification, in Proc. 2019 IEEE/CVF Int. Conf. on Computer Vision, Seoul, Republic of Korea, 2019, pp. 8221–8230.

Crossref Google Scholar

[51]

Yang

, K.

, Z.

Zhong

, Z.

Luo

, X.

Sun

, H.

Cheng

, X.

Guo

, F.

Huang

, R.

, and , Asymmetric co-teaching for unsupervised cross-domain person re-identification, Proc. AAAI Conf. Artif. Intell., vol. 34, no. 7, pp. 12597–12604, 2020.

Crossref Google Scholar

[52]

Wang

and S.

Zhang

, Unsupervised person re-identification via multi-label classification, in Proc. 2020 EEE/CVF Conf. on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10978–10987.

Crossref Google Scholar

[53]

Huang

, P.

Peng

, Y.

Jin

, J.

Xing

, C.

Lang

, and S.

Feng

, Domain adaptive attention model for unsupervised cross-domain person re-identification, arXiv preprint arXiv: 1905.10529, 2019.

Google Scholar

[54]

Jin

, C.

Lan

, W.

Zeng

, and Z.

Chen

, Global distance-distributions separation for unsupervised person re-identification, in Proc. 16^th European Conf. on Computer Vision, Glasgow, UK, 2020, pp. 735–751.

Crossref Google Scholar

[55]

Wang

, F.

Zhao

, S.

Liao

, and L.

Shao

, Attentive WaveBlock: Complementarity-enhanced mutual networks for unsupervised domain adaptation in person re-identification and beyond, IEEE Trans. Image Process., vol. 31, pp. 1532–1544, 2022.

Crossref Google Scholar

[56]

Dai

, J.

Liu

, Y.

Sun

, Z.

Tong

, C.

Zhang

, and L. Y.

Duan

, IDM: An intermediate domain module for domain adaptive person re-ID, in Proc. 2021 IEEE/CVF Int. Conf. on Computer Vision, Montreal, Canada, 2021, pp. 11844–11854.

Crossref Google Scholar

[57]

Zheng

, L.

Shen

, L.

Tian

, S.

Wang

, J.

Wang

, and Q.

Tian

, Scalable person re-identification: A benchmark, in Proc. 2015 IEEE Int. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 1116–1124.

Crossref Google Scholar

[58]

Ristani

, F.

Solera

, R.

Zou

, R.

Cucchiara

, and C.

Tomasi

, Performance measures and a data set for multi-target, multi-camera tracking, In Proc. 14^th European Conf. on Computer Vision, Amsterdam, The Netherlands, 2016, pp. 17–35.

Crossref Google Scholar

Tsinghua Science and Technology

Volume 28 Issue 4,
August 2023

Pages 799-810

DOI: 10.26599/TST.2022.9010044

Cite this article:

Yu S, Dou Z, Wang S. Prompting and Tuning: A Two-Stage Unsupervised Domain Adaptive Person Re-identification Method on Vision Transformer Backbone. Tsinghua Science and Technology, 2023, 28(4): 799-810. https://doi.org/10.26599/TST.2022.9010044