Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
Research Article | Open Access

PVT v2: Improved baselines with Pyramid Vision Transformer

Shanghai AI Laboratory, Shanghai 200232, China
Department of Computer Science and Technology, NanjingUniversity, Nanjing 210023, China
Department of Computer Science, the University ofHong Kong, Hong Kong 999077, China
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210014, China
Computer Vision Lab, ETH Zurich, Zurich 8092, Switzerland
SenseTime, Beijing 100080, China
Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Show Author Information

Graphical Abstract


Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer research in computer vision. Code is available at


Computational Visual Media
Pages 415-424
Cite this article:
Wang W, Xie E, Li X, et al. PVT v2: Improved baselines with Pyramid Vision Transformer. Computational Visual Media, 2022, 8(3): 415-424.








Web of Science






Received: 22 December 2021
Accepted: 08 February 2022
Published: 16 March 2022
© The Author(s) 2022.

