AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (998 KB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Research Article | Open Access

PVT v2: Improved baselines with Pyramid Vision Transformer

Shanghai AI Laboratory, Shanghai 200232, China
Department of Computer Science and Technology, NanjingUniversity, Nanjing 210023, China
Department of Computer Science, the University ofHong Kong, Hong Kong 999077, China
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210014, China
Computer Vision Lab, ETH Zurich, Zurich 8092, Switzerland
SenseTime, Beijing 100080, China
Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Show Author Information

Abstract

Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer research in computer vision. Code is available at https://github.com/whai362/PVT.

Graphical Abstract

References

【1】
【1】
 
 
Computational Visual Media
Pages 415-424

{{item.num}}

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Close
Close
Cite this article:
Wang W, Xie E, Li X, et al. PVT v2: Improved baselines with Pyramid Vision Transformer. Computational Visual Media, 2022, 8(3): 415-424. https://doi.org/10.1007/s41095-022-0274-8

4761

Views

279

Downloads

2068

Crossref

1832

Web of Science

2202

Scopus

90

CSCD

Received: 22 December 2021
Accepted: 08 February 2022
Published: 16 March 2022
© The Author(s) 2022.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.