Scholar - SciOpen

Masked autoencoders (MAEs) have recently achieved great success in computer vision. They can automatically extract representations from unlabeled data and improve the performance of various downstream tasks. However, training an MAE model requires substantial resources, which limits their accessibility to many academic institutions: often laboratories in universities lack the necessary resources. This issue significantly hinders the development of this field. In this paper, we propose FastMAE, an efficient MAE approach. Inspired by the idea of offline tokenizers in natural language processing, FastMAE presents a novel way to build an offline vision tokenizer, which can provide high-level semantics in an efficient way. Benefiting from the offline tokenizer, FastMAE becomes an efficient vision learner. Our experiments demonstrate that FastMAE can achieve 83.6% accuracy with ViT-B in only 18.8 h on 8 NVIDIA Tesla-V100 GPUs, which is 31.3× faster than the original MAE, providing a resource friendly baseline for the computer vision community. Moreover, it also achieves comparable performance to state-of-the-art methods. We hope our research will attract more people to engage in MAE-related research and that we can advance its development together.

Open Access Research Article Issue

Visual attention network

Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu

Computational Visual Media 2023, 9(4): 733-752

Published: 28 July 2023

Abstract

PDF (5.4 MB) Collect Collected

Downloads：159

While originally designed for natural language processing tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision: (1) treating images as 1D sequences neglects their 2D structures; (2) the quadratic complexity is too expensive for high-resolution images; (3) it only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel linear attention named large kernel attention (LKA) to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings. Furthermore, we present a neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple, VAN achieves comparable results with similar size convolutional neuralnetworks (CNNs) and vision transformers (ViTs) in various tasks, including image classification, object detection, semantic segmentation, panoptic segmentation,pose estimation, etc. For example, VAN-B6 achieves 87.8% accuracy on ImageNet benchmark, and sets new state-of-the-art performance (58.2% PQ) for panoptic segmentation. Besides, VAN-B2 surpasses Swin-T 4% mIoU (50.1% vs. 46.1%) for semantic segmentation on ADE20K benchmark, 2.6% AP (48.8% vs. 46.2%) for object detection on COCO dataset. It provides a novel method and a simple yet strong baseline for the community. The code is available at https://github.com/Visual-Attention-Network.

Open Access Short Communication Issue

Can attention enable MLPs to catch up with CNNs?

Meng-Hao Guo, Zheng-Ning Liu, Tai-Jiang Mu, Dun Liang, Ralph R. Martin, Shi-Min Hu

Computational Visual Media 2021, 7(3): 283-288

Published: 27 July 2021

Abstract

PDF (3 MB) Collect Collected

Downloads：116

Open Access Research Article Issue

PCT: Point cloud transformer

Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R. Martin, Shi-Min Hu

Computational Visual Media 2021, 7(2): 187-199

Published: 10 April 2021

Abstract

PDF (10.7 MB) Collect Collected

Downloads：397

The irregular domain and lack of ordering make it challenging to design deep neural networks for point cloud processing. This paper presents a novel framework named Point Cloud Transformer (PCT) for point cloud learning. PCT is based on Transformer,which achieves huge success in natural language processingand displays great potential in image processing. It is inherently permutation invariant for processing a sequence of points, making it well-suited for point cloud learning. To better capture local context within the point cloud, we enhance input embedding with the support of farthest point sampling and nearest neighbor search. Extensive experiments demonstrate that the PCT achieves the state-of-the-art performance on shape classification, part segmentation, semantic segmentation, and normal estimation tasks.

Total 4