Open Access Article Issue
CamDiff: Camouflage Image Augmentation via Diffusion
CAAI Artificial Intelligence Research 2023, 2: 9150021
Published: 22 November 2023

The burgeoning field of Camouflaged Object Detection (COD) seeks to identify objects that blend into their surroundings. Despite the impressive performance of recent learning-based models, their robustness is limited, as existing methods may misclassify salient objects as camouflaged ones, despite these contradictory characteristics. This limitation may stem from the lack of multi-pattern training images, leading to reduced robustness against salient objects. To overcome the scarcity of multi-pattern training images, we introduce CamDiff, a novel approach inspired by AI-Generated Content (AIGC). Specifically, we leverage a latent diffusion model to synthesize salient objects in camouflaged scenes, while using the zero-shot image classification ability of the Contrastive Language-Image Pre-training (CLIP) model to prevent synthesis failures and ensure that the synthesized objects align with the input prompt. Consequently, the synthesized image retains its original camouflage label while incorporating salient objects, yielding camouflaged scenes with richer characteristics. The results of user studies show that the salient objects in our synthesized scenes attract the user’s attention more; thus, such samples pose a greater challenge to the existing COD models. Our CamDiff enables flexible editing and effcient large-scale dataset generation at a low cost. It significantly enhances the training and testing phases of COD baselines, granting them robustness across diverse domains. Our newly generated datasets and source code are available at

Open Access Research Article Issue
Sequential interactive image segmentation
Computational Visual Media 2023, 9 (4): 753-765
Published: 05 July 2023

Interactive image segmentation (IIS) is an important technique for obtaining pixel-level anno-tations. In many cases, target objects share similar semantics. However, IIS methods neglect this con-nection and in particular the cues provided by representations of previously segmented objects, previous user interaction, and previous prediction masks, which can all provide suitable priors for the current annotation. In this paper, we formulate a sequential interactive image segmentation (SIIS) task for minimizing user interaction when segmenting sequences of related images, and we provide a practical approach to this task using two pertinent designs. The first is a novel interaction mode. When annotating a new sample, our method can automatically propose an initial click proposal based on previous annotation. This dramatically helps to reduce the interaction burden on the user. The second is an online opti-mization strategy, with the goal of providing seman-tic information when annotating specific targets, optimizing the model with dense supervision from previously labeled samples. Experiments demonstrate the effectiveness of regarding SIIS as a particular task, and our methods for addressing it.

Open Access Article Issue
Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers
CAAI Artificial Intelligence Research 2023, 2: 9150015
Published: 30 June 2023

Most polyp segmentation methods use convolutional neural networks (CNNs) as their backbone, leading to two key issues when exchanging information between the encoder and decoder: (1) taking into account the differences in contribution between different-level features, and (2) designing an effective mechanism for fusing these features. Unlike existing CNN-based methods, we adopt a transformer encoder, which learns more powerful and robust representations. In addition, considering the image acquisition influence and elusive properties of polyps, we introduce three standard modules, including a cascaded fusion module (CFM), a camouflage identification module (CIM), and a similarity aggregation module (SAM). Among these, the CFM is used to collect the semantic and location information of polyps from high-level features; the CIM is applied to capture polyp information disguised in low-level features, and the SAM extends the pixel features of the polyp area with high-level semantic position information to the entire polyp area, thereby effectively fusing cross-level features. The proposed model, named Polyp-PVT, effectively suppresses noises in the features and significantly improves their expressive capabilities. Extensive experiments on five widely adopted datasets show that the proposed model is more robust to various challenging situations (e.g., appearance changes, small objects, and rotation) than existing representative methods. The proposed model is available at

Open Access Research Article Issue
Specificity-preserving RGB-D saliency detection
Computational Visual Media 2023, 9 (2): 297-317
Published: 03 January 2023

Salient object detection (SOD) in RGB and depth images has attracted increasing research interest. Existing RGB-D SOD models usually adopt fusion strategies to learn a shared representation from RGB and depth modalities, while few methods explicitly consider how to preserve modality-specific characteristics. In this study, we propose a novel framework, the specificity-preserving network (SPNet), which improves SOD performance by exploring both the shared information and modality-specific properties. Specifically, we use two modality-specific networks and a shared learning network to generate individual and shared saliency prediction maps. To effectively fuse cross-modal features in the shared learning network, we propose a cross-enhanced integration module (CIM) and propagate the fused feature to the next layer to integrate cross-level information. Moreover, to capture rich complementary multi-modal information to boost SOD performance, we use a multi-modal feature aggregation (MFA) module to integrate the modality-specific features from each individual decoder into the shared decoder. By using skip connections between encoder and decoder layers, hierarchical features can be fully combined. Extensive experiments demonstrate that our SPNet outperforms cutting-edge approaches on six popular RGB-D SOD and three camouflaged object detection benchmarks. The project is publicly available at

Open Access Review Article Issue
Light field salient object detection: A review and benchmark
Computational Visual Media 2022, 8 (4): 509-534
Published: 16 May 2022

Salient object detection (SOD) is a long-standing research topic in computer vision with increasing interest in the past decade. Since light fields record comprehensive information of natural scenes that benefit SOD in a number of ways, using light field inputs to improve saliency detection over conventional RGB inputs is an emerging trend. This paper provides the first comprehensive review and a benchmark for light field SOD, which has long been lacking in the saliency community. Firstly, we introduce light fields, including theory and data forms, and then review existing studies on light field SOD, covering ten traditional models, seven deep learning-based models, a comparative study, and a brief review. Existing datasets for light field SOD are also summarized. Secondly, we benchmark nine representative light field SOD models together with several cutting-edge RGB-D SOD models on four widely used light field datasets, providing insightful discussions and analyses, including a comparison between light field SOD and RGB-D SOD models. Due to the inconsistency of current datasets, we further generate complete data and supplement focal stacks, depth maps, and multi-view images for them, making them consistent and uniform. Our supplemental data make a universal benchmark possible. Lastly, light field SOD is a specialised problem, because of its diverse data representations and high dependency on acquisition hardware, so it differs greatly from other saliency detection tasks. We provide nine observations on challenges and future directions, and outline several open issues. All the materials including models, datasets, benchmarking results, and supplemented light field datasets are publicly available at

Open Access Research Article Issue
PVT v2: Improved baselines with Pyramid Vision Transformer
Computational Visual Media 2022, 8 (3): 415-424
Published: 16 March 2022

Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer research in computer vision. Code is available at

Open Access Review Article Issue
RGB-D salient object detection: A survey
Computational Visual Media 2021, 7 (1): 37-69
Published: 07 January 2021

Salient object detection, which simulateshuman visual perception in locating the most significant object(s) in a scene, has been widely applied to various computer vision tasks. Now, the advent of depth sensors means that depth maps can easily be captured; this additional spatial information can boost the performance of salient object detection. Although various RGB-D based salient object detection models with promising performance have been proposed over the past several years, an in-depth understanding of these models and the challenges in this field remains lacking. In this paper, we provide a comprehensive survey of RGB-D based salient object detection models from various perspectives, and review related benchmark datasets in detail. Further, as light fields can also provide depth maps, we review salient object detection models and popular benchmark datasets from this domain too. Moreover, to investigate the ability of existing models to detect salient objects, we have carried out a comprehensive attribute-based evaluation of several representative RGB-D based salient object detection models. Finally, we discuss several challenges and open directions of RGB-D based salient object detection for future research. All collected models, benchmark datasets, datasets constructed for attribute-based evaluation, and related code are publicly available at

total 7