Image inpainting is the task of filling in missing or masked regions of an image with semantically meaningful content. Recent methods have shown significant improvement in dealing with large missing regions. However, these methods usually require large training datasets to achieve satisfactory results, and there has been limited research into training such models on a small number of samples. To address this, we present a novel data-efficient generative residual image inpainting method that produces high-quality inpainting results. The core idea is to use an iterative residual reasoning method that incorporates convolutional neural networks (CNNs) for feature extraction and transformers for global reasoning within generative adversarial networks, along with image-level and patch-level discriminators. We also propose a novel forged-patch adversarial training strategy to create faithful textures and detailed appearances. Extensive evaluation shows that our method outperforms previous methods on the data-efficient image inpainting task, both quantitatively and qualitatively.
- Article type
- Year
- Co-author
Open Access
Research Article
Issue
Open Access
Research Article
Issue
The ability to exhibit appropriate emotions is crucial for the expressiveness and attractiveness of facial videos. However, it is difficult to control the level of emotion, even for experienced actors and amateur podcasters on social networks. In this study, we aim to solve the novel problem of semantically amplifying the emotions of a facial video. This poses new challenges for effectively editing a sequence of video frames in terms of face semantics, emotion adaptiveness, and temporal coherence. Our approach is based on semantic face editing in the disentangled latent space of a state-of-the-art StyleGAN model. We presented a new face dataset with diverse emotions to fine-tune the pretrained StyleGAN and improve the expressiveness of its original emotion-biased latent space. An emotion-editing subspace was constructed to allow adaptive emotion amplification while preserving other facial attributes. We further propose an effective stitching-tuning technique to ensure temporally coherent video frames. Our work results in plausible emotion amplification for a wide range of facial videos. Qualitative and quantitative evaluations demonstrated the advantages of our method over other baseline methods. The proposed dataset and research code will be made publicly available.
Open Access
Research Article
Issue
Storyboards comprising key illustrations and images help filmmakers to outline ideas, key moments, and story events when filming movies. Inspired by this, we introduce the first contextual benchmark dataset Script-to-Storyboard (Sc2St) composed of storyboards to explicitly express story structures in the movie domain, and propose the contextual retrieval task to facilitate movie story understanding. The Sc2St dataset contains fine-grained and diverse texts, annotated semantic keyframes, and coherent storylines in storyboards, unlike existing movie datasets. The contextual retrieval task takes as input a multi-sentence movie script summary with keyframe history and aims to retrieve a future keyframe described by a corresponding sentence to form the storyboard. Compared to classic text-based visual retrieval tasks, this requires capturing the context from the description (script) and keyframe history. We benchmark existing text-based visual retrieval methods on the new dataset and propose a recurrent-based framework with three variants for effective context encoding. Comprehensive experiments demonstrate that our methods compare favourably to existing methods; ablation studies validate the effectiveness of the proposed context encoding approaches.
Open Access
Review Article
Issue
Deep learning has been successfully used for tasks in the 2D image domain. Research on 3D computer vision and deep geometry learning has also attracted attention. Considerable achievements have been made regarding feature extraction and discrimination of 3D shapes. Following recent advances in deep generative models such as generative adversarial networks, effective generation of 3D shapes has become an active research topic. Unlike 2D images with a regular grid structure, 3D shapes have various representations, such as voxels, point clouds, meshes, and implicit functions. For deep learning of 3D shapes, shape representation has to be taken into account as there is no unified representation that can cover all tasks well. Factors such as the representativeness of geometry and topology often largely affect the quality of the generated 3D shapes. In this survey, we comprehensively review works on deep-learning-based 3D shape generation by classifying and discussing them in terms of the underlying shape representation and the architecture of the shape generator. The advantages and disadvantages of each class are further analyzed. We also consider the 3D shape datasets commonly used for shape generation. Finally, we present several potential research directions that hopefully can inspire future works on this topic.
Open Access
Research Article
Issue
Modeling the complete geometry of general shapes from a single image is an ill-posed problem. User hints are often incorporated to resolve ambiguities and provide guidance during the modeling process. In this work, we present a novel interactive approach for extracting high-quality freeform shapes from a single image. This is inspired by the popular lofting technique in many CAD systems, and only requires minimal user input. Given an input image, the user only needs to sketch several projected cross sections, provide a "main axis" , and specify some geometric relations. Our algorithm then automatically optimizes the common normal to the sections with respect to these constraints, and interpolates between the sections, resulting in a high-quality 3D model that conforms to both the original image and the user input. The entire modeling session is efficient and intuitive. We demonstrate the effectiveness of our approach based on qualitative tests on a variety of images, and quantitative comparisons with the ground truth using synthetic images.
京公网安备11010802044758号