Sort:
Open Access Research Article Issue
FaceCLIP: CLIP-driven accurate and detailed 3D face reconstruction from a single image
Computational Visual Media 2026, 12(1): 85-103
Published: 02 February 2026
Abstract PDF (13.6 MB) Collect
Downloads:61

In recent years, 3D face reconstruction has become a research hotspot in computer graphics and computer vision. Most current 3DMM-based methods focus on learning displacement maps to recover high-frequency facial details. However, they focus less on learning mid-frequency facial details and introduce displacement maps with noise, decreasing face reconstruction accuracy. Thus, this work presents a novel approach to regressing accurate and detailed 3D face shapes. First, we design a novel feature consistency loss to recover mid-frequency facial details. Specifically, we exploit the powerful CLIP as prior knowledge of faces to extract geometric and semantic features, which helps guide the reconstructed 3D geometric details to match local details in the input image. Furthermore, we propose a parameter refinement module to learn fine-grained features. It helps to obtain accurate model parameters and improve the accuracy of facial reconstruction. Extensive experiments on a FaceScape and a REALY benchmark demonstrate that our method outperforms several state-of-the-art methods in reconstruction accuracy. Furthermore, comprehensive qualitative results show that our approach achieves better visual performance than existing methods.

Open Access Research Article Issue
Multi-task learning and joint refinement between camera localization and object detection
Computational Visual Media 2024, 10(5): 993-1011
Published: 08 February 2024
Abstract PDF (7.5 MB) Collect
Downloads:69

Visual localization and object detection both play important roles in various tasks. In many indoor application scenarios where some detected objects have fixed positions, the two techniques work closely together. However, few researchers consider these two tasks simultaneously, because of a lack of datasets and the little attention paid to such environments. In this paper, we explore multi-task network design and joint refinement of detection and localization. To address the dataset problem, we construct a medium indoor scene of an aviation exhibition hall through a semi-automatic process. The dataset provides localization and detection information, and is publicly available at https://drive.google.com/drive/folders/1U28zkON4_I0dbzkqyIAKlAl5k9oUK0jI?usp=sharing for benchmarking localization and object detection tasks. Targeting this dataset, we have designed a multi-task network, JLDNet, based on YOLO v3, that outputs a target point cloud and object bounding boxes. For dynamic environments, the detection branch also promotes the perception of dynamics. JLDNet includes image feature learning, point feature learning, feature fusion, detection construction, and point cloud regression. Moreover, object-level bundle adjustment is used to further improve localization and detection accuracy. To test JLDNet and compare it to other methods, we have conducted experiments on 7 static scenes, our constructed dataset, and the dynamic TUM RGB-D and Bonn datasets. Our results show state-of-the-art accuracy for both tasks, and the benefit of jointly working on both tasks is demonstrated.

Total 2