Scholar - SciOpen

Generating expressive and diverse human gestures from audio is crucial in fields like human–computer interaction, virtual reality, and animation. While existing methods have achieved remarkable performance, they often exhibit limitations due to constrained dataset diversity and the restricted amount of information derived from audio inputs. To address these challenges, we present VarGes, a novel variation-driven framework designed to enhance co-speech gesture generation by integrating visual stylistic cues while maintaining naturalness. Our approach begins with a variation-enhanced feature extraction module, which seamlessly incorporates style-reference video data into a 3D human pose estimation network to extract StyleCLIPS, thereby enriching the input with stylistic information. Subsequently, we employ a variation-compensation style encoder, a transformer-style encoder equipped with an additive attention mechanism pooling layer, to robustly encode diverse StyleCLIPS representations and effectively manage stylistic variations. Finally, a variation-driven gesture predictor module fuses MFCC audio features with StyleCLIPS encodings via cross-attention, injecting this fused data into a cross-conditional autoregressive model to modulate 3D human gesture generation based on audio input and stylistic clues. The efficacy of our approach is validated on benchmark datasets, on which it outperforms existing methods in terms of gesture diversity and naturalness. Our code and video results are publicly available at https://github.com/mookerr/VarGES/.

Open Access Review Article Issue

3D indoor scene geometry estimation from a single omnidirectional image: A comprehensive survey

Ming Meng, Yonggui Zhu, Yufei Zhao, Yufei Li, Zhe Zhu

Computational Visual Media 2025, 11(3): 431-464

Published: 04 June 2025

Abstract

PDF (14 MB) Collect Collected

Downloads：288

This paper surveys the technology used in three-dimensional indoor scene geometry estimation from a single 360° omnidirectional image, which is pivotal in extracting 3D structural information from indoor environments. The technology transforms omnidirectional data into a 3D model, depicting spatial structure, object positions, and scene layout. Its significance spans various domains, including virtual reality (VR), augmented reality (AR), mixed reality (MR), game development, urban planning, and robot navigation. We begin by revisiting foundational concepts of omnidirectional imaging and detailing the problems, applications, and challenges in this field. Our review categorizes the fundamental tasks of structure recovery, depth estimation, and layout recovery. We also review pertinent datasets and evaluation metrics, providing the latest research as a reference. Finally, we summarize the field and discuss potential future trends to inform and guide further research.

Total 2