Existing GAN-based generative methods are typically used for semantic image synthesis. We pose the question of whether GAN-based architectures can generate plausible depth maps and find that existing methods have difficulty in generating depth maps which reasonably represent 3D scene structure due to the lack of global geometric correlations. Thus, we propose DepthGAN, a novel method of generating a depth map using a semantic layout as input to aid construction, and manipulation of well-structured 3D scene point clouds. Specifically, we first build a feature generation model with a cascade of semantically-aware transformer blocks to obtain depth features with global structural information. For our semantically aware transformer block, we propose a mixed attention module and a semantically aware layer normalization module to better exploit semantic consistency for depth features generation. Moreover, we present a novel semantically weighted depth synthesis module, which generates adaptive depth intervals for the current scene. We generate the final depth map by using a weighted combination of semantically aware depth weights for different depth ranges. In this manner, we obtain a more accurate depth map. Extensive experiments on indoor and outdoor datasets demonstrate that DepthGAN achieves superior results both quantitatively and visually for the depth generation task.
- Article type
- Year
- Co-author
The popularity of online home design and floor plan customization has been steadily increasing. However, the manual conversion of floor plan images from books or paper materials into electronic resources can be a challenging task due to the vast amount of historical data available. By leveraging neural networks to identify and parse floor plans, the process of converting these images into electronic materials can be significantly streamlined. In this paper, we present a novel learning framework for automatically parsing floor plan images. Our key insight is that the room type text is very common and crucial in floor plan images as it identifies the important semantic information of the corresponding room. However, this clue is rarely considered in previous learning-based methods. In contrast, we propose the Row and Column network (RC-Net) for recognizing floor plan elements by integrating the text feature. Specifically, we add the text feature branch in the network to extract text features corresponding to the room type for the guidance of room type predictions. More importantly, we formulate the Row and Column constraint module (RC constraint module) to share and constrain features across the entire row and column of the feature maps to ensure that only one type is predicted in each room as much as possible, making the segmentation boundaries between different rooms more regular and cleaner. Extensive experiments on three benchmark datasets validate that our framework substantially outperforms other state-of-the-art approaches in terms of the metrics of FWIoU, mACC and mIoU.
Specular highlight detection and removal is a fundamental problem in computer vision and image processing. In this paper, we present an efficient end-to-end deep learning model for automatically detecting and removing specular highlights in a single image. In particular, an encoder–decoder network is utilized to detect specular highlights, and then a novel Unet-Transformer network performs highlight removal; we append transformer modules instead of feature maps in the Unet architecture. We also introduce a highlight detection module as a mask to guide the removal task. Thus, these two networks can be jointly trained in an effective manner. Thanks to the hierarchical and global properties of the transformer mechanism, our framework is able to establish relationships between continuous self-attention layers, making it possible to directly model the mapping between the diffuse area and the specular highlight area, and reduce indeterminacy within areas containing strong specular highlight reflection. Experiments on public benchmark and real-world images demonstrate that our approach outperforms state-of-the-art methods for both highlight detection and removal tasks.
The traditional space-invariant isotropic kernel utilized by a bilateral filter (BF) frequently leads to blurry edges and gradient reversal artifacts due to the existence of a large amount of outliers in the local averaging window. However, the efficient and accurate estimation of space-variant kernels which adapt to image structures, and the fast realization of the corresponding space-variant bilateral filtering are challenging problems. To address these problems, we present a space-variant BF (SVBF), and its linear time and error-bounded acceleration method. First, we accurately estimate spacevariant anisotropic kernels that vary with image structures in linear time through structure tensor and minimum spanning tree. Second, we perform SVBF in linear time using two error-bounded approximation methods, namely, low-rank tensor approximation via higher-order singular value decomposition and exponential sum approximation. Therefore, the proposed SVBF can efficiently achieve good edge-preserving results. We validate the advantages of the proposed filter in applications including: image denoising, image enhancement, and image focus editing. Experimental results demonstrate that our fast and error-bounded SVBF is superior to state-of-the-art methods.