Scholar - SciOpen

Visual Question and Answering (VQA) has garnered significant attention as a domain that requires the synthesis of visual and textual information to produce accurate responses. While existing methods often rely on Convolutional Neural Networks (CNNs) for feature extraction and attention mechanisms for embedding learning, they frequently fail to capture the nuanced interactions between entities within images, leading to potential ambiguities in answer generation. In this paper, we introduce a novel network architecture, Dual-modality Integration Attention with Graph-based Feature Extraction (DIAGFE), which addresses these limitations by incorporating two key innovations: a Graph-based Feature Extraction (GFE) module that enhances the precision of visual semantics extraction, and a Dual-modality Integration Attention (DIA) mechanism that efficiently fuses visual and question features to guide the model towards more accurate answer generation. Our model is trained with a composite loss function to refine its predictive accuracy. Rigorous experiments on the VQA2.0 dataset demonstrate that DIAGFE outperforms existing methods, underscoring the effectiveness of our approach in advancing VQA research and its potential for cross-modal understanding.

Open Access Issue

Generating Medical Report via Joint Probability Graph Reasoning

Junsan Zhang, Ming Cheng, Xiangyang Li, Xiuxuan Shen, Yuxue Liu, Yao Wan

Tsinghua Science and Technology 2025, 30(4): 1685-1699

Published: 03 March 2025

Abstract

PDF (5.6 MB) Collect Collected

Downloads：144

In medical X-ray images, multiple abnormalities may occur frequently. However, existing report generation methods cannot efficiently extract all abnormal features, resulting in incomplete disease diagnoses when generating diagnostic reports. In real medical scenarios, there are co-occurrence relations among multiple diseases. If such co-occurrence relations are mined and integrated into the feature extraction process, the issue of missing abnormal features may be addressed. Inspired by this observation, we propose a novel method to improve the extraction of abnormal features in images through joint probability graph reasoning. Specifically, to reveal the co-occurrence relations among multiple diseases, we conduct statistical analyses on the dataset, and extract disease relationships into a probability map. Subsequently, we devise a graph reasoning network for conducting correlation-based reasoning over the features of medical images, which can facilitate the acquisition of more abnormal features. Furthermore, we introduce a gating mechanism focused on cross-modal features fusion into the current text generation model. This optimization substantially improves the model’s capabilities to learn and fuse information from two distinct modalities—medical images and texts. Experimental results on the IU-X-Ray and MIMIC-CXR datasets demonstrate that our approach outperforms previous state-of-the-art methods, exhibiting the ability to generate higher quality medical image reports.

Total 2