Seeing and Reasoning: A Simple Deep Learning Approach to Visual Question Answering

Rufai Yusuf Zakari; Jim Wilson Owusu; Ke Qin; Tao He; Guangchun Luo

doi:10.26599/BDMA.2024.9020079

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (3.7 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Seeing and Reasoning: A Simple Deep Learning Approach to Visual Question Answering

Rufai Yusuf Zakari^¹, Jim Wilson Owusu^¹, Ke Qin^¹(

), Tao He^¹, Guangchun Luo^¹

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

Show Author Information

Abstract

Visual Question Answering (VQA) is a complex task that requires a deep understanding of both visual content and natural language questions. The challenge lies in enabling models to recognize and interpret visual elements and to reason through questions in a multi-step, compositional manner. We propose a novel Transformer-based model that introduces specialized tokenization techniques to effectively capture intricate relationships between visual and textual features. The model employs an enhanced self-attention mechanism, enabling it to attend to multiple modalities simultaneously, while a co-attention unit dynamically guides focus to the most relevant image regions and question components. Additionally, a multi-step reasoning module supports iterative inference, allowing the model to excel at complex reasoning tasks. Extensive experiments on benchmark datasets demonstrate the model’s superior performance, with accuracies of 98.6% on CLEVR, 63.78% on GQA, and 68.67% on VQA v2.0. Ablation studies confirm the critical contribution of key components, such as the reasoning module and co-attention mechanism, to the model’s effectiveness. Qualitative analysis of the learned attention distributions further illustrates the model’s dynamic reasoning process, adapting to task complexity. Overall, our study advances the adaptation of Transformer architectures for VQA, enhancing both reasoning capabilities and model interpretability in visual reasoning tasks.

Keywords

machine learning deep learning Visual Question Answering (VQA)multi-step reasoning computer vision

References

【1】

Crossref Google Scholar

Big Data Mining and Analytics

Volume 8 Issue 2,
April 2025

Pages 458-478

DOI: 10.26599/BDMA.2024.9020079

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , {{reviewData.reportCite.doi}}

Cite this article:

Zakari RY, Owusu JW, Qin K, et al. Seeing and Reasoning: A Simple Deep Learning Approach to Visual Question Answering. Big Data Mining and Analytics, 2025, 8(2): 458-478. https://doi.org/10.26599/BDMA.2024.9020079

2131

Views

144

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 02 April 2024

Revised: 11 September 2024

Accepted: 21 October 2024

Published: 28 January 2025

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).