AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (3.7 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Seeing and Reasoning: A Simple Deep Learning Approach to Visual Question Answering

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Show Author Information

Abstract

Visual Question Answering (VQA) is a complex task that requires a deep understanding of both visual content and natural language questions. The challenge lies in enabling models to recognize and interpret visual elements and to reason through questions in a multi-step, compositional manner. We propose a novel Transformer-based model that introduces specialized tokenization techniques to effectively capture intricate relationships between visual and textual features. The model employs an enhanced self-attention mechanism, enabling it to attend to multiple modalities simultaneously, while a co-attention unit dynamically guides focus to the most relevant image regions and question components. Additionally, a multi-step reasoning module supports iterative inference, allowing the model to excel at complex reasoning tasks. Extensive experiments on benchmark datasets demonstrate the model’s superior performance, with accuracies of 98.6% on CLEVR, 63.78% on GQA, and 68.67% on VQA v2.0. Ablation studies confirm the critical contribution of key components, such as the reasoning module and co-attention mechanism, to the model’s effectiveness. Qualitative analysis of the learned attention distributions further illustrates the model’s dynamic reasoning process, adapting to task complexity. Overall, our study advances the adaptation of Transformer architectures for VQA, enhancing both reasoning capabilities and model interpretability in visual reasoning tasks.

References

[1]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, VQA: Visual question answering, in Proc. 2015 IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 2425–2433.
[2]
R. V. Yampolskiy, AI-complete, AI-hard, or AI-easy: Classification of problems in artificial intelligence, in Proc. 23 rd Midwest Artificial Intelligence and Cognitive Science Conf., Cincinnati, OH, USA, https://ceur-ws.org/Vol-841/submission_3.pdf, 2012.
[3]
R. Y. Zakari, J. W. Owusu, H. Wang, K. Qin, Z. K. Lawal, and Y. Dong, VQA and visual reasoning: An overview of recent datasets, methods and challenges, arXiv preprint arXiv: 2212.13296, 2022.
[4]
Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, Deep modular co-attention networks for visual question answering, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6274–6283.
[5]
Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, Stacked attention networks for image question answering, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 21–29.
[6]
A. Urooj, A. Mazaheri, N. da V. Lobo, and M. Shah, MMFT-BERT: Multimodal fusion transformer with Bert encodings for visual question answering, in Proc. Findings of the Association for Computational Linguistics : EMNLP 2020, Virtual Event, 2020, pp. 4648–4660.
[7]

H. Yao, L. Wang, C. Cai, Y. Sun, Z. Zhang, and Y. Luo, Multi-modal spatial relational attention networks for visual question answering, Image Vis. Comput., vol. 140, p. 104840, 2023.

[8]

X. Shen, D. Han, Z. Guo, C. Chen, J. Hua, and G. Luo, Local self-attention in transformer for visual question answering, Appl. Intell., vol. 53, no. 13, pp. 16706–16723, 2023.

[9]
V. Srisupavanich, Multimodal learning and reasoning for visual question answering, master thesis, University of Southampton, https://github.com/markvasin/MSc-Project, 2020.
[10]
J. Johnson, B. Hariharan, L. van der Maaten, F. F. Li, C. L. Zitnick, and R. Girshick, CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 1988–1997.
[11]
V. Kazemi and A. Elqursh, Show, ask, attend, and answer: A strong baseline for visual question answering, arXiv preprint arXiv: 1704.03162, 2017.
[12]
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6077–6086.
[13]
H. Xu and K. Saenko, Ask, attend and answer: Exploring question-guided spatial attention for visual question answering, in Proc. 14 th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 451–466.
[14]
S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, in Proc. 28 th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 91–99.
[15]
D. A. Hudson and C. D. Manning, Compositional attention networks for machine reasoning, arXiv preprint arXiv: 1803.03067, 2018.
[16]
J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu, The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision, arXiv preprint arXiv: 1904.12584, 2019.
[17]
A. d'Avila Garcez, M. Gori, L. C. Lamb, L. Serafini, M. Spranger, and S. N. Tran, Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning, arXiv preprint arXiv: 1905.06088, 2019.
[18]
A. Graves, G. Wayne, and I. Danihelka, Neural turing machines, arXiv preprint arXiv: 1410.5401, 2014.
[19]
S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, End-to-end memory networks, arXiv preprint arXiv: 1503.08895, 2015.
[20]
P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al., Relational inductive biases, deep learning, and graph networks, arXiv preprint arXiv: 1806.01261, 2018.
[21]
C. Ma, C. Shen, A. Dick, Q. Wu, P. Wang, A. van den Hengel, and I. Reid, Visual question answering with memory-augmented networks, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6975–6984.
[22]
D. A. Hudson and C. D. Manning, Learning by abstraction: The neural state machine, arXiv preprint arXiv: 1907.03950, 2019.
[23]
J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, Neural module networks, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 39–48.
[24]
J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, F. F. Li, C. L. Zitnick, and R. Girshick, Inferring and executing programs for visual reasoning, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 3008–3017.
[25]
R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, Learning to reason: End-to-end module networks for visual question answering, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 804–813.
[26]
J. Pennington, R. Socher, and C. Manning, GloVe: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP ), Doha, Qatar, 2014, pp. 1532–1543.
[27]
J. Lu, D. Batra, D. Parikh, and S. Lee, ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, in Proc. 33 rd Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 13–23.
[28]
H. Tan and M. Bansal, LXMERT: Learning cross-modality encoder representations from transformers, arXiv preprint arXiv: 1908.07490, 2019.
[29]
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2019.
[30]
A. Santoro, D. Raposo, D. G. T. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap, A simple neural network module for relational reasoning, in Proc. 31 st Int. Conf. neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 4974–4983.
[31]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., PyTorch: An imperative style, high-performance deep learning library, in Proc. 33 rd Int. Conf. on Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 8026–8037.
[32]
A. Singh, V. Goswami, V. Natarajan, Y. Jiang, X. Chen, M. Shah, M. Rohrbach, D. Batra, and D. Parikh, MMF: A multimodal framework for vision and language research, https://github.com/facebookresearch/mmf, 2020.
[33]
Z. Yu, Y. Cui, Z. Shao, P. Gao, and J. Yu, Openvqa, https://github.com/MILVLG/openvqa, 2019.
[34]
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., HuggingFace’s transformers: State-of-the-art natural language processing, arXiv preprint arXiv: 1910.03771, 2020.
[35]
D. A. Hudson and C. D. Manning, GQA: A new dataset for real-world visual reasoning and compositional question answering, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6693–6702.
[36]
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, Making the V in VQA matter: Elevating the role of image understanding in visual question answering, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 6325–6334.
[37]
E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, FiLM: Visual reasoning with a general conditioning layer, in Proc. 32 nd AAAI Conf. Artificial Intelligence, New Orleans, LA, USA, 2018, pp. 3942–3951.
[38]

H. Zhong, J. Chen, C. Shen, H. Zhang, J. Huang, and X. S. Hua, Self-adaptive neural module transformer for visual question answering, IEEE Trans. Multimedia, vol. 23, pp. 1264–1273, 2021.

[39]

J. Yu, W. Zhang, Y. Lu, Z. Qin, Y. Hu, J. Tan, and Q. Wu, Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval, IEEE Trans. Multimedia, vol. 22, no. 12, pp. 3196–3209, 2020.

[40]

B. Qin, H. Hu, and Y. Zhuang, Deep residual weight-sharing attention network with low-rank attention for visual question answering, IEEE Trans. Multimedia, vol. 25, pp. 4282–4295, 2023.

[41]
Y. Zhou, T. Ren, C. Zhu, X. Sun, J. Liu, X. Ding, M. Xu, and R. Ji, TRAR: Routing the attention spans in transformer for visual question answering, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 2054–2064.
[42]
W. Zhang, J. Yu, W. Zhao, and C. Ran, DMRFNet: Deep multimodal reasoning and fusion for visual question answering and explanation generation, Inf. Fusion, vol. 72, pp. 70–79, 2021.
[43]

X. Shen, D. Han, C. Chen, G. Luo, and Z. Wu, An effective spatial relational reasoning networks for visual question answering, PLoS One, vol. 17, no. 11, p. e0277693, 2022.

[44]
P. Banerjee, T. Gokhale, Y. Yang, and C. Baral, Weakly supervised relative spatial reasoning for visual question answering, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 1988–1998.
[45]
D. K. Nguyen and T. Okatani, Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6087–6096.
[46]
Q. Sun and Y. Fu, Stacked self-attention networks for visual question answering, in Proc. 2019 on Int. Conf. Multimedia Retrieval, Ottawa, Canada, 2019, pp. 207–211.
[47]

D. Yu, J. Fu, X. Tian, and T. Mei, Multi-source multi-level attention networks for visual question answering, ACM Trans. Multimed. Comput. Commun. Appl., vol. 15, no. 2s, p. 51, 2019.

[48]

X. Zhu and J. Han, Research on visual question answering based on deep stacked attention network, J. Phys.: Conf. Ser., vol. 1873, no. 1, p. 012047, 2021.

[49]
R. Cadene, H. Ben-Younes, M. Cord, and N. Thome, MUREL: Multimodal relational reasoning for visual question answering, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 1989–1998.
[50]

A. A. Yusuf, F. Chong, and M. Xianling, Evaluation of graph convolutional networks performance for visual question answering on reasoning datasets, Multimed. Tools Appl., vol. 81, no. 28, pp. 40361–40370, 2022.

[51]

X. Zhu, Z. Mao, Z. Chen, Y. Li, Z. Wang, and B. Wang, Object-difference drived graph convolutional networks for visual question answering, Multimed. Tools Appl., vol. 80, no. 11, pp. 16247–16265, 2021.

[52]

H. Sharma and A. S. Jalal, Visual question answering model based on graph neural network and contextual attention, Image Vis. Comput., vol. 110, p. 104165, 2021.

[53]
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in Proc. 32 nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 2048–2057.
Big Data Mining and Analytics
Pages 458-478
Cite this article:
Zakari RY, Owusu JW, Qin K, et al. Seeing and Reasoning: A Simple Deep Learning Approach to Visual Question Answering. Big Data Mining and Analytics, 2025, 8(2): 458-478. https://doi.org/10.26599/BDMA.2024.9020079

398

Views

35

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 02 April 2024
Revised: 11 September 2024
Accepted: 21 October 2024
Published: 28 January 2025
© The author(s) 2025.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return