AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (7.7 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Dual-Modality Integration Attention with Graph-Based Feature Extraction for Visual Question and Answering

College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
School of Cyber Engineering, Xidian University, Xi’an 710126, China
Show Author Information

Abstract

Visual Question and Answering (VQA) has garnered significant attention as a domain that requires the synthesis of visual and textual information to produce accurate responses. While existing methods often rely on Convolutional Neural Networks (CNNs) for feature extraction and attention mechanisms for embedding learning, they frequently fail to capture the nuanced interactions between entities within images, leading to potential ambiguities in answer generation. In this paper, we introduce a novel network architecture, Dual-modality Integration Attention with Graph-based Feature Extraction (DIAGFE), which addresses these limitations by incorporating two key innovations: a Graph-based Feature Extraction (GFE) module that enhances the precision of visual semantics extraction, and a Dual-modality Integration Attention (DIA) mechanism that efficiently fuses visual and question features to guide the model towards more accurate answer generation. Our model is trained with a composite loss function to refine its predictive accuracy. Rigorous experiments on the VQA2.0 dataset demonstrate that DIAGFE outperforms existing methods, underscoring the effectiveness of our approach in advancing VQA research and its potential for cross-modal understanding.

References

[1]

C. Wu, J. Lu, H. Li, J. Wu, H. Duan, and S. Yuan, Compound-attention network with original feature injection for visual question and answering, Signal Image Video Process., vol. 15, no. 8, pp. 1853–1861, 2021.

[2]
C. W. Kuo and Z. Kira, Beyond a pre-trained object detector: Cross-modal textual and visual context for image captioning, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 17948−17958.
[3]
L. Zhou, C. Xu, P. Koch, and J. J. Corso, Watch what you just said: Image captioning with text-conditional attention, in Proc. Thematic Workshops of ACM Multimedia, Mountain View, CA, USA, 2017, pp. 305−313.
[4]
H. Diao, Y. Zhang, L. Ma, and H. Lu, Similarity reasoning and filtration for image-text matching, in Proc. 35th AAAI Conf. Artificial Intelligence, Vancouver, Canada, 2021, pp. 1218−1226.
[5]

L. Wang, Y. Li, J. Huang, and S. Lazebnik, Learning two-branch neural networks for image-text matching tasks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 394–407, 2019.

[6]

S. Zhang, M. Chen, J. Chen, F. Zou, Y. F. Li, and P. Lu, Multimodal feature-wise co-attention method for visual question answering, Inf. Fusion, vol. 73, pp. 1–10, 2021.

[7]
Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, Deep modular co-attention networks for visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6274−6283.
[8]
L. Huang, W. Wang, J. Chen, and X. Y. Wei, Attention on attention for image captioning, in Proc. IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 4634−4643.
[9]
M. Ghorbani, M. S. Baghshah, and H. R. Rabiee, MGCN: Semi-supervised classification in multi-layer graphs with graph convolutional networks, in Proc. 2019 IEEE/ACM Int. Conf. Advances in Social Networks Analysis and Mining, Vancouver, Canada, 2019, pp. 208−211.
[10]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, VQA: Visual question answering, in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 2425−2433.
[11]
L. Nie, M. Wang, Z. Zha, G. Li, and T. S. Chua, Multimedia answering: Enriching text QA with media information, in Proc. 34th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, Beijing, China, 2011, pp. 695−704.
[12]
W. He, Z. Li, D. Lu, E. Chen, T. Xu, B. Huai, and J. Yuan, Multimodal dialogue systems via capturing context-aware dependencies of semantic elements, in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 2755−2764.
[13]
J. H. Kim, J. Jun, and B. T. Zhang, Bilinear attention networks, in Proc. 32nd Int. Conf. Neural Information Processing Systems, Montréal, Canada, 2018, pp. 1571−1581.
[14]
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6077−6086.
[15]
M. Malinowski and M. Fritz, A multi-world approach to question answering about real-world scenes based on uncertain input, in Proc. 27th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 1682−1690.
[16]
J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, Neural module networks, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 39−48.
[17]
R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, Learning to reason: End-to-end module networks for visual question answering, in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 804−813.
[18]
Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, Stacked attention networks for image question answering, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 21−29.
[19]
M. Malinowski, C. Doersch, A. Santoro, and P. Battaglia, Learning visual question answering by bootstrapping hard attention, in Proc. 15th European Conf. Computer Vision (ECCV), Munich, Germany, 2018, pp. 3−20.
[20]
S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 91−99.
[21]
Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra, and D. Parikh, Pythia v0.1: The winning entry to the VQA challenge 2018, arXiv preprint arXiv: 1807.09956, 2018.
[22]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000−6010.
[23]
P. Gao, Z. Jiang, H. You, P. Lu, S. C. H. Hoi, X. Wang, and H. Li, Dynamic fusion with intra- and inter-modality attention flow for visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6632−6641.
[24]
P. Gao, H. You, Z. Zhang, X. Wang, and H. Li, Multi-modality latent interaction network for visual question answering, in Proc. IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 5824−5834.
[25]
R. Cadene, H. Ben-Younes, M. Cord, and N. Thome, MUREL: Multimodal relational reasoning for visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 1989−1998.
[26]

H. Deng, Y. Zhang, R. Li, C. Hu, Z. Feng, and H. Li, Combining residual attention mechanisms and generative adversarial networks for hippocampus segmentation, Tsinghua Science and Technology, vol. 27, no. 1, pp. 68–78, 2022.

[27]
W. Cai, Y. Wang, J. Ma, and Q. Jin, CAN: Effective cross features by global attention mechanism and neural network for ad click prediction, Tsinghua Science and Technology, vol. 27, no. 1, pp. 186−195, 2022.
[28]

Y. Duan, J. Wang, H. Ma, and Y. Sun, Residual convolutional graph neural network with subgraph attention pooling, Tsinghua Science and Technology, vol. 27, no. 4, pp. 653–663, 2022.

[29]
D. K. Nguyen and T. Okatani, Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6087−6096.
[30]
D. Teney, P. Anderson, X. He, and A. Van Den Hengel, Tips and tricks for visual question answering: Learnings from the 2017 challenge, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4223−4232.
[31]

Y. Liu, C. Hao, and Z. Kai, Deep learning based track reconstruction on CEPC luminometer, Nucl. Instrum. Methods Phys. Res. Sect. A: Accel. Spectrom. Detect. Assoc. Equip., vol. 929, pp. 42–49, 2019.

[32]

H. Lee, C. Huang, S. Yune, S. H. Tajmir, M. Kim, and S. Do, Machine friendly machine learning: Interpretation of computed tomography without image reconstruction, Sci. Rep., vol. 9, no. 1, p. 15540, 2019.

[33]
I. Ilievski, S. Yan, and J. Feng, A focused dynamic attention model for visual question answering, arXiv preprint arXiv: 1604.01485, 2016.
[34]

M. Damron, J. Hanson, C. Houdré, and C. Xu, Lower bounds for fluctuations in first-passage percolation for general distributions, Ann. Inst. Henri Poincaré Probab. Statist., vol. 56, no. 2, pp. 1336–1357, 2020.

[35]
C. Zhu, Y. Zhao, S. Huang, K. Tu, and Y. Ma, Structured attentions for visual question answering, in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 1300−1309.
[36]
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, Multimodal compact bilinear pooling for visual question answering and visual grounding. in Proc. Conf. Empirical Methods in Natural Language Processing, Austin, TX, USA, 2016, pp. 457−468.
[37]
Z. Zhang, L. Liao, M. Huang, X. Zhu, and T. S. Chua, Neural multimodal belief tracker with adaptive attention for dialogue systems, in Proc. World Wide Web Conf., San Francisco, CA, USA, 2019, pp. 2401−2412.
[38]
T. Rahman, S. H. Chou, L. Sigal, and G. Carenini, An improved attention for visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 2021, pp. 1653−1662.
[39]

D. Yu, J. Fu, X. Tian, and T. Mei, Multi-source multi-level attention networks for visual question answering, ACM Trans. Multimedia Comput. Commun. Appl., vol. 15, no. 2s, p. 51, 2019.

[40]

Y. Liu, X. Zhang, Q. Zhang, C. Li, F. Huang, X. Tang, and Z. Li, Dual self-attention with co-attention networks for visual question answering, Pattern Recognit., vol. 117, p. 107956, 2021.

[41]
T. Zhang, H. Huang, C. Feng, and L. Cao, Enlivening redundant heads in multi-head self-attention for machine translation, in Proc. 2021 Conf. Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2021, pp. 3238−3248.
[42]
L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, Normalized and geometry-aware self-attention network for image captioning, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10324−10333.
[43]
H. Zhao, J. Jia, and V. Koltun, Exploring self-attention for image recognition, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10073−10082.
[44]
J. Gao, X. Liu, Y. Chen, and F. Xiong, MHGCN: Multiview highway graph convolutional network for cross-lingual entity alignment, Tsinghua Science and Technology, vol. 27, no. 4, pp. 719−728, 2022.
[45]

X. Xu, T. Gao, Y. Wang, and X. Xuan, Event temporal relation extraction with attention mechanism and graph neural network, Tsinghua Science and Technology, vol. 27, no. 1, pp. 79–90, 2022.

[46]

Y. Jin, Y. Zhang, and Y. Zhang, Neighbor library-aware graph neural network for third party library recommendation, Tsinghua Science and Technology, vol. 28, no. 4, pp. 769–785, 2023.

[47]
Y. Zhong, L. Wang, J. Chen, D. Yu, and Y. Li, Comprehensive image captioning via scene graph decomposition, in Proc. 16th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 211−229.
[48]
J. Gu, S. Joty, J. Cai, H. Zhao, X. Yang, and G. Wang, Unpaired image captioning via scene graph alignments, in Proc. IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 10322−10331.
[49]
D. Paschalidou, A. Kar, M. Shugrina, K. Kreis, A. Geiger, and S. Fidler, ATISS: Autoregressive transformers for indoor scene synthesis, in Proc. 35th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2021, p. 919.
[50]
C. Zhang, J. Yu, Y. Song, and W. Cai, Exploiting edge-oriented reasoning for 3D point-based scene graph analysis, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 9700−9710.
[51]
P. Xu, X. Chang, L. Guo, P. Y. Huang, X. Chen, and A. G. Hauptmann, A survey of scene graph: Generation and application, IEEE Trans. Neural Network. Learn. Syst., doi: 10.13140/RG.2.2.11161.57446.
[52]
C. Wang, B. Samari, V. G. Kim, S. Chaudhuri, and K. Siddiqi, Affinity Graph supervision for visual recognition, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 8244−8252.
[53]
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in Proc. 2019 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2019, pp. 4171−4186.
[54]
R. Girshick, Fast R-CNN, in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 1440−1448.
[55]
J. Lu, C. Wu, L. Wang, S. Yuan, and J. Wu, Nested attention network with graph filtering for visual question and answering, in Proc. 2023 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1−5.
[56]
P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, Graph attention networks, in Proc. 6th Int. Conf. Learning Representations, Vancouver, Canada, 2018, pp. 49−54.
[57]
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, Microsoft COCO: Common objects in context, in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740−755.
[58]
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, Making the V in VQA matter: Elevating the role of image understanding in visual question answering, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 6325−6334.
[59]
J. Pennington, R. Socher, and C. Manning, GloVe: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 1532−1543.
[60]
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 142.6980, 2014.
[61]

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014.

[62]
J. H. Kim, K. W. On, W. Lim, J. Kim, J. W. Ha, and B. T. Zhang, Hadamard product for low-rank bilinear pooling, arXiv preprint arXiv: 1610. 04325, 2016.
[63]
Z. Yu, J. Yu, J. Fan, and D. Tao, Multi-modal factorized bilinear pooling with co-attention learning for visual question answering, in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 1839−1848.
[64]

Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao, Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering, IEEE Trans. Neural Network. Learn. Syst., vol. 29, no. 12, pp. 5947–5959, 2018.

[65]
M. Stefanini, M. Cornia, L. Baraldi, and R. Cucchiara, A novel attention-based aggregation function to combine vision and language, in Proc. 25th Int. Conf. Pattern Recognition (ICPR), Milan, Italy, 2021, pp. 1212−1219.
[66]
J. Cao, X. Qin, S. Zhao, and J. Shen, Bilateral cross-modality graph matching attention for feature fusion in visual question answering, IEEE Trans. Neural Network. Learn. Syst., doi: 10.1109/TNNLS.2021.3135655.
[67]
Y. Qian, Y. Hu, R. Wang, F. Feng, and X. Wang, Question-driven graph fusion network for visual question answering, in Proc. IEEE Int. Conf. Multimedia and Expo (ICME), Taipei, China, 2022, pp. 1−6.
Tsinghua Science and Technology
Pages 2133-2145
Cite this article:
Lu J, Wu C, Wang L, et al. Dual-Modality Integration Attention with Graph-Based Feature Extraction for Visual Question and Answering. Tsinghua Science and Technology, 2025, 30(5): 2133-2145. https://doi.org/10.26599/TST.2024.9010093

153

Views

11

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 30 December 2023
Revised: 24 February 2024
Accepted: 14 May 2024
Published: 29 April 2025
© The Author(s) 2025.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return