AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (6.7 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual Question Answering

School of Computer Science and Engineering, Southeast University, and also with the Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education of the People’s Republic of China, Nanjing 211189, China
Southeast University - Monash University Joint Graduate School (Suzhou), Southeast University, Suzhou 215125, China
Show Author Information

Abstract

Previous works employ the Large Language Model (LLM) like GPT-3 for knowledge-based Visual Question Answering (VQA). We argue that the inferential capacity of LLM can be enhanced through knowledge injection. Although methods that utilize knowledge graphs to enhance LLM have been explored in various tasks, they may have some limitations, such as the possibility of not being able to retrieve the required knowledge. In this paper, we introduce a novel framework for knowledge-based VQA titled “Prompting Large Language Models with Knowledge-Injection” (PLLMKI). We use vanilla VQA model to inspire the LLM and further enhance the LLM with knowledge injection. Unlike earlier approaches, we adopt the LLM for knowledge enhancement instead of relying on knowledge graphs. Furthermore, we leverage open LLMs, incurring no additional costs. In comparison to existing baselines, our approach exhibits the accuracy improvement of over 1.3 and 1.7 on two knowledge-based VQA datasets, namely OK-VQA and A-OKVQA, respectively.

References

[1]
Q. Wu, P. Wang, X. Wang, X. He, and W. Zhu, Knowledge-based VQA, in Visual Question Answering, Q. Wu, P. Wang, X. Wang, X. He, and W. Zhu, eds. Singapore: Springer, 2022, pp. 73–90.
[2]

S. Manmadhan and B. C. Kovoor, Visual question answering: a state-of-the-art review, Artif. Intell. Rev., vol. 53, no. 8, pp. 5705–5745, 2020.

[3]
P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel, FVQA: Fact-based visual question answering, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 10, pp. 2413–2427, 2018.
[4]
K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, OK-VQA: A visual question answering benchmark requiring external knowledge, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 3190–3199.
[5]
D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi, A-OKVQA: A benchmark for visual question answering using world knowledge, in European Conference on Computer Vision, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, eds. Cham, Switzerland: Springer, 2022, pp. 146–162.
[6]
K. Marino, X. Chen, D. Parikh, A. Gupta, and M. Rohrbach, KRISP: Integrating implicit and symbolic knowledge for open-domain knowledge-based VQA, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 14106–14116.
[7]

Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L. Wang, An empirical study of GPT-3 for few-shot knowledge-based VQA, Proc. AAAI Conf. Artif. Intell., vol. 36, no. 3, pp. 3081–3089, 2022.

[8]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, arXiv preprint arXiv: 2005.14165, 2020.
[9]
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive NLP tasks, in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, eds. New York, NY, USA: Curran Associates, Inc., 2020, pp. 9459–9474.
[10]
D. Hong, B. Zhang, X. Li, Y. Li, C. Li, J. Yao, N. Yokoya, H. Li, P. Ghamisi, X. Jia, et al., SpectralGPT: Spectral remote sensing foundation model, IEEE Trans. Pattern Anal. Mach. Intell.
[11]

D. Hong, B. Zhang, H. Li, Y. Li, J. Yao, C. Li, M. Werner, J. Chanussot, A. Zipf, and X. X. Zhu, Cross-city matters: A multimodal remote sensing benchmark dataset for cross-city semantic segmentation using high-resolution domain adaptation networks, Remote. Sens. Environ., vol. 299, p. 113856, 2023.

[12]

D. Hong, N. Yokoya, J. Chanussot, and X. X. Zhu, An augmented linear mixing model to address spectral variability for hyperspectral unmixing, IEEE Trans. Image Process., vol. 28, no. 4, pp. 1923–1938, 2019.

[13]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, VQA: Visual question answering, in Proc. IEEE Int. Conf. Computer Vision (ICCV), Santiago, Chile, 2015, pp. 2425–2433.
[14]
Y. Srivastava, V. Murali, S. R. Dubey, and S. Mukherjee, Visual question answering using deep learning: A survey and performance analysis, in Computer Vision and Image Processing, S. K. Singh, P. Roy, B. Raman, and P. Nagabhushan, eds. Singapore: Springer, 2021, pp. 75–86.
[15]

P. Sun, W. Zhang, S. Li, Y. Guo, C. Song, and X. Li, Learnable depth-sensitive attention for deep RGB-D saliency detection with multi-modal fusion architecture search, Int. J. Comput. Vis., vol. 130, no. 11, pp. 2822–2841, 2022.

[16]

Y. Wang, Q. Mao, H. Zhu, J. Deng, Y. Zhang, J. Ji, H. Li, and Y. Zhang, Multi-modal 3D object detection in autonomous driving: A survey, Int. J. Comput. Vis., vol. 131, no. 8, pp. 2122–2152, 2023.

[17]
H. Jiang, I. Misra, M. Rohrbach, E. Learned-Miller, and X. Chen, In defense of grid features for visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 10264–10273.
[18]
P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, VinVL: Revisiting visual representations in vision-language models, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 5575–5584.
[19]
L. Li, Z. Gan, Y. Cheng, and J. Liu, Relation-aware graph attention network for visual question answering, in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 10312–10321.
[20]
Z. Yu, Y. Cui, J. Yu, M. Wang, D. Tao, and Q. Tian, Deep multimodal neural architecture search, in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 3743–3752.
[21]
Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, Deep modular co-attention networks for visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 6274–6283.
[22]
Y. Cui, Z. Yu, C. Wang, Z. Zhao, J. Zhang, M. Wang, and J. Yu, ROSITA: Enhancing vision-and-language semantic alignments via cross- and intra-modal knowledge integration, in Proc. 29th ACM Int. Conf. Multimedia, Virtual Event, 2021, pp. 797–806.
[23]
J. Li, D. Li, C. Xiong, and S. Hoi, BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation, arXiv preprint arXiv: 2201.12086, 2022.
[24]
M. Zhou, L. Yu, A. Singh, M. Wang, Z. Yu, and N. Zhang, Unsupervised vision-and-language pre-training via retrieval-based multi-granular alignment, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 16464–16473.
[25]
M. Malinowski, M. Rohrbach, and M. Fritz, Ask your neurons: A neural-based approach to answering questions about images, in Proc. IEEE Int. Conf. Computer Vision (ICCV), Santiago, Chile, 2015, pp. 1–9.
[26]
Z. Yu, J. Yu, J. Fan, and D. Tao, Multi-modal factorized bilinear pooling with co-attention learning for visual question answering, in Proc. IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 1839–1848.
[27]
J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, Neural module networks, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 39–48.
[28]

J. Wu, J. Lu, A. Sabharwal, and R. Mottaghi, Multi-modal answer validation for knowledge-based VQA, Proc. AAAI Conf. Artif. Intell., vol. 36, no. 3, pp. 2712–2721, 2022.

[29]
Z. Shao, Z. Yu, M. Wang, and J. Yu, Prompting large language models with answer heuristics for knowledge-based visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023, pp. 14974–14983.
[30]
S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, Unifying large language models and knowledge graphs: A roadmap, IEEE Trans. Knowl. Data Eng.
[31]

X. Zou, A survey on application of knowledge graph, J. Phys.: Conf. Ser., vol. 1487, no. 1, p. 012016, 2020.

[32]
Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, ERNIE: Enhanced language representation with informative entities, in Proc. 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 1441–1451.
[33]
Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu, et al., ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation, arXiv preprint arXiv: 2107.02137, 2021.
[34]
J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen, What makes good in-context examples for GPT-3? in Proc. Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, Dublin, Ireland, 2022, pp. 100–114.
[35]
H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, MUTAN: Multimodal tucker fusion for visual question answering, in Proc. IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 2631–2639.
[36]
Z. Zhu, J. Yu, Y. Wang, Y. Sun, Y. Hu, and Q. Wu, Mucko: Multi-layer cross-modal knowledge reasoning for fact-based visual question answering, in Proc. 29th Int. Joint Conf. Artificial Intelligence, Yokohama, Japan, 2020, pp. 1097–1103.
[37]
F. Gardères, M. Ziaeefard, B. Abeloos, and F. Lecue, ConceptBert: Concept-aware representation for visual question answering, in Proc. Findings of the Association for Computational Linguistics : EMNLP 2020, Virtual Event, 2020, pp. 489–498.
[38]
M. Luo, Y. Zeng, P. Banerjee, and C. Baral, Weakly-supervised visual-retriever-reader for knowledge-based question answering, in Proc. 2021 Conf. Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2021, pp. 6417–6431.
[39]
F. Gao, Q. Ping, G. Thattai, A. Reganti, Y. N. Wu, and P. Natarajan, Transform-retrieve-generate: Natural language-centric outside-knowledge visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 5057–5067.
[40]
Y. Guo, L. Nie, Y. Wong, Y. Liu, Z. Cheng, and M. Kankanhalli, A unified end-to-end retriever-reader framework for knowledge-based VQA, in Proc. 30th ACM Int. Conf. Multimedia, Lisboa, Portugal, 2022, pp. 2061–2069.
[41]
Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra, and D. Parikh, Pythia v0.1: The winning entry to the VQA challenge 2018, arXiv preprint arXiv: 1807.09956, 2018.
[42]
J. Lu, D. Batra, D. Parikh, and S. Lee, Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, eds. New York, NY, USA: Curran Associates, Inc., 2019, pp. 13–23.
[43]
R. Mokady, A. Hertz, and A. H. Bermano, Clipcap: Clip prefix for image captioning, arXiv preprint arXiv: 2111.09734, 2021.
[44]
H. Tan and M. Bansal, LXMERT: Learning cross-modality encoder representations from transformers, in Proc. 2019 Conf. Empirical Methods in Natural Language Processing and the 9th Int. Joint Conf. Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 5100–5111.
[45]
A. Kamath, C. Clark, T. Gupta, E. Kolve, D. Hoiem, and A. Kembhavi, Webly supervised concept expansion for general purpose vision models, in Computer Vision—ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, eds. Cham, Switzerland: Springer, 2022, pp. 662–681.
[46]
S. Ravi, A. Chinchure, L. Sigal, R. Liao, and V. Shwartz, VLC-BERT: Visual question answering with contextualized commonsense knowledge, in Proc. IEEE/CVF Winter Conf. Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2023, pp. 1155–1165.
[47]
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, Making the V in VQA matter: Elevating the role of image understanding in visual question answering, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 6325–6334.
[48]

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. J. Li, D. A. Shamma, et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International Journal of Computer Vision, vol. 123, pp. 32–73, 2017.

[49]
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language models, arXiv preprint arXiv: 2302.13971, 2023.
[50]
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv: 2307.09288, 2023.
[51]
F. Ilievski, P. Szekely, and B. Zhang, CSKG: The commonsense knowledge graph, in The Semantic Web, R. Verborgh, K. Hose, H. Paulheim, P. Champin, M. Maleshkova, O. Corcho, P. Ristoski, and M. Alam, eds. Cham, Switzerland, Springer, 2021, pp. 680–696.
Big Data Mining and Analytics
Pages 843-857
Cite this article:
Hu Z, Yang P, Liu F, et al. Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual Question Answering. Big Data Mining and Analytics, 2024, 7(3): 843-857. https://doi.org/10.26599/BDMA.2024.9020026

276

Views

40

Downloads

1

Crossref

0

Web of Science

1

Scopus

0

CSCD

Altmetrics

Received: 24 February 2024
Revised: 01 April 2024
Accepted: 07 April 2024
Published: 28 August 2024
© The author(s) 2024.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return