[1]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, VQA: Visual question answering, in Proc. 2015 IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 2425–2433.
[2]
R. V. Yampolskiy, AI-complete, AI-hard, or AI-easy: Classification of problems in artificial intelligence, in Proc. 23 rd Midwest Artificial Intelligence and Cognitive Science Conf., Cincinnati, OH, USA, https://ceur-ws.org/Vol-841/submission_3.pdf, 2012.
[3]
R. Y. Zakari, J. W. Owusu, H. Wang, K. Qin, Z. K. Lawal, and Y. Dong, VQA and visual reasoning: An overview of recent datasets, methods and challenges, arXiv preprint arXiv: 2212.13296, 2022.
[4]
Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, Deep modular co-attention networks for visual question answering, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6274–6283.
[5]
Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, Stacked attention networks for image question answering, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 21–29.
[6]
A. Urooj, A. Mazaheri, N. da V. Lobo, and M. Shah, MMFT-BERT: Multimodal fusion transformer with Bert encodings for visual question answering, in Proc. Findings of the Association for Computational Linguistics : EMNLP 2020, Virtual Event, 2020, pp. 4648–4660.
[9]
V. Srisupavanich, Multimodal learning and reasoning for visual question answering, master thesis, University of Southampton, https://github.com/markvasin/MSc-Project, 2020.
[10]
J. Johnson, B. Hariharan, L. van der Maaten, F. F. Li, C. L. Zitnick, and R. Girshick, CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 1988–1997.
[11]
V. Kazemi and A. Elqursh, Show, ask, attend, and answer: A strong baseline for visual question answering, arXiv preprint arXiv: 1704.03162, 2017.
[12]
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6077–6086.
[13]
H. Xu and K. Saenko, Ask, attend and answer: Exploring question-guided spatial attention for visual question answering, in Proc. 14 th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 451–466.
[14]
S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, in Proc. 28 th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 91–99.
[15]
D. A. Hudson and C. D. Manning, Compositional attention networks for machine reasoning, arXiv preprint arXiv: 1803.03067, 2018.
[16]
J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu, The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision, arXiv preprint arXiv: 1904.12584, 2019.
[17]
A. d'Avila Garcez, M. Gori, L. C. Lamb, L. Serafini, M. Spranger, and S. N. Tran, Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning, arXiv preprint arXiv: 1905.06088, 2019.
[18]
A. Graves, G. Wayne, and I. Danihelka, Neural turing machines, arXiv preprint arXiv: 1410.5401, 2014.
[19]
S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, End-to-end memory networks, arXiv preprint arXiv: 1503.08895, 2015.
[20]
P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al., Relational inductive biases, deep learning, and graph networks, arXiv preprint arXiv: 1806.01261, 2018.
[21]
C. Ma, C. Shen, A. Dick, Q. Wu, P. Wang, A. van den Hengel, and I. Reid, Visual question answering with memory-augmented networks, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6975–6984.
[22]
D. A. Hudson and C. D. Manning, Learning by abstraction: The neural state machine, arXiv preprint arXiv: 1907.03950, 2019.
[23]
J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, Neural module networks, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 39–48.
[24]
J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, F. F. Li, C. L. Zitnick, and R. Girshick, Inferring and executing programs for visual reasoning, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 3008–3017.
[25]
R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, Learning to reason: End-to-end module networks for visual question answering, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 804–813.
[26]
J. Pennington, R. Socher, and C. Manning, GloVe: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP ), Doha, Qatar, 2014, pp. 1532–1543.
[27]
J. Lu, D. Batra, D. Parikh, and S. Lee, ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, in Proc. 33 rd Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 13–23.
[28]
H. Tan and M. Bansal, LXMERT: Learning cross-modality encoder representations from transformers, arXiv preprint arXiv: 1908.07490, 2019.
[29]
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2019.
[30]
A. Santoro, D. Raposo, D. G. T. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap, A simple neural network module for relational reasoning, in Proc. 31 st Int. Conf. neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 4974–4983.
[31]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., PyTorch: An imperative style, high-performance deep learning library, in Proc. 33 rd Int. Conf. on Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 8026–8037.
[32]
A. Singh, V. Goswami, V. Natarajan, Y. Jiang, X. Chen, M. Shah, M. Rohrbach, D. Batra, and D. Parikh, MMF: A multimodal framework for vision and language research, https://github.com/facebookresearch/mmf, 2020.
[33]
Z. Yu, Y. Cui, Z. Shao, P. Gao, and J. Yu, Openvqa, https://github.com/MILVLG/openvqa, 2019.
[34]
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., HuggingFace’s transformers: State-of-the-art natural language processing, arXiv preprint arXiv: 1910.03771, 2020.
[35]
D. A. Hudson and C. D. Manning, GQA: A new dataset for real-world visual reasoning and compositional question answering, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6693–6702.
[36]
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, Making the V in VQA matter: Elevating the role of image understanding in visual question answering, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 6325–6334.
[37]
E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, FiLM: Visual reasoning with a general conditioning layer, in Proc. 32 nd AAAI Conf. Artificial Intelligence, New Orleans, LA, USA, 2018, pp. 3942–3951.
[41]
Y. Zhou, T. Ren, C. Zhu, X. Sun, J. Liu, X. Ding, M. Xu, and R. Ji, TRAR: Routing the attention spans in transformer for visual question answering, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 2054–2064.
[42]
W. Zhang, J. Yu, W. Zhao, and C. Ran, DMRFNet: Deep multimodal reasoning and fusion for visual question answering and explanation generation, Inf. Fusion, vol. 72, pp. 70–79, 2021.
[44]
P. Banerjee, T. Gokhale, Y. Yang, and C. Baral, Weakly supervised relative spatial reasoning for visual question answering, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 1988–1998.
[45]
D. K. Nguyen and T. Okatani, Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6087–6096.
[46]
Q. Sun and Y. Fu, Stacked self-attention networks for visual question answering, in Proc. 2019 on Int. Conf. Multimedia Retrieval, Ottawa, Canada, 2019, pp. 207–211.
[49]
R. Cadene, H. Ben-Younes, M. Cord, and N. Thome, MUREL: Multimodal relational reasoning for visual question answering, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 1989–1998.
[53]
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in Proc. 32 nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 2048–2057.