[2]
C. W. Kuo and Z. Kira, Beyond a pre-trained object detector: Cross-modal textual and visual context for image captioning, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 17948−17958.
[3]
L. Zhou, C. Xu, P. Koch, and J. J. Corso, Watch what you just said: Image captioning with text-conditional attention, in Proc. Thematic Workshops of ACM Multimedia, Mountain View, CA, USA, 2017, pp. 305−313.
[4]
H. Diao, Y. Zhang, L. Ma, and H. Lu, Similarity reasoning and filtration for image-text matching, in Proc. 35th AAAI Conf. Artificial Intelligence, Vancouver, Canada, 2021, pp. 1218−1226.
[7]
Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, Deep modular co-attention networks for visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6274−6283.
[8]
L. Huang, W. Wang, J. Chen, and X. Y. Wei, Attention on attention for image captioning, in Proc. IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 4634−4643.
[9]
M. Ghorbani, M. S. Baghshah, and H. R. Rabiee, MGCN: Semi-supervised classification in multi-layer graphs with graph convolutional networks, in Proc. 2019 IEEE/ACM Int. Conf. Advances in Social Networks Analysis and Mining, Vancouver, Canada, 2019, pp. 208−211.
[10]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, VQA: Visual question answering, in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 2425−2433.
[11]
L. Nie, M. Wang, Z. Zha, G. Li, and T. S. Chua, Multimedia answering: Enriching text QA with media information, in Proc. 34th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, Beijing, China, 2011, pp. 695−704.
[12]
W. He, Z. Li, D. Lu, E. Chen, T. Xu, B. Huai, and J. Yuan, Multimodal dialogue systems via capturing context-aware dependencies of semantic elements, in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 2755−2764.
[13]
J. H. Kim, J. Jun, and B. T. Zhang, Bilinear attention networks, in Proc. 32nd Int. Conf. Neural Information Processing Systems, Montréal, Canada, 2018, pp. 1571−1581.
[14]
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6077−6086.
[15]
M. Malinowski and M. Fritz, A multi-world approach to question answering about real-world scenes based on uncertain input, in Proc. 27th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 1682−1690.
[16]
J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, Neural module networks, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 39−48.
[17]
R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, Learning to reason: End-to-end module networks for visual question answering, in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 804−813.
[18]
Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, Stacked attention networks for image question answering, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 21−29.
[19]
M. Malinowski, C. Doersch, A. Santoro, and P. Battaglia, Learning visual question answering by bootstrapping hard attention, in Proc. 15th European Conf. Computer Vision (ECCV), Munich, Germany, 2018, pp. 3−20.
[20]
S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 91−99.
[21]
Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra, and D. Parikh, Pythia v0.1: The winning entry to the VQA challenge 2018, arXiv preprint arXiv: 1807.09956, 2018.
[22]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000−6010.
[23]
P. Gao, Z. Jiang, H. You, P. Lu, S. C. H. Hoi, X. Wang, and H. Li, Dynamic fusion with intra- and inter-modality attention flow for visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6632−6641.
[24]
P. Gao, H. You, Z. Zhang, X. Wang, and H. Li, Multi-modality latent interaction network for visual question answering, in Proc. IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 5824−5834.
[25]
R. Cadene, H. Ben-Younes, M. Cord, and N. Thome, MUREL: Multimodal relational reasoning for visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 1989−1998.
[27]
W. Cai, Y. Wang, J. Ma, and Q. Jin, CAN: Effective cross features by global attention mechanism and neural network for ad click prediction, Tsinghua Science and Technology, vol. 27, no. 1, pp. 186−195, 2022.
[29]
D. K. Nguyen and T. Okatani, Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6087−6096.
[30]
D. Teney, P. Anderson, X. He, and A. Van Den Hengel, Tips and tricks for visual question answering: Learnings from the 2017 challenge, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4223−4232.
[33]
I. Ilievski, S. Yan, and J. Feng, A focused dynamic attention model for visual question answering, arXiv preprint arXiv: 1604.01485, 2016.
[35]
C. Zhu, Y. Zhao, S. Huang, K. Tu, and Y. Ma, Structured attentions for visual question answering, in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 1300−1309.
[36]
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, Multimodal compact bilinear pooling for visual question answering and visual grounding. in Proc. Conf. Empirical Methods in Natural Language Processing, Austin, TX, USA, 2016, pp. 457−468.
[37]
Z. Zhang, L. Liao, M. Huang, X. Zhu, and T. S. Chua, Neural multimodal belief tracker with adaptive attention for dialogue systems, in Proc. World Wide Web Conf., San Francisco, CA, USA, 2019, pp. 2401−2412.
[38]
T. Rahman, S. H. Chou, L. Sigal, and G. Carenini, An improved attention for visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 2021, pp. 1653−1662.
[41]
T. Zhang, H. Huang, C. Feng, and L. Cao, Enlivening redundant heads in multi-head self-attention for machine translation, in Proc. 2021 Conf. Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2021, pp. 3238−3248.
[42]
L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, Normalized and geometry-aware self-attention network for image captioning, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10324−10333.
[43]
H. Zhao, J. Jia, and V. Koltun, Exploring self-attention for image recognition, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10073−10082.
[44]
J. Gao, X. Liu, Y. Chen, and F. Xiong, MHGCN: Multiview highway graph convolutional network for cross-lingual entity alignment, Tsinghua Science and Technology, vol. 27, no. 4, pp. 719−728, 2022.
[47]
Y. Zhong, L. Wang, J. Chen, D. Yu, and Y. Li, Comprehensive image captioning via scene graph decomposition, in Proc. 16th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 211−229.
[48]
J. Gu, S. Joty, J. Cai, H. Zhao, X. Yang, and G. Wang, Unpaired image captioning via scene graph alignments, in Proc. IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 10322−10331.
[49]
D. Paschalidou, A. Kar, M. Shugrina, K. Kreis, A. Geiger, and S. Fidler, ATISS: Autoregressive transformers for indoor scene synthesis, in Proc. 35th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2021, p. 919.
[50]
C. Zhang, J. Yu, Y. Song, and W. Cai, Exploiting edge-oriented reasoning for 3D point-based scene graph analysis, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 9700−9710.
[52]
C. Wang, B. Samari, V. G. Kim, S. Chaudhuri, and K. Siddiqi, Affinity Graph supervision for visual recognition, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 8244−8252.
[53]
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in Proc. 2019 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2019, pp. 4171−4186.
[54]
R. Girshick, Fast R-CNN, in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 1440−1448.
[55]
J. Lu, C. Wu, L. Wang, S. Yuan, and J. Wu, Nested attention network with graph filtering for visual question and answering, in Proc. 2023 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1−5.
[56]
P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, Graph attention networks, in Proc. 6th Int. Conf. Learning Representations, Vancouver, Canada, 2018, pp. 49−54.
[57]
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, Microsoft COCO: Common objects in context, in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740−755.
[58]
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, Making the V in VQA matter: Elevating the role of image understanding in visual question answering, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 6325−6334.
[59]
J. Pennington, R. Socher, and C. Manning, GloVe: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 1532−1543.
[60]
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 142.6980, 2014.
[62]
J. H. Kim, K. W. On, W. Lim, J. Kim, J. W. Ha, and B. T. Zhang, Hadamard product for low-rank bilinear pooling, arXiv preprint arXiv: 1610. 04325, 2016.
[63]
Z. Yu, J. Yu, J. Fan, and D. Tao, Multi-modal factorized bilinear pooling with co-attention learning for visual question answering, in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 1839−1848.
[65]
M. Stefanini, M. Cornia, L. Baraldi, and R. Cucchiara, A novel attention-based aggregation function to combine vision and language, in Proc. 25th Int. Conf. Pattern Recognition (ICPR), Milan, Italy, 2021, pp. 1212−1219.
[67]
Y. Qian, Y. Hu, R. Wang, F. Feng, and X. Wang, Question-driven graph fusion network for visual question answering, in Proc. IEEE Int. Conf. Multimedia and Expo (ICME), Taipei, China, 2022, pp. 1−6.