Journal Home > Volume 27 , Issue 4

Multimodal Sentiment Classification (MSC) uses multimodal data, such as images and texts, to identify the users’ sentiment polarities from the information posted by users on the Internet. MSC has attracted considerable attention because of its wide applications in social computing and opinion mining. However, improper correlation strategies can cause erroneous fusion as the texts and the images that are unrelated to each other may integrate. Moreover, simply concatenating them modal by modal, even with true correlation, cannot fully capture the features within and between modals. To solve these problems, this paper proposes a Cross-Modal Complementary Network (CMCN) with hierarchical fusion for MSC. The CMCN is designed as a hierarchical structure with three key modules, namely, the feature extraction module to extract features from texts and images, the feature attention module to learn both text and image attention features generated by an image-text correlation generator, and the cross-modal hierarchical fusion module to fuse features within and between modals. Such a CMCN provides a hierarchical fusion framework that can fully integrate different modal features and helps reduce the risk of integrating unrelated modal features. Extensive experimental results on three public datasets show that the proposed approach significantly outperforms the state-of-the-art methods.


menu
Abstract
Full text
Outline
About this article

Cross-Modal Complementary Network with Hierarchical Fusion for Multimodal Sentiment Classification

Show Author's information Cheng PengChunxia Zhang( )Xiaojun XueJiameng GaoHongjian LiangZhengdong Niu
School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
School of Information, Production and Systems, Waseda University, Fukuoka 808-0135, Japan

Abstract

Multimodal Sentiment Classification (MSC) uses multimodal data, such as images and texts, to identify the users’ sentiment polarities from the information posted by users on the Internet. MSC has attracted considerable attention because of its wide applications in social computing and opinion mining. However, improper correlation strategies can cause erroneous fusion as the texts and the images that are unrelated to each other may integrate. Moreover, simply concatenating them modal by modal, even with true correlation, cannot fully capture the features within and between modals. To solve these problems, this paper proposes a Cross-Modal Complementary Network (CMCN) with hierarchical fusion for MSC. The CMCN is designed as a hierarchical structure with three key modules, namely, the feature extraction module to extract features from texts and images, the feature attention module to learn both text and image attention features generated by an image-text correlation generator, and the cross-modal hierarchical fusion module to fuse features within and between modals. Such a CMCN provides a hierarchical fusion framework that can fully integrate different modal features and helps reduce the risk of integrating unrelated modal features. Extensive experimental results on three public datasets show that the proposed approach significantly outperforms the state-of-the-art methods.

Keywords: joint optimization, multimodal sentiment analysis, multimodal fusion, Cross-Modal Complementary Network (CMCN), hierarchical fusion

References(44)

[1]
N. Xu, W. J. Mao, and G. D. Chen, Multi-interactive memory network for aspect based multimodal sentiment analysis, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 371-378, 2019.
[2]
Z. Yu, J. Yu, J. P. Fan, and D. C. Tao, Multi-modal factorized bilinear pooling with co-attention learning for visual question answering, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 1839-1848.
DOI
[3]
Z. Yu, J. Yu, C. C. Xiang, J. P. Fan, and D. C. Tao, Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering, IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 12, pp. 5947-5959, 2018.
[4]
D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann, ICON: Interactive conversational memory network for multimodal emotion detection, in Proc. 2018 Conf. Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 2594-2604.
DOI
[5]
A. Hu and S. Flaxman, Multimodal sentiment analysis to explore the structure of emotions, in Proc. 24th ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining, London, UK, 2018, pp. 350-358.
DOI
[6]
P. Anderson, X. D. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6077-6086.
DOI
[7]
L. Zhang, S. Wang, and B. Liu, Deep learning for sentiment analysis: A survey, WIREs Data Min. Knowl. Discov., vol. 8, no. 4, p. e1253, 2018.
[8]
S. C. Zhao, S. F. Wang, M. Soleymani, D. Joshi, and Q. Ji, Affective computing for large-scale heterogeneous multimedia data: A survey, ACM Trans. Multimed. Comput. Commun. Appl., vol. 15, no. 3s, p. 93, 2020.
[9]
T. Niu, S. A. Zhu, L. Pang, and A. El Saddik, Sentiment analysis on multi-view social data, in Proc. 22nd Int. Conf. MultiMedia Modeling, Miami, FL, USA, 2016, pp. 15-27.
DOI
[10]
K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, in Proc. 3rd Int. Conf. Learning Representations, arXiv preprint arXiv:1409.1556v6.
[11]
G. R. Wang, K. Z. Wang, and L. Lin, Adaptively connected neural networks, in Proc. of the 2019 IEEE Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 1781-1790.
DOI
[12]
R. Cadne, C. Dancette, H. Ben-younes, M. Cord, and D. Parikh, RUBi: Reducing unimodal biases for visual question answering, in Proc. 33rd Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 841-852.
[13]
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in Proc. 2019 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2019, pp. 4171-4186.
[14]
N. Xu and W. J. Mao, MultiSentiNet: A deep semantic network for multimodal sentiment analysis, in Proc. 2017 ACM Conf. Information and Knowledge Management, Singapore, 2017, pp. 2399-2402.
DOI
[15]
N. Xu, W. J. Mao, and G. D. Chen, A co-memory network for multimodal sentiment analysis, in Proc. 41st Int. ACM SIGIR Conf. Research & Development in Information Retrieval, Ann Arbor, MI, USA, 2018, pp. 929-932.
DOI
[16]
J. C. Xu, D. L. Chen, X. P. Qiu, and X. J. Huang, Cached long short-term memory neural networks for document-level sentiment classification, in Proc. 2016 Conf. Empirical Methods in Natural Language Processing, Austin, TX, USA, 2016, pp. 1660-1669.
DOI
[17]
A. Mishra, K. Dey, and P. Bhattacharyya, Learning cognitive features from gaze data for sentiment and sarcasm classification using convolutional neural network, in Proc. 55th Annu. Meeting of the Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 377-387.
DOI
[18]
D. H. Ma, S. J. Li, X. D. Zhang, and H. F. Wang, Interactive attention networks for aspect-level sentiment classification, in Proc. 26th Int. Joint Conf. Artificial Intelligence, Melbourne, Australia, 2017, pp. 4068-4074.
DOI
[19]
A. Gaspar and L. A. Alexandre, A multimodal approach to image sentiment analysis, in Proc. 20th Int. Conf. Intelligent Data Engineering and Automated Learning, Manchester, UK, 2019, pp. 302-309.
DOI
[20]
Q. T. Truong and H. W. Lauw, VistaNet: Visual aspect attention network for multimodal sentiment analysis, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 305-312, 2019.
[21]
B. Liu, S. J. Tang, X. J. Sun, Q. Y. Chen, J. X. Cao, J. Z. Luo, and S. S. Zhao, Context-aware social media user sentiment analysis, Tsinghua Science and Technology, vol. 25, no. 4, pp. 528-541, 2020.
[22]
E. J. Barezi and P. Fung, Modality-based factorization for multimodal fusion, in Proc. 4th Workshop on Representation Learning for NLP, Florence, Italy, 2019, pp. 260-269.
DOI
[23]
S. Poria, N. Majumder, D. Hazarika, E. Cambria, A. Gelbukh, and A. Hussain, Multimodal sentiment analysis: Addressing key issues and setting up the baselines, IEEE Intell. Syst., vol. 33, no. 6, pp. 17-25, 2018.
[24]
M. H. Chen, S. Wang, P. P. Liang, T. Baltrušaitis, A. Zadeh, and L. P. Morency, Multimodal sentiment analysis with word-level fusion and reinforcement learning, in Proc. 19th ACM Int. Conf. Multimodal Interaction, Glasgow, UK, 2017, pp. 163-171.
DOI
[25]
N. Majumder, D. Hazarika, A. Gelbukh, E. Cambria, and S. Poria, Multimodal sentiment analysis using hierarchical fusion with context modeling, Knowl.-Based Syst., vol. 161, pp. 124-133, 2018.
[26]
E. Cambria, D. Hazarika, S. Poria, A. Hussain, and R. B. V. Subramanyam, Benchmarking multimodal sentiment analysis, in Proc. 18th Int. Conf. Computational Linguistics and Intelligent Text Processing, Budapest, Hungary, 2017, pp. 166-179.
DOI
[27]
D. Zhang, S. S. Li, Q. M. Zhu, and G. D. Zhou, Multi-modal sentiment classification with independent and interactive knowledge via semi-supervised learning, IEEE Access, vol. 8, pp. 22945-22954, 2020.
[28]
Z. L. Wang, Z. H. Wan, and X. J. Wan, TransModality: An End2End fusion method with transformer for multimodal sentiment analysis, in Proc. Web Conf., Taipei, China, 2020, pp. 2514-2520.
DOI
[29]
C. Yang, X. C. Wang, and B. Jiang, Sentiment enhanced multi-modal Hashtag recommendation for micro-videos, IEEE Access, vol. 8, pp. 78252-78264, 2020.
[30]
F. R. Huang, K. M. Wei, J. Weng, and Z. J. Li, Attention-based modality-gated networks for image-text sentiment analysis, ACM Trans. Multimed. Comput. Commun. Appl., vol. 16, no. 3, p. 79, 2020.
[31]
K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 2048-2057.
[32]
D. Borth, R. R. Ji, T. Chen, T. Breuel, and S. F. Chang, Large-scale visual sentiment ontology and detectors using adjective noun pairs, in Proc. 21st ACM Int. Conf. Multimedia, Barcelona, Spain, 2013, pp. 223-232.
DOI
[33]
C. Baecchi, T. Uricchio, M. Bertini, and A. Del Bimbo, A multimodal feature learning approach for sentiment analysis of social network multimedia, Multimed. Tools Appl., vol. 75, no. 5, pp. 2507-2525, 2016.
[34]
G. Y. Cai and B. B. Xia, Convolutional neural networks for multimedia sentiment analysis, in Proc. 4th CCF Conf. Natural Language Processing and Chinese Computing, Nanchang, China, 2015, pp. 159-167.
DOI
[35]
Y. H. Yu, H. F. Lin, J. N. Meng, and Z. H. Zhao, Visual and textual sentiment analysis of a microblog using deep convolutional neural networks, Algorithms, vol. 9, no. 2, p. 41, 2016.
[36]
N. Xu, Analyzing multimodal public sentiment based on hierarchical semantic attentional network, in Proc. 2017 IEEE Int. Conf. Intelligence and Security Informatics, Beijing, China, 2017, pp. 152-154.
DOI
[37]
K. Zhang, Y. S. Geng, J. Zhao, J. X. Liu, and W. X. Li, Sentiment analysis of social media via multimodal feature fusion, Symmetry, vol. 12, no. 12, p. 2010, 2020.
[38]
X. C. Yang, S. Feng, D. L. Wang, and Y. F. Zhang, Image-text multimodal emotion classification via multi-view attentional network, IEEE Trans. Multimed., .
[39]
N. Vo, J. Lu, S. Chen, K. Murphy, and J. Hays, Composing text and image for image retrieval-an empirical odyssey, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6432-6441.
DOI
[40]
J. Arevalo, T. Solorio, M. Montes-y-Gómez, and F. A. González, Gated multimodal units for information fusion, in Proc. 5th Int. Conf. Learning Representations, https://arxiv.org/abs/1702.01992v1.
[41]
Y. Q. Wang, M. L. Huang, X. Y. Zhu, and L. Zhao, Attention-based LSTM for aspect-level sentiment classification, in Proc. 2016 Conf. Empirical Methods in Natural Language Processing, Austin, TX, USA, 2016, pp. 606-615.
DOI
[42]
D. Y. Tang, B. Qin, and T. Liu, Aspect level sentiment classification with deep memory network, in Proc. 2016 Conf. Empirical Methods in Natural Language Processing, Austin, TX, USA, 2016, pp. 214-224.
DOI
[43]
P. Chen, Z. Q. Sun, L. D. Bing, and W. Yang, Recurrent attention network on memory for aspect sentiment analysis, in Proc. 2017 Conf. Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 2017, pp. 452-461.
DOI
[44]
K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.
DOI
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 14 May 2021
Revised: 15 July 2021
Accepted: 30 July 2021
Published: 09 December 2021
Issue date: August 2022

Copyright

© The author(s) 2022

Acknowledgements

The work was supported by the National Key Research and Development Program of China (No. 2020AAA0104903).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return