CAAI Artificial Intelligence Research 2023, 2: 9150019 https://doi.org/10.26599/AIR.2023.9150019

Article |

Open Access | Issue | Published: 27 October 2023

TACFN: Transformer-Based Adaptive Cross-Modal Fusion Network for Multimodal Emotion Recognition

Show Author's Information Hide Author's Information Feng Liu^¹(

), Ziwang Fu^², Yunlong Wang^³, Qijian Zheng^¹

1School of Computer Science and Technology, East China Normal University, Shanghai 200062, China

2MTlab, Meitu (China) Limited, Beijing 100876, China

3Institute of Acoustics, University of Chinese Academy of Sciences, Beijing 100084, China

Keywords:

Transformer, multimodal fusion, multimodal emotion recognition, adaptive cross-modal blocks, computational perception

Cite this article:

Liu F, Fu Z, Wang Y, et al. TACFN: Transformer-Based Adaptive Cross-Modal Fusion Network for Multimodal Emotion Recognition. CAAI Artificial Intelligence Research, 2023, 2: 9150019. https://doi.org/10.26599/AIR.2023.9150019

Download citation

EndNote(RIS)

BibTeX

825

Views

131

Downloads

Citations

Crossref

N/A

WoS

N/A

Scopus

N/A

CSCD

Abstract Full text About this article

Abstract

The fusion technique is the key to the multimodal emotion recognition task. Recently, cross-modal attention-based fusion methods have demonstrated high performance and strong robustness. However, cross-modal attention suffers from redundant features and does not capture complementary features well. We find that it is not necessary to use the entire information of one modality to reinforce the other during cross-modal interaction, and the features that can reinforce a modality may contain only a part of it. To this end, we design an innovative Transformer-based Adaptive Cross-modal Fusion Network (TACFN). Specifically, for the redundant features, we make one modality perform intra-modal feature selection through a self-attention mechanism, so that the selected features can adaptively and efficiently interact with another modality. To better capture the complementary information between the modalities, we obtain the fused weight vector by splicing and use the weight vector to achieve feature reinforcement of the modalities. We apply TCAFN to the RAVDESS and IEMOCAP datasets. For fair comparison, we use the same unimodal representations to validate the effectiveness of the proposed fusion method. The experimental results show that TACFN brings a significant performance improvement compared to other methods and reaches the state-of-the-art performance. All code and models could be accessed from https://github.com/shuzihuaiyu/TACFN.

Full text

Abstract

Full text

Outline

About this article

TACFN: Transformer-Based Adaptive Cross-Modal Fusion Network for Multimodal Emotion Recognition

Show Author's information Hide Author's Information Feng Liu^¹(

), Ziwang Fu^², Yunlong Wang^³, Qijian Zheng^¹

1School of Computer Science and Technology, East China Normal University, Shanghai 200062, China

2MTlab, Meitu (China) Limited, Beijing 100876, China

3Institute of Acoustics, University of Chinese Academy of Sciences, Beijing 100084, China

Abstract

Keywords: Transformer, multimodal fusion, multimodal emotion recognition, adaptive cross-modal blocks, computational perception

References(39)

[1]

S. Zhao, G. Jia, J. Yang, G. Ding, and K. Keutzer, Emotion recognition from multiple modalities: Fundamentals and methodologies, IEEE Signal Process. Mag., vol. 38, no. 6, pp. 59–73, 2021.

DOI Google Scholar

[2]

S. Poria, D. Hazarika, N. Majumder, and R. Mihalcea, Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research, IEEE Trans. Affect. Comput., vol. 14, no. 1, pp. 108–132, 2023.

DOI Google Scholar

[3]

Q. Gan, S. Wang, L. Hao, and Q. Ji, A multimodal deep regression Bayesian network for affective video content analyses, in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 5123–5132.

DOI

[4]

D. Nguyen, K. Nguyen, S. Sridharan, D. Dean, and C. Fookes, Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition, Comput. Vis. Image Underst., vol. 174, pp. 33–42, 2018.

DOI Google Scholar

[5]

L. Smith and M. Gasser, The development of embodied cognition: Six lessons from babies, Artif. Life, vol. 11, nos. 1–2, pp. 29, 2005.

DOI Google Scholar

[6]

W. Wang, D. Tran, and M. Feiszli, What makes training multi-modal classification networks hard? in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 12692–12702.

DOI

[7]

L. P. Morency, R. Mihalcea, and P. Doshi, Towards multimodal sentiment analysis: Harvesting opinions from the web, in Proc. 13th Int. Conf. multimodal interfaces, Alicante, Spain, 2011, pp. 169–176.

DOI

[8]

V. Pérez-Rosas, R. Mihalcea, and L. P. Morency, Utterance-level multimodal sentiment analysis, in Proc. 51st Annual Meeting Association for Computational Linguistics, Sofia, Bulgaria, 2013, pp. 973–982.

[9]

A. Zadeh, R. Zellers, E. Pincus, and L. P. Morency, MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos, arXiv preprint arXiv: 1606.06259, 2016.

[10]

H. Wang, A. Meghawat, L. -P. Morency, and E. P. Xing, Select-additive learning: Improving generalization in multimodal sentiment analysis, in Proc. 2017 IEEE Int. Conf. Multimedia and Expo (ICME), Hong Kong, China, 2017, pp. 949–954.

DOI

[11]

S. Sahay, E. Okur, S. H. Kumar, and L. Nachman, Low rank fusion based Transformers for multimodal sequences, arXiv preprint arXiv: 2007.02038, 2020.

DOI

[12]

W. Rahman, M. K. Hasan, S. Lee, A. Bagher Zadeh, C. Mao, L. P. Morency, and E. Hoque, Integrating multimodal information in large pretrained Transformers, in Proc. 58th Annual Meeting Association for Computational Linguistics, virtual, 2020, pp. 2359–2369.

DOI

[13]

W. Yu, H. Xu, Z. Yuan, and J. Wu, Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis, Proc. AAAI Conf. Artif. Intell., vol. 35, no. 12, pp. 10790–10797, 2021.

DOI Google Scholar

[14]

D. Hazarika, R. Zimmermann, and S. Poria, MISA: Modality-invariant and-specific representations for multimodal sentiment analysis, in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 1122–1131.

DOI

[15]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, arXiv preprint arXiv: 1706.03762, 2017.

[16]

J. Cheng, I. Fostiropoulos, B. Boehm, and M. Soleymani, Multimodal phased Transformer for sentiment analysis, in Proc. 2021 Conf. Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2021, pp. 2447–2458.

DOI

[17]

F. Lv, X. Chen, Y. Huang, L. Duan, and G. Lin, Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 2554–2562.

DOI

[18]

A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, Attention bottlenecksfor multimodal fusion, in Proc. 2021 Annual Conf. Nerual Information Processing Systems, virtual, 2021, pp. 14200–14213.

[19]

S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 5987–5995.

DOI

[20]

N. Neverova, C. Wolf, G. Taylor, and F. Nebout, ModDrop: Adaptive multi-modal gesture recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 8, pp. 1692–1706, 2016.

DOI Google Scholar

[21]

S. R. Livingstone and F. A. Russo, The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English, . PLoS One , vol. 13, no. 5, p. e0196391, 2018.

DOI Google Scholar

[22]

C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, IEMOCAP: Interactive emotional dyadic motion capture database, Lang. Resour. Eval., vol. 42, no. 4, pp. 335–359, 2008.

DOI Google Scholar

[23]

D. Nguyen, K. Nguyen, S. Sridharan, A. Ghasemi, D. Dean, and C. Fookes, Deep spatio-temporal features for multimodal emotion recognition, in Proc. 2017 IEEE Winter Conf. Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 2017, pp. 1215–1223.

DOI

[24]

Y. H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. P. Morency, and R. Salakhutdinov, Multimodal Transformer for unaligned multimodal language sequences, in Proc. 57th Annual Meeting Association for Computational Linguistics, Florence, Italy, 2019, pp. 6558–6569.

DOI

[25]

H. R. Vaezi Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, MMTM: Multimodal transfer module for CNN fusion, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 13286–13296.

DOI

[26]

L. Su, C. Hu, G. Li, and D. Cao, MSAF: Multimodal split attention fusion, arXiv preprint arXiv: 2012.07175, 2020.

[27]

J. Wang, M. Xue, R. Culhane, E. Diao, J. Ding, and V. Tarokh, Speech emotion recognition with dual-sequence LSTM architecture, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 6474–6478.

DOI

[28]

W. Dai, Z. Liu, T. Yu, and P. Fung, Modality-transferable emotion embeddings for low-resource multimodal emotion recognition, arXiv preprint arXiv: 2009.09629, 2020.

[29]

Q. Jin, C. Li, S. Chen, and H. Wu, Speech emotion recognition with acoustic and lexical features, in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 2015, pp. 4749–4753.

DOI

[30]

J. Pennington, R. Socher, and C. Manning, Glove: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 1532–1543.

DOI

[31]

T. Baltrušaitis, P. Robinson, and L. P. Morency, OpenFace: An open source facial behavior analysis toolkit, in Proc. 2016 IEEE Winter Conf. Applications of Computer Vision (WACV), Lake Placid, NY, USA, 2016, pp. 1–10.

DOI

[32]

G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer, COVAREP—A collaborative voice analysis repository for speech technologies, in Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 2014, pp. 960–964.

DOI

[33]

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2014.

[34]

J. D. S. Ortega, M. Senoussaoui, E. Granger, M. Pedersoli, P. Cardinal, and A. L. Koerich, Multimodal fusion with deep neural networks for audio-video emotion recognition, arXiv preprint arXiv: 1907.03196, 2019.

[35]

A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, Multimodal compact bilinear pooling for visual question answering and visual grounding, arXiv preprint arXiv: 1606.01847, 2016.

DOI

[36]

K. Liu, Y. Li, N. Xu, and P. Natarajan, Learn to combine modalities in multimodal deep learning, arXiv preprint arXiv: 1805.11730, 2018.

[37]

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

DOI Google Scholar

[38]

Y. Wang, Y. Shen, Z. Liu, P. P. Liang, A. Zadeh, and L. -P. Morency, Words can shift: Dynamically adjusting word representations using nonverbal behaviors, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 7216–7223, 2019.

DOI Google Scholar

[39]

H. Pham, P. P. Liang, T. Manzini, L. -P. Morency, and B. Póczos, Found in translation: Learning robust joint representations by cyclic translations between modalities, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 6892–6899, 2019.

DOI Google Scholar

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 07 July 2023

Accepted: 31 August 2023

Published: 27 October 2023

Issue date: December 2023

Copyright

Acknowledgements

Acknowledgment

This study was supported by Beijing Key Laboratory of Behavior and Mental Health, Peking University.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).