Journal Home > Volume 2

The fusion technique is the key to the multimodal emotion recognition task. Recently, cross-modal attention-based fusion methods have demonstrated high performance and strong robustness. However, cross-modal attention suffers from redundant features and does not capture complementary features well. We find that it is not necessary to use the entire information of one modality to reinforce the other during cross-modal interaction, and the features that can reinforce a modality may contain only a part of it. To this end, we design an innovative Transformer-based Adaptive Cross-modal Fusion Network (TACFN). Specifically, for the redundant features, we make one modality perform intra-modal feature selection through a self-attention mechanism, so that the selected features can adaptively and efficiently interact with another modality. To better capture the complementary information between the modalities, we obtain the fused weight vector by splicing and use the weight vector to achieve feature reinforcement of the modalities. We apply TCAFN to the RAVDESS and IEMOCAP datasets. For fair comparison, we use the same unimodal representations to validate the effectiveness of the proposed fusion method. The experimental results show that TACFN brings a significant performance improvement compared to other methods and reaches the state-of-the-art performance. All code and models could be accessed from https://github.com/shuzihuaiyu/TACFN.


menu
Abstract
Full text
Outline
About this article

TACFN: Transformer-Based Adaptive Cross-Modal Fusion Network for Multimodal Emotion Recognition

Show Author's information Feng Liu1( )Ziwang Fu2Yunlong Wang3Qijian Zheng1
School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
MTlab, Meitu (China) Limited, Beijing 100876, China
Institute of Acoustics, University of Chinese Academy of Sciences, Beijing 100084, China

Abstract

The fusion technique is the key to the multimodal emotion recognition task. Recently, cross-modal attention-based fusion methods have demonstrated high performance and strong robustness. However, cross-modal attention suffers from redundant features and does not capture complementary features well. We find that it is not necessary to use the entire information of one modality to reinforce the other during cross-modal interaction, and the features that can reinforce a modality may contain only a part of it. To this end, we design an innovative Transformer-based Adaptive Cross-modal Fusion Network (TACFN). Specifically, for the redundant features, we make one modality perform intra-modal feature selection through a self-attention mechanism, so that the selected features can adaptively and efficiently interact with another modality. To better capture the complementary information between the modalities, we obtain the fused weight vector by splicing and use the weight vector to achieve feature reinforcement of the modalities. We apply TCAFN to the RAVDESS and IEMOCAP datasets. For fair comparison, we use the same unimodal representations to validate the effectiveness of the proposed fusion method. The experimental results show that TACFN brings a significant performance improvement compared to other methods and reaches the state-of-the-art performance. All code and models could be accessed from https://github.com/shuzihuaiyu/TACFN.

Keywords: Transformer, multimodal fusion, multimodal emotion recognition, adaptive cross-modal blocks, computational perception

References(39)

[1]

S. Zhao, G. Jia, J. Yang, G. Ding, and K. Keutzer, Emotion recognition from multiple modalities: Fundamentals and methodologies, IEEE Signal Process. Mag., vol. 38, no. 6, pp. 59–73, 2021.

[2]

S. Poria, D. Hazarika, N. Majumder, and R. Mihalcea, Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research, IEEE Trans. Affect. Comput., vol. 14, no. 1, pp. 108–132, 2023.

[3]
Q. Gan, S. Wang, L. Hao, and Q. Ji, A multimodal deep regression Bayesian network for affective video content analyses, in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 5123–5132.
DOI
[4]

D. Nguyen, K. Nguyen, S. Sridharan, D. Dean, and C. Fookes, Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition, Comput. Vis. Image Underst., vol. 174, pp. 33–42, 2018.

[5]

L. Smith and M. Gasser, The development of embodied cognition: Six lessons from babies, Artif. Life, vol. 11, nos. 1–2, pp. 29, 2005.

[6]
W. Wang, D. Tran, and M. Feiszli, What makes training multi-modal classification networks hard? in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 12692–12702.
DOI
[7]
L. P. Morency, R. Mihalcea, and P. Doshi, Towards multimodal sentiment analysis: Harvesting opinions from the web, in Proc. 13th Int. Conf. multimodal interfaces, Alicante, Spain, 2011, pp. 169–176.
DOI
[8]
V. Pérez-Rosas, R. Mihalcea, and L. P. Morency, Utterance-level multimodal sentiment analysis, in Proc. 51st Annual Meeting Association for Computational Linguistics, Sofia, Bulgaria, 2013, pp. 973–982.
[9]
A. Zadeh, R. Zellers, E. Pincus, and L. P. Morency, MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos, arXiv preprint arXiv: 1606.06259, 2016.
[10]
H. Wang, A. Meghawat, L. -P. Morency, and E. P. Xing, Select-additive learning: Improving generalization in multimodal sentiment analysis, in Proc. 2017 IEEE Int. Conf. Multimedia and Expo (ICME), Hong Kong, China, 2017, pp. 949–954.
DOI
[11]
S. Sahay, E. Okur, S. H. Kumar, and L. Nachman, Low rank fusion based Transformers for multimodal sequences, arXiv preprint arXiv: 2007.02038, 2020.
DOI
[12]
W. Rahman, M. K. Hasan, S. Lee, A. Bagher Zadeh, C. Mao, L. P. Morency, and E. Hoque, Integrating multimodal information in large pretrained Transformers, in Proc. 58th Annual Meeting Association for Computational Linguistics, virtual, 2020, pp. 2359–2369.
DOI
[13]

W. Yu, H. Xu, Z. Yuan, and J. Wu, Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis, Proc. AAAI Conf. Artif. Intell., vol. 35, no. 12, pp. 10790–10797, 2021.

[14]
D. Hazarika, R. Zimmermann, and S. Poria, MISA: Modality-invariant and-specific representations for multimodal sentiment analysis, in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 1122–1131.
DOI
[15]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, arXiv preprint arXiv: 1706.03762, 2017.
[16]
J. Cheng, I. Fostiropoulos, B. Boehm, and M. Soleymani, Multimodal phased Transformer for sentiment analysis, in Proc. 2021 Conf. Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2021, pp. 2447–2458.
DOI
[17]
F. Lv, X. Chen, Y. Huang, L. Duan, and G. Lin, Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 2554–2562.
DOI
[18]
A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, Attention bottlenecksfor multimodal fusion, in Proc. 2021 Annual Conf. Nerual Information Processing Systems, virtual, 2021, pp. 14200–14213.
[19]
S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 5987–5995.
DOI
[20]

N. Neverova, C. Wolf, G. Taylor, and F. Nebout, ModDrop: Adaptive multi-modal gesture recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 8, pp. 1692–1706, 2016.

[21]

S. R. Livingstone and F. A. Russo, The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English, . PLoS One , vol. 13, no. 5, p. e0196391, 2018.

[22]

C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, IEMOCAP: Interactive emotional dyadic motion capture database, Lang. Resour. Eval., vol. 42, no. 4, pp. 335–359, 2008.

[23]
D. Nguyen, K. Nguyen, S. Sridharan, A. Ghasemi, D. Dean, and C. Fookes, Deep spatio-temporal features for multimodal emotion recognition, in Proc. 2017 IEEE Winter Conf. Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 2017, pp. 1215–1223.
DOI
[24]
Y. H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. P. Morency, and R. Salakhutdinov, Multimodal Transformer for unaligned multimodal language sequences, in Proc. 57th Annual Meeting Association for Computational Linguistics, Florence, Italy, 2019, pp. 6558–6569.
DOI
[25]
H. R. Vaezi Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, MMTM: Multimodal transfer module for CNN fusion, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 13286–13296.
DOI
[26]
L. Su, C. Hu, G. Li, and D. Cao, MSAF: Multimodal split attention fusion, arXiv preprint arXiv: 2012.07175, 2020.
[27]
J. Wang, M. Xue, R. Culhane, E. Diao, J. Ding, and V. Tarokh, Speech emotion recognition with dual-sequence LSTM architecture, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 6474–6478.
DOI
[28]
W. Dai, Z. Liu, T. Yu, and P. Fung, Modality-transferable emotion embeddings for low-resource multimodal emotion recognition, arXiv preprint arXiv: 2009.09629, 2020.
[29]
Q. Jin, C. Li, S. Chen, and H. Wu, Speech emotion recognition with acoustic and lexical features, in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 2015, pp. 4749–4753.
DOI
[30]
J. Pennington, R. Socher, and C. Manning, Glove: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 1532–1543.
DOI
[31]
T. Baltrušaitis, P. Robinson, and L. P. Morency, OpenFace: An open source facial behavior analysis toolkit, in Proc. 2016 IEEE Winter Conf. Applications of Computer Vision (WACV), Lake Placid, NY, USA, 2016, pp. 1–10.
DOI
[32]
G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer, COVAREP—A collaborative voice analysis repository for speech technologies, in Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 2014, pp. 960–964.
DOI
[33]
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2014.
[34]
J. D. S. Ortega, M. Senoussaoui, E. Granger, M. Pedersoli, P. Cardinal, and A. L. Koerich, Multimodal fusion with deep neural networks for audio-video emotion recognition, arXiv preprint arXiv: 1907.03196, 2019.
[35]
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, Multimodal compact bilinear pooling for visual question answering and visual grounding, arXiv preprint arXiv: 1606.01847, 2016.
DOI
[36]
K. Liu, Y. Li, N. Xu, and P. Natarajan, Learn to combine modalities in multimodal deep learning, arXiv preprint arXiv: 1805.11730, 2018.
[37]

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

[38]

Y. Wang, Y. Shen, Z. Liu, P. P. Liang, A. Zadeh, and L. -P. Morency, Words can shift: Dynamically adjusting word representations using nonverbal behaviors, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 7216–7223, 2019.

[39]

H. Pham, P. P. Liang, T. Manzini, L. -P. Morency, and B. Póczos, Found in translation: Learning robust joint representations by cyclic translations between modalities, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 6892–6899, 2019.

Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 07 July 2023
Accepted: 31 August 2023
Published: 27 October 2023
Issue date: December 2023

Copyright

© The author(s) 2023.

Acknowledgements

Acknowledgment

This study was supported by Beijing Key Laboratory of Behavior and Mental Health, Peking University.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return