AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (3.3 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Research Article | Open Access

Cross-modal learning using privileged information for long-tailed image classification

School of Software, Shandong University, Jinan 250101, China
Show Author Information

Graphical Abstract

Abstract

The prevalence of long-tailed distributions in real-world data often results in classification models favoring the dominant classes, neglecting the less frequent ones. Current approaches address the issues in long-tailed image classification by rebalancing data, optimizing weights, and augmenting information. However, these methods often struggle to balance the performance between dominant and minority classes because of inadequate representation learning of the latter. To address these problems, we introduce descriptional words into images as cross-modal privileged information and propose a cross-modal enhanced method for long-tailed image classification, referred to as CMLTNet. CMLTNet improves the learning of intra-class similarity of tail-class representations by cross-modal alignment and captures the difference between the head and tail classes in semantic space by cross-modal inference. After fusing the above information, CMLTNet achieved an overall performance that was better than those of benchmark long-tailed and cross-modal learning methods on the long-tailed cross-modal datasets, NUS-WIDE and VireoFood-172. The effectiveness of the proposed modules was further studied through ablation experiments. In a case study of feature distribution, the proposed model was better in learning representations of tail classes, and in the experiments on model attention, CMLTNet has the potential to help learn some rare concepts in the tail class through mapping to the semantic space.

References

[1]
Zhou, B.; Cui, Q.; Wei, X. S.; Chen, Z. M. BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9716–9725, 2020.
[2]
Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng, J.; Kalantidis, Y. Decoupling representation and classifier for long-tailed recognition. In: Proceedings of the International Conference on Learning Representations, 2019.
[3]
Cui, Y.; Jia, M.; Lin, T. Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9268–9277, 2019.
[4]
Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. In: Proceedings of the Advances in Neural Information Processing Systems, 1567–1578, 2019.
[5]
Cui, J.; Zhong, Z.; Liu, S.; Yu, B.; Jia, J. Parametric contrastive learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 715–724, 2021.
[6]
Chou, H. P.; Chang, S. C.; Pan, J. Y.; Wei, W.; Juan, D. C. Remix: Rebalanced mixup. In: Computer Vision – ECCV 2020 Workshops. Lecture Notes in Computer Science, Vol. 12540. Bartoli, A.; Fusiello, A. Eds. Springer Cham, 95–110, 2021.
[7]

Zhang, Y.; Wei, X. S.; Zhou, B.; Wu, J. Bag of tricks for long-tailed visual recognition with deep convolutional neural networks. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 4, 3447–3455, 2021.

[8]
Park, S.; Hong, Y.; Heo, B.; Yun, S.; Choi, J. Y. The majority can help the minority: Context-rich minority oversampling for long-tailed classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6877–6886, 2022.
[9]
Li, X.; Ma, H.; Meng, L.; Meng, X. Comparative study of adversarial training methods for long-tailed classification. In: Proceedings of the 1st International Workshop on Adversarial Learning for Multimedia, 1–7, 2021.
[10]
Kim, J.; Jeong, J.; Shin, J. M2m: Imbalanced classification via major-to-minor translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13893–13902, 2020.
[11]
Liu, J.; Sun, Y.; Han, C.; Dou, Z.; Li, W. Deep representation learning on long-tailed data: A learnable embedding augmentation perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2970–2979, 2020.
[12]
Ma, H.; Qi, Z.; Dong, X.; Li, X.; Zheng, Y.; Meng, X.; Meng, L. Cross-modal content inference and feature enrichment for cold-start recommendation. In: Proceedings of the International Joint Conference on Neural Networks, 1–8, 2023.
[13]

Vapnik, V.; Vashist, A. A new learning paradigm: Learning using privileged information. Neural Networks Vol. 22, Nos. 5–6, 544–557, 2009.

[14]

Vapnik, V.; Izmailov, R. Learning using privileged information: Similarity control and knowledge transfer. Journal of Machine Learning Research Vol. 16, No. 61, 2023–2049, 2015.

[15]
Chen, J. J.; Ngo, C. W.; Chua, T. S. Cross-modal recipe retrieval with rich food attributes. In: Proceedings of the 25th ACM International Conference on Multimedia, 1771–1779, 2017.
[16]
Min, W.; Liu, L.; Luo, Z.; Jiang, S. Ingredient-guided cascaded multi-attention network for food recognition. In: Proceedings of the 27th ACM International Conference on Multimedia, 1331–1339, 2019.
[17]
Chen, J.; Ngo, C. W. Deep-based ingredient recognition for cooking recipe retrieval. In: Proceedings of the 24th ACM International Conference on Multimedia, 32–41, 2016.
[18]
George, A.; Marcel, S. Cross modal focal loss for RGBD face anti-spoofing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7882–7891, 2021.
[19]
Meng, L.; Chen, L.; Yang, X.; Tao, D.; Zhang, H.; Miao, C.; Chua, T. S. Learning using privileged information for food recognition. In: Proceedings of the 27th ACM International Conference on Multimedia, 557–565, 2019.
[20]
Sun, B.; Saenko, K. Deep CORAL: Correlation alignment for deep domain adaptation. In: Proceedings of the European Conference on Computer Vision, 443–450, 2016.
[21]
Li, S.; Xie, B.; Wu, J.; Zhao, Y.; Liu, C. H.; Ding, Z. Simultaneous semantic alignment network for heterogeneous domain adaptation. In: Proceedings of the 28th ACM International Conference on Multimedia, 3866–3874, 2020.
[22]

Li, X.; Xu, Z.; Wei, K.; Deng, C. Generalized zero-shot learning via disentangled representation. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 3, 1966–1974, 2021.

[23]

Gao, J.; Chen, J.; Fu, H.; Jiang, Y. G. Dynamic mixup for multi-label long-tailed food ingredient recognition. IEEE Transactions on Multimedia Vol. 25, 4764–4773, 2023.

[24]
Mahajan, D.; Girshick, R.; Ramanathan, V.; He, K.; Paluri, M.; Li, Y.; Bharambe, A.; van der Maaten, L. Exploring the limits of weakly supervised pretraining. In: Proceedings of the European Conference on Computer Vision, 181–196, 2018.
[25]
Ren, J.; Yu, C.; Sheng, S.; Ma, X.; Zhao, H.; Yi, S.; Li, H. Balanced meta-softmax for long-tailed visual recognition. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, Article No. 351, 2020.
[26]
Lin, T. Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2980–2988, 2017.
[27]
Wang, Y.; Gan, W.; Yang, J.; Wu, W.; Yan, J. Dynamic curriculum learning for imbalanced data classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5017–5026, 2019.
[28]
Chu, P.; Bian, X.; Liu, S.; Ling, H. Feature space augmentation for long-tailed data. In: Proceedings of the 17th European Conference on Computer Vision, 694–710, 2020.
[29]
Hong, Y.; Zhang, J.; Sun, Z.; Yan, K. SAFA: Sample-adaptive feature augmentation for long-tailed image classification. In: Proceedings of the 17th European Conference on Computer Vision, 587–603, 2022.
[30]
Kang, B.; Li, Y.; Xie, S.; Yuan, Z.; Feng, J. Exploring balanced feature spaces for representation learning. In: Proceedings of the International Conference on Learning Representations, 2021.
[31]
Li, T.; Cao, P.; Yuan, Y.; Fan, L.; Yang, Y.; Feris, R.; Indyk, P.; Katabi, D. Targeted supervised contrastive learning for long-tailed recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6918–6928, 2022.
[32]
Xiang, L.; Ding, G.; Han, J. Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12350. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 247–263, 2020.
[33]
Tang, K.; Huang, J.; Zhang, H. Long-tailed classification by keeping the good and removing the bad momentum causal effect. In: Proceedings of the 34th Conference on Neural Information Processing Systems, 1513–1524, 2020.
[34]
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, 8748–8763, 2021.
[35]
Meng, L.; Feng, F.; He, X.; Gao, X.; Chua, T. S. Heterogeneous fusion of semantic and collaborative information for visually-aware food recommendation. In: Proceedings of the 28th ACM International Conference on Multimedia, 3460–3468, 2020.
[36]

Jiang, S.; Min, W.; Liu, L.; Luo, Z. Multi-scale multi-view deep feature aggregation for food recognition. IEEE Transactions on Image Processing Vol. 29, 265–276, 2020.

[37]
Chua, T. S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; Zheng, Y. NUS-WIDE: A real-world web image database from National University of Singapore. In: Proceedings of the ACM International Conference on Image and Video Retrieval, Article No. 48, 2009.
[38]

Tang, J.; Shu, X.; Li, Z.; Qi, G. J.; Wang, J. Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Transactions on Multimedia Computing, Communications, and Applications Vol. 12, No. 4s, Article No. 68, 2016.

[39]

Tang, J.; Shu, X.; Qi, G. J.; Li, Z.; Wang, M.; Yan, S.; Jain, R. Tri-clustered tensor completion for social-aware image tag refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 8, 1662–1674, 2017.

[40]
Wu, T.; Huang, Q.; Liu, Z.; Wang, Y.; Lin, D. Distribution-balanced loss for multi-label classification in long-tailed datasets. In: Proceedings of the 16th European Conference on Computer Vision, 162–178, 2020.
[41]
Guo, H.; Wang, S. Long-tailed multi-label visual recognition by collaborative training on uniform and re-balanced samplings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15089–15098, 2021.
[42]
Liu, Z.; Miao, Z.; Zhan, X.; Wang, J.; Gong, B.; Yu, S. X. Large-scale long-tailed recognition in an open world. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2537–2546, 2019.
[43]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.
[44]
Simonyan, K.; Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[45]
Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146, 2017.
[46]
Martinel, N.; Foresti, G. L.; Micheloni, C. Wide-slice residual networks for food recognition. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 567–576, 2018.
[47]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, 2021.
[48]
Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, 618–626, 2017.
[49]
Chen, Z.; Qi, Z.; Cao, X.; Li, X.; Meng, X.; Meng, L. Class-level structural relation modelling and smoothing for visual representation learning. arXiv preprint arXiv:2308.04142, 2023.
[50]
Wang, Y.; Li, X.; Qi, Z.; Li, J.; Li, X.; Meng, X.; Meng, L. Meta-causal feature learning for out-of-distribution generalization. In: Computer Vision – ECCV 2022 Workshops. Lecture Notes in Computer Science, Vol. 13806. Karlinsky, L.; Michaeli, T.; Nishino, K. Eds. Springer Cham, 530–545, 2023.
Computational Visual Media
Pages 981-992
Cite this article:
Li X, Zheng Y, Ma H, et al. Cross-modal learning using privileged information for long-tailed image classification. Computational Visual Media, 2024, 10(5): 981-992. https://doi.org/10.1007/s41095-023-0382-0

140

Views

5

Downloads

3

Crossref

3

Web of Science

3

Scopus

0

CSCD

Altmetrics

Received: 11 January 2023
Accepted: 29 September 2023
Published: 10 June 2024
© The Author(s) 2024.

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Return