Joint Sample Position-Based Noise Filtering and Mean Shift Clustering for Imbalanced Classification Learning

Lilong Duan; Wei Xue; Jun Huang; Xiao Zheng

doi:10.26599/TST.2023.9010006

Tsinghua Science and Technology 2024, 29(1): 216-231 https://doi.org/10.26599/TST.2023.9010006

Open Access | Issue | Published: 21 August 2023

Joint Sample Position-Based Noise Filtering and Mean Shift Clustering for Imbalanced Classification Learning

Show Author's Information Hide Author's Information Lilong Duan^{¹^,²}, Wei Xue^{¹^,²}(

), Jun Huang^{¹^,²}, Xiao Zheng^{¹^,²}

1School of Computer Science and Technology, Anhui University of Technology, Maanshan 243032, China

2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230088, China

Keywords:

clustering, noise filtering, imbalanced data classification, oversampling

Cite this article:

Duan L, Xue W, Huang J, et al. Joint Sample Position-Based Noise Filtering and Mean Shift Clustering for Imbalanced Classification Learning. Tsinghua Science and Technology, 2024, 29(1): 216-231. https://doi.org/10.26599/TST.2023.9010006

Download citation

EndNote(RIS)

BibTeX

279

Views

Downloads

Citations

Crossref

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

The problem of imbalanced data classification learning has received much attention. Conventional classification algorithms are susceptible to data skew to favor majority samples and ignore minority samples. Majority weighted minority oversampling technique (MWMOTE) is an effective approach to solve this problem, however, it may suffer from the shortcomings of inadequate noise filtering and synthesizing the same samples as the original minority data. To this end, we propose an improved MWMOTE method named joint sample position based noise filtering and mean shift clustering (SPMSC) to solve these problems. Firstly, in order to effectively eliminate the effect of noisy samples, SPMSC uses a new noise filtering mechanism to determine whether a minority sample is noisy or not based on its position and distribution relative to the majority sample. Note that MWMOTE may generate duplicate samples, we then employ the mean shift algorithm to cluster minority samples to reduce synthetic replicate samples. Finally, data cleaning is performed on the processed data to further eliminate class overlap. Experiments on extensive benchmark datasets demonstrate the effectiveness of SPMSC compared with other sampling methods.

Full text

Abstract

Full text

Outline

About this article

Joint Sample Position-Based Noise Filtering and Mean Shift Clustering for Imbalanced Classification Learning

Show Author's information Hide Author's Information Lilong Duan^{¹^,²}, Wei Xue^{¹^,²}(

), Jun Huang^{¹^,²}, Xiao Zheng^{¹^,²}

1School of Computer Science and Technology, Anhui University of Technology, Maanshan 243032, China

2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230088, China

Abstract

Keywords: clustering, noise filtering, imbalanced data classification, oversampling

References(46)

[1]

P. Branco, L. Torgo, and R. P. Ribeiro, A survey of predictive modeling on imbalanced domains, ACM Comput. Surv., vol. 49, no. 2, pp. 1–50, 2016.

DOI Google Scholar

[2]

S. Fotouhi, S. Asadi, and M. W. Kattan, A comprehensive data level analysis for cancer diagnosis on imbalanced data, J. Biomed. Inform., vol. 90, p. 103089, 2019.

DOI Google Scholar

[3]

J. Yang, X. Wu, J. Liang, X. Sun, M. -M. Cheng, P. L. Rosin, and L. Wang, Self-paced balance learning for clinical skin disease recognition, IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 8, pp. 2832–2846, 2019.

DOI Google Scholar

[4]

K. Luo, G. Wang, Q. Li, and J. Tao, An improved SVM-RFE based on F-statistic and mPDC for gene selection in cancer classification, IEEE Access, vol. 7, pp. 147617–147628, 2019.

DOI Google Scholar

[5]

W. W. Soh and R. M. Yusuf, Predicting credit card fraud on a imbalanced data, Int. J. Data Sci. Adv. Anal., vol. 1, no. 1, pp. 12–17, 2019.

Google Scholar

[6]

H. Yu, J. Ni, and J. Zhao, ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data, Neurocomputing, vol. 101, pp. 309–318, 2013.

DOI Google Scholar

[7]

Y. Li, H. Guo, Q. Zhang, M. Gu, and J. Yang, Imbalanced text sentiment classification using universal and domain-specific knowledge, Knowl.-Based Syst., vol. 160, pp. 1–15, 2018.

DOI Google Scholar

[8]

V. Engen, J. Vincent, and K. Phalp, Enhancing network based intrusion detection for imbalanced data, Int. J. Knowl.-Based Intell. Eng. Syst., vol. 12, nos. 5&6, pp. 357–367, 2008.

DOI Google Scholar

[9]

R. Abdulhammed, M. Faezipour, A. Abuzneid, and A. AbuMallouh, Deep and machine learning approaches for anomaly-based intrusion detection of imbalanced network traffic, IEEE Sensors Lett., vol. 3, no. 1, p. 7101404, 2018.

DOI Google Scholar

[10]

A. Azaria, A. Richardson, S. Kraus, and V. S. Subrahmanian, Behavioral analysis of insider threat: A survey and bootstrapped prediction in imbalanced data, IEEE Trans. Comput. Social Syst., vol. 1, no. 2, pp. 135–155, 2014.

DOI Google Scholar

[11]

H. He and E. A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, 2009.

DOI Google Scholar

[12]

S. Maldonado and J. López, Imbalanced data classification using second-order cone programming support vector machines, Pattern Recognit., vol. 47, no. 5, pp. 2070–2079, 2014.

DOI Google Scholar

[13]

D. J. Yu, J. Hu, Z. M. Tang, H. B. Shen, J. Yang, and J. Y. Yang, Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling, Neurocomputing, vol. 104, pp. 180–190, 2013.

DOI Google Scholar

[14]

S. Alshomrani, A. Bawakid, S. O. Shim, A. Fernández, and F. Herrera, A proposal for evolutionary fuzzy systems using feature weighting: Dealing with overlapping in imbalanced datasets, Knowl.-Based Syst., vol. 73, pp. 1–17, 2015.

DOI Google Scholar

[15]

R. C. Prati, G. E. Batista, and M. C. Monard, Class imbalances versus class overlapping: An analysis of a learning system behavior, in Proc. 3^rd Mexican International Conference on Artificial Intelligence, Mexico City, Mexico, 2004, pp. 312–321.

DOI Google Scholar

[16]

S. A. Shahee and U. Ananthakumar, An adaptive oversampling technique for imbalanced datasets, in Proc. 18^th Industrial Conference on Data Mining, New York, NY, USA, 2018, pp. 1–16.

DOI Google Scholar

[17]

N. Japkowicz and S. Stephen, The class imbalance problem: A systematic study, Intell. Data Anal., vol. 6, no. 5, pp. 429–449, 2002.

DOI Google Scholar

[18]

T. Jo and N. Japkowicz, Class imbalances versus small disjuncts, ACM SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 40–49, 2004.

DOI Google Scholar

[19]

N. Japkowicz, Concept-learning in the presence of between-class and within-class imbalances, in Proc. 14^th Biennial Conference of the Canadian Society for Computational Studies of Intelligence, Ottawa, Canada, 2001, pp. 67–77.

DOI Google Scholar

[20]

H. A. Majzoub and I. Elgedawy, AB-SMOTE: An affinitive borderline SMOTE approach for imbalanced data binary classification, Int. J. Mach. Learn. Comput., vol. 10, no. 1, pp. 31–37, 2020.

DOI Google Scholar

[21]

T. Zhu, Y. Lin, and Y. Liu, Synthetic minority oversampling technique for multiclass imbalance problems, Pattern Recognit., vol. 72, pp. 327–340, 2017.

DOI Google Scholar

[22]

A. Onan, Consensus clustering-based undersampling approach to imbalanced learning, Sci. Program., vol. 2019, p. 5901087, 2019.

DOI Google Scholar

[23]

J. Wei, H. Huang, L. Yao, Y. Hu, Q. Fan, and D. Huang, NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems, Expert Syst. Appl., vol. 158, p. 113504, 2020.

DOI Google Scholar

[24]

L. Jiang, C. Li, and S. Wang, Cost-sensitive Bayesian network classifiers, Pattern Recognition Lett., vol. 45, pp. 211–216, 2014.

DOI Google Scholar

[25]

L. Jiang, C. Qiu, and C. Li, A novel minority cloning technique for cost-sensitive learning, Int. J. Pattern Recognit. Artif. Intell., vol. 29, no. 4, p. 1551004, 2015.

DOI Google Scholar

[26]

M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches, IEEE Trans. Syst., Man, Cybern. C (Appl. Rev.), vol. 42, no. 4, pp. 463–484, 2012.

DOI Google Scholar

[27]

I. Mani, I. Zhang, J. Zhang, and K. S. Mani, KNN approach to unbalanced data distributions: A case study involving information extraction, in Proc. Workshop on Learning from Imbalanced Datasets, Washington, DC, USA, 2003, pp. 1–7.

Google Scholar

[28]

R. C. Holte, L. E. Acker, and B. W. Porter, Concept learning and the problem of small disjuncts, in Proc. 11^th International Joint Conference on Artificial Intelligence, Detroit, MI, USA, 1989, pp. 813–818.

Google Scholar

[29]

Z. Wang and H. Wang, Global data distribution weighted synthetic oversampling technique for imbalanced learning, IEEE Access, vol. 9, pp. 44770–44783, 2021.

DOI Google Scholar

[30]

H. Yu, C. Mu, C. Sun, W. Yang, X. Yang, and X. Zuo, Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data, Knowl.-Based Syst., vol. 76, pp. 67–78, 2015.

DOI Google Scholar

[31]

J. V. Hulse, T. M. Khoshgoftaar, and A. Napolitano, Experimental perspectives on learning from imbalanced data, in Proc. 24^th International Conference on Machine Learning, Corvallis, OR, USA, 2007, pp. 935–942.

DOI Google Scholar

[32]

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, J. Artif. Intell. Res., vol. 16, no. 1, pp. 321–357, 2002.

DOI Google Scholar

[33]

H. He, Y. Bai, E. A. Garcia, and S. Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in Proc. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 2008, pp. 1322–1328.

Google Scholar

[34]

G. Douzas, F. Bacao, and F. Last, Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE, Inf. Sci., vol. 465, pp. 1–20, 2018.

DOI Google Scholar

[35]

S. Barua, M. M. Islam, X. Yao, and K. Murase, MMWOTE—Majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng., vol. 26, no. 2, pp. 405–425, 2014.

DOI Google Scholar

[36]

H. Han, W. Y. Wang, and B. H. Mao, Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, in Proc. International Conference on Intelligent Computing, Hefei, China, 2005, pp. 878–887.

DOI Google Scholar

[37]

C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, DBSMOTE: Density-based synthetic minority oversampling technique, Appl. Intell., vol. 36, no. 3, pp. 664–684, 2012.

DOI Google Scholar

[38]

C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, Safe-level-SMOTE: Safe-level-synthetic minority oversampling technique for handling the class imbalanced problem, in Proc. 13^th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkok, Thailand, 2009, pp. 475–482.

DOI Google Scholar

[39]

I. Nekooeimehr and S. K. Lai-Yuen, Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets, Expert Syst. Appl., vol. 46, pp. 405–416, 2016.

DOI Google Scholar

[40]

G. E. Batista, A. L. Bazzan, and M. C. Monard, Balancing training data for automated annotation of keywords: A case study, in Proc. 2^nd Brazilian Workshop on Bioinformatics, Macaé, Brazil, 2003, pp. 10–18.

Google Scholar

[41]

T. Cover and P. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, 1967.

DOI Google Scholar

[42]

T. F. Chan, G. H. Golub, and R. J. LeVeque, Updating formulae and a pairwise algorithm for computing sample variances, in Proc. COMPSTAT 1982 5^th Symposium, Toulouse, France, 1982, pp. 30–41.

DOI Google Scholar

[43]

C. Cortes and V. Vapnik, Support-vector networks, Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995.

DOI Google Scholar

[44]

A. Onan, Bidirectional convolutional recurrent neural network architecture with group-wise enhancement mechanism for text sentiment classification, J. King Saud Univ. Comput. Inf. Sci., vol. 34, no. 5, pp. 2098–2117, 2022.

DOI Google Scholar

[45]

N. Japkowicz, Assessment metrics for imbalanced learning, in Imbalanced Learning: Foundations, Algorithms, and Applications, H. He and Y. Ma, eds. Hoboken, NJ, USA: John Wiley & Sons, 2013, pp. 187–206.

DOI Google Scholar

[46]

G. W. Corder and D. I. Foreman, Nonparametric Statistics: A Step-by-Step Approach. Hoboken, NJ, USA: John Wiley & Sons, 2014.

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 03 December 2022

Revised: 22 January 2023

Accepted: 31 January 2023

Published: 21 August 2023

Issue date: February 2024

Copyright

Acknowledgements

This work was supported in part by the Anhui Provincial Natural Science Foundation (No. 2208085MF168) and the Program for Synergy Innovation in the Anhui Higher Education Institutions of China (Nos. GXXT-2019-025 and GXXT-2022-052).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).