Journal Home > Volume 29 , Issue 1

The problem of imbalanced data classification learning has received much attention. Conventional classification algorithms are susceptible to data skew to favor majority samples and ignore minority samples. Majority weighted minority oversampling technique (MWMOTE) is an effective approach to solve this problem, however, it may suffer from the shortcomings of inadequate noise filtering and synthesizing the same samples as the original minority data. To this end, we propose an improved MWMOTE method named joint sample position based noise filtering and mean shift clustering (SPMSC) to solve these problems. Firstly, in order to effectively eliminate the effect of noisy samples, SPMSC uses a new noise filtering mechanism to determine whether a minority sample is noisy or not based on its position and distribution relative to the majority sample. Note that MWMOTE may generate duplicate samples, we then employ the mean shift algorithm to cluster minority samples to reduce synthetic replicate samples. Finally, data cleaning is performed on the processed data to further eliminate class overlap. Experiments on extensive benchmark datasets demonstrate the effectiveness of SPMSC compared with other sampling methods.


menu
Abstract
Full text
Outline
About this article

Joint Sample Position-Based Noise Filtering and Mean Shift Clustering for Imbalanced Classification Learning

Show Author's information Lilong Duan1,2Wei Xue1,2( )Jun Huang1,2Xiao Zheng1,2
School of Computer Science and Technology, Anhui University of Technology, Maanshan 243032, China
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230088, China

Abstract

The problem of imbalanced data classification learning has received much attention. Conventional classification algorithms are susceptible to data skew to favor majority samples and ignore minority samples. Majority weighted minority oversampling technique (MWMOTE) is an effective approach to solve this problem, however, it may suffer from the shortcomings of inadequate noise filtering and synthesizing the same samples as the original minority data. To this end, we propose an improved MWMOTE method named joint sample position based noise filtering and mean shift clustering (SPMSC) to solve these problems. Firstly, in order to effectively eliminate the effect of noisy samples, SPMSC uses a new noise filtering mechanism to determine whether a minority sample is noisy or not based on its position and distribution relative to the majority sample. Note that MWMOTE may generate duplicate samples, we then employ the mean shift algorithm to cluster minority samples to reduce synthetic replicate samples. Finally, data cleaning is performed on the processed data to further eliminate class overlap. Experiments on extensive benchmark datasets demonstrate the effectiveness of SPMSC compared with other sampling methods.

Keywords: clustering, noise filtering, imbalanced data classification, oversampling

References(46)

[1]
P. Branco, L. Torgo, and R. P. Ribeiro, A survey of predictive modeling on imbalanced domains, ACM Comput. Surv., vol. 49, no. 2, pp. 1–50, 2016.
[2]
S. Fotouhi, S. Asadi, and M. W. Kattan, A comprehensive data level analysis for cancer diagnosis on imbalanced data, J. Biomed. Inform., vol. 90, p. 103089, 2019.
[3]
J. Yang, X. Wu, J. Liang, X. Sun, M. -M. Cheng, P. L. Rosin, and L. Wang, Self-paced balance learning for clinical skin disease recognition, IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 8, pp. 2832–2846, 2019.
[4]
K. Luo, G. Wang, Q. Li, and J. Tao, An improved SVM-RFE based on F-statistic and mPDC for gene selection in cancer classification, IEEE Access, vol. 7, pp. 147617–147628, 2019.
[5]
W. W. Soh and R. M. Yusuf, Predicting credit card fraud on a imbalanced data, Int. J. Data Sci. Adv. Anal., vol. 1, no. 1, pp. 12–17, 2019.
[6]
H. Yu, J. Ni, and J. Zhao, ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data, Neurocomputing, vol. 101, pp. 309–318, 2013.
[7]
Y. Li, H. Guo, Q. Zhang, M. Gu, and J. Yang, Imbalanced text sentiment classification using universal and domain-specific knowledge, Knowl.-Based Syst., vol. 160, pp. 1–15, 2018.
[8]
V. Engen, J. Vincent, and K. Phalp, Enhancing network based intrusion detection for imbalanced data, Int. J. Knowl.-Based Intell. Eng. Syst., vol. 12, nos. 5&6, pp. 357–367, 2008.
[9]
R. Abdulhammed, M. Faezipour, A. Abuzneid, and A. AbuMallouh, Deep and machine learning approaches for anomaly-based intrusion detection of imbalanced network traffic, IEEE Sensors Lett., vol. 3, no. 1, p. 7101404, 2018.
[10]
A. Azaria, A. Richardson, S. Kraus, and V. S. Subrahmanian, Behavioral analysis of insider threat: A survey and bootstrapped prediction in imbalanced data, IEEE Trans. Comput. Social Syst., vol. 1, no. 2, pp. 135–155, 2014.
[11]
H. He and E. A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, 2009.
[12]
S. Maldonado and J. López, Imbalanced data classification using second-order cone programming support vector machines, Pattern Recognit., vol. 47, no. 5, pp. 2070–2079, 2014.
[13]
D. J. Yu, J. Hu, Z. M. Tang, H. B. Shen, J. Yang, and J. Y. Yang, Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling, Neurocomputing, vol. 104, pp. 180–190, 2013.
[14]
S. Alshomrani, A. Bawakid, S. O. Shim, A. Fernández, and F. Herrera, A proposal for evolutionary fuzzy systems using feature weighting: Dealing with overlapping in imbalanced datasets, Knowl.-Based Syst., vol. 73, pp. 1–17, 2015.
[15]
R. C. Prati, G. E. Batista, and M. C. Monard, Class imbalances versus class overlapping: An analysis of a learning system behavior, in Proc. 3rd Mexican International Conference on Artificial Intelligence, Mexico City, Mexico, 2004, pp. 312–321.
[16]
S. A. Shahee and U. Ananthakumar, An adaptive oversampling technique for imbalanced datasets, in Proc. 18th Industrial Conference on Data Mining, New York, NY, USA, 2018, pp. 1–16.
[17]
N. Japkowicz and S. Stephen, The class imbalance problem: A systematic study, Intell. Data Anal., vol. 6, no. 5, pp. 429–449, 2002.
[18]
T. Jo and N. Japkowicz, Class imbalances versus small disjuncts, ACM SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 40–49, 2004.
[19]
N. Japkowicz, Concept-learning in the presence of between-class and within-class imbalances, in Proc. 14th Biennial Conference of the Canadian Society for Computational Studies of Intelligence, Ottawa, Canada, 2001, pp. 67–77.
[20]
H. A. Majzoub and I. Elgedawy, AB-SMOTE: An affinitive borderline SMOTE approach for imbalanced data binary classification, Int. J. Mach. Learn. Comput., vol. 10, no. 1, pp. 31–37, 2020.
[21]
T. Zhu, Y. Lin, and Y. Liu, Synthetic minority oversampling technique for multiclass imbalance problems, Pattern Recognit., vol. 72, pp. 327–340, 2017.
[22]
A. Onan, Consensus clustering-based undersampling approach to imbalanced learning, Sci. Program., vol. 2019, p. 5901087, 2019.
[23]
J. Wei, H. Huang, L. Yao, Y. Hu, Q. Fan, and D. Huang, NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems, Expert Syst. Appl., vol. 158, p. 113504, 2020.
[24]
L. Jiang, C. Li, and S. Wang, Cost-sensitive Bayesian network classifiers, Pattern Recognition Lett., vol. 45, pp. 211–216, 2014.
[25]
L. Jiang, C. Qiu, and C. Li, A novel minority cloning technique for cost-sensitive learning, Int. J. Pattern Recognit. Artif. Intell., vol. 29, no. 4, p. 1551004, 2015.
[26]
M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches, IEEE Trans. Syst., Man, Cybern. C (Appl. Rev.), vol. 42, no. 4, pp. 463–484, 2012.
[27]
I. Mani, I. Zhang, J. Zhang, and K. S. Mani, KNN approach to unbalanced data distributions: A case study involving information extraction, in Proc. Workshop on Learning from Imbalanced Datasets, Washington, DC, USA, 2003, pp. 1–7.
[28]
R. C. Holte, L. E. Acker, and B. W. Porter, Concept learning and the problem of small disjuncts, in Proc. 11th International Joint Conference on Artificial Intelligence, Detroit, MI, USA, 1989, pp. 813–818.
[29]
Z. Wang and H. Wang, Global data distribution weighted synthetic oversampling technique for imbalanced learning, IEEE Access, vol. 9, pp. 44770–44783, 2021.
[30]
H. Yu, C. Mu, C. Sun, W. Yang, X. Yang, and X. Zuo, Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data, Knowl.-Based Syst., vol. 76, pp. 67–78, 2015.
[31]
J. V. Hulse, T. M. Khoshgoftaar, and A. Napolitano, Experimental perspectives on learning from imbalanced data, in Proc. 24th International Conference on Machine Learning, Corvallis, OR, USA, 2007, pp. 935–942.
[32]
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, J. Artif. Intell. Res., vol. 16, no. 1, pp. 321–357, 2002.
[33]
H. He, Y. Bai, E. A. Garcia, and S. Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in Proc. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 2008, pp. 1322–1328.
[34]
G. Douzas, F. Bacao, and F. Last, Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE, Inf. Sci., vol. 465, pp. 1–20, 2018.
[35]
S. Barua, M. M. Islam, X. Yao, and K. Murase, MMWOTE—Majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng., vol. 26, no. 2, pp. 405–425, 2014.
[36]
H. Han, W. Y. Wang, and B. H. Mao, Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, in Proc. International Conference on Intelligent Computing, Hefei, China, 2005, pp. 878–887.
[37]
C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, DBSMOTE: Density-based synthetic minority oversampling technique, Appl. Intell., vol. 36, no. 3, pp. 664–684, 2012.
[38]
C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, Safe-level-SMOTE: Safe-level-synthetic minority oversampling technique for handling the class imbalanced problem, in Proc. 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkok, Thailand, 2009, pp. 475–482.
[39]
I. Nekooeimehr and S. K. Lai-Yuen, Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets, Expert Syst. Appl., vol. 46, pp. 405–416, 2016.
[40]
G. E. Batista, A. L. Bazzan, and M. C. Monard, Balancing training data for automated annotation of keywords: A case study, in Proc. 2nd Brazilian Workshop on Bioinformatics, Macaé, Brazil, 2003, pp. 10–18.
[41]
T. Cover and P. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, 1967.
[42]
T. F. Chan, G. H. Golub, and R. J. LeVeque, Updating formulae and a pairwise algorithm for computing sample variances, in Proc. COMPSTAT 1982 5th Symposium, Toulouse, France, 1982, pp. 30–41.
[43]
C. Cortes and V. Vapnik, Support-vector networks, Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995.
[44]
A. Onan, Bidirectional convolutional recurrent neural network architecture with group-wise enhancement mechanism for text sentiment classification, J. King Saud Univ. Comput. Inf. Sci., vol. 34, no. 5, pp. 2098–2117, 2022.
[45]
N. Japkowicz, Assessment metrics for imbalanced learning, in Imbalanced Learning: Foundations, Algorithms, and Applications, H. He and Y. Ma, eds. Hoboken, NJ, USA: John Wiley & Sons, 2013, pp. 187–206.
[46]
G. W. Corder and D. I. Foreman, Nonparametric Statistics: A Step-by-Step Approach. Hoboken, NJ, USA: John Wiley & Sons, 2014.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 03 December 2022
Revised: 22 January 2023
Accepted: 31 January 2023
Published: 21 August 2023
Issue date: February 2024

Copyright

© The author(s) 2024.

Acknowledgements

This work was supported in part by the Anhui Provincial Natural Science Foundation (No. 2208085MF168) and the Program for Synergy Innovation in the Anhui Higher Education Institutions of China (Nos. GXXT-2019-025 and GXXT-2022-052).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return