Journal Home > Volume 4 , issue 3

Density-based clustering is an important category among clustering algorithms. In real applications, many datasets suffer from incompleteness. Traditional imputation technologies or other techniques for handling missing values are not suitable for density-based clustering and decrease clustering result quality. To avoid these problems, we develop a novel density-based clustering approach for incomplete data based on Bayesian theory, which conducts imputation and clustering concurrently and makes use of intermediate clustering results. To avoid the impact of low-density areas inside non-convex clusters, we introduce a local imputation clustering algorithm, which aims to impute points to high-density local areas. The performances of the proposed algorithms are evaluated using ten synthetic datasets and five real-world datasets with induced missing values. The experimental results show the effectiveness of the proposed algorithms.


menu
Abstract
Full text
Outline
About this article

Effective Density-Based Clustering Algorithms for Incomplete Data

Show Author's information Zhonghao Xue1Hongzhi Wang2( )
USC Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90007, USA
Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China

Abstract

Density-based clustering is an important category among clustering algorithms. In real applications, many datasets suffer from incompleteness. Traditional imputation technologies or other techniques for handling missing values are not suitable for density-based clustering and decrease clustering result quality. To avoid these problems, we develop a novel density-based clustering approach for incomplete data based on Bayesian theory, which conducts imputation and clustering concurrently and makes use of intermediate clustering results. To avoid the impact of low-density areas inside non-convex clusters, we introduce a local imputation clustering algorithm, which aims to impute points to high-density local areas. The performances of the proposed algorithms are evaluated using ten synthetic datasets and five real-world datasets with induced missing values. The experimental results show the effectiveness of the proposed algorithms.

Keywords:

density-based clustering, incomplete data, clustering algorihtm
Received: 13 December 2020 Accepted: 13 January 2021 Published: 12 May 2021 Issue date: September 2021
References(18)
[1]
R. J. G. B. Campello, P. Kröger, J. Sander, and A. Zimek, Density-based clustering, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 10, no. 2, p. e1343, 2020.
[2]
X. W. Xu, M. Ester, H. P. Kriegel, and J. Sander, A distribution-based clustering algorithm for mining in large spatial databases, in Proc. 14th Int. Conf. Data Engineering, Washington, DC, USA, 1998, pp. 324-331.
[3]
H. O. Hartley, R. R. Hocking, The analysis of incomplete data, Biometrics, vol. 27, no. 4, pp. 783-823, 1971.
[4]
J. MacQueen, Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, Oakland, CA, USA, 1967, pp. 281-297.
[5]
A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, no. 1, pp. 1-38, 1977.
[6]
E. Acuña and C. Rodriguez, The treatment of missing values and its effect on classifier accuracy, in Classification, Clustering, and Data Mining Applications, D. Banks, F. R. McMorris, P. Arabie, and W. Gaul, eds. Berlin, Germany: Springer, 2004, pp. 639-647.
[7]
R. J. Hathaway, J. C. Bezdek, Fuzzy c-means clustering of incomplete data, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 31, no. 5, pp. 735-744, 2001.
[8]
M. G. Kendall, Advanced Theory of Statistics Vol.-I. London, UK: Charles Griffin, 1943.
[9]
D. T. Lee, C. K. Wong, Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees, Acta Informatica, vol. 9, no. 1, pp. 23-29, 1977.
[10]
J. L. Bentley, Multidimensional binary search trees used for associative searching, Communications of the ACM, vol. 18, no. 9, pp. 509-517, 1975.
[11]
L. J. Gleser, Multivariate statistics: A vector space approach, Journal of the American Statistical Association, vol. 80, no. 392, pp. 1069-1070, 1985.
[12]
B. A. Galler, M. J. Fisher, An improved equivalence algorithm, Communications of the ACM, vol. 7, no. 5, pp. 301-303, 1964.
[13]
K. Lai, L. F. Bo, X. F. Ren, and D. Fox, A large-scale hierarchical multi-view RGB-D object dataset, in Proc. 2011 IEEE Int. Conf. Robotics and Automation, Shanghai, China, 2011, pp. 1817-1824.
[14]
A. Martiniano, R. P. Ferreira, R. J. Sassi, and C. Affonso, Application of a neuro fuzzy network in prediction of absenteeism at work, in Proc. 7th Iberian Conf. Information Systems and Technologies (CISTI 2012), Madrid, Spain, 2012, pp. 1-4.
[15]
S. Renjith and C. Anjali, A personalized mobile travel recommender system using hybrid algorithm, in Proc. 2014 1st Int. Conf. Computational Systems and Communications (ICCSC), Trivandrum, India, 2014, pp. 12-17.
[16]
R. C. B. Madeo, C. A. M. Lima, and S. M. Peres, Gesture unit segmentation using support vector machines: Segmenting gestures from rest positions, in Proc. 28th Annu. ACM Symp. Applied Computing, New York, NY, USA, 2013, pp. 46-52.
[17]
A. Jacobson, D. Panozzo, C. Schüller, O. Diamanti, Q. N. Zhou, S. Koch, J. Dumas, A. Vaxman, N. Pietroni, S. Brugger, et al., libigl: A simple C++ geometry processing library, , 2018.
[18]
Y. Sasaki, The truth of the F-measure, Teach Tutor Mater, vol. 1, no. 5, pp. 1-5, 2007.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 13 December 2020
Accepted: 13 January 2021
Published: 12 May 2021
Issue date: September 2021

Copyright

© The author(s) 2021

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. U1866602 and 71773025) and the National Key Research and Development Program of China (No. 2020YFB1006104).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Reprints and Permission requests may be sought directly from editorial office.

Return