Journal Home > Volume 28 , Issue 6

Constrained clustering, such as k-means with instance-level Must-Link (ML) and Cannot-Link (CL) auxiliary information as the constraints, has been extensively studied recently, due to its broad applications in data science and AI. Despite some heuristic approaches, there has not been any algorithm providing a non-trivial approximation ratio to the constrained k-means problem. To address this issue, we propose an algorithm with a provable approximation ratio of O(logk) when only ML constraints are considered. We also empirically evaluate the performance of our algorithm on real-world datasets having artificial ML and disjoint CL constraints. The experimental results show that our algorithm outperforms the existing greedy-based heuristic methods in clustering accuracy.


menu
Abstract
Full text
Outline
About this article

Efficient Algorithm for the k-Means Problem with Must-Link and Cannot-Link Constraints

Show Author's information Chaoqi Jia1Longkun Guo1,( )Kewen Liao2,Zhigang Lu3,
Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
HilstLab, Peter Faber Business School, Australian Catholic University, Sydney 2060, Australia
Macquarie University Cyber Security Hub, Macquarie University, Sydney 2109, Australia

† Longkun Guo, Kewen Liao, and Zhigang Lu contribute equally to this paper.

Abstract

Constrained clustering, such as k-means with instance-level Must-Link (ML) and Cannot-Link (CL) auxiliary information as the constraints, has been extensively studied recently, due to its broad applications in data science and AI. Despite some heuristic approaches, there has not been any algorithm providing a non-trivial approximation ratio to the constrained k-means problem. To address this issue, we propose an algorithm with a provable approximation ratio of O(logk) when only ML constraints are considered. We also empirically evaluate the performance of our algorithm on real-world datasets having artificial ML and disjoint CL constraints. The experimental results show that our algorithm outperforms the existing greedy-based heuristic methods in clustering accuracy.

Keywords: approximation algorithm, constrained clustering, Constrained k-means, Must-Link (ML) and Cannot-Link (CL) constraints

References(26)

[1]
X. Li and H. Liu, Greedy optimization for K-means-based consensus clustering, Tsinghua Science and Technology, vol. 23, no. 2, pp. 184–194, 2018.
[2]
S. Ji, D. Xu, L. Guo, M. Li, and D. Zhang, The seeding algorithm for spherical K-means clustering with penalties, in Proc. 13th Int. Conf. Algorithmic Aspects in Information and Management, Beijing, China, 2019, pp. 149–158.
[3]
S. Har-Peled and S. Mazumdar, On coresets for k-means and k-median clustering, in Proc. 36th Annu. ACM Symp. Theory of Computing, Chicago, IL, USA, 2004, pp. 291–300.
[4]
Z. Lu and H. Shen, Differentially private k-means clustering with convergence guarantee, IEEE Transactions on Dependable and Secure Computing, vol. 18, no. 4, pp. 1541–1552, 2021.
[5]
S. Dasgupta, The Hardness of k-Means Clustering, Report, Department of Computer Science and Engineering University of California, https://cseweb.ucsd.edu/∼dasgupta/papers/kmeans.pdf, 2008.
[6]
A. Vattani, The hardness of k-means clustering in the plane, https://cseweb.ucsd.edu/∼avattani/papers/kmeans_hardness.pdf, 2009.
[7]
M. Mahajan, P. Nimbhorkar, and K. Varadarajan, The planar k-means problem is NP-hard, in Proc. 3rd Int. Workshop on Algorithms and Computation, Kolkata, India, 2009, pp. 274–285.
[8]
D. Arthur and S. Vassilvitskii, k-means++: The advantages of careful seeding, in Proc. 18th Annu. ACM-SIAM Symp. Discrete Algorithms, New Orleans, LA, USA, 2007, pp. 1027–1035.
[9]
T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, A local search approximation algorithm for k-means clustering, in Proc. 18th Annu. Symp. Computational Geometry, Barcelona, Spain, 2002, pp. 10–18.
[10]
S. Lattanzi and C. Sohler, A better k-means++ algorithm via local search, in Proc. 36th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 3662–3671.
[11]
S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward, Better guarantees for k-means and Euclidean k-median by primal-dual algorithms, in Proc. of the 2017 IEEE 58th Annu. Symp. Foundations of Computer Science (FOCS), Berkeley, CA, USA, 2017, pp. 61–72.
[12]
K. Wagstaff and C. Cardie, Clustering with instance-level constraints, in Proc. 17th Int. Conf. Machine Learning, San Francisco, CA, USA, 2000, pp. 1103–1110.
[13]
K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl, Constrained k-means clustering with background knowledge, in Proc. 18th Int. Conf. Machine Learning, San Francisco, CA, USA, 2001, pp. 577–584.
[14]
T. Lange, M. H. C. Law, A. K. Jain, and J. M. Buhmann, Learning with constrained and unlabelled data, in Proc. of 2005 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 2005, pp. 731–738.
[15]
S. Basu, I. Davidson, and K. Wagstaff, Constrained Clustering: Advances in Algorithms, Theory, and Applications. Boca Raton, FL, USA: CRC Press, 2008.
DOI
[16]
Z. Li, J. Liu, and X. Tang, Constrained clustering via spectral regularization, in Proc. of 2009 IEEE Conf. Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 421–428.
[17]
X. Zhai, Y. Peng, and J. Xiao, Cross-modality correlation propagation for cross-media retrieval, in Proc. of 2012 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 2012, pp. 2337–2340.
[18]
H. Ding and J. Xu, A unified framework for clustering constrained data without locality property, in Proc. 26th Annu. ACM-SIAM Symp. Discrete Algorithms, San Diego, CA, USA, 2015, pp. 1471–1490.
[19]
Q. Feng, J. Hu, N. Huang, and J. Wang, Improved PTAS for the constrained k-means problem, J. Comb. Optim., vol. 37, no. 4, pp. 1091–1110, 2019.
[20]
I. Davidson and S. S. Ravi, Clustering with constraints: Feasibility issues and the k-means algorithm, in Proc. 2005 SIAM Int. Conf. Data Mining, Newport Beach, CA, USA, 2005, pp. 138–149.
[21]
P. Baumann, A binary linear programming-based k-means algorithm for clustering with must-link and cannot-link constraints, in Proc. of 2020 IEEE Int. Conf. Industrial Engineering and Engineering Management (IEEM), Singapore, 2020, pp. 324–328.
[22]
J. Edmonds, Maximum matching and a polyhedron with 0,1-vertices, Journal of Research of the National Bureau of Standards B, vol. 69, nos. 1&2, pp. 125–130, 1965.
[23]
J. B. MacQuuen, Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkley Symp. Mathematical Statistics and Probability, Berkeley, CA, USA, 1967, pp. 281–297.
[24]
L. Bottou and Y. Bengio, Convergence properties of the k-means algorithms, in Proc. 7th Int. Conf. Neural Information Processing Systems, Denver, CO, USA, 1994, pp. 585–592.
[25]
M. Lichman, UCI machine learning repository, http://archive.ics.uci.edu/ml, 2013.
[26]
W. M. Rand, Objective criteria for the evaluation of clustering methods, J. Am. Stat. Assoc., vol. 66, no. 336, pp. 846–850, 1971.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 09 August 2022
Revised: 24 October 2022
Accepted: 20 November 2022
Published: 28 July 2023
Issue date: December 2023

Copyright

© The author(s) 2023.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. 12271098 and 61772005) and the Outstanding Youth Innovation Team Project for Universities of Shandong Province (No. 2020KJN008).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return