Efficient Algorithm for the <i>k</i>-Means Problem with Must-Link and Cannot-Link Constraints

Chaoqi Jia; Longkun Guo; Kewen Liao; Zhigang Lu

doi:10.26599/TST.2022.9010056

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (5 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Efficient Algorithm for the k-Means Problem with Must-Link and Cannot-Link Constraints

Chaoqi Jia^¹, Longkun Guo^{¹^,^†}(

), Kewen Liao^{²^,^†}, Zhigang Lu^{³^,^†}

1Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China

2HilstLab, Peter Faber Business School, Australian Catholic University, Sydney 2060, Australia

3Macquarie University Cyber Security Hub, Macquarie University, Sydney 2109, Australia

† Longkun Guo, Kewen Liao, and Zhigang Lu contribute equally to this paper.

Show Author Information

Abstract

Constrained clustering, such as $k$ -means with instance-level Must-Link (ML) and Cannot-Link (CL) auxiliary information as the constraints, has been extensively studied recently, due to its broad applications in data science and AI. Despite some heuristic approaches, there has not been any algorithm providing a non-trivial approximation ratio to the constrained $k$ -means problem. To address this issue, we propose an algorithm with a provable approximation ratio of $O (\log k)$ when only ML constraints are considered. We also empirically evaluate the performance of our algorithm on real-world datasets having artificial ML and disjoint CL constraints. The experimental results show that our algorithm outperforms the existing greedy-based heuristic methods in clustering accuracy.

Keywords

approximation algorithm constrained clustering Constrained k-means Must-Link (ML) and Cannot-Link (CL) constraints

References

[1]

X. Li and H. Liu, Greedy optimization for K-means-based consensus clustering, Tsinghua Science and Technology, vol. 23, no. 2, pp. 184–194, 2018.

Crossref Google Scholar

[2]

S. Ji, D. Xu, L. Guo, M. Li, and D. Zhang, The seeding algorithm for spherical K-means clustering with penalties, in Proc. 13^th Int. Conf. Algorithmic Aspects in Information and Management, Beijing, China, 2019, pp. 149–158.

Crossref Google Scholar

[3]

S. Har-Peled and S. Mazumdar, On coresets for k-means and k-median clustering, in Proc. 36^th Annu. ACM Symp. Theory of Computing, Chicago, IL, USA, 2004, pp. 291–300.

Crossref Google Scholar

[4]

Z. Lu and H. Shen, Differentially private k-means clustering with convergence guarantee, IEEE Transactions on Dependable and Secure Computing, vol. 18, no. 4, pp. 1541–1552, 2021.

Google Scholar

[5]

S. Dasgupta, The Hardness of k-Means Clustering, Report, Department of Computer Science and Engineering University of California, https://cseweb.ucsd.edu/∼dasgupta/papers/kmeans.pdf, 2008.

[6]

A. Vattani, The hardness of k-means clustering in the plane, https://cseweb.ucsd.edu/∼avattani/papers/kmeans_hardness.pdf, 2009.

[7]

M. Mahajan, P. Nimbhorkar, and K. Varadarajan, The planar k-means problem is NP-hard, in Proc. 3^rd Int. Workshop on Algorithms and Computation, Kolkata, India, 2009, pp. 274–285.

Crossref Google Scholar

[8]

D. Arthur and S. Vassilvitskii, k-means++: The advantages of careful seeding, in Proc. 18^th Annu. ACM-SIAM Symp. Discrete Algorithms, New Orleans, LA, USA, 2007, pp. 1027–1035.

Google Scholar

[9]

T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, A local search approximation algorithm for k-means clustering, in Proc. 18^th Annu. Symp. Computational Geometry, Barcelona, Spain, 2002, pp. 10–18.

Crossref Google Scholar

[10]

S. Lattanzi and C. Sohler, A better k-means++ algorithm via local search, in Proc. 36^th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 3662–3671.

Google Scholar

[11]

S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward, Better guarantees for k-means and Euclidean k-median by primal-dual algorithms, in Proc. of the 2017 IEEE 58^th Annu. Symp. Foundations of Computer Science (FOCS), Berkeley, CA, USA, 2017, pp. 61–72.

Crossref Google Scholar

[12]

K. Wagstaff and C. Cardie, Clustering with instance-level constraints, in Proc. 17^th Int. Conf. Machine Learning, San Francisco, CA, USA, 2000, pp. 1103–1110.

Google Scholar

[13]

K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl, Constrained k-means clustering with background knowledge, in Proc. 18^th Int. Conf. Machine Learning, San Francisco, CA, USA, 2001, pp. 577–584.

Google Scholar

[14]

T. Lange, M. H. C. Law, A. K. Jain, and J. M. Buhmann, Learning with constrained and unlabelled data, in Proc. of 2005 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 2005, pp. 731–738.

Google Scholar

[15]

S. Basu, I. Davidson, and K. Wagstaff, Constrained Clustering: Advances in Algorithms, Theory, and Applications. Boca Raton, FL, USA: CRC Press, 2008.

Crossref

[16]

Z. Li, J. Liu, and X. Tang, Constrained clustering via spectral regularization, in Proc. of 2009 IEEE Conf. Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 421–428.

Crossref Google Scholar

[17]

X. Zhai, Y. Peng, and J. Xiao, Cross-modality correlation propagation for cross-media retrieval, in Proc. of 2012 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 2012, pp. 2337–2340.

Crossref Google Scholar

[18]

H. Ding and J. Xu, A unified framework for clustering constrained data without locality property, in Proc. 26^th Annu. ACM-SIAM Symp. Discrete Algorithms, San Diego, CA, USA, 2015, pp. 1471–1490.

Crossref Google Scholar

[19]

Q. Feng, J. Hu, N. Huang, and J. Wang, Improved PTAS for the constrained k-means problem, J. Comb. Optim., vol. 37, no. 4, pp. 1091–1110, 2019.

Crossref Google Scholar

[20]

I. Davidson and S. S. Ravi, Clustering with constraints: Feasibility issues and the k-means algorithm, in Proc. 2005 SIAM Int. Conf. Data Mining, Newport Beach, CA, USA, 2005, pp. 138–149.

Crossref Google Scholar

[21]

P. Baumann, A binary linear programming-based k-means algorithm for clustering with must-link and cannot-link constraints, in Proc. of 2020 IEEE Int. Conf. Industrial Engineering and Engineering Management (IEEM), Singapore, 2020, pp. 324–328.

Crossref Google Scholar

[22]

J. Edmonds, Maximum matching and a polyhedron with 0,1-vertices, Journal of Research of the National Bureau of Standards B, vol. 69, nos. 1&2, pp. 125–130, 1965.

Crossref Google Scholar

[23]

J. B. MacQuuen, Some methods for classification and analysis of multivariate observations, in Proc. 5^th Berkley Symp. Mathematical Statistics and Probability, Berkeley, CA, USA, 1967, pp. 281–297.

Google Scholar

[24]

L. Bottou and Y. Bengio, Convergence properties of the k-means algorithms, in Proc. 7^th Int. Conf. Neural Information Processing Systems, Denver, CO, USA, 1994, pp. 585–592.

Google Scholar

[25]

M. Lichman, UCI machine learning repository, http://archive.ics.uci.edu/ml, 2013.

[26]

W. M. Rand, Objective criteria for the evaluation of clustering methods, J. Am. Stat. Assoc., vol. 66, no. 336, pp. 846–850, 1971.

Crossref Google Scholar

Tsinghua Science and Technology

Volume 28 Issue 6,
December 2023

Pages 1050-1062

DOI: 10.26599/TST.2022.9010056

Cite this article:

Jia C, Guo L, Liao K, et al. Efficient Algorithm for the k-Means Problem with Must-Link and Cannot-Link Constraints. Tsinghua Science and Technology, 2023, 28(6): 1050-1062. https://doi.org/10.26599/TST.2022.9010056

515

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 09 August 2022

Revised: 24 October 2022

Accepted: 20 November 2022

Published: 28 July 2023

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).