Journal Home > Volume 6 , Issue 4

With the development of information technology, a mass of data are generated every day. Collecting and analysing these data help service providers improve their services and gain an advantage in the fierce market competition. K-means clustering has been widely used for cluster analysis in real life. However, these analyses are based on users’ data, which disclose users’ privacy. Local differential privacy has attracted lots of attention recently due to its strong privacy guarantee and has been applied for clustering analysis. However, existing K-means clustering methods with local differential privacy protection cannot get an ideal clustering result due to the large amount of noise introduced to the whole dataset to ensure the privacy guarantee. To solve this problem, we propose a novel method that provides local distance privacy for users who participate in the clustering analysis. Instead of making the users’ records in-distinguish from each other in high-dimensional space, we map the user’s record into a one-dimensional distance space and make the records in such a distance space not be distinguished from each other. To be specific, we generate a noisy distance first and then synthesize the high-dimensional data record. We propose a Bounded Laplace Method (BLM) and a Cluster Indistinguishable Method (CIM) to sample such a noisy distance, which satisfies the local differential privacy guarantee and local dE-privacy guarantee, respectively. Furthermore, we introduce a way to generate synthetic data records in high-dimensional space. Our experimental evaluation results show that our methods outperform the traditional methods significantly.


menu
Abstract
Full text
Outline
About this article

K-Means Clustering with Local Distance Privacy

Show Author's information Mengmeng Yang1Longxia Huang2Chenghua Tang3( )
Data61, Commonwealth Scientific and Industrial Research Organization, Melbourne 3168, Australia
School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China
School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541010, China

Abstract

With the development of information technology, a mass of data are generated every day. Collecting and analysing these data help service providers improve their services and gain an advantage in the fierce market competition. K-means clustering has been widely used for cluster analysis in real life. However, these analyses are based on users’ data, which disclose users’ privacy. Local differential privacy has attracted lots of attention recently due to its strong privacy guarantee and has been applied for clustering analysis. However, existing K-means clustering methods with local differential privacy protection cannot get an ideal clustering result due to the large amount of noise introduced to the whole dataset to ensure the privacy guarantee. To solve this problem, we propose a novel method that provides local distance privacy for users who participate in the clustering analysis. Instead of making the users’ records in-distinguish from each other in high-dimensional space, we map the user’s record into a one-dimensional distance space and make the records in such a distance space not be distinguished from each other. To be specific, we generate a noisy distance first and then synthesize the high-dimensional data record. We propose a Bounded Laplace Method (BLM) and a Cluster Indistinguishable Method (CIM) to sample such a noisy distance, which satisfies the local differential privacy guarantee and local dE-privacy guarantee, respectively. Furthermore, we introduce a way to generate synthetic data records in high-dimensional space. Our experimental evaluation results show that our methods outperform the traditional methods significantly.

Keywords: data analysis, K-means clustering, local differential privacy

References(27)

[1]
G. Guo and C. Altrjman, E-commerce customer segmentation method under improved k-means algorithm, in Proc. 4th International Conference on Multi-Modal Information Analytics, Hohhot, China, 2022, pp. 1083–1089.
[2]
M. Jebakumari, T. Palaniraja, K. A. Patrick, and Ashwini , Blocking of spam mail using k-means clustering algorithm, International Journal of Information Technology and Computer Engineering (IJITC), vol. 2, no. 3, pp. 19–24, 2022.
[3]
M. Dhalaria and E. Gandotra, Android malware detection techniques: A literature review, Recent Patents on Engineering, vol. 15, no. 2, pp. 225–245, 2021.
[4]
Y. Qu, J. Zhang, R. Li, X. Zhang, X. Zhai, and S. Yu, Generative adversarial networks enhanced location privacy in 5G networks, Science China Information Sciences, vol. 63, no. 12, p. 220303, 2020.
[5]
Y. Qu, S. Yu, W. Zhou, S. Chen, and J. Wu, Customizable reliable privacy-preserving data sharing in cyber-physical social networks, IEEE Transactions on Network Science and Engineering, vol. 8, no. 1, pp. 269–281, 2020.
[6]
Y. Qu, L. Gao, Y. Xiang, S. Shen, and S. Yu, FedTwin: Blockchain-enabled adaptive asynchronous federated learning for digital twin networks, IEEE Network, vol. 36, no. 6, pp. 183–190, 2022.
[7]
U. Erlingsson, V. Pihur, and A. Korolova, RAPPOR: Randomized aggregatable privacy-preserving ordinal response, in Proc. 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 2014, pp. 1054–1067.
[8]
Apple differential privacy technical overview, https://www.apple.com/privacy/docs/Differential\_Privacy\_Overview.pdf, 2022.
[9]
B. Ding, J. Kulkarni, and S. Yekhanin, Collecting telemetry data privately, in Proc. 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 2017, pp. 3574–3583.
[10]
C. Dwork and A. Roth, The algorithmic foundations of differential privacy, Foundations and Trends in Theoretical Computer Science, vol. 9, nos. 3&4, pp. 211–407, 2014.
[11]
S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith, What can we learn privately? SIAM Journal on Computing, vol. 40, no. 3, pp. 793–826, 2011.
[12]
M. E. Andrés, N. E. Bordenabe, K. Chatzikokolakis, and C. Palamidessi, Geo-indistinguishability: Differential privacy for location-based systems, in Proc. 2013 ACM SIGSAC Conference on Computer and Communications Security, Berlin, Germany, 2013, pp. 901–914.
[13]
N. Wang, X. Xiao, Y. Yang, J. Zhao, S. C. Hui, H. Shin, J. Shin, and G. Yu, Collecting and analyzing multidimensional data with local differential privacy, in Proc. 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 2019, pp. 638–649.
[14]
J. C. Duchi, M. I. Jordan, and M. J. Wainwright, Minimax optimal procedures for locally private estimation, Journal of the American Statistical Association, vol. 113, no. 521, pp. 182–201, 2018.
[15]
A. Blum, C. Dwork, F. McSherry, and K. Nissim, Practical privacy: The SuLQ framework, in Proc. Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Baltimore, MD, USA, 2005, pp. 128–138.
[16]
T. Ni, M. Qiao, Z. Chen, S. Zhang, and H. Zhong, Utility-efficient differentially private k-means clustering based on cluster merging, Neurocomputing, vol. 424, pp. 205–214, 2021.
[17]
A. Smith, Privacy-preserving statistical estimation with optimal convergence rates, in Proc. 43rd Annual ACM Symposium on Theory of Computing, San Jose, CA, USA, 2011, pp. 813–822.
[18]
J. Zhang, X. Xiao, Y. Yang, Z. Zhang, and M. Winslett, PrivGene: Differentially private model fitting using genetic algorithms, in Proc. 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 2013, pp. 665–676.
[19]
Z. Lu and H. Shen, A convergent differentially private k-means clustering algorithm, in Proc. 23rd Pacific-Asia Conference on Knowledge Discovery and Data Mining, Macao, China, 2019, pp. 612–624.
[20]
D. Su, J. Cao, N. Li, E. Bertino, M. Lyu, and H. Jin, Differentially private k-means clustering and a hybrid approach to private optimization, ACM Transactions on Privacy and Security, vol. 20, no. 4, pp. 1–33, 2017.
[21]
M. Yang, I. Tjuawinata, and K. Y. Lam, K-means clustering with local dx-privacy for privacy-preserving data analysis, IEEE Transactions on Information Forensics and Security, vol. 17, pp. 2524–2537, 2022.
[22]
M. Yang, L. Lyu, J. Zhao, T. Zhu, and K. Y. Lam, Local differential privacy and its applications: A comprehensive survey, arXiv preprint arXiv: 2008.03686, 2020.
[23]
K. Nissim and U. Stemmer, Clustering algorithms for the centralized and local models, Proceedings of Algorithmic Learning Theory, vol. 83, pp. 619–653, 2018.
[24]
H. Kaplan and U. Stemmer, Differentially private k-means with constant multiplicative error, in Proc. 32nd International Conference on Neural Information Processing Systems (NeurIPS), Montréal, Canada, 2018, pp. 5436–5446.
[25]
U. Stemmer, Locally private k-means clustering, in Proc. Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms, Salt Lake City, UT, USA, 2020, pp. 548–559.
[26]
C. Xia, J. Hua, W. Tong, and S. Zhong, Distributed k-means clustering guaranteeing local differential privacy, Computers & Security, vol. 90, p. 101699, 2020.
[27]
A. Chang, B. Ghazi, R. Kumar, and P. Manurangsi, Locally private k-means in one round, arXiv preprint arXiv: 2104.09734, 2021.
Publication history
Copyright
Rights and permissions

Publication history

Received: 01 August 2022
Revised: 10 October 2022
Accepted: 03 December 2022
Published: 29 August 2023
Issue date: December 2023

Copyright

© The author(s) 2023.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return