Journal Home > Volume 28 , Issue 6

The clustering problem of big data in the era of artificial intelligence has been widely studied. Because of the huge amount of data, distributed algorithms are often used to deal with big data problems. The distributed computing model has an attractive feature: it can handle massive datasets that cannot be put into the main memory. On the other hand, since many decisions are made automatically by machines in today’s society, algorithm fairness is also an important research area of machine learning. In this paper, we study two fair clustering problems: the centralized fair k-center problem with outliers and the distributed fair k-center problem with outliers. For these two problems, we have designed corresponding constant approximation ratio algorithms. The theoretical proof and analysis of the approximation ratio, and the running space of the algorithm are given.


menu
Abstract
Full text
Outline
About this article

Fair k-Center Problem with Outliers on Massive Data

Show Author's information Fan Yuan1Luhong Diao1Donglei Du2Lei Liu1( )
Beijing Institute for Scientific and Engineering Computing, Faculty of Science, Beijing University of Technology, Beijing 100124, China
Faculty of Management, University of New Brunswick, Fredericton, E3B 5A3, Canada

Abstract

The clustering problem of big data in the era of artificial intelligence has been widely studied. Because of the huge amount of data, distributed algorithms are often used to deal with big data problems. The distributed computing model has an attractive feature: it can handle massive datasets that cannot be put into the main memory. On the other hand, since many decisions are made automatically by machines in today’s society, algorithm fairness is also an important research area of machine learning. In this paper, we study two fair clustering problems: the centralized fair k-center problem with outliers and the distributed fair k-center problem with outliers. For these two problems, we have designed corresponding constant approximation ratio algorithms. The theoretical proof and analysis of the approximation ratio, and the running space of the algorithm are given.

Keywords: machine learning, distributed algorithm, fairness constraints, outlier constraints, k-center problem

References(41)

[1]
M. Kay, C. Matuszek, and S. A. Munson, Unequal representation and gender stereotypes in image search results for occupations, in Proc. 33rd Annu. ACM Conf. Human Factors in Computing Systems, Seoul, Republic of Korea, 2015, pp. 3819–3828.
[2]
M. Kleindessner, P. Awasthi, and J. Morgenstern, Fair k-center clustering for data summarization, in Proc. 36th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 3448–3457.
[3]
A. Chiplunkar, S. Kale, and S. N. Ramamoorthy, How to solve fair k-center in massive data models, in Proc. 37th Int. Conf. Machine Learning, Virtual Event, 2020, pp. 1877–1886.
[4]
T. F. Gonzalez, Clustering to minimize the maximum intercluster distance, Theor. Comput. Sci., vol. 38, pp. 293–306, 1985.
[5]
D. S. Hochbaum and D. B. Shmoys, A best possible heuristic for the k-center problem, Math. Operat. Res., vol. 10, no. 2, pp. 180–184, 1985.
[6]
W. L. Hsu and G. L. Nemhauser, Easy and hard bottleneck location problems, Discrete Appl. Math., vol. 1, no. 3, pp. 209–215, 1979.
[7]
D. Z. Chen, J. Li, H. Liang, and H. Wang, Matroid and knapsack center problems, Algorithmica, vol. 75, no. 1, pp. 27–52, 2016.
[8]
P. K. Agarwal and J. M. Phillips, An efficient algorithm for 2D Euclidean 2-center with outliers, in Proc. 16th Annu. European Symp. Algorithms, Karlsruhe, Germany, 2008, pp. 64–75.
[9]
S. Guha, R. Rastogi, and K. Shim, Techniques for clustering massive data sets, in Clustering and Information Retrieval, W. Wu, H. Xiong, and S. Shekhar, eds. New York, NY, USA: Springer, 2004, pp. 35–82.
DOI
[10]
M. Hassani, E. Müller, and T. Seidl, EDISKCO: Energy efficient distributed in-sensor-network k-center clustering with outliers, in Proc. 3rd Int. Workshop on Knowledge Discovery from Sensor Data, Paris, France, 2009, pp. 39–48.
[11]
M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan, Algorithms for facility location problems with outliers, in Proc. 12th Annu. ACM-SIAM Symp. Discrete Algorithms, Washington, DC, USA, 2001, pp. 642–651.
[12]
S. Guha, Y. Li, and Q. Zhang, Distributed partial clustering, ACM Trans. Parallel Comput., vol. 6, no. 3, p. 11, 2019.
[13]
X. Guo and S. Li, Distributed k-clustering for data with heavy noise, in Proc. 32nd Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2018, pp. 7849–7857.
[14]
G. Malkomes, M. J. Kusner, W. Chen, K. Q. Weinberger, and B. Moseley, Fast distributed k-center clustering with outliers on massive data, in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 1063–1071.
[15]
N. Ailon, R. Jaiswal, and C. Monteleoni, Streaming k-means approximation, in Proc. 22nd Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2009, pp. 10–18.
[16]
S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan, Clustering data streams: Theory and practice, IEEE Trans. Knowl. Data Eng., vol. 15, no. 3, pp. 515–528, 2003.
[17]
R. M. McCutchen, and S. Khuller, Streaming algorithms for k-center clustering with outliers and with anonymity, in Proc. 11th Int. Workshop, APPROX 2008, and 12th Int. Workshop, RANDOM 2008 on Approximation, Randomization and Combinatorial Optimization: Algorithms and Techniques, Boston, MA, USA, 2008, pp. 165–178.
[18]
M. Shindler, A. Wong, and A. Meyerson, Fast and accurate k-means for large datasets, in Proc. 24th Int. Conf. Neural Information Processing Systems, Granada, Spain, 2011, pp. 2375–2383.
[19]
B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii, Scalable k-means++, arXiv preprint arXiv: 1203.6402, 2012.
[20]
M. F. Balcan, S. Ehrlich, and Y. Liang, Distributed k-means and k-median clustering on general topologies, in Proc. 26th Int. Conf. Neural Information Processing Systems, Lake, Tahoe, NV, USA, 2013, pp. 1995–2003.
[21]
A. Ene, S. Im, and B. Moseley, Fast clustering using MapReduce, in Proc. 17th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, San Diego, CA, USA, 2011, pp. 681–689.
[22]
B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause, Distributed submodular maximization: Identifying representative elements in massive data, in Proc. 26th Int. Conf. Neural Information Processing Systems, Lake Tahoe, NV, USA, 2013, pp. 2049–2057.
[23]
V. Cohen-Addad, C. Schwiegelshohn, and C. Sohler, Diameter and k-center in sliding windows, in Proc. 43rd Int. Colloquium on Automata, Languages, and Programming, Rome, Italy, 2016, pp. 19:1–9:12.
[24]
P. Pellizzoni, A. Pietracaprina, and G. Pucci, Dimensionality-adaptive k-center in sliding windows, in Proc. 2020 IEEE 7th Int. Conf. Data Science and Advanced Analytics (DSAA), Sydney, Australia, 2020, pp. 197–206.
[25]
P. Pellizzoni, A. Pietracaprina, and G. Pucci, k-center clustering with outliers in sliding windows, Algorithms, vol. 15, no. 2, p. 52, 2022.
[26]
H. Ding, H. Yu, and Z. Wang, Greedy strategy works for k-center clustering with outliers and coreset construction, in Proc. 27th Annu. European Symp. Algorithms (ESA 2019), Munich/Garching, Germany, 2019, pp. 40:1–40:16.
[27]
H. Zarrabi-Zadeh, Core-preserving algorithms, in Proc. 20th Annu. Canadian Conference on Computational Geometry, Montreal, Canada, 2008, pp. 159–162.
[28]
F. Yuan, L. Diao, D. Du, and L. Liu, Distributed fair k-Center clustering problems with outliers, in Proc. 22nd Int. Conf. Parallel and Distributed Computing, Applications and Technologies, Guangzhou, China, 2022, pp. 430–440.
[29]
S. Kale, Small space stream summary for matroid center, arXiv preprint arXiv: 1810.06267, 2020.
[30]
L. E. Celis, V. Keswani, D. Straszak, A. Deshpande, T. Kathuria, and N. K. Vishnoi, Fair and diverse DPP-based data summarization, in Proc. 35th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 716–725.
[31]
S. Ahmadian, A. Epasto, R. Kumar, and M. Mahdian, Clustering without over-representation, in Proc. 25th ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 2019, pp. 267–275.
[32]
S. K. Bera, D. Chakrabarty, N. J. Flores, and M. Negahbani, Fair algorithms for clustering, in Proc. 33rd Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 4954–4965.
[33]
S. Bandyapadhyay, T. Inamdar, S. Pai, and K. Varadarajan, A constant approximation for colorful k-center, arXiv preprint arXiv: 1907.08906, 2019.
[34]
F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii, Fair clustering through fairlets, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 5029–5037.
[35]
X. Jia, K. Sheth, and O. Svensson, Fair colorful k-center clustering, Math. Program., vol. 192, nos. 1&2, pp. 339–360, 2022.
[36]
M. Schmidt, C. Schwiegelshohn, and C. Sohler, Fair coresets and streaming algorithms for fair k-means, in Proc. 17th Int. Workshop on Approximation and Online Algorithms, Munich, Germany, 2019, pp. 232–251.
[37]
I. O. Bercea, M. Groβ, S. Khuller, A. Kumar, C. Rösner, D. R. Schmidt, and M. Schmidt, On the cost of essentially fair clusterings, in Proc. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, Cambridge, MA, USA, 2019, pp. 18:1–18:22.
[38]
M. Böhm, A. Fazzone, S. Leonardi, C. Menghini, and C. Schwiegelshohn, Algorithms for fair k-clustering with multiple protected attributes, Oper. Res. Lett., vol. 49, no. 5, pp. 787–789, 2021.
[39]
M. Hajiaghayi, R. Khandekar, and G. Kortsarz, Budgeted red-blue median and its generalizations, in Proc. 18th Annu. European Symp. Algorithms, Liverpool, UK, 2010, pp. 314–325.
[40]
R. Krishnaswamy, A. Kumar, V. Nagarajan, Y. Sabharwal, and B. Saha, The matroid median problem, in Proc. 2011 Annu. ACM-SIAM Symp. Discrete Algorithms (SODA), San Francisco, CA, USA, 2011, pp. 1117–1130.
[41]
M. E. Dyer and A. M. Frieze, A simple heuristic for the p-centre problem, Oper. Res. Lett., vol. 3, no. 6, pp. 285–288, 1985.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 04 October 2022
Revised: 02 December 2022
Accepted: 13 March 2023
Published: 28 July 2023
Issue date: December 2023

Copyright

© The author(s) 2023.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. 12131003, 11771386, and 11728104), the Beijing Natural Science Foundadtion Project (No. Z200002), the General Research Projects of Beijing Educations Committee in China (No. KM201910005013), the Natural Sciences and Engineering Research Council of Canada (NSERC) (No. 06446), and the General Program of Science and Technology Development Project of Beijing Municipal Education Commission (No. KM201810005005).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return