Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
The advent of Big Data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI, MapReduce, and Spark. An important step for any parallel clustering algorithm is the distribution of data amongst the cluster nodes. This step governs the methodology and performance of the entire algorithm. Researchers typically use random, or a spatial/geometric distribution strategy like kd-tree based partitioning and grid-based partitioning, as per the requirements of the algorithm. However, these strategies are generic and are not tailor-made for any specific parallel clustering algorithm. In this paper, we give a very comprehensive literature survey of MPI-based parallel clustering algorithms with special reference to the specific data distribution strategies they employ. We also propose three new data distribution strategies namely Parameterized Dimensional Split for parallel density-based clustering algorithms like DBSCAN and OPTICS, Cell-Based Dimensional Split for dGridSLINK, which is a grid-based hierarchical clustering algorithm that exhibits efficiency for disjoint spatial distribution, and Projection-Based Split, which is a generic distribution strategy. All of these preserve spatial locality, achieve disjoint partitioning, and ensure good data load balancing. The experimental analysis shows the benefits of using the proposed data distribution strategies for algorithms they are designed for, based on which we give appropriate recommendations for their usage.
Park H S, Jun C H. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications , 2009, 36(2): 3336–3341. DOI: 10.1016/j.eswa.2008.01.039.
Jarvis R A, Patrick E. Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Computers , 1973, C-22(11): 1025–1034. DOI: 10.1109/T-C.1973.223640.
Sibson R. SLINK: An optimally efficient algorithm for the single-link cluster method. The Computer Journal , 1973, 16(1): 30–34. DOI: 10.1093/comjnl/16.1.30.
Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Record , 1998, 27(2): 94–105. DOI: 10.1145/276305.276314.
Woo K G, Lee J H, Kim M H, Lee Y J. FINDIT: A fast and intelligent subspace clustering algorithm using dimension voting. Information and Software Technology , 2004, 46(4): 255–271. DOI: 10.1016/j.infsof.2003.07.003.
Madeira S C, Oliveira A L. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Computational Biology and Bioinformatics , 2004, 1(1): 24–45. DOI: 10.1109/TCBB.2004.2.
Olman V, Mao F L, Wu H W, Xu Y. Parallel clustering algorithm for large data sets with applications in bioinformatics. IEEE/ACM Trans. Computational Biology and Bioinformatics , 2009, 6(2): 344–352. DOI: 10.1109/TCBB.2007.70272.
Zhang J, Wu G Q, Hu X G, Li S Y, Hao S L. A parallel clustering algorithm with MPI-MKmeans. Journal of Computers , 2013, 8(1): 10–17. DOI: 10.4304/jcp.8.1.10-17.
Kumar J, Mills R T, Hoffman F M, Hargrove W W. Parallel k-means clustering for quantitative ecoregion delineation using large data sets. Procedia Computer Science , 2011, 4: 1602–1611. DOI: 10.1016/j.procs.2011.04.173.
Kerdprasop K, Taokok S, Kerdprasop N. Declarative parallelized techniques for K-means data clustering. International Journal of Mathematics and Computers in Simulation , 2012, 6(5): 483–495.
Li Y J, Chung S M. Parallel bisecting k-means with prediction clustering algorithm. The Journal of Supercomputing , 2007, 39(1): 19–37. DOI: 10.1007/s11227-006-0002-7.
Xu X W, Jäger J, Kriegel H P. A fast parallel clustering algorithm for large spatial databases. Data Mining and Knowledge Discovery , 1999, 3(3): 263–290. DOI: 10.1023/ A:1009884809343.
Zhou A Y, Zhou S G, Cao J, Fan Y, Hu Y F. Approaches for scaling DBSCAN algorithm to large spatial databases. Journal of Computer Science and Technology , 2000, 15(6): 509–526. DOI: 10.1007/BF02948834.
Coppola M, Vanneschi M. High-performance data mining with skeleton-based structured parallel programming. Parallel Computing , 2002, 28(5): 793–813. DOI: 10.1016/S0167-8191(02)00095-9.
Rajasekaran S. Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel and Distributed Systems , 2005, 16(6): 497–502. DOI: 10.1109/TPDS.2005.72.
Deb B, Srirama S N. Parallel K-Means clustering for gene expression data on SNOW. International Journal of Computer Applications , 2013, 71(24): 26–30. DOI: 10.5120/12691-9486.
Torti E, Florimbi G, Castelli F, Ortega S, Fabelo H, Callicó G M, Marrero-Martin M, Leporati F. Parallel K-means clustering for brain cancer detection using hyperspectral images. Electronics , 2018, 7(11): 283. DOI: 10.3390/electronics7110283.
Sardar T H, Ansari Z. An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm. Future Computing and Informatics Journal , 2018, 3(2): 200–209. DOI: 10.1016/j.fcij.2018.03.003.
Bousbaci A, Kamel N. Efficient data distribution and results merging for parallel data clustering in MapReduce environment. Applied Intelligence , 2018, 48(8): 2408–2428. DOI: 10.1007/s10489-017-1089-7.
Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S. Scalable k-means++. Proceedings of the VLDB Endowment , 2012, 5(7): 622–633. DOI: 10.14778/2180912. 2180915.
Goyal P, Challa J S, Kumar D, Balasubramaniam S, Goyal N. Grid-R-tree: A data structure for efficient neighborhood and nearest neighbor queries in data mining. International Journal of Data Science and Analytics , 2020, 10(1): 25–47. DOI: 10.1007/s41060-020-00208-2.
Chen L, Gao Y J, Huang X R, Jensen C S, Zheng B L. Efficient distributed clustering algorithms on star-schema heterogeneous graphs. IEEE Trans. Knowledge and Data Engineering , 2022, 34(10): 4781–4796. DOI: 10.1109/TKDE.2020.3047631.
Andrade G, Ramos G, Madeira D, Sachetto R, Ferreira R, Rocha L. G-DBSCAN: A GPU accelerated algorithm for density-based clustering. Procedia Computer Science , 2013, 18: 369–378. DOI: 10.1016/j.procs.2013.05.200.
Chen C C, Chen M S. HiClus: Highly scalable density-based clustering with heterogeneous cloud. Procedia Computer Science , 2015, 53: 149–157. DOI: 10.1016/j.procs.2015.07.289.
Hu X J, Liu L, Qiu N J, Yang D, Li M. A MapReduce-based improvement algorithm for DBSCAN. Journal of Algorithms & Computational Technology , 2018, 12(1): 53–61. DOI: 10.1177/1748301817735665.
Gu Y H, Ye X Y, Zhang F, Du Z H, Liu R Y, Yu L F. A parallel varied density-based clustering algorithm with optimized data partition. Journal of Spatial Science , 2018, 63(1): 93–114. DOI: 10.1080/14498596.2017.1352542.
Huang F, Zhu Q, Zhou J, Tao J, Zhou X C, Jin D, Tan X C, Wang L Z. Research on the parallelization of the DBSCAN clustering algorithm for spatial data mining based on the spark platform. Remote Sensing , 2017, 9(12): 1301. DOI: 10.3390/rs9121301.
Zhang Y F, Chen S M, Yu G. Efficient distributed density peaks for clustering large data sets in MapReduce. IEEE Trans. Knowledge and Data Engineering , 2016, 28(12): 3218–3230. DOI: 10.1109/TKDE.2016.2609423.
Gagolewski M, Bartoszuk M, Cena A. Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm. Information Sciences , 2016, 363: 8–23. DOI: 10.1016/j.ins.2016.05.003.
Li X. Parallel algorithms for hierarchical clustering and cluster validity. IEEE Trans. Pattern Analysis and Machine Intelligence , 1990, 12(11): 1088–1092. DOI: 10.1109/34.61708.
Wu C H, Horng S J, Tsai H R. Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses. Journal of Parallel and Distributed Computing , 2000, 60(9): 1137–1153. DOI: 10.1006/jpdc.2000.1644.
Du Z, Lin F. A novel parallelization approach for hierarchical clustering. Parallel Computing , 2005, 31(5): 523–527. DOI: 10.1016/j.parco.2005.01.001.
Olson C F. Parallel algorithms for hierarchical clustering. Parallel Computing , 1995, 21(8): 1313–1325. DOI: 10.1016/0167-8191(95)00017-I.
Dash M, Liu H, Scheuermann P, Tan K L. Fast hierarchical clustering and its validation. Data & Knowledge Engineering , 2003, 44(1): 109–138. DOI: 10.1016/S0169-023X(02)00138-6.
Wang Y, Narasayya V, He Y Y, Chaudhuri S. PACk: An efficient partition-based distributed agglomerative hierarchical clustering algorithm for deduplication. Proceedings of the VLDB Endowment , 2022, 15(6): 1132–1145. DOI: 10.14778/3514061.3514062.
Friedman J H, Meulman J J. Clustering objects on subsets of attributes. Journal of the Royal Statistical Society Series B: Statistical Methodology , 2004, 66(4): 815–849. DOI: 10.1111/j.1467-9868.2004.02059.x.
Kaur A, Datta A. A novel algorithm for fast and scalable subspace clustering of high-dimensional data. Journal of Big Data , 2015, 2(1): Article No. 17. DOI: 10.1186/s40537-015-0027-y.
Springel V, White S D M, Jenkins A et al. Simulations of the formation, evolution and clustering of galaxies and quasars. Nature , 2005, 435(1): 629–636. DOI: 10.1038/nature03597.