Journal Home > Volume 2 , Issue 4

Big data analytics and data mining are techniques used to analyze data and to extract hidden information. Traditional approaches to analysis and extraction do not work well for big data because this data is complex and of very high volume. A major data mining technique known as data clustering groups the data into clusters and makes it easy to extract information from these clusters. However, existing clustering algorithms, such as k-means and hierarchical, are not efficient as the quality of the clusters they produce is compromised. Therefore, there is a need to design an efficient and highly scalable clustering algorithm. In this paper, we put forward a new clustering algorithm called hybrid clustering in order to overcome the disadvantages of existing clustering algorithms. We compare the new hybrid algorithm with existing algorithms on the bases of precision, recall, F-measure, execution time, and accuracy of results. From the experimental results, it is clear that the proposed hybrid clustering algorithm is more accurate, and has better precision, recall, and F-measure values.


menu
Abstract
Full text
Outline
About this article

A Novel Clustering Technique for Efficient Clustering of Big Data in Hadoop Ecosystem

Show Author's information Sunil Kumar( )Maninder Singh
Directorate of Livestock Farms, Guru Angad Dev Veterinary and Animal Sciences University, Ludhiana 141001, India.
Department of Computer Science, Punjabi University, Punjab 147002, India.

Abstract

Big data analytics and data mining are techniques used to analyze data and to extract hidden information. Traditional approaches to analysis and extraction do not work well for big data because this data is complex and of very high volume. A major data mining technique known as data clustering groups the data into clusters and makes it easy to extract information from these clusters. However, existing clustering algorithms, such as k-means and hierarchical, are not efficient as the quality of the clusters they produce is compromised. Therefore, there is a need to design an efficient and highly scalable clustering algorithm. In this paper, we put forward a new clustering algorithm called hybrid clustering in order to overcome the disadvantages of existing clustering algorithms. We compare the new hybrid algorithm with existing algorithms on the bases of precision, recall, F-measure, execution time, and accuracy of results. From the experimental results, it is clear that the proposed hybrid clustering algorithm is more accurate, and has better precision, recall, and F-measure values.

Keywords: big data, clustering, Hadoop, k-means, hierarchical

References(25)

[1]
C. L. P. Chen and C. Y. Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on big data, Inf. Sci., vol. 275, pp. 314-347, 2014.
[2]
A. Gandomi and M. Haider, Beyond the hype: Big data concepts, methods, and analytics, Int. J. Inf. Manage., vol. 35, no. 2, pp. 137-144, 2015.
[3]
W. Raghupathi and V. Raghupathi, Big data analytics in healthcare: Promise and potential, Health Inf. Sci. Syst., vol. 2, p. 3, 2014.
[4]
B. Saraladevi, N. Pazhaniraja, P. V. Paul, M. S. S. Basha, and P. Dhavachelvan, Big data and Hadoop-A study in security perspective, Procedia Computer Science, vol. 50, pp. 596-601, 2015.
[5]
A. Katal, M. Wazid, and R. H. Goudar, Big data: Issues, challenges, tools and good practices, in Proc. 6th Int. Conf. Contemporary Computing, Noida, India, 2013, pp. 404-409.
DOI
[6]
M. Herland, T. M. Khoshgoftaar, and R. Wald, A review of data mining using big data in health informatics, J. Big Data, vol. 1, p. 2, 2014.
[7]
A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, and A. Bouras, A survey of clustering algorithms for Big Data: Taxonomy and empirical analysis, IEEE Trans. Emerg. Top. Comput., vol. 2, no. 3, pp. 267-279, 2014.
[8]
X. B. Li and Z. X. Fang, Parallel clustering algorithms, Parallel Comput., vol. 11, no. 3, pp. 275-290, 1989.
[9]
J. Dittrich and J. A. Quiane-Ruiz, Efficient big data processing in Hadoop MapReduce, Proceedings of the VLDB Endowment, vol. 5, no. 12, pp. 2014-2015, 2011.
[10]
C. C. Aggarwal and C. X. Zhai, A survey of text clustering algorithms, in Mining Text Data, C. C. Aggarwal and C. X. Zhai, eds. Springer, 2012, pp. 77-128.
DOI
[11]
A. Hatamlou, In search of optimal centroids on data clustering using a binary search algorithm, Pattern Recognit. Lett., vol. 33, no. 13, pp. 1756-1760, 2012.
[12]
D. Pandove and S. Goel, A comprehensive study on clustering approaches for big data mining, in Proc. 2nd Int. Conf. Electronics and Communication System, Coimbatore, India, 2015, pp. 1333-1338.
DOI
[13]
R. Jensi and G. W. Jiji, Hybrid data clustering approach using k-means and flower pollination algorithm, Adv. Comput. Intell.: Int. J., vol. 2, no. 2, pp. 15-25, 2015.
[14]
B. B. Ali and Y. Massmoudi, K-means clustering based on Gower Similarity Coefficient: A comparative study, in Proc. 5th Int. Conf. Modeling, Simulation and Applied Optimization, Hammamet, Tunisia, 2013.
[15]
A. Hatamlou, S. Abdullah, and H. Nezamabadi-Pour, A combined approach for clustering based on k-means and gravitational search algorithms, Swarm Evol. Comput., vol. 6, pp. 47-52, 2012.
[16]
T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, An efficient k-means clustering algorithm: Analysis and implementation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 881-892, 2002.
[17]
B. B. Firouzi, M. S. Sadeghi, and T. Niknam, A new hybrid algorithm based on PSO, SA, and K-means for cluster analysis, Int. J. Innova. Comput., Inf. Control, vol. 6, no. 7, pp. 3177-3192, 2010.
[18]
Y. K. Patil and V. S. Nandedkar, Design and implementation of k-means and hierarchical document clustering on hadoop, Int. J. Sci. Res., vol. 3, no. 10, pp. 1566-1570, 2014.
[19]
E. Rashedi and A. Mirzaei, A novel multi-clustering method for hierarchical clusterings based on boosting, in Proc. 9th Iranian Conf. Electrical Engineering, 2011, pp. 1-5.
[20]
R. T. Ng and J. W. Han, CLARANS: A method for clustering objects for spatial data mining, IEEE Trans. Knowl. Data Eng., vol. 14, no. 5, pp. 1003-1016, 2002.
[21]
A. Farinelli, M. Bicego, S. Ramchurn, and M. Zucchelli, C-link: A hierarchical clustering approach to large-scale near-optimal coalition formation, in Proc. 23rd Int. Joint Conf. Artificial Intelligence, Beijing, China, 2013, pp. 106-112.
[22]
A. Mirzaei and M. Rahmati, A novel hierarchical-clustering-combination scheme based on fuzzy-similarity relations, IEEE Trans. Fuzzy Syst., vol. 18, no. 1, pp. 27-39, 2010.
[23]
E. M. Rasmussen and P. Willett, Efficiency of hierarchic agglomerative clustering using the ICL distributed array processor, J. Doc., vol. 45, no. 1, pp. 1-24, 1989.
[24]
Apache Hadoop, http://hadoop.apache.org/, 2018.
[25]
National Climatic Data Centre (NCDC) Data Access, https://www.ncdc.noaa.gov/data-access, 2018.
Publication history
Copyright
Rights and permissions

Publication history

Received: 08 November 2018
Revised: 09 January 2019
Accepted: 12 January 2019
Published: 05 August 2019
Issue date: December 2019

Copyright

© The author(s) 2019

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return