A Novel Clustering Technique for Efficient Clustering of Big Data in Hadoop Ecosystem

Sunil Kumar; Maninder Singh

doi:10.26599/BDMA.2018.9020037

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (32 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

A Novel Clustering Technique for Efficient Clustering of Big Data in Hadoop Ecosystem

Sunil Kumar(

), Maninder Singh

∙ Directorate of Livestock Farms, Guru Angad Dev Veterinary and Animal Sciences University, Ludhiana 141001, India.

∙ Department of Computer Science, Punjabi University, Punjab 147002, India.

Show Author Information

Abstract

Big data analytics and data mining are techniques used to analyze data and to extract hidden information. Traditional approaches to analysis and extraction do not work well for big data because this data is complex and of very high volume. A major data mining technique known as data clustering groups the data into clusters and makes it easy to extract information from these clusters. However, existing clustering algorithms, such as $k$ -means and hierarchical, are not efficient as the quality of the clusters they produce is compromised. Therefore, there is a need to design an efficient and highly scalable clustering algorithm. In this paper, we put forward a new clustering algorithm called hybrid clustering in order to overcome the disadvantages of existing clustering algorithms. We compare the new hybrid algorithm with existing algorithms on the bases of precision, recall, F-measure, execution time, and accuracy of results. From the experimental results, it is clear that the proposed hybrid clustering algorithm is more accurate, and has better precision, recall, and F-measure values.

Keywords

big data clustering Hadoop k-means hierarchical

References

[1]

C. L. P. Chen and C. Y. Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on big data, Inf. Sci., vol. 275, pp. 314-347, 2014.

Crossref Google Scholar

[2]

A. Gandomi and M. Haider, Beyond the hype: Big data concepts, methods, and analytics, Int. J. Inf. Manage., vol. 35, no. 2, pp. 137-144, 2015.

Crossref Google Scholar

[3]

W. Raghupathi and V. Raghupathi, Big data analytics in healthcare: Promise and potential, Health Inf. Sci. Syst., vol. 2, p. 3, 2014.

Crossref Google Scholar

[4]

B. Saraladevi, N. Pazhaniraja, P. V. Paul, M. S. S. Basha, and P. Dhavachelvan, Big data and Hadoop-A study in security perspective, Procedia Computer Science, vol. 50, pp. 596-601, 2015.

Crossref Google Scholar

[5]

A. Katal, M. Wazid, and R. H. Goudar, Big data: Issues, challenges, tools and good practices, in Proc. 6th Int. Conf. Contemporary Computing, Noida, India, 2013, pp. 404-409.

Crossref

[6]

M. Herland, T. M. Khoshgoftaar, and R. Wald, A review of data mining using big data in health informatics, J. Big Data, vol. 1, p. 2, 2014.

Crossref Google Scholar

[7]

A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, and A. Bouras, A survey of clustering algorithms for Big Data: Taxonomy and empirical analysis, IEEE Trans. Emerg. Top. Comput., vol. 2, no. 3, pp. 267-279, 2014.

Crossref Google Scholar

[8]

X. B. Li and Z. X. Fang, Parallel clustering algorithms, Parallel Comput., vol. 11, no. 3, pp. 275-290, 1989.

Crossref Google Scholar

[9]

J. Dittrich and J. A. Quiane-Ruiz, Efficient big data processing in Hadoop MapReduce, Proceedings of the VLDB Endowment, vol. 5, no. 12, pp. 2014-2015, 2011.

Crossref Google Scholar

[10]

C. C. Aggarwal and C. X. Zhai, A survey of text clustering algorithms, in Mining Text Data, C. C. Aggarwal and C. X. Zhai, eds. Springer, 2012, pp. 77-128.

Crossref

[11]

A. Hatamlou, In search of optimal centroids on data clustering using a binary search algorithm, Pattern Recognit. Lett., vol. 33, no. 13, pp. 1756-1760, 2012.

Crossref Google Scholar

[12]

D. Pandove and S. Goel, A comprehensive study on clustering approaches for big data mining, in Proc. 2nd Int. Conf. Electronics and Communication System, Coimbatore, India, 2015, pp. 1333-1338.

Crossref

[13]

R. Jensi and G. W. Jiji, Hybrid data clustering approach using k-means and flower pollination algorithm, Adv. Comput. Intell.: Int. J., vol. 2, no. 2, pp. 15-25, 2015.

Google Scholar

[14]

B. B. Ali and Y. Massmoudi, K-means clustering based on Gower Similarity Coefficient: A comparative study, in Proc. 5th Int. Conf. Modeling, Simulation and Applied Optimization, Hammamet, Tunisia, 2013.

[15]

A. Hatamlou, S. Abdullah, and H. Nezamabadi-Pour, A combined approach for clustering based on k-means and gravitational search algorithms, Swarm Evol. Comput., vol. 6, pp. 47-52, 2012.

Crossref Google Scholar

[16]

T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, An efficient k-means clustering algorithm: Analysis and implementation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 881-892, 2002.

Crossref Google Scholar

[17]

B. B. Firouzi, M. S. Sadeghi, and T. Niknam, A new hybrid algorithm based on PSO, SA, and K-means for cluster analysis, Int. J. Innova. Comput., Inf. Control, vol. 6, no. 7, pp. 3177-3192, 2010.

Google Scholar

[18]

Y. K. Patil and V. S. Nandedkar, Design and implementation of k-means and hierarchical document clustering on hadoop, Int. J. Sci. Res., vol. 3, no. 10, pp. 1566-1570, 2014.

Google Scholar

[19]

E. Rashedi and A. Mirzaei, A novel multi-clustering method for hierarchical clusterings based on boosting, in Proc. 9th Iranian Conf. Electrical Engineering, 2011, pp. 1-5.

[20]

R. T. Ng and J. W. Han, CLARANS: A method for clustering objects for spatial data mining, IEEE Trans. Knowl. Data Eng., vol. 14, no. 5, pp. 1003-1016, 2002.

Crossref Google Scholar

[21]

A. Farinelli, M. Bicego, S. Ramchurn, and M. Zucchelli, C-link: A hierarchical clustering approach to large-scale near-optimal coalition formation, in Proc. 23rd Int. Joint Conf. Artificial Intelligence, Beijing, China, 2013, pp. 106-112.

[22]

A. Mirzaei and M. Rahmati, A novel hierarchical-clustering-combination scheme based on fuzzy-similarity relations, IEEE Trans. Fuzzy Syst., vol. 18, no. 1, pp. 27-39, 2010.

Crossref Google Scholar

[23]

E. M. Rasmussen and P. Willett, Efficiency of hierarchic agglomerative clustering using the ICL distributed array processor, J. Doc., vol. 45, no. 1, pp. 1-24, 1989.

Crossref Google Scholar

[24]

Apache Hadoop, http://hadoop.apache.org/, 2018.

[25]

National Climatic Data Centre (NCDC) Data Access, https://www.ncdc.noaa.gov/data-access, 2018.

Big Data Mining and Analytics

Volume 2 Issue 4,
December 2019

Pages 240-247

DOI: 10.26599/BDMA.2018.9020037

Cite this article:

Kumar S, Singh M. A Novel Clustering Technique for Efficient Clustering of Big Data in Hadoop Ecosystem. Big Data Mining and Analytics, 2019, 2(4): 240-247. https://doi.org/10.26599/BDMA.2018.9020037

885

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 08 November 2018

Revised: 09 January 2019

Accepted: 12 January 2019

Published: 05 August 2019

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).