Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

Xudong Sun; Yulin He; Dingming Wu; Joshua Zhexue Huang

doi:10.26599/BDMA.2022.9020014

Big Data Mining and Analytics 2023, 6(2): 154-169 https://doi.org/10.26599/BDMA.2022.9020014

Open Access | Issue | Published: 26 January 2023

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

Show Author's Information Hide Author's Information Xudong Sun^¹, Yulin He^{¹^,²}, Dingming Wu^¹, Joshua Zhexue Huang^{¹^,²}(

)

1College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China

2Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen 518107, China

Keywords:

big data analysis, approximate computing, distributed computing frameworks, MapReduce computing model

Cite this article:

Sun X, He Y, Wu D, et al. Survey of Distributed Computing Frameworks for Supporting Big Data Analysis. Big Data Mining and Analytics, 2023, 6(2): 154-169. https://doi.org/10.26599/BDMA.2022.9020014

Download citation

EndNote(RIS)

BibTeX

2194

Views

450

Downloads

Citations

Crossref

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

Distributed computing frameworks are the fundamental component of distributed computing systems. They provide an essential way to support the efficient processing of big data on clusters or cloud. The size of big data increases at a pace that is faster than the increase in the big data processing capacity of clusters. Thus, distributed computing frameworks based on the MapReduce computing model are not adequate to support big data analysis tasks which often require running complex analytical algorithms on extremely big data sets in terabytes. In performing such tasks, these frameworks face three challenges: computational inefficiency due to high I/O and communication costs, non-scalability to big data due to memory limit, and limited analytical algorithms because many serial algorithms cannot be implemented in the MapReduce programming model. New distributed computing frameworks need to be developed to conquer these challenges. In this paper, we review MapReduce-type distributed computing frameworks that are currently used in handling big data and discuss their problems when conducting big data analysis. In addition, we present a non-MapReduce distributed computing framework that has the potential to overcome big data analysis challenges.

Full text

Abstract

Full text

Outline

About this article

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

Show Author's information Hide Author's Information Xudong Sun^¹, Yulin He^{¹^,²}, Dingming Wu^¹, Joshua Zhexue Huang^{¹^,²}(

)

1College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China

2Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen 518107, China

Abstract

Keywords: big data analysis, approximate computing, distributed computing frameworks, MapReduce computing model

References(113)

[1]

M. Anjomshoa, M. Salleh, and M. P. Kermani, A taxonomy and survey of distributed computing systems, J. Appl. Sci., vol. 15, no. 1, pp. 46–57, 2015.

DOI Google Scholar

[2]

D. C. Marinescu, Parallel and distributed computing: Memories of time past and a glimpse at the future, in Proc. 2014 IEEE 13^t⁢h Int. Symp. Parallel and Distributed Computing, Marseille, France, 2014, pp. 14&15.

DOI Google Scholar

[3]

J. Fan, F. Han, and H. Liu, Challenges of big data analysis, Natl. Sci. Rev., vol. 1, no. 2, pp. 293–314, 2014.

DOI Google Scholar

[4]

Z. N. Rashid, S. R. M. Zebari, K. H. Sharif, and K. Jacksi, Distributed cloud computing and distributed parallel computing: A review, in Proc. 2018 Int. Conf. Advanced Science and Engineering (ICOASE), Duhok, Iraq, 2018, pp. 167–172.

DOI Google Scholar

[5]

V. K. Singh, M. Taram, V. Agrawal, and B. S. Baghel, A literature review on hadoop ecosystem and various techniques of big data optimization, in Advances in Data and Information Sciences, M. Kolhe, M. Trivedi, S. Tiwari, and V. Singh, eds. Singapore: Springer, 2018, pp. 231–240.

DOI

[6]

K. Zhang, B. Qin, and Q. C. Liu, Study of parallel computing framework based on GPU-Hadoop, (in Chinese), Applicat. Res. Comput., vol. 31, no. 8, pp. 2548–2550& 2556, 2014.

Google Scholar

[7]

H. Ogawa, H. Nakada, R. Takano, and T. Kudoh, SSS: An implementation of key-value store based MapReduce framework, in Proc. 2010 IEEE Second Int. Conf. Cloud Computing Technology and Science, Indianapolis, IN, USA, 2010, pp. 754–761.

DOI Google Scholar

[8]

S. Ghemawat, H. Gobioff, and S. T. Leung, The google file system, ACM SIGOPS Oper. Syst. Rev., vol. 73, no. 5, pp. 29–43, 2003.

DOI Google Scholar

[9]

V. K. C. Bumgardner, V. W. Marek, and C. D. Hickey, Cresco: A distributed agent-based edge computing framework, in Proc. 2016 12^t⁢h Int. Conf. Network and Service Management (CNSM), Montreal, Canada, 2016, pp. 400–405.

DOI Google Scholar

[10]

M. Grivas and D. Kehagias, A multi-platform framework for distributed computing, in Proc. 2008 Panhellenic Conf. Informatics, Samos, Greece, 2008, pp. 163–167.

DOI Google Scholar

[11]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The hadoop distributed file system, in Proc. 2010 IEEE 26^t⁢h Symp. Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 2010, pp. 1–10.

DOI Google Scholar

[12]

S. Salloum, J. Z. Huang, Y. He, and X. Chen, An asymptotic ensemble learning framework for big data analysis, IEEE Access, vol. 7, pp. 3675–3693, 2018.

DOI Google Scholar

[13]

J. Z. Huang, Y. He, C. Wei, and X. Zhang, Random sample partition data model and related technologies for big data analysis, J Data Acquisit. Process., vol. 34, no. 3, pp. 373–385, 2019.

Google Scholar

[14]

Y. He, J. Z. Huang, H. Long, Q. Wang, and C. Wei, I-sampling: A new block-based sampling method for large-scale dataset, in Proc. 2017 IEEE Int. Congress on Big Data (BigData Congress), Honolulu, HI, USA, 2017, pp. 360–367.

DOI Google Scholar

[15]

T. Biswas, P. Kuila, and A. K. Ray, Multi-level queue for task scheduling in heterogeneous distributed computing system, in Proc. 2017 4^t⁢h Int. Conf. Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 2017, pp. 1–6.

DOI Google Scholar

[16]

L. Globa and N. Gvozdetska, Comprehensive energy efficient approach to workload processing in distributed computing environment, in Proc. 2020 IEEE Int. Black Sea Conf. Communications and Networking (BlackSeaCom), Odessa, Ukraine, 2020, pp. 1–6.

DOI Google Scholar

[17]

N. A. Bahnasawy, M. A. Koutb, M. Mosa, and F. Omara, A new algorithm for static task scheduling for heterogeneous distributed computing systems, Afr. J. Mathemat. Comput. Sci. Res., vol. 4, no. 6, pp. 221–234, 2011.

Google Scholar

[18]

M. I. Daoud and N. Kharma, A high performance algorithm for static task scheduling in heterogeneous distributed computing systems, J. Parallel Distrib. Comput., vol. 68, no. 4, pp. 399–409, 2008.

DOI Google Scholar

[19]

H. Lin, X. Zhu, B. Yu, X. Tang, W. Xue, W. Chen, L. Zhang, T. Hoefler, X. Ma, X. Liu, et al., ShenTu: Processing multi-trillion edge graphs on millions of cores in seconds, in Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, 2018, pp. 706–716.

DOI Google Scholar

[20]

X. J. Yang, X. K. Liao, K. Lu, Q. F. Hu, J. Q. Song, and J. S. Su, The TianHe-1A supercomputer: Its hardware and software, J. Comput. Sci. Technol., vol. 26, no. 3, pp. 344–351, 2011.

DOI Google Scholar

[21]

D. P. Anderson, E. Korpela, and R. Walton, High-performance task distribution for volunteer computing, in Proc. First Int. Conf. e-Science and Grid Computing (e-Science’05), Melbourne, Australia, 2005, p. 8.

Google Scholar

[22]

G. Lu and W. H. Zeng, Cloud computing survey, Appl. Mechan. Mater., vols. 530-531, pp. 650–661, 2014.

DOI Google Scholar

[23]

K. Krauter, R. Buyya, and M. Maheswaran, A taxonomy and survey of grid resource management systems for distributed computing, Software Pract. Exper, vol. 32, no. 2, pp. 135–164, 2002.

DOI Google Scholar

[24]

S. Patidar, D. Rane, and P. Jain, A survey paper on cloud computing, in Proc. 2012 Second Int. Conf. Advanced Computing & Communication Technologies, Rohtak, India, 2012, pp. 394–398.

DOI Google Scholar

[25]

R. Nath and A. Nagaraju, A novel task assignment heuristic using local search in distributed computing systems, in Proc. 2017 Int. Conf. Energy, Communication, Data Analytics and Soft Computing (ICECDS), Chennai, India, 2017, pp. 2767–2771.

DOI Google Scholar

[26]

Z. Fadika and M. Govindaraju, DELMA: Dynamically elastic MapReduce framework for CPU-intensive applications, in Proc. 2011 11^t⁢h IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing, Newport Beach, CA, USA, 2011, pp. 454–463.

DOI Google Scholar

[27]

K. Singh, M. Alam, and S. Kumar, A survey of static scheduling algorithm for distributed computing system, Int. J. Comput. Applicat., vol. 129, no. 2, pp. 25–30, 2015.

DOI Google Scholar

[28]

O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun, Diagnosing performance variations in HPC applications using machine learning, in Proc. 32^nd High Performance Computing, Frankfurt, Germany, 2017, pp. 355–373.

DOI Google Scholar

[29]

G. Ramirez-Gargallo, M. Garcia-Gasulla, and F. Mantovani, Tensorflow on state-of-the-art HPC clusters: A machine learning use case, in Proc. 2019 19^t⁢h IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing (CCGRID), Larnaca, Cyprus, 2019, pp. 526–533.

DOI Google Scholar

[30]

V. K. Naik, S. K. Setia, and M. S. Squillante, Performance analysis of job scheduling policies in parallel supercomputing environments, in Proc. 1993 ACM/IEEE Conf. Supercomputing, Portland, OR, USA, 1993, pp. 824–833.

DOI Google Scholar

[31]

M. A. S. Netto, R. N. Calheiros, E. R. Rodrigues, R. L. F. Cunha, and R. Buyya, HPC cloud for scientific and business applications: taxonomy, vision, and research challenges, ACM Comput. Surv., vol. 51, no. 1, pp. 1–29, 2019.

DOI Google Scholar

[32]

T. D. Thanh, S. Mohan, E. Choi, S. Kim, and P. Kim, A taxonomy and survey on distributed file systems, in Proc. 2008 4^t⁢h Int. Conf. Networked Computing and Advanced Information Management, Gyeongju, Republic of Korea, 2008, pp. 144–149.

DOI Google Scholar

[33]

J. Blomer, A survey on distributed file system technology, J. Phys. Conf. Ser., vol. 608, p. 012039, 2015.

DOI Google Scholar

[34]

L. Jiang, B. Li, and M. Song, The optimization of HDFS based on small files, in Proc. 2010 3^r⁢d IEEE Int. Conf. Broadband Network and Multimedia Technology (IC-BNMT), Beijing, China, 2010, pp. 912–915.

Google Scholar

[35]

S. Zhuo, X. Wu, W. Zhang, and W. Dou, Distributed file system and classification for small images, in Proc. 2013 IEEE Int. Conf. Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing, Beijing, China, 2013, pp. 2231–2234.

DOI Google Scholar

[36]

S. Fu, L. He, C. Huang, X. Liao, and K. Li, Performance optimization for managing massive numbers of small files in distributed file systems, IEEE Trans. Parallel Distribut. Syst., vol. 26, no. 12, pp. 3433–3448, 2015.

DOI Google Scholar

[37]

H. Che and H. Zhang, Exploiting fastDFS client-based small file merging, in Proc. 2016 Int. Conf. Artificial Intelligence and Engineering Applications, Hong Kong, China, 2016, pp. 242–246.

DOI Google Scholar

[38]

W. K. Josephson, L. A. Bongo, K. Li, and D. Flynn, DFS: A file system for virtualized flash storage, ACM Trans. Stor., vol. 6, no. 3, p. 14, 2010.

DOI Google Scholar

[39]

Z. Ullah, S. Jabbar, M. H. Bin Tariq Alvi, and A. Ahmad, Analytical study on performance, challenges and future considerations of Google file system, Int. J. Computer Communicat. Eng., vol. 3, no. 4, pp. 279–284, 2014.

DOI Google Scholar

[40]

R. Gu, X. Yang, J. Yan, Y. Sun, B. Wang, C. Yuan, and Y. Huang, SHadoop: Improving MapReduce performance by optimizing job execution mechanism in hadoop clusters, J. Parallel Distribut. Comput., vol. 74, no. 3, pp. 2166–2179, 2014.

DOI Google Scholar

[41]

I. Polato, R. Ré, A. Goldman, and F. Kon, A comprehensive view of hadoop research—A systematic literature review, J. Network Comput. Applicat., vol. 46, pp. 1–25, 2014.

DOI Google Scholar

[42]

Y. Wang, W. Jiang, and G. Agrawal, SciMATE: A novel MapReduce-like framework for multiple scientific data formats, in Proc. 2012 12^t⁢h IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing (CCGRID 2012), Ottawa, Canada, 2012, pp. 443–450.

DOI Google Scholar

[43]

J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Commun ACM, vol. 51, no. 1, pp. 107–113, 2008.

DOI Google Scholar

[44]

M. R. Ghazi and D. Gangodkar, Hadoop, MapReduce and HDFS: A developers perspective, Proc. Comput. Sci., vol. 48, pp. 45–50, 2015.

DOI Google Scholar

[45]

Y. Zhang, Q. Gao, L. Gao, and C. Wang, iMapReduce: A distributed computing framework for iterative computation, J. Grid Comput., vol. 10, no. 1, pp. 47–68, 2012.

DOI Google Scholar

[46]

H. Alshammari, J. Lee, and H. Bajwa, H2Hadoop: Improving hadoop performance using the metadata of related jobs, IEEE Trans. Cloud Comput., vol. 6, no. 4, pp. 1031–1040, 2018.

DOI Google Scholar

[47]

P. S. Janardhanan and P. Samuel, Launch overheads of spark applications on standalone and hadoop YARN clusters, in Advances in Electrical and Computer Technologies, T. Sengodan, M. Murugappan, and S. Misra, eds. Singapore: Springer, 2020, pp. 47–54.

[48]

C. Lam, Hadoop in Action. Lewis Street Greenwich, CT, USA: Manning Publications Co., 2010.

[49]

S. R. Alapati, Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS. Palo Alto, CA, USA: Addison-Wesley Professional, 2016.

DOI

[50]

R. K. Bhatia and A. Bansal, Deploying and improving hadoop on pseudo-distributed mode, Compusoft, vol. 3, no. 10, p. 1136, 2014.

Google Scholar

[51]

F. Li, J. Chen, and Z. Wang, Wireless MapReduce distributed computing, IEEE Trans. Inform. Theory, vol. 65, no. 10, pp. 6101–6114, 2019.

DOI Google Scholar

[52]

C. F. Chiu, S. J. Hsu, and S. R. Jan, Distributed MapReduce framework using distributed hash table, in Proc. 2013 Int. Joint Conf. Awareness Science and Technology & Ubi-Media Computing (iCAST 2013 & UMEDIA 2013), Aizu-Wakamatsu, Japan, 2013, pp. 475–481.

DOI Google Scholar

[53]

S. D. Kavila, G. S. V. P. Raju, S. C. Satapathy, A. Machiraju, G. V. L. Kinnera, and K. Rasly, A survey on fault management techniques in distributed computing, in Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA), S. C. Satapathy, S. K. Udgata, and B. N. Biswal, eds. Berlin, Germany: Springer, 2013, pp. 593–602.

DOI Google Scholar

[54]

Y. C. Sun and X. F. Wang, MapReduce designed to optimize computing model based on hadoop framework, (in Chinese), Comput. Sci., vol. 41, no. 11A, pp. 333–336, 2014.

Google Scholar

[55]

J. Yu, J. Wu, and M. Sarwat, A demonstration of geoSpark: A cluster computing framework for processing big spatial data, in Proc. 2016 IEEE 32^n⁢d Int. Conf. Data Engineering (ICDE), Helsinki, Finland, 2016, pp. 1410–1413.

DOI Google Scholar

[56]

Z. Yang, C. Zhang, M. Hu, and F. Lin, OPC: A distributed computing and memory computing-based effective solution of big data, in Proc. 2015 IEEE Int. Conf. Smart City/ SocialCom/SustainCom (SmartCity), Chengdu, China, 2015, pp. 50–53.

DOI Google Scholar

[57]

V. Taran, O. Alienin, S. Stirenko, Y. Gordienko, and A. Rojbi, Performance evaluation of distributed computing environments with Hadoop and spark frameworks, in Proc. 2017 IEEE Int. Young Scientists Forum on Applied Physics and Engineering (YSF), Lviv, Ukraine, 2017, pp. 80–83.

DOI Google Scholar

[58]

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al., Spark SQL: Relational data processing in spark, in Proc. 2015 ACM SIGMOD Int. Conf. Management of Data, Melbourne, Australia, 2015, pp. 1383–1394.

DOI Google Scholar

[59]

Y. Benlachmi and M. L. Hasnaoui, Big data and spark: Comparison with hadoop, in Proc. 2020 Fourth World Conf. Smart Trends in Systems, Security and Sustainability (WorldS4), London, UK, 2020, pp. 811–817.

DOI Google Scholar

[60]

P. Karunaratne, S. Karunasekera, and A. Harwood, Distributed stream clustering using micro-clusters on apache storm, J. Parallel Distribut. Comput., vol. 108, pp. 74–84, 2017.

DOI Google Scholar

[61]

P. Carbone, S. Ewen, G. Fóra, S. Haridi, S. Richter, and K. Tzoumas, State management in Apache Flink®: Consistent stateful distributed stream processing, Proc. VLDB Endowm., vol. 10, no. 12, pp. 1718–1729, 2017.

DOI Google Scholar

[62]

F. Hueske and V. Kalavri, Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications. Sebastopol, CA, USA: O’Reilly Media, 2019.

DOI

[63]

P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas, Apache flinkTM: Stream and batch processing in a single engine, Bull. IEEE Comput. Soc. Technol. Committ. Data Eng., vol. 36, no. 4, pp. 28–38, 2015.

Google Scholar

[64]

A. Katsifodimos and S. Schelter, Apache Flink: Stream analytics at scale, in Proc. 2016 IEEE Int. Conf. Cloud Engineering Workshop (IC2EW), Berlin, Germany, 2016, p. 193.

DOI Google Scholar

[65]

M. H. Iqbal and T. R. Soomro, Big data analysis: Apache storm perspective, Int. J. Computer Trends Technol., vol. 19, no. 1, pp. 9–14, 2015.

DOI Google Scholar

[66]

T. Da Silva Morais, Survey on frameworks for distributed computing: Hadoop, spark and storm, in Proc. 10^t⁢h Doctoral Symp. in Informatics Engineering–DSIE’15, Porto, Portugal, 2015, pp. 95–105.

Google Scholar

[67]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, Hive: A warehousing solution over a map-reduce framework, Proc. VLDB Endowm., vol. 2, no. 2, pp. 1626–1629, 2009.

DOI Google Scholar

[68]

A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, Building a high-level dataflow system on top of map-reduce: The pig experience, Proc. VLDB Endowm., vol. 2, no. 2, pp. 1414–1425, 2009.

DOI Google Scholar

[69]

A. Thusoo, J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, Hive-a petabyte scale data warehouse using hadoop, in Proc. 2010 IEEE 26^t⁢h Int. Conf. Data Engineering (ICDE 2010), Long Beach, CA, USA, 2010, pp. 996–1005.

DOI Google Scholar

[70]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, Pig latin: A not-so-foreign language for data processing, in Proc. 2008 ACM SIGMOD Int. Conf. Management of data, Vancouver, Canada, 2008, pp. 1099–1110.

DOI Google Scholar

[71]

X. Lu, H. Shi, R. Biswas, M. H. Javed, and D. K. Panda, DLoBD: A comprehensive study of deep learning over big data stacks on HPC clusters, IEEE Trans. Multi-Scale Comput. Syst., vol. 4, no. 4, pp. 635–648, 2018.

DOI Google Scholar

[72]

S. Sakr, Big data processing stacks, IT Profess., vol. 19, no. 1, pp. 34–41, 2017.

DOI Google Scholar

[73]

J. Kreps, N. Narkhede, and J. Rao, Kafka: A distributed messaging system for log processing, in Proc. 6th Int. Workshop on Networking Meets Databases, Athens, Greece, 2011, pp. 1–7.

Google Scholar

[74]

S. Aravinth, A. H. Begam, S. Shanmugapriyaa, S. Sowmya, and E. Arun, An efficient HADOOP frameworks SQOOP and ambari for big data processing, Int. J. Innovat. Res. Sci. Technol., vol. 1, no. 10, pp. 252–255, 2015.

Google Scholar

[75]

M. N. Vora, Hadoop-Hbase for large-scale data, in Proc. 2011 Int. Conf. Computer Science and Network Technology, Harbin, China, 2011, pp. 601–605.

Google Scholar

[76]

J. Carlson, Redis in action. Shelter Island, NY, USA: Manning, 2013.

[77]

D. Huang, Q. Liu, Q. Cui, Z. Fang, X. Ma, F. Xu, L. Shen, L. Tang, Y. Zhou, M. Huang, et al., TiDB: A raft-based HTAP database, Proc. VLDB Endow., vol. 13, no. 12, pp. 3072–3084, 2020.

DOI Google Scholar

[78]

R. Anil, G. Capan, I. Drost-Fromm, T. Dunning, E. Friedman, T. Grant, S. Quinn, P. Ranjan, S. Schelter, and Ö. Yılmazeł, Apache mahout: Machine learning on distributed dataflow systems, J. Mach. Learn. Res., vol. 21, no. 127, pp. 1–6, 2020.

Google Scholar

[79]

B. Quinto, Introduction to spark and spark MLlib, in Next-Generation Machine Learning with Spark, B. Quint, ed. New York, NY, USA: Apress, 2020. pp. 29–96.

DOI

[80]

S. V. Ranawade, S. Navale, A. Dhamal, K. Deshpande, and C. Ghuge, Online analytical processing on hadoop using apache Kylin, Int. J. Appl. Inform. Syst., vol. 12, pp. 1–5, 2017.

DOI Google Scholar

[81]

L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al., Impala: Scalable distributed deep-RL with importance weighted actor-learner architectures, in Proc. 35^t⁢h Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 1407–1416.

Google Scholar

[82]

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al., Apache hadoop YARN: Yet another resource negotiator, in Proc. 4^t⁢h annual Symp. on Cloud Computing, Santa Clara, CA, USA, 2013, p. 5.

DOI Google Scholar

[83]

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica, Mesos: A platform for Fine-Grained resource sharing in the data center, in Proc. 8^t⁢h USENIX Conf. Networked Systems Design and Implementation, Boston, MA, USA, 2011, pp. 295–308.

Google Scholar

[84]

S. Wadkar and M. Siddalingaiah, Apache ambari, in Pro Apache Hadoop, S. Wadkar and M. Siddalingaiah, eds. Berkeley, CA, USA: Springer, 2014. pp. 399–401.

DOI

[85]

F. Junqueira and B. Reed, ZooKeeper: Distributed Process Coordination. Sebastopol, CA, USA: O’Reilly Media, 2013.

[86]

A. Y. Zomaya, Parallel and Distributed Computing Handbook. New York, NY, USA: McGraw-Hill Professional, 1995.

[87]

A. D. Kshemkalyani and M. Singhal, Distributed Computing: Principles, Algorithms, and Systems. Cambridge, UK: Cambridge University Press, 2011.

[88]

D. Borthakur, The hadoop distributed file system: Architecture and design, https://www.cs.stolaf.edu/docs/hadoop/hdfs_design.html, 2007.

[89]

K. J. Merceedi and N. A. Sabry, A comprehensive survey for Hadoop distributed file system, Asian J. Res. Comput. Sci., vol. 11, no. 2, pp. 46–57, 2021.

DOI Google Scholar

[90]

J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Commun. ACM, vol. 53, no. 1, pp. 72–77, 2010.

DOI Google Scholar

[91]

Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, Haloop: Efficient iterative data processing on large clusters, Proc. VLDB Endow., vol. 3, nos. 1&2, pp. 285–296, 2010.

DOI Google Scholar

[92]

M. Yoon, H. I. Kim, D. H. Choi, H. Jo, and J. W. Chang, Performance analysis of mapReduce-based distributed systems for iterative data processing applications, in Mobile, Ubiquitous, and Intelligent Computing, J. J. J.H. Park, H. Adeli, N. Park, and I. Woungang, eds. Berlin, Germany: Springer, 2014. pp. 293–299.

[93]

Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, The haLoop approach to large-scale iterative data analysis, VLDB J., vol. 21, no. 2, pp. 169–190, 2012.

DOI Google Scholar

[94]

S. B. Sriramoju, A review on processing big data, Int. J. Innovat. Res. Comput. Communicat. Eng., vol. 2, pp. 2672–2685, 2014.

Google Scholar

[95]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proc. 9^t⁢h USENIX Conf. Networked Systems Design and Implementation, San Jose, CA, USA, 2012, pp. 15–28.

Google Scholar

[96]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, Spark: Cluster computing with working sets, in Proc. 2^n⁢d USENIX Conf. Hot Topics in Cloud Computing, Boston, MA, USA, 2010, p. 10.

Google Scholar

[97]

S. Salloum, J. Z. Huang, and Y. He, Random sample partition: A distributed data model for big data analysis, IEEE Trans. Industr. Inform., vol. 15, no. 11, pp. 5846–5854, 2019.

DOI Google Scholar

[98]

S. Salloum, J. Z. Huang, and Y. He, Empirical analysis of asymptotic ensemble learning for big data, in Proc. IEEE/ACM 3^r⁢d Int. Conf. Big Data Computing Applications and Technologies, Shanghai, China, 2016, pp. 8–17.

DOI Google Scholar

[99]

C. Wei, J. Zhang, T. Valiullin, W. Cao, Q. Wang, and H. Long, Distributed and parallel ensemble classification for big data based on kullback-leibler random sample partition, in Proc. Int. Conf. Algorithms and Architectures for Parallel Processing, New York, NY, USA, 2020, pp. 448–464.

DOI Google Scholar

[100]

C. Wei, S. Salloum, T. Z. Emara, X. Zhang, J. Z. Huang, and Y. He, A two-stage data processing algorithm to generate random sample partitions for big data analysis, in Proc. 11^t⁢h Int. Conf. Cloud Computing, Seattle, WA, USA, 2018, pp. 347–364.

DOI Google Scholar

[101]

T. Z. Emara and J. Z. Huang, A distributed data management system to support large-scale data analysis, J. Syst. Softw., vol. 148, pp. 105–115, 2019.

DOI Google Scholar

[102]

T. Z. Emara and J. Z. Huang, Distributed data strategies to support large-scale data analysis across geo-distributed data centers, IEEE Access, vol. 8, pp. 178526–178538, 2020.

DOI Google Scholar

[103]

V. Kalavri, V. Brundza, and V. Vlassov, Block sampling: Efficient accurate online aggregation in MapReduce, in Proc. 2013 IEEE 5^t⁢h Int. Conf. Cloud Computing Technology and Science, Bristol, UK, 2013, pp. 250–257.

DOI Google Scholar

[104]

G. Cormode and N. Duffield, Sampling for big data: A tutorial, in Proc. 20^t⁢h ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2014, p. 1975.

DOI Google Scholar

[105]

P. Sanders, S. Lamm, L. Hübschle-Schneider, E. Schrade, and C. Dachsbacher, Efficient parallel random sampling-vectorized, cache-efficient, and online, ACM Trans. Mathemat. Softw., vol. 44, no. 3, p. 29, 2018.

DOI Google Scholar

[106]

E. Gavagsaz, A. Rezaee, and H. H. S. Javadi, Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling, J. Supercomput., vol. 74, no. 7, pp. 3415–3440, 2018.

DOI Google Scholar

[107]

O. Sagi and L. Rokach, Ensemble learning: A survey, Wiley Interdiscip Rev Data Min. Knowl Discov, vol. 8, no. 4, p. e1249, 2018.

DOI Google Scholar

[108]

J. K. Kim and Z. Wang, Sampling techniques for big data analysis, Int. Stat. Rev., vol. 87, no. S1, pp. S177–S191, 2019.

DOI Google Scholar

[109]

X. Meng, Scalable simple random sampling and stratified sampling, in Proc. 30^t⁢h Int. Conf. Int. Conf. Machine Learning, Atlanta, GA, USA, 2013, pp. III-531–III-539.

Google Scholar

[110]

M. S. Mahmud, J. Z. Huang, S. Salloum, T. Z. Emara, and K. Sadatdiynov, A survey of data partitioning and sampling methods to support big data analysis, Big Data Min. Anal., vol. 3, no. 2, pp. 85–101, 2020.

DOI Google Scholar

[111]

S. Chaudhuri, G. Das, and U. Srivastava, Effective use of block-level sampling in statistics estimation, in Proc. 2004 ACM SIGMOD Int. Conf. Management of data, Paris, France, 2004, pp. 287–298.

DOI Google Scholar

[112]

G. Cormode, M. Garofalakis, P. J. Haas, C. Matthew, Synopses for massive data: Samples, histograms, wavelets, sketches, Foundat. Trends Databases, vol. 4, no. 1–3, pp. 1–294, 2012.

DOI Google Scholar

[113]

S. Salloum, J. Z. Huang, and Y. He, Exploring and cleaning big data with random sample data blocks, J. Big Data, vol. 6, no. 1, p. 45, 2019.

DOI Google Scholar

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 15 June 2022

Accepted: 28 June 2022

Published: 26 January 2023

Issue date: June 2023

Copyright

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 61972261) and Basic Research Foundations of Shenzhen (Nos. JCYJ 20210324093609026 and JCYJ20200813091134001).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).