Journal Home > Volume 6 , Issue 2

Distributed computing frameworks are the fundamental component of distributed computing systems. They provide an essential way to support the efficient processing of big data on clusters or cloud. The size of big data increases at a pace that is faster than the increase in the big data processing capacity of clusters. Thus, distributed computing frameworks based on the MapReduce computing model are not adequate to support big data analysis tasks which often require running complex analytical algorithms on extremely big data sets in terabytes. In performing such tasks, these frameworks face three challenges: computational inefficiency due to high I/O and communication costs, non-scalability to big data due to memory limit, and limited analytical algorithms because many serial algorithms cannot be implemented in the MapReduce programming model. New distributed computing frameworks need to be developed to conquer these challenges. In this paper, we review MapReduce-type distributed computing frameworks that are currently used in handling big data and discuss their problems when conducting big data analysis. In addition, we present a non-MapReduce distributed computing framework that has the potential to overcome big data analysis challenges.


menu
Abstract
Full text
Outline
About this article

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

Show Author's information Xudong Sun1Yulin He1,2Dingming Wu1Joshua Zhexue Huang1,2( )
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen 518107, China

Abstract

Distributed computing frameworks are the fundamental component of distributed computing systems. They provide an essential way to support the efficient processing of big data on clusters or cloud. The size of big data increases at a pace that is faster than the increase in the big data processing capacity of clusters. Thus, distributed computing frameworks based on the MapReduce computing model are not adequate to support big data analysis tasks which often require running complex analytical algorithms on extremely big data sets in terabytes. In performing such tasks, these frameworks face three challenges: computational inefficiency due to high I/O and communication costs, non-scalability to big data due to memory limit, and limited analytical algorithms because many serial algorithms cannot be implemented in the MapReduce programming model. New distributed computing frameworks need to be developed to conquer these challenges. In this paper, we review MapReduce-type distributed computing frameworks that are currently used in handling big data and discuss their problems when conducting big data analysis. In addition, we present a non-MapReduce distributed computing framework that has the potential to overcome big data analysis challenges.

Keywords: big data analysis, approximate computing, distributed computing frameworks, MapReduce computing model

References(113)

[1]
M. Anjomshoa, M. Salleh, and M. P. Kermani, A taxonomy and survey of distributed computing systems, J. Appl. Sci., vol. 15, no. 1, pp. 46–57, 2015.
[2]
D. C. Marinescu, Parallel and distributed computing: Memories of time past and a glimpse at the future, in Proc. 2014 IEEE 13t⁢h Int. Symp. Parallel and Distributed Computing, Marseille, France, 2014, pp. 14&15.
[3]
J. Fan, F. Han, and H. Liu, Challenges of big data analysis, Natl. Sci. Rev., vol. 1, no. 2, pp. 293–314, 2014.
[4]
Z. N. Rashid, S. R. M. Zebari, K. H. Sharif, and K. Jacksi, Distributed cloud computing and distributed parallel computing: A review, in Proc. 2018 Int. Conf. Advanced Science and Engineering (ICOASE), Duhok, Iraq, 2018, pp. 167–172.
[5]
V. K. Singh, M. Taram, V. Agrawal, and B. S. Baghel, A literature review on hadoop ecosystem and various techniques of big data optimization, in Advances in Data and Information Sciences, M. Kolhe, M. Trivedi, S. Tiwari, and V. Singh, eds. Singapore: Springer, 2018, pp. 231–240.
DOI
[6]
K. Zhang, B. Qin, and Q. C. Liu, Study of parallel computing framework based on GPU-Hadoop, (in Chinese), Applicat. Res. Comput., vol. 31, no. 8, pp. 2548–2550& 2556, 2014.
[7]
H. Ogawa, H. Nakada, R. Takano, and T. Kudoh, SSS: An implementation of key-value store based MapReduce framework, in Proc. 2010 IEEE Second Int. Conf. Cloud Computing Technology and Science, Indianapolis, IN, USA, 2010, pp. 754–761.
[8]
S. Ghemawat, H. Gobioff, and S. T. Leung, The google file system, ACM SIGOPS Oper. Syst. Rev., vol. 73, no. 5, pp. 29–43, 2003.
[9]
V. K. C. Bumgardner, V. W. Marek, and C. D. Hickey, Cresco: A distributed agent-based edge computing framework, in Proc. 2016 12t⁢h Int. Conf. Network and Service Management (CNSM), Montreal, Canada, 2016, pp. 400–405.
[10]
M. Grivas and D. Kehagias, A multi-platform framework for distributed computing, in Proc. 2008 Panhellenic Conf. Informatics, Samos, Greece, 2008, pp. 163–167.
[11]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The hadoop distributed file system, in Proc. 2010 IEEE 26t⁢h Symp. Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 2010, pp. 1–10.
[12]
S. Salloum, J. Z. Huang, Y. He, and X. Chen, An asymptotic ensemble learning framework for big data analysis, IEEE Access, vol. 7, pp. 3675–3693, 2018.
[13]
J. Z. Huang, Y. He, C. Wei, and X. Zhang, Random sample partition data model and related technologies for big data analysis, J Data Acquisit. Process., vol. 34, no. 3, pp. 373–385, 2019.
[14]
Y. He, J. Z. Huang, H. Long, Q. Wang, and C. Wei, I-sampling: A new block-based sampling method for large-scale dataset, in Proc. 2017 IEEE Int. Congress on Big Data (BigData Congress), Honolulu, HI, USA, 2017, pp. 360–367.
[15]
T. Biswas, P. Kuila, and A. K. Ray, Multi-level queue for task scheduling in heterogeneous distributed computing system, in Proc. 2017 4t⁢h Int. Conf. Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 2017, pp. 1–6.
[16]
L. Globa and N. Gvozdetska, Comprehensive energy efficient approach to workload processing in distributed computing environment, in Proc. 2020 IEEE Int. Black Sea Conf. Communications and Networking (BlackSeaCom), Odessa, Ukraine, 2020, pp. 1–6.
[17]
N. A. Bahnasawy, M. A. Koutb, M. Mosa, and F. Omara, A new algorithm for static task scheduling for heterogeneous distributed computing systems, Afr. J. Mathemat. Comput. Sci. Res., vol. 4, no. 6, pp. 221–234, 2011.
[18]
M. I. Daoud and N. Kharma, A high performance algorithm for static task scheduling in heterogeneous distributed computing systems, J. Parallel Distrib. Comput., vol. 68, no. 4, pp. 399–409, 2008.
[19]
H. Lin, X. Zhu, B. Yu, X. Tang, W. Xue, W. Chen, L. Zhang, T. Hoefler, X. Ma, X. Liu, et al., ShenTu: Processing multi-trillion edge graphs on millions of cores in seconds, in Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, 2018, pp. 706–716.
[20]
X. J. Yang, X. K. Liao, K. Lu, Q. F. Hu, J. Q. Song, and J. S. Su, The TianHe-1A supercomputer: Its hardware and software, J. Comput. Sci. Technol., vol. 26, no. 3, pp. 344–351, 2011.
[21]
D. P. Anderson, E. Korpela, and R. Walton, High-performance task distribution for volunteer computing, in Proc. First Int. Conf. e-Science and Grid Computing (e-Science’05), Melbourne, Australia, 2005, p. 8.
[22]
G. Lu and W. H. Zeng, Cloud computing survey, Appl. Mechan. Mater., vols. 530-531, pp. 650–661, 2014.
[23]
K. Krauter, R. Buyya, and M. Maheswaran, A taxonomy and survey of grid resource management systems for distributed computing, Software Pract. Exper, vol. 32, no. 2, pp. 135–164, 2002.
[24]
S. Patidar, D. Rane, and P. Jain, A survey paper on cloud computing, in Proc. 2012 Second Int. Conf. Advanced Computing & Communication Technologies, Rohtak, India, 2012, pp. 394–398.
[25]
R. Nath and A. Nagaraju, A novel task assignment heuristic using local search in distributed computing systems, in Proc. 2017 Int. Conf. Energy, Communication, Data Analytics and Soft Computing (ICECDS), Chennai, India, 2017, pp. 2767–2771.
[26]
Z. Fadika and M. Govindaraju, DELMA: Dynamically elastic MapReduce framework for CPU-intensive applications, in Proc. 2011 11t⁢h IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing, Newport Beach, CA, USA, 2011, pp. 454–463.
[27]
K. Singh, M. Alam, and S. Kumar, A survey of static scheduling algorithm for distributed computing system, Int. J. Comput. Applicat., vol. 129, no. 2, pp. 25–30, 2015.
[28]
O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun, Diagnosing performance variations in HPC applications using machine learning, in Proc. 32nd High Performance Computing, Frankfurt, Germany, 2017, pp. 355–373.
[29]
G. Ramirez-Gargallo, M. Garcia-Gasulla, and F. Mantovani, Tensorflow on state-of-the-art HPC clusters: A machine learning use case, in Proc. 2019 19t⁢h IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing (CCGRID), Larnaca, Cyprus, 2019, pp. 526–533.
[30]
V. K. Naik, S. K. Setia, and M. S. Squillante, Performance analysis of job scheduling policies in parallel supercomputing environments, in Proc. 1993 ACM/IEEE Conf. Supercomputing, Portland, OR, USA, 1993, pp. 824–833.
[31]
M. A. S. Netto, R. N. Calheiros, E. R. Rodrigues, R. L. F. Cunha, and R. Buyya, HPC cloud for scientific and business applications: taxonomy, vision, and research challenges, ACM Comput. Surv., vol. 51, no. 1, pp. 1–29, 2019.
[32]
T. D. Thanh, S. Mohan, E. Choi, S. Kim, and P. Kim, A taxonomy and survey on distributed file systems, in Proc. 2008 4t⁢h Int. Conf. Networked Computing and Advanced Information Management, Gyeongju, Republic of Korea, 2008, pp. 144–149.
[33]
J. Blomer, A survey on distributed file system technology, J. Phys. Conf. Ser., vol. 608, p. 012039, 2015.
[34]
L. Jiang, B. Li, and M. Song, The optimization of HDFS based on small files, in Proc. 2010 3r⁢d IEEE Int. Conf. Broadband Network and Multimedia Technology (IC-BNMT), Beijing, China, 2010, pp. 912–915.
[35]
S. Zhuo, X. Wu, W. Zhang, and W. Dou, Distributed file system and classification for small images, in Proc. 2013 IEEE Int. Conf. Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing, Beijing, China, 2013, pp. 2231–2234.
[36]
S. Fu, L. He, C. Huang, X. Liao, and K. Li, Performance optimization for managing massive numbers of small files in distributed file systems, IEEE Trans. Parallel Distribut. Syst., vol. 26, no. 12, pp. 3433–3448, 2015.
[37]
H. Che and H. Zhang, Exploiting fastDFS client-based small file merging, in Proc. 2016 Int. Conf. Artificial Intelligence and Engineering Applications, Hong Kong, China, 2016, pp. 242–246.
[38]
W. K. Josephson, L. A. Bongo, K. Li, and D. Flynn, DFS: A file system for virtualized flash storage, ACM Trans. Stor., vol. 6, no. 3, p. 14, 2010.
[39]
Z. Ullah, S. Jabbar, M. H. Bin Tariq Alvi, and A. Ahmad, Analytical study on performance, challenges and future considerations of Google file system, Int. J. Computer Communicat. Eng., vol. 3, no. 4, pp. 279–284, 2014.
[40]
R. Gu, X. Yang, J. Yan, Y. Sun, B. Wang, C. Yuan, and Y. Huang, SHadoop: Improving MapReduce performance by optimizing job execution mechanism in hadoop clusters, J. Parallel Distribut. Comput., vol. 74, no. 3, pp. 2166–2179, 2014.
[41]
I. Polato, R. Ré, A. Goldman, and F. Kon, A comprehensive view of hadoop research—A systematic literature review, J. Network Comput. Applicat., vol. 46, pp. 1–25, 2014.
[42]
Y. Wang, W. Jiang, and G. Agrawal, SciMATE: A novel MapReduce-like framework for multiple scientific data formats, in Proc. 2012 12t⁢h IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing (CCGRID 2012), Ottawa, Canada, 2012, pp. 443–450.
[43]
J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Commun ACM, vol. 51, no. 1, pp. 107–113, 2008.
[44]
M. R. Ghazi and D. Gangodkar, Hadoop, MapReduce and HDFS: A developers perspective, Proc. Comput. Sci., vol. 48, pp. 45–50, 2015.
[45]
Y. Zhang, Q. Gao, L. Gao, and C. Wang, iMapReduce: A distributed computing framework for iterative computation, J. Grid Comput., vol. 10, no. 1, pp. 47–68, 2012.
[46]
H. Alshammari, J. Lee, and H. Bajwa, H2Hadoop: Improving hadoop performance using the metadata of related jobs, IEEE Trans. Cloud Comput., vol. 6, no. 4, pp. 1031–1040, 2018.
[47]
P. S. Janardhanan and P. Samuel, Launch overheads of spark applications on standalone and hadoop YARN clusters, in Advances in Electrical and Computer Technologies, T. Sengodan, M. Murugappan, and S. Misra, eds. Singapore: Springer, 2020, pp. 47–54.
[48]
C. Lam, Hadoop in Action. Lewis Street Greenwich, CT, USA: Manning Publications Co., 2010.
[49]
S. R. Alapati, Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS. Palo Alto, CA, USA: Addison-Wesley Professional, 2016.
DOI
[50]
R. K. Bhatia and A. Bansal, Deploying and improving hadoop on pseudo-distributed mode, Compusoft, vol. 3, no. 10, p. 1136, 2014.
[51]
F. Li, J. Chen, and Z. Wang, Wireless MapReduce distributed computing, IEEE Trans. Inform. Theory, vol. 65, no. 10, pp. 6101–6114, 2019.
[52]
C. F. Chiu, S. J. Hsu, and S. R. Jan, Distributed MapReduce framework using distributed hash table, in Proc. 2013 Int. Joint Conf. Awareness Science and Technology & Ubi-Media Computing (iCAST 2013 & UMEDIA 2013), Aizu-Wakamatsu, Japan, 2013, pp. 475–481.
[53]
S. D. Kavila, G. S. V. P. Raju, S. C. Satapathy, A. Machiraju, G. V. L. Kinnera, and K. Rasly, A survey on fault management techniques in distributed computing, in Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA), S. C. Satapathy, S. K. Udgata, and B. N. Biswal, eds. Berlin, Germany: Springer, 2013, pp. 593–602.
[54]
Y. C. Sun and X. F. Wang, MapReduce designed to optimize computing model based on hadoop framework, (in Chinese), Comput. Sci., vol. 41, no. 11A, pp. 333–336, 2014.
[55]
J. Yu, J. Wu, and M. Sarwat, A demonstration of geoSpark: A cluster computing framework for processing big spatial data, in Proc. 2016 IEEE 32n⁢d Int. Conf. Data Engineering (ICDE), Helsinki, Finland, 2016, pp. 1410–1413.
[56]
Z. Yang, C. Zhang, M. Hu, and F. Lin, OPC: A distributed computing and memory computing-based effective solution of big data, in Proc. 2015 IEEE Int. Conf. Smart City/ SocialCom/SustainCom (SmartCity), Chengdu, China, 2015, pp. 50–53.
[57]
V. Taran, O. Alienin, S. Stirenko, Y. Gordienko, and A. Rojbi, Performance evaluation of distributed computing environments with Hadoop and spark frameworks, in Proc. 2017 IEEE Int. Young Scientists Forum on Applied Physics and Engineering (YSF), Lviv, Ukraine, 2017, pp. 80–83.
[58]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al., Spark SQL: Relational data processing in spark, in Proc. 2015 ACM SIGMOD Int. Conf. Management of Data, Melbourne, Australia, 2015, pp. 1383–1394.
[59]
Y. Benlachmi and M. L. Hasnaoui, Big data and spark: Comparison with hadoop, in Proc. 2020 Fourth World Conf. Smart Trends in Systems, Security and Sustainability (WorldS4), London, UK, 2020, pp. 811–817.
[60]
P. Karunaratne, S. Karunasekera, and A. Harwood, Distributed stream clustering using micro-clusters on apache storm, J. Parallel Distribut. Comput., vol. 108, pp. 74–84, 2017.
[61]
P. Carbone, S. Ewen, G. Fóra, S. Haridi, S. Richter, and K. Tzoumas, State management in Apache Flink®: Consistent stateful distributed stream processing, Proc. VLDB Endowm., vol. 10, no. 12, pp. 1718–1729, 2017.
[62]
F. Hueske and V. Kalavri, Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications. Sebastopol, CA, USA: O’Reilly Media, 2019.
DOI
[63]
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas, Apache flinkTM: Stream and batch processing in a single engine, Bull. IEEE Comput. Soc. Technol. Committ. Data Eng., vol. 36, no. 4, pp. 28–38, 2015.
[64]
A. Katsifodimos and S. Schelter, Apache Flink: Stream analytics at scale, in Proc. 2016 IEEE Int. Conf. Cloud Engineering Workshop (IC2EW), Berlin, Germany, 2016, p. 193.
[65]
M. H. Iqbal and T. R. Soomro, Big data analysis: Apache storm perspective, Int. J. Computer Trends Technol., vol. 19, no. 1, pp. 9–14, 2015.
[66]
T. Da Silva Morais, Survey on frameworks for distributed computing: Hadoop, spark and storm, in Proc. 10t⁢h Doctoral Symp. in Informatics Engineering–DSIE’15, Porto, Portugal, 2015, pp. 95–105.
[67]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, Hive: A warehousing solution over a map-reduce framework, Proc. VLDB Endowm., vol. 2, no. 2, pp. 1626–1629, 2009.
[68]
A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, Building a high-level dataflow system on top of map-reduce: The pig experience, Proc. VLDB Endowm., vol. 2, no. 2, pp. 1414–1425, 2009.
[69]
A. Thusoo, J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, Hive-a petabyte scale data warehouse using hadoop, in Proc. 2010 IEEE 26t⁢h Int. Conf. Data Engineering (ICDE 2010), Long Beach, CA, USA, 2010, pp. 996–1005.
[70]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, Pig latin: A not-so-foreign language for data processing, in Proc. 2008 ACM SIGMOD Int. Conf. Management of data, Vancouver, Canada, 2008, pp. 1099–1110.
[71]
X. Lu, H. Shi, R. Biswas, M. H. Javed, and D. K. Panda, DLoBD: A comprehensive study of deep learning over big data stacks on HPC clusters, IEEE Trans. Multi-Scale Comput. Syst., vol. 4, no. 4, pp. 635–648, 2018.
[72]
S. Sakr, Big data processing stacks, IT Profess., vol. 19, no. 1, pp. 34–41, 2017.
[73]
J. Kreps, N. Narkhede, and J. Rao, Kafka: A distributed messaging system for log processing, in Proc. 6th Int. Workshop on Networking Meets Databases, Athens, Greece, 2011, pp. 1–7.
[74]
S. Aravinth, A. H. Begam, S. Shanmugapriyaa, S. Sowmya, and E. Arun, An efficient HADOOP frameworks SQOOP and ambari for big data processing, Int. J. Innovat. Res. Sci. Technol., vol. 1, no. 10, pp. 252–255, 2015.
[75]
M. N. Vora, Hadoop-Hbase for large-scale data, in Proc. 2011 Int. Conf. Computer Science and Network Technology, Harbin, China, 2011, pp. 601–605.
[76]
J. Carlson, Redis in action. Shelter Island, NY, USA: Manning, 2013.
[77]
D. Huang, Q. Liu, Q. Cui, Z. Fang, X. Ma, F. Xu, L. Shen, L. Tang, Y. Zhou, M. Huang, et al., TiDB: A raft-based HTAP database, Proc. VLDB Endow., vol. 13, no. 12, pp. 3072–3084, 2020.
[78]
R. Anil, G. Capan, I. Drost-Fromm, T. Dunning, E. Friedman, T. Grant, S. Quinn, P. Ranjan, S. Schelter, and Ö. Yılmazeł, Apache mahout: Machine learning on distributed dataflow systems, J. Mach. Learn. Res., vol. 21, no. 127, pp. 1–6, 2020.
[79]
B. Quinto, Introduction to spark and spark MLlib, in Next-Generation Machine Learning with Spark, B. Quint, ed. New York, NY, USA: Apress, 2020. pp. 29–96.
DOI
[80]
S. V. Ranawade, S. Navale, A. Dhamal, K. Deshpande, and C. Ghuge, Online analytical processing on hadoop using apache Kylin, Int. J. Appl. Inform. Syst., vol. 12, pp. 1–5, 2017.
[81]
L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al., Impala: Scalable distributed deep-RL with importance weighted actor-learner architectures, in Proc. 35t⁢h Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 1407–1416.
[82]
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al., Apache hadoop YARN: Yet another resource negotiator, in Proc. 4t⁢h annual Symp. on Cloud Computing, Santa Clara, CA, USA, 2013, p. 5.
[83]
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica, Mesos: A platform for Fine-Grained resource sharing in the data center, in Proc. 8t⁢h USENIX Conf. Networked Systems Design and Implementation, Boston, MA, USA, 2011, pp. 295–308.
[84]
S. Wadkar and M. Siddalingaiah, Apache ambari, in Pro Apache Hadoop, S. Wadkar and M. Siddalingaiah, eds. Berkeley, CA, USA: Springer, 2014. pp. 399–401.
DOI
[85]
F. Junqueira and B. Reed, ZooKeeper: Distributed Process Coordination. Sebastopol, CA, USA: O’Reilly Media, 2013.
[86]
A. Y. Zomaya, Parallel and Distributed Computing Handbook. New York, NY, USA: McGraw-Hill Professional, 1995.
[87]
A. D. Kshemkalyani and M. Singhal, Distributed Computing: Principles, Algorithms, and Systems. Cambridge, UK: Cambridge University Press, 2011.
[88]
D. Borthakur, The hadoop distributed file system: Architecture and design, https://www.cs.stolaf.edu/docs/hadoop/hdfs_design.html, 2007.
[89]
K. J. Merceedi and N. A. Sabry, A comprehensive survey for Hadoop distributed file system, Asian J. Res. Comput. Sci., vol. 11, no. 2, pp. 46–57, 2021.
[90]
J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Commun. ACM, vol. 53, no. 1, pp. 72–77, 2010.
[91]
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, Haloop: Efficient iterative data processing on large clusters, Proc. VLDB Endow., vol. 3, nos. 1&2, pp. 285–296, 2010.
[92]
M. Yoon, H. I. Kim, D. H. Choi, H. Jo, and J. W. Chang, Performance analysis of mapReduce-based distributed systems for iterative data processing applications, in Mobile, Ubiquitous, and Intelligent Computing, J. J. J.H. Park, H. Adeli, N. Park, and I. Woungang, eds. Berlin, Germany: Springer, 2014. pp. 293–299.
[93]
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, The haLoop approach to large-scale iterative data analysis, VLDB J., vol. 21, no. 2, pp. 169–190, 2012.
[94]
S. B. Sriramoju, A review on processing big data, Int. J. Innovat. Res. Comput. Communicat. Eng., vol. 2, pp. 2672–2685, 2014.
[95]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proc. 9t⁢h USENIX Conf. Networked Systems Design and Implementation, San Jose, CA, USA, 2012, pp. 15–28.
[96]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, Spark: Cluster computing with working sets, in Proc. 2n⁢d USENIX Conf. Hot Topics in Cloud Computing, Boston, MA, USA, 2010, p. 10.
[97]
S. Salloum, J. Z. Huang, and Y. He, Random sample partition: A distributed data model for big data analysis, IEEE Trans. Industr. Inform., vol. 15, no. 11, pp. 5846–5854, 2019.
[98]
S. Salloum, J. Z. Huang, and Y. He, Empirical analysis of asymptotic ensemble learning for big data, in Proc. IEEE/ACM 3r⁢d Int. Conf. Big Data Computing Applications and Technologies, Shanghai, China, 2016, pp. 8–17.
[99]
C. Wei, J. Zhang, T. Valiullin, W. Cao, Q. Wang, and H. Long, Distributed and parallel ensemble classification for big data based on kullback-leibler random sample partition, in Proc. Int. Conf. Algorithms and Architectures for Parallel Processing, New York, NY, USA, 2020, pp. 448–464.
[100]
C. Wei, S. Salloum, T. Z. Emara, X. Zhang, J. Z. Huang, and Y. He, A two-stage data processing algorithm to generate random sample partitions for big data analysis, in Proc. 11t⁢h Int. Conf. Cloud Computing, Seattle, WA, USA, 2018, pp. 347–364.
[101]
T. Z. Emara and J. Z. Huang, A distributed data management system to support large-scale data analysis, J. Syst. Softw., vol. 148, pp. 105–115, 2019.
[102]
T. Z. Emara and J. Z. Huang, Distributed data strategies to support large-scale data analysis across geo-distributed data centers, IEEE Access, vol. 8, pp. 178526–178538, 2020.
[103]
V. Kalavri, V. Brundza, and V. Vlassov, Block sampling: Efficient accurate online aggregation in MapReduce, in Proc. 2013 IEEE 5t⁢h Int. Conf. Cloud Computing Technology and Science, Bristol, UK, 2013, pp. 250–257.
[104]
G. Cormode and N. Duffield, Sampling for big data: A tutorial, in Proc. 20t⁢h ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2014, p. 1975.
[105]
P. Sanders, S. Lamm, L. Hübschle-Schneider, E. Schrade, and C. Dachsbacher, Efficient parallel random sampling-vectorized, cache-efficient, and online, ACM Trans. Mathemat. Softw., vol. 44, no. 3, p. 29, 2018.
[106]
E. Gavagsaz, A. Rezaee, and H. H. S. Javadi, Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling, J. Supercomput., vol. 74, no. 7, pp. 3415–3440, 2018.
[107]
O. Sagi and L. Rokach, Ensemble learning: A survey, Wiley Interdiscip Rev Data Min. Knowl Discov, vol. 8, no. 4, p. e1249, 2018.
[108]
J. K. Kim and Z. Wang, Sampling techniques for big data analysis, Int. Stat. Rev., vol. 87, no. S1, pp. S177–S191, 2019.
[109]
X. Meng, Scalable simple random sampling and stratified sampling, in Proc. 30t⁢h Int. Conf. Int. Conf. Machine Learning, Atlanta, GA, USA, 2013, pp. III-531–III-539.
[110]
M. S. Mahmud, J. Z. Huang, S. Salloum, T. Z. Emara, and K. Sadatdiynov, A survey of data partitioning and sampling methods to support big data analysis, Big Data Min. Anal., vol. 3, no. 2, pp. 85–101, 2020.
[111]
S. Chaudhuri, G. Das, and U. Srivastava, Effective use of block-level sampling in statistics estimation, in Proc. 2004 ACM SIGMOD Int. Conf. Management of data, Paris, France, 2004, pp. 287–298.
[112]
G. Cormode, M. Garofalakis, P. J. Haas, C. Matthew, Synopses for massive data: Samples, histograms, wavelets, sketches, Foundat. Trends Databases, vol. 4, no. 1–3, pp. 1–294, 2012.
[113]
S. Salloum, J. Z. Huang, and Y. He, Exploring and cleaning big data with random sample data blocks, J. Big Data, vol. 6, no. 1, p. 45, 2019.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 15 June 2022
Accepted: 28 June 2022
Published: 26 January 2023
Issue date: June 2023

Copyright

© The author(s) 2023.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 61972261) and Basic Research Foundations of Shenzhen (Nos. JCYJ 20210324093609026 and JCYJ20200813091134001).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return