References(113)
[1]
M. Anjomshoa, M. Salleh, and M. P. Kermani, A taxonomy and survey of distributed computing systems, J. Appl. Sci., vol. 15, no. 1, pp. 46–57, 2015.
[2]
D. C. Marinescu, Parallel and distributed computing: Memories of time past and a glimpse at the future, in Proc. 2014 IEEE 13th Int. Symp. Parallel and Distributed Computing, Marseille, France, 2014, pp. 14&15.
[3]
J. Fan, F. Han, and H. Liu, Challenges of big data analysis, Natl. Sci. Rev., vol. 1, no. 2, pp. 293–314, 2014.
[4]
Z. N. Rashid, S. R. M. Zebari, K. H. Sharif, and K. Jacksi, Distributed cloud computing and distributed parallel computing: A review, in Proc. 2018 Int. Conf. Advanced Science and Engineering (ICOASE), Duhok, Iraq, 2018, pp. 167–172.
[5]
V. K. Singh, M. Taram, V. Agrawal, and B. S. Baghel, A literature review on hadoop ecosystem and various techniques of big data optimization, in Advances in Data and Information Sciences, M. Kolhe, M. Trivedi, S. Tiwari, and V. Singh, eds. Singapore: Springer, 2018, pp. 231–240.
[6]
K. Zhang, B. Qin, and Q. C. Liu, Study of parallel computing framework based on GPU-Hadoop, (in Chinese), Applicat. Res. Comput., vol. 31, no. 8, pp. 2548–2550& 2556, 2014.
[7]
H. Ogawa, H. Nakada, R. Takano, and T. Kudoh, SSS: An implementation of key-value store based MapReduce framework, in Proc. 2010 IEEE Second Int. Conf. Cloud Computing Technology and Science, Indianapolis, IN, USA, 2010, pp. 754–761.
[8]
S. Ghemawat, H. Gobioff, and S. T. Leung, The google file system, ACM SIGOPS Oper. Syst. Rev., vol. 73, no. 5, pp. 29–43, 2003.
[9]
V. K. C. Bumgardner, V. W. Marek, and C. D. Hickey, Cresco: A distributed agent-based edge computing framework, in Proc. 2016 12th Int. Conf. Network and Service Management (CNSM), Montreal, Canada, 2016, pp. 400–405.
[10]
M. Grivas and D. Kehagias, A multi-platform framework for distributed computing, in Proc. 2008 Panhellenic Conf. Informatics, Samos, Greece, 2008, pp. 163–167.
[11]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The hadoop distributed file system, in Proc. 2010 IEEE 26th Symp. Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 2010, pp. 1–10.
[12]
S. Salloum, J. Z. Huang, Y. He, and X. Chen, An asymptotic ensemble learning framework for big data analysis, IEEE Access, vol. 7, pp. 3675–3693, 2018.
[13]
J. Z. Huang, Y. He, C. Wei, and X. Zhang, Random sample partition data model and related technologies for big data analysis, J Data Acquisit. Process., vol. 34, no. 3, pp. 373–385, 2019.
[14]
Y. He, J. Z. Huang, H. Long, Q. Wang, and C. Wei, I-sampling: A new block-based sampling method for large-scale dataset, in Proc. 2017 IEEE Int. Congress on Big Data (BigData Congress), Honolulu, HI, USA, 2017, pp. 360–367.
[15]
T. Biswas, P. Kuila, and A. K. Ray, Multi-level queue for task scheduling in heterogeneous distributed computing system, in Proc. 2017 4th Int. Conf. Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 2017, pp. 1–6.
[16]
L. Globa and N. Gvozdetska, Comprehensive energy efficient approach to workload processing in distributed computing environment, in Proc. 2020 IEEE Int. Black Sea Conf. Communications and Networking (BlackSeaCom), Odessa, Ukraine, 2020, pp. 1–6.
[17]
N. A. Bahnasawy, M. A. Koutb, M. Mosa, and F. Omara, A new algorithm for static task scheduling for heterogeneous distributed computing systems, Afr. J. Mathemat. Comput. Sci. Res., vol. 4, no. 6, pp. 221–234, 2011.
[18]
M. I. Daoud and N. Kharma, A high performance algorithm for static task scheduling in heterogeneous distributed computing systems, J. Parallel Distrib. Comput., vol. 68, no. 4, pp. 399–409, 2008.
[19]
H. Lin, X. Zhu, B. Yu, X. Tang, W. Xue, W. Chen, L. Zhang, T. Hoefler, X. Ma, X. Liu, et al., ShenTu: Processing multi-trillion edge graphs on millions of cores in seconds, in Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, 2018, pp. 706–716.
[20]
X. J. Yang, X. K. Liao, K. Lu, Q. F. Hu, J. Q. Song, and J. S. Su, The TianHe-1A supercomputer: Its hardware and software, J. Comput. Sci. Technol., vol. 26, no. 3, pp. 344–351, 2011.
[21]
D. P. Anderson, E. Korpela, and R. Walton, High-performance task distribution for volunteer computing, in Proc. First Int. Conf. e-Science and Grid Computing (e-Science’05), Melbourne, Australia, 2005, p. 8.
[22]
G. Lu and W. H. Zeng, Cloud computing survey, Appl. Mechan. Mater., vols. 530-531, pp. 650–661, 2014.
[23]
K. Krauter, R. Buyya, and M. Maheswaran, A taxonomy and survey of grid resource management systems for distributed computing, Software Pract. Exper, vol. 32, no. 2, pp. 135–164, 2002.
[24]
S. Patidar, D. Rane, and P. Jain, A survey paper on cloud computing, in Proc. 2012 Second Int. Conf. Advanced Computing & Communication Technologies, Rohtak, India, 2012, pp. 394–398.
[25]
R. Nath and A. Nagaraju, A novel task assignment heuristic using local search in distributed computing systems, in Proc. 2017 Int. Conf. Energy, Communication, Data Analytics and Soft Computing (ICECDS), Chennai, India, 2017, pp. 2767–2771.
[26]
Z. Fadika and M. Govindaraju, DELMA: Dynamically elastic MapReduce framework for CPU-intensive applications, in Proc. 2011 11th IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing, Newport Beach, CA, USA, 2011, pp. 454–463.
[27]
K. Singh, M. Alam, and S. Kumar, A survey of static scheduling algorithm for distributed computing system, Int. J. Comput. Applicat., vol. 129, no. 2, pp. 25–30, 2015.
[28]
O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun, Diagnosing performance variations in HPC applications using machine learning, in Proc. 32nd High Performance Computing, Frankfurt, Germany, 2017, pp. 355–373.
[29]
G. Ramirez-Gargallo, M. Garcia-Gasulla, and F. Mantovani, Tensorflow on state-of-the-art HPC clusters: A machine learning use case, in Proc. 2019 19th IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing (CCGRID), Larnaca, Cyprus, 2019, pp. 526–533.
[30]
V. K. Naik, S. K. Setia, and M. S. Squillante, Performance analysis of job scheduling policies in parallel supercomputing environments, in Proc. 1993 ACM/IEEE Conf. Supercomputing, Portland, OR, USA, 1993, pp. 824–833.
[31]
M. A. S. Netto, R. N. Calheiros, E. R. Rodrigues, R. L. F. Cunha, and R. Buyya, HPC cloud for scientific and business applications: taxonomy, vision, and research challenges, ACM Comput. Surv., vol. 51, no. 1, pp. 1–29, 2019.
[32]
T. D. Thanh, S. Mohan, E. Choi, S. Kim, and P. Kim, A taxonomy and survey on distributed file systems, in Proc. 2008 4th Int. Conf. Networked Computing and Advanced Information Management, Gyeongju, Republic of Korea, 2008, pp. 144–149.
[33]
J. Blomer, A survey on distributed file system technology, J. Phys. Conf. Ser., vol. 608, p. 012039, 2015.
[34]
L. Jiang, B. Li, and M. Song, The optimization of HDFS based on small files, in Proc. 2010 3rd IEEE Int. Conf. Broadband Network and Multimedia Technology (IC-BNMT), Beijing, China, 2010, pp. 912–915.
[35]
S. Zhuo, X. Wu, W. Zhang, and W. Dou, Distributed file system and classification for small images, in Proc. 2013 IEEE Int. Conf. Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing, Beijing, China, 2013, pp. 2231–2234.
[36]
S. Fu, L. He, C. Huang, X. Liao, and K. Li, Performance optimization for managing massive numbers of small files in distributed file systems, IEEE Trans. Parallel Distribut. Syst., vol. 26, no. 12, pp. 3433–3448, 2015.
[37]
H. Che and H. Zhang, Exploiting fastDFS client-based small file merging, in Proc. 2016 Int. Conf. Artificial Intelligence and Engineering Applications, Hong Kong, China, 2016, pp. 242–246.
[38]
W. K. Josephson, L. A. Bongo, K. Li, and D. Flynn, DFS: A file system for virtualized flash storage, ACM Trans. Stor., vol. 6, no. 3, p. 14, 2010.
[39]
Z. Ullah, S. Jabbar, M. H. Bin Tariq Alvi, and A. Ahmad, Analytical study on performance, challenges and future considerations of Google file system, Int. J. Computer Communicat. Eng., vol. 3, no. 4, pp. 279–284, 2014.
[40]
R. Gu, X. Yang, J. Yan, Y. Sun, B. Wang, C. Yuan, and Y. Huang, SHadoop: Improving MapReduce performance by optimizing job execution mechanism in hadoop clusters, J. Parallel Distribut. Comput., vol. 74, no. 3, pp. 2166–2179, 2014.
[41]
I. Polato, R. Ré, A. Goldman, and F. Kon, A comprehensive view of hadoop research—A systematic literature review, J. Network Comput. Applicat., vol. 46, pp. 1–25, 2014.
[42]
Y. Wang, W. Jiang, and G. Agrawal, SciMATE: A novel MapReduce-like framework for multiple scientific data formats, in Proc. 2012 12th IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing (CCGRID 2012), Ottawa, Canada, 2012, pp. 443–450.
[43]
J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Commun ACM, vol. 51, no. 1, pp. 107–113, 2008.
[44]
M. R. Ghazi and D. Gangodkar, Hadoop, MapReduce and HDFS: A developers perspective, Proc. Comput. Sci., vol. 48, pp. 45–50, 2015.
[45]
Y. Zhang, Q. Gao, L. Gao, and C. Wang, iMapReduce: A distributed computing framework for iterative computation, J. Grid Comput., vol. 10, no. 1, pp. 47–68, 2012.
[46]
H. Alshammari, J. Lee, and H. Bajwa, H2Hadoop: Improving hadoop performance using the metadata of related jobs, IEEE Trans. Cloud Comput., vol. 6, no. 4, pp. 1031–1040, 2018.
[47]
P. S. Janardhanan and P. Samuel, Launch overheads of spark applications on standalone and hadoop YARN clusters, in Advances in Electrical and Computer Technologies, T. Sengodan, M. Murugappan, and S. Misra, eds. Singapore: Springer, 2020, pp. 47–54.
[48]
C. Lam, Hadoop in Action. Lewis Street Greenwich, CT, USA: Manning Publications Co., 2010.
[49]
S. R. Alapati, Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS. Palo Alto, CA, USA: Addison-Wesley Professional, 2016.
[50]
R. K. Bhatia and A. Bansal, Deploying and improving hadoop on pseudo-distributed mode, Compusoft, vol. 3, no. 10, p. 1136, 2014.
[51]
F. Li, J. Chen, and Z. Wang, Wireless MapReduce distributed computing, IEEE Trans. Inform. Theory, vol. 65, no. 10, pp. 6101–6114, 2019.
[52]
C. F. Chiu, S. J. Hsu, and S. R. Jan, Distributed MapReduce framework using distributed hash table, in Proc. 2013 Int. Joint Conf. Awareness Science and Technology & Ubi-Media Computing (iCAST 2013 & UMEDIA 2013), Aizu-Wakamatsu, Japan, 2013, pp. 475–481.
[53]
S. D. Kavila, G. S. V. P. Raju, S. C. Satapathy, A. Machiraju, G. V. L. Kinnera, and K. Rasly, A survey on fault management techniques in distributed computing, in Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA), S. C. Satapathy, S. K. Udgata, and B. N. Biswal, eds. Berlin, Germany: Springer, 2013, pp. 593–602.
[54]
Y. C. Sun and X. F. Wang, MapReduce designed to optimize computing model based on hadoop framework, (in Chinese), Comput. Sci., vol. 41, no. 11A, pp. 333–336, 2014.
[55]
J. Yu, J. Wu, and M. Sarwat, A demonstration of geoSpark: A cluster computing framework for processing big spatial data, in Proc. 2016 IEEE 32nd Int. Conf. Data Engineering (ICDE), Helsinki, Finland, 2016, pp. 1410–1413.
[56]
Z. Yang, C. Zhang, M. Hu, and F. Lin, OPC: A distributed computing and memory computing-based effective solution of big data, in Proc. 2015 IEEE Int. Conf. Smart City/ SocialCom/SustainCom (SmartCity), Chengdu, China, 2015, pp. 50–53.
[57]
V. Taran, O. Alienin, S. Stirenko, Y. Gordienko, and A. Rojbi, Performance evaluation of distributed computing environments with Hadoop and spark frameworks, in Proc. 2017 IEEE Int. Young Scientists Forum on Applied Physics and Engineering (YSF), Lviv, Ukraine, 2017, pp. 80–83.
[58]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al., Spark SQL: Relational data processing in spark, in Proc. 2015 ACM SIGMOD Int. Conf. Management of Data, Melbourne, Australia, 2015, pp. 1383–1394.
[59]
Y. Benlachmi and M. L. Hasnaoui, Big data and spark: Comparison with hadoop, in Proc. 2020 Fourth World Conf. Smart Trends in Systems, Security and Sustainability (WorldS4), London, UK, 2020, pp. 811–817.
[60]
P. Karunaratne, S. Karunasekera, and A. Harwood, Distributed stream clustering using micro-clusters on apache storm, J. Parallel Distribut. Comput., vol. 108, pp. 74–84, 2017.
[61]
P. Carbone, S. Ewen, G. Fóra, S. Haridi, S. Richter, and K. Tzoumas, State management in Apache Flink®: Consistent stateful distributed stream processing, Proc. VLDB Endowm., vol. 10, no. 12, pp. 1718–1729, 2017.
[62]
F. Hueske and V. Kalavri, Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications. Sebastopol, CA, USA: O’Reilly Media, 2019.
[63]
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas, Apache flinkTM: Stream and batch processing in a single engine, Bull. IEEE Comput. Soc. Technol. Committ. Data Eng., vol. 36, no. 4, pp. 28–38, 2015.
[64]
A. Katsifodimos and S. Schelter, Apache Flink: Stream analytics at scale, in Proc. 2016 IEEE Int. Conf. Cloud Engineering Workshop (IC2EW), Berlin, Germany, 2016, p. 193.
[65]
M. H. Iqbal and T. R. Soomro, Big data analysis: Apache storm perspective, Int. J. Computer Trends Technol., vol. 19, no. 1, pp. 9–14, 2015.
[66]
T. Da Silva Morais, Survey on frameworks for distributed computing: Hadoop, spark and storm, in Proc. 10th Doctoral Symp. in Informatics Engineering–DSIE’15, Porto, Portugal, 2015, pp. 95–105.
[67]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, Hive: A warehousing solution over a map-reduce framework, Proc. VLDB Endowm., vol. 2, no. 2, pp. 1626–1629, 2009.
[68]
A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, Building a high-level dataflow system on top of map-reduce: The pig experience, Proc. VLDB Endowm., vol. 2, no. 2, pp. 1414–1425, 2009.
[69]
A. Thusoo, J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, Hive-a petabyte scale data warehouse using hadoop, in Proc. 2010 IEEE 26th Int. Conf. Data Engineering (ICDE 2010), Long Beach, CA, USA, 2010, pp. 996–1005.
[70]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, Pig latin: A not-so-foreign language for data processing, in Proc. 2008 ACM SIGMOD Int. Conf. Management of data, Vancouver, Canada, 2008, pp. 1099–1110.
[71]
X. Lu, H. Shi, R. Biswas, M. H. Javed, and D. K. Panda, DLoBD: A comprehensive study of deep learning over big data stacks on HPC clusters, IEEE Trans. Multi-Scale Comput. Syst., vol. 4, no. 4, pp. 635–648, 2018.
[72]
S. Sakr, Big data processing stacks, IT Profess., vol. 19, no. 1, pp. 34–41, 2017.
[73]
J. Kreps, N. Narkhede, and J. Rao, Kafka: A distributed messaging system for log processing, in Proc. 6th Int. Workshop on Networking Meets Databases, Athens, Greece, 2011, pp. 1–7.
[74]
S. Aravinth, A. H. Begam, S. Shanmugapriyaa, S. Sowmya, and E. Arun, An efficient HADOOP frameworks SQOOP and ambari for big data processing, Int. J. Innovat. Res. Sci. Technol., vol. 1, no. 10, pp. 252–255, 2015.
[75]
M. N. Vora, Hadoop-Hbase for large-scale data, in Proc. 2011 Int. Conf. Computer Science and Network Technology, Harbin, China, 2011, pp. 601–605.
[76]
J. Carlson, Redis in action. Shelter Island, NY, USA: Manning, 2013.
[77]
D. Huang, Q. Liu, Q. Cui, Z. Fang, X. Ma, F. Xu, L. Shen, L. Tang, Y. Zhou, M. Huang, et al., TiDB: A raft-based HTAP database, Proc. VLDB Endow., vol. 13, no. 12, pp. 3072–3084, 2020.
[78]
R. Anil, G. Capan, I. Drost-Fromm, T. Dunning, E. Friedman, T. Grant, S. Quinn, P. Ranjan, S. Schelter, and Ö. Yılmazeł, Apache mahout: Machine learning on distributed dataflow systems, J. Mach. Learn. Res., vol. 21, no. 127, pp. 1–6, 2020.
[79]
B. Quinto, Introduction to spark and spark MLlib, in Next-Generation Machine Learning with Spark, B. Quint, ed. New York, NY, USA: Apress, 2020. pp. 29–96.
[80]
S. V. Ranawade, S. Navale, A. Dhamal, K. Deshpande, and C. Ghuge, Online analytical processing on hadoop using apache Kylin, Int. J. Appl. Inform. Syst., vol. 12, pp. 1–5, 2017.
[81]
L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al., Impala: Scalable distributed deep-RL with importance weighted actor-learner architectures, in Proc. 35th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 1407–1416.
[82]
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al., Apache hadoop YARN: Yet another resource negotiator, in Proc. 4th annual Symp. on Cloud Computing, Santa Clara, CA, USA, 2013, p. 5.
[83]
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica, Mesos: A platform for Fine-Grained resource sharing in the data center, in Proc. 8th USENIX Conf. Networked Systems Design and Implementation, Boston, MA, USA, 2011, pp. 295–308.
[84]
S. Wadkar and M. Siddalingaiah, Apache ambari, in Pro Apache Hadoop, S. Wadkar and M. Siddalingaiah, eds. Berkeley, CA, USA: Springer, 2014. pp. 399–401.
[85]
F. Junqueira and B. Reed, ZooKeeper: Distributed Process Coordination. Sebastopol, CA, USA: O’Reilly Media, 2013.
[86]
A. Y. Zomaya, Parallel and Distributed Computing Handbook. New York, NY, USA: McGraw-Hill Professional, 1995.
[87]
A. D. Kshemkalyani and M. Singhal, Distributed Computing: Principles, Algorithms, and Systems. Cambridge, UK: Cambridge University Press, 2011.
[89]
K. J. Merceedi and N. A. Sabry, A comprehensive survey for Hadoop distributed file system, Asian J. Res. Comput. Sci., vol. 11, no. 2, pp. 46–57, 2021.
[90]
J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Commun. ACM, vol. 53, no. 1, pp. 72–77, 2010.
[91]
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, Haloop: Efficient iterative data processing on large clusters, Proc. VLDB Endow., vol. 3, nos. 1&2, pp. 285–296, 2010.
[92]
M. Yoon, H. I. Kim, D. H. Choi, H. Jo, and J. W. Chang, Performance analysis of mapReduce-based distributed systems for iterative data processing applications, in Mobile, Ubiquitous, and Intelligent Computing, J. J. J.H. Park, H. Adeli, N. Park, and I. Woungang, eds. Berlin, Germany: Springer, 2014. pp. 293–299.
[93]
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, The haLoop approach to large-scale iterative data analysis, VLDB J., vol. 21, no. 2, pp. 169–190, 2012.
[94]
S. B. Sriramoju, A review on processing big data, Int. J. Innovat. Res. Comput. Communicat. Eng., vol. 2, pp. 2672–2685, 2014.
[95]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proc. 9th USENIX Conf. Networked Systems Design and Implementation, San Jose, CA, USA, 2012, pp. 15–28.
[96]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, Spark: Cluster computing with working sets, in Proc. 2nd USENIX Conf. Hot Topics in Cloud Computing, Boston, MA, USA, 2010, p. 10.
[97]
S. Salloum, J. Z. Huang, and Y. He, Random sample partition: A distributed data model for big data analysis, IEEE Trans. Industr. Inform., vol. 15, no. 11, pp. 5846–5854, 2019.
[98]
S. Salloum, J. Z. Huang, and Y. He, Empirical analysis of asymptotic ensemble learning for big data, in Proc. IEEE/ACM 3rd Int. Conf. Big Data Computing Applications and Technologies, Shanghai, China, 2016, pp. 8–17.
[99]
C. Wei, J. Zhang, T. Valiullin, W. Cao, Q. Wang, and H. Long, Distributed and parallel ensemble classification for big data based on kullback-leibler random sample partition, in Proc. Int. Conf. Algorithms and Architectures for Parallel Processing, New York, NY, USA, 2020, pp. 448–464.
[100]
C. Wei, S. Salloum, T. Z. Emara, X. Zhang, J. Z. Huang, and Y. He, A two-stage data processing algorithm to generate random sample partitions for big data analysis, in Proc. 11th Int. Conf. Cloud Computing, Seattle, WA, USA, 2018, pp. 347–364.
[101]
T. Z. Emara and J. Z. Huang, A distributed data management system to support large-scale data analysis, J. Syst. Softw., vol. 148, pp. 105–115, 2019.
[102]
T. Z. Emara and J. Z. Huang, Distributed data strategies to support large-scale data analysis across geo-distributed data centers, IEEE Access, vol. 8, pp. 178526–178538, 2020.
[103]
V. Kalavri, V. Brundza, and V. Vlassov, Block sampling: Efficient accurate online aggregation in MapReduce, in Proc. 2013 IEEE 5th Int. Conf. Cloud Computing Technology and Science, Bristol, UK, 2013, pp. 250–257.
[104]
G. Cormode and N. Duffield, Sampling for big data: A tutorial, in Proc. 20th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2014, p. 1975.
[105]
P. Sanders, S. Lamm, L. Hübschle-Schneider, E. Schrade, and C. Dachsbacher, Efficient parallel random sampling-vectorized, cache-efficient, and online, ACM Trans. Mathemat. Softw., vol. 44, no. 3, p. 29, 2018.
[106]
E. Gavagsaz, A. Rezaee, and H. H. S. Javadi, Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling, J. Supercomput., vol. 74, no. 7, pp. 3415–3440, 2018.
[107]
O. Sagi and L. Rokach, Ensemble learning: A survey, Wiley Interdiscip Rev Data Min. Knowl Discov, vol. 8, no. 4, p. e1249, 2018.
[108]
J. K. Kim and Z. Wang, Sampling techniques for big data analysis, Int. Stat. Rev., vol. 87, no. S1, pp. S177–S191, 2019.
[109]
X. Meng, Scalable simple random sampling and stratified sampling, in Proc. 30th Int. Conf. Int. Conf. Machine Learning, Atlanta, GA, USA, 2013, pp. III-531–III-539.
[110]
M. S. Mahmud, J. Z. Huang, S. Salloum, T. Z. Emara, and K. Sadatdiynov, A survey of data partitioning and sampling methods to support big data analysis, Big Data Min. Anal., vol. 3, no. 2, pp. 85–101, 2020.
[111]
S. Chaudhuri, G. Das, and U. Srivastava, Effective use of block-level sampling in statistics estimation, in Proc. 2004 ACM SIGMOD Int. Conf. Management of data, Paris, France, 2004, pp. 287–298.
[112]
G. Cormode, M. Garofalakis, P. J. Haas, C. Matthew, Synopses for massive data: Samples, histograms, wavelets, sketches, Foundat. Trends Databases, vol. 4, no. 1–3, pp. 1–294, 2012.
[113]
S. Salloum, J. Z. Huang, and Y. He, Exploring and cleaning big data with random sample data blocks, J. Big Data, vol. 6, no. 1, p. 45, 2019.