Open Access
Issue
Published:
*27 February 2020*

Keywords:

big data analysis, data partitioning, data sampling, distributed and parallel computing, approximate computing
Cite this article:

Mahmud MS, Huang JZ, Salloum S, et al.
A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis.
Big Data Mining and Analytics,
2020, 3(2): 85-101.
https://doi.org/10.26599/BDMA.2019.9020015
Download citation

2609

Views

280

Downloads

Citations

159

Crossref

119

WoS

177

Scopus

0

CSCD

Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis. In cluster computing, data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability. In this paper, we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis. We start with an overview of the mainstream big data frameworks on Hadoop clusters. The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes: range, hash, and random partitioning. Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning, including the new Random Sample Partition (RSP) distributed model. The classical methods of data sampling are then investigated, including simple random sampling, stratified sampling, and reservoir sampling. Two common methods of big data sampling on computing clusters are also discussed: record-level sampling and block-level sampling. Record-level sampling is not as efficient as block-level sampling on big distributed data. On the other hand, block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data. In this survey, we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters. We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.

menu

Abstract

Full text

Outline

About this article

Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis. In cluster computing, data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability. In this paper, we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis. We start with an overview of the mainstream big data frameworks on Hadoop clusters. The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes: range, hash, and random partitioning. Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning, including the new Random Sample Partition (RSP) distributed model. The classical methods of data sampling are then investigated, including simple random sampling, stratified sampling, and reservoir sampling. Two common methods of big data sampling on computing clusters are also discussed: record-level sampling and block-level sampling. Record-level sampling is not as efficient as block-level sampling on big distributed data. On the other hand, block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data. In this survey, we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters. We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.

[1]

R. Cattell, Scalable SQL and NoSQL data stores, *ACM SIGMOD Record*, vol. 39, no. 4, pp. 12-27, 2011.

[2]

K. Bakshi, Considerations for big data: Architecture and approach, in Proc. of 2012 IEEE Aerospace Conference, Big Sky, MT, USA, 2012, pp. 1-7.

[3]

X. Chen and M. Xie, A split-and-conquer approach for analysis of extraordinarily large data, *Statistica Sinica*, vol. 24, no. 4, pp. 1655-1684, 2014.

[4]

N. Lazar, The big picture: Divide and combine to conquer big data, *Chance*, vol. 31, no. 1, pp. 57-59, 2018.

[5]

J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, in Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI’04), San Francisco, CA, USA, 2004, pp. 137-150.

[6]

D. Singh and C. K. Reddy, A survey on platforms for big data analytics, *Journal of Big Data*, vol. 2, no. 1, p. 8, 2014.

[7]

H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, and C. Shahabi, Big data and its technical challenges, *Communications of the ACM*, vol. 57, no. 7, pp. 86-94, 2014.

[8]

M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, et al., Apache spark: A unified engine for big data processing, *Communications of the ACM*, vol. 59, no. 11, pp. 56-65, 2016.

[9]

S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z. Huang, Big data analytics on apache spark, *International Journal of Data Science and Analytics*, vol. 1, no. 3, pp. 145-164, 2016.

[10]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The Hadoop distributed file system, in Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 2010, pp. 1-10.

[11]

C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun, Map-reduce for machine learning on multicore, in Proceedings of the 19th International Conference on Neural Information Processing Systems (NIPS’06), Cambridge, MA, USA, 2006, pp. 281-288.

[12]

D. Quoc, Approximate data analytics systems, PhD dissertation, Technische Universität Dresden, Dresden, Germany, 2017.

[13]

R. Nair, Big data needs approximate computing: Technical perspective, *Communications of the ACM*, vol. 58, no. 1, pp. 104-104, 2015.

[14]

S. Mittal, A survey of techniques for approximate computing, *ACM Computing Surveys*, vol. 48, no. 4, pp. 1-33, 2016.

[15]

D. A. Reed and J. Dongarra, Exascale computing and big data, *Communications of the ACM*, vol. 58, no. 7, pp. 56-68, 2015.

[16]

C. E. Otero and A. Peter, Research directions for engineering big data analytics software, *IEEE Intelligent Systems*, vol. 30, no. 1, pp. 13-19, 2015.

[17]

A. Agrawal, J. Choi, K. Gopalakrishnan, S. Gupta, R. Nair, J. Oh, D. A. Prener, S. Shukla, V. Srinivasan, and Z. Sura, Approximate computing: Challenges and opportunities, in Proc. of 2016 IEEE International Conference on Rebooting Computing (ICRC), San Diego, CA, USA, 2016, pp. 1-8.

[18]

S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard, Managing performance vs. accuracy trade-offs with loop perforation, in Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE’11), Szeged, Hungary, 2011, pp. 124-134.

[19]

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica, BlinkDB: Queries with bounded errors and bounded response times on very large data, in Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys’13), Prague, Czech Republic, 2013, pp. 29-42.

[20]

I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen, ApproxHadoop: Bringing approximations to MapReduce frameworks, in Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15), Istanbul, Turkey, 2015, pp. 383-397.

[21]

K. Li and G. Li, Approximate query processing: What is new and where to go?, *Data Science and Engineering*, vol. 3, no. 4, pp. 379-397, 2018.

[22]

S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Madden, B. Mozafari, and I. Stoica, Knowing when you’re wrong: Building fast and reliable approximate query processing systems, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14), Snowbird, UT, USA, 2014, pp. 481-492.

[23]

D. L. Quoc, R. Chen, P. Bhatotia, C. Fetzer, V. Hilt, and T. Strufe, Streamapprox: Approximate computing for stream analytics, in Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference (Middleware’17), Las Vegas, NV, USA, 2017, pp. 185-197.

[24]

O. Sagi and L. Rokach, Ensemble learning: A survey, *Data Mining and Knowledge Discovery*, vol. 8, no. 4, pp. 1-18, 2018.

[25]

S. Basiri, E. Ollila, and V. Koivunen, Robust, scalable, and fast bootstrap method for analyzing large scale data, *IEEE Transactions on Signal Processing*, vol. 64, no. 4, pp. 1007-1017, 2016.

[26]

V. K. Singh, M. Taram, V. Agrawal, and B. S. Baghel, A literature review on Hadoop ecosystem and various techniques of big data optimization, in Proceedings of International Conference on Data and Information Systems (ICDIS’17), Amarkantak, India, 2017, pp. 231-240.

[27]

I. Polato, R. Ré, A. Goldman, and F. Kon, A comprehensive view of Hadoop research-systematic literature review, *Journal of Network and Computer Applications*, vol. 46, pp. 1-25, 2014.

[28]

J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, *Communications of the ACM*, vol. 51, no. 1, pp. 107-113, 2008.

[29]

R. Hasan, Z. Anwar, W. Yurcik, L. Brumbaugh, and R. Campbell, A survey of peer-to-peer storage techniques for distributed file systems, in Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05) - Volume II, Las Vegas, NV, USA, 2005, pp. 205-213.

[30]

P. Bzoch and J. Safarik, State of the art in distributed file systems: Increasing performance, in Proc. of 2011 Second Eastern European Regional Conference on the Engineering of Computer Based Systems, Bratislava, Slovakia, 2011, pp. 153-154.

[31]

M. T. Ozsu and P. Valduriez, *Principles of Distributed Database Systems*, 3rd ed. New York, NY, USA: Springer-Verlag, 2011.

[32]

D. Taniar, High performance database processing, in Proc. of 2012 IEEE 26th International Conference on Advanced Information Networking and Applications, Fukuoka, Japan, 2012, pp. 5-6.

[33]

S. Vijayakumar, A. Bhargavi, U. Praseeda, and S. A. Ahamed, Optimizing sequence alignment in cloud using Hadoop and Mpp database, in Proceedings of the 2012 IEEE Fifth International Conference on Cloud Computing (CLOUD’12), Honolulu, HI, USA, 2012, pp. 819-827.

[34]

S. Ghemawat, H. Gobioff, and S.-T. Leung, The google file system, *ACM SIGOPS Operating Systems Review*, vol. 37, no. 5, pp. 29-43, 2003.

[35]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, Dryad: Distributed data-parallel programs from sequential building blocks, in Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys’07), Lisbon, Portugal, 2007, pp. 59-72.

[36]

R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou, Scope: Easy and efficient parallel processing of massive data sets, *Proceedings of the VLDB Endowment*, vol. 1, no. 2, pp. 1265-1276, 2008.

[37]

Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey, Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language, in Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08), San Diego, CA, USA, 2008, pp. 1-14.

[38]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, Pig latin: A not-so-foreign language for data processing, in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08), Vancouver, Canada, 2008, pp. 1099-1110.

[39]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis, Dremel: Interactive analysis of web-scale datasets, *Communications of the ACM*, vol. 54, no. 6, pp. 114-123, 2011.

[40]

A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, Hive a petabyte scale data warehouse using hadoop, in Proceedings of the 2010 IEEE 26th International Conference on Data Engineering (ICDE’10), Long Beach, CA, USA, 2010, pp. 996-1005.

[41]

K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F. Ozcan, and E. J. Shekita, Jaql: A scripting language for large scale semi structured data analysis, *Proceedings of the VLDB Endowment*, vol. 4, no. 12, pp. 1272-1283, 2011.

[42]

B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong, Tenzing: A SQL implementation on the MapReduce framework, *Proceedings of the VLDB Endowment*, vol. 4, no. 12, pp. 1318-1327, 2011.

[43]

H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, *Learning Spark: Lightning-Fast Big Data Analytics*, 1st ed. Sebastopol, CA, USA: O’Reilly Media, 2015.

[44]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12), San Jose, CA, USA, 2012, p. 2.

[45]

S. Navathe, S. Ceri, G. Wiederhold, and J. Dou, Vertical partitioning algorithms for database design, *ACM Transactions on Database Systems*, vol. 9, no. 4, pp. 680-710, 1984.

[46]

D. W. Cornell and P. S. Yu, An effective approach to vertical partitioning for physical design of relational databases, *IEEE Transactions on Software Engineering*, vol. 16, no. 2, pp. 248-258, 1990.

[47]

W. W. Chu and I. T. Ieong, A transaction-based approach to vertical partitioning for relational database systems, *IEEE Transactions on Software Engineering*, vol. 19, no. 8, pp. 804-812, 1993.

[48]

C. Curino, E. Jones, Y. Zhang, and S. Madden, Schism: A workload-driven approach to database replication and partitioning, *Proceedings of the VLDB Endowment*, vol. 3, no. 1, pp. 48-57, 2010.

[49]

C. Curino, E. P. Jones, R. A. Popa, N. Malviya, E. Wu, S. Madden, H. Balakrishnan, and N. Zeldovich, Relational cloud: A database-as-a-service for the cloud, in Proc. of 5th Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, USA, 2011, pp. 235-240.

[50]

S. Das, D. Agrawal, and A. El Abbadi, Elastras: An elastic transactional data store in the cloud, in Proceedings of the 2009 Conference on Hot Topics in Cloud Computing (HotCloud’09), San Diego, CA, USA, 2009, pp. 1-5.

[51]

J. Baker, C. Bond, J. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh, Megastore: Providing scalable, highly available storage for interactive services, in Proc. of Conference on Innovative Database Research (CIDR), Asilomar, CA, USA, 2011, pp. 223-234.

[52]

J. Kamal, M. Murshed, and R. Buyya, Workload-aware incremental repartitioning of shared-nothing distributed databases for scalable OLTP applications, *Future Generation Computer Systems*, vol. 56, pp. 421-435, 2016.

[53]

S. P. Phansalkar and A. R. Dani, Transaction aware vertical partitioning of database (TAVDP) for responsive OLTP applications in cloud data stores, *Journal of Theoretical and Applied Information Technology*, vol. 59, no. 1, pp. 73-81, 2014.

[54]

P. A. Bernstein, I. Cseri, N. Dani, N. Ellis, A. Kalhan, G. Kakivaya, D. B. Lomet, R. Manne, L. Novik, and T. Talius, Adapting microsoft SQL server for cloud computing, in Proc. of 2011 IEEE 27th International Conference on Data Engineering, Hannover, Germany, 2011, pp. 1255-1263.

[55]

S. Ahirrao and R. Ingle, Scalable transactions in cloud data stores, in Proc. of 2013 3rd IEEE International Advance Computing Conference (IACC), Ghaziabad, India, 2013, pp. 116-119.

[56]

S. B. Navathe and M. Ra, Vertical partitioning for database design: A graphical algorithm, *ACM SIGMOD Record*, vol. 18, no. 2, pp. 440-450, 1989.

[57]

J. H. Son and M. H. Kim, An adaptable vertical partitioning method in distributed systems, *Journal of Systems and Software*, vol. 73, no. 3, pp. 551-561, 2004.

[58]

W. Zhao, Y. Cheng, and F. Rusu, Workload-driven vertical partitioning for effective query processing over raw data, arXiv preprint arXiv: 1503.08946, 2015.

[59]

Y.-F. Huang and C.-J. Lai, Integrating frequent pattern clustering and branch-and-bound approaches for data partitioning, *Information Sciences*, vol. 328, pp. 288-301, 2016.

[60]

M. Vojnovic, F. Xu, and J. Zhou, Sampling-based range partition methods for big data analytics, Technical Report MSR-TR-2012-18, Microsoft Research, Redmond, WA, USA, 2012.

[61]

A. Chakrabarti, S. Parthasarathy, and C. Stewart, Green- and heterogeneity-aware partitioning for data analytics, in Proc. of IEEE Conference on Computer Communications Workshops, San Francisco, CA, USA, 2016, pp. 366-371.

[62]

S. Phansalkar and S. Ahirrao, Survey of data partitioning algorithms for big data stores, in Proc. of 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), Waknaghat, India, 2016, pp. 163-168.

[63]

A. Shanbhag, A. Jindal, S. Madden, J. Quiane, and A. J. Elmore, A robust partitioning scheme for ad-hoc query workloads, in Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17), Santa Clara, CA, USA, 2017, pp. 229-241.

[64]

J. Wang, Q. Xiao, and J. Yin, DRAW: A new Data-gRouping-AWare data placement scheme for data intensive applications with interest locality, *IEEE Transactions on Magnetics*, vol. 49, no. 6, pp. 2514-2520, 2013.

[65]

K. H. K. Reddy and D. S. Roy, DPPACS: A novel data partitioning and placement aware computation scheduling scheme for data-intensive cloud applications, *The Computer Journal*, vol. 59, no. 1, pp. 64-82, 2016.

[66]

D. Yuan, Y. Yang, X. Liu, and J. Chen, A data placement strategy in scientific cloud workflows, *Future Generation Computer Systems*, vol. 26, no. 8, pp. 1200-1214, 2010.

[67]

K. Slagter, C.-H. Hsu, Y.-C. Chung, and D. Zhang, An improved partitioning mechanism for optimizing massive data analysis using MapReduce, *The Journal of Supercomputing*, vol. 66, no. 1, pp. 539-555, 2013.

[68]

Q. Chen, J. Yao, and Z. Xiao, Libra: Lightweight data skew mitigation in MapReduce, *IEEE Transactions on Parallel and Distributed Systems*, vol. 26, no. 9, pp. 2520-2533, 2015.

[69]

M. He, G. Li, C. Huang, Y. Ye, and W. Tian, A comparative study of data skew in Hadoop, in Proceedings of 26th International Conference on Network, Communication and Computing (ICNCC’17), Kunming, China, 2017, pp. 1-6.

[70]

Z. Tang, W. Lv, K. Li, and K. Li, An intermediate data partition algorithm for skew mitigation in spark computing environment, *IEEE Transactions on Cloud Computing*, .

[71]

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, A study of skew in MapReduce applications, in The 5th Open Cirrus Summit, Moskow, Russia, 2011, pp. 1-5.

[72]

Z. Tang, X. Zhang, K. Li, and K. Li, An intermediate data placement algorithm for load balancing in spark computing environment, *Future Generation Computer Systems*, vol. 78, no. 1, pp. 287-301, 2018.

[73]

Y. Xu, P. Zou, W. Qu, Z. Li, K. Li, and X. Cui, Sampling-based partitioning in MapReduce for skewed data, in Proc. of 2012 Seventh ChinaGrid Annual Conference (CHINAGRID’12), Beijing, China, 2012, pp. 1-8.

[74]

S. del RÃo, V. LÃ³pez, J. M. BenÃtez, and F. Herrera, On the use of MapReduce for imbalanced big data using random forest, *Information Sciences*, vol. 285, pp. 112-137, 2014.

[75]

X. Ci and X. Meng, An efficient block sampling strategy for online aggregation in the cloud, in Proc. of International Conference on Web-Age Information Management (WAIM 2015), Qingdao, China, 2015, pp. 362-373.

[76]

S. Salloum, J. Z. Huang, Y. He, X. Zhang, T. Z. Emara, C. Wei, and H. He, A random sample partition data model for big data analysis, arXiv preprint arXiv: 1712.04146, 2017.

[77]

C. Wei, S. Salloum, T. Z. Emara, X. Zhang, J. Z. Huang, and Y. He, A two-stage data processing algorithm to generate random sample partitions for big data analysis, in Proc. of International Conference on Cloud Computing (CLOUD 2018), Seattle, WA, USA, 2018, pp. 347-364.

[78]

T. Z. Emara and J. Z. Huang, A distributed data management system to support large-scale data analysis, *The Journal of Systems and Software*, vol. 148, pp. 105-115, 2019.

[79]

T. Z. Emara and J. Z. Huang, RRPlib: A Spark library for representing HDFS blocks as a set of random sample data blocks, *Science of Computer Programming*, vol. 184, pp. 1-7, 2019.

[80]

A. Jacobs, The pathologies of big data, *Communications of the ACM*, vol. 52, no. 8, pp. 36-44, 2009.

[81]

G. Cormode and N. Duffield, Sampling for big data: A tutorial, in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14), New York, NY, USA, 2014, pp. 1975-1975.

[82]

J. Acharya, I. Diakonikolas, C. Hegde, J. Z. Li, and L. Schmidt, Fast and near-optimal algorithms for approximating distributions by histograms, in Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS’15), Melbourne, Australia, 2015, pp. 249-263.

[83]

B. Efron and R. J. Tibshirani, *An Introduction to the Bootstrap*. New York, NY, USA: Chapman & Hall, 1993.

[84]

F. Olken and D. Rotem, Random sampling from database files: A survey, in Proceedings of the 5th International Conference on Statistical and Scientific Database Management (SSDBM’1990), Charlotte, NC, USA, 1990, pp. 92-111.

[85]

F. Olken, Random sampling from databases, PhD dissertation, University of California at Berkeley, Berkeley, CA, USA, 1993.

[86]

P. Bruce and A. Bruce, *Practical Statistics for Data Scientists: 50 Essential Concepts*. Sebastopol, CA, USA: O’Reilly Media, Inc., 2017.

[87]

W. S. Cleveland and R. Hafen, Divide and recombine (d&r): Data science for large complex data, *Statistical Analysis and Data Mining: The ASA Data Science Journal*, vol. 7, no. 6, pp. 425-433, 2014.

[88]

P. Bailis, E. Gan, S. Madden, D. Narayanan, K. Rong, and S. Suri, Macrobase: Prioritizing attention in fast data, in Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17), Chicago, IL, USA, 2017, pp. 541-556.

[89]

S. Chaudhuri, R. Motwani, and V. Narasayya, Random sampling for histogram construction: How much is enough?, *ACM SIGMOD Record*, vol. 27, no. 2, pp. 436-447, 1998.

[90]

G. S. Manku, S. Rajagopalan, and B. G. Lindsay, Random sampling techniques for space efficient online computation of order statistics of large datasets, in Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD’99), Philadelphia, PA, USA, 1999, pp. 251-262.

[91]

M. Charikar, K. Chen, and M. Farach-Colton, Finding frequent items in data streams, in Proceedings of the 29th International Colloquium on Automata, Languages and Programming (ICALP’02), Málaga, Spain, 2002, pp. 693-703.

[92]

S. Guha and A. McGregor, Space-efficient sampling, in Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS 2007), San Juan, PUR, USA, 2007, pp. 169-176.

[93]

F. Kuhn, T. Locher, and S. Schmid, Distributed computation of the mode, in Proceedings of the Twenty-seventh ACM Symposium on Principles of Distributed Computing (PODC’08), Toronto, Canada, 2008, pp. 15-24.

[94]

A. Kirsch, M. Mitzenmacher, A. Pietracaprina, E. Upfal, and F. Vandin, A rigorous statistical approach for identifying significant itemsets, in Proceedings of the IEEE International Conference on Data Mining (ICDM’08), Pisa, Italy, 2008, pp. 1-10.

[95]

J. A. R. Rojas, M. Beth Kery, S. Rosenthal, and A. Dey, Sampling techniques to improve big data exploration, in Proc. of 2017 IEEE 7th Symposium on Large Data Analysis and Visualization (LDAV), Phoenix, AZ, USA, 2017, pp. 26-35.

[96]

S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer, Enterprise data analysis and visualization: An interview study, *IEEE Transactions on Visualization and Computer Graphics*, vol. 18, no. 12, pp. 2917-2926, 2012.

[97]

J. Lin and D. Ryaboy, Scaling big data mining infrastructure: The twitter experience, *SIGKDD Explorations*, vol. 14, no. 2, pp. 6-19, 2013.

[98]

C. T. Fan, Development of sampling plans by using sequential (item by item) selection techniques and digital computers, *Publications of the American Statistical Association*, vol. 57, no. 298, pp. 387-402, 1962.

[99]

P. J. Haas, Data-stream sampling: Basic techniques and results, in Data Stream Management, Data-Centric Systems and Applications Book Series. Berlin, Heidelberg, Germany: Springer, 2016, pp. 13-44.

[100]

T. E. Oliphant, Scipy: Open source scientific tools for python, *Computing in Science and Engineering*, vol. 9, no. 3, pp. 10-20, 2007.

[101]

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, The weka data mining software: An update, *ACM SIGKDD Explorations Newsletter*, vol. 11, no. 1, pp. 10-18, 2009.

[102]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, *The Journal of Machine Learning Research*, vol. 12, no. 10, pp. 2825-2830, 2011.

[103]

M. Al-Kateb and B. S. Lee, Stratified reservoir sampling over heterogeneous data streams, in Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM’10), Heidelberg, Germany, 2010, pp. 621-639.

[104]

J. S. Vitter, Random sampling with a reservoir, *ACM Transactions on Mathematical Software*, vol. 11, no. 1, pp. 37-57, 1985.

[105]

C. C. Aggarwal, On biased reservoir sampling in the presence of stream evolution, in Proceedings of the 32nd International Conference on Very Large Databases (VLDB’06), Seoul, Korea, 2006, pp. 607-618.

[106]

M. Dash and W. Ng, Efficient reservoir sampling for transactional data streams, in Proc. of Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06), Hong Kong, China, 2006, pp. 662-666.

[107]

V. Malbasa and S. Vucetic, A reservoir sampling algorithm with adaptive estimation of conditional expectation, in Proc. of International Joint Conference on Neural Networks, Orlando, FL, USA, 2007, pp. 2200-2204.

[108]

A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan, The big data Bootstrap, in Proceedings of the 29th International Coference on International Conference on Machine Learning (ICML’12), Edinburgh, UK, 2012, pp. 1787-1794.

[109]

L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, *Classification and Regression Trees*. Boca Raton, FL, USA: CRC Press, 1984.

[110]

L. Breiman, Bagging predictors, *Machine Learning*, vol. 24, no. 2, pp. 123-140, 1996.

[111]

M. I. Jordan, On statistics, computation and scalability, *Bernoulli*, vol. 19, no. 4, pp. 1378-1390, 2013.

[112]

R. Genuer, J.-M. Poggi, C. Tuleau-Malot, and N. Villa-Vialaneix, Random forests for big data, *Big Data Research*, vol. 9, pp. 28-46, 2017.

[113]

X. Meng, Scalable simple random sampling and stratified sampling, in Proceedings of the 30th International Conference on International Conference on Machine Learning (ICML’13), Atlanta, GA, USA, 2013, pp. 531-539.

[114]

P. Sanders, S. Lamm, L. Hübschle-Schneider, E. Schrade, and C. Dachsbacher, Efficient parallel random sampling-vectorized, cache-efficient, and online, *ACM Transactions on Mathematical Software*, vol. 44, no. 3, pp. 1-14, 2018.

[115]

E. Gavagsaz, A. Rezaee, and H. H. S. Javadi, Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling, *The Journal of Supercomputing*, vol. 74, no. 7, pp. 3415-3440, 2018.

[116]

G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine, Synopses for massive data: Samples, histograms, wavelets, sketches, *Foundations Trends Databases*, vol. 4, no. 1, pp. 1-294, 2012.

[117]

Q. Xu, T. Mytkowicz, and N. S. Kim, Approximate computing: A survey, *IEEE Design Test*, vol. 33, no. 1, pp. 8-22, 2016.

[118]

Y. Shi, X. Meng, F. Wang, and Y. Gan, HEDC: A histogram estimator for data in the cloud, in Proceedings of the 4th International Workshop on Cloud Data Management (CloudDB’12), Maui, HI, USA, 2012, pp. 51-58.

[119]

D. R. Krishnan, D. L. Quoc, P. Bhatotia, C. Fetzer, and R. Rodrigues, IncApprox: A data analytics system for incremental approximate computing, in Proceedings of the 25th International Conference on World Wide Web (WWW’16), Montral, Canada, 2016, pp. 1133-1144.

[120]

M. N. Garofalakis and P. B. Gibbon, Approximate query processing: Taming the terabytes, in Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01), Roma, Italy, 2001, p. 725.

[121]

N. Laptev, K. Zeng, and C. Zaniolo, Early accurate results for advanced analytics on MapReduce, *Proceedings of the VLDB Endowment*, vol. 5, no. 10, pp. 1028-1039, 2012.

[122]

G. Hu, D. Zhang, S. Rigo, and T. D. Nguyen, Approximation with error bounds in spark, arXiv preprint arXiv: 1812.01823, 2018.

[123]

Y. Park, J. Qing, X. Shen, and B. Mozafari, BlinkML: Approximate machine learning with probabilistic guarantees, in Proc. of the 45th International Conference on Very Large Data Bases, Los Angeles, CA, USA, 2018, pp. 1-18.

[124]

X. Zhang, J. Wang, and J. Yin, Sapprox: Enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling, *Proceedings of the VLDB Endowment*, vol. 10, no. 3, pp. 109-120, 2016.

[125]

S. Salloum, J. Z. Huang, Y. He, and X. Chen, An asymptotic ensemble learning framework for big data analysis, *IEEE Access*, vol. 7, no. 1, pp. 3675-3693, 2019.

[126]

S. Salloum, J. Z. Huang, and Y. He, Exploring and cleaning big data with random sample data blocks, *Journal of Big Data*, vol. 6, no. 1, p. 45, 2019.

[127]

Z. Wen, D. L. Quoc, P. Bhatotia, R. Chen, and M. Lee, ApproxIoT: Approximate analytics for edge computing, in Proc. of 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), Vienna, Austria, 2018, pp. 411-421.

[128]

M. Elteir, H. Lin, and W. Feng, Enhancing MapReduce via asynchronous data processing, in Proc. of 2010 IEEE 16th International Conference on Parallel and Distributed Systems, Shanghai, China, 2010, pp. 397-405.

Publication history

Copyright

Acknowledgements

Rights and permissions

Received: 30 May 2019

Revised: 23 September 2019

Accepted: 25 September 2019

Published:
27 February 2020

Issue date: June 2020

© The author(s) 2020

This research was Supported in part by the National Natural Science Foundation of China (No. 61972261) and the National Key R&D Program of China (No. 2017YFC0822604-2).

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).