A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis

Mohammad Sultan Mahmud; Joshua Zhexue Huang; Salman Salloum; Tamer Z. Emara; Kuanishbay Sadatdiynov

doi:10.26599/BDMA.2019.9020015

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (6.6 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis

Mohammad Sultan Mahmud, Joshua Zhexue Huang, Salman Salloum(

), Tamer Z. Emara, Kuanishbay Sadatdiynov

∙ National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, China, and Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China.

Show Author Information

Abstract

Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis. In cluster computing, data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability. In this paper, we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis. We start with an overview of the mainstream big data frameworks on Hadoop clusters. The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes: range, hash, and random partitioning. Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning, including the new Random Sample Partition (RSP) distributed model. The classical methods of data sampling are then investigated, including simple random sampling, stratified sampling, and reservoir sampling. Two common methods of big data sampling on computing clusters are also discussed: record-level sampling and block-level sampling. Record-level sampling is not as efficient as block-level sampling on big distributed data. On the other hand, block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data. In this survey, we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters. We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.

Keywords

big data analysis data partitioning data sampling distributed and parallel computing approximate computing

References

[1]

R. Cattell, Scalable SQL and NoSQL data stores, ACM SIGMOD Record, vol. 39, no. 4, pp. 12-27, 2011.

Crossref Google Scholar

[2]

K. Bakshi, Considerations for big data: Architecture and approach, in Proc. of 2012 IEEE Aerospace Conference, Big Sky, MT, USA, 2012, pp. 1-7.

Crossref

[3]

X. Chen and M. Xie, A split-and-conquer approach for analysis of extraordinarily large data, Statistica Sinica, vol. 24, no. 4, pp. 1655-1684, 2014.

Crossref Google Scholar

[4]

N. Lazar, The big picture: Divide and combine to conquer big data, Chance, vol. 31, no. 1, pp. 57-59, 2018.

Crossref Google Scholar

[5]

J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, in Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI’04), San Francisco, CA, USA, 2004, pp. 137-150.

[6]

D. Singh and C. K. Reddy, A survey on platforms for big data analytics, Journal of Big Data, vol. 2, no. 1, p. 8, 2014.

Crossref Google Scholar

[7]

H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, and C. Shahabi, Big data and its technical challenges, Communications of the ACM, vol. 57, no. 7, pp. 86-94, 2014.

Crossref Google Scholar

[8]

M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, et al., Apache spark: A unified engine for big data processing, Communications of the ACM, vol. 59, no. 11, pp. 56-65, 2016.

Crossref Google Scholar

[9]

S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z. Huang, Big data analytics on apache spark, International Journal of Data Science and Analytics, vol. 1, no. 3, pp. 145-164, 2016.

Crossref Google Scholar

[10]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The Hadoop distributed file system, in Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 2010, pp. 1-10.

Crossref

[11]

C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun, Map-reduce for machine learning on multicore, in Proceedings of the 19th International Conference on Neural Information Processing Systems (NIPS’06), Cambridge, MA, USA, 2006, pp. 281-288.

[12]

D. Quoc, Approximate data analytics systems, PhD dissertation, Technische Universität Dresden, Dresden, Germany, 2017.

[13]

R. Nair, Big data needs approximate computing: Technical perspective, Communications of the ACM, vol. 58, no. 1, pp. 104-104, 2015.

Crossref Google Scholar

[14]

S. Mittal, A survey of techniques for approximate computing, ACM Computing Surveys, vol. 48, no. 4, pp. 1-33, 2016.

Crossref Google Scholar

[15]

D. A. Reed and J. Dongarra, Exascale computing and big data, Communications of the ACM, vol. 58, no. 7, pp. 56-68, 2015.

Crossref Google Scholar

[16]

C. E. Otero and A. Peter, Research directions for engineering big data analytics software, IEEE Intelligent Systems, vol. 30, no. 1, pp. 13-19, 2015.

Crossref Google Scholar

[17]

A. Agrawal, J. Choi, K. Gopalakrishnan, S. Gupta, R. Nair, J. Oh, D. A. Prener, S. Shukla, V. Srinivasan, and Z. Sura, Approximate computing: Challenges and opportunities, in Proc. of 2016 IEEE International Conference on Rebooting Computing (ICRC), San Diego, CA, USA, 2016, pp. 1-8.

Crossref

[18]

S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard, Managing performance vs. accuracy trade-offs with loop perforation, in Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE’11), Szeged, Hungary, 2011, pp. 124-134.

Crossref

[19]

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica, BlinkDB: Queries with bounded errors and bounded response times on very large data, in Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys’13), Prague, Czech Republic, 2013, pp. 29-42.

Crossref

[20]

I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen, ApproxHadoop: Bringing approximations to MapReduce frameworks, in Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15), Istanbul, Turkey, 2015, pp. 383-397.

Crossref

[21]

K. Li and G. Li, Approximate query processing: What is new and where to go?, Data Science and Engineering, vol. 3, no. 4, pp. 379-397, 2018.

Crossref Google Scholar

[22]

S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Madden, B. Mozafari, and I. Stoica, Knowing when you’re wrong: Building fast and reliable approximate query processing systems, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14), Snowbird, UT, USA, 2014, pp. 481-492.

Crossref

[23]

D. L. Quoc, R. Chen, P. Bhatotia, C. Fetzer, V. Hilt, and T. Strufe, Streamapprox: Approximate computing for stream analytics, in Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference (Middleware’17), Las Vegas, NV, USA, 2017, pp. 185-197.

Crossref

[24]

O. Sagi and L. Rokach, Ensemble learning: A survey, Data Mining and Knowledge Discovery, vol. 8, no. 4, pp. 1-18, 2018.

Crossref Google Scholar

[25]

S. Basiri, E. Ollila, and V. Koivunen, Robust, scalable, and fast bootstrap method for analyzing large scale data, IEEE Transactions on Signal Processing, vol. 64, no. 4, pp. 1007-1017, 2016.

Crossref Google Scholar

[26]

V. K. Singh, M. Taram, V. Agrawal, and B. S. Baghel, A literature review on Hadoop ecosystem and various techniques of big data optimization, in Proceedings of International Conference on Data and Information Systems (ICDIS’17), Amarkantak, India, 2017, pp. 231-240.

Crossref

[27]

I. Polato, R. Ré, A. Goldman, and F. Kon, A comprehensive view of Hadoop research-systematic literature review, Journal of Network and Computer Applications, vol. 46, pp. 1-25, 2014.

Crossref Google Scholar

[28]

J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.

Crossref Google Scholar

[29]

R. Hasan, Z. Anwar, W. Yurcik, L. Brumbaugh, and R. Campbell, A survey of peer-to-peer storage techniques for distributed file systems, in Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05) - Volume II, Las Vegas, NV, USA, 2005, pp. 205-213.

Crossref

[30]

P. Bzoch and J. Safarik, State of the art in distributed file systems: Increasing performance, in Proc. of 2011 Second Eastern European Regional Conference on the Engineering of Computer Based Systems, Bratislava, Slovakia, 2011, pp. 153-154.

Crossref

[31]

M. T. Ozsu and P. Valduriez, Principles of Distributed Database Systems, 3rd ed. New York, NY, USA: Springer-Verlag, 2011.

Crossref

[32]

D. Taniar, High performance database processing, in Proc. of 2012 IEEE 26th International Conference on Advanced Information Networking and Applications, Fukuoka, Japan, 2012, pp. 5-6.

Crossref

[33]

S. Vijayakumar, A. Bhargavi, U. Praseeda, and S. A. Ahamed, Optimizing sequence alignment in cloud using Hadoop and Mpp database, in Proceedings of the 2012 IEEE Fifth International Conference on Cloud Computing (CLOUD’12), Honolulu, HI, USA, 2012, pp. 819-827.

Crossref

[34]

S. Ghemawat, H. Gobioff, and S.-T. Leung, The google file system, ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 29-43, 2003.

Crossref Google Scholar

[35]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, Dryad: Distributed data-parallel programs from sequential building blocks, in Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys’07), Lisbon, Portugal, 2007, pp. 59-72.

Crossref

[36]

R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou, Scope: Easy and efficient parallel processing of massive data sets, Proceedings of the VLDB Endowment, vol. 1, no. 2, pp. 1265-1276, 2008.

Crossref Google Scholar

[37]

Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey, Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language, in Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08), San Diego, CA, USA, 2008, pp. 1-14.

[38]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, Pig latin: A not-so-foreign language for data processing, in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08), Vancouver, Canada, 2008, pp. 1099-1110.

Crossref

[39]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis, Dremel: Interactive analysis of web-scale datasets, Communications of the ACM, vol. 54, no. 6, pp. 114-123, 2011.

Crossref Google Scholar

[40]

A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, Hive a petabyte scale data warehouse using hadoop, in Proceedings of the 2010 IEEE 26th International Conference on Data Engineering (ICDE’10), Long Beach, CA, USA, 2010, pp. 996-1005.

Crossref

[41]

K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F. Ozcan, and E. J. Shekita, Jaql: A scripting language for large scale semi structured data analysis, Proceedings of the VLDB Endowment, vol. 4, no. 12, pp. 1272-1283, 2011.

Crossref Google Scholar

[42]

B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong, Tenzing: A SQL implementation on the MapReduce framework, Proceedings of the VLDB Endowment, vol. 4, no. 12, pp. 1318-1327, 2011.

Crossref Google Scholar

[43]

H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning Spark: Lightning-Fast Big Data Analytics, 1st ed. Sebastopol, CA, USA: O’Reilly Media, 2015.

[44]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12), San Jose, CA, USA, 2012, p. 2.

[45]

S. Navathe, S. Ceri, G. Wiederhold, and J. Dou, Vertical partitioning algorithms for database design, ACM Transactions on Database Systems, vol. 9, no. 4, pp. 680-710, 1984.

Crossref Google Scholar

[46]

D. W. Cornell and P. S. Yu, An effective approach to vertical partitioning for physical design of relational databases, IEEE Transactions on Software Engineering, vol. 16, no. 2, pp. 248-258, 1990.

Crossref Google Scholar

[47]

W. W. Chu and I. T. Ieong, A transaction-based approach to vertical partitioning for relational database systems, IEEE Transactions on Software Engineering, vol. 19, no. 8, pp. 804-812, 1993.

Crossref Google Scholar

[48]

C. Curino, E. Jones, Y. Zhang, and S. Madden, Schism: A workload-driven approach to database replication and partitioning, Proceedings of the VLDB Endowment, vol. 3, no. 1, pp. 48-57, 2010.

Crossref Google Scholar

[49]

C. Curino, E. P. Jones, R. A. Popa, N. Malviya, E. Wu, S. Madden, H. Balakrishnan, and N. Zeldovich, Relational cloud: A database-as-a-service for the cloud, in Proc. of 5th Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, USA, 2011, pp. 235-240.

[50]

S. Das, D. Agrawal, and A. El Abbadi, Elastras: An elastic transactional data store in the cloud, in Proceedings of the 2009 Conference on Hot Topics in Cloud Computing (HotCloud’09), San Diego, CA, USA, 2009, pp. 1-5.

Crossref

[51]

J. Baker, C. Bond, J. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh, Megastore: Providing scalable, highly available storage for interactive services, in Proc. of Conference on Innovative Database Research (CIDR), Asilomar, CA, USA, 2011, pp. 223-234.

[52]

J. Kamal, M. Murshed, and R. Buyya, Workload-aware incremental repartitioning of shared-nothing distributed databases for scalable OLTP applications, Future Generation Computer Systems, vol. 56, pp. 421-435, 2016.

Crossref Google Scholar

[53]

S. P. Phansalkar and A. R. Dani, Transaction aware vertical partitioning of database (TAVDP) for responsive OLTP applications in cloud data stores, Journal of Theoretical and Applied Information Technology, vol. 59, no. 1, pp. 73-81, 2014.

Google Scholar

[54]

P. A. Bernstein, I. Cseri, N. Dani, N. Ellis, A. Kalhan, G. Kakivaya, D. B. Lomet, R. Manne, L. Novik, and T. Talius, Adapting microsoft SQL server for cloud computing, in Proc. of 2011 IEEE 27th International Conference on Data Engineering, Hannover, Germany, 2011, pp. 1255-1263.

Crossref

[55]

S. Ahirrao and R. Ingle, Scalable transactions in cloud data stores, in Proc. of 2013 3rd IEEE International Advance Computing Conference (IACC), Ghaziabad, India, 2013, pp. 116-119.

Crossref

[56]

S. B. Navathe and M. Ra, Vertical partitioning for database design: A graphical algorithm, ACM SIGMOD Record, vol. 18, no. 2, pp. 440-450, 1989.

Crossref Google Scholar

[57]

J. H. Son and M. H. Kim, An adaptable vertical partitioning method in distributed systems, Journal of Systems and Software, vol. 73, no. 3, pp. 551-561, 2004.

Crossref Google Scholar

[58]

W. Zhao, Y. Cheng, and F. Rusu, Workload-driven vertical partitioning for effective query processing over raw data, arXiv preprint arXiv: 1503.08946, 2015.

Crossref

[59]

Y.-F. Huang and C.-J. Lai, Integrating frequent pattern clustering and branch-and-bound approaches for data partitioning, Information Sciences, vol. 328, pp. 288-301, 2016.

Crossref Google Scholar

[60]

M. Vojnovic, F. Xu, and J. Zhou, Sampling-based range partition methods for big data analytics, Technical Report MSR-TR-2012-18, Microsoft Research, Redmond, WA, USA, 2012.

[61]

A. Chakrabarti, S. Parthasarathy, and C. Stewart, Green- and heterogeneity-aware partitioning for data analytics, in Proc. of IEEE Conference on Computer Communications Workshops, San Francisco, CA, USA, 2016, pp. 366-371.

Crossref

[62]

S. Phansalkar and S. Ahirrao, Survey of data partitioning algorithms for big data stores, in Proc. of 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), Waknaghat, India, 2016, pp. 163-168.

Crossref

[63]

A. Shanbhag, A. Jindal, S. Madden, J. Quiane, and A. J. Elmore, A robust partitioning scheme for ad-hoc query workloads, in Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17), Santa Clara, CA, USA, 2017, pp. 229-241.

Crossref

[64]

J. Wang, Q. Xiao, and J. Yin, DRAW: A new Data-gRouping-AWare data placement scheme for data intensive applications with interest locality, IEEE Transactions on Magnetics, vol. 49, no. 6, pp. 2514-2520, 2013.

Crossref Google Scholar

[65]

K. H. K. Reddy and D. S. Roy, DPPACS: A novel data partitioning and placement aware computation scheduling scheme for data-intensive cloud applications, The Computer Journal, vol. 59, no. 1, pp. 64-82, 2016.

Crossref Google Scholar

[66]

D. Yuan, Y. Yang, X. Liu, and J. Chen, A data placement strategy in scientific cloud workflows, Future Generation Computer Systems, vol. 26, no. 8, pp. 1200-1214, 2010.

Crossref Google Scholar

[67]

K. Slagter, C.-H. Hsu, Y.-C. Chung, and D. Zhang, An improved partitioning mechanism for optimizing massive data analysis using MapReduce, The Journal of Supercomputing, vol. 66, no. 1, pp. 539-555, 2013.

Crossref Google Scholar

[68]

Q. Chen, J. Yao, and Z. Xiao, Libra: Lightweight data skew mitigation in MapReduce, IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 9, pp. 2520-2533, 2015.

Crossref Google Scholar

[69]

M. He, G. Li, C. Huang, Y. Ye, and W. Tian, A comparative study of data skew in Hadoop, in Proceedings of 26th International Conference on Network, Communication and Computing (ICNCC’17), Kunming, China, 2017, pp. 1-6.

Crossref

[70]

Z. Tang, W. Lv, K. Li, and K. Li, An intermediate data partition algorithm for skew mitigation in spark computing environment, IEEE Transactions on Cloud Computing, .

Crossref Google Scholar

[71]

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, A study of skew in MapReduce applications, in The 5th Open Cirrus Summit, Moskow, Russia, 2011, pp. 1-5.

Crossref

[72]

Z. Tang, X. Zhang, K. Li, and K. Li, An intermediate data placement algorithm for load balancing in spark computing environment, Future Generation Computer Systems, vol. 78, no. 1, pp. 287-301, 2018.

Crossref Google Scholar

[73]

Y. Xu, P. Zou, W. Qu, Z. Li, K. Li, and X. Cui, Sampling-based partitioning in MapReduce for skewed data, in Proc. of 2012 Seventh ChinaGrid Annual Conference (CHINAGRID’12), Beijing, China, 2012, pp. 1-8.

Crossref

[74]

S. del RÃo, V. LÃ³pez, J. M. BenÃtez, and F. Herrera, On the use of MapReduce for imbalanced big data using random forest, Information Sciences, vol. 285, pp. 112-137, 2014.

Crossref Google Scholar

[75]

X. Ci and X. Meng, An efficient block sampling strategy for online aggregation in the cloud, in Proc. of International Conference on Web-Age Information Management (WAIM 2015), Qingdao, China, 2015, pp. 362-373.

Crossref

[76]

S. Salloum, J. Z. Huang, Y. He, X. Zhang, T. Z. Emara, C. Wei, and H. He, A random sample partition data model for big data analysis, arXiv preprint arXiv: 1712.04146, 2017.

[77]

C. Wei, S. Salloum, T. Z. Emara, X. Zhang, J. Z. Huang, and Y. He, A two-stage data processing algorithm to generate random sample partitions for big data analysis, in Proc. of International Conference on Cloud Computing (CLOUD 2018), Seattle, WA, USA, 2018, pp. 347-364.

Crossref

[78]

T. Z. Emara and J. Z. Huang, A distributed data management system to support large-scale data analysis, The Journal of Systems and Software, vol. 148, pp. 105-115, 2019.

Crossref Google Scholar

[79]

T. Z. Emara and J. Z. Huang, RRPlib: A Spark library for representing HDFS blocks as a set of random sample data blocks, Science of Computer Programming, vol. 184, pp. 1-7, 2019.

Crossref Google Scholar

[80]

A. Jacobs, The pathologies of big data, Communications of the ACM, vol. 52, no. 8, pp. 36-44, 2009.

Crossref Google Scholar

[81]

G. Cormode and N. Duffield, Sampling for big data: A tutorial, in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14), New York, NY, USA, 2014, pp. 1975-1975.

Crossref

[82]

J. Acharya, I. Diakonikolas, C. Hegde, J. Z. Li, and L. Schmidt, Fast and near-optimal algorithms for approximating distributions by histograms, in Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS’15), Melbourne, Australia, 2015, pp. 249-263.

Crossref

[83]

B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. New York, NY, USA: Chapman & Hall, 1993.

Crossref

[84]

F. Olken and D. Rotem, Random sampling from database files: A survey, in Proceedings of the 5th International Conference on Statistical and Scientific Database Management (SSDBM’1990), Charlotte, NC, USA, 1990, pp. 92-111.

Crossref

[85]

F. Olken, Random sampling from databases, PhD dissertation, University of California at Berkeley, Berkeley, CA, USA, 1993.

[86]

P. Bruce and A. Bruce, Practical Statistics for Data Scientists: 50 Essential Concepts. Sebastopol, CA, USA: O’Reilly Media, Inc., 2017.

[87]

W. S. Cleveland and R. Hafen, Divide and recombine (d&r): Data science for large complex data, Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 7, no. 6, pp. 425-433, 2014.

Crossref Google Scholar

[88]

P. Bailis, E. Gan, S. Madden, D. Narayanan, K. Rong, and S. Suri, Macrobase: Prioritizing attention in fast data, in Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17), Chicago, IL, USA, 2017, pp. 541-556.

Crossref

[89]

S. Chaudhuri, R. Motwani, and V. Narasayya, Random sampling for histogram construction: How much is enough?, ACM SIGMOD Record, vol. 27, no. 2, pp. 436-447, 1998.

Crossref Google Scholar

[90]

G. S. Manku, S. Rajagopalan, and B. G. Lindsay, Random sampling techniques for space efficient online computation of order statistics of large datasets, in Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD’99), Philadelphia, PA, USA, 1999, pp. 251-262.

Crossref

[91]

M. Charikar, K. Chen, and M. Farach-Colton, Finding frequent items in data streams, in Proceedings of the 29th International Colloquium on Automata, Languages and Programming (ICALP’02), Málaga, Spain, 2002, pp. 693-703.

Crossref

[92]

S. Guha and A. McGregor, Space-efficient sampling, in Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS 2007), San Juan, PUR, USA, 2007, pp. 169-176.

[93]

F. Kuhn, T. Locher, and S. Schmid, Distributed computation of the mode, in Proceedings of the Twenty-seventh ACM Symposium on Principles of Distributed Computing (PODC’08), Toronto, Canada, 2008, pp. 15-24.

Crossref

[94]

A. Kirsch, M. Mitzenmacher, A. Pietracaprina, E. Upfal, and F. Vandin, A rigorous statistical approach for identifying significant itemsets, in Proceedings of the IEEE International Conference on Data Mining (ICDM’08), Pisa, Italy, 2008, pp. 1-10.

[95]

J. A. R. Rojas, M. Beth Kery, S. Rosenthal, and A. Dey, Sampling techniques to improve big data exploration, in Proc. of 2017 IEEE 7th Symposium on Large Data Analysis and Visualization (LDAV), Phoenix, AZ, USA, 2017, pp. 26-35.

[96]

S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer, Enterprise data analysis and visualization: An interview study, IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 12, pp. 2917-2926, 2012.

Crossref Google Scholar

[97]

J. Lin and D. Ryaboy, Scaling big data mining infrastructure: The twitter experience, SIGKDD Explorations, vol. 14, no. 2, pp. 6-19, 2013.

Crossref Google Scholar

[98]

C. T. Fan, Development of sampling plans by using sequential (item by item) selection techniques and digital computers, Publications of the American Statistical Association, vol. 57, no. 298, pp. 387-402, 1962.

Crossref Google Scholar

[99]

P. J. Haas, Data-stream sampling: Basic techniques and results, in Data Stream Management, Data-Centric Systems and Applications Book Series. Berlin, Heidelberg, Germany: Springer, 2016, pp. 13-44.

Crossref

[100]

T. E. Oliphant, Scipy: Open source scientific tools for python, Computing in Science and Engineering, vol. 9, no. 3, pp. 10-20, 2007.

Crossref Google Scholar

[101]

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, The weka data mining software: An update, ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10-18, 2009.

Crossref Google Scholar

[102]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, The Journal of Machine Learning Research, vol. 12, no. 10, pp. 2825-2830, 2011.

Google Scholar

[103]

M. Al-Kateb and B. S. Lee, Stratified reservoir sampling over heterogeneous data streams, in Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM’10), Heidelberg, Germany, 2010, pp. 621-639.

Crossref

[104]

J. S. Vitter, Random sampling with a reservoir, ACM Transactions on Mathematical Software, vol. 11, no. 1, pp. 37-57, 1985.

Crossref Google Scholar

[105]

C. C. Aggarwal, On biased reservoir sampling in the presence of stream evolution, in Proceedings of the 32nd International Conference on Very Large Databases (VLDB’06), Seoul, Korea, 2006, pp. 607-618.

[106]

M. Dash and W. Ng, Efficient reservoir sampling for transactional data streams, in Proc. of Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06), Hong Kong, China, 2006, pp. 662-666.

Crossref

[107]

V. Malbasa and S. Vucetic, A reservoir sampling algorithm with adaptive estimation of conditional expectation, in Proc. of International Joint Conference on Neural Networks, Orlando, FL, USA, 2007, pp. 2200-2204.

Crossref

[108]

A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan, The big data Bootstrap, in Proceedings of the 29th International Coference on International Conference on Machine Learning (ICML’12), Edinburgh, UK, 2012, pp. 1787-1794.

[109]

L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and Regression Trees. Boca Raton, FL, USA: CRC Press, 1984.

[110]

L. Breiman, Bagging predictors, Machine Learning, vol. 24, no. 2, pp. 123-140, 1996.

Crossref Google Scholar

[111]

M. I. Jordan, On statistics, computation and scalability, Bernoulli, vol. 19, no. 4, pp. 1378-1390, 2013.

Crossref Google Scholar

[112]

R. Genuer, J.-M. Poggi, C. Tuleau-Malot, and N. Villa-Vialaneix, Random forests for big data, Big Data Research, vol. 9, pp. 28-46, 2017.

Crossref Google Scholar

[113]

X. Meng, Scalable simple random sampling and stratified sampling, in Proceedings of the 30th International Conference on International Conference on Machine Learning (ICML’13), Atlanta, GA, USA, 2013, pp. 531-539.

[114]

P. Sanders, S. Lamm, L. Hübschle-Schneider, E. Schrade, and C. Dachsbacher, Efficient parallel random sampling-vectorized, cache-efficient, and online, ACM Transactions on Mathematical Software, vol. 44, no. 3, pp. 1-14, 2018.

Crossref Google Scholar

[115]

E. Gavagsaz, A. Rezaee, and H. H. S. Javadi, Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling, The Journal of Supercomputing, vol. 74, no. 7, pp. 3415-3440, 2018.

Crossref Google Scholar

[116]

G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine, Synopses for massive data: Samples, histograms, wavelets, sketches, Foundations Trends Databases, vol. 4, no. 1, pp. 1-294, 2012.

Crossref Google Scholar

[117]

Q. Xu, T. Mytkowicz, and N. S. Kim, Approximate computing: A survey, IEEE Design Test, vol. 33, no. 1, pp. 8-22, 2016.

Crossref Google Scholar

[118]

Y. Shi, X. Meng, F. Wang, and Y. Gan, HEDC: A histogram estimator for data in the cloud, in Proceedings of the 4th International Workshop on Cloud Data Management (CloudDB’12), Maui, HI, USA, 2012, pp. 51-58.

Crossref

[119]

D. R. Krishnan, D. L. Quoc, P. Bhatotia, C. Fetzer, and R. Rodrigues, IncApprox: A data analytics system for incremental approximate computing, in Proceedings of the 25th International Conference on World Wide Web (WWW’16), Montral, Canada, 2016, pp. 1133-1144.

Crossref

[120]

M. N. Garofalakis and P. B. Gibbon, Approximate query processing: Taming the terabytes, in Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01), Roma, Italy, 2001, p. 725.

[121]

N. Laptev, K. Zeng, and C. Zaniolo, Early accurate results for advanced analytics on MapReduce, Proceedings of the VLDB Endowment, vol. 5, no. 10, pp. 1028-1039, 2012.

Crossref Google Scholar

[122]

G. Hu, D. Zhang, S. Rigo, and T. D. Nguyen, Approximation with error bounds in spark, arXiv preprint arXiv: 1812.01823, 2018.

Crossref

[123]

Y. Park, J. Qing, X. Shen, and B. Mozafari, BlinkML: Approximate machine learning with probabilistic guarantees, in Proc. of the 45th International Conference on Very Large Data Bases, Los Angeles, CA, USA, 2018, pp. 1-18.

Crossref

[124]

X. Zhang, J. Wang, and J. Yin, Sapprox: Enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling, Proceedings of the VLDB Endowment, vol. 10, no. 3, pp. 109-120, 2016.

Crossref Google Scholar

[125]

S. Salloum, J. Z. Huang, Y. He, and X. Chen, An asymptotic ensemble learning framework for big data analysis, IEEE Access, vol. 7, no. 1, pp. 3675-3693, 2019.

Crossref Google Scholar

[126]

S. Salloum, J. Z. Huang, and Y. He, Exploring and cleaning big data with random sample data blocks, Journal of Big Data, vol. 6, no. 1, p. 45, 2019.

Crossref Google Scholar

[127]

Z. Wen, D. L. Quoc, P. Bhatotia, R. Chen, and M. Lee, ApproxIoT: Approximate analytics for edge computing, in Proc. of 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), Vienna, Austria, 2018, pp. 411-421.

Crossref

[128]

M. Elteir, H. Lin, and W. Feng, Enhancing MapReduce via asynchronous data processing, in Proc. of 2010 IEEE 16th International Conference on Parallel and Distributed Systems, Shanghai, China, 2010, pp. 397-405.

Crossref

Big Data Mining and Analytics

Volume 3 Issue 2,
June 2020

Pages 85-101

DOI: 10.26599/BDMA.2019.9020015

Cite this article:

Mahmud MS, Huang JZ, Salloum S, et al. A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis. Big Data Mining and Analytics, 2020, 3(2): 85-101. https://doi.org/10.26599/BDMA.2019.9020015

2834

Views

304

Downloads

166

Crossref

125

Web of Science

187

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 30 May 2019

Revised: 23 September 2019

Accepted: 25 September 2019

Published: 27 February 2020

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).