LotusSQL: SQL Engine for High-Performance Big Data Systems

Xiaohan Li; Bowen Yu; Guanyu Feng; Haojie Wang; Wenguang Chen

doi:10.26599/BDMA.2021.9020009

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (2.6 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

LotusSQL: SQL Engine for High-Performance Big Data Systems

Xiaohan Li, Bowen Yu, Guanyu Feng, Haojie Wang, Wenguang Chen(

)

Department of Computer Science and Technology, Tsinghua University, China

Show Author Information

Abstract

In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to $9 \times$ in certain queries and outperforms Spark SQL in a standard query benchmark by more than $2 \times$ on average.

Keywords

big data C++Structured Query Language (SQL)query optimization

References

[1]

Apache Hadoop, Apache hadoop, http://hadoop.apache.org, 2021.

[2]

J. Ekanayake, H. Li, B. J. Zhang, T. Gunarathne, S. H. Bae, J. Qiu, and G. Fox, Twister: A runtime for iterative mapreduce, in Proc. 19th ACM Int. Symp. on High Performance Distributed Computing, Chicago, IL, USA, 2010, pp. 810-818.

Crossref

[3]

Y. Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, HaLoop: Efficient iterative data processing on large clusters, Proc. VLDB Endowm., vol. 3, nos. 1&2, pp. 285-296, 2010.

Crossref Google Scholar

[4]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proc. 9th USENIX Conf. on Networked Systems Design and Implementation, Berkeley, CA, USA, 2012, pp. 15-28.

[5]

F. Yang, J. F. Li, and J. Cheng, Husky: Towards a more efficient and expressive distributed computing framework, Proc. VLDB Endowm., vol. 9, no. 5, pp. 420-431, 2016.

Crossref Google Scholar

[6]

P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas, Apache flinkTM: Stream and batch processing in a single engine, Bull. IEEE Comput. Soc. Tech. Comm. Data Eng., vol. 36, no. 4, pp. 28-38, 2015.

Google Scholar

[7]

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. R. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al., SparkSQL: Relational data processing in spark, in Proc. 2015 ACM SIGMOD Int. Conf. on Management of Data, Victoria, Australia, 2015, pp. 1383-1394.

Crossref

[8]

M. Anderson, S. Smith, N. Sundaram, M. Capot? Z. G. Zhao, S. Dulloor, N. Satish, and T. L. Willke, Bridging the gap between HPC and big data frameworks, Proc. VLDB Endowme., vol. 10, no. 8, pp. 901-912, 2017.

Crossref Google Scholar

[9]

G. M. Essertel, R. Y. Tahboub, J. M. Decker, K. J. Brown, K. Olukotun, and T. Rompf, Flare: Optimizing apache spark with native compilation for scale-up architectures and medium-size data, in Proc. of the 13th USENIX Conf. on Operating Systems Design and Implementation, Berkeley, CA, USA, 2018, pp. 799-815.

[10]

L. Lu, X. H. Shi, Y. L. Zhou, X. Zhang, H. Jin, C. Pei, L. G. He, and Y. Z. Geng, Lifetime-based memory management for distributed data processing systems, Proc. VLDB Endowm., vol. 9, no. 12, pp. 936-947, 2016.

Crossref Google Scholar

[11]

C. Navasca, C. Cai, K. Nguyen, B. Demsky, S. Lu, M. Kim, and G. H. Xu, Gerenuk: Thin computation over big native data using speculative program transformation, in Proc. 27th ACM Symp. on Operating Systems Principles, Ontario, Canada, 2019, pp. 538-553.

Crossref

[12]

J. Arnold, B. Glavic, and I. Raicu, A high-performance distributed relational database system for scalable OLAP processing, in 2019 IEEE Int. Parallel and Distributed Processing Symp. (IPDPS), Rio de Janeiro, Brazil, 2019, pp. 738-748.

Crossref

[13]

T. Bingmann, M. Axtmann, E. Jöbstl, S. Lamm, H. C. Nguyen, A. Noe, S. Schlag, M. Stumpp, T. Sturm, and P. Sanders, Thrill: High-performance algorithmic distributed batch data processing with C++, in 2016 IEEE Int. Conf. on Big Data (Big Data), Washington, DC, USA, 2016, pp. 172-183.

Crossref

[14]

E. Begoli, J. Camacho-Rodríguez, J. Hyde, M. J. Mior, and D. Lemire, Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources, in Proc. 2018 Int. Conf. on Management of Data, Houston, TX, USA, 2018, pp. 221-230.

Crossref

[15]

G. Graefe, and W. J. McKenna, The volcano optimizer generator: extensibility and efficient search, in Proc. IEEE 9th Int. Conf. on Data Engineering, 1993, Vienna, Austria, pp. 209-218.

[16]

G. Graefe, The cascades framework for query optimization, Data Eng. Bull., vol. 18, no. 3, pp. 19-29, 1995.

Google Scholar

[17]

T. Neumann, Efficiently compiling efficient query plans for modern hardware, Proc. VLDB Endowm., vol. 4, no. 9, pp. 539-550, 2011.

Crossref Google Scholar

[18]

Wikipedia, De Morgan’s, https://en.wikipedia.org/wiki/DeMorgan%27s, 2021.

[19]

The Transaction Processing Performance Council, TPC-H vesion 2 and version 3, http://www.tpc.org/tpch/, 2021.

[20]

S. Ghemawat, H. Gobioff, and S. T. Leung, The Google file system, in Proc. 19th ACM Symp. on Operating Systems Principles, Bolton Landing, NY, USA, 2003, pp. 29-43.

Crossref

[21]

J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, in 6th Symp. on Operating System Design and Implementation (OSDI 2004), San Francisco, CA, USA, 2004, pp. 137-150.

[22]

K. Shvachko, H. R. Kuang, S. Radia, and R. Chansler, The Hadoop distributed file system, in 2010 IEEE 26th Symp. on Mass Storage Systems and Technologies (Msst), Incline Village, NV, USA, 2010, pp. 1-10.

Crossref

[23]

C. Swarna and Z. Ansari, Apache pig-A data flow framework based on Hadoop map reduce, IJETT J., vol. 50, no. 5, pp. 271-275, 2017.

Crossref Google Scholar

[24]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, Hive-A petabyte scale data warehouse using Hadoop, in 2010 IEEE 26th Int. Conf. on Data Engineering (ICDE 2010), Long Beach, CA, USA, 2010, pp. 996-1005.

Crossref

[25]

M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, et al., Impala: A modern, open-source SQL engine for Hadoop, presented at 7th Biennial Conf. on Innovative Data Systems Research (CIDR’15), Asilomar, CA, USA, 2015.

Crossref

[26]

R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica, Shark: SQL and rich analytics at scale, in Proc. 2013 ACM SIGMOD Int. Conf. on Management of Data, New York, NY, USA, 2013, pp. 13-24.

Crossref

[27]

A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V. J. Tsotras, ASTERIX: Towards a scalable, semistructured data platform for evolving-world models, Distrib. Parallel Databases, vol. 29, no. 3, pp. 185-216, 2011.

Crossref Google Scholar

[28]

A. Alexandrov, R. Bergmann, S. Ewen, J. C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, et al., The stratosphere platform for big data analytics, VLDB J., vol. 23, no. 6, pp. 939-964, 2014.

Crossref Google Scholar

[29]

A. Crotty, A. Galakatos, K. Dursun, T. Kraska, U. Cetintemel, and S. Zdonik, Tupleware: “Big” Data, Big Analytics, Small Clusters, presented at 7th Biennial Conf. on Innovative Data Systems Research (CIDR 2015), Asilomar, CA, USA, 2015.

[30]

R. Chaiken, B. Jenkins, P. Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. R. Zhou, SCOPE: Easy and efficient parallel processing of massive data sets, Proc. VLDB Endowm., vol. 1, no. 2, pp. 1265-1276, 2008.

Crossref Google Scholar

[31]

R. A. Lorie, XRM-An Extended (N-ary) Relational Memory. Yorktown Heights: IBM, 1974.

[32]

A. Kemper and T. Neumann, HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots, in 2011 IEEE 27th Int. Conf. on Data Engineering, Hannover, Germany, 2011, pp. 195-206.

Crossref

[33]

F. McSherry, M. Isard, and D. G. Murray, Scalability! But at what COST? presented at 15th Workshop on Hot Topics in Operating Systems (HotOS XV), Kartause Ittingen, Switzerland, 2015.

Big Data Mining and Analytics

Volume 4 Issue 4,
December 2021

Pages 252-265

DOI: 10.26599/BDMA.2021.9020009

Cite this article:

Li X, Yu B, Feng G, et al. LotusSQL: SQL Engine for High-Performance Big Data Systems. Big Data Mining and Analytics, 2021, 4(4): 252-265. https://doi.org/10.26599/BDMA.2021.9020009

1104

Views

823

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 11 May 2021

Accepted: 28 May 2021

Published: 26 August 2021

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).