Journal Home > Volume 4 , issue 4

In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.


menu
Abstract
Full text
Outline
About this article

LotusSQL: SQL Engine for High-Performance Big Data Systems

Show Author's information Xiaohan LiBowen YuGuanyu FengHaojie WangWenguang Chen( )
Department of Computer Science and Technology, Tsinghua University, China

Abstract

In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.

Keywords:

big data, C++, Structured Query Language (SQL), query optimization
Received: 11 May 2021 Accepted: 28 May 2021 Published: 26 August 2021 Issue date: December 2021
References(33)
[1]
Apache Hadoop, Apache hadoop, , 2021.
[2]
J. Ekanayake, H. Li, B. J. Zhang, T. Gunarathne, S. H. Bae, J. Qiu, and G. Fox, Twister: A runtime for iterative mapreduce, in Proc. 19th ACM Int. Symp. on High Performance Distributed Computing, Chicago, IL, USA, 2010, pp. 810-818.
[3]
Y. Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, HaLoop: Efficient iterative data processing on large clusters, Proc. VLDB Endowm., vol. 3, nos. 1&2, pp. 285-296, 2010.
[4]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proc. 9th USENIX Conf. on Networked Systems Design and Implementation, Berkeley, CA, USA, 2012, pp. 15-28.
[5]
F. Yang, J. F. Li, and J. Cheng, Husky: Towards a more efficient and expressive distributed computing framework, Proc. VLDB Endowm., vol. 9, no. 5, pp. 420-431, 2016.
[6]
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas, Apache flinkTM: Stream and batch processing in a single engine, Bull. IEEE Comput. Soc. Tech. Comm. Data Eng., vol. 36, no. 4, pp. 28-38, 2015.
[7]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. R. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al., SparkSQL: Relational data processing in spark, in Proc. 2015 ACM SIGMOD Int. Conf. on Management of Data, Victoria, Australia, 2015, pp. 1383-1394.
[8]
M. Anderson, S. Smith, N. Sundaram, M. Capot? Z. G. Zhao, S. Dulloor, N. Satish, and T. L. Willke, Bridging the gap between HPC and big data frameworks, Proc. VLDB Endowme., vol. 10, no. 8, pp. 901-912, 2017.
[9]
G. M. Essertel, R. Y. Tahboub, J. M. Decker, K. J. Brown, K. Olukotun, and T. Rompf, Flare: Optimizing apache spark with native compilation for scale-up architectures and medium-size data, in Proc. of the 13th USENIX Conf. on Operating Systems Design and Implementation, Berkeley, CA, USA, 2018, pp. 799-815.
[10]
L. Lu, X. H. Shi, Y. L. Zhou, X. Zhang, H. Jin, C. Pei, L. G. He, and Y. Z. Geng, Lifetime-based memory management for distributed data processing systems, Proc. VLDB Endowm., vol. 9, no. 12, pp. 936-947, 2016.
[11]
C. Navasca, C. Cai, K. Nguyen, B. Demsky, S. Lu, M. Kim, and G. H. Xu, Gerenuk: Thin computation over big native data using speculative program transformation, in Proc. 27th ACM Symp. on Operating Systems Principles, Ontario, Canada, 2019, pp. 538-553.
[12]
J. Arnold, B. Glavic, and I. Raicu, A high-performance distributed relational database system for scalable OLAP processing, in 2019 IEEE Int. Parallel and Distributed Processing Symp. (IPDPS), Rio de Janeiro, Brazil, 2019, pp. 738-748.
[13]
T. Bingmann, M. Axtmann, E. Jöbstl, S. Lamm, H. C. Nguyen, A. Noe, S. Schlag, M. Stumpp, T. Sturm, and P. Sanders, Thrill: High-performance algorithmic distributed batch data processing with C++, in 2016 IEEE Int. Conf. on Big Data (Big Data), Washington, DC, USA, 2016, pp. 172-183.
[14]
E. Begoli, J. Camacho-Rodríguez, J. Hyde, M. J. Mior, and D. Lemire, Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources, in Proc. 2018 Int. Conf. on Management of Data, Houston, TX, USA, 2018, pp. 221-230.
[15]
G. Graefe, and W. J. McKenna, The volcano optimizer generator: extensibility and efficient search, in Proc. IEEE 9th Int. Conf. on Data Engineering, 1993, Vienna, Austria, pp. 209-218.
[16]
G. Graefe, The cascades framework for query optimization, Data Eng. Bull., vol. 18, no. 3, pp. 19-29, 1995.
[17]
T. Neumann, Efficiently compiling efficient query plans for modern hardware, Proc. VLDB Endowm., vol. 4, no. 9, pp. 539-550, 2011.
[18]
Wikipedia, De Morgan’s, , 2021.
[19]
The Transaction Processing Performance Council, TPC-H vesion 2 and version 3, , 2021.
[20]
S. Ghemawat, H. Gobioff, and S. T. Leung, The Google file system, in Proc. 19th ACM Symp. on Operating Systems Principles, Bolton Landing, NY, USA, 2003, pp. 29-43.
[21]
J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, in 6th Symp. on Operating System Design and Implementation (OSDI 2004), San Francisco, CA, USA, 2004, pp. 137-150.
[22]
K. Shvachko, H. R. Kuang, S. Radia, and R. Chansler, The Hadoop distributed file system, in 2010 IEEE 26th Symp. on Mass Storage Systems and Technologies (Msst), Incline Village, NV, USA, 2010, pp. 1-10.
[23]
C. Swarna and Z. Ansari, Apache pig-A data flow framework based on Hadoop map reduce, IJETT J., vol. 50, no. 5, pp. 271-275, 2017.
[24]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, Hive-A petabyte scale data warehouse using Hadoop, in 2010 IEEE 26th Int. Conf. on Data Engineering (ICDE 2010), Long Beach, CA, USA, 2010, pp. 996-1005.
[25]
M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, et al., Impala: A modern, open-source SQL engine for Hadoop, presented at 7th Biennial Conf. on Innovative Data Systems Research (CIDR’15), Asilomar, CA, USA, 2015.
[26]
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica, Shark: SQL and rich analytics at scale, in Proc. 2013 ACM SIGMOD Int. Conf. on Management of Data, New York, NY, USA, 2013, pp. 13-24.
[27]
A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V. J. Tsotras, ASTERIX: Towards a scalable, semistructured data platform for evolving-world models, Distrib. Parallel Databases, vol. 29, no. 3, pp. 185-216, 2011.
[28]
A. Alexandrov, R. Bergmann, S. Ewen, J. C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, et al., The stratosphere platform for big data analytics, VLDB J., vol. 23, no. 6, pp. 939-964, 2014.
[29]
A. Crotty, A. Galakatos, K. Dursun, T. Kraska, U. Cetintemel, and S. Zdonik, Tupleware: “Big” Data, Big Analytics, Small Clusters, presented at 7th Biennial Conf. on Innovative Data Systems Research (CIDR 2015), Asilomar, CA, USA, 2015.
[30]
R. Chaiken, B. Jenkins, P. Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. R. Zhou, SCOPE: Easy and efficient parallel processing of massive data sets, Proc. VLDB Endowm., vol. 1, no. 2, pp. 1265-1276, 2008.
[31]
R. A. Lorie, XRM-An Extended (N-ary) Relational Memory. Yorktown Heights: IBM, 1974.
[32]
A. Kemper and T. Neumann, HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots, in 2011 IEEE 27th Int. Conf. on Data Engineering, Hannover, Germany, 2011, pp. 195-206.
[33]
F. McSherry, M. Isard, and D. G. Murray, Scalability! But at what COST? presented at 15th Workshop on Hot Topics in Operating Systems (HotOS XV), Kartause Ittingen, Switzerland, 2015.
Publication history
Copyright
Rights and permissions

Publication history

Received: 11 May 2021
Accepted: 28 May 2021
Published: 26 August 2021
Issue date: December 2021

Copyright

© The author(s) 2021

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Reprints and Permission requests may be sought directly from editorial office.

Return