Banian: A Cross-Platform Interactive Query System for Structured Big Data

Tao Xu; Dongsheng Wang; Guodong Liu

doi:10.1109/TST.2015.7040514

Tsinghua Science and Technology 2015, 20(1): 62-71 https://doi.org/10.1109/TST.2015.7040514

Open Access | Issue | Published: 12 February 2015

Banian: A Cross-Platform Interactive Query System for Structured Big Data

Show Author's Information Hide Author's Information Tao Xu, Dongsheng Wang(

), Guodong Liu

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China.

Department of Computer Science and Technology and Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China.

Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China.

Keywords:

big data, HDFS, interactive query, relational database, cross platform

Cite this article:

Xu T, Wang D, Liu G. Banian: A Cross-Platform Interactive Query System for Structured Big Data. Tsinghua Science and Technology, 2015, 20(1): 62-71. https://doi.org/10.1109/TST.2015.7040514

Download citation

EndNote(RIS)

BibTeX

445

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

The rapid growth of structured data has presented new technological challenges in the research fields of big data and relational database. In this paper, we present an efficient system for managing and analyzing PB level structured data called Banian. Banian overcomes the storage structure limitation of relational database and effectively integrates interactive query with large-scale storage management. It provides a uniform query interface for cross-platform datasets and thus shows favorable compatibility and scalability. Banian's system architecture mainly includes three layers: (1) a storage layer using HDFS for the distributed storage of massive data; (2) a scheduling and execution layer employing the splitting and scheduling technology of parallel database; and (3) an application layer providing a cross-platform query interface and supporting standard SQL. We evaluate Banian using PB level Internet data and the TPC-H benchmark. The results show that when compared with Hive, Banian improves the query performance to a maximum of 30 times and achieves better scalability and concurrency.

Full text

Abstract

Full text

Outline

About this article

Banian: A Cross-Platform Interactive Query System for Structured Big Data

Show Author's information Hide Author's Information Tao Xu, Dongsheng Wang(

), Guodong Liu

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China.

Department of Computer Science and Technology and Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China.

Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China.

Abstract

Keywords: big data, HDFS, interactive query, relational database, cross platform

References(35)

[1]

Ghemawat S., Gobioff H., and Leung S. T., The Google file system, ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 29-43, 2003.

DOI Google Scholar

[2]

Dean J. and Ghemawat S., MapReduce: Simplified data processing on large clusters, Commun. of ACM, vol. 51, no. 1, pp. 107-113, 2008.

DOI Google Scholar

[3]

Shvachko K., Kuang H., Radia S., and Chansler R., The Hadoop distributed file system, in Proceedings of IEEE Conference on Mass Storage Systems and Technologies (MSST), 2010, pp. 1-10.

DOI

[4]

HBase project, http://hbase.apache.org/, 2014.

[5]

Borthakur D., Grap J., Sarma J. S., Muthukkaruppan K., Spiegelberg N., Kuang H., Ranganathan K., Molkov D., Menon A., Rash S., al. et, Apache Hadoop goes realtime at facebook, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 2011, pp. 1071-1080.

DOI

[6]

Yu K., Large-scale deep learning at Baidu, in Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, 2013, pp. 2211-2212.

DOI

[7]

Budak C., Agrawal D., and Abbadi A. El, Structural trend analysis for online social networks, in Proceedings of the VLDB Endowment, vol. 4, no. 10, pp. 646–656, 2011.

DOI

[8]

Pu L., Xu J., Yu B. and Zhang J., Smart cafe: A mobile local computing system based on indoor virtual cloud, China Communications, vol. 11, no. 4, pp. 38-49, 2014.

DOI Google Scholar

[9]

Codd E. F., A relational model of data for large shared data banks, Communications of the ACM, vol. 13, no. 6, pp. 377-387, 1970.

DOI Google Scholar

[10]

Bellatreche L. and Woameno K. Y., Dimension table driven approach to referential partition relational data warehouses, in Proceedings of the ACM Twelfth International Workshop on Data Warehousing and OLAP, New York, NY, USA, 2009, pp. 9-16.

DOI

[11]

Han J., Chiang J. Y., Chee S., Chen J., Chen Q., Cheng S., Gong W., Kamber M., Koperski K., Liu G., al. et, DBMiner: A system for data mining in relational databases and data warehouses, in Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence, 1997, pp. 326-336.

[12]

Tay Y. C., Goodman N., and Suri R., Locking performance in centralized databases, ACM Transactions on Database Systems (TODS), vol. 10, no. 4, pp. 415-462, 1985.

DOI Google Scholar

[13]

Bell D. and Grimson J., Distributed Database Systems. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1992.

[14]

DeWitt D. and Gray J., Parallel database systems: The future of high performance database systems, Communications of the ACM, vol. 35, no. 6, pp. 85-98, 1992.

DOI Google Scholar

[15]

Antova L., El-Helw A., Soliman M. A., Gu Z., Petropoulos M., and Waas F., Optimizing queries over partitioned tables in MPP systems, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 2014, pp. 373-384.

DOI

[16]

Chen Y., Alspaugh S., Borthakur D., and Katz R. H., Energy efficiency for large-scale mapreduce workloads with significant interactive analysis, in Proceedings of the 7th ACM European Conference on Computer Systems, 2012, pp. 43-56.

DOI

[17]

Meng Y., Luan Z., and Qian D., Differentiating data collection for cloud environment monitoring, China Communications, vol. 11, no. 4, pp. 13-24, 2014.

DOI Google Scholar

[18]

Zaharia M., BorthaKur D., Sarma J. Sen, Elmeleegy K., Shenker S., and Stoica I., Job scheduling for multi-user mapreduce clusters, Technical Report UCB/EECS-2009-55, EECS Department, University of California, Berkeley, USA, April 2009.

[19]

Elghandour I. and Aboulnaga A., ReStore: Reusing results of MapReduce jobs in pig, in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012, pp. 701-704.

DOI

[20]

Stonebraker M., Abadi D., DeWitt D. J., Madden S., Paulson E., Pavlo A., and Rasin A., MapReduce and parallel DBMSs: Friends or foes? Communications of the ACM, vol. 53, no. 1, pp. 64-71, 2010.

DOI Google Scholar

[21]

Greenplum Inc., Greenplum Database: Powering the data-driven enterprise, http://www.greenplum.com/resources, 2014.

[22]

Xu Y., Kostamaa P., and Gao L., Integrating hadoop and parallel DBMs, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010, pp. 969-974.

DOI

[23]

Afrati F. N. and Ullman J. D., Optimizing multiway joins in a map-reduce environment, IEEE Transactions on Knowledge & Data Engineering, vol. 23, no. 9, pp. 1282-1298, 2011.

DOI Google Scholar

[24]

Herodotou H. and Babu S., Profiling, analysis what-if, and cost-based optimization of MapReduce programs, in PVLDB, 2011, pp. 1111-1122.

DOI

[25]

Abouzeid A., Bajda-Pawlikowski K., Abadi D., Silberschatz A., and Rasin A., HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads, in PVLDB, 2009, pp. 922-933.

DOI

[26]

Thusoo A., Sarma J. S., Jain N., Shao Z., Chakka P., Zhang N., Antony S., Liu H., and Murthy R., Hive — A petabyte scale data warehouse using Hadoop, in IEEE 29th International Conference on Data Engineering (ICDE), 2010, pp. 996-1005.

DOI

[27]

Thusoo A., Sarma J. S., Jain N., Shao Z., Chakka P., Anthony S., Liu H., Wyckoff P., and Murthy R., Hive: A warehousing solution over a map-reduce framework, in Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626–1629, 2009.

DOI

[28]

Melnik S., Gubarev A., Long J. J., Romer G., Shivakumar S., Tolton M., and Vassilakis T., Dremel: Interactive analysis of webscale datasets, in Proceedings of the VLDB Endowment, vol. 3, nos. 1–2, pp. 330–339, 2010.

DOI

[29]

Li M., Andrey L., Sasu T., and Antti Y., MPTCP incast in data center networks, China Communications, vol. 11, no. 4, pp. 25-37, 2014.

DOI Google Scholar

[30]

Zaharia M., Chowdhury M., Das T., Dave A., Ma J., McCauley M., Franklin M., Shenker S., and Stoica I., Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, 2012.

[31]

Impala project, http://impala.io/, 2014.

[32]

Agarwal S., Mozafari B., Panda A., Milner H., Madden S., and Stoica I., BlinkDB: Queries with bounded errors and bounded response times on very large data, in Proceedings of the 8th ACM European Conference on Computer Systems, New York, NY, USA, 2013, pp. 29-42.

DOI

[33]

Lin R., Wu B., Yang F., Zhao Y., and Hou J., An efficient adaptive failure detection mechanism for cloud platform based on volterra series, China Communications, vol. 11, no. 4, pp. 1-12, 2014.

DOI Google Scholar

[34]

Hsu W. W., Smith A. J., and Young H. C., I/O reference behavior of production database workloads and the TPC benchmarks — An analysis at the logical level, in ACM Transactions on Database Systems (TODS), vol. 26, no. 1, pp. 96–143, New York, NY, USA, 2001.

DOI

[35]

Wang L., Zhan J., Luo C., Zhu Y., Yang Q., He Y., Gao W., Jia Z., Shi Y., Zhang S., al. et, Bigdatabench: A big data benchmark suite from internet services, in HPCA, 2014.

DOI

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 03 December 2014

Accepted: 25 December 2014

Published: 12 February 2015

Issue date: February 2015

Copyright

Acknowledgements

This work was supported by the National High-Tech Research and Development (863) Program of China (No. 2012AA012609).