Journal Home > Volume 20 , Issue 1

The rapid growth of structured data has presented new technological challenges in the research fields of big data and relational database. In this paper, we present an efficient system for managing and analyzing PB level structured data called Banian. Banian overcomes the storage structure limitation of relational database and effectively integrates interactive query with large-scale storage management. It provides a uniform query interface for cross-platform datasets and thus shows favorable compatibility and scalability. Banian's system architecture mainly includes three layers: (1) a storage layer using HDFS for the distributed storage of massive data; (2) a scheduling and execution layer employing the splitting and scheduling technology of parallel database; and (3) an application layer providing a cross-platform query interface and supporting standard SQL. We evaluate Banian using PB level Internet data and the TPC-H benchmark. The results show that when compared with Hive, Banian improves the query performance to a maximum of 30 times and achieves better scalability and concurrency.


menu
Abstract
Full text
Outline
About this article

Banian: A Cross-Platform Interactive Query System for Structured Big Data

Show Author's information Tao XuDongsheng Wang( )Guodong Liu
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China.
Department of Computer Science and Technology and Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China.
Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China.

Abstract

The rapid growth of structured data has presented new technological challenges in the research fields of big data and relational database. In this paper, we present an efficient system for managing and analyzing PB level structured data called Banian. Banian overcomes the storage structure limitation of relational database and effectively integrates interactive query with large-scale storage management. It provides a uniform query interface for cross-platform datasets and thus shows favorable compatibility and scalability. Banian's system architecture mainly includes three layers: (1) a storage layer using HDFS for the distributed storage of massive data; (2) a scheduling and execution layer employing the splitting and scheduling technology of parallel database; and (3) an application layer providing a cross-platform query interface and supporting standard SQL. We evaluate Banian using PB level Internet data and the TPC-H benchmark. The results show that when compared with Hive, Banian improves the query performance to a maximum of 30 times and achieves better scalability and concurrency.

Keywords: big data, HDFS, interactive query, relational database, cross platform

References(35)

[1]
Ghemawat S., Gobioff H., and Leung S. T., The Google file system, ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 29-43, 2003.
[2]
Dean J. and Ghemawat S., MapReduce: Simplified data processing on large clusters, Commun. of ACM, vol. 51, no. 1, pp. 107-113, 2008.
[3]
Shvachko K., Kuang H., Radia S., and Chansler R., The Hadoop distributed file system, in Proceedings of IEEE Conference on Mass Storage Systems and Technologies (MSST), 2010, pp. 1-10.
DOI
[4]
HBase project, http://hbase.apache.org/, 2014.
[5]
Borthakur D., Grap J., Sarma J. S., Muthukkaruppan K., Spiegelberg N., Kuang H., Ranganathan K., Molkov D., Menon A., Rash S., al. et, Apache Hadoop goes realtime at facebook, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 2011, pp. 1071-1080.
DOI
[6]
Yu K., Large-scale deep learning at Baidu, in Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, 2013, pp. 2211-2212.
DOI
[7]
Budak C., Agrawal D., and Abbadi A. El, Structural trend analysis for online social networks, in Proceedings of the VLDB Endowment, vol. 4, no. 10, pp. 646–656, 2011.
DOI
[8]
Pu L., Xu J., Yu B. and Zhang J., Smart cafe: A mobile local computing system based on indoor virtual cloud, China Communications, vol. 11, no. 4, pp. 38-49, 2014.
[9]
Codd E. F., A relational model of data for large shared data banks, Communications of the ACM, vol. 13, no. 6, pp. 377-387, 1970.
[10]
Bellatreche L. and Woameno K. Y., Dimension table driven approach to referential partition relational data warehouses, in Proceedings of the ACM Twelfth International Workshop on Data Warehousing and OLAP, New York, NY, USA, 2009, pp. 9-16.
DOI
[11]
Han J., Chiang J. Y., Chee S., Chen J., Chen Q., Cheng S., Gong W., Kamber M., Koperski K., Liu G., al. et, DBMiner: A system for data mining in relational databases and data warehouses, in Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence, 1997, pp. 326-336.
[12]
Tay Y. C., Goodman N., and Suri R., Locking performance in centralized databases, ACM Transactions on Database Systems (TODS), vol. 10, no. 4, pp. 415-462, 1985.
[13]
Bell D. and Grimson J., Distributed Database Systems. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1992.
[14]
DeWitt D. and Gray J., Parallel database systems: The future of high performance database systems, Communications of the ACM, vol. 35, no. 6, pp. 85-98, 1992.
[15]
Antova L., El-Helw A., Soliman M. A., Gu Z., Petropoulos M., and Waas F., Optimizing queries over partitioned tables in MPP systems, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 2014, pp. 373-384.
DOI
[16]
Chen Y., Alspaugh S., Borthakur D., and Katz R. H., Energy efficiency for large-scale mapreduce workloads with significant interactive analysis, in Proceedings of the 7th ACM European Conference on Computer Systems, 2012, pp. 43-56.
DOI
[17]
Meng Y., Luan Z., and Qian D., Differentiating data collection for cloud environment monitoring, China Communications, vol. 11, no. 4, pp. 13-24, 2014.
[18]
Zaharia M., BorthaKur D., Sarma J. Sen, Elmeleegy K., Shenker S., and Stoica I., Job scheduling for multi-user mapreduce clusters, Technical Report UCB/EECS-2009-55, EECS Department, University of California, Berkeley, USA, April 2009.
[19]
Elghandour I. and Aboulnaga A., ReStore: Reusing results of MapReduce jobs in pig, in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012, pp. 701-704.
DOI
[20]
Stonebraker M., Abadi D., DeWitt D. J., Madden S., Paulson E., Pavlo A., and Rasin A., MapReduce and parallel DBMSs: Friends or foes? Communications of the ACM, vol. 53, no. 1, pp. 64-71, 2010.
[21]
Greenplum Inc., Greenplum Database: Powering the data-driven enterprise, http://www.greenplum.com/resources, 2014.
[22]
Xu Y., Kostamaa P., and Gao L., Integrating hadoop and parallel DBMs, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010, pp. 969-974.
DOI
[23]
Afrati F. N. and Ullman J. D., Optimizing multiway joins in a map-reduce environment, IEEE Transactions on Knowledge & Data Engineering, vol. 23, no. 9, pp. 1282-1298, 2011.
[24]
Herodotou H. and Babu S., Profiling, analysis what-if, and cost-based optimization of MapReduce programs, in PVLDB, 2011, pp. 1111-1122.
DOI
[25]
Abouzeid A., Bajda-Pawlikowski K., Abadi D., Silberschatz A., and Rasin A., HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads, in PVLDB, 2009, pp. 922-933.
DOI
[26]
Thusoo A., Sarma J. S., Jain N., Shao Z., Chakka P., Zhang N., Antony S., Liu H., and Murthy R., Hive — A petabyte scale data warehouse using Hadoop, in IEEE 29th International Conference on Data Engineering (ICDE), 2010, pp. 996-1005.
DOI
[27]
Thusoo A., Sarma J. S., Jain N., Shao Z., Chakka P., Anthony S., Liu H., Wyckoff P., and Murthy R., Hive: A warehousing solution over a map-reduce framework, in Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626–1629, 2009.
DOI
[28]
Melnik S., Gubarev A., Long J. J., Romer G., Shivakumar S., Tolton M., and Vassilakis T., Dremel: Interactive analysis of webscale datasets, in Proceedings of the VLDB Endowment, vol. 3, nos. 1–2, pp. 330–339, 2010.
DOI
[29]
Li M., Andrey L., Sasu T., and Antti Y., MPTCP incast in data center networks, China Communications, vol. 11, no. 4, pp. 25-37, 2014.
[30]
Zaharia M., Chowdhury M., Das T., Dave A., Ma J., McCauley M., Franklin M., Shenker S., and Stoica I., Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, 2012.
[31]
Impala project, http://impala.io/, 2014.
[32]
Agarwal S., Mozafari B., Panda A., Milner H., Madden S., and Stoica I., BlinkDB: Queries with bounded errors and bounded response times on very large data, in Proceedings of the 8th ACM European Conference on Computer Systems, New York, NY, USA, 2013, pp. 29-42.
DOI
[33]
Lin R., Wu B., Yang F., Zhao Y., and Hou J., An efficient adaptive failure detection mechanism for cloud platform based on volterra series, China Communications, vol. 11, no. 4, pp. 1-12, 2014.
[34]
Hsu W. W., Smith A. J., and Young H. C., I/O reference behavior of production database workloads and the TPC benchmarks — An analysis at the logical level, in ACM Transactions on Database Systems (TODS), vol. 26, no. 1, pp. 96–143, New York, NY, USA, 2001.
DOI
[35]
Wang L., Zhan J., Luo C., Zhu Y., Yang Q., He Y., Gao W., Jia Z., Shi Y., Zhang S., al. et, Bigdatabench: A big data benchmark suite from internet services, in HPCA, 2014.
DOI
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 03 December 2014
Accepted: 25 December 2014
Published: 12 February 2015
Issue date: February 2015

Copyright

© The authors 2015

Acknowledgements

This work was supported by the National High-Tech Research and Development (863) Program of China (No. 2012AA012609).

Rights and permissions

Return