Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework

Yaxiong Zhao; Jie Wu; Cong Liu

doi:10.1109/TST.2014.6733207

Tsinghua Science and Technology 2014, 19(1): 39-50 https://doi.org/10.1109/TST.2014.6733207

Open Access | Issue | Published: 07 February 2014

Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework

Show Author's Information Hide Author's Information Yaxiong Zhao(

), Jie Wu, Cong Liu

Google Inc., Mountain View, CA 94043, USA. This work was done while the author was with Temple University, Philadelphia, PA 19122, USA

Temple University, Philadelphia, PA 19122, USA

Sun Yat-Sen University, Guangzhou 510275, China

Keywords:

caching, big-data, MapReduce, Hadoop

Cite this article:

Zhao Y, Wu J, Liu C. Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework. Tsinghua Science and Technology, 2014, 19(1): 39-50. https://doi.org/10.1109/TST.2014.6733207

Download citation

EndNote(RIS)

BibTeX

381

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

The buzz-word big-data refers to the large-scale distributed data processing applications that operate on exceptionally large amounts of data. Google’s MapReduce and Apache’s Hadoop, its open-source implementation, are the defacto software systems for big-data applications. An observation of the MapReduce framework is that the framework generates a large amount of intermediate data. Such abundant information is thrown away after the tasks finish, because MapReduce is unable to utilize them. In this paper, we propose Dache, a data-aware cache framework for big-data applications. In Dache, tasks submit their intermediate results to the cache manager. A task queries the cache manager before executing the actual computing work. A novel cache description scheme and a cache request and reply protocol are designed. We implement Dache by extending Hadoop. Testbed experiment results demonstrate that Dache significantly improves the completion time of MapReduce jobs.

Full text

Abstract

Full text

Outline

About this article

Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework

Show Author's information Hide Author's Information Yaxiong Zhao(

), Jie Wu, Cong Liu

Google Inc., Mountain View, CA 94043, USA. This work was done while the author was with Temple University, Philadelphia, PA 19122, USA

Temple University, Philadelphia, PA 19122, USA

Sun Yat-Sen University, Guangzhou 510275, China

Abstract

Keywords: caching, big-data, MapReduce, Hadoop

References(28)

[1]

J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, Commun. of ACM, vol. 51, no. 1, pp. 107-113, 2008.

DOI Google Scholar

[2]

Hadoop, http://hadoop.apache.org/, 2013.

[3]

Java programming language, http://www.java.com/, 2013.

[4]

P. Th. Eugster, P. A. Felber, R. Guerraoui, and A.-M. Kermarrec, The many faces of publish/subscribe, ACM Comput. Surv., vol. 35, no. 2, pp. 114-131, 2003.

DOI Google Scholar

[5]

Cache algorithms, http://en.wikipedia.org/wiki/Cache algorithms, 2013.

[6]

Amawon web services, http://aws.amazon.com/, 2013.

[7]

Google compute engine, http://cloud.google.com/ products/computeengine.html, 2013.

[8]

G. Ramalingam and T. Reps. A categorized bibliography on incremental computation, in Proc. of POPL ’93, New York, NY, USA, 1993.

DOI

[9]

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data, in Proc. of OSDI’2006, Berkeley, CA, USA, 2006.

[10]

S. Ghemawat, H. Gobioff, and S.-T. Leung, The google file system, SIGOPS Oper. Syst. Rev., vol. 37, no. 5, pp. 29-43, 2003.

DOI Google Scholar

[11]

D. Peng and F. Dabek, Largescale incremental processing using distributed transactions and notifications, in Proc. of OSDI’ 2010, Berkeley, CA, USA, 2010.

[12]

J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazi‘eres, S. Mitra, A. Narayanan, D. Ongaro, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann, and R. Stutsman, The case for ramcloud, Commun. of ACM, vol. 54, no. 7, pp. 121-130, 2011.

DOI Google Scholar

[13]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, Dryad: Distributed data-parallel programs from sequential building blocks, SIGOPS Oper. Syst. Rev., vol. 41, no. 3, pp. 59-72, 2007.

DOI Google Scholar

[14]

L. Popa, M. Budiu, Y. Yu, and M. Isard, Dryadinc: Reusing work in large-scale computations, in Proc. of HotCloud’09, Berkeley, CA, USA, 2009.

[15]

C. Olston, G. Chiou, L. Chitnis, F. Liu, Y. Han, M. Larsson, A. Neumann, V. B. N. Rao, V. Sankarasubramanian, S. Seth, C. Tian, T. ZiCornell, and X. Wang, Nova: Continuous pig/hadoop workflows, in Proc. of SIGMOD’2011, New York, NY, USA, 2011.

[16]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, Pig latin: A not-so-foreign language for data processing, in Proc. of SIGMOD’2008, New York, NY, USA, 2008.

DOI

[17]

U. A. Acar, Self-adjusting computation: An Overview, in Proc. of PEPM’09, New York, NY, USA, 2009.

DOI

[18]

T. Karagiannis, C. Gkantsidis, D. Narayanan, and A. Rowstron, Hermes: Clustering users in large-scale e-mail services, in Proc. of SoCC ’10, New York, NY, USA, 2010.

DOI

[19]

Memcached—A distributed memory object caching system, http://memcached.org/, 2013.

[20]

P. Scheuermann, G. Weikum, and P. Zabback, Data partitioning and load balancing in parallel disk systems, The VLDB Journal, vol. 7, no. 1, pp. 48-66, 1998.

DOI Google Scholar

[21]

M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica, Improving mapreduce performance in heterogeneous environments, in Proc. of OSDI’2008, Berkeley, CA, USA, 2008.

[22]

H. Herodotou, F. Dong, and S. Babu, No one (cluster) size fits all: Automatic cluster sizing for data-intensive analytics, in Proc. of SOCC’2011, New York, NY, USA, 2011.

DOI

[23]

S. Wu, F. Li, S. Mehrotra, and B. C. Ooi, Query optimization for massively parallel data processing, in Proc. of SOCC’2011, New York, NY, USA, 2011.

DOI

[24]

D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum, Stateful bulk processing for incremental analytics, in Proc. of SOCC’2011, New York, NY, USA, 2010.

DOI

[25]

B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou, Comet: Batched stream processing for data intensive distributed computing, in Proc. of SOCC’2011, New York, NY, USA, 2010.

DOI

[26]

D. Battr’e, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke, Nephele/pacts: A programming model and execution framework for web-scale analytical processing, in Proc. of SOCC’2010, New York, NY, USA, 2010.

DOI

[27]

P. Xiong, Y. Chi, S. Zhu, J. Tatemura, C. Pu, and H. HacigümüŞ, Activesla: A profit-oriented admission control framework for database-as-a-service providers, in Proc. of SOCC’2011, New York, NY, USA, 2011.

DOI

[28]

H. Gonzalez, A. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, and W. Shen, Google fusion tables: Data management, integration and collaboration in the cloud, in Proc. of SOCC’2010, New York, NY, USA, 2010.

DOI

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 13 December 2013

Accepted: 20 December 2013

Published: 07 February 2014

Issue date: February 2014

Copyright

Acknowledgements

This research was supported in part by the Natural Science Foundation of USA (Nos. ECCS 1128209, CNS 1138963, CNS 1065444, and CCF 1028167).