Journal Home > Volume 19 , Issue 1

The buzz-word big-data refers to the large-scale distributed data processing applications that operate on exceptionally large amounts of data. Google’s MapReduce and Apache’s Hadoop, its open-source implementation, are the defacto software systems for big-data applications. An observation of the MapReduce framework is that the framework generates a large amount of intermediate data. Such abundant information is thrown away after the tasks finish, because MapReduce is unable to utilize them. In this paper, we propose Dache, a data-aware cache framework for big-data applications. In Dache, tasks submit their intermediate results to the cache manager. A task queries the cache manager before executing the actual computing work. A novel cache description scheme and a cache request and reply protocol are designed. We implement Dache by extending Hadoop. Testbed experiment results demonstrate that Dache significantly improves the completion time of MapReduce jobs.


menu
Abstract
Full text
Outline
About this article

Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework

Show Author's information Yaxiong Zhao( )Jie WuCong Liu
Google Inc., Mountain View, CA 94043, USA. This work was done while the author was with Temple University, Philadelphia, PA 19122, USA
Temple University, Philadelphia, PA 19122, USA
Sun Yat-Sen University, Guangzhou 510275, China

Abstract

The buzz-word big-data refers to the large-scale distributed data processing applications that operate on exceptionally large amounts of data. Google’s MapReduce and Apache’s Hadoop, its open-source implementation, are the defacto software systems for big-data applications. An observation of the MapReduce framework is that the framework generates a large amount of intermediate data. Such abundant information is thrown away after the tasks finish, because MapReduce is unable to utilize them. In this paper, we propose Dache, a data-aware cache framework for big-data applications. In Dache, tasks submit their intermediate results to the cache manager. A task queries the cache manager before executing the actual computing work. A novel cache description scheme and a cache request and reply protocol are designed. We implement Dache by extending Hadoop. Testbed experiment results demonstrate that Dache significantly improves the completion time of MapReduce jobs.

Keywords: caching, big-data, MapReduce, Hadoop

References(28)

[1]
J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, Commun. of ACM, vol. 51, no. 1, pp. 107-113, 2008.
[2]
Hadoop, http://hadoop.apache.org/, 2013.
[3]
Java programming language, http://www.java.com/, 2013.
[4]
P. Th. Eugster, P. A. Felber, R. Guerraoui, and A.-M. Kermarrec, The many faces of publish/subscribe, ACM Comput. Surv., vol. 35, no. 2, pp. 114-131, 2003.
[5]
[6]
Amawon web services, http://aws.amazon.com/, 2013.
[7]
[8]
G. Ramalingam and T. Reps. A categorized bibliography on incremental computation, in Proc. of POPL ’93, New York, NY, USA, 1993.
DOI
[9]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data, in Proc. of OSDI’2006, Berkeley, CA, USA, 2006.
[10]
S. Ghemawat, H. Gobioff, and S.-T. Leung, The google file system, SIGOPS Oper. Syst. Rev., vol. 37, no. 5, pp. 29-43, 2003.
[11]
D. Peng and F. Dabek, Largescale incremental processing using distributed transactions and notifications, in Proc. of OSDI’ 2010, Berkeley, CA, USA, 2010.
[12]
J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazi‘eres, S. Mitra, A. Narayanan, D. Ongaro, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann, and R. Stutsman, The case for ramcloud, Commun. of ACM, vol. 54, no. 7, pp. 121-130, 2011.
[13]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, Dryad: Distributed data-parallel programs from sequential building blocks, SIGOPS Oper. Syst. Rev., vol. 41, no. 3, pp. 59-72, 2007.
[14]
L. Popa, M. Budiu, Y. Yu, and M. Isard, Dryadinc: Reusing work in large-scale computations, in Proc. of HotCloud’09, Berkeley, CA, USA, 2009.
[15]
C. Olston, G. Chiou, L. Chitnis, F. Liu, Y. Han, M. Larsson, A. Neumann, V. B. N. Rao, V. Sankarasubramanian, S. Seth, C. Tian, T. ZiCornell, and X. Wang, Nova: Continuous pig/hadoop workflows, in Proc. of SIGMOD’2011, New York, NY, USA, 2011.
[16]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, Pig latin: A not-so-foreign language for data processing, in Proc. of SIGMOD’2008, New York, NY, USA, 2008.
DOI
[17]
U. A. Acar, Self-adjusting computation: An Overview, in Proc. of PEPM’09, New York, NY, USA, 2009.
DOI
[18]
T. Karagiannis, C. Gkantsidis, D. Narayanan, and A. Rowstron, Hermes: Clustering users in large-scale e-mail services, in Proc. of SoCC ’10, New York, NY, USA, 2010.
DOI
[19]
Memcached—A distributed memory object caching system, http://memcached.org/, 2013.
[20]
P. Scheuermann, G. Weikum, and P. Zabback, Data partitioning and load balancing in parallel disk systems, The VLDB Journal, vol. 7, no. 1, pp. 48-66, 1998.
[21]
M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica, Improving mapreduce performance in heterogeneous environments, in Proc. of OSDI’2008, Berkeley, CA, USA, 2008.
[22]
H. Herodotou, F. Dong, and S. Babu, No one (cluster) size fits all: Automatic cluster sizing for data-intensive analytics, in Proc. of SOCC’2011, New York, NY, USA, 2011.
DOI
[23]
S. Wu, F. Li, S. Mehrotra, and B. C. Ooi, Query optimization for massively parallel data processing, in Proc. of SOCC’2011, New York, NY, USA, 2011.
DOI
[24]
D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum, Stateful bulk processing for incremental analytics, in Proc. of SOCC’2011, New York, NY, USA, 2010.
DOI
[25]
B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou, Comet: Batched stream processing for data intensive distributed computing, in Proc. of SOCC’2011, New York, NY, USA, 2010.
DOI
[26]
D. Battr’e, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke, Nephele/pacts: A programming model and execution framework for web-scale analytical processing, in Proc. of SOCC’2010, New York, NY, USA, 2010.
DOI
[27]
P. Xiong, Y. Chi, S. Zhu, J. Tatemura, C. Pu, and H. HacigümüŞ, Activesla: A profit-oriented admission control framework for database-as-a-service providers, in Proc. of SOCC’2011, New York, NY, USA, 2011.
DOI
[28]
H. Gonzalez, A. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, and W. Shen, Google fusion tables: Data management, integration and collaboration in the cloud, in Proc. of SOCC’2010, New York, NY, USA, 2010.
DOI
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 13 December 2013
Accepted: 20 December 2013
Published: 07 February 2014
Issue date: February 2014

Copyright

© The author(s) 2014

Acknowledgements

This research was supported in part by the Natural Science Foundation of USA (Nos. ECCS 1128209, CNS 1138963, CNS 1065444, and CCF 1028167).

Rights and permissions

Return