An Improved Algorithm for Optimizing MapReduce Based on Locality and Overlapping

Jianjiang Li; Jie Wang; Bin Lyu; Jie Wu; Xiaolei Yang

doi:10.26599/TST.2018.9010115

Tsinghua Science and Technology 2018, 23(6): 744-753 https://doi.org/10.26599/TST.2018.9010115

Open Access | Issue | Published: 15 October 2018

An Improved Algorithm for Optimizing MapReduce Based on Locality and Overlapping

Show Author's Information Hide Author's Information Jianjiang Li, Jie Wang, Bin Lyu(

), Jie Wu, Xiaolei Yang

Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China.

University of Southern California, Los Angeles, CA 90089, USA.

Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA.

Keywords:

MapReduce, data locality, overlapping, load balance

Cite this article:

Li J, Wang J, Lyu B, et al. An Improved Algorithm for Optimizing MapReduce Based on Locality and Overlapping. Tsinghua Science and Technology, 2018, 23(6): 744-753. https://doi.org/10.26599/TST.2018.9010115

Download citation

EndNote(RIS)

BibTeX

532

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

MapReduce is currently the most popular programming model for big data processing, and Hadoop is a well-known MapReduce implementation platform. However, Hadoop jobs suffer from imbalanced workloads during the reduce phase and inefficiently utilize the available computing and network resources. In some cases, these problems lead to serious performance degradation in MapReduce jobs. To resolve these problems, in this paper, we propose two algorithms, the Locality-Based Balanced Schedule (LBBS) and Overlapping-Based Resource Utilization (OBRU), that optimize the Locality-Enhanced Load Balance (LELB) and the Map, Local reduce, Shuffle, and final Reduce (MLSR) phases. The LBBS collects partition information from input data during the map phase and generates balanced schedule plans for the reduce phase. OBRU is responsible for using computing and network resources efficiently by overlapping the local reduce, shuffle, and final reduce phases. Experimental results show that the LBBS and OBRU algorithms yield significant improvements in load balancing. When LBBS and OBRU are applied, job performance increases by 15% from that of models using LELB and MLSR.

Full text

Abstract

Full text

Outline

About this article

An Improved Algorithm for Optimizing MapReduce Based on Locality and Overlapping

Show Author's information Hide Author's Information Jianjiang Li, Jie Wang, Bin Lyu(

), Jie Wu, Xiaolei Yang

Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China.

University of Southern California, Los Angeles, CA 90089, USA.

Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA.

Abstract

Keywords: MapReduce, data locality, overlapping, load balance

References(22)

[1]

Hadoop H. A., http://hadoop.apache.org, 2009.

[2]

Dean J. and Ghemawat S., MapReduce: Simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

DOI Google Scholar

[3]

White T., Hadoop: The Definitive Guide. Sebastopol, CA, USA: O’Reilly Media, Inc, 2012.

[4]

Li J. J., Wu J., Yang X. L., and Zhong S. Q., Optimizing mapreduce based on locality of k-v pairs and overlap between shuffle and local reduce, presented at the 44th Int. Conf. on Parallel Processing, Beijing, China, 2015.

[5]

Wang L. Z.,Tao J., Ranjan R., Marten H., Streit A., Chen J. Y., and Chen D., G-Hadoop: MapReduce across distributed data centers for data-intensive computing, Future Generation Computer Systems, vol. 29, no. 3, pp. 739–750, 2013.

DOI Google Scholar

[6]

Guo Y. F., Rao J., Cheng D. Z., and Zhou X. B., Ishuffle: Improving hadoop performance with shuffle-on-write, IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 6, pp. 1649–1662, 2017.

DOI Google Scholar

[7]

Kwon Y., Balazinska M., Howe B., and Rolia J., A study of skew in MapReduce applications, in 5th Open Cirrus Summit, 2011.

[8]

Kwon Y., Balazinska M., Howe B., and Rolia J., Skew-resistant parallel processing of feature-extracting scientific user-defined functions, in Proc. 1st Int. ACM Symp. on Cloud Computing, 2010, pp. 75–86.

DOI

[9]

Kwon Y., Balazinska M., Howe B., and Rolia J., SkewTune in action: Mitigating skew in MapReduce applications, Proceedings of the VLDB Endowment Hompage Archive, vol. 5, no. 12, pp. 1934–1937, 2012.

DOI Google Scholar

[10]

Ahmad F., Lee S., Thottethodi M., and Vijaykumar T. N., MapReduce with Communication Overlap (MaRCO), Journal of Parallel and Distributed Computing, vol. 73, no. 5, pp. 608–620, 2013.

DOI Google Scholar

[11]

Tian C., Zhou H. J., He Y. Q., and Zha L., A dynamic mapreduce scheduler for heterogeneous workloads, presented at the 2009 Eighth Int. Conf. on Grid and Cooperative Computing, Lanzhou, China, 2009.

DOI

[12]

Zaharia M., Konwinski A., Joseph A. D., Katz R., and Stoica I., Improving MapReduce performance in heterogeneous environments, in Proc. 8th USENIX Conf. Operating Systems Design and Implementation, San Diego, CA, USA, 2008, pp. 29–42.

[13]

Tao D., Lin Z. W., and Wang B. X., Load feedback-based resource scheduling and dynamic migration-based data locality for virtual hadoop clusters in openstack-based clouds, Tsinghua Science and Technology, vol. 22, no. 2, pp. 149–159, 2017.

DOI Google Scholar

[14]

Wolf J.,Rajan D.,Hildrum K.,Khandekar R., Kumar V., Parekh S., Wu K. L., and Balmin A., Flex: A slot allocation scheduling optimizer for mapreduce workloads, in Proc. ACM/IFIP/USENIX 11th Int. Conf. Middleware, Bangalore, India, 2010, pp. 1–20.

DOI

[15]

Guo Z. H., Fox G., and Zhou M., Investigation of data locality in mapreduce, in Proc. 12th IEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing, Ottawa, ON, Canada, 2012, pp. 419–426.

DOI

[16]

Zhao Y. R.,Wang W. P.,Meng D., Yang X. F., Zhang S. B., Li J., and Guan G., A data locality optimization algorithm for large-scale data processing in Hadoop, presented at the 2012 IEEE Symp. on Computers and Communications (ISCC), Cappadocia, Turkey, 2012.

DOI

[17]

Shafer J., Rixner S., and Cox A. L., The hadoop distributed filesystem: Balancing portability and performance, presented at the 2010 IEEE Int. Symp. on Performance Analysis of Systems & Software, White Plains, NY, USA, 2010.

DOI

[18]

Mohandas N. and Thampi S. M., Improving Hadoop performance in handling small files, in Advances in Computing and Communications, Abraham A.,Mauri J. L.,Buford J. F.,Suzuki J.,.Thampi S. M,eds. Springer, Berlin, Heidelberg, 2011.

[19]

Liu J., Li B., and Song M. N., THE optimization of HDFS based on small files, presented at the 2010 3rd IEEE Int. Conf. on Broadband Network and Multimedia Technology (IC-BNMT), Beijing, China, 2010.

[20]

Song G.,Meng Z. D., Huet F., Magoules F., Yu L., and Lin X. L., A Hadoop Mapreduce performance prediction method, presented at the 2013 IEEE 10th Int. Conf. on High Performance Computing and Communications & 2013 IEEE Int. Conf. on Embedded and Ubiquitous Computing, Zhangjiajie, China, 2013.

DOI

[21]

Zhu H. and Chen H. P., Adaptive failure detection via heartbeat under Hadoop, presented at the 2011 IEEE Asia-Pacific Services Computing Conf., Jeju Island, R. Korea, 2012.

DOI

[22]

Wang Y. D., Que X. Y., Yu W. K., Goldenberg D., and Sehgal D., Hadoop acceleration through network levitated merge, in Proc. 2011 Int. Conf. High PERFORMANCE Computing, Networking, Storage and Analysis, New York, NY, USA, 2011.

DOI

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 23 July 2018

Accepted: 10 August 2018

Published: 15 October 2018

Issue date: December 2018

Copyright

Acknowledgements

This work was supported by the National Key R&D Program of China (Nos. 2017YFB0202104 and 2017YFB0202003).