Journal Home > Volume 23 , Issue 6

MapReduce is currently the most popular programming model for big data processing, and Hadoop is a well-known MapReduce implementation platform. However, Hadoop jobs suffer from imbalanced workloads during the reduce phase and inefficiently utilize the available computing and network resources. In some cases, these problems lead to serious performance degradation in MapReduce jobs. To resolve these problems, in this paper, we propose two algorithms, the Locality-Based Balanced Schedule (LBBS) and Overlapping-Based Resource Utilization (OBRU), that optimize the Locality-Enhanced Load Balance (LELB) and the Map, Local reduce, Shuffle, and final Reduce (MLSR) phases. The LBBS collects partition information from input data during the map phase and generates balanced schedule plans for the reduce phase. OBRU is responsible for using computing and network resources efficiently by overlapping the local reduce, shuffle, and final reduce phases. Experimental results show that the LBBS and OBRU algorithms yield significant improvements in load balancing. When LBBS and OBRU are applied, job performance increases by 15% from that of models using LELB and MLSR.


menu
Abstract
Full text
Outline
About this article

An Improved Algorithm for Optimizing MapReduce Based on Locality and Overlapping

Show Author's information Jianjiang LiJie WangBin Lyu( )Jie WuXiaolei Yang
Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China.
University of Southern California, Los Angeles, CA 90089, USA.
Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA.

Abstract

MapReduce is currently the most popular programming model for big data processing, and Hadoop is a well-known MapReduce implementation platform. However, Hadoop jobs suffer from imbalanced workloads during the reduce phase and inefficiently utilize the available computing and network resources. In some cases, these problems lead to serious performance degradation in MapReduce jobs. To resolve these problems, in this paper, we propose two algorithms, the Locality-Based Balanced Schedule (LBBS) and Overlapping-Based Resource Utilization (OBRU), that optimize the Locality-Enhanced Load Balance (LELB) and the Map, Local reduce, Shuffle, and final Reduce (MLSR) phases. The LBBS collects partition information from input data during the map phase and generates balanced schedule plans for the reduce phase. OBRU is responsible for using computing and network resources efficiently by overlapping the local reduce, shuffle, and final reduce phases. Experimental results show that the LBBS and OBRU algorithms yield significant improvements in load balancing. When LBBS and OBRU are applied, job performance increases by 15% from that of models using LELB and MLSR.

Keywords: MapReduce, data locality, overlapping, load balance

References(22)

[1]
Hadoop H. A., http://hadoop.apache.org, 2009.
[2]
Dean J. and Ghemawat S., MapReduce: Simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
[3]
White T., Hadoop: The Definitive Guide. Sebastopol, CA, USA: O’Reilly Media, Inc, 2012.
[4]
Li J. J., Wu J., Yang X. L., and Zhong S. Q., Optimizing mapreduce based on locality of k-v pairs and overlap between shuffle and local reduce, presented at the 44th Int. Conf. on Parallel Processing, Beijing, China, 2015.
[5]
Wang L. Z.,Tao J., Ranjan R., Marten H., Streit A., Chen J. Y., and Chen D., G-Hadoop: MapReduce across distributed data centers for data-intensive computing, Future Generation Computer Systems, vol. 29, no. 3, pp. 739–750, 2013.
[6]
Guo Y. F., Rao J., Cheng D. Z., and Zhou X. B., Ishuffle: Improving hadoop performance with shuffle-on-write, IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 6, pp. 1649–1662, 2017.
[7]
Kwon Y., Balazinska M., Howe B., and Rolia J., A study of skew in MapReduce applications, in 5th Open Cirrus Summit, 2011.
[8]
Kwon Y., Balazinska M., Howe B., and Rolia J., Skew-resistant parallel processing of feature-extracting scientific user-defined functions, in Proc. 1st Int. ACM Symp. on Cloud Computing, 2010, pp. 75–86.
DOI
[9]
Kwon Y., Balazinska M., Howe B., and Rolia J., SkewTune in action: Mitigating skew in MapReduce applications, Proceedings of the VLDB Endowment Hompage Archive, vol. 5, no. 12, pp. 1934–1937, 2012.
[10]
Ahmad F., Lee S., Thottethodi M., and Vijaykumar T. N., MapReduce with Communication Overlap (MaRCO), Journal of Parallel and Distributed Computing, vol. 73, no. 5, pp. 608–620, 2013.
[11]
Tian C., Zhou H. J., He Y. Q., and Zha L., A dynamic mapreduce scheduler for heterogeneous workloads, presented at the 2009 Eighth Int. Conf. on Grid and Cooperative Computing, Lanzhou, China, 2009.
DOI
[12]
Zaharia M., Konwinski A., Joseph A. D., Katz R., and Stoica I., Improving MapReduce performance in heterogeneous environments, in Proc. 8th USENIX Conf. Operating Systems Design and Implementation, San Diego, CA, USA, 2008, pp. 29–42.
[13]
Tao D., Lin Z. W., and Wang B. X., Load feedback-based resource scheduling and dynamic migration-based data locality for virtual hadoop clusters in openstack-based clouds, Tsinghua Science and Technology, vol. 22, no. 2, pp. 149–159, 2017.
[14]
Wolf J.,Rajan D.,Hildrum K.,Khandekar R., Kumar V., Parekh S., Wu K. L., and Balmin A., Flex: A slot allocation scheduling optimizer for mapreduce workloads, in Proc. ACM/IFIP/USENIX 11th Int. Conf. Middleware, Bangalore, India, 2010, pp. 1–20.
DOI
[15]
Guo Z. H., Fox G., and Zhou M., Investigation of data locality in mapreduce, in Proc. 12th IEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing, Ottawa, ON, Canada, 2012, pp. 419–426.
DOI
[16]
Zhao Y. R.,Wang W. P.,Meng D., Yang X. F., Zhang S. B., Li J., and Guan G., A data locality optimization algorithm for large-scale data processing in Hadoop, presented at the 2012 IEEE Symp. on Computers and Communications (ISCC), Cappadocia, Turkey, 2012.
DOI
[17]
Shafer J., Rixner S., and Cox A. L., The hadoop distributed filesystem: Balancing portability and performance, presented at the 2010 IEEE Int. Symp. on Performance Analysis of Systems & Software, White Plains, NY, USA, 2010.
DOI
[18]
Mohandas N. and Thampi S. M., Improving Hadoop performance in handling small files, in Advances in Computing and Communications, Abraham A.,Mauri J. L.,Buford J. F.,Suzuki J.,.Thampi S. M,eds. Springer, Berlin, Heidelberg, 2011.
[19]
Liu J., Li B., and Song M. N., THE optimization of HDFS based on small files, presented at the 2010 3rd IEEE Int. Conf. on Broadband Network and Multimedia Technology (IC-BNMT), Beijing, China, 2010.
[20]
Song G.,Meng Z. D., Huet F., Magoules F., Yu L., and Lin X. L., A Hadoop Mapreduce performance prediction method, presented at the 2013 IEEE 10th Int. Conf. on High Performance Computing and Communications & 2013 IEEE Int. Conf. on Embedded and Ubiquitous Computing, Zhangjiajie, China, 2013.
DOI
[21]
Zhu H. and Chen H. P., Adaptive failure detection via heartbeat under Hadoop, presented at the 2011 IEEE Asia-Pacific Services Computing Conf., Jeju Island, R. Korea, 2012.
DOI
[22]
Wang Y. D., Que X. Y., Yu W. K., Goldenberg D., and Sehgal D., Hadoop acceleration through network levitated merge, in Proc. 2011 Int. Conf. High PERFORMANCE Computing, Networking, Storage and Analysis, New York, NY, USA, 2011.
DOI
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 23 July 2018
Accepted: 10 August 2018
Published: 15 October 2018
Issue date: December 2018

Copyright

© The authors 2018

Acknowledgements

This work was supported by the National Key R&D Program of China (Nos. 2017YFB0202104 and 2017YFB0202003).

Rights and permissions

Return