Balance Resource Allocation for Spark Jobs Based on Prediction of the Optimal Resource

Zhiyao Hu; Dongsheng Li; Deke Guo

doi:10.26599/TST.2019.9010054

Tsinghua Science and Technology 2020, 25(4): 487-497 https://doi.org/10.26599/TST.2019.9010054

Open Access | Issue | Published: 13 January 2020

Balance Resource Allocation for Spark Jobs Based on Prediction of the Optimal Resource

Show Author's Information Hide Author's Information Zhiyao Hu, Dongsheng Li(

), Deke Guo

College of Computer, National University of Defense Technology, Changsha 410073, China.

College of System Engineering, National University of Defense Technology, Changsha 410073, China.

Keywords:

Spark jobs, resource over-allocation, performance prediction

Cite this article:

Hu Z, Li D, Guo D. Balance Resource Allocation for Spark Jobs Based on Prediction of the Optimal Resource. Tsinghua Science and Technology, 2020, 25(4): 487-497. https://doi.org/10.26599/TST.2019.9010054

Download citation

EndNote(RIS)

BibTeX

675

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

Apache Spark provides a well-known MapReduce computing framework, aiming to fast-process big data analytics in data-parallel manners. With this platform, large input data are divided into data partitions. Each data partition is processed by multiple computation tasks concurrently. Outputs of these computation tasks are transferred among multiple computers via the network. However, such a distributed computing framework suffers from system overheads, inevitably caused by communication and disk I/O operations. System overheads take up a large proportion of the Job Completion Time (JCT). We observed that excessive computational resources incurs considerable system overheads, prolonging the JCT. The over-allocation of individual jobs not only prolongs their own JCTs, but also likely makes other jobs suffer from under-allocation. Thus, the average JCT is suboptimal, too. To address this problem, we propose a prediction model to estimate the changing JCT of a single Spark job. With the support of the prediction method, we designed a heuristic algorithm to balance the resource allocation of multiple Spark jobs, aiming to minimize the average JCT in multiple-job cases. We implemented the prediction model and resource allocation method in ReB, a Resource-Balancer based on Apache Spark. Experimental results showed that ReB significantly outperformed the traditional max-min fairness and shortest-job-optimal methods. The average JCT was decreased by around 10%-30% compared to the existing solutions.

Full text

Abstract

Full text

Outline

About this article

Balance Resource Allocation for Spark Jobs Based on Prediction of the Optimal Resource

Show Author's information Hide Author's Information Zhiyao Hu, Dongsheng Li(

), Deke Guo

College of Computer, National University of Defense Technology, Changsha 410073, China.

College of System Engineering, National University of Defense Technology, Changsha 410073, China.

Abstract

Keywords: Spark jobs, resource over-allocation, performance prediction

References(23)

[1]

C. Mosharaf, Z. Matei, M. Justin, J. Michael, and S. Ion, Managing data transfers in computer clusters with Orchestra, ACM SIGCOMM Computer Communication Review, vol. 41, no. 4, pp. 98-109, 2011.

DOI Google Scholar

[2]

J. H. Howard, M. L. Kazar, S. G. Menees, D. A. Nichols, M. Satyanarayanan, R. N. Sidebotham, and M. J. West, Scale and performance in a distributed file system, ACM Transaction on Computer System, vol. 6, no. 1, pp. 51-81, 1988.

DOI Google Scholar

[3]

D. P. Woodruff and Q. Zhang, When distributed computation is communication expensive, Distributed Computing, vol. 30, no. 5, pp. 309-323, 2017.

DOI Google Scholar

[4]

J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, in Proceedings of USENIX Symposium on Operating System Design and Implementation (OSDI’04), San Francisco, CA, USA, 2004, pp. 137-150.

[5]

Z. Shen, S. Subbiah, X. Gu, and J. Wilkes, Cloudscale: Elastic resource scaling for multi-tenant cloud systems, in Proceedings of ACM Symposium on Cloud Computing (SOCC’11), Cascais, Portugal, 2011, pp. 5-17.

DOI

[6]

C. Delimitrou and C. Kozyrakis, Quasar: Resource-efficient and QoS-aware cluster management, in Proceedings of ACM Architectural Support for Programming Languages and Operating Systems (ASPLOS’14), Salt Lake City, UT, USA, 2014, pp. 127-144.

DOI

[7]

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica, Mesos: A platform for fine-grained resource sharing in the data center, in Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI’11), Boston, MA, USA, 2011, pp. 429-483.

[8]

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, and S. Seth, Apache Hadoop YARN: Yet another resource negotiator, in Proceedings of ACM Symposium on Cloud Computing (SOCC’13), Santa Clara, CA, USA, 2013, pp. 1-16.

DOI

[9]

H. Mao, M. Alizadeh, I. Menache, and S. Kandula, Resource management with deep reinforcement learning, in Proceedings of ACM HotNet Workshop on Hot Topics in Networks (HotNet’16), Atlanta, GA, USA, 2016, pp. 50-56.

DOI

[10]

T. Bonald, L. Massouli, A. Prouti, and J. T. Virtamo, A queueing analysis of max-min fairness, proportional fairness and balanced fairness, Queueing Systems, vol. 53, nos. 1&2, pp. 65-84, 2006.

DOI Google Scholar

[11]

A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant resource fairness: Fair allocation of multiple resource types, in Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI’13), Boston, MA, USA, 2013, pp. 323-336.

[12]

M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling, in Proceedings of European Conference on Computer Systems (EuroSys’10), Paris, France, 2010, pp. 265-278.

DOI

[13]

R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella, Multi-resource packing for cluster schedulers, in Proceedings of ACM Special Interest Group on Data Communication (SIGCOMM’14), Chicago, IL, USA, 2014, pp. 455-466.

DOI

[14]

S. Venkataraman, Z. Yang, M. J. Franklin, B. Recht, and I. Stoica, Ernest: Efficient performance prediction for large-scale advanced analytics, in Proceedings of USENIX Symposium on Networked Systems Design and Implementatio (NSDI’16), Santa Clara, CA, USA, 2016, pp. 363-378.

[15]

Z. Bei, Z. Yu, H. Zhang, W. Xiong, C. Xu, L. Eeckhout, and S. Feng, RFHOC: A random-forest approach to auto-tuning Hadoop’s configuration, IEEE Transaction on Parallel and Distributed Systems, vol. 27, no. 5, pp. 1470-1483, 2016.

DOI Google Scholar

[16]

Z. Yu, Z. Bei, and X. Qian, Datasize-aware high dimensional configurations auto-tuning of in-memory cluster computing, in Proceedings of ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18), Williamsburg, VA, USA, 2018, pp. 564-577.

DOI

[17]

C. Re, D. Agrawal, M. Balazinska, M. J. Cafarella, M. I. Jordan, T. Kraska, and R. Ramakrishnan, Machine learning and databases: The sound of things to come or a cacophony of hype? in Proceedings of ACM International Conference on Management of Data (SIGMOD’15), Melbourne, Australia, 2015, pp. 283-284.

DOI

[18]

L. Sun, S. Sun, T. Wang, J. Li, and J. Lin, Parallel ADR detection based on spark and BCPNN, Tsinghua Science and Technology, vol. 24, no. 2, pp. 195-206, 2019.

DOI Google Scholar

[19]

X. Ye, X. Chen, D. Liu, W. Wang, L. Yang, G. Liang, and G. Shao, Notice of retraction: Efficient feature extraction using Apache Spark for network behavior anomaly detection, Tsinghua Science and Technology, vol. 23, no. 5, pp. 561-573, 2018.

DOI Google Scholar

[20]

M. Wang, Y. Cui, S. Xiao, X. Wang, D. Yang, K. Chen, and J. Zhu, Neural network meets DCN: Traffic-driven topology adaptation with deep learning, in Proceedings of ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’18), Irvine, CA, USA, 2018, pp. 97-99.

DOI

[21]

N. Yamashita and M. Fukushima, On the rate of convergence of the Levenberg-Marquardt method, Springer Computing, vol. 15, pp. 239-249, 2001.

DOI Google Scholar

[22]

R. Grandl, M. Chowdhury, A. Akella, and G. Ananthanarayanan, Altruistic scheduling in multi-resource clusters, in Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), Savannah, GA, USA, 2016, pp. 65-80.

[23]

D. Fooladivanda, A. A. Daoud, and C. Rosenberg, Joint channel allocation and user association for heterogeneous wireless cellular networks, IEEE Transaction on Wireless Communications, vol. 12, no. 1, pp. 248-257, 2011.

DOI Google Scholar

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 30 March 2019

Revised: 24 July 2019

Accepted: 09 September 2019

Published: 13 January 2020

Issue date: August 2020

Copyright

Acknowledgements

This work was supported in part by the National Key R&D Program of China (No. 2018YFB2101100), the National Natural Science Foundation of China (Nos. 61932001 and 61872376), and Hunan Provincial Innovation Foundation For Postgraduate.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).