Journal Home > Volume 21 , Issue 5

Recent developments in cloud computing and big data have spurred the emergence of data-intensive applications for which massive scientific datasets are stored in globally distributed scientific data centers that have a high frequency of data access by scientists worldwide. Multiple associated data items distributed in different scientific data centers may be requested for one data processing task, and data placement decisions must respect the storage capacity limits of the scientific data centers. Therefore, the optimization of data access cost in the placement of data items in globally distributed scientific data centers has become an increasingly important goal. Existing data placement approaches for geo-distributed data items are insufficient because they either cannot cope with the cost incurred by the associated data access, or they overlook storage capacity limitations, which are a very practical constraint of scientific data centers. In this paper, inspired by applications in the field of high energy physics, we propose an integer-programming-based data placement model that addresses the above challenges as a Non-deterministic Polynomial-time (NP)-hard problem. In addition we use a Lagrangian relaxation based heuristics algorithm to obtain ideal data placement solutions. Our simulation results demonstrate that our algorithm is effective and significantly reduces overall data access cost.


menu
Abstract
Full text
Outline
About this article

Efficient Location-Aware Data Placement for Data-Intensive Applications in Geo-distributed Scientific Data Centers

Show Author's information Jinghui Zhang( )Jian ChenJunzhou LuoAibo Song
School of Computer Science and Engineering, Southeast University, Nanjing 211189, China.

Abstract

Recent developments in cloud computing and big data have spurred the emergence of data-intensive applications for which massive scientific datasets are stored in globally distributed scientific data centers that have a high frequency of data access by scientists worldwide. Multiple associated data items distributed in different scientific data centers may be requested for one data processing task, and data placement decisions must respect the storage capacity limits of the scientific data centers. Therefore, the optimization of data access cost in the placement of data items in globally distributed scientific data centers has become an increasingly important goal. Existing data placement approaches for geo-distributed data items are insufficient because they either cannot cope with the cost incurred by the associated data access, or they overlook storage capacity limitations, which are a very practical constraint of scientific data centers. In this paper, inspired by applications in the field of high energy physics, we propose an integer-programming-based data placement model that addresses the above challenges as a Non-deterministic Polynomial-time (NP)-hard problem. In addition we use a Lagrangian relaxation based heuristics algorithm to obtain ideal data placement solutions. Our simulation results demonstrate that our algorithm is effective and significantly reduces overall data access cost.

Keywords: data center, Lagrangian relaxation, data placement, geo-distributed

References(19)

[1]
AMS02, http://www.ams02.org/, 2016.
[2]
[3]
[4]
Yu B. Y. and Pan J. P., Location-aware associated data placement for geo-distributed data-intensive applications, in Proc. 34th IEEE Conference on Computer Communications, Kowloon, Hong Kong, China, 2015, pp. 603-611.
[5]
LeCun B., Mautor T., Quessette F., and Weisser M. A., Bin packing with fragmentable items: Presentation and approximations, Theoretical Computer Science, vol. 602, pp. 50-59, 2015.
[6]
Fisher M. L., The Lagrangian relaxation method for solving integer programming problems, Management Science, vol. 50, no. 12, pp. 1861-1871, 2004.
[7]
Agarwal S., Dunagan J., Jain N., Saroiu S., and Wolman A., Volley: Automated data placement for geo-distributed cloud services, in Proc. 7th USENIX Symposium on Networked Systems Design and Implementation, San Jose, CA, USA, 2010, pp. 17-32.
[8]
Yu B. Y. and Pan J. P., Sketch-based data placement among geo-distributed datacenters for cloud storages, in Proc. 35th IEEE Conference on Computer Communications, San Francisco, CA, USA, 2016, pp. 1-9.
[9]
Xu H. and Li B., Joint request mapping and response routing for geo-distributed cloud services, in Proc. 32th IEEE Conference on Computer Communications, Turin, Italy, 2013, pp. 854-862.
DOI
[10]
Kumar K. A., Quamar A., Deshpande A., and Khuller S., SWORD: Workload-aware data placement and replica selection for cloud data management systems, VLDB Journal, vol. 23, no. 6, pp. 845-870, 2014.
[11]
Quamar A., Kumar K. A., and Deshpande A., SWORD: Scalable workload-aware data placement for transactional workloads, in Proc. 16th International Conference on Extending Database Technology, Genoa, Italy, 2013, pp. 430-441.
DOI
[12]
Jiao L., Li J., Du W., and Fu X. M.. Multi-objective data placement for multi-cloud socially aware services, in Proc. 33th IEEE Conference on Computer Communications, Toronto, Canada, 2014, pp. 28-36.
DOI
[13]
Jiao L., Li J., Xu T. Y., Du W., and Fu X. M., Optimizing cost for online social networks on geo-distributed clouds, IEEE/ACM Transactions on Networking, vol. 24, no. 1, pp. 99-112, 2016.
[14]
Golab L., Hadjieleftheriou M., Karloff H., and Saha B., Distributed data placement to minimize communication costs via graph partitioning, in Proc. 26th International Conference on Scientific and Statistical Database Management, Aalborg, Denmark, 2014, pp. 20-28.
DOI
[15]
Çatalyürek Ü. V., Kaya K., and Uçar B., Integrated data placement and task assignment for scientific workflows in clouds, in Proc. 4th International Workshop on Data Intensive Distributed Computing, 2011, pp. 45-54.
DOI
[16]
Zhang J. H., Luo J. Z., and Dong F., Scheduling of scientific workflow in non-dedicated heterogeneous multicluster platform, Journal of Systems and Software, vol. 86, no. 7, pp. 1806-1818, 2013.
[17]
Zhang J. H., Luo J. Z., and Dong F., Scientific workflow scheduling in non-dedicated heterogeneous multicluster with advance reservations, Integrated Computer-Aided Engineering, vol. 22, no. 3, pp. 261-280, 2015.
[18]
Zhang J. H., Wang M. J., Luo J. Z., Dong F., and Zhang J. X., Towards optimized scheduling for data-intensive scientific workflow in multiple datacenter environment, Concurrency and Computation: Practice and Experience, vol. 27, no. 18, pp. 5606-5622, 2015.
[19]
Bodik P., Menache I., Chowdhury M., Mani P., Maltz D. A., and Stoica I., Surviving failures in bandwidth-constrained datacenters, in Proc. Annual Conference of the ACM Special Interest Group on Data Communication, Helsinki, Finland, 2012, pp. 431-442.
DOI
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 26 July 2016
Revised: 04 August 2016
Accepted: 22 August 2016
Published: 18 October 2016
Issue date: October 2016

Copyright

© The author(s) 2016

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. 61320106007, 61572129, 61502097, and 61370207), the National High-Tech Research and Development (863) Program of China (No. 2013AA013503), International S&T Cooperation Program of China (No. 2015DFA10490), Jiangsu research prospective joint research project (No. BY2013073-01), Jiangsu Provincial Key Laboratory of Network and Information Security (No. BM2003201), Key Laboratory of Computer Network and Information Integration of Ministry of Education of China (No. 93K-9), and partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization and Collaborative Innovation Center of Wireless Communications Technology.

Rights and permissions

Return