Journal Home > Volume 20 , Issue 3

With the rise of various cloud services, the problem of redundant data is more prominent in the cloud storage systems. How to assign a set of documents to a distributed file system, which can not only reduce storage space, but also ensure the access efficiency as much as possible, is an urgent problem which needs to be solved. Space-efficiency mainly uses data de-duplication technologies, while access-efficiency requires gathering the files with high similarity on a server. Based on the study of other data de-duplication technologies, especially the Similarity-Aware Partitioning (SAP) algorithm, this paper proposes the Frequency and Similarity-Aware Partitioning (FSAP) algorithm for cloud storage. The FSAP algorithm is a more reasonable data partitioning algorithm than the SAP algorithm. Meanwhile, this paper proposes the Space-Time Utility Maximization Model (STUMM), which is useful in balancing the relationship between space-efficiency and access-efficiency. Finally, this paper uses 100 web files downloaded from CNN for testing, and the results show that, relative to using the algorithms associated with the SAP algorithm (including the SAP-Space-Delta algorithm and the SAP-Space-Dedup algorithm), the FSAP algorithm based on STUMM reaches higher compression ratio and a more balanced distribution of data blocks.


menu
Abstract
Full text
Outline
About this article

Frequency and Similarity-Aware Partitioning for Cloud Storage Based on Space-Time Utility Maximization Model

Show Author's information Jianjiang LiJie WuZhanning Ma( )
Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China.
Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA.

Abstract

With the rise of various cloud services, the problem of redundant data is more prominent in the cloud storage systems. How to assign a set of documents to a distributed file system, which can not only reduce storage space, but also ensure the access efficiency as much as possible, is an urgent problem which needs to be solved. Space-efficiency mainly uses data de-duplication technologies, while access-efficiency requires gathering the files with high similarity on a server. Based on the study of other data de-duplication technologies, especially the Similarity-Aware Partitioning (SAP) algorithm, this paper proposes the Frequency and Similarity-Aware Partitioning (FSAP) algorithm for cloud storage. The FSAP algorithm is a more reasonable data partitioning algorithm than the SAP algorithm. Meanwhile, this paper proposes the Space-Time Utility Maximization Model (STUMM), which is useful in balancing the relationship between space-efficiency and access-efficiency. Finally, this paper uses 100 web files downloaded from CNN for testing, and the results show that, relative to using the algorithms associated with the SAP algorithm (including the SAP-Space-Delta algorithm and the SAP-Space-Dedup algorithm), the FSAP algorithm based on STUMM reaches higher compression ratio and a more balanced distribution of data blocks.

Keywords: cloud storage, de-duplication, redundancy, frequency

References(23)

[1]
Benson, T. Akella, A. and Maltz, D. A. Network traffic characteristics of data centers in the wild, in Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, ACM, 2010, pp. 267-280.
[2]
Meyer D. T. and Bolosky, W. J. A study of practical deduplication, ACM Transactions on Storage (TOS), vol. 7, no. 4, p. 14, 2012.
[3]
Clements, A. T. Ahmad, I. Vilayannur, M. and Li, J. Decentralized deduplication in san cluster file systems, in USENIX Annual Technical Conference, 2009, pp. 101-114.
[4]
Manber, U. Finding similar files in a large file system, in Usenix Winter 1994 Technical Conference, 1994, pp. 1-10.
[5]
Baker, B. S. On finding duplication and near-duplication in large software systems, in Reverse Engineering, 1995, Proceedings of 2nd Working Conference on, IEEE, 1995, pp. 86-95.
[6]
Forman, G. Eshghi, K. and Chiocchetti, S. Finding similar files in large document repositories, in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, ACM, 2005, pp. 394-400.
[7]
Douceur, J. R. Adya, A. Bolosky, W. J. Simon, P. and Theimer, M. Reclaiming space from duplicate files in a serverless distributed file system, in Distributed Computing Systems, 2002. Proceedings. 22nd International Conference on, IEEE, 2002, pp. 617-624.
[8]
Quinlan S. and Dorward, S. Venti: A new approach to archival storage, FAST, vol. 2, pp. 89-101, 2002.
[9]
Zhu, B. Li, K. and Patterson, R. H. Avoiding the disk bottleneck in the data domain deduplication file system, FAST, vol. 8, pp. 1-14, 2008.
[10]
Balasubramanian, B. Lan, T. and Chiang, M. Sap: Similarity aware partitioning for efficient cloud storage, in Infocom 2014 Proceedings IEEE, 2014, pp. 592-600.
[11]
Bhagwat, D. Pollack, K. Long, D. D. Schwarz, T. Miller, E. L. and Paris, J. F. Providing high reliability in a minimum redundancy archival storage system, in Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2006. MASCOTS 2006. 14th IEEE International Symposium on, IEEE, 2006, pp. 413-421.
[12]
Schwarz, T. J. Xin, Q. Miller, E. L. Long, D. D. Hospodor, A. and Ng, S. Disk scrubbing in large archival storage systems, in Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings. The IEEE Computer Societys 12th Annual International Symposium on, IEEE, 2004, pp. 409-418.
[13]
Weil, S. A. Brandt, S. A. Miller, E. L. Long, D. D. and Maltzahn, C. Ceph: A scalable, high-performance distributed file system, in Proceedings of the 7th Symposium on Operating Systems Design and Implementation, USENIX Association, 2006, pp. 307-320.
[14]
Zhu, R. Qin, L.-H. Zhou, J.-L. and Zheng, H. Using multithreads to hide deduplication i/o latency with low synchronization overhead, Journal of Central South University, vol. 20, pp. 1582-1591, 2013.
[15]
Bolosky, W. J. Corbin, S. Goebel, D. and Douceur, J. R. Single instance storage in windows 2000, in Proceedings of the 4th USENIX Windows Systems Symposium, Seattle, USA, 2000, pp. 13-24.
[16]
Hong, B. Plantenberg, D. Long, D. D. and Sivan-Zimet, M. Duplicate data elimination in a san file system, in Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, 2004, pp. 301-314.
[17]
Sapuntzakis, C. P. Chandra, R. Pfaff, B. Chow, J. Lam, M. S. and Rosenblum, M. Optimizing the migration of virtual computers, ACM SIGOPS Operating Systems Review, vol. 36, no. SI, pp. 377-390, 2002.
[18]
Muthitacharoen, A. Chen, B. and Mazieres, D. A low bandwidth network file system, ACM SIGOPS Operating Systems Review, vol. 35, no. 5, pp. 174-187, 2001.
[19]
Rabin, M. O. Fingerprinting by random polynomials, Center for Research in Computing Technology, Harvard University, Report TR-15-81, 1981.
[20]
Broder, A. Z. Some applications of rabins fingerprinting method, in Sequences II. Springer, 1993, pp. 143-152.
DOI
[21]
Douglis F. and Iyengar, A. Application-specific deltaencoding via resemblance detection. in USENIX Annual Technical Conference, General Track, 2003, pp. 113-126.
[22]
Bloom, B. H. Space/time trade-offs in hash coding with allowable errors, Communications of the ACM, vol. 13, no. 7, pp. 422-426, 1970.
[23]
Lillibridge, M. Eshghi, K. Bhagwat, D. Deolalikar, V. Trezis, G. and Camble, P. Sparse indexing: Large scale, inline deduplication using sampling and locality, in 7th USENIX Conference on File and Storage Technologies, 2009, pp. 111-123.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 16 March 2015
Accepted: 04 May 2015
Published: 19 June 2015
Issue date: June 2015

Copyright

© The authors 2015

Acknowledgements

This work was supported by the National High-Tech Research and Development (863) Program of China (No. 2015AA01A303).

Rights and permissions

Return