Frequency and Similarity-Aware Partitioning for Cloud Storage Based on Space-Time Utility Maximization Model

Jianjiang Li; Jie Wu; Zhanning Ma

doi:10.1109/TST.2015.7128935

Tsinghua Science and Technology 2015, 20(3): 233-245 https://doi.org/10.1109/TST.2015.7128935

Open Access | Issue | Published: 19 June 2015

Frequency and Similarity-Aware Partitioning for Cloud Storage Based on Space-Time Utility Maximization Model

Show Author's Information Hide Author's Information Jianjiang Li, Jie Wu, Zhanning Ma(

)

Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China.

Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA.

Keywords:

cloud storage, de-duplication, redundancy, frequency

Cite this article:

Li J, Wu J, Ma Z. Frequency and Similarity-Aware Partitioning for Cloud Storage Based on Space-Time Utility Maximization Model. Tsinghua Science and Technology, 2015, 20(3): 233-245. https://doi.org/10.1109/TST.2015.7128935

Download citation

EndNote(RIS)

BibTeX

483

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

With the rise of various cloud services, the problem of redundant data is more prominent in the cloud storage systems. How to assign a set of documents to a distributed file system, which can not only reduce storage space, but also ensure the access efficiency as much as possible, is an urgent problem which needs to be solved. Space-efficiency mainly uses data de-duplication technologies, while access-efficiency requires gathering the files with high similarity on a server. Based on the study of other data de-duplication technologies, especially the Similarity-Aware Partitioning (SAP) algorithm, this paper proposes the Frequency and Similarity-Aware Partitioning (FSAP) algorithm for cloud storage. The FSAP algorithm is a more reasonable data partitioning algorithm than the SAP algorithm. Meanwhile, this paper proposes the Space-Time Utility Maximization Model (STUMM), which is useful in balancing the relationship between space-efficiency and access-efficiency. Finally, this paper uses 100 web files downloaded from CNN for testing, and the results show that, relative to using the algorithms associated with the SAP algorithm (including the SAP-Space-Delta algorithm and the SAP-Space-Dedup algorithm), the FSAP algorithm based on STUMM reaches higher compression ratio and a more balanced distribution of data blocks.

Full text

Abstract

Full text

Outline

About this article

Frequency and Similarity-Aware Partitioning for Cloud Storage Based on Space-Time Utility Maximization Model

Show Author's information Hide Author's Information Jianjiang Li, Jie Wu, Zhanning Ma(

)

Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China.

Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA.

Abstract

Keywords: cloud storage, de-duplication, redundancy, frequency

References(23)

[1]

Benson, T. Akella, A. and Maltz, D. A. Network traffic characteristics of data centers in the wild, in Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, ACM, 2010, pp. 267-280.

[2]

Meyer D. T. and Bolosky, W. J. A study of practical deduplication, ACM Transactions on Storage (TOS), vol. 7, no. 4, p. 14, 2012.

DOI Google Scholar

[3]

Clements, A. T. Ahmad, I. Vilayannur, M. and Li, J. Decentralized deduplication in san cluster file systems, in USENIX Annual Technical Conference, 2009, pp. 101-114.

[4]

Manber, U. Finding similar files in a large file system, in Usenix Winter 1994 Technical Conference, 1994, pp. 1-10.

[5]

Baker, B. S. On finding duplication and near-duplication in large software systems, in Reverse Engineering, 1995, Proceedings of 2nd Working Conference on, IEEE, 1995, pp. 86-95.

[6]

Forman, G. Eshghi, K. and Chiocchetti, S. Finding similar files in large document repositories, in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, ACM, 2005, pp. 394-400.

[7]

Douceur, J. R. Adya, A. Bolosky, W. J. Simon, P. and Theimer, M. Reclaiming space from duplicate files in a serverless distributed file system, in Distributed Computing Systems, 2002. Proceedings. 22nd International Conference on, IEEE, 2002, pp. 617-624.

[8]

Quinlan S. and Dorward, S. Venti: A new approach to archival storage, FAST, vol. 2, pp. 89-101, 2002.

Google Scholar

[9]

Zhu, B. Li, K. and Patterson, R. H. Avoiding the disk bottleneck in the data domain deduplication file system, FAST, vol. 8, pp. 1-14, 2008.

Google Scholar

[10]

Balasubramanian, B. Lan, T. and Chiang, M. Sap: Similarity aware partitioning for efficient cloud storage, in Infocom 2014 Proceedings IEEE, 2014, pp. 592-600.

[11]

Bhagwat, D. Pollack, K. Long, D. D. Schwarz, T. Miller, E. L. and Paris, J. F. Providing high reliability in a minimum redundancy archival storage system, in Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2006. MASCOTS 2006. 14th IEEE International Symposium on, IEEE, 2006, pp. 413-421.

[12]

Schwarz, T. J. Xin, Q. Miller, E. L. Long, D. D. Hospodor, A. and Ng, S. Disk scrubbing in large archival storage systems, in Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings. The IEEE Computer Societys 12th Annual International Symposium on, IEEE, 2004, pp. 409-418.

[13]

Weil, S. A. Brandt, S. A. Miller, E. L. Long, D. D. and Maltzahn, C. Ceph: A scalable, high-performance distributed file system, in Proceedings of the 7th Symposium on Operating Systems Design and Implementation, USENIX Association, 2006, pp. 307-320.

[14]

Zhu, R. Qin, L.-H. Zhou, J.-L. and Zheng, H. Using multithreads to hide deduplication i/o latency with low synchronization overhead, Journal of Central South University, vol. 20, pp. 1582-1591, 2013.

DOI Google Scholar

[15]

Bolosky, W. J. Corbin, S. Goebel, D. and Douceur, J. R. Single instance storage in windows 2000, in Proceedings of the 4th USENIX Windows Systems Symposium, Seattle, USA, 2000, pp. 13-24.

[16]

Hong, B. Plantenberg, D. Long, D. D. and Sivan-Zimet, M. Duplicate data elimination in a san file system, in Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, 2004, pp. 301-314.

[17]

Sapuntzakis, C. P. Chandra, R. Pfaff, B. Chow, J. Lam, M. S. and Rosenblum, M. Optimizing the migration of virtual computers, ACM SIGOPS Operating Systems Review, vol. 36, no. SI, pp. 377-390, 2002.

DOI Google Scholar

[18]

Muthitacharoen, A. Chen, B. and Mazieres, D. A low bandwidth network file system, ACM SIGOPS Operating Systems Review, vol. 35, no. 5, pp. 174-187, 2001.

DOI Google Scholar

[19]

Rabin, M. O. Fingerprinting by random polynomials, Center for Research in Computing Technology, Harvard University, Report TR-15-81, 1981.

[20]

Broder, A. Z. Some applications of rabins fingerprinting method, in Sequences II. Springer, 1993, pp. 143-152.

DOI

[21]

Douglis F. and Iyengar, A. Application-specific deltaencoding via resemblance detection. in USENIX Annual Technical Conference, General Track, 2003, pp. 113-126.

[22]

Bloom, B. H. Space/time trade-offs in hash coding with allowable errors, Communications of the ACM, vol. 13, no. 7, pp. 422-426, 1970.

DOI Google Scholar

[23]

Lillibridge, M. Eshghi, K. Bhagwat, D. Deolalikar, V. Trezis, G. and Camble, P. Sparse indexing: Large scale, inline deduplication using sampling and locality, in 7th USENIX Conference on File and Storage Technologies, 2009, pp. 111-123.

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 16 March 2015

Accepted: 04 May 2015

Published: 19 June 2015

Issue date: June 2015

Copyright

Acknowledgements

This work was supported by the National High-Tech Research and Development (863) Program of China (No. 2015AA01A303).