I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage

Jibin Wang; Zhigang Zhao; Zhaogang Xu; Hu Zhang; Liang Li; Ying Guo

doi:10.1109/TST.2015.7040510

Tsinghua Science and Technology 2015, 20(1): 17-27 https://doi.org/10.1109/TST.2015.7040510

Open Access | Issue | Published: 12 February 2015

I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage

Show Author's Information Hide Author's Information Jibin Wang, Zhigang Zhao, Zhaogang Xu, Hu Zhang, Liang Li, Ying Guo(

)

Shandong Computer Science Center （National Supercomputer Center in Jinan）, Shandong Provincial Key Laboratory of Computer Networks, Jinan 250101, China.

Keywords:

cloud storage, I-sieve, data deduplication

Cite this article:

Wang J, Zhao Z, Xu Z, et al. I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage. Tsinghua Science and Technology, 2015, 20(1): 17-27. https://doi.org/10.1109/TST.2015.7040510

Download citation

EndNote(RIS)

BibTeX

523

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

Data deduplication is an emerging and widely employed method for current storage systems. As this technology is gradually applied in inline scenarios such as with virtual machines and cloud storage systems, this study proposes a novel deduplication architecture called I-sieve. The goal of I-sieve is to realize a high performance data sieve system based on iSCSI in the cloud storage system. We also design the corresponding index and mapping tables and present a multi-level cache using a solid state drive to reduce RAM consumption and to optimize lookup performance. A prototype of I-sieve is implemented based on the open source iSCSI target, and many experiments have been conducted driven by virtual machine images and testing tools. The evaluation results show excellent deduplication and foreground performance. More importantly, I-sieve can co-exist with the existing deduplication systems as long as they support the iSCSI protocol.

Full text

Abstract

Full text

Outline

About this article

I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage

Show Author's information Hide Author's Information Jibin Wang, Zhigang Zhao, Zhaogang Xu, Hu Zhang, Liang Li, Ying Guo(

)

Shandong Computer Science Center （National Supercomputer Center in Jinan）, Shandong Provincial Key Laboratory of Computer Networks, Jinan 250101, China.

Abstract

Keywords: cloud storage, I-sieve, data deduplication

References(38)

[1]

EMC Corporation, The emc digital universe study, Technical Report, 2014.

[2]

International Data Corporation, The 2011 digital universe study, Technical Report, 2011.

[3]

Merrill D. R., Storage economics: Four principles for reducing total cost of ownership, http://www.hds.com/assets/pdf/four-principles-for-reducing-total-cost-of-ownership.pdf, 2011.

[4]

Clements A. T., Ahmad I., Vilayannur M., and Li J., Decentralized deduplication in san cluster file systems, in Proceedings of the 2009 Conference on USENIX Annual Technical Conference, 2009.

[5]

Kruus E., Ungureanu C., and Dubnicki C., Bimodal content defined chunking for backup streams, in Proceedings of the 8th USENIX Conference on File and Storage Technologies, 2010.

[6]

Srinivasan K., Bisson T., Goodson G., and Voruganti K., idedup: Latency-aware, inline data deduplication for primary storage, in Proceedings of the 10th USENIX Conference on File and Storage Technologies, 2012.

[7]

Debnath B., Sengupta S., and Li J., Chunkstash: Speeding up inline storage deduplication using flash memory, in Proceedings of the Annual Conference on USENIX Annual Technical Conference, 2010.

[8]

Dubnicki C., Gryz L., Heldt L., Kaczmarczyk M., Kilian W., Strzelczak P., Szczepkowski J., Ungureanu C., and Welnicki M., Hydrastor: A scalable secondary storage, in Proccedings of the 7th Conference on File and Storage Technologies, 2009, pp. 197-210.

[9]

Dong W., Douglis F., Li K., Patterson H., Reddy S., and Shilane P., Tradeoffs in scalable data routing for deduplication clusters, in Proceedings of the 9th USENIX Conference on File and Storage Technologies, 2011.

[10]

Bhagwat D., Eshghi K., Long D. D. E., and Lillibridge M., Extreme binning: Scalable, parallel deduplication for chunk-based file backup, in Proceedings of the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems, 2009, pp. 1-9.

DOI

[11]

Ungureanu C., Atkin B., Aranya A., Gokhale S., Rago S., Calkowski G., Dubnicki C., and Bohra A., Hydrafs: A high-throughput file system for the hydrastor content-addressable storage system, in Proceedings of the 8th USENIX Conference on File and Storage Technologies, 2010.

[12]

Bolosky W. J., Corbin S., Goebel D., and Douceur J. R., Single instance storage in windows 2000, in Proceedings of the 4th Conference on USENIX Windows Systems Symposium, 2000.

[13]

Policroniades C. and Pratt I., Alternatives for detecting redundancy in storage systems data, in Proceedings of the Annual Conference on USENIX Annual Technical Conference, 2004.

[14]

Liu C., Lu Y., Shi C., Lu G., Du D.-C., and Wang D.-S., Admad: Application-driven metadata aware deduplication archival storage system, in Proceedings of the Fifth IEEE International Workshop on Storage Network Architecture and Parallel I/Os, 2008, pp. 29-35.

DOI

[15]

Quinlan S. and Dorward S., Venti: A new approach to archival data storage, in Proceedings of the 1st USENIX Conference on File and Storage Technologies, 2002.

[16]

Zhu B., Li K., and Patterson H., Avoiding the disk bottleneck in the data domain deduplication file system, in Proceedings of the 6th USENIX Conference on File and Storage Technologies, 2008, pp. 1-14.

[17]

Paulo J. A. and Pereira J., A survey and classification of storage deduplication systems, ACM Computing Surveys, vol. 47, no. 1, pp. 11: 1-11: 30, 2014.

DOI Google Scholar

[18]

Liu F., Sun Y., Li B., Li B., and Zhang X., Fs2you: Peer-assisted semipersistent online hosting at a large scale, IEEE Transactions on Parallel and Distributed Systems, vol. 21, no. 10, pp. 1442-1457, 2010.

DOI Google Scholar

[19]

Liu F., Li B., Li B., and Jin H., Peer-assisted on-demand streaming: Characterizing demands and optimizing supplies, IEEE Transactions on Computers, vol. 62, no. 2, pp. 351-361, 2013.

DOI Google Scholar

[20]

Xu F., Liu F., Liu L., Jin H., Li B., and Li B., iaware: Making live migration of virtual machines interference-aware in the cloud, IEEE Transactions on Computers, vol. 63, no. 12, pp. 3012-3025, 2014.

DOI Google Scholar

[21]

Xu F., Liu F., Jin H., and Vasilakos A., Managing performance overhead of virtual machines in cloud computing: A survey, state of the art, and future directions, Proceedings of the IEEE, vol. 102, no. 1, pp. 11-31, 2014.

DOI Google Scholar

[22]

Lillibridge M., Eshghi K., Bhagwat D., Deolalikar V., Trezise G., and Camble P., Sparse indexing: Large scale, inline deduplication using sampling and locality, in Proccedings of the 7th Conference on File and Storage Technologies, 2009, pp. 111-123.

[23]

Muthitacharoen A., Chen B., and Mazières D., A low-bandwidth network file system, in Proceedings of the 18th ACM Symposium on Operating Systems Principles, 2001, pp. 174-187.

DOI

[24]

Bobbarjung D. R., Jagannathan S., and Dubnicki C., Improving duplicate elimination in storage systems, Transactions on Storage, vol. 2, no. 4, pp. 424-448, 2006.

DOI Google Scholar

[25]

Domain D., Data domain appliance series: High-speed inline deduplication storage, http://www.emc.com/backup-and-recovery/data-domain/data-domain-deduplication-storage-systems.htm, 2011.

[26]

Leesakul W., Townend P., and Xu J., Dynamic data deduplication in cloud storage, in Proceedings of the 8th IEEE International Symposium on Service Oriented System Engineering, 2014, pp. 320-325.

DOI

[27]

Fu Y., Jiang H., Xiao N., Tian L., Liu F., and Xu L., Application-aware local-global source deduplication for cloud backup services of personal storage, IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 5, pp. 1155-1165, 2014.

DOI Google Scholar

[28]

Puzio P., Molva R., Onen M., and Loureiro S., Cloudedup: Secure deduplication with encrypted data for cloud storage, in Proceedings of the 5th IEEE International Conference on Cloud Computing Technology and Science, 2013.

DOI

[29]

Meyer D. T. and Bolosky W. J., A study of practical deduplication, in Proceedings of the 9th USENIX Conference on File and Storage Technologies, 2011.

[30]

Eshghi K. and Tang H. K., A framework for analysing and improving content-based chunking algorithms, Tech. Rep. HPL–2005–30(RI), 2005.

[31]

National Institute of Standards and Technology, FIPS PUB 180-1: Secure hash Standards, Technical Report, 1995.

[32]

Rivest R., The md5 message-digest algorithm, http://www.ietf.org/rfc/rfc1321.txt, 1992.

DOI

[33]

Ng C.-H., Ma M., Wong T.-Y., Lee P. P. C., and Lui J. C. S., Live deduplication storage of virtual machine images in an open-source cloud, in Proceedings of the 12th International Middleware Conference, 2011, pp. 80-99.

[34]

Ungureanu C., Atkin B., Aranya A., Gokhale S., Rago S., Cakowski G., Dubnicki C., and Bohra A., Hydrafs: A high-throughput file system for the hydrastor content-addressable storage system, in Proceedings of the 8th USENIX Conference on File and Storage Technologies, 2010.

[35]

Bloom B. H., Space/time trade-offs in hash coding with allowable errors, Communications of the ACM, vol. 13, no. 7, pp. 422-426, 1970.

DOI Google Scholar

[36]

Lu G., Nam Y. J., and Du D.-C., Bloomstore: Bloom-filter based memory-efficient key-value store for indexing of data deduplication on flash, in Proceedings of the 28th IEEE Symposium on Mass Storage Systems and Technologies, 2012.

DOI

[37]

iSCSI enterprise target, http://sourceforge.net/projects/iscsitarget/files/, 2010.

[38]

Jin K. and Miller E. L., The effectiveness of deduplication on virtual machine disk images, in Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, 2009.

DOI

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 15 November 2014

Revised: 15 December 2014

Accepted: 07 January 2015

Published: 12 February 2015

Issue date: February 2015

Copyright

Acknowledgements

Thanks to Wei Huang for the work of programming of I-sieve prototype. We also would like to thank the anonymous reviewers for their valuable insights that have improved the quality of the paper greatly. This work was supported by the Young Scholars of the Shandong Academy of Science (No. 2014QN013) and the National High-Tech Research and Development (863) Program of China (No. 2012AA011202).