Journal Home > Volume 20 , Issue 1

Data deduplication is an emerging and widely employed method for current storage systems. As this technology is gradually applied in inline scenarios such as with virtual machines and cloud storage systems, this study proposes a novel deduplication architecture called I-sieve. The goal of I-sieve is to realize a high performance data sieve system based on iSCSI in the cloud storage system. We also design the corresponding index and mapping tables and present a multi-level cache using a solid state drive to reduce RAM consumption and to optimize lookup performance. A prototype of I-sieve is implemented based on the open source iSCSI target, and many experiments have been conducted driven by virtual machine images and testing tools. The evaluation results show excellent deduplication and foreground performance. More importantly, I-sieve can co-exist with the existing deduplication systems as long as they support the iSCSI protocol.


menu
Abstract
Full text
Outline
About this article

I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage

Show Author's information Jibin WangZhigang ZhaoZhaogang XuHu ZhangLiang LiYing Guo( )
Shandong Computer Science Center (National Supercomputer Center in Jinan), Shandong Provincial Key Laboratory of Computer Networks, Jinan 250101, China.

Abstract

Data deduplication is an emerging and widely employed method for current storage systems. As this technology is gradually applied in inline scenarios such as with virtual machines and cloud storage systems, this study proposes a novel deduplication architecture called I-sieve. The goal of I-sieve is to realize a high performance data sieve system based on iSCSI in the cloud storage system. We also design the corresponding index and mapping tables and present a multi-level cache using a solid state drive to reduce RAM consumption and to optimize lookup performance. A prototype of I-sieve is implemented based on the open source iSCSI target, and many experiments have been conducted driven by virtual machine images and testing tools. The evaluation results show excellent deduplication and foreground performance. More importantly, I-sieve can co-exist with the existing deduplication systems as long as they support the iSCSI protocol.

Keywords: cloud storage, I-sieve, data deduplication

References(38)

[1]
EMC Corporation, The emc digital universe study, Technical Report, 2014.
[2]
International Data Corporation, The 2011 digital universe study, Technical Report, 2011.
[3]
Merrill D. R., Storage economics: Four principles for reducing total cost of ownership, http://www.hds.com/assets/pdf/four-principles-for-reducing-total-cost-of-ownership.pdf, 2011.
[4]
Clements A. T., Ahmad I., Vilayannur M., and Li J., Decentralized deduplication in san cluster file systems, in Proceedings of the 2009 Conference on USENIX Annual Technical Conference, 2009.
[5]
Kruus E., Ungureanu C., and Dubnicki C., Bimodal content defined chunking for backup streams, in Proceedings of the 8th USENIX Conference on File and Storage Technologies, 2010.
[6]
Srinivasan K., Bisson T., Goodson G., and Voruganti K., idedup: Latency-aware, inline data deduplication for primary storage, in Proceedings of the 10th USENIX Conference on File and Storage Technologies, 2012.
[7]
Debnath B., Sengupta S., and Li J., Chunkstash: Speeding up inline storage deduplication using flash memory, in Proceedings of the Annual Conference on USENIX Annual Technical Conference, 2010.
[8]
Dubnicki C., Gryz L., Heldt L., Kaczmarczyk M., Kilian W., Strzelczak P., Szczepkowski J., Ungureanu C., and Welnicki M., Hydrastor: A scalable secondary storage, in Proccedings of the 7th Conference on File and Storage Technologies, 2009, pp. 197-210.
[9]
Dong W., Douglis F., Li K., Patterson H., Reddy S., and Shilane P., Tradeoffs in scalable data routing for deduplication clusters, in Proceedings of the 9th USENIX Conference on File and Storage Technologies, 2011.
[10]
Bhagwat D., Eshghi K., Long D. D. E., and Lillibridge M., Extreme binning: Scalable, parallel deduplication for chunk-based file backup, in Proceedings of the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems, 2009, pp. 1-9.
DOI
[11]
Ungureanu C., Atkin B., Aranya A., Gokhale S., Rago S., Calkowski G., Dubnicki C., and Bohra A., Hydrafs: A high-throughput file system for the hydrastor content-addressable storage system, in Proceedings of the 8th USENIX Conference on File and Storage Technologies, 2010.
[12]
Bolosky W. J., Corbin S., Goebel D., and Douceur J. R., Single instance storage in windows 2000, in Proceedings of the 4th Conference on USENIX Windows Systems Symposium, 2000.
[13]
Policroniades C. and Pratt I., Alternatives for detecting redundancy in storage systems data, in Proceedings of the Annual Conference on USENIX Annual Technical Conference, 2004.
[14]
Liu C., Lu Y., Shi C., Lu G., Du D.-C., and Wang D.-S., Admad: Application-driven metadata aware deduplication archival storage system, in Proceedings of the Fifth IEEE International Workshop on Storage Network Architecture and Parallel I/Os, 2008, pp. 29-35.
DOI
[15]
Quinlan S. and Dorward S., Venti: A new approach to archival data storage, in Proceedings of the 1st USENIX Conference on File and Storage Technologies, 2002.
[16]
Zhu B., Li K., and Patterson H., Avoiding the disk bottleneck in the data domain deduplication file system, in Proceedings of the 6th USENIX Conference on File and Storage Technologies, 2008, pp. 1-14.
[17]
Paulo J. A. and Pereira J., A survey and classification of storage deduplication systems, ACM Computing Surveys, vol. 47, no. 1, pp. 11: 1-11: 30, 2014.
[18]
Liu F., Sun Y., Li B., Li B., and Zhang X., Fs2you: Peer-assisted semipersistent online hosting at a large scale, IEEE Transactions on Parallel and Distributed Systems, vol. 21, no. 10, pp. 1442-1457, 2010.
[19]
Liu F., Li B., Li B., and Jin H., Peer-assisted on-demand streaming: Characterizing demands and optimizing supplies, IEEE Transactions on Computers, vol. 62, no. 2, pp. 351-361, 2013.
[20]
Xu F., Liu F., Liu L., Jin H., Li B., and Li B., iaware: Making live migration of virtual machines interference-aware in the cloud, IEEE Transactions on Computers, vol. 63, no. 12, pp. 3012-3025, 2014.
[21]
Xu F., Liu F., Jin H., and Vasilakos A., Managing performance overhead of virtual machines in cloud computing: A survey, state of the art, and future directions, Proceedings of the IEEE, vol. 102, no. 1, pp. 11-31, 2014.
[22]
Lillibridge M., Eshghi K., Bhagwat D., Deolalikar V., Trezise G., and Camble P., Sparse indexing: Large scale, inline deduplication using sampling and locality, in Proccedings of the 7th Conference on File and Storage Technologies, 2009, pp. 111-123.
[23]
Muthitacharoen A., Chen B., and Mazières D., A low-bandwidth network file system, in Proceedings of the 18th ACM Symposium on Operating Systems Principles, 2001, pp. 174-187.
DOI
[24]
Bobbarjung D. R., Jagannathan S., and Dubnicki C., Improving duplicate elimination in storage systems, Transactions on Storage, vol. 2, no. 4, pp. 424-448, 2006.
[25]
Domain D., Data domain appliance series: High-speed inline deduplication storage, http://www.emc.com/backup-and-recovery/data-domain/data-domain-deduplication-storage-systems.htm, 2011.
[26]
Leesakul W., Townend P., and Xu J., Dynamic data deduplication in cloud storage, in Proceedings of the 8th IEEE International Symposium on Service Oriented System Engineering, 2014, pp. 320-325.
DOI
[27]
Fu Y., Jiang H., Xiao N., Tian L., Liu F., and Xu L., Application-aware local-global source deduplication for cloud backup services of personal storage, IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 5, pp. 1155-1165, 2014.
[28]
Puzio P., Molva R., Onen M., and Loureiro S., Cloudedup: Secure deduplication with encrypted data for cloud storage, in Proceedings of the 5th IEEE International Conference on Cloud Computing Technology and Science, 2013.
DOI
[29]
Meyer D. T. and Bolosky W. J., A study of practical deduplication, in Proceedings of the 9th USENIX Conference on File and Storage Technologies, 2011.
[30]
Eshghi K. and Tang H. K., A framework for analysing and improving content-based chunking algorithms, Tech. Rep. HPL–2005–30(RI), 2005.
[31]
National Institute of Standards and Technology, FIPS PUB 180-1: Secure hash Standards, Technical Report, 1995.
[32]
Rivest R., The md5 message-digest algorithm, http://www.ietf.org/rfc/rfc1321.txt, 1992.
DOI
[33]
Ng C.-H., Ma M., Wong T.-Y., Lee P. P. C., and Lui J. C. S., Live deduplication storage of virtual machine images in an open-source cloud, in Proceedings of the 12th International Middleware Conference, 2011, pp. 80-99.
[34]
Ungureanu C., Atkin B., Aranya A., Gokhale S., Rago S., Cakowski G., Dubnicki C., and Bohra A., Hydrafs: A high-throughput file system for the hydrastor content-addressable storage system, in Proceedings of the 8th USENIX Conference on File and Storage Technologies, 2010.
[35]
Bloom B. H., Space/time trade-offs in hash coding with allowable errors, Communications of the ACM, vol. 13, no. 7, pp. 422-426, 1970.
[36]
Lu G., Nam Y. J., and Du D.-C., Bloomstore: Bloom-filter based memory-efficient key-value store for indexing of data deduplication on flash, in Proceedings of the 28th IEEE Symposium on Mass Storage Systems and Technologies, 2012.
DOI
[37]
iSCSI enterprise target, http://sourceforge.net/projects/iscsitarget/files/, 2010.
[38]
Jin K. and Miller E. L., The effectiveness of deduplication on virtual machine disk images, in Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, 2009.
DOI
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 15 November 2014
Revised: 15 December 2014
Accepted: 07 January 2015
Published: 12 February 2015
Issue date: February 2015

Copyright

© The authors 2015

Acknowledgements

Thanks to Wei Huang for the work of programming of I-sieve prototype. We also would like to thank the anonymous reviewers for their valuable insights that have improved the quality of the paper greatly. This work was supported by the Young Scholars of the Shandong Academy of Science (No. 2014QN013) and the National High-Tech Research and Development (863) Program of China (No. 2012AA011202).

Rights and permissions

Return