Sort:
Regular Paper Issue
An Enhanced Physical-Locality Deduplication System for Space Efficiency
Journal of Computer Science and Technology 2024, 39(6): 1361-1379
Published: 16 January 2025
Abstract Collect

An abundance of data have been generated from various embedded devices, applications, and systems, and require cost-efficient storage services. Data deduplication removes duplicate chunks and becomes an important technique for storage systems to improve space efficiency. However, stored unique chunks are heavily fragmented, decreasing restore performance and incurs high overheads for garbage collection. Existing schemes fail to achieve an efficient trade-off among deduplication, restore and garbage collection performance, due to failing to explore and exploit the physical locality of different chunks. In this paper, we trace the storage patterns of the fragmented chunks in backup systems, and propose a high-performance deduplication system, called HiDeStore. The main insight is to enhance the physical-locality for the new backup versions during the deduplication phase, which identifies and stores hot chunks in the active containers. The chunks not appearing in new backups become cold and are gathered together in the archival containers. Moreover, we remove the expired data with an isolated container deletion scheme, avoiding the high overheads for expired data detection. Compared with state-of-the-art schemes, HiDeStore improves the deduplication and restore performance by up to 1.4x and 1.6x, respectively, without decreasing the deduplication ratios and incurring high garbage collection overheads.

Regular Paper Issue
Approximate Similarity-Aware Compression for Non-Volatile Main Memory
Journal of Computer Science and Technology 2024, 39(1): 63-81
Published: 25 January 2024
Abstract Collect

Image bitmaps, i.e., data containing pixels and visual perception, have been widely used in emerging applications for pixel operations while consuming lots of memory space and energy. Compared with legacy DRAM (dynamic random access memory), non-volatile memories (NVMs) are suitable for bitmap storage due to the salient features of high density and intrinsic durability. However, writing NVMs suffers from higher energy consumption and latency compared with read accesses. Existing precise or approximate compression schemes in NVM controllers show limited performance for bitmaps due to the irregular data patterns and variance in bitmaps. We observe the pixel-level similarity when writing bitmaps due to the analogous contents in adjacent pixels. By exploiting the pixel-level similarity, we propose SimCom, an approximate similarity-aware compression scheme in the NVM module controller, to efficiently compress data for each write access on-the-fly. The idea behind SimCom is to compress continuous similar words into the pairs of base words with runs. The storage costs for small runs are further mitigated by reusing the least significant bits of base words. SimCom adaptively selects an appropriate compression mode for various bitmap formats, thus achieving an efficient trade-off between quality and memory performance. We implement SimCom on GEM5/zsim with NVMain and evaluate the performance with real-world image/video workloads. Our results demonstrate the efficacy and efficiency of our SimCom with an efficient quality-performance trade-off.

Open Access Issue
Towards a Cost-Efficient MapReduce: Mitigating Power Peaks for Hadoop Clusters
Tsinghua Science and Technology 2014, 19(1): 24-32
Published: 07 February 2014
Abstract PDF (3.4 MB) Collect
Downloads:19

Distributed data processing system is becoming one of the most important components for data-intensive computational tasks in the enterprise software infrastructure. Deploying and operating such systems require large amount of costs, including hardware costs to build clusters and energy costs to run clusters. To make these systems sustainable and scalable, power management has been an important research problem. In this paper, we take Hadoop as an example to illustrate the power peak problem which causes power inefficiency and provides in-depth analysis to explain issues with existing system designs. We propose a novel power capping module in the Hadoop scheduler to mitigate power peaks. Extensive simulation studies show that our proposed solution can effectively smooth the power consumption curve and mitigate temporary power peaks for Hadoop clusters.

Total 3