AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (3.6 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Research Article | Open Access | Online First

A Block-Wise Updating Strategy for Approximate Duplicate Detection

School of Computer Science Technology, Anhui University of Technology Ma’anshan 243032, China, and also with the Anhui Province Key Laboratory of Digital Twin Technology in Metallurgical Industry, Ma’anshan 243032, China
School of Mathematics and Statistics, Fuzhou University, Fuzhou 350108, China
School of Computer Science Engineering, Anhui University of Science and Technology, Huainan 232001, China
Show Author Information

Abstract

Approximate duplicate detection in data streams aims to determine whether an item is present within a small subsequence of the data stream. It is a fundamental query problem needed in several network applications, such as web crawling and Radio Frequency Identification (RFID) tag management. Most of the existing algorithms are not space-efficient as they overlook the distributional information of query frequency and membership likelihood. In this paper, we propose CEll Bloom Filter (CEBF) algorithm, a space-efficient data structure designed by adopting a block-wise updating strategy, to solve this problem, and two typical distributions are considered: two typical distributions: (1) uniform query frequency of items, and (2) uniform membership likelihood of items. For an arbitrary sliding window size n and an arbitrary average false positive rate ε(0,1), the proposed algorithm needs approximately O(nlog2(1/ε)) bits, which is significantly less than the number of bits used in existing methods. Space lower bound is also derived for these two typical distributions, which demonstrates that the proposed algorithm is near-optimal in terms of space consumption. Experiments are conducted on both synthetic and real datasets, and the experimental results clearly demonstrate that the proposed method is clearly and significantly more effective and promising compared to existing other methods.

References

【1】
【1】
 
 
Tsinghua Science and Technology

{{item.num}}

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Close
Close
Cite this article:
Wang X, Guo L, Yang G, et al. A Block-Wise Updating Strategy for Approximate Duplicate Detection. Tsinghua Science and Technology, 2025, https://doi.org/10.26599/TST.2024.9010187

918

Views

43

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Received: 02 July 2024
Revised: 29 August 2024
Accepted: 08 October 2024
Published: 26 September 2025
© The author(s) 2026.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).