A Block-Wise Updating Strategy for Approximate Duplicate Detection

Xiujun Wang; Longkun Guo; Gaoming Yang; Lei Mo; Xiao Zheng

doi:10.26599/TST.2024.9010187

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (3.6 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Research Article | Open Access | Online First

A Block-Wise Updating Strategy for Approximate Duplicate Detection

Xiujun Wang^¹, Longkun Guo^²(

), Gaoming Yang^³, Lei Mo^², Xiao Zheng^¹

1School of Computer Science Technology, Anhui University of Technology Ma’anshan 243032, China, and also with the Anhui Province Key Laboratory of Digital Twin Technology in Metallurgical Industry, Ma’anshan 243032, China

2School of Mathematics and Statistics, Fuzhou University, Fuzhou 350108, China

3School of Computer Science Engineering, Anhui University of Science and Technology, Huainan 232001, China

Show Author Information

Abstract

Approximate duplicate detection in data streams aims to determine whether an item is present within a small subsequence of the data stream. It is a fundamental query problem needed in several network applications, such as web crawling and Radio Frequency Identification (RFID) tag management. Most of the existing algorithms are not space-efficient as they overlook the distributional information of query frequency and membership likelihood. In this paper, we propose CEll Bloom Filter (CEBF) algorithm, a space-efficient data structure designed by adopting a block-wise updating strategy, to solve this problem, and two typical distributions are considered: two typical distributions: (1) uniform query frequency of items, and (2) uniform membership likelihood of items. For an arbitrary sliding window size n and an arbitrary average false positive rate $ε \in (0, 1)$ , the proposed algorithm needs approximately $O (n \log_{2} (1 / ε))$ bits, which is significantly less than the number of bits used in existing methods. Space lower bound is also derived for these two typical distributions, which demonstrates that the proposed algorithm is near-optimal in terms of space consumption. Experiments are conducted on both synthetic and real datasets, and the experimental results clearly demonstrate that the proposed method is clearly and significantly more effective and promising compared to existing other methods.

Keywords

approximate duplicate detection sliding window space lower bound Bloom filter

References

【1】

Crossref Google Scholar

Tsinghua Science and Technology

DOI: 10.26599/TST.2024.9010187

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Wang X, Guo L, Yang G, et al. A Block-Wise Updating Strategy for Approximate Duplicate Detection. Tsinghua Science and Technology, 2025, https://doi.org/10.26599/TST.2024.9010187

1096

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 02 July 2024

Revised: 29 August 2024

Accepted: 08 October 2024

Published: 26 September 2025

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).