Graph Deep Active Learning Framework for Data Deduplication

Huan Cao; Shengdong Du; Jie Hu; Yan Yang; Shi-Jinn Horng; Tianrui Li

doi:10.26599/BDMA.2023.9020040

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (2.9 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Graph Deep Active Learning Framework for Data Deduplication

Huan Cao^¹, Shengdong Du^¹(

), Jie Hu^¹, Yan Yang^¹, Shi-Jinn Horng^², Tianrui Li^¹

1School of Computing and Artificial Intelligence, Southwest Jiaotong University, and also with the Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Chengdu 611756, China

2College of Information and Electric Engineering, Asia University, Chongsheng 41359, China

Show Author Information

Abstract

With the advent of the era of big data, an increasing amount of duplicate data are expressed in different forms. In order to reduce redundant data storage and improve data quality, data deduplication technology has never become more significant than nowadays. It is usually necessary to connect multiple data tables and identify different records pointing to the same entity, especially in the case of multi-source data deduplication. Active learning trains the model by selecting the data items with the maximum information divergence and reduces the data to be annotated, which has unique advantages in dealing with big data annotations. However, most of the current active learning methods only employ classical entity matching and are rarely applied to data deduplication tasks. To fill this research gap, we propose a novel graph deep active learning framework for data deduplication, which is based on similarity algorithms combined with the bidirectional encoder representations from transformers (BERT) model to extract the deep similarity features of multi-source data records, and first introduce the graph active learning strategy to build a clean graph to filter the data that needs to be labeled, which is used to delete the duplicate data that retain the most information. Experimental results on real-world datasets demonstrate that the proposed method outperforms state-of-the-art active learning models on data deduplication tasks.

Keywords

data deduplication active learning similarity calculation

References

【1】

Crossref Google Scholar

Big Data Mining and Analytics

Volume 7 Issue 3,
September 2024

Pages 753-764

DOI: 10.26599/BDMA.2023.9020040

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Cao H, Du S, Hu J, et al. Graph Deep Active Learning Framework for Data Deduplication. Big Data Mining and Analytics, 2024, 7(3): 753-764. https://doi.org/10.26599/BDMA.2023.9020040

1453

Views

138

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 04 September 2023

Revised: 17 November 2023

Accepted: 07 December 2023

Published: 28 August 2024

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).