376
Views
26
Downloads
25
Crossref
N/A
WoS
26
Scopus
0
CSCD
Current Conditional Functional Dependency (CFD) discovery algorithms always need a well-prepared training dataset. This condition makes them difficult to apply on large and low-quality datasets. To handle the volume issue of big data, we develop the sampling algorithms to obtain a small representative training set. We design the fault-tolerant rule discovery and conflict-resolution algorithms to address the low-quality issue of big data. We also propose parameter selection strategy to ensure the effectiveness of CFD discovery algorithms. Experimental results demonstrate that our method can discover effective CFD rules on billion-tuple data within a reasonable period.
Current Conditional Functional Dependency (CFD) discovery algorithms always need a well-prepared training dataset. This condition makes them difficult to apply on large and low-quality datasets. To handle the volume issue of big data, we develop the sampling algorithms to obtain a small representative training set. We design the fault-tolerant rule discovery and conflict-resolution algorithms to address the low-quality issue of big data. We also propose parameter selection strategy to ensure the effectiveness of CFD discovery algorithms. Experimental results demonstrate that our method can discover effective CFD rules on billion-tuple data within a reasonable period.
This paper was partially supported by the National Key R&D Program of China (No. 2018YFB1004700), the National Natural Science Foundation of China (Nos. U1509216, U1866602, and 61602129), and Microsoft Research Asia.
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).