Journal Home > Volume 3 , Issue 1

Current Conditional Functional Dependency (CFD) discovery algorithms always need a well-prepared training dataset. This condition makes them difficult to apply on large and low-quality datasets. To handle the volume issue of big data, we develop the sampling algorithms to obtain a small representative training set. We design the fault-tolerant rule discovery and conflict-resolution algorithms to address the low-quality issue of big data. We also propose parameter selection strategy to ensure the effectiveness of CFD discovery algorithms. Experimental results demonstrate that our method can discover effective CFD rules on billion-tuple data within a reasonable period.


menu
Abstract
Full text
Outline
About this article

Mining Conditional Functional Dependency Rules on Big Data

Show Author's information Mingda LiHongzhi Wang( )Jianzhong Li
Department of Computer Science, University of California, Los Angles, CA 90095, USA.
Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150000, China.

Abstract

Current Conditional Functional Dependency (CFD) discovery algorithms always need a well-prepared training dataset. This condition makes them difficult to apply on large and low-quality datasets. To handle the volume issue of big data, we develop the sampling algorithms to obtain a small representative training set. We design the fault-tolerant rule discovery and conflict-resolution algorithms to address the low-quality issue of big data. We also propose parameter selection strategy to ensure the effectiveness of CFD discovery algorithms. Experimental results demonstrate that our method can discover effective CFD rules on billion-tuple data within a reasonable period.

Keywords: big data, data mining, conditional functional dependency, data quality

References(17)

[1]
W. F. Fan, F. Geerts, X. B. Jia, and A. Kementsietsidis, Conditional functional dependencies for capturing data inconsistencies, ACM Trans. Database Syst., vol. 33, no. 2, p. 6, 2008.
[2]
W. G. Chen, W. F. Fan, and S. Ma, Analyses and validation of conditional dependencies with built-in predicates, in Proc. 20th Int. Conf. Database and Expert Systems Applications, Linz, Austria, 2009, pp. 576-591.
DOI
[3]
B. Bollobás, Modern Graph Theory. New York, NY, USA: Springer-Verlag, 1998.
DOI
[4]
L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu, On generating near-optimal tableaux for conditional functional dependencies, Proc. VLDB Endowment, vol. 1, no. 1, pp. 376-390, 2008.
[5]
G. Cormode, L. Golab, K. Flip, A. McGregor, D. Srivastava, and X. Zhang, Estimating the confidence of conditional functional dependencies, in Proc. 2009 ACM SIGMOD Int. Conf. Management of Data, Providence, RI, USA, 2009, pp. 469-482.
DOI
[6]
J. S. Vitter, Random sampling with a reservoir, ACM Trans. Math. Softw., vol. 11, no. 1, pp. 37-57, 1985.
[7]
W. F. Fan, F. Geerts, J. Z. Li, and M. Xiong, Discovering conditional functional dependencies, IEEE Trans. Knowl. Data Eng., vol. 23, no. 5, pp. 683-698, 2011.
[8]
M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. New York, NY, USA: W. H. Freeman & Co., 1979.
[9]
F. Chiang and R. J. Miller, Discovering data quality rules, Proc. VLDB Endowment, vol. 1, no. 1, pp. 1166-1177, 2008.
[10]
C. Batini and M. Scannapieco, Data Quality: Concepts, Methodologies and Techniques. Berlin, Germany: Springer, 2006.
[11]
J. Chomicki, Consistent query answering: Five easy pieces, in Proc. 11th Int. Conf. Database Theory, Barcelona, Spain, 2007, pp. 1-17.
DOI
[12]
L. Bertossi, Consistent query answering in databases, ACM SIGMOD Record, vol. 35, no. 2, pp. 68-76, 2006.
[13]
W. F. Fan, Dependencies revisited for improving data quality, in Proc. 27th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, Vancouver, Canada, 2008, pp. 159-170.
DOI
[14]
E. Rahm and H. H. Do, Data cleaning: Problems and current approaches, IEEE Data Eng. Bull., vol. 23, no. 4, pp. 3-13, 2010.
[15]
L. Bravo, W. F. Fan, and S. Ma, Extending dependencies with conditions, in Proc. 33rd Int. Conf. Very Large Data Bases, Vienna, Austria, 2007, pp. 243-254.
[16]
L. Bravo, W. F. Fan, F. Geerts, and S. Ma, Increasing the expressivity of conditional functional dependencies without extra complexity, in Proc. 24th Int. Conf. Data Engineering, Cancun, Mexico, 2008, pp. 516-525.
DOI
[17]
A. K. Kalavagattu, Mining approximate functional dependencies as condensed representations of association rules, Master dissertation, Arizona State University, Phoenix, AZ, USA, 2008.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 28 September 2019
Accepted: 09 October 2019
Published: 19 December 2019
Issue date: March 2020

Copyright

© The author(s) 2020

Acknowledgements

This paper was partially supported by the National Key R&D Program of China (No. 2018YFB1004700), the National Natural Science Foundation of China (Nos. U1509216, U1866602, and 61602129), and Microsoft Research Asia.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return