Journal Home > Volume 23 , Issue 3

In this era of big data, data are often collected from multiple sources that have different reliabilities, and there is inevitable conflict with respect to the various information obtained when it relates to the the same object. One important task is to identify the most trustworthy value out of all the conflicting claims, and this is known as truth discovery. Existing truth discovery methods simultaneously identify the most trustworthy information and source reliability degrees and are based on the idea that more reliable sources often provide more trustworthy information, and vice versa. However, there are often semantic constrains defined upon relational database, which can be violated by a single data source. To remove violations, an important task is to repair data to satisfy the constrains, and this is known as data cleaning. The two problems above may coexist, but considering them together can provide some benefits, and to the authors knowledge, this has not yet been the focus of any research. In this paper, therefore, a schema-decomposing based method is proposed to simultaneously discover the truth and to clean the data, with the aim of improving accuracy. Experimental results using real world data sets of notebooks and mobile phones, as well as simulated data sets, demonstrate the effectiveness and efficiency of our proposed method.


menu
Abstract
Full text
Outline
About this article

Truth Discovery on Inconsistent Relational Data

Show Author's information Jizhou SunJianzhong Li( )Hong GaoHongzhi Wang
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China.

Abstract

In this era of big data, data are often collected from multiple sources that have different reliabilities, and there is inevitable conflict with respect to the various information obtained when it relates to the the same object. One important task is to identify the most trustworthy value out of all the conflicting claims, and this is known as truth discovery. Existing truth discovery methods simultaneously identify the most trustworthy information and source reliability degrees and are based on the idea that more reliable sources often provide more trustworthy information, and vice versa. However, there are often semantic constrains defined upon relational database, which can be violated by a single data source. To remove violations, an important task is to repair data to satisfy the constrains, and this is known as data cleaning. The two problems above may coexist, but considering them together can provide some benefits, and to the authors knowledge, this has not yet been the focus of any research. In this paper, therefore, a schema-decomposing based method is proposed to simultaneously discover the truth and to clean the data, with the aim of improving accuracy. Experimental results using real world data sets of notebooks and mobile phones, as well as simulated data sets, demonstrate the effectiveness and efficiency of our proposed method.

Keywords: truth discovery, inconsistent data, data cleaning

References(52)

[1]
Fan W. and Geerts F., Cleaning data with conditional dependencies, in Foundations of Data Quality Management, Özsu M. T., ed. San Rafael, CA, USA: Morgan & Claypool Publishers, 2012, pp. 3986.
[2]
Li X., Dong X. L., Lyons K., Meng W., and Srivastava D., Truth finding on the deep web: Is the problem solved? in Proc. 39th Int. Conf. Very Large Data Bases, Riva del Garda, Italy, 2013, pp. 97–108.
DOI
[3]
Dong X. L., Equille L. B., and Srivastava D., Data fusion: Resolving conflicts from multiple sources, in Handbook of Data Quality, Sadiq S., ed. Springer, 2013, pp. 293–318.
DOI
[4]
Yin X., Han J., and Yu P. S., Truth discovery with multiple conflicting information providers on the web, in Proc. 13th Int. Conf. Knowledge Discovery and Data Mining, San Jose, CA, USA, 2007, pp. 1048–1052.
DOI
[5]
Dong X. L., Equille L. B., and Srivastava D., Integrating conflicting data: The role of source dependence, in Proc. 35th Int. Conf. Very Large Data Bases, Lyon, France, 2009, pp. 550–561.
DOI
[6]
Galland A., Abiteboul S., Marian A., and Senellart P., Corroborating information from disagreeing views, in Proc. 3rd Int. Conf. Web Search and Web Data Mining, New York, NY, USA, 2010, pp. 131–140.
DOI
[7]
Pasternack J. and Roth D., Knowing what to believe (when you already know something), in Proc. 23rd Int. Conf. Computational Linguistics, Beijing, China, 2010, pp. 877–885.
[8]
Yin X. and Tan W., Semi-supervised truth discovery, in Proc. 20th Int. World Wide Web Conf., Hyderabad, India, 2011, pp. 217–226.
DOI
[9]
Zhao B., Rubinstein B. I. P., Gemmell J., and Han J., A Bayesian approach to discovering truth from conflicting sources for data integration, in Proc. 38th Int. Conf. Very Large Data Bases, Istanbul, Turkey, 2012, pp. 550–561.
DOI
[10]
Zhao B. and Han J., A probabilistic model for estimating real-valued truth from conflicting sources, presented at the 10th Int. Workshop on Quality in Databases, Istanbul, Turkey, 2012.
[11]
Wang D., Kaplan L., Le H., and Abdelzaher T., On truth discovery in social sensing: A maximum likelihood estimation approach, in Proc. 11th Int. Conf. Information Processing in Sensor Networks, Beijing, China, 2012, pp. 233–244.
DOI
[12]
Pasternack J. and Roth D., Latent credibility analysis, in Proc. 22nd Int. World Wide Web Conf., Rio de Janeiro, Brazil, 2013, pp. 1009–1020.
DOI
[13]
Li Q., Li Y., Gao J., Zhao B., Fan W., and Han J., Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation, in Proc. Int. Conf. Management of Data, Snowbird, UT, USA, 2014, pp. 1187–1198.
DOI
[14]
Li Q., Li Y., Gao J., Su L., Zhao B., Demirbas M., Fan W., and Han J., A confidence-aware approach for truth discovery on long-tail data, in Proc. 41st Int. Conf. Very Large Data Bases, Kohala Coast, HI, USA, 2015, pp. 425–436.
DOI
[15]
Code E. F., Relational completeness of data base sublanguages, in Data Base Systems, Courant Computer Science Symposia 6, Rustin R., ed. Upper Saddle River, NJ, USA: Prentice Hall, 1972, pp. 65–98.
[16]
Bohannon P., Fan W., Geerts F., Jia X., and Kementsietsidis A., Conditional functional dependencies for data cleaning, in Proc. 23rd Int. Conf. Data Engineering, Istanbul, Turkey, 2007, pp. 746–755.
DOI
[17]
Fan W., Geerts F., Jia X., and Kementsietsidis A., Conditional functional dependencies for capturing data inconsistencies, ACM Trans. Database Syst., vol. 33, no. 6, pp. 1–48, 2008.
[18]
Fan W., Li J., Tang N., and Yu W., Incremental detection of inconsistencies in distributed data, IEEE Trans. Knowl. Data Eng., vol. 26, no. 6, pp. 1367–1383, 2014.
[19]
Bohannon P., Fan W., Flaster M., and Rastogi R., A cost-based model and effective heuristic for repairing constraints by value modification, in Proc. Int. Conf. Management of Data, Baltimore, MD, USA, 2005, pp. 143–154.
DOI
[20]
Chomicki J. and Marcinkowski J., Minimal-change integrity maintenance using tuple deletions, Inf. Comput., vol. 197, no. 2005, pp. 90–121, 2005.
[21]
Wijsen J., Database repairing using updates, ACM Trans. Database Syst., vol. 30, no. 3, pp. 722–768, 2005.
[22]
Kolahi S. and Lakshmanan L. V. S., On approximating optimum repairs for functional dependency violations, in Proc. 12th Int. Conf. Database Theory, St. Petersburg, Russia, 2009, pp. 53–62.
DOI
[23]
Beskales G., Ilyas I. F., and Golab L., Sampling the repairs of functional dependency violations under hard constraints, in Proc. 36th Int. Conf. Very Large Data Bases, Singapore, 2010, pp. 197–207.
DOI
[24]
Beskales G., Ilyas I. F., Golab L., and Galiullin A., Sampling from repairs of conditional functional dependency violations, VLDB J., vol. 23, no. 1, pp. 103–128, 2014.
[25]
Bravo L., Fan W., Geerts F., and Ma S., Increasing the expressivity of conditional functional dependencies without extra complexity, in Proc. 24th Int. Conf. Data Engineering, Cancún, México, 2008, pp. 516–525.
DOI
[26]
Chen W., Fan W., and Ma S., Analyses and validation of conditional dependencies with built-in predicates, in Proc. 20th Int. Conf. Database and Expert Systems Applications, Linz, Austria, 2009, pp. 576–591.
DOI
[27]
Fan W., Li J., Ma S., Tang N., and Yu W., Towards certain fixes with editing rules and master data, VLDB J., vol. 21, no. 2, pp. 213–238, 2012.
[28]
Wang J. and Tang N., Towards dependable data repairing with fixing rules, in Proc. Int. Conf. Management of Data, Snowbird, UT, USA, 2014, pp. 457–468.
DOI
[29]
Song S. and Chen L., Differential dependencies: Reasoning and discovery, ACM Trans. Database Syst., vol. 36, no. 16, pp. 1–41, 2011.
[30]
Song S., Chen L., and Yu P. S., Comparable dependencies over heterogeneous data, VLDB J., vol. 22, no. 2, pp. 253–274, 2013.
[31]
Li Y., Gao J., Meng C., Li Q., Su L., Zhao B., Fan W., and Han J., A survey on truth discovery, SIGKDD Explorations, vol. 17, no. 2, pp. 1–16, 2015.
[32]
Qiu D., Barbosa L., Dong X. L., Shen Y., and Srivastava D., DEXTER: Large-scale discovery and extraction of product specifications on the web, in Proc. 41st Int. Conf. Very Large Data Bases, Kohala Coast, HI, USA, 2015, pp. 2194–2205.
DOI
[33]
Yakout M., Elmagarmid A. K., Neville J., and Ouzzani M., GDR: A system for guided data repair, in Proc. Int. Conf. Management of Data, Indianapolis, IN, USA, 2010, pp. 1223–1226.
DOI
[34]
Yakout M., Elmagarmid A. K., Neville J., Ouzzani M., and Ilyas I. F., Guided data repair, in Proc. 37th Int. Conf. Very Large Data Bases, 2011, pp. 279–289.
DOI
[35]
Xie H., Wang H., Li J., and Gao H., A data cleaning framework based on user feedback, in Proc. 14th Int. Conf. Web-Age Information Management, Beidaihe, China, 2013, pp. 514–520.
DOI
[36]
He J., Veltri E., Santoro D., Li G., Mecca G., Papotti P., and Tang N., Interactive and deterministic data cleaning, in Proc. Int. Conf. Management of Data, San Francisco, CA, USA, 2016, pp. 893–907.
DOI
[37]
Cai Z., He Z., Guan X., and Li Y., Collective data-sanitization for preventing sensitive information inference attacks in social networks, IEEE Trans. Depend. Secure., .
[38]
He Z., Cai Z., Sun Y., Li Y., and Cheng X., Customized privacy preserving for inherent-data and latent-data, Pers. Ubiquit. Comput., vol. 21, no. 1, pp. 43–54, 2017.
[39]
Miao D., Cai Z., Liu X., and Li J., Functional dependency restricted insertion propagation, Theoret. Comput. Sci., .
[40]
Miao D., Cai Z., Liu X., and Li J., On the complexity of insertion propagation with functional dependency constraints, in Proc. 22nd Int. Conf. Computing and Combinatorics, Ho Chi Minh City, Vietnam, 2016, pp. 623–632.
DOI
[41]
Cai Z., Heydari M., and Lin G., Iterated local least squares microarray missing value imputation, J. Bioinf. Comput. Biol., vol. 4, no. 4, pp. 935–958, 2006.
[42]
Ding X., Wang H., Gao Y., Li J., and Gao H., Efficient currency determination algorithms for dynamic data, Tsinghua Sci. Technol., vol. 22, no. 3, pp. 227–242, 2017.
[43]
Chu X., Ilyas I. F., Krishnan S., and Wang J., Data cleaning: Overview and emerging challenges, in Proc. Int. Conf. Management of Data, San Francisco, CA, USA, 2016, pp. 2201–2206.
DOI
[44]
Pasternack J. and Roth D., Making better informed trust decisions with generalized fact-finding, in Proc. 22nd Int. Joint Conf. Artificial Intelligence, Barcelona, Spain, 2011, pp. 2324–2329.
DOI
[45]
Gupta M., Sun Y., and Han J., Trust analysis with clustering, in Proc. 20th Int. World Wide Web Conf., Hyderabad, India, 2011, pp. 53–54.
DOI
[46]
Wang D., Abdelzaher T. F., Ahmadi H., Pasternack J., Roth D., Gupta M., Han J., Fatemieh O., Le H. K., and Aggarwal C. C., On bayesian interpretation of fact-finding in information networks, in Proc. 14th Int. Conf. Information Fusion, Chicago, IL, USA, 2011, pp. 1–8.
[47]
Wang X., Sheng Q. Z., Fang X. S., Yao L., Xu X., and Li X., An integrated bayesian approach for effective multi-truth discovery, in Proc. 24th Int. Conf. Information and Knowledge Management, Melbourne, Australia, 2015, pp. 493–502.
DOI
[48]
Blanco L., Crescenzi V., Merialdo P., and Papotti P., Probabilistic models to reconcile complex data from inaccurate data sources, in Proc. 22nd Int. Conf. Advanced Information Systems Engineering, Hammamet, Tunisia, 2010, pp. 83–97.
DOI
[49]
Qi G., Aggarwal C. C., Han J., and Huang T. S., Mining collective intelligence in diverse groups, in Proc. 22nd Int. World Wide Web Conf., Rio de Janeiro, Brazil, 2013, pp. 1041–1052.
DOI
[50]
Dong X. L., Saha B., and Srivastava D., Less is more: Selecting sources wisely for integration, in Proc. 38th Int. Conf. Very Large Data Bases, Istanbul, Turkey, 2012, pp. 37–48.
DOI
[51]
Rekatsinas T., Dong X. L., and Srivastava D., Characterizing and selecting fresh data sources, in Proc. Int. Conf. Management of Data, Snowbird, UT, USA, 2014, pp. 919–930.
DOI
[52]
Silberschatz A., Korth H. F., Sudarshan S., Functional-dependency theory, in Database System Concepts Six Edition, Bilecki M. D., ed. New York, NY, USA: McGraw-Hill, 1997, pp. 338–348.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 14 July 2017
Accepted: 07 August 2017
Published: 02 July 2018
Issue date: June 2018

Copyright

© The author(s) 2018

Acknowledgements

This paper was partially supported by the Key Research and Development Plan of National Ministry of Science and Technology (No. 2016YFB1000703), the Key Program of the National Natural Science Foundation of China (Nos. 61190115, 61472099, 61632010, and U1509216), National Sci-Tech Support Plan (No. 2015BAH10F01), the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province (No. LC2016026), and MOE-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology.

Rights and permissions

Return