Efficient Currency Determination Algorithms for Dynamic Data

Xiaoou Ding; Hongzhi Wang; Yitong Gao; Jianzhong Li; Hong Gao

doi:10.23919/TST.2017.7914196

Tsinghua Science and Technology 2017, 22(3): 227-242 https://doi.org/10.23919/TST.2017.7914196

Open Access | Issue | Published: 04 May 2017

Efficient Currency Determination Algorithms for Dynamic Data

Show Author's Information Hide Author's Information Xiaoou Ding, Hongzhi Wang(

), Yitong Gao, Jianzhong Li, Hong Gao

School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China.

Keywords:

data quality management, data currency, dynamic determining

Cite this article:

Ding X, Wang H, Gao Y, et al. Efficient Currency Determination Algorithms for Dynamic Data. Tsinghua Science and Technology, 2017, 22(3): 227-242. https://doi.org/10.23919/TST.2017.7914196

Download citation

EndNote(RIS)

BibTeX

487

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

Data quality is an important aspect in data application and management, and currency is one of the major dimensions influencing its quality. In real applications, datasets timestamps are often incomplete and unavailable, or even absent. With the increasing requirements to update real-time data, existing methods can fail to adequately determine the currency of entities. In consideration of the velocity of big data, we propose a series of efficient algorithms for determining the currency of dynamic datasets, which we divide into two steps. In the preprocessing step, to better determine data currency and accelerate dataset updating, we propose the use of a topological graph of the processing order of the entity attributes. Then, we construct an Entity Query B-Tree (EQB-Tree) structure and an Entity Storage Dynamic Linked List (ES-DLL) to improve the querying and updating processes of both the data currency graph and currency scores. In the currency determination step, we propose definitions of the currency score and currency information for tuples referring to the same entity and use examples to discuss methods and algorithms for their computation. Based on our experimental results with both real and synthetic data, we verify that our methods can efficiently update data in the correct order of currency.

Full text

Abstract

Full text

Outline

About this article

Efficient Currency Determination Algorithms for Dynamic Data

Show Author's information Hide Author's Information Xiaoou Ding, Hongzhi Wang(

), Yitong Gao, Jianzhong Li, Hong Gao

School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China.

Abstract

Keywords: data quality management, data currency, dynamic determining

References(28)

[1]

Fan W., Geerts F., Ma S., Tang N., and Yu W., Data Quality Problems beyond Consistency and Deduplication. Springer Berlin Heidelberg, 2013, pp. 237–249.

DOI

[2]

Li M. H., Li J. Z., and Gao H., Evaluation of data currency, (in Chinese), Chinese Journal of Computers, vol. 35, no. 11, pp. 2348-2360, 2012.

DOI Google Scholar

[3]

Fan W., Geerts F., and Jia X., Conditional dependencies: A principled approach to improving data quality, in British National Conference on Databases: Dataspace: the Final Frontier, 2009, pp. 8-20.

DOI

[4]

Herzog T. N., Scheuren F. J., and Winkler W. E., Data Quality and Record Linkage Techniques. Springer Science & Business Media, 2007.

[5]

Fan W., Geerts F., and Wijsen J., Determining the currency of data, Acm Transactions on Database Systems, vol. 37, no. 4, pp. 71-82, 2012.

DOI Google Scholar

[6]

Li M. and Li J., A minimized-rule based approach for improving data currency, Journal of Combinatorial Optimization, vol. 32, no. 3, pp. 812-841, 2016.

DOI Google Scholar

[7]

Shen Y., Guo B., Shen Y., Duan X., Dong X., and Zhang H., A pricing model for big personal data, Tsinghua Science and Technology, vol. 21, no. 5, pp. 482-490, 2016.

DOI Google Scholar

[8]

Batini C., Cappiello C., Francalanci C., and Maurino A., Methodologies for data quality assessment and improvement, ACM Computing Surveys, vol. 41, no. 3, pp. 75-79, 2009.

DOI Google Scholar

[9]

Godfrey T. C., Data Quality for the Information Age. Artech House, Inc., 1996.

[10]

Wang R. Y. and Strong D. M., Beyond accuracy: What data quality means to data consumers, Journal of Management Information Systems, vol. 12, no. 4, pp. 5-33, 1996.

DOI Google Scholar

[11]

Gorz Q., An economics-driven decision model for data quality improvement—A contribution to data currency, in Proc. 17th Americas Conference on Information Systems (AMCIS), Detroit, MI, USA, 2011, pp. 1-8.

[12]

Heinrich B. and Klier M., Assessing data currency—A probabilistic approach, Journal of Information Science, vol. 37, no. 1, pp. 86-100, 2011.

DOI Google Scholar

[13]

Cappiello C., Francalanci C., and Pernici B., A model of data currency in multi-channel financial architectures, in International Conference on Information Quality, 2002, pp. 106-118.

[14]

Heinrich B., Klier M., and Kaiser M., A procedure to develop metrics for currency and its application in CRM, Journal of Data and Information Quality, vol. 1, no. 1, pp. 1-28, 2009

DOI Google Scholar

[15]

Heinrich B. and Hristova D., A fuzzy metric for currency in the context of BIG DATA, in 22nd European Conference on Information Systems (ECIS), 2014.

[16]

Cappiello C., Francalanci C., and Pernici B., Time related factors of data accuracy, completeness, and currency in multi-channel infor-mation systems, in The Conference on Advanced Information Systems Engineering, 2003, pp. 145-153.

[17]

Bertossi L., Consistent query answering in databases, ACM Sigmod Record Homepage, vol. 35, no. 2, pp. 68-76, 2006.

DOI Google Scholar

[18]

Chomicki J., Consistent query answering: Five easy pieces, in Database Theory – ICDT 2007, International Conference, Barcelona, Spain, January 10–12, 2007, pp. 1-17.

DOI

[19]

Dong X. L., Berti-Equille L., and Srivastava D., Truth discovery and copying detection in a dynamic world, Proceedings of the Vldb Endowment, vol. 2, no. 1, pp. 562-573, 2009.

DOI Google Scholar

[20]

Cao Y., Fan W., and Yu W., Determining the relative accuracy of attributes, in ACM SIGMOD International Conference on Management of Data, 2013, pp. 565-576.

DOI

[21]

Fan W., Geerts F., Tang N., and Yu W., Inferring data currency and consistency for conflict resolution, in 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, Australlia, 2013, pp. 470-481.

[22]

Fan W., Li J. , Ma S. , Tang N. , and Yu W. , Interaction between record matching and data repairing, in ACM SIGMOD International Conference on Management of Data, Athens, Greece, ACM, 2011, pp. 469-480.

DOI

[23]

Fan W., Geerts F., Tang N., and Yu W., Conflict resolution with data currency and consistency, Journal of Data and Information Quality, vol. 5, nos. 1&2, pp. 1-37, 2014.

DOI Google Scholar

[24]

Ding X., Wang H., Gao Y., Li J., and Gao H., Determining the currency of dynamic data, in Proceedings of the 2017 ACM TUR-C Conference, ACM, 2017.

DOI

[25]

Christen P., A survey of indexing techniques for scalable record linkage and deduplication, IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 9, pp. 1537-1555, 2011.

DOI Google Scholar

[26]

Bodirsky M. and Kara J.. The complexity of temporal constraint satisfaction problems, in ACM Symposium on Theory of Computing, Victoria, British Columbia, Canada, 2008.

DOI

[27]

Wang H., Li J., and Gao H., Efficient entity resolution based on subgraph cohesion, Knowledge and Information Systems, vol. 46, no. 2, pp. 285-314, 2016.

DOI Google Scholar

[28]

Elmagarmid A. K., Ipeirotis P. G., and Verykios V. S., Duplicate record detection: A survey, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 1-16, 2007.

DOI Google Scholar

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 26 March 2017

Revised: 06 April 2017

Accepted: 11 April 2017

Published: 04 May 2017

Issue date: June 2017

Copyright

Acknowledgements

This paper was partially supported by the National Natural Science Foundation of China (Nos. U1509216 and 61472099), National Key Technology Research and Development Program (No. 2015BAH10F01), the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province (No. LC2016026), and MOE-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology.