Journal Home > Volume 22 , Issue 3

Data quality is an important aspect in data application and management, and currency is one of the major dimensions influencing its quality. In real applications, datasets timestamps are often incomplete and unavailable, or even absent. With the increasing requirements to update real-time data, existing methods can fail to adequately determine the currency of entities. In consideration of the velocity of big data, we propose a series of efficient algorithms for determining the currency of dynamic datasets, which we divide into two steps. In the preprocessing step, to better determine data currency and accelerate dataset updating, we propose the use of a topological graph of the processing order of the entity attributes. Then, we construct an Entity Query B-Tree (EQB-Tree) structure and an Entity Storage Dynamic Linked List (ES-DLL) to improve the querying and updating processes of both the data currency graph and currency scores. In the currency determination step, we propose definitions of the currency score and currency information for tuples referring to the same entity and use examples to discuss methods and algorithms for their computation. Based on our experimental results with both real and synthetic data, we verify that our methods can efficiently update data in the correct order of currency.


menu
Abstract
Full text
Outline
About this article

Efficient Currency Determination Algorithms for Dynamic Data

Show Author's information Xiaoou DingHongzhi Wang( )Yitong GaoJianzhong LiHong Gao
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China.

Abstract

Data quality is an important aspect in data application and management, and currency is one of the major dimensions influencing its quality. In real applications, datasets timestamps are often incomplete and unavailable, or even absent. With the increasing requirements to update real-time data, existing methods can fail to adequately determine the currency of entities. In consideration of the velocity of big data, we propose a series of efficient algorithms for determining the currency of dynamic datasets, which we divide into two steps. In the preprocessing step, to better determine data currency and accelerate dataset updating, we propose the use of a topological graph of the processing order of the entity attributes. Then, we construct an Entity Query B-Tree (EQB-Tree) structure and an Entity Storage Dynamic Linked List (ES-DLL) to improve the querying and updating processes of both the data currency graph and currency scores. In the currency determination step, we propose definitions of the currency score and currency information for tuples referring to the same entity and use examples to discuss methods and algorithms for their computation. Based on our experimental results with both real and synthetic data, we verify that our methods can efficiently update data in the correct order of currency.

Keywords: data quality management, data currency, dynamic determining

References(28)

[1]
Fan W., Geerts F., Ma S., Tang N., and Yu W., Data Quality Problems beyond Consistency and Deduplication. Springer Berlin Heidelberg, 2013, pp. 237–249.
DOI
[2]
Li M. H., Li J. Z., and Gao H., Evaluation of data currency, (in Chinese), Chinese Journal of Computers, vol. 35, no. 11, pp. 2348-2360, 2012.
[3]
Fan W., Geerts F., and Jia X., Conditional dependencies: A principled approach to improving data quality, in British National Conference on Databases: Dataspace: the Final Frontier, 2009, pp. 8-20.
DOI
[4]
Herzog T. N., Scheuren F. J., and Winkler W. E., Data Quality and Record Linkage Techniques. Springer Science & Business Media, 2007.
[5]
Fan W., Geerts F., and Wijsen J., Determining the currency of data, Acm Transactions on Database Systems, vol. 37, no. 4, pp. 71-82, 2012.
[6]
Li M. and Li J., A minimized-rule based approach for improving data currency, Journal of Combinatorial Optimization, vol. 32, no. 3, pp. 812-841, 2016.
[7]
Shen Y., Guo B., Shen Y., Duan X., Dong X., and Zhang H., A pricing model for big personal data, Tsinghua Science and Technology, vol. 21, no. 5, pp. 482-490, 2016.
[8]
Batini C., Cappiello C., Francalanci C., and Maurino A., Methodologies for data quality assessment and improvement, ACM Computing Surveys, vol. 41, no. 3, pp. 75-79, 2009.
[9]
Godfrey T. C., Data Quality for the Information Age. Artech House, Inc., 1996.
[10]
Wang R. Y. and Strong D. M., Beyond accuracy: What data quality means to data consumers, Journal of Management Information Systems, vol. 12, no. 4, pp. 5-33, 1996.
[11]
Gorz Q., An economics-driven decision model for data quality improvement—A contribution to data currency, in Proc. 17th Americas Conference on Information Systems (AMCIS), Detroit, MI, USA, 2011, pp. 1-8.
[12]
Heinrich B. and Klier M., Assessing data currency—A probabilistic approach, Journal of Information Science, vol. 37, no. 1, pp. 86-100, 2011.
[13]
Cappiello C., Francalanci C., and Pernici B., A model of data currency in multi-channel financial architectures, in International Conference on Information Quality, 2002, pp. 106-118.
[14]
Heinrich B., Klier M., and Kaiser M., A procedure to develop metrics for currency and its application in CRM, Journal of Data and Information Quality, vol. 1, no. 1, pp. 1-28, 2009
[15]
Heinrich B. and Hristova D., A fuzzy metric for currency in the context of BIG DATA, in 22nd European Conference on Information Systems (ECIS), 2014.
[16]
Cappiello C., Francalanci C., and Pernici B., Time related factors of data accuracy, completeness, and currency in multi-channel infor-mation systems, in The Conference on Advanced Information Systems Engineering, 2003, pp. 145-153.
[17]
Bertossi L., Consistent query answering in databases, ACM Sigmod Record Homepage, vol. 35, no. 2, pp. 68-76, 2006.
[18]
Chomicki J., Consistent query answering: Five easy pieces, in Database Theory – ICDT 2007, International Conference, Barcelona, Spain, January 10–12, 2007, pp. 1-17.
DOI
[19]
Dong X. L., Berti-Equille L., and Srivastava D., Truth discovery and copying detection in a dynamic world, Proceedings of the Vldb Endowment, vol. 2, no. 1, pp. 562-573, 2009.
[20]
Cao Y., Fan W., and Yu W., Determining the relative accuracy of attributes, in ACM SIGMOD International Conference on Management of Data, 2013, pp. 565-576.
DOI
[21]
Fan W., Geerts F., Tang N., and Yu W., Inferring data currency and consistency for conflict resolution, in 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, Australlia, 2013, pp. 470-481.
[22]
Fan W., Li J. , Ma S. , Tang N. , and Yu W. , Interaction between record matching and data repairing, in ACM SIGMOD International Conference on Management of Data, Athens, Greece, ACM, 2011, pp. 469-480.
DOI
[23]
Fan W., Geerts F., Tang N., and Yu W., Conflict resolution with data currency and consistency, Journal of Data and Information Quality, vol. 5, nos. 1&2, pp. 1-37, 2014.
[24]
Ding X., Wang H., Gao Y., Li J., and Gao H., Determining the currency of dynamic data, in Proceedings of the 2017 ACM TUR-C Conference, ACM, 2017.
DOI
[25]
Christen P., A survey of indexing techniques for scalable record linkage and deduplication, IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 9, pp. 1537-1555, 2011.
[26]
Bodirsky M. and Kara J.. The complexity of temporal constraint satisfaction problems, in ACM Symposium on Theory of Computing, Victoria, British Columbia, Canada, 2008.
DOI
[27]
Wang H., Li J., and Gao H., Efficient entity resolution based on subgraph cohesion, Knowledge and Information Systems, vol. 46, no. 2, pp. 285-314, 2016.
[28]
Elmagarmid A. K., Ipeirotis P. G., and Verykios V. S., Duplicate record detection: A survey, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 1-16, 2007.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 26 March 2017
Revised: 06 April 2017
Accepted: 11 April 2017
Published: 04 May 2017
Issue date: June 2017

Copyright

© The authors 2017

Acknowledgements

This paper was partially supported by the National Natural Science Foundation of China (Nos. U1509216 and 61472099), National Key Technology Research and Development Program (No. 2015BAH10F01), the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province (No. LC2016026), and MOE-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology.

Rights and permissions

Return