AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Survey

A Survey on Blocking Technology of Entity Resolution

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
Key Laboratory of Safety-Critical Software, Ministry of Industry and Information Technology, Nanjing 211106, China
Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
Show Author Information

Abstract

Entity resolution (ER) is a significant task in data integration, which aims to detect all entity profiles that correspond to the same real-world entity. Due to its inherently quadratic complexity, blocking was proposed to ameliorate ER, and it offers an approximate solution which clusters similar entity profiles into blocks so that it suffices to perform pairwise comparisons inside each block in order to reduce the computational cost of ER. This paper presents a comprehensive survey on existing blocking technologies. We summarize and analyze all classic blocking methods with emphasis on different blocking construction and optimization techniques. We find that traditional blocking ER methods which depend on the fixed schema may not work in the context of highly heterogeneous information spaces. How to use schema information flexibly is of great significance to efficiently process data with the new features of this era. Machine learning is an important tool for ER, but end-to-end and efficient machine learning methods still need to be explored. We also sum up and provide the most promising trend for future work from the directions of real-time blocking ER, incremental blocking ER, deep learning with ER, etc.

Electronic Supplementary Material

Download File(s)
jcst-35-4-769-Highlights.pdf (227.2 KB)

References

[1]
Bhattacharya I, Getoor L. A Latent Dirichlet model for unsupervised entity resolution. In Proc. the 2006 SIAM International Conference on Data Mining, Apr. 2006, pp.47-58.
[2]

Liu X L, Wang H Z, Li J Z, Gao H. EntityManager: Managing dirty data based on entity resolution. J. Comput. Sci. Technol., 2017, 32(3): 644-662.

[3]

Winkler W E. Methods for evaluating and creating data quality. Inf. Syst., 2004, 29(7): 531-550.

[4]
Winkler WE. Overview of record linkage and current research directions. Technical Report, U.S. Bureau of the Census, 2006. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=7D191FC85CD0D418884ACD2CECC2C190?doi=1-0.1.1.79.1519&rep=rep1&type=pdf, March 2020.
[5]

Newcombe H B, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130(3381): 954-959.

[6]
Bhattacharya I, Getoor L. Iterative record linkage for cleaning and integration. In Proc. the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Jun. 2004, pp.11-18.
[7]
Pasula H, Marthi B, Milch B, Russell S J, Shpitser I. Identity uncertainty and citation matching. In Proc. the 2002 Annual Conference on Neural Information Processing Systems, Dec. 2002, pp.1401-1408.
[8]

Fan W F, Jia X B, Li J Z, Ma S. Reasoning about record matching rules. Proc. the VLDB Endowment, 2009, 2(1): 407-418.

[9]
Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2003, pp.39-48.
[10]
Bhattacharya I, Getoor L. Deduplication and group detection using links. In Proc. the 2004 ACM SIGKDD Workshop on Link Analysis and Group Detection, August 2004.
[11]

Getoor L, Machanavajjhala A. Entity resolution: Theory, practice & open challenges. Proc. the VLDB Endowment, 2012, 5(12): 2018-2019.

[12]

Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 2011, 24(9): 1537-1555.

[13]
Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, 2012.
[14]
Dong X L, Srivastava D. Big data integration. In Proc. the 29th International Conference on Data Engineering, Apr. 2013, pp.1245-1248.
[15]
Papadakis G, Skoutas D, Thanos E, Palpanas T. A survey of blocking and filtering techniques for entity resolution. arXiv: 1905.06167, 2019. https://arxiv.org/abs/1905.06167, March 2020.
[16]

Dunn H L. Record linkage. American Journal of Public Health and the Nation’s Health, 1946, 36(12): 1412-1416.

[17]

Christophides V, Efthymiou V, Stefanidis K. Entity resolution in the web of data. Synthesis Lectures on the Semantic Web, 2015, 5(3): 1-122.

[18]

Papadakis G, Ioannou E, Palpanas T, Niederée C, Nejdl W. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Transactions on Knowledge and Data Engineering, 2012, 25(12): 2665-2682.

[19]
Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W. Eliminating the redundancy in blocking-based entity resolution methods. In Proc. the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, Jun. 2011, pp.85-94.
[20]

Papadakis G, Alexiou G, Papastefanatos G, Koutrika G. Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proc. the VLDB Endowment, 2015, 9(4): 312-323.

[21]

Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.

[22]
Aizawa A, Oyama K. A fast linkage detection scheme for multi-source information integration. In Proc. the 2005 International Workshop on Challenges in Web Information Retrieval and Integration, Apr. 2005, pp.30-39.
[23]
de Vries T, Ke H, Chawla S, Christen P. Robust record linkage blocking using suffix arrays. In Proc. the 18th ACM Conference on Information and Knowledge Management, Nov. 2009, pp.305-314.
[24]

Allam A, Skiadopoulos S, Kalnis P. Improved suffix blocking for record linkage and entity resolution. Data & Knowledge Engineering, 2018, 117: 98-113.

[25]
Gravano L, Ipeirotis P G, Jagadish H V, Koudas N, Muthukrishnan S, Srivastava D. Approximate string joins in a database (almost) for free. In Proc. the 27th International Conference on Very Large Data Bases, Sept. 2001, pp.491-500.
[26]
Baxter R, Christen P, Churches T. A comparison of fast blocking methods for record linkage. In Proc. the ACM SIGKDD 2003 Workshop on Data Cleaning, Record Linkage and Object Consolidation, Aug. 2003.
[27]

Kenig B, Gal A. MFIBlocks: An effective blocking algorithm for entity resolution. Information Systems, 2013, 38(6): 908-926.

[28]
McCallum A, Nigam K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2000, pp.169-178.
[29]

Hernández M A, Stolfo S J. The merge/purge problem for large databases. ACM SIGMOD Record, 1995, 24(2): 127-138.

[30]
Yan S, Lee D, Kan M Y, Giles L C. Adaptive sorted neighborhood methods for efficient record linkage. In Proc. the 7th ACM/IEEE Joint Conference on Digital Libraries, Jun. 2007, pp.185-194.
[31]
Draisbach U, Naumann F. A comparison and generalization of blocking and windowing algorithms for duplicate detection. In Proc. the International Workshop on Quality in Databases, Aug. 2009, pp.51-56.
[32]
Draisbach U, Naumann F, Szott S, Wonneberg O. Adaptive windows for duplicate detection. In Proc. the 28th IEEE International Conference on Data Engineering, Apr. 2012, pp.1073-1083.
[33]
Draisbach U, Naumann F. A generalization of blocking and windowing algorithms for duplicate detection. In Proc. the 2011 International Conference on Data and Knowledge Engineering, Sept. 2011, pp.18-24.
[34]
Papadakis G, Ioannou E, Niederée C, Fankhauser P. Efficient entity resolution for large heterogeneous information spaces. In Proc. the 4th International Conference on Web Search and Web Data Mining, Feb. 2011, pp.535-544.
[35]
Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W. Beyond 100 million entities: Large-scale blocking-based resolution for heterogeneous data. In Proc. the 5th International Conference on Web Search and Web Data Mining, Feb. 2012, pp.53-62.
[36]
Song D, Heflin J. Automatically generating data linkages using a domain-independent candidate selection approach. In Proc. the 10th International Semantic Web Conference, Oct. 2011, pp.649-664.
[37]
Nin J, Muntes-Mulero V, Martínez-Bazan N, Larriba-Pey J. On the use of semantic blocking techniques for data cleansing and integration. In Proc. the 11th International Database Engineering and Applications Symp., Sept. 2007, pp.190-198.
[38]
Bilenko M, Kamath B, Mooney R J. Adaptive blocking: Learning to scale up record linkage. In Proc. the 6th IEEE International Conference on Data Mining, Dec. 2006, pp.87-96.
[39]
Michelson M, Knoblock C A. Learning blocking schemes for record linkage. In Proc. the 21st National Conference on Artificial Intelligence and the 18th Innovative Applications of Artificial Intelligence Conference, Jul. 2006, pp.440-445.
[40]

Evangelista L O, Cortez E, da Silva A S, Meira W. Adaptive and flexible blocking for record linkage tasks. Journal of Information and Data Management, 2010, 1(2): 167-167.

[41]
Sarma A D, Jain A, Machanavajjhala A, Bohannon P. An automatic blocking mechanism for large-scale deduplication tasks. In Proc. the 21st ACM International Conference on Information and Knowledge Management, Oct. 2012, pp.1055-1064.
[42]

Giang P H. A machine learning approach to create blocking criteria for record linkage. Health Care Management Science, 2015, 18(1): 93-105.

[43]
Cao Y, Chen Z, Zhu J, Yue P, Lin C, Yu Y. Leveraging unlabeled data to scale blocking for record linkage. In Proc. the 22nd International Joint Conference on Artificial Intelligence, Jul. 2011, pp.2211-2217.
[44]
Mayank K, Daniel P M. An unsupervised algorithm for learning blocking schemes. In Proc. the 13th International Conference on Data Mining, Dec. 2019, pp.340-349.
[45]

O’Hare K, Jurek-Loughrey A, de Campos C. An unsupervised blocking technique for more efficient record linkage. Data & Knowledge Engineering, 2019, 122: 181-195.

[46]
Ma Y T, Tran T. TYPiMatch: Type-specific unsupervised learning of keys and key values for heterogeneous web data integration. In Proc. the 6th ACM International Conference on Web Search and Data Mining, Feb. 2013, pp.325-334.
[47]
Kejriwal M, Miranker D P. A two-step blocking scheme learner for scalable link discovery. In Proc. the 9th International Workshop on Ontology Matching Collocated with the 13th International Semantic Web Conference, Oct. 2014, pp.49-60.
[48]
Kejriwal M, Miranker D P. A DNF blocking scheme learner for heterogeneous datasets. arXiv: 1501.01694. https://arxiv.org/abs/1501.01694, Jan. 2020.
[49]
Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W. To compare or not to compare: Making entity resolution more efficient. In Proc. the International Workshop on Semantic Web Information Management, Jun. 2011, Article No. 3.
[50]

Papadakis G, Koutrika G, Palpanas T, Nejdl W. Metablocking: Taking entity resolution to the next level. IEEE Transactions on Knowledge and Data Engineering, 2013, 26(8): 1946-1960.

[51]

Ferragina P, Grossi R. The string B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM, 1999, 46(2): 236-280.

[52]
Christen P. Towards parameter-free blocking for scalable record linkage. Technical Report, Faculty of Engineering and Information Technology, 2007. http://users.cecs.anu.edu.au/˜Peter.Christen/publications/tr-cs-07-03.pdf, March 2020.
[53]
Madhavan J, Jeffery S R, Cohen S, Dong X L, Ko D, Yu C, Halevy A. Web-scale data integration: You can only afford to pay as you go. In Proc. the 3rd Biennial Conference on Innovative Data Systems Research, Jan. 2007, pp.342-350.
[54]
Papadakis G, Palpanas T. Blocking for large-scale entity resolution: Challenges, algorithms, and practical examples. In Proc. the 32nd International Conference on Data Engineering, May 2016, pp.1436-1439.
[55]
Chaudhuri S, Chen B, Ganti V, Kaushik R. Example-driven design of efficient record matching queries. In Proc. the 33rd International Conference on Very Large Data Bases, Sept. 2007, pp.327-338.
[56]
Papadakis G, Demartini G, Fankhauser P, Kärger P. The missing links: Discovering hidden same-as links among a billion of triples. In Proc. the 12th International Conference on Information Integration and Web-Based Applications and Services, Nov. 2010, pp.453-460.
[57]
Winkler W E. Approximate string comparator search strategies for very large administrative lists. Technical Report, U.S. Census Bureau, 2005. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.402&rep=rep1&type=pdf, March 2020.
[58]

Valiant L G. A theory of the learnable. Communications of the ACM, 1984, 27(11): 1134-1142.

[59]

Suchanek F M, Abiteboul S, Senellart P. PARIS: Probabilistic alignment of relations, instances, and schema. Proceedings the VLDB Endowment, 2011, 5(3): 157-168.

[60]

Papadakis G, Svirsky J, Gal A, Palpanas T. Comparative analysis of approximate blocking techniques for entity resolution. Proceedings the VLDB Endowment, 2016, 9(9): 684-695.

[61]

Jonker R, Volgenant T. Improving the Hungarian assignment algorithm. Operations Research Letters, 1986, 5(4): 171-175.

[62]

Verikas A, Gelzinis A, Bacauskiene M. Mining data with random forests: A survey and results of new tests. Pattern Recognition, 2011, 44(2): 330-349.

[63]
Bilke A, Naumann F. Schema matching using duplicates. In Proc. the 21st International Conference on Data Engineering, Apr. 2005, pp.69-80.
[64]
Papadakis G, Papastefanatos G, Palpanas T, Koubarakis M. Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In Proc. the 19th International Conference on Extending Database Technology, Mar. 2016, pp.221-232.
[65]
Fisher J, Christen P, Wang Q, Rahm E. A clustering-based framework to control block sizes for entity resolution. In Proc. the 21st International Conference on Knowledge Discovery and Data Mining, Aug. 2015, pp.279-288.
[66]
Whang S E, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H. Entity resolution with iterative blocking. In Proc. the International Conference on Management of Data, Jun. 2009, pp.219-232.
[67]
Shu L, Chen A, Xiong M, Meng W. Efficient SPectrAl Neighborhood blocking for entity resolution. In Proc. the 27th International Conference on Data Engineering, Apr. 2011, pp.1067-1078.
[68]

Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.

[69]
Christen P, Gayler R, Hawking D. Similarity-aware indexing for real-time entity resolution. In Proc. the 18th ACM Conference on Information and Knowledge Management, Nov. 2009, pp.1565-1568.
[70]
Ramadan B, Christen P, Liang H, Gayler R W, Hawking D. Dynamic similarity-aware inverted indexing for real-time entity resolution. In Proc. the 2013 PAKDD Workshop on Data Mining Applications in Industry and Government, Apr. 2013, pp.47-58.
[71]
Ramadan B, Christen P. Forest-based dynamic sorted neighborhood indexing for real-time entity resolution. In Proc. the 23rd International Conference on Information and Knowledge Management, Nov. 2014, pp.1787-1790.
[72]

Ramadan B, Christen P, Liang H, Gayler R W. Dynamic sorted neighborhood indexing for real-time entity resolution. Journal of Data and Information Quality, 2015, 6(4): Article No. 15.

[73]
Rice S V. Braided AVL trees for efficient event sets and ranked sets in the SIMSCRIPT III simulation programming language. In Proc. the 2007 Western Multiconference on Computer Simulation, Jan. 2007, pp.150-155.
[74]
Ramadan B, Christen P. Unsupervised blocking key selection for real-time entity resolution. In Proc. the 19th Pacific-Asia Conference on Knowledge Discovery and Data Mining, May 2015, pp.574-585.
[75]
Liang H, Wang Y, Christen P, Gayler R. Noise-tolerant approximate blocking for dynamic real-time entity resolution. In Proc. the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining, May 2014, pp.449-460.
[76]

Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang S E, Widom J. Swoosh: A generic approach to entity resolution. The VLDB Journal, 2009, 18(1): 255-276.

[77]
Araújo B T, Stefanidis K, Pires C E S, Nummenmaa J, da Nóbrega P T. Incremental blocking for entity resolution over web streaming data. In Proc. the 2019 IEEE/WIC/ACM International Conference on Web Intelligence, Oct. 2019, pp.332-336.
[78]

Ebraheem M, Thirumuruganathan S, Joty S, Ouzzani M, Tang N. Distributed representations of tuples for entity resolution. Proc. the VLDB Endowment, 2018, 11(11): 1454-1467.

[79]
Mudgal S, Li H, Rekatsinas T, Doan A, Park Y, Krishnan G, Deep R, Arcaute E, Raghavendra V. Deep learning for entity matching: A design space exploration. In Proc. the 2018 International Conference on Management of Data, Jun. 2018, pp.19-34.
[80]
di Cicco V, Firmani D, Koudas N, Merialdo P, Srivastava D. Interpreting deep learning models for entity resolution: An experience report using LIME. In Proc. the 2nd International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, Jul. 2019, Article No. 8.
[81]

Papenbrock T, Heise A, Naumann F. Progressive duplicate detection. IEEE Transactions on Knowledge and Data Engineering, 2014, 27(5): 1316-1329.

[82]

Whang S E, Marmaros D, Garcia-Molina H. Pay-as-you-go entity resolution. IEEE Transactions on Knowledge and Data Engineering, 2012, 25(5): 1111-1124.

[83]

Altowim Y, Kalashnikov D V, Mehrotra S. Progressive approach to relational entity resolution. Proc. the VLDB Endowment, 2014, 7(11): 999-1010.

[84]
Altowim Y, Mehrotra S. Parallel progressive approach to entity resolution using MapReduce. In Proc. the 33rd IEEE International Conference on Data Engineering, Apr. 2017, pp.909-920.
[85]

Bhattacharya I, Getoor L. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): Article No. 5.

[86]
Globerson A, Lazic N, Chakrabarti S, Subramanya A, Ringaard M, Pereira F. Collective entity resolution with multi-focal attention. In Proc. the 54th Annual Meeting of the Association for Computational Linguistics, Aug. 2016.
[87]
Kouki P, Pujara J, Marcum C, Koehly L, Getoor L. Collective entity resolution in familial networks. In Proc. the 2017 IEEE International Conference on Data Mining, Nov. 2017, pp.227-236.
[88]
de Assis Costa G, de Oliveira J M P. A relational learning approach for collective entity resolution in the web of data. In Proc. the 5th International Workshop on Consuming Linked Data Co-Located with the 13th International Semantic Web Conference, Oct. 2014.
[89]

Kouki P, Pujara J, Marcum C, Koehly L, Getoor L. Collective entity resolution in multi-relational familial networks. Knowledge and Information Systems, 2019, 61(3): 1547-1581.

[90]
Wang J, Kraska T, Franklin M J, Feng J. CrowdER: Crowdsourcing entity resolution. arXiv: 1208.1927, 2012. http://arxiv.org/abs/1208.1927, Aug. 2018.
[91]

Vesdapunt N, Bellare K, Dalvi N. Crowdsourcing algorithms for entity resolution. Proc. the VLDB Endowment, 2014, 7(12): 1071-1082.

[92]

Gong S S, Hu W, Ge W Y, Qu Y Z. Modeling topic-based human expertise for crowd entity resolution. Journal of Computer Science and Technology, 2018, 33(6): 1204-1218.

[93]

Zhang A Z, Li J Z, Gao H, Chen Y B, Ma H Z, Bah M J. CrowdOLA: Online aggregation on duplicate data powered by crowdsourcing. Journal of Computer Science and Technology, 2018, 33(2): 366-379.

[94]
Mazumdar A, Saha B. A theoretical analysis of first heuristics of crowdsourced entity resolution. In Proc. the 31st AAAI Conference on Artificial Intelligence, Feb. 2017, pp.970-976.
[95]

Chai C, Li G, Li J, Deng D, Feng J. A partial-order-based framework for cost-effective crowdsourced entity resolution. The VLDB Journal, 2018, 27(6): 745-770.

[96]

Maskat R, Paton N W, Embury S M. Pay-as-you-go configuration of entity resolution. Transactions on Large-Scale Data-and Knowledge-Centered Systems, 2016, 29: 40-65.

[97]
Li H, Konda P, G.C. P S, Doan A, Snyder B, Park, Y, Krishnan G, Deep R, Raghavendra V. MatchCatcher: A debugger for blocking in entity matching. In Proc. the 21st International Conference on Extending Database Technology, Mar. 2018, pp.193-204.
[98]
Papadakis G, Giannakopoulos G, Niederée C, Palpanas T, Nejdl W. Detecting and exploiting stability in evolving heterogeneous information spaces. In Proc. the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, Jun. 2011, pp.95-104.
Journal of Computer Science and Technology
Pages 769-793
Cite this article:
Li B-H, Liu Y, Zhang A-M, et al. A Survey on Blocking Technology of Entity Resolution. Journal of Computer Science and Technology, 2020, 35(4): 769-793. https://doi.org/10.1007/s11390-020-0350-4

334

Views

80

Crossref

N/A

Web of Science

88

Scopus

1

CSCD

Altmetrics

Received: 30 January 2020
Revised: 08 June 2020
Published: 27 July 2020
©Institute of Computing Technology, Chinese Academy of Sciences 2020
Return