Skip to main content
Erschienen in: Knowledge and Information Systems 2/2016

01.02.2016 | Regular Paper

Efficient entity resolution based on subgraph cohesion

verfasst von: Hongzhi Wang, Jianzhong Li, Hong Gao

Erschienen in: Knowledge and Information Systems | Ausgabe 2/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Entity resolution has wide applications and receives considerable attentions in literature. For entity resolution, similarity functions are often used to judge whether two data objects refer to the same real-world entity. However, the similar relations determined by many commonly used similarity functions lack transitivity. This fact results in the conflict that \(A\) and \(B\) refer to the same entity and \(B\) and \(C\) refer to the same entity, but \(A\) and \(C\) do not refer to the same entity. To address this problem and make the group-wise entity resolution results consistent with pairwise entity resolution, this paper models the entity resolution problem as the partition of the vertices in a weighted graph into cohesive subgraphs, which is proven to be co-NP-complete. To solve this problem, an approximate algorithm with approximation ratio bound is proposed. For performing entity resolution on a large data set efficiently, a heuristic algorithm is developed to address this problem. In order to implement the heuristic algorithm efficiently, a similarity measure compatible with many measures in common usage is presented. With such similarity measure, indices and efficient implementations for the heuristic algorithm are proposed. Extensive experiments have been performed to verify the efficiency and effectiveness of the methods in this paper.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
4.
Zurück zum Zitat Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: ICDE, pp 952–963 Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: ICDE, pp 952–963
5.
Zurück zum Zitat Aslam JA, Pelekhov E, Rus D (2004) The star clustering algorithm for static and dynamic information organization. J Graph Algorithms Appl 8:95–129MathSciNetCrossRefMATH Aslam JA, Pelekhov E, Rus D (2004) The star clustering algorithm for static and dynamic information organization. J Graph Algorithms Appl 8:95–129MathSciNetCrossRefMATH
6.
Zurück zum Zitat Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: WWW, pp 131–140 Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: WWW, pp 131–140
7.
Zurück zum Zitat Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang SE, Widom J (2009) Swoosh: a generic approach to entity resolution. VLDB J 18(1):255–276CrossRef Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang SE, Widom J (2009) Swoosh: a generic approach to entity resolution. VLDB J 18(1):255–276CrossRef
8.
Zurück zum Zitat Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: DMKD, pp 11–18 Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: DMKD, pp 11–18
9.
Zurück zum Zitat Chaudhuri S, Chen B-C, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: VLDB, pp 327–338 Chaudhuri S, Chen B-C, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: VLDB, pp 327–338
10.
Zurück zum Zitat Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: ICDE, p 5 Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: ICDE, p 5
11.
Zurück zum Zitat Chaudhuri S, Sarma AD, Ganti V, Kaushik R (2007) Leveraging aggregate constraints for deduplication. In: SIGMOD conference, pp 437–448 Chaudhuri S, Sarma AD, Ganti V, Kaushik R (2007) Leveraging aggregate constraints for deduplication. In: SIGMOD conference, pp 437–448
12.
Zurück zum Zitat Chen Z, Kalashnikov DV, Mehrotra S (2009) Exploiting context analysis for combining multiple entity resolution systems. In: SIGMOD conference, pp 207–218 Chen Z, Kalashnikov DV, Mehrotra S (2009) Exploiting context analysis for combining multiple entity resolution systems. In: SIGMOD conference, pp 207–218
13.
Zurück zum Zitat Dong X, Halevy AY, Madhavan J (2005) Reference reconciliation in complex information spaces. In: SIGMOD conference, pp 85–96 Dong X, Halevy AY, Madhavan J (2005) Reference reconciliation in complex information spaces. In: SIGMOD conference, pp 85–96
14.
Zurück zum Zitat Duda R, Hart P, Stork D (2001) Pattern classification. Wiley, Hoboken, NJMATH Duda R, Hart P, Stork D (2001) Pattern classification. Wiley, Hoboken, NJMATH
15.
Zurück zum Zitat Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16CrossRef Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16CrossRef
16.
Zurück zum Zitat Hassanzadeh O, Chiang F, Miller RJ, Lee HC (2009) Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1):1282–1293 Hassanzadeh O, Chiang F, Miller RJ, Lee HC (2009) Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1):1282–1293
17.
Zurück zum Zitat Hassanzadeh O, Miller RJ (2009) Creating probabilistic databases from duplicated data. VLDB J 18(5):1141–1166CrossRef Hassanzadeh O, Miller RJ (2009) Creating probabilistic databases from duplicated data. VLDB J 18(5):1141–1166CrossRef
18.
Zurück zum Zitat Jiang Y, Li G, Feng J, Li W (2014) String similarity joins: an experimental evaluation. PVLDB 7(8):625–636 Jiang Y, Li G, Feng J, Li W (2014) String similarity joins: an experimental evaluation. PVLDB 7(8):625–636
19.
Zurück zum Zitat Kalashnikov DV, Mehrotra S (2006) Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans Database Syst 31(2):716–767CrossRef Kalashnikov DV, Mehrotra S (2006) Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans Database Syst 31(2):716–767CrossRef
20.
Zurück zum Zitat Kim HS, Lee D (2010) Harra: fast iterative hashed record linkage for large-scale data collections. In: EDBT, pp 525–536 Kim HS, Lee D (2010) Harra: fast iterative hashed record linkage for large-scale data collections. In: EDBT, pp 525–536
21.
Zurück zum Zitat Koudas N, Saha A, Srivastava D, Venkatasubramanian S (2009) Metric functional dependencies. In: ICDE, pp 1275–1278 Koudas N, Saha A, Srivastava D, Venkatasubramanian S (2009) Metric functional dependencies. In: ICDE, pp 1275–1278
22.
Zurück zum Zitat Manning CD, Schütze H (1999) Foundations of statistical natural language processing. The MIT Press, Cambridge, MAMATH Manning CD, Schütze H (1999) Foundations of statistical natural language processing. The MIT Press, Cambridge, MAMATH
23.
Zurück zum Zitat Menestrina D, Whang S, Garcia-Molina H (2010) Evaluating entity resolution results. PVLDB 3(1):208–219 Menestrina D, Whang S, Garcia-Molina H (2010) Evaluating entity resolution results. PVLDB 3(1):208–219
24.
Zurück zum Zitat Micali S, Vazirani VV (1980) An \(\text{ o }(\sqrt{|V|}|e|\)) algorithm for finding maximum matching in general graphs. In: FOCS, pp 17–27 Micali S, Vazirani VV (1980) An \(\text{ o }(\sqrt{|V|}|e|\)) algorithm for finding maximum matching in general graphs. In: FOCS, pp 17–27
25.
Zurück zum Zitat Michelson M, Knoblock CA (2006) Learning blocking schemes for record linkage. In: AAAI Michelson M, Knoblock CA (2006) Learning blocking schemes for record linkage. In: AAAI
26.
Zurück zum Zitat Michelson M, Knoblock CA (2009) Mining the heterogeneous transformations between data sources to aid record linkage. In: IC-AI, pp 422–428 Michelson M, Knoblock CA (2009) Mining the heterogeneous transformations between data sources to aid record linkage. In: IC-AI, pp 422–428
27.
Zurück zum Zitat Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: KDD, pp 269–278 Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: KDD, pp 269–278
29.
Zurück zum Zitat Shen W, DeRose P, Vu L, Doan A, Ramakrishnan R (2007) Source-aware entity matching: a compositional approach. In: ICDE, pp 196–205 Shen W, DeRose P, Vu L, Doan A, Ramakrishnan R (2007) Source-aware entity matching: a compositional approach. In: ICDE, pp 196–205
30.
Zurück zum Zitat Shen W, Li X, Doan A (2005) Constraint-based entity matching. In: AAAI, pp 862–867 Shen W, Li X, Doan A (2005) Constraint-based entity matching. In: AAAI, pp 862–867
31.
Zurück zum Zitat Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: SIGMOD conference, pp 219–232 Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: SIGMOD conference, pp 219–232
32.
Zurück zum Zitat Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140 Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140
33.
Zurück zum Zitat Yang X, Wang B, Li C (2008) Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD conference, pp 353–364 Yang X, Wang B, Li C (2008) Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD conference, pp 353–364
34.
Zurück zum Zitat Yannakakis M (1978) Node- and edge-deletion np-complete problems. In: STOC, pp 253–264 Yannakakis M (1978) Node- and edge-deletion np-complete problems. In: STOC, pp 253–264
35.
Zurück zum Zitat Yin X, Han J, Yu PS (2007) Object distinction: distinguishing objects with identical names. In: ICDE, pp 1242–1246 Yin X, Han J, Yu PS (2007) Object distinction: distinguishing objects with identical names. In: ICDE, pp 1242–1246
Metadaten
Titel
Efficient entity resolution based on subgraph cohesion
verfasst von
Hongzhi Wang
Jianzhong Li
Hong Gao
Publikationsdatum
01.02.2016
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 2/2016
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-015-0818-7

Weitere Artikel der Ausgabe 2/2016

Knowledge and Information Systems 2/2016 Zur Ausgabe

Premium Partner