Skip to main content
Top

2018 | OriginalPaper | Chapter

Using Metric Space Indexing for Complete and Efficient Record Linkage

Authors : Özgür Akgün, Alan Dearle, Graham Kirby, Peter Christen

Published in: Advances in Knowledge Discovery and Data Mining

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Record linkage is the process of identifying records that refer to the same real-world entities in situations where entity identifiers are unavailable. Records are linked on the basis of similarity between common attributes, with every pair being classified as a link or non-link depending on their similarity. Linkage is usually performed in a three-step process: first, groups of similar candidate records are identified using indexing, then pairs within the same group are compared in more detail, and finally classified. Even state-of-the-art indexing techniques, such as locality sensitive hashing, have potential drawbacks. They may fail to group together some true matching records with high similarity, or they may group records with low similarity, leading to high computational overhead. We propose using metric space indexing (MSI) to perform complete linkage, resulting in a parameter-free process combining indexing, comparison and classification into a single step delivering complete and efficient record linkage. An evaluation on real-world data from several domains shows that linkage using MSI can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or trial and error to configure the process.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
Experimental data, additional figures and source code can be downloaded from: http://​github.​com/​digitisingscotla​nd/​pakdd2018-metric-linkage.
 
2
Relatively high Levenshtein edit distances are included since Cora contains a number of low-similarity true matches.
 
3
Noting that recent research identifies some problematic aspects with using the F-measure to compare record linkage procedures at different similarity thresholds [14].
 
Literature
1.
go back to reference Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: IEEE ICDM, Hong Kong, pp. 87–96 (2006) Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: IEEE ICDM, Hong Kong, pp. 87–96 (2006)
3.
go back to reference Broder, A.: On the resemblance and containment of documents. In: IEEE Compression and Complexity of Sequences, Salerno, Italy, pp. 21–29 (1997) Broder, A.: On the resemblance and containment of documents. In: IEEE Compression and Complexity of Sequences, Salerno, Italy, pp. 21–29 (1997)
5.
go back to reference Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2012) Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2012)
6.
go back to reference Ciaccia, P., Patella, M., Rabitti, F., Zezula, P.: Indexing metric spaces with M-tree. In: Italian Symposium on Advanced Database Systems 1997, pp. 67–86 (1997) Ciaccia, P., Patella, M., Rabitti, F., Zezula, P.: Indexing metric spaces with M-tree. In: Italian Symposium on Advanced Database Systems 1997, pp. 67–86 (1997)
10.
go back to reference Dong, X.L., Srivastava, D.: Big data integration. Synth. Lect. Data Manag. 7(1), 1–198 (2015)CrossRef Dong, X.L., Srivastava, D.: Big data integration. Synth. Lect. Data Manag. 7(1), 1–198 (2015)CrossRef
11.
go back to reference Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive windows for duplicate detection. In: IEEE ICDE, Washington, DC, pp. 1073–1083 (2012) Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive windows for duplicate detection. In: IEEE ICDE, Washington, DC, pp. 1073–1083 (2012)
12.
go back to reference Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)CrossRef Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)CrossRef
13.
go back to reference Fisher, J., Wang, Q.: Unsupervised measuring of entity resolution consistency. In: IEEE ICDM DINA Workshop, pp. 218–221 (2015) Fisher, J., Wang, Q.: Unsupervised measuring of entity resolution consistency. In: IEEE ICDM DINA Workshop, pp. 218–221 (2015)
14.
go back to reference Hand, D., Christen, P.: A note on using the F-measure for evaluating record linkage algorithms. Stat. Comput. 28(3), 539–547 (2018)MathSciNetCrossRef Hand, D., Christen, P.: A note on using the F-measure for evaluating record linkage algorithms. Stat. Comput. 28(3), 539–547 (2018)MathSciNetCrossRef
15.
go back to reference Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. SIGMOD Rec. 27(2), 237–248 (1998)CrossRef Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. SIGMOD Rec. 27(2), 237–248 (1998)CrossRef
16.
go back to reference Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: ACM TOC, Dallas, pp. 604–613 (1998) Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: ACM TOC, Dallas, pp. 604–613 (1998)
17.
go back to reference Kejriwal, M., Miranker, D.P.: An unsupervised algorithm for learning blocking schemes. In: IEEE ICDM, Dallas, pp. 340–349 (2013) Kejriwal, M., Miranker, D.P.: An unsupervised algorithm for learning blocking schemes. In: IEEE ICDM, Dallas, pp. 340–349 (2013)
18.
go back to reference Kim, H., Lee, D.: HARRA: fast iterative hashed record linkage for large-scale data collections. In: EDBT, Lausanne, pp. 525–536 (2010) Kim, H., Lee, D.: HARRA: fast iterative hashed record linkage for large-scale data collections. In: EDBT, Lausanne, pp. 525–536 (2010)
19.
go back to reference Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Cybern. Control Theory 10, 707–710 (1966)MathSciNetMATH Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Cybern. Control Theory 10, 707–710 (1966)MathSciNetMATH
20.
go back to reference Li, C., Jin, L., Mehrotra, S.: Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web 9(4), 557–584 (2006)CrossRef Li, C., Jin, L., Mehrotra, S.: Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web 9(4), 557–584 (2006)CrossRef
22.
go back to reference Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI, Boston (2006) Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI, Boston (2006)
23.
go back to reference Monge, A.E., Elkan, C.P.: The field-matching problem: algorithm and applications. In: ACM SIGKDD, Portland, pp. 267–270 (1996) Monge, A.E., Elkan, C.P.: The field-matching problem: algorithm and applications. In: ACM SIGKDD, Portland, pp. 267–270 (1996)
24.
go back to reference Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)CrossRef Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)CrossRef
25.
go back to reference Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. PVLDB 9(9), 684–695 (2016) Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. PVLDB 9(9), 684–695 (2016)
27.
go back to reference Reid, A., Garrett, E., Davies, R., Blaikie, A.: Scottish census enumerators’ books: Skye, Kilmarnock, Rothiemay and Torthorwald, 1861–1901. Economic and Social Data Service (2006) Reid, A., Garrett, E., Davies, R., Blaikie, A.: Scottish census enumerators’ books: Skye, Kilmarnock, Rothiemay and Torthorwald, 1861–1901. Economic and Social Data Service (2006)
28.
go back to reference Reid, A., Davies, R., Garrett, E.: Nineteenth-century Scottish demography from linked censuses and civil registers: a ‘sets of related individuals’ approach. History Comput. 14(1–2), 61–86 (2002)CrossRef Reid, A., Davies, R., Garrett, E.: Nineteenth-century Scottish demography from linked censuses and civil registers: a ‘sets of related individuals’ approach. History Comput. 14(1–2), 61–86 (2002)CrossRef
Metadata
Title
Using Metric Space Indexing for Complete and Efficient Record Linkage
Authors
Özgür Akgün
Alan Dearle
Graham Kirby
Peter Christen
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-93040-4_8

Premium Partner