Skip to main content

2012 | OriginalPaper | Buchkapitel

Statistical Perspective on Blocking Methods When Linking Large Data-sets

verfasst von : Nicoletta Cibella, Tiziana Tuoto

Erschienen in: Advanced Statistical Methods for the Analysis of Large Data-Sets

Verlag: Springer Berlin Heidelberg

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The combined use of data from different sources is largely widespread. Record linkage is a complex process aiming at recognizing the same real world entity, differently represented in data sources. Many problems arise when dealing with large data-sets, connected with both computational and statistical aspects. The well-know blocking methods can reduce the number of record comparisons to a suitable number. In this context, the research and the debate are very animated among the information technology scientists. On the contrary, the statistical implications of different blocking methods are often neglected. This work is focused on highlighting the advantages and disadvantages of the main blocking methods in carrying out successfully a probabilistic record linkage process on large data-sets, stressing the statistical point of view.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Armstrong J.B. and Mayda J.E (1993) Model-based estimation of record linkage error rates. Survey Methodology, 19, 137–147 Armstrong J.B. and Mayda J.E (1993) Model-based estimation of record linkage error rates. Survey Methodology, 19, 137–147
Zurück zum Zitat Hernandez M.A. and Stolfo S.J. (1995) The merge/purge problem for large databases. In M. J. Carey and D. A. Schneider, editors, SIGMOD, pp. 127–138 Hernandez M.A. and Stolfo S.J. (1995) The merge/purge problem for large databases. In M. J. Carey and D. A. Schneider, editors, SIGMOD, pp. 127–138
Zurück zum Zitat Hernandez M.A. and Stolfo S.J. (1998) Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37CrossRef Hernandez M.A. and Stolfo S.J. (1998) Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37CrossRef
Zurück zum Zitat Jaro M.A. (1989) Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84, 414–420CrossRef Jaro M.A. (1989) Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84, 414–420CrossRef
Zurück zum Zitat Larsen M.D. and Rubin D.B. (2001). Iterative automated record linkage using mixture models. Journal of the American Statistical Association, 96, 32–41MathSciNetCrossRef Larsen M.D. and Rubin D.B. (2001). Iterative automated record linkage using mixture models. Journal of the American Statistical Association, 96, 32–41MathSciNetCrossRef
Zurück zum Zitat Yan S., Lee D., Kan M.-Y., Giles C. L. (2007) Adaptive sorted neighborhood methods for efficient record linkage, JCDL’07, Vancouver, British Columbia, Canada Yan S., Lee D., Kan M.-Y., Giles C. L. (2007) Adaptive sorted neighborhood methods for efficient record linkage, JCDL’07, Vancouver, British Columbia, Canada
Zurück zum Zitat Yancey W.E. (2004) A program for large-scale record linkage. In Proceedings of the Section on Survey Research Methods, Journal of the American Statistical Association Yancey W.E. (2004) A program for large-scale record linkage. In Proceedings of the Section on Survey Research Methods, Journal of the American Statistical Association
Metadaten
Titel
Statistical Perspective on Blocking Methods When Linking Large Data-sets
verfasst von
Nicoletta Cibella
Tiziana Tuoto
Copyright-Jahr
2012
Verlag
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/978-3-642-21037-2_8