Top

Published in:

2016 | OriginalPaper | Chapter

8. Object Identification

Authors : Carlo Batini, Monica Scannapieco

Published in: Data and Information Quality

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In this chapter we address object identification (IQ), the most important and the most extensively investigated information quality activity. Due to such an importance, we decided to dedicate two chapters of the book to object identification, this chapter focusing on consolidated techniques and the next one on recent advancements.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Activities for Information Quality

next chapter Recent Advances in Object Identification

23.

Ananthakrishna R, Chaudhuri C, Ganti V (2002) Eliminating Fuzzy duplicates in data warehouses. In: Proceedings of VLDB 2002, Hong Kong, pp 586–597

52.

Belin TR, Rubin DB (1995) A method for calibrating false matches rates in record linkage. Journal of American Statistical Association 90:694–707CrossRefMATH

64.

Bertolazzi P, Santis LD, Scannapieco M (2003) Automatic record matching in cooperative information systems. In: Proceedings of the ICDT’03 International Workshop on Data Quality in Cooperative Information Systems (DQCIS’03), Siena

77.

Bitton D, DeWitt D (1983) Duplicate record elimination in large data files. ACM Transactions on Databases Systems 8(2):255–262CrossRefMATH

96.

Buechi M, Borthwick A, Winkel A, Goldberg A (2003) ClueMaker: a language for approximate record matching. In: Proceedings of the 7th International Conference on Information Quality (ICIQ 2003), Boston, pp 207–223

144.

Codd EF (1970) A relational model of data for large shared data banks. Communications of the ACM 13(6):377–387CrossRefMATH

174.

Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society 39:1–38MathSciNetMATH

180.

Dong X, Halevy AY, Madhavan J (2005) Reference reconciliation in complex information spaces. In: Proceedings of the SIGMOD 2005, pp 85–96

186.

Dunn HL (1946) Record linkage. American Journal of Public Health 36:1412–1416CrossRef

193.

Elfeky MG, Verykios VS, Elmagarmid AK (2002) Tailor: a record linkage toolbox. In: Proceedings of the 18th International Conference on Data Engineering, 2002. IEEE, New York, pp 17–28

229.

Fellegi IP, Sunter AB (1969) A theory for record linkage. Journal of the American Statistical Association 64

281.

Gu L, Baxter R, Vickers D, Rainsford C (2003) Record Linkage: Current Practice and Future Directions. Technical Report 03/83, CMIS 03/83

289.

Hall PA, Dowling G (1980) Approximate string comparison. ACM Computing Surveys 12(4):381–402MathSciNetCrossRef

308.

Hernández MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: ACM SIGMOD Record. ACM, New York, vol 24, pp 127–138

309.

Hernandez MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery 1(2)

335.

Jaccard P (1901) Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz

343.

Jaro MA (1985) Advances in record linkage methodologies as applied to matching the 1985 Cencus of Tampa, Florida. Journal of American Statistical Society 84(406):414–420CrossRef

388.

Larsen MD, Rubin DB (1989) An iterative automated record matching using mixture models. Journal of American Statistical Association 79:32–41MathSciNet

394.

Lehti P, Fankhauser P (2005) Probabilistic iterative duplicate detection. In: OTM Conferences (2), pp 1225–1242

416.

Low W, Lee M, Ling T (2001) A knowledge-based approach for duplicate elimination in data cleaning. Information Systems 26(8):586–606MATH

448.

Monge A, Elkan C (1997) An efficient domain independent algorithm for detecting approximate duplicate database records. In: Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD’97), Tucson

465.

Navarro G (2001) A guided tour of approximate string matching. ACM Computing Surveys 31:31–88CrossRef

470.

Newcombe HB, Kennedy JM, Axford SJ, James APF (1959) Automatic linkage of vital records. Science 130

472.

Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Machine Learning 39:103–134CrossRefMATH

545.

Sarawagi S, Bhamidipaty A (eds) (Edmonton, Alberta, Canada, 2002) Interactive Deduplication Using Active Learning

584.

Smith TF, Waterman MS (1981) Identification of common molecular subsequences. Molecular Biology 147:195–197CrossRef

593.

Stolfo SJ, Hernandez MA (1995) The merge/purge problem for large databases. In: Proceedings of the SIGMOD 1995, pp 127–138

603.

Tarjan RE (1975) Efficiency of a good but not linear set union algorithm. Journal of the ACM 22(2):215–225MathSciNetCrossRefMATH

605.

Tejada S, Knoblock C, Minton S (2001) Learning object identification rules for information integration. Information Systems 26(8):607–633CrossRefMATH

626.

Verykios VS, Moustakides GV, Elfeky MG (2003) A Bayesian decision model for cost optimal record matching. The VLDB Journal 12:28–40CrossRef

660.

Weis M, Naumann F (2005) DogmatiX tracks down duplicates in XML. In: Proceedings of the SIGMOD 2005, pp 431–442

668.

Winkler W (1993) Improved decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods. American Statistical Association

669.

Winkler WE (1988) Using the EM algorithm for weight computation in the Fellegi and Sunter modelo of record linkage. In: Proceedings of the Section on Survey Research Methods. American Statistical Association

670.

Winkler WE (1995) Matching and record linkage. Business Survey Methods 1:355–384

671.

Winkler WE (2000) Machine learning, information retrieval and record linkage. In: Proceedings of the Section on Survey Research Methods. American Statistical Association

672.

Winkler WE (2001) Quality of Very Large Databases. Technical Report RR-2001/04, U.S. Bureau of the Census, Statistical Research Division

673.

Winkler WE (2004) Methods for evaluating and creating data quality. Information Systems 29(7):531–550CrossRef

Title: Object Identification
Authors: Carlo Batini
Monica Scannapieco
Publisher: Springer International Publishing
Book: Data and Information Quality
Print ISBN: 978-3-319-24104-3

Electronic ISBN: 978-3-319-24106-7

Copyright Year: 2016
DOI: https://doi.org/10.1007/978-3-319-24106-7_8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner