Skip to main content
Top

2011 | OriginalPaper | Chapter

14. Record Linkage Methodology and Applications

Author : Ling Qin Zhang

Published in: Handbook of Data Intensive Computing

Publisher: Springer New York

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

As information technology advances rapidly and Internet blooms, a lot of business tends to electronization and globalization. Individuals and organizations have more channels or methods to expose information and gather information. The result is that individuals and organizations face the increasing challenges to process the large volumes of data and find the relevant quality information to fit their specific business needs. In addition, the data gathered from multiple resources usually contains errors and duplicate information. There is a strong need to detect duplicates and remove them in data preparation phase before performing advanced data mining [1, 2, 3, 4]. In other cases, data gathered from one data source is not enough to provide a complete view about a person or entity. Therefore, data needs to be linked or integrated together to provide a single complete view about a person, a product, a object, a geographical area or any entity to meet a specific business application need [5, 6].

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference W. E. Winkler. Record Linkage Software and Methods for Merging Administrative Lists, BUREAU OF THE CENSUS STATISTICAL RESEARCH DIVISION, Statistical Research Report Series W. E. Winkler. Record Linkage Software and Methods for Merging Administrative Lists, BUREAU OF THE CENSUS STATISTICAL RESEARCH DIVISION, Statistical Research Report Series
2.
go back to reference P. Christen, A two-step classification approach to unsupervised record linkage. In AusDM’07, CRPIT vol. 70, pages 111–119, Gold Coast, Australia, 2007. P. Christen, A two-step classification approach to unsupervised record linkage. In AusDM’07, CRPIT vol. 70, pages 111–119, Gold Coast, Australia, 2007.
3.
go back to reference P. Christen and K. Goiser (2005), Assessing deduplication and data linkage quality: What to measure, in ‘Proceedings of the fourth Australasian Data Mining Conference (AusDM 2005)’, Sydney. P. Christen and K. Goiser (2005), Assessing deduplication and data linkage quality: What to measure, in ‘Proceedings of the fourth Australasian Data Mining Conference (AusDM 2005)’, Sydney.
4.
go back to reference P. Christen and K. Goiser, Quality and Complexity Measures for Data Linkage and Deduplication, Accepted for Quality Measures in Data Mining, Springer, 2006. P. Christen and K. Goiser, Quality and Complexity Measures for Data Linkage and Deduplication, Accepted for Quality Measures in Data Mining, Springer, 2006.
5.
go back to reference L. Zhang and M. Wasson, TEMPLAR, Valentina, METHODS AND SYSTEMS FOR MATCHING RECORDS AND NORMALIZING NAMES, WO/2010/088052 L. Zhang and M. Wasson, TEMPLAR, Valentina, METHODS AND SYSTEMS FOR MATCHING RECORDS AND NORMALIZING NAMES, WO/2010/088052
6.
go back to reference C. Dozer and R. Haschart, Automatic Extraction and Linking of Person Names in Legal Text, RIAO-2000 Proceedings C. Dozer and R. Haschart, Automatic Extraction and Linking of Person Names in Legal Text, RIAO-2000 Proceedings
7.
go back to reference H. L. Dunn, (1946) Record Linkage, American Journal of Public Health, 36, 1412–1415CrossRef H. L. Dunn, (1946) Record Linkage, American Journal of Public Health, 36, 1412–1415CrossRef
8.
go back to reference H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James, 1959, Automatic Linkage of vital records, Science 150(1959), 954–959 H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James, 1959, Automatic Linkage of vital records, Science 150(1959), 954–959
9.
go back to reference H. B. Newcombe and J. M. Kennedy, Record Linkage, Making Maximum Use of the Discriminating Power of Identifying Information, Communications of ACM, 1962, Vol. 5, Issue11 H. B. Newcombe and J. M. Kennedy, Record Linkage, Making Maximum Use of the Discriminating Power of Identifying Information, Communications of ACM, 1962, Vol. 5, Issue11
10.
go back to reference H. B. Newcombe, Record linking: The design of efficient systems for linking records into individual and family histories, Am J Hum Genet. 1967 May; 19(3 Pt 1): 335–359. H. B. Newcombe, Record linking: The design of efficient systems for linking records into individual and family histories, Am J Hum Genet. 1967 May; 19(3 Pt 1): 335–359.
11.
go back to reference I. Fellegi and A. Sunter, A theory for record linkage, Journal of the American Statistical Society, 64(328):1183–1210, 1969. I. Fellegi and A. Sunter, A theory for record linkage, Journal of the American Statistical Society, 64(328):1183–1210, 1969.
12.
go back to reference W. Winkler. The State of Record Linkage and Current Research Problems. U.S. Bureau of the Census, Research Report, 1999. W. Winkler. The State of Record Linkage and Current Research Problems. U.S. Bureau of the Census, Research Report, 1999.
13.
go back to reference K. Goiser and P. Christen, Towards Automated Record Linkage, In ACM KDD’08Proc. Fifth Australasian Data Mining Conference (AusDM2006) K. Goiser and P. Christen, Towards Automated Record Linkage, In ACM KDD’08Proc. Fifth Australasian Data Mining Conference (AusDM2006)
14.
go back to reference J. S. Lawson, Record Linkage Techniques for Improving Online Genealogical Research using Census Index Records, ASA Section on Survey Research Methods. J. S. Lawson, Record Linkage Techniques for Improving Online Genealogical Research using Census Index Records, ASA Section on Survey Research Methods.
15.
go back to reference W. W. Cohen and J. Richman, Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration, SIGKDD’ 02 W. W. Cohen and J. Richman, Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration, SIGKDD’ 02
16.
go back to reference P. Christen, Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification, In ACM KDD’08, Pages 151–159, Las Vegas, 2008 P. Christen, Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification, In ACM KDD’08, Pages 151–159, Las Vegas, 2008
17.
go back to reference M. G. Elfeky, T. M. Ghanem, V. S. Verykios, A. R. Huwait, and A. K. Elmagarmid, Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital Government Web Service, Technical Report CSD-TR 03–024 M. G. Elfeky, T. M. Ghanem, V. S. Verykios, A. R. Huwait, and A. K. Elmagarmid, Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital Government Web Service, Technical Report CSD-TR 03–024
18.
go back to reference D. A. Bayliss, Database systems and methods for linking records and entity representations with sufficiently high confidence, US 2009/0271424 A1 D. A. Bayliss, Database systems and methods for linking records and entity representations with sufficiently high confidence, US 2009/0271424 A1
19.
go back to reference D. A. Bayliss, DATABASE SYSTEMS AND METHODS FOR LINKING RECORDS, WO/2010/003061 D. A. Bayliss, DATABASE SYSTEMS AND METHODS FOR LINKING RECORDS, WO/2010/003061
20.
go back to reference M. Fair, Generalized Record Linkage System – Statistics Canada’s Record Linkage Software, Austrian Journal of Statistics, Volume 33 (2004), Number 1&2, 37–53 M. Fair, Generalized Record Linkage System – Statistics Canada’s Record Linkage Software, Austrian Journal of Statistics, Volume 33 (2004), Number 1&2, 37–53
21.
go back to reference P. Christen. Febrl-An Open Source Data Cleaning, Deduplication and Record Linkage System with a Graphical User Interface. August 2008. P. Christen. Febrl-An Open Source Data Cleaning, Deduplication and Record Linkage System with a Graphical User Interface. August 2008.
22.
go back to reference M. Elfeky, V. Verykios, and A. Elmagarmid. TAILOR: A record linkage toolbox. In ICDE’02, pages 17–28, San Jose, 2002. M. Elfeky, V. Verykios, and A. Elmagarmid. TAILOR: A record linkage toolbox. In ICDE’02, pages 17–28, San Jose, 2002.
23.
go back to reference C. W. Kelman, J. Bass, and D. Holman, (2002), ‘Research use of linked health data – A best practice protocol’, Aust NZ Journal of Public Health, vol. 26, pp. 251–255.CrossRef C. W. Kelman, J. Bass, and D. Holman, (2002), ‘Research use of linked health data – A best practice protocol’, Aust NZ Journal of Public Health, vol. 26, pp. 251–255.CrossRef
24.
go back to reference W. E. Winkler. Methods for evaluating and creating data quality, Elsevier Information Systems, 29(7):531–550, 2004. W. E. Winkler. Methods for evaluating and creating data quality, Elsevier Information Systems, 29(7):531–550, 2004.
25.
go back to reference M. Cochinwala, S. Dalal, A. K. Elmagarmid, V. S. Verykios, Record Matching, Past, Present and Future, Submitted to ACM Computing Surveys, 2003 M. Cochinwala, S. Dalal, A. K. Elmagarmid, V. S. Verykios, Record Matching, Past, Present and Future, Submitted to ACM Computing Surveys, 2003
26.
go back to reference D. Loshin, Ed Allburn, Customer Data Integration, Linkage Precision and Match Accuracy, Information Management Magazine, November 2004 D. Loshin, Ed Allburn, Customer Data Integration, Linkage Precision and Match Accuracy, Information Management Magazine, November 2004
28.
go back to reference H. Issa, Application of Duplicate Records detection Techniques to Duplicate Payments in a Real Business Environment, Rutgers Business School, Rutgers University 2010 H. Issa, Application of Duplicate Records detection Techniques to Duplicate Payments in a Real Business Environment, Rutgers Business School, Rutgers University 2010
31.
go back to reference L. Karl Branting, BAE Systems, Inc, Name Matching in Law Enforcement and CounterTerrorism, Columbia, MD 21046, USA, karl.branting@baesystems.com L. Karl Branting, BAE Systems, Inc, Name Matching in Law Enforcement and CounterTerrorism, Columbia, MD 21046, USA, karl.branting@baesystems.com
32.
go back to reference J. Jonas and J. Harper, Effective counterterrorism and the limited role of predictive data mining, Policy Analysis, (584), 2006. J. Jonas and J. Harper, Effective counterterrorism and the limited role of predictive data mining, Policy Analysis, (584), 2006.
34.
go back to reference M.-Y. Kan and Y. F. Tan, Record Matching in Digital Library Metadata, Technical Opinion, Communications of The ACM, Vol. 51, No. 2, 02/2008 M.-Y. Kan and Y. F. Tan, Record Matching in Digital Library Metadata, Technical Opinion, Communications of The ACM, Vol. 51, No. 2, 02/2008
35.
go back to reference O. Charif_z, H. Omraniz, O. Kleinz, M. Schneiderz, and P. Trigano, A method and a tool for geocoding and record linkage, Working Paperking Paper, No 2010-17, 07/2010 O. Charif_z, H. Omraniz, O. Kleinz, M. Schneiderz, and P. Trigano, A method and a tool for geocoding and record linkage, Working Paperking Paper, No 2010-17, 07/2010
36.
go back to reference C. Giraud-Carrier, J. Goodliffe, and B. Jones, Improving the Study of Campaign Contributors with Record Linkage C. Giraud-Carrier, J. Goodliffe, and B. Jones, Improving the Study of Campaign Contributors with Record Linkage
37.
go back to reference S. J. Grannis, J. M. Overhage, C. J. McDonald M, Analysis of Identifier Performance using a Deterministic Linkage Algorithm S. J. Grannis, J. M. Overhage, C. J. McDonald M, Analysis of Identifier Performance using a Deterministic Linkage Algorithm
38.
go back to reference S. Gomatam, R. Carter., M. Ariet, and G. Mitchell, An empirical comparison of record linkage procedures. Statistics in Medicine, vol. 21, no. 10, pp. 1485–1496, May 2002. S. Gomatam, R. Carter., M. Ariet, and G. Mitchell, An empirical comparison of record linkage procedures. Statistics in Medicine, vol. 21, no. 10, pp. 1485–1496, May 2002.
39.
go back to reference F. Maggi, A Survey of Probabilistic Record Matching Models, Techniques and Tools, Scienti_c Report TR-2008-22 F. Maggi, A Survey of Probabilistic Record Matching Models, Techniques and Tools, Scienti_c Report TR-2008-22
40.
go back to reference A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1–16, 2007.CrossRef A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1–16, 2007.CrossRef
41.
go back to reference W. E. Winkler, Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. Technical Report RR2000/05, US Bureau of the Census, 2000 W. E. Winkler, Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. Technical Report RR2000/05, US Bureau of the Census, 2000
42.
go back to reference P. Christen, T. Churches, and J. X. Zhu, Probabilistic Name and Address Clearning and Standardization, the Australasian Data Mining Workshop 2002 P. Christen, T. Churches, and J. X. Zhu, Probabilistic Name and Address Clearning and Standardization, the Australasian Data Mining Workshop 2002
43.
go back to reference J. Friedman, T. Hastie, and R. Tibshirani, Additive logistic regression: a statistical view of boosting, the Annals of Statistic, 28(2):337–407, 2000.MATHMathSciNet J. Friedman, T. Hastie, and R. Tibshirani, Additive logistic regression: a statistical view of boosting, the Annals of Statistic, 28(2):337–407, 2000.MATHMathSciNet
44.
go back to reference M. Elfeky, V. Verykios, and A. Elmagarmid. TAILOR: A record linkage toolbox. In ICDE’02, pages 17–28, San Jose, 2002. M. Elfeky, V. Verykios, and A. Elmagarmid. TAILOR: A record linkage toolbox. In ICDE’02, pages 17–28, San Jose, 2002.
45.
go back to reference S. Sarawagi and A. Bhamidipaty, Interactive deduplication using active learning. Proceedings of the 8th ACM SIGKDD conference, Edmonton, July 2002. S. Sarawagi and A. Bhamidipaty, Interactive deduplication using active learning. Proceedings of the 8th ACM SIGKDD conference, Edmonton, July 2002.
46.
go back to reference J. Rennie, Boosting with decision stumps and binary features, 2003 J. Rennie, Boosting with decision stumps and binary features, 2003
47.
go back to reference R. E. Schapire, The Boosting Approach to Machine Learning. An Overview Nonlinear Estimation and Classification, Springer, 2003 R. E. Schapire, The Boosting Approach to Machine Learning. An Overview Nonlinear Estimation and Classification, Springer, 2003
48.
go back to reference M. Jaro. Software Demonstrations. In Proc. of an International Workshop and Exposition – Record Linkage Techniques, Arlington, VA, USA, 1997. M. Jaro. Software Demonstrations. In Proc. of an International Workshop and Exposition – Record Linkage Techniques, Arlington, VA, USA, 1997.
49.
go back to reference E. Rundensteiner (Ed.), Special Issue on Data Transformation, IEEE Data Engineering Bulletin, March 1999. E. Rundensteiner (Ed.), Special Issue on Data Transformation, IEEE Data Engineering Bulletin, March 1999.
50.
go back to reference L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra, Australia, April 2003. L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra, Australia, April 2003.
51.
go back to reference D. Knuth, The Art of Computing Programming, Volume III, Addison-Wesley 1973. D. Knuth, The Art of Computing Programming, Volume III, Addison-Wesley 1973.
52.
go back to reference M. Hernandez and S. Stolfo. Real World Data is Dirty: Data Cleansing and the Merge/Purge Problem. Journal of Data Mining and Knowledge Discovery, 2(1), pages 9–37, 1998. M. Hernandez and S. Stolfo. Real World Data is Dirty: Data Cleansing and the Merge/Purge Problem. Journal of Data Mining and Knowledge Discovery, 2(1), pages 9–37, 1998.
53.
go back to reference A. McCallum, K. Nigam, and L. H. Ungar, “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching,” Proc. Sixth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD’00), pp. 169–178, 2000. A. McCallum, K. Nigam, and L. H. Ungar, “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching,” Proc. Sixth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD’00), pp. 169–178, 2000.
54.
go back to reference R. K. Chapman, D. A. Bayliss, G. C. Halliday, METHODS AND SYSTEMS FOR DYNAMICALLY CREATING KEYS IN A DATABASE SYSTEM, US 7739287 B1 R. K. Chapman, D. A. Bayliss, G. C. Halliday, METHODS AND SYSTEMS FOR DYNAMICALLY CREATING KEYS IN A DATABASE SYSTEM, US 7739287 B1
55.
go back to reference M. A. Hernandez and S. J. Stolfo. The Merge/Purge Problem for Large Databases. In Proc. of 1995 ACT SIGMOD Conf., pages 127–138, 1995. M. A. Hernandez and S. J. Stolfo. The Merge/Purge Problem for Large Databases. In Proc. of 1995 ACT SIGMOD Conf., pages 127–138, 1995.
56.
go back to reference V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10:707–710, 1966.MathSciNet V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10:707–710, 1966.MathSciNet
58.
go back to reference M. A. Jaro, 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84:414–420.CrossRef M. A. Jaro, 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84:414–420.CrossRef
59.
go back to reference M. A. Jaro, 1995. Probabilistic linkage of large public health data files (disc: P687–689). Statistics in Medicine 14:491–498CrossRef M. A. Jaro, 1995. Probabilistic linkage of large public health data files (disc: P687–689). Statistics in Medicine 14:491–498CrossRef
61.
go back to reference L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srinivasta. Approximate string joins in a database. In Proc. 27th Int. Conf. on Very Large Data Bases, pages 491–500, 2001. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srinivasta. Approximate string joins in a database. In Proc. 27th Int. Conf. on Very Large Data Bases, pages 491–500, 2001.
62.
go back to reference W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-03), 2003. To appear W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-03), 2003. To appear
63.
go back to reference Journal of the American Statistical Association, 84(406), pages 414–420, 1989. Journal of the American Statistical Association, 84(406), pages 414–420, 1989.
64.
go back to reference S. Tejada, C. Knoblock, and S. Minton, Learning domain-independent string transformation weights for high accuracy object identification, In ACM KDD’02, pages 350–359, dmonton, 2002. S. Tejada, C. Knoblock, and S. Minton, Learning domain-independent string transformation weights for high accuracy object identification, In ACM KDD’02, pages 350–359, dmonton, 2002.
65.
go back to reference U. Y. Nahm, M. Bilenko, and R. J. Mooney. Two approaches to handling noisy variation in text mining. In TextML’02, pages 18–27, Sydney, 2002. U. Y. Nahm, M. Bilenko, and R. J. Mooney. Two approaches to handling noisy variation in text mining. In TextML’02, pages 18–27, Sydney, 2002.
66.
go back to reference L. Gu and R. Baxter. Decision models for record linkage. In Selected Papers from AusDM, Springer LNCS 3755, pages 146–160, 2006. L. Gu and R. Baxter. Decision models for record linkage. In Selected Papers from AusDM, Springer LNCS 3755, pages 146–160, 2006.
67.
go back to reference W. E. Winkler, (1995), Advanced methods for record linkage, American Statistical Association, Proceedings of the Section on Survey Research Methods, pp. 467–472. W. E. Winkler, (1995), Advanced methods for record linkage, American Statistical Association, Proceedings of the Section on Survey Research Methods, pp. 467–472.
68.
go back to reference R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 25–27, Washington DC, 2003. R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 25–27, Washington DC, 2003.
69.
go back to reference W. E. Winkler, (1995), Matching and Record Linkage, in B. G. Cox et al. (ed.) Business Survey. Methods, New York: J. Wiley, 355–384. W. E. Winkler, (1995), Matching and Record Linkage, in B. G. Cox et al. (ed.) Business Survey. Methods, New York: J. Wiley, 355–384.
Metadata
Title
Record Linkage Methodology and Applications
Author
Ling Qin Zhang
Copyright Year
2011
Publisher
Springer New York
DOI
https://doi.org/10.1007/978-1-4614-1415-5_14

Premium Partner