research-article

Development and user experiences of an open source data cleaning, deduplication and record linkage system

Author:
Peter Christen

The Australian National University, Canberra, Australia

The Australian National University, Canberra, Australia
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 11 Issue 1June 2009pp 39–48https://doi.org/10.1145/1656274.1656282

Published:16 November 2009Publication History

ACM SIGKDD Explorations Newsletter

Abstract

Record linkage, also known as database matching or entity resolution, is now recognised as a core step in the KDD process. Data mining projects increasingly require that information from several sources is combined before the actual mining can be conducted. Also of increasing interest is the deduplication of a single database. The objectives of record linkage and deduplication are to identify, match and merge all records that relate to the same real-world entities. Because real-world data is commonly 'dirty', data cleaning is an important first step in many deduplication, record linkage, and data mining project.

In this paper, an overview of the Febrl (Freely Extensible Biomedical Record Linkage) system is provided, and the results of a recent survey of Febrl users is discussed. Febrl includes a variety of functionalities required for data cleaning, deduplication and record linkage, and it provides a graphical user interface that facilitates its application for users who do not have programming experience.

References

A. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In International Workshop on Challenges in Web Information Retrieval and Integration (WIRI'05), pages 30--39, Tokyo, 2005. Google ScholarDigital Library
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD'03 workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 25--27, Washington DC, 2003.Google Scholar
P. Christen. A comparison of personal name matching: Techniques and practical issues. In Workshop on Mining Complex Data (MCD'06), held at IEEE ICDM'06, Hong Kong, 2006. Google ScholarDigital Library
P. Christen. A two-step classification approach to unsupervised record linkage. In Australasian Data Mining Conference (AusDM'07), Conferences in Research and Practice in Information Technology (CRPIT), volume 70, pages 111--119, Gold Coast, Australia, 2007. Google ScholarDigital Library
P. Christen. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'08), pages 151--159, Las Vegas, 2008. Google ScholarDigital Library
P. Christen. Automatic training example selection for scalable unsupervised record linkage. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'08), Springer LNAI 5012, pages 511--518, Osaka, Japan, 2008. Google ScholarDigital Library
P. Christen. Febrl - An open source data cleaning, deduplication and record linkage system with a graphical user interface (Demonstration Session). In ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'08), pages 1065--1068, Las Vegas, 2008. Google ScholarDigital Library
P. Christen and D. Belacic. Automated probabilistic address standardisation and verification. In Australasian Data Mining Conference (AusDM'05), pages 53--67, Sydney, 2005.Google Scholar
P. Christen, T. Churches, and M. Hegland. Febrl - A parallel open source data linkage system. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'04), Springer LNAI 3056, pages 638--647, Sydney, 2004.Google ScholarCross Ref
P. Christen and K. Goiser. Quality and complexity measures for data linkage and deduplication. In F. Guillet and H. Hamilton, editors, Quality Measures in Data Mining, volume 43 of Studies in Computational Intelligence, pages 127--151. Springer, 2007.Google Scholar
P. Christen and A. Pudjijono. Accurate synthetic generation of realistic personal information. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'09), Springer LNAI 5476, pages 507--514, Bangkok, Thailand, 2009. Google ScholarDigital Library
T. Churches, P. Christen, K. Lim, and J.X. Zhu. Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making, 2(9), 2002.Google Scholar
D.E. Clark. Practical introduction to record linkage for injury research. British Medical Journal, 10(3):186--191, 2004.Google Scholar
W.W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string distance metrics for name-matching tasks. In Workshop on Information Integration on the Web (IIWeb'03), held at IJCAI'03, pages 73--78, Acapulco, 2003.Google Scholar
W.W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'02), pages 475--480, Edmonton, 2002. Google ScholarDigital Library
A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1--16, 2007. Google ScholarDigital Library
I.P. Fellegi and A.B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(328):1183--1210, 1969.Google ScholarCross Ref
K. Goiser and P. Christen. Towards automated record linkage. In Australasian Data Mining Conference (AusDM'06), Conferences in Research and Practice in Information Technology (CRPIT), volume 61, pages 23--31, Sydney, 2006. Google ScholarDigital Library
L. Gu and R. Baxter. Decision models for record linkage. In Selected Papers from AusDM, Springer LNCS 3755, pages 146--160, 2006. Google ScholarDigital Library
M.A. Hernandez and S.J. Stolfo. The merge/purge problem for large databases. In ACM International Conference on Management of Data (SIGMOD'95), pages 127--138, San Jose, 1995. Google ScholarDigital Library
L. Jin, C. Li, and S. Mehrotra. Efficient record linkage in large data sets. In International Conference on Database Systems for Advanced Applications (DASFAA'03), pages 137--146, Tokyo, 2003. Google ScholarDigital Library
G. Williams. Data mining with Rattle and R. Togaware, Canberra, 2009. Software available at: http://rattle.togaware.com.Google Scholar
W. Winkler. Methods for evaluating and creating data quality. Elsevier Information Systems, 29(7):531--550, 2004. Google ScholarDigital Library
W. E. Yancey. BigMatch: A program for extracting probable matches from a large file for record linkage. Technical Report RR2007/01, US Bureau of the Census, 2007.Google Scholar

Index Terms

Development and user experiences of an open source data cleaning, deduplication and record linkage system
1. Information systems
  1. Information systems applications

Recommendations

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Matching records that refer to the same entity across data-bases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Significant ...
Read More
A note on using the F-measure for evaluating record linkage algorithms

Record linkage is the process of identifying and linking records about the same entities from one or more databases. Record linkage can be viewed as a classification problem where the aim is to decide whether a pair of records is a match (i.e. two ...
Read More
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many application ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 11, Issue 1
June 2009
56 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1656274
Issue’s Table of Contents

Copyright © 2009 Author
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 November 2009
Check for updates
Author Tags
GUI
Python
data linkage
data standardisation
database matching
open source software
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 551
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Development and user experiences of an open source data cleaning, deduplication and record linkage system

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

A note on using the F-measure for evaluating record linkage algorithms

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Development and user experiences of an open source data cleaning, deduplication and record linkage system

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

A note on using the F-measure for evaluating record linkage algorithms

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media