demonstration

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Author:
Peter Christen

The Australian National University, Canberra, Australia

The Australian National University, Canberra, Australia
View Profile

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2008Pages 1065–1068https://doi.org/10.1145/1401890.1402020

Published:24 August 2008Publication History

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1065–1068

ABSTRACT

Matching records that refer to the same entity across data-bases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Significant advances in record linkage techniques have been made in recent years. However, many new techniques are either implemented in research proof-of-concept systems only, or they are hidden within expensive 'black box' commercial software. This makes it difficult for both researchers and practitioners to experiment with new record linkage techniques, and to compare existing techniques with new ones. The Febrl (Freely Extensible Biomedical Record Linkage) system aims to fill this gap. It contains many recently developed techniques for data cleaning, deduplication and record linkage, and encapsulates them into a graphical user interface (GUI). Febrl thus allows even inexperienced users to learn and experiment with both traditional and new record linkage techniques. Because Febrl is written in Python and its source code is available, it is fairly easy to integrate new record linkage techniques into it. Therefore, Febrl can be seen as a tool that allows researchers to compare various existing record linkage techniques with their own ones, enabling the record linkage research community to conduct their work more efficiently. Additionally, Febrl is suitable as a training tool for new record linkage users, and it can also be used for practical linkage projects with data sets that contain up to several hundred thousand records.

References

A. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In WIRI'05, pages 30--39, Tokyo, 2005. Google ScholarDigital Library
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 25--27, Washington DC, 2003.Google Scholar
P. Christen. A comparison of personal name matching: Techniques and practical issues. In MCD'06, held at IEEE ICDM'06, Hong Kong, 2006. Google ScholarDigital Library
P. Christen. Towards parameter-free blocking for scalable record linkage. Technical Report TR-CS-07-03, The Australian National University, Canberra, 2007.Google Scholar
P. Christen. A two-step classification approach to unsupervised record linkage. In AusDM'07, pages 111--119, Gold Coast, Australia, 2007. Google ScholarDigital Library
P. Christen. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In ACM SIGKDD'08, Las Vegas, 2008. Google ScholarDigital Library
P. Christen. Automatic training example selection for scalable unsupervised record linkage. In PAKDD'08, Springer LNAI 5012, pages 511--518, Osaka, Japan, 2008. Google ScholarDigital Library
P. Christen. Febrl - A freely available record linkage system with a graphical user interface. In HDKM'08, CRPIT vol. 80, pages 17--25, Wollongong, Australia, 2008. Google ScholarDigital Library
P. Christen and D. Belacic. Automated probabilistic address standardisation and verification. In AusDM'05, Sydney, 2005.Google Scholar
P. Christen and K. Goiser. Quality and complexity measures for data linkage and deduplication. In F. Guillet and H. Hamilton, editors, Quality Measures in Data Mining, volume 43 of Studies in Computational Intelligence. Springer, 2007.Google Scholar
T. Churches, P. Christen, K. Lim, and J. X. Zhu. Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making, 2(9), 2002.Google Scholar
W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In ACM SIGKDD'02, Edmonton, 2002. Google ScholarDigital Library
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(328):1183--1210, 1969.Google ScholarCross Ref
K. Goiser and P. Christen. Towards automated record linkage. In AusDM'06, pages 23--31, Sydney, 2006. Google ScholarDigital Library
M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In ACM SIGMOD'95, pages 127--138, San Jose, 1995. Google ScholarDigital Library
L. Jin, C. Li, and S. Mehrotra. Efficient record linkage in large data sets. In DASFAA'03, Tokyo, 2003. Google ScholarDigital Library
G. J. Williams. Data mining with Rattle and R. Togaware, Canberra, 2008. Software available at: http://datamining.togaware.com/survivor/.Google Scholar

Index Terms

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Automatic record linkage using seeded nearest neighbour and support vector machine classification
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific ...
Read More
Development and user experiences of an open source data cleaning, deduplication and record linkage system

Record linkage, also known as database matching or entity resolution, is now recognised as a core step in the KDD process. Data mining projects increasingly require that information from several sources is combined before the actual mining can be ...
Read More
Febrl: a freely available record linkage system with a graphical user interface
HDKM '08: Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80

Record or data linkage is an important enabling technology in the health sector, as linked data is a cost-effective resource that can help to improve research into health policies, detect adverse drug reactions, reduce costs, and uncover fraud within the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Python
data cleaning
data linkage
data matching
deduplication
open source software
Qualifiers
- demonstration
Conference

Acceptance Rates
KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 134
  Total Citations
  View Citations
- 1,330
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic record linkage using seeded nearest neighbour and support vector machine classification

Development and user experiences of an open source data cleaning, deduplication and record linkage system

Febrl: a freely available record linkage system with a graphical user interface

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic record linkage using seeded nearest neighbour and support vector machine classification

Development and user experiences of an open source data cleaning, deduplication and record linkage system

Febrl: a freely available record linkage system with a graphical user interface

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media