skip to main content
10.1145/1401890.1402020acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
demonstration

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Published:24 August 2008Publication History

ABSTRACT

Matching records that refer to the same entity across data-bases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Significant advances in record linkage techniques have been made in recent years. However, many new techniques are either implemented in research proof-of-concept systems only, or they are hidden within expensive 'black box' commercial software. This makes it difficult for both researchers and practitioners to experiment with new record linkage techniques, and to compare existing techniques with new ones. The Febrl (Freely Extensible Biomedical Record Linkage) system aims to fill this gap. It contains many recently developed techniques for data cleaning, deduplication and record linkage, and encapsulates them into a graphical user interface (GUI). Febrl thus allows even inexperienced users to learn and experiment with both traditional and new record linkage techniques. Because Febrl is written in Python and its source code is available, it is fairly easy to integrate new record linkage techniques into it. Therefore, Febrl can be seen as a tool that allows researchers to compare various existing record linkage techniques with their own ones, enabling the record linkage research community to conduct their work more efficiently. Additionally, Febrl is suitable as a training tool for new record linkage users, and it can also be used for practical linkage projects with data sets that contain up to several hundred thousand records.

References

  1. A. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In WIRI'05, pages 30--39, Tokyo, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 25--27, Washington DC, 2003.Google ScholarGoogle Scholar
  3. P. Christen. A comparison of personal name matching: Techniques and practical issues. In MCD'06, held at IEEE ICDM'06, Hong Kong, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Christen. Towards parameter-free blocking for scalable record linkage. Technical Report TR-CS-07-03, The Australian National University, Canberra, 2007.Google ScholarGoogle Scholar
  5. P. Christen. A two-step classification approach to unsupervised record linkage. In AusDM'07, pages 111--119, Gold Coast, Australia, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Christen. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In ACM SIGKDD'08, Las Vegas, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Christen. Automatic training example selection for scalable unsupervised record linkage. In PAKDD'08, Springer LNAI 5012, pages 511--518, Osaka, Japan, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Christen. Febrl - A freely available record linkage system with a graphical user interface. In HDKM'08, CRPIT vol. 80, pages 17--25, Wollongong, Australia, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Christen and D. Belacic. Automated probabilistic address standardisation and verification. In AusDM'05, Sydney, 2005.Google ScholarGoogle Scholar
  10. P. Christen and K. Goiser. Quality and complexity measures for data linkage and deduplication. In F. Guillet and H. Hamilton, editors, Quality Measures in Data Mining, volume 43 of Studies in Computational Intelligence. Springer, 2007.Google ScholarGoogle Scholar
  11. T. Churches, P. Christen, K. Lim, and J. X. Zhu. Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making, 2(9), 2002.Google ScholarGoogle Scholar
  12. W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In ACM SIGKDD'02, Edmonton, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(328):1183--1210, 1969.Google ScholarGoogle ScholarCross RefCross Ref
  14. K. Goiser and P. Christen. Towards automated record linkage. In AusDM'06, pages 23--31, Sydney, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In ACM SIGMOD'95, pages 127--138, San Jose, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. L. Jin, C. Li, and S. Mehrotra. Efficient record linkage in large data sets. In DASFAA'03, Tokyo, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. J. Williams. Data mining with Rattle and R. Togaware, Canberra, 2008. Software available at: http://datamining.togaware.com/survivor/.Google ScholarGoogle Scholar

Index Terms

  1. Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2008
      1116 pages
      ISBN:9781605581934
      DOI:10.1145/1401890
      • General Chair:
      • Ying Li,
      • Program Chairs:
      • Bing Liu,
      • Sunita Sarawagi

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 August 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • demonstration

      Acceptance Rates

      KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader