ABSTRACT
Matching records that refer to the same entity across data-bases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Significant advances in record linkage techniques have been made in recent years. However, many new techniques are either implemented in research proof-of-concept systems only, or they are hidden within expensive 'black box' commercial software. This makes it difficult for both researchers and practitioners to experiment with new record linkage techniques, and to compare existing techniques with new ones. The Febrl (Freely Extensible Biomedical Record Linkage) system aims to fill this gap. It contains many recently developed techniques for data cleaning, deduplication and record linkage, and encapsulates them into a graphical user interface (GUI). Febrl thus allows even inexperienced users to learn and experiment with both traditional and new record linkage techniques. Because Febrl is written in Python and its source code is available, it is fairly easy to integrate new record linkage techniques into it. Therefore, Febrl can be seen as a tool that allows researchers to compare various existing record linkage techniques with their own ones, enabling the record linkage research community to conduct their work more efficiently. Additionally, Febrl is suitable as a training tool for new record linkage users, and it can also be used for practical linkage projects with data sets that contain up to several hundred thousand records.
- A. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In WIRI'05, pages 30--39, Tokyo, 2005. Google ScholarDigital Library
- R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 25--27, Washington DC, 2003.Google Scholar
- P. Christen. A comparison of personal name matching: Techniques and practical issues. In MCD'06, held at IEEE ICDM'06, Hong Kong, 2006. Google ScholarDigital Library
- P. Christen. Towards parameter-free blocking for scalable record linkage. Technical Report TR-CS-07-03, The Australian National University, Canberra, 2007.Google Scholar
- P. Christen. A two-step classification approach to unsupervised record linkage. In AusDM'07, pages 111--119, Gold Coast, Australia, 2007. Google ScholarDigital Library
- P. Christen. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In ACM SIGKDD'08, Las Vegas, 2008. Google ScholarDigital Library
- P. Christen. Automatic training example selection for scalable unsupervised record linkage. In PAKDD'08, Springer LNAI 5012, pages 511--518, Osaka, Japan, 2008. Google ScholarDigital Library
- P. Christen. Febrl - A freely available record linkage system with a graphical user interface. In HDKM'08, CRPIT vol. 80, pages 17--25, Wollongong, Australia, 2008. Google ScholarDigital Library
- P. Christen and D. Belacic. Automated probabilistic address standardisation and verification. In AusDM'05, Sydney, 2005.Google Scholar
- P. Christen and K. Goiser. Quality and complexity measures for data linkage and deduplication. In F. Guillet and H. Hamilton, editors, Quality Measures in Data Mining, volume 43 of Studies in Computational Intelligence. Springer, 2007.Google Scholar
- T. Churches, P. Christen, K. Lim, and J. X. Zhu. Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making, 2(9), 2002.Google Scholar
- W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In ACM SIGKDD'02, Edmonton, 2002. Google ScholarDigital Library
- I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(328):1183--1210, 1969.Google ScholarCross Ref
- K. Goiser and P. Christen. Towards automated record linkage. In AusDM'06, pages 23--31, Sydney, 2006. Google ScholarDigital Library
- M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In ACM SIGMOD'95, pages 127--138, San Jose, 1995. Google ScholarDigital Library
- L. Jin, C. Li, and S. Mehrotra. Efficient record linkage in large data sets. In DASFAA'03, Tokyo, 2003. Google ScholarDigital Library
- G. J. Williams. Data mining with Rattle and R. Togaware, Canberra, 2008. Software available at: http://datamining.togaware.com/survivor/.Google Scholar
Index Terms
- Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface
Recommendations
Automatic record linkage using seeded nearest neighbour and support vector machine classification
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningThe task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific ...
Development and user experiences of an open source data cleaning, deduplication and record linkage system
Record linkage, also known as database matching or entity resolution, is now recognised as a core step in the KDD process. Data mining projects increasingly require that information from several sources is combined before the actual mining can be ...
Febrl: a freely available record linkage system with a graphical user interface
HDKM '08: Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80Record or data linkage is an important enabling technology in the health sector, as linked data is a cost-effective resource that can help to improve research into health policies, detect adverse drug reactions, reduce costs, and uncover fraud within the ...
Comments