Abstract
Duplicate detection is the process of identifying multiple representations of a same real-world object in a data source. Duplicate detection is a problem of critical importance in many applications, including customer relationship management, personal information management, or data mining.
In this paper, we present how a research prototype, namely DogmatiX, which was designed to detect duplicates in hierarchical XML data, was successfully extended and applied on a large scale industrial relational database in cooperation with Schufa Holding AG. Schufa's main business line is to store and retrieve credit histories of over 60 million individuals. Here, correctly identifying duplicates is critical both for individuals and companies: On the one hand, an incorrectly identified duplicate potentially results in a false negative credit history for an individual, who will then not be granted credit anymore. On the other hand, it is essential for companies that Schufa detects duplicates of a person that deliberately tries to create a new identity in the database in order to have a clean credit history.
Besides the quality of duplicate detection, i.e., its effectiveness, scalability cannot be neglected, because of the considerable size of the database. We describe our solution to coping with both problems and present a comprehensive evaluation based on large volumes of real-world data.
- FUZZY! Double by FUZZY! Informatik AG. Details at http://www.fazi.de/.Google Scholar
- IBM Entity Analytic Solutions (EAS). Details at http://www-306.ibm.com/software/data/db2/eas/.Google Scholar
- Trillium software system. Details at http://www.trilliumsoftware.com/de/content/products/index.asp.Google Scholar
- R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Conference on Very Large Databases (VLDB), Hong Kong, China, 2002. Google ScholarDigital Library
- R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern information retrieval. ACM Press / Addison-Wesley, 1999. Google ScholarDigital Library
- I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD), 2004. Google ScholarDigital Library
- M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Conference on Knowledge Discovery and Data Mining (KDD), Washington, DC, 2003. Google ScholarDigital Library
- S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In Conference on Very Large Databases (VLDB), 2007. Google ScholarDigital Library
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1), 2007. Google ScholarDigital Library
- M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Conference on Management of Data (SIGMOD), San Jose, CA, 1995. Google ScholarDigital Library
- M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1), 1998. Google ScholarDigital Library
- D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst., 31(2), 2006. Google ScholarDigital Library
- M. Weis. Duplicate detection in XML. WiKu Verlag, 2008.Google Scholar
- M. Weis and F. Naumann. DogmatiX tracks down duplicates in XML. In Conference on the Management of Data (SIGMOD), Baltimore, MD, 2005. Google ScholarDigital Library
- S. Yan, D. Lee, M.-Y. Kan, and C. L. Giles. Adaptive sorted neighborhood methods for efficient record linkage. In Joint Conference on Digital Libraries (JCDL), 2007. Google ScholarDigital Library
Index Terms
- Industry-scale duplicate detection
Recommendations
Duplicate video detection for large-scale multimedia
Since rapid growth of IT technologies, the use of multimedia data such as image and videos are explosively increasing. It is an important aspect of not only for users but also researchers. Duplicate images and videos are rapidly increasing and it causes ...
Evaluating indeterministic duplicate detection results
SUM'12: Proceedings of the 6th international conference on Scalable Uncertainty ManagementDuplicate detection is an important process for cleaning or integrating data. Since real-life data is often polluted, detecting duplicates usually comes along with uncertainty. To handle duplicate uncertainty in an appropriate way, indeterministic ...
Duplicate Record Detection: A Survey
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription ...
Comments