skip to main content
research-article

Industry-scale duplicate detection

Published:01 August 2008Publication History
Skip Abstract Section

Abstract

Duplicate detection is the process of identifying multiple representations of a same real-world object in a data source. Duplicate detection is a problem of critical importance in many applications, including customer relationship management, personal information management, or data mining.

In this paper, we present how a research prototype, namely DogmatiX, which was designed to detect duplicates in hierarchical XML data, was successfully extended and applied on a large scale industrial relational database in cooperation with Schufa Holding AG. Schufa's main business line is to store and retrieve credit histories of over 60 million individuals. Here, correctly identifying duplicates is critical both for individuals and companies: On the one hand, an incorrectly identified duplicate potentially results in a false negative credit history for an individual, who will then not be granted credit anymore. On the other hand, it is essential for companies that Schufa detects duplicates of a person that deliberately tries to create a new identity in the database in order to have a clean credit history.

Besides the quality of duplicate detection, i.e., its effectiveness, scalability cannot be neglected, because of the considerable size of the database. We describe our solution to coping with both problems and present a comprehensive evaluation based on large volumes of real-world data.

References

  1. FUZZY! Double by FUZZY! Informatik AG. Details at http://www.fazi.de/.Google ScholarGoogle Scholar
  2. IBM Entity Analytic Solutions (EAS). Details at http://www-306.ibm.com/software/data/db2/eas/.Google ScholarGoogle Scholar
  3. Trillium software system. Details at http://www.trilliumsoftware.com/de/content/products/index.asp.Google ScholarGoogle Scholar
  4. R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Conference on Very Large Databases (VLDB), Hong Kong, China, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern information retrieval. ACM Press / Addison-Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Conference on Knowledge Discovery and Data Mining (KDD), Washington, DC, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In Conference on Very Large Databases (VLDB), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Conference on Management of Data (SIGMOD), San Jose, CA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst., 31(2), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Weis. Duplicate detection in XML. WiKu Verlag, 2008.Google ScholarGoogle Scholar
  14. M. Weis and F. Naumann. DogmatiX tracks down duplicates in XML. In Conference on the Management of Data (SIGMOD), Baltimore, MD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Yan, D. Lee, M.-Y. Kan, and C. L. Giles. Adaptive sorted neighborhood methods for efficient record linkage. In Joint Conference on Digital Libraries (JCDL), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Industry-scale duplicate detection

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader