research-article

Industry-scale duplicate detection

Authors:
Melanie Weis

Hasso-Plattner-Institut, Potsdam, Germany

Hasso-Plattner-Institut, Potsdam, Germany
View Profile

,
Felix Naumann

Hasso-Plattner-Institut, Potsdam, Germany

Hasso-Plattner-Institut, Potsdam, Germany
View Profile

,
Ulrich Jehle

SCHUFA Holding AG, Wiesbaden, Germany

SCHUFA Holding AG, Wiesbaden, Germany
View Profile

,
Jens Lufter

SCHUFA Holding AG, Wiesbaden, Germany

SCHUFA Holding AG, Wiesbaden, Germany
View Profile

,
Holger Schuster

SCHUFA Holding AG, Wiesbaden, Germany

SCHUFA Holding AG, Wiesbaden, Germany
View Profile

Proceedings of the VLDB Endowment Volume 1 Issue 2pp 1253–1264https://doi.org/10.14778/1454159.1454165

Published:01 August 2008Publication History

Proceedings of the VLDB Endowment

Abstract

Duplicate detection is the process of identifying multiple representations of a same real-world object in a data source. Duplicate detection is a problem of critical importance in many applications, including customer relationship management, personal information management, or data mining.

In this paper, we present how a research prototype, namely DogmatiX, which was designed to detect duplicates in hierarchical XML data, was successfully extended and applied on a large scale industrial relational database in cooperation with Schufa Holding AG. Schufa's main business line is to store and retrieve credit histories of over 60 million individuals. Here, correctly identifying duplicates is critical both for individuals and companies: On the one hand, an incorrectly identified duplicate potentially results in a false negative credit history for an individual, who will then not be granted credit anymore. On the other hand, it is essential for companies that Schufa detects duplicates of a person that deliberately tries to create a new identity in the database in order to have a clean credit history.

Besides the quality of duplicate detection, i.e., its effectiveness, scalability cannot be neglected, because of the considerable size of the database. We describe our solution to coping with both problems and present a comprehensive evaluation based on large volumes of real-world data.

References

FUZZY! Double by FUZZY! Informatik AG. Details at http://www.fazi.de/.Google Scholar
IBM Entity Analytic Solutions (EAS). Details at http://www-306.ibm.com/software/data/db2/eas/.Google Scholar
Trillium software system. Details at http://www.trilliumsoftware.com/de/content/products/index.asp.Google Scholar
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Conference on Very Large Databases (VLDB), Hong Kong, China, 2002. Google ScholarDigital Library
R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern information retrieval. ACM Press / Addison-Wesley, 1999. Google ScholarDigital Library
I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD), 2004. Google ScholarDigital Library
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Conference on Knowledge Discovery and Data Mining (KDD), Washington, DC, 2003. Google ScholarDigital Library
S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In Conference on Very Large Databases (VLDB), 2007. Google ScholarDigital Library
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1), 2007. Google ScholarDigital Library
M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Conference on Management of Data (SIGMOD), San Jose, CA, 1995. Google ScholarDigital Library
M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1), 1998. Google ScholarDigital Library
D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst., 31(2), 2006. Google ScholarDigital Library
M. Weis. Duplicate detection in XML. WiKu Verlag, 2008.Google Scholar
M. Weis and F. Naumann. DogmatiX tracks down duplicates in XML. In Conference on the Management of Data (SIGMOD), Baltimore, MD, 2005. Google ScholarDigital Library
S. Yan, D. Lee, M.-Y. Kan, and C. L. Giles. Adaptive sorted neighborhood methods for efficient record linkage. In Joint Conference on Digital Libraries (JCDL), 2007. Google ScholarDigital Library

Index Terms

Industry-scale duplicate detection
1. Applied computing
  1. Document management and text processing
    1. Document preparation
2. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
    2. Machine learning algorithms
      1. Feature selection

Recommendations

Duplicate video detection for large-scale multimedia

Since rapid growth of IT technologies, the use of multimedia data such as image and videos are explosively increasing. It is an important aspect of not only for users but also researchers. Duplicate images and videos are rapidly increasing and it causes ...
Read More
Evaluating indeterministic duplicate detection results
SUM'12: Proceedings of the 6th international conference on Scalable Uncertainty Management

Duplicate detection is an important process for cleaning or integrating data. Since real-life data is often polluted, detecting duplicates usually comes along with uncertainty. To handle duplicate uncertainty in an appropriate way, indeterministic ...
Read More
Duplicate Record Detection: A Survey

Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 1, Issue 2
August 2008
461 pages
ISSN:2150-8097
Editors:
Peter Buneman,
Beng Chin Ooi,
Kenneth Ross,
Gerald Weber
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2008
Published in pvldb Volume 1, Issue 2
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 381
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Industry-scale duplicate detection

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Duplicate video detection for large-scale multimedia

Evaluating indeterministic duplicate detection results

Duplicate Record Detection: A Survey

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Industry-scale duplicate detection

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Duplicate video detection for large-scale multimedia

Evaluating indeterministic duplicate detection results

Duplicate Record Detection: A Survey

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media