nach oben

Datenbank-Spektrum

Erschienen in:

01.03.2013 | Schwerpunktbeitrag

Parallel Entity Resolution with Dedoop

verfasst von: Lars Kolb, Erhard Rahm

Erschienen in: Datenbank-Spektrum | Ausgabe 1/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

We provide an overview of Dedoop (Deduplication with Hadoop), a new tool for parallel entity resolution (ER) on cloud infrastructures. Dedoop supports a browser-based specification of complex ER strategies and provides a large library of blocking and matching approaches. To simplify the configuration of ER strategies with several similarity metrics, training-based machine learning approaches can be employed with Dedoop. Specified ER strategies are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. For improved performance, Dedoop supports redundancy-free multi-pass blocking as well as advanced load balancing approaches. To illustrate the usefulness of Dedoop, we present the results of a comparative evaluation of different ER strategies on a challenging real-world dataset.

Vorheriger Artikel Efficient OR Hadoop: Why Not Both?

Nächster Artikel Inkrementelle Neuberechnungen in MapReduce

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Datenbank-Spektrum

Datenbank-Spektrum ist das offizielle Organ der Fachgruppe Datenbanken und Information Retrieval der Gesellschaft für Informatik (GI) e.V. Die Zeitschrift widmet sich den Themen Datenbanken, Datenbankanwendungen und Information Retrieval.

Jetzt informieren

http://xmlstar.sourceforge.net/.

Internally, Dedoop prefixes each blocking key with its (zero-padded) pass number to force blocking keys of pass i to be lexicographically smaller than keys of pass j>i. In favor of readability, this has been omitted in the previous sections.

We do not have the perfect match result for this dataset so we could not use it for the evaluation of match quality.

Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp 39–48

Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Springer, Berlin

Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555 CrossRef

Christen P, Churches T, Hegland M (2004) Febrl—a parallel open source data linkage system. In: PAKDD, pp 638–647

Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: OSDI, pp 137–150

Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16 CrossRef

Gufler B, Augsten N, Reiser A, Kemper A (2012) Load balancing in mapreduce based on scalable cardinality estimates. In: ICDE, pp 522–533

Kim H, Lee D (2007) Parallel linkage. In: CIKM, pp 283–292

Kirsten T, Kolb L, Hartung M, Gross A, Köpcke H, Rahm E (2010) Data partitioning for parallel entity matching. In: QDB

10.

Kolb L, Köpcke H, Thor A, Rahm E (2011) Learning-based entity resolution with MapReduce. In: CloudDB, pp 1–6 CrossRef

11.

Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with Hadoop. Proc. VLDB Endow. 5(12):1878–1881

12.

Kolb L, Thor A, Rahm E (2012) Don’t match twice: redundancy-free similarity computation with MapReduce. Tech. rep. http://dbs.uni-leipzig.de/de/publication/redfree

13.

Kolb L, Thor A, Rahm E (2012) Load balancing for MapReduce-based entity resolution. In: ICDE, pp 618–629

14.

Kolb L, Thor A, Rahm E (2012) Multi-pass sorted neighborhood blocking with MapReduce. Comput. Sci. Res. Dev. 27(1):45–63 CrossRef

15.

Köpcke H, Rahm E (2010) Frameworks for entity matching: a comparison. Data Knowl Eng 69(2):197–210 CrossRef

16.

Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1):484–493

17.

Köpcke H, Thor A, Rahm E (2010) Learning-based approaches for matching web data entities. IEEE Internet Comput 14(4):23–31 CrossRef

18.

Kwon Y, Balazinska M, Howe B, Rolia JA (2010) Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: SoCC, pp 75–86 CrossRef

19.

Kwon Y, Balazinska M, Howe B, Rolia JA (2012) SkewTune: mitigating skew in MapReduce applications. In: SIGMOD conference, pp 25–36

20.

Lange D, Naumann F (2011) Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate. In: CIKM, pp 243–248

21.

Lin J (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in MapReduce. In: Workshop on large-scale distributed systems for information retrieval

22.

McNeill N, Kardes H, Borthwick A (2012) Dynamic record blocking: efficient linking of massive databases in mapreduce. In: QDB

23.

Papadakis G Ioannou E Niederée C et al. (2011) Eliminating the redundancy in blocking-based entity resolution methods. In: JCDL, pp 85–94

24.

Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: SIGMOD conference, pp 495–506

25.

Wang C, Wang J, Lin X, Wang W, Wang H, Li H, Tian W, Xu J, Li R (2010) Mapdupreducer: detecting near duplicates over massive datasets. In: SIGMOD conference, pp 1119–1122

26.

Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140 CrossRef

Titel: Parallel Entity Resolution with Dedoop
verfasst von: Lars Kolb
Erhard Rahm
Publikationsdatum: 01.03.2013
Verlag: Springer-Verlag
Erschienen in: Datenbank-Spektrum / Ausgabe 1/2013
Print ISSN: 1618-2162
Elektronische ISSN: 1610-1995
DOI: https://doi.org/10.1007/s13222-012-0110-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Datenbank-Spektrum

Weitere Artikel der Ausgabe 1/2013

Efficient OR Hadoop: Why Not Both?

Bericht vom Herbsttreffen der GI-Fachgruppe Datenbanksysteme

News

Datenmanagement und -exploration an der RWTH Aachen

Compilation of Query Languages into MapReduce

Dissertationen

Premium Partner