Skip to main content

2015 | OriginalPaper | Buchkapitel

Cross Language Duplicate Record Detection in Big Data

verfasst von : Ahmed H. Yousef

Erschienen in: Big Data in Complex Systems

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

The importance of data accuracy and quality has increased with the explosion of data size. This factor is crucial to ensure the success of any cross-enterprise integration applications, business intelligence or data mining solutions. Detecting duplicate data that represent the same real world object more than once in a certain dataset is the first step to ensure the data accuracy. This operation becomes more complicated when the same object name (person, city) is represented in multiple natural languages due to several factors including spelling, typographical and pronunciation variation, dialects and special vowel and consonant distinction and other linguistic characteristics. Therefore, it is difficult to decide whether or not two syntactic values (names) are alternative designation of the same semantic entity. Up to authors’ knowledge, the previously proposed duplicate record detection (DRD) algorithms and frameworks support only single language duplicate record detection, or at most bilingual. In this paper, two available tools of duplicate record detection are compared. Then, a generic cross language based duplicate record detection solution architecture is proposed, designed and implemented to support the wide range variations of several languages. The proposed system design uses a dictionary based on phonetic algorithms and support different indexing/blocking techniques to allow fast processing. The framework proposes the use of several proximity matching algorithms, performance evaluation metrics and classifiers to suit the diversity in several languages names matching. The framework is implemented and verified empirically in several case studies. Several Experiments are executed to compare the advantages and disadvantages of the proposed system compared to other tool. Results showed that the proposed system has substantial improvements compared to the well-known tools.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Metadaten
Titel
Cross Language Duplicate Record Detection in Big Data
verfasst von
Ahmed H. Yousef
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-11056-1_5

Premium Partner