Skip to main content
Erschienen in: Distributed and Parallel Databases 3/2018

02.08.2018

An effective weighted rule-based method for entity resolution

verfasst von: Hiba Abu Ahmad, Hongzhi Wang

Erschienen in: Distributed and Parallel Databases | Ausgabe 3/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Entity resolution is an important task in data cleaning to detect records that belong to the same entity. It has a critical impact on digital libraries where different entities share the same name without any identifier key. Conventional methods adopt similarity measures and clustering techniques to reveal the records of a specific entity. Due to the lack of performance, recent methods build rules on records’ attributes with distinct values for entities to overcome some drawbacks. However, they use inadequate attributes and ignore common and empty attributes values which affect the quality of entity resolution. In this paper, we define a multi-attributes weighted rule system (MAWR) that investigates all values of records’ attributes in order to represent the difficult record-entity mapping. Then, we propose a rule generation algorithm based on this system. We also propose an entity resolution algorithm (MAWR-ER) depending on the generated rules to identify entities. We verify our method on real data, and the experimental results prove the effectiveness and efficiency of our proposed method.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Ayat, N., Akbarinia, R., Afsarmanesh, H., Valduriez, P.: Entity resolution for distributed probabilistic data. Distrib. Parallel Databases 31(4), 509–542 (2013)CrossRefMATH Ayat, N., Akbarinia, R., Afsarmanesh, H., Valduriez, P.: Entity resolution for distributed probabilistic data. Distrib. Parallel Databases 31(4), 509–542 (2013)CrossRefMATH
2.
Zurück zum Zitat Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 39–48 (2003) Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 39–48 (2003)
3.
Zurück zum Zitat Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of the 21st International Conference on Data Engineering, 2005. ICDE 2005. IEEE, pp. 865–876 (2005) Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of the 21st International Conference on Data Engineering, 2005. ICDE 2005. IEEE, pp. 865–876 (2005)
4.
Zurück zum Zitat Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. Proc. VLDB Endow. 2(1), 407–418 (2009)CrossRef Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. Proc. VLDB Endow. 2(1), 407–418 (2009)CrossRef
5.
Zurück zum Zitat Fan, X., Wang, J., Pu, X., Zhou, L., Lv, B.: On graph-based name disambiguation. J. Data Inf. Qual. (JDIQ) 2(2), 10 (2011) Fan, X., Wang, J., Pu, X., Zhou, L., Lv, B.: On graph-based name disambiguation. J. Data Inf. Qual. (JDIQ) 2(2), 10 (2011)
6.
Zurück zum Zitat Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: Proceedings of the 12th International Conference on World Wide Web. ACM, pp. 90–101 (2003) Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: Proceedings of the 12th International Conference on World Wide Web. ACM, pp. 90–101 (2003)
7.
Zurück zum Zitat Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)CrossRef Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)CrossRef
8.
Zurück zum Zitat Li, L., Wang, H., Gao, H., Li, J.: Eif: a framework of effective entity identification. In: International Conference on Web-Age Information Management, pp. 717–728. Springer, New York (2010) Li, L., Wang, H., Gao, H., Li, J.: Eif: a framework of effective entity identification. In: International Conference on Web-Age Information Management, pp. 717–728. Springer, New York (2010)
9.
Zurück zum Zitat Li, L., Li, J., Wang, H., Gao, H.: Context-based entity description rule for entity resolution. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, pp. 1725–1730 (2011) Li, L., Li, J., Wang, H., Gao, H.: Context-based entity description rule for entity resolution. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, pp. 1725–1730 (2011)
10.
Zurück zum Zitat Li, L., Li, J., Gao, H.: Rule-based method for entity resolution. IEEE Trans. Knowl. Data Eng. 27(1), 250–263 (2015)CrossRef Li, L., Li, J., Gao, H.: Rule-based method for entity resolution. IEEE Trans. Knowl. Data Eng. 27(1), 250–263 (2015)CrossRef
11.
Zurück zum Zitat Saha, T.K., Zhang, B., Al Hasan, M.: Name disambiguation from link data in a collaboration graph using temporal and topological features. Soc. Netw. Anal. Min. 5(1), 11 (2015)CrossRef Saha, T.K., Zhang, B., Al Hasan, M.: Name disambiguation from link data in a collaboration graph using temporal and topological features. Soc. Netw. Anal. Min. 5(1), 11 (2015)CrossRef
12.
Zurück zum Zitat Shu, L., Long, B., Meng, W.: A latent topic model for complete entity resolution. In: IEEE 25th International Conference on Data Engineering. ICDE’09. IEEE, pp. 880–891 (2009) Shu, L., Long, B., Meng, W.: A latent topic model for complete entity resolution. In: IEEE 25th International Conference on Data Engineering. ICDE’09. IEEE, pp. 880–891 (2009)
13.
Zurück zum Zitat Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. 26(8), 607–633 (2001)CrossRefMATH Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. 26(8), 607–633 (2001)CrossRefMATH
14.
Zurück zum Zitat Yin, X., Han, J., Philip, S.Y.: Object distinction: distinguishing objects with identical names. In: IEEE 23rd International Conference on Data Engineering. ICDE 2007. IEEE, pp. 1242–1246 (2007) Yin, X., Han, J., Philip, S.Y.: Object distinction: distinguishing objects with identical names. In: IEEE 23rd International Conference on Data Engineering. ICDE 2007. IEEE, pp. 1242–1246 (2007)
Metadaten
Titel
An effective weighted rule-based method for entity resolution
verfasst von
Hiba Abu Ahmad
Hongzhi Wang
Publikationsdatum
02.08.2018
Verlag
Springer US
Erschienen in
Distributed and Parallel Databases / Ausgabe 3/2018
Print ISSN: 0926-8782
Elektronische ISSN: 1573-7578
DOI
https://doi.org/10.1007/s10619-018-7240-6

Weitere Artikel der Ausgabe 3/2018

Distributed and Parallel Databases 3/2018 Zur Ausgabe