Skip to main content

2018 | OriginalPaper | Buchkapitel

Network Metrics for Assessing the Quality of Entity Resolution Between Multiple Datasets

verfasst von : Al Koudous Idrissou, Frank van Harmelen, Peter van den Besselaar

Erschienen in: Knowledge Engineering and Knowledge Management

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Matching entities between datasets is a crucial step for combining multiple datasets on the semantic web. A rich literature exists on different approaches to this entity resolution problem. However, much less work has been done on how to assess the quality of such entity links once they have been generated. Evaluation methods for link quality are typically limited to either comparison with a ground truth dataset (which is often not available), manual work (which is cumbersome and prone to error), or crowd sourcing (which is not always feasible, especially if expert knowledge is required). Furthermore, the problem of link evaluation is greatly exacerbated for links between more than two datasets, because the number of possible links grows rapidly with the number of datasets. In this paper, we propose a method to estimate the quality of entity links between multiple datasets. We exploit the fact that the links between entities from multiple datasets form a network, and we show how simple metrics on this network can reliably predict their quality. We verify our results in a large experimental study using six datasets from the domain of science, technology and innovation studies, for which we created a gold standard. This gold standard, available online, is an additional contribution of this paper. In addition, we evaluate our metric on a recently published gold standard to confirm our findings.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
The metric value indicates the negative impact of one or more missing links in an ILN.
 
4
The information provided here about the datasets was collected in January 2018. The datasets themselves are of earlier dates: Grid: 2017.07.12; Orgref: 2017.07.03; OpenAire: 2018.08.16; OrgReg: 2017.07.18; Eter: 2014; Leiden Ranking 2015: 2017.6.16; and Cordis-H2020: 2016.12.22. All these datasets are available on the RISIS platform at http://​datasets.​risis.​eu/​.
 
12
On a 6th Gen Intel®Core™i7 notebook with 8 GB RAM, it takes about 1:40 min to automatically evaluate all 4398 clusters of size three and above (see Fig. 4).
 
13
However, the very imbalanced character of the ground truth makes it hard to always outperform the baseline as illustrated in Table 2.
 
14
All confusion matrices supporting the analysis can be found on the RISIS project website at http://​sms.​risis.​eu/​assets/​pdf/​metrics-link-network.​pdf.
 
Literatur
1.
Zurück zum Zitat Baron, A., Freedman, M.: Who is who and what is what: experiments in cross-document co-reference. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 274–283. Association for Computational Linguistics (2008) Baron, A., Freedman, M.: Who is who and what is what: experiments in cross-document co-reference. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 274–283. Association for Computational Linguistics (2008)
2.
Zurück zum Zitat Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of the 2007 Joint Conference on EMNLP-CoNLL (2007) Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of the 2007 Joint Conference on EMNLP-CoNLL (2007)
5.
Zurück zum Zitat Euzenat, J., Shvaiko, P.: Ontology Matching, 2nd edn. Springer, Heidelberg (2013)CrossRef Euzenat, J., Shvaiko, P.: Ontology Matching, 2nd edn. Springer, Heidelberg (2013)CrossRef
7.
Zurück zum Zitat Hassanzadeh, O., Kementsietsidis, A., Lim, L., Miller, R.J., Wang, M.: A framework for semantic link discovery over relational data. In: 18th ACM Conference on Information and Knowledge Management, pp. 1027–1036. ACM (2009) Hassanzadeh, O., Kementsietsidis, A., Lim, L., Miller, R.J., Wang, M.: A framework for semantic link discovery over relational data. In: 18th ACM Conference on Information and Knowledge Management, pp. 1027–1036. ACM (2009)
8.
Zurück zum Zitat Hassanzadeh, O., Xin, R., Miller, R.J., Kementsietsidis, A., Lim, L., Wang, M.: Linkage query writer. Proc. VLDB Endow. 2(2), 1590–1593 (2009)CrossRef Hassanzadeh, O., Xin, R., Miller, R.J., Kementsietsidis, A., Lim, L., Wang, M.: Linkage query writer. Proc. VLDB Endow. 2(2), 1590–1593 (2009)CrossRef
9.
Zurück zum Zitat Li, W., Zhang, S., Qi, G.: A graph-based approach for resolving incoherent ontology mappings. In: Web Intelligence, vol. 16, pp. 15–35. IOS Press (2018) Li, W., Zhang, S., Qi, G.: A graph-based approach for resolving incoherent ontology mappings. In: Web Intelligence, vol. 16, pp. 15–35. IOS Press (2018)
11.
Zurück zum Zitat Menestrina, D., Whang, S.E., Garcia-Molina, H.: Evaluating entity resolution results. Proc. VLDB Endow. 3(1–2), 208–219 (2010)CrossRef Menestrina, D., Whang, S.E., Garcia-Molina, H.: Evaluating entity resolution results. Proc. VLDB Endow. 3(1–2), 208–219 (2010)CrossRef
12.
Zurück zum Zitat Ngomo, A.-C.N., Auer, S.: Limes-a time-efficient approach for large-scale link discovery on the web of data. In: IJCAI, pp. 2312–2317 (2011) Ngomo, A.-C.N., Auer, S.: Limes-a time-efficient approach for large-scale link discovery on the web of data. In: IJCAI, pp. 2312–2317 (2011)
Metadaten
Titel
Network Metrics for Assessing the Quality of Entity Resolution Between Multiple Datasets
verfasst von
Al Koudous Idrissou
Frank van Harmelen
Peter van den Besselaar
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-030-03667-6_10