Skip to main content

2017 | OriginalPaper | Buchkapitel

Entity Deduplication on ScholarlyData

verfasst von : Ziqi Zhang, Andrea Giovanni Nuzzolese, Anna Lisa Gentile

Erschienen in: The Semantic Web

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

ScholarlyData is the new and currently the largest reference linked dataset of the Semantic Web community about papers, people, organisations, and events related to its academic conferences. Originally started from the Semantic Web Dog Food (SWDF), it addressed multiple issues on data representation and maintenance by (i) adopting a novel data model and (ii) establishing an open source workflow to support the addition of new data from the community. Nevertheless, the major issue with the current dataset is the presence of multiple URIs for the same entities, typically in persons and organisations. In this work we: (i) perform entity deduplication on the whole dataset, using supervised classification methods; (ii) devise a protocol to choose the most representative URI for an entity and deprecate duplicated ones, while ensuring backward compatibilities for them; (iii) incorporate the automatic deduplication step in the general workflow to reduce the creation of duplicate URIs when adding new data. Our early experiment focused on the person and organisation URIs and results show significant improvement over state-of-the-art solutions. We managed to consolidate, on the entire dataset, over 100 and 800 pairs of duplicate person and organisation URIs and their associated triples (over 1,800 and 5,000) respectively, hence significantly improving the overall quality and connectivity of the data graph. Integrated into the ScholarlyData data publishing workflow, we believe that this serves a major step towards the creation of clean, high-quality scholarly linked data on the Semantic Web.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
4
cLODG is an Open Source tool that provides a formalised process for the conference metadata publication workflow https://​github.​com/​anuzzolese/​cLODg2.
 
5
The tool as been used for generating data for ISWC2016 and EKAW2016.
 
16
Due to space limitation those triples have been omitted in the example.
 
21
LIMES allows setting a threshold for predicted mappings. We tested different thresholds from 0.1 to 1.0 with increment of 0.1 and found that LIMES-wc is insensitive to the threshold while LIMES-ws is. For complete results and optimal thresholds see: https://​github.​com/​ziqizhang/​scholarlydata/​tree/​master/​data/​public/​soa_​results.
 
Literatur
1.
Zurück zum Zitat Bryl, V., Birukou, A., Eckert, K., Kessler, M.: What is in the proceedings? combining publishers and researchers perspectives. In: SePublica 2014 (2014) Bryl, V., Birukou, A., Eckert, K., Kessler, M.: What is in the proceedings? combining publishers and researchers perspectives. In: SePublica 2014 (2014)
2.
Zurück zum Zitat Clark, K., Manning, C.: Entity-centric coreference resolution with model stacking. In: Association for Computational Linguistics (2015) Clark, K., Manning, C.: Entity-centric coreference resolution with model stacking. In: Association for Computational Linguistics (2015)
3.
Zurück zum Zitat Duan, S., Fokoue, A., Hassanzadeh, O.: Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing. pp. 49–64 (2012)CrossRef Duan, S., Fokoue, A., Hassanzadeh, O.: Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing. pp. 49–64 (2012)CrossRef
4.
Zurück zum Zitat Gentile, A.L., Acosta, M., Costabello, L., Nuzzolese, A.G., Presutti, V., Reforgiato Recupero, D.: Conference live: accessible and sociable conference semantic data. In: Proceedings of WWW Companion, pp. 1007–1012 (2015) Gentile, A.L., Acosta, M., Costabello, L., Nuzzolese, A.G., Presutti, V., Reforgiato Recupero, D.: Conference live: accessible and sociable conference semantic data. In: Proceedings of WWW Companion, pp. 1007–1012 (2015)
5.
Zurück zum Zitat Glaser, H., Jaffri, A., Millard, I.: Managing co-reference on the semantic web. In: Linked Data on the Web (LDOW 2009) (2009) Glaser, H., Jaffri, A., Millard, I.: Managing co-reference on the semantic web. In: Linked Data on the Web (LDOW 2009) (2009)
6.
Zurück zum Zitat Halpin, H., Presutti, V.: The identity of resources on the web: an ontology for web architecture. Appl. Ontol. 6(3), 263–293 (2011) Halpin, H., Presutti, V.: The identity of resources on the web: an ontology for web architecture. Appl. Ontol. 6(3), 263–293 (2011)
7.
Zurück zum Zitat Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of SIGMOD 1995. ACM (1995)CrossRef Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of SIGMOD 1995. ACM (1995)CrossRef
8.
Zurück zum Zitat Isele, R., Bizer, C.: Learning expressive linkage rules using genetic programming. Proc. VLDB Endow. 5(11), 1638–1649 (2012)CrossRef Isele, R., Bizer, C.: Learning expressive linkage rules using genetic programming. Proc. VLDB Endow. 5(11), 1638–1649 (2012)CrossRef
10.
Zurück zum Zitat Lee, D., Kang, J., Mitra, P., Giles, C.L., On, B.-W.: Are your citations. Commun. ACM 50(12), 33–38 (2007)CrossRef Lee, D., Kang, J., Mitra, P., Giles, C.L., On, B.-W.: Are your citations. Commun. ACM 50(12), 33–38 (2007)CrossRef
11.
Zurück zum Zitat Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web J. 6, 167–195 (2013) Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web J. 6, 167–195 (2013)
12.
Zurück zum Zitat Mamun, A.-A., Aseltine, R., Rajasekaran, S.: Efficient record linkage algorithms using complete linkage clustering. PLoS ONE 11(4), e0154446 (2016)CrossRef Mamun, A.-A., Aseltine, R., Rajasekaran, S.: Efficient record linkage algorithms using complete linkage clustering. PLoS ONE 11(4), e0154446 (2016)CrossRef
13.
Zurück zum Zitat Möller, K., Heath, T., Handschuh, S., Domingue, J.: Recipes for semantic web dog food — the ESWC and ISWC metadata projects. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 802–815. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76298-0_58CrossRef Möller, K., Heath, T., Handschuh, S., Domingue, J.: Recipes for semantic web dog food — the ESWC and ISWC metadata projects. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 802–815. Springer, Heidelberg (2007). doi:10.​1007/​978-3-540-76298-0_​58CrossRef
14.
Zurück zum Zitat Nentwig, M., Hartung, M., Ngomo, A.-C.N., Rahm, E.: A survey of current link discovery frameworks. Semant. Web (Preprint):1–18 (2015) Nentwig, M., Hartung, M., Ngomo, A.-C.N., Rahm, E.: A survey of current link discovery frameworks. Semant. Web (Preprint):1–18 (2015)
15.
Zurück zum Zitat Ngomo, A.-C.N., Auer, S.: LIMES: a time-efficient approach for large-scale link discovery on the web of data. In: Proceedings of IJCAI 2011, pp. 2312–2317 (2011) Ngomo, A.-C.N., Auer, S.: LIMES: a time-efficient approach for large-scale link discovery on the web of data. In: Proceedings of IJCAI 2011, pp. 2312–2317 (2011)
16.
Zurück zum Zitat Nuzzolese, A.G., Gentile, A.L., Presutti, V., Gangemi, A.: Conference Linked data: the scholarlydata project. In: Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 150–158. Springer, Cham (2016). doi:10.1007/978-3-319-46547-0_16CrossRef Nuzzolese, A.G., Gentile, A.L., Presutti, V., Gangemi, A.: Conference Linked data: the scholarlydata project. In: Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 150–158. Springer, Cham (2016). doi:10.​1007/​978-3-319-46547-0_​16CrossRef
17.
Zurück zum Zitat Osborne, F., Motta, E., Mulholland, P.: Exploring scholarly data with rexplore. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013. LNCS, vol. 8218, pp. 460–477. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41335-3_29CrossRef Osborne, F., Motta, E., Mulholland, P.: Exploring scholarly data with rexplore. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013. LNCS, vol. 8218, pp. 460–477. Springer, Heidelberg (2013). doi:10.​1007/​978-3-642-41335-3_​29CrossRef
18.
Zurück zum Zitat Papadakis, G., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces. pp. 535–544 (2011) Papadakis, G., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces. pp. 535–544 (2011)
19.
Zurück zum Zitat Shotton, D.: Semantic publishing: the coming revolution in scientific journal publishing. Learned Publishing 22(2), 85–94 (2009)CrossRef Shotton, D.: Semantic publishing: the coming revolution in scientific journal publishing. Learned Publishing 22(2), 85–94 (2009)CrossRef
20.
Zurück zum Zitat Solecki, B., Silva, L., Efimov, D.: KDD cup 2013: author disambiguation. In: Proceedings of the 2013 KDD Cup 2013 Workshop, KDD Cup 2013, pp. 9:1–9:3. ACM, New York (2013) Solecki, B., Silva, L., Efimov, D.: KDD cup 2013: author disambiguation. In: Proceedings of the 2013 KDD Cup 2013 Workshop, KDD Cup 2013, pp. 9:1–9:3. ACM, New York (2013)
21.
Zurück zum Zitat Zhang, Z., Gentile, A.L., Blomqvist, E., Augenstein, I., Ciravegna, F.: An unsupervised data-driven method to discover equivalent relations in large linked datasets. Semant. web 8(2), 197–223 (2017)CrossRef Zhang, Z., Gentile, A.L., Blomqvist, E., Augenstein, I., Ciravegna, F.: An unsupervised data-driven method to discover equivalent relations in large linked datasets. Semant. web 8(2), 197–223 (2017)CrossRef
22.
Zurück zum Zitat Zheng, J., Chapman, W., Crowley, R., Savova, G.: Coreference resolution: a review of general methodologies and applications in the clinical domain. Biomed. Inform. 44(6), 1113–1122 (2011)CrossRef Zheng, J., Chapman, W., Crowley, R., Savova, G.: Coreference resolution: a review of general methodologies and applications in the clinical domain. Biomed. Inform. 44(6), 1113–1122 (2011)CrossRef
Metadaten
Titel
Entity Deduplication on ScholarlyData
verfasst von
Ziqi Zhang
Andrea Giovanni Nuzzolese
Anna Lisa Gentile
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-58068-5_6

Neuer Inhalt