Skip to main content

2016 | OriginalPaper | Buchkapitel

Context Semantic Analysis: A Knowledge-Based Technique for Computing Inter-document Similarity

verfasst von : Fabio Benedetti, Domenico Beneventano, Sonia Bergamaschi

Erschienen in: Similarity Search and Applications

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We propose a novel knowledge-based technique for inter-document similarity, called Context Semantic Analysis (CSA). Several specialized approaches built on top of specific knowledge base (e.g. Wikipedia) exist in literature but CSA differs from them because it is designed to be portable to any RDF knowledge base. Our technique relies on a generic RDF knowledge base (e.g. DBpedia and Wikidata) to extract from it a vector able to represent the context of a document. We show how such a Semantic Context Vector can be effectively exploited to compute inter-document similarity. Experimental results show that our general technique outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones of specialized methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
We abbreviate URI namespaces with common prefixes, see http://​prefix.​cc for details.
 
4
When an entity is an instance of more than one class we use the class with the minor number of instances because it better characterizes an entity; however if we filter the knowledge bases by excluding classes defined in external sources such as YAGO, GroNames, etc. only 6.4 % of entities in Dbpedia and 2.22 % in Wikidata are instances of more than one class.
 
6
Implemented as in [15] (only removing the stopwords).
 
7
If not explicitly stated all the difference in performance are statistically significant at \(p{\text {-}}value < 0.05\) using Fisher’s Z-value transformation.
 
8
The sets of starting entities are obtained by using NER APIs.
 
9
With td-idf as weighting function.
 
11
We executed this experiment in a Ubuntu machine with 16 cores (Intel Xeon E312xx) and 98 Gb of RAM.
 
Literatur
1.
Zurück zum Zitat Anyanwu, K., Maduko, A., Sheth, A.: SemRank: ranking complex relationship search results on the semantic web. In Proceedings of the 14th International Conference on World Wide Web, pp. 117–127. ACM (2005) Anyanwu, K., Maduko, A., Sheth, A.: SemRank: ranking complex relationship search results on the semantic web. In Proceedings of the 14th International Conference on World Wide Web, pp. 117–127. ACM (2005)
2.
Zurück zum Zitat Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76298-0_52 CrossRef Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi:10.​1007/​978-3-540-76298-0_​52 CrossRef
3.
Zurück zum Zitat Bär, D., Zesch, T., Gurevych, I.: A reflective view on text similarity. In: RANLP, pp. 515–520 (2011) Bär, D., Zesch, T., Gurevych, I.: A reflective view on text similarity. In: RANLP, pp. 515–520 (2011)
4.
Zurück zum Zitat Beneventano, D., Bergamaschi, S., Sorrentino, S., Vincini, M., Benedetti, F.: Semantic annotation of the cerealab database by the agrovoc linked dataset. Ecol. Inform. 26, 119–126 (2015)CrossRef Beneventano, D., Bergamaschi, S., Sorrentino, S., Vincini, M., Benedetti, F.: Semantic annotation of the cerealab database by the agrovoc linked dataset. Ecol. Inform. 26, 119–126 (2015)CrossRef
5.
Zurück zum Zitat Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. In: Sheth, A.P. (ed.) Semantic Services, Interoperability, Web Applications: Emerging Concepts, pp. 205–227. IGI Global, Hershey (2009) Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. In: Sheth, A.P. (ed.) Semantic Services, Interoperability, Web Applications: Emerging Concepts, pp. 205–227. IGI Global, Hershey (2009)
6.
Zurück zum Zitat Bos, L., Donnelly, K.: SNOMED-CT: the advanced terminology and coding system for eHealth. Stud. Health Technol. Inform. 121, 279–290 (2006) Bos, L., Donnelly, K.: SNOMED-CT: the advanced terminology and coding system for eHealth. Stud. Health Technol. Inform. 121, 279–290 (2006)
7.
Zurück zum Zitat Caracciolo, C., Stellato, A., Morshed, A., Johannsen, G., Rajbhandari, S., Jaques, Y., Keizer, J.: The AGROVOC linked dataset. Semant. Web 4(3), 341–348 (2013) Caracciolo, C., Stellato, A., Morshed, A., Johannsen, G., Rajbhandari, S., Jaques, Y., Keizer, J.: The AGROVOC linked dataset. Semant. Web 4(3), 341–348 (2013)
8.
Zurück zum Zitat Cyganiak, R., Wood, D., Lanthaler, M.: RDF 1.1 concepts, abstract syntax. W3C Recomm. 25, 1–8 (2014) Cyganiak, R., Wood, D., Lanthaler, M.: RDF 1.1 concepts, abstract syntax. W3C Recomm. 25, 1–8 (2014)
9.
Zurück zum Zitat Dumais, S.T.: Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)CrossRef Dumais, S.T.: Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)CrossRef
10.
Zurück zum Zitat Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJCAI 7, 1606–1611 (2007) Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJCAI 7, 1606–1611 (2007)
11.
Zurück zum Zitat Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68(13), 13–18 (2013) Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68(13), 13–18 (2013)
12.
Zurück zum Zitat Hassan, S., Mihalcea, R.: Semantic relatedness using salient semantic analysis. In: AAAI (2011) Hassan, S., Mihalcea, R.: Semantic relatedness using salient semantic analysis. In: AAAI (2011)
13.
Zurück zum Zitat Haveliwala, T.H.: Topic-sensitive pagerank. In: Proceedings of the 11th International Conference on World Wide Web, pp. 517–526. ACM (2002) Haveliwala, T.H.: Topic-sensitive pagerank. In: Proceedings of the 11th International Conference on World Wide Web, pp. 517–526. ACM (2002)
14.
Zurück zum Zitat Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255–268 (1989)CrossRefMATH Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255–268 (1989)CrossRefMATH
15.
Zurück zum Zitat Lee, M., Pincombe, B., Welsh, M.: An empirical evaluation of models of text document similarity. In: Cognitive Science (2005) Lee, M., Pincombe, B., Welsh, M.: An empirical evaluation of models of text document similarity. In: Cognitive Science (2005)
16.
Zurück zum Zitat Manning, C.D., Raghavan, P., Schütze, H., et al.: Introduction to Information Retrieval, vol. 1. Cambridge University Press, Cambridge (2008)CrossRefMATH Manning, C.D., Raghavan, P., Schütze, H., et al.: Introduction to Information Retrieval, vol. 1. Cambridge University Press, Cambridge (2008)CrossRefMATH
17.
Zurück zum Zitat Mendes, P., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight shedding light on the web of documents. In: I-Semantics (2011) Mendes, P., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight shedding light on the web of documents. In: I-Semantics (2011)
18.
Zurück zum Zitat Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)CrossRef Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)CrossRef
19.
Zurück zum Zitat Nakov, P., Popova, A., Mateev, P.: Weight functions impact on LSA performance. In: EuroConference RANLP, pp. 187–193 (2001) Nakov, P., Popova, A., Mateev, P.: Weight functions impact on LSA performance. In: EuroConference RANLP, pp. 187–193 (2001)
20.
Zurück zum Zitat Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web (1999) Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web (1999)
21.
Zurück zum Zitat Schuhmacher, M., Ponzetto, S.P.: Knowledge-based graph document modeling. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 543–552. ACM (2014) Schuhmacher, M., Ponzetto, S.P.: Knowledge-based graph document modeling. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 543–552. ACM (2014)
22.
Zurück zum Zitat Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706. ACM (2007) Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706. ACM (2007)
23.
Zurück zum Zitat Turney, P.D., Pantel, P., et al.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)MathSciNetMATH Turney, P.D., Pantel, P., et al.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)MathSciNetMATH
24.
Zurück zum Zitat Van de Cruys, T.: Two multivariate generalizations of pointwise mutual information. In Proceedings of the Workshop on Distributional Semantics and Compositionality, pp. 16–20. Association for Computational Linguistics (2011) Van de Cruys, T.: Two multivariate generalizations of pointwise mutual information. In Proceedings of the Workshop on Distributional Semantics and Compositionality, pp. 16–20. Association for Computational Linguistics (2011)
25.
Zurück zum Zitat Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)CrossRef Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)CrossRef
26.
Zurück zum Zitat Xing, W., Ghorbani, A.: Weighted pagerank algorithm. In: Second Annual Conference on Communication Networks and Services Research, 2004. Proceedings, pp. 305–314. IEEE (2004) Xing, W., Ghorbani, A.: Weighted pagerank algorithm. In: Second Annual Conference on Communication Networks and Services Research, 2004. Proceedings, pp. 305–314. IEEE (2004)
27.
Zurück zum Zitat Yeh, E., Ramage, D., Manning, C.D., Agirre, E., Soroa, A.: WikiWalk: random walks on wikipedia for semantic relatedness. In Proceedings of the 2009 Workshop on Graph-Based Methods for Natural Language Processing, pp. 41–49. Association for Computational Linguistics (2009) Yeh, E., Ramage, D., Manning, C.D., Agirre, E., Soroa, A.: WikiWalk: random walks on wikipedia for semantic relatedness. In Proceedings of the 2009 Workshop on Graph-Based Methods for Natural Language Processing, pp. 41–49. Association for Computational Linguistics (2009)
28.
Zurück zum Zitat Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524. ACM (2002) Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524. ACM (2002)
Metadaten
Titel
Context Semantic Analysis: A Knowledge-Based Technique for Computing Inter-document Similarity
verfasst von
Fabio Benedetti
Domenico Beneventano
Sonia Bergamaschi
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-46759-7_13

Neuer Inhalt