Skip to main content
Erschienen in: Journal of Intelligent Information Systems 3/2017

28.11.2016

Sentence similarity based on semantic kernels for intelligent text retrieval

verfasst von: Samir Amir, Adrian Tanasescu, Djamel A. Zighed

Erschienen in: Journal of Intelligent Information Systems | Ausgabe 3/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We propose a new approach to compute semantic similarity between sentences. It is based on the semantic kernel, composed of subject, verb, and object that, we suppose, summarize the general meaning of each sentence. Thanks to linguistics resources available such as Stanford Parser, many features are then extracted from the semantic kernels and aggregated by mean of weights. The weighting is produced by a supervised machine learning technique on a training data set provided by human experts as ground truth. The cross validation shows good performances. Thanks to this similarity measure between sentences, one can build an intelligent text retrieval engine more sensitive to the semantic content, specifically suited for short texts than the classical methods based on bag of words. An application is being developed for highlighting parts of speech in scientific articles.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Breaux, H.J. (1968). A modification of efroymson’s technique for stepwise regression analysis. Communications of the ACM, 11(8), 556–558.CrossRef Breaux, H.J. (1968). A modification of efroymson’s technique for stepwise regression analysis. Communications of the ACM, 11(8), 556–558.CrossRef
Zurück zum Zitat Budanitsky, A., & Hirst, G. (2006). Evaluating wordnet-based measures of lexical semantic relatedness. Computer Linguistic, 32(1), 13–47.CrossRefMATH Budanitsky, A., & Hirst, G. (2006). Evaluating wordnet-based measures of lexical semantic relatedness. Computer Linguistic, 32(1), 13–47.CrossRefMATH
Zurück zum Zitat Che, L.M., Wei, C.J., Cheng, H.T., Hui, C.H., & Chen, C.H. (2012). A sentence similarity metric based on semantic patterns. Advances in Information Sciences and Service Sciences, 4(1), 576–585. Che, L.M., Wei, C.J., Cheng, H.T., Hui, C.H., & Chen, C.H. (2012). A sentence similarity metric based on semantic patterns. Advances in Information Sciences and Service Sciences, 4(1), 576–585.
Zurück zum Zitat Croft, D., Coupland, S., Shell, J., & Brown, S. (2013). A fast and efficient semantic short text similarity metric. Croft, D., Coupland, S., Shell, J., & Brown, S. (2013). A fast and efficient semantic short text similarity metric.
Zurück zum Zitat De Boni, M., & Manandhar, S. (2003). The use of sentence similarity as a semantic relevance metric for question answering. In New directions in question answering, papers from 2003 AAAI spring symposium (pp. 138–144). Stanford: Stanford University. De Boni, M., & Manandhar, S. (2003). The use of sentence similarity as a semantic relevance metric for question answering. In New directions in question answering, papers from 2003 AAAI spring symposium (pp. 138–144). Stanford: Stanford University.
Zurück zum Zitat de Marneffe, M.-C., & Manning, C.D. (2008). The stanford typed dependencies representation. In Coling 2008: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, CrossParser ’08 (pp. 1–8). Stroudsburg: Association for Computational Linguistics. de Marneffe, M.-C., & Manning, C.D. (2008). The stanford typed dependencies representation. In Coling 2008: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, CrossParser ’08 (pp. 1–8). Stroudsburg: Association for Computational Linguistics.
Zurück zum Zitat Hardin, J.W., & Hilbe, J. (2001). Generalized linear models and extensions. College station: Stata Press.MATH Hardin, J.W., & Hilbe, J. (2001). Generalized linear models and extensions. College station: Stata Press.MATH
Zurück zum Zitat Hatzlvassiloglou, V., Klavans, J.L., & Eskin, E. (1999). Detecting text similarity over short passages:Exploring linguistic feature combinations via machine learning. In 1999 Joint SIGDAT conference on empirical methods in natural language processing and very large corpora (pp. 203–212). Hatzlvassiloglou, V., Klavans, J.L., & Eskin, E. (1999). Detecting text similarity over short passages:Exploring linguistic feature combinations via machine learning. In 1999 Joint SIGDAT conference on empirical methods in natural language processing and very large corpora (pp. 203–212).
Zurück zum Zitat Heidinger, V. (1984). Analyzing Syntax and Semantics: Workbook: Gallaudet university press. Heidinger, V. (1984). Analyzing Syntax and Semantics: Workbook: Gallaudet university press.
Zurück zum Zitat Hirst, G., & St-Onge, D. (1994). WORDNET: A Lexical database for English. In Human language technology, proceedings of a workshop held at plainsboro, New Jersey, USA, March 8-11. Hirst, G., & St-Onge, D. (1994). WORDNET: A Lexical database for English. In Human language technology, proceedings of a workshop held at plainsboro, New Jersey, USA, March 8-11.
Zurück zum Zitat Hirst, G., & St Onge, D. (1998). Lexical Chains as representation of context for the detection and correction malapropisms: The MIT Press. Hirst, G., & St Onge, D. (1998). Lexical Chains as representation of context for the detection and correction malapropisms: The MIT Press.
Zurück zum Zitat Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions Knowledge Discovery Data, 2(2), 10:1-10:25. Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions Knowledge Discovery Data, 2(2), 10:1-10:25.
Zurück zum Zitat Jurafsky, D., & Martin, J.H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 1st edn. Upper Saddle River: Prentice Hall PTR. Jurafsky, D., & Martin, J.H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 1st edn. Upper Saddle River: Prentice Hall PTR.
Zurück zum Zitat Landauer, T.K., Foltz, P.W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259–284.CrossRef Landauer, T.K., Foltz, P.W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259–284.CrossRef
Zurück zum Zitat Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Kleef, P.V., Auer, S., & Bizer, C. (2015). Dbpedia- A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2), 167–195. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Kleef, P.V., Auer, S., & Bizer, C. (2015). Dbpedia- A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2), 167–195.
Zurück zum Zitat Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138–1150.CrossRef Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138–1150.CrossRef
Zurück zum Zitat Oliva, J., Serrano, J.I., Dolores del Castillo, M., & Iglesias, A. (2011). Symss: A syntax-based measure for short-text semantic similarity. Data Knowledge Engineering, 70(4), 390–405.CrossRef Oliva, J., Serrano, J.I., Dolores del Castillo, M., & Iglesias, A. (2011). Symss: A syntax-based measure for short-text semantic similarity. Data Knowledge Engineering, 70(4), 390–405.CrossRef
Zurück zum Zitat O’Shea, J., Bandar, Z., Crockett, K.A., & McLean, D. (2008). A comparative study of two short text semantic similarity measures. In Proceedings onAgent and multi-agent systems: Technologies and applications, second KES international symposium, KES-AMSTA 2008, incheon, korea, march 26-28, 2008 (pp. 172–181). O’Shea, J., Bandar, Z., Crockett, K.A., & McLean, D. (2008). A comparative study of two short text semantic similarity measures. In Proceedings onAgent and multi-agent systems: Technologies and applications, second KES international symposium, KES-AMSTA 2008, incheon, korea, march 26-28, 2008 (pp. 172–181).
Zurück zum Zitat O’shea, J., Bandar, Z., & Crockett, K. (2014). A new benchmark dataset with production methodology for short text semantic similarity algorithms. ACM Transactions Speech Language Processing, 10(4), 19:1–19:63. O’shea, J., Bandar, Z., & Crockett, K. (2014). A new benchmark dataset with production methodology for short text semantic similarity algorithms. ACM Transactions Speech Language Processing, 10(4), 19:1–19:63.
Zurück zum Zitat Rakesh, P., Shivapratap, G., Divya, G., & Soman, K.P. (2009). Evaluation of svd and nmf methods for latent semantic analysis. International Journal of Recent Trends in Engineering, 1(3). Rakesh, P., Shivapratap, G., Divya, G., & Soman, K.P. (2009). Evaluation of svd and nmf methods for latent semantic analysis. International Journal of Recent Trends in Engineering, 1(3).
Zurück zum Zitat Salton, G., Wong, A., & Yang, C. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.CrossRefMATH Salton, G., Wong, A., & Yang, C. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.CrossRefMATH
Zurück zum Zitat Salton, G., & McGill, M. (1984). Introduction to Modern Information Retrieval: McGraw-Hill Book Company. Salton, G., & McGill, M. (1984). Introduction to Modern Information Retrieval: McGraw-Hill Book Company.
Zurück zum Zitat Spaeth, A., & Desmarais, M.C. (2013). Combining collaborative filtering and text similarity for expert profile recommendations in social websites. In Proceedings on User modeling, adaptation, and personalization - 21th international conference, UMAP 2013, rome, Italy, June 10-14, 2013 (pp. 178–189). Spaeth, A., & Desmarais, M.C. (2013). Combining collaborative filtering and text similarity for expert profile recommendations in social websites. In Proceedings on User modeling, adaptation, and personalization - 21th international conference, UMAP 2013, rome, Italy, June 10-14, 2013 (pp. 178–189).
Zurück zum Zitat Tsatsaronis, G., Varlamis, I., & Vazirgiannis, Michalis (2010). Text relatedness based on a word thesaurus. Journal of Artificial Intelligence Research, 37(1), 1–40.MATH Tsatsaronis, G., Varlamis, I., & Vazirgiannis, Michalis (2010). Text relatedness based on a word thesaurus. Journal of Artificial Intelligence Research, 37(1), 1–40.MATH
Zurück zum Zitat Winkler, W.E. (1999). The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau. Winkler, W.E. (1999). The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau.
Metadaten
Titel
Sentence similarity based on semantic kernels for intelligent text retrieval
verfasst von
Samir Amir
Adrian Tanasescu
Djamel A. Zighed
Publikationsdatum
28.11.2016
Verlag
Springer US
Erschienen in
Journal of Intelligent Information Systems / Ausgabe 3/2017
Print ISSN: 0925-9902
Elektronische ISSN: 1573-7675
DOI
https://doi.org/10.1007/s10844-016-0434-3

Weitere Artikel der Ausgabe 3/2017

Journal of Intelligent Information Systems 3/2017 Zur Ausgabe

Premium Partner