Skip to main content

2015 | OriginalPaper | Buchkapitel

Low-Level Features for Paraphrase Identification

verfasst von : Ekaterina Pronoza, Elena Yagunova

Erschienen in: Advances in Artificial Intelligence and Soft Computing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper deals with the task of sentential paraphrase identification. We work with Russian but our approach can be applied to any other language with rich morphology and free word order. As part of our ParaPhraser.ru project, we construct a paraphrase corpus and then experiment with supervised methods of paraphrase identification. In this paper we focus on the low-level string, lexical and semantic features which unlike complex deep ones do not cause information noise and can serve as a solid basis for the development of an effective paraphrase identification system. Results of the experiments show that the features introduced in this paper improve the paraphrase identification model based solely on the standard low-level features or the optimized matrix metric used for corpus construction.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
In fact, our approach is not restricted to languages with these characteristics (e.g., it can be applied for English as well) but the features we propose in this paper take serious advantage of them, and therefore we recommend using our method for morphologically rich languages with free word order.
 
2
We follow a simplified approach and consider any notional title cased word a Proper name.
 
3
In this section we only show that the modified metric improves over our baseline: we do not solve the task of selecting the optimal classifier, and we simply choose SVM because it is well-known and widely used in NLP. Further in Sect. 5 we present the results obtained in the experiments with other classifiers.
 
5
In this paper we do not attempt to select the optimal classifier – we leave the elaborate choice of it for future work.
 
Literatur
2.
Zurück zum Zitat Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238–247 (2014) Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238–247 (2014)
3.
Zurück zum Zitat Bouma, G.: Normalized (Pointwise) mutual information in collocation extraction. In: Proceedings of the Biennial GSCL Conference (2009) Bouma, G.: Normalized (Pointwise) mutual information in collocation extraction. In: Proceedings of the Biennial GSCL Conference (2009)
4.
Zurück zum Zitat Braslavski, P., Ustalov, D., Mukhin, M.: A Spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 101–104. Gothenburg, Sweden (2014) Braslavski, P., Ustalov, D., Mukhin, M.: A Spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 101–104. Gothenburg, Sweden (2014)
5.
Zurück zum Zitat Brockett, C., Dolan, B.: Support vector machines for paraphrase identification and corpus construction. In Proceedings of the 3rd International Workshop on Paraphrasing, pp. 1–8 (2005) Brockett, C., Dolan, B.: Support vector machines for paraphrase identification and corpus construction. In Proceedings of the 3rd International Workshop on Paraphrasing, pp. 1–8 (2005)
6.
Zurück zum Zitat Burrows, S., Potthast, M., Stein, B.: Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans. Intell. Syst. Technol. 4(3), 43 (2013)CrossRef Burrows, S., Potthast, M., Stein, B.: Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans. Intell. Syst. Technol. 4(3), 43 (2013)CrossRef
7.
Zurück zum Zitat Callison-Burch, C.: Paraphrasing and Translation. Institute for Communicating and Collaborative Systems. School of Informatics, University of Edinburgh, Edinburgh (2007) Callison-Burch, C.: Paraphrasing and Translation. Institute for Communicating and Collaborative Systems. School of Informatics, University of Edinburgh, Edinburgh (2007)
8.
Zurück zum Zitat Chitra, A., Kumar, S.: Paraphrase identification using machine learning techniques. In: Proceedings of the 12th International Conference on Networking, VLSI and Signal Processing, pp. 245–249 (2010) Chitra, A., Kumar, S.: Paraphrase identification using machine learning techniques. In: Proceedings of the 12th International Conference on Networking, VLSI and Signal Processing, pp. 245–249 (2010)
9.
Zurück zum Zitat Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and of the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (2009) Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and of the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (2009)
9.
Zurück zum Zitat Dice, Lee R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)CrossRef Dice, Lee R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)CrossRef
11.
Zurück zum Zitat Dolan, W. B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland (2004) Dolan, W. B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland (2004)
12.
Zurück zum Zitat Eyecioglu, A., Keller, B.: ASOBEK: Twitter paraphrase identification with simple overlap features and SVMs. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 64–69 (2015) Eyecioglu, A., Keller, B.: ASOBEK: Twitter paraphrase identification with simple overlap features and SVMs. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 64–69 (2015)
13.
Zurück zum Zitat Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: 11th Annual Research Colloqium on Computational Linguistics UK (CLUK 2008) (2008) Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: 11th Annual Research Colloqium on Computational Linguistics UK (CLUK 2008) (2008)
14.
Zurück zum Zitat Ji, Y., Eisenstein, J.: Discriminative improvements to distributional sentence similarity. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2013) Ji, Y., Eisenstein, J.: Discriminative improvements to distributional sentence similarity. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2013)
15.
Zurück zum Zitat Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139(1), 91–107 (2002)MATHMathSciNetCrossRef Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139(1), 91–107 (2002)MATHMathSciNetCrossRef
16.
Zurück zum Zitat Kozareva, Z., Montoyo, A.: Paraphrase identification on the basis of supervised machine learning techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006)CrossRef Kozareva, Z., Montoyo, A.: Paraphrase identification on the basis of supervised machine learning techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006)CrossRef
17.
Zurück zum Zitat McClendon, J.L., Mack, N.A., Hodges, L.F.: The use of paraphrase identification in the retrieval of appropriate responses for script based conversational agents. In: Proceedings of the Twenty-Seventh International Florida Artificial Intelligence Research Society Conference (2014) McClendon, J.L., Mack, N.A., Hodges, L.F.: The use of paraphrase identification in the retrieval of appropriate responses for script based conversational agents. In: Proceedings of the Twenty-Seventh International Florida Artificial Intelligence Research Society Conference (2014)
19.
Zurück zum Zitat Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
20.
Zurück zum Zitat Miller, G., Fellbaum, C.: Wordnet: An electronic lexical database. MIT Press, Cambridge (1998) Miller, G., Fellbaum, C.: Wordnet: An electronic lexical database. MIT Press, Cambridge (1998)
21.
Zurück zum Zitat Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In: Proceedings of the 9th Summer School in Information Retrieval and Young Scientist Conference (2015). (in press) Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In: Proceedings of the 9th Summer School in Information Retrieval and Young Scientist Conference (2015). (in press)
22.
Zurück zum Zitat Rajkumar, A., Chitra, A.: Paraphrase recognition using neural network classification. Int. J. Comput. Appl. 1(29), 42–47 (2010). ISSN: (0975 - 8887) Rajkumar, A., Chitra, A.: Paraphrase recognition using neural network classification. Int. J. Comput. Appl. 1(29), 42–47 (2010). ISSN: (0975 - 8887)
23.
Zurück zum Zitat Rus, V., McCarthy, Ph. M., Lintean, M.C.: Paraphrase identification with lexico-syntactic graph subsumption. In: Proceedings of the Twenty-First International FLAIRS Conference, pp. 201–206 (2008) Rus, V., McCarthy, Ph. M., Lintean, M.C.: Paraphrase identification with lexico-syntactic graph subsumption. In: Proceedings of the Twenty-First International FLAIRS Conference, pp. 201–206 (2008)
24.
Zurück zum Zitat Schmid, H.: Improvements in part-of-speech tagging with an application to German. In: Proceedings of the ACL SIGDAT-Workshop. Dublin, Ireland (1995) Schmid, H.: Improvements in part-of-speech tagging with an application to German. In: Proceedings of the ACL SIGDAT-Workshop. Dublin, Ireland (1995)
25.
Zurück zum Zitat Sidorov, G.: Non-linear construction of n-grams in computational linguistics: syntactic, filtered, and generalized n-grams, p. 166 (2013) Sidorov, G.: Non-linear construction of n-grams in computational linguistics: syntactic, filtered, and generalized n-grams, p. 166 (2013)
26.
Zurück zum Zitat Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)CrossRef Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)CrossRef
27.
Zurück zum Zitat Sidorov, G., Gómez-Adorno, H., Markov, I., Pinto, D., Loya, N.: Computing text similarity using tree edit distance. In: NAFIPS 2015 (accepted paper) (2015) Sidorov, G., Gómez-Adorno, H., Markov, I., Pinto, D., Loya, N.: Computing text similarity using tree edit distance. In: NAFIPS 2015 (accepted paper) (2015)
28.
Zurück zum Zitat Socher, R., Huang, E.H., Pennington, J., Ng, A.Y., Manning, C.D.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Proceedings of the Conference on Neural Information Processing Systems (2011) Socher, R., Huang, E.H., Pennington, J., Ng, A.Y., Manning, C.D.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Proceedings of the Conference on Neural Information Processing Systems (2011)
29.
Zurück zum Zitat Wan, S., Dras, M., Dale, R., Paris, C.: Using dependency-based features to take the “Para-farce” out of paraphrase. In: Proceedings of the Australasian Language Technology Workshop, pp. 131–138 (2006) Wan, S., Dras, M., Dale, R., Paris, C.: Using dependency-based features to take the “Para-farce” out of paraphrase. In: Proceedings of the Australasian Language Technology Workshop, pp. 131–138 (2006)
30.
Zurück zum Zitat Zhang, Y., Patrick, J.: Paraphrase identification by text canonicalization. In: Proceedings of the Australasian Language Technology Workshop, pp. 160–166 (2005) Zhang, Y., Patrick, J.: Paraphrase identification by text canonicalization. In: Proceedings of the Australasian Language Technology Workshop, pp. 160–166 (2005)
31.
Zurück zum Zitat Tihonov, A.N.: Slovoobrazovatelnij Slovar’ Russkogo Yazika v Dvuh Tomah: Ok 145000 Slov. Moscow, Russkiy Yazik, vol. 1, 854 p., vol. 2, 885 p. (1985) Tihonov, A.N.: Slovoobrazovatelnij Slovar’ Russkogo Yazika v Dvuh Tomah: Ok 145000 Slov. Moscow, Russkiy Yazik, vol. 1, 854 p., vol. 2, 885 p. (1985)
Metadaten
Titel
Low-Level Features for Paraphrase Identification
verfasst von
Ekaterina Pronoza
Elena Yagunova
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-27060-9_5