Skip to main content
Top

2015 | OriginalPaper | Chapter

Low-Level Features for Paraphrase Identification

Authors : Ekaterina Pronoza, Elena Yagunova

Published in: Advances in Artificial Intelligence and Soft Computing

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper deals with the task of sentential paraphrase identification. We work with Russian but our approach can be applied to any other language with rich morphology and free word order. As part of our ParaPhraser.ru project, we construct a paraphrase corpus and then experiment with supervised methods of paraphrase identification. In this paper we focus on the low-level string, lexical and semantic features which unlike complex deep ones do not cause information noise and can serve as a solid basis for the development of an effective paraphrase identification system. Results of the experiments show that the features introduced in this paper improve the paraphrase identification model based solely on the standard low-level features or the optimized matrix metric used for corpus construction.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
In fact, our approach is not restricted to languages with these characteristics (e.g., it can be applied for English as well) but the features we propose in this paper take serious advantage of them, and therefore we recommend using our method for morphologically rich languages with free word order.
 
2
We follow a simplified approach and consider any notional title cased word a Proper name.
 
3
In this section we only show that the modified metric improves over our baseline: we do not solve the task of selecting the optimal classifier, and we simply choose SVM because it is well-known and widely used in NLP. Further in Sect. 5 we present the results obtained in the experiments with other classifiers.
 
5
In this paper we do not attempt to select the optimal classifier – we leave the elaborate choice of it for future work.
 
Literature
2.
go back to reference Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238–247 (2014) Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238–247 (2014)
3.
go back to reference Bouma, G.: Normalized (Pointwise) mutual information in collocation extraction. In: Proceedings of the Biennial GSCL Conference (2009) Bouma, G.: Normalized (Pointwise) mutual information in collocation extraction. In: Proceedings of the Biennial GSCL Conference (2009)
4.
go back to reference Braslavski, P., Ustalov, D., Mukhin, M.: A Spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 101–104. Gothenburg, Sweden (2014) Braslavski, P., Ustalov, D., Mukhin, M.: A Spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 101–104. Gothenburg, Sweden (2014)
5.
go back to reference Brockett, C., Dolan, B.: Support vector machines for paraphrase identification and corpus construction. In Proceedings of the 3rd International Workshop on Paraphrasing, pp. 1–8 (2005) Brockett, C., Dolan, B.: Support vector machines for paraphrase identification and corpus construction. In Proceedings of the 3rd International Workshop on Paraphrasing, pp. 1–8 (2005)
6.
go back to reference Burrows, S., Potthast, M., Stein, B.: Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans. Intell. Syst. Technol. 4(3), 43 (2013)CrossRef Burrows, S., Potthast, M., Stein, B.: Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans. Intell. Syst. Technol. 4(3), 43 (2013)CrossRef
7.
go back to reference Callison-Burch, C.: Paraphrasing and Translation. Institute for Communicating and Collaborative Systems. School of Informatics, University of Edinburgh, Edinburgh (2007) Callison-Burch, C.: Paraphrasing and Translation. Institute for Communicating and Collaborative Systems. School of Informatics, University of Edinburgh, Edinburgh (2007)
8.
go back to reference Chitra, A., Kumar, S.: Paraphrase identification using machine learning techniques. In: Proceedings of the 12th International Conference on Networking, VLSI and Signal Processing, pp. 245–249 (2010) Chitra, A., Kumar, S.: Paraphrase identification using machine learning techniques. In: Proceedings of the 12th International Conference on Networking, VLSI and Signal Processing, pp. 245–249 (2010)
9.
go back to reference Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and of the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (2009) Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and of the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (2009)
9.
go back to reference Dice, Lee R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)CrossRef Dice, Lee R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)CrossRef
11.
go back to reference Dolan, W. B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland (2004) Dolan, W. B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland (2004)
12.
go back to reference Eyecioglu, A., Keller, B.: ASOBEK: Twitter paraphrase identification with simple overlap features and SVMs. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 64–69 (2015) Eyecioglu, A., Keller, B.: ASOBEK: Twitter paraphrase identification with simple overlap features and SVMs. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 64–69 (2015)
13.
go back to reference Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: 11th Annual Research Colloqium on Computational Linguistics UK (CLUK 2008) (2008) Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: 11th Annual Research Colloqium on Computational Linguistics UK (CLUK 2008) (2008)
14.
go back to reference Ji, Y., Eisenstein, J.: Discriminative improvements to distributional sentence similarity. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2013) Ji, Y., Eisenstein, J.: Discriminative improvements to distributional sentence similarity. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2013)
15.
go back to reference Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139(1), 91–107 (2002)MATHMathSciNetCrossRef Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139(1), 91–107 (2002)MATHMathSciNetCrossRef
16.
go back to reference Kozareva, Z., Montoyo, A.: Paraphrase identification on the basis of supervised machine learning techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006)CrossRef Kozareva, Z., Montoyo, A.: Paraphrase identification on the basis of supervised machine learning techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006)CrossRef
17.
go back to reference McClendon, J.L., Mack, N.A., Hodges, L.F.: The use of paraphrase identification in the retrieval of appropriate responses for script based conversational agents. In: Proceedings of the Twenty-Seventh International Florida Artificial Intelligence Research Society Conference (2014) McClendon, J.L., Mack, N.A., Hodges, L.F.: The use of paraphrase identification in the retrieval of appropriate responses for script based conversational agents. In: Proceedings of the Twenty-Seventh International Florida Artificial Intelligence Research Society Conference (2014)
19.
go back to reference Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
20.
go back to reference Miller, G., Fellbaum, C.: Wordnet: An electronic lexical database. MIT Press, Cambridge (1998) Miller, G., Fellbaum, C.: Wordnet: An electronic lexical database. MIT Press, Cambridge (1998)
21.
go back to reference Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In: Proceedings of the 9th Summer School in Information Retrieval and Young Scientist Conference (2015). (in press) Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In: Proceedings of the 9th Summer School in Information Retrieval and Young Scientist Conference (2015). (in press)
22.
go back to reference Rajkumar, A., Chitra, A.: Paraphrase recognition using neural network classification. Int. J. Comput. Appl. 1(29), 42–47 (2010). ISSN: (0975 - 8887) Rajkumar, A., Chitra, A.: Paraphrase recognition using neural network classification. Int. J. Comput. Appl. 1(29), 42–47 (2010). ISSN: (0975 - 8887)
23.
go back to reference Rus, V., McCarthy, Ph. M., Lintean, M.C.: Paraphrase identification with lexico-syntactic graph subsumption. In: Proceedings of the Twenty-First International FLAIRS Conference, pp. 201–206 (2008) Rus, V., McCarthy, Ph. M., Lintean, M.C.: Paraphrase identification with lexico-syntactic graph subsumption. In: Proceedings of the Twenty-First International FLAIRS Conference, pp. 201–206 (2008)
24.
go back to reference Schmid, H.: Improvements in part-of-speech tagging with an application to German. In: Proceedings of the ACL SIGDAT-Workshop. Dublin, Ireland (1995) Schmid, H.: Improvements in part-of-speech tagging with an application to German. In: Proceedings of the ACL SIGDAT-Workshop. Dublin, Ireland (1995)
25.
go back to reference Sidorov, G.: Non-linear construction of n-grams in computational linguistics: syntactic, filtered, and generalized n-grams, p. 166 (2013) Sidorov, G.: Non-linear construction of n-grams in computational linguistics: syntactic, filtered, and generalized n-grams, p. 166 (2013)
26.
go back to reference Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)CrossRef Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)CrossRef
27.
go back to reference Sidorov, G., Gómez-Adorno, H., Markov, I., Pinto, D., Loya, N.: Computing text similarity using tree edit distance. In: NAFIPS 2015 (accepted paper) (2015) Sidorov, G., Gómez-Adorno, H., Markov, I., Pinto, D., Loya, N.: Computing text similarity using tree edit distance. In: NAFIPS 2015 (accepted paper) (2015)
28.
go back to reference Socher, R., Huang, E.H., Pennington, J., Ng, A.Y., Manning, C.D.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Proceedings of the Conference on Neural Information Processing Systems (2011) Socher, R., Huang, E.H., Pennington, J., Ng, A.Y., Manning, C.D.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Proceedings of the Conference on Neural Information Processing Systems (2011)
29.
go back to reference Wan, S., Dras, M., Dale, R., Paris, C.: Using dependency-based features to take the “Para-farce” out of paraphrase. In: Proceedings of the Australasian Language Technology Workshop, pp. 131–138 (2006) Wan, S., Dras, M., Dale, R., Paris, C.: Using dependency-based features to take the “Para-farce” out of paraphrase. In: Proceedings of the Australasian Language Technology Workshop, pp. 131–138 (2006)
30.
go back to reference Zhang, Y., Patrick, J.: Paraphrase identification by text canonicalization. In: Proceedings of the Australasian Language Technology Workshop, pp. 160–166 (2005) Zhang, Y., Patrick, J.: Paraphrase identification by text canonicalization. In: Proceedings of the Australasian Language Technology Workshop, pp. 160–166 (2005)
31.
go back to reference Tihonov, A.N.: Slovoobrazovatelnij Slovar’ Russkogo Yazika v Dvuh Tomah: Ok 145000 Slov. Moscow, Russkiy Yazik, vol. 1, 854 p., vol. 2, 885 p. (1985) Tihonov, A.N.: Slovoobrazovatelnij Slovar’ Russkogo Yazika v Dvuh Tomah: Ok 145000 Slov. Moscow, Russkiy Yazik, vol. 1, 854 p., vol. 2, 885 p. (1985)
Metadata
Title
Low-Level Features for Paraphrase Identification
Authors
Ekaterina Pronoza
Elena Yagunova
Copyright Year
2015
DOI
https://doi.org/10.1007/978-3-319-27060-9_5

Premium Partner