Skip to main content
Top

2018 | OriginalPaper | Chapter

Description of Turkish Paraphrase Corpus Structure and Generation Method

Authors : Bahar Karaoglan, Tarık Kışla, Senem Kumova Metin

Published in: Computational Linguistics and Intelligent Text Processing

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Because developing a corpus requires a long time and lots of human effort, it is desirable to make it as resourceful as possible: rich in coverage, flexible, multipurpose and expandable. Here we describe the steps we took in the development of Turkish paraphrase corpus, the factors we considered, problems we faced and how we dealt with them. Currently our corpus contains nearly 4000 sentences with the ratio of 60% paraphrase and 40% non-paraphrase sentence pairs. The sentence pairs are annotated at 5-scale: paraphrase, encapsulating, encapsulated, non-paraphrase and opposite. The corpus is formulated in a database structure integrated with Turkish dictionary. The sources we used till now are news texts from Bilcon 2005 corpus, a set of professionally translated sentence pairs from MSRP corpus, multiple Turkish translations from different languages that are involved in Tatoeba corpus and user generated paraphrases.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Dolan, B., Quirk C., and Brockett C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics (2004) Dolan, B., Quirk C., and Brockett C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics (2004)
2.
go back to reference McCarthy, P.M., McNamara, D.: The user-language paraphrase challenge. In: Special ANLP Topic of the 22nd International Florida Artificial Intelligence Research Society Conference, Florida (2008) McCarthy, P.M., McNamara, D.: The user-language paraphrase challenge. In: Special ANLP Topic of the 22nd International Florida Artificial Intelligence Research Society Conference, Florida (2008)
3.
go back to reference Bernhard, D., Gurevych, I.: Answering learners’ questions by retrieving question paraphrases from social Q&A sites. In Proceedings of the Third ACL Workshop on Innovative Use of NLP for Building Educational Applications, pp. 44–52. Association for Computational Linguistics, Stroudsburg (2009) Bernhard, D., Gurevych, I.: Answering learners’ questions by retrieving question paraphrases from social Q&A sites. In Proceedings of the Third ACL Workshop on Innovative Use of NLP for Building Educational Applications, pp. 44–52. Association for Computational Linguistics, Stroudsburg (2009)
4.
go back to reference Rus, V., Lintean, M., Moldovan, C., Baggett, W., Niraula, N., Morgan, B.: SIMILAR Corpus: a resource to foster the qualitative understanding of semantic similarity of texts. In: LREC, pp. 50–59 (2012) Rus, V., Lintean, M., Moldovan, C., Baggett, W., Niraula, N., Morgan, B.: SIMILAR Corpus: a resource to foster the qualitative understanding of semantic similarity of texts. In: LREC, pp. 50–59 (2012)
5.
go back to reference Regneri, M., Wang, R.: Using discourse information for paraphrase extraction. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 916–927 (2012) Regneri, M., Wang, R.: Using discourse information for paraphrase extraction. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 916–927 (2012)
6.
go back to reference Max, A., Wisniewski, G.: Mining naturally-occurring corrections and paraphrases from Wikipedia’s Revision History. In: LREC (2010) Max, A., Wisniewski, G.: Mining naturally-occurring corrections and paraphrases from Wikipedia’s Revision History. In: LREC (2010)
7.
go back to reference Dutrey, C., Bouamor, H., Bernhard, D., Max, A.: Local modifications and paraphrases in Wikipedia’s revision history. Procesamiento del Lenguaje Natural 46, 51–58 (2010) Dutrey, C., Bouamor, H., Bernhard, D., Max, A.: Local modifications and paraphrases in Wikipedia’s revision history. Procesamiento del Lenguaje Natural 46, 51–58 (2010)
8.
go back to reference Zhao, S., Zhou, M., Liu, T.: Learning question paraphrases for QA from encarta logs. In: IJCAI (2007) Zhao, S., Zhou, M., Liu, T.: Learning question paraphrases for QA from encarta logs. In: IJCAI (2007)
9.
go back to reference Lytinen, S., Tomuro, N.: The use of question types to match questions in FAQFinder. AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases (2002) Lytinen, S., Tomuro, N.: The use of question types to match questions in FAQFinder. AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases (2002)
10.
go back to reference Demir, S., El-Kahlout, I.D., Unal, E., Kaya, H.: Turkish paraphrase corpus. In: LREC, pp. 4087–4091 (2012) Demir, S., El-Kahlout, I.D., Unal, E., Kaya, H.: Turkish paraphrase corpus. In: LREC, pp. 4087–4091 (2012)
11.
go back to reference Can, F., Kocberber, S., Baglioglu, O., Kardas, S., Ocalan, H.C., Uyar, E.: New event detection and topic tracking in Turkish. J. Am. Soc. Inform. Sci. Technol. 61(4), 802–819 (2010) Can, F., Kocberber, S., Baglioglu, O., Kardas, S., Ocalan, H.C., Uyar, E.: New event detection and topic tracking in Turkish. J. Am. Soc. Inform. Sci. Technol. 61(4), 802–819 (2010)
12.
go back to reference Dolan, W., Brockett, C.: Automatically Constructing a Corpus of Sentential Paraphrases. In Third International Workshop on Paraphrasing (2005) Dolan, W., Brockett, C.: Automatically Constructing a Corpus of Sentential Paraphrases. In Third International Workshop on Paraphrasing (2005)
13.
go back to reference Brockett, C., Dolan, W.: Support vector machines for paraphrase identification and corpus construction. In: Third International Workshop on Paraphrasing (IWP2005) (2005) Brockett, C., Dolan, W.: Support vector machines for paraphrase identification and corpus construction. In: Third International Workshop on Paraphrasing (IWP2005) (2005)
14.
go back to reference Tiedemann J.: Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC) (2012) Tiedemann J.: Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC) (2012)
15.
go back to reference Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bull. 76(5), 378–382 (1971)CrossRef Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bull. 76(5), 378–382 (1971)CrossRef
16.
go back to reference Fleiss, J.L., Nee, J.C., Landis, J.R.: Large sample variance of kappa in the case of different sets of raters. Psychological Bull. 86(5), 974–977 (1979)CrossRef Fleiss, J.L., Nee, J.C., Landis, J.R.: Large sample variance of kappa in the case of different sets of raters. Psychological Bull. 86(5), 974–977 (1979)CrossRef
17.
go back to reference Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data (TKDD), 2(2), Article 10, 25 pages (2008) Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data (TKDD), 2(2), Article 10, 25 pages (2008)
Metadata
Title
Description of Turkish Paraphrase Corpus Structure and Generation Method
Authors
Bahar Karaoglan
Tarık Kışla
Senem Kumova Metin
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-75477-2_13

Premium Partner