Skip to main content

2019 | OriginalPaper | Buchkapitel

PKU Paraphrase Bank: A Sentence-Level Paraphrase Corpus for Chinese

verfasst von : Bowei Zhang, Weiwei Sun, Xiaojun Wan, Zongming Guo

Erschienen in: Natural Language Processing and Chinese Computing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

One of the main challenges of conducting research on paraphrase is the lack of large-scale, high-quality corpus, which is particularly serious for non-English investigations. In this paper, we present a simple and effective unsupervised learning model that is able to automatically extract high-quality sentence-level paraphrases from multiple Chinese translations of the same source texts. By applying this new model, we obtain a large-scale paraphrase corpus, which contains 509,832 pairs of paraphrased sentences. The quality of this new corpus is manually examined. Our new model is language-independent, meaning that such paraphrase corpora for other languages can be built in the same way.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
3
The supplementary note gives the detailed information about these books.
 
4
Since there is no fixed reference relationship for the sentence pairs in our corpus, the formula for the original PINC formula has been slightly modified. After the two sentences are exchanged, the PINC is calculated again, and the calculation results of the two calculations are averaged.
 
Literatur
1.
Zurück zum Zitat Androutsopoulos, I., Malakasiotis, P.: A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38, 135–187 (2010)CrossRef Androutsopoulos, I., Malakasiotis, P.: A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38, 135–187 (2010)CrossRef
2.
Zurück zum Zitat Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Meeting of the Association for Computational Linguistics, pp. 597–604 (2005) Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Meeting of the Association for Computational Linguistics, pp. 597–604 (2005)
3.
Zurück zum Zitat Barzilay, R., Mckeown, K.R.: Extracting paraphrases from a parallel corpus. In: Meeting of the Association for Computational Linguistics, pp. 50–57 (2001) Barzilay, R., Mckeown, K.R.: Extracting paraphrases from a parallel corpus. In: Meeting of the Association for Computational Linguistics, pp. 50–57 (2001)
4.
Zurück zum Zitat Berant, J., Liang, P.: Semantic parsing via paraphrasing. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1415–1425. Association for Computational Linguistics, Baltimore, Maryland, June 2014 Berant, J., Liang, P.: Semantic parsing via paraphrasing. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1415–1425. Association for Computational Linguistics, Baltimore, Maryland, June 2014
5.
Zurück zum Zitat Bhagat, R., Hovy, E.: What is a paraphrase? Comput. Linguist. 39(3), 463–472 (2013)CrossRef Bhagat, R., Hovy, E.: What is a paraphrase? Comput. Linguist. 39(3), 463–472 (2013)CrossRef
6.
Zurück zum Zitat Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 190–200. Association for Computational Linguistics (2011) Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 190–200. Association for Computational Linguistics (2011)
7.
Zurück zum Zitat Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018) Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018)
8.
Zurück zum Zitat Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th international conference on Computational Linguistics, p. 350. Association for Computational Linguistics (2004) Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th international conference on Computational Linguistics, p. 350. Association for Computational Linguistics (2004)
9.
Zurück zum Zitat Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005) (2005) Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005) (2005)
10.
Zurück zum Zitat Dong, L., Mallinson, J., Reddy, S., Lapata, M.: Learning to paraphrase for question answering. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 875–886. Association for Computational Linguistics, Copenhagen (2017) Dong, L., Mallinson, J., Reddy, S., Lapata, M.: Learning to paraphrase for question answering. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 875–886. Association for Computational Linguistics, Copenhagen (2017)
11.
Zurück zum Zitat Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB: the paraphrase database. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 758–764 (2013) Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB: the paraphrase database. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 758–764 (2013)
12.
Zurück zum Zitat Grycner, A., Weikum, G.: POLY: mining relational paraphrases from multilingual sentences. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2183–2192 (2016) Grycner, A., Weikum, G.: POLY: mining relational paraphrases from multilingual sentences. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2183–2192 (2016)
13.
Zurück zum Zitat Jiang, Y., Kummerfeld, J.K., Lasecki, W.S.: Understanding task design trade-offs in crowdsourced paraphrase collection. arXiv preprint arXiv:1704.05753 (2017) Jiang, Y., Kummerfeld, J.K., Lasecki, W.S.: Understanding task design trade-offs in crowdsourced paraphrase collection. arXiv preprint arXiv:​1704.​05753 (2017)
14.
Zurück zum Zitat Lan, W., Qiu, S., He, H., Xu, W.: A continuously growing dataset of sentential paraphrases. arXiv preprint arXiv:1708.00391 (2017) Lan, W., Qiu, S., He, H., Xu, W.: A continuously growing dataset of sentential paraphrases. arXiv preprint arXiv:​1708.​00391 (2017)
15.
Zurück zum Zitat Lin, D., Pantel, P.: Dirt@ sbt@ discovery of inference rules from text. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 323–328. ACM (2001) Lin, D., Pantel, P.: Dirt@ sbt@ discovery of inference rules from text. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 323–328. ACM (2001)
16.
Zurück zum Zitat Lin, D., Pantel, P.: Discovery of inference rules for question-answering. Nat. Lang. Eng. 7(4), 343–360 (2001)CrossRef Lin, D., Pantel, P.: Discovery of inference rules for question-answering. Nat. Lang. Eng. 7(4), 343–360 (2001)CrossRef
17.
Zurück zum Zitat Madnani, N., Dorr, B.J.: Generating phrasal and sentential paraphrases: a survey of data-driven methods. Comput. Linguist. 36(3), 341–387 (2010)MathSciNetCrossRef Madnani, N., Dorr, B.J.: Generating phrasal and sentential paraphrases: a survey of data-driven methods. Comput. Linguist. 36(3), 341–387 (2010)MathSciNetCrossRef
18.
Zurück zum Zitat Nakashole, N., Weikum, G., Suchanek, F.: PATTY: a taxonomy of relational patterns with semantic types. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1135–1145. Association for Computational Linguistics (2012) Nakashole, N., Weikum, G., Suchanek, F.: PATTY: a taxonomy of relational patterns with semantic types. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1135–1145. Association for Computational Linguistics (2012)
19.
Zurück zum Zitat Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002) Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
20.
Zurück zum Zitat Peters, M.E., et al.: Deep contextualized word representations (2018) Peters, M.E., et al.: Deep contextualized word representations (2018)
21.
Zurück zum Zitat Quirk, C., Brockett, C., Dolan, B.: Monolingual machine translation for paraphrase generation (2004) Quirk, C., Brockett, C., Dolan, B.: Monolingual machine translation for paraphrase generation (2004)
23.
Zurück zum Zitat Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019)
24.
Zurück zum Zitat Seraj, R.M., Siahbani, M., Sarkar, A.: Improving statistical machine translation with a multilingual paraphrase database. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1379–1390 (2015) Seraj, R.M., Siahbani, M., Sarkar, A.: Improving statistical machine translation with a multilingual paraphrase database. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1379–1390 (2015)
25.
Zurück zum Zitat Sun, W., Xu, J.: Enhancing Chinese word segmentation using unlabeled data. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 970–979. Association for Computational Linguistics, Edinburgh, Scotland, UK, July 2011 Sun, W., Xu, J.: Enhancing Chinese word segmentation using unlabeled data. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 970–979. Association for Computational Linguistics, Edinburgh, Scotland, UK, July 2011
26.
Zurück zum Zitat Suzuki, Y., Kajiwara, T., Komachi, M.: Building a non-trivial paraphrase corpus using multiple machine translation systems. In: Proceedings of ACL 2017, Student Research Workshop, pp. 36–42 (2017) Suzuki, Y., Kajiwara, T., Komachi, M.: Building a non-trivial paraphrase corpus using multiple machine translation systems. In: Proceedings of ACL 2017, Student Research Workshop, pp. 36–42 (2017)
27.
Zurück zum Zitat Wieting, J., Gimpel, K.: ParaNMT-50M: pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1 (2018) Wieting, J., Gimpel, K.: ParaNMT-50M: pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1 (2018)
28.
Zurück zum Zitat Xu, W., Callison-Burch, C., Dolan, B.: SemEval-2015 task 1: paraphrase and semantic similarity in Twitter (PIT). In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 1–11 (2015) Xu, W., Callison-Burch, C., Dolan, B.: SemEval-2015 task 1: paraphrase and semantic similarity in Twitter (PIT). In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 1–11 (2015)
29.
Zurück zum Zitat Xu, W., Ritter, A., Callison-Burch, C., Dolan, W.B., Ji, Y.: Extracting lexically divergent paraphrases from Twitter. Trans. Assoc. Comput. Linguist. 2, 435–448 (2014)CrossRef Xu, W., Ritter, A., Callison-Burch, C., Dolan, W.B., Ji, Y.: Extracting lexically divergent paraphrases from Twitter. Trans. Assoc. Comput. Linguist. 2, 435–448 (2014)CrossRef
30.
Zurück zum Zitat Zhang, C., Soderland, S., Weld, D.S.: Exploiting parallel news streams for unsupervised event extraction. Trans. Assoc. Comput. Linguist. 3(1), 117–129 (2015)CrossRef Zhang, C., Soderland, S., Weld, D.S.: Exploiting parallel news streams for unsupervised event extraction. Trans. Assoc. Comput. Linguist. 3(1), 117–129 (2015)CrossRef
Metadaten
Titel
PKU Paraphrase Bank: A Sentence-Level Paraphrase Corpus for Chinese
verfasst von
Bowei Zhang
Weiwei Sun
Xiaojun Wan
Zongming Guo
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-32233-5_63