Skip to main content

2021 | OriginalPaper | Buchkapitel

CTRD: A Chinese Theme-Rheme Discourse Dataset

verfasst von : Biao Fu, Yiqi Tong, Dawei Tian, Yidong Chen, Xiaodong Shi, Ming Zhu

Erschienen in: Natural Language Processing and Chinese Computing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Discourse topic structure is the key to the cohesion of the discourse and reflects the essence of the text. Current Chinese discourse corpus are constructed mainly based on rhetoric and semantic relations, which ignore the functional information in discourse. To alleviate this problem, we introduce a new Chinese discourse analysis dataset called CTRD, which stands for Chinese Theme-Rheme Discourse dataset. Different from previous discourse banks, CTRD was annotated according to a novel discourse annotation scheme based on the Chinese theme-rheme theory and thematic progression patterns from Halliday’s systemic functional grammar. As a result, we manually annotated 525 news documents from OntoNotes 4.0 with a Kappa value greater than 0.6. And preliminary experiments on this corpus verify the computability of CTRD. Finally, we make CTRD available at https://​github.​com/​ydc/​ctrd.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Alekseyenko, N.V.: A corpus-based study of theme and thematic progression in English and Russian non-translated texts and in Russian translated texts. Ph.D. thesis, Kent State University (2013) Alekseyenko, N.V.: A corpus-based study of theme and thematic progression in English and Russian non-translated texts and in Russian translated texts. Ph.D. thesis, Kent State University (2013)
3.
Zurück zum Zitat Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)CrossRef Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)CrossRef
4.
Zurück zum Zitat Cui, L., Wu, Y., Liu, S., Zhang, Y., Zhou, M.: MuTual: a dataset for multi-turn dialogue reasoning. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1406–1416 (2020) Cui, L., Wu, Y., Liu, S., Zhang, Y., Zhou, M.: MuTual: a dataset for multi-turn dialogue reasoning. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1406–1416 (2020)
5.
Zurück zum Zitat Fang, Y.: A study of topical theme in Chinese: An SFL perspective. In: Meaning in Context: Implementing Intelligent Applications of Language Studies, pp. 84–114. Continuum, London (2008) Fang, Y.: A study of topical theme in Chinese: An SFL perspective. In: Meaning in Context: Implementing Intelligent Applications of Language Studies, pp. 84–114. Continuum, London (2008)
6.
Zurück zum Zitat Forbes-Riley, K., Webber, B., Joshi, A.: Computing discourse semantics: the predicate-argument semantics of discourse connectives in D-LTAG. J. Semant. 23(1), 55–106 (2006)CrossRef Forbes-Riley, K., Webber, B., Joshi, A.: Computing discourse semantics: the predicate-argument semantics of discourse connectives in D-LTAG. J. Semant. 23(1), 55–106 (2006)CrossRef
7.
Zurück zum Zitat Halliday, M., Matthiessen, C.M., Matthiessen, C.: An Introduction to Functional Grammar. Routledge (2014) Halliday, M., Matthiessen, C.M., Matthiessen, C.: An Introduction to Functional Grammar. Routledge (2014)
8.
Zurück zum Zitat Jiang, F., Xu, S., Chu, X., Li, P., Zhu, Q., Zhou, G.: MCDTB: a macro-level Chinese discourse treebank. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3493–3504 (2018) Jiang, F., Xu, S., Chu, X., Li, P., Zhu, Q., Zhou, G.: MCDTB: a macro-level Chinese discourse treebank. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3493–3504 (2018)
9.
Zurück zum Zitat Kizil, M., Kushch, E.: Thematic progression and its types in English literary and legislative texts. Adv. Educ. 6(12), 181–187 (2019)CrossRef Kizil, M., Kushch, E.: Thematic progression and its types in English literary and legislative texts. Adv. Educ. 6(12), 181–187 (2019)CrossRef
10.
Zurück zum Zitat Kong, F., Zhou, G.: A tree kernel-based unified framework for Chinese zero anaphora resolution. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 882–891 (2010) Kong, F., Zhou, G.: A tree kernel-based unified framework for Chinese zero anaphora resolution. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 882–891 (2010)
11.
Zurück zum Zitat Krippendorff, K.: Content Analysis: An Introduction to its Methodology. Seikeigeka Orthopedic Surgery (1980) Krippendorff, K.: Content Analysis: An Introduction to its Methodology. Seikeigeka Orthopedic Surgery (1980)
12.
Zurück zum Zitat Lafferty, J.D., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001) Lafferty, J.D., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)
13.
Zurück zum Zitat Li, Y., Feng, W., Sun, J., Kong, F., Zhou, G.: Building Chinese discourse corpus with connective-driven dependency tree structure. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2105–2114 (2014) Li, Y., Feng, W., Sun, J., Kong, F., Zhou, G.: Building Chinese discourse corpus with connective-driven dependency tree structure. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2105–2114 (2014)
14.
Zurück zum Zitat Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text 8(3), 243–281 (1988) Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text 8(3), 243–281 (1988)
15.
Zurück zum Zitat Miculicich, L., Ram, D., Pappas, N., Henderson, J.: Document-level neural machine translation with hierarchical attention networks. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2947–2954 (2018) Miculicich, L., Ram, D., Pappas, N., Henderson, J.: Document-level neural machine translation with hierarchical attention networks. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2947–2954 (2018)
16.
Zurück zum Zitat Miltsakaki, E., Prasad, R., Joshi, A.K., Webber, B.L.: The PENN discourse treebank. In: LREC (2004) Miltsakaki, E., Prasad, R., Joshi, A.K., Webber, B.L.: The PENN discourse treebank. In: LREC (2004)
17.
Zurück zum Zitat Ming, Y.: Rhetorical structure annotation of Chinese news commentaries. J. Chinese Inf. Process. 4 (2008) Ming, Y.: Rhetorical structure annotation of Chinese news commentaries. J. Chinese Inf. Process. 4 (2008)
18.
Zurück zum Zitat Prasad, R., et al.: The PENN discourse treebank 2.0. In: LREC. Citeseer (2008) Prasad, R., et al.: The PENN discourse treebank 2.0. In: LREC. Citeseer (2008)
19.
Zurück zum Zitat Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for squad. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789 (2018) Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for squad. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789 (2018)
20.
Zurück zum Zitat Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392 (2016) Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392 (2016)
21.
Zurück zum Zitat Rutherford, A., Demberg, V., Xue, N.: A systematic study of neural discourse models for implicit discourse relation. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 281–291 (2017) Rutherford, A., Demberg, V., Xue, N.: A systematic study of neural discourse models for implicit discourse relation. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 281–291 (2017)
22.
Zurück zum Zitat Suárez, E.D.O., Cesteros, A.M.F.P.: A new approach for extracting the conceptual schema of texts based on the linguistic thematic progression theory. arXiv preprint arXiv:2010.07440 (2020) Suárez, E.D.O., Cesteros, A.M.F.P.: A new approach for extracting the conceptual schema of texts based on the linguistic thematic progression theory. arXiv preprint arXiv:​2010.​07440 (2020)
23.
Zurück zum Zitat Taboada, M., Mann, W.C.: Rhetorical structure theory: looking back and moving ahead. Discourse Stud. 8(3), 423–459 (2006)CrossRef Taboada, M., Mann, W.C.: Rhetorical structure theory: looking back and moving ahead. Discourse Stud. 8(3), 423–459 (2006)CrossRef
24.
Zurück zum Zitat Tong, Y., Chen, Y., Shi, X.: A multi-task approach for improving biomedical named entity recognition by incorporating multi-granularity information. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4804–4813. Association for Computational Linguistics, August 2021 Tong, Y., Chen, Y., Shi, X.: A multi-task approach for improving biomedical named entity recognition by incorporating multi-granularity information. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4804–4813. Association for Computational Linguistics, August 2021
25.
Zurück zum Zitat Tong, Y., Zheng, J., Zhu, H., Chen, Y., Shi, X.: A document-level neural machine translation model with dynamic caching guided by Theme-Rheme information. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 4385–4395 (2020) Tong, Y., Zheng, J., Zhu, H., Chen, Y., Shi, X.: A document-level neural machine translation model with dynamic caching guided by Theme-Rheme information. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 4385–4395 (2020)
26.
Zurück zum Zitat Weischedel, R., et al.: Ontonotes release 4.0. LDC2011T03. Penn.: Linguistic Data Consortium, Philadelphia (2011) Weischedel, R., et al.: Ontonotes release 4.0. LDC2011T03. Penn.: Linguistic Data Consortium, Philadelphia (2011)
27.
Zurück zum Zitat Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016) Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:​1609.​08144 (2016)
28.
Zurück zum Zitat Xi, X.F., Zhou, G.: Building a Chinese discourse topic corpus with a micro-topic scheme based on Theme-Rheme theory. Big Data Anal. 2(1), 9 (2017) Xi, X.F., Zhou, G.: Building a Chinese discourse topic corpus with a micro-topic scheme based on Theme-Rheme theory. Big Data Anal. 2(1), 9 (2017)
29.
Zurück zum Zitat Yan, H., Webster, J.J.: A corpus-based approach to linguistic function. In: Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27), pp. 215–221 (2013) Yan, H., Webster, J.J.: A corpus-based approach to linguistic function. In: Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27), pp. 215–221 (2013)
30.
Zurück zum Zitat Yang, J., Zhang, Y.: NCRF++: an open-source neural sequence labeling toolkit. In: Proceedings of ACL 2018, System Demonstrations, pp. 74–79 (2018) Yang, J., Zhang, Y.: NCRF++: an open-source neural sequence labeling toolkit. In: Proceedings of ACL 2018, System Demonstrations, pp. 74–79 (2018)
31.
Zurück zum Zitat Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016) Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)
32.
Zurück zum Zitat Yao, Y., et al.: DocRED: a large-scale document-level relation extraction dataset. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 764–777 (2019) Yao, Y., et al.: DocRED: a large-scale document-level relation extraction dataset. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 764–777 (2019)
33.
Zurück zum Zitat Zhang, M., Song, Y., Qin, B., Liu, T.: Chinese discourse relation recognition. J. Chin. Inf. Process. 27(6), 51 (2013) Zhang, M., Song, Y., Qin, B., Liu, T.: Chinese discourse relation recognition. J. Chin. Inf. Process. 27(6), 51 (2013)
34.
Zurück zum Zitat Zhou, Y., Xue, N.: The Chinese discourse treebank: a Chinese corpus annotated with discourse relations. Lang. Resour. Eval. 49(2), 397–431 (2015)MathSciNetCrossRef Zhou, Y., Xue, N.: The Chinese discourse treebank: a Chinese corpus annotated with discourse relations. Lang. Resour. Eval. 49(2), 397–431 (2015)MathSciNetCrossRef
Metadaten
Titel
CTRD: A Chinese Theme-Rheme Discourse Dataset
verfasst von
Biao Fu
Yiqi Tong
Dawei Tian
Yidong Chen
Xiaodong Shi
Ming Zhu
Copyright-Jahr
2021
DOI
https://doi.org/10.1007/978-3-030-88480-2_6