Skip to main content
Top

2021 | OriginalPaper | Chapter

CTRD: A Chinese Theme-Rheme Discourse Dataset

Authors : Biao Fu, Yiqi Tong, Dawei Tian, Yidong Chen, Xiaodong Shi, Ming Zhu

Published in: Natural Language Processing and Chinese Computing

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Discourse topic structure is the key to the cohesion of the discourse and reflects the essence of the text. Current Chinese discourse corpus are constructed mainly based on rhetoric and semantic relations, which ignore the functional information in discourse. To alleviate this problem, we introduce a new Chinese discourse analysis dataset called CTRD, which stands for Chinese Theme-Rheme Discourse dataset. Different from previous discourse banks, CTRD was annotated according to a novel discourse annotation scheme based on the Chinese theme-rheme theory and thematic progression patterns from Halliday’s systemic functional grammar. As a result, we manually annotated 525 news documents from OntoNotes 4.0 with a Kappa value greater than 0.6. And preliminary experiments on this corpus verify the computability of CTRD. Finally, we make CTRD available at https://​github.​com/​ydc/​ctrd.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Alekseyenko, N.V.: A corpus-based study of theme and thematic progression in English and Russian non-translated texts and in Russian translated texts. Ph.D. thesis, Kent State University (2013) Alekseyenko, N.V.: A corpus-based study of theme and thematic progression in English and Russian non-translated texts and in Russian translated texts. Ph.D. thesis, Kent State University (2013)
3.
go back to reference Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)CrossRef Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)CrossRef
4.
go back to reference Cui, L., Wu, Y., Liu, S., Zhang, Y., Zhou, M.: MuTual: a dataset for multi-turn dialogue reasoning. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1406–1416 (2020) Cui, L., Wu, Y., Liu, S., Zhang, Y., Zhou, M.: MuTual: a dataset for multi-turn dialogue reasoning. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1406–1416 (2020)
5.
go back to reference Fang, Y.: A study of topical theme in Chinese: An SFL perspective. In: Meaning in Context: Implementing Intelligent Applications of Language Studies, pp. 84–114. Continuum, London (2008) Fang, Y.: A study of topical theme in Chinese: An SFL perspective. In: Meaning in Context: Implementing Intelligent Applications of Language Studies, pp. 84–114. Continuum, London (2008)
6.
go back to reference Forbes-Riley, K., Webber, B., Joshi, A.: Computing discourse semantics: the predicate-argument semantics of discourse connectives in D-LTAG. J. Semant. 23(1), 55–106 (2006)CrossRef Forbes-Riley, K., Webber, B., Joshi, A.: Computing discourse semantics: the predicate-argument semantics of discourse connectives in D-LTAG. J. Semant. 23(1), 55–106 (2006)CrossRef
7.
go back to reference Halliday, M., Matthiessen, C.M., Matthiessen, C.: An Introduction to Functional Grammar. Routledge (2014) Halliday, M., Matthiessen, C.M., Matthiessen, C.: An Introduction to Functional Grammar. Routledge (2014)
8.
go back to reference Jiang, F., Xu, S., Chu, X., Li, P., Zhu, Q., Zhou, G.: MCDTB: a macro-level Chinese discourse treebank. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3493–3504 (2018) Jiang, F., Xu, S., Chu, X., Li, P., Zhu, Q., Zhou, G.: MCDTB: a macro-level Chinese discourse treebank. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3493–3504 (2018)
9.
go back to reference Kizil, M., Kushch, E.: Thematic progression and its types in English literary and legislative texts. Adv. Educ. 6(12), 181–187 (2019)CrossRef Kizil, M., Kushch, E.: Thematic progression and its types in English literary and legislative texts. Adv. Educ. 6(12), 181–187 (2019)CrossRef
10.
go back to reference Kong, F., Zhou, G.: A tree kernel-based unified framework for Chinese zero anaphora resolution. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 882–891 (2010) Kong, F., Zhou, G.: A tree kernel-based unified framework for Chinese zero anaphora resolution. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 882–891 (2010)
11.
go back to reference Krippendorff, K.: Content Analysis: An Introduction to its Methodology. Seikeigeka Orthopedic Surgery (1980) Krippendorff, K.: Content Analysis: An Introduction to its Methodology. Seikeigeka Orthopedic Surgery (1980)
12.
go back to reference Lafferty, J.D., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001) Lafferty, J.D., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)
13.
go back to reference Li, Y., Feng, W., Sun, J., Kong, F., Zhou, G.: Building Chinese discourse corpus with connective-driven dependency tree structure. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2105–2114 (2014) Li, Y., Feng, W., Sun, J., Kong, F., Zhou, G.: Building Chinese discourse corpus with connective-driven dependency tree structure. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2105–2114 (2014)
14.
go back to reference Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text 8(3), 243–281 (1988) Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text 8(3), 243–281 (1988)
15.
go back to reference Miculicich, L., Ram, D., Pappas, N., Henderson, J.: Document-level neural machine translation with hierarchical attention networks. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2947–2954 (2018) Miculicich, L., Ram, D., Pappas, N., Henderson, J.: Document-level neural machine translation with hierarchical attention networks. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2947–2954 (2018)
16.
go back to reference Miltsakaki, E., Prasad, R., Joshi, A.K., Webber, B.L.: The PENN discourse treebank. In: LREC (2004) Miltsakaki, E., Prasad, R., Joshi, A.K., Webber, B.L.: The PENN discourse treebank. In: LREC (2004)
17.
go back to reference Ming, Y.: Rhetorical structure annotation of Chinese news commentaries. J. Chinese Inf. Process. 4 (2008) Ming, Y.: Rhetorical structure annotation of Chinese news commentaries. J. Chinese Inf. Process. 4 (2008)
18.
go back to reference Prasad, R., et al.: The PENN discourse treebank 2.0. In: LREC. Citeseer (2008) Prasad, R., et al.: The PENN discourse treebank 2.0. In: LREC. Citeseer (2008)
19.
go back to reference Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for squad. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789 (2018) Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for squad. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789 (2018)
20.
go back to reference Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392 (2016) Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392 (2016)
21.
go back to reference Rutherford, A., Demberg, V., Xue, N.: A systematic study of neural discourse models for implicit discourse relation. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 281–291 (2017) Rutherford, A., Demberg, V., Xue, N.: A systematic study of neural discourse models for implicit discourse relation. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 281–291 (2017)
22.
go back to reference Suárez, E.D.O., Cesteros, A.M.F.P.: A new approach for extracting the conceptual schema of texts based on the linguistic thematic progression theory. arXiv preprint arXiv:2010.07440 (2020) Suárez, E.D.O., Cesteros, A.M.F.P.: A new approach for extracting the conceptual schema of texts based on the linguistic thematic progression theory. arXiv preprint arXiv:​2010.​07440 (2020)
23.
go back to reference Taboada, M., Mann, W.C.: Rhetorical structure theory: looking back and moving ahead. Discourse Stud. 8(3), 423–459 (2006)CrossRef Taboada, M., Mann, W.C.: Rhetorical structure theory: looking back and moving ahead. Discourse Stud. 8(3), 423–459 (2006)CrossRef
24.
go back to reference Tong, Y., Chen, Y., Shi, X.: A multi-task approach for improving biomedical named entity recognition by incorporating multi-granularity information. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4804–4813. Association for Computational Linguistics, August 2021 Tong, Y., Chen, Y., Shi, X.: A multi-task approach for improving biomedical named entity recognition by incorporating multi-granularity information. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4804–4813. Association for Computational Linguistics, August 2021
25.
go back to reference Tong, Y., Zheng, J., Zhu, H., Chen, Y., Shi, X.: A document-level neural machine translation model with dynamic caching guided by Theme-Rheme information. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 4385–4395 (2020) Tong, Y., Zheng, J., Zhu, H., Chen, Y., Shi, X.: A document-level neural machine translation model with dynamic caching guided by Theme-Rheme information. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 4385–4395 (2020)
26.
go back to reference Weischedel, R., et al.: Ontonotes release 4.0. LDC2011T03. Penn.: Linguistic Data Consortium, Philadelphia (2011) Weischedel, R., et al.: Ontonotes release 4.0. LDC2011T03. Penn.: Linguistic Data Consortium, Philadelphia (2011)
27.
go back to reference Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016) Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:​1609.​08144 (2016)
28.
go back to reference Xi, X.F., Zhou, G.: Building a Chinese discourse topic corpus with a micro-topic scheme based on Theme-Rheme theory. Big Data Anal. 2(1), 9 (2017) Xi, X.F., Zhou, G.: Building a Chinese discourse topic corpus with a micro-topic scheme based on Theme-Rheme theory. Big Data Anal. 2(1), 9 (2017)
29.
go back to reference Yan, H., Webster, J.J.: A corpus-based approach to linguistic function. In: Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27), pp. 215–221 (2013) Yan, H., Webster, J.J.: A corpus-based approach to linguistic function. In: Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27), pp. 215–221 (2013)
30.
go back to reference Yang, J., Zhang, Y.: NCRF++: an open-source neural sequence labeling toolkit. In: Proceedings of ACL 2018, System Demonstrations, pp. 74–79 (2018) Yang, J., Zhang, Y.: NCRF++: an open-source neural sequence labeling toolkit. In: Proceedings of ACL 2018, System Demonstrations, pp. 74–79 (2018)
31.
go back to reference Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016) Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)
32.
go back to reference Yao, Y., et al.: DocRED: a large-scale document-level relation extraction dataset. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 764–777 (2019) Yao, Y., et al.: DocRED: a large-scale document-level relation extraction dataset. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 764–777 (2019)
33.
go back to reference Zhang, M., Song, Y., Qin, B., Liu, T.: Chinese discourse relation recognition. J. Chin. Inf. Process. 27(6), 51 (2013) Zhang, M., Song, Y., Qin, B., Liu, T.: Chinese discourse relation recognition. J. Chin. Inf. Process. 27(6), 51 (2013)
34.
go back to reference Zhou, Y., Xue, N.: The Chinese discourse treebank: a Chinese corpus annotated with discourse relations. Lang. Resour. Eval. 49(2), 397–431 (2015)MathSciNetCrossRef Zhou, Y., Xue, N.: The Chinese discourse treebank: a Chinese corpus annotated with discourse relations. Lang. Resour. Eval. 49(2), 397–431 (2015)MathSciNetCrossRef
Metadata
Title
CTRD: A Chinese Theme-Rheme Discourse Dataset
Authors
Biao Fu
Yiqi Tong
Dawei Tian
Yidong Chen
Xiaodong Shi
Ming Zhu
Copyright Year
2021
DOI
https://doi.org/10.1007/978-3-030-88480-2_6

Premium Partner