Skip to main content

2019 | OriginalPaper | Buchkapitel

A Word Segmentation Method of Ancient Chinese Based on Word Alignment

verfasst von : Chao Che, Hanyu Zhao, Xiaoting Wu, Dongsheng Zhou, Qiang Zhang

Erschienen in: Natural Language Processing and Chinese Computing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Since there are no public tagged corpora available for ancient Chinese word segmentation (CWS), the state-of-the-art CWS methods cannot be used for ancient Chinese. To address this problem, this paper proposes a word segmentation method based on word alignment (WSWA). Specifically, the method segments words according to the word alignment between modern Chinese words and ancient Chinese characters. If multiple consecutive characters in ancient Chinese align to the same modern Chinese word, they are considered as one word. Because many modern Chinese words are derived from ancient Chinese, the method also exploits the co-occurring characters between modern and ancient Chinese to extract words for CWS. Moreover, to reduce the effect of alignment errors, the method removes the word alignments easily leading to CWS errors. We quantitatively analyze the effects of modern CWS and word alignment on WSWA method using hand-annotated corpora. Our method outperforms the state-of-the-art methods on the WSA experiment on Shiji with a large margin, which demonstrates the effectiveness of using word alignment to perform ancient CWS.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Sproat, R., Shih, C., Gale, W., Chang, N.: A stochasitic finite-state word-segmentation algorithm for Chinese. Comput. Linguist. 22(3), 377–404 (1996) Sproat, R., Shih, C., Gale, W., Chang, N.: A stochasitic finite-state word-segmentation algorithm for Chinese. Comput. Linguist. 22(3), 377–404 (1996)
2.
Zurück zum Zitat Xue, N., Shen, L.: Chinese word segmentation as LMR tagging. In: Sighan Workshop on Chinese Language Processing, pp. 176–179. ACL, Stroudsburg (2003) Xue, N., Shen, L.: Chinese word segmentation as LMR tagging. In: Sighan Workshop on Chinese Language Processing, pp. 176–179. ACL, Stroudsburg (2003)
3.
Zurück zum Zitat Shi, M., Bin, L.I., Chen, X.: CRF based research on a unified approach to word segmentation and POS tagging for pre-qin Chinese. J. Chin. Inform. Process. 2, 39–45 (2010) Shi, M., Bin, L.I., Chen, X.: CRF based research on a unified approach to word segmentation and POS tagging for pre-qin Chinese. J. Chin. Inform. Process. 2, 39–45 (2010)
4.
Zurück zum Zitat Qian, Z., Zhou, J., Tong, G., Su, X.: Research on automatic word segmentation and POS tagging for Chu Ci based on HMM. Libr. Inform. Serv. 58(4), 105–110 (2014) Qian, Z., Zhou, J., Tong, G., Su, X.: Research on automatic word segmentation and POS tagging for Chu Ci based on HMM. Libr. Inform. Serv. 58(4), 105–110 (2014)
5.
Zurück zum Zitat Li, S., Li, M.Z., Xu, Y.J., Bao, Z.Y., Fu, L., Zhu, Y.: Capsules based Chinese word segmentation for ancient Chinese medical books. IEEE Access 6, 70874–70883 (2018)CrossRef Li, S., Li, M.Z., Xu, Y.J., Bao, Z.Y., Fu, L., Zhu, Y.: Capsules based Chinese word segmentation for ancient Chinese medical books. IEEE Access 6, 70874–70883 (2018)CrossRef
6.
Zurück zum Zitat Palmer, D.D.: A trainable rule-based algorithm for word segmentation. In: Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pp. 321–328. ACL, Stroudsburg (1997) Palmer, D.D.: A trainable rule-based algorithm for word segmentation. In: Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pp. 321–328. ACL, Stroudsburg (1997)
7.
Zurück zum Zitat Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of Coling, pp. 562–568 (2004) Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of Coling, pp. 562–568 (2004)
8.
Zurück zum Zitat Zhang, Y., Clark, S.: Chinese segmentation with a word-based perceptron algorithm. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 840–847. ACL, Stroudsburg (2007) Zhang, Y., Clark, S.: Chinese segmentation with a word-based perceptron algorithm. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 840–847. ACL, Stroudsburg (2007)
9.
Zurück zum Zitat Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(1), 2493–2537 (2011)MATH Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(1), 2493–2537 (2011)MATH
10.
Zurück zum Zitat Zheng, X., Chen, H., Xu, T.: Deep learning for Chinese word segmentation and POS tagging. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 647–657. ACL, Stroudsburg (2003) Zheng, X., Chen, H., Xu, T.: Deep learning for Chinese word segmentation and POS tagging. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 647–657. ACL, Stroudsburg (2003)
11.
Zurück zum Zitat Pei, W., Ge, T., Chang, B.: Max-margin tensor neural network for Chinese word segmentation. In: In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 293–303. ACL, Stroudsburg (2014) Pei, W., Ge, T., Chang, B.: Max-margin tensor neural network for Chinese word segmentation. In: In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 293–303. ACL, Stroudsburg (2014)
12.
Zurück zum Zitat Chen, X., Qiu, X., Zhu, C., Pengfei, L., Huang, X.: Long short-term memory neural networks for chinese word segmentation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1197–1206. ACL, Stroudsburg (2015b) Chen, X., Qiu, X., Zhu, C., Pengfei, L., Huang, X.: Long short-term memory neural networks for chinese word segmentation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1197–1206. ACL, Stroudsburg (2015b)
13.
Zurück zum Zitat Zhang, M., Zhang, Y., Fu, G.: Transition-based neural word segmentation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 421–431. ACL, Stroudsburg (2016) Zhang, M., Zhang, Y., Fu, G.: Transition-based neural word segmentation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 421–431. ACL, Stroudsburg (2016)
14.
Zurück zum Zitat Liu, Y., Che, W., Guo, J., Qin, B., Liu, T.: Exploring segment representations for neural segmentation models. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 2880–2886 (2016) Liu, Y., Che, W., Guo, J., Qin, B., Liu, T.: Exploring segment representations for neural segmentation models. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 2880–2886 (2016)
15.
Zurück zum Zitat Xu, J., Zens, R., Ney, H.: Do we need Chinese word segmentation for statistical machine translation? In: ACL SIGHAN Workshop Association for Computational Linguistics, pp. 122–128. ACL, Stroudsburg (2004) Xu, J., Zens, R., Ney, H.: Do we need Chinese word segmentation for statistical machine translation? In: ACL SIGHAN Workshop Association for Computational Linguistics, pp. 122–128. ACL, Stroudsburg (2004)
16.
Zurück zum Zitat Ma, Y., Way, A.: Bilingually motivated domain-adapted word segmentation for statistical machine translation. In: Conference of the European Chapter of the Association for Computational Linguistics, pp. 549–557. ACL, Stroudsburg (2009) Ma, Y., Way, A.: Bilingually motivated domain-adapted word segmentation for statistical machine translation. In: Conference of the European Chapter of the Association for Computational Linguistics, pp. 549–557. ACL, Stroudsburg (2009)
17.
Zurück zum Zitat Paul, M., Finch, A.M., Sumita, E.: Integration of multiple bilingually-trained Segmentation Schemes into Statistical machine translation. In: Joint Fifth Workshop on Statistical Machine Translation and Metricsmatr, pp. 400–408 (2010) Paul, M., Finch, A.M., Sumita, E.: Integration of multiple bilingually-trained Segmentation Schemes into Statistical machine translation. In: Joint Fifth Workshop on Statistical Machine Translation and Metricsmatr, pp. 400–408 (2010)
18.
Zurück zum Zitat Wang, X., Utiyama, M., M. Finch, A., Sumita, E.: Refining word segmentation using a manually aligned corpus for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1654–1664. ACL, Stroudsburg (2014) Wang, X., Utiyama, M., M. Finch, A., Sumita, E.: Refining word segmentation using a manually aligned corpus for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1654–1664. ACL, Stroudsburg (2014)
19.
Zurück zum Zitat Tran, P., Dinh, D., Nguyen, L.H.B.: Word re-segmentation in Chinese-Vietnamese machine translation. ACM Trans. Asian Low-Resour. Lang. Inform. Process. (TALLIP) 16(2), 12 (2016) Tran, P., Dinh, D., Nguyen, L.H.B.: Word re-segmentation in Chinese-Vietnamese machine translation. ACM Trans. Asian Low-Resour. Lang. Inform. Process. (TALLIP) 16(2), 12 (2016)
20.
Zurück zum Zitat Chu, C., Nakazawa, T., Kawahara, D., Kurohashi, S.: Chinese-Japanese machine translation exploiting Chinese characters. ACM Trans. Asian Lang. Inform. Process. 12(4), 1–25 (2013)CrossRef Chu, C., Nakazawa, T., Kawahara, D., Kurohashi, S.: Chinese-Japanese machine translation exploiting Chinese characters. ACM Trans. Asian Lang. Inform. Process. 12(4), 1–25 (2013)CrossRef
21.
Zurück zum Zitat Brown, P.F., Della Pietra, S.A., Della Pietra, V.J.: The mathesmatics of statistical machine translation: parameter estimation. Computat. Linguist. 19, 263–311 (1993) Brown, P.F., Della Pietra, S.A., Della Pietra, V.J.: The mathesmatics of statistical machine translation: parameter estimation. Computat. Linguist. 19, 263–311 (1993)
22.
Zurück zum Zitat Wu, X.T., Zhao, H.Y., Che, C.: Term translation extraction from historical classics using modern chinese explanation. In: The 17th China National Conference on Computational Linguistics and 6th International Symposium on Natural Language Processing Based on Naturally Annotated Big Data (CCL 2018/NLP-NABD 2018), pp. 88–98. CCL, Beijing (2018) Wu, X.T., Zhao, H.Y., Che, C.: Term translation extraction from historical classics using modern chinese explanation. In: The 17th China National Conference on Computational Linguistics and 6th International Symposium on Natural Language Processing Based on Naturally Annotated Big Data (CCL 2018/NLP-NABD 2018), pp. 88–98. CCL, Beijing (2018)
Metadaten
Titel
A Word Segmentation Method of Ancient Chinese Based on Word Alignment
verfasst von
Chao Che
Hanyu Zhao
Xiaoting Wu
Dongsheng Zhou
Qiang Zhang
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-32233-5_59