ABSTRACT
In this paper, we roughly described the procedures of our segmentation system, including the methods for resolving segmentation ambiguities and identifying unknown words. The CKIP group of Academia Sinica participated in testing on open and closed tracks of Beijing University (PK) and Hong Kong Cityu (HK). The evaluation results show our system performs very well in either HK open track or HK closed track and just acceptable in PK tracks. Some explanations and analysis are presented in this paper.
- Chen, K. J. & S. H. Liu, 1992, "Word Identification for Mandarin Chinese Sentences," Proceedings of 14th Coling, pp. 101--107 Google ScholarDigital Library
- Chen, C. J., M. H. Bai, & K. J. Chen, 1997," Category Guessing for Chinese Unknown Words," Proceedings of the Natural Language Processing Pacific Rim Symposium, 35-40, Thailand.Google Scholar
- Chen, K. J. & Ming-Hong Bai, 1998, "Unknown Word Detection for Chinese by a Corpus-based Learning Method," international Journal of Computational linguistics and Chinese Language Processing, Vol. 3, #1, pp. 27--44Google Scholar
- Chen, Keh-jiann, 1999," Lexical Analysis for Chinese- Difficulties and Possible Solutions", Journal of Chinese Institute of Engineers, Vol. 22. #5, pp. 561--571. Google ScholarDigital Library
- Chen, K. J. & Wei-Yun Ma, 2002. Unknown Word Extraction for Chinese Documents. In Proceedings of COLING 2002, pages 169--175 Google ScholarDigital Library
- Tseng, H. H. & K. J. Chen, 2002. Design of Chinese Morphological Analyzer. In Proceedings of SIGHAN, pages 49--55 Google ScholarDigital Library
- Ma Wei-Yun & K. J. Chen, 2003. A bottom-up Merging Algorithm for Chinese Unknown Word Extraction. In Proceedings of SIGHAN Google ScholarDigital Library
Recommendations
Chinese word segmentation as morpheme-based lexical chunking
Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as ...
Splitting-merging model of Chinese word tokenization and segmentation
Currently, word tokenization and segmentation are still a hot topic in natural language processing, especially for languages like Chinese in which there is no blank space for word delimitation. Three major problems are faced: (1) tokenizing direction ...
Subword-based tagging for confidence-dependent Chinese word segmentation
COLING-ACL '06: Proceedings of the COLING/ACL on Main conference poster sessionsWe proposed a subword-based tagging for Chinese word segmentation to improve the existing character-based tagging. The subword-based tagging was implemented using the maximum entropy (MaxEnt) and the conditional random fields (CRF) methods. We found ...
Comments