ABSTRACT
Chinese sentences are composed with string of characters without blanks to mark words. However the basic unit for sentence parsing and understanding is word. Therefore the first step of processing Chinese sentences is to identify the words. The difficulties of identifying words include (1) the identification of complex words, such as Determinative-Measure, reduplications, derived words etc., (2) the identification of proper names, (3) resolving the ambiguous segmentations. In this paper, we propose the possible solutions for the above difficulties. We adopt a matching algorithm with 6 different heuristic rules to resolve the ambiguities and achieve an 99.77% of the success rate. The statistical data supports that the maximal matching algorithm is the most effective heuristics.
- J. S. Chang, "A Multiple-Corpus Approach to Identi cation of Chinese Surname-Names." Proc. of Natural Language Processing Pacific Rim Symposium, Singapore, 1991Google Scholar
- J. S. Chang, J. I. Chang and S. D. Chen, "A Method of Constraint Satisfaction and Statistical Optimization for Chinese Word Segmentation," Proc. of the 1991 R. O. C. Computational Linguistics Conference, Taiwan, 1991Google Scholar
- Y. R. Chao, A Grammar of Spoken Chinese, University of California Press, California, 1968Google Scholar
- K. J. Chen, C. J. Chen and L. J. Lee, "Analysis and Research in Chinese Sentences---Segmentation and Construction," Technical Report, TR-86-004, Nankang, Academia Sinica, 1986Google Scholar
- K. J. Chen and C. R. Huang, "Information-based Case Grammar," COLING-90, Vol. 2, p. 54--p. 59 Google ScholarDigital Library
- K. J. Chen et al, "Compounds and Parsing in Mandarin Chinese," Proc. of National Computer Symposium, 1987Google Scholar
- G. Y. Chen, "A-not-A Questions in Chinese," manuscript, CKIP group, Academia Sinica, Taipei, 1991Google Scholar
- C. K. Fan and W. H. Tsai, "Automatic Word Identification in Chinese Sentences by the Relaxation Technique," Computer Processing of Chinese and Oriental Languages, Vol. 4, No. 1, November 1988Google Scholar
- R. Garside, G. Leech and G. Sampson, "The Computational Analysis of English --- a Corpusbased Approach," Longman Group UK Limited, 1987Google Scholar
- W. H. Ho, "Automatic Recognition of Chinese Words," Master Thesis, National Taiwan Institute of Technology, Taipei, Taiwan, 1983Google Scholar
- W. M. Hong, C. R. Huang, T. Z. Tang and K. J. Chen, "The Morphological Rules of Chinese Derivative Words," To be presented at the 1991 International Conference on Teaching Chinese as a Second Language, December, 1991, TaipeiGoogle Scholar
- C. Y. Jie, Y. Liu and N. Y. Liang, "On Methods of Chinese Automatic Segmentation," Journal of Chinese Information Processing, Vol. 3, No. 1, 1989Google Scholar
- B. I. Li, S. Lien, C. F. Sun and M. S. Sun, "A Maximal Matching Automatic Chinese Word Segmentation Algorithm Using Corpus Tagging for Ambiguity Resolution," Proc. of the 1991 R. O. C Computational Linguistics Conference, Taiwan, 1991Google Scholar
- N. Y. Liang, "Automatic Chinese Text Word Segmentation System --- CDWS". Journal of Chinese Information Processing, Vol. 1, No. 2, 1987Google Scholar
- N. Y. Liang, "Contemporary Chinese Language Word Segmentation Standard Used for Information Processing," 1989, a draft proposalGoogle Scholar
- N. Y. Liang, "The Knowledge of Chinese Words Segmentation," Journal of Chinese Information Processing, Vol. 4, No. 2, 1990Google Scholar
- M. L. Lin, "The Grammatical and Semantic Properties of Reduplications," manuscript, CKIP group, Academia Sinica, 1991Google Scholar
- I. M. Liu, C. Z. Chang and S. C. Wang, "Frequency Count of Frequently Used Chinese Words," Taipei, Taiwan, Lucky Book Co., 1975Google Scholar
- R. P. Mo, Y. J. Yang, K. J. Chen and C. R. Huang, "Determinative-Measure Compounds in Mandarin Chinese: Their Formation Rules and Parser Implementation," Proc. of the 1991 R.O.C Computational Linguistics Conference, Taiwan, 1991Google Scholar
- R. Sproat and C. Shih, "A Statistical Method for Finding Word Boundaries in Chinese Text," Computer Processing of Chinese and Oriental Languages, Vol. 4, No. 4, March 1990Google Scholar
- C. L. Yeh and H. J. Lee, "Rule-based Word Identification for Mandarin Chinese Sentences --- A Unification Approach," Computer Processing of Chinese and Oriental Languages, Vol. 5, No. 2, March 1991Google Scholar
- Word identification for Mandarin Chinese sentences
Recommendations
A parsing method for identifying words in mandarin Chinese sentences
IJCAI'91: Proceedings of the 12th international joint conference on Artificial intelligence - Volume 2This paper presents a parsing method for identifying words in mandarin Chinese sentences. The identification system is composed of a Tomita's parser augmented with tests originally a part of the English-Chinese machine translation system CCL-ECMT ...
Recognizing unregistered names for Mandarin word identification
COLING '92: Proceedings of the 14th conference on Computational linguistics - Volume 4Word Identification has been an important and active issue in Chinese Natural Language Processing. In this paper, a new mechanism, based on the concept of sublanguage, is proposed for identifying unknown words, especially personal names, in Chinese ...
Comments