nach oben

Information Systems Frontiers

Erschienen in:

01.03.2011

Domain-specific Chinese word segmentation using suffix tree and mutual information

verfasst von: Daniel Zeng, Donghua Wei, Michael Chau, Feiyue Wang

Erschienen in: Information Systems Frontiers | Ausgabe 1/2011

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

As the amount of online Chinese contents grows, there is a critical need for effective Chinese word segmentation approaches to facilitate Web computing applications in a range of domains including terrorism informatics. Most existing Chinese word segmentation approaches are either statistics-based or dictionary-based. The pure statistical method has lower precision, while the pure dictionary-based method cannot deal with new words beyond the dictionary. In this paper, we propose a hybrid method that is able to avoid the limitations of both types of approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves high accuracy in word segmentation when domain training is available. It can also identify new words through MI-based token merging and dictionary updating. In addition, with the proposed Improved Bigram method IASeg can process N-grams. To evaluate the performance of our segmenter, we compare it with two well-known systems, the Hylanda segmenter and the ICTCLAS segmenter, using a terrorism-centric corpus and a general corpus. The experiment results show that IASeg performs better than the benchmarks in both precision and recall for the domain-specific corpus and achieves comparable performance for the general corpus.

Vorheriger Artikel Identity matching using personal and social identity features

Nächster Artikel Tag-only aging-counter localization for the R-LIM2 system

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Chan, H. L., Hon, W. K., Lam, T. W., Sadakane, K. (2005) Dynamic dictionary matching and compressed suffix trees. Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms, Society for Industrial and Applied Mathematics. ISBN: 0-89871-585-7.

Chau, M., & Xu, J. (2007). Mining communities and their relationships in blogs: a study of online hate groups. International Journal of Human-Computer Studies, 65(1), 57–70.CrossRef

Chen, H. (2006). Intelligence and security informatics: information systems perspective. Decision Support Systems, 41(3), 555–559.CrossRef

Chen, M. T., Seiferas, J. (1985). Efficient and elegant subword-tree construction. Combinatorial Algorithm on Words (pp 97–107). NATO Advanced Science Institutes, Series F, vol. 12, Springer, Berlin.

Chen, H., & Xu, J. (2006). Intelligence and security informatics. Annual Review of Information Science and Technology, 40, 229–289.CrossRef

Chien, L. F. (1997). PAT-tree based keyword extraction for chinese information retrieval. ACM SIGIR

Creutz, M., Lagus, K. (2007). Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, Volume 4, Issue 1.

Cui, S. Q., Liu, Q., Meng, Y., Yu, H., & Nishino, F. (2006). New word detection based on large-scale corpus. Journal of Computer Research and Development, 43(05), 927–932.CrossRef

Dai, Y. B., Khoo, S. G. T., Loh, T. E. (1999). A new statistical formula for Chinese word segmentation incorporating contextual information. In: Proc. of the 22nd ACM SIGIR Conf. on Research and Development in Information Retrieval (pp 82–89).

Fang, Y., Yang, H. E. H. (2005). The algorithm design and realization to calculate the mutual information of four-word-string in large scale corpus. Computer Development & Applications, Vol.1.

Giegerich, R., & Kurtz, S. (1997). From Ukkonen to McCreight and Weiner: a unifying view to linear-time suffix tree construction. Algorithmica, 19, 331–353.CrossRef

Hockenmaier, J., & Brew, C. (1998). Error-driven segmentation of Chinese. Communications of COLIPS, 1(1), 69–84.

Jia, N., & Zhang, Q. (2007). Identification of Chinese names based on maximum entropy model. Computer Engineering, 33(9), 31–33.

Leydesdorff, L., & Zhou, P. (2008). Co-word analysis using the Chinese character set. Journal of the American Society for Information Science and Technology, 59(9), 1528–1530.CrossRef

Li, J. F., & Zhang, Y. F. (2002). Segmenting Chinese by EM algorithm. Journal of the China Society for Scientific and Technical Information, 03, 13–16.

Li, R., Liu, S. H., Ye, S. W., & Shi, Z. Z. (2001). A method of crossing ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing, 15(6), 13–18 (in Chinese).

Low, J. K., Ng, H. T., Guo, W. (2005). A maximum entropy approach to Chinese word segmentation. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (pp 161-164). Jeju Island, Korea.

Maaß, M. (1999). Suffix trees and their applications. Ferienakademie 1999 Kurs 2: Bäume: Algorithmik and Kombinatorik.

McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. Journal of ACM, 23(2), 262–272.CrossRef

Ong, T. H., Chen, H. (1999). Updateable PAT-tree approach to chinese key phrase extraction using mutual information: a linguistic foundation for knowledge management. In Proceedings of the Asian Digital Library Conference (pp 63-84). Taipei, Taiwan.

Palmer, D. (1997). A trainable rule-based algorithm to word segmentation. In Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics. Madrid, Spain.

Peng, F. C., Schuurmans D. (2001). Self-supervised Chinese word segmentation. Proceedings of the 4th International Symposium of Intelligent Data Analysis (pp 238–247).

Peng, F. C., Feng, F. F., McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields. COLING 2004, Geneva, Switzerland.

Ponte, J. M., Croft, W. B. (1996). Useg: a retargetable word segmentation procedure for information retrieval. In Proceedings of SDAIR96, Las Vegas, Nevada.

Raghu, T. S., & Chen, H. (2007). Cyberinfrastructure for homeland security: advances in information sharing, data mining, and collaboration systems. Decision Support Systems, 43(4), 1321–1323.CrossRef

Sproat, R., Shih, C., Gale, W., & Chang, N. (1996). A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics, 22(3), 377–404.

Sun, M. S., Xiao, M., & Zou, J. Y. (2004). Chinese word segmentation without using dictionary based on unsupervised learning strategy. Chinese Journal of Computers, 27(6), 736–742.

Teahan, W. J., Wen, Y., McNab, R. J., & Witten, I. H. (2000). A compression-based algorithm for Chinese word segmentation. Computational Linguistics, 26, 375–393.CrossRef

Ukkonen, E. (1992). Constructing suffix trees on-line in linear time. In Jv Leeuwen (ed), Proc. IFIP 12th World Computer Congress on Algorithms, Software, Architecture (pp 484–492) Madrid, Spain.

Ukkonen, E. (1995). On-line Construction of Suffix-Trees. Algorithmica, 14(3).

Weiner, P. (1973). Linear pattern matching algorithms. Proc. 14th IEEE Annual Symp. on Switching and Automata Theory (pp 1-11).

Wong, P.-k., Chan, C. (1996). Chinese word segmentation based on maximum matching and word binding force. Proceedings of the 16th International Conference on Computational Linguistics (pp 200–203).

Wu, Z., & Tseng, G. (1993). Chinese text segmentation for text retrieval achievements and problems. Journal of the American Society for Information Science, 44(9), 532–542.CrossRef

Xue, N. W. (2003). Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing, 8(1), 29–48.

Xue, N.W., Chiou, Fu-Dong, and Palmer, M. Building a large annotated Chinese corpus. In Proceedings of the 19th International Conference on Computational Linguistics. Taipei, Taiwan, 2002.

Yu, H. K., Zhang, H. P., Liu, Q., Lv, X. Q., & Shi, S. C. (2006). Chinese named entity identification using cascaded hidden Markov model. Journal on Communications, 27(2), 87–94.

Zhang, H. P., Yu, H. K., Xiong, D. Y., Liu Q. (2003). HMM-Based Chinese lexical analyzer ICTCLAS. In Proc. of the 2nd SIGHAN Workshop (pp 184–187).

Zhang, C. L., Hao, F. L., Wan, W. L. (2004). An automatic and dictionary-free Chinese word segmentation method based on suffix array. Journal of Jilin University (Science Edition), Vol 4.

Zhou, L. X., Liu, Q. (2002). A Character-net Based Chinese Text Segmentation Method, SEMANET: Building and Using Semantic Networks Workshop at the 19th COLING (pp 101–106).

Titel: Domain-specific Chinese word segmentation using suffix tree and mutual information
verfasst von: Daniel Zeng
Donghua Wei
Michael Chau
Feiyue Wang
Publikationsdatum: 01.03.2011
Verlag: Springer US
Erschienen in: Information Systems Frontiers / Ausgabe 1/2011
Print ISSN: 1387-3326
Elektronische ISSN: 1572-9419
DOI: https://doi.org/10.1007/s10796-010-9278-5

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2011

“Where’s Farah?”: Knowledge silos and information fusion by distributed collaborating teams

Analyzing the semantic content and persuasive composition of extremist media: A case study of texts produced during the Gaza conflict

Introduction to special issue on terrorism informatics

Information control and terrorism: Tracking the Mumbai terrorist attack through twitter

Identity matching using personal and social identity features

Tag-only aging-counter localization for the R-LIM2 system