Skip to main content
Erschienen in: Information Systems Frontiers 1/2011

01.03.2011

Domain-specific Chinese word segmentation using suffix tree and mutual information

verfasst von: Daniel Zeng, Donghua Wei, Michael Chau, Feiyue Wang

Erschienen in: Information Systems Frontiers | Ausgabe 1/2011

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

As the amount of online Chinese contents grows, there is a critical need for effective Chinese word segmentation approaches to facilitate Web computing applications in a range of domains including terrorism informatics. Most existing Chinese word segmentation approaches are either statistics-based or dictionary-based. The pure statistical method has lower precision, while the pure dictionary-based method cannot deal with new words beyond the dictionary. In this paper, we propose a hybrid method that is able to avoid the limitations of both types of approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves high accuracy in word segmentation when domain training is available. It can also identify new words through MI-based token merging and dictionary updating. In addition, with the proposed Improved Bigram method IASeg can process N-grams. To evaluate the performance of our segmenter, we compare it with two well-known systems, the Hylanda segmenter and the ICTCLAS segmenter, using a terrorism-centric corpus and a general corpus. The experiment results show that IASeg performs better than the benchmarks in both precision and recall for the domain-specific corpus and achieves comparable performance for the general corpus.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Chan, H. L., Hon, W. K., Lam, T. W., Sadakane, K. (2005) Dynamic dictionary matching and compressed suffix trees. Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms, Society for Industrial and Applied Mathematics. ISBN: 0-89871-585-7. Chan, H. L., Hon, W. K., Lam, T. W., Sadakane, K. (2005) Dynamic dictionary matching and compressed suffix trees. Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms, Society for Industrial and Applied Mathematics. ISBN: 0-89871-585-7.
Zurück zum Zitat Chau, M., & Xu, J. (2007). Mining communities and their relationships in blogs: a study of online hate groups. International Journal of Human-Computer Studies, 65(1), 57–70.CrossRef Chau, M., & Xu, J. (2007). Mining communities and their relationships in blogs: a study of online hate groups. International Journal of Human-Computer Studies, 65(1), 57–70.CrossRef
Zurück zum Zitat Chen, H. (2006). Intelligence and security informatics: information systems perspective. Decision Support Systems, 41(3), 555–559.CrossRef Chen, H. (2006). Intelligence and security informatics: information systems perspective. Decision Support Systems, 41(3), 555–559.CrossRef
Zurück zum Zitat Chen, M. T., Seiferas, J. (1985). Efficient and elegant subword-tree construction. Combinatorial Algorithm on Words (pp 97–107). NATO Advanced Science Institutes, Series F, vol. 12, Springer, Berlin. Chen, M. T., Seiferas, J. (1985). Efficient and elegant subword-tree construction. Combinatorial Algorithm on Words (pp 97–107). NATO Advanced Science Institutes, Series F, vol. 12, Springer, Berlin.
Zurück zum Zitat Chen, H., & Xu, J. (2006). Intelligence and security informatics. Annual Review of Information Science and Technology, 40, 229–289.CrossRef Chen, H., & Xu, J. (2006). Intelligence and security informatics. Annual Review of Information Science and Technology, 40, 229–289.CrossRef
Zurück zum Zitat Chien, L. F. (1997). PAT-tree based keyword extraction for chinese information retrieval. ACM SIGIR Chien, L. F. (1997). PAT-tree based keyword extraction for chinese information retrieval. ACM SIGIR
Zurück zum Zitat Creutz, M., Lagus, K. (2007). Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, Volume 4, Issue 1. Creutz, M., Lagus, K. (2007). Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, Volume 4, Issue 1.
Zurück zum Zitat Cui, S. Q., Liu, Q., Meng, Y., Yu, H., & Nishino, F. (2006). New word detection based on large-scale corpus. Journal of Computer Research and Development, 43(05), 927–932.CrossRef Cui, S. Q., Liu, Q., Meng, Y., Yu, H., & Nishino, F. (2006). New word detection based on large-scale corpus. Journal of Computer Research and Development, 43(05), 927–932.CrossRef
Zurück zum Zitat Dai, Y. B., Khoo, S. G. T., Loh, T. E. (1999). A new statistical formula for Chinese word segmentation incorporating contextual information. In: Proc. of the 22nd ACM SIGIR Conf. on Research and Development in Information Retrieval (pp 82–89). Dai, Y. B., Khoo, S. G. T., Loh, T. E. (1999). A new statistical formula for Chinese word segmentation incorporating contextual information. In: Proc. of the 22nd ACM SIGIR Conf. on Research and Development in Information Retrieval (pp 82–89).
Zurück zum Zitat Fang, Y., Yang, H. E. H. (2005). The algorithm design and realization to calculate the mutual information of four-word-string in large scale corpus. Computer Development & Applications, Vol.1. Fang, Y., Yang, H. E. H. (2005). The algorithm design and realization to calculate the mutual information of four-word-string in large scale corpus. Computer Development & Applications, Vol.1.
Zurück zum Zitat Giegerich, R., & Kurtz, S. (1997). From Ukkonen to McCreight and Weiner: a unifying view to linear-time suffix tree construction. Algorithmica, 19, 331–353.CrossRef Giegerich, R., & Kurtz, S. (1997). From Ukkonen to McCreight and Weiner: a unifying view to linear-time suffix tree construction. Algorithmica, 19, 331–353.CrossRef
Zurück zum Zitat Hockenmaier, J., & Brew, C. (1998). Error-driven segmentation of Chinese. Communications of COLIPS, 1(1), 69–84. Hockenmaier, J., & Brew, C. (1998). Error-driven segmentation of Chinese. Communications of COLIPS, 1(1), 69–84.
Zurück zum Zitat Jia, N., & Zhang, Q. (2007). Identification of Chinese names based on maximum entropy model. Computer Engineering, 33(9), 31–33. Jia, N., & Zhang, Q. (2007). Identification of Chinese names based on maximum entropy model. Computer Engineering, 33(9), 31–33.
Zurück zum Zitat Leydesdorff, L., & Zhou, P. (2008). Co-word analysis using the Chinese character set. Journal of the American Society for Information Science and Technology, 59(9), 1528–1530.CrossRef Leydesdorff, L., & Zhou, P. (2008). Co-word analysis using the Chinese character set. Journal of the American Society for Information Science and Technology, 59(9), 1528–1530.CrossRef
Zurück zum Zitat Li, J. F., & Zhang, Y. F. (2002). Segmenting Chinese by EM algorithm. Journal of the China Society for Scientific and Technical Information, 03, 13–16. Li, J. F., & Zhang, Y. F. (2002). Segmenting Chinese by EM algorithm. Journal of the China Society for Scientific and Technical Information, 03, 13–16.
Zurück zum Zitat Li, R., Liu, S. H., Ye, S. W., & Shi, Z. Z. (2001). A method of crossing ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing, 15(6), 13–18 (in Chinese). Li, R., Liu, S. H., Ye, S. W., & Shi, Z. Z. (2001). A method of crossing ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing, 15(6), 13–18 (in Chinese).
Zurück zum Zitat Low, J. K., Ng, H. T., Guo, W. (2005). A maximum entropy approach to Chinese word segmentation. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (pp 161-164). Jeju Island, Korea. Low, J. K., Ng, H. T., Guo, W. (2005). A maximum entropy approach to Chinese word segmentation. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (pp 161-164). Jeju Island, Korea.
Zurück zum Zitat Maaß, M. (1999). Suffix trees and their applications. Ferienakademie 1999 Kurs 2: Bäume: Algorithmik and Kombinatorik. Maaß, M. (1999). Suffix trees and their applications. Ferienakademie 1999 Kurs 2: Bäume: Algorithmik and Kombinatorik.
Zurück zum Zitat McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. Journal of ACM, 23(2), 262–272.CrossRef McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. Journal of ACM, 23(2), 262–272.CrossRef
Zurück zum Zitat Ong, T. H., Chen, H. (1999). Updateable PAT-tree approach to chinese key phrase extraction using mutual information: a linguistic foundation for knowledge management. In Proceedings of the Asian Digital Library Conference (pp 63-84). Taipei, Taiwan. Ong, T. H., Chen, H. (1999). Updateable PAT-tree approach to chinese key phrase extraction using mutual information: a linguistic foundation for knowledge management. In Proceedings of the Asian Digital Library Conference (pp 63-84). Taipei, Taiwan.
Zurück zum Zitat Palmer, D. (1997). A trainable rule-based algorithm to word segmentation. In Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics. Madrid, Spain. Palmer, D. (1997). A trainable rule-based algorithm to word segmentation. In Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics. Madrid, Spain.
Zurück zum Zitat Peng, F. C., Schuurmans D. (2001). Self-supervised Chinese word segmentation. Proceedings of the 4th International Symposium of Intelligent Data Analysis (pp 238–247). Peng, F. C., Schuurmans D. (2001). Self-supervised Chinese word segmentation. Proceedings of the 4th International Symposium of Intelligent Data Analysis (pp 238–247).
Zurück zum Zitat Peng, F. C., Feng, F. F., McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields. COLING 2004, Geneva, Switzerland. Peng, F. C., Feng, F. F., McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields. COLING 2004, Geneva, Switzerland.
Zurück zum Zitat Ponte, J. M., Croft, W. B. (1996). Useg: a retargetable word segmentation procedure for information retrieval. In Proceedings of SDAIR96, Las Vegas, Nevada. Ponte, J. M., Croft, W. B. (1996). Useg: a retargetable word segmentation procedure for information retrieval. In Proceedings of SDAIR96, Las Vegas, Nevada.
Zurück zum Zitat Raghu, T. S., & Chen, H. (2007). Cyberinfrastructure for homeland security: advances in information sharing, data mining, and collaboration systems. Decision Support Systems, 43(4), 1321–1323.CrossRef Raghu, T. S., & Chen, H. (2007). Cyberinfrastructure for homeland security: advances in information sharing, data mining, and collaboration systems. Decision Support Systems, 43(4), 1321–1323.CrossRef
Zurück zum Zitat Sproat, R., Shih, C., Gale, W., & Chang, N. (1996). A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics, 22(3), 377–404. Sproat, R., Shih, C., Gale, W., & Chang, N. (1996). A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics, 22(3), 377–404.
Zurück zum Zitat Sun, M. S., Xiao, M., & Zou, J. Y. (2004). Chinese word segmentation without using dictionary based on unsupervised learning strategy. Chinese Journal of Computers, 27(6), 736–742. Sun, M. S., Xiao, M., & Zou, J. Y. (2004). Chinese word segmentation without using dictionary based on unsupervised learning strategy. Chinese Journal of Computers, 27(6), 736–742.
Zurück zum Zitat Teahan, W. J., Wen, Y., McNab, R. J., & Witten, I. H. (2000). A compression-based algorithm for Chinese word segmentation. Computational Linguistics, 26, 375–393.CrossRef Teahan, W. J., Wen, Y., McNab, R. J., & Witten, I. H. (2000). A compression-based algorithm for Chinese word segmentation. Computational Linguistics, 26, 375–393.CrossRef
Zurück zum Zitat Ukkonen, E. (1992). Constructing suffix trees on-line in linear time. In Jv Leeuwen (ed), Proc. IFIP 12th World Computer Congress on Algorithms, Software, Architecture (pp 484–492) Madrid, Spain. Ukkonen, E. (1992). Constructing suffix trees on-line in linear time. In Jv Leeuwen (ed), Proc. IFIP 12th World Computer Congress on Algorithms, Software, Architecture (pp 484–492) Madrid, Spain.
Zurück zum Zitat Ukkonen, E. (1995). On-line Construction of Suffix-Trees. Algorithmica, 14(3). Ukkonen, E. (1995). On-line Construction of Suffix-Trees. Algorithmica, 14(3).
Zurück zum Zitat Weiner, P. (1973). Linear pattern matching algorithms. Proc. 14th IEEE Annual Symp. on Switching and Automata Theory (pp 1-11). Weiner, P. (1973). Linear pattern matching algorithms. Proc. 14th IEEE Annual Symp. on Switching and Automata Theory (pp 1-11).
Zurück zum Zitat Wong, P.-k., Chan, C. (1996). Chinese word segmentation based on maximum matching and word binding force. Proceedings of the 16th International Conference on Computational Linguistics (pp 200–203). Wong, P.-k., Chan, C. (1996). Chinese word segmentation based on maximum matching and word binding force. Proceedings of the 16th International Conference on Computational Linguistics (pp 200–203).
Zurück zum Zitat Wu, Z., & Tseng, G. (1993). Chinese text segmentation for text retrieval achievements and problems. Journal of the American Society for Information Science, 44(9), 532–542.CrossRef Wu, Z., & Tseng, G. (1993). Chinese text segmentation for text retrieval achievements and problems. Journal of the American Society for Information Science, 44(9), 532–542.CrossRef
Zurück zum Zitat Xue, N. W. (2003). Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing, 8(1), 29–48. Xue, N. W. (2003). Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing, 8(1), 29–48.
Zurück zum Zitat Xue, N.W., Chiou, Fu-Dong, and Palmer, M. Building a large annotated Chinese corpus. In Proceedings of the 19th International Conference on Computational Linguistics. Taipei, Taiwan, 2002. Xue, N.W., Chiou, Fu-Dong, and Palmer, M. Building a large annotated Chinese corpus. In Proceedings of the 19th International Conference on Computational Linguistics. Taipei, Taiwan, 2002.
Zurück zum Zitat Yu, H. K., Zhang, H. P., Liu, Q., Lv, X. Q., & Shi, S. C. (2006). Chinese named entity identification using cascaded hidden Markov model. Journal on Communications, 27(2), 87–94. Yu, H. K., Zhang, H. P., Liu, Q., Lv, X. Q., & Shi, S. C. (2006). Chinese named entity identification using cascaded hidden Markov model. Journal on Communications, 27(2), 87–94.
Zurück zum Zitat Zhang, H. P., Yu, H. K., Xiong, D. Y., Liu Q. (2003). HMM-Based Chinese lexical analyzer ICTCLAS. In Proc. of the 2nd SIGHAN Workshop (pp 184–187). Zhang, H. P., Yu, H. K., Xiong, D. Y., Liu Q. (2003). HMM-Based Chinese lexical analyzer ICTCLAS. In Proc. of the 2nd SIGHAN Workshop (pp 184–187).
Zurück zum Zitat Zhang, C. L., Hao, F. L., Wan, W. L. (2004). An automatic and dictionary-free Chinese word segmentation method based on suffix array. Journal of Jilin University (Science Edition), Vol 4. Zhang, C. L., Hao, F. L., Wan, W. L. (2004). An automatic and dictionary-free Chinese word segmentation method based on suffix array. Journal of Jilin University (Science Edition), Vol 4.
Zurück zum Zitat Zhou, L. X., Liu, Q. (2002). A Character-net Based Chinese Text Segmentation Method, SEMANET: Building and Using Semantic Networks Workshop at the 19th COLING (pp 101–106). Zhou, L. X., Liu, Q. (2002). A Character-net Based Chinese Text Segmentation Method, SEMANET: Building and Using Semantic Networks Workshop at the 19th COLING (pp 101–106).
Metadaten
Titel
Domain-specific Chinese word segmentation using suffix tree and mutual information
verfasst von
Daniel Zeng
Donghua Wei
Michael Chau
Feiyue Wang
Publikationsdatum
01.03.2011
Verlag
Springer US
Erschienen in
Information Systems Frontiers / Ausgabe 1/2011
Print ISSN: 1387-3326
Elektronische ISSN: 1572-9419
DOI
https://doi.org/10.1007/s10796-010-9278-5

Weitere Artikel der Ausgabe 1/2011

Information Systems Frontiers 1/2011 Zur Ausgabe