ABSTRACT
Automatic key phrase extraction is fundamental to the success of many recent digital library applications and semantic information retrieval techniques and a difficult and essential problem in Vietnamese natural language processing (NLP). In this work, we propose a novel method for key phrase extracting of Vietnamese text that combines assignment and extraction approaches. We also explore NLP techniques that we propose for the analysis of Vietnamese texts, focusing on the advanced candidate phrases recognition phase as well as part-of-speech (POS) tagging. Then we propose a method that exploits specific characteristics of the Vietnamese language and exploits the Vietnamese Wikipedia as an ontology for key phrase ambiguity resolution. Finally, we show the results of several experiments that have examined the impacts of strategies chosen for Vietnamese key phrase extracting.
- Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Proc. of the 13th Biennial Conf. of the Canadian Society on Computational Studies of Intelligence, pp. 40--52. Springer, Heidelberg (2000). Google ScholarDigital Library
- Banerjee S. and Pederson T., 2003, Extended Gloss Overlaps as a Measure of Semantic Relatedness, In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, IJCAI-03, pp. 805--810. Google ScholarDigital Library
- Bunescu, R., Pasca, M.: Using encyclopedic knowledge for name entity disambiguation. In: Proc. Of the 11th Conference of EACL, pp. 9--16 (2006).Google Scholar
- Chau Q. Nguyen, Tuoi T. Phan. An Ontology--Based Approach to Vietnamese Key Phrase Extraction, in Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2009), August 2--7, 2009, Singapore. Companion Vol, pp. 181--184. Google ScholarDigital Library
- Chau Q. Nguyen, Luan T. Hong, Tuoi T. Phan. A Support Vector Machines Approach to Vietnamese Key Phrase Extraction, in Proceedings of the 2009 IEEE-RIVF International Conference on Computing & Communication Technologies (IEEE-RIVF 2009), IEEE eXpress, pp. 131--135.Google Scholar
- Chau Q. Nguyen, Tuoi T. Phan, Tru H. Cao. Writing Style Based Vietnamese POS Tagging, in Proceedings of The Second National Symposium on Fundamental and Applied Information Technology Research- FAIR'05 (9/2005), pp. 106--116.Google Scholar
- Chau Q. Nguyen, Tuoi T. Phan. A Hybrid Approach to Vietnamese Part-Of-Speech Tagging. In Proceeding of the 9th International Oriental COCOSDA Conference (OCOCOSDA'06), 12/2006, Malaysia, pp. 157--160.Google Scholar
- Chau Q. Nguyen, Tuoi T. Phan. A Pattern-based Approach to Vietnamese Key Phrase Extraction, In Addendum Contributions of the 5th International IEEE Conference on Computer Sciences- RIVF'07, 2007, Studia Informatica Universalis, pp. 41--46.Google Scholar
- Dumais, S. T., Platt, J., Hecherman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: CIKM. Proc. of 7th International Conference on Information and Knowledge Management, pp. 148--155 (1998). Google ScholarDigital Library
- Frank, E., Paynter, G. W., Witten, H. I., Gutwin, C., Nevill-Manning, C. G.: Domain specific keyphrase extraction. In: Proceedings of the 16th International Joint Conference on ArtificialIntelligence, pp. 668--673 (1999) Google ScholarDigital Library
- Kim, W., Wilbur, W. J.: Corpus-based statistical screening for content-bearing terms. J. Am. Soc. Inf. Sci. Technol. 52, 247--259 (2001). Google ScholarDigital Library
- Medelyan, O., Witten, I. H.: Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, pp. 296--297. ACM Press, New York (2006). Google ScholarDigital Library
- Pouliquen, B., Steinberger, R., Ignat, C.: Automatic annotation of multilingual text collections with a conceptual thesaurus. In: BUG (2003).Google Scholar
- Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Landauer, T., Mcnamara, D., Dennis, S., Kintsch, W. (eds.) Latent Semantic Analysis: A Road to Meaning, Laurence Erlbaum, Mahwah (2005).Google Scholar
- Tomokiyo, T., Hurst, M.: A language model approach to keyphrase extraction. In: Proceedings of ACL Workshop on Multiword Expressions (2003). Google ScholarDigital Library
- Turney, P. D. Learning to Extract Keyphrases from Text, Canadian National Research Council, Institute for Information Technology, 1999.Google Scholar
- Turney, P. D. Learning Algorithms for Keyphrase Extraction. Information Retrieval 2, 4 (2000); 303--336. Google ScholarDigital Library
- Zesch, T., Gurevych, I.: Analysis of the Wikipedia Category Graph for NLP Applications. In Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), pp. 1--8(2007).Google Scholar
Index Terms
- Key phrase extraction: a hybrid assignment and extraction approach
Recommendations
An ontology-based approach for key phrase extraction
ACLShort '09: Proceedings of the ACL-IJCNLP 2009 Conference Short PapersAutomatic key phrase extraction is fundamental to the success of many recent digital library applications and semantic information retrieval techniques and a difficult and essential problem in Vietnamese natural language processing (NLP). In this work, ...
Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results
MOCR_AND '11: Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text DataWith the increase in the number of people communicating through internet, there has been a steady increase in the amount of text available online. Most such text is different from the standard language, as people try to use various kinds of short forms ...
Improving word vector model with part‐of‐speech and dependency grammar information
Part‐of‐speech (POS) and dependency grammar (DG) are the basic components of natural language processing. However, current word vector models have not made full use of both POS information and DG information, and hence the models’ performances are limited ...
Comments