ABSTRACT
In this paper, experiments on automatic extraction of keywords from abstracts using a supervised machine learning algorithm are discussed. The main point of this paper is that by adding linguistic knowledge to the representation (such as syntactic features), rather than relying only on statistics (such as term frequency and n-grams), a better result is obtained as measured by keywords previously assigned by professional indexers. In more detail, extracting NP-chunks gives a better precision than n-grams, and by adding the PoS tag(s) assigned to the term as a feature, a dramatic improvement of the results is obtained, independent of the term selection approach applied.
- Ken Barker and Nadia Cornacchia. 2000. Using noun phrase heads to extract document keyphrases. In Canadian Conference on AI.]] Google ScholarDigital Library
- Branimir Boguraev and Christopher Kennedy. 1999. Applications of term identification technology: Domain description and content characterisation. Natural Language Engineering, 5(1): 17--44.]] Google ScholarDigital Library
- Didier Bourigault, Christian Jacquemin, and Marie-Claude L'Homme, editors. 2001. Recent Advances in Computational Terminology. John Benjamins Publishing Company, Amsterdam.]]Google Scholar
- Leo Breiman. 1996. Bagging predictors. Machine Learning, 24(2): 123--140.]] Google ScholarCross Ref
- Béatrice Daille, Éric Gaussier, and Jean-Marc Langé. 1994. Towards automatic extraction of monolingual and bilingual terminology. In Proceedings of COLING-94, pages 515--521, Kyoto, Japan.]] Google ScholarDigital Library
- David K. Evans, Judith L. Klavans, and Nina Wacholder. 2000. Document processing with LinkIT. In Proceedings of the RIAO Conference, Paris, France.]]Google Scholar
- Christopher Fox. 1992. Lexical analysis and stoplists. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms, pages 102--130. Prentice-Hall, New Jersey.]] Google ScholarDigital Library
- Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and Craig G. Nevill-Manning. 1999. Domain-specific keyphrase extraction. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI'99), pages 668--673, Stockholm, Sweden.]] Google ScholarDigital Library
- John S. Justeson and Slava M. Katz. 1995. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(1):9--27.]]Google ScholarCross Ref
- Martin Porter. 1980. An algorithm for suffix stripping. Program, 14(3):130--137.]]Google ScholarCross Ref
- Ralf Steinberger. 2001. Cross-lingual keyword assignment. In Proceedings of the XVII Conference of the Spanish Society for Natural Language Processing (SEPLN'2001), pages 273--280, Jaén, Spain.]]Google Scholar
- Peter D. Turney. 2000. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303--336.]] Google ScholarDigital Library
Recommendations
Automatic Keyword Extraction Using Linguistic Features
ICDMW '06: Proceedings of the Sixth IEEE International Conference on Data Mining - WorkshopsThis paper describes a novel keyword extraction algorithm Position Weight (PW) that utilizes linguistic features to represent the importance of the word position in a document. Topical terms and their previous-term and next-term co-occurrence ...
Keyword Extraction Using Word Co-occurrence
DEXA '10: Proceedings of the 2010 Workshops on Database and Expert Systems ApplicationsA common strategy to assign keywords to documents is to select the most appropriate words from the document text. One of the most important criteria for a word to be selected as keyword is its relevance for the text. The tf.idf score of a term is a ...
Thesaurus Based Term Ranking for Keyword Extraction
DEXA '10: Proceedings of the 2010 Workshops on Database and Expert Systems ApplicationsIn many cases keywords from a restricted set of possible keywords have to be assigned to texts. A common way to find the best keywords is to rank terms occurring in the text according to their tf.idf value. This requires a corpus of texts from which ...
Comments