ABSTRACT
In this paper, we propose a multi-criteria-based active learning approach and effectively apply it to named entity recognition. Active learning targets to minimize the human annotation efforts by selecting examples for labeling. To maximize the contribution of the selected examples, we consider the multiple criteria: informativeness, representativeness and diversity and propose measures to quantify them. More comprehensively, we incorporate all the criteria using two selection strategies, both of which result in less labeling cost than single-criterion-based method. The results of the named entity recognition in both MUC-6 and GENIA show that the labeling cost can be reduced by at least 80% without degrading the performance.
- R. Baeza-Yates and B. Ribeiro-Neto. 1999. Modern Information Retrieval. ISBN 0-201-39829-X. Google ScholarDigital Library
- K. Brinker. 2003. Incorporating Diversity in Active Learning with Support Vector Machines. In Proceedings of ICML, 2003.Google Scholar
- S. A. Engelson and I. Dagan. 1999. Committee-Based Sample Selection for Probabilistic Classifiers. Journal of Artifical Intelligence Research.Google Scholar
- F. Jelinek. 1997. Statistical Methods for Speech Recognition. MIT Press. Google ScholarDigital Library
- J. Kazama, T. Makino, Y. Ohta and J. Tsujii. 2002. Tuning Support Vector Machines for Biomedical Named Entity Recognition. In Proceedings of the ACL2002 Workshop on NLP in Biomedicine. Google ScholarDigital Library
- K. J. Lee, Y. S. Hwang and H. C. Rim. 2003. Two-Phase Biomedical NE Recognition based on SVMs. In Proceedings of the ACL2003 Workshop on NLP in Biomedicine. Google ScholarDigital Library
- D. D. Lewis and J. Catlett. 1994. Heterogeneous Uncertainty Sampling for Supervised Learning. In Proceedings of ICML, 1994.Google Scholar
- A. McCallum and K. Nigam. 1998. Employing EM in Pool-Based Active Learning for Text Classification. In Proceedings of ICML, 1998. Google ScholarDigital Library
- G. Ngai and D. Yarowsky. 2000. Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking. In Proceedings of ACL, 2000. Google ScholarDigital Library
- T. Ohta, Y. Tateisi, J. Kim, H. Mima and J. Tsujii. 2002. The GENIA corpus: An annotated research abstract corpus in molecular biology domain. In Proceedings of HLT 2002. Google ScholarDigital Library
- L. R. Rabiner, A. E. Rosenberg and S. E. Levinson. 1978. Considerations in Dynamic Time Warping Algorithms for Discrete Word Recognition. In Proceedings of IEEE Transactions on acoustics, speech and signal processing. Vol. ASSP-26, NO.6.Google Scholar
- D. Schohn and D. Cohn. 2000. Less is More: Active Learning with Support Vector Machines. In Proceedings of the 17th International Conference on Machine Learning. Google ScholarDigital Library
- D. Shen, J. Zhang, G. D. Zhou, J. Su and C. L. Tan. 2003. Effective Adaptation of a Hidden Markov Model-based Named Entity Recognizer for Bio-medical Domain. In Proceedings of the ACL2003 Workshop on NLP in Biomedicine. Google ScholarDigital Library
- M. Steedman, R. Hwa, S. Clark, M. Osborne, A. Sarkar, J. Hockenmaier, P. Ruhlen, S. Baker and J. Crim. 2003. Example Selection for Bootstrapping Statistical Parsers. In Proceedings of HLTNAACL, 2003. Google ScholarDigital Library
- M. Tang, X. Luo and S. Roukos. 2002. Active Learning for Statistical Natural Language Parsing. In Proceedings of the ACL 2002. Google ScholarDigital Library
- C. A. Thompson, M. E. Califf and R. J. Mooney. 1999. Active Learning for Natural Language Parsing and Information Extraction. In Proceedings of ICML 1999. Google ScholarDigital Library
- S. Tong and D. Koller. 2000. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research. Google ScholarDigital Library
- V. Vapnik. 1998. Statistical learning theory. N.Y.:John Wiley. Google ScholarDigital Library
Recommendations
Bagging-based active learning model for named entity recognition with distant supervision
BIGCOMP '16: Proceedings of the 2016 International Conference on Big Data and Smart Computing (BigComp)Named entity recognition (NER) is a preliminary step to performing information extraction and question answering. Most previous studies on NER have been based on supervised machine learning methods that need a large amount of human-annotated training ...
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Boosted Web Named Entity Recognition via Tri-Training
TALLIP Notes and Regular PapersNamed entity extraction is a fundamental task for many natural language processing applications on the web. Existing studies rely on annotated training data, which is quite expensive to obtain large datasets, limiting the effectiveness of recognition. In ...
Comments