Skip to main content
Top
Published in: Knowledge and Information Systems 2/2014

01-08-2014 | Regular Paper

Exploring feature sets for two-phase biomedical named entity recognition using semi-CRFs

Authors: Li Yang, Yanhong Zhou

Published in: Knowledge and Information Systems | Issue 2/2014

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper represents a two-phase approach based on semi-Markov conditional random fields model (semi-CRFs) and explores novel feature sets for identifying the entities in text into 5 types: protein, DNA, RNA, cell_line and cell_type. Semi-CRFs put the label to a segment not a single word which is more natural than the other machine learning methods such as conditional random fields model (CRFs). Our approach divides the biomedical named entity recognition task into two sub-tasks: term boundary detection and semantic labeling. At the first phase, term boundary detection sub-task detects the boundary of the entities and classifies the entities into one type C. At the second phase, semantic labeling sub-task labels the entities detected at the first phase the correct entity type. We explore novel feature sets at both phases to improve the performance. To make a comparison, experiments conducted both on CRFs and on semi-CRFs models at each phase. Our experiments carried out on JNLPBA 2004 datasets achieve an F-score of 74.64 % based on semi-CRFs without deep domain knowledge and post-processing algorithms, which outperforms most of the state-of-the-art systems.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
1
CRF package supports CRFs and Semi-CRFs. CRF package is available at http://​crf.​sourceforge.​net.
 
Literature
1.
go back to reference Chan S, Lam W, Yu X (2007) A cascaded approach to biomedical named entity recognition using a unified model. In: Proceedings of the 2007 7th IEEE international conference on data mining (ICDM ’07), pp 93–102 Chan S, Lam W, Yu X (2007) A cascaded approach to biomedical named entity recognition using a unified model. In: Proceedings of the 2007 7th IEEE international conference on data mining (ICDM ’07), pp 93–102
2.
go back to reference Cohen A, Hersh W (2005) A survey of current work in biomedical text mining. Brief Bioinformatics 6(1):57–71CrossRef Cohen A, Hersh W (2005) A survey of current work in biomedical text mining. Brief Bioinformatics 6(1):57–71CrossRef
3.
go back to reference Finkel J, Dingare S, Nguyen H et al (2004) Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 88–91 Finkel J, Dingare S, Nguyen H et al (2004) Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 88–91
4.
go back to reference Kim J, Ohta T, Tateisi Y, Tsujii J (2003) GENIA corpus-a semantically annotated corpus for bio-text mining. Bioinformatics 19(suppl 1):i180–i182CrossRef Kim J, Ohta T, Tateisi Y, Tsujii J (2003) GENIA corpus-a semantically annotated corpus for bio-text mining. Bioinformatics 19(suppl 1):i180–i182CrossRef
5.
go back to reference Kim J, Ohta T, Tsuruoka Y et al (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ‘04), pp 70–75 Kim J, Ohta T, Tsuruoka Y et al (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ‘04), pp 70–75
6.
go back to reference Kim S, Yoon J, Park K, Rim HC (2005) Two-phase biomedical named entity recognition using a hybrid method. In: Proceedings of the 2nd international joint conference (IJCNLP 2005), pp 646–657 Kim S, Yoon J, Park K, Rim HC (2005) Two-phase biomedical named entity recognition using a hybrid method. In: Proceedings of the 2nd international joint conference (IJCNLP 2005), pp 646–657
7.
go back to reference Kim S, Yoon J (2007) Experimental study on a two phase method for biomedical named entity recognition. IEICE Trans Inf Syst E90–D(7):1103–1110CrossRefMathSciNet Kim S, Yoon J (2007) Experimental study on a two phase method for biomedical named entity recognition. IEICE Trans Inf Syst E90–D(7):1103–1110CrossRefMathSciNet
8.
go back to reference Kulick S, Bies A, Liberman M, (2004) Integrated annotation for biomedical information extraction. In: HLT-NAACL 2004 workshop, linking biological literature, ontologies and databases, pp 61–68 Kulick S, Bies A, Liberman M, (2004) Integrated annotation for biomedical information extraction. In: HLT-NAACL 2004 workshop, linking biological literature, ontologies and databases, pp 61–68
9.
go back to reference Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th international conference on, machine learning (ICML ’01), pp 282–289 Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th international conference on, machine learning (ICML ’01), pp 282–289
10.
go back to reference Lee C, Hou W, Chen H (2004) Annotating multiple types of biomedical entities: a single word classification approach. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 83–86 Lee C, Hou W, Chen H (2004) Annotating multiple types of biomedical entities: a single word classification approach. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 83–86
11.
go back to reference Lee K, Hwang YS, Rim HC (2003) Two-phase biomedical NE recognition based on SVMs. In: Proceedings of the ACL 2003 workshop on natural language processing in, biomedicine (BioMed ’03), pp 33–40 Lee K, Hwang YS, Rim HC (2003) Two-phase biomedical NE recognition based on SVMs. In: Proceedings of the ACL 2003 workshop on natural language processing in, biomedicine (BioMed ’03), pp 33–40
12.
go back to reference Li L, Zhou R, Huang D (2009) Two-phase biomedical named entity recognition using CRFs. Comput Biol Chem 33(4):334–338CrossRef Li L, Zhou R, Huang D (2009) Two-phase biomedical named entity recognition using CRFs. Comput Biol Chem 33(4):334–338CrossRef
13.
go back to reference McDonald R, Pereira F (2005) Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 6(suppl 1):s6CrossRef McDonald R, Pereira F (2005) Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 6(suppl 1):s6CrossRef
14.
go back to reference Okanohara D, Miyao Y, Tsuruoka Y, Tsujii J (2006) Improving the scalability of semi-Markov conditional random fields for named entity recognition. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the ACL, pp 465–472 Okanohara D, Miyao Y, Tsuruoka Y, Tsujii J (2006) Improving the scalability of semi-Markov conditional random fields for named entity recognition. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the ACL, pp 465–472
15.
go back to reference Olsson F, Eriksson G, Franzen K et al (2002) Notions of correctness when evaluating protein name taggers. In: Proceedings of the 19th international conference on, computational linguistics, pp 765–771 Olsson F, Eriksson G, Franzen K et al (2002) Notions of correctness when evaluating protein name taggers. In: Proceedings of the 19th international conference on, computational linguistics, pp 765–771
16.
go back to reference Pablo-Sánchez CD, Segura-Bedmar I, Martínez P, Iglesias-Maqueda A (2012) Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining. Knowl Inf Syst. doi:10.1007/s10115-012-0502-0 Pablo-Sánchez CD, Segura-Bedmar I, Martínez P, Iglesias-Maqueda A (2012) Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining. Knowl Inf Syst. doi:10.​1007/​s10115-012-0502-0
17.
go back to reference Pérez-Catalán M, Berlanga R, Sanz I, Aramburu MJ (2012) A semantic approach for the requirement-driven discovery of web resources in the Life Sciences. Knowl Inf Syst 34(3):671–690. doi:10.1007/s10115-012-0498-5 CrossRef Pérez-Catalán M, Berlanga R, Sanz I, Aramburu MJ (2012) A semantic approach for the requirement-driven discovery of web resources in the Life Sciences. Knowl Inf Syst 34(3):671–690. doi:10.​1007/​s10115-012-0498-5 CrossRef
18.
go back to reference Sarawagi S, Cohen W (2004) Semi-Markov conditional random fields for information extraction. Adv Neural Inf Process Syst 17:1185–1192 Sarawagi S, Cohen W (2004) Semi-Markov conditional random fields for information extraction. Adv Neural Inf Process Syst 17:1185–1192
19.
go back to reference Settles B (2004) Biomedical named entity recognition using conditional random fields and novel feature sets. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 104–107 Settles B (2004) Biomedical named entity recognition using conditional random fields and novel feature sets. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 104–107
21.
go back to reference Sundheim B (1995) Overview of results of the MUC-6 evaluation. In: Proceedings of the 6th conference on message understanding (MUC6 ‘95), pp 13–31 Sundheim B (1995) Overview of results of the MUC-6 evaluation. In: Proceedings of the 6th conference on message understanding (MUC6 ‘95), pp 13–31
22.
go back to reference Tsai R, Sung C, Dai H et al (2006) NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 7(suppl 5):s11CrossRef Tsai R, Sung C, Dai H et al (2006) NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 7(suppl 5):s11CrossRef
23.
go back to reference Yang L, Zhou Y (2010) Two-phase biomedical named entity recognition based on semi-CRFs. In: Proceedings of the IEEE international conference on bio-inspired computing: theories and applications, pp 1061–1065 Yang L, Zhou Y (2010) Two-phase biomedical named entity recognition based on semi-CRFs. In: Proceedings of the IEEE international conference on bio-inspired computing: theories and applications, pp 1061–1065
24.
go back to reference Yang Z, Lin H, Li Y (2008) Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature. Comput Biol Chem 32(4):287–291CrossRefMATH Yang Z, Lin H, Li Y (2008) Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature. Comput Biol Chem 32(4):287–291CrossRefMATH
26.
go back to reference Zhou G, Su J (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 96–99 Zhou G, Su J (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 96–99
Metadata
Title
Exploring feature sets for two-phase biomedical named entity recognition using semi-CRFs
Authors
Li Yang
Yanhong Zhou
Publication date
01-08-2014
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 2/2014
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-013-0637-7

Other articles of this Issue 2/2014

Knowledge and Information Systems 2/2014 Go to the issue

Premium Partner