Skip to main content
Erschienen in: Knowledge and Information Systems 2/2014

01.08.2014 | Regular Paper

Exploring feature sets for two-phase biomedical named entity recognition using semi-CRFs

verfasst von: Li Yang, Yanhong Zhou

Erschienen in: Knowledge and Information Systems | Ausgabe 2/2014

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper represents a two-phase approach based on semi-Markov conditional random fields model (semi-CRFs) and explores novel feature sets for identifying the entities in text into 5 types: protein, DNA, RNA, cell_line and cell_type. Semi-CRFs put the label to a segment not a single word which is more natural than the other machine learning methods such as conditional random fields model (CRFs). Our approach divides the biomedical named entity recognition task into two sub-tasks: term boundary detection and semantic labeling. At the first phase, term boundary detection sub-task detects the boundary of the entities and classifies the entities into one type C. At the second phase, semantic labeling sub-task labels the entities detected at the first phase the correct entity type. We explore novel feature sets at both phases to improve the performance. To make a comparison, experiments conducted both on CRFs and on semi-CRFs models at each phase. Our experiments carried out on JNLPBA 2004 datasets achieve an F-score of 74.64 % based on semi-CRFs without deep domain knowledge and post-processing algorithms, which outperforms most of the state-of-the-art systems.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
CRF package supports CRFs and Semi-CRFs. CRF package is available at http://​crf.​sourceforge.​net.
 
Literatur
1.
Zurück zum Zitat Chan S, Lam W, Yu X (2007) A cascaded approach to biomedical named entity recognition using a unified model. In: Proceedings of the 2007 7th IEEE international conference on data mining (ICDM ’07), pp 93–102 Chan S, Lam W, Yu X (2007) A cascaded approach to biomedical named entity recognition using a unified model. In: Proceedings of the 2007 7th IEEE international conference on data mining (ICDM ’07), pp 93–102
2.
Zurück zum Zitat Cohen A, Hersh W (2005) A survey of current work in biomedical text mining. Brief Bioinformatics 6(1):57–71CrossRef Cohen A, Hersh W (2005) A survey of current work in biomedical text mining. Brief Bioinformatics 6(1):57–71CrossRef
3.
Zurück zum Zitat Finkel J, Dingare S, Nguyen H et al (2004) Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 88–91 Finkel J, Dingare S, Nguyen H et al (2004) Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 88–91
4.
Zurück zum Zitat Kim J, Ohta T, Tateisi Y, Tsujii J (2003) GENIA corpus-a semantically annotated corpus for bio-text mining. Bioinformatics 19(suppl 1):i180–i182CrossRef Kim J, Ohta T, Tateisi Y, Tsujii J (2003) GENIA corpus-a semantically annotated corpus for bio-text mining. Bioinformatics 19(suppl 1):i180–i182CrossRef
5.
Zurück zum Zitat Kim J, Ohta T, Tsuruoka Y et al (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ‘04), pp 70–75 Kim J, Ohta T, Tsuruoka Y et al (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ‘04), pp 70–75
6.
Zurück zum Zitat Kim S, Yoon J, Park K, Rim HC (2005) Two-phase biomedical named entity recognition using a hybrid method. In: Proceedings of the 2nd international joint conference (IJCNLP 2005), pp 646–657 Kim S, Yoon J, Park K, Rim HC (2005) Two-phase biomedical named entity recognition using a hybrid method. In: Proceedings of the 2nd international joint conference (IJCNLP 2005), pp 646–657
7.
Zurück zum Zitat Kim S, Yoon J (2007) Experimental study on a two phase method for biomedical named entity recognition. IEICE Trans Inf Syst E90–D(7):1103–1110CrossRefMathSciNet Kim S, Yoon J (2007) Experimental study on a two phase method for biomedical named entity recognition. IEICE Trans Inf Syst E90–D(7):1103–1110CrossRefMathSciNet
8.
Zurück zum Zitat Kulick S, Bies A, Liberman M, (2004) Integrated annotation for biomedical information extraction. In: HLT-NAACL 2004 workshop, linking biological literature, ontologies and databases, pp 61–68 Kulick S, Bies A, Liberman M, (2004) Integrated annotation for biomedical information extraction. In: HLT-NAACL 2004 workshop, linking biological literature, ontologies and databases, pp 61–68
9.
Zurück zum Zitat Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th international conference on, machine learning (ICML ’01), pp 282–289 Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th international conference on, machine learning (ICML ’01), pp 282–289
10.
Zurück zum Zitat Lee C, Hou W, Chen H (2004) Annotating multiple types of biomedical entities: a single word classification approach. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 83–86 Lee C, Hou W, Chen H (2004) Annotating multiple types of biomedical entities: a single word classification approach. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 83–86
11.
Zurück zum Zitat Lee K, Hwang YS, Rim HC (2003) Two-phase biomedical NE recognition based on SVMs. In: Proceedings of the ACL 2003 workshop on natural language processing in, biomedicine (BioMed ’03), pp 33–40 Lee K, Hwang YS, Rim HC (2003) Two-phase biomedical NE recognition based on SVMs. In: Proceedings of the ACL 2003 workshop on natural language processing in, biomedicine (BioMed ’03), pp 33–40
12.
Zurück zum Zitat Li L, Zhou R, Huang D (2009) Two-phase biomedical named entity recognition using CRFs. Comput Biol Chem 33(4):334–338CrossRef Li L, Zhou R, Huang D (2009) Two-phase biomedical named entity recognition using CRFs. Comput Biol Chem 33(4):334–338CrossRef
13.
Zurück zum Zitat McDonald R, Pereira F (2005) Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 6(suppl 1):s6CrossRef McDonald R, Pereira F (2005) Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 6(suppl 1):s6CrossRef
14.
Zurück zum Zitat Okanohara D, Miyao Y, Tsuruoka Y, Tsujii J (2006) Improving the scalability of semi-Markov conditional random fields for named entity recognition. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the ACL, pp 465–472 Okanohara D, Miyao Y, Tsuruoka Y, Tsujii J (2006) Improving the scalability of semi-Markov conditional random fields for named entity recognition. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the ACL, pp 465–472
15.
Zurück zum Zitat Olsson F, Eriksson G, Franzen K et al (2002) Notions of correctness when evaluating protein name taggers. In: Proceedings of the 19th international conference on, computational linguistics, pp 765–771 Olsson F, Eriksson G, Franzen K et al (2002) Notions of correctness when evaluating protein name taggers. In: Proceedings of the 19th international conference on, computational linguistics, pp 765–771
16.
Zurück zum Zitat Pablo-Sánchez CD, Segura-Bedmar I, Martínez P, Iglesias-Maqueda A (2012) Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining. Knowl Inf Syst. doi:10.1007/s10115-012-0502-0 Pablo-Sánchez CD, Segura-Bedmar I, Martínez P, Iglesias-Maqueda A (2012) Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining. Knowl Inf Syst. doi:10.​1007/​s10115-012-0502-0
17.
Zurück zum Zitat Pérez-Catalán M, Berlanga R, Sanz I, Aramburu MJ (2012) A semantic approach for the requirement-driven discovery of web resources in the Life Sciences. Knowl Inf Syst 34(3):671–690. doi:10.1007/s10115-012-0498-5 CrossRef Pérez-Catalán M, Berlanga R, Sanz I, Aramburu MJ (2012) A semantic approach for the requirement-driven discovery of web resources in the Life Sciences. Knowl Inf Syst 34(3):671–690. doi:10.​1007/​s10115-012-0498-5 CrossRef
18.
Zurück zum Zitat Sarawagi S, Cohen W (2004) Semi-Markov conditional random fields for information extraction. Adv Neural Inf Process Syst 17:1185–1192 Sarawagi S, Cohen W (2004) Semi-Markov conditional random fields for information extraction. Adv Neural Inf Process Syst 17:1185–1192
19.
Zurück zum Zitat Settles B (2004) Biomedical named entity recognition using conditional random fields and novel feature sets. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 104–107 Settles B (2004) Biomedical named entity recognition using conditional random fields and novel feature sets. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 104–107
20.
21.
Zurück zum Zitat Sundheim B (1995) Overview of results of the MUC-6 evaluation. In: Proceedings of the 6th conference on message understanding (MUC6 ‘95), pp 13–31 Sundheim B (1995) Overview of results of the MUC-6 evaluation. In: Proceedings of the 6th conference on message understanding (MUC6 ‘95), pp 13–31
22.
Zurück zum Zitat Tsai R, Sung C, Dai H et al (2006) NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 7(suppl 5):s11CrossRef Tsai R, Sung C, Dai H et al (2006) NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 7(suppl 5):s11CrossRef
23.
Zurück zum Zitat Yang L, Zhou Y (2010) Two-phase biomedical named entity recognition based on semi-CRFs. In: Proceedings of the IEEE international conference on bio-inspired computing: theories and applications, pp 1061–1065 Yang L, Zhou Y (2010) Two-phase biomedical named entity recognition based on semi-CRFs. In: Proceedings of the IEEE international conference on bio-inspired computing: theories and applications, pp 1061–1065
24.
Zurück zum Zitat Yang Z, Lin H, Li Y (2008) Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature. Comput Biol Chem 32(4):287–291CrossRefMATH Yang Z, Lin H, Li Y (2008) Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature. Comput Biol Chem 32(4):287–291CrossRefMATH
26.
Zurück zum Zitat Zhou G, Su J (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 96–99 Zhou G, Su J (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 96–99
Metadaten
Titel
Exploring feature sets for two-phase biomedical named entity recognition using semi-CRFs
verfasst von
Li Yang
Yanhong Zhou
Publikationsdatum
01.08.2014
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 2/2014
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-013-0637-7

Weitere Artikel der Ausgabe 2/2014

Knowledge and Information Systems 2/2014 Zur Ausgabe