Skip to main content
Top
Published in: Progress in Artificial Intelligence 3/2020

22-05-2020 | Regular Paper

Utilizing external corpora through kernel function: application in biomedical named entity recognition

Authors: Rakesh Patra, Sujan Kumar Saha

Published in: Progress in Artificial Intelligence | Issue 3/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Performance of word sequential labelling tasks like named entity recognition and parts-of-speech tagging largely depends on the features chosen in the task. But, in general representing a word as well as capturing its characteristics properly through a set of features is quite difficult. Moreover, external resources often become essential in order to build a high-performance system. But, acquiring required knowledge demands domain-specific processing and feature engineering. Kernel functions along with support vector machine may offer an alternative way to more efficiently capture similarity between words using both the local context and the external corpora. In this paper, we aim to compute similarity between the words using their context information, syntactic information and occurrence statistics in external corpora. This similarity value is gathered through a kernel function. The proposed kernel function combines two sub-kernels. One of these captures global information through words co-occurrence statistics accumulated from a large corpora. The second kernel captures local semantic information of the words through word specific parse tree fragmentation. We test this proposed kernel using JNLPBA 2004 Biomedical Named Entity Recognition and BioCreative II 2006 Gene Mention Recognition task data-sets. In our experiments, we observe that the proposed method is effective on both the data-sets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Ando, R.K., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 6, 1817–1853 (2005)MathSciNetMATH Ando, R.K., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 6, 1817–1853 (2005)MathSciNetMATH
2.
go back to reference Ando, R.K.: BioCreative II gene mention tagging system at IBM Watson. In: Proceedings of the Second BioCreative Challenge Evaluation Workshop. Centro Nacional de Investigaciones Oncologicas (CNIO), vol. 23, pp. 101–103 Madrid, Spain (2007) Ando, R.K.: BioCreative II gene mention tagging system at IBM Watson. In: Proceedings of the Second BioCreative Challenge Evaluation Workshop. Centro Nacional de Investigaciones Oncologicas (CNIO), vol. 23, pp. 101–103 Madrid, Spain (2007)
3.
go back to reference Carpenter, B.: LingPipe for 99.99% re-call of gene mentions. In: Proceedings of the Second Bi-oCreative Challenge Evaluation Workshop, vol. 23, pp. 307–309, (2007) Carpenter, B.: LingPipe for 99.99% re-call of gene mentions. In: Proceedings of the Second Bi-oCreative Challenge Evaluation Workshop, vol. 23, pp. 307–309, (2007)
4.
go back to reference Chen, Y., Liu, F., Manderick, B.: Improving the performance of gene mention recognition system using reformed lexicon-based support vector machine. Margin 500, 2 (2007) Chen, Y., Liu, F., Manderick, B.: Improving the performance of gene mention recognition system using reformed lexicon-based support vector machine. Margin 500, 2 (2007)
5.
go back to reference Collins, M., Duffy, N.: Convolution kernels for natural language. In: Advances in neural information processing systems, pp. 625–632 (2001) Collins, M., Duffy, N.: Convolution kernels for natural language. In: Advances in neural information processing systems, pp. 625–632 (2001)
6.
go back to reference Cortes, C., Haffner, P., Mohri, M.: Rational kernels: theory and algorithms. J. Mach. Learn. Res. 5, 1035–1062 (2004)MathSciNetMATH Cortes, C., Haffner, P., Mohri, M.: Rational kernels: theory and algorithms. J. Mach. Learn. Res. 5, 1035–1062 (2004)MathSciNetMATH
7.
go back to reference Cortes, C., Vapnik, V.: Support-vector net-works. Mach. Learn. 20(3), 273–297 (1995)MATH Cortes, C., Vapnik, V.: Support-vector net-works. Mach. Learn. 20(3), 273–297 (1995)MATH
8.
go back to reference Eskin, E., Weston, J., Noble, W. S., Leslie, C. S.: Mismatch string kernels for SVM protein classification. In: Advances in neural information processing systems, pp. 1417–1424 (2002) Eskin, E., Weston, J., Noble, W. S., Leslie, C. S.: Mismatch string kernels for SVM protein classification. In: Advances in neural information processing systems, pp. 1417–1424 (2002)
9.
go back to reference Finkel, J., Dingare, S., Nguyen, H., Nissim, M., Manning, C., Sinclair, G.: Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 88–91 (2004) Finkel, J., Dingare, S., Nguyen, H., Nissim, M., Manning, C., Sinclair, G.: Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 88–91 (2004)
10.
go back to reference Ganchev, K., Crammer, K., Pereira, F., Mann, G., Bellare, K., McCallum, A., Carroll, S., Jin, Y., White, P.: Penn/UMass/CHOP Biocreative II Systems. In: Proceedings of the Second BioCreative Challenge Evaluation Workshop, pp. 119–124, Madrid, Spain, (2007) Ganchev, K., Crammer, K., Pereira, F., Mann, G., Bellare, K., McCallum, A., Carroll, S., Jin, Y., White, P.: Penn/UMass/CHOP Biocreative II Systems. In: Proceedings of the Second BioCreative Challenge Evaluation Workshop, pp. 119–124, Madrid, Spain, (2007)
11.
go back to reference Guo, B., Gunn, S.R., Damper, R.I., Nelson, J.D.: Customizing kernel functions for SVM-based hyperspectral image classification. IEEE Trans. Image Process. 17(4), 622–629 (2008)MathSciNetCrossRef Guo, B., Gunn, S.R., Damper, R.I., Nelson, J.D.: Customizing kernel functions for SVM-based hyperspectral image classification. IEEE Trans. Image Process. 17(4), 622–629 (2008)MathSciNetCrossRef
12.
go back to reference Hsu, Y.Y., Kao, H.Y.: Curatable named-entity recognition using semantic relations. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 12(4), 785–792 (2015)CrossRef Hsu, Y.Y., Kao, H.Y.: Curatable named-entity recognition using semantic relations. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 12(4), 785–792 (2015)CrossRef
13.
go back to reference Jumutc, V., Zayakin, P., Borisov, A.: Rank-ing-based kernels in applied biomedical diagnostics using a support vector machine. Int. J Neural Syst. 21(06), 459–473 (2011)CrossRef Jumutc, V., Zayakin, P., Borisov, A.: Rank-ing-based kernels in applied biomedical diagnostics using a support vector machine. Int. J Neural Syst. 21(06), 459–473 (2011)CrossRef
14.
go back to reference Katrenko, S., Adriaans, P. W.: Using semi-supervised techniques to detect gene mentions. In: Proceedings of the Second BioCreative Challenge Workshop, pp. 97–101, (2007) Katrenko, S., Adriaans, P. W.: Using semi-supervised techniques to detect gene mentions. In: Proceedings of the Second BioCreative Challenge Workshop, pp. 97–101, (2007)
15.
go back to reference Kim, J. D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Colli-er, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 70–75, (2004) Kim, J. D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Colli-er, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 70–75, (2004)
16.
go back to reference Kulick, S., Bies, A., Liberman, M., Mandel, M., McDonald, R., Palmer, M., White, P.: Integrated annotation for biomedical information extraction. In: Proceedings of the Human Language Technology Conference and the Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL), pp. 61–68, (2004) Kulick, S., Bies, A., Liberman, M., Mandel, M., McDonald, R., Palmer, M., White, P.: Integrated annotation for biomedical information extraction. In: Proceedings of the Human Language Technology Conference and the Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL), pp. 61–68, (2004)
18.
go back to reference Lee, C., Hou, W. J., Chen, H. H.: Annotating multiple types of biomedical entities: a single word classification approach. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 80–83, (2004) Lee, C., Hou, W. J., Chen, H. H.: Annotating multiple types of biomedical entities: a single word classification approach. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 80–83, (2004)
19.
go back to reference Leslie, C. S., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. In: Pacific Symposium on Bio-computing, vol. 7, pp. 566–575, (2002) Leslie, C. S., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. In: Pacific Symposium on Bio-computing, vol. 7, pp. 566–575, (2002)
20.
go back to reference Li, J., Zhang, Z., Li, X., Chen, H.: Kernel-based learning for biomedical relation extraction. J. Am. Soc. Inform. Sci. Technol. 59(5), 756–769 (2008)CrossRef Li, J., Zhang, Z., Li, X., Chen, H.: Kernel-based learning for biomedical relation extraction. J. Am. Soc. Inform. Sci. Technol. 59(5), 756–769 (2008)CrossRef
21.
go back to reference Li, L., Fan, W., Huang, D.: A two-phase Bio-NER system based on integrated classifiers and multiagent strategy. IEEE/ACM Trans. Comput. Biol. Bioinform. 10(4), 897–904 (2013)CrossRef Li, L., Fan, W., Huang, D.: A two-phase Bio-NER system based on integrated classifiers and multiagent strategy. IEEE/ACM Trans. Comput. Biol. Bioinform. 10(4), 897–904 (2013)CrossRef
22.
go back to reference Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification us-ing string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)MATH Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification us-ing string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)MATH
23.
go back to reference Moschitti, A.: Making tree kernels practical for natural language learning. In: EACL, vol. 113, No. 120, p. 24, (2006) Moschitti, A.: Making tree kernels practical for natural language learning. In: EACL, vol. 113, No. 120, p. 24, (2006)
24.
go back to reference Ninomiya, T., Matsuzaki, T., Miyao, Y., Tsujii, J. I.: A log-linear model with an n-gram reference distribution for accurate HPSG parsing. In: Proceedings of the 10th International Conference on Parsing Technologies. Association for Computational Linguistics, pp. 60–68 (2007) Ninomiya, T., Matsuzaki, T., Miyao, Y., Tsujii, J. I.: A log-linear model with an n-gram reference distribution for accurate HPSG parsing. In: Proceedings of the 10th International Conference on Parsing Technologies. Association for Computational Linguistics, pp. 60–68 (2007)
25.
go back to reference Padierna, L.C., Carpio, M., Rojas-Domínguez, A., Puga, H., Fraire, H.: A novel formulation of orthogonal polynomial kernel functions for SVM classifiers: the Gegenbauer family. Pattern Recognit. 84, 211–225 (2018)CrossRef Padierna, L.C., Carpio, M., Rojas-Domínguez, A., Puga, H., Fraire, H.: A novel formulation of orthogonal polynomial kernel functions for SVM classifiers: the Gegenbauer family. Pattern Recognit. 84, 211–225 (2018)CrossRef
27.
go back to reference Patrick, J., Wang, Y.: Biomedical named entity recognition system. In: Proceedings of the Tenth Australasian Document Computing Symposium ADCS (2005) Patrick, J., Wang, Y.: Biomedical named entity recognition system. In: Proceedings of the Tenth Australasian Document Computing Symposium ADCS (2005)
28.
go back to reference Saha, S.K., Narayan, S., Sarkar, S., Mitra, P.: A composite kernel for named entity recognition. Pattern Recognit. Lett. 31(12), 1591–1597 (2010)CrossRef Saha, S.K., Narayan, S., Sarkar, S., Mitra, P.: A composite kernel for named entity recognition. Pattern Recognit. Lett. 31(12), 1591–1597 (2010)CrossRef
29.
go back to reference Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 104–107, (2004) Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 104–107, (2004)
31.
go back to reference Smith, L., Tanabe, L.K., Ando, R.J., Kuo, C.J., Chung, I.F., Hsu, C.N., Torii, M.: Over-view of BioCreative II gene mention recognition. Genome Biol. 9(2), S2 (2008)CrossRef Smith, L., Tanabe, L.K., Ando, R.J., Kuo, C.J., Chung, I.F., Hsu, C.N., Torii, M.: Over-view of BioCreative II gene mention recognition. Genome Biol. 9(2), S2 (2008)CrossRef
32.
go back to reference Song, Y., Kim, E., Lee, G. G., Yi, B.K.: POSBIOTM-NER in the shared task of Bi-oNLP/NLPBA 2004. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 100–103 (2004) Song, Y., Kim, E., Lee, G. G., Yi, B.K.: POSBIOTM-NER in the shared task of Bi-oNLP/NLPBA 2004. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 100–103 (2004)
33.
go back to reference Sun, C., Lin, L., Wang, X., Guan, Y.: Study for Application of Discriminative Models in Bio-medical Literature Mining. In: Proceedings of the Second BioCreative Challenge Evaluation Work-shop Madrid, Spain, pp. 319–321 (2007) Sun, C., Lin, L., Wang, X., Guan, Y.: Study for Application of Discriminative Models in Bio-medical Literature Mining. In: Proceedings of the Second BioCreative Challenge Evaluation Work-shop Madrid, Spain, pp. 319–321 (2007)
34.
go back to reference Suzuki, J., Hirao, T., Sasaki, Y., Maeda, E.: Hierarchical directed acyclic graph kernel: methods for structured natural language data. In: Proceedings of the 41st Annual Meeting on Associa-tion for Computational Linguistics. Association for Computational Linguistics, vol. 1 pp. 32–39 (2003) Suzuki, J., Hirao, T., Sasaki, Y., Maeda, E.: Hierarchical directed acyclic graph kernel: methods for structured natural language data. In: Proceedings of the 41st Annual Meeting on Associa-tion for Computational Linguistics. Association for Computational Linguistics, vol. 1 pp. 32–39 (2003)
35.
go back to reference Vishwanathan, S.V.N., Schraudolph, N.N., Kondor, R., Borgwardt, K.M.: Graph kernels. J. Mach. Learn. Res. 11, 1201–1242 (2010)MathSciNetMATH Vishwanathan, S.V.N., Schraudolph, N.N., Kondor, R., Borgwardt, K.M.: Graph kernels. J. Mach. Learn. Res. 11, 1201–1242 (2010)MathSciNetMATH
36.
go back to reference Yu, S., Falck, T., Daemen, A., Tranchevent, L.C., Suykens, J.A., De Moor, B., Moreau, Y.: L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform. 11(1), 309 (2010)CrossRef Yu, S., Falck, T., Daemen, A., Tranchevent, L.C., Suykens, J.A., De Moor, B., Moreau, Y.: L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform. 11(1), 309 (2010)CrossRef
37.
go back to reference Yu Zhang, Yu., Wang, Guoxu Zhou, Jin, Jing, Wang, Bei, Wang, Xingyu, Cichocki, Andrzej: Multi-kernel extreme learning machine for EEG classification in brain-computer interfaces. Expert Syst. Appl. 96, 302–310 (2018)CrossRef Yu Zhang, Yu., Wang, Guoxu Zhou, Jin, Jing, Wang, Bei, Wang, Xingyu, Cichocki, Andrzej: Multi-kernel extreme learning machine for EEG classification in brain-computer interfaces. Expert Syst. Appl. 96, 302–310 (2018)CrossRef
38.
go back to reference Zhou, GD., Su, J.: Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the International Joint Work-shop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 96–99, (2004) Zhou, GD., Su, J.: Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the International Joint Work-shop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics, pp. 96–99, (2004)
Metadata
Title
Utilizing external corpora through kernel function: application in biomedical named entity recognition
Authors
Rakesh Patra
Sujan Kumar Saha
Publication date
22-05-2020
Publisher
Springer Berlin Heidelberg
Published in
Progress in Artificial Intelligence / Issue 3/2020
Print ISSN: 2192-6352
Electronic ISSN: 2192-6360
DOI
https://doi.org/10.1007/s13748-020-00208-0

Other articles of this Issue 3/2020

Progress in Artificial Intelligence 3/2020 Go to the issue

Premium Partner