Skip to main content
Erschienen in: Cluster Computing 2/2015

01.06.2015

CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework

verfasst von: Zhuo Tang, Lingang Jiang, Li Yang, Kenli Li, Keqin Li

Erschienen in: Cluster Computing | Ausgabe 2/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

As the rapid growth of the biomedical literature, the model training time in biomedical named entity recognition increases sharply when dealing with large-scale training samples. How to increase the efficiency of named entity recognition in biomedical big data becomes one of the key problems in biomedical text mining. For the purposes of improving the recognition performance and reducing the training time, this paper proposes an optimization method for two-phase recognition using conditional random fields. In the first stage, each named entity boundary is detected to distinguish all real entities. In the second stage, we label the semantic class of the entity detected. To expedite the training speed, in these two phases, we implement the model training process on a parallel optimization program framework based on MapReduce. Through dividing the training set into several parts, the iterations in the training algorithm are designed as map tasks which can be executed simultaneously in a cluster, where each map function is designed to complete the calculation of a gradient vector component for each part in the training set. Our experiments show that the proposed method in this paper can achieve high performance with short training time, which has important implications for the current biological big data processing.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
5.
Zurück zum Zitat Shen, L., Shen, H., Cheng, L.: New algorithms for efficient mining of association rules. In: The Seventh Symposium on the Frontiers of Massively Parallel Computation, pp. 234–241 (1999) Shen, L., Shen, H., Cheng, L.: New algorithms for efficient mining of association rules. In: The Seventh Symposium on the Frontiers of Massively Parallel Computation, pp. 234–241 (1999)
6.
Zurück zum Zitat Li, L., Zhou, R., Huang, D.: Two-phase biomedical named entity recognition using CRFs. Comput. Biol. Chem. 33(4), 334–338 (2009)CrossRef Li, L., Zhou, R., Huang, D.: Two-phase biomedical named entity recognition using CRFs. Comput. Biol. Chem. 33(4), 334–338 (2009)CrossRef
7.
Zurück zum Zitat Finkel, J., Dingare, S., Nguyen, H.: Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (JNLPBA), pp. 88–91 (2004) Finkel, J., Dingare, S., Nguyen, H.: Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (JNLPBA), pp. 88–91 (2004)
8.
Zurück zum Zitat Wang, H., Zhao, T., Li, S., Yu, H.: A conditional random fields approach to biomedical named entity recognition. J. Electron. 6(24), 838–844 (2007) Wang, H., Zhao, T., Li, S., Yu, H.: A conditional random fields approach to biomedical named entity recognition. J. Electron. 6(24), 838–844 (2007)
9.
Zurück zum Zitat Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), pp. 104–107 (2004) Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), pp. 104–107 (2004)
10.
Zurück zum Zitat Li, L., Fan, W., Huang, D.: A two-phase bio-NER system based on integrated classifiers and multi-agent strategy. IEEE/ACM Trans. Comput. Biol. Bioinform. (2013). doi:10.1109/TCBB.2013.106 Li, L., Fan, W., Huang, D.: A two-phase bio-NER system based on integrated classifiers and multi-agent strategy. IEEE/ACM Trans. Comput. Biol. Bioinform. (2013). doi:10.​1109/​TCBB.​2013.​106
11.
12.
Zurück zum Zitat Lee, K.-J., Hwang, Y.-S., Rim, H.-C.: Two-phase biomedical NE recognition based on SVMs. In: Proceedings of the ACL Workshop on Natural Language Processing in Biomedicine (BioMed), pp. 33–40 (2003) Lee, K.-J., Hwang, Y.-S., Rim, H.-C.: Two-phase biomedical NE recognition based on SVMs. In: Proceedings of the ACL Workshop on Natural Language Processing in Biomedicine (BioMed), pp. 33–40 (2003)
13.
Zurück zum Zitat Kim, S., Yoon, J., Park, K.-M., Rim, H.-C.: Two-phase biomedical named entity recognition using a hybrid method. In: Proceedings of the 2nd International Joint Conference (IJCNLP), pp. 646–657 (2005) Kim, S., Yoon, J., Park, K.-M., Rim, H.-C.: Two-phase biomedical named entity recognition using a hybrid method. In: Proceedings of the 2nd International Joint Conference (IJCNLP), pp. 646–657 (2005)
14.
Zurück zum Zitat Kim, S., Yoon, J.: Experimental study on a two phase method for biomedical named entity recognition. IEICE Trans. Inf. Syst. 7(E90–D), 1103–1110 (2007)CrossRef Kim, S., Yoon, J.: Experimental study on a two phase method for biomedical named entity recognition. IEICE Trans. Inf. Syst. 7(E90–D), 1103–1110 (2007)CrossRef
15.
Zurück zum Zitat Li, Lishuang, Zhou, Rongpeng, Huang, Degen: Two-phase biomedical named entity recognition using CRFs. Comput. Biol. Chem. 33, 334–338 (2009)CrossRef Li, Lishuang, Zhou, Rongpeng, Huang, Degen: Two-phase biomedical named entity recognition using CRFs. Comput. Biol. Chem. 33, 334–338 (2009)CrossRef
16.
Zurück zum Zitat Wang, L., Ke, L., Liu, P., Ranjan, R., Chen, L.: IK-SVD: dictionary learning for spatial big data via incremental atom update. Comput. Sci. Eng. 16(4), 41–52 (2014)CrossRef Wang, L., Ke, L., Liu, P., Ranjan, R., Chen, L.: IK-SVD: dictionary learning for spatial big data via incremental atom update. Comput. Sci. Eng. 16(4), 41–52 (2014)CrossRef
17.
Zurück zum Zitat Wang, L., von Laszewski, G., Younge, A.J., He, X., Kunze, M., Tao, J.: Cloud computing: a perspective study. New Gener. Comput. 28(2), 137–146 (2010)CrossRefMATH Wang, L., von Laszewski, G., Younge, A.J., He, X., Kunze, M., Tao, J.: Cloud computing: a perspective study. New Gener. Comput. 28(2), 137–146 (2010)CrossRefMATH
18.
Zurück zum Zitat Wittek, P., Darányi, S.: Accelerating text mining workloads in a MapReduce-based distributed GPU environment. J. Parallel Distrib. Comput. 2(73), 98–206 (2013) Wittek, P., Darányi, S.: Accelerating text mining workloads in a MapReduce-based distributed GPU environment. J. Parallel Distrib. Comput. 2(73), 98–206 (2013)
19.
Zurück zum Zitat Wang, L., Tao, J., Marten, H., Streit, A., Khan, S.U., Kolodziej, J., Chen, D.: MapReduce across distributed clusters for data-intensive applications. In: The 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS) Workshops 2012: 2004–2011 Wang, L., Tao, J., Marten, H., Streit, A., Khan, S.U., Kolodziej, J., Chen, D.: MapReduce across distributed clusters for data-intensive applications. In: The 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS) Workshops 2012: 2004–2011
20.
Zurück zum Zitat Laclavik, M., Seleng, M., Hluchy, L.: Towards large scale semantic annotation built on MapReduce architecture. Lecture Notes in Computer Science 3(5103), 331–338 (2008) Laclavik, M., Seleng, M., Hluchy, L.: Towards large scale semantic annotation built on MapReduce architecture. Lecture Notes in Computer Science 3(5103), 331–338 (2008)
21.
Zurück zum Zitat Wang, L., Tao, J., Ranjan, R., Marten, H., Streit, A., Chen, J., Chen, D.: G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Future Gener. Comput. Syst. 29(3), 739–750 (2013)CrossRef Wang, L., Tao, J., Ranjan, R., Marten, H., Streit, A., Chen, J., Chen, D.: G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Future Gener. Comput. Syst. 29(3), 739–750 (2013)CrossRef
22.
Zurück zum Zitat Whitney, M., Clifton, A., Sarkar, A., Fedorova, A.: Making the most of a distributed perceptron for NLP. In: Pacific Northwest Regional NLP Workshop, Redmond, Washington, USA (2012) Whitney, M., Clifton, A., Sarkar, A., Fedorova, A.: Making the most of a distributed perceptron for NLP. In: Pacific Northwest Regional NLP Workshop, Redmond, Washington, USA (2012)
23.
Zurück zum Zitat Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: 27th Proceedings of the International Conference on Machine Learning (ICML), pp. 282–289 (2010) Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: 27th Proceedings of the International Conference on Machine Learning (ICML), pp. 282–289 (2010)
24.
Zurück zum Zitat Atkinson, J., Bull, V.: A multi-strategy approach to biological named entity recognition. Expert Syst. Appl. 39(17), 12968–12974 (2012)CrossRef Atkinson, J., Bull, V.: A multi-strategy approach to biological named entity recognition. Expert Syst. Appl. 39(17), 12968–12974 (2012)CrossRef
25.
Zurück zum Zitat Forney, G.D. Jr.: The viterbi algorithm. In: Proceedings of the IEEE, vol. 3(61), pp. 268–278. Codex Corporation. Newton, MA (2005) Forney, G.D. Jr.: The viterbi algorithm. In: Proceedings of the IEEE, vol. 3(61), pp. 268–278. Codex Corporation. Newton, MA (2005)
26.
Zurück zum Zitat Vijay Sundar Ram, R., Akilandeswari, A., Lalitha Devi, S.: Linguistic features for named entity recognition using CRFs. In: International Conference on Asian Language Processing (IALP), pp. 158–161 (2010) Vijay Sundar Ram, R., Akilandeswari, A., Lalitha Devi, S.: Linguistic features for named entity recognition using CRFs. In: International Conference on Asian Language Processing (IALP), pp. 158–161 (2010)
27.
Zurück zum Zitat Langford, J.: Parallel machine learning on big data, XRDS: crossroads. ACM Mag. Stud. 1(19), 60–62 (2012) Langford, J.: Parallel machine learning on big data, XRDS: crossroads. ACM Mag. Stud. 1(19), 60–62 (2012)
28.
Zurück zum Zitat Meraji, S., Tropper, C.: A machine learning approach for optimizing parallel logic simulation. In: 39th International Conference on Parallel Processing (ICPP), pp. 545–554 (2010) Meraji, S., Tropper, C.: A machine learning approach for optimizing parallel logic simulation. In: 39th International Conference on Parallel Processing (ICPP), pp. 545–554 (2010)
29.
Zurück zum Zitat Livieris, I.E., Apostolopoulou, M.S., Sotiropoulos, D.G., Sioutas, S., Pintelas, P.: Classification of large biomedical data using ANNs based on BFGS method. In: 13th Panhellenic Conference on Informatics (PCI), pp. 87–91 (2009) Livieris, I.E., Apostolopoulou, M.S., Sotiropoulos, D.G., Sioutas, S., Pintelas, P.: Classification of large biomedical data using ANNs based on BFGS method. In: 13th Panhellenic Conference on Informatics (PCI), pp. 87–91 (2009)
30.
Zurück zum Zitat Munkhdalai, T., Li, M., Kim, T., Namsrai, O.-E., Jeong, S.-p., Shin, J., Ryu, K.H.: Bio named entity recognition based on co-training algorithm. In: 26th International Conference on Advanced Information Networking and Applications Workshops (WAINA), pp. 857–862 (2012) Munkhdalai, T., Li, M., Kim, T., Namsrai, O.-E., Jeong, S.-p., Shin, J., Ryu, K.H.: Bio named entity recognition based on co-training algorithm. In: 26th International Conference on Advanced Information Networking and Applications Workshops (WAINA), pp. 857–862 (2012)
31.
Zurück zum Zitat Zhang, J., Shen, D., Zhou, G., Tan, C.-L.: Enhancing HMM-based biomedical named entity recognition by studying special phenomena. J. Biomed. Inform. 6(37), 411–422 (2004)CrossRef Zhang, J., Shen, D., Zhou, G., Tan, C.-L.: Enhancing HMM-based biomedical named entity recognition by studying special phenomena. J. Biomed. Inform. 6(37), 411–422 (2004)CrossRef
32.
Zurück zum Zitat Mathur, A., Chakrabarti, S.: Accelerating newton optimization for log-linear models through feature redundancy. In: 6th International Conference on Data Mining, pp. 404–413 (2006) Mathur, A., Chakrabarti, S.: Accelerating newton optimization for log-linear models through feature redundancy. In: 6th International Conference on Data Mining, pp. 404–413 (2006)
34.
Zurück zum Zitat Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. J. Math. Program. B 3(45), 503–528 (1989)CrossRefMathSciNet Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. J. Math. Program. B 3(45), 503–528 (1989)CrossRefMathSciNet
35.
Zurück zum Zitat Wang, L., Chen, D., Ranjan, R., Khan, S.U., Kolodziej, J., Wang, J.: Parallel processing of massive EEG data with MapReduce. In: The 18th IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp. 164–171 (2012) Wang, L., Chen, D., Ranjan, R., Khan, S.U., Kolodziej, J., Wang, J.: Parallel processing of massive EEG data with MapReduce. In: The 18th IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp. 164–171 (2012)
36.
Zurück zum Zitat Guodong, Z., Jian, S.: Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (JNLPBA), pp. 96–99 (2004) Guodong, Z., Jian, S.: Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (JNLPBA), pp. 96–99 (2004)
37.
Zurück zum Zitat Okanohara, D., Miyao, Y., Tsuruoka, Y., Tsujii, J.: Improving the scalability of semi-Markov conditional random fields for named entity recognition. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pp. 465–472 (2006) Okanohara, D., Miyao, Y., Tsuruoka, Y., Tsujii, J.: Improving the scalability of semi-Markov conditional random fields for named entity recognition. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pp. 465–472 (2006)
38.
Zurück zum Zitat Zhao, Jiaqi, Wang, Lizhe, Tao, Jie, Chen, Jinjun, Sun, Weiye, Ranjan, Rajiv, Kolodziej, Joanna, Streit, Achim, Georgakopoulos, Dimitrios: A security framework in G-Hadoop for big data computing across distributed cloud data centres. J. Comput. Syst. Sci. 80(5), 994–1007 (2014)CrossRefMATHMathSciNet Zhao, Jiaqi, Wang, Lizhe, Tao, Jie, Chen, Jinjun, Sun, Weiye, Ranjan, Rajiv, Kolodziej, Joanna, Streit, Achim, Georgakopoulos, Dimitrios: A security framework in G-Hadoop for big data computing across distributed cloud data centres. J. Comput. Syst. Sci. 80(5), 994–1007 (2014)CrossRefMATHMathSciNet
39.
Zurück zum Zitat Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Qin, X.: Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In: IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–9 (2010) Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Qin, X.: Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In: IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–9 (2010)
Metadaten
Titel
CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework
verfasst von
Zhuo Tang
Lingang Jiang
Li Yang
Kenli Li
Keqin Li
Publikationsdatum
01.06.2015
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe 2/2015
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-015-0426-z

Weitere Artikel der Ausgabe 2/2015

Cluster Computing 2/2015 Zur Ausgabe