Skip to main content
Top

2021 | OriginalPaper | Chapter

Analysis of Contextual and Non-contextual Word Embedding Models for Hindi NER with Web Application for Data Collection

Authors : Aindriya Barua, S. Thara, B. Premjith, K. P. Soman

Published in: Advanced Computing

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Named Entity Recognition (NER) is the process of taking a string and identifying relevant proper nouns in it. In this paper (All codes and datasets used in this paper are available at: https://​github.​com/​AindriyaBarua/​Contextual-vs-Non-Contextual-Word-Embeddings-For-Hindi-NER-With-WebApp.) we report the development of the Hindi NER system, in Devanagari script, using various embedding models. We categorize embeddings as Contextual and Non-contextual, and further compare them inter and intra-category. Under non-contextual type embeddings, we experiment with Word2Vec and FastText, and under the contextual embedding category, we experiment with BERT and its variants, viz. RoBERTa, ELECTRA, CamemBERT, Distil-BERT, XLM-RoBERTa. For non-contextual embeddings, we use five machine learning algorithms namely Gaussian NB, Adaboost Classifier, Multi-layer Perceptron classifier, Random Forest Classifier, and Decision Tree Classifier for developing ten Hindi NER systems, each, once with Fast Text and once with Gensim Word2Vec word embedding models. These models are then compared with Transformers based contextual NER models, using BERT and its variants. A comparative study among all these NER models is made. Finally, the best of all these models is used and a web app is built, that takes a Hindi text of any length and returns NER tags for each word and takes feedback from the user about the correctness of tags. These feed-backs aid our further data collection.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)CrossRef Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)CrossRef
3.
go back to reference Srivastava, S., Sanglikar, M., Kothari, D.C.: Named entity recognition system for Hindi language: a hybrid approach. Int. J. Comput. Linguist. (IJCL) 2(1), 10–23 (2011) Srivastava, S., Sanglikar, M., Kothari, D.C.: Named entity recognition system for Hindi language: a hybrid approach. Int. J. Comput. Linguist. (IJCL) 2(1), 10–23 (2011)
4.
go back to reference Kumar Saha, S., Sarathi Ghosh, P., Sarkar, S., Mitra, P.: Named entity recognition in Hindi using maximum entropy and transliteration. Polibits 38, 33–41 (2008)CrossRef Kumar Saha, S., Sarathi Ghosh, P., Sarkar, S., Mitra, P.: Named entity recognition in Hindi using maximum entropy and transliteration. Polibits 38, 33–41 (2008)CrossRef
5.
6.
go back to reference Godbole, N., Srinivasaiah, M., Skiena, S.: Large-scale sentiment analysis for news and blogs. ICWSM 7(21), 219–222 (2007) Godbole, N., Srinivasaiah, M., Skiena, S.: Large-scale sentiment analysis for news and blogs. ICWSM 7(21), 219–222 (2007)
7.
go back to reference Bergsma, S., Lin, D.: Bootstrapping path-based pronoun resolution. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 33–40. Association for Computational Linguistics, July 2006 Bergsma, S., Lin, D.: Bootstrapping path-based pronoun resolution. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 33–40. Association for Computational Linguistics, July 2006
9.
go back to reference Hu, W., Zhang, J., Zheng, N.: Different contexts lead to different word embeddings. In: Proceedings of COLING 2016, The 26th International Conference on Computational Linguistics: Technical Papers, pp. 762–771, December 2016 Hu, W., Zhang, J., Zheng, N.: Different contexts lead to different word embeddings. In: Proceedings of COLING 2016, The 26th International Conference on Computational Linguistics: Technical Papers, pp. 762–771, December 2016
11.
go back to reference Sravani, L., Reddy, A.S., Thara, S.: A comparison study of word embedding for detecting named entities of code-mixed data in Indian language. In: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 2375–2381. IEEE, September 2018 Sravani, L., Reddy, A.S., Thara, S.: A comparison study of word embedding for detecting named entities of code-mixed data in Indian language. In: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 2375–2381. IEEE, September 2018
12.
go back to reference Bhat, I.A., Shrivastava, M., Bhat, R.A.: Code mixed entity extraction in Indian languages using neural networks. In: FIRE (Working Notes), pp. 296–297 (2016) Bhat, I.A., Shrivastava, M., Bhat, R.A.: Code mixed entity extraction in Indian languages using neural networks. In: FIRE (Working Notes), pp. 296–297 (2016)
13.
14.
go back to reference IITH. Workshop on NER for South and South East Asian Languages (2008) IITH. Workshop on NER for South and South East Asian Languages (2008)
15.
go back to reference Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:​1702.​01923 (2017)
16.
go back to reference Srinidhi Skanda, V., Singh, S., Remmiya Devi, G., Veena, P.V., Kumar, M.A., Soman, K.P.: CEN@ Amrita FIRE 2016: context based character embeddings for entity extraction in code-mixed text. In: FIRE (Working Notes), pp. 321–324 (2016) Srinidhi Skanda, V., Singh, S., Remmiya Devi, G., Veena, P.V., Kumar, M.A., Soman, K.P.: CEN@ Amrita FIRE 2016: context based character embeddings for entity extraction in code-mixed text. In: FIRE (Working Notes), pp. 321–324 (2016)
17.
19.
go back to reference Barathi Ganesh H.B., et al.: Overview of Arnekt IECSIL at FIRE-2018 track on information extraction for conversational systems in Indian languages. In: FIRE (Working Notes), pp. 119–128 (2018) Barathi Ganesh H.B., et al.: Overview of Arnekt IECSIL at FIRE-2018 track on information extraction for conversational systems in Indian languages. In: FIRE (Working Notes), pp. 119–128 (2018)
21.
go back to reference Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016) Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:​1607.​01759 (2016)
22.
go back to reference Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
23.
go back to reference Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805 (2018)
27.
go back to reference Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019) Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:​1910.​01108 (2019)
28.
go back to reference Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020) Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:​2003.​10555 (2020)
30.
go back to reference Premjith, B., Soman, K.P., Kumar, M.A.: A deep learning approach for Malayalam morphological analysis at character level. Proc. Comput. Sci. 132, 47–54 (2018)CrossRef Premjith, B., Soman, K.P., Kumar, M.A.: A deep learning approach for Malayalam morphological analysis at character level. Proc. Comput. Sci. 132, 47–54 (2018)CrossRef
31.
go back to reference Grinberg, M.: Flask Web Development: Developing Web Applications with Python. O’Reilly Media Inc., Newton (2018) Grinberg, M.: Flask Web Development: Developing Web Applications with Python. O’Reilly Media Inc., Newton (2018)
32.
go back to reference Soman, K.P., Diwakar, S., Ajay, V.: Data Mining: Theory and Practice [with CD]. PHI Learning Pvt. Ltd., New Delhi (2006) Soman, K.P., Diwakar, S., Ajay, V.: Data Mining: Theory and Practice [with CD]. PHI Learning Pvt. Ltd., New Delhi (2006)
33.
go back to reference Premjith, B., Soman, K.P., Anand Kumar, M., Jyothi Ratnam, D.: Embedding linguistic features in word embedding for preposition sense disambiguation in English—Malayalam machine translation context. In: Kumar, R., Wiil, U.K. (eds.) Recent Advances in Computational Intelligence. SCI, vol. 823, pp. 341–370. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12500-4_20CrossRef Premjith, B., Soman, K.P., Anand Kumar, M., Jyothi Ratnam, D.: Embedding linguistic features in word embedding for preposition sense disambiguation in English—Malayalam machine translation context. In: Kumar, R., Wiil, U.K. (eds.) Recent Advances in Computational Intelligence. SCI, vol. 823, pp. 341–370. Springer, Cham (2019). https://​doi.​org/​10.​1007/​978-3-030-12500-4_​20CrossRef
Metadata
Title
Analysis of Contextual and Non-contextual Word Embedding Models for Hindi NER with Web Application for Data Collection
Authors
Aindriya Barua
S. Thara
B. Premjith
K. P. Soman
Copyright Year
2021
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-16-0401-0_14

Premium Partner