Skip to main content

2021 | OriginalPaper | Buchkapitel

Analysis of Contextual and Non-contextual Word Embedding Models for Hindi NER with Web Application for Data Collection

verfasst von : Aindriya Barua, S. Thara, B. Premjith, K. P. Soman

Erschienen in: Advanced Computing

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Named Entity Recognition (NER) is the process of taking a string and identifying relevant proper nouns in it. In this paper (All codes and datasets used in this paper are available at: https://​github.​com/​AindriyaBarua/​Contextual-vs-Non-Contextual-Word-Embeddings-For-Hindi-NER-With-WebApp.) we report the development of the Hindi NER system, in Devanagari script, using various embedding models. We categorize embeddings as Contextual and Non-contextual, and further compare them inter and intra-category. Under non-contextual type embeddings, we experiment with Word2Vec and FastText, and under the contextual embedding category, we experiment with BERT and its variants, viz. RoBERTa, ELECTRA, CamemBERT, Distil-BERT, XLM-RoBERTa. For non-contextual embeddings, we use five machine learning algorithms namely Gaussian NB, Adaboost Classifier, Multi-layer Perceptron classifier, Random Forest Classifier, and Decision Tree Classifier for developing ten Hindi NER systems, each, once with Fast Text and once with Gensim Word2Vec word embedding models. These models are then compared with Transformers based contextual NER models, using BERT and its variants. A comparative study among all these NER models is made. Finally, the best of all these models is used and a web app is built, that takes a Hindi text of any length and returns NER tags for each word and takes feedback from the user about the correctness of tags. These feed-backs aid our further data collection.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)CrossRef Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)CrossRef
3.
Zurück zum Zitat Srivastava, S., Sanglikar, M., Kothari, D.C.: Named entity recognition system for Hindi language: a hybrid approach. Int. J. Comput. Linguist. (IJCL) 2(1), 10–23 (2011) Srivastava, S., Sanglikar, M., Kothari, D.C.: Named entity recognition system for Hindi language: a hybrid approach. Int. J. Comput. Linguist. (IJCL) 2(1), 10–23 (2011)
4.
Zurück zum Zitat Kumar Saha, S., Sarathi Ghosh, P., Sarkar, S., Mitra, P.: Named entity recognition in Hindi using maximum entropy and transliteration. Polibits 38, 33–41 (2008)CrossRef Kumar Saha, S., Sarathi Ghosh, P., Sarkar, S., Mitra, P.: Named entity recognition in Hindi using maximum entropy and transliteration. Polibits 38, 33–41 (2008)CrossRef
5.
Zurück zum Zitat Chen, Y., Perozzi, B., Al-Rfou, R., Skiena, S.: The expressive power of word embeddings. arXiv preprint arXiv:1301.3226 (2013) Chen, Y., Perozzi, B., Al-Rfou, R., Skiena, S.: The expressive power of word embeddings. arXiv preprint arXiv:​1301.​3226 (2013)
6.
Zurück zum Zitat Godbole, N., Srinivasaiah, M., Skiena, S.: Large-scale sentiment analysis for news and blogs. ICWSM 7(21), 219–222 (2007) Godbole, N., Srinivasaiah, M., Skiena, S.: Large-scale sentiment analysis for news and blogs. ICWSM 7(21), 219–222 (2007)
7.
Zurück zum Zitat Bergsma, S., Lin, D.: Bootstrapping path-based pronoun resolution. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 33–40. Association for Computational Linguistics, July 2006 Bergsma, S., Lin, D.: Bootstrapping path-based pronoun resolution. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 33–40. Association for Computational Linguistics, July 2006
9.
Zurück zum Zitat Hu, W., Zhang, J., Zheng, N.: Different contexts lead to different word embeddings. In: Proceedings of COLING 2016, The 26th International Conference on Computational Linguistics: Technical Papers, pp. 762–771, December 2016 Hu, W., Zhang, J., Zheng, N.: Different contexts lead to different word embeddings. In: Proceedings of COLING 2016, The 26th International Conference on Computational Linguistics: Technical Papers, pp. 762–771, December 2016
11.
Zurück zum Zitat Sravani, L., Reddy, A.S., Thara, S.: A comparison study of word embedding for detecting named entities of code-mixed data in Indian language. In: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 2375–2381. IEEE, September 2018 Sravani, L., Reddy, A.S., Thara, S.: A comparison study of word embedding for detecting named entities of code-mixed data in Indian language. In: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 2375–2381. IEEE, September 2018
12.
Zurück zum Zitat Bhat, I.A., Shrivastava, M., Bhat, R.A.: Code mixed entity extraction in Indian languages using neural networks. In: FIRE (Working Notes), pp. 296–297 (2016) Bhat, I.A., Shrivastava, M., Bhat, R.A.: Code mixed entity extraction in Indian languages using neural networks. In: FIRE (Working Notes), pp. 296–297 (2016)
13.
14.
Zurück zum Zitat IITH. Workshop on NER for South and South East Asian Languages (2008) IITH. Workshop on NER for South and South East Asian Languages (2008)
15.
Zurück zum Zitat Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:​1702.​01923 (2017)
16.
Zurück zum Zitat Srinidhi Skanda, V., Singh, S., Remmiya Devi, G., Veena, P.V., Kumar, M.A., Soman, K.P.: CEN@ Amrita FIRE 2016: context based character embeddings for entity extraction in code-mixed text. In: FIRE (Working Notes), pp. 321–324 (2016) Srinidhi Skanda, V., Singh, S., Remmiya Devi, G., Veena, P.V., Kumar, M.A., Soman, K.P.: CEN@ Amrita FIRE 2016: context based character embeddings for entity extraction in code-mixed text. In: FIRE (Working Notes), pp. 321–324 (2016)
17.
19.
Zurück zum Zitat Barathi Ganesh H.B., et al.: Overview of Arnekt IECSIL at FIRE-2018 track on information extraction for conversational systems in Indian languages. In: FIRE (Working Notes), pp. 119–128 (2018) Barathi Ganesh H.B., et al.: Overview of Arnekt IECSIL at FIRE-2018 track on information extraction for conversational systems in Indian languages. In: FIRE (Working Notes), pp. 119–128 (2018)
21.
Zurück zum Zitat Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016) Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:​1607.​01759 (2016)
22.
Zurück zum Zitat Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
23.
Zurück zum Zitat Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805 (2018)
25.
27.
Zurück zum Zitat Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019) Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:​1910.​01108 (2019)
28.
Zurück zum Zitat Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020) Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:​2003.​10555 (2020)
30.
Zurück zum Zitat Premjith, B., Soman, K.P., Kumar, M.A.: A deep learning approach for Malayalam morphological analysis at character level. Proc. Comput. Sci. 132, 47–54 (2018)CrossRef Premjith, B., Soman, K.P., Kumar, M.A.: A deep learning approach for Malayalam morphological analysis at character level. Proc. Comput. Sci. 132, 47–54 (2018)CrossRef
31.
Zurück zum Zitat Grinberg, M.: Flask Web Development: Developing Web Applications with Python. O’Reilly Media Inc., Newton (2018) Grinberg, M.: Flask Web Development: Developing Web Applications with Python. O’Reilly Media Inc., Newton (2018)
32.
Zurück zum Zitat Soman, K.P., Diwakar, S., Ajay, V.: Data Mining: Theory and Practice [with CD]. PHI Learning Pvt. Ltd., New Delhi (2006) Soman, K.P., Diwakar, S., Ajay, V.: Data Mining: Theory and Practice [with CD]. PHI Learning Pvt. Ltd., New Delhi (2006)
33.
Zurück zum Zitat Premjith, B., Soman, K.P., Anand Kumar, M., Jyothi Ratnam, D.: Embedding linguistic features in word embedding for preposition sense disambiguation in English—Malayalam machine translation context. In: Kumar, R., Wiil, U.K. (eds.) Recent Advances in Computational Intelligence. SCI, vol. 823, pp. 341–370. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12500-4_20CrossRef Premjith, B., Soman, K.P., Anand Kumar, M., Jyothi Ratnam, D.: Embedding linguistic features in word embedding for preposition sense disambiguation in English—Malayalam machine translation context. In: Kumar, R., Wiil, U.K. (eds.) Recent Advances in Computational Intelligence. SCI, vol. 823, pp. 341–370. Springer, Cham (2019). https://​doi.​org/​10.​1007/​978-3-030-12500-4_​20CrossRef
Metadaten
Titel
Analysis of Contextual and Non-contextual Word Embedding Models for Hindi NER with Web Application for Data Collection
verfasst von
Aindriya Barua
S. Thara
B. Premjith
K. P. Soman
Copyright-Jahr
2021
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-16-0401-0_14

Premium Partner