Top

Published in:

2021 | OriginalPaper | Chapter

Analysis of Contextual and Non-contextual Word Embedding Models for Hindi NER with Web Application for Data Collection

Authors : Aindriya Barua, S. Thara, B. Premjith, K. P. Soman

Published in: Advanced Computing

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Named Entity Recognition (NER) is the process of taking a string and identifying relevant proper nouns in it. In this paper (All codes and datasets used in this paper are available at: https://github.com/AindriyaBarua/Contextual-vs-Non-Contextual-Word-Embeddings-For-Hindi-NER-With-WebApp.) we report the development of the Hindi NER system, in Devanagari script, using various embedding models. We categorize embeddings as Contextual and Non-contextual, and further compare them inter and intra-category. Under non-contextual type embeddings, we experiment with Word2Vec and FastText, and under the contextual embedding category, we experiment with BERT and its variants, viz. RoBERTa, ELECTRA, CamemBERT, Distil-BERT, XLM-RoBERTa. For non-contextual embeddings, we use five machine learning algorithms namely Gaussian NB, Adaboost Classifier, Multi-layer Perceptron classifier, Random Forest Classifier, and Decision Tree Classifier for developing ten Hindi NER systems, each, once with Fast Text and once with Gensim Word2Vec word embedding models. These models are then compared with Transformers based contextual NER models, using BERT and its variants. A comparative study among all these NER models is made. Finally, the best of all these models is used and a web app is built, that takes a Hindi text of any length and returns NER tags for each word and takes feedback from the user about the correctness of tags. These feed-backs aid our further data collection.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter CovidNet: A Light-Weight CNN for the Detection of COVID-19 Using Chest X-Ray Images

next chapter NEWS Article Summarization with Pretrained Transformer

Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)CrossRef

22nd edition of the world language database Ethnologue. https://www.ethnologue.com/ethnoblog/gary-simons/welcome-22nd-edition

Srivastava, S., Sanglikar, M., Kothari, D.C.: Named entity recognition system for Hindi language: a hybrid approach. Int. J. Comput. Linguist. (IJCL) 2(1), 10–23 (2011)

Kumar Saha, S., Sarathi Ghosh, P., Sarkar, S., Mitra, P.: Named entity recognition in Hindi using maximum entropy and transliteration. Polibits 38, 33–41 (2008)CrossRef

Chen, Y., Perozzi, B., Al-Rfou, R., Skiena, S.: The expressive power of word embeddings. arXiv preprint arXiv:1301.3226 (2013)

Godbole, N., Srinivasaiah, M., Skiena, S.: Large-scale sentiment analysis for news and blogs. ICWSM 7(21), 219–222 (2007)

Bergsma, S., Lin, D.: Bootstrapping path-based pronoun resolution. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 33–40. Association for Computational Linguistics, July 2006

Fellbaum, C.: WordNet. In: Poli, R., Healy, M., Kameas, A. (eds.) Theory and Applications of Ontology: Computer Applications. Springer, Dordrecht (2010). https://doi.org/10.1007/978-90-481-8847-5_10CrossRef

Hu, W., Zhang, J., Zheng, N.: Different contexts lead to different word embeddings. In: Proceedings of COLING 2016, The 26th International Conference on Computational Linguistics: Technical Papers, pp. 762–771, December 2016

10.

Rajasekharan, A.: Brief review of word embedding families (2019). https://mc.ai/brief-review-of-word-embedding-families-2019/

11.

Sravani, L., Reddy, A.S., Thara, S.: A comparison study of word embedding for detecting named entities of code-mixed data in Indian language. In: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 2375–2381. IEEE, September 2018

12.

Bhat, I.A., Shrivastava, M., Bhat, R.A.: Code mixed entity extraction in Indian languages using neural networks. In: FIRE (Working Notes), pp. 296–297 (2016)

13.

Shah, B., Kopparapu, S.K.: A Deep Learning approach for Hindi Named Entity Recognition. arXiv preprint arXiv:1911.01421 (2019)

14.

IITH. Workshop on NER for South and South East Asian Languages (2008)

15.

Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:1702.01923 (2017)

16.

Srinidhi Skanda, V., Singh, S., Remmiya Devi, G., Veena, P.V., Kumar, M.A., Soman, K.P.: CEN@ Amrita FIRE 2016: context based character embeddings for entity extraction in code-mixed text. In: FIRE (Working Notes), pp. 321–324 (2016)

17.

Karita, S., et al.: A comparative study on transformer vs RNN in speech applications. arXiv preprint arXiv:1909.06317 (2019)

18.

Liu, Q., Kusner, M.J., Blunsom, P.: A Survey on Contextual Embeddings. arXiv preprint arXiv:2003.07278 (2020)

19.

Barathi Ganesh H.B., et al.: Overview of Arnekt IECSIL at FIRE-2018 track on information extraction for conversational systems in Indian languages. In: FIRE (Working Notes), pp. 119–128 (2018)

20.

Rong, X.: Word2Vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014)

21.

Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)

22.

Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

23.

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

24.

Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

25.

Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019)

26.

Martin, L., et al.: CamemBERT: A Tasty French Language Model. arXiv preprint arXiv:1911.03894 (2019)

27.

Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

28.

Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020)

29.

Simple Transformers. https://simpletransformers.ai/

30.

Premjith, B., Soman, K.P., Kumar, M.A.: A deep learning approach for Malayalam morphological analysis at character level. Proc. Comput. Sci. 132, 47–54 (2018)CrossRef

31.

Grinberg, M.: Flask Web Development: Developing Web Applications with Python. O’Reilly Media Inc., Newton (2018)

32.

Soman, K.P., Diwakar, S., Ajay, V.: Data Mining: Theory and Practice [with CD]. PHI Learning Pvt. Ltd., New Delhi (2006)

33.

Premjith, B., Soman, K.P., Anand Kumar, M., Jyothi Ratnam, D.: Embedding linguistic features in word embedding for preposition sense disambiguation in English—Malayalam machine translation context. In: Kumar, R., Wiil, U.K. (eds.) Recent Advances in Computational Intelligence. SCI, vol. 823, pp. 341–370. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12500-4_20CrossRef

Title: Analysis of Contextual and Non-contextual Word Embedding Models for Hindi NER with Web Application for Data Collection
Authors: Aindriya Barua
S. Thara
B. Premjith
K. P. Soman
Publisher: Springer Singapore
Book: Advanced Computing
Print ISBN: 978-981-16-0400-3

Electronic ISBN: 978-981-16-0401-0

Copyright Year: 2021
DOI: https://doi.org/10.1007/978-981-16-0401-0_14

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner