Skip to main content
Top

2025 | OriginalPaper | Chapter

Comparison of Perplexity Scores of Language Models for Telugu Data Corpus in the Agricultural Domain

Authors : Pooja Rajesh, Akshita Gupta, Praneeta Immadisetty

Published in: Innovative Computing and Communications

Publisher: Springer Nature Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The agricultural domain has a lack of readily available resources, especially in regional languages. For a country like India, which majorly relies on agriculture as its main source of GDP and employment, it becomes vital to develop a corpus that spans across multiple topics of this domain. Post-collection of data, there must be a language model (LM) that can be implemented to assess the use of this data collected. Perplexity is a measure of how well a probability distribution model can predict a sample. Based on the lowest perplexity score of all models, we determine which LM performs the best. This paper compares three different LMs—n-gram, LSTM and Transformers. The perplexity of LSTM and Transformers was found to be 23.127 and 12.3 respectively, on the Telugu language dataset that was built by collecting data via web-scraping of links on the internet. The alignment of the theoretical knowledge and observed results of perplexity scores validates that the prepared Telugu agricultural dataset can be used for further NLP applications.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Bajaj, D., Goel, A., Gupta, S. C., & Batra, H. (2022). Muce: A multilingual use case model extractor using GPT-3. International Journal of Information Technology, 14(3), 1543–1554. Bajaj, D., Goel, A., Gupta, S. C., & Batra, H. (2022). Muce: A multilingual use case model extractor using GPT-3. International Journal of Information Technology, 14(3), 1543–1554.
2.
go back to reference Colla, D., Delsanto, M., Agosto, M., Vitiello, B., & Radicioni, D. P. (2022). Semantic coherence markers: The contribution of perplexity metrics. Artificial Intelligence in Medicine, 134, 102393. Colla, D., Delsanto, M., Agosto, M., Vitiello, B., & Radicioni, D. P. (2022). Semantic coherence markers: The contribution of perplexity metrics. Artificial Intelligence in Medicine, 134, 102393.
3.
go back to reference Gamon, M., Aue, A., & Smets, M. (2005). Sentence-level MT evaluation without reference translations: Beyond language modeling. In: Proceedings of the 10th EAMT conference: Practical applications of machine translation. Gamon, M., Aue, A., & Smets, M. (2005). Sentence-level MT evaluation without reference translations: Beyond language modeling. In: Proceedings of the 10th EAMT conference: Practical applications of machine translation.
5.
go back to reference Gorla, S., Tangeda, S. S., Neti, L. B. M., & Malapati, A. (2022). Telugu named entity recognition using Bert. International Journal of Data Science and Analytics, 14(2), 127–140. Gorla, S., Tangeda, S. S., Neti, L. B. M., & Malapati, A. (2022). Telugu named entity recognition using Bert. International Journal of Data Science and Analytics, 14(2), 127–140.
6.
go back to reference Harish, B. S., & Rangan, R. K. (2020). A comprehensive survey on Indian regional language processing. SN Applied Sciences, 2(7), 1204.CrossRef Harish, B. S., & Rangan, R. K. (2020). A comprehensive survey on Indian regional language processing. SN Applied Sciences, 2(7), 1204.CrossRef
8.
go back to reference Kallimani, J. S., Srinivasa, K. G., Reddy B. E. (2011). Information extraction by an abstractive text summarization for an Indian regional language. In 2011 7th International conference on natural language processing and knowledge engineering (pp. 319–322). https://doi.org/10.1109/NLPKE.2011.6138217 Kallimani, J. S., Srinivasa, K. G., Reddy B. E. (2011). Information extraction by an abstractive text summarization for an Indian regional language. In 2011 7th International conference on natural language processing and knowledge engineering (pp. 319–322). https://​doi.​org/​10.​1109/​NLPKE.​2011.​6138217
10.
go back to reference Kumar, T., Mahrishi, M., & Sharma, G. (2023). Emotion recognition in Hindi text using multilingual Bert transformer. Multimedia Tools and Applications, 82(27), 42373–42394. Kumar, T., Mahrishi, M., & Sharma, G. (2023). Emotion recognition in Hindi text using multilingual Bert transformer. Multimedia Tools and Applications, 82(27), 42373–42394.
11.
13.
go back to reference Rai, A., & Borah, S. (2021). Study of various methods for tokenization. In J. K. Mandal, S. Mukhopadhyay, & A. Roy (Eds.), Applications of Internet of Things (pp. 193–200). Springer.CrossRef Rai, A., & Borah, S. (2021). Study of various methods for tokenization. In J. K. Mandal, S. Mukhopadhyay, & A. Roy (Eds.), Applications of Internet of Things (pp. 193–200). Springer.CrossRef
15.
go back to reference Rani, B. P., Vardhan, B. V., Durga, A. K., Reddy, L., Babu, A. (2008). Analysis of n-gram model on Telegu document classification. In 2008 IEEE congress on evolutionary computation (IEEE world congress on computational intelligence) (pp. 3199–3203). https://doi.org/10.1109/CEC.2008.4631231 Rani, B. P., Vardhan, B. V., Durga, A. K., Reddy, L., Babu, A. (2008). Analysis of n-gram model on Telegu document classification. In 2008 IEEE congress on evolutionary computation (IEEE world congress on computational intelligence) (pp. 3199–3203). https://​doi.​org/​10.​1109/​CEC.​2008.​4631231
18.
go back to reference Suzuki, M., Itoh, N., Nagano, T., Kurata, G., Thomas, S. (2019) Improvements to n-gram language model using text generated from neural language model. In ICASSP 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7245–7249). https://doi.org/10.1109/ICASSP.2019.8683481 Suzuki, M., Itoh, N., Nagano, T., Kurata, G., Thomas, S. (2019) Improvements to n-gram language model using text generated from neural language model. In ICASSP 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7245–7249). https://​doi.​org/​10.​1109/​ICASSP.​2019.​8683481
19.
go back to reference Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Metadata
Title
Comparison of Perplexity Scores of Language Models for Telugu Data Corpus in the Agricultural Domain
Authors
Pooja Rajesh
Akshita Gupta
Praneeta Immadisetty
Copyright Year
2025
Publisher
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-97-4152-6_38