Skip to main content

2025 | OriginalPaper | Buchkapitel

Comparison of Perplexity Scores of Language Models for Telugu Data Corpus in the Agricultural Domain

verfasst von : Pooja Rajesh, Akshita Gupta, Praneeta Immadisetty

Erschienen in: Innovative Computing and Communications

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The agricultural domain has a lack of readily available resources, especially in regional languages. For a country like India, which majorly relies on agriculture as its main source of GDP and employment, it becomes vital to develop a corpus that spans across multiple topics of this domain. Post-collection of data, there must be a language model (LM) that can be implemented to assess the use of this data collected. Perplexity is a measure of how well a probability distribution model can predict a sample. Based on the lowest perplexity score of all models, we determine which LM performs the best. This paper compares three different LMs—n-gram, LSTM and Transformers. The perplexity of LSTM and Transformers was found to be 23.127 and 12.3 respectively, on the Telugu language dataset that was built by collecting data via web-scraping of links on the internet. The alignment of the theoretical knowledge and observed results of perplexity scores validates that the prepared Telugu agricultural dataset can be used for further NLP applications.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bajaj, D., Goel, A., Gupta, S. C., & Batra, H. (2022). Muce: A multilingual use case model extractor using GPT-3. International Journal of Information Technology, 14(3), 1543–1554. Bajaj, D., Goel, A., Gupta, S. C., & Batra, H. (2022). Muce: A multilingual use case model extractor using GPT-3. International Journal of Information Technology, 14(3), 1543–1554.
2.
Zurück zum Zitat Colla, D., Delsanto, M., Agosto, M., Vitiello, B., & Radicioni, D. P. (2022). Semantic coherence markers: The contribution of perplexity metrics. Artificial Intelligence in Medicine, 134, 102393. Colla, D., Delsanto, M., Agosto, M., Vitiello, B., & Radicioni, D. P. (2022). Semantic coherence markers: The contribution of perplexity metrics. Artificial Intelligence in Medicine, 134, 102393.
3.
Zurück zum Zitat Gamon, M., Aue, A., & Smets, M. (2005). Sentence-level MT evaluation without reference translations: Beyond language modeling. In: Proceedings of the 10th EAMT conference: Practical applications of machine translation. Gamon, M., Aue, A., & Smets, M. (2005). Sentence-level MT evaluation without reference translations: Beyond language modeling. In: Proceedings of the 10th EAMT conference: Practical applications of machine translation.
5.
Zurück zum Zitat Gorla, S., Tangeda, S. S., Neti, L. B. M., & Malapati, A. (2022). Telugu named entity recognition using Bert. International Journal of Data Science and Analytics, 14(2), 127–140. Gorla, S., Tangeda, S. S., Neti, L. B. M., & Malapati, A. (2022). Telugu named entity recognition using Bert. International Journal of Data Science and Analytics, 14(2), 127–140.
6.
Zurück zum Zitat Harish, B. S., & Rangan, R. K. (2020). A comprehensive survey on Indian regional language processing. SN Applied Sciences, 2(7), 1204.CrossRef Harish, B. S., & Rangan, R. K. (2020). A comprehensive survey on Indian regional language processing. SN Applied Sciences, 2(7), 1204.CrossRef
8.
Zurück zum Zitat Kallimani, J. S., Srinivasa, K. G., Reddy B. E. (2011). Information extraction by an abstractive text summarization for an Indian regional language. In 2011 7th International conference on natural language processing and knowledge engineering (pp. 319–322). https://doi.org/10.1109/NLPKE.2011.6138217 Kallimani, J. S., Srinivasa, K. G., Reddy B. E. (2011). Information extraction by an abstractive text summarization for an Indian regional language. In 2011 7th International conference on natural language processing and knowledge engineering (pp. 319–322). https://​doi.​org/​10.​1109/​NLPKE.​2011.​6138217
10.
Zurück zum Zitat Kumar, T., Mahrishi, M., & Sharma, G. (2023). Emotion recognition in Hindi text using multilingual Bert transformer. Multimedia Tools and Applications, 82(27), 42373–42394. Kumar, T., Mahrishi, M., & Sharma, G. (2023). Emotion recognition in Hindi text using multilingual Bert transformer. Multimedia Tools and Applications, 82(27), 42373–42394.
11.
13.
Zurück zum Zitat Rai, A., & Borah, S. (2021). Study of various methods for tokenization. In J. K. Mandal, S. Mukhopadhyay, & A. Roy (Eds.), Applications of Internet of Things (pp. 193–200). Springer.CrossRef Rai, A., & Borah, S. (2021). Study of various methods for tokenization. In J. K. Mandal, S. Mukhopadhyay, & A. Roy (Eds.), Applications of Internet of Things (pp. 193–200). Springer.CrossRef
15.
Zurück zum Zitat Rani, B. P., Vardhan, B. V., Durga, A. K., Reddy, L., Babu, A. (2008). Analysis of n-gram model on Telegu document classification. In 2008 IEEE congress on evolutionary computation (IEEE world congress on computational intelligence) (pp. 3199–3203). https://doi.org/10.1109/CEC.2008.4631231 Rani, B. P., Vardhan, B. V., Durga, A. K., Reddy, L., Babu, A. (2008). Analysis of n-gram model on Telegu document classification. In 2008 IEEE congress on evolutionary computation (IEEE world congress on computational intelligence) (pp. 3199–3203). https://​doi.​org/​10.​1109/​CEC.​2008.​4631231
18.
Zurück zum Zitat Suzuki, M., Itoh, N., Nagano, T., Kurata, G., Thomas, S. (2019) Improvements to n-gram language model using text generated from neural language model. In ICASSP 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7245–7249). https://doi.org/10.1109/ICASSP.2019.8683481 Suzuki, M., Itoh, N., Nagano, T., Kurata, G., Thomas, S. (2019) Improvements to n-gram language model using text generated from neural language model. In ICASSP 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7245–7249). https://​doi.​org/​10.​1109/​ICASSP.​2019.​8683481
19.
Zurück zum Zitat Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Metadaten
Titel
Comparison of Perplexity Scores of Language Models for Telugu Data Corpus in the Agricultural Domain
verfasst von
Pooja Rajesh
Akshita Gupta
Praneeta Immadisetty
Copyright-Jahr
2025
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-97-4152-6_38