Skip to main content
Erschienen in: Education and Information Technologies 5/2016

07.12.2014

A study of readability of texts in Bangla through machine learning approaches

verfasst von: Manjira Sinha, Anupam Basu

Erschienen in: Education and Information Technologies | Ausgabe 5/2016

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this work, we have investigated text readability in Bangla language. Text readability is an indicator of the suitability of a given document with respect to a target reader group. Therefore, text readability has huge impact on educational content preparation. The advances in the field of natural language processing have enabled the automatic identification of reading difficulty of texts and contributed in the design and development of suitable educational materials. In spite of the fact that, Bangla is one of the major languages in India and the official language of Bangladesh, the research of text readability in Bangla is still in its nascent stage. In this paper, we have presented computational models to determine the readability of Bangla text documents based on syntactic properties. Since Bangla is a digital resource poor language, therefore, we were required to develop a novel dataset suitable for automatic identification of text properties. Our initial experiments have shown that existing English readability metrics are inapplicable for Bangla. Accordingly, we have proceeded towards new models for analyzing text readability in Bangla. We have considered language specific syntactic features of Bangla text in this work. We have identified major structural contributors responsible for text comprehensibility and subsequently developed readability models for Bangla texts. We have used different machine-learning methods such as regression, support vector machines (SVM) and support vector regression (SVR) to achieve our aim. The performance of the individual models has been compared against one another. We have conducted detailed user survey for data preparation, identification of important structural parameters of texts and validation of our proposed models. The work posses further implications in the field of educational research and in matching text to readers.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Agnihotri, R. K. (2008). 13 orality and literacy. Language in South Asia, page 271. Agnihotri, R. K. (2008). 13 orality and literacy. Language in South Asia, page 271.
Zurück zum Zitat Bamberger, R., & Rabin, A. T. (1984). New approaches to readability: Austrian research. The Reading Teacher, 37(6), 512–519. Bamberger, R., & Rabin, A. T. (1984). New approaches to readability: Austrian research. The Reading Teacher, 37(6), 512–519.
Zurück zum Zitat Basak, D., Pal, S., & Patranabis, D. C. (2007). Support vector regression. Neural Information Processing-Letters and Reviews, 11(10), 203–224. Basak, D., Pal, S., & Patranabis, D. C. (2007). Support vector regression. Neural Information Processing-Letters and Reviews, 11(10), 203–224.
Zurück zum Zitat Benjamin, R. (2012). Reconstructing readability: Recent developments and recommendations in the analysis of text difficulty. Educational Psychology Review, 24, 1–26.MathSciNetCrossRef Benjamin, R. (2012). Reconstructing readability: Recent developments and recommendations in the analysis of text difficulty. Educational Psychology Review, 24, 1–26.MathSciNetCrossRef
Zurück zum Zitat Britton, B., & Gülgöz, S. (1991). Using kintsch’s computational model to improve instructional text: Effects of repairing inference calls on recall and cognitive structures. Journal of Educational Psychology, 83(3), 329.CrossRef Britton, B., & Gülgöz, S. (1991). Using kintsch’s computational model to improve instructional text: Effects of repairing inference calls on recall and cognitive structures. Journal of Educational Psychology, 83(3), 329.CrossRef
Zurück zum Zitat Buswell, G. (1937). How adults read. University of Chicago. Buswell, G. (1937). How adults read. University of Chicago.
Zurück zum Zitat Chakraborti, P. (2003). Diglossia in Bengali. PhD thesis, University of New Mexico. Chakraborti, P. (2003). Diglossia in Bengali. PhD thesis, University of New Mexico.
Zurück zum Zitat Chall, J. (1958). Readability: An appraisal of research and application. Number 34. Ohio State University. Chall, J. (1958). Readability: An appraisal of research and application. Number 34. Ohio State University.
Zurück zum Zitat Chall, J. (1995). Readability revisited: The new Dale-Chall readability formula, volume 118. Cambridge: Brookline Books. Chall, J. (1995). Readability revisited: The new Dale-Chall readability formula, volume 118. Cambridge: Brookline Books.
Zurück zum Zitat Chang, C.-C., & Lin, C.-J. (2011). Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27. Chang, C.-C., & Lin, C.-J. (2011). Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
Zurück zum Zitat Collins-Thompson, K. and Callan, J. (2004). A language modeling approach to predicting reading difficulty. In Proceedings of HLT/NAACL, volume 4 Collins-Thompson, K. and Callan, J. (2004). A language modeling approach to predicting reading difficulty. In Proceedings of HLT/NAACL, volume 4
Zurück zum Zitat Collins-Thompson, K., & Callan, J. (2005). Predicting reading difficulty with statistical language models. Journal of the American Society for Information Science and Technology, 56(13), 1448–1462.CrossRef Collins-Thompson, K., & Callan, J. (2005). Predicting reading difficulty with statistical language models. Journal of the American Society for Information Science and Technology, 56(13), 1448–1462.CrossRef
Zurück zum Zitat Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.MATH Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.MATH
Zurück zum Zitat Cotugna, N., Vickery, C., & Carpenter-Haefele, K. (2005). Evaluation of literacy level of patient education pages in health-related journals. Journal of Community Health, 30(3), 213–219.CrossRef Cotugna, N., Vickery, C., & Carpenter-Haefele, K. (2005). Evaluation of literacy level of patient education pages in health-related journals. Journal of Community Health, 30(3), 213–219.CrossRef
Zurück zum Zitat Crossley, S., Dufty, D., McCarthy, P., & McNamara, D. (2007). Toward a new readability: A mixed model approach. In Proceedings of the 29th annual conference of the Cognitive Science Society, pp. 197–202. Crossley, S., Dufty, D., McCarthy, P., & McNamara, D. (2007). Toward a new readability: A mixed model approach. In Proceedings of the 29th annual conference of the Cognitive Science Society, pp. 197–202.
Zurück zum Zitat Dale, E., & Chall, J. (1948). A formula for predicting readability. Educational research bulletin, pp. 11–28. Dale, E., & Chall, J. (1948). A formula for predicting readability. Educational research bulletin, pp. 11–28.
Zurück zum Zitat Das, S., & Roychoudhury, R. (2006). Readability modelling and comparison of one and two parametric fit: a case study in bangla*. Journal of Quantitative Linguistics, 13(01), 17–34.CrossRef Das, S., & Roychoudhury, R. (2006). Readability modelling and comparison of one and two parametric fit: a case study in bangla*. Journal of Quantitative Linguistics, 13(01), 17–34.CrossRef
Zurück zum Zitat Drucker, H., Burges, C. J., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support vector regression machines. Advances in Neural Information Processing Systems, 9, 155–161. Drucker, H., Burges, C. J., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support vector regression machines. Advances in Neural Information Processing Systems, 9, 155–161.
Zurück zum Zitat DuBay, W. (2004). The principles of readability. Impact Information, 1–76. DuBay, W. (2004). The principles of readability. Impact Information, 1–76.
Zurück zum Zitat DuBay, W. (2007). Smart Language: Readers, Readability, and the Grading of Text. ERIC. DuBay, W. (2007). Smart Language: Readers, Readability, and the Grading of Text. ERIC.
Zurück zum Zitat Ferguson, C. A. (1959). Diglossia. Word-Journal of the International Linguistic Association, 15(2), 325–340. Ferguson, C. A. (1959). Diglossia. Word-Journal of the International Linguistic Association, 15(2), 325–340.
Zurück zum Zitat Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221.CrossRef Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221.CrossRef
Zurück zum Zitat Foltz, P., Kintsch, W., & Landauer, T. (1998). The measurement of textual coherence with latent semantic analysis. Discourse Processes, 25(2–3), 285–307.CrossRef Foltz, P., Kintsch, W., & Landauer, T. (1998). The measurement of textual coherence with latent semantic analysis. Discourse Processes, 25(2–3), 285–307.CrossRef
Zurück zum Zitat Fry, E. (1968). A readability formula that saves time. Journal of Reading, 11(7), 513–578. Fry, E. (1968). A readability formula that saves time. Journal of Reading, 11(7), 513–578.
Zurück zum Zitat Graesser, A., McNamara, D., & Kulikowich, J. (2011). Coh-metrix providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234.CrossRef Graesser, A., McNamara, D., & Kulikowich, J. (2011). Coh-metrix providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234.CrossRef
Zurück zum Zitat Graesser, A., McNamara, D., Louwerse, M., & Cai, Z. (2004). Coh-metrix: Analysis of text on cohesion and language. Behavior Research Methods, 36(2), 193–202.CrossRef Graesser, A., McNamara, D., Louwerse, M., & Cai, Z. (2004). Coh-metrix: Analysis of text on cohesion and language. Behavior Research Methods, 36(2), 193–202.CrossRef
Zurück zum Zitat Gunning, R. (1968). The technique of clear writing. NewYork: McGraw-Hill. Gunning, R. (1968). The technique of clear writing. NewYork: McGraw-Hill.
Zurück zum Zitat Heilman, M., Collins-Thompson, K., and Eskenazi, M. (2008). An analysis of statistical models and features for reading difficulty prediction. In Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications, (pp. 71–79). Association for Computational Linguistics. Heilman, M., Collins-Thompson, K., and Eskenazi, M. (2008). An analysis of statistical models and features for reading difficulty prediction. In Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications, (pp. 71–79). Association for Computational Linguistics.
Zurück zum Zitat Islam, Z., Mehler, A., Rahman, R., and Texttechnology, A. (2012). Text readability classification of textbooks of a low-resource language. In Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation. Islam, Z., Mehler, A., Rahman, R., and Texttechnology, A. (2012). Text readability classification of textbooks of a low-resource language. In Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation.
Zurück zum Zitat Kemper, S. (1983). Measuring the inference load of a text. Journal of Educational Psychology, 75(3), 391.CrossRef Kemper, S. (1983). Measuring the inference load of a text. Journal of Educational Psychology, 75(3), 391.CrossRef
Zurück zum Zitat Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., and Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report, DTIC Document. Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., and Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report, DTIC Document.
Zurück zum Zitat Kintsch, W., & Van Dijk, T. (1978). Toward a model of text comprehension and production. Psychological Review, 85(5), 363.CrossRef Kintsch, W., & Van Dijk, T. (1978). Toward a model of text comprehension and production. Psychological Review, 85(5), 363.CrossRef
Zurück zum Zitat Klare, G. (1963). The mesaurement of readability. Ames: Iowa State University Press. Klare, G. (1963). The mesaurement of readability. Ames: Iowa State University Press.
Zurück zum Zitat Landauer, T., Foltz, P., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284.CrossRef Landauer, T., Foltz, P., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284.CrossRef
Zurück zum Zitat Learning, R. (2001). The atos readability formula for books and how it compares to other formulas. Madison: School Renaissance Institute. Learning, R. (2001). The atos readability formula for books and how it compares to other formulas. Madison: School Renaissance Institute.
Zurück zum Zitat Liu, X., Croft, W., Oh, P., and Hart, D. (2004). Automatic recognition of reading levels from user queries. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, (pp. 548–549). ACM. Liu, X., Croft, W., Oh, P., and Hart, D. (2004). Automatic recognition of reading levels from user queries. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, (pp. 548–549). ACM.
Zurück zum Zitat Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval, volume 1. Cambridge: University Press Cambridge.CrossRefMATH Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval, volume 1. Cambridge: University Press Cambridge.CrossRefMATH
Zurück zum Zitat McLaughlin, G. (1969). Smog grading: A new readability formula. Journal of Reading, 12(8), 639–646. McLaughlin, G. (1969). Smog grading: A new readability formula. Journal of Reading, 12(8), 639–646.
Zurück zum Zitat McNamara, D., Louwerse, M., McCarthy, P., & Graesser, A. (2010). Coh-metrix: Capturing linguistic features of cohesion. Discourse Processes, 47(4), 292–330.CrossRef McNamara, D., Louwerse, M., McCarthy, P., & Graesser, A. (2010). Coh-metrix: Capturing linguistic features of cohesion. Discourse Processes, 47(4), 292–330.CrossRef
Zurück zum Zitat Miltsakaki, E., & Troutt, A. (2007). Read-x: Automatic evaluation of reading difficulty of web text. In Proceedings of E-Automatic evaluation of reading difficulty of web text. In Proceedings of ELearn. Miltsakaki, E., & Troutt, A. (2007). Read-x: Automatic evaluation of reading difficulty of web text. In Proceedings of E-Automatic evaluation of reading difficulty of web text. In Proceedings of ELearn.
Zurück zum Zitat Montgomery, D., Peck, E., and Vining, G. (2007). Introduction to linear regression analysis, volume 49. Wiley. Montgomery, D., Peck, E., and Vining, G. (2007). Introduction to linear regression analysis, volume 49. Wiley.
Zurück zum Zitat Oakland, T., & Lane, H. (2004). Language, reading, and readability formulas: Implications for developing and adapting tests. International Journal of Testing, 4(3), 239–252.CrossRef Oakland, T., & Lane, H. (2004). Language, reading, and readability formulas: Implications for developing and adapting tests. International Journal of Testing, 4(3), 239–252.CrossRef
Zurück zum Zitat Petersen, S. E., & Ostendorf, M. (2009). A machine learning approach to reading level assessment. Computer Speech & Language, 23(1), 89–106.CrossRef Petersen, S. E., & Ostendorf, M. (2009). A machine learning approach to reading level assessment. Computer Speech & Language, 23(1), 89–106.CrossRef
Zurück zum Zitat Rabin, A., Zakaluk, B., and Samuels, S. (1988). Determining difficulty levels of text written in languages other than english. Readability: Its past, present & future. Newark DE: International Reading Association, (pp. 46–76). Rabin, A., Zakaluk, B., and Samuels, S. (1988). Determining difficulty levels of text written in languages other than english. Readability: Its past, present & future. Newark DE: International Reading Association, (pp. 46–76).
Zurück zum Zitat Rosch, E. (1978). Principles of categorization. Fuzzy grammar: a reader (pp. 91–108). Rosch, E. (1978). Principles of categorization. Fuzzy grammar: a reader (pp. 91–108).
Zurück zum Zitat Schwarm, S. and Ostendorf, M. (2005). Reading level assessment using support vector machines and statistical language models. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, (pp. 523–530). Association for Computational Linguistics. Schwarm, S. and Ostendorf, M. (2005). Reading level assessment using support vector machines and statistical language models. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, (pp. 523–530). Association for Computational Linguistics.
Zurück zum Zitat Sherman, L. (1893). Analytics of literature: A manual for the objective study of english poetry and prose. Boston: Ginn. Sherman, L. (1893). Analytics of literature: A manual for the objective study of english poetry and prose. Boston: Ginn.
Zurück zum Zitat Si, L., & Callan, J. (2003). A semisupervised learning method to merge search engine results. ACM Transactions on Information Systems (TOIS), 21(4), 457–491.CrossRef Si, L., & Callan, J. (2003). A semisupervised learning method to merge search engine results. ACM Transactions on Information Systems (TOIS), 21(4), 457–491.CrossRef
Zurück zum Zitat Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222.MathSciNetCrossRef Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222.MathSciNetCrossRef
Zurück zum Zitat Stenner, A. (1996). Measuring reading comprehension with the lexile framework. Stenner, A. (1996). Measuring reading comprehension with the lexile framework.
Zurück zum Zitat Taft, M. (2004). Morphological decomposition and the reverse base frequency effect. Quarterly Journal of Experimental Psychology Section A, 57(4), 745–765.CrossRef Taft, M. (2004). Morphological decomposition and the reverse base frequency effect. Quarterly Journal of Experimental Psychology Section A, 57(4), 745–765.CrossRef
Zurück zum Zitat vor der Brück, T., Helbig, H., Leveling, J., & Kommunikationssysteme, I. (2008). The Readability Checker Delite: Technical Report. FernUniv., Fak. für Mathematik und Informatik. vor der Brück, T., Helbig, H., Leveling, J., & Kommunikationssysteme, I. (2008). The Readability Checker Delite: Technical Report. FernUniv., Fak. für Mathematik und Informatik.
Zurück zum Zitat Zar, J. (1998). Spearman rank correlation. Encyclopedia of Biostatistics. Zar, J. (1998). Spearman rank correlation. Encyclopedia of Biostatistics.
Metadaten
Titel
A study of readability of texts in Bangla through machine learning approaches
verfasst von
Manjira Sinha
Anupam Basu
Publikationsdatum
07.12.2014
Verlag
Springer US
Erschienen in
Education and Information Technologies / Ausgabe 5/2016
Print ISSN: 1360-2357
Elektronische ISSN: 1573-7608
DOI
https://doi.org/10.1007/s10639-014-9368-y

Weitere Artikel der Ausgabe 5/2016

Education and Information Technologies 5/2016 Zur Ausgabe