Skip to main content
Log in

A web-based Bengali news corpus for named entity recognition

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The rapid development of language resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 million wordforms. Named Entity Recognition (NER) systems based on pattern based shallow parsing with or without using linguistic knowledge have been developed using a part of this corpus. The NER system that uses linguistic knowledge has performed better yielding highest F-Score values of 75.40%, 72.30%, 71.37%, and 70.13% for person, location, organization, and miscellaneous names, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. http://www.wacky.sslmit.unibo.it

  2. http://www.ilc.cnr.it/Eagles96/home.html

References

  • Baroni, M., & Bernardini, S. (2004). BootCat: Bootstrapping corpora and terms from the web. In Proceedings of LREC 2004, Lisbon, pp. 1313–1316.

  • Bertagna, F., Lenci, A., Monachini, M., Calzolari, N. (2004). Content interoperability of lexical resources, open issues and “MILE” Perspectives. In Proceedings of the LREC 2004, 131–134.

  • Bharati, A., Sharma, D. M., Chaitanya, V., Kulkarni, A. P., & Sangal, R. (2001). LERIL: Collaborative effort for creating lexical resources. In Proceedings of the 6th NLP Pacific Rim Symposium Post-Conference Workshop, Japan.

  • Boleda, G., Bott, S., Meza, R., Castillo, C., Badia, T., & Lopez, V. (2006). CUCWeb: A Catalian corpus built from the web. In Proceedings of the second International Workshop on Web as Corpus, Torento, Italy, pp. 19–26.

  • Calzolari, N., Bertagna, F., Lenci, A., & Monachini, M. (2003). Standards and best practice for multilingual computational lexicons, MILE (the multilingual ISLE lexical entry). ISLE Deliverable D2.2 & 3.2.

  • Cunningham, H. G. (2002). A general architecture for text engineering. Computers and the Humanities, 36, 223–254.

    Google Scholar 

  • Fletcher, W. H. (2001). Concordancing the web with KWiCFinder. In Proceedings of the Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23–25 March 2001.

  • Fletcher, W. H. (2004). Making the web more use-ful as source for linguists corpora. In U. Conor & T. A. Upton (Eds.), Applied corpus linguists: A multidimensional perspective (pp. 191–205). Amsterdam: Rodopi.

  • Giguet, E., & Luquet, P. (2006). Multilingual lexical database generation from parallel texts in 20 European languages with endogeneous resources. In Proceedings of the COLING/ACL 2006, Sydney, pp. 271–278.

  • Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–347.

    Article  Google Scholar 

  • Lenci, A., Bel, N., Busa, F., Calzolari, N., Gola, E., Monachini, M., Ogonowsky, A., Peters, I., Peters, W., Ruimy, N., Villegas, M., & Zampolli, A. (2000). SIMPLE: A general framework for the development of multilingual lexicons. International Journal of Lexicography, Special Issue, Dictionaries, Thesauri and Lexical-Semantic Relations, XIII(4), 249–263.

    Google Scholar 

  • Okanohara, D., Miyao, Y., Tsuruoka, Y., & Tsujii, J. (2006). Improving the scalibility of semi-Markov conditional random fields for named entity recognition. In Proceedings of the COLING/ACL 2006, Sydney, pp. 465–472.

  • Rayson, P., Walkerdine, J., Fletcher, W. H., & Kolgarriff, A. (2006). Annotated web as corpus. In Proceedings of the second International Workshop on Web as Corpus, Torento, Italy, pp. 27–33.

  • Robb, T. (2003). Google as a corpus tool? ETJ Journal, 4(1), Spring.

  • Rundell, M. (2000). The biggest corpus of all. Humanising Language Teaching, 2(3).

  • Tokunaga, T., Sornlertlamvanich, V., Charoenporn, T., Calzolari, N., Monachini, M., Soria, C., Huang, C., YingJu, X., Hao, Y., Prevot, L., & Shirai, K. (2006). Infrastructure for standardization of asian languages resources. In Proceedings of the COLING/ACL 2006, Sydney, pp. 827–834.

  • Yangarber, R., Lin, W., & Grishman, R. (2002). Unsupervised learning of generalized names. In Proceedings of the COLING-2002.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Asif Ekbal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ekbal, A., Bandyopadhyay, S. A web-based Bengali news corpus for named entity recognition. Lang Resources & Evaluation 42, 173–182 (2008). https://doi.org/10.1007/s10579-008-9064-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-008-9064-x

Keywords

Navigation