Skip to main content
Log in

Fine-grained Dutch named entity recognition

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper describes the creation of a fine-grained named entity annotation scheme and corpus for Dutch, and experiments on automatic main type and subtype named entity recognition. We give an overview of existing named entity annotation schemes, and motivate our own, which describes six main types (persons, organizations, locations, products, events and miscellaneous named entities) and finer-grained information on subtypes and metonymic usage. This was applied to a one-million-word subset of the Dutch SoNaR reference corpus. The classifier for main type named entities achieves a micro-averaged F-score of 84.91 %, and is publicly available, along with the corpus and annotations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. http://ilk.uvt.nl/frog/.

  2. http://www.lt3.ugent.be/en/publications/named-entity-annotatierichtlijnen-voor-het-nederlands/ [in Dutch].

  3. http://taalunieversum.org/taal/technologie/stevin/.

  4. http://lands.let.ru.nl/projects/SoNaR/.

  5. The Dutch-Flemish agency for management, maintenance and distribution of Dutch digital language resources. See http://www.tst-centrale.org.

  6. http://ilk.uvt.nl/timbl/.

  7. http://chasen.org/taku/software/yamcha/.

  8. http://crfpp.sourceforge.net/.

  9. http://pyevolve.sourceforge.net/.

  10. http://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt.

References

  • Alfonseca, E., & Manandhar, S. (2002). An unsupervised method for general named entity recognition and automated concept discovery. In Proceedings of the international conference on general WordNet.

  • Asahara, M., & Matsumoto, Y. (2003, June). Japanese named entity extraction with redundant morphological analysis. In Proceedings of the human language technology conference pp. 8–15.

  • Babych, B., & Hartley, A. (2003). Improving machine translation quality with automatic named entity recognition. In Proceedings of the 7th international EAMT workshop on MT and other language technology tools.

  • Bogers, T. (2004). Dutch named entity recognition: Optimizing features, algorithms, and output. Master’s thesis, Universiteit van Tilburg.

  • Brody, S., Navigli, R., & Lapata, M. (2006, July). Ensemble Methods for Unsupervised WSD. In Proceedings of the 44th annual meeting of the association for computational linguistics (pp. 97–104).

  • Brunstein, A. (2002). Annotation guidelines for answer types. Technical report.

  • Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2), 249–254.

    Google Scholar 

  • Carreras, X., Màrquez, L., & Padró, L. (2002). Named entity extraction using Adaboost. In Proceedings of CoNLL-2002 Taipei, Taiwan.

  • Chinchor, N. (1997). MUC-7 named entity task definition. In Proceedings of the 7th conference on message understanding.

  • Chinchor, N. (1998). Overview of MUC-7. In Proceedings of the 7th message understanding conference.

  • Daelemans, W., & Hoste, V. (2002). Evaluation of machine learning methods for natural language processing tasks. In Proceedings of the third international conference on language resources and evaluation (LREC) (pp 755–760). Las Palmas, Spain.

  • Daelemans, W., & van den Bosch, A. (2005). Memory-based language processing. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Daelemans, W., Zavrel, J., van der Sloot, K., & van den Bosch, A. (2009). TiMBL: Tilburg memory based learner, version 6.2, Reference Guide. Technical report 09-01, ILK Research Group.

  • De Meulder, F., & Daelemans, W. (2003). Memory-based named entity recognition using unannotated data. In Proceedings of the 7th conference on natural language learning.

  • De Meulder, F., Daelemans, W., & Hoste, V. (2002). A named entity recognition system for Dutch. In Computational linguistics in the Netherlands 2001. Selected papers from the 12th CLIN Meeting.

  • Decadt, B., Hoste, V., Daelemans, W., & van den Bosch, A. (2004, July). GAMBL, genetic algorithm optimization of memory-based WSD. In Proceedings of the 3rd international workshop on the evaluation of systems for the semantic analysis of text Barcelona, Spain.

  • Desmet, B., & Hoste, V. (2010). Towards a balanced named entity corpus for Dutch. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10) (pp. 535–541). Valletta, Malta.

  • Desmet, B., Hoste, V., Verstraeten, D., & Verhasselt, J. (2013). Gallop documentation. Technical report LT3 13-03.

  • Ekbal, A., & Saha, S. (2010). Maximum entropy classifier ensembling using genetic algorithm for NER in Bengali. In Proceedings of the international conference on language resources and evaluation (LREC).

  • Ekbal, A., Sourjikova, E., Frank, A., & Ponzetto, S. (2010, July). Assessing the challenge of fine-grained named entity recognition and classification. In proceedings of the 2010 named entities workshop, association for computational linguistics (pp. 93–101). Uppsala, Sweden.

  • Ferro, L., Gerber, L., Mani, I., Sundheim, B., & Wilson, G. (2005). TIDES 2005 standard for the annotation of temporal expressions. Technical report April, The MITRE Corporation.

  • Fleischman, M. (2001). Automated subcategorization of named entities. In Proceedings of the annual meeting of the association for computational linguistics (ACL) (Vol. 39, pp. 25–30).

  • Fleischman, M., & Hovy, E. (2002). Fine grained classification of named entities. In Proceedings of the 19th international conference on computational linguistics (COLING) (pp. 267–273). Taipei, Taiwan.

  • Grishman, R., & Sundheim, B. (1996). Message understanding conference—6: A brief history. In Proceedings of COLING.

  • Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York: Springer.

    Book  Google Scholar 

  • Hoffart, J., Suchanek, F. M., Berberich, K., Weikum, G. (2013). YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence, 194, 28–61.

    Article  Google Scholar 

  • Hoste, V. (2005).Optimization issues in machine learning of coreference resolution. Ph.D. thesis, Universiteit Antwerpen.

  • Isozaki, H., & Kazawa, H. (2002). Efficient support vector classifiers for named entity recognition. In Proceedings of the 19th international conference on computational linguistics Taipei, Taiwan.

  • Kudo, T., & Matsumoto, Y. (2003). Fast methods for Kernel-based text analysis. In Proceedings of the 41st annual meeting of the association for computational linguistics (ACL 2003) (pp. 24–31).

  • Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Machine learning international workshop.

  • LDC. (2008). ACE (automatic content extraction) English annotation guidelines for entities version 6.6. Linguistic Data Consortium, Philadelphia, USA.

  • Lee, C., Hwang, Y., Oh, H., Lim, S., Heo, J., Lee, C.. et al. (2006). Fine-grained named entity recognition using conditional random fields for question answering. In Lecture Notes in Computer Science (Vol. 4182, pp. 581–587).

  • Ling, X., Weld, D. S. (2012). Fine-Grained entity recognition. In Proceedings of the 26th conference on artificial intelligence (AAAI).

  • Liu, B. (2010). Sentiment analysis and subjectivity. In N. Indurkhya, F. J. Damerau (eds.), Handbook of natural language processing 2nd edn, (pp. 627–664). Boca Raton: Chapman & Hall/CRC Press.

  • Markert, K., & Nissim, M. (2002). Towards a corpus annotated for metonymies: the case of location names. In Proceedings of the international conference on language resources and evaluation (pp. 1385–1392). Las Palmas, Spain.

  • Mccallum, A., & Li, W. (2003). Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the conference on computational natural language learning.

  • Muller, C., & Strube, M. (2006). Multi-level annotation of linguistic data with MMAX2. In S. Braun, K. Kohn, J. Mukherjee (eds.), Corpus technology and language pedagogy: New resources, new tools, new methods (pp. 197–214). Peter Lang, Frankfurt, Germany.

  • Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Named Entities: Recognition, Classification and Use Special Issue of Lingvisticæ Investigationes, 30(1), 3–26.

    Google Scholar 

  • Nadeau, D., Turney, P., & Matwin, S. (2006). Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity. In Proceedings of the Canadian conference on artificial intelligence.

  • Navigli, R., & Ponzetto, S. P. (2012, July). Joining forces pays off: Multilingual joint word sense disambiguation. In Proceedings of the 2012 conference on empirical methods in natural language processing (EMNLP) (pp. 1399–1410).

  • Nissim, M., & Markert, K. (2005). Learning to buy a Renault and talk to BMW: A supervised approach to conventional metonymy. In International workshop on computational semantics (IWCS2005), Tilburg, The Netherlands.

  • Noreen, E. W. (1989). Computer intensive methods for testing Hypothesis: an introduction. New York: Wiley.

    Google Scholar 

  • Nothman, J., Murphy, T., & Curran, J. R. (2009). Analysing Wikipedia and gold-standard corpora for NER training. In Proceedings of the 12th conference of the European chapter of the ACL (pp. 612–620), Athens, Greece.

  • Oostdijk, N., Reynaert, M., Monachesi, P., van Noord, G., Ordelman, R., Schuurman, I. et al. (2008). From D-Coi to SoNaR: A reference corpus for Dutch. In Proceedings of the sixth international language resources and evaluation (LREC’08), Marrakech, Morocco.

  • Poibeau, T., & Kosseim, L. (2001). Proper name extraction from non-journalistic texts. In Proceedings of computational linguistics in the Netherlands.

  • Ponzetto, S. P., & Navigli, R. (2009). Large-scale taxonomy mapping for restructuring and integrating Wikipedia. In Proceedings of the 21st international joint conference on artificial intelligence (pp. 2083–2088).

  • Rahman, A., & Ng, V. (2009, August). Supervised models for coreference resolution. In Proceedings of the conference on empirical methods in natural language processing pp. 968–977.

  • Schuurman, I., Hoste, V., & Monachesi, P. (2009). Cultivating trees: Adding several semantic layers to the Lassy Treebank in SoNaR. In Proceedings of the 7th international workshop on treebanks and linguistic theories, Groningen, The Netherlands.

  • Sekine, S., & Nobata, C. (2004). Definition, dictionaries and tagger for extended named entity hierarchy. In Proceedings of the conference on language resources and evaluation (pp. 1977–1980).

  • Shinyama, Y., & Sekine, S. (2004). Named entity discovery using comparable news articles. In Proceedings of the international conference on computational linguistics.

  • Tjong Kim Sang, E. (2002a). Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In Proceedings of the 6th conference on natural language learning (pp. 155–158). Taipei, Taiwan.

  • Tjong Kim Sang, E. (2002b). Memory-based shallow parsing. Journal of Machine Learning Research, 2, 559–594.

    Google Scholar 

  • Tjong Kim Sang, E., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the 7th conference on natural language learning (pp. 142–147). Edmonton, Canada.

  • van den Bosch, A., Busser, B., Canisius, S., & Daelemans, W. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. In F. van Eynde, P. Dirix, I. Schuurman, V. Vandeghinste (Eds.), Selected papers of the 17th computational linguistics in the Netherlands meeting (pp. 99–114). Leuven, Belgium.

    Google Scholar 

  • Van Rijsbergen, C. J. (1979). Information retrieval. London: Butterworth.

    Google Scholar 

  • Vapnik, V., & Cortes, C. (1995). Support vector networks. Machine Learning, 20, 273–297.

    Google Scholar 

  • Wang, H., Zhao, T., Tan, H., & Zhang, S. (2008). Biomedical named entity recognition based on classifiers ensemble. International Journal of Computer Science and Applications, 5(2), 1–11.

    Google Scholar 

  • Weischedel, R., & Brunstein, A. (2005). BBN pronoun coreference and entity type corpus. Linguistic Data Consortium, Philadelphia, USA.

    Google Scholar 

  • Whitley, D. (1994). A genetic algorithm tutorial. Statistics and Computing, 4, 65–85.

    Article  Google Scholar 

  • Yeh, A. (2000). More accurate tests for the statistical significance of result differences. In Proceedings of the 18th international conference on computational linguistics (pp. 947–953), Saarbrucken, Germany.

  • Zhou, G. D., & Su, J. (2002). Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 473–480) Philadelphia, USA.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bart Desmet.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Desmet, B., Hoste, V. Fine-grained Dutch named entity recognition. Lang Resources & Evaluation 48, 307–343 (2014). https://doi.org/10.1007/s10579-013-9255-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-013-9255-y

Keywords

Navigation