Skip to main content
Log in

Is it possible to create a very large wordnet in 100 days? An evaluation

  • Project Note
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Wordnets are large-scale lexical databases of related words and concepts, useful for language-aware software applications. They have recently been built for many languages by using various approaches. The Finnish wordnet, FinnWordNet (FiWN), was created by translating the more than 200,000 word senses in the English Princeton WordNet (PWN) 3.0 in 100 days. To ensure quality, they were translated by professional translators. The direct translation approach was based on the assumption that most synsets in PWN represent language-independent real-world concepts. Thus also the semantic relations between synsets were assumed mostly language-independent, so the structure of PWN could be reused as well. This approach allowed the creation of an extensive Finnish wordnet directly aligned with PWN and also provided us with a translation relation and thus a bilingual wordnet usable as a dictionary. In this paper, we address several concerns raised with regard to our approach, many of them for the first time. We evaluate the craftsmanship of the translators by checking the spelling and translation quality, the viability of the approach by assessing the synonym quality both on the lexeme and concept level, as well as the usefulness of the resulting lexical resource both for humans and in a language-technological task. We discovered no new problems compared with those already known in PWN. As a whole, the paper contributes to the scientific discourse on what it takes to create a very large wordnet. As a side-effect of the evaluation, we extended FiWN to contain 208,645 word senses in 120,449 synsets, effectively making version 2.0 of FiWN currently the largest wordnet in the world by these statistics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

  1. The full FiWN can be downloaded from http://www.ling.helsinki.fi/en/lt/research/finnwordnet/.

  2. For information on the availability and size of various wordnets, see Bond and Paik (2012) and Open Multilingual Wordnet, http://casta-net.jp/~kuribayashi/multi/.

  3. The figure excludes the suggestions for additional synonyms and “not sure” answers.

  4. “Equally frequent” meant that the words in FiWN had an average frequency in the text corpus and that we used all the words in the text corpus with a least a quarter of this average frequency.

  5. Two translators working in parallel spent 100 calendar days full-time translating PWN word senses into Finnish, so the translation effort took six person months in total.

References

  • Agirre, E., & Soroa, A. (2009). Personalizing PageRank for word sense disambiguation. In Proceedings of the 12th conference of the European chapter of the association for computational linguistics (EACL-2009) (pp. 33–41). Athens: ACL.

  • Atserias, J., Climent, S., Farreres, X., Rigau, G., & Rodríguez, H. (2000). Combining multiple methods for the automatic construction of multilingual WordNets. In N. Nicolov & R. Mitkov (Eds.), Recent advances in natural language processing. Volume II: Selected papers from RANLP’97, number 189 in current issues in linguistic theory (pp. 327–338). Amsterdam: John Benjamins.

  • Bond, F., Isahara, H., Kanzaki, K., & Uchimoto, K. (2008). Boot-strapping a WordNet using multiple existing WordNets. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, & D. Tapias (Eds.), Proceedings of the international conference on language resources and evaluation, LREC 2008 (pp. 1619–1624). Marrakech: ELRA.

  • Bond, F., & Paik, K. (2012). A survey of wordnets and their licenses. In Global WordNet Association (2012) (pp. 64–71). http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.

  • Cilibrasi R. L., & Vitányi P. M. B. (2007) The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3):370–383. doi:10.1109/TKDE.2007.48.

    Google Scholar 

  • Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.

    Google Scholar 

  • Fišer, D., & Sagot, B. (2008). Combining multiple resources to build reliable wordnets. In P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), Text, speech and dialogue, Lecture Notes in Computer Science (Vol. 5246, pp. 61–68). Berlin: Springer. doi:10.1007/978-3-540-87391-4_10.

  • Global WordNet Association. (2012). Proceedings of the 6th international global wordnet conference (GWC 2012). Matsue: Global WordNet Association. http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.

  • Lee C., Lee G. G., & Seo J. (2004). Multiple heuristics and their combination for automatic WordNet mapping. Computers and the Humanities, 38(4):437–455. doi:10.1007/s10579-004-1367-y.

    Google Scholar 

  • Lindén, K., & Carlson, L. (2010). FinnWordNet-WordNet p å finska viaöversättning. LexicoNordica: Nordic Journal of Lexicography, 17, 119–140. English translation “FinnWordNet—Finnish WordNet by translation” at http://www.ling.helsinki.fi/~klinden/pubs/FinnWordnetInLexicoNordica-en.pdf.

  • Lindén, K., Niemi, J., & Hyvärinen, M. (2012). Extending and updating the finnish Wordnet. In D. Santos, K. Lindén, & W. Ng’ang’a (Eds.), Shall we play the festschrift game? Essays on the Occasion of Lauri Carlson’s 60th Birthday (pp. 67–98). Berlin: Springer. doi:10.1007/978-3-642-30773-7_7.

  • Martola, N. (2011). FinnWordNet och kulturbundna ord. LexicoNordica: Nordic Journal of Lexicography, 18:111–133.

    Google Scholar 

  • Muhonen, K., & Lindén, K. (2011). Do wordnets also improve human performance on NLP tasks? In B. S. Pedersen, G. Nešpore, & I. Skadiņa (Eds.), Proceedings of the 18th Nordic conference of computational linguistics NODALIDA 2011, NEALT proceedings series, (Vol. 11, pp. 146–152). Northern European Association for Language Technology (NEALT). URL http://hdl.handle.net/10062/16955.

  • Niemi, J., & Lindén, K. (2012). Representing the translation relation in a bilingual wordnet. In Proceedings of the eight international conference on language resources and evaluation (LREC’12) (pp. 2439–2446). Istambul, Turkey. http://www.lrec-conf.org/proceedings/lrec2012/summaries/194.html.

  • Niemi, J., Lindén, K., & Hyvärinen, M. (2012). Using a bilingual resource to add synonyms to a wordnet: FinnWordNet and Wikipedia as an example. In Global WordNet association (2012) (pp. 227–231). http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.

  • Pääkkö, P., & Lindén, K. (2012). Finding a location for a new word in WordNet. In Global WordNet Association (2012) (pp. 286–293). http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.

  • Pedersen, B. S., Borin, L., Forsberg, M., Kahusk, N., Lindén, K., Niemi, J., Nisbeth, N., Nygaard, L., Orav, H., Rögnvaldsson, E., Seaton, M., Vider, K., & Voionmaa, K. (2013). Nordic and Baltic wordnets aligned and compared through WordTies. In S. Oepen, K. Hagen, & J. B. Johannessen (Eds.), Proceedings of the 19th nordic conference of computational linguistics (NODALIDA 2013), number 16 in NEALT Proceedings series (pp. 147–162). Oslo University, Norway. http://www.ep.liu.se/ecp_article/index.en.aspx?issue=085;article=016.

  • Pedersen, B. S., Borin, L., Forsberg, M., Lindén, K., Orav, H., & Rögnvaldsson, E. (2012). Linking and validating Nordic and Baltic wordnets: A multilingual action in META-NORD. In Global WordNet Association (2012) (pp. 254–260). http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.

  • Pianta, E., Bentivogli, L., & Girardi, C. (2002). MultiWordNet: developing an aligned multilingual database. In Proceedings of the first international conference on global WordNet (pp. 293–302). Mysore, India. http://multiwordnet.fbk.eu/paper/MWN-India-published.pdf.

  • Sagot, B., & Fišer, D. (2008). Building a free French wordnet from multilingual resources. In Proceedings of OntoLex 2008 (pp. 14–19). Marrakech, Morocco. http://hal.inria.fr/inria-00614708.

  • Saveski, M., & Trajkovski, I. (2010). Automatic construction of wordnets by using machine translation and language modeling. In T. Erjavec, & J. Žganec Gros (Eds.), Proceedings of seventh language technologies conference, 13th international multiconference information society. Ljubljana, Slovenia.

  • Thoongsup, S., Robkop, K., Mokarat, C., Sinthurahat, T., Charoenporn, T., Sornlertlamvanich, V., & Isahara, H. (2009). Thai WordNet construction. In Proceedings of the 7th workshop on Asian language resources, in conjunction with ACL-IJCNLP 2009 (pp. 139–144). Singapore: ACL.

  • Tufiş, D., Cristea, D., & Stamou, S. (2004) BalkaNet: Aims, methods, results and perspectives. A general overview. Romanian Journal of Information Science and Technology, 7 (1–2), 9–43.

    Google Scholar 

  • Vossen, P. (Ed.) (1998). EuroWordNet: A multilingual database with lexical semantic networks. Dordrecht: Kluwer Academic Publishers.

    Google Scholar 

Download references

Acknowledgments

We are grateful to Mirka Hyvärinen, Kristiina Muhonen and Paula Pääkkä for checking the long lists of words with potential spelling errors and part-of-speech mismatches. We also thank Mirka Hyvärinen and Pinja Pennala for their valuable contribution to the creation of the word-sense disambiguated test corpus and for the many hours spent on evaluating sets of words extracted from Wikipedia and Wiktionary. Mirka Hyärinen also conducted the crowdsourcing experiment. This work was funded by the FIN-CLARIN and META-NORD projects. The META-NORD project has received funding from the European Union’s ICT Policy Support Programme as part of the Competitiveness and Innovation Framework Programme under grant agreement no. 270899.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Krister Lindén.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lindén, K., Niemi, J. Is it possible to create a very large wordnet in 100 days? An evaluation. Lang Resources & Evaluation 48, 191–201 (2014). https://doi.org/10.1007/s10579-013-9245-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-013-9245-0

Keywords

Navigation