skip to main content
article

Unsupervised models for morpheme segmentation and morphology learning

Published:02 February 2007Publication History
Skip Abstract Section

Abstract

We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes. A lexicon of word segments, called morphs, is induced from the data. The lexicon stores information about both the usage and form of the morphs. Several instances of the model are evaluated quantitatively in a morpheme segmentation task on different sized sets of Finnish as well as English data. Morfessor is shown to perform very well compared to a widely known benchmark algorithm, in particular on Finnish data.

References

  1. Adda-Decker, M. 2003. A corpus-based decompounding algorithm for German lexical modeling in LVCSR. In Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech). Geneva, Switzerland. 257--260.Google ScholarGoogle Scholar
  2. Allen, M., Badecker, W., and Osterhout, L. 2003. Morphological analysis in sentence processing: An ERP study. Lang. Cognit. Proc. 18, 4, 405--430.Google ScholarGoogle ScholarCross RefCross Ref
  3. Altun, Y. and Johnson, M. 2001. Inducing SFA with ε-transitions using Minimum Description Length. In Proceedings of the Finite-State Methods in Natural Language Processing, ESSLLI Workshop. Helsinki, Finland.Google ScholarGoogle Scholar
  4. Ando, R. K. and Lee, L. 2000. Mostly-unsupervised statistical segmentation of Japanese: Applications to Kanji. In Proceedings of the 6th Applied Natural Language Processing Conference and 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL). 241--248. Google ScholarGoogle Scholar
  5. Baayen, R. H., Piepenbrock, R., and Gulikers, L. 1995. The CELEX lexical database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA. http://wave.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96L14.Google ScholarGoogle Scholar
  6. Baayen, R. H. and Schreuder, R. 2000. Towards a psycholinguistic computational model for morphological parsing. Philosophical Transactions of the Royal Society (Series A: Mathematical, Physical and Engineering Sciences 358), 1--13.Google ScholarGoogle Scholar
  7. Baroni, M., Matiasek, J., and Trost, H. 2002. Unsupervised learning of morphologically related words based on orthographic and semantic similarity. In Proceedings of the Workshop on Morphological&Phonological Learning of ACL. 48--57. Google ScholarGoogle Scholar
  8. Brent, M. R. 1999. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learn. 34, 71--105. Google ScholarGoogle ScholarCross RefCross Ref
  9. Chang, J.-S., Lin, Y.-C., and Su, K.-Y. 1995. Automatic construction of a Chinese electronic dictionary. In Proceedings of the 3rd Workshop on Very Large Corpora. Somerset, NJ. 107--120.Google ScholarGoogle Scholar
  10. Creutz, M. 2003. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the Association for Computations Languages (ACL'03). Sapporo, Japan. 280--287. Google ScholarGoogle Scholar
  11. Creutz, M. and Lagus, K. 2002. Unsupervised discovery of morphemes. In Proceedings of the Workshop on Morphological and Phonological Learning of ACL. Philadelphia, PA. 21--30. Google ScholarGoogle Scholar
  12. Creutz, M. and Lagus, K. 2004. Induction of a simple morphology for highly-inflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON). Barcelona, Spain. 43--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Creutz, M. and Lagus, K. 2005a. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05). Espoo, Finland. 106--113.Google ScholarGoogle Scholar
  14. Creutz, M. and Lagus, K. 2005b. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Tech. rep. A81, Publications in Computer and Information Science, Helsinki University of Technology.Google ScholarGoogle Scholar
  15. Creutz, M. and Lindén, K. 2004. Morpheme segmentation gold standards for Finnish and English. Tech. rep. A77, Publications in Computer and Information Science, Helsinki University of Technology.Google ScholarGoogle Scholar
  16. de Marcken, C. G. 1996. Unsupervised language acquisition. Ph.D. thesis, MIT, Cambridge, MA. Google ScholarGoogle Scholar
  17. Déjean, H. 1998. Morphemes as necessary concept for structures discovery from untagged corpora. Workshop on Paradigms and Grounding in Natural Language Learning. Adelaide, Australia. 295--299.Google ScholarGoogle Scholar
  18. Deligne, S. and Bimbot, F. 1997. Inference of variable-length linguistic and acoustic units by multigrams. Speech Comm. 23, 223--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Feng, H., Chen, K., Kit, C., and Deng, X. 2004. Unsupervised segmentation of Chinese corpus using accessor variety. In Proceedings of the 1st International Joint Conference on Natural Language Processing (IJCNLP). Sanya, Hainan. 255--261.Google ScholarGoogle Scholar
  20. Gaussier, E. 1999. Unsupervised learning of derivational morphology from inflectional lexicons. In Proceedings of the ACL Workshop on Unsupervised Learning in Natural Language Processing. University of Maryland. 24--30.Google ScholarGoogle Scholar
  21. Ge, X., Pratt, W., and Smyth, P. 1999. Discovering Chinese words from unsegmented text. In Proceedings of SIGIR. 271--272. Google ScholarGoogle Scholar
  22. Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Computat. Linguis. 27, 2, 153--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Goldsmith, J. 2005. An algorithm for the unsupervised learning of morphology. Tech. rep. TR-2005-06, Department of Computer Science, University of Chicago. http://humfs1.uchicago.edu/~jagoldsm/Papers/Algorithm.pdf.Google ScholarGoogle Scholar
  24. Goldsmith, J. and Hu, Y. 2004. From signatures to finite state automata. Midwest Computational Linguistics Colloquium. Bloomington IN.Google ScholarGoogle Scholar
  25. Hafer, M. A. and Weiss, S. F. 1974. Word segmentation by letter successor varieties. Inform. Storage Retriev. 10, 371--385.Google ScholarGoogle ScholarCross RefCross Ref
  26. Hakulinen, L. 1979. Suomen kielen rakenne ja kehitys (The Structure and Development of the Finnish Language) 4th Ed. Kustannus-Oy Otava.Google ScholarGoogle Scholar
  27. Harris, Z. S. 1955. From phoneme to morpheme. Language 31, 2, 190--222. (Reprinted 1970 in Papers in Structural and Transformational Linguistics, Reidel Publishing Company, Dordrecht, Holland.)Google ScholarGoogle ScholarCross RefCross Ref
  28. Harris, Z. S. 1967. Morpheme boundaries within words: Report on a computer test. Transformations and Discourse Analysis Papers 73. (Reprinted 1970 in Papers in Structural and Transformational Linguistics, Reidel Publishing Company, Dordrecht, Holland.)Google ScholarGoogle Scholar
  29. Hu, Y., Matveeva, I., Goldsmith, J., and Sprague, C. 2005b. The SED heuristic for morpheme discovery: a look at Swahili. In Proceedings of the 2nd Workshop of Psychocomputational Models of Human Language Acquisition. Ann Arbor, MI. 28--35. Google ScholarGoogle Scholar
  30. Hu, Y., Matveeva, I., Goldsmith, J., and Sprague, C. 2005a. Using morphology and syntax together in unsupervised learning. In Proceedings of the 2nd Workshop of Psychocomputational Models of Human Language Acquisition. Ann Arbor, MI. 20--27. Google ScholarGoogle Scholar
  31. Jacquemin, C. 1997. Guessing morphology from terms and corpora. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97). Philadelphia, PA. 156--165. Google ScholarGoogle Scholar
  32. Järvikivi, J. and Niemi, J. 2002. Form-based representation in the mental lexicon: Priming (with) bound stem allomorphs in Finnish. Brain Lang. 81, 412--423.Google ScholarGoogle ScholarCross RefCross Ref
  33. Johnson, H. and Martin, J. 2003. Unsupervised learning of morphology for English and Inuktitut. Human Language Technology and North American Chapter of the Association for Computational Linguistics Conference (HLT-NAACL'03). Edmonton, Canada. Google ScholarGoogle Scholar
  34. Kazakov, D. 1997. Unsupervised learning of naïve morphology with genetic algorithms. Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks. Prague, Czech Republic, 105--112.Google ScholarGoogle Scholar
  35. Kit, C. 2003. How does lexical acquisition begin? A cognitive perspective. Cognit. Science 1, 1, 1--50.Google ScholarGoogle Scholar
  36. Kit, C., Pan, H., and Chen, H. 2002. Learning case-based knowledge for disambiguating Chinese word segmentation: A preliminary study. In Proceedings of the COLING Workshop SIGHAN-1. Taipei, Taiwan. 33--39. Google ScholarGoogle Scholar
  37. Kit, C. and Wilks, Y. 1999. Unsupervised learning of word boundary with description length gain. In Proceedings of the CoNLL99 ACL Workshop. Bergen, Norway.Google ScholarGoogle Scholar
  38. Kneissler, J. and Klakow, D. 2001. Speech recognition for huge vocabularies by using optimized sub-word units. In Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech). Aalborg, Denmark. 69--72.Google ScholarGoogle Scholar
  39. Kontorovich, L., Ron, D., and Singer, Y. 2003. A Markov model for the acquisition of morphological structure. Tech. rep. CMU-CS-03-147, School of Computer Science, Carnegie Mellon University.Google ScholarGoogle Scholar
  40. Koskenniemi, K. 1983. Two-level morphology: A general computational model for word-form recognition and production. Ph.D. thesis, University of Helsinki.Google ScholarGoogle Scholar
  41. Matthews, P. H. 1991. Morphology 2nd Ed. Cambridge Textbooks in Linguistics.Google ScholarGoogle Scholar
  42. McKinnon, R., Allen, M., and Osterhout, L. 2003. Morphological decomposition involving non-productive morphemes: ERP evidence. Cognit. Neurosci. Neuropsychol. 14, 6, 883--886.Google ScholarGoogle Scholar
  43. Nagata, M. 1997. A self-organizing Japanese word segmenter using heuristic word identification and re-estimation. In Proceedings of the 5th Workshop on Very Large Corpora. 203--215.Google ScholarGoogle Scholar
  44. Neuvel, S. and Fulop, S. A. 2002. Unsupervised learning of morphology without morphemes. In Proceedings of the Workshop on Morphological&Phonological Learning of ACL. 31--40. Google ScholarGoogle Scholar
  45. Peng, F. and Schuurmans, D. 2001. Self-supervised Chinese word segmentation. In Proceedings of the 4th International Conference on Intelligent Data Analysis (IDA). Springer, 238--247. Google ScholarGoogle Scholar
  46. Quirk, R., Greenbaum, S., Leech, G., and Svartvik, J. 1985. A Comprehensive Grammar of the English Language. Longman, Essex.Google ScholarGoogle Scholar
  47. Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. Vol. 15. World Scientific Series in Computer Science, Singapore. Google ScholarGoogle Scholar
  48. Saffran, J. R., Newport, E. L., and Aslin, R. N. 1996. Word segmentation: The role of distributional cues. J. Memory Lang. 35, 606--621.Google ScholarGoogle ScholarCross RefCross Ref
  49. Schone, P. and Jurafsky, D. 2000. Knowledge-free induction of morphology using latent semantic analysis. In Proceedings of the CoNLL-2000 and LLL-2000. 67--72. Google ScholarGoogle Scholar
  50. Schone, P. and Jurafsky, D. 2001. Knowledge-free induction of inflectional morphologies. In Proceedings of the North American Chapter of the Association for Computational Linguistic Conference. Google ScholarGoogle Scholar
  51. Snover, M. G. and Brent, M. R. 2001. A Bayesian model for morpheme and paradigm identification. In Proceedings of the 39th Annual Meeting of the ACL. 482--490. Google ScholarGoogle Scholar
  52. Snover, M. G., Jarosz, G. E., and Brent, M. R. 2002. Unsupervised learning of morphology using a novel directed search algorithm: Taking the first step. In Proceedings of the Workshop of Morphological&Phonological Learning of ACL. 11--20. Google ScholarGoogle Scholar
  53. Wicentowski, R. 2004. Multilingual noise-robust supervised morphological analysis using the WordFrame model. In Proceedings of the 7th ACL Special Interest Group in Computational Phonology (SIGPHON). Barcelona, Spain. 70--77. Google ScholarGoogle Scholar
  54. Yarowsky, D., Ngai, G., and Wicentowski, R. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st International Conference on Human Language Technology Research (HLT '01). 161--168. Google ScholarGoogle Scholar
  55. Yarowsky, D. and Wicentowski, R. 2000. Minimally supervised morphological analysis by multimodal alignment. In Proceedings of the Association for Computational Linguistics (ACL '00). 207--216. Google ScholarGoogle Scholar
  56. Yu, H. 2000. Unsupervised word induction using MDL criterion. In Proceedings of the International Symposium of Chinese Spoken Language Processing (ISCSL). Beijing, China.Google ScholarGoogle Scholar

Index Terms

  1. Unsupervised models for morpheme segmentation and morphology learning

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader