article

Unsupervised models for morpheme segmentation and morphology learning

Authors:
Mathias Creutz

Helsinki University of Technology, Finland

Helsinki University of Technology, Finland
View Profile

,
Krista Lagus

Helsinki University of Technology, Finland

Helsinki University of Technology, Finland
View Profile

ACM Transactions on Speech and Language Processing Volume 4 Issue 1Article No.: 3pp 1–34https://doi.org/10.1145/1187415.1187418

Published:02 February 2007Publication History

ACM Transactions on Speech and Language Processing

Abstract

We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes. A lexicon of word segments, called morphs, is induced from the data. The lexicon stores information about both the usage and form of the morphs. Several instances of the model are evaluated quantitatively in a morpheme segmentation task on different sized sets of Finnish as well as English data. Morfessor is shown to perform very well compared to a widely known benchmark algorithm, in particular on Finnish data.

References

Adda-Decker, M. 2003. A corpus-based decompounding algorithm for German lexical modeling in LVCSR. In Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech). Geneva, Switzerland. 257--260.Google Scholar
Allen, M., Badecker, W., and Osterhout, L. 2003. Morphological analysis in sentence processing: An ERP study. Lang. Cognit. Proc. 18, 4, 405--430.Google ScholarCross Ref
Altun, Y. and Johnson, M. 2001. Inducing SFA with &epsi;-transitions using Minimum Description Length. In Proceedings of the Finite-State Methods in Natural Language Processing, ESSLLI Workshop. Helsinki, Finland.Google Scholar
Ando, R. K. and Lee, L. 2000. Mostly-unsupervised statistical segmentation of Japanese: Applications to Kanji. In Proceedings of the 6th Applied Natural Language Processing Conference and 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL). 241--248. Google Scholar
Baayen, R. H., Piepenbrock, R., and Gulikers, L. 1995. The CELEX lexical database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA. http://wave.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96L14.Google Scholar
Baayen, R. H. and Schreuder, R. 2000. Towards a psycholinguistic computational model for morphological parsing. Philosophical Transactions of the Royal Society (Series A: Mathematical, Physical and Engineering Sciences 358), 1--13.Google Scholar
Baroni, M., Matiasek, J., and Trost, H. 2002. Unsupervised learning of morphologically related words based on orthographic and semantic similarity. In Proceedings of the Workshop on Morphological&Phonological Learning of ACL. 48--57. Google Scholar
Brent, M. R. 1999. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learn. 34, 71--105. Google ScholarCross Ref
Chang, J.-S., Lin, Y.-C., and Su, K.-Y. 1995. Automatic construction of a Chinese electronic dictionary. In Proceedings of the 3rd Workshop on Very Large Corpora. Somerset, NJ. 107--120.Google Scholar
Creutz, M. 2003. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the Association for Computations Languages (ACL'03). Sapporo, Japan. 280--287. Google Scholar
Creutz, M. and Lagus, K. 2002. Unsupervised discovery of morphemes. In Proceedings of the Workshop on Morphological and Phonological Learning of ACL. Philadelphia, PA. 21--30. Google Scholar
Creutz, M. and Lagus, K. 2004. Induction of a simple morphology for highly-inflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON). Barcelona, Spain. 43--51. Google ScholarDigital Library
Creutz, M. and Lagus, K. 2005a. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05). Espoo, Finland. 106--113.Google Scholar
Creutz, M. and Lagus, K. 2005b. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Tech. rep. A81, Publications in Computer and Information Science, Helsinki University of Technology.Google Scholar
Creutz, M. and Lindén, K. 2004. Morpheme segmentation gold standards for Finnish and English. Tech. rep. A77, Publications in Computer and Information Science, Helsinki University of Technology.Google Scholar
de Marcken, C. G. 1996. Unsupervised language acquisition. Ph.D. thesis, MIT, Cambridge, MA. Google Scholar
Déjean, H. 1998. Morphemes as necessary concept for structures discovery from untagged corpora. Workshop on Paradigms and Grounding in Natural Language Learning. Adelaide, Australia. 295--299.Google Scholar
Deligne, S. and Bimbot, F. 1997. Inference of variable-length linguistic and acoustic units by multigrams. Speech Comm. 23, 223--241. Google ScholarDigital Library
Feng, H., Chen, K., Kit, C., and Deng, X. 2004. Unsupervised segmentation of Chinese corpus using accessor variety. In Proceedings of the 1st International Joint Conference on Natural Language Processing (IJCNLP). Sanya, Hainan. 255--261.Google Scholar
Gaussier, E. 1999. Unsupervised learning of derivational morphology from inflectional lexicons. In Proceedings of the ACL Workshop on Unsupervised Learning in Natural Language Processing. University of Maryland. 24--30.Google Scholar
Ge, X., Pratt, W., and Smyth, P. 1999. Discovering Chinese words from unsegmented text. In Proceedings of SIGIR. 271--272. Google Scholar
Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Computat. Linguis. 27, 2, 153--198. Google ScholarDigital Library
Goldsmith, J. 2005. An algorithm for the unsupervised learning of morphology. Tech. rep. TR-2005-06, Department of Computer Science, University of Chicago. http://humfs1.uchicago.edu/~jagoldsm/Papers/Algorithm.pdf.Google Scholar
Goldsmith, J. and Hu, Y. 2004. From signatures to finite state automata. Midwest Computational Linguistics Colloquium. Bloomington IN.Google Scholar
Hafer, M. A. and Weiss, S. F. 1974. Word segmentation by letter successor varieties. Inform. Storage Retriev. 10, 371--385.Google ScholarCross Ref
Hakulinen, L. 1979. Suomen kielen rakenne ja kehitys (The Structure and Development of the Finnish Language) 4th Ed. Kustannus-Oy Otava.Google Scholar
Harris, Z. S. 1955. From phoneme to morpheme. Language 31, 2, 190--222. (Reprinted 1970 in Papers in Structural and Transformational Linguistics, Reidel Publishing Company, Dordrecht, Holland.)Google ScholarCross Ref
Harris, Z. S. 1967. Morpheme boundaries within words: Report on a computer test. Transformations and Discourse Analysis Papers 73. (Reprinted 1970 in Papers in Structural and Transformational Linguistics, Reidel Publishing Company, Dordrecht, Holland.)Google Scholar
Hu, Y., Matveeva, I., Goldsmith, J., and Sprague, C. 2005b. The SED heuristic for morpheme discovery: a look at Swahili. In Proceedings of the 2nd Workshop of Psychocomputational Models of Human Language Acquisition. Ann Arbor, MI. 28--35. Google Scholar
Hu, Y., Matveeva, I., Goldsmith, J., and Sprague, C. 2005a. Using morphology and syntax together in unsupervised learning. In Proceedings of the 2nd Workshop of Psychocomputational Models of Human Language Acquisition. Ann Arbor, MI. 20--27. Google Scholar
Jacquemin, C. 1997. Guessing morphology from terms and corpora. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97). Philadelphia, PA. 156--165. Google Scholar
Järvikivi, J. and Niemi, J. 2002. Form-based representation in the mental lexicon: Priming (with) bound stem allomorphs in Finnish. Brain Lang. 81, 412--423.Google ScholarCross Ref
Johnson, H. and Martin, J. 2003. Unsupervised learning of morphology for English and Inuktitut. Human Language Technology and North American Chapter of the Association for Computational Linguistics Conference (HLT-NAACL'03). Edmonton, Canada. Google Scholar
Kazakov, D. 1997. Unsupervised learning of naïve morphology with genetic algorithms. Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks. Prague, Czech Republic, 105--112.Google Scholar
Kit, C. 2003. How does lexical acquisition begin&quest; A cognitive perspective. Cognit. Science 1, 1, 1--50.Google Scholar
Kit, C., Pan, H., and Chen, H. 2002. Learning case-based knowledge for disambiguating Chinese word segmentation: A preliminary study. In Proceedings of the COLING Workshop SIGHAN-1. Taipei, Taiwan. 33--39. Google Scholar
Kit, C. and Wilks, Y. 1999. Unsupervised learning of word boundary with description length gain. In Proceedings of the CoNLL99 ACL Workshop. Bergen, Norway.Google Scholar
Kneissler, J. and Klakow, D. 2001. Speech recognition for huge vocabularies by using optimized sub-word units. In Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech). Aalborg, Denmark. 69--72.Google Scholar
Kontorovich, L., Ron, D., and Singer, Y. 2003. A Markov model for the acquisition of morphological structure. Tech. rep. CMU-CS-03-147, School of Computer Science, Carnegie Mellon University.Google Scholar
Koskenniemi, K. 1983. Two-level morphology: A general computational model for word-form recognition and production. Ph.D. thesis, University of Helsinki.Google Scholar
Matthews, P. H. 1991. Morphology 2nd Ed. Cambridge Textbooks in Linguistics.Google Scholar
McKinnon, R., Allen, M., and Osterhout, L. 2003. Morphological decomposition involving non-productive morphemes: ERP evidence. Cognit. Neurosci. Neuropsychol. 14, 6, 883--886.Google Scholar
Nagata, M. 1997. A self-organizing Japanese word segmenter using heuristic word identification and re-estimation. In Proceedings of the 5th Workshop on Very Large Corpora. 203--215.Google Scholar
Neuvel, S. and Fulop, S. A. 2002. Unsupervised learning of morphology without morphemes. In Proceedings of the Workshop on Morphological&Phonological Learning of ACL. 31--40. Google Scholar
Peng, F. and Schuurmans, D. 2001. Self-supervised Chinese word segmentation. In Proceedings of the 4th International Conference on Intelligent Data Analysis (IDA). Springer, 238--247. Google Scholar
Quirk, R., Greenbaum, S., Leech, G., and Svartvik, J. 1985. A Comprehensive Grammar of the English Language. Longman, Essex.Google Scholar
Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. Vol. 15. World Scientific Series in Computer Science, Singapore. Google Scholar
Saffran, J. R., Newport, E. L., and Aslin, R. N. 1996. Word segmentation: The role of distributional cues. J. Memory Lang. 35, 606--621.Google ScholarCross Ref
Schone, P. and Jurafsky, D. 2000. Knowledge-free induction of morphology using latent semantic analysis. In Proceedings of the CoNLL-2000 and LLL-2000. 67--72. Google Scholar
Schone, P. and Jurafsky, D. 2001. Knowledge-free induction of inflectional morphologies. In Proceedings of the North American Chapter of the Association for Computational Linguistic Conference. Google Scholar
Snover, M. G. and Brent, M. R. 2001. A Bayesian model for morpheme and paradigm identification. In Proceedings of the 39th Annual Meeting of the ACL. 482--490. Google Scholar
Snover, M. G., Jarosz, G. E., and Brent, M. R. 2002. Unsupervised learning of morphology using a novel directed search algorithm: Taking the first step. In Proceedings of the Workshop of Morphological&Phonological Learning of ACL. 11--20. Google Scholar
Wicentowski, R. 2004. Multilingual noise-robust supervised morphological analysis using the WordFrame model. In Proceedings of the 7th ACL Special Interest Group in Computational Phonology (SIGPHON). Barcelona, Spain. 70--77. Google Scholar
Yarowsky, D., Ngai, G., and Wicentowski, R. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st International Conference on Human Language Technology Research (HLT '01). 161--168. Google Scholar
Yarowsky, D. and Wicentowski, R. 2000. Minimally supervised morphological analysis by multimodal alignment. In Proceedings of the Association for Computational Linguistics (ACL '00). 207--216. Google Scholar
Yu, H. 2000. Unsupervised word induction using MDL criterion. In Proceedings of the International Symposium of Chinese Spoken Language Processing (ISCSL). Beijing, China.Google Scholar

Index Terms

Unsupervised models for morpheme segmentation and morphology learning
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
2. Mathematics of computing
  1. Information theory

Recommendations

Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging

There has been recent interest in statistical approaches to Korean morphological analysis. However, previous studies have been based mostly on generative models, including a hidden Markov model (HMM), without utilizing discriminative models such as a ...
Read More
Morph-based speech recognition and modeling of out-of-vocabulary words across languages

We explore the use of morph-based language models in large-vocabulary continuous-speech recognition systems across four so-called morphologically rich languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units ...
Read More
Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages

The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Speech and Language Processing Volume 4, Issue 1
January 2007
68 pages
ISSN:1550-4875
EISSN:1550-4883
DOI:10.1145/1187415
Issue’s Table of Contents

Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 February 2007
Published in tslp Volume 4, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Efficient storage
highly inflecting and compounding languages
language independent methods
maximum a posteriori (MAP) estimation
morpheme lexicon and segmentation
unsupervised learning
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 141
  Total Citations
  View Citations
- 1,716
  Total Downloads
- Downloads (Last 12 months)49
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Unsupervised models for morpheme segmentation and morphology learning

ACM Transactions on Speech and Language Processing

Abstract

References

Cited By

Index Terms

Recommendations

Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging

Morph-based speech recognition and modeling of out-of-vocabulary words across languages

Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Unsupervised models for morpheme segmentation and morphology learning

ACM Transactions on Speech and Language Processing

Abstract

References

Cited By

Index Terms

Recommendations

Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging

Morph-based speech recognition and modeling of out-of-vocabulary words across languages

Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media