Abstract
We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes. A lexicon of word segments, called morphs, is induced from the data. The lexicon stores information about both the usage and form of the morphs. Several instances of the model are evaluated quantitatively in a morpheme segmentation task on different sized sets of Finnish as well as English data. Morfessor is shown to perform very well compared to a widely known benchmark algorithm, in particular on Finnish data.
- Adda-Decker, M. 2003. A corpus-based decompounding algorithm for German lexical modeling in LVCSR. In Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech). Geneva, Switzerland. 257--260.Google Scholar
- Allen, M., Badecker, W., and Osterhout, L. 2003. Morphological analysis in sentence processing: An ERP study. Lang. Cognit. Proc. 18, 4, 405--430.Google ScholarCross Ref
- Altun, Y. and Johnson, M. 2001. Inducing SFA with ε-transitions using Minimum Description Length. In Proceedings of the Finite-State Methods in Natural Language Processing, ESSLLI Workshop. Helsinki, Finland.Google Scholar
- Ando, R. K. and Lee, L. 2000. Mostly-unsupervised statistical segmentation of Japanese: Applications to Kanji. In Proceedings of the 6th Applied Natural Language Processing Conference and 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL). 241--248. Google Scholar
- Baayen, R. H., Piepenbrock, R., and Gulikers, L. 1995. The CELEX lexical database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA. http://wave.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96L14.Google Scholar
- Baayen, R. H. and Schreuder, R. 2000. Towards a psycholinguistic computational model for morphological parsing. Philosophical Transactions of the Royal Society (Series A: Mathematical, Physical and Engineering Sciences 358), 1--13.Google Scholar
- Baroni, M., Matiasek, J., and Trost, H. 2002. Unsupervised learning of morphologically related words based on orthographic and semantic similarity. In Proceedings of the Workshop on Morphological&Phonological Learning of ACL. 48--57. Google Scholar
- Brent, M. R. 1999. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learn. 34, 71--105. Google ScholarCross Ref
- Chang, J.-S., Lin, Y.-C., and Su, K.-Y. 1995. Automatic construction of a Chinese electronic dictionary. In Proceedings of the 3rd Workshop on Very Large Corpora. Somerset, NJ. 107--120.Google Scholar
- Creutz, M. 2003. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the Association for Computations Languages (ACL'03). Sapporo, Japan. 280--287. Google Scholar
- Creutz, M. and Lagus, K. 2002. Unsupervised discovery of morphemes. In Proceedings of the Workshop on Morphological and Phonological Learning of ACL. Philadelphia, PA. 21--30. Google Scholar
- Creutz, M. and Lagus, K. 2004. Induction of a simple morphology for highly-inflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON). Barcelona, Spain. 43--51. Google ScholarDigital Library
- Creutz, M. and Lagus, K. 2005a. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05). Espoo, Finland. 106--113.Google Scholar
- Creutz, M. and Lagus, K. 2005b. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Tech. rep. A81, Publications in Computer and Information Science, Helsinki University of Technology.Google Scholar
- Creutz, M. and Lindén, K. 2004. Morpheme segmentation gold standards for Finnish and English. Tech. rep. A77, Publications in Computer and Information Science, Helsinki University of Technology.Google Scholar
- de Marcken, C. G. 1996. Unsupervised language acquisition. Ph.D. thesis, MIT, Cambridge, MA. Google Scholar
- Déjean, H. 1998. Morphemes as necessary concept for structures discovery from untagged corpora. Workshop on Paradigms and Grounding in Natural Language Learning. Adelaide, Australia. 295--299.Google Scholar
- Deligne, S. and Bimbot, F. 1997. Inference of variable-length linguistic and acoustic units by multigrams. Speech Comm. 23, 223--241. Google ScholarDigital Library
- Feng, H., Chen, K., Kit, C., and Deng, X. 2004. Unsupervised segmentation of Chinese corpus using accessor variety. In Proceedings of the 1st International Joint Conference on Natural Language Processing (IJCNLP). Sanya, Hainan. 255--261.Google Scholar
- Gaussier, E. 1999. Unsupervised learning of derivational morphology from inflectional lexicons. In Proceedings of the ACL Workshop on Unsupervised Learning in Natural Language Processing. University of Maryland. 24--30.Google Scholar
- Ge, X., Pratt, W., and Smyth, P. 1999. Discovering Chinese words from unsegmented text. In Proceedings of SIGIR. 271--272. Google Scholar
- Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Computat. Linguis. 27, 2, 153--198. Google ScholarDigital Library
- Goldsmith, J. 2005. An algorithm for the unsupervised learning of morphology. Tech. rep. TR-2005-06, Department of Computer Science, University of Chicago. http://humfs1.uchicago.edu/~jagoldsm/Papers/Algorithm.pdf.Google Scholar
- Goldsmith, J. and Hu, Y. 2004. From signatures to finite state automata. Midwest Computational Linguistics Colloquium. Bloomington IN.Google Scholar
- Hafer, M. A. and Weiss, S. F. 1974. Word segmentation by letter successor varieties. Inform. Storage Retriev. 10, 371--385.Google ScholarCross Ref
- Hakulinen, L. 1979. Suomen kielen rakenne ja kehitys (The Structure and Development of the Finnish Language) 4th Ed. Kustannus-Oy Otava.Google Scholar
- Harris, Z. S. 1955. From phoneme to morpheme. Language 31, 2, 190--222. (Reprinted 1970 in Papers in Structural and Transformational Linguistics, Reidel Publishing Company, Dordrecht, Holland.)Google ScholarCross Ref
- Harris, Z. S. 1967. Morpheme boundaries within words: Report on a computer test. Transformations and Discourse Analysis Papers 73. (Reprinted 1970 in Papers in Structural and Transformational Linguistics, Reidel Publishing Company, Dordrecht, Holland.)Google Scholar
- Hu, Y., Matveeva, I., Goldsmith, J., and Sprague, C. 2005b. The SED heuristic for morpheme discovery: a look at Swahili. In Proceedings of the 2nd Workshop of Psychocomputational Models of Human Language Acquisition. Ann Arbor, MI. 28--35. Google Scholar
- Hu, Y., Matveeva, I., Goldsmith, J., and Sprague, C. 2005a. Using morphology and syntax together in unsupervised learning. In Proceedings of the 2nd Workshop of Psychocomputational Models of Human Language Acquisition. Ann Arbor, MI. 20--27. Google Scholar
- Jacquemin, C. 1997. Guessing morphology from terms and corpora. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97). Philadelphia, PA. 156--165. Google Scholar
- Järvikivi, J. and Niemi, J. 2002. Form-based representation in the mental lexicon: Priming (with) bound stem allomorphs in Finnish. Brain Lang. 81, 412--423.Google ScholarCross Ref
- Johnson, H. and Martin, J. 2003. Unsupervised learning of morphology for English and Inuktitut. Human Language Technology and North American Chapter of the Association for Computational Linguistics Conference (HLT-NAACL'03). Edmonton, Canada. Google Scholar
- Kazakov, D. 1997. Unsupervised learning of naïve morphology with genetic algorithms. Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks. Prague, Czech Republic, 105--112.Google Scholar
- Kit, C. 2003. How does lexical acquisition begin? A cognitive perspective. Cognit. Science 1, 1, 1--50.Google Scholar
- Kit, C., Pan, H., and Chen, H. 2002. Learning case-based knowledge for disambiguating Chinese word segmentation: A preliminary study. In Proceedings of the COLING Workshop SIGHAN-1. Taipei, Taiwan. 33--39. Google Scholar
- Kit, C. and Wilks, Y. 1999. Unsupervised learning of word boundary with description length gain. In Proceedings of the CoNLL99 ACL Workshop. Bergen, Norway.Google Scholar
- Kneissler, J. and Klakow, D. 2001. Speech recognition for huge vocabularies by using optimized sub-word units. In Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech). Aalborg, Denmark. 69--72.Google Scholar
- Kontorovich, L., Ron, D., and Singer, Y. 2003. A Markov model for the acquisition of morphological structure. Tech. rep. CMU-CS-03-147, School of Computer Science, Carnegie Mellon University.Google Scholar
- Koskenniemi, K. 1983. Two-level morphology: A general computational model for word-form recognition and production. Ph.D. thesis, University of Helsinki.Google Scholar
- Matthews, P. H. 1991. Morphology 2nd Ed. Cambridge Textbooks in Linguistics.Google Scholar
- McKinnon, R., Allen, M., and Osterhout, L. 2003. Morphological decomposition involving non-productive morphemes: ERP evidence. Cognit. Neurosci. Neuropsychol. 14, 6, 883--886.Google Scholar
- Nagata, M. 1997. A self-organizing Japanese word segmenter using heuristic word identification and re-estimation. In Proceedings of the 5th Workshop on Very Large Corpora. 203--215.Google Scholar
- Neuvel, S. and Fulop, S. A. 2002. Unsupervised learning of morphology without morphemes. In Proceedings of the Workshop on Morphological&Phonological Learning of ACL. 31--40. Google Scholar
- Peng, F. and Schuurmans, D. 2001. Self-supervised Chinese word segmentation. In Proceedings of the 4th International Conference on Intelligent Data Analysis (IDA). Springer, 238--247. Google Scholar
- Quirk, R., Greenbaum, S., Leech, G., and Svartvik, J. 1985. A Comprehensive Grammar of the English Language. Longman, Essex.Google Scholar
- Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. Vol. 15. World Scientific Series in Computer Science, Singapore. Google Scholar
- Saffran, J. R., Newport, E. L., and Aslin, R. N. 1996. Word segmentation: The role of distributional cues. J. Memory Lang. 35, 606--621.Google ScholarCross Ref
- Schone, P. and Jurafsky, D. 2000. Knowledge-free induction of morphology using latent semantic analysis. In Proceedings of the CoNLL-2000 and LLL-2000. 67--72. Google Scholar
- Schone, P. and Jurafsky, D. 2001. Knowledge-free induction of inflectional morphologies. In Proceedings of the North American Chapter of the Association for Computational Linguistic Conference. Google Scholar
- Snover, M. G. and Brent, M. R. 2001. A Bayesian model for morpheme and paradigm identification. In Proceedings of the 39th Annual Meeting of the ACL. 482--490. Google Scholar
- Snover, M. G., Jarosz, G. E., and Brent, M. R. 2002. Unsupervised learning of morphology using a novel directed search algorithm: Taking the first step. In Proceedings of the Workshop of Morphological&Phonological Learning of ACL. 11--20. Google Scholar
- Wicentowski, R. 2004. Multilingual noise-robust supervised morphological analysis using the WordFrame model. In Proceedings of the 7th ACL Special Interest Group in Computational Phonology (SIGPHON). Barcelona, Spain. 70--77. Google Scholar
- Yarowsky, D., Ngai, G., and Wicentowski, R. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st International Conference on Human Language Technology Research (HLT '01). 161--168. Google Scholar
- Yarowsky, D. and Wicentowski, R. 2000. Minimally supervised morphological analysis by multimodal alignment. In Proceedings of the Association for Computational Linguistics (ACL '00). 207--216. Google Scholar
- Yu, H. 2000. Unsupervised word induction using MDL criterion. In Proceedings of the International Symposium of Chinese Spoken Language Processing (ISCSL). Beijing, China.Google Scholar
Index Terms
- Unsupervised models for morpheme segmentation and morphology learning
Recommendations
Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging
There has been recent interest in statistical approaches to Korean morphological analysis. However, previous studies have been based mostly on generative models, including a hidden Markov model (HMM), without utilizing discriminative models such as a ...
Morph-based speech recognition and modeling of out-of-vocabulary words across languages
We explore the use of morph-based language models in large-vocabulary continuous-speech recognition systems across four so-called morphologically rich languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units ...
Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages
The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS ...
Comments