ABSTRACT
We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of., ?, and ! as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. The model can therefore be trained easily on any genre of English, and should be trainable on any other Romanalphabet language. Performance is comparable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains.
- Brill, Eric. 1994. Some advances in transformation-based part-of-speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence, volume 1, pages 722--727.]] Google ScholarDigital Library
- Collins, Michael. 1996. A new statistical parser based on bigram lexical dependencies. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, June.]] Google ScholarDigital Library
- Cutting, Doug, Julian Kupiec. Jan Pedersen, and Penelope Sibun. 1992. A practical part-of-speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, pages 133--140, Trento, Italy, April.]] Google ScholarDigital Library
- Darroch, J. N. and D. Ratcliff. 1972. Generalized Iterative Scaling for Log-Linear Models. The Annals of Mathematical Statistics, 43(5):1470--1480.]]Google ScholarCross Ref
- Liberman, Mark Y. and Kenneth W. Church. 1992. Text analysis and word pronunciation in text-to-speech synthesis. In Sadaoki Furui and M. Mohan Sondi, editors, Advances in Speech Signal Processing. Marcel Dekker, Incorporated, New York.]]Google Scholar
- Marcus, Mitchell, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313--330.]] Google ScholarDigital Library
- Nunberg, Geoffrey. 1990. The Linguistics of Punctuation. Number 18 in CSLI Lecture Notes. University of Chicago Press.]]Google Scholar
- Palmer, David D. and Marti A. Hearst. 1994. Adaptive sentence boundary disambiguation. In Proceedings of the 1994 conference on Applied Natural Language Processing (ANLP). Stuttgart, Germany, October.]] Google ScholarDigital Library
- Palmer, David D. and Marti A. Hearst: To appear. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics.]] Google ScholarDigital Library
- Ratnaparkhi, Adwait. 1996. A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing, pages 133--142, University of Pennsylvania, May 17--18.]]Google Scholar
- Riley, Michael D. 1989. Some applications of tree-based modelling to speech and language. In DARPA Speech and Language Technology Workshop, pages 339--352, Cape Cod, Massachusetts.]] Google ScholarDigital Library
- White, Michael. 1995. Presenting punctuation. In Proceedings of the Fifth European Workshop on Natural Language Generation, pages 107--125, Leiden. The Netherlands.]]Google Scholar
- A maximum entropy approach to identifying sentence boundaries
Recommendations
Tagging sentence boundaries
NAACL 2000: Proceedings of the 1st North American chapter of the Association for Computational Linguistics conferenceIn this paper we tackle sentence boundary disambiguation through a part-of-speech (POS) tagging framework. We describe necessary changes in text tokenization and the implementation of a POS tagger and provide results of an evaluation of this system on ...
Prosodic word prediction using a maximum entropy approach
ISCSLP'06: Proceedings of the 5th international conference on Chinese Spoken Language ProcessingAs the basic prosodic unit, the prosodic word influences the naturalness and the intelligibility greatly. Although the research shows that the lexicon word are greatly different from the prosodic word, the lexicon word still provides the important cues ...
Maximum entropy models for word sense disambiguation
COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1A maximum entropy-based word sense disambiguation system is presented, consisting of individual word experts that are trained on both labeled and partially labeled corpora. The classification probabilities from the individual word experts are integrated ...
Comments