skip to main content
10.3115/974557.974561dlproceedingsArticle/Chapter ViewAbstractPublication PagesanlcConference Proceedingsconference-collections
Article
Free Access

A maximum entropy approach to identifying sentence boundaries

Published:31 March 1997Publication History

ABSTRACT

We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of., ?, and ! as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. The model can therefore be trained easily on any genre of English, and should be trainable on any other Romanalphabet language. Performance is comparable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains.

References

  1. Brill, Eric. 1994. Some advances in transformation-based part-of-speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence, volume 1, pages 722--727.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Collins, Michael. 1996. A new statistical parser based on bigram lexical dependencies. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, June.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Cutting, Doug, Julian Kupiec. Jan Pedersen, and Penelope Sibun. 1992. A practical part-of-speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, pages 133--140, Trento, Italy, April.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Darroch, J. N. and D. Ratcliff. 1972. Generalized Iterative Scaling for Log-Linear Models. The Annals of Mathematical Statistics, 43(5):1470--1480.]]Google ScholarGoogle ScholarCross RefCross Ref
  5. Liberman, Mark Y. and Kenneth W. Church. 1992. Text analysis and word pronunciation in text-to-speech synthesis. In Sadaoki Furui and M. Mohan Sondi, editors, Advances in Speech Signal Processing. Marcel Dekker, Incorporated, New York.]]Google ScholarGoogle Scholar
  6. Marcus, Mitchell, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313--330.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Nunberg, Geoffrey. 1990. The Linguistics of Punctuation. Number 18 in CSLI Lecture Notes. University of Chicago Press.]]Google ScholarGoogle Scholar
  8. Palmer, David D. and Marti A. Hearst. 1994. Adaptive sentence boundary disambiguation. In Proceedings of the 1994 conference on Applied Natural Language Processing (ANLP). Stuttgart, Germany, October.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Palmer, David D. and Marti A. Hearst: To appear. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ratnaparkhi, Adwait. 1996. A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing, pages 133--142, University of Pennsylvania, May 17--18.]]Google ScholarGoogle Scholar
  11. Riley, Michael D. 1989. Some applications of tree-based modelling to speech and language. In DARPA Speech and Language Technology Workshop, pages 339--352, Cape Cod, Massachusetts.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. White, Michael. 1995. Presenting punctuation. In Proceedings of the Fifth European Workshop on Natural Language Generation, pages 107--125, Leiden. The Netherlands.]]Google ScholarGoogle Scholar
  1. A maximum entropy approach to identifying sentence boundaries

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image DL Hosted proceedings
        ANLC '97: Proceedings of the fifth conference on Applied natural language processing
        March 1997
        417 pages

        Publisher

        Association for Computational Linguistics

        United States

        Publication History

        • Published: 31 March 1997

        Qualifiers

        • Article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader