Statistical Models for Text Segmentation

Beeferman, Doug; Berger, Adam; Lafferty, John

doi:10.1023/A:1007506220214

Statistical Models for Text Segmentation

Published: February 1999

Volume 34, pages 177–210, (1999)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Statistical Models for Text Segmentation

Download PDF

Doug Beeferman¹,
Adam Berger¹ &
John Lafferty¹

4473 Accesses
298 Citations
6 Altmetric
Explore all metrics

Abstract

This paper introduces a new statistical approach to automatically partitioning text into coherent segments. The approach is based on a technique that incrementally builds an exponential model to extract features that are correlated with the presence of boundaries in labeled training text. The models use two classes of features: topicality features that use adaptive language models in a novel way to detect broad changes of topic, and cue-word features that detect occurrences of specific words, which may be domain-specific, that tend to be used near segment boundaries. Assessment of our approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts. Quantitative results on these domains are presented using a new probabilistically motivated error metric, which combines precision and recall in a natural and flexible way. This metric is used to make a quantitative assessment of the relative contributions of the different feature types, as well as a comparison with decision trees and previously proposed text segmentation algorithms.

References

Beeferman, D., Berger, A., & Lafferty, J. (1997). A model of lexical attraction and repulsion. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics.
Berger, A., Della Pietra, S., & Della Pietra, V. (1996).A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71.
Google Scholar
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Belmont, Wadsworth.
Google Scholar
Christel, M., Kanade, T., Mauldin, M., Reddy, R., Sirbu, M., Stevens, S., & Wactlar, H. (1995). Informedia digital video library. Communications of the ACM, 38(4), 57–58.
Google Scholar
Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 380–393.
Google Scholar
Gelman, A., Carlin, J., Stern, H., & Rubin, D. (1995). Bayesian data analysis. London: Chapman & Hall.
Google Scholar
Hastie, T., & Tibshirani, R. (1990). Generalized additive models. Chapman and Hall.
Hearst, M.A. (1994). Multi-paragraph segmentation of expository text. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics.
Hearst, M.A. (1997). TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), 33–64.
Google Scholar
Hirschberg, J., & Litman, D.J. (1993). Disambiguation of cue phrases. Computational Linguistics, 19(3), 501–530.
Google Scholar
Jelinek, F., Merialdo, B., Roukos, S., & Strauss, M. (1991). A dynamic language model for speech recognition. Proceedings of the DARPA Speech and Natural Language Workshop (pp. 293–295).
Katz, S. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35(3), 400–401.
Google Scholar
Kozima, H. (1993). Text segmentation based on similarity between words. Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics.
Kuhn, R., & de Mori, R. (1990). A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12, 570–583.
Google Scholar
Lafferty, J. (1993). Barking up the right tree: Estimating & #x03BB; & #x2019;s for decision trees using the EM algorithm. IBM Research Report.
Lau, R., Rosenfeld, R., & Roukos, S. (1993). Adaptive language modeling using the maximum entropy principle. Proceedings of the ARPA Human Language TechnologyWorkshop (pp. 108–113). Morgan Kaufman Publishers.
Litman, D.J., & Passonneau, R.J. (1995). Combining multiple knowledge sources for discourse segmentation. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics.
Passoneau, R.J., & Litman, D.J. (1997). Discourse segmentation by human and automated means. Computational Linguistics, 23(1), 103–139.
Google Scholar
Ponte, J., & Croft, W.B. (1997). Text segmentation by topic. Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries (pp. 120–129).
Reynar, J.C. (1994). An automatic method of finding topic boundaries. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics.
Rosenfeld, R. (1996). A maximum entropy approach to adaptive statistical language modeling. Computer Speech and Language, 10, 187–228.
Google Scholar
Topic Detection and Tracking Workshop. (1997). Unpublished workshop notes.
Xu, J., & Croft, W.B. (1996). Query expansion using local and global document analysis. Proceedings of the Nineteenth Annual ACM SIGIR Conference on Research and Development in Information Retrieval.
Yamron, J. (1998). Topic detection and tracking segmentation task. Broadcast News Transcription and Understanding Workshop.
Yamron, J., Carp, I., Gillick, L., Lowe, S., & Van Mulbregt, P. (1998). A hidden Markov model approach to text segmentation and event tracking. Proceedings of the International Conference on Acoustics and Signal Processing (ICASSP).

Download references

Author information

Authors and Affiliations

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Doug Beeferman, Adam Berger & John Lafferty

Authors

Doug Beeferman
View author publications
You can also search for this author in PubMed Google Scholar
Adam Berger
View author publications
You can also search for this author in PubMed Google Scholar
John Lafferty
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Beeferman, D., Berger, A. & Lafferty, J. Statistical Models for Text Segmentation. Machine Learning 34, 177–210 (1999). https://doi.org/10.1023/A:1007506220214

Download citation

Issue Date: February 1999
DOI: https://doi.org/10.1023/A:1007506220214

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Statistical Models for Text Segmentation

Abstract

Article PDF

Similar content being viewed by others

An Empirical Comparison of Web Page Segmentation Algorithms

A Quantitative Comparison of Semantic Web Page Segmentation Approaches

Text Segmentation Using Light Syntax Parsing and Fuzzy Systems

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Statistical Models for Text Segmentation

Abstract

Article PDF

Similar content being viewed by others

An Empirical Comparison of Web Page Segmentation Algorithms

A Quantitative Comparison of Semantic Web Page Segmentation Approaches

Text Segmentation Using Light Syntax Parsing and Fuzzy Systems

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation