Skip to main content

The Penn Treebank: An Overview

  • Chapter

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 20))

Abstract

The Penn Treebank, in its eight years of operation (1989–1996), produced approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annotated for speech disfluencies. This paper describes the design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation and the methodology employed in production. All available Penn Treebank materials are distributed by the Linguistic Data Consortium http://www.ldc.upenn.edu.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bies, Ann, Mark Ferguson, Karen Katz, and Robert MacIntyre. (1995). Bracketing Guidelines for Treebank II Style. Ms., Department of Computer and Information Science, University of Pennsylvania.

    Google Scholar 

  • Brill, Eric. (1993). A Corpus-based Approach to Language Learning. PhD Dissertation, University of Pennsylvania.

    Google Scholar 

  • Church, Kenneth W. (1980). Memory Limitations in Natural Language Processing, MIT LCS Technical Report 245. Master’s thesis, Massachusetts Institute of Technology.

    Google Scholar 

  • Church, Kenneth W. (1988). A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the Second Conference on Applied Natural Language Processing. 26th Annual Meeting of the Association for Computational Linguistics, pages 136–143.

    Google Scholar 

  • Francis, W. Nelson (1964). A Standard Sample of Present-day English for Use with Digital Computers. Report to the U.S Office of Education on Cooperative Research Project No. E-007. Brown University, Providence RI.

    Google Scholar 

  • Francis, W. Nelson and Henry Kučera. (1982). Frequency Analysis of English Usage. Lexicon and Grammar. Houghton Mifflin, Boston.

    Google Scholar 

  • Garside, Roger, Geoffrey Leech, and Geoffrey Sampson. (1987). The Computational Analysis of English. A Corpus-based Approach. Longman, London.

    Google Scholar 

  • Hindle, Donald. (1983). User Manual for Fidditch. Technical memorandum 7590-142, Naval Research Laboratory.

    Google Scholar 

  • Hindle, Donald. (1989). Acquiring disambiguation rules from text. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics.

    Google Scholar 

  • Kroch, Anthony S. and Ann Taylor. (2000). The Penn-Helsinki Parsed Corpus of Middle English, Second Edition. Department of Linguistics, University of Pennsylvania.

    Google Scholar 

  • Lewis, Bil, Dan Laliberte, and the GNU Manual Group. (1990). The GNU Emacs Lisp Reference Manual. Free Software Foundation, Cambridge MA.

    Google Scholar 

  • Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. (1993) Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19(2):313–330.

    Google Scholar 

  • Marcus, Mitchell P., Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. (1994). The Penn Treebank: Annotating predicate-argument structure. In ARPA Human Language Technology Workshop.

    Google Scholar 

  • Mateer, Marie, and Ann Taylor. (1995). Disfluency Annotation Stylebook for the Switchboard Corpus. Ms., Department of Computer and Information Science, University of Pennsylvania.

    Google Scholar 

  • Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik. (1985). A Comprehensive Grammar of the English Language, Longman, London.

    Google Scholar 

  • Santorini, Beatrice. (1990). Part-of-speech Tagging Guidelines for the Penn Treebank Project. Technical report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania.

    Google Scholar 

  • Santorini, Beatrice and Mary Ann Marcinkiewicz. (1991). Bracketing Guidelines for the Penn Treebank Project. Ms., Department of Computer and Information Science, University of Pennsylvania.

    Google Scholar 

  • Shriberg, E.E. (1994). Preliminaries to a Theory of Speech Disfluencies. PhD Dissertation, University of California at Berkeley.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Taylor, A., Marcus, M., Santorini, B. (2003). The Penn Treebank: An Overview. In: Abeillé, A. (eds) Treebanks. Text, Speech and Language Technology, vol 20. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0201-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-94-010-0201-1_1

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-1-4020-1335-5

  • Online ISBN: 978-94-010-0201-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics