skip to main content
10.3115/1220355.1220443dlproceedingsArticle/Chapter ViewAbstractPublication PagescolingConference Proceedingsconference-collections
Article
Free Access

Linguistic correlates of style: authorship classification with deep linguistic analysis features

Published:23 August 2004Publication History

ABSTRACT

The identification of authorship falls into the category of style classification, an interesting sub-field of text categorization that deals with properties of the form of linguistic expression as opposed to the content of a text. Various feature sets and classification methods have been proposed in the literature, geared towards abstracting away from the content of a text, and focusing on its stylistic properties. We demonstrate that in a realistically difficult authorship attribution scenario, deep linguistic analysis features such as context free production frequencies and semantic relationship frequencies achieve significant error reduction over more commonly used "shallow" features such as function word frequencies and part of speech trigrams. Modern machine learning techniques like support vector machines allow us to explore large feature vectors, combining these different feature sets to achieve high classification accuracy in style-based tasks.

References

  1. Shlomo Argamon-Engelson, Moshe Koppel, and Galit Avneri. 1998. Style-Based Text Categorization: What Newspaper am I Reading? Proceedings of AAAI Workshop on Learning for Text Categorization, 1--4.Google ScholarGoogle Scholar
  2. Harald Baayen, Hans van Halteren, and Fiona Tweedie. 1996. Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution. Literary and Linguistic Computing 11(3): 121--131.Google ScholarGoogle ScholarCross RefCross Ref
  3. Joachim Diederich, Jörg Kindermann, Edda Leopold, and Gerhard Paass.2003. Authorship Attribution with Support Vector Machines. Applied Intelligence 19(1):109--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Thomas G. Dietterich. 1998. Machine Learning Research: Four Current Directions. The AI Magazine 18(4): 97--136.Google ScholarGoogle Scholar
  5. Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. 1998. Inductive Learning Algorithms and Representations for Text Categorization. Proceedings of the 7th International Conference on Information and Knowledge Management: 148--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ted Dunning. 1993. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19: 61--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Aidan Finn and Nicholas Kushmerick. 2003. Learning to Classify Documents According to Genre. IJCAI-2003 Workshop on Computational Approaches to Text Style and Synthesis, Acapulco, Mexico.Google ScholarGoogle Scholar
  8. George Heidorn. 2000. Intelligent Writing Assistance. In R. Dale, H. Moisl and H. Somers, eds., Handbook of Natural Language Processing. Marcel Dekker.Google ScholarGoogle Scholar
  9. David I. Holmes. 1998. The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13(3):111--117.Google ScholarGoogle ScholarCross RefCross Ref
  10. Thorsten Joachims. 1998. Text Categorization with Support Vector Machines: Learning with many Relevant Features. Proceedings of the tenth European Conference on Machine Learning: 137--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Moshe Koppel, Navot Akiva and Ido Dagan. 2003. A Corpus-Independent Feature Set for Style-Based Text Categorization. IJCAI-2003 Workshop on Computational Approaches to Text Style and Synthesis, Acapulco, Mexico.Google ScholarGoogle Scholar
  12. Moshe Koppel, Jonathan Schler and Droz Mughaz. 2004. Text Categorization for Authorship Verification. Paper presented at the 8th Symposium on Artifical Intelligence and Mathematics, Fort Lauderdale, Florida.Google ScholarGoogle Scholar
  13. Moshe Koppel, Shlomo Argamon, and Anat R. Shimoni. 2003. Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17(4): 401--412.Google ScholarGoogle ScholarCross RefCross Ref
  14. F. Mosteller. and D. L. Wallace. 1964. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, Reading, MA.Google ScholarGoogle Scholar
  15. John Platt. 1999. Fast Training of SVMs Using Sequential Minimal Optimization. In: B. Schölkopf, C. Burges and A. Smola (eds.) Advances in kernel methods: support vector learning. MIT Press, Cambridge, MA, 185--208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Marina Santini. 2004. A Shallow Approach to Syntactic Feature Extraction for Genre Classification. Proceedings of the 7th Annual Colloquium for the UK Special Interest Group for Computational Linguistics.Google ScholarGoogle Scholar
  17. Efstathios Stamatos, Nikos Fakotakis and George Kokkinakis. 2000. Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26(4): 471--495. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Linguistic correlates of style: authorship classification with deep linguistic analysis features

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image DL Hosted proceedings
          COLING '04: Proceedings of the 20th international conference on Computational Linguistics
          August 2004
          1411 pages

          Publisher

          Association for Computational Linguistics

          United States

          Publication History

          • Published: 23 August 2004

          Qualifiers

          • Article

          Acceptance Rates

          COLING '04 Paper Acceptance Rate1,411of1,411submissions,100%Overall Acceptance Rate1,537of1,537submissions,100%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader