Article

Free Access

Linguistic correlates of style: authorship classification with deep linguistic analysis features

Author:
Michael Gamon

Microsoft Corp., One Microsoft Way, Redmond, WA

Microsoft Corp., One Microsoft Way, Redmond, WA
View Profile

COLING '04: Proceedings of the 20th international conference on Computational LinguisticsAugust 2004Pages 611–eshttps://doi.org/10.3115/1220355.1220443

Published:23 August 2004Publication History

COLING '04: Proceedings of the 20th international conference on Computational Linguistics

Pages 611–es

ABSTRACT

The identification of authorship falls into the category of style classification, an interesting sub-field of text categorization that deals with properties of the form of linguistic expression as opposed to the content of a text. Various feature sets and classification methods have been proposed in the literature, geared towards abstracting away from the content of a text, and focusing on its stylistic properties. We demonstrate that in a realistically difficult authorship attribution scenario, deep linguistic analysis features such as context free production frequencies and semantic relationship frequencies achieve significant error reduction over more commonly used "shallow" features such as function word frequencies and part of speech trigrams. Modern machine learning techniques like support vector machines allow us to explore large feature vectors, combining these different feature sets to achieve high classification accuracy in style-based tasks.

References

Shlomo Argamon-Engelson, Moshe Koppel, and Galit Avneri. 1998. Style-Based Text Categorization: What Newspaper am I Reading? Proceedings of AAAI Workshop on Learning for Text Categorization, 1--4.Google Scholar
Harald Baayen, Hans van Halteren, and Fiona Tweedie. 1996. Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution. Literary and Linguistic Computing 11(3): 121--131.Google ScholarCross Ref
Joachim Diederich, Jörg Kindermann, Edda Leopold, and Gerhard Paass.2003. Authorship Attribution with Support Vector Machines. Applied Intelligence 19(1):109--123. Google ScholarDigital Library
Thomas G. Dietterich. 1998. Machine Learning Research: Four Current Directions. The AI Magazine 18(4): 97--136.Google Scholar
Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. 1998. Inductive Learning Algorithms and Representations for Text Categorization. Proceedings of the 7th International Conference on Information and Knowledge Management: 148--155. Google ScholarDigital Library
Ted Dunning. 1993. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19: 61--74. Google ScholarDigital Library
Aidan Finn and Nicholas Kushmerick. 2003. Learning to Classify Documents According to Genre. IJCAI-2003 Workshop on Computational Approaches to Text Style and Synthesis, Acapulco, Mexico.Google Scholar
George Heidorn. 2000. Intelligent Writing Assistance. In R. Dale, H. Moisl and H. Somers, eds., Handbook of Natural Language Processing. Marcel Dekker.Google Scholar
David I. Holmes. 1998. The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13(3):111--117.Google ScholarCross Ref
Thorsten Joachims. 1998. Text Categorization with Support Vector Machines: Learning with many Relevant Features. Proceedings of the tenth European Conference on Machine Learning: 137--142. Google ScholarDigital Library
Moshe Koppel, Navot Akiva and Ido Dagan. 2003. A Corpus-Independent Feature Set for Style-Based Text Categorization. IJCAI-2003 Workshop on Computational Approaches to Text Style and Synthesis, Acapulco, Mexico.Google Scholar
Moshe Koppel, Jonathan Schler and Droz Mughaz. 2004. Text Categorization for Authorship Verification. Paper presented at the 8th Symposium on Artifical Intelligence and Mathematics, Fort Lauderdale, Florida.Google Scholar
Moshe Koppel, Shlomo Argamon, and Anat R. Shimoni. 2003. Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17(4): 401--412.Google ScholarCross Ref
F. Mosteller. and D. L. Wallace. 1964. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, Reading, MA.Google Scholar
John Platt. 1999. Fast Training of SVMs Using Sequential Minimal Optimization. In: B. Schölkopf, C. Burges and A. Smola (eds.) Advances in kernel methods: support vector learning. MIT Press, Cambridge, MA, 185--208. Google ScholarDigital Library
Marina Santini. 2004. A Shallow Approach to Syntactic Feature Extraction for Genre Classification. Proceedings of the 7th Annual Colloquium for the UK Special Interest Group for Computational Linguistics.Google Scholar
Efstathios Stamatos, Nikos Fakotakis and George Kokkinakis. 2000. Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26(4): 471--495. Google ScholarDigital Library

Linguistic correlates of style: authorship classification with deep linguistic analysis features
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

The neural correlates of linguistic distinctions: Unaccusative and unergative verbs

Unaccusative verbs like fall are special in that their sole argument is syntactically generated at the object position of the verb rather than at the subject position. Unaccusative verbs are derived by a lexical operation that reduces the agent from ...
Read More
Structuralizing biomedical abstracts with discriminative linguistic features

ObjectiveNearly 75% of the abstracts in MEDLINE papers present in an unstructured format. This study aims to automate the reformatting of unstructured abstracts into the Introduction, Methods, Results, and Discussion (IMRAD) format. The quality of this ...
Read More
Computational style processing
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

COLING '04: Proceedings of the 20th international conference on Computational Linguistics
August 2004
1411 pages
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 23 August 2004
Qualifiers
- Article
Conference

Acceptance Rates
COLING '04 Paper Acceptance Rate1,411of1,411submissions,100%Overall Acceptance Rate1,537of1,537submissions,100%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 1,284
  Total Downloads
- Downloads (Last 12 months)46
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Linguistic correlates of style: authorship classification with deep linguistic analysis features

COLING '04: Proceedings of the 20th international conference on Computational Linguistics

ABSTRACT

References

Cited By

Recommendations

The neural correlates of linguistic distinctions: Unaccusative and unergative verbs

Structuralizing biomedical abstracts with discriminative linguistic features

Computational style processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Linguistic correlates of style: authorship classification with deep linguistic analysis features

COLING '04: Proceedings of the 20th international conference on Computational Linguistics

ABSTRACT

References

Cited By

Recommendations

The neural correlates of linguistic distinctions: Unaccusative and unergative verbs

Structuralizing biomedical abstracts with discriminative linguistic features

Computational style processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media