ABSTRACT
The identification of authorship falls into the category of style classification, an interesting sub-field of text categorization that deals with properties of the form of linguistic expression as opposed to the content of a text. Various feature sets and classification methods have been proposed in the literature, geared towards abstracting away from the content of a text, and focusing on its stylistic properties. We demonstrate that in a realistically difficult authorship attribution scenario, deep linguistic analysis features such as context free production frequencies and semantic relationship frequencies achieve significant error reduction over more commonly used "shallow" features such as function word frequencies and part of speech trigrams. Modern machine learning techniques like support vector machines allow us to explore large feature vectors, combining these different feature sets to achieve high classification accuracy in style-based tasks.
- Shlomo Argamon-Engelson, Moshe Koppel, and Galit Avneri. 1998. Style-Based Text Categorization: What Newspaper am I Reading? Proceedings of AAAI Workshop on Learning for Text Categorization, 1--4.Google Scholar
- Harald Baayen, Hans van Halteren, and Fiona Tweedie. 1996. Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution. Literary and Linguistic Computing 11(3): 121--131.Google ScholarCross Ref
- Joachim Diederich, Jörg Kindermann, Edda Leopold, and Gerhard Paass.2003. Authorship Attribution with Support Vector Machines. Applied Intelligence 19(1):109--123. Google ScholarDigital Library
- Thomas G. Dietterich. 1998. Machine Learning Research: Four Current Directions. The AI Magazine 18(4): 97--136.Google Scholar
- Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. 1998. Inductive Learning Algorithms and Representations for Text Categorization. Proceedings of the 7th International Conference on Information and Knowledge Management: 148--155. Google ScholarDigital Library
- Ted Dunning. 1993. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19: 61--74. Google ScholarDigital Library
- Aidan Finn and Nicholas Kushmerick. 2003. Learning to Classify Documents According to Genre. IJCAI-2003 Workshop on Computational Approaches to Text Style and Synthesis, Acapulco, Mexico.Google Scholar
- George Heidorn. 2000. Intelligent Writing Assistance. In R. Dale, H. Moisl and H. Somers, eds., Handbook of Natural Language Processing. Marcel Dekker.Google Scholar
- David I. Holmes. 1998. The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13(3):111--117.Google ScholarCross Ref
- Thorsten Joachims. 1998. Text Categorization with Support Vector Machines: Learning with many Relevant Features. Proceedings of the tenth European Conference on Machine Learning: 137--142. Google ScholarDigital Library
- Moshe Koppel, Navot Akiva and Ido Dagan. 2003. A Corpus-Independent Feature Set for Style-Based Text Categorization. IJCAI-2003 Workshop on Computational Approaches to Text Style and Synthesis, Acapulco, Mexico.Google Scholar
- Moshe Koppel, Jonathan Schler and Droz Mughaz. 2004. Text Categorization for Authorship Verification. Paper presented at the 8th Symposium on Artifical Intelligence and Mathematics, Fort Lauderdale, Florida.Google Scholar
- Moshe Koppel, Shlomo Argamon, and Anat R. Shimoni. 2003. Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17(4): 401--412.Google ScholarCross Ref
- F. Mosteller. and D. L. Wallace. 1964. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, Reading, MA.Google Scholar
- John Platt. 1999. Fast Training of SVMs Using Sequential Minimal Optimization. In: B. Schölkopf, C. Burges and A. Smola (eds.) Advances in kernel methods: support vector learning. MIT Press, Cambridge, MA, 185--208. Google ScholarDigital Library
- Marina Santini. 2004. A Shallow Approach to Syntactic Feature Extraction for Genre Classification. Proceedings of the 7th Annual Colloquium for the UK Special Interest Group for Computational Linguistics.Google Scholar
- Efstathios Stamatos, Nikos Fakotakis and George Kokkinakis. 2000. Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26(4): 471--495. Google ScholarDigital Library
- Linguistic correlates of style: authorship classification with deep linguistic analysis features
Recommendations
The neural correlates of linguistic distinctions: Unaccusative and unergative verbs
Unaccusative verbs like fall are special in that their sole argument is syntactically generated at the object position of the verb rather than at the subject position. Unaccusative verbs are derived by a lexical operation that reduces the agent from ...
Structuralizing biomedical abstracts with discriminative linguistic features
ObjectiveNearly 75% of the abstracts in MEDLINE papers present in an unstructured format. This study aims to automate the reformatting of unstructured abstracts into the Introduction, Methods, Results, and Discussion (IMRAD) format. The quality of this ...
Comments