Authorship Attribution with Support Vector Machines

Diederich, Joachim; Kindermann, Jörg; Leopold, Edda; Paass, Gerhard

doi:10.1023/A:1023824908771

Authorship Attribution with Support Vector Machines

Published: July 2003

Volume 19, pages 109–123, (2003)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Joachim Diederich¹,
Jörg Kindermann²,
Edda Leopold² &
…
Gerhard Paass²

1559 Accesses
164 Citations
Explore all metrics

Abstract

In this paper we explore the use of text-mining methods for the identification of the author of a text. We apply the support vector machine (SVM) to this problem, as it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of a text. We performed a number of experiments with texts from a German newspaper. With nearly perfect reliability the SVM was able to reject other authors and detected the target author in 60–80% of the cases. In a second experiment, we ignored nouns, verbs and adjectives and replaced them by grammatical tags and bigrams. This resulted in slightly reduced performance. Author detection with SVMs on full word forms was remarkably robust even if the author wrote about different topics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

T. Joachims, “Making large-scale SVM learning practical,” Technical Report, Uni Dortmund, 1998.
S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algorithms and representations for text categorization,” in 7th International Conference on Information and knowledge Management, 1998.
T.C. Mendenhall, “The characteristic curves of composition,” Science, vol. IX, pp. 237–249, 1887.
Google Scholar
G.U. Yule, “On sentence length as a statistical characteristic of style in prose with application two cases of disputed authorship,” Biometrika, vol. 30, pp. 363–390, 1938.
Google Scholar
J. Gani, “Literature and statistics,” in Encyclopedia of Statistical Sciences, edited by S. Katz and N.-L. Johnson, Wiley, 1985, vol. 5, pp. 90–95.
D.I. Holmes, “The evolution of stylometry in humanities scholarship,” Literary and Linguistic Computing, vol. 13, no.3, pp. 111–117, 1998.
Google Scholar
George K. Zipf, Human Behaviour and the Principle of Least Effort. An Introduction to Human Ecology, Houghton-Mifflin: Boston, 1932.
Google Scholar
George K. Zipf, “Observations on the possible effects of mental age upon the frequency-distribution ofwords from the viewpoint of dynamic philology,” J. of Psychology, vol. 4, pp. 239–244, 1937.
Google Scholar
H.S. Sichel, “On a distribution law for word frequencies,” Journal of the Americal Statistical Association, vol. 70, pp. 542–547, 1975.
Google Scholar
J.K. Orlov, “Ein Modell der Häufigkeitsstruktur des Vokabulars,” in Studies in Zipf's Law, edited by H. Guiter and M. Arapov, Brockmeyer, Bochum, 1983, pp. 154–233.
Google Scholar
K. Tzeras and S. Hartmann, “Automatic indexing based on Bayesian inference networks,” in 16th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’ 93), 1993, pp. 22–34.
F. Mosteller and D.L. Wallace, Inference and Disputed Authorship: The Federalist, Addison-Wesley: Reading, MA, 1964.
Google Scholar
J.F. Burrows, “Word patterns and story shapes: The statistical analysis o narrative style,” Literary and Linguistic Computing, vol. 2, pp. 61–70, 1987.
Google Scholar
J. Binongo and M. Smith, “A bridge between statistics and literature: The graphs of Oscar Wilde's literary genres,” J. Applied Statistics, vol. 26, pp. 781–787, 1999.
Google Scholar
R.H. Baayen, H. van Halteren, and F.J. Tweedie, “Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution,” Literary and Linguistic Computing, vol. 11, no.3, pp. 121–131, 1996.
Google Scholar
E. Charniak, Statistical Language Learning, MIT Press: Cambridge, MA, 1993.
Google Scholar
J. Rudman, “The state of authorship attribution studies: Some problems and solutions,” Computers and the Humanities, vol. 31, pp. 351–365, 1998.
Google Scholar
T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in European Conference on Machine Learning (ECML), edited by C. Nedellec and C. Rouveirol, 1998.
Y.M.M. Bishop, S.E. Fienberg, and P.W. Holland, Discrete Multivariate Analysis: Theroy and Practice, MIT-Press, Cambridge, MA, 1975.
Google Scholar
A.L. Goel, “Cumulative sum control charts,” in Encyclopedia of Statistics, edited by S. Kotz and N. Johnson, Wiley, 1982, vol. 2, pp. 233–241.
Jill M. Farringdon, Analysing for Authorship: A Guide to the Cusum Technique, University of Wales Press: Cardiff, 1996.
Google Scholar
B. Thisted and R. Efron, “Did Shakespeare write a newly discovered poem?” Biometrika, vol. 74, pp. 445–455, 1987.
Google Scholar
R.J. Valenza, “Are the Thisted-Efron authorship tests valid?” Computers and the Humanities, vol. 25, pp. 27–46, 1991.
Google Scholar
Y. Yang and C. Chute, “An example-based mapping method for text categorization and retrieval,” ACM Transaction on Information Systems, vol. 12, pp. 252–277, 1994.
Google Scholar
A. McCallum and K. Nigam, “A comparison of event models for naive Bayes text classification,” in AAAI-98 Workshop on Learning for Text Categorization, 1998.
I. Moulinier, G. Raskinis, and J. Ganascia, “Text categorization: A symbolic approach,” in Proc. of the Fifth Symp. on Document Analysis and Information Retrieval, 1996.
H. Ng, W. Gob, and K. Low, “Feature selection, perceptron learning and a usability case study for text categorization,” in 20th Ann. Int. ACMSIGIR Conference on Research and Development in Information Retrieval (SIGIR’ 97), 1997, pp. 67–73.
D. Lowe and R. Matthews, “Shakespeare vs. Fletcher: A stylometric analysis by radial basis functions,” Computers and the Humanities, vol. 29, pp. 449–461, 1995.
Google Scholar
C. Apte, F. Damereau, and S. Weiss, “Text mining with decision rules and decision trees,” in Proc. Conf. on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web, 1998.
K. Lam and C. Ho, “Using a generalized instance set for automatic text categorization,” in 21th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’ 98), 1998, pp. 81–89.
F.J. Tweedie, S. Singh, and D.I. Holmes, “Neural network applications in stylometry: The Federalist papers,” Computers and the Humanities, vol. 30, pp. 1–10, 1996.
Google Scholar
R. Andrews, J. Diederich, and A.B. Tickle, “A survey and critique of techniques for extracting rules from trained artificial neural networks,” Knowledge-Based Systems, vol. 8, pp. 373–389, 1995.
Google Scholar
J.R. Quinlan, “Inferno: A cautious approach to uncertain inference,” The Computer Journal, pp. 255–269, 1983.
V.N. Vapnik, Statistical Learning Theory, Wiley: New York, 1998.
Google Scholar
C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, pp. 273–297, 1995.
Google Scholar
B.E. Boser, I.M. Guyon, and V.N. Vapnik, “A training algorithm for optimal margin classifiers,” in Proc. 5th ACM Workshop on Computational Learning Theroy, edited by D. Haussler, ACM Press, 1992, pp. 144–152.
T. Joachims, “Transductive inference for text classification using support vector machines,’ in Int. Conf. on Machine Learning (ICML), 1999.
H. Drucker, D. Wu, and V. Vapnik, “Support vector machines for spam categorization,” IEEE Transactions on Neural Networks, vol. 10, no.5, pp. 1048–1054, 1999.
Google Scholar
George K. Zipf, Selected Studies of the Principle of Relative Frequency in Language. Harvard University Press: Cambridge, MA, 1932.
Google Scholar
R.J. Chitashvili and R.H. Baayen, “Word frequency distributions,” in Quantitative Text Analysis, edited by G. Altmann and L. Hřebíček, wvt, Trier, 1993, pp. 46–135.
B.B. Mandelbrot, “On the theory of word frequencies and on related Markovian models of discourse,” in Proceedings of Symposia in Applied Mathematics, vol. XII, pp. l90–219, 1953.
Google Scholar
G. Herdan, The Advanced Theory of Language as Choice and Chance, Springer, Berlin, 1966.
Google Scholar
G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, McGraw Hill: New York, 1983.
Google Scholar
G. Salton and C. Buckley, “Term weighting approaches in automatic text retrieval,” Information Processing and Management, vol. 24, pp. 513–523, 1988.
Google Scholar
C. van Rijsbergen, Information Retrieval, Butterworths: London, 1979.
Google Scholar
W. Lezius, R. Rapp, and M. Wettler, “A freely available morphological analyzer, disambiguator and context sensitive lemmatizer for German,” in Proc. COLING-ACL, 1998. The program is available under http://psycho1.uni-paderborn.de/lezius.

Download references

Author information

Authors and Affiliations

School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, Q-4072, Australia
Joachim Diederich
GMD—Forschungszentrum Informationstechnik, D-52754, Sankt, Augustin
Jörg Kindermann, Edda Leopold & Gerhard Paass

Authors

Joachim Diederich
View author publications
You can also search for this author in PubMed Google Scholar
Jörg Kindermann
View author publications
You can also search for this author in PubMed Google Scholar
Edda Leopold
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Paass
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Diederich, J., Kindermann, J., Leopold, E. et al. Authorship Attribution with Support Vector Machines. Applied Intelligence 19, 109–123 (2003). https://doi.org/10.1023/A:1023824908771

Download citation

Issue Date: July 2003
DOI: https://doi.org/10.1023/A:1023824908771

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Authorship Attribution with Support Vector Machines

Abstract

Access this article

Similar content being viewed by others

Empirical Evaluations Using Character and Word N-Grams on Authorship Attribution for Telugu Text

Stylometric Authorship Attribution of Collaborative Documents

Role of Machine Learning in Authorship Attribution with Select Stylometric Features

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Authorship Attribution with Support Vector Machines

Abstract

Access this article

Similar content being viewed by others

Empirical Evaluations Using Character and Word N-Grams on Authorship Attribution for Telugu Text

Stylometric Authorship Attribution of Collaborative Documents

Role of Machine Learning in Authorship Attribution with Select Stylometric Features

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation