Skip to main content
Top
Published in: Information Systems and e-Business Management 1/2017

02-03-2016 | Original Article

Gender classification of microblog text based on authorial style

Authors: Shubhadeep Mukherjee, Pradip Kumar Bala

Published in: Information Systems and e-Business Management | Issue 1/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Gender profiling of unstructured text data has several applications in areas such as marketing, advertising, legal investigation, and recommender systems. The automatic detection of gender in microblogs, like twitter, is a difficult task. It requires a system that can use knowledge to interpret the linguistic styles being used by the genders. In this paper, we try to provide this knowledge for such a system by considering different sets of features, which are relatively independent of the text, such as function words and part of speech n-grams. We test a range of different feature sets using two different classifiers; namely Naïve Bayes and maximum entropy algorithms. Our results show that the gender detection task benefits from the inclusion of features that capture the authorial style of the microblog authors. We achieve an accuracy of approximately 71 %, which outperforms the classification accuracy of commercially available gender detection software like Gender Genie and Gender Guesser.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
go back to reference Alowibdi JS, Buy UA, Yu P (2013) Language independent gender classification on Twitter. In: Proceedings of 2013 IEEE/ACM international conference on Advances in social networks analysis and mining (ASONAM). IEEE, Niagara Falls, pp 739–743. doi:10.1145/2492517.2492632 Alowibdi JS, Buy UA, Yu P (2013) Language independent gender classification on Twitter. In: Proceedings of 2013 IEEE/ACM international conference on Advances in social networks analysis and mining (ASONAM). IEEE, Niagara Falls, pp 739–743. doi:10.​1145/​2492517.​2492632
go back to reference Baayen H, Van Halteren H, Tweedie F (1996) Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Lit Linguist Comput 11:121–132CrossRef Baayen H, Van Halteren H, Tweedie F (1996) Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Lit Linguist Comput 11:121–132CrossRef
go back to reference Hota SR, Argamon S, Koppel M, Zigdon I (2006) Performing gender: automatic stylistic analysis of shakespeare’s characters. Digit Humanit 1:82–88 Hota SR, Argamon S, Koppel M, Zigdon I (2006) Performing gender: automatic stylistic analysis of shakespeare’s characters. Digit Humanit 1:82–88
go back to reference Jordan MI, Ng AY (2002) On disriminative vs. generative classiers: a comparison of logistic regression and naive Bayes. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems. MIT Press, Cambridge, pp 841–848 Jordan MI, Ng AY (2002) On disriminative vs. generative classiers: a comparison of logistic regression and naive Bayes. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems. MIT Press, Cambridge, pp 841–848
go back to reference Juan A, Vilar Torres D, Ney H (2007) Bridging the gap between naive Bayes and maximum entropy text classification. In: Proceedings of the 7th international workshop on pattern recognition in information systems (PRIS). INSTICC Press, Setúbal, pp 59–65 Juan A, Vilar Torres D, Ney H (2007) Bridging the gap between naive Bayes and maximum entropy text classification. In: Proceedings of the 7th international workshop on pattern recognition in information systems (PRIS). INSTICC Press, Setúbal, pp 59–65
go back to reference Kestemont M (2014) Function words in authorship attribution from black magic to theory? In: 3rd Workshop on computational linguistic for literature (CLfL 2014), pp 59–66 Kestemont M (2014) Function words in authorship attribution from black magic to theory? In: 3rd Workshop on computational linguistic for literature (CLfL 2014), pp 59–66
go back to reference Klammer T, Schulz M, Della Volpe A (2000) Analyzing English grammar, 6th edn. Pearson Education Klammer T, Schulz M, Della Volpe A (2000) Analyzing English grammar, 6th edn. Pearson Education
go back to reference Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence
go back to reference Manning CD, Schutze H (1999) Foundations of statistical natural language processing. MIT press, Cambridge Manning CD, Schutze H (1999) Foundations of statistical natural language processing. MIT press, Cambridge
go back to reference Mukherjee A, Liu B (2010) Improving gender classification of blog authors. In: Proceeding EMNLP ‘10 proceedings of the 2010 conference on empirical methods in natural language processing, pp 207–217 Mukherjee A, Liu B (2010) Improving gender classification of blog authors. In: Proceeding EMNLP ‘10 proceedings of the 2010 conference on empirical methods in natural language processing, pp 207–217
go back to reference Peersman C, Daelemans W, Van Vaerenbergh L (2011) Predicting age and gender in online social networks. In: International conference on information and knowledge management proceedings, pp 37–44. doi:10.1145/2065023.2065035 Peersman C, Daelemans W, Van Vaerenbergh L (2011) Predicting age and gender in online social networks. In: International conference on information and knowledge management proceedings, pp 37–44. doi:10.​1145/​2065023.​2065035
go back to reference Pennacchiotti M, Popescu A-M (2011) A machine learning approach to Twitter user classification. ICWSM 11:281–288 Pennacchiotti M, Popescu A-M (2011) A machine learning approach to Twitter user classification. ICWSM 11:281–288
go back to reference Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop search mining user-generated contents—SMUC’10, p 37. doi:10.1145/1871985.1871993 Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop search mining user-generated contents—SMUC’10, p 37. doi:10.​1145/​1871985.​1871993
go back to reference Winkler E (2012) A basic course in linguistics. Bloomsbury Publishing, London Winkler E (2012) A basic course in linguistics. Bloomsbury Publishing, London
go back to reference Yan X, Yan L (2006) Gender classification of weblog authors. In: AAAI spring symposium series on computational approaches to analysing weblogs, pp 228–230 Yan X, Yan L (2006) Gender classification of weblog authors. In: AAAI spring symposium series on computational approaches to analysing weblogs, pp 228–230
go back to reference Zhang C, Zhang P (2010) Predicting gender from blog posts. Technical Report. University of Massachusetts Amherst, USA Zhang C, Zhang P (2010) Predicting gender from blog posts. Technical Report. University of Massachusetts Amherst, USA
Metadata
Title
Gender classification of microblog text based on authorial style
Authors
Shubhadeep Mukherjee
Pradip Kumar Bala
Publication date
02-03-2016
Publisher
Springer Berlin Heidelberg
Published in
Information Systems and e-Business Management / Issue 1/2017
Print ISSN: 1617-9846
Electronic ISSN: 1617-9854
DOI
https://doi.org/10.1007/s10257-016-0312-0

Other articles of this Issue 1/2017

Information Systems and e-Business Management 1/2017 Go to the issue