Skip to main content
Erschienen in: Information Systems and e-Business Management 1/2017

02.03.2016 | Original Article

Gender classification of microblog text based on authorial style

verfasst von: Shubhadeep Mukherjee, Pradip Kumar Bala

Erschienen in: Information Systems and e-Business Management | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Gender profiling of unstructured text data has several applications in areas such as marketing, advertising, legal investigation, and recommender systems. The automatic detection of gender in microblogs, like twitter, is a difficult task. It requires a system that can use knowledge to interpret the linguistic styles being used by the genders. In this paper, we try to provide this knowledge for such a system by considering different sets of features, which are relatively independent of the text, such as function words and part of speech n-grams. We test a range of different feature sets using two different classifiers; namely Naïve Bayes and maximum entropy algorithms. Our results show that the gender detection task benefits from the inclusion of features that capture the authorial style of the microblog authors. We achieve an accuracy of approximately 71 %, which outperforms the classification accuracy of commercially available gender detection software like Gender Genie and Gender Guesser.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Alowibdi JS, Buy UA, Yu P (2013) Language independent gender classification on Twitter. In: Proceedings of 2013 IEEE/ACM international conference on Advances in social networks analysis and mining (ASONAM). IEEE, Niagara Falls, pp 739–743. doi:10.1145/2492517.2492632 Alowibdi JS, Buy UA, Yu P (2013) Language independent gender classification on Twitter. In: Proceedings of 2013 IEEE/ACM international conference on Advances in social networks analysis and mining (ASONAM). IEEE, Niagara Falls, pp 739–743. doi:10.​1145/​2492517.​2492632
Zurück zum Zitat Argamon S, Koppel M, Pennebaker JW, Schler J (2007) Mining the blogosphere: age, gender and the varieties of self-expression. First Monday 12(9). doi:10.5210/fm.v12i9.2003 Argamon S, Koppel M, Pennebaker JW, Schler J (2007) Mining the blogosphere: age, gender and the varieties of self-expression. First Monday 12(9). doi:10.​5210/​fm.​v12i9.​2003
Zurück zum Zitat Baayen H, Van Halteren H, Tweedie F (1996) Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Lit Linguist Comput 11:121–132CrossRef Baayen H, Van Halteren H, Tweedie F (1996) Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Lit Linguist Comput 11:121–132CrossRef
Zurück zum Zitat Hota SR, Argamon S, Koppel M, Zigdon I (2006) Performing gender: automatic stylistic analysis of shakespeare’s characters. Digit Humanit 1:82–88 Hota SR, Argamon S, Koppel M, Zigdon I (2006) Performing gender: automatic stylistic analysis of shakespeare’s characters. Digit Humanit 1:82–88
Zurück zum Zitat Jordan MI, Ng AY (2002) On disriminative vs. generative classiers: a comparison of logistic regression and naive Bayes. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems. MIT Press, Cambridge, pp 841–848 Jordan MI, Ng AY (2002) On disriminative vs. generative classiers: a comparison of logistic regression and naive Bayes. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems. MIT Press, Cambridge, pp 841–848
Zurück zum Zitat Juan A, Vilar Torres D, Ney H (2007) Bridging the gap between naive Bayes and maximum entropy text classification. In: Proceedings of the 7th international workshop on pattern recognition in information systems (PRIS). INSTICC Press, Setúbal, pp 59–65 Juan A, Vilar Torres D, Ney H (2007) Bridging the gap between naive Bayes and maximum entropy text classification. In: Proceedings of the 7th international workshop on pattern recognition in information systems (PRIS). INSTICC Press, Setúbal, pp 59–65
Zurück zum Zitat Kestemont M (2014) Function words in authorship attribution from black magic to theory? In: 3rd Workshop on computational linguistic for literature (CLfL 2014), pp 59–66 Kestemont M (2014) Function words in authorship attribution from black magic to theory? In: 3rd Workshop on computational linguistic for literature (CLfL 2014), pp 59–66
Zurück zum Zitat Klammer T, Schulz M, Della Volpe A (2000) Analyzing English grammar, 6th edn. Pearson Education Klammer T, Schulz M, Della Volpe A (2000) Analyzing English grammar, 6th edn. Pearson Education
Zurück zum Zitat Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence
Zurück zum Zitat Manning CD, Schutze H (1999) Foundations of statistical natural language processing. MIT press, Cambridge Manning CD, Schutze H (1999) Foundations of statistical natural language processing. MIT press, Cambridge
Zurück zum Zitat Mukherjee A, Liu B (2010) Improving gender classification of blog authors. In: Proceeding EMNLP ‘10 proceedings of the 2010 conference on empirical methods in natural language processing, pp 207–217 Mukherjee A, Liu B (2010) Improving gender classification of blog authors. In: Proceeding EMNLP ‘10 proceedings of the 2010 conference on empirical methods in natural language processing, pp 207–217
Zurück zum Zitat Peersman C, Daelemans W, Van Vaerenbergh L (2011) Predicting age and gender in online social networks. In: International conference on information and knowledge management proceedings, pp 37–44. doi:10.1145/2065023.2065035 Peersman C, Daelemans W, Van Vaerenbergh L (2011) Predicting age and gender in online social networks. In: International conference on information and knowledge management proceedings, pp 37–44. doi:10.​1145/​2065023.​2065035
Zurück zum Zitat Pennacchiotti M, Popescu A-M (2011) A machine learning approach to Twitter user classification. ICWSM 11:281–288 Pennacchiotti M, Popescu A-M (2011) A machine learning approach to Twitter user classification. ICWSM 11:281–288
Zurück zum Zitat Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop search mining user-generated contents—SMUC’10, p 37. doi:10.1145/1871985.1871993 Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop search mining user-generated contents—SMUC’10, p 37. doi:10.​1145/​1871985.​1871993
Zurück zum Zitat Winkler E (2012) A basic course in linguistics. Bloomsbury Publishing, London Winkler E (2012) A basic course in linguistics. Bloomsbury Publishing, London
Zurück zum Zitat Yan X, Yan L (2006) Gender classification of weblog authors. In: AAAI spring symposium series on computational approaches to analysing weblogs, pp 228–230 Yan X, Yan L (2006) Gender classification of weblog authors. In: AAAI spring symposium series on computational approaches to analysing weblogs, pp 228–230
Zurück zum Zitat Zhang C, Zhang P (2010) Predicting gender from blog posts. Technical Report. University of Massachusetts Amherst, USA Zhang C, Zhang P (2010) Predicting gender from blog posts. Technical Report. University of Massachusetts Amherst, USA
Metadaten
Titel
Gender classification of microblog text based on authorial style
verfasst von
Shubhadeep Mukherjee
Pradip Kumar Bala
Publikationsdatum
02.03.2016
Verlag
Springer Berlin Heidelberg
Erschienen in
Information Systems and e-Business Management / Ausgabe 1/2017
Print ISSN: 1617-9846
Elektronische ISSN: 1617-9854
DOI
https://doi.org/10.1007/s10257-016-0312-0

Weitere Artikel der Ausgabe 1/2017

Information Systems and e-Business Management 1/2017 Zur Ausgabe

Premium Partner