Top

Data Mining and Knowledge Discovery

Published in:

12-05-2021

What’s in a name? – gender classification of names with character based machine learning models

Authors: Yifan Hu, Changwei Hu, Thanh Tran, Tejaswi Kasturi, Elizabeth Joseph, Matt Gillingham

Published in: Data Mining and Knowledge Discovery | Issue 4/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Gender information is no longer a mandatory input when registering for an account at many leading Internet companies. However, prediction of demographic information such as gender and age remains an important task, especially in intervention of unintentional gender/age bias in recommender systems. Therefore it is necessary to infer the gender of those users who did not to provide this information during registration. We consider the problem of predicting the gender of registered users based on their declared name. By analyzing the first names of 100M+ users, we found that genders can be very effectively classified using the composition of the name strings. We propose a number of character based machine learning models, and demonstrate that our models are able to infer the gender of users with much higher accuracy than baseline models. Moreover, we show that using the last names in addition to the first names improves classification performance further.

previous article Relational Learning Analysis of Social Politics using Knowledge Graph Embedding

next article Handling imbalance in hierarchical classification problems using local classifiers approaches

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

E.g., if a user read an article about the department store Macy’s, a categorical variable wiki_Macy’s is added to the list of features describing the user

As an example, in SSA Data, around 21% of people with the name “Avery” are male. In a regression setting, we can fit a model to predict a value of 0.21 given this name. On the other hand, in a binary classification setting, we seek to predict a value of 0 for this name.

Given a predominantly female name with a male probability of \(p < 0.5\), and given k randomly selected people with this name, the probability that more than half of these people are male is: \(f(p, k) = \sum _{i=\lceil k/2\rceil )}^{k} C_n^i p^i(1-p)^{k-i}\). For \(p = 0.3\), we found that \(f(p,k) < 0.05\) when \(k \ge 16.\) For \(p=0.4\), we found that \(f(p,k) < 0.05\) when \(k \ge 66.\)

3000 most common words in english. https://www.ef.edu/english-resources/english-vocabulary/top-3000-words/ (2020). [Online; accessed March 22, 2020]

SP 500 Companies (2020). https://datahub.io/core/s-and-p-500-companies. [Online; accessed March 22, 2020]

Social Security Administration: National data on the relative frequency of given names in the population of U.S. births where the individual has a social security number (tabulated based on social security records as of march 3, 2019). http://www.ssa.gov/oact/babynames/names.zip

Al Zamal F, Liu W, Ruths D (2012) Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In: Sixth International AAAI Conference on Weblogs and Social Media

Ambekar A, Ward C, Mohammed J, Male S, Skiena S (2009) Name-ethnicity classification from open sources. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 49–58. ACM

Beretta V, Maccagnola D, Cribbin T, Messina E (2015) An interactive method for inferring demographic attributes in twitter. In: Proceedings of the 26th ACM Conference on Hypertext & Social Media, pp. 113–122. ACM

Brown E (2017) Gender inference from character sequences in multinational first names. https://towardsdatascience.com/name2gender-introduction-626d89378fb0#408a

Burger JD, Henderson J, Kim G, Zarrella G (2011) Discriminating gender on twitter. In: Proceedings of the conference on empirical methods in natural language processing, pp. 1301–1309. Association for Computational Linguistics

Chen P, Sun Z, Bing L, Yang W (2017) Recurrent attention network on memory for aspect sentiment analysis. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 452–461

Ciot M, Sonderegger M, Ruths D (2013) Gender inference of twitter users in non-english contexts. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1136–1145

Google Cloud Content Categories (2019). https://cloud.google.com/natural-language/docs/categories

Culotta A, Kumar NR, Cutler J (2015) Predicting the demographics of twitter users from website traffic data. In: AAAI, pp. 72–78

Culotta A, Ravi NK, Cutler J (2016) Predicting twitter user demographics using distant supervision from website traffic data. J Artif Intell Res 55:389–408CrossRef

Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

Grbovic M, Radosavljevic V, Djuric N, Bhamidipati N, Nagarajan A (2015) Gender and interest targeting for sponsored post advertising at tumblr. In: proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pp. 1819–1828. ACM, New York, NY, USA. https://doi.org/10.1145/2783258.2788616

Han S, Hu Y, Skiena S, Coskun B, Liu M, Qin H, Perez J (2017) Generating look-alike names for security challenges. In: proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec ’17, pp. 57–67. ACM, New York, NY, USA. https://doi.org/10.1145/3128572.3140441

Hochreiter S, Schmidhuber J (1997) Long short-term memory. In: neural Computation, pp. 1735–1780

Karako C, Manggala P (2018) Using image fairness representations in diversity-based re-ranking for recommendations. In: adjunct Publication of the 26th Conference on User Modeling, Adaptation and Personalization, pp. 23–28. ACM

Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

Knowles R, Carroll J, Dredze M (2016) Demographer: Extremely simple name demographics. In: proceedings of the First Workshop on NLP and Computational Social Science, pp. 108–113

Kokkos A, Tzouramanis T (2014) A robust gender inference model for online social networks and its application to linkedin and twitter. First Monday 19(9)

Liu W, Al Zamal F, Ruths D (2012) Using social media to infer gender composition of commuter populations. In: sixth international AAAI Conference on Weblogs and Social Media

Liu W, Ruths D (2013) What’s in a name? using first names as features for gender inference in twitter. In: analyzing microtext AAAI 2013 Spring Symposium, pp. 10–16. AAAI, Palo Alto, CA, USA

Lu F (2018) The 11 Most Beautiful Chinese Names and What They Mean. https://bit.ly/2yGSNO7

Ludu PS (2014) Inferring gender of a twitter user using celebrities it follows. arXiv preprint arXiv:1405.6667

Merler M, Cao L, Smith JR (2015) You are what you tweet...pic! gender prediction based on semantic analysis of social media images. In: 2015 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE

Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: proceedings of Workshop at ICLR

Mueller J, Stumme G (2016) Gender inference using statistical name characteristics in twitter. In: proceedings of the The 3rd Multidisciplinary International Social Networks Conference on SocialInformatics, Data Science 2016, p. 47. ACM

Otterbacher J (2010) Inferring gender of movie reviewers: exploiting writing style, content and metadata. In: proceedings of the 19th ACM international conference on Information and knowledge management, pp. 369–378. ACM

Pennacchiotti M, Popescu AM (2011) A machine learning approach to twitter user classification. In: Fifth International AAAI Conference on Weblogs and Social Media

Rao D, Yarowsky D (2010) Detecting latent user properties in social media. In: Proc. of the NIPS MLSN Workshop, pp. 1–7. Citeseer

Sakaki S, Miura Y, Ma X, Hattori K, Ohkuma T (2014) Twitter user gender inference using combined analysis of text and image processing. In: proceedings of the Third Workshop on Vision and Language, pp. 54–61

Wang S, Manning CD (2012) Baselines and bigrams: Simple, good sentiment and topic classification. In: proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL ’12, pp. 90–94. Association for Computational Linguistics, Stroudsburg, PA, USA

Wang Y, Huang M, Zhao L, et al. (2016) Attention-based lstm for aspect-level sentiment classification. In: proceedings of the 2016 conference on empirical methods in natural language processing, pp. 606–615

Wikipedia: Andrea. https://en.wikipedia.org/wiki/Andrea [Online; accessed March 22, 2020]

Wikipedia: Toni. https://en.wikipedia.org/wiki/Toni [Online; accessed March 22, 2020]

Wikipedia: Unisex name. https://en.wikipedia.org/wiki/Unisex_name [Online; accessed March 22, 2020]

Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Łukasz Kaiser, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR arXiv:1609.08144

Yao S, Huang B (2017) Beyond parity: Fairness objectives for collaborative filtering. In: advances in neural information processing systems, pp. 2921–2930

Ye J, Han S, Hu Y, Coskun B, Liu M, Qin H, Skiena S (2017) Nationality classification using name embeddings. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, pp. 1897–1906. ACM, New York, NY, USA. https://doi.org/10.1145/3132847.3133008

Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pp. 649–657. MIT Press, Cambridge, MA, USA

Zhou X, Wan X, Xiao J (2016) Attention-based lstm network for cross-lingual sentiment classification. In: proceedings of the 2016 conference on empirical methods in natural language processing, pp. 247–256

Title: What’s in a name? – gender classification of names with character based machine learning models
Authors: Yifan Hu
Changwei Hu
Thanh Tran
Tejaswi Kasturi
Elizabeth Joseph
Matt Gillingham
Publication date: 12-05-2021
Publisher: Springer US
Published in: Data Mining and Knowledge Discovery / Issue 4/2021
Print ISSN: 1384-5810
Electronic ISSN: 1573-756X
DOI: https://doi.org/10.1007/s10618-021-00748-6

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2021

Fast computation of Katz index for efficient processing of link prediction queries

Extending greedy feature selection algorithms to multiple solutions

Widening: using parallel resources to improve model quality

Pseudoinverse graph convolutional networks

Affinity analysis for studying physicians’ prescription behavior.

CrashNet: an encoder–decoder architecture to predict crash test outcomes

Premium Partner