1 Introduction
-
Women in Wikipedia are on average slightly more notable than their male counterparts. Furthermore, the gap between the number of men and women is larger for ‘local heroes’ (people who are only depicted in few language editions) than for ‘superstars’ (people who are present in almost all language editions). These effects can be explained by interpreting Wikipedia’s entry barrier as a subtle glass ceiling. While it is obvious that very notable people should be included in Wikipedia, the decision is questionable for people who are less notable. We find that bias and inequality manifest themselves in the presence of such uncertainty, as the Wikipedia editor community must make more subjective decisions about inclusion.
-
There are differences in the topical focus of biographical content, where gender-, family-, and relationship-related topics are more dominant in the stand-alone overviews of biographies about women in the English Wikipedia.
-
Linguistic bias becomes evident when looking at the abstractness and positivity of language. Abstract terms tend to be used to describe positive aspects in biographies of men, and negative aspects in biographies of women.
-
There are structural differences in terms of meta-data and hyperlinks, which have consequences for information-seeking activities.
2 Data and methods
2.1 Dataset
Language
|
Fraction of women
|
Overlap with English edition
|
Biographies
|
---|---|---|---|
English (en) | 0.155 | – | 893,380 |
Italian (it) | 0.151 | 0.986 | 134,122 |
Deutsch (de) | 0.132 | 0.995 | 102,233 |
French (fr) | 0.136 | 0.966 | 93,400 |
Polish (pl) | 0.158 | 0.986 | 69,531 |
Spanish (es) | 0.182 | 0.980 | 66,067 |
Russian (ru) | 0.158 | 0.988 | 64,233 |
Portuguese (pt) | 0.185 | 0.989 | 44,793 |
Dutch (nl) | 0.194 | 0.993 | 38,659 |
Japanese (ja) | 0.184 | 0.991 | 31,033 |
Hungarian (hu) | 0.179 | 0.999 | 18,074 |
Bulgarian (bg) | 0.149 | 1.000 | 16,850 |
Korean (ko) | 0.226 | 0.994 | 15,921 |
Turkish (tr) | 0.175 | 0.982 | 14,399 |
Indonesian (id) | 0.151 | 0.987 | 12,401 |
Arabic (ar) | 0.199 | 0.787 | 12,030 |
Czech (cs) | 0.156 | 1.000 | 10,765 |
Catalan (ca) | 0.183 | 0.995 | 7,721 |
Greek (el) | 0.145 | 0.806 | 6,748 |
Basque (eu) | 0.179 | 0.987 | 3,449 |
2.2 Approach
2.2.1 Global notability
2.2.2 Topical and linguistic bias
-
The gender topic contains words that emphasize that someone is a man or woman (i.e., man, women, mr, mrs, lady, gentleman) as well as sexual identity (e.g., gay, lesbian).
-
The relationship topic consists of words about romantic relationships (e.g., married, divorced, couple, husband, wife).
-
The family topic aggregates words about family relations (e.g., kids, children, mother, grandmother).
2.2.3 Structural properties
-
Random. We shuffle the edges in the original network. For each edge \((u,v)\), we select two random nodes \((i,j)\) and replace \((u,v)\) with \((i,j)\). The resulting network is a random graph with neither the heterogeneous degree distribution nor the clustered structure that the Wikipedia graph reveals [22].
-
Degree sequence. We generate a graph that preserves both in-degree and out-degree sequences (and therefore both distributions) by shuffling the structure of the original network. For a random pair of edges \(((u,v), (i,j)) \) rewire to \(((u,j), (i,v))\). We repeat this shuffling as many times as there are edges. Note that although the in- and out-degree of each node are unchanged, the degree correlations and the clustering are lost.
-
Small world. We generate an undirected small world graph using the model by Watts and Strogatz [23]. This model interpolates a random graph and a lattice in a way that preserves two properties of small world networks: average path length and clustering coefficient. After building the graph, we randomly assign a gender to each node, maintaining the proportions from the observed network.
2.3 Tools
3 Results
3.1 Inequalities in global notability thresholds
3.1.1 Number of language editions
0-1899
|
1900-present
|
0-present
| |||||||
---|---|---|---|---|---|---|---|---|---|
β
|
Std. err.
|
p
|
β
|
Std. err.
|
p
|
β
|
Std. err.
|
p
| |
C(class)[T. Ambassador] | 0.083 | 0.148 | 0.574 | −0.537 | 0.076 |
∗∗∗
| −0.412 | 0.068 |
∗∗∗
|
C(class)[T. Architect] | 0.355 | 0.041 |
∗∗∗
| 0.574 | 0.047 |
∗∗∗
| 0.421 | 0.031 |
∗∗∗
|
C(class)[T. Artist] | 0.853 | 0.012 |
∗∗∗
| 0.420 | 0.005 |
∗∗∗
| 0.508 | 0.005 |
∗∗∗
|
C(class)[T. Astronaut] | – | – | – | 1.403 | 0.038 |
∗∗∗
| 1.428 | 0.038 |
∗∗∗
|
C(class)[T. Athlete] | −0.344 | 0.011 |
∗∗∗
| 0.042 | 0.004 |
∗∗∗
| 0.084 | 0.003 |
∗∗∗
|
C(class)[T. BeautyQueen] | – | – | – | −0.290 | 0.035 |
∗∗∗
| −0.206 | 0.035 |
∗∗∗
|
C(class)[T. BusinessPerson] | −1.066 | 0.254 |
∗∗∗
| −0.929 | 0.173 |
∗∗∗
| −0.983 | 0.143 |
∗∗∗
|
C(class)[T. Chef] | 0.272 | 0.571 | 0.633 | −0.268 | 0.070 |
∗∗∗
| −0.217 | 0.070 | 0.002 |
C(class)[T. Cleric] | 0.545 | 0.022 |
∗∗∗
| 0.417 | 0.020 |
∗∗∗
| 0.477 | 0.015 |
∗∗∗
|
C(class)[T. Coach] | −0.932 | 0.042 |
∗∗∗
| −0.938 | 0.023 |
∗∗∗
| −0.941 | 0.020 |
∗∗∗
|
C(class)[T. Criminal] | 0.468 | 0.073 |
∗∗∗
| 0.197 | 0.030 |
∗∗∗
| 0.244 | 0.028 |
∗∗∗
|
C(class)[T. Economist] | 1.504 | 0.099 |
∗∗∗
| 0.941 | 0.045 |
∗∗∗
| 1.043 | 0.041 |
∗∗∗
|
C(class)[T. Engineer] | 0.411 | 0.054 |
∗∗∗
| 0.002 | 0.079 | 0.979 | 0.243 | 0.044 |
∗∗∗
|
C(class)[T. FictionalCharacter] | – | – | – | −1.021 | 0.418 | 0.015 | −0.969 | 0.419 | 0.021 |
C(class)[T. Historian] | −0.579 | 0.172 | 0.001 | −0.756 | 0.117 |
∗∗∗
| −0.730 | 0.097 |
∗∗∗
|
C(class)[T. HorseTrainer] | −0.983 | 0.563 | 0.081 | −0.999 | 0.107 |
∗∗∗
| −0.987 | 0.106 |
∗∗∗
|
C(class)[T. Journalist] | −0.899 | 0.176 |
∗∗∗
| −1.032 | 0.078 |
∗∗∗
| −1.005 | 0.072 |
∗∗∗
|
C(class)[T. Judge] | −0.580 | 0.055 |
∗∗∗
| −0.700 | 0.040 |
∗∗∗
| −0.677 | 0.033 |
∗∗∗
|
C(class)[T. MilitaryPerson] | −0.014 | 0.011 | 0.195 | −0.287 | 0.013 |
∗∗∗
| −0.166 | 0.008 |
∗∗∗
|
C(class)[T. Model] | −0.146 | 0.704 | 0.836 | 0.249 | 0.030 |
∗∗∗
| 0.332 | 0.030 |
∗∗∗
|
C(class)[T. Monarch] | 1.024 | 0.064 |
∗∗∗
| 1.313 | 0.119 |
∗∗∗
| 1.227 | 0.056 |
∗∗∗
|
C(class)[T. Noble] | 0.096 | 0.029 | 0.001 | 0.009 | 0.135 | 0.944 | 0.175 | 0.028 |
∗∗∗
|
C(class)[T. OfficeHolder] | 0.340 | 0.011 |
∗∗∗
| 0.300 | 0.007 |
∗∗∗
| 0.308 | 0.006 |
∗∗∗
|
C(class)[T. Philosopher] | 1.992 | 0.050 |
∗∗∗
| 1.180 | 0.040 |
∗∗∗
| 1.547 | 0.031 |
∗∗∗
|
C(class)[T. PlayboyPlaymate] | – | – | – | −0.068 | 0.078 | 0.381 | −0.014 | 0.078 | 0.854 |
C(class)[T. Politician] | 0.067 | 0.011 |
∗∗∗
| 0.098 | 0.009 |
∗∗∗
| 0.068 | 0.007 |
∗∗∗
|
C(class)[T. Presenter] | 0.121 | 0.458 | 0.792 | −0.758 | 0.068 |
∗∗∗
| −0.701 | 0.068 |
∗∗∗
|
C(class)[T. Religious] | 0.295 | 0.115 | 0.010 | 0.112 | 0.076 | 0.145 | 0.172 | 0.064 | 0.007 |
C(class)[T. Royalty] | 1.175 | 0.017 |
∗∗∗
| 1.077 | 0.029 |
∗∗∗
| 1.155 | 0.015 |
∗∗∗
|
C(class)[T. Scientist] | 1.191 | 0.014 |
∗∗∗
| 0.631 | 0.012 |
∗∗∗
| 0.854 | 0.009 |
∗∗∗
|
C(class)[T. SportsManager] | 0.306 | 0.053 |
∗∗∗
| 0.464 | 0.010 |
∗∗∗
| 0.493 | 0.010 |
∗∗∗
|
C(gender)[T. female] | −0.044 | 0.011 |
∗∗∗
| 0.116 | 0.004 |
∗∗∗
| 0.119 | 0.004 |
∗∗∗
|
birth_decade | −0.017 | 0.000 |
∗∗∗
| 0.010 | 0.001 |
∗∗∗
| −0.010 | 0.000 |
∗∗∗
|
Intercept | 4.269 | 0.060 |
∗∗∗
| −0.684 | 0.131 |
∗∗∗
| 3.022 | 0.038 |
∗∗∗
|
AIC | 660,646.944 | 2,206,624.237 | 2,873,689.603 | ||||||
Num. obs. | 134,306.000 | 456,435.000 | 590,741.000 |
3.1.2 Google search trends
Num. regions
|
Num. months
| |||||
---|---|---|---|---|---|---|
β
|
Std. err.
|
p
|
β
|
Std. err.
|
p
| |
Intercept | 0.4417 | 0.048 |
∗∗∗
| 3.4120 | 0.021 |
∗∗∗
|
C(gender)[T.female] | 0.2792 | 0.117 |
∗
| 0.1220 | 0.052 |
∗
|
AIC | 20,939.81 | 53,351.84 | ||||
Num. obs. | 5,998 | 5,998 |
3.2 Topical and linguistic asymmetries
3.2.1 Topical bias
-
Pre-1900: the three words most strongly associated with females are her husband, women’s, and actress. The three most strongly associated with males are served, elected, and politician.
-
1900-onwards: the three words most strongly associated with females are actress, women’s, and female. The three most strongly associated with males are played, league, and football.
Family
|
Gender
|
Relationship
|
Other
| |
---|---|---|---|---|
0-1900 | ||||
Men | 0.5 | 1.5 | 0 | 98 |
Women | 5.0 | 7 | 3 | 85 |
1900-present | ||||
Men | 0.5 | 2.5 | 0 | 97 |
Women | 3 | 4.5 | 2 | 90.5 |
3.2.2 Linguistic bias
% in men
|
% in women
|
\(\boldsymbol{\chi^{2}}\)
|
w
|
% change
| |
---|---|---|---|---|---|
Abstract positive | 27.96 | 25.53 | 933.7∗∗∗
| 0.04 | 8.69 |
Abstract negative | 13.47 | 13.69 | 6.26∗∗
| 0.005 | −1.62 |
Abstract positive
|
Abstract negative
| |
---|---|---|
(Intercept) |
\( 0.63\ (0.05)^{***} \)
|
\( 0.25\ (0.05)^{***} \)
|
G[female] |
\( -0.02\ (0.00)^{***} \)
|
\( 0.01\ (0.00)^{**} \)
|
cArchitect | 0.07 (0.05) | 0.06 (0.05) |
cArtist | 0.01 (0.04) | 0.07 (0.05) |
cAstronaut | −0.04 (0.06) | 0.00 (0.06) |
cAthlete | 0.03 (0.04) | 0.05 (0.05) |
cBeautyQueen | −0.02 (0.05) | −0.06 (0.05) |
cBusinessPerson | 0.00 (0.09) | −0.02 (0.09) |
cChef | 0.03 (0.06) | 0.01 (0.06) |
cCleric |
\( -0.10\ (0.04)^{*} \)
| 0.07 (0.05) |
cCoach | −0.04 (0.04) |
\( 0.15\ (0.05)^{**} \)
|
cCriminal |
\( -0.09\ (0.04)^{*} \)
| 0.09 (0.05) |
cEconomist | −0.01 (0.05) |
\( 0.15\ (0.05)^{**} \)
|
cEngineer | 0.01 (0.05) | 0.08 (0.05) |
cHistorian | −0.00 (0.07) | 0.10 (0.07) |
cHorseTrainer | −0.06 (0.05) | 0.03 (0.06) |
cJournalist | −0.03 (0.06) |
\( 0.15\ (0.06)^{*} \)
|
cJudge |
\( -0.17\ (0.04)^{***} \)
| 0.02 (0.05) |
cMilitaryPerson | −0.05 (0.04) | −0.02 (0.05) |
cModel | −0.03 (0.05) | 0.02 (0.06) |
cMonarch | −0.07 (0.05) | 0.01 (0.06) |
cNoble | −0.06 (0.05) | 0.03 (0.05) |
cOfficeHolder | −0.06 (0.04) | 0.04 (0.05) |
cPerson | −0.02 (0.04) | 0.07 (0.05) |
cPhilosopher | 0.05 (0.05) |
\( 0.11\ (0.05)^{*} \)
|
cPlayboyPlaymate | −0.06 (0.10) | −0.03 (0.10) |
cPolitician | −0.06 (0.04) | 0.05 (0.05) |
cPresenter | −0.05 (0.06) | 0.06 (0.06) |
cReligious | 0.04 (0.06) | 0.11 (0.06) |
cRoyalty | −0.07 (0.04) | 0.04 (0.05) |
cScientist | 0.05 (0.04) |
\( 0.10\ (0.05)^{*} \)
|
cSportsManager | 0.01 (0.04) | 0.06 (0.05) |
cent |
\( -0.02\ (0.00)^{***} \)
|
\( -0.01\ (0.00)^{***} \)
|
AIC | −20,917.94 | −21,900.42 |
Num. obs. | 50,965 | 48,942 |
3.3 Structural inequalities
3.3.1 Meta-data
0-1899
|
1900-present
| |||||||
---|---|---|---|---|---|---|---|---|
% men
|
% women
|
\(\boldsymbol{\chi^{2}}\)
|
w
|
% men
|
% women
|
\(\boldsymbol{\chi^{2}}\)
|
w
| |
activeYearsEndDate | 1.68 | 0.11 | 23.25∗∗∗
| 3.84 | 2.94 | 1.67 | 0.97 | – |
activeYearsStartYear | 0.64 | 1.08 | 0.31 | – | 8.07 | 12.92 | 2.91 | – |
birthName | 0.53 | 1.02 | 0.44 | – | 2.86 | 8.45 | 10.93∗∗∗
| 1.40 |
careerStation | – | – | – | – | 8.35 | 1.08 | 48.81∗∗∗
| 2.59 |
deathDate | 15.25 | 7.10 | 9.37∗∗
| 1.07 | 12.50 | 9.27 | 1.13 | – |
deathYear | 16.15 | 7.51 | 9.94∗∗
| 1.07 | 13.09 | 9.58 | 1.29 | – |
homepage | 0.03 | 0.02 | 0 | – | 2.92 | 6.43 | 4.22∗
| 1.10 |
numberOfMatches | – | – | – | – | 8.06 | 1.02 | 48.58∗∗∗
| 2.63 |
occupation | 1.68 | 1.43 | 0.04 | – | 7.51 | 15.69 | 8.90∗∗
| 1.04 |
position | 0.61 | 0 | 513.34∗∗∗
| 29.04 | 12.54 | 1.63 | 73.10∗∗∗
| 2.59 |
spouse | 0.44 | 1.51 | 2.57 | – | 0.74 | 3.47 | 10.12∗∗
| 1.92 |
team | – | – | – | – | 12.74 | 1.78 | 67.59∗∗∗
| 2.48 |
title | 1.44 | 1.91 | 0.15 | – | 4.94 | 12.49 | 11.53∗∗∗
| 1.24 |
years | – | – | – | – | 8.34 | 1.08 | 48.82∗∗∗
| 2.59 |
-
Attributes activeYearsEndDate, activeYearsStartYear, careerStation, numberOfMatches, position, team, and years are more frequently used to describe men. All of these attributes are related to sports, therefore the differences can be explained by the prominence of men in sports-related DBpedia classes (e.g., Athlete, SportsManager and Coach [5]). Differences in activeYearsStartYear are only significant at the entire dataset level, and differences in activeYearsEndDate are only significant before the 20th century. The other attributes are mostly significantly different in recent times.
-
Attributes deathDate and deathYear are more frequently used for men born before 1900. A possible explanation is that the life of women was less well documented than the life of men in the past, and therefore it is more likely that the death date or birth date is unknown for women.
-
Attribute birthName is more frequently used for women in recent times. Its value refer mostly to the original name of artists, and women have considerable presence in this class [5]. A likely explanation is that married women change their surnames to those of their husbands in some cultures.
-
Attributes occupation and title are more frequently used to describe women in recent times, and seem to serve the same purpose but through different mechanisms. On one hand, title is a text description of a person’s occupation (the most common values found are Actor and Actress). On the other hand, occupation is a DBpedia resource URI (e.g., http://dbpedia.org/resource/Actress). These attributes are present in the infoboxes of art-related biographies. Conversely, the infoboxes of sport-related biographies do not contain these attributes because their templates are different and contain other attributes (like the aforementioned careerStation and position). Thus the meta-data of athletes, who are mostly men, do not contain such attributes.
-
The homepage attribute is more frequently used for women in recent times. Our manual inspection showed that biographies from the Artist class tend to have homepages, which explains why the attribute is used more frequently for women.
-
The spouse attribute is more frequently used for women in recent times. This attribute indicates whether the portrayed person was married or not, and with whom. In some cases, it contains the resource URI of the spouse, while in other cases, it contains the name (i.e., when the spouse does not have a Wikipedia article), or the resource URI of the article of ‘divorced status.’ This difference is consistent with our results about topical gender difference, where terms related to relationships show a stronger association with women than men.
3.3.2 Network structure
Edges
|
Clust. coeff.
|
Edges
(
M
to
M
)
|
Edges
(
M
to
W
)
|
\(\boldsymbol{\chi^{2}}\)
(
M to W
)
|
Edges
(
W to M
)
|
Edges
(
W to W
)
|
\(\boldsymbol{\chi^{2}}\)
(
W to W
)
| |
---|---|---|---|---|---|---|---|---|
0-1900 | ||||||||
Observed | 584,879 | 0.16 | 93.10% | 6.90% | 0.20 | 69.47% | 30.53% | 67.25∗∗∗
|
Random | 415,145 | 0.00 | 92.26% | 7.74% | 0.02 | 92.28% | 7.72% | 0.02 |
Small world | 219,058 | 0.16 | 91.89% | 8.11% | 0.00 | 91.53% | 8.47% | 0.02 |
Degree sequence | 584,879 | 0.00 | 90.22% | 9.78% | 0.37 | 90.25% | 9.75% | 0.35 |
1900-present | ||||||||
Observed | 1,772,793 | 0.11 | 89.47% | 10.53% | 3.37 | 54.91% | 45.09% | 52.67∗∗∗
|
Random | 1,052,299 | 0.00 | 83.15% | 16.85% | 0.03 | 83.21% | 16.79% | 0.04 |
Small world | 647,524 | 0.11 | 82.51% | 17.49% | 0.00 | 82.48% | 17.52% | 0.00 |
Degree sequence | 1,772,793 | 0.00 | 83.00% | 17.00% | 0.02 | 83.11% | 16.89% | 0.03 |