Skip to main content
Erschienen in: Social Network Analysis and Mining 1/2019

01.12.2019 | Original Article

Detecting pages to protect in Wikipedia across multiple languages

verfasst von: Francesca Spezzano, Kelsey Suyehira, Laxmi Amulya Gundala

Erschienen in: Social Network Analysis and Mining | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Wikipedia is based on the idea that anyone can make edits to the website to create reliable and crowd-sourced content. Yet with the cover of internet anonymity, some users make changes to the website that do not align with Wikipedia’s intended uses. For this reason, Wikipedia allows for some pages of the website to become protected, where only certain users can make revisions to the page. This allows administrators to protect pages from vandalism, libel, and edit wars. However, with over five million pages on English Wikipedia, it is impossible for active editors to monitor all pages to suggest articles in need of protection. In this paper, we consider the problem of deciding whether a page should be protected or not in a collaborative environment such as Wikipedia. We formulate the problem as a binary classification task and propose a novel set of features to decide which pages to protect based on (1) users page revision behavior and (2) page categories. We tested our system, called DePP, on four different Wikipedia language versions: English, German, French, and Italian. Experimental results show that DePP reaches at least 0.93 in both AUROC and average precision across the four languages and significantly outperforms baselines. Moreover, DePP works well in a more realistic, unbalanced setting, that is, when unprotected pages are greatly outnumbered by protected pages, by achieving a good AUROC, a high recall and an average precision significantly higher than the baselines in all the settings and languages considered.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
Autoconfirmed registered users can send these requests through the Twinkle gadget (Twinkle. https://​en.​wikipedia.​org/​wiki/​Wikipedia:​Twinkle).
 
2
Adler et al. also included the features implemented in WikiTrust in their analysis. However, WikiTrust was discontinued as a tool to detect vandalism in 2012 due to unreliability (Wikitrust. https://​en.​wikipedia.​org/​wiki/​WikiTrust, Computing wikipedia’s authority. https://​acrlog.​org/​2007/​08/​15/​computing-wikipedias-authority/​).
 
3
STiki uses spatio-temporal features such as edit time-of-day, edit day-of-week, time-since article edited, time-since editor registered, time-since last user-offending edit, revision comment length, registered user properties, and reputation features such as article, category, editor, and country reputation (West et al. 2010).
 
4
More details on the grid search are provided in the Appendix.
 
5
The data on whether or not an edit has been reverted by these bots/tools is directly available as metadata of the edits we crawled.
 
6
In English Wikipedia, the average number of edit wars in protected pages is 1.37 while the same number for unprotected pages is 0.06.
 
7
Because the dataset with page controversy level contains more non-controversial pages than controversial ones, we balanced the number of controversial/non-controversial training pages via majority undersampling to avoid bias towards non-controversial Wikipedia articles. The sampling was conducted 10 times and the results are averaged.
 
8
Similarly to what done in Sect. 4, the set on unprotected pages are uniformly random sampled from the complete list of unprotected pages.
 
Literatur
Zurück zum Zitat Adler BT, De Alfaro L, Pye I (2010) Detecting wikipedia vandalism using wikitrust—lab report for PAN at CLEF. In: CLEF 2010 LABs and Workshops, Notebook Papers, 22–23 September, Padua, Italy Adler BT, De Alfaro L, Pye I (2010) Detecting wikipedia vandalism using wikitrust—lab report for PAN at CLEF. In: CLEF 2010 LABs and Workshops, Notebook Papers, 22–23 September, Padua, Italy
Zurück zum Zitat Adler BT, De Alfaro L, Mola-Velasco SM, Rosso P, West AG (2011) Wikipedia vandalism detection: combining natural language, metadata, and reputation features. In: Computational linguistics and intelligent text processing—12th International Conference, CICLing 2011, Tokyo, Japan, February 20–26, 2011. Proceedings, Part II, pp 277–288 Adler BT, De Alfaro L, Mola-Velasco SM, Rosso P, West AG (2011) Wikipedia vandalism detection: combining natural language, metadata, and reputation features. In: Computational linguistics and intelligent text processing—12th International Conference, CICLing 2011, Tokyo, Japan, February 20–26, 2011. Proceedings, Part II, pp 277–288
Zurück zum Zitat Das S, Lavoie A, Magdon-Ismail M (2016) Manipulation among the arbiters of collective intelligence: how wikipedia administrators mold public opinion. ACM Trans Web 10(4):24:1–24:25CrossRef Das S, Lavoie A, Magdon-Ismail M (2016) Manipulation among the arbiters of collective intelligence: how wikipedia administrators mold public opinion. ACM Trans Web 10(4):24:1–24:25CrossRef
Zurück zum Zitat Dori-Hacohen S, Allan J (2013) Detecting controversy on the web. In: 22nd ACM international conference on information and knowledge management, CIKM’13, San Francisco, CA, USA, October 27–November 1, 2013, pp 1845–1848 Dori-Hacohen S, Allan J (2013) Detecting controversy on the web. In: 22nd ACM international conference on information and knowledge management, CIKM’13, San Francisco, CA, USA, October 27–November 1, 2013, pp 1845–1848
Zurück zum Zitat Dori-Hacohen S, Allan J (2015) Automated controversy detection on the web. In: Advances in information retrieval—37th European Conference on IR Research, ECIR 2015, Proceedings, Vienna, Austria, March 29–April 2, pp 423–434 Dori-Hacohen S, Allan J (2015) Automated controversy detection on the web. In: Advances in information retrieval—37th European Conference on IR Research, ECIR 2015, Proceedings, Vienna, Austria, March 29–April 2, pp 423–434
Zurück zum Zitat Dori-Hacohen S, Jensen DD, Allan J (2016) Controversy detection in wikipedia using collective classification. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, July 17–21, pp 797–800 Dori-Hacohen S, Jensen DD, Allan J (2016) Controversy detection in wikipedia using collective classification. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, July 17–21, pp 797–800
Zurück zum Zitat Green T, Spezzano F (2017) Spam users identification in wikipedia via editing behavior. In: Proceedings of the eleventh international conference on web and social media, ICWSM 2017, Montréal, Québec, Canada, May 15–18, 2017, pp 532–535 Green T, Spezzano F (2017) Spam users identification in wikipedia via editing behavior. In: Proceedings of the eleventh international conference on web and social media, ICWSM 2017, Montréal, Québec, Canada, May 15–18, 2017, pp 532–535
Zurück zum Zitat Hill BM, Shaw AD (2015) Page protection: another missing dimension of wikipedia research. In: Proceedings of the 11th International Symposium on Open Collaboration, San Francisco, CA, USA, August 19–21, 2015, pp 15:1–15:4 Hill BM, Shaw AD (2015) Page protection: another missing dimension of wikipedia research. In: Proceedings of the 11th International Symposium on Open Collaboration, San Francisco, CA, USA, August 19–21, 2015, pp 15:1–15:4
Zurück zum Zitat Jang M, Foley J, Dori-Hacohen S, Allan J (2016) Probabilistic approaches to controversy detection. In: Proceedings of the 25th ACM international conference on information and knowledge management, CIKM 2016, Indianapolis, IN, USA, October 24–28, 2016, pp 2069–2072 Jang M, Foley J, Dori-Hacohen S, Allan J (2016) Probabilistic approaches to controversy detection. In: Proceedings of the 25th ACM international conference on information and knowledge management, CIKM 2016, Indianapolis, IN, USA, October 24–28, 2016, pp 2069–2072
Zurück zum Zitat Johannes K, Potthast M, Hagen M, Stein B (2017) Spatio-temporal analysis of reverted wikipedia edits. In: Proceedings of the eleventh international conference on web and social media, ICWSM 2017, Montréal, Québec, Canada, May 15–18, 2017, pp 122–131 Johannes K, Potthast M, Hagen M, Stein B (2017) Spatio-temporal analysis of reverted wikipedia edits. In: Proceedings of the eleventh international conference on web and social media, ICWSM 2017, Montréal, Québec, Canada, May 15–18, 2017, pp 122–131
Zurück zum Zitat Kittur A, Suh B, Pendleton BA, Chi EH (2007) He says, she says: conflict and coordination in wikipedia. In: Proceedings of the 2007 conference on human factors in computing systems, CHI 2007, San Jose, California, USA, April 28–May 3, 2007, pp 453–462 Kittur A, Suh B, Pendleton BA, Chi EH (2007) He says, she says: conflict and coordination in wikipedia. In: Proceedings of the 2007 conference on human factors in computing systems, CHI 2007, San Jose, California, USA, April 28–May 3, 2007, pp 453–462
Zurück zum Zitat Kumar S, Spezzano F, Subrahmanian VS (2015) VEWS: a wikipedia vandal early warning system. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, Sydney, NSW, Australia, August 10–13, 2015, pp 607–616 Kumar S, Spezzano F, Subrahmanian VS (2015) VEWS: a wikipedia vandal early warning system. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, Sydney, NSW, Australia, August 10–13, 2015, pp 607–616
Zurück zum Zitat Kumar S, West R, Leskovec J (2016) Disinformation on the web: impact, characteristics, and detection of wikipedia hoaxes. In: Proceedings of the 25th international conference on world wide web, WWW 2016, Montreal, Canada, April 11–15, 2016, pp 591–602 Kumar S, West R, Leskovec J (2016) Disinformation on the web: impact, characteristics, and detection of wikipedia hoaxes. In: Proceedings of the 25th international conference on world wide web, WWW 2016, Montreal, Canada, April 11–15, 2016, pp 591–602
Zurück zum Zitat McDonald DW, Javanmardi S, Zachry M (2011) Finding patterns in behavioral observations by automatically labeling forms of wikiwork in barnstars. In: Proceedings of the 7th international symposium on wikis and open collaboration, 2011, Mountain View, CA, USA, October 3–5, 2011, pp 15–24 McDonald DW, Javanmardi S, Zachry M (2011) Finding patterns in behavioral observations by automatically labeling forms of wikiwork in barnstars. In: Proceedings of the 7th international symposium on wikis and open collaboration, 2011, Mountain View, CA, USA, October 3–5, 2011, pp 15–24
Zurück zum Zitat Potthast M, Stein B, Gerling R (2008) Automatic vandalism detection in wikipedia. In: Proceedings advances in information retrieval, 30th European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30–April 3, 2008, pp 663–668 Potthast M, Stein B, Gerling R (2008) Automatic vandalism detection in wikipedia. In: Proceedings advances in information retrieval, 30th European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30–April 3, 2008, pp 663–668
Zurück zum Zitat Potthast M, Stein B, Holfeld T (2010) Overview of the 1st international competition on wikipedia vandalism detection. In CLEF (2010) LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, Italy Potthast M, Stein B, Holfeld T (2010) Overview of the 1st international competition on wikipedia vandalism detection. In CLEF (2010) LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, Italy
Zurück zum Zitat Rad HS, Barbosa D (2012) Identifying controversial articles in wikipedia: a comparative study. In: Proceedings of the eighth annual international symposium on wikis and open collaboration, WikiSym 2012, Austria, August 27–29, 2012 Rad HS, Barbosa D (2012) Identifying controversial articles in wikipedia: a comparative study. In: Proceedings of the eighth annual international symposium on wikis and open collaboration, WikiSym 2012, Austria, August 27–29, 2012
Zurück zum Zitat Roitman H, Hummel S, Rabinovich E, Sznajder B, Slonim N, Aharoni E (2016) On the retrieval of wikipedia articles containing claims on controversial topics. In: Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11–15, 2016, Companion Volume, pp 991–996 Roitman H, Hummel S, Rabinovich E, Sznajder B, Slonim N, Aharoni E (2016) On the retrieval of wikipedia articles containing claims on controversial topics. In: Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11–15, 2016, Companion Volume, pp 991–996
Zurück zum Zitat Singer P, Lemmerich F, West R, Zia L, Wulczyn E, Strohmaier M, Leskovec J (2017) Why we read wikipedia. In: Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3–7, 2017, pp 1591–1600 Singer P, Lemmerich F, West R, Zia L, Wulczyn E, Strohmaier M, Leskovec J (2017) Why we read wikipedia. In: Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3–7, 2017, pp 1591–1600
Zurück zum Zitat Solorio T, Hasan R, Mizan M (2013) A case study of sockpuppet detection in wikipedia. In: Proceedings of the workshop on language analysis in social media. Association for Computational Linguistics, Atlanta, Georgia, pp 59–68. http://aclweb.org/anthology/W13-1107 Solorio T, Hasan R, Mizan M (2013) A case study of sockpuppet detection in wikipedia. In: Proceedings of the workshop on language analysis in social media. Association for Computational Linguistics, Atlanta, Georgia, pp 59–68. http://​aclweb.​org/​anthology/​W13-1107
Zurück zum Zitat Suyehira K, Spezzano F (2016) Depp: a system for detecting pages to protect in wikipedia. In: Proceedings of the 25th ACM international conference on information and knowledge management, CIKM 2016, Indianapolis, IN, USA, October 24–28, 2016, pp 2081–2084 Suyehira K, Spezzano F (2016) Depp: a system for detecting pages to protect in wikipedia. In: Proceedings of the 25th ACM international conference on information and knowledge management, CIKM 2016, Indianapolis, IN, USA, October 24–28, 2016, pp 2081–2084
Zurück zum Zitat Tran KND (2015) Detecting vandalism on wikipedia across multiple languages Tran KND (2015) Detecting vandalism on wikipedia across multiple languages
Zurück zum Zitat Viégas FB, Wattenberg M, McKeon MM (2007) The hidden order of wikipedia. In: International conference on online communities and social computing, Second international conference, OCSC 2007, held as part of HCI international 2007, 22–27 July 2007. Springer, Beijing, China, pp 445–454 Viégas FB, Wattenberg M, McKeon MM (2007) The hidden order of wikipedia. In: International conference on online communities and social computing, Second international conference, OCSC 2007, held as part of HCI international 2007, 22–27 July 2007. Springer, Beijing, China, pp 445–454
Zurück zum Zitat West AG, Kannan S, Lee I (2010) Detecting wikipedia vandalism via spatio-temporal analysis of revision metadata? In: Proceedings of the third European workshop on system security, EUROSEC 2010, Paris, France. ACM, New York, pp 22–28. https://doi.org/10.1145/1752046.1752050 West AG, Kannan S, Lee I (2010) Detecting wikipedia vandalism via spatio-temporal analysis of revision metadata? In: Proceedings of the third European workshop on system security, EUROSEC 2010, Paris, France. ACM, New York, pp 22–28. https://​doi.​org/​10.​1145/​1752046.​1752050
Zurück zum Zitat West AG, Agrawal A, Baker P, Exline B, Lee I (2011a) Autonomous link spam detection in purely collaborative environments. In: Proceedings of the 7th international symposium on wikis and open collaboration, 2011, Mountain View, CA, USA, October 3–5, 2011, pp 91–100 West AG, Agrawal A, Baker P, Exline B, Lee I (2011a) Autonomous link spam detection in purely collaborative environments. In: Proceedings of the 7th international symposium on wikis and open collaboration, 2011, Mountain View, CA, USA, October 3–5, 2011, pp 91–100
Zurück zum Zitat West AG, Chang J, Venkatasubramanian KK, Sokolsky O, Lee I (2011b) Link spamming wikipedia for profit. In: The 8th annual collaboration, electronic messaging, anti-abuse and spam conference, CEAS 2011, Perth, Australia, Proceedings, September 1–2, 2011, pp 152–161 West AG, Chang J, Venkatasubramanian KK, Sokolsky O, Lee I (2011b) Link spamming wikipedia for profit. In: The 8th annual collaboration, electronic messaging, anti-abuse and spam conference, CEAS 2011, Perth, Australia, Proceedings, September 1–2, 2011, pp 152–161
Zurück zum Zitat Yasseri T, Sumi R, Rung A, Kornai A, Kertész J (2012) Dynamics of conflicts in wikipedia. PLOS One 7(6):1–12CrossRef Yasseri T, Sumi R, Rung A, Kornai A, Kertész J (2012) Dynamics of conflicts in wikipedia. PLOS One 7(6):1–12CrossRef
Metadaten
Titel
Detecting pages to protect in Wikipedia across multiple languages
verfasst von
Francesca Spezzano
Kelsey Suyehira
Laxmi Amulya Gundala
Publikationsdatum
01.12.2019
Verlag
Springer Vienna
Erschienen in
Social Network Analysis and Mining / Ausgabe 1/2019
Print ISSN: 1869-5450
Elektronische ISSN: 1869-5469
DOI
https://doi.org/10.1007/s13278-019-0555-0

Weitere Artikel der Ausgabe 1/2019

Social Network Analysis and Mining 1/2019 Zur Ausgabe