Skip to main content
Erschienen in: Social Network Analysis and Mining 1/2018

01.12.2018 | Original Article

Learning from noisy label proportions for classifying online social data

verfasst von: Ehsan Mohammady Ardehaly, Aron Culotta

Erschienen in: Social Network Analysis and Mining | Ausgabe 1/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Inferring latent attributes (e.g., demographics) of social media users is important to improve the accuracy and validity of social media analysis methods. While most existing approaches use either heuristics or supervised classification, recent work has shown that accurate classification models can be trained using supervision from population statistics. These learning with label proportion (LLP) models are fit on bags of instances and then applied to individual accounts. However, it is well known that many social media sites such as Twitter are not a representative sample of the population; thus, there are many sources of noise in these label proportions (e.g., sampling bias). This can in turn degrade the quality of the resulting model. In this paper, we investigate classification algorithms that use population statistical constraints such as demographics, names, and social network followers to fit classifiers to predict individual user attributes. We propose LLP methods that explicitly model the noise inherent in these label proportions. On several real and synthetic datasets, we find that combining these enhancements together can significantly reduce averaged classification error by 7%, resulting in methods that are robust to noise in the provided label proportions.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Al Zamal F, Liu W, Ruths D (2012) Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In: ICWSM Al Zamal F, Liu W, Ruths D (2012) Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In: ICWSM
Zurück zum Zitat Amigó E, Carrillo de Albornoz J, Chugur I, Corujo A, Gonzalo J, Martín T, Meij E, de Rijke M, Spina D (2013) Overview of RepLab 2013: evaluating online reputation monitoring systems. In: Proceedings of the fourth international conference of the CLEF initiative, pp 333–352 Amigó E, Carrillo de Albornoz J, Chugur I, Corujo A, Gonzalo J, Martín T, Meij E, de Rijke M, Spina D (2013) Overview of RepLab 2013: evaluating online reputation monitoring systems. In: Proceedings of the fourth international conference of the CLEF initiative, pp 333–352
Zurück zum Zitat Ardehaly E Mohammady, Culotta A (2015) Inferring latent attributes of twitter users with label regularization. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Denver, Colorado, pp 185–195. http://www.aclweb.org/anthology/N15-1019 Ardehaly E Mohammady, Culotta A (2015) Inferring latent attributes of twitter users with label regularization. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Denver, Colorado, pp 185–195. http://​www.​aclweb.​org/​anthology/​N15-1019
Zurück zum Zitat Ardehaly EM, Culotta A (2016) Domain adaptation for learning from label proportions using self-training. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, IJCAI 2016, New York, NY, USA, pp 3670–3676, 9-15 July 2016. http://www.ijcai.org/Abstract/16/516 Ardehaly EM, Culotta A (2016) Domain adaptation for learning from label proportions using self-training. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, IJCAI 2016, New York, NY, USA, pp 3670–3676, 9-15 July 2016. http://​www.​ijcai.​org/​Abstract/​16/​516
Zurück zum Zitat Argamon S, Dhawle S, Koppel M, Pennebaker JW (2005) Lexical predictors of personality type. In: Proceedings of the joint annual meeting of the interface and the classification society of North America Argamon S, Dhawle S, Koppel M, Pennebaker JW (2005) Lexical predictors of personality type. In: Proceedings of the joint annual meeting of the interface and the classification society of North America
Zurück zum Zitat Barberá P (2013) Birds of the same feather tweet together. Bayesian ideal point estimation using twitter data. In: Proceedings of the social media and political participation, Florence, Italy, pp 10–11 Barberá P (2013) Birds of the same feather tweet together. Bayesian ideal point estimation using twitter data. In: Proceedings of the social media and political participation, Florence, Italy, pp 10–11
Zurück zum Zitat Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167CrossRef Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167CrossRef
Zurück zum Zitat Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16(5):1190–1208MathSciNetCrossRef Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16(5):1190–1208MathSciNetCrossRef
Zurück zum Zitat Chang MW, Ratinov L, Roth D (2012) Structured learning with constrained conditional models. Mach Learn 88(3):399–431MathSciNetCrossRef Chang MW, Ratinov L, Roth D (2012) Structured learning with constrained conditional models. Mach Learn 88(3):399–431MathSciNetCrossRef
Zurück zum Zitat Chang J, Rosenn I, Backstrom L, Marlow C (2010) Epluribus: ethnicity on social networks. In: ICWSM Chang J, Rosenn I, Backstrom L, Marlow C (2010) Epluribus: ethnicity on social networks. In: ICWSM
Zurück zum Zitat Cohen R, Ruths D (2013) Classifying political orientation on twitter: it’s not easy! In: ICWSM Cohen R, Ruths D (2013) Classifying political orientation on twitter: it’s not easy! In: ICWSM
Zurück zum Zitat Conover MD, Gonçalves B, Ratkiewicz J, Flammini A, Menczer F (2011) Predicting the political alignment of twitter users. In: 2011 IEEE third international conference on Privacy, security, risk and trust (passat) and 2011 IEEE third international conference on social computing (socialcom). IEEE, pp 192–199 Conover MD, Gonçalves B, Ratkiewicz J, Flammini A, Menczer F (2011) Predicting the political alignment of twitter users. In: 2011 IEEE third international conference on Privacy, security, risk and trust (passat) and 2011 IEEE third international conference on social computing (socialcom). IEEE, pp 192–199
Zurück zum Zitat Culotta A, Kumar NR, Cutler J (2016) Predicting twitter user demographics using distant supervision from website traffic data. J Artif Intell Res (JAIR) 55:389–408CrossRef Culotta A, Kumar NR, Cutler J (2016) Predicting twitter user demographics using distant supervision from website traffic data. J Artif Intell Res (JAIR) 55:389–408CrossRef
Zurück zum Zitat Diaz F, Gamon M, Hofman JM, Kıcıman E, Rothschild D (2016) Online and social media data as an imperfect continuous panel survey. PloS ONE 11(1):e0145406CrossRef Diaz F, Gamon M, Hofman JM, Kıcıman E, Rothschild D (2016) Online and social media data as an imperfect continuous panel survey. PloS ONE 11(1):e0145406CrossRef
Zurück zum Zitat Eisenstein J, Smith NA, Xing EP (2011) Discovering sociolinguistic associations with structured sparsity. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11, p 13651374. http://dl.acm.org/citation.cfm?id=2002472.2002641 Eisenstein J, Smith NA, Xing EP (2011) Discovering sociolinguistic associations with structured sparsity. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11, p 13651374. http://​dl.​acm.​org/​citation.​cfm?​id=​2002472.​2002641
Zurück zum Zitat Gopinath S, Thomas JS, Krishnamurthi L (2014) Investigating the relationship between the content of online word of mouth, advertising, and brand performance. Market Sci 33(2):241–258CrossRef Gopinath S, Thomas JS, Krishnamurthi L (2014) Investigating the relationship between the content of online word of mouth, advertising, and brand performance. Market Sci 33(2):241–258CrossRef
Zurück zum Zitat Graca J, Ganchev K, Taskar B (2007) Expectation maximization and posterior constraints. NIPS 20:569–576 Graca J, Ganchev K, Taskar B (2007) Expectation maximization and posterior constraints. NIPS 20:569–576
Zurück zum Zitat Jin R, Liu Y (2005) A framework for incorporating class priors into discriminative classification. In: Ho TB, Cheung D, Liu H (eds) Advances in knowledge discovery and data Mining. PAKDD 2005. Lecture Notes in Computer Science, vol 3518. Springer, Berlin Jin R, Liu Y (2005) A framework for incorporating class priors into discriminative classification. In: Ho TB, Cheung D, Liu H (eds) Advances in knowledge discovery and data Mining. PAKDD 2005. Lecture Notes in Computer Science, vol 3518. Springer, Berlin
Zurück zum Zitat Knowles R, Carroll J, Dredze M (2016) Demographer: extremely simple name demographics. In: NLP+ CSS 2016, p 108 Knowles R, Carroll J, Dredze M (2016) Demographer: extremely simple name demographics. In: NLP+ CSS 2016, p 108
Zurück zum Zitat Lenhart A, Fox S (2009) Twitter and status updating. PEW Internet & American Life Project, Washington DC Lenhart A, Fox S (2009) Twitter and status updating. PEW Internet & American Life Project, Washington DC
Zurück zum Zitat Lin CJ, Kuo TT, Lin SD (2014) A content-based matrix factorization model for recipe recommendation. In: Tseng V, Ho T, Zhou ZH, Chen A, Kao HY (eds) Advances in knowledge discovery and data mining, lecture notes in computer science, vol 8444. Springer International Publishing, pp 560–571. https://dx.doi.org/10.1007/978-3-319-06605-9_46 Lin CJ, Kuo TT, Lin SD (2014) A content-based matrix factorization model for recipe recommendation. In: Tseng V, Ho T, Zhou ZH, Chen A, Kao HY (eds) Advances in knowledge discovery and data mining, lecture notes in computer science, vol 8444. Springer International Publishing, pp 560–571. https://​dx.​doi.​org/​10.​1007/​978-3-319-06605-9_​46
Zurück zum Zitat Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 142–150. http://www.aclweb.org/anthology/P11-1015 Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 142–150. http://​www.​aclweb.​org/​anthology/​P11-1015
Zurück zum Zitat Maneewongvatana S, Mount DM (2002) Analysis of approximate nearest neighbor searching with clustered point sets. Data Struct Near Neighb Search Methodol 59:105–123MathSciNetMATH Maneewongvatana S, Mount DM (2002) Analysis of approximate nearest neighbor searching with clustered point sets. Data Struct Near Neighb Search Methodol 59:105–123MathSciNetMATH
Zurück zum Zitat Mann GS, McCallum A (2007) Simple, robust, scalable semi-supervised learning via expectation regularization. In: Proceedings of the 24th international conference on machine learning, ACM, New York, NY, USA, ICML ’07, p 593600. https://doi.org/10.1145/1273496.1273571 Mann GS, McCallum A (2007) Simple, robust, scalable semi-supervised learning via expectation regularization. In: Proceedings of the 24th international conference on machine learning, ACM, New York, NY, USA, ICML ’07, p 593600. https://​doi.​org/​10.​1145/​1273496.​1273571
Zurück zum Zitat Mislove A, Lehmann S, Ahn YY, Onnela JP, Rosenquist JN (2011) Understanding the demographics of twitter users. In: Proceedings of the fifth international AAAI conference on weblogs and social media (ICWSM’11), Barcelona, Spain Mislove A, Lehmann S, Ahn YY, Onnela JP, Rosenquist JN (2011) Understanding the demographics of twitter users. In: Proceedings of the fifth international AAAI conference on weblogs and social media (ICWSM’11), Barcelona, Spain
Zurück zum Zitat Nguyen D, Smith NA, Ros CP (2011) Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, Association for Computational Linguistics, Stroudsburg, PA, USA, LaTeCH ’11, p 115123. http://dl.acm.org/citation.cfm?id=2107636.2107651 Nguyen D, Smith NA, Ros CP (2011) Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, Association for Computational Linguistics, Stroudsburg, PA, USA, LaTeCH ’11, p 115123. http://​dl.​acm.​org/​citation.​cfm?​id=​2107636.​2107651
Zurück zum Zitat O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From tweets to polls: linking text sentiment to public opinion time series. ICWSM 11:122–129 O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From tweets to polls: linking text sentiment to public opinion time series. ICWSM 11:122–129
Zurück zum Zitat Preotiuc-Pietro D, Lampos V, Aletras N (2015) An analysis of the user occupational class through twitter content. In: ACL Preotiuc-Pietro D, Lampos V, Aletras N (2015) An analysis of the user occupational class through twitter content. In: ACL
Zurück zum Zitat Rao D, Paul MJ, Fink C, Yarowsky D, Oates T, Coppersmith G (2011) Hierarchical Bayesian models for latent attribute detection in social media. In: Adamic LA, Baeza-Yates RA, Counts S (eds) ICWSM. The AAAI Press Rao D, Paul MJ, Fink C, Yarowsky D, Oates T, Coppersmith G (2011) Hierarchical Bayesian models for latent attribute detection in social media. In: Adamic LA, Baeza-Yates RA, Counts S (eds) ICWSM. The AAAI Press
Zurück zum Zitat Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, ACM, New York, NY, USA, SMUC ’10, p 3744. https://doi.org/10.1145/1871985.1871993 Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, ACM, New York, NY, USA, SMUC ’10, p 3744. https://​doi.​org/​10.​1145/​1871985.​1871993
Zurück zum Zitat Rendle S, Schmidt-Thieme L (2008) Online-updating regularized kernel matrix factorization models for large-scale recommender systems. In: Proceedings of the 2008 ACM conference on recommender systems, ACM, New York, NY, USA, RecSys ’08, pp 251–258. https://doi.org/10.1145/1454008.1454047 Rendle S, Schmidt-Thieme L (2008) Online-updating regularized kernel matrix factorization models for large-scale recommender systems. In: Proceedings of the 2008 ACM conference on recommender systems, ACM, New York, NY, USA, RecSys ’08, pp 251–258. https://​doi.​org/​10.​1145/​1454008.​1454047
Zurück zum Zitat Rogati M, Yang Y (2002) High-performing feature selection for text classification. In: Proceedings of the eleventh international conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’02, pp 659–661. https://doi.org/10.1145/584792.584911 Rogati M, Yang Y (2002) High-performing feature selection for text classification. In: Proceedings of the eleventh international conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’02, pp 659–661. https://​doi.​org/​10.​1145/​584792.​584911
Zurück zum Zitat Rosenthal S, McKeown K (2011) Age prediction in blogs: A study of style, content, and online behavior in pre- and post-social media generations. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11, p 763772. http://dl.acm.org/citation.cfm?id=2002472.2002569 Rosenthal S, McKeown K (2011) Age prediction in blogs: A study of style, content, and online behavior in pre- and post-social media generations. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11, p 763772. http://​dl.​acm.​org/​citation.​cfm?​id=​2002472.​2002569
Zurück zum Zitat Schapire RE, Rochery M, Rahim MG, Gupta NK (2002) Incorporating prior knowledge into boosting. In: Proceedings of the nineteenth international conference on machine learning, pp 538–545 Schapire RE, Rochery M, Rahim MG, Gupta NK (2002) Incorporating prior knowledge into boosting. In: Proceedings of the nineteenth international conference on machine learning, pp 538–545
Zurück zum Zitat Schler J, Koppel M, Argamon S, Pennebaker J (2006) Effects of age and gender on blogging. In: AAAI 2006 spring symposium on computational approaches to analysing weblogs (AAAI-CAAW), pp 06–03 Schler J, Koppel M, Argamon S, Pennebaker J (2006) Effects of age and gender on blogging. In: AAAI 2006 spring symposium on computational approaches to analysing weblogs (AAAI-CAAW), pp 06–03
Zurück zum Zitat Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Lucas RE, Agrawal M, Park GJ, Lakshmikanth SK, Jha S, Seligman MEP, Ungar LH (2013a) Characterizing geographic variation in well-being using tweets. In: Seventh international AAAI conference on weblogs and social media (ICWSM) Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Lucas RE, Agrawal M, Park GJ, Lakshmikanth SK, Jha S, Seligman MEP, Ungar LH (2013a) Characterizing geographic variation in well-being using tweets. In: Seventh international AAAI conference on weblogs and social media (ICWSM)
Zurück zum Zitat She Y, Owen AB (2011) Outlier detection using nonconvex penalized regression. J Am Stat Assoc 106(494):626–639MathSciNetCrossRef She Y, Owen AB (2011) Outlier detection using nonconvex penalized regression. J Am Stat Assoc 106(494):626–639MathSciNetCrossRef
Zurück zum Zitat Tibshirani J, Manning CD (2014) Robust logistic regression using shift parameters. In: ACL, pp 124–129 Tibshirani J, Manning CD (2014) Robust logistic regression using shift parameters. In: ACL, pp 124–129
Zurück zum Zitat Vapnik VN (1995) The nature of statistical learning theory. Springer, New YorkCrossRef Vapnik VN (1995) The nature of statistical learning theory. Springer, New YorkCrossRef
Zurück zum Zitat Volkova S, Van Durme B (2015) Online bayesian models for personal analytics in social media. In: Proceedings of the twenty-ninth conference on artificial intelligence (AAAI), Austin, TX Volkova S, Van Durme B (2015) Online bayesian models for personal analytics in social media. In: Proceedings of the twenty-ninth conference on artificial intelligence (AAAI), Austin, TX
Zurück zum Zitat Watkins SC (2009) The young and the digital: what the migration to social-network sites, games, and anytime, anywhere media means for our future. Beacon Press, Boston Watkins SC (2009) The young and the digital: what the migration to social-network sites, games, and anytime, anywhere media means for our future. Beacon Press, Boston
Zurück zum Zitat Zhang S, Wang W, Ford J, Makedon F (2006) Learning from incomplete ratings using non-negative matrix factorization. In: Proceedings of the 6th SIAM conference on data mining, SDM, pp 549–553 Zhang S, Wang W, Ford J, Makedon F (2006) Learning from incomplete ratings using non-negative matrix factorization. In: Proceedings of the 6th SIAM conference on data mining, SDM, pp 549–553
Zurück zum Zitat Zhu J, Chen N, Xing EP (2014) Bayesian inference with posterior regularization and applications to infinite latent svms. J Mach Learn Res 15:1799–1847MathSciNetMATH Zhu J, Chen N, Xing EP (2014) Bayesian inference with posterior regularization and applications to infinite latent svms. J Mach Learn Res 15:1799–1847MathSciNetMATH
Metadaten
Titel
Learning from noisy label proportions for classifying online social data
verfasst von
Ehsan Mohammady Ardehaly
Aron Culotta
Publikationsdatum
01.12.2018
Verlag
Springer Vienna
Erschienen in
Social Network Analysis and Mining / Ausgabe 1/2018
Print ISSN: 1869-5450
Elektronische ISSN: 1869-5469
DOI
https://doi.org/10.1007/s13278-017-0478-6

Weitere Artikel der Ausgabe 1/2018

Social Network Analysis and Mining 1/2018 Zur Ausgabe

Premium Partner