Top

Data Mining and Knowledge Discovery

Published in:

01-09-2014

Approximating the crowd

Authors: Şeyda Ertekin, Cynthia Rudin, Haym Hirsh

Published in: Data Mining and Knowledge Discovery | Issue 5-6/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The problem of “approximating the crowd” is that of estimating the crowd’s majority opinion by querying only a subset of it. Algorithms that approximate the crowd can intelligently stretch a limited budget for a crowdsourcing task. We present an algorithm, “CrowdSense,” that works in an online fashion where items come one at a time. CrowdSense dynamically samples subsets of the crowd based on an exploration/exploitation criterion. The algorithm produces a weighted combination of the subset’s votes that approximates the crowd’s opinion. We then introduce two variations of CrowdSense that make various distributional approximations to handle distinct crowd characteristics. In particular, the first algorithm makes a statistical independence approximation of the labelers for large crowds, whereas the second algorithm finds a lower bound on how often the current subcrowd agrees with the crowd’s majority vote. Our experiments on CrowdSense and several baselines demonstrate that we can reliably approximate the entire crowd’s vote by collecting opinions from a representative subset of the crowd.

previous article Generalization-based privacy preservation and discrimination prevention in data publishing and mining

next article Ontology of core data mining entities

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

http://vizwiz.org

http://iqengines.com/

http://github.com/CrowdSense/SupplementaryMaterial

http://www.mturk.com

http://www.netflixprize.com

Dataset are available at http://github.com/CrowdSense/Datasets

http://github.com/ipeirotis/Get-Another-Label/tree/master/data/HITspam-UsingMTurk

Bernstein MS, Little G, Miller RC, Hartmann B, Ackerman MS, Karger DR, Crowell D, Panovich K (2010) Soylent: a word processor with a crowd inside. In: Proceedings of the \(23^{rd}\) annual ACM symposium on User interface software and technology (UIST), pp 313–322

Bernstein MS, Brandt J, Miller RC, Karger DR (2011) Crowds in two seconds: enabling realtime crowd-powered interfaces. In: Proceedings of the 24th annual ACM symposium on User interface software and technology (UIST), pp 33–42

Bigham JP, Jayant C, Ji H, Little G, Miller A, Miller RC, Miller R, Tatarowicz A, White B, White S, Yeh T (2010) Vizwiz: Nearly real-time answers to visual questions. In: Proceedings of the \(23^{rd}\) Annual ACM Symposium on User Interface Software and Technology, ACM, New York, USA, UIST ’10, pp 333–342

Callison-Burch C, Dredze M (2010) Creating speech and language data with amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp 1–12

Dakka W, Ipeirotis PG (2008) Automatic extraction of useful facet hierarchies from text databases. In: Proceedings of the 24\(^{th}\) International Conference on Data Engineering (ICDE), pp 466–475

Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the em algorithm. Appl. Stat. 28(1):20–28CrossRef

Dekel O, Shamir O (2009a) Good learners for evil teachers. In: Proceedings of the 26\(^{th}\) Annual International Conference on Machine Learning (ICML)

Dekel O, Shamir O (2009b) Vox populi: Collecting high-quality labels from a crowd. In: Proceedings of the 22\(^{nd}\) Annual Conference on Learning Theory

Dekel O, Gentile C, Sridharan K (2010) Robust selective sampling from single and multiple teachers. In: The 23\(^{rd}\) Conference on Learning Theory (COLT), pp 346–358

Donmez P, Carbonell JG, Schneider J (2009) Efficiently learning the accuracy of labeling sources for selective sampling. In: Proceedings of the 15th International Conference on Knowledge Discovery and Data Mining (KDD), pp 259–268

Downs JS, Holbrook MB, Sheng S, Cranor LF (2010) Are your participants gaming the system?: screening mechanical turk workers. In: Proceedings of the 28th international conference on Human factors in computing systems, CHI ’10, pp 2399–2402

Ertekin S, Hirsh H, Rudin C (2012) Learning to predict the wisdom of crowds. In: Proceedings of Collective Intelligence, CI’12, Cambridge, Massachusetts

Gillick D, Liu Y (2010) Non-expert evaluation of summarization systems is risky. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Association for Computational Linguistics, CSLDAMT ’10, pp 148–151

Hsueh PY, Melville P, Sindhwani V (2009) Data quality from crowdsourcing: a study of annotation selection criteria. In: Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing, pp 27–35

Ipeirotis PG, Provost F, Wang J (2010) Quality management on amazon mechanical turk. In: Proceedings of the ACM SIGKDD Workshop on Human Computation, ACM, New York, USA, HCOMP ’10, pp 64–67

Ipeirotis PG, Provost F, Sheng VS, Wang J (2013) Repeated labeling using multiple noisy labelers. Data Min Knowl Discov 28(2):402–441CrossRefMathSciNet

Kaisser M, Lowe J (2008) Creating a research collection of question answer sentence pairs with amazons mechanical turk. In: Proceedings of the \(6^{th}\) International Language Resources and Evaluation (LREC’08), European Language Resources Association (ELRA)

Kasneci G, Gael JV, Stern D, Graepel T (2011) Cobayes: bayesian knowledge corroboration with assessors of unknown areas of expertise. In: Proceedings of the \(4^{th}\) ACM International Conference on Web Search and Data Mining (WSDM), pp 465–474

Law E, von Ahn L (2011) Human computation, synthesis lectures on artificial intelligence and machine learning. Morgan & Claypool Publishers, San Rafael

Marge M, Banerjee S, Rudnicky A (2010) Using the amazon mechanical turk for transcription of spoken language. In: Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp 5270–5273

Mason W, Watts DJ (2009) Financial incentives and the “performance of crowds”. In: Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP ’09, pp 77–85

Nakov P (2008) Noun compound interpretation using paraphrasing verbs: Feasibility study. In: Proceedings of the \(13^{th}\) international conference on Artificial Intelligence: Methodology, Systems, and Applications, Springer-Verlag, Berlin, Heidelberg, AIMSA ’08, pp 103–117

Nowak S, Rüger S (2010) How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the international conference on Multimedia information retrieval, MIR ’10, pp 557–566

Ogawa S, Piller F (2006) Reducing the risks of new product development. MITSloan Manag Rev 47(2):65

Quinn AJ, Bederson BB (2011), Human computation: a survey and taxonomy of a growing field. In: Proceedings of the 2011 Conference on Human Factors in, Computing Systems, pp 1403–1412

Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Res (JMLR) 11:1297–1322MathSciNet

Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? improving data quality and data mining using multiple, noisy labelers. In: Proceeding of the \(14^{th}\) International Conference on Knowledge Discovery and Data Mining (KDD), pp 614–622

Smyth P, Burl MC, Fayyad UM, Perona P (1994a) Knowledge discovery in large image databases: Dealing with uncertainties in ground truth. In: KDD, Workshop, pp 109–120

Smyth P, Fayyad UM, Burl MC, Perona P, Baldi P (1994b) Inferring ground truth from subjective labelling of venus images. In: NIPS, pp 1085–1092

Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast–but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 254–263

Sorokin A, Forsyth D (2008) Utility data annotation with amazon mechanical turk. Computer Vision and Pattern Recognition Workshop 1–8

Sullivan EA (2010) A group effort: more companies are turning to the wisdom of the crowd to find ways to innovate. Mark News 44(2):22–28

Wallace BC, Small K, Brodley CE, Trikalinos TA (2011) Who should label what? instance allocation in multiple expert active learning. In: Proceedings of the SIAM International Conference on Data Mining (SDM)

Warfield SK, Zou KH, Wells WM (2004) Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation. IEEE Transact Med Imaging (TMI) 23(7):21–903

Welinder P, Branson S, Belongie S, Perona P (2010) The multidimensional wisdom of crowds. In: Advances in Neural Information Processing Systems (NIPS) vol 10, pp 2424-2432

Whitehill J, Ruvolo P, fan Wu T, Bergsma J, Movellan J (2009) Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In: Advances in Neural Information Processing Systems (NIPS), pp 2035–2043

Yan Y, Rosales R, Fung G, Dy J (2010b) Modeling multiple annotator expertise in the semi-supervised learning scenario. In: Proc. of the \(26^{th}\) Conference on Uncertainty in Artificial Intelligence (UAI), AUAI Press, Corvallis, Oregon, pp 674–682

Yan Y, Rosales R, Fung G, Schmidt MW, Valadez GH, Bogoni L, Moy L, Dy JG (2010b) Modeling annotator expertise: Learning when everybody knows a bit of something. J Mac Learn Res-Proc Track 9:932–939

Zheng Y, Scott S, Deng K (2010) Active learning from multiple noisy labelers with varied costs. In: 10th IEEE International Conference on Data Mining (ICDM), pp 639–648

Title: Approximating the crowd
Authors: Şeyda Ertekin
Cynthia Rudin
Haym Hirsh
Publication date: 01-09-2014
Publisher: Springer US
Published in: Data Mining and Knowledge Discovery / Issue 5-6/2014
Print ISSN: 1384-5810
Electronic ISSN: 1573-756X
DOI: https://doi.org/10.1007/s10618-014-0354-1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 5-6/2014

Classy: fast clustering streams of call-graphs

Self-organizing maps by difference of convex functions optimization

Leveraging the power of local spatial autocorrelation in geophysical interpolative clustering

Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas

Discovering bands from graphs

Guest editors’ introduction: special issue of the ECML/PKDD 2014 journal track

Premium Partner