Skip to main content
Top
Published in: Knowledge and Information Systems 3/2017

19-04-2017 | Regular Paper

Crowd labeling latent Dirichlet allocation

Authors: Luca Pion-Tonachini, Scott Makeig, Ken Kreutz-Delgado

Published in: Knowledge and Information Systems | Issue 3/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Large, unlabeled datasets are abundant nowadays, but getting labels for those datasets can be expensive and time-consuming. Crowd labeling is a crowdsourcing approach for gathering such labels from workers whose suggestions are not always accurate. While a variety of algorithms exist for this purpose, we present crowd labeling latent Dirichlet allocation (CL-LDA), a generalization of latent Dirichlet allocation that can solve a more general set of crowd labeling problems. We show that it performs as well as other methods and at times better on a variety of simulated and actual datasets while treating each label as compositional rather than indicating a discrete class. In addition, prior knowledge of workers’ abilities can be incorporated into the model through a structured Bayesian framework. We then apply CL-LDA to the EEG independent component labeling dataset, using its generalizations to further explore the utility of the algorithm. We discuss prospects for creating classifiers from the generated labels.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Agarwal D, Chen B-C (2010) fLDA: matrix factorization through latent Dirichlet allocation. In: Proceedings of the third ACM international conference on web search and data mining. ACM, pp 91–100 Agarwal D, Chen B-C (2010) fLDA: matrix factorization through latent Dirichlet allocation. In: Proceedings of the third ACM international conference on web search and data mining. ACM, pp 91–100
2.
go back to reference Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATH
3.
go back to reference Buckley C, Lease M, Smucker MD (2010) Overview of the TREC 2010 Relevance Feedback Track (Notebook). In: The nineteenth text retrieval conference (TREC) notebook Buckley C, Lease M, Smucker MD (2010) Overview of the TREC 2010 Relevance Feedback Track (Notebook). In: The nineteenth text retrieval conference (TREC) notebook
4.
go back to reference Canini KR, Shi L, Griffiths TL (2009) Online inference of topics with latent Dirichlet allocation. In: International conference on artificial intelligence and statistics (AISTATS), vol 9, pp 65–72 Canini KR, Shi L, Griffiths TL (2009) Online inference of topics with latent Dirichlet allocation. In: International conference on artificial intelligence and statistics (AISTATS), vol 9, pp 65–72
5.
go back to reference Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl Stat 28:20–28CrossRef Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl Stat 28:20–28CrossRef
6.
go back to reference Della Penna N, Reid MD (2012) Crowd and prejudice: an impossibility theorem for crowd labelling without a gold standard. arXiv preprint arXiv:1204.3511 Della Penna N, Reid MD (2012) Crowd and prejudice: an impossibility theorem for crowd labelling without a gold standard. arXiv preprint arXiv:​1204.​3511
7.
go back to reference Demartini G, Difallah DE, Cudré-Mauroux P (2012) Zencrowd: leveraging probabilistic reasoning and crowd sourcing techniques for large-scale entity linking. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 469–478 Demartini G, Difallah DE, Cudré-Mauroux P (2012) Zencrowd: leveraging probabilistic reasoning and crowd sourcing techniques for large-scale entity linking. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 469–478
8.
go back to reference Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235CrossRef Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235CrossRef
10.
go back to reference Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: Advances in neural information processing systems, pp 856–864 Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: Advances in neural information processing systems, pp 856–864
11.
go back to reference Ipeirotis PG, Provost F, Wang J (2010) Quality management on Amazon mechanical turk. In: Proceedings of the ACM SIGKDD workshop on human computation. ACM, pp 64–67 Ipeirotis PG, Provost F, Wang J (2010) Quality management on Amazon mechanical turk. In: Proceedings of the ACM SIGKDD workshop on human computation. ACM, pp 64–67
12.
go back to reference Kim HC, Ghahramani Z (2012) Bayesian classifier combination. In: International conference on artificial intelligence and statistics (AISTATS), pp 619–627 Kim HC, Ghahramani Z (2012) Bayesian classifier combination. In: International conference on artificial intelligence and statistics (AISTATS), pp 619–627
13.
go back to reference Krestel R, Fankhauser P, Nejdl W (2009) Latent Dirichlet allocation for tag recommendation. In: Proceedings of the third ACM conference on recommender systems. ACM, pp 61–68 Krestel R, Fankhauser P, Nejdl W (2009) Latent Dirichlet allocation for tag recommendation. In: Proceedings of the third ACM conference on recommender systems. ACM, pp 61–68
15.
go back to reference Makeig S, Bell AJ, Jung TP, Sejnowski TJ, et al (1996) Independent component analysis of electroencephalographic data. In: Advances in neural information processing systems, pp 145–151 Makeig S, Bell AJ, Jung TP, Sejnowski TJ, et al (1996) Independent component analysis of electroencephalographic data. In: Advances in neural information processing systems, pp 145–151
16.
go back to reference Minka T (2000) Estimating a Dirichlet distribution. Tech. rep Minka T (2000) Estimating a Dirichlet distribution. Tech. rep
17.
18.
go back to reference Mozafari B, Sarkar P, Franklin MJ, Jordan MI, Madden S (2012) Active learning for crowd-sourced databases. arXiv preprint arXiv:1209.3686 Mozafari B, Sarkar P, Franklin MJ, Jordan MI, Madden S (2012) Active learning for crowd-sourced databases. arXiv preprint arXiv:​1209.​3686
20.
go back to reference Sato I, Kashima H, Nakagawa H (2014) Latent confusion analysis by normalized gamma construction. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1116–1124 Sato I, Kashima H, Nakagawa H (2014) Latent confusion analysis by normalized gamma construction. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1116–1124
21.
go back to reference Sheshadri A (2014) A collaborative approach to IR evaluation. Master’s thesis, The University of Texas at Austin Sheshadri A (2014) A collaborative approach to IR evaluation. Master’s thesis, The University of Texas at Austin
22.
go back to reference Sheshadri A, Lease M (2013) SQUARE: a benchmark for research on computing crowd consensus. In: First AAAI conference on human computation and crowdsourcing Sheshadri A, Lease M (2013) SQUARE: a benchmark for research on computing crowd consensus. In: First AAAI conference on human computation and crowdsourcing
23.
go back to reference Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 254–263 Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 254–263
24.
go back to reference Tang W, Lease M (2011) Semi-supervised consensus labeling for crowdsourcing. In: ACM SIGIR workshop on crowdsourcing for information retrieval (CIR), pp 36–41 Tang W, Lease M (2011) Semi-supervised consensus labeling for crowdsourcing. In: ACM SIGIR workshop on crowdsourcing for information retrieval (CIR), pp 36–41
25.
go back to reference Wallach HM (2008) Structured topic models for language. PhD thesis, University of Cambridge Wallach HM (2008) Structured topic models for language. PhD thesis, University of Cambridge
26.
go back to reference Wallach HM, Mimno DM, McCallum A (2009) Rethinking LDA: why priors matter. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22, Curran Associates Inc., pp 1973–1981. Wallach HM, Mimno DM, McCallum A (2009) Rethinking LDA: why priors matter. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22, Curran Associates Inc., pp 1973–1981.
27.
go back to reference Wang X, Grimson E (2008) Spatial latent Dirichlet allocation. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems 20, Curran Associates Inc., pp 1577–1584. Wang X, Grimson E (2008) Spatial latent Dirichlet allocation. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems 20, Curran Associates Inc., pp 1577–1584.
28.
go back to reference Wang Y, Bai H, Stanton M, Chen WY, Chang EY (2009) PLDA: parallel latent Dirichlet allocation for large-scale applications. In: Goldberg AV, Zhou Y (eds) Algorithmic aspects in information and management. Springer, Berlin, Heidelberg, pp 301–314. doi:10.1007/978-3-642-02158-9_26 Wang Y, Bai H, Stanton M, Chen WY, Chang EY (2009) PLDA: parallel latent Dirichlet allocation for large-scale applications. In: Goldberg AV, Zhou Y (eds) Algorithmic aspects in information and management. Springer, Berlin, Heidelberg, pp 301–314. doi:10.​1007/​978-3-642-02158-9_​26
29.
go back to reference Welinder P, Branson S, Belongie S, Perona P (2010) The multidimensional wisdom of crowds. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, pp 2424–2432. Welinder P, Branson S, Belongie S, Perona P (2010) The multidimensional wisdom of crowds. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, pp 2424–2432.
30.
go back to reference Whitehill J, fan Wu T, Bergsma J, Movellan JR, Ruvolo PL (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. Curran Associates Inc., Rostrevor, pp 2035–2043 Whitehill J, fan Wu T, Bergsma J, Movellan JR, Ruvolo PL (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. Curran Associates Inc., Rostrevor, pp 2035–2043
31.
go back to reference Wilson AT, Chew PA (2010) Term weighting schemes for latent Dirichlet allocation. In: Human language technologies: the 2010 annual conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp 465–473 Wilson AT, Chew PA (2010) Term weighting schemes for latent Dirichlet allocation. In: Human language technologies: the 2010 annual conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp 465–473
32.
go back to reference Yan , Xu N, Qi Y (2009) Parallel inference for latent Dirichlet allocation on graphics processing units. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22, Curran Associates Inc., pp 2134–2142. Yan , Xu N, Qi Y (2009) Parallel inference for latent Dirichlet allocation on graphics processing units. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22, Curran Associates Inc., pp 2134–2142.
Metadata
Title
Crowd labeling latent Dirichlet allocation
Authors
Luca Pion-Tonachini
Scott Makeig
Ken Kreutz-Delgado
Publication date
19-04-2017
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 3/2017
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-017-1053-1

Other articles of this Issue 3/2017

Knowledge and Information Systems 3/2017 Go to the issue

Premium Partner