Skip to main content
Erschienen in: Knowledge and Information Systems 3/2017

19.04.2017 | Regular Paper

Crowd labeling latent Dirichlet allocation

verfasst von: Luca Pion-Tonachini, Scott Makeig, Ken Kreutz-Delgado

Erschienen in: Knowledge and Information Systems | Ausgabe 3/2017

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Large, unlabeled datasets are abundant nowadays, but getting labels for those datasets can be expensive and time-consuming. Crowd labeling is a crowdsourcing approach for gathering such labels from workers whose suggestions are not always accurate. While a variety of algorithms exist for this purpose, we present crowd labeling latent Dirichlet allocation (CL-LDA), a generalization of latent Dirichlet allocation that can solve a more general set of crowd labeling problems. We show that it performs as well as other methods and at times better on a variety of simulated and actual datasets while treating each label as compositional rather than indicating a discrete class. In addition, prior knowledge of workers’ abilities can be incorporated into the model through a structured Bayesian framework. We then apply CL-LDA to the EEG independent component labeling dataset, using its generalizations to further explore the utility of the algorithm. We discuss prospects for creating classifiers from the generated labels.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Agarwal D, Chen B-C (2010) fLDA: matrix factorization through latent Dirichlet allocation. In: Proceedings of the third ACM international conference on web search and data mining. ACM, pp 91–100 Agarwal D, Chen B-C (2010) fLDA: matrix factorization through latent Dirichlet allocation. In: Proceedings of the third ACM international conference on web search and data mining. ACM, pp 91–100
2.
Zurück zum Zitat Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATH
3.
Zurück zum Zitat Buckley C, Lease M, Smucker MD (2010) Overview of the TREC 2010 Relevance Feedback Track (Notebook). In: The nineteenth text retrieval conference (TREC) notebook Buckley C, Lease M, Smucker MD (2010) Overview of the TREC 2010 Relevance Feedback Track (Notebook). In: The nineteenth text retrieval conference (TREC) notebook
4.
Zurück zum Zitat Canini KR, Shi L, Griffiths TL (2009) Online inference of topics with latent Dirichlet allocation. In: International conference on artificial intelligence and statistics (AISTATS), vol 9, pp 65–72 Canini KR, Shi L, Griffiths TL (2009) Online inference of topics with latent Dirichlet allocation. In: International conference on artificial intelligence and statistics (AISTATS), vol 9, pp 65–72
5.
Zurück zum Zitat Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl Stat 28:20–28CrossRef Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl Stat 28:20–28CrossRef
6.
Zurück zum Zitat Della Penna N, Reid MD (2012) Crowd and prejudice: an impossibility theorem for crowd labelling without a gold standard. arXiv preprint arXiv:1204.3511 Della Penna N, Reid MD (2012) Crowd and prejudice: an impossibility theorem for crowd labelling without a gold standard. arXiv preprint arXiv:​1204.​3511
7.
Zurück zum Zitat Demartini G, Difallah DE, Cudré-Mauroux P (2012) Zencrowd: leveraging probabilistic reasoning and crowd sourcing techniques for large-scale entity linking. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 469–478 Demartini G, Difallah DE, Cudré-Mauroux P (2012) Zencrowd: leveraging probabilistic reasoning and crowd sourcing techniques for large-scale entity linking. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 469–478
8.
Zurück zum Zitat Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235CrossRef Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235CrossRef
10.
Zurück zum Zitat Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: Advances in neural information processing systems, pp 856–864 Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: Advances in neural information processing systems, pp 856–864
11.
Zurück zum Zitat Ipeirotis PG, Provost F, Wang J (2010) Quality management on Amazon mechanical turk. In: Proceedings of the ACM SIGKDD workshop on human computation. ACM, pp 64–67 Ipeirotis PG, Provost F, Wang J (2010) Quality management on Amazon mechanical turk. In: Proceedings of the ACM SIGKDD workshop on human computation. ACM, pp 64–67
12.
Zurück zum Zitat Kim HC, Ghahramani Z (2012) Bayesian classifier combination. In: International conference on artificial intelligence and statistics (AISTATS), pp 619–627 Kim HC, Ghahramani Z (2012) Bayesian classifier combination. In: International conference on artificial intelligence and statistics (AISTATS), pp 619–627
13.
Zurück zum Zitat Krestel R, Fankhauser P, Nejdl W (2009) Latent Dirichlet allocation for tag recommendation. In: Proceedings of the third ACM conference on recommender systems. ACM, pp 61–68 Krestel R, Fankhauser P, Nejdl W (2009) Latent Dirichlet allocation for tag recommendation. In: Proceedings of the third ACM conference on recommender systems. ACM, pp 61–68
15.
Zurück zum Zitat Makeig S, Bell AJ, Jung TP, Sejnowski TJ, et al (1996) Independent component analysis of electroencephalographic data. In: Advances in neural information processing systems, pp 145–151 Makeig S, Bell AJ, Jung TP, Sejnowski TJ, et al (1996) Independent component analysis of electroencephalographic data. In: Advances in neural information processing systems, pp 145–151
16.
Zurück zum Zitat Minka T (2000) Estimating a Dirichlet distribution. Tech. rep Minka T (2000) Estimating a Dirichlet distribution. Tech. rep
17.
Zurück zum Zitat Moreno PG, Teh YW, Perez-Cruz F, Artés-Rodríguez A (2014) Bayesian nonparametric crowdsourcing. arXiv preprint arXiv:1407.5017 Moreno PG, Teh YW, Perez-Cruz F, Artés-Rodríguez A (2014) Bayesian nonparametric crowdsourcing. arXiv preprint arXiv:​1407.​5017
18.
Zurück zum Zitat Mozafari B, Sarkar P, Franklin MJ, Jordan MI, Madden S (2012) Active learning for crowd-sourced databases. arXiv preprint arXiv:1209.3686 Mozafari B, Sarkar P, Franklin MJ, Jordan MI, Madden S (2012) Active learning for crowd-sourced databases. arXiv preprint arXiv:​1209.​3686
20.
Zurück zum Zitat Sato I, Kashima H, Nakagawa H (2014) Latent confusion analysis by normalized gamma construction. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1116–1124 Sato I, Kashima H, Nakagawa H (2014) Latent confusion analysis by normalized gamma construction. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1116–1124
21.
Zurück zum Zitat Sheshadri A (2014) A collaborative approach to IR evaluation. Master’s thesis, The University of Texas at Austin Sheshadri A (2014) A collaborative approach to IR evaluation. Master’s thesis, The University of Texas at Austin
22.
Zurück zum Zitat Sheshadri A, Lease M (2013) SQUARE: a benchmark for research on computing crowd consensus. In: First AAAI conference on human computation and crowdsourcing Sheshadri A, Lease M (2013) SQUARE: a benchmark for research on computing crowd consensus. In: First AAAI conference on human computation and crowdsourcing
23.
Zurück zum Zitat Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 254–263 Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 254–263
24.
Zurück zum Zitat Tang W, Lease M (2011) Semi-supervised consensus labeling for crowdsourcing. In: ACM SIGIR workshop on crowdsourcing for information retrieval (CIR), pp 36–41 Tang W, Lease M (2011) Semi-supervised consensus labeling for crowdsourcing. In: ACM SIGIR workshop on crowdsourcing for information retrieval (CIR), pp 36–41
25.
Zurück zum Zitat Wallach HM (2008) Structured topic models for language. PhD thesis, University of Cambridge Wallach HM (2008) Structured topic models for language. PhD thesis, University of Cambridge
26.
Zurück zum Zitat Wallach HM, Mimno DM, McCallum A (2009) Rethinking LDA: why priors matter. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22, Curran Associates Inc., pp 1973–1981. Wallach HM, Mimno DM, McCallum A (2009) Rethinking LDA: why priors matter. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22, Curran Associates Inc., pp 1973–1981.
27.
Zurück zum Zitat Wang X, Grimson E (2008) Spatial latent Dirichlet allocation. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems 20, Curran Associates Inc., pp 1577–1584. Wang X, Grimson E (2008) Spatial latent Dirichlet allocation. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems 20, Curran Associates Inc., pp 1577–1584.
28.
Zurück zum Zitat Wang Y, Bai H, Stanton M, Chen WY, Chang EY (2009) PLDA: parallel latent Dirichlet allocation for large-scale applications. In: Goldberg AV, Zhou Y (eds) Algorithmic aspects in information and management. Springer, Berlin, Heidelberg, pp 301–314. doi:10.1007/978-3-642-02158-9_26 Wang Y, Bai H, Stanton M, Chen WY, Chang EY (2009) PLDA: parallel latent Dirichlet allocation for large-scale applications. In: Goldberg AV, Zhou Y (eds) Algorithmic aspects in information and management. Springer, Berlin, Heidelberg, pp 301–314. doi:10.​1007/​978-3-642-02158-9_​26
29.
Zurück zum Zitat Welinder P, Branson S, Belongie S, Perona P (2010) The multidimensional wisdom of crowds. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, pp 2424–2432. Welinder P, Branson S, Belongie S, Perona P (2010) The multidimensional wisdom of crowds. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, pp 2424–2432.
30.
Zurück zum Zitat Whitehill J, fan Wu T, Bergsma J, Movellan JR, Ruvolo PL (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. Curran Associates Inc., Rostrevor, pp 2035–2043 Whitehill J, fan Wu T, Bergsma J, Movellan JR, Ruvolo PL (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. Curran Associates Inc., Rostrevor, pp 2035–2043
31.
Zurück zum Zitat Wilson AT, Chew PA (2010) Term weighting schemes for latent Dirichlet allocation. In: Human language technologies: the 2010 annual conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp 465–473 Wilson AT, Chew PA (2010) Term weighting schemes for latent Dirichlet allocation. In: Human language technologies: the 2010 annual conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp 465–473
32.
Zurück zum Zitat Yan , Xu N, Qi Y (2009) Parallel inference for latent Dirichlet allocation on graphics processing units. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22, Curran Associates Inc., pp 2134–2142. Yan , Xu N, Qi Y (2009) Parallel inference for latent Dirichlet allocation on graphics processing units. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22, Curran Associates Inc., pp 2134–2142.
Metadaten
Titel
Crowd labeling latent Dirichlet allocation
verfasst von
Luca Pion-Tonachini
Scott Makeig
Ken Kreutz-Delgado
Publikationsdatum
19.04.2017
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 3/2017
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-017-1053-1

Weitere Artikel der Ausgabe 3/2017

Knowledge and Information Systems 3/2017 Zur Ausgabe