skip to main content
research-article

DOCS: a domain-aware crowdsourcing system using knowledge bases

Published:01 November 2016Publication History
Skip Abstract Section

Abstract

Crowdsourcing is a new computing paradigm that harnesses human effort to solve computer-hard problems, such as entity resolution and photo tagging. The crowd (or workers) have diverse qualities and it is important to effectively model a worker's quality. Most of existing worker models assume that workers have the same quality on different tasks. In practice, however, tasks belong to a variety of diverse domains, and workers have different qualities on different domains. For example, a worker who is a basketball fan should have better quality for the task of labeling a photo related to 'Stephen Curry' than the one related to 'Leonardo DiCaprio'. In this paper, we study how to leverage domain knowledge to accurately model a worker's quality. We examine using knowledge base (KB), e.g., Wikipedia and Freebase, to detect the domains of tasks and workers. We develop Domain Vector Estimation, which analyzes the domains of a task with respect to the KB. We also study Truth Inference, which utilizes the domain-sensitive worker model to accurately infer the true answer of a task. We design an Online Task Assignment algorithm, which judiciously and efficiently assigns tasks to appropriate workers. To implement these solutions, we have built DOCS, a system deployed on the Amazon Mechanical Turk. Experiments show that DOCS performs much better than the state-of-the-art approaches.

References

  1. https://docs.aws.amazon.com/AWSMechTurk/latest/RequesterUI/amt-ui.pdf.Google ScholarGoogle Scholar
  2. http://answers.yahoo.com/question/index?qid=20071211155603AAKwtyr.Google ScholarGoogle Scholar
  3. Amazon mechanical turk. https://www.mturk.com/.Google ScholarGoogle Scholar
  4. Chinacrowd. http://www.chinacrowds.com.Google ScholarGoogle Scholar
  5. Y. Amsterdamer, S. B. Davidson, T. Milo, S. Novgorodov, and A. Somech. Oassis: query driven crowd mining. In SIGMOD, pages 589--600, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Blum, R. W. Floyd, V. R. Pratt, R. L. Rivest, and R. E. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448--461, 1973. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Boim, O. Greenshpan, T. Milo, S. Novgorodov, N. Polyzotis, and W. C. Tan. Asking the right questions in crowd data sourcing. In ICDE, pages 1261--1264, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Chai, G. Li, J. Li, D. Deng, and J. Feng. Cost-effective crowdsourced entity resolution: A partial-order approach. In SIGMOD, pages 969--984, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. X. Cheng and D. Roth. Relational inference for wikification. In EMNLP, pages 1787--1796, 2013.Google ScholarGoogle Scholar
  11. X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247--1261, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Compositions. http://mathworld.wolfram.com/Composition.html.Google ScholarGoogle Scholar
  13. CrowdFlower. http://crowdflower.com/.Google ScholarGoogle Scholar
  14. S. B. Davidson, S. Khanna, T. Milo, and S. Roy. Using the crowd for top-k and group-by queries. In ICDT, pages 225--236, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20--28, 1979.Google ScholarGoogle Scholar
  16. G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In WWW, pages 469--478, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1--38, 1977.Google ScholarGoogle Scholar
  18. J. Fan, G. Li, B. C. Ooi, K.-l. Tan, and J. Feng. icrowd: An adaptive crowdsourcing framework. In SIGMOD, pages 1015--1030, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, pages 61--72, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Freebase. https://www.freebase.com/.Google ScholarGoogle Scholar
  21. G. Goel, A. Nikzad, and A. Singla. Allocating tasks to workers with matching constraints: truthful mechanisms for crowdsourcing markets. In WWW, pages 279--280, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C.-J. Ho and J. W. Vaughan. Online task assignment in crowdsourcing markets. In AAAI, pages 45--51, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Hu, G. Li, Z. Bao, Y. Cui, and J. Feng. Crowdsourcing-based real-time urban traffic speed estimation: From trends to speeds. In ICDE, pages 883--894, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  24. H. Hu, Y. Zheng, Z. Bao, G. Li, J. Feng, and R. Cheng. Crowdsourced POI labelling: Location-aware result inference and task assignment. In ICDE, pages 61--72, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  25. H. Ji, R. Grishman, H. T. Dang, K. Griffitt, and J. Ellis. Overview of the tac 2010 knowledge base population track. In TAC, 2010.Google ScholarGoogle Scholar
  26. S. Kullback and R. A. Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79--86, 1951.Google ScholarGoogle Scholar
  27. G. Li, J. Wang, Y. Zheng, and M. J. Franklin. Crowdsourced data management: A survey. TKDE, 28(9):2296--2319, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han. A survey on truth discovery. SIGKDD Explorations, 17(2):1--16, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. Cdas: a crowdsourcing data analytics system. PVLDB, 5(10):1040--1051, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. F. Ma, Y. Li, Q. Li, M. Qiu, J. Gao, S. Zhi, L. Su, B. Zhao, H. Ji, and J. Han. Faitcrowd: Fine grained truth discovery for crowdsourced data aggregation. In KDD, pages 745--754, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR, pages 211--214, 2011.Google ScholarGoogle Scholar
  32. P. Mavridis, D. Gross-Amblard, and Z. Miklós. Using hierarchical skills for optimized task assignment in knowledge-intensive crowdsourcing. In WWW, pages 843--853, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. G. L. Nemhauser and L. A. Wolsey. Integer and Combinatorial Optimization. Wiley-Interscience, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. G. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. Crowdscreen: algorithms for filtering data with humans. In SIGMOD, pages 361--372, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. QA. https://webscope.sandbox.yahoo.com/catalog.php?datatype=l&did=76.Google ScholarGoogle Scholar
  36. L. Ratinov, D. Roth, D. Downey, and M. Anderson. Local and global algorithms for disambiguation to wikipedia. In ACL, pages 1375--1384, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. B. Roy, I. Lykourentzou, S. Thirumuruganathan, S. Amer-Yahia, and G. Das. Task assignment optimization in knowledge-intensive crowdsourcing. VLDBJ, 24(4):467--491, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. C. E. Shannon. A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev., 5(1):3--55, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE, 27(2):443--460, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  40. Technical Report. http://i.cs.hku.hk/~ydzheng2/docs_full.pdf.Google ScholarGoogle Scholar
  41. L. Tran-Thanh, S. Stein, A. Rogers, and N. R. Jennings. Efficient crowdsourcing of unknown experts using bounded multi-armed bandits. Artificial Intelligence, 214:89--111, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Trani, D. Ceccarelli, C. Lucchese, S. Orlando, and R. Perego. Dexter 2.0-an open source tool for semantically enriching data. In ICWS, pages 417--420, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11):1483--1494, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD, pages 229--240, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS, pages 2035--2043, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Wikipedia. https://en.wikipedia.org/wiki/Category:Main_topic_classifications.Google ScholarGoogle Scholar
  48. Yahoo Answers. https://answers.yahoo.com/dir/index.Google ScholarGoogle Scholar
  49. L. Yang, M. Qiu, S. Gottipati, F. Zhu, J. Jiang, H. Sun, and Z. Chen. Cqarank: jointly model topics and expertise in community question answering. In CIKM, pages 99--108, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. X. Zhang, G. Li, and J. Feng. Crowdsourced top-k algorithms: An experimental evaluation. PVLDB, 9(8):612--623, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In ECIR, pages 338--349, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Z. Zhao, F. Wei, M. Zhou, W. Chen, and W. Ng. Crowd-selection query processing in crowdsourcing databases: A task-driven approach. In EDBT, pages 397--408, 2015.Google ScholarGoogle Scholar
  53. Y. Zheng, R. Cheng, S. Maniu, and L. Mo. On optimality of jury selection in crowdsourcing. In EDBT, pages 193--204, 2015.Google ScholarGoogle Scholar
  54. Y. Zheng, J. Wang, G. Li, R. Cheng, and J. Feng. Qasca: A quality-aware task assignment system for crowdsourcing applications. In SIGMOD, pages 1031--1046, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. S. Zhi, B. Zhao, W. Tong, J. Gao, D. Yu, H. Ji, and J. Han. Modeling truth existence in truth discovery. In KDD, pages 1543--1552, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In NIPS, pages 2195--2203, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 10, Issue 4
    November 2016
    180 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 November 2016
    Published in pvldb Volume 10, Issue 4

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader