research-article

DOCS: a domain-aware crowdsourcing system using knowledge bases

Authors:
Yudian Zheng

The University of Hong Kong

The University of Hong Kong
View Profile

,
Guoliang Li

Tsinghua University

Tsinghua University
View Profile

,
Reynold Cheng

The University of Hong Kong

The University of Hong Kong
View Profile

Proceedings of the VLDB Endowment Volume 10 Issue 4pp 361–372https://doi.org/10.14778/3025111.3025118

Published:01 November 2016Publication History

Proceedings of the VLDB Endowment

Abstract

Crowdsourcing is a new computing paradigm that harnesses human effort to solve computer-hard problems, such as entity resolution and photo tagging. The crowd (or workers) have diverse qualities and it is important to effectively model a worker's quality. Most of existing worker models assume that workers have the same quality on different tasks. In practice, however, tasks belong to a variety of diverse domains, and workers have different qualities on different domains. For example, a worker who is a basketball fan should have better quality for the task of labeling a photo related to 'Stephen Curry' than the one related to 'Leonardo DiCaprio'. In this paper, we study how to leverage domain knowledge to accurately model a worker's quality. We examine using knowledge base (KB), e.g., Wikipedia and Freebase, to detect the domains of tasks and workers. We develop Domain Vector Estimation, which analyzes the domains of a task with respect to the KB. We also study Truth Inference, which utilizes the domain-sensitive worker model to accurately infer the true answer of a task. We design an Online Task Assignment algorithm, which judiciously and efficiently assigns tasks to appropriate workers. To implement these solutions, we have built DOCS, a system deployed on the Amazon Mechanical Turk. Experiments show that DOCS performs much better than the state-of-the-art approaches.

References

https://docs.aws.amazon.com/AWSMechTurk/latest/RequesterUI/amt-ui.pdf.Google Scholar
http://answers.yahoo.com/question/index?qid=20071211155603AAKwtyr.Google Scholar
Amazon mechanical turk. https://www.mturk.com/.Google Scholar
Chinacrowd. http://www.chinacrowds.com.Google Scholar
Y. Amsterdamer, S. B. Davidson, T. Milo, S. Novgorodov, and A. Somech. Oassis: query driven crowd mining. In SIGMOD, pages 589--600, 2014. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993--1022, 2003. Google ScholarDigital Library
M. Blum, R. W. Floyd, V. R. Pratt, R. L. Rivest, and R. E. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448--461, 1973. Google ScholarDigital Library
R. Boim, O. Greenshpan, T. Milo, S. Novgorodov, N. Polyzotis, and W. C. Tan. Asking the right questions in crowd data sourcing. In ICDE, pages 1261--1264, 2012. Google ScholarDigital Library
C. Chai, G. Li, J. Li, D. Deng, and J. Feng. Cost-effective crowdsourced entity resolution: A partial-order approach. In SIGMOD, pages 969--984, 2016. Google ScholarDigital Library
X. Cheng and D. Roth. Relational inference for wikification. In EMNLP, pages 1787--1796, 2013.Google Scholar
X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247--1261, 2015. Google ScholarDigital Library
Compositions. http://mathworld.wolfram.com/Composition.html.Google Scholar
CrowdFlower. http://crowdflower.com/.Google Scholar
S. B. Davidson, S. Khanna, T. Milo, and S. Roy. Using the crowd for top-k and group-by queries. In ICDT, pages 225--236, 2013. Google ScholarDigital Library
A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20--28, 1979.Google Scholar
G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In WWW, pages 469--478, 2012. Google ScholarDigital Library
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1--38, 1977.Google Scholar
J. Fan, G. Li, B. C. Ooi, K.-l. Tan, and J. Feng. icrowd: An adaptive crowdsourcing framework. In SIGMOD, pages 1015--1030, 2015. Google ScholarDigital Library
M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, pages 61--72, 2011. Google ScholarDigital Library
Freebase. https://www.freebase.com/.Google Scholar
G. Goel, A. Nikzad, and A. Singla. Allocating tasks to workers with matching constraints: truthful mechanisms for crowdsourcing markets. In WWW, pages 279--280, 2014. Google ScholarDigital Library
C.-J. Ho and J. W. Vaughan. Online task assignment in crowdsourcing markets. In AAAI, pages 45--51, 2012. Google ScholarDigital Library
H. Hu, G. Li, Z. Bao, Y. Cui, and J. Feng. Crowdsourcing-based real-time urban traffic speed estimation: From trends to speeds. In ICDE, pages 883--894, 2016.Google ScholarCross Ref
H. Hu, Y. Zheng, Z. Bao, G. Li, J. Feng, and R. Cheng. Crowdsourced POI labelling: Location-aware result inference and task assignment. In ICDE, pages 61--72, 2016.Google ScholarCross Ref
H. Ji, R. Grishman, H. T. Dang, K. Griffitt, and J. Ellis. Overview of the tac 2010 knowledge base population track. In TAC, 2010.Google Scholar
S. Kullback and R. A. Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79--86, 1951.Google Scholar
G. Li, J. Wang, Y. Zheng, and M. J. Franklin. Crowdsourced data management: A survey. TKDE, 28(9):2296--2319, 2016.Google ScholarDigital Library
Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han. A survey on truth discovery. SIGKDD Explorations, 17(2):1--16, 2015. Google ScholarDigital Library
X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. Cdas: a crowdsourcing data analytics system. PVLDB, 5(10):1040--1051, 2012. Google ScholarDigital Library
F. Ma, Y. Li, Q. Li, M. Qiu, J. Gao, S. Zhi, L. Su, B. Zhao, H. Ji, and J. Han. Faitcrowd: Fine grained truth discovery for crowdsourced data aggregation. In KDD, pages 745--754, 2015. Google ScholarDigital Library
A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR, pages 211--214, 2011.Google Scholar
P. Mavridis, D. Gross-Amblard, and Z. Miklós. Using hierarchical skills for optimized task assignment in knowledge-intensive crowdsourcing. In WWW, pages 843--853, 2016. Google ScholarDigital Library
G. L. Nemhauser and L. A. Wolsey. Integer and Combinatorial Optimization. Wiley-Interscience, 1988. Google ScholarDigital Library
A. G. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. Crowdscreen: algorithms for filtering data with humans. In SIGMOD, pages 361--372, 2012. Google ScholarDigital Library
QA. https://webscope.sandbox.yahoo.com/catalog.php?datatype=l&did=76.Google Scholar
L. Ratinov, D. Roth, D. Downey, and M. Anderson. Local and global algorithms for disambiguation to wikipedia. In ACL, pages 1375--1384, 2011. Google ScholarDigital Library
S. B. Roy, I. Lykourentzou, S. Thirumuruganathan, S. Amer-Yahia, and G. Das. Task assignment optimization in knowledge-intensive crowdsourcing. VLDBJ, 24(4):467--491, 2015. Google ScholarDigital Library
C. E. Shannon. A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev., 5(1):3--55, 2001. Google ScholarDigital Library
W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE, 27(2):443--460, 2015.Google ScholarCross Ref
Technical Report. http://i.cs.hku.hk/~ydzheng2/docs_full.pdf.Google Scholar
L. Tran-Thanh, S. Stein, A. Rogers, and N. R. Jennings. Efficient crowdsourcing of unknown experts using bounded multi-armed bandits. Artificial Intelligence, 214:89--111, 2014. Google ScholarDigital Library
S. Trani, D. Ceccarelli, C. Lucchese, S. Orlando, and R. Perego. Dexter 2.0-an open source tool for semantically enriching data. In ICWS, pages 417--420, 2014. Google ScholarDigital Library
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11):1483--1494, 2012. Google ScholarDigital Library
J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD, pages 229--240, 2013. Google ScholarDigital Library
S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013. Google ScholarDigital Library
J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS, pages 2035--2043, 2009. Google ScholarDigital Library
Wikipedia. https://en.wikipedia.org/wiki/Category:Main_topic_classifications.Google Scholar
Yahoo Answers. https://answers.yahoo.com/dir/index.Google Scholar
L. Yang, M. Qiu, S. Gottipati, F. Zhu, J. Jiang, H. Sun, and Z. Chen. Cqarank: jointly model topics and expertise in community question answering. In CIKM, pages 99--108, 2013. Google ScholarDigital Library
X. Zhang, G. Li, and J. Feng. Crowdsourced top-k algorithms: An experimental evaluation. PVLDB, 9(8):612--623, 2016. Google ScholarDigital Library
W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In ECIR, pages 338--349, 2011. Google ScholarDigital Library
Z. Zhao, F. Wei, M. Zhou, W. Chen, and W. Ng. Crowd-selection query processing in crowdsourcing databases: A task-driven approach. In EDBT, pages 397--408, 2015.Google Scholar
Y. Zheng, R. Cheng, S. Maniu, and L. Mo. On optimality of jury selection in crowdsourcing. In EDBT, pages 193--204, 2015.Google Scholar
Y. Zheng, J. Wang, G. Li, R. Cheng, and J. Feng. Qasca: A quality-aware task assignment system for crowdsourcing applications. In SIGMOD, pages 1031--1046, 2015. Google ScholarDigital Library
S. Zhi, B. Zhao, W. Tong, J. Gao, D. Yu, H. Ji, and J. Han. Modeling truth existence in truth discovery. In KDD, pages 1543--1552, 2015. Google ScholarDigital Library
D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In NIPS, pages 2195--2203, 2012. Google ScholarDigital Library

Recommendations

I-DOCS: Distributed Agent-Assisted Knowledge Fusion for Disease Gene Discovery
ICPADS '01: Proceedings of the Eighth International Conference on Parallel and Distributed Systems

Abstract: New methods of electronic collaboration are needed to manage and reconcile the vast scientific knowledge made available through the experience of diverse experts. Fundamental research in capturing, managing, analyzing, and explaining ...
Read More
The Complete Works of Ralph Waldo Emerson; Volume 6
Read More
The Telephone Hand-Book
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 10, Issue 4
November 2016
180 pages
ISSN:2150-8097
Editors:
Peter Boncz
CWI
,
Ken Salem
University of Waterloo
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 November 2016
Published in pvldb Volume 10, Issue 4
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 33
  Total Citations
  View Citations
- 346
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DOCS: a domain-aware crowdsourcing system using knowledge bases

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

I-DOCS: Distributed Agent-Assisted Knowledge Fusion for Disease Gene Discovery

The Complete Works of Ralph Waldo Emerson; Volume 6

The Telephone Hand-Book

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

DOCS: a domain-aware crowdsourcing system using knowledge bases

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

I-DOCS: Distributed Agent-Assisted Knowledge Fusion for Disease Gene Discovery

The Complete Works of Ralph Waldo Emerson; Volume 6

The Telephone Hand-Book

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media