research-article

FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation

Authors:
Fenglong Ma

SUNY Buffalo, Buffalo, NY, USA

SUNY Buffalo, Buffalo, NY, USA
View Profile

,
Yaliang Li

SUNY Buffalo, Buffalo, USA

SUNY Buffalo, Buffalo, USA
View Profile

,
Qi Li

SUNY Buffalo, Buffalo, USA

SUNY Buffalo, Buffalo, USA
View Profile

,
Minghui Qiu

Singapore Management University, Singapore, Singapore

Singapore Management University, Singapore, Singapore
View Profile

,
Jing Gao

SUNY Buffalo, Buffalo, USA

SUNY Buffalo, Buffalo, USA
View Profile

,
Shi Zhi

University of Illinois Urbana-Champaign, Urbana-Champaign, USA

University of Illinois Urbana-Champaign, Urbana-Champaign, USA
View Profile

,
Lu Su

SUNY Buffalo, Buffalo, USA

SUNY Buffalo, Buffalo, USA
View Profile

,
Bo Zhao

LinkedIn, Mountain View, USA

LinkedIn, Mountain View, USA
View Profile

,
Heng Ji

Rensselaer Polytechnic Institute, Troy, USA

Rensselaer Polytechnic Institute, Troy, USA
View Profile

,
Jiawei Han

University of Illinois Urbana-Champaign, Urbana-Champaign, USA

University of Illinois Urbana-Champaign, Urbana-Champaign, USA
View Profile

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2015Pages 745–754https://doi.org/10.1145/2783258.2783314

Published:10 August 2015Publication History

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 745–754

ABSTRACT

In crowdsourced data aggregation task, there exist conflicts in the answers provided by large numbers of sources on the same set of questions. The most important challenge for this task is to estimate source reliability and select answers that are provided by high-quality sources. Existing work solves this problem by simultaneously estimating sources' reliability and inferring questions' true answers (i.e., the truths). However, these methods assume that a source has the same reliability degree on all the questions, but ignore the fact that sources' reliability may vary significantly among different topics. To capture various expertise levels on different topics, we propose FaitCrowd, a fine grained truth discovery model for the task of aggregating conflicting data collected from multiple users/sources. FaitCrowd jointly models the process of generating question content and sources' provided answers in a probabilistic model to estimate both topical expertise and true answers simultaneously. This leads to a more precise estimation of source reliability. Therefore, FaitCrowd demonstrates better ability to obtain true answers for the questions compared with existing approaches. Experimental results on two real-world datasets show that FaitCrowd can significantly reduce the error rate of aggregation compared with the state-of-the-art multi-source aggregation approaches due to its ability of learning topical expertise from question content and collected answers.

References

B. I. Aydin, Y. S. Yilmaz, Y. Li, Q. Li, J. Gao, and M. Demirbas. Crowdsourcing for multiple-choice question answering. In Twenty-Sixth IAAI Conference, pages 2946--2953, 2014.Google Scholar
A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. In Applied Statistics, pages 20--28, 1979.Google Scholar
G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st International Conference on World Wide Web, pages 469--478, 2012. Google ScholarDigital Library
X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. In Proceedings of the VLDB Endowment, 2(1):550--561, 2009. Google ScholarDigital Library
X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. In Proceedings of the VLDB Endowment, 6(2):37--48, 2012. Google ScholarDigital Library
A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, pages 131--140, 2010. Google ScholarDigital Library
T. L. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5228--5235, 2004.Google ScholarCross Ref
H. Ji, R. Grishman, and H. T. Dang. Overview of the TAC 2011 knowledge base population track. In Third Text Analysis Conference, 2011.Google Scholar
J. Guo, S. Xu, S. Bao, and Y. Yu. Tapping on the potential of Q&A community by recommending answer providers. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, pages 921--930, 2008. Google ScholarDigital Library
Q. Li, Y. Li, J. Gao, L. Su, B. Zhao, M. Demirbas, W. Fan, and J. Han. A confidence-aware approach for truth discovery on long-tail data. In Proceedings of the VLDB Endowment, 8(4):425--436, 2014. Google ScholarDigital Library
Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pages 1187--1198, 2014. Google ScholarDigital Library
Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han. A Survey on Truth Discovery. In ArXiv Preprint ArXiv:1505.02463, 2015.Google Scholar
S. Mukherjee, G. Weikum, and C. Danescu-Niculescu-Mizil. People on drugs: Credibility of user statements in health communities. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 65--74, 2014. Google ScholarDigital Library
J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In Proceedings of the 23rd International Conference on Computational Linguistics, pages 877--885, 2010. Google ScholarDigital Library
J. Pasternack and D. Roth. Latent credibility analysis. In Proceedings of the 22nd International Conference on World Wide Web, pages 1009--1020, 2013. Google ScholarDigital Library
V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. In Journal of Machine Learning Research, 11:1297--1322, 2010. Google ScholarDigital Library
R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast--but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 254--263, 2008. Google ScholarDigital Library
L. Su, Q. Li, S. Hu, S. Wang, J. Gao, H. Liu, T. F. Abdelzaher, J. Han, X. Liu, Y. Gao, and L. Kaplan. Generalized decision aggregation in distributed sensing systems. In 2014 IEEE Real-Time Systems Symposium (RTSS), pages 1--10, 2014.Google ScholarCross Ref
M. Venanzi, J. Guiver, G. Kazai, P. Kohli, and M. Shokouhi. Community-based bayesian aggregation models for crowdsourcing. In Proceedings of the 23rd International Conference on World Wide Web, pages 155--164, 2014. Google ScholarDigital Library
V. Vydiswaran, C. Zhai, and D. Roth. Content-driven trust propagation framework. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 974--982, 2011. Google ScholarDigital Library
H. M. Wallach. Topic modeling: Beyond bag-of-words. In Proceedings of the 23rd International Conference on Machine Learning, pages 977--984, 2006. Google ScholarDigital Library
D. Wang, L. Kaplan, H. Le, and T. Abdelzaher. On truth discovery in social sensing: A maximum likelihood estimation approach. In Proceedings of the 11th International Conference on Information Processing in Sensor Networks, pages 233--244, 2012. Google ScholarDigital Library
P. Welinder, S. Branson, P. Perona, and S. J. Belongie. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems, pages 2424--2432, 2010.Google ScholarDigital Library
J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems, pages 2035--2043, 2009.Google ScholarDigital Library
L. Yang, M. Qiu, S. Gottipati, F. Zhu, J. Jiang, H. Sun, and Z. Chen. CQARank: jointly model topics and expertise in community question answering. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, pages 99--108, 2013. Google ScholarDigital Library
X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In IEEE Transactions on Knowledge and Data Engineering, 20(6):796--808, 2008. Google ScholarDigital Library
B. Zhao and J. Han. A probabilistic model for estimating real-valued truth from conflicting sources. In Proceedings of the 10th International Workshop on Quality in Databases, 2012.Google Scholar
B. Zhao, B. I. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. In Proceedings of the VLDB Endowment, 5(6):550--561, 2012. Google ScholarDigital Library
T. Zhao, N. Bian, C. Li, and M. Li. Topic-level expert modeling in community question answering. In SIAM International Conference on Data Mining, pages 776--784, 2013.Google ScholarCross Ref
W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In Proceedings of the 33rd European Conference on Advances in Information Retrieval, pages 338--349, 2011. Google ScholarDigital Library
D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In Advances in Neural Information Processing Systems, pages 2195--2203, 2012.Google ScholarDigital Library

Index Terms

FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

A Survey on Truth Discovery

Thanks to information explosion, data for the objects of interest can be collected from increasingly more sources. However, for the same object, there usually exist conflicts among the collected multi-source information. To tackle this challenge, truth ...
Read More
Truth Discovery on Crowd Sensing of Correlated Entities
SenSys '15: Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems

With the popular usage of mobile devices and smartphones, crowd sensing becomes pervasive in real life when human acts as sensors to report their observations about entities. For the same entity, users may report conflicting information, and thus it is ...
Read More
A confidence-aware approach for truth discovery on long-tail data

In many real world applications, the same item may be described by multiple sources. As a consequence, conflicts among these sources are inevitable, which leads to an important task: how to identify which piece of information is trustworthy, i.e., the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2015
2378 pages
ISBN:9781450336642
DOI:10.1145/2783258
General Chairs:
Longbing Cao
University of Technology, Sydney
,
Chengqi Zhang
University of Technology, Sydney
,
Program Chairs:
Thorsten Joachims
Cornell University
,
Geoff Webb
Monash University
,
Dragos D. Margineantu
Boeing Research
,
Graham Williams
Australian Taxation Office
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 August 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crowdsourcing
source reliability
truth discovery
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '15 Paper Acceptance Rate160of819submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 133
  Total Citations
  View Citations
- 1,434
  Total Downloads
- Downloads (Last 12 months)108
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Survey on Truth Discovery

Truth Discovery on Crowd Sensing of Correlated Entities

A confidence-aware approach for truth discovery on long-tail data