research-article

A confidence-aware approach for truth discovery on long-tail data

Authors:
Qi Li

SUNY Buffalo, Buffalo, NY

SUNY Buffalo, Buffalo, NY
View Profile

,
Yaliang Li

SUNY Buffalo, Buffalo, NY

SUNY Buffalo, Buffalo, NY
View Profile

,
Jing Gao

SUNY Buffalo, Buffalo, NY

SUNY Buffalo, Buffalo, NY
View Profile

,
Lu Su

SUNY Buffalo, Buffalo, NY

SUNY Buffalo, Buffalo, NY
View Profile

,
Bo Zhao

Microsoft Research, Mountain View, CA

Microsoft Research, Mountain View, CA
View Profile

,
Murat Demirbas

SUNY Buffalo, Buffalo, NY

SUNY Buffalo, Buffalo, NY
View Profile

,
Wei Fan

Huawei Noah's Ark Lab, Hong Kong

Huawei Noah's Ark Lab, Hong Kong
View Profile

,
Jiawei Han

University of Illinois, Urbana, IL

University of Illinois, Urbana, IL
View Profile

Proceedings of the VLDB Endowment Volume 8 Issue 4pp 425–436https://doi.org/10.14778/2735496.2735505

Published:01 December 2014Publication History

Proceedings of the VLDB Endowment

Abstract

In many real world applications, the same item may be described by multiple sources. As a consequence, conflicts among these sources are inevitable, which leads to an important task: how to identify which piece of information is trustworthy, i.e., the truth discovery task. Intuitively, if the piece of information is from a reliable source, then it is more trustworthy, and the source that provides trustworthy information is more reliable. Based on this principle, truth discovery approaches have been proposed to infer source reliability degrees and the most trustworthy information (i.e., the truth) simultaneously. However, existing approaches overlook the ubiquitous long-tail phenomenon in the tasks, i.e., most sources only provide a few claims and only a few sources make plenty of claims, which causes the source reliability estimation for small sources to be unreasonable. To tackle this challenge, we propose a confidence-aware truth discovery (CATD) method to automatically detect truths from conflicting data with long-tail phenomenon. The proposed method not only estimates source reliability, but also considers the confidence interval of the estimation, so that it can effectively reflect real source reliability for sources with various levels of participation. Experiments on four real world tasks as well as simulated multi-source long-tail datasets demonstrate that the proposed method outperforms existing state-of-the-art truth discovery approaches by successful discounting the effect of small sources.

References

M. Alzantot and M. Youssef. Crowdinside: Automatic construction of indoor floorplans. In Proc. of SIGSPATIAL, pages 99--108, 2012. Google ScholarDigital Library
B. I. Aydin, Y. S. Yilmaz, Y. Li, Q. Li, J. Gao, and M. Demirbas. Crowdsourcing for multiple-choice question answering. In IAAI, pages 2946--2953, 2014.Google Scholar
Y. Bachrach, T. Minka, J. Guiver, and T. Graepel. How to grade a test without knowing the answers -- a bayesian graphical model for adaptive crowdsourcing and aptitude testing. In Proc. of ICML, pages 255--262, 2012.Google Scholar
J. Bleiholder and F. Naumann. Conflict handling strategies in an integrated information system. In Proc. of WWW, 2006.Google Scholar
J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1): 1:1--1:41, 2009. Google ScholarDigital Library
S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004. Google ScholarDigital Library
A. Clauset, C. R. Shalizi, and M. E. Newman. Power-law distributions in empirical data. SIAM review, 51(4): 661--703, 2009. Google ScholarDigital Library
X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1): 550--561, 2009. Google ScholarDigital Library
X. L. Dong, L. Berti-Equille, and D. Srivastava. Data fusion: resolving conflicts from multiple sources. In Handbook of Data Quality, pages 293--318. Springer, 2013.Google ScholarCross Ref
X. L. Dong and F. Naumann. Data fusion: Resolving data conflicts for integration. PVLDB, 2(2): 1654--1655, 2009. Google ScholarDigital Library
X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2): 37--48, 2012. Google ScholarDigital Library
W. Fan. Data quality: Theory and practice. Web-Age Information Management, page 1--16, 2012.Google Scholar
W. Fan, F. Geerts, S. Ma, N. Tang, and W. Yu. Data quality problems beyond consistency and deduplication. In Search of Elegance in the Theory and Practice of Computation, pages 237--249, 2013.Google ScholarCross Ref
R. Feldman and M. Taqqu. A practical guide to heavy tails: statistical techniques and applications. Springer, 1998.Google Scholar
J. Feng, G. Li, H. Wang, and J. Feng. Incremental quality inference in crowdsourcing. In Database Systems for Advanced Applications, pages 453--467. Springer, 2014.Google ScholarCross Ref
A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In Proc. of WSDM, pages 131--140, 2010. Google ScholarDigital Library
R. V. Hogg, J. McKean, and A. T. Craig. Introduction to mathematical statistics. Pearson Education, 2005.Google Scholar
Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In Proc. of SIGMOD, pages 1187--1198, 2014. Google ScholarDigital Library
X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2): 97--108, 2012. Google ScholarDigital Library
X. Liu, X. L. Dong, B. C. Ooi, and D. Srivastava. Online data fusion. PVLDB, 4(11): 932--943, 2011.Google ScholarDigital Library
E. Mustafaraj, S. Finn, C. Whitlock, and P. T. Metaxas. Vocal minority versus silent majority: Discovering the opionions of the long tail. In Proc. of IEEE SocialCom, pages 103--110, 2011.Google ScholarCross Ref
F. Naumann, A. Bilke, J. Bleiholder, and M. Weis. Data fusion in three steps: Resolving schema, tuple, and value inconsistencies. IEEE Data Engineering Bulletin, 29(2): 21--31, 2006.Google Scholar
J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In Proc. of COLING, pages 877--885, 2010. Google ScholarDigital Library
J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In Proc. of IJCAI, pages 2324--2329, 2011. Google ScholarDigital Library
J. Pasternack and D. Roth. Latent credibility analysis. In Proc. of WWW, pages 1009--1020, 2013. Google ScholarDigital Library
G.-J. Qi, C. C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in diverse groups. In Proc. of WWW, pages 1041--1052, 2013. Google ScholarDigital Library
T. Rekatsinas, X. L. Dong, and D. Srivastava. Characterizing and selecting fresh data sources. In Proc. of SIGMOD, pages 919--930, 2014. Google ScholarDigital Library
G. Shen, Z. Chen, P. Zhang, T. Moscibroda, and Y. Zhang. Walkie-markie: indoor pathway mapping made easy. In Proc. of NSDI, pages 85--98, 2013. Google ScholarDigital Library
V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proc. of KDD, pages 614--622, 2008. Google ScholarDigital Library
Y. Tian and J. Zhu. Learning from crowds in the presence of schools of thought. In Proc. of KDD, pages 226--234, 2012. Google ScholarDigital Library
V. Vydiswaran, C. Zhai, and D. Roth. Content-driven trust propagation framework. In Proc. of KDD, pages 974--982, 2011. Google ScholarDigital Library
P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. In NIPS, volume 10, pages 2424--2432, 2010.Google Scholar
J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. R. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS, volume 22, pages 2035--2043, 2009.Google ScholarDigital Library
X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. TKDE, 20(6): 796--808, 2008. Google ScholarDigital Library
X. Yin and W. Tan. Semi-supervised truth discovery. In Proc. of WWW, pages 217--226, 2011. Google ScholarDigital Library
B. Zhao and J. Han. A probabilistic model for estimating real-valued truth from conflicting sources. In Proc. of QDB, 2012.Google Scholar
B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6): 550--561, 2012. Google ScholarDigital Library
D. Zhou, J. C. Platt, S. Basu, and Y. Mao. Learning from the wisdom of crowds by minimax entropy. In NIPS, pages 2204--2212, 2012.Google ScholarDigital Library

Index Terms

A confidence-aware approach for truth discovery on long-tail data
1. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

The demand for automatic extraction of true information (i.e., truths) from conflicting multi-source data has soared recently. A variety of truth discovery methods have witnessed great successes via jointly estimating source reliability and truths. All ...
Read More
A Truth Discovery Approach with Theoretical Guarantee
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In the information age, people can easily collect information about the same set of entities from multiple sources, among which conflicts are inevitable. This leads to an important task, truth discovery, i.e., to identify true facts (truths) via ...
Read More
Empowering Truth Discovery with Multi-Truth Prediction
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Truth discovery is the problem of detecting true values from the conflicting data provided by multiple sources on the same data items. Since sources' reliability is unknown a priori, a truth discovery method usually estimates sources' reliability along ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 8, Issue 4
December 2014
132 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 December 2014
Published in pvldb Volume 8, Issue 4
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 113
  Total Citations
  View Citations
- 894
  Total Downloads
- Downloads (Last 12 months)103
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A confidence-aware approach for truth discovery on long-tail data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach

A Truth Discovery Approach with Theoretical Guarantee

Empowering Truth Discovery with Multi-Truth Prediction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A confidence-aware approach for truth discovery on long-tail data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach

A Truth Discovery Approach with Theoretical Guarantee

Empowering Truth Discovery with Multi-Truth Prediction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media