A Truth Discovery Approach with Theoretical Guarantee

Authors:
Houping Xiao

The State University of New York at Buffalo, Buffalo, NY, USA

The State University of New York at Buffalo, Buffalo, NY, USA
View Profile

,
Jing Gao

The State University of New York at Buffalo, Buffalo, NY, USA

The State University of New York at Buffalo, Buffalo, NY, USA
View Profile

,
Zhaoran Wang

Princeton University, Princeton, NJ, USA

Princeton University, Princeton, NJ, USA
View Profile

,
Shiyu Wang

University of Illinois at Urbana-Champaign, Champaign, IL, USA

University of Illinois at Urbana-Champaign, Champaign, IL, USA
View Profile

,
Lu Su

The State University of New York at Buffalo, Buffalo, NY, USA

The State University of New York at Buffalo, Buffalo, NY, USA
View Profile

,
Han Liu

Princeton University, Princeton, NJ, USA

Princeton University, Princeton, NJ, USA
View Profile

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2016Pages 1925–1934https://doi.org/10.1145/2939672.2939816

Published:13 August 2016Publication History

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1925–1934

ABSTRACT

In the information age, people can easily collect information about the same set of entities from multiple sources, among which conflicts are inevitable. This leads to an important task, truth discovery, i.e., to identify true facts (truths) via iteratively updating truths and source reliability. However, the convergence to the truths is never discussed in existing work, and thus there is no theoretical guarantee in the results of these truth discovery approaches. In contrast, in this paper we propose a truth discovery approach with theoretical guarantee. We propose a randomized gaussian mixture model (RGMM) to represent multi-source data, where truths are model parameters. We incorporate source bias which captures its reliability degree into RGMM formulation. The truth discovery task is then modeled as seeking the maximum likelihood estimate (MLE) of the truths. Based on expectation-maximization (EM) techniques, we propose population-based (i.e., on the limit of infinite data) and sample-based (i.e., on a finite set of samples) solutions for the MLE. Theoretically, we prove that both solutions are contractive to an ε-ball around the MLE, under certain conditions. Experimentally, we evaluate our method on both simulated and real-world datasets. Experimental results show that our method achieves high accuracy in identifying truths with convergence guarantee.

References

S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm: From population to sample-based analysis. arXiv:1408.2156, 2014.Google Scholar
L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In Proc.\ of CAiSE, pages 83--97, 2010. Google ScholarDigital Library
C. Dai, D. Lin, E. Bertino, and M. Kantarcioglu. An approach to evaluate data trustworthiness based on data provenance. In Proc. of SDM, pages 82--98, 2008. Google ScholarDigital Library
A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl. Stat., pages 20--28, 1979.Google ScholarCross Ref
X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proc. of KDD, pages 601--610, 2014. Google ScholarDigital Library
X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, pages 550--561, 2009. Google ScholarDigital Library
A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In Proc. of WSDM, pages 131--140, 2010. Google ScholarDigital Library
F. Li, M. L. Lee, and W. Hsu. Entity profiling with varying source reliabilities. In Proc. of KDD, pages 1146--1155, 2014. Google ScholarDigital Library
Q. Li, Y. Li, J. Gao, L. Su, B. Zhao, M. Demirbas, W. Fan, and J. Han. A confidence-aware approach for truth discovery on long-tail data. PVLDB, 2014. Google ScholarDigital Library
Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In Proc. of SIGMOD, pages 1187--1198, 2014. Google ScholarDigital Library
X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: is the problem solved? PVLDB, pages 97--108, 2012. Google ScholarDigital Library
F. Ma, Y. Li, Q. Li, M. Qiu, J. Gao, S. Zhi, L. Su, B. Zhao, H. Ji, and J. Han. Faitcrowd: Fine grained truth discovery for crowdsourced data aggregation. In Proc. of KDD, pages 745--754, 2015. Google ScholarDigital Library
A. Marian and M. Wu. Corroborating information from web sources. Data Eng. Bull., pages 11--17, 2011.Google Scholar
C. Meng, W. Jiang, Y. Li, J. Gao, L. Su, H. Ding, and Y. Cheng. Truth discovery on crowd sensing of correlated entities. In Proc. of SenSys, pages 169--182, 2015. Google ScholarDigital Library
S. Mukherjee, G. Weikum, and C. Danescu-Niculescu-Mizil. People on drugs: credibility of user statements in health communities. In Proc. of KDD, pages 65--74, 2014. Google ScholarDigital Library
J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In Proc. of IJCAI, pages 2324--2329, 2011. Google ScholarDigital Library
G.-J. Qi, C. C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in diverse groups. In Proc. of WWW, pages 1041--1052, 2013. Google ScholarDigital Library
R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM review, pages 195--239, 1984.Google Scholar
D. Wang, L. Kaplan, H. Le, and T. Abdelzaher. On truth discovery in social sensing: A maximum likelihood estimation approach. In Proc. of IPSN, pages 233--244, 2012. Google ScholarDigital Library
D. Wang, L. M. Kaplan, T. F. Abdelzaher, and C. C. Aggarwal. On scalability and robustness limitations of real and asymptotic confidence bounds in social sensing. In Proc. of SECON, pages 506--514, 2012.Google ScholarCross Ref
P. Welinder, S. Branson, P. Perona, and S. J. Belongie. The multidimensional wisdom of crowds. In Proc. of NIPS, pages 2424--2432, 2010.Google Scholar
Q. Wu and D.-X. Zhou. SVM soft margin classifiers: linear programming versus quadratic programming. Neural Comput., pages 1160--1187, 2005. Google ScholarDigital Library
J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In Proc. of KDD, pages 233--242, 2014. Google ScholarDigital Library
X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. TKDE, pages 796--808, 2008. Google ScholarDigital Library
B. Zhao and J. Han. A probabilistic model for estimating real-valued truth from conflicting sources. Proc. of QDB, 2012.Google Scholar
B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, pages 550--561, 2012. Google ScholarDigital Library
D. Zhou, J. C. Platt, S. Basu, and Y. Mao. Learning from the wisdom of crowds by minimax entropy. In Proc. of NIPS, pages 2204--2212, 2012.Google Scholar

Index Terms

A Truth Discovery Approach with Theoretical Guarantee
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

The demand for automatic extraction of true information (i.e., truths) from conflicting multi-source data has soared recently. A variety of truth discovery methods have witnessed great successes via jointly estimating source reliability and truths. All ...
Read More
On the Discovery of Continuous Truth: A Semi-supervised Approach with Partial Ground Truths
Web Information Systems Engineering – WISE 2018
Abstract
In many applications, the information regarding to the same object can be collected from multiple sources. However, these multi-source data are not reported consistently. In the light of this challenge, truth discovery is emerged to identify truth ...
Read More
Empowering Truth Discovery with Multi-Truth Prediction
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Truth discovery is the problem of detecting true values from the conflicting data provided by multiple sources on the same data items. Since sources' reliability is unknown a priori, a truth discovery method usually estimates sources' reliability along ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2016
2176 pages
ISBN:9781450342322
DOI:10.1145/2939672
General Chairs:
Balaji Krishnapuram
IBM
,
Mohak Shah
Bosch
,
Program Chairs:
Alex Smola
Amazon
,
Charu Aggarwal
IBM
,
Dou Shen
Baidu
,
Rajeev Rastogi
Amazon
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 August 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
asymptotic consistency
mixture model
truth discovery
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 27
  Total Citations
  View Citations
- 679
  Total Downloads
- Downloads (Last 12 months)80
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Truth Discovery Approach with Theoretical Guarantee

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach

On the Discovery of Continuous Truth: A Semi-supervised Approach with Partial Ground Truths

Empowering Truth Discovery with Multi-Truth Prediction