ABSTRACT
In the information age, people can easily collect information about the same set of entities from multiple sources, among which conflicts are inevitable. This leads to an important task, truth discovery, i.e., to identify true facts (truths) via iteratively updating truths and source reliability. However, the convergence to the truths is never discussed in existing work, and thus there is no theoretical guarantee in the results of these truth discovery approaches. In contrast, in this paper we propose a truth discovery approach with theoretical guarantee. We propose a randomized gaussian mixture model (RGMM) to represent multi-source data, where truths are model parameters. We incorporate source bias which captures its reliability degree into RGMM formulation. The truth discovery task is then modeled as seeking the maximum likelihood estimate (MLE) of the truths. Based on expectation-maximization (EM) techniques, we propose population-based (i.e., on the limit of infinite data) and sample-based (i.e., on a finite set of samples) solutions for the MLE. Theoretically, we prove that both solutions are contractive to an ε-ball around the MLE, under certain conditions. Experimentally, we evaluate our method on both simulated and real-world datasets. Experimental results show that our method achieves high accuracy in identifying truths with convergence guarantee.
- S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm: From population to sample-based analysis. arXiv:1408.2156, 2014.Google Scholar
- L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In Proc.\ of CAiSE, pages 83--97, 2010. Google ScholarDigital Library
- C. Dai, D. Lin, E. Bertino, and M. Kantarcioglu. An approach to evaluate data trustworthiness based on data provenance. In Proc. of SDM, pages 82--98, 2008. Google ScholarDigital Library
- A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl. Stat., pages 20--28, 1979.Google ScholarCross Ref
- X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proc. of KDD, pages 601--610, 2014. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, pages 550--561, 2009. Google ScholarDigital Library
- A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In Proc. of WSDM, pages 131--140, 2010. Google ScholarDigital Library
- F. Li, M. L. Lee, and W. Hsu. Entity profiling with varying source reliabilities. In Proc. of KDD, pages 1146--1155, 2014. Google ScholarDigital Library
- Q. Li, Y. Li, J. Gao, L. Su, B. Zhao, M. Demirbas, W. Fan, and J. Han. A confidence-aware approach for truth discovery on long-tail data. PVLDB, 2014. Google ScholarDigital Library
- Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In Proc. of SIGMOD, pages 1187--1198, 2014. Google ScholarDigital Library
- X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: is the problem solved? PVLDB, pages 97--108, 2012. Google ScholarDigital Library
- F. Ma, Y. Li, Q. Li, M. Qiu, J. Gao, S. Zhi, L. Su, B. Zhao, H. Ji, and J. Han. Faitcrowd: Fine grained truth discovery for crowdsourced data aggregation. In Proc. of KDD, pages 745--754, 2015. Google ScholarDigital Library
- A. Marian and M. Wu. Corroborating information from web sources. Data Eng. Bull., pages 11--17, 2011.Google Scholar
- C. Meng, W. Jiang, Y. Li, J. Gao, L. Su, H. Ding, and Y. Cheng. Truth discovery on crowd sensing of correlated entities. In Proc. of SenSys, pages 169--182, 2015. Google ScholarDigital Library
- S. Mukherjee, G. Weikum, and C. Danescu-Niculescu-Mizil. People on drugs: credibility of user statements in health communities. In Proc. of KDD, pages 65--74, 2014. Google ScholarDigital Library
- J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In Proc. of IJCAI, pages 2324--2329, 2011. Google ScholarDigital Library
- G.-J. Qi, C. C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in diverse groups. In Proc. of WWW, pages 1041--1052, 2013. Google ScholarDigital Library
- R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM review, pages 195--239, 1984.Google Scholar
- D. Wang, L. Kaplan, H. Le, and T. Abdelzaher. On truth discovery in social sensing: A maximum likelihood estimation approach. In Proc. of IPSN, pages 233--244, 2012. Google ScholarDigital Library
- D. Wang, L. M. Kaplan, T. F. Abdelzaher, and C. C. Aggarwal. On scalability and robustness limitations of real and asymptotic confidence bounds in social sensing. In Proc. of SECON, pages 506--514, 2012.Google ScholarCross Ref
- P. Welinder, S. Branson, P. Perona, and S. J. Belongie. The multidimensional wisdom of crowds. In Proc. of NIPS, pages 2424--2432, 2010.Google Scholar
- Q. Wu and D.-X. Zhou. SVM soft margin classifiers: linear programming versus quadratic programming. Neural Comput., pages 1160--1187, 2005. Google ScholarDigital Library
- J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In Proc. of KDD, pages 233--242, 2014. Google ScholarDigital Library
- X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. TKDE, pages 796--808, 2008. Google ScholarDigital Library
- B. Zhao and J. Han. A probabilistic model for estimating real-valued truth from conflicting sources. Proc. of QDB, 2012.Google Scholar
- B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, pages 550--561, 2012. Google ScholarDigital Library
- D. Zhou, J. C. Platt, S. Basu, and Y. Mao. Learning from the wisdom of crowds by minimax entropy. In Proc. of NIPS, pages 2204--2212, 2012.Google Scholar
Index Terms
- A Truth Discovery Approach with Theoretical Guarantee
Recommendations
Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningThe demand for automatic extraction of true information (i.e., truths) from conflicting multi-source data has soared recently. A variety of truth discovery methods have witnessed great successes via jointly estimating source reliability and truths. All ...
On the Discovery of Continuous Truth: A Semi-supervised Approach with Partial Ground Truths
Web Information Systems Engineering – WISE 2018AbstractIn many applications, the information regarding to the same object can be collected from multiple sources. However, these multi-source data are not reported consistently. In the light of this challenge, truth discovery is emerged to identify truth ...
Empowering Truth Discovery with Multi-Truth Prediction
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementTruth discovery is the problem of detecting true values from the conflicting data provided by multiple sources on the same data items. Since sources' reliability is unknown a priori, a truth discovery method usually estimates sources' reliability along ...
Comments