skip to main content
10.1145/2939672.2939816acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Public Access

A Truth Discovery Approach with Theoretical Guarantee

Authors Info & Claims
Published:13 August 2016Publication History

ABSTRACT

In the information age, people can easily collect information about the same set of entities from multiple sources, among which conflicts are inevitable. This leads to an important task, truth discovery, i.e., to identify true facts (truths) via iteratively updating truths and source reliability. However, the convergence to the truths is never discussed in existing work, and thus there is no theoretical guarantee in the results of these truth discovery approaches. In contrast, in this paper we propose a truth discovery approach with theoretical guarantee. We propose a randomized gaussian mixture model (RGMM) to represent multi-source data, where truths are model parameters. We incorporate source bias which captures its reliability degree into RGMM formulation. The truth discovery task is then modeled as seeking the maximum likelihood estimate (MLE) of the truths. Based on expectation-maximization (EM) techniques, we propose population-based (i.e., on the limit of infinite data) and sample-based (i.e., on a finite set of samples) solutions for the MLE. Theoretically, we prove that both solutions are contractive to an ε-ball around the MLE, under certain conditions. Experimentally, we evaluate our method on both simulated and real-world datasets. Experimental results show that our method achieves high accuracy in identifying truths with convergence guarantee.

References

  1. S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm: From population to sample-based analysis. arXiv:1408.2156, 2014.Google ScholarGoogle Scholar
  2. L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In Proc.\ of CAiSE, pages 83--97, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Dai, D. Lin, E. Bertino, and M. Kantarcioglu. An approach to evaluate data trustworthiness based on data provenance. In Proc. of SDM, pages 82--98, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl. Stat., pages 20--28, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  5. X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proc. of KDD, pages 601--610, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, pages 550--561, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In Proc. of WSDM, pages 131--140, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Li, M. L. Lee, and W. Hsu. Entity profiling with varying source reliabilities. In Proc. of KDD, pages 1146--1155, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Q. Li, Y. Li, J. Gao, L. Su, B. Zhao, M. Demirbas, W. Fan, and J. Han. A confidence-aware approach for truth discovery on long-tail data. PVLDB, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In Proc. of SIGMOD, pages 1187--1198, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: is the problem solved? PVLDB, pages 97--108, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Ma, Y. Li, Q. Li, M. Qiu, J. Gao, S. Zhi, L. Su, B. Zhao, H. Ji, and J. Han. Faitcrowd: Fine grained truth discovery for crowdsourced data aggregation. In Proc. of KDD, pages 745--754, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Marian and M. Wu. Corroborating information from web sources. Data Eng. Bull., pages 11--17, 2011.Google ScholarGoogle Scholar
  14. C. Meng, W. Jiang, Y. Li, J. Gao, L. Su, H. Ding, and Y. Cheng. Truth discovery on crowd sensing of correlated entities. In Proc. of SenSys, pages 169--182, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Mukherjee, G. Weikum, and C. Danescu-Niculescu-Mizil. People on drugs: credibility of user statements in health communities. In Proc. of KDD, pages 65--74, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In Proc. of IJCAI, pages 2324--2329, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G.-J. Qi, C. C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in diverse groups. In Proc. of WWW, pages 1041--1052, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM review, pages 195--239, 1984.Google ScholarGoogle Scholar
  19. D. Wang, L. Kaplan, H. Le, and T. Abdelzaher. On truth discovery in social sensing: A maximum likelihood estimation approach. In Proc. of IPSN, pages 233--244, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Wang, L. M. Kaplan, T. F. Abdelzaher, and C. C. Aggarwal. On scalability and robustness limitations of real and asymptotic confidence bounds in social sensing. In Proc. of SECON, pages 506--514, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  21. P. Welinder, S. Branson, P. Perona, and S. J. Belongie. The multidimensional wisdom of crowds. In Proc. of NIPS, pages 2424--2432, 2010.Google ScholarGoogle Scholar
  22. Q. Wu and D.-X. Zhou. SVM soft margin classifiers: linear programming versus quadratic programming. Neural Comput., pages 1160--1187, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In Proc. of KDD, pages 233--242, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. TKDE, pages 796--808, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Zhao and J. Han. A probabilistic model for estimating real-valued truth from conflicting sources. Proc. of QDB, 2012.Google ScholarGoogle Scholar
  26. B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, pages 550--561, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Zhou, J. C. Platt, S. Basu, and Y. Mao. Learning from the wisdom of crowds by minimax entropy. In Proc. of NIPS, pages 2204--2212, 2012.Google ScholarGoogle Scholar

Index Terms

  1. A Truth Discovery Approach with Theoretical Guarantee

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
      August 2016
      2176 pages
      ISBN:9781450342322
      DOI:10.1145/2939672

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 August 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader