skip to main content
research-article

A confidence-aware approach for truth discovery on long-tail data

Published:01 December 2014Publication History
Skip Abstract Section

Abstract

In many real world applications, the same item may be described by multiple sources. As a consequence, conflicts among these sources are inevitable, which leads to an important task: how to identify which piece of information is trustworthy, i.e., the truth discovery task. Intuitively, if the piece of information is from a reliable source, then it is more trustworthy, and the source that provides trustworthy information is more reliable. Based on this principle, truth discovery approaches have been proposed to infer source reliability degrees and the most trustworthy information (i.e., the truth) simultaneously. However, existing approaches overlook the ubiquitous long-tail phenomenon in the tasks, i.e., most sources only provide a few claims and only a few sources make plenty of claims, which causes the source reliability estimation for small sources to be unreasonable. To tackle this challenge, we propose a confidence-aware truth discovery (CATD) method to automatically detect truths from conflicting data with long-tail phenomenon. The proposed method not only estimates source reliability, but also considers the confidence interval of the estimation, so that it can effectively reflect real source reliability for sources with various levels of participation. Experiments on four real world tasks as well as simulated multi-source long-tail datasets demonstrate that the proposed method outperforms existing state-of-the-art truth discovery approaches by successful discounting the effect of small sources.

References

  1. M. Alzantot and M. Youssef. Crowdinside: Automatic construction of indoor floorplans. In Proc. of SIGSPATIAL, pages 99--108, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. B. I. Aydin, Y. S. Yilmaz, Y. Li, Q. Li, J. Gao, and M. Demirbas. Crowdsourcing for multiple-choice question answering. In IAAI, pages 2946--2953, 2014.Google ScholarGoogle Scholar
  3. Y. Bachrach, T. Minka, J. Guiver, and T. Graepel. How to grade a test without knowing the answers -- a bayesian graphical model for adaptive crowdsourcing and aptitude testing. In Proc. of ICML, pages 255--262, 2012.Google ScholarGoogle Scholar
  4. J. Bleiholder and F. Naumann. Conflict handling strategies in an integrated information system. In Proc. of WWW, 2006.Google ScholarGoogle Scholar
  5. J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1): 1:1--1:41, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Clauset, C. R. Shalizi, and M. E. Newman. Power-law distributions in empirical data. SIAM review, 51(4): 661--703, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1): 550--561, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. X. L. Dong, L. Berti-Equille, and D. Srivastava. Data fusion: resolving conflicts from multiple sources. In Handbook of Data Quality, pages 293--318. Springer, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  10. X. L. Dong and F. Naumann. Data fusion: Resolving data conflicts for integration. PVLDB, 2(2): 1654--1655, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2): 37--48, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Fan. Data quality: Theory and practice. Web-Age Information Management, page 1--16, 2012.Google ScholarGoogle Scholar
  13. W. Fan, F. Geerts, S. Ma, N. Tang, and W. Yu. Data quality problems beyond consistency and deduplication. In Search of Elegance in the Theory and Practice of Computation, pages 237--249, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  14. R. Feldman and M. Taqqu. A practical guide to heavy tails: statistical techniques and applications. Springer, 1998.Google ScholarGoogle Scholar
  15. J. Feng, G. Li, H. Wang, and J. Feng. Incremental quality inference in crowdsourcing. In Database Systems for Advanced Applications, pages 453--467. Springer, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  16. A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In Proc. of WSDM, pages 131--140, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. V. Hogg, J. McKean, and A. T. Craig. Introduction to mathematical statistics. Pearson Education, 2005.Google ScholarGoogle Scholar
  18. Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In Proc. of SIGMOD, pages 1187--1198, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2): 97--108, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. X. Liu, X. L. Dong, B. C. Ooi, and D. Srivastava. Online data fusion. PVLDB, 4(11): 932--943, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. Mustafaraj, S. Finn, C. Whitlock, and P. T. Metaxas. Vocal minority versus silent majority: Discovering the opionions of the long tail. In Proc. of IEEE SocialCom, pages 103--110, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  22. F. Naumann, A. Bilke, J. Bleiholder, and M. Weis. Data fusion in three steps: Resolving schema, tuple, and value inconsistencies. IEEE Data Engineering Bulletin, 29(2): 21--31, 2006.Google ScholarGoogle Scholar
  23. J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In Proc. of COLING, pages 877--885, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In Proc. of IJCAI, pages 2324--2329, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Pasternack and D. Roth. Latent credibility analysis. In Proc. of WWW, pages 1009--1020, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G.-J. Qi, C. C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in diverse groups. In Proc. of WWW, pages 1041--1052, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T. Rekatsinas, X. L. Dong, and D. Srivastava. Characterizing and selecting fresh data sources. In Proc. of SIGMOD, pages 919--930, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Shen, Z. Chen, P. Zhang, T. Moscibroda, and Y. Zhang. Walkie-markie: indoor pathway mapping made easy. In Proc. of NSDI, pages 85--98, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proc. of KDD, pages 614--622, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Tian and J. Zhu. Learning from crowds in the presence of schools of thought. In Proc. of KDD, pages 226--234, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. V. Vydiswaran, C. Zhai, and D. Roth. Content-driven trust propagation framework. In Proc. of KDD, pages 974--982, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. In NIPS, volume 10, pages 2424--2432, 2010.Google ScholarGoogle Scholar
  33. J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. R. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS, volume 22, pages 2035--2043, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. TKDE, 20(6): 796--808, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. X. Yin and W. Tan. Semi-supervised truth discovery. In Proc. of WWW, pages 217--226, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. B. Zhao and J. Han. A probabilistic model for estimating real-valued truth from conflicting sources. In Proc. of QDB, 2012.Google ScholarGoogle Scholar
  37. B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6): 550--561, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. Zhou, J. C. Platt, S. Basu, and Y. Mao. Learning from the wisdom of crowds by minimax entropy. In NIPS, pages 2204--2212, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A confidence-aware approach for truth discovery on long-tail data
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 8, Issue 4
      December 2014
      132 pages

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 December 2014
      Published in pvldb Volume 8, Issue 4

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader