skip to main content
10.1145/1963405.1963439acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Semi-supervised truth discovery

Published:28 March 2011Publication History

ABSTRACT

Accessing online information from various data sources has become a necessary part of our everyday life. Unfortunately such information is not always trustworthy, as different sources are of very different qualities and often provide inaccurate and conflicting information. Existing approaches attack this problem using unsupervised learning methods, and try to infer the confidence of the data value and trustworthiness of each source from each other by assuming values provided by more sources are more accurate. However, because false values can be widespread through copying among different sources and out-of-date data often overwhelm up-to-date data, such bootstrapping methods are often ineffective.

In this paper we propose a semi-supervised approach that finds true values with the help of ground truth data. Such ground truth data, even in very small amount, can greatly help us identify trustworthy data sources. Unlike existing studies that only provide iterative algorithms, we derive the optimal solution to our problem and provide an iterative algorithm that converges to it. Experiments show our method achieves higher accuracy than existing approaches, and it can be applied on very huge data sets when implemented with MapReduce.

References

  1. J. Bleiholder and F. Naumann. Conflict handling strategies in an integrated information system. WWW'06.Google ScholarGoogle Scholar
  2. M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu and Y. Zhang. WebTables: Exploring the Power of Tables on the Web. VLDB'08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Celikyilmaz, M. Thint, Z. Huang. A graph-based semi-supervised learning for question answering. IJCNLP'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Crestan and P. Pantel. Web-scale knowledge extraction from semi-structured tables. WWW'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. X. L. Dong, L. Berti-Equille, Y. Hu and D. Srivastava. Global detection of complex copying relationships between sources. In VLDB'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. X. L. Dong, L. Berti-Equille and D. Srivastava. Integrating conflicting data: The role of source dependence. VLDB'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. X. L. Dong, L. Berti-Equille and D. Srivastava. Truth discovery and copying detection in a dynamic world. VLDB'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. X. L. Dong. Presentation for {6}. http://www2.research.att.com/~lunadong/talks/depenDetection.pptxGoogle ScholarGoogle Scholar
  9. A. Enright. Consumers trust information found online less than offline messages. Internet Retailer, Aug 25, 2010.Google ScholarGoogle Scholar
  10. A. Galland, S. Abiteboul, A. Marian and P. Senellart. Corroborating information from disagreeing views. WSDM'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. B. Goldberg, X. Zhu and S. Wright. Dissimilarity in graph-based semi-supervised classification. AISTATS'07.Google ScholarGoogle Scholar
  12. M. Isard, M. Budiu, Y. Yu, A. Birrell and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. Operating Systems Review, 41(3), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires and L. E. Moser. Extracting data records from the web using tag path clustering. WWW'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Tang, H. Li, Q.-J. Qi and T.-S. Chua. Integrated graph-based semi-supervised multiple/single instance learning framework for image annotation. ACM Multimedia'08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Wu and A. Marian. Corroborating answers from multiple web sources. WebDB'07.Google ScholarGoogle Scholar
  16. X. Yin, J. Han and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. KDD'07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. X. Yin, W. Tan, X. Li and Y.-C. Tu. Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries. WWW'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Zhou, O. Bousquet, T. N. Lal, J. Weston and B. Schölkopf. Learning with local and global consistency. NIPS'04.Google ScholarGoogle Scholar
  19. X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Carnegie Mellon University Technical Report Carnegie Mellon University-CALD-02-107, 2002.Google ScholarGoogle Scholar
  20. X. Zhu, Z. Ghahramani and J. Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. ICML'03.Google ScholarGoogle Scholar
  1. Semi-supervised truth discovery

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      WWW '11: Proceedings of the 20th international conference on World wide web
      March 2011
      840 pages
      ISBN:9781450306324
      DOI:10.1145/1963405

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 March 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader