skip to main content
article

Truth finding on the deep web: is the problem solved?

Published:01 December 2012Publication History
Skip Abstract Section

Abstract

The amount of useful information available on the Web has been growing at a dramatic pace in recent years and people rely more and more on the Web to fulfill their information needs. In this paper, we study truthfulness of Deep Web data in two domains where we believed data are fairly clean and data quality is important to people's lives: Stock and Flight. To our surprise, we observed a large amount of inconsistency on data from different sources and also some sources with quite low accuracy. We further applied on these two data sets state-of-the-art data fusion methods that aim at resolving conflicts and finding the truth, analyzed their strengths and limitations, and suggested promising research directions. We wish our study can increase awareness of the seriousness of conflicting data on the Web and in turn inspire more research in our community to tackle this problem.

References

  1. L. Berti-Equille, A. D. Sarma, X. L. Dong, A. Marian, and D. Srivastava. Sailing the information ocean with awareness of currents: Discovery and application of source dependence. In CIDR, 2009.Google ScholarGoogle Scholar
  2. L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, 83-97, 2010. Google ScholarGoogle Scholar
  3. J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1):1-41, 2008. Google ScholarGoogle Scholar
  4. N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on the web. PVLDB, 5(7):680-691, 2012. Google ScholarGoogle Scholar
  5. X. L. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358-1369, 2010. Google ScholarGoogle Scholar
  6. X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, 2(1):550-561, 2009. Google ScholarGoogle Scholar
  7. X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562-573, 2009. Google ScholarGoogle Scholar
  8. X. L. Dong and F. Naumann. Data fusion-resolving data conflicts for integration. PVLDB, 2(2):1654-1655, 2009. Google ScholarGoogle Scholar
  9. X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2), 2013. Google ScholarGoogle Scholar
  10. A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, 131-140, 2010. Google ScholarGoogle Scholar
  11. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, 668-677, 1998. Google ScholarGoogle Scholar
  12. X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth Finding on the Deep Web: Is the Problem Solved? http://lunadong.com/publication/webfusion_report.pdf. Google ScholarGoogle Scholar
  13. J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, 877-885, 2010. Google ScholarGoogle Scholar
  14. J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In IJCAI, 2324-2329, 2011. Google ScholarGoogle Scholar
  15. D. Srivastava and S. Venkatasubramanian. Information theory for data management. PVLDB, 2(2):1662-1663, 2009. Google ScholarGoogle Scholar
  16. M. Wu and A. Marian. Corroborating answers from multiple web sources. In Proc. of the WebDB Workshop, 2007.Google ScholarGoogle Scholar
  17. M. Wu and A. Marian. A framework for corroborating answers from multiple web sources. Inf. Syst., 36(2):431-449, 2011. Google ScholarGoogle Scholar
  18. X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng., 20:796-808, 2008. Google ScholarGoogle Scholar
  19. X. Yin and W. Tan. Semi-supervised truth discovery. In WWW, 217-226, 2011. Google ScholarGoogle Scholar
  20. B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550-561, 2012. Google ScholarGoogle Scholar

Index Terms

  1. Truth finding on the deep web: is the problem solved?

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 6, Issue 2
          December 2012
          120 pages

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 December 2012
          Published in pvldb Volume 6, Issue 2

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader