Abstract
The amount of useful information available on the Web has been growing at a dramatic pace in recent years and people rely more and more on the Web to fulfill their information needs. In this paper, we study truthfulness of Deep Web data in two domains where we believed data are fairly clean and data quality is important to people's lives: Stock and Flight. To our surprise, we observed a large amount of inconsistency on data from different sources and also some sources with quite low accuracy. We further applied on these two data sets state-of-the-art data fusion methods that aim at resolving conflicts and finding the truth, analyzed their strengths and limitations, and suggested promising research directions. We wish our study can increase awareness of the seriousness of conflicting data on the Web and in turn inspire more research in our community to tackle this problem.
- L. Berti-Equille, A. D. Sarma, X. L. Dong, A. Marian, and D. Srivastava. Sailing the information ocean with awareness of currents: Discovery and application of source dependence. In CIDR, 2009.Google Scholar
- L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, 83-97, 2010. Google Scholar
- J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1):1-41, 2008. Google Scholar
- N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on the web. PVLDB, 5(7):680-691, 2012. Google Scholar
- X. L. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358-1369, 2010. Google Scholar
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, 2(1):550-561, 2009. Google Scholar
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562-573, 2009. Google Scholar
- X. L. Dong and F. Naumann. Data fusion-resolving data conflicts for integration. PVLDB, 2(2):1654-1655, 2009. Google Scholar
- X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2), 2013. Google Scholar
- A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, 131-140, 2010. Google Scholar
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, 668-677, 1998. Google Scholar
- X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth Finding on the Deep Web: Is the Problem Solved? http://lunadong.com/publication/webfusion_report.pdf. Google Scholar
- J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, 877-885, 2010. Google Scholar
- J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In IJCAI, 2324-2329, 2011. Google Scholar
- D. Srivastava and S. Venkatasubramanian. Information theory for data management. PVLDB, 2(2):1662-1663, 2009. Google Scholar
- M. Wu and A. Marian. Corroborating answers from multiple web sources. In Proc. of the WebDB Workshop, 2007.Google Scholar
- M. Wu and A. Marian. A framework for corroborating answers from multiple web sources. Inf. Syst., 36(2):431-449, 2011. Google Scholar
- X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng., 20:796-808, 2008. Google Scholar
- X. Yin and W. Tan. Semi-supervised truth discovery. In WWW, 217-226, 2011. Google Scholar
- B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550-561, 2012. Google Scholar
Index Terms
- Truth finding on the deep web: is the problem solved?
Recommendations
Truth Discovery with Multiple Conflicting Information Providers on the Web
The world-wide web has become the most important information source for most of us. Unfortunately, there is no guarantee for the correctness of information on the web. Moreover, different web sites often provide conflicting information on a subject, ...
Comments