ABSTRACT
Accessing online information from various data sources has become a necessary part of our everyday life. Unfortunately such information is not always trustworthy, as different sources are of very different qualities and often provide inaccurate and conflicting information. Existing approaches attack this problem using unsupervised learning methods, and try to infer the confidence of the data value and trustworthiness of each source from each other by assuming values provided by more sources are more accurate. However, because false values can be widespread through copying among different sources and out-of-date data often overwhelm up-to-date data, such bootstrapping methods are often ineffective.
In this paper we propose a semi-supervised approach that finds true values with the help of ground truth data. Such ground truth data, even in very small amount, can greatly help us identify trustworthy data sources. Unlike existing studies that only provide iterative algorithms, we derive the optimal solution to our problem and provide an iterative algorithm that converges to it. Experiments show our method achieves higher accuracy than existing approaches, and it can be applied on very huge data sets when implemented with MapReduce.
- J. Bleiholder and F. Naumann. Conflict handling strategies in an integrated information system. WWW'06.Google Scholar
- M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu and Y. Zhang. WebTables: Exploring the Power of Tables on the Web. VLDB'08. Google ScholarDigital Library
- A. Celikyilmaz, M. Thint, Z. Huang. A graph-based semi-supervised learning for question answering. IJCNLP'09. Google ScholarDigital Library
- E. Crestan and P. Pantel. Web-scale knowledge extraction from semi-structured tables. WWW'10. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, Y. Hu and D. Srivastava. Global detection of complex copying relationships between sources. In VLDB'10. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille and D. Srivastava. Integrating conflicting data: The role of source dependence. VLDB'09. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille and D. Srivastava. Truth discovery and copying detection in a dynamic world. VLDB'09. Google ScholarDigital Library
- X. L. Dong. Presentation for {6}. http://www2.research.att.com/~lunadong/talks/depenDetection.pptxGoogle Scholar
- A. Enright. Consumers trust information found online less than offline messages. Internet Retailer, Aug 25, 2010.Google Scholar
- A. Galland, S. Abiteboul, A. Marian and P. Senellart. Corroborating information from disagreeing views. WSDM'10. Google ScholarDigital Library
- A. B. Goldberg, X. Zhu and S. Wright. Dissimilarity in graph-based semi-supervised classification. AISTATS'07.Google Scholar
- M. Isard, M. Budiu, Y. Yu, A. Birrell and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. Operating Systems Review, 41(3), 2007. Google ScholarDigital Library
- G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires and L. E. Moser. Extracting data records from the web using tag path clustering. WWW'09. Google ScholarDigital Library
- J. Tang, H. Li, Q.-J. Qi and T.-S. Chua. Integrated graph-based semi-supervised multiple/single instance learning framework for image annotation. ACM Multimedia'08. Google ScholarDigital Library
- M. Wu and A. Marian. Corroborating answers from multiple web sources. WebDB'07.Google Scholar
- X. Yin, J. Han and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. KDD'07. Google ScholarDigital Library
- X. Yin, W. Tan, X. Li and Y.-C. Tu. Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries. WWW'10. Google ScholarDigital Library
- D. Zhou, O. Bousquet, T. N. Lal, J. Weston and B. Schölkopf. Learning with local and global consistency. NIPS'04.Google Scholar
- X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Carnegie Mellon University Technical Report Carnegie Mellon University-CALD-02-107, 2002.Google Scholar
- X. Zhu, Z. Ghahramani and J. Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. ICML'03.Google Scholar
- Semi-supervised truth discovery
Recommendations
On the Discovery of Continuous Truth: A Semi-supervised Approach with Partial Ground Truths
Web Information Systems Engineering – WISE 2018AbstractIn many applications, the information regarding to the same object can be collected from multiple sources. However, these multi-source data are not reported consistently. In the light of this challenge, truth discovery is emerged to identify truth ...
Empowering Truth Discovery with Multi-Truth Prediction
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementTruth discovery is the problem of detecting true values from the conflicting data provided by multiple sources on the same data items. Since sources' reliability is unknown a priori, a truth discovery method usually estimates sources' reliability along ...
A Survey of Semi-Supervised Learning Methods
CIS '08: Proceedings of the 2008 International Conference on Computational Intelligence and Security - Volume 02In traditional machine learning approaches to classification, one uses only a labelled set to train the classifier. Labelled instances however are often difficult, expensive, or time consuming to obtain, as they require the efforts of experienced human ...
Comments