ABSTRACT
In many applications, one can obtain descriptions about the same objects or events from a variety of sources. As a result, this will inevitably lead to data or information conflicts. One important problem is to identify the true information (i.e., the truths) among conflicting sources of data. It is intuitive to trust reliable sources more when deriving the truths, but it is usually unknown which one is more reliable a priori. Moreover, each source possesses a variety of properties with different data types. An accurate estimation of source reliability has to be made by modeling multiple properties in a unified model. Existing conflict resolution work either does not conduct source reliability estimation, or models multiple properties separately. In this paper, we propose to resolve conflicts among multiple sources of heterogeneous data types. We model the problem using an optimization framework where truths and source reliability are defined as two sets of unknown variables. The objective is to minimize the overall weighted deviation between the truths and the multi-source observations where each source is weighted by its reliability. Different loss functions can be incorporated into this framework to recognize the characteristics of various data types, and efficient computation approaches are developed. Experiments on real-world weather, stock and flight data as well as simulated multi-source data demonstrate the necessity of jointly modeling different data types in the proposed framework.
- A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. JMLR, 6:1705--1749, 2005. Google ScholarDigital Library
- D. P. Bertsekas. Non-linear programming. Athena Scientific, 1999.Google Scholar
- L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In Proc. of CAiSE, pages 83--97, 2010. Google ScholarDigital Library
- J. Bleiholder and F. Naumann. Conflict handling strategies in an integrated information system. In Proc. of IIWeb, 2006.Google Scholar
- J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1):1:1--1:41, 2009. Google ScholarDigital Library
- S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004. Google ScholarDigital Library
- C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, pages 281--288, 2006.Google ScholarDigital Library
- C. Dai, D. Lin, E. Bertino, and M. Kantarcioglu. An approach to evaluate data trustworthiness based on data provenance. In Proc. of SDM, pages 82--98, 2008. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009. Google ScholarDigital Library
- X. L. Dong and F. Naumann. Data fusion: Resolving data conflicts for integration. PVLDB, 2(2):1654--1655, 2009. Google ScholarDigital Library
- X. L. Dong and D. Srivastava. Big data integration. In Proc. of ICDE, pages 1245--1248, 2013. Google ScholarDigital Library
- A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In Proc. of WSDM, pages 131--140, 2010. Google ScholarDigital Library
- Z. Jiang. A decision-theoretic framework for numerical attribute value reconciliation. TKDE, 24(7):1153--1169, 2012. Google ScholarDigital Library
- G. Kasneci, J. V. Gael, D. H. Stern, and T. Graepel. Cobayes: Bayesian knowledge corroboration with assessors of unknown areas of expertise. In Proc.\ of WSDM, pages 465--474, 2011. Google ScholarDigital Library
- X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 2013. Google ScholarDigital Library
- A. Marian and M. Wu. Corroborating information from web sources. IEEE Data Engineering Bulletin, 34(3):11--17, 2011.Google Scholar
- J. Nocedal and S. Wright. Numerical optimization. Springer, 2006.Google Scholar
- J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In Proc. of IJCAI, pages 2324--2329, 2011. Google ScholarDigital Library
- G.-J. Qi, C. C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in diverse groups. In Proc. of WWW, pages 1041--1052, 2013. Google ScholarDigital Library
- P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. JOTA, 109(3):475--494, 2001. Google ScholarDigital Library
- V. Vydiswaran, C. Zhai, and D. Roth. Content-driven trust propagation framework. In Proc. of KDD, pages 974--982, 2011. Google ScholarDigital Library
- X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In Proc. of KDD, pages 1048--1052, 2007. Google ScholarDigital Library
- B. Zhao and J. Han. A probabilistic model for estimating real-valued truth from conflicting sources. In Proc. of QDB, 2012.Google Scholar
- B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550--561, 2012. Google ScholarDigital Library
Index Terms
- Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation
Recommendations
Semi-supervised truth discovery
WWW '11: Proceedings of the 20th international conference on World wide webAccessing online information from various data sources has become a necessary part of our everyday life. Unfortunately such information is not always trustworthy, as different sources are of very different qualities and often provide inaccurate and ...
A new truth discovery method for resolving object conflicts over Linked Data with scale-free property
Considerable effort has been exerted to increase the scale of Linked Data. However, an inevitable problem arises when dealing with data integration from multiple sources. Various sources often provide conflicting objects for a certain predicate of the ...
Better Weather Forecasting through truth discovery Analysis
ICIIP '17: Proceedings of the 2nd International Conference on Intelligent Information ProcessingIn many real world applications, the same object or event may be described by multiple sources. As a result, conflicts among these sources are inevitable and these conflicts cause confusion as we have more than one value or outcome for each object. One ...
Comments