Abstract
The task of data fusion is to identify the true values of data items (e.g., the true date of birth for Tom Cruise) among multiple observed values drawn from different sources (e.g., Web sites) of varying (and unknown) reliability. A recent survey [20] has provided a detailed comparison of various fusion methods on Deep Web data. In this paper, we study the applicability and limitations of different fusion techniques on a more challenging problem: knowledge fusion. Knowledge fusion identifies true subject-predicate-object triples extracted by multiple information extractors from multiple information sources. These extractors perform the tasks of entity linkage and schema alignment, thus introducing an additional source of noise that is quite different from that traditionally considered in the data fusion literature, which only focuses on factual errors in the original sources. We adapt state-of-the-art data fusion techniques and apply them to a knowledge base with 1.6B unique knowledge triples extracted by 12 extractors from over 1B Web pages, which is three orders of magnitude larger than the data sets used in previous data fusion papers. We show great promise of the data fusion approaches in solving the knowledge fusion problem, and suggest interesting research directions through a detailed error analysis of the methods.
- Z. Bellahsene, A. Bonifati, and E. Rahm. Schema Matching and Mapping. Springer, 2011. Google ScholarCross Ref
- L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, 2010. Google ScholarDigital Library
- J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1):1--41, 2008. Google ScholarDigital Library
- K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247--1250, 2008. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107--117, 1998. Google ScholarDigital Library
- M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. In PVLDB, 2008. Google ScholarDigital Library
- A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. H. Jr., and T. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.Google ScholarDigital Library
- N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on the web. PVLDB, 5(7):680--691, 2012. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--149, 2004. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 2010. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, 2(1), 2009. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1), 2009. Google ScholarDigital Library
- X. L. Dong and F. Naumann. Data fusion--resolving data conflicts for integration. PVLDB, 2009. Google ScholarDigital Library
- X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2), 2013. Google ScholarDigital Library
- J. Fleiss. Statistical methods for rates and proportions. John Wiley and Sons, 1981.Google Scholar
- L. A. Galárraga, C. Teflioudi, K. Hose, and F. Suchanek. Amie: association rule mining under incomplete evidence in ontological knowledge bases. In WWW, pages 413--422, 2013. Google ScholarDigital Library
- A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, 2010. Google ScholarDigital Library
- R. Gupta, A. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. PVLDB, 7(7), 2014. Google ScholarDigital Library
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, 1998. Google ScholarDigital Library
- X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2), 2013. Google ScholarDigital Library
- X. Liu, X. L. Dong, B. chin Ooi, and D. Srivastava. Online data fusion. PVLDB, 4(12), 2011.Google Scholar
- J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Y. Halevy. Google's deep web crawl. In PVLDB, 2008. Google ScholarDigital Library
- M. Mintz, S. Bills, R. Snow, and D. Jurafksy. Distant supervision for relation extraction without labeled data. In Prof. Conf. Recent Advances in NLP, 2009. Google ScholarDigital Library
- F. Niu, C. Zhang, and C. Re. Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference. Intl. J. on Semantic Web and Information Systems, 2012. Google ScholarDigital Library
- J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, pages 877--885, 2010. Google ScholarDigital Library
- J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In IJCAI, pages 2324--2329, 2011. Google ScholarDigital Library
- J. Pasternack and D. Roth. Latent credibility analysis. In WWW, 2013. Google ScholarDigital Library
- R. Pochampally, A. D. Sarma, X. L. Dong, A. Meliou, and D. Srivastava. Fusing data with correlations. In Sigmod, 2014. Google ScholarDigital Library
- G.-J. Qi, C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in groups. In WWW, 2013. Google ScholarDigital Library
- L. Ratinov, D. Roth, D. Downey, and M. Anderson. Local and global algorithms for disambiguation to wikipedia. In NAACL, 2011. Google ScholarDigital Library
- A. Ritter, L. Zettlemoyer, Mausam, and O. Etzioni. Modeling missing data in distant supervision for information extraction. Trans. Assoc. Comp. Linguistics, 1, 2013.Google Scholar
- F. Suchanek, G. Kasneci, and G. Weikum. YAGO - A Core of Semantic Knowledge. In WWW, 2007. Google ScholarDigital Library
- M. Wick, S. Singh, A. Kobren, and A. McCallum. Assessing confidence of knowledge base content with an experimental study in entity resolution. In AKBC workshop, 2013. Google ScholarDigital Library
- X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In Proc. of SIGKDD, 2007. Google ScholarDigital Library
- X. Yin and W. Tan. Semi-supervised truth discovery. In WWW, pages 217--226, 2011. Google ScholarDigital Library
- B. Zhao and J. Han. A probabilistic model for estimating real-valued truth from conflicting sources. In QDB, 2012.Google Scholar
- B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550--561, 2012. Google ScholarDigital Library
Recommendations
Data fusion
The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information ...
Multi-source data fusion study in scientometrics
This paper provides an introduction to multi-source data fusion (MSDF) and comprehensively overviews the ingredients and challenges of MSDF in scientometrics. As compared to the MSDF methods in the sensor and other fields, and considering the features ...
Multi sensor data fusion with filtering
AIC'05: Proceedings of the 5th WSEAS International Conference on Applied Informatics and CommunicationsThe purpose of data fusion is to produce an improved model or estimate of a system from a set of independent data sources. There are various multisensor data fusion approaches, of which Kalman filtering is one of the most significant. Methods for Kalman ...
Comments