research-article

From data fusion to knowledge fusion

Authors:
Xin Luna Dong

Google Inc.

Google Inc.
View Profile

,
Evgeniy Gabrilovich

Google Inc.

Google Inc.
View Profile

,
Geremy Heitz

Google Inc.

Google Inc.
View Profile

,
Wilko Horn

Google Inc.

Google Inc.
View Profile

,
Kevin Murphy

Google Inc.

Google Inc.
View Profile

,
Shaohua Sun

Google Inc.

Google Inc.
View Profile

,
Wei Zhang

Google Inc.

Google Inc.
View Profile

Proceedings of the VLDB Endowment Volume 7 Issue 10pp 881–892https://doi.org/10.14778/2732951.2732962

Published:01 June 2014Publication History

Proceedings of the VLDB Endowment

Abstract

The task of data fusion is to identify the true values of data items (e.g., the true date of birth for Tom Cruise) among multiple observed values drawn from different sources (e.g., Web sites) of varying (and unknown) reliability. A recent survey [20] has provided a detailed comparison of various fusion methods on Deep Web data. In this paper, we study the applicability and limitations of different fusion techniques on a more challenging problem: knowledge fusion. Knowledge fusion identifies true subject-predicate-object triples extracted by multiple information extractors from multiple information sources. These extractors perform the tasks of entity linkage and schema alignment, thus introducing an additional source of noise that is quite different from that traditionally considered in the data fusion literature, which only focuses on factual errors in the original sources. We adapt state-of-the-art data fusion techniques and apply them to a knowledge base with 1.6B unique knowledge triples extracted by 12 extractors from over 1B Web pages, which is three orders of magnitude larger than the data sets used in previous data fusion papers. We show great promise of the data fusion approaches in solving the knowledge fusion problem, and suggest interesting research directions through a detailed error analysis of the methods.

References

Z. Bellahsene, A. Bonifati, and E. Rahm. Schema Matching and Mapping. Springer, 2011. Google ScholarCross Ref
L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, 2010. Google ScholarDigital Library
J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1):1--41, 2008. Google ScholarDigital Library
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247--1250, 2008. Google ScholarDigital Library
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107--117, 1998. Google ScholarDigital Library
M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. In PVLDB, 2008. Google ScholarDigital Library
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. H. Jr., and T. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.Google ScholarDigital Library
N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on the web. PVLDB, 5(7):680--691, 2012. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--149, 2004. Google ScholarDigital Library
X. L. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 2010. Google ScholarDigital Library
X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, 2(1), 2009. Google ScholarDigital Library
X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1), 2009. Google ScholarDigital Library
X. L. Dong and F. Naumann. Data fusion--resolving data conflicts for integration. PVLDB, 2009. Google ScholarDigital Library
X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2), 2013. Google ScholarDigital Library
J. Fleiss. Statistical methods for rates and proportions. John Wiley and Sons, 1981.Google Scholar
L. A. Galárraga, C. Teflioudi, K. Hose, and F. Suchanek. Amie: association rule mining under incomplete evidence in ontological knowledge bases. In WWW, pages 413--422, 2013. Google ScholarDigital Library
A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, 2010. Google ScholarDigital Library
R. Gupta, A. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. PVLDB, 7(7), 2014. Google ScholarDigital Library
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, 1998. Google ScholarDigital Library
X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2), 2013. Google ScholarDigital Library
X. Liu, X. L. Dong, B. chin Ooi, and D. Srivastava. Online data fusion. PVLDB, 4(12), 2011.Google Scholar
J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Y. Halevy. Google's deep web crawl. In PVLDB, 2008. Google ScholarDigital Library
M. Mintz, S. Bills, R. Snow, and D. Jurafksy. Distant supervision for relation extraction without labeled data. In Prof. Conf. Recent Advances in NLP, 2009. Google ScholarDigital Library
F. Niu, C. Zhang, and C. Re. Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference. Intl. J. on Semantic Web and Information Systems, 2012. Google ScholarDigital Library
J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, pages 877--885, 2010. Google ScholarDigital Library
J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In IJCAI, pages 2324--2329, 2011. Google ScholarDigital Library
J. Pasternack and D. Roth. Latent credibility analysis. In WWW, 2013. Google ScholarDigital Library
R. Pochampally, A. D. Sarma, X. L. Dong, A. Meliou, and D. Srivastava. Fusing data with correlations. In Sigmod, 2014. Google ScholarDigital Library
G.-J. Qi, C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in groups. In WWW, 2013. Google ScholarDigital Library
L. Ratinov, D. Roth, D. Downey, and M. Anderson. Local and global algorithms for disambiguation to wikipedia. In NAACL, 2011. Google ScholarDigital Library
A. Ritter, L. Zettlemoyer, Mausam, and O. Etzioni. Modeling missing data in distant supervision for information extraction. Trans. Assoc. Comp. Linguistics, 1, 2013.Google Scholar
F. Suchanek, G. Kasneci, and G. Weikum. YAGO - A Core of Semantic Knowledge. In WWW, 2007. Google ScholarDigital Library
M. Wick, S. Singh, A. Kobren, and A. McCallum. Assessing confidence of knowledge base content with an experimental study in entity resolution. In AKBC workshop, 2013. Google ScholarDigital Library
X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In Proc. of SIGKDD, 2007. Google ScholarDigital Library
X. Yin and W. Tan. Semi-supervised truth discovery. In WWW, pages 217--226, 2011. Google ScholarDigital Library
B. Zhao and J. Han. A probabilistic model for estimating real-valued truth from conflicting sources. In QDB, 2012.Google Scholar
B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550--561, 2012. Google ScholarDigital Library

Recommendations

Data fusion

The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information ...
Read More
Multi-source data fusion study in scientometrics

This paper provides an introduction to multi-source data fusion (MSDF) and comprehensively overviews the ingredients and challenges of MSDF in scientometrics. As compared to the MSDF methods in the sensor and other fields, and considering the features ...
Read More
Multi sensor data fusion with filtering
AIC'05: Proceedings of the 5th WSEAS International Conference on Applied Informatics and Communications

The purpose of data fusion is to produce an improved model or estimate of a system from a set of independent data sources. There are various multisensor data fusion approaches, of which Kalman filtering is one of the most significant. Methods for Kalman ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 7, Issue 10
June 2014
146 pages
ISSN:2150-8097
Editors:
H. V. Jagadish
University of Michigan
,
Aoying Zhou
East Normal University, China
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 June 2014
Published in pvldb Volume 7, Issue 10
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 67
  Total Citations
  View Citations
- 816
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

From data fusion to knowledge fusion

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Data fusion

Multi-source data fusion study in scientometrics

Multi sensor data fusion with filtering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

From data fusion to knowledge fusion

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Data fusion

Multi-source data fusion study in scientometrics

Multi sensor data fusion with filtering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media