skip to main content
research-article

From data fusion to knowledge fusion

Published:01 June 2014Publication History
Skip Abstract Section

Abstract

The task of data fusion is to identify the true values of data items (e.g., the true date of birth for Tom Cruise) among multiple observed values drawn from different sources (e.g., Web sites) of varying (and unknown) reliability. A recent survey [20] has provided a detailed comparison of various fusion methods on Deep Web data. In this paper, we study the applicability and limitations of different fusion techniques on a more challenging problem: knowledge fusion. Knowledge fusion identifies true subject-predicate-object triples extracted by multiple information extractors from multiple information sources. These extractors perform the tasks of entity linkage and schema alignment, thus introducing an additional source of noise that is quite different from that traditionally considered in the data fusion literature, which only focuses on factual errors in the original sources. We adapt state-of-the-art data fusion techniques and apply them to a knowledge base with 1.6B unique knowledge triples extracted by 12 extractors from over 1B Web pages, which is three orders of magnitude larger than the data sets used in previous data fusion papers. We show great promise of the data fusion approaches in solving the knowledge fusion problem, and suggest interesting research directions through a detailed error analysis of the methods.

References

  1. Z. Bellahsene, A. Bonifati, and E. Rahm. Schema Matching and Mapping. Springer, 2011. Google ScholarGoogle ScholarCross RefCross Ref
  2. L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1):1--41, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247--1250, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107--117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. In PVLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. H. Jr., and T. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on the web. PVLDB, 5(7):680--691, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--149, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. X. L. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, 2(1), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. X. L. Dong and F. Naumann. Data fusion--resolving data conflicts for integration. PVLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Fleiss. Statistical methods for rates and proportions. John Wiley and Sons, 1981.Google ScholarGoogle Scholar
  16. L. A. Galárraga, C. Teflioudi, K. Hose, and F. Suchanek. Amie: association rule mining under incomplete evidence in ontological knowledge bases. In WWW, pages 413--422, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Gupta, A. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. PVLDB, 7(7), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. X. Liu, X. L. Dong, B. chin Ooi, and D. Srivastava. Online data fusion. PVLDB, 4(12), 2011.Google ScholarGoogle Scholar
  22. J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Y. Halevy. Google's deep web crawl. In PVLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Mintz, S. Bills, R. Snow, and D. Jurafksy. Distant supervision for relation extraction without labeled data. In Prof. Conf. Recent Advances in NLP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. Niu, C. Zhang, and C. Re. Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference. Intl. J. on Semantic Web and Information Systems, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, pages 877--885, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In IJCAI, pages 2324--2329, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Pasternack and D. Roth. Latent credibility analysis. In WWW, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Pochampally, A. D. Sarma, X. L. Dong, A. Meliou, and D. Srivastava. Fusing data with correlations. In Sigmod, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G.-J. Qi, C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in groups. In WWW, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. L. Ratinov, D. Roth, D. Downey, and M. Anderson. Local and global algorithms for disambiguation to wikipedia. In NAACL, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Ritter, L. Zettlemoyer, Mausam, and O. Etzioni. Modeling missing data in distant supervision for information extraction. Trans. Assoc. Comp. Linguistics, 1, 2013.Google ScholarGoogle Scholar
  32. F. Suchanek, G. Kasneci, and G. Weikum. YAGO - A Core of Semantic Knowledge. In WWW, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Wick, S. Singh, A. Kobren, and A. McCallum. Assessing confidence of knowledge base content with an experimental study in entity resolution. In AKBC workshop, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In Proc. of SIGKDD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. X. Yin and W. Tan. Semi-supervised truth discovery. In WWW, pages 217--226, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. B. Zhao and J. Han. A probabilistic model for estimating real-valued truth from conflicting sources. In QDB, 2012.Google ScholarGoogle Scholar
  37. B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550--561, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 7, Issue 10
    June 2014
    146 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 June 2014
    Published in pvldb Volume 7, Issue 10

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader