skip to main content
article

Less is more: selecting sources wisely for integration

Published:01 December 2012Publication History
Skip Abstract Section

Abstract

We are often thrilled by the abundance of information surrounding us and wish to integrate data from as many sources as possible. However, understanding, analyzing, and using these data are often hard. Too much data can introduce a huge integration cost, such as expenses for purchasing data and resources for integration and cleaning. Furthermore, including low-quality data can even deteriorate the quality of integration results instead of bringing the desired quality gain. Thus, "the more the better" does not always hold for data integration and often "less is more".

In this paper, we study how to select a subset of sources before integration such that we can balance the quality of integrated data and integration cost. Inspired by the Marginalism principle in economic theory, we wish to integrate a new source only if its marginal gain, often a function of improved integration quality, is higher than the marginal cost, associated with data-purchase expense and integration resources. As a first step towards this goal, we focus on data fusion tasks, where the goal is to resolve conflicts from different sources. We propose a randomized solution for selecting sources for fusion and show empirically its effectiveness and scalability on both real-world data and synthetic data.

References

  1. M. Balazinska, B. Howe, and D. Suciu. Data markets in the cloud: An opportunity for the database community. PVLDB, 4(12), 2011.Google ScholarGoogle Scholar
  2. J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1):1-41, 2008. Google ScholarGoogle Scholar
  3. X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, 2(1), 2009. Google ScholarGoogle Scholar
  4. X. L. Dong and F. Naumann. Data fusion-resolving data conflicts for integration. PVLDB, 2009. Google ScholarGoogle Scholar
  5. X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. http://lunadong.com/publication/marginalism report.pdfGoogle ScholarGoogle Scholar
  6. S. M. Embury, P. Missier, S. Sampaio, R. M. Greenwood, and A. D. Preece. Incorporating domain-specific information quality constraints into database queries. J. Data and Information Quality, 1(2), 2009. Google ScholarGoogle Scholar
  7. T. Feo and M. G. Resende. Greedy randomized adaptive search procedures. J. of Global Optimization, 6, 1995.Google ScholarGoogle Scholar
  8. A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, 2010. Google ScholarGoogle Scholar
  9. K. Hose, A. Roth, A. Zeitz, K.-U. Sattler, and F. Naumann. A research agenda for query processing in large-scale peer data management systems. Inf. Syst., 33(7-8):597-610, 2008. Google ScholarGoogle Scholar
  10. X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2), 2013. Google ScholarGoogle Scholar
  11. A. Marshall. Pinciples of Economics. Prometheus Books, 1890.Google ScholarGoogle Scholar
  12. W. Meng and C. T. Yu. Advanced Metasearch Engine Technology. Morgan & Claypool, 2010. Google ScholarGoogle Scholar
  13. G. A. Mihaila, L. Raschid, and M.-E. Vidal. Using quality of data metadata for source selection and ranking. In WebDB, 2000.Google ScholarGoogle Scholar
  14. F. Naumann, J. C. Freytag, and M. Spiliopoulou. Quality driven source selection using data envelope analysis. In IQ, 1998.Google ScholarGoogle Scholar
  15. J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, pages 877-885, 2010. Google ScholarGoogle Scholar
  16. J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In IJCAI, pages 2324-2329, 2011. Google ScholarGoogle Scholar
  17. H. Qu, J. Xu, and A. Labrinidis. Quality is in the eye of the beholder: towards user-centric web-databases. In SIGMOD, 2007. Google ScholarGoogle Scholar
  18. M. A. Suryanto, E.-P. Lim, A. Sun, and R. Chiang. Quality-aware collaborative question answering: Methods and evaluation. In WSDM, 2009. Google ScholarGoogle Scholar
  19. H. Wu, Q. Luo, J. Li, and A. Labrinidis. Quality aware query scheduling in wireless sensor networks. In DMSN, 2009. Google ScholarGoogle Scholar
  20. N. K. Yeganeh, S. Sadiq, K. Deng, and X. Zhou. Data quality aware queries in collaborative information systems. Lecture Notes in Computer Science, 5446:39-50, 2009. Google ScholarGoogle Scholar
  21. X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In Proc. of SIGKDD, 2007. Google ScholarGoogle Scholar
  22. X. Yin and W. Tan. Semi-supervised truth discovery. In WWW, pages 217-226, 2011. Google ScholarGoogle Scholar
  23. B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550-561, 2012. Google ScholarGoogle Scholar

Index Terms

  1. Less is more: selecting sources wisely for integration
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 6, Issue 2
        December 2012
        120 pages

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 December 2012
        Published in pvldb Volume 6, Issue 2

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader