skip to main content
10.1145/2588555.2610504acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Characterizing and selecting fresh data sources

Published:18 June 2014Publication History

ABSTRACT

Data integration is a challenging task due to the large numbers of autonomous data sources. This necessitates the development of techniques to reason about the benefits and costs of acquiring and integrating data. Recently the problem of source selection (i.e., identifying the subset of sources that maximizes the profit from integration) was introduced as a preprocessing step before the actual integration. The problem was studied for static sources and used the accuracy of data fusion to quantify the integration profit.

In this paper, we study the problem of source selection considering dynamic data sources whose content changes over time. We define a set of time-dependent metrics, including coverage, freshness and accuracy, to characterize the quality of integrated data. We show how statistical models for the evolution of sources can be used to estimate these metrics. While source selection is NP-complete, we show that for a large class of practical cases, near-optimal solutions can be found, propose an algorithmic framework with theoretical guarantees for our problem and show its effectiveness with an extensive experimental evaluation on both real-world and synthetic data.

References

  1. J. Cho and H. Garcia-Molina. Effective page refresh policies for web crawlers. ACM Trans. Database Syst., 28(4), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. U. Feige and V. S. Mirrokni. Maximizing non-monotone submodular functions. In FOCS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. G. Gallager. Discrete Stochastic Processes. Kluwer Academic Publishers, Boston, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  6. T. Herzog, F. Scheuren, and W. Winkler. Record linkage. Wiley Interdisciplinary Reviews: Computational Statistics, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Hua, C.-T. Lu, N. Ramakrishnan, F. Chen, J. Arredondo, D. Mares, and K. Summers. Analyzing civil unrest through social media. Computer, 46(12):80--84, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. L. Kaplan and P. Meier. Nonparametric Estimation from Incomplete Observations. JASA, 53:457--481, 1958.Google ScholarGoogle ScholarCross RefCross Ref
  9. J. Lee, V. S. Mirrokni, V. Nagarajan, and M. Sviridenko. Non-monotone submodular maximization under matroid and knapsack constraints. STOC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Leetaru and P. Schrodt. Gdelt: Global data on events, language, and tone, 1979--2012. Inter. Studies Association Annual Conf., 2013.Google ScholarGoogle Scholar
  11. X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: is the problem solved? PVLDB, 6(2), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Meng and C. T. Yu. Advanced Metasearch Engine Technology. Morgan & Claypool Publishers, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. A. Mihaila, L. Raschid, and M.-E. Vidal. Using quality of data metadata for source selection and ranking. In WebDB, 2000.Google ScholarGoogle Scholar
  14. A. C. Morris, V. Maier, and P. Green. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In INTERSPEECH, 2004.Google ScholarGoogle Scholar
  15. A. Pal, V. Rastogi, A. Machanavajjhala, and P. Bohannon. Information integration over time in unreliable and uncertain environments. In WWW, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G.-J. Qi, C. C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in diverse groups. WWW, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Stonebraker, D. Bruckner, I. Ilyas, G. Beskales, M. Cherniack, S. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR'13, 2013.Google ScholarGoogle Scholar
  18. K. Wilson and J. S. Brownstein. Early detection of disease outbreaks using the internet. CMAJ, 180(8):829--831, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  19. B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Characterizing and selecting fresh data sources

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
        June 2014
        1645 pages
        ISBN:9781450323765
        DOI:10.1145/2588555

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 June 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader