ABSTRACT
Data integration is a challenging task due to the large numbers of autonomous data sources. This necessitates the development of techniques to reason about the benefits and costs of acquiring and integrating data. Recently the problem of source selection (i.e., identifying the subset of sources that maximizes the profit from integration) was introduced as a preprocessing step before the actual integration. The problem was studied for static sources and used the accuracy of data fusion to quantify the integration profit.
In this paper, we study the problem of source selection considering dynamic data sources whose content changes over time. We define a set of time-dependent metrics, including coverage, freshness and accuracy, to characterize the quality of integrated data. We show how statistical models for the evolution of sources can be used to estimate these metrics. While source selection is NP-complete, we show that for a large class of practical cases, near-optimal solutions can be found, propose an algorithmic framework with theoretical guarantees for our problem and show its effectiveness with an extensive experimental evaluation on both real-world and synthetic data.
- J. Cho and H. Garcia-Molina. Effective page refresh policies for web crawlers. ACM Trans. Database Syst., 28(4), 2003. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1), 2009. Google ScholarDigital Library
- X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2), 2012. Google ScholarDigital Library
- U. Feige and V. S. Mirrokni. Maximizing non-monotone submodular functions. In FOCS, 2007. Google ScholarDigital Library
- R. G. Gallager. Discrete Stochastic Processes. Kluwer Academic Publishers, Boston, 1996.Google ScholarCross Ref
- T. Herzog, F. Scheuren, and W. Winkler. Record linkage. Wiley Interdisciplinary Reviews: Computational Statistics, 2010.Google ScholarDigital Library
- T. Hua, C.-T. Lu, N. Ramakrishnan, F. Chen, J. Arredondo, D. Mares, and K. Summers. Analyzing civil unrest through social media. Computer, 46(12):80--84, 2013. Google ScholarDigital Library
- E. L. Kaplan and P. Meier. Nonparametric Estimation from Incomplete Observations. JASA, 53:457--481, 1958.Google ScholarCross Ref
- J. Lee, V. S. Mirrokni, V. Nagarajan, and M. Sviridenko. Non-monotone submodular maximization under matroid and knapsack constraints. STOC, 2009. Google ScholarDigital Library
- K. Leetaru and P. Schrodt. Gdelt: Global data on events, language, and tone, 1979--2012. Inter. Studies Association Annual Conf., 2013.Google Scholar
- X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: is the problem solved? PVLDB, 6(2), 2012. Google ScholarDigital Library
- W. Meng and C. T. Yu. Advanced Metasearch Engine Technology. Morgan & Claypool Publishers, 2010. Google ScholarDigital Library
- G. A. Mihaila, L. Raschid, and M.-E. Vidal. Using quality of data metadata for source selection and ranking. In WebDB, 2000.Google Scholar
- A. C. Morris, V. Maier, and P. Green. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In INTERSPEECH, 2004.Google Scholar
- A. Pal, V. Rastogi, A. Machanavajjhala, and P. Bohannon. Information integration over time in unreliable and uncertain environments. In WWW, 2012. Google ScholarDigital Library
- G.-J. Qi, C. C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in diverse groups. WWW, 2013. Google ScholarDigital Library
- M. Stonebraker, D. Bruckner, I. Ilyas, G. Beskales, M. Cherniack, S. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR'13, 2013.Google Scholar
- K. Wilson and J. S. Brownstein. Early detection of disease outbreaks using the internet. CMAJ, 180(8):829--831, 2009.Google ScholarCross Ref
- B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6), 2012. Google ScholarDigital Library
Index Terms
- Characterizing and selecting fresh data sources
Recommendations
Selecting quality sources
This study investigated undergraduates' source selection behaviour: what sources they use frequently, what criteria they consider important for source selection, how they perceive different sources, and whether their source selection behaviour is ...
Efficient Feedback Collection for Pay-as-you-go Source Selection
SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database ManagementTechnical developments, such as the web of data and web data extraction, combined with policy developments such as those relating to open government or open science, are leading to the availability of increasing numbers of data sources. Indeed, given ...
Data source management and selection for dynamic data integration
RED'09: Proceedings of the 2nd international conference on Resource discoverySelection-dynamic data integration employs a set of known data sources attached to an integration system. For answering a given query, suitable sources are selected from this set and dynamically integrated. This procedure requires a method to determine ...
Comments