research-article

Characterizing and selecting fresh data sources

Authors:
Theodoros Rekatsinas

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Xin Luna Dong

Google Inc., Mountainview, CA, USA

Google Inc., Mountainview, CA, USA
View Profile

,
Divesh Srivastava

AT&T Labs-Research, Bedminster, NJ, USA

AT&T Labs-Research, Bedminster, NJ, USA
View Profile

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataJune 2014Pages 919–930https://doi.org/10.1145/2588555.2610504

Published:18 June 2014Publication History

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 919–930

ABSTRACT

Data integration is a challenging task due to the large numbers of autonomous data sources. This necessitates the development of techniques to reason about the benefits and costs of acquiring and integrating data. Recently the problem of source selection (i.e., identifying the subset of sources that maximizes the profit from integration) was introduced as a preprocessing step before the actual integration. The problem was studied for static sources and used the accuracy of data fusion to quantify the integration profit.

In this paper, we study the problem of source selection considering dynamic data sources whose content changes over time. We define a set of time-dependent metrics, including coverage, freshness and accuracy, to characterize the quality of integrated data. We show how statistical models for the evolution of sources can be used to estimate these metrics. While source selection is NP-complete, we show that for a large class of practical cases, near-optimal solutions can be found, propose an algorithmic framework with theoretical guarantees for our problem and show its effectiveness with an extensive experimental evaluation on both real-world and synthetic data.

References

J. Cho and H. Garcia-Molina. Effective page refresh policies for web crawlers. ACM Trans. Database Syst., 28(4), 2003. Google ScholarDigital Library
X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1), 2009. Google ScholarDigital Library
X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2), 2012. Google ScholarDigital Library
U. Feige and V. S. Mirrokni. Maximizing non-monotone submodular functions. In FOCS, 2007. Google ScholarDigital Library
R. G. Gallager. Discrete Stochastic Processes. Kluwer Academic Publishers, Boston, 1996.Google ScholarCross Ref
T. Herzog, F. Scheuren, and W. Winkler. Record linkage. Wiley Interdisciplinary Reviews: Computational Statistics, 2010.Google ScholarDigital Library
T. Hua, C.-T. Lu, N. Ramakrishnan, F. Chen, J. Arredondo, D. Mares, and K. Summers. Analyzing civil unrest through social media. Computer, 46(12):80--84, 2013. Google ScholarDigital Library
E. L. Kaplan and P. Meier. Nonparametric Estimation from Incomplete Observations. JASA, 53:457--481, 1958.Google ScholarCross Ref
J. Lee, V. S. Mirrokni, V. Nagarajan, and M. Sviridenko. Non-monotone submodular maximization under matroid and knapsack constraints. STOC, 2009. Google ScholarDigital Library
K. Leetaru and P. Schrodt. Gdelt: Global data on events, language, and tone, 1979--2012. Inter. Studies Association Annual Conf., 2013.Google Scholar
X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: is the problem solved? PVLDB, 6(2), 2012. Google ScholarDigital Library
W. Meng and C. T. Yu. Advanced Metasearch Engine Technology. Morgan & Claypool Publishers, 2010. Google ScholarDigital Library
G. A. Mihaila, L. Raschid, and M.-E. Vidal. Using quality of data metadata for source selection and ranking. In WebDB, 2000.Google Scholar
A. C. Morris, V. Maier, and P. Green. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In INTERSPEECH, 2004.Google Scholar
A. Pal, V. Rastogi, A. Machanavajjhala, and P. Bohannon. Information integration over time in unreliable and uncertain environments. In WWW, 2012. Google ScholarDigital Library
G.-J. Qi, C. C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in diverse groups. WWW, 2013. Google ScholarDigital Library
M. Stonebraker, D. Bruckner, I. Ilyas, G. Beskales, M. Cherniack, S. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR'13, 2013.Google Scholar
K. Wilson and J. S. Brownstein. Early detection of disease outbreaks using the internet. CMAJ, 180(8):829--831, 2009.Google ScholarCross Ref
B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6), 2012. Google ScholarDigital Library

Index Terms

Characterizing and selecting fresh data sources
1. Information systems
  1. Data management systems
2. Mathematics of computing
  1. Probability and statistics

Recommendations

Selecting quality sources

This study investigated undergraduates' source selection behaviour: what sources they use frequently, what criteria they consider important for source selection, how they perceive different sources, and whether their source selection behaviour is ...
Read More
Efficient Feedback Collection for Pay-as-you-go Source Selection
SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Technical developments, such as the web of data and web data extraction, combined with policy developments such as those relating to open government or open science, are leading to the availability of increasing numbers of data sources. Indeed, given ...
Read More
Data source management and selection for dynamic data integration
RED'09: Proceedings of the 2nd international conference on Resource discovery

Selection-dynamic data integration employs a set of known data sources attached to an integration system. For answering a given query, suitable sources are selected from this set and dynamically integrated. This procedure requires a method to determine ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
June 2014
1645 pages
ISBN:9781450323765
DOI:10.1145/2588555
General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data integration
dynamic data sources
source selection
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 45
  Total Citations
  View Citations
- 695
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Characterizing and selecting fresh data sources

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Selecting quality sources

Efficient Feedback Collection for Pay-as-you-go Source Selection

Data source management and selection for dynamic data integration