research-article

Efficient Feedback Collection for Pay-as-you-go Source Selection

Authors:
Julio César Cortés Ríos

School of Computer Science, University of Manchester, Manchester, UK

School of Computer Science, University of Manchester, Manchester, UK
View Profile

,
Norman W. Paton

School of Computer Science, University of Manchester, Manchester, UK

School of Computer Science, University of Manchester, Manchester, UK
View Profile

,
Alvaro A.A. Fernandes

School of Computer Science, University of Manchester, Manchester, UK

School of Computer Science, University of Manchester, Manchester, UK
View Profile

,
Khalid Belhajjame

Université Paris Dauphine, Place du Maréchal de Lattre de Tassigny, France

Université Paris Dauphine, Place du Maréchal de Lattre de Tassigny, France
View Profile

SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database ManagementJuly 2016Article No.: 1Pages 1–12https://doi.org/10.1145/2949689.2949690

Published:18 July 2016Publication History

SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pages 1–12

ABSTRACT

Technical developments, such as the web of data and web data extraction, combined with policy developments such as those relating to open government or open science, are leading to the availability of increasing numbers of data sources. Indeed, given these physical sources, it is then also possible to create further virtual sources that integrate, aggregate or summarise the data from the original sources. As a result, there is a plethora of data sources, from which a small subset may be able to provide the information required to support a task. The number and rate of change in the available sources is likely to make manual source selection and curation by experts impractical for many applications, leading to the need to pursue a pay-as-you-go approach, in which crowds or data consumers annotate results based on their correctness or suitability, with the resulting annotations used to inform, e.g., source selection algorithms. However, for pay-as-you-go feedback collection to be cost-effective, it may be necessary to select judiciously the data items on which feedback is to be obtained. This paper describes OLBP (Ordering and Labelling By Precision), a heuristics-based approach to the targeting of data items for feedback to support mapping and source selection tasks, where users express their preferences in terms of the trade-off between precision and recall. The proposed approach is then evaluated on two different scenarios, mapping selection with synthetic data, and source selection with real data produced by web data extraction. The results demonstrate a significant reduction in the amount of feedback required to reach user-provided objectives when using OLBP.

References

B. Alexe, B. ten Cate, P. G. Kolaitis, and W. C. Tan. Characterizing schema mappings via data examples. ACM Trans. Database Syst., 36(4):23, 2011. Google ScholarDigital Library
K. Belhajjame, N. W. Paton, S. M. Embury, A. A. A. Fernandes, and C. Hedeler. Feedback-based annotation, selection and refinement of schema mappings for dataspaces. In EDBT, pages 573--584, 2010. Google ScholarDigital Library
K. Belhajjame, N. W. Paton, S. M. Embury, A. A. A. Fernandes, and C. Hedeler. Incrementally improving dataspaces based on user feedback. Inf. Syst., 38(5):656--687, 2013. Google ScholarDigital Library
P. A. Bernstein and L. M. Haas. Information integration in the enterprise. CACM, 51(9):72--79, 2008. Google ScholarDigital Library
A. Bonifati, G. Mecca, A. Pappalardo, S. Raunich, and G. Summa. Schema mapping verification: the spicy way. In EDBT, pages 85--96, 2008. Google ScholarDigital Library
A. Bozzon, M. Brambilla, and S. Ceri. Answering search queries with crowdsearcher. WWW, pages 1009--1018, 2012. Google ScholarDigital Library
M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. CACM, 54(2):72--79, 2011. Google ScholarDigital Library
H. Cao, Y. Qi, K. S. Candan, and M. L. Sapino. Feedback-driven result ranking and query refinement for exploring semi-structured data collections. In EDBT, pages 3--14, 2010. Google ScholarDigital Library
J. Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 2 edition, 1988.Google Scholar
V. Crescenzi, P. Merialdo, and D. Qiu. Wrapper generation supervised by a noisy crowd. In VLDB, pages 8--13, 2013.Google Scholar
V. Crescenzi, P. Merialdo, and D. Qiu. Crowdsourcing large scale wrapper inference. Distributed and Parallel Databases, 33(1):95--122, 2015. Google ScholarDigital Library
X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2):37--48, 2012. Google ScholarDigital Library
H. Elmeleegy, A. K. Elmagarmid, and J. Lee. Leveraging query logs for schema mapping generation in u-map. In SIGMOD, pages 121--132, 2011. Google ScholarDigital Library
D. H. Foley. Considerations of sample and feature size. IEEE Trans. on Inf. Theory, 18(5):618--626, 1972. Google ScholarDigital Library
M. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: Answering queries with crowdsourcing. In ACM SIGMOD, pages 61--72, 2011. Google ScholarDigital Library
T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, and C. Wang. DIADEM: Thousands of websites to a single database. PVLDB, 7(14):1845--1856, 2014. Google ScholarDigital Library
C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. W. Shavlik, and X. Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD Conference, pages 601--612, 2014. Google ScholarDigital Library
L. M. Haas, M. A. Hernández, H. Ho, L. Popa, and M. Roth. Clio grows up: from research prototype to industrial tool. In ACM SIGMOD, pages 805--810, 2005. Google ScholarDigital Library
T. Heath and C. Bizer. Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool, 2011. Google ScholarDigital Library
S. B. Hulley, S. R. Cummings, W. S. Browner, D. G. Grady, and T. B. Newman. Designing Clinical Research. Lippincott Williams and Wilkins, 3 edition, 2007.Google Scholar
N. Q. V. Hung, D. C. Thang, M. Weidlich, and K. Aberer. Minimizing efforts in validating crowd answers. In SIGMOD, pages 999--1014, 2015. Google ScholarDigital Library
R. Isele and C. Bizer. Learning expressive linkage rules using genetic programming. PVLDB, 5(11):1638--1649, 2012. Google ScholarDigital Library
R. Isele and C. Bizer. Active learning of expressive linkage rules using genetic programming. JWS, 2--15, 2013. Google ScholarDigital Library
X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. CDAS: A crowdsourcing data analytics system. PVLDB, 5(10):1040--1051, 2012. Google ScholarDigital Library
C. Lofi, K. E. Maarry, and W.-T. Balke. Skyline queries in crowd-enabled databases. In Proc. 16th EDBT, pages 465--476, 2013. Google ScholarDigital Library
J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale data integration: You can only afford to pay as you go. In CIDR, pages 342--350, 2007.Google Scholar
A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Human-powered sorts and joins. PVLDB, 5(1):13--24, 2011. Google ScholarDigital Library
B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. PVLDB, 8(2):125--136, 2014. Google ScholarDigital Library
F. Naumann, J. C. Freytag, and M. Spiliopoulou. Quality driven source selection using data envelope analysis. In IQ, pages 137--152, 1998.Google Scholar
T. Rekatsinas, X. L. Dong, L. Getoor, and D. Srivastava. Finding quality in quantity: The challenge of discovering valuable sources for integration. In CIDR, USA, 2015, 2015.Google Scholar
T. Rekatsinas, X. L. Dong, and D. Srivastava. Characterizing and selecting fresh data sources. In SIGMOD, pages 919--930, 2014. Google ScholarDigital Library
B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1--114, 2012.Google ScholarCross Ref
P. P. Talukdar, Z. G. Ives, and F. C. N. Pereira. Automatically incorporating new sources in keyword search-based data integration. In ACM SIGMOD, USA, 2010, pages 387--398, 2010. Google ScholarDigital Library
P. P. Talukdar, M. Jacob, M. S. Mehmood, K. Crammer, Z. G. Ives, F. C. N. Pereira, and S. Guha. Learning to create data-integrating queries. PVLDB, 1(1):785--796, 2008. Google ScholarDigital Library
S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013. Google ScholarDigital Library
L. Yan, R. J. Miller, L. M. Haas, and R. Fagin. Data-driven understanding and refinement of schema mappings. In ACM SIGMOD, pages 485--496, 2001. Google ScholarDigital Library
Z. Yan, N. Zheng, Z. G. Ives, P. P. Talukdar, and C. Yu. Active learning in keyword search-based data integration. VLDB J., 24(5):611--631, 2015. Google ScholarDigital Library
C. J. Zhang, L. Chen, Y. Tong, and Z. Liu. Cleaning uncertain data with a noisy crowd. ICDE, 6--17, 2015.Google Scholar

Recommendations

Crowdsourced Targeted Feedback Collection for Multicriteria Data Source Selection
On the Horizon, Regular Papers and Challenge Paper

A multicriteria data source selection (MCSS) scenario identifies, from a set of candidate data sources, the subset that best meets users’ needs. These needs are expressed using several criteria, which are used to evaluate the candidate data sources. An ...
Read More
Data source management and selection for dynamic data integration
RED'09: Proceedings of the 2nd international conference on Resource discovery

Selection-dynamic data integration employs a set of known data sources attached to an integration system. For answering a given query, suitable sources are selected from this set and dynamically integrated. This procedure requires a method to determine ...
Read More
Characterizing and selecting fresh data sources
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Data integration is a challenging task due to the large numbers of autonomous data sources. This necessitates the development of techniques to reason about the benefits and costs of acquiring and integrating data. Recently the problem of source ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management
July 2016
290 pages
ISBN:9781450342155
DOI:10.1145/2949689
Editors:
Peter Baumann,
Ioana Manolescu-Goujot,
Luca Trani,
Yannis Ioannidis,
Gergely Gábor Barnaföldi,
László Dobos,
Evelin Bányai
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 July 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data integration
feedback collection
mapping selection
pay-as-you-go
source selection
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate56of146submissions,38%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 181
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient Feedback Collection for Pay-as-you-go Source Selection

SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management

ABSTRACT

References

Cited By

Recommendations

Crowdsourced Targeted Feedback Collection for Multicriteria Data Source Selection

Data source management and selection for dynamic data integration

Characterizing and selecting fresh data sources