ABSTRACT
Technical developments, such as the web of data and web data extraction, combined with policy developments such as those relating to open government or open science, are leading to the availability of increasing numbers of data sources. Indeed, given these physical sources, it is then also possible to create further virtual sources that integrate, aggregate or summarise the data from the original sources. As a result, there is a plethora of data sources, from which a small subset may be able to provide the information required to support a task. The number and rate of change in the available sources is likely to make manual source selection and curation by experts impractical for many applications, leading to the need to pursue a pay-as-you-go approach, in which crowds or data consumers annotate results based on their correctness or suitability, with the resulting annotations used to inform, e.g., source selection algorithms. However, for pay-as-you-go feedback collection to be cost-effective, it may be necessary to select judiciously the data items on which feedback is to be obtained. This paper describes OLBP (Ordering and Labelling By Precision), a heuristics-based approach to the targeting of data items for feedback to support mapping and source selection tasks, where users express their preferences in terms of the trade-off between precision and recall. The proposed approach is then evaluated on two different scenarios, mapping selection with synthetic data, and source selection with real data produced by web data extraction. The results demonstrate a significant reduction in the amount of feedback required to reach user-provided objectives when using OLBP.
- B. Alexe, B. ten Cate, P. G. Kolaitis, and W. C. Tan. Characterizing schema mappings via data examples. ACM Trans. Database Syst., 36(4):23, 2011. Google ScholarDigital Library
- K. Belhajjame, N. W. Paton, S. M. Embury, A. A. A. Fernandes, and C. Hedeler. Feedback-based annotation, selection and refinement of schema mappings for dataspaces. In EDBT, pages 573--584, 2010. Google ScholarDigital Library
- K. Belhajjame, N. W. Paton, S. M. Embury, A. A. A. Fernandes, and C. Hedeler. Incrementally improving dataspaces based on user feedback. Inf. Syst., 38(5):656--687, 2013. Google ScholarDigital Library
- P. A. Bernstein and L. M. Haas. Information integration in the enterprise. CACM, 51(9):72--79, 2008. Google ScholarDigital Library
- A. Bonifati, G. Mecca, A. Pappalardo, S. Raunich, and G. Summa. Schema mapping verification: the spicy way. In EDBT, pages 85--96, 2008. Google ScholarDigital Library
- A. Bozzon, M. Brambilla, and S. Ceri. Answering search queries with crowdsearcher. WWW, pages 1009--1018, 2012. Google ScholarDigital Library
- M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. CACM, 54(2):72--79, 2011. Google ScholarDigital Library
- H. Cao, Y. Qi, K. S. Candan, and M. L. Sapino. Feedback-driven result ranking and query refinement for exploring semi-structured data collections. In EDBT, pages 3--14, 2010. Google ScholarDigital Library
- J. Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 2 edition, 1988.Google Scholar
- V. Crescenzi, P. Merialdo, and D. Qiu. Wrapper generation supervised by a noisy crowd. In VLDB, pages 8--13, 2013.Google Scholar
- V. Crescenzi, P. Merialdo, and D. Qiu. Crowdsourcing large scale wrapper inference. Distributed and Parallel Databases, 33(1):95--122, 2015. Google ScholarDigital Library
- X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2):37--48, 2012. Google ScholarDigital Library
- H. Elmeleegy, A. K. Elmagarmid, and J. Lee. Leveraging query logs for schema mapping generation in u-map. In SIGMOD, pages 121--132, 2011. Google ScholarDigital Library
- D. H. Foley. Considerations of sample and feature size. IEEE Trans. on Inf. Theory, 18(5):618--626, 1972. Google ScholarDigital Library
- M. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: Answering queries with crowdsourcing. In ACM SIGMOD, pages 61--72, 2011. Google ScholarDigital Library
- T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, and C. Wang. DIADEM: Thousands of websites to a single database. PVLDB, 7(14):1845--1856, 2014. Google ScholarDigital Library
- C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. W. Shavlik, and X. Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD Conference, pages 601--612, 2014. Google ScholarDigital Library
- L. M. Haas, M. A. Hernández, H. Ho, L. Popa, and M. Roth. Clio grows up: from research prototype to industrial tool. In ACM SIGMOD, pages 805--810, 2005. Google ScholarDigital Library
- T. Heath and C. Bizer. Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool, 2011. Google ScholarDigital Library
- S. B. Hulley, S. R. Cummings, W. S. Browner, D. G. Grady, and T. B. Newman. Designing Clinical Research. Lippincott Williams and Wilkins, 3 edition, 2007.Google Scholar
- N. Q. V. Hung, D. C. Thang, M. Weidlich, and K. Aberer. Minimizing efforts in validating crowd answers. In SIGMOD, pages 999--1014, 2015. Google ScholarDigital Library
- R. Isele and C. Bizer. Learning expressive linkage rules using genetic programming. PVLDB, 5(11):1638--1649, 2012. Google ScholarDigital Library
- R. Isele and C. Bizer. Active learning of expressive linkage rules using genetic programming. JWS, 2--15, 2013. Google ScholarDigital Library
- X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. CDAS: A crowdsourcing data analytics system. PVLDB, 5(10):1040--1051, 2012. Google ScholarDigital Library
- C. Lofi, K. E. Maarry, and W.-T. Balke. Skyline queries in crowd-enabled databases. In Proc. 16th EDBT, pages 465--476, 2013. Google ScholarDigital Library
- J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale data integration: You can only afford to pay as you go. In CIDR, pages 342--350, 2007.Google Scholar
- A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Human-powered sorts and joins. PVLDB, 5(1):13--24, 2011. Google ScholarDigital Library
- B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. PVLDB, 8(2):125--136, 2014. Google ScholarDigital Library
- F. Naumann, J. C. Freytag, and M. Spiliopoulou. Quality driven source selection using data envelope analysis. In IQ, pages 137--152, 1998.Google Scholar
- T. Rekatsinas, X. L. Dong, L. Getoor, and D. Srivastava. Finding quality in quantity: The challenge of discovering valuable sources for integration. In CIDR, USA, 2015, 2015.Google Scholar
- T. Rekatsinas, X. L. Dong, and D. Srivastava. Characterizing and selecting fresh data sources. In SIGMOD, pages 919--930, 2014. Google ScholarDigital Library
- B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1--114, 2012.Google ScholarCross Ref
- P. P. Talukdar, Z. G. Ives, and F. C. N. Pereira. Automatically incorporating new sources in keyword search-based data integration. In ACM SIGMOD, USA, 2010, pages 387--398, 2010. Google ScholarDigital Library
- P. P. Talukdar, M. Jacob, M. S. Mehmood, K. Crammer, Z. G. Ives, F. C. N. Pereira, and S. Guha. Learning to create data-integrating queries. PVLDB, 1(1):785--796, 2008. Google ScholarDigital Library
- S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013. Google ScholarDigital Library
- L. Yan, R. J. Miller, L. M. Haas, and R. Fagin. Data-driven understanding and refinement of schema mappings. In ACM SIGMOD, pages 485--496, 2001. Google ScholarDigital Library
- Z. Yan, N. Zheng, Z. G. Ives, P. P. Talukdar, and C. Yu. Active learning in keyword search-based data integration. VLDB J., 24(5):611--631, 2015. Google ScholarDigital Library
- C. J. Zhang, L. Chen, Y. Tong, and Z. Liu. Cleaning uncertain data with a noisy crowd. ICDE, 6--17, 2015.Google Scholar
Recommendations
Crowdsourced Targeted Feedback Collection for Multicriteria Data Source Selection
On the Horizon, Regular Papers and Challenge PaperA multicriteria data source selection (MCSS) scenario identifies, from a set of candidate data sources, the subset that best meets users’ needs. These needs are expressed using several criteria, which are used to evaluate the candidate data sources. An ...
Data source management and selection for dynamic data integration
RED'09: Proceedings of the 2nd international conference on Resource discoverySelection-dynamic data integration employs a set of known data sources attached to an integration system. For answering a given query, suitable sources are selected from this set and dynamically integrated. This procedure requires a method to determine ...
Characterizing and selecting fresh data sources
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataData integration is a challenging task due to the large numbers of autonomous data sources. This necessitates the development of techniques to reason about the benefits and costs of acquiring and integrating data. Recently the problem of source ...
Comments