skip to main content
10.1145/2949689.2949690acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Efficient Feedback Collection for Pay-as-you-go Source Selection

Authors Info & Claims
Published:18 July 2016Publication History

ABSTRACT

Technical developments, such as the web of data and web data extraction, combined with policy developments such as those relating to open government or open science, are leading to the availability of increasing numbers of data sources. Indeed, given these physical sources, it is then also possible to create further virtual sources that integrate, aggregate or summarise the data from the original sources. As a result, there is a plethora of data sources, from which a small subset may be able to provide the information required to support a task. The number and rate of change in the available sources is likely to make manual source selection and curation by experts impractical for many applications, leading to the need to pursue a pay-as-you-go approach, in which crowds or data consumers annotate results based on their correctness or suitability, with the resulting annotations used to inform, e.g., source selection algorithms. However, for pay-as-you-go feedback collection to be cost-effective, it may be necessary to select judiciously the data items on which feedback is to be obtained. This paper describes OLBP (Ordering and Labelling By Precision), a heuristics-based approach to the targeting of data items for feedback to support mapping and source selection tasks, where users express their preferences in terms of the trade-off between precision and recall. The proposed approach is then evaluated on two different scenarios, mapping selection with synthetic data, and source selection with real data produced by web data extraction. The results demonstrate a significant reduction in the amount of feedback required to reach user-provided objectives when using OLBP.

References

  1. B. Alexe, B. ten Cate, P. G. Kolaitis, and W. C. Tan. Characterizing schema mappings via data examples. ACM Trans. Database Syst., 36(4):23, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Belhajjame, N. W. Paton, S. M. Embury, A. A. A. Fernandes, and C. Hedeler. Feedback-based annotation, selection and refinement of schema mappings for dataspaces. In EDBT, pages 573--584, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Belhajjame, N. W. Paton, S. M. Embury, A. A. A. Fernandes, and C. Hedeler. Incrementally improving dataspaces based on user feedback. Inf. Syst., 38(5):656--687, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. A. Bernstein and L. M. Haas. Information integration in the enterprise. CACM, 51(9):72--79, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bonifati, G. Mecca, A. Pappalardo, S. Raunich, and G. Summa. Schema mapping verification: the spicy way. In EDBT, pages 85--96, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Bozzon, M. Brambilla, and S. Ceri. Answering search queries with crowdsearcher. WWW, pages 1009--1018, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. CACM, 54(2):72--79, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. Cao, Y. Qi, K. S. Candan, and M. L. Sapino. Feedback-driven result ranking and query refinement for exploring semi-structured data collections. In EDBT, pages 3--14, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 2 edition, 1988.Google ScholarGoogle Scholar
  10. V. Crescenzi, P. Merialdo, and D. Qiu. Wrapper generation supervised by a noisy crowd. In VLDB, pages 8--13, 2013.Google ScholarGoogle Scholar
  11. V. Crescenzi, P. Merialdo, and D. Qiu. Crowdsourcing large scale wrapper inference. Distributed and Parallel Databases, 33(1):95--122, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2):37--48, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Elmeleegy, A. K. Elmagarmid, and J. Lee. Leveraging query logs for schema mapping generation in u-map. In SIGMOD, pages 121--132, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. H. Foley. Considerations of sample and feature size. IEEE Trans. on Inf. Theory, 18(5):618--626, 1972. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: Answering queries with crowdsourcing. In ACM SIGMOD, pages 61--72, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, and C. Wang. DIADEM: Thousands of websites to a single database. PVLDB, 7(14):1845--1856, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. W. Shavlik, and X. Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD Conference, pages 601--612, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. M. Haas, M. A. Hernández, H. Ho, L. Popa, and M. Roth. Clio grows up: from research prototype to industrial tool. In ACM SIGMOD, pages 805--810, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Heath and C. Bizer. Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. B. Hulley, S. R. Cummings, W. S. Browner, D. G. Grady, and T. B. Newman. Designing Clinical Research. Lippincott Williams and Wilkins, 3 edition, 2007.Google ScholarGoogle Scholar
  21. N. Q. V. Hung, D. C. Thang, M. Weidlich, and K. Aberer. Minimizing efforts in validating crowd answers. In SIGMOD, pages 999--1014, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Isele and C. Bizer. Learning expressive linkage rules using genetic programming. PVLDB, 5(11):1638--1649, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Isele and C. Bizer. Active learning of expressive linkage rules using genetic programming. JWS, 2--15, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. CDAS: A crowdsourcing data analytics system. PVLDB, 5(10):1040--1051, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Lofi, K. E. Maarry, and W.-T. Balke. Skyline queries in crowd-enabled databases. In Proc. 16th EDBT, pages 465--476, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale data integration: You can only afford to pay as you go. In CIDR, pages 342--350, 2007.Google ScholarGoogle Scholar
  27. A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Human-powered sorts and joins. PVLDB, 5(1):13--24, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. PVLDB, 8(2):125--136, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. F. Naumann, J. C. Freytag, and M. Spiliopoulou. Quality driven source selection using data envelope analysis. In IQ, pages 137--152, 1998.Google ScholarGoogle Scholar
  30. T. Rekatsinas, X. L. Dong, L. Getoor, and D. Srivastava. Finding quality in quantity: The challenge of discovering valuable sources for integration. In CIDR, USA, 2015, 2015.Google ScholarGoogle Scholar
  31. T. Rekatsinas, X. L. Dong, and D. Srivastava. Characterizing and selecting fresh data sources. In SIGMOD, pages 919--930, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1--114, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  33. P. P. Talukdar, Z. G. Ives, and F. C. N. Pereira. Automatically incorporating new sources in keyword search-based data integration. In ACM SIGMOD, USA, 2010, pages 387--398, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. P. Talukdar, M. Jacob, M. S. Mehmood, K. Crammer, Z. G. Ives, F. C. N. Pereira, and S. Guha. Learning to create data-integrating queries. PVLDB, 1(1):785--796, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. L. Yan, R. J. Miller, L. M. Haas, and R. Fagin. Data-driven understanding and refinement of schema mappings. In ACM SIGMOD, pages 485--496, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Z. Yan, N. Zheng, Z. G. Ives, P. P. Talukdar, and C. Yu. Active learning in keyword search-based data integration. VLDB J., 24(5):611--631, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. C. J. Zhang, L. Chen, Y. Tong, and Z. Liu. Cleaning uncertain data with a noisy crowd. ICDE, 6--17, 2015.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management
    July 2016
    290 pages

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 18 July 2016

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate56of146submissions,38%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader