skip to main content
research-article

Crowdsourced Targeted Feedback Collection for Multicriteria Data Source Selection

Published:04 January 2019Publication History
Skip Abstract Section

Abstract

A multicriteria data source selection (MCSS) scenario identifies, from a set of candidate data sources, the subset that best meets users’ needs. These needs are expressed using several criteria, which are used to evaluate the candidate data sources. An MCSS problem can be solved using multidimensional optimization techniques that trade off the different objectives. Sometimes one may have uncertain knowledge regarding how well the candidate data sources meet the criteria. In order to overcome this uncertainty, one may rely on end-users or crowds to annotate the data items produced by the sources in relation to the selection criteria. In this article, a proposed Targeted Feedback Collection (TFC) approach is introduced that aims to identify those data items on which feedback should be collected, thereby providing evidence on how the sources satisfy the required criteria. The proposed TFC targets feedback by considering the confidence intervals around the estimated criteria values, with a view to increasing the confidence in the estimates that are most relevant to the multidimensional optimization. Variants of the proposed TFC approach have been developed for use where feedback is expected to be reliable (e.g., where it is provided by trusted experts) and where feedback is expected to be unreliable (e.g., from crowd workers). Both variants have been evaluated, and positive results are reported against other approaches to feedback collection, including active learning, in experiments that involve real-world datasets and crowdsourcing.

References

  1. Edward Abel, Keane John, Norman W. Paton, Fernandes Alvaro A. A., Martin Koehler, Nikolaos Konstantinou, Nurzety Bintiahmadazuan, and Suzanne M. Embury. 2018. User driven multi-criteria source selection. Information Sciences 430--431 (2018), 179--199.Google ScholarGoogle Scholar
  2. Tudor Barbu, Mihaela Costin, and Adrian Ciobanu. 2013. Color-based image retrieval approaches using a relevance feedback scheme. In New Concepts and Applications in Soft Computing. Springer-Verlag, Berlin, 47--55.Google ScholarGoogle Scholar
  3. Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A. Fernandes, and Cornelia Hedeler. 2013. Incrementally improving dataspaces based on user feedback. Information Systems 38, 5 (2013), 656--687. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Alessandro Bozzon, Marco Brambilla, and Stefano Ceri. 2012. Answering search queries with CrowdSearcher. In Proceedings of the 21st International Conference on World Wide Web (WWW’12). ACM, New York, 1009--1018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Michael G. Bulmer. 1979. Principles of Statistics. Dover Publications, New York.Google ScholarGoogle Scholar
  6. Chengliang Chai, Ju Fan, and Guoliang Li. 2018. Incentive-based entity collection using crowdsourcing. In 34th IEEE International Conference on Data Engineering (ICDE’18). Conference Publishing a Services (CPS), California, USA.Google ScholarGoogle ScholarCross RefCross Ref
  7. Valter Crescenzi, Alvaro A. A. Fernandes, Paolo Merialdo, and Norman W. Paton. 2017. Crowdsourcing for data management: A survey. Knowledge and Information Systems 53, 1 (2017), 1--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2013. Wrapper generation supervised by a noisy crowd. DBCrowd 1, 1 (2013), 8--13.Google ScholarGoogle Scholar
  9. Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2015. Crowdsourcing large scale wrapper inference. Distributed and Parallel Databases 33, 1 (2015), 95--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Alexander P. Dawid and Allan M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics 28, 1 (1979), 20--28.Google ScholarGoogle ScholarCross RefCross Ref
  11. Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39, 1 (1977), 1--38.Google ScholarGoogle Scholar
  12. Xin Luna Dong, Barna Saha, and Divesh Srivastava. 2012. Less is more: Selecting sources wisely for integration. Proceedings of the VLDB Endowment 6, 2 (2012), 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Arlene Fink. 2002. The Survey Handbook. SAGE Publications, University of California at Los Angeles, The Langley Research Institute.Google ScholarGoogle Scholar
  14. Donald H. Foley. 1972. Considerations of sample and feature size. IEEE Transactions on Information Theory 18, 5 (1972), 618--626. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. 2011. CrowdDB: Answering queries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11). ACM, New York, 61--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Abir Gallas, Walid Barhoumi, and Ezzeddine Zagrouba. 2014. Negative relevance feedback for improving retrieval in large-scale image collections. In 2014 IEEE International Symposium on Multimedia (ISM’14). IEEE, ISM, Taichung, Taiwan, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: Hands-off crowdsourcing for entity matching. In Proceedings of the International Conference on Management of Data (SIGMOD’14). ACM, New York, 601--612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sean Goldberg, Daisy Zhe Wang, and Christan Grant. 2017. A probabilistically integrated system for crowd-assisted text labeling and extraction. Journal on Data and Information Quality 8, 2, Article 10 (Feb. 2017), 23 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Giovanni Grano, Adelina Ciurumelea, Sebastiano Panichella, Fabio Palomba, and Harald C. Gall. 2018. Exploring the integration of user feedback in automated testing of Android applications. In 25th International Conference on Software Analysis, Evolution and Reengineering (SANER’18). Campobasso, Italy, 72--83.Google ScholarGoogle Scholar
  20. Alon Halevy, Flip Korn, Natalya F. Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google’s datasets. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16). ACM, New York, 795--806. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Robert Isele and Christian Bizer. 2012. Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowment 5, 11 (2012), 1638--1649. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. David R. Karger, Sewoong Oh, and Devavrat Shah. 2011. Iterative learning for reliable crowdsourcing systems. Advances in Neural Information Processing Systems 24 (2011), 1953--1961. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Andrea Knezevic. 2008. Overlapping confidence intervals and statistical significance. Cornell University, Statistical Consulting Unit 73 (2008), 1.Google ScholarGoogle Scholar
  24. Julia Kreutzer, Shahram Khadivi, Evgeny Matusov, and Stefan Riezler. 2018. Can neural machine translation be improved with user feedback? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL, HLT’18). New Orleans, LA, 92--105.Google ScholarGoogle ScholarCross RefCross Ref
  25. David D. Lewis and William A. Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94). Springer-Verlag New York,New York, 3--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. James R. Lewis and Jeff Sauro. 2006. When 100% really isn’t 100%: Improving the accuracy of small--sample estimates of completion rates. Journal of Usability Studies 3, 1 (2006), 136--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Guoliang Li, Jianan Wang, Yudian Zheng, and Michael J. Franklin. 2016. Crowdsourced data management: A Survey. IEEE Transactions on Knowledge and Data Engineering 28, 9 (2016), 2296--2319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mark S. Litwin. 1995. How to Measure Survey Reliability and Validity. SAGE Publications, UCLA School of Medicine, Los Angeles.Google ScholarGoogle Scholar
  29. Xuan Liu, Meiyu Lu, Beng Chin Ooi, Yanyan Shen, Sai Wu, and Meihui Zhang. 2012. CDAS: A crowdsourcing data analytics system. Proceedings of the VLDB Endowment 5, 10 (2012), 1040--1051. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Christoph Lofi, Kinda El Maarry, and Wolf-Tilo Balke. 2013. Skyline queries in crowd-enabled databases. In Proceedings of the 16th International Conference on Extending Database Technology (EDBT’13). ACM, New York, 465--476. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Matteo Magnani and Danilo Montesi. 2010. A survey on uncertainty management in data integration. Journal of Data and Information Quality 2, 1, Article 5 (July 2010), 33 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Adam Marcus, Eugene Wu, David R. Karger, Samuel Madden, and Robert C. Miller. 2011. Human-powered sorts and joins. Proceedings of the VLDB Endowment 5, 1 (2011), 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Nigel Martin, Alexandra Poulovassilis, and Jianing Wang. 2014. A methodology and architecture embedding quality assessment in data integration. Journal of Data and Information Quality 4, 4, Article 17 (May 2014), 40 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. George A. Mihaila, Louiqa Raschid, and Maria-Esther Vidal. 2000. Using quality of data metadata for source selection and ranking. In Proceedings of the 3rd International Workshop on the Web and Databases, in Conjunction with PODS/SIGMOD 2000. ACM, New York, 93--98.Google ScholarGoogle Scholar
  35. George A. Mihaila, Louiqa Raschid, and Maria-Esther Vidal. 2001. Source selection and ranking in the WebSemantics architecture: Using quality of data metadata. Advances in Computers 55 (2001), 87--118.Google ScholarGoogle ScholarCross RefCross Ref
  36. Barzan Mozafari, Purnamrita Sarkar, Michael J. Franklin, Michael I. Jordan, and Samuel Madden. 2014. Scaling up crowd-sourcing to very large datasets: A case for active learning. Proceedings of the VLDB Endowment 8, 2 (2014), 125--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. A. Viraj J. Muthugala and A. G. Buddhika P. Jayasekara. 2017. Enhancing user satisfaction by adapting Robot’s perception of uncertain information based on environment and user feedback. IEEE Access 5 (2017), 26435--26447.Google ScholarGoogle ScholarCross RefCross Ref
  38. Fernando Osorno-Gutierrez, Norman W. Paton, and Alvaro A. A. Fernandes. 2013. Crowdsourcing feedback for pay--as--you--go data integration. In DBCrowd. CEUR-WS.org, Riva del Garda, Trento, Italy, 32--37.Google ScholarGoogle Scholar
  39. Leo L. Pipino, Yang W. Lee, and Richard Y. Wang. 2002. Data quality assessment. CACM-Supp. Community 8 Building Social Capital 45, 4 (2002), 211--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. Journal of Machine Learning Research 11 (2010), 1297--1322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Theodoros Rekatsinas, Amol Deshpande, Xin Luna Dong, Lise Getoor, and Divesh Srivastava. 2016. SourceSight: Enabling effective source selection. In Proceedings of the 2016 SIGMOD International Conference on Management of Data (SIGMOD’16). ACM, New York, 2157--2160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Theodoros Rekatsinas, Xin Luna Dong, Lise Getoor, and Divesh Srivastava. 2015. Finding quality in quantity: The challenge of discovering valuable sources for integration. In 7th Biennial Conference on Innovative Data Systems Research (CIDR’15). CIDR, Asilomar, CA, 1--7.Google ScholarGoogle Scholar
  43. Theodoros Rekatsinas, Xin Luna Dong, and Divesh Srivastava. 2014. Characterizing and selecting fresh data sources. In SIGMOD. ACM, New York, 919--930. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Julio César Cortés Ríos, Norman W. Paton, Alvaro A. A. Fernandes, Edward Abel, and John A. Keane. 2017. Targeted feedback collection applied to multi-criteria source selection. In New Trends in Databases and Information Systems (ADBIS’17). Springer, Cham, Nicosia, Cyprus, 136--150.Google ScholarGoogle Scholar
  45. Julio César Cortés Ríos, Norman W. Paton, Alvaro A. A. Fernandes, and Khalid Belhajjame. 2016. Efficient feedback collection for pay-as-you-go source selection. In International Conference on Scientific and Statistical Database Management (SSDBM’16). ACM, New York, 1:1--1:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Burr Settles. 2012. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6, 1 (2012), 1--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. 2008. Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th International Conference on Knowledge Discovery and Data Mining (KDD’08). ACM, New York, 614--622. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Partha Pratim Talukdar, Zachary G. Ives, and Fernando C. N. Pereira. 2010. Automatically incorporating new sources in keyword search-based data integration. In Proceedings of the 2010 SIGMOD International Conference on Management of Data (SIGMOD’10). ACM, New York, 387--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Mauricio Villegas, Luis A. Leiva, and Roberto Paredes. 2013. Interactive image retrieval based on relevance feedback. In Multimodal Interaction in Image and Video Applications. Springer-Verlag, Berlin, 83--109.Google ScholarGoogle Scholar
  50. Zhepeng Yan, Nan Zheng, Zachary G. Ives, Partha Pratim Talukdar, and Cong Yu. 2015. Active learning in keyword search-based data integration. VLDB Journal 24, 5 (2015), 611--631. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Chen Jason Zhang, Lei Chen, Yongxin Tong, and Zheng Liu. 2015. Cleaning uncertain data with a noisy crowd. In 31st IEEE International Conference on Data Engineering (ICDE’15). Conference Publishing a Services (CPS), California, USA, 6--17.Google ScholarGoogle ScholarCross RefCross Ref
  52. Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. 2017. Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment 10, 5 (2017), 541--552. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Yudian Zheng, Jiannan Wang, Guoliang Li, Reynold Cheng, and Jianhua Feng. 2015. QASCA: A quality-aware task assignment system for crowdsourcing applications. In Proceedings of the 2015 SIGMOD International Conference on Management of Data (SIGMOD’15). ACM, New York, 1031--1046. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Crowdsourced Targeted Feedback Collection for Multicriteria Data Source Selection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Journal of Data and Information Quality
      Journal of Data and Information Quality  Volume 11, Issue 1
      On the Horizon, Regular Papers and Challenge Paper
      March 2019
      60 pages
      ISSN:1936-1955
      EISSN:1936-1963
      DOI:10.1145/3303842
      Issue’s Table of Contents

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 January 2019
      • Accepted: 1 October 2018
      • Revised: 1 June 2018
      • Received: 1 December 2017
      Published in jdiq Volume 11, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)9
      • Downloads (Last 6 weeks)0

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format