research-article

Crowdsourced Targeted Feedback Collection for Multicriteria Data Source Selection

Authors:
Julio César Cortés Ríos

University of Manchester, Manchester, United Kingdom

University of Manchester, Manchester, United Kingdom

0000-0003-0377-6421
View Profile

,
Norman W. Paton

University of Manchester, Manchester, United Kingdom

University of Manchester, Manchester, United Kingdom
View Profile

,
Alvaro A. A. Fernandes

University of Manchester, Manchester, United Kingdom

University of Manchester, Manchester, United Kingdom
View Profile

,
Edward Abel

University of Manchester, Manchester, United Kingdom

University of Manchester, Manchester, United Kingdom
View Profile

,
John A. Keane

University of Manchester, Manchester, United Kingdom

University of Manchester, Manchester, United Kingdom
View Profile

Authors Info & Claims

Journal of Data and Information Quality Volume 11 Issue 1Article No.: 2pp 1–27https://doi.org/10.1145/3284934

Published:04 January 2019Publication History

Journal of Data and Information Quality

Abstract

A multicriteria data source selection (MCSS) scenario identifies, from a set of candidate data sources, the subset that best meets users’ needs. These needs are expressed using several criteria, which are used to evaluate the candidate data sources. An MCSS problem can be solved using multidimensional optimization techniques that trade off the different objectives. Sometimes one may have uncertain knowledge regarding how well the candidate data sources meet the criteria. In order to overcome this uncertainty, one may rely on end-users or crowds to annotate the data items produced by the sources in relation to the selection criteria. In this article, a proposed Targeted Feedback Collection (TFC) approach is introduced that aims to identify those data items on which feedback should be collected, thereby providing evidence on how the sources satisfy the required criteria. The proposed TFC targets feedback by considering the confidence intervals around the estimated criteria values, with a view to increasing the confidence in the estimates that are most relevant to the multidimensional optimization. Variants of the proposed TFC approach have been developed for use where feedback is expected to be reliable (e.g., where it is provided by trusted experts) and where feedback is expected to be unreliable (e.g., from crowd workers). Both variants have been evaluated, and positive results are reported against other approaches to feedback collection, including active learning, in experiments that involve real-world datasets and crowdsourcing.

References

Edward Abel, Keane John, Norman W. Paton, Fernandes Alvaro A. A., Martin Koehler, Nikolaos Konstantinou, Nurzety Bintiahmadazuan, and Suzanne M. Embury. 2018. User driven multi-criteria source selection. Information Sciences 430--431 (2018), 179--199.Google Scholar
Tudor Barbu, Mihaela Costin, and Adrian Ciobanu. 2013. Color-based image retrieval approaches using a relevance feedback scheme. In New Concepts and Applications in Soft Computing. Springer-Verlag, Berlin, 47--55.Google Scholar
Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A. Fernandes, and Cornelia Hedeler. 2013. Incrementally improving dataspaces based on user feedback. Information Systems 38, 5 (2013), 656--687. Google ScholarDigital Library
Alessandro Bozzon, Marco Brambilla, and Stefano Ceri. 2012. Answering search queries with CrowdSearcher. In Proceedings of the 21st International Conference on World Wide Web (WWW’12). ACM, New York, 1009--1018. Google ScholarDigital Library
Michael G. Bulmer. 1979. Principles of Statistics. Dover Publications, New York.Google Scholar
Chengliang Chai, Ju Fan, and Guoliang Li. 2018. Incentive-based entity collection using crowdsourcing. In 34th IEEE International Conference on Data Engineering (ICDE’18). Conference Publishing a Services (CPS), California, USA.Google ScholarCross Ref
Valter Crescenzi, Alvaro A. A. Fernandes, Paolo Merialdo, and Norman W. Paton. 2017. Crowdsourcing for data management: A survey. Knowledge and Information Systems 53, 1 (2017), 1--41. Google ScholarDigital Library
Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2013. Wrapper generation supervised by a noisy crowd. DBCrowd 1, 1 (2013), 8--13.Google Scholar
Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2015. Crowdsourcing large scale wrapper inference. Distributed and Parallel Databases 33, 1 (2015), 95--122. Google ScholarDigital Library
Alexander P. Dawid and Allan M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics 28, 1 (1979), 20--28.Google ScholarCross Ref
Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39, 1 (1977), 1--38.Google Scholar
Xin Luna Dong, Barna Saha, and Divesh Srivastava. 2012. Less is more: Selecting sources wisely for integration. Proceedings of the VLDB Endowment 6, 2 (2012), 37--48. Google ScholarDigital Library
Arlene Fink. 2002. The Survey Handbook. SAGE Publications, University of California at Los Angeles, The Langley Research Institute.Google Scholar
Donald H. Foley. 1972. Considerations of sample and feature size. IEEE Transactions on Information Theory 18, 5 (1972), 618--626. Google ScholarDigital Library
Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. 2011. CrowdDB: Answering queries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11). ACM, New York, 61--72. Google ScholarDigital Library
Abir Gallas, Walid Barhoumi, and Ezzeddine Zagrouba. 2014. Negative relevance feedback for improving retrieval in large-scale image collections. In 2014 IEEE International Symposium on Multimedia (ISM’14). IEEE, ISM, Taichung, Taiwan, 1--8. Google ScholarDigital Library
Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: Hands-off crowdsourcing for entity matching. In Proceedings of the International Conference on Management of Data (SIGMOD’14). ACM, New York, 601--612. Google ScholarDigital Library
Sean Goldberg, Daisy Zhe Wang, and Christan Grant. 2017. A probabilistically integrated system for crowd-assisted text labeling and extraction. Journal on Data and Information Quality 8, 2, Article 10 (Feb. 2017), 23 pages. Google ScholarDigital Library
Giovanni Grano, Adelina Ciurumelea, Sebastiano Panichella, Fabio Palomba, and Harald C. Gall. 2018. Exploring the integration of user feedback in automated testing of Android applications. In 25th International Conference on Software Analysis, Evolution and Reengineering (SANER’18). Campobasso, Italy, 72--83.Google Scholar
Alon Halevy, Flip Korn, Natalya F. Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google’s datasets. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16). ACM, New York, 795--806. Google ScholarDigital Library
Robert Isele and Christian Bizer. 2012. Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowment 5, 11 (2012), 1638--1649. Google ScholarDigital Library
David R. Karger, Sewoong Oh, and Devavrat Shah. 2011. Iterative learning for reliable crowdsourcing systems. Advances in Neural Information Processing Systems 24 (2011), 1953--1961. Google ScholarDigital Library
Andrea Knezevic. 2008. Overlapping confidence intervals and statistical significance. Cornell University, Statistical Consulting Unit 73 (2008), 1.Google Scholar
Julia Kreutzer, Shahram Khadivi, Evgeny Matusov, and Stefan Riezler. 2018. Can neural machine translation be improved with user feedback? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL, HLT’18). New Orleans, LA, 92--105.Google ScholarCross Ref
David D. Lewis and William A. Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94). Springer-Verlag New York,New York, 3--12. Google ScholarDigital Library
James R. Lewis and Jeff Sauro. 2006. When 100% really isn’t 100%: Improving the accuracy of small--sample estimates of completion rates. Journal of Usability Studies 3, 1 (2006), 136--150. Google ScholarDigital Library
Guoliang Li, Jianan Wang, Yudian Zheng, and Michael J. Franklin. 2016. Crowdsourced data management: A Survey. IEEE Transactions on Knowledge and Data Engineering 28, 9 (2016), 2296--2319. Google ScholarDigital Library
Mark S. Litwin. 1995. How to Measure Survey Reliability and Validity. SAGE Publications, UCLA School of Medicine, Los Angeles.Google Scholar
Xuan Liu, Meiyu Lu, Beng Chin Ooi, Yanyan Shen, Sai Wu, and Meihui Zhang. 2012. CDAS: A crowdsourcing data analytics system. Proceedings of the VLDB Endowment 5, 10 (2012), 1040--1051. Google ScholarDigital Library
Christoph Lofi, Kinda El Maarry, and Wolf-Tilo Balke. 2013. Skyline queries in crowd-enabled databases. In Proceedings of the 16th International Conference on Extending Database Technology (EDBT’13). ACM, New York, 465--476. Google ScholarDigital Library
Matteo Magnani and Danilo Montesi. 2010. A survey on uncertainty management in data integration. Journal of Data and Information Quality 2, 1, Article 5 (July 2010), 33 pages. Google ScholarDigital Library
Adam Marcus, Eugene Wu, David R. Karger, Samuel Madden, and Robert C. Miller. 2011. Human-powered sorts and joins. Proceedings of the VLDB Endowment 5, 1 (2011), 13--24. Google ScholarDigital Library
Nigel Martin, Alexandra Poulovassilis, and Jianing Wang. 2014. A methodology and architecture embedding quality assessment in data integration. Journal of Data and Information Quality 4, 4, Article 17 (May 2014), 40 pages. Google ScholarDigital Library
George A. Mihaila, Louiqa Raschid, and Maria-Esther Vidal. 2000. Using quality of data metadata for source selection and ranking. In Proceedings of the 3rd International Workshop on the Web and Databases, in Conjunction with PODS/SIGMOD 2000. ACM, New York, 93--98.Google Scholar
George A. Mihaila, Louiqa Raschid, and Maria-Esther Vidal. 2001. Source selection and ranking in the WebSemantics architecture: Using quality of data metadata. Advances in Computers 55 (2001), 87--118.Google ScholarCross Ref
Barzan Mozafari, Purnamrita Sarkar, Michael J. Franklin, Michael I. Jordan, and Samuel Madden. 2014. Scaling up crowd-sourcing to very large datasets: A case for active learning. Proceedings of the VLDB Endowment 8, 2 (2014), 125--136. Google ScholarDigital Library
M. A. Viraj J. Muthugala and A. G. Buddhika P. Jayasekara. 2017. Enhancing user satisfaction by adapting Robot’s perception of uncertain information based on environment and user feedback. IEEE Access 5 (2017), 26435--26447.Google ScholarCross Ref
Fernando Osorno-Gutierrez, Norman W. Paton, and Alvaro A. A. Fernandes. 2013. Crowdsourcing feedback for pay--as--you--go data integration. In DBCrowd. CEUR-WS.org, Riva del Garda, Trento, Italy, 32--37.Google Scholar
Leo L. Pipino, Yang W. Lee, and Richard Y. Wang. 2002. Data quality assessment. CACM-Supp. Community 8 Building Social Capital 45, 4 (2002), 211--218. Google ScholarDigital Library
Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. Journal of Machine Learning Research 11 (2010), 1297--1322. Google ScholarDigital Library
Theodoros Rekatsinas, Amol Deshpande, Xin Luna Dong, Lise Getoor, and Divesh Srivastava. 2016. SourceSight: Enabling effective source selection. In Proceedings of the 2016 SIGMOD International Conference on Management of Data (SIGMOD’16). ACM, New York, 2157--2160. Google ScholarDigital Library
Theodoros Rekatsinas, Xin Luna Dong, Lise Getoor, and Divesh Srivastava. 2015. Finding quality in quantity: The challenge of discovering valuable sources for integration. In 7th Biennial Conference on Innovative Data Systems Research (CIDR’15). CIDR, Asilomar, CA, 1--7.Google Scholar
Theodoros Rekatsinas, Xin Luna Dong, and Divesh Srivastava. 2014. Characterizing and selecting fresh data sources. In SIGMOD. ACM, New York, 919--930. Google ScholarDigital Library
Julio César Cortés Ríos, Norman W. Paton, Alvaro A. A. Fernandes, Edward Abel, and John A. Keane. 2017. Targeted feedback collection applied to multi-criteria source selection. In New Trends in Databases and Information Systems (ADBIS’17). Springer, Cham, Nicosia, Cyprus, 136--150.Google Scholar
Julio César Cortés Ríos, Norman W. Paton, Alvaro A. A. Fernandes, and Khalid Belhajjame. 2016. Efficient feedback collection for pay-as-you-go source selection. In International Conference on Scientific and Statistical Database Management (SSDBM’16). ACM, New York, 1:1--1:12. Google ScholarDigital Library
Burr Settles. 2012. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6, 1 (2012), 1--114. Google ScholarDigital Library
Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. 2008. Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th International Conference on Knowledge Discovery and Data Mining (KDD’08). ACM, New York, 614--622. Google ScholarDigital Library
Partha Pratim Talukdar, Zachary G. Ives, and Fernando C. N. Pereira. 2010. Automatically incorporating new sources in keyword search-based data integration. In Proceedings of the 2010 SIGMOD International Conference on Management of Data (SIGMOD’10). ACM, New York, 387--398. Google ScholarDigital Library
Mauricio Villegas, Luis A. Leiva, and Roberto Paredes. 2013. Interactive image retrieval based on relevance feedback. In Multimodal Interaction in Image and Video Applications. Springer-Verlag, Berlin, 83--109.Google Scholar
Zhepeng Yan, Nan Zheng, Zachary G. Ives, Partha Pratim Talukdar, and Cong Yu. 2015. Active learning in keyword search-based data integration. VLDB Journal 24, 5 (2015), 611--631. Google ScholarDigital Library
Chen Jason Zhang, Lei Chen, Yongxin Tong, and Zheng Liu. 2015. Cleaning uncertain data with a noisy crowd. In 31st IEEE International Conference on Data Engineering (ICDE’15). Conference Publishing a Services (CPS), California, USA, 6--17.Google ScholarCross Ref
Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. 2017. Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment 10, 5 (2017), 541--552. Google ScholarDigital Library
Yudian Zheng, Jiannan Wang, Guoliang Li, Reynold Cheng, and Jianhua Feng. 2015. QASCA: A quality-aware task assignment system for crowdsourcing applications. In Proceedings of the 2015 SIGMOD International Conference on Management of Data (SIGMOD’15). ACM, New York, 1031--1046. Google ScholarDigital Library

Index Terms

Crowdsourced Targeted Feedback Collection for Multicriteria Data Source Selection
1. Information systems
  1. Data management systems
    1. Information integration
      1. Mediators and data integration

Recommendations

Efficient Feedback Collection for Pay-as-you-go Source Selection
SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Technical developments, such as the web of data and web data extraction, combined with policy developments such as those relating to open government or open science, are leading to the availability of increasing numbers of data sources. Indeed, given ...
Read More
Uncrowded hypervolume improvement: COMO-CMA-ES and the sofomore framework
GECCO '19: Proceedings of the Genetic and Evolutionary Computation Conference

We present a framework to build a multiobjective algorithm from single-objective ones. This framework addresses the p × n-dimensional problem of finding p solutions in an n-dimensional search space, maximizing an indicator by dynamic subspace ...
Read More
Multiobjective immune algorithm with nondominated neighbor-based selection

Nondominated Neighbor Immune Algorithm (NNIA) is proposed for multiobjective optimization by using a novel nondominated neighbor-based selection technique, an immune inspired operator, two heuristic search operators, and elitism. The unique selection ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal of Data and Information Quality Volume 11, Issue 1
On the Horizon, Regular Papers and Challenge Paper
March 2019
60 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3303842
Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 January 2019
- Accepted: 1 October 2018
- Revised: 1 June 2018
- Received: 1 December 2017
Published in jdiq Volume 11, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Source selection
crowdsourcing
feedback collection
multiobjective optimization
pay as you go
single-objective optimization
uncertainty handling
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 215
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Crowdsourced Targeted Feedback Collection for Multicriteria Data Source Selection

Journal of Data and Information Quality

Abstract

References

Cited By

Index Terms

Recommendations

Efficient Feedback Collection for Pay-as-you-go Source Selection

Uncrowded hypervolume improvement: COMO-CMA-ES and the sofomore framework

Multiobjective immune algorithm with nondominated neighbor-based selection