Abstract
A multicriteria data source selection (MCSS) scenario identifies, from a set of candidate data sources, the subset that best meets users’ needs. These needs are expressed using several criteria, which are used to evaluate the candidate data sources. An MCSS problem can be solved using multidimensional optimization techniques that trade off the different objectives. Sometimes one may have uncertain knowledge regarding how well the candidate data sources meet the criteria. In order to overcome this uncertainty, one may rely on end-users or crowds to annotate the data items produced by the sources in relation to the selection criteria. In this article, a proposed Targeted Feedback Collection (TFC) approach is introduced that aims to identify those data items on which feedback should be collected, thereby providing evidence on how the sources satisfy the required criteria. The proposed TFC targets feedback by considering the confidence intervals around the estimated criteria values, with a view to increasing the confidence in the estimates that are most relevant to the multidimensional optimization. Variants of the proposed TFC approach have been developed for use where feedback is expected to be reliable (e.g., where it is provided by trusted experts) and where feedback is expected to be unreliable (e.g., from crowd workers). Both variants have been evaluated, and positive results are reported against other approaches to feedback collection, including active learning, in experiments that involve real-world datasets and crowdsourcing.
- Edward Abel, Keane John, Norman W. Paton, Fernandes Alvaro A. A., Martin Koehler, Nikolaos Konstantinou, Nurzety Bintiahmadazuan, and Suzanne M. Embury. 2018. User driven multi-criteria source selection. Information Sciences 430--431 (2018), 179--199.Google Scholar
- Tudor Barbu, Mihaela Costin, and Adrian Ciobanu. 2013. Color-based image retrieval approaches using a relevance feedback scheme. In New Concepts and Applications in Soft Computing. Springer-Verlag, Berlin, 47--55.Google Scholar
- Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A. Fernandes, and Cornelia Hedeler. 2013. Incrementally improving dataspaces based on user feedback. Information Systems 38, 5 (2013), 656--687. Google ScholarDigital Library
- Alessandro Bozzon, Marco Brambilla, and Stefano Ceri. 2012. Answering search queries with CrowdSearcher. In Proceedings of the 21st International Conference on World Wide Web (WWW’12). ACM, New York, 1009--1018. Google ScholarDigital Library
- Michael G. Bulmer. 1979. Principles of Statistics. Dover Publications, New York.Google Scholar
- Chengliang Chai, Ju Fan, and Guoliang Li. 2018. Incentive-based entity collection using crowdsourcing. In 34th IEEE International Conference on Data Engineering (ICDE’18). Conference Publishing a Services (CPS), California, USA.Google ScholarCross Ref
- Valter Crescenzi, Alvaro A. A. Fernandes, Paolo Merialdo, and Norman W. Paton. 2017. Crowdsourcing for data management: A survey. Knowledge and Information Systems 53, 1 (2017), 1--41. Google ScholarDigital Library
- Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2013. Wrapper generation supervised by a noisy crowd. DBCrowd 1, 1 (2013), 8--13.Google Scholar
- Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2015. Crowdsourcing large scale wrapper inference. Distributed and Parallel Databases 33, 1 (2015), 95--122. Google ScholarDigital Library
- Alexander P. Dawid and Allan M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics 28, 1 (1979), 20--28.Google ScholarCross Ref
- Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39, 1 (1977), 1--38.Google Scholar
- Xin Luna Dong, Barna Saha, and Divesh Srivastava. 2012. Less is more: Selecting sources wisely for integration. Proceedings of the VLDB Endowment 6, 2 (2012), 37--48. Google ScholarDigital Library
- Arlene Fink. 2002. The Survey Handbook. SAGE Publications, University of California at Los Angeles, The Langley Research Institute.Google Scholar
- Donald H. Foley. 1972. Considerations of sample and feature size. IEEE Transactions on Information Theory 18, 5 (1972), 618--626. Google ScholarDigital Library
- Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. 2011. CrowdDB: Answering queries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11). ACM, New York, 61--72. Google ScholarDigital Library
- Abir Gallas, Walid Barhoumi, and Ezzeddine Zagrouba. 2014. Negative relevance feedback for improving retrieval in large-scale image collections. In 2014 IEEE International Symposium on Multimedia (ISM’14). IEEE, ISM, Taichung, Taiwan, 1--8. Google ScholarDigital Library
- Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: Hands-off crowdsourcing for entity matching. In Proceedings of the International Conference on Management of Data (SIGMOD’14). ACM, New York, 601--612. Google ScholarDigital Library
- Sean Goldberg, Daisy Zhe Wang, and Christan Grant. 2017. A probabilistically integrated system for crowd-assisted text labeling and extraction. Journal on Data and Information Quality 8, 2, Article 10 (Feb. 2017), 23 pages. Google ScholarDigital Library
- Giovanni Grano, Adelina Ciurumelea, Sebastiano Panichella, Fabio Palomba, and Harald C. Gall. 2018. Exploring the integration of user feedback in automated testing of Android applications. In 25th International Conference on Software Analysis, Evolution and Reengineering (SANER’18). Campobasso, Italy, 72--83.Google Scholar
- Alon Halevy, Flip Korn, Natalya F. Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google’s datasets. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16). ACM, New York, 795--806. Google ScholarDigital Library
- Robert Isele and Christian Bizer. 2012. Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowment 5, 11 (2012), 1638--1649. Google ScholarDigital Library
- David R. Karger, Sewoong Oh, and Devavrat Shah. 2011. Iterative learning for reliable crowdsourcing systems. Advances in Neural Information Processing Systems 24 (2011), 1953--1961. Google ScholarDigital Library
- Andrea Knezevic. 2008. Overlapping confidence intervals and statistical significance. Cornell University, Statistical Consulting Unit 73 (2008), 1.Google Scholar
- Julia Kreutzer, Shahram Khadivi, Evgeny Matusov, and Stefan Riezler. 2018. Can neural machine translation be improved with user feedback? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL, HLT’18). New Orleans, LA, 92--105.Google ScholarCross Ref
- David D. Lewis and William A. Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94). Springer-Verlag New York,New York, 3--12. Google ScholarDigital Library
- James R. Lewis and Jeff Sauro. 2006. When 100% really isn’t 100%: Improving the accuracy of small--sample estimates of completion rates. Journal of Usability Studies 3, 1 (2006), 136--150. Google ScholarDigital Library
- Guoliang Li, Jianan Wang, Yudian Zheng, and Michael J. Franklin. 2016. Crowdsourced data management: A Survey. IEEE Transactions on Knowledge and Data Engineering 28, 9 (2016), 2296--2319. Google ScholarDigital Library
- Mark S. Litwin. 1995. How to Measure Survey Reliability and Validity. SAGE Publications, UCLA School of Medicine, Los Angeles.Google Scholar
- Xuan Liu, Meiyu Lu, Beng Chin Ooi, Yanyan Shen, Sai Wu, and Meihui Zhang. 2012. CDAS: A crowdsourcing data analytics system. Proceedings of the VLDB Endowment 5, 10 (2012), 1040--1051. Google ScholarDigital Library
- Christoph Lofi, Kinda El Maarry, and Wolf-Tilo Balke. 2013. Skyline queries in crowd-enabled databases. In Proceedings of the 16th International Conference on Extending Database Technology (EDBT’13). ACM, New York, 465--476. Google ScholarDigital Library
- Matteo Magnani and Danilo Montesi. 2010. A survey on uncertainty management in data integration. Journal of Data and Information Quality 2, 1, Article 5 (July 2010), 33 pages. Google ScholarDigital Library
- Adam Marcus, Eugene Wu, David R. Karger, Samuel Madden, and Robert C. Miller. 2011. Human-powered sorts and joins. Proceedings of the VLDB Endowment 5, 1 (2011), 13--24. Google ScholarDigital Library
- Nigel Martin, Alexandra Poulovassilis, and Jianing Wang. 2014. A methodology and architecture embedding quality assessment in data integration. Journal of Data and Information Quality 4, 4, Article 17 (May 2014), 40 pages. Google ScholarDigital Library
- George A. Mihaila, Louiqa Raschid, and Maria-Esther Vidal. 2000. Using quality of data metadata for source selection and ranking. In Proceedings of the 3rd International Workshop on the Web and Databases, in Conjunction with PODS/SIGMOD 2000. ACM, New York, 93--98.Google Scholar
- George A. Mihaila, Louiqa Raschid, and Maria-Esther Vidal. 2001. Source selection and ranking in the WebSemantics architecture: Using quality of data metadata. Advances in Computers 55 (2001), 87--118.Google ScholarCross Ref
- Barzan Mozafari, Purnamrita Sarkar, Michael J. Franklin, Michael I. Jordan, and Samuel Madden. 2014. Scaling up crowd-sourcing to very large datasets: A case for active learning. Proceedings of the VLDB Endowment 8, 2 (2014), 125--136. Google ScholarDigital Library
- M. A. Viraj J. Muthugala and A. G. Buddhika P. Jayasekara. 2017. Enhancing user satisfaction by adapting Robot’s perception of uncertain information based on environment and user feedback. IEEE Access 5 (2017), 26435--26447.Google ScholarCross Ref
- Fernando Osorno-Gutierrez, Norman W. Paton, and Alvaro A. A. Fernandes. 2013. Crowdsourcing feedback for pay--as--you--go data integration. In DBCrowd. CEUR-WS.org, Riva del Garda, Trento, Italy, 32--37.Google Scholar
- Leo L. Pipino, Yang W. Lee, and Richard Y. Wang. 2002. Data quality assessment. CACM-Supp. Community 8 Building Social Capital 45, 4 (2002), 211--218. Google ScholarDigital Library
- Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. Journal of Machine Learning Research 11 (2010), 1297--1322. Google ScholarDigital Library
- Theodoros Rekatsinas, Amol Deshpande, Xin Luna Dong, Lise Getoor, and Divesh Srivastava. 2016. SourceSight: Enabling effective source selection. In Proceedings of the 2016 SIGMOD International Conference on Management of Data (SIGMOD’16). ACM, New York, 2157--2160. Google ScholarDigital Library
- Theodoros Rekatsinas, Xin Luna Dong, Lise Getoor, and Divesh Srivastava. 2015. Finding quality in quantity: The challenge of discovering valuable sources for integration. In 7th Biennial Conference on Innovative Data Systems Research (CIDR’15). CIDR, Asilomar, CA, 1--7.Google Scholar
- Theodoros Rekatsinas, Xin Luna Dong, and Divesh Srivastava. 2014. Characterizing and selecting fresh data sources. In SIGMOD. ACM, New York, 919--930. Google ScholarDigital Library
- Julio César Cortés Ríos, Norman W. Paton, Alvaro A. A. Fernandes, Edward Abel, and John A. Keane. 2017. Targeted feedback collection applied to multi-criteria source selection. In New Trends in Databases and Information Systems (ADBIS’17). Springer, Cham, Nicosia, Cyprus, 136--150.Google Scholar
- Julio César Cortés Ríos, Norman W. Paton, Alvaro A. A. Fernandes, and Khalid Belhajjame. 2016. Efficient feedback collection for pay-as-you-go source selection. In International Conference on Scientific and Statistical Database Management (SSDBM’16). ACM, New York, 1:1--1:12. Google ScholarDigital Library
- Burr Settles. 2012. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6, 1 (2012), 1--114. Google ScholarDigital Library
- Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. 2008. Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th International Conference on Knowledge Discovery and Data Mining (KDD’08). ACM, New York, 614--622. Google ScholarDigital Library
- Partha Pratim Talukdar, Zachary G. Ives, and Fernando C. N. Pereira. 2010. Automatically incorporating new sources in keyword search-based data integration. In Proceedings of the 2010 SIGMOD International Conference on Management of Data (SIGMOD’10). ACM, New York, 387--398. Google ScholarDigital Library
- Mauricio Villegas, Luis A. Leiva, and Roberto Paredes. 2013. Interactive image retrieval based on relevance feedback. In Multimodal Interaction in Image and Video Applications. Springer-Verlag, Berlin, 83--109.Google Scholar
- Zhepeng Yan, Nan Zheng, Zachary G. Ives, Partha Pratim Talukdar, and Cong Yu. 2015. Active learning in keyword search-based data integration. VLDB Journal 24, 5 (2015), 611--631. Google ScholarDigital Library
- Chen Jason Zhang, Lei Chen, Yongxin Tong, and Zheng Liu. 2015. Cleaning uncertain data with a noisy crowd. In 31st IEEE International Conference on Data Engineering (ICDE’15). Conference Publishing a Services (CPS), California, USA, 6--17.Google ScholarCross Ref
- Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. 2017. Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment 10, 5 (2017), 541--552. Google ScholarDigital Library
- Yudian Zheng, Jiannan Wang, Guoliang Li, Reynold Cheng, and Jianhua Feng. 2015. QASCA: A quality-aware task assignment system for crowdsourcing applications. In Proceedings of the 2015 SIGMOD International Conference on Management of Data (SIGMOD’15). ACM, New York, 1031--1046. Google ScholarDigital Library
Index Terms
- Crowdsourced Targeted Feedback Collection for Multicriteria Data Source Selection
Recommendations
Efficient Feedback Collection for Pay-as-you-go Source Selection
SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database ManagementTechnical developments, such as the web of data and web data extraction, combined with policy developments such as those relating to open government or open science, are leading to the availability of increasing numbers of data sources. Indeed, given ...
Uncrowded hypervolume improvement: COMO-CMA-ES and the sofomore framework
GECCO '19: Proceedings of the Genetic and Evolutionary Computation ConferenceWe present a framework to build a multiobjective algorithm from single-objective ones. This framework addresses the p × n-dimensional problem of finding p solutions in an n-dimensional search space, maximizing an indicator by dynamic subspace ...
Multiobjective immune algorithm with nondominated neighbor-based selection
Nondominated Neighbor Immune Algorithm (NNIA) is proposed for multiobjective optimization by using a novel nondominated neighbor-based selection technique, an immune inspired operator, two heuristic search operators, and elitism. The unique selection ...
Comments