ABSTRACT
As key decisions are often made based on information contained in a database, it is important for the database to be as complete and correct as possible. For this reason, many data cleaning tools have been developed to automatically resolve inconsistencies in databases. However, data cleaning tools provide only best-effort results and usually cannot eradicate all errors that may exist in a database. Even more importantly, existing data cleaning tools do not typically address the problem of determining what information is missing from a database.
To overcome the limitations of existing data cleaning techniques, we present QOCO, a novel query-oriented system for cleaning data with oracles. Under this framework, incorrect (resp. missing) tuples are removed from (added to) the result of a query through edits that are applied to the underlying database, where the edits are derived by interacting with domain experts which we model as oracle crowds. We show that the problem of determining minimal interactions with oracle crowds to derive database edits for removing (adding) incorrect (missing) tuples to the result of a query is NP-hard in general and present heuristic algorithms that interact with oracle crowds. Finally, we implement our algorithms in our prototype system QOCO and show that it is effective and efficient through a comprehensive suite of experiments.
- Y. Altowim, D. V. Kalashnikov, and S. Mehrotra. Progressive approach to relational entity resolution. Proceedings of the VLDB Endowment, 7(11), 2014. Google ScholarDigital Library
- Y. Amsterdamer, Y. Grossman, T. Milo, and P. Senellart. Crowd mining. In SIGMOD, pages 241--252, 2013. Google ScholarDigital Library
- T. Arora, R. Ramakrishnan, W. G. Roth, P. Seshadri, and D. Srivastava. Explaining program execution in deductive systems. In DOOD, pages 101--119, 1993.Google ScholarCross Ref
- I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):5, 2007. Google ScholarDigital Library
- N. Bidoit, M. Herschel, and K. Tzompanaki. Query-based why-not provenance with nedexplain. In EDBT, pages 145--156, 2014.Google Scholar
- M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. KDD, pages 39--48, 2003. Google ScholarDigital Library
- P. Buneman, J. Cheney, W. C. Tan, and S. Vansummeren. Curated databases. In ACM PODS, pages 1--12, 2008. Google ScholarDigital Library
- P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316--330, 2001. Google ScholarDigital Library
- P. Buneman, S. Khanna, and W. C. Tan. On propagation of deletions and annotations through views. In PODS, pages 150--158, 2002. Google ScholarDigital Library
- A. Chapman and H. V. Jagadish. Why not? In SIGMOD, pages 523--534, 2009. Google ScholarDigital Library
- M. Charika, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. PODS, pages 268--279, 2000. Google ScholarDigital Library
- J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009. Google ScholarDigital Library
- Cia world fact book. https://www.cia.gov/library/publications/the-world-factbook/.Google Scholar
- Y. Cui and J. Widom. Run-time translation of view tuple deletions using data lineage. Technical Report 2001--24, Stanford InfoLab, 2001.Google Scholar
- Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM TODS, 25(2):179--227, 2000. Google ScholarDigital Library
- T. Dasu and T. Johnson. Exploratory data mining and data cleaning. In Wiley, 2003. Google ScholarDigital Library
- P. Domingos. Multi-relational record linkage. In In Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining. Citeseer, 2004.Google Scholar
- J. Edmonds and R. M. Karp. Theoretical improvements in algorithmic efficiency for network flow problems. JACM, 19(2):284--264, 1972. Google ScholarDigital Library
- J. Evermanna and J. Fangb. Evaluating ontologies: Towards a cognitive measure of quality. Information Systems, 35(4):391--403, 2010. Google ScholarDigital Library
- W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. 2012. Google ScholarDigital Library
- W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Cerfix: A system for cleaning data with certain fixes. In PVLDB, pages 1375--1378, 2011.Google ScholarDigital Library
- U. M. Fayyad. Mining databases: Towards algorithms for knowledge discovery. IEEE Data Eng. Bull., 21(1):39--48, 1998.Google Scholar
- U. Feige, M. Langberg, and K. Nissim. On the hardness of approximating NP witnesses. In K. Jansen and S. Khuller, editors, Approximation Algorithms for Combinatorial Optimization, volume 1913 of LNCS, pages 120--131. Springer, 2000. Google ScholarDigital Library
- Fifa official site. http://www.fifa.com/.Google Scholar
- Fifa trivia quizzes and games. http://www.sporcle.com/games/g/fifaworldcup.Google Scholar
- Fifa world cup trivia. http://en.trivia.fifa.com/.Google Scholar
- M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: Answering queries with crowdsourcing. In SIGMOD, pages 61--72, 2011. Google ScholarDigital Library
- V. Ganti and A. D. Sarma. Data Cleaning: A Practical Perspective. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2013. Google ScholarDigital Library
- S. Ghosh, N. Sharma, F. Benevenuto, N. Ganguly, and K. Gummadi. Cognos: crowdsourcing search for topic experts in microblogs. SIGIR, pages 575--590, 2012. Google ScholarDigital Library
- T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In ACM PODS, pages 31--40, 2007. Google ScholarDigital Library
- Z. He and E. Lo. Answering why-not questions on top-k queries. In ICDE, pages 750--761, 2012. Google ScholarDigital Library
- M. Herschel and M. A. Hernández. Explaining missing answers to SPJUA queries. PVLDB, 3(1):185--196, 2010. Google ScholarDigital Library
- M. Herschel, M. A. Hernández, and W. C. Tan. Artemis: A system for analyzing missing answers. PVLDB, 2(2):1550--1553, 2009. Google ScholarDigital Library
- J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the provenance of non-answers to queries over extracted data. PVLDB, 1(1):736--747, 2008. Google ScholarDigital Library
- N. Q. V. Hung, N. T. Tam, Z. Mikló, K. Aberer, A. Gal, and M. Weidlich. Pay-as-you-go reconciliation in schema matching networks. In ICDE, pages 220--231, 2014.Google Scholar
- P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. HCOMP, pages 64--67, 2010. Google ScholarDigital Library
- J. Jiao, J. Yan, H. Zhao, and W. Fan. Expertrank: An expert user ranking algorithm in online communities. NISS, pages 674--679, 2009. Google ScholarDigital Library
- B. Kanagal, J. Li, and A. Deshpande. Sensitivity analysis and explanations for robust query evaluation in probabilistic databases. In ACM SIGMOD, pages 841--852. ACM, 2011. Google ScholarDigital Library
- D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. NIPS, pages 1953--1961, 2011.Google ScholarDigital Library
- R. M. Karp. Reducibility among combinatorial problems. In R. E. Miller and J. W. Thatcher, editors, Complexity of Computer Computations, pages 85--103, Plenum Press, New York, 1972.Google ScholarCross Ref
- B. Kimelfeld, J. Vondrák, and R. Williams. Maximizing conjunctive views in deletion propagation. In ACM PODS, pages 187--198, 2011. Google ScholarDigital Library
- A. Marcus, D. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. PVLDB, 6(2):109--120, 2012. Google ScholarDigital Library
- R. Mccann, W. Shen, and A. Doan. Matching schemas in online communities: A web 2.0 approach. In ICDE, pages 110--119, 2008. Google ScholarDigital Library
- A. Meliou, W. Gatterbauer, K. F. Moore, and D. Suciu. The complexity of causality and responsibility for query answers and non-answers. PVLDB, 4(1):34--45, 2010. Google ScholarDigital Library
- R. J. Miller, L. M. Haas, and M. A. Hernández. Schema mapping as query discovery. In VLDB, pages 77--88, 2000. Google ScholarDigital Library
- T. Milo and S. Zohar. Using schema matching to simplify heterogeneous data translation. In VLDB, volume 98, pages 24--27. Citeseer, 1998. Google ScholarDigital Library
- A. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. Crowdscreen: Algorithms for filtering data with humans. In SIGMOD, pages 361--372, 2012. Google ScholarDigital Library
- A. G. Parameswaran, S. Boyd, H. Garcia-Molina, A. Gupta, N. Polyzotis, and J. Widom. Optimal crowd-powered rating and filtering algorithms. VLDB, 7(9):685--696, 2014. Google ScholarDigital Library
- H. Park, R. Pang, A. Parameswaran, H. Garcia-Molina, N. Polyzotis, and J. Widom. An overview of the deco system: data model and query language; query processing and optimization. In SIGMOD, pages 22--27, 2012. Google ScholarDigital Library
- H. Park and J. Widom. Crowdfill: collecting structured data from the crowd. In SIGMOD, pages 577--588, 2014. Google ScholarDigital Library
- M. J. Raddick, G. Bracey, P. L. Gay, C. J. Lintott, P. Murray, K. Schawinski, A. S. Szalay, and J. Vandenberg. Galaxy zoo: exploring the motivations of citizen science volunteers. Astronomy Education Review, 9(1), 2010.Google ScholarCross Ref
- E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google Scholar
- V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001. Google ScholarDigital Library
- V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: whom to trust when everyone lies a bit. ICML, 2009. Google ScholarDigital Library
- C. Sapia, G. Höfling, M. Müller, C. Hausdorf, H. Stoyan, and U. Grimmer. On supporting the data warehouse design by data mining techniques. In Proc. GI-Workshop Data Mining and Data Warehousing, page 63. Citeseer, 1999.Google Scholar
- S. Sarawagi. Explaining differences in multidimensional aggregates. In VLDB, pages 42--53, 1999. Google ScholarDigital Library
- F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge unifying wordnet and wikipedia. In WWW, pages 697--706, 2007. Google ScholarDigital Library
- Q. T. Tran and C.-Y. Chan. How to conquer why-not questions. In SIGMOD Conference, pages 15--26, 2010. Google ScholarDigital Library
- B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, pages 673--684, 2013. Google ScholarDigital Library
- Uniprot. http://www.uniprot.org/.Google Scholar
- J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(10):1483--1494, 2012. Google ScholarDigital Library
- J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, pages 469--480, 2014. Google ScholarDigital Library
- Open football. https://github.com/openfootball/.Google Scholar
- World cup history. http://www.worldcup-history.com/.Google Scholar
- S. E. Whang, H. Garcia-Molina, and P. Lofgren. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013. Google ScholarDigital Library
- J. Xu, D. V. Kalashnikov, and S. Mehrotra. Query aware determinization of uncertain objects. IEEE Trans. Knowl. Data Eng., 27(1):207--221, 2015.Google ScholarCross Ref
- C. J. Zhang, Z. Zhao, L. Chen, H. V. Jagadish, and C. C. Cao. Crowdmatcher: crowd-assisted schema matching. In SIGMOD, pages 721--724, 2014. Google ScholarDigital Library
Index Terms
- Query-Oriented Data Cleaning with Oracles
Recommendations
Learning Over Dirty Data Without Cleaning
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataReal-world datasets are dirty and contain many errors, such as violations of integrity constraints and entity duplicates. Learning over dirty databases may result in inaccurate models. Data scientists spend most of their time on preparing and repairing ...
Using Recon for data cleaning
KDD'95: Proceedings of the First International Conference on Knowledge Discovery and Data MiningTo aid in making investment decisions, financial analysts purchase large amounts of data from financial data providers. They use this historical data to develop financial models for making predictions and assessing risk about current dat about current ...
Ontologies for Reusing Data Cleaning Knowledge
ICSC '12: Proceedings of the 2012 IEEE Sixth International Conference on Semantic ComputingThe emergence of new business models, namely, the establishment of partnerships between organizations, the chance that companies have of adding existing data on the web, especially in the semantic web, to their information, led to the emphasis on some ...
Comments