research-article

Query-Oriented Data Cleaning with Oracles

Authors:
Moria Bergman

Tel Aviv University, Tel Aviv, Israel

Tel Aviv University, Tel Aviv, Israel
View Profile

,
Tova Milo

Tel Aviv University, Tel Aviv, Israel

Tel Aviv University, Tel Aviv, Israel
View Profile

,
Slava Novgorodov

Tel Aviv University, Tel Aviv, Israel

Tel Aviv University, Tel Aviv, Israel
View Profile

,
Wang-Chiew Tan

UC Santa Cruz, Santa Cruz, CA, USA

UC Santa Cruz, Santa Cruz, CA, USA
View Profile

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataMay 2015Pages 1199–1214https://doi.org/10.1145/2723372.2737786

Published:27 May 2015Publication History

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 1199–1214

ABSTRACT

As key decisions are often made based on information contained in a database, it is important for the database to be as complete and correct as possible. For this reason, many data cleaning tools have been developed to automatically resolve inconsistencies in databases. However, data cleaning tools provide only best-effort results and usually cannot eradicate all errors that may exist in a database. Even more importantly, existing data cleaning tools do not typically address the problem of determining what information is missing from a database.

To overcome the limitations of existing data cleaning techniques, we present QOCO, a novel query-oriented system for cleaning data with oracles. Under this framework, incorrect (resp. missing) tuples are removed from (added to) the result of a query through edits that are applied to the underlying database, where the edits are derived by interacting with domain experts which we model as oracle crowds. We show that the problem of determining minimal interactions with oracle crowds to derive database edits for removing (adding) incorrect (missing) tuples to the result of a query is NP-hard in general and present heuristic algorithms that interact with oracle crowds. Finally, we implement our algorithms in our prototype system QOCO and show that it is effective and efficient through a comprehensive suite of experiments.

References

Y. Altowim, D. V. Kalashnikov, and S. Mehrotra. Progressive approach to relational entity resolution. Proceedings of the VLDB Endowment, 7(11), 2014. Google ScholarDigital Library
Y. Amsterdamer, Y. Grossman, T. Milo, and P. Senellart. Crowd mining. In SIGMOD, pages 241--252, 2013. Google ScholarDigital Library
T. Arora, R. Ramakrishnan, W. G. Roth, P. Seshadri, and D. Srivastava. Explaining program execution in deductive systems. In DOOD, pages 101--119, 1993.Google ScholarCross Ref
I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):5, 2007. Google ScholarDigital Library
N. Bidoit, M. Herschel, and K. Tzompanaki. Query-based why-not provenance with nedexplain. In EDBT, pages 145--156, 2014.Google Scholar
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. KDD, pages 39--48, 2003. Google ScholarDigital Library
P. Buneman, J. Cheney, W. C. Tan, and S. Vansummeren. Curated databases. In ACM PODS, pages 1--12, 2008. Google ScholarDigital Library
P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316--330, 2001. Google ScholarDigital Library
P. Buneman, S. Khanna, and W. C. Tan. On propagation of deletions and annotations through views. In PODS, pages 150--158, 2002. Google ScholarDigital Library
A. Chapman and H. V. Jagadish. Why not? In SIGMOD, pages 523--534, 2009. Google ScholarDigital Library
M. Charika, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. PODS, pages 268--279, 2000. Google ScholarDigital Library
J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009. Google ScholarDigital Library
Cia world fact book. https://www.cia.gov/library/publications/the-world-factbook/.Google Scholar
Y. Cui and J. Widom. Run-time translation of view tuple deletions using data lineage. Technical Report 2001--24, Stanford InfoLab, 2001.Google Scholar
Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM TODS, 25(2):179--227, 2000. Google ScholarDigital Library
T. Dasu and T. Johnson. Exploratory data mining and data cleaning. In Wiley, 2003. Google ScholarDigital Library
P. Domingos. Multi-relational record linkage. In In Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining. Citeseer, 2004.Google Scholar
J. Edmonds and R. M. Karp. Theoretical improvements in algorithmic efficiency for network flow problems. JACM, 19(2):284--264, 1972. Google ScholarDigital Library
J. Evermanna and J. Fangb. Evaluating ontologies: Towards a cognitive measure of quality. Information Systems, 35(4):391--403, 2010. Google ScholarDigital Library
W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. 2012. Google ScholarDigital Library
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Cerfix: A system for cleaning data with certain fixes. In PVLDB, pages 1375--1378, 2011.Google ScholarDigital Library
U. M. Fayyad. Mining databases: Towards algorithms for knowledge discovery. IEEE Data Eng. Bull., 21(1):39--48, 1998.Google Scholar
U. Feige, M. Langberg, and K. Nissim. On the hardness of approximating NP witnesses. In K. Jansen and S. Khuller, editors, Approximation Algorithms for Combinatorial Optimization, volume 1913 of LNCS, pages 120--131. Springer, 2000. Google ScholarDigital Library
Fifa official site. http://www.fifa.com/.Google Scholar
Fifa trivia quizzes and games. http://www.sporcle.com/games/g/fifaworldcup.Google Scholar
Fifa world cup trivia. http://en.trivia.fifa.com/.Google Scholar
M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: Answering queries with crowdsourcing. In SIGMOD, pages 61--72, 2011. Google ScholarDigital Library
V. Ganti and A. D. Sarma. Data Cleaning: A Practical Perspective. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2013. Google ScholarDigital Library
S. Ghosh, N. Sharma, F. Benevenuto, N. Ganguly, and K. Gummadi. Cognos: crowdsourcing search for topic experts in microblogs. SIGIR, pages 575--590, 2012. Google ScholarDigital Library
T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In ACM PODS, pages 31--40, 2007. Google ScholarDigital Library
Z. He and E. Lo. Answering why-not questions on top-k queries. In ICDE, pages 750--761, 2012. Google ScholarDigital Library
M. Herschel and M. A. Hernández. Explaining missing answers to SPJUA queries. PVLDB, 3(1):185--196, 2010. Google ScholarDigital Library
M. Herschel, M. A. Hernández, and W. C. Tan. Artemis: A system for analyzing missing answers. PVLDB, 2(2):1550--1553, 2009. Google ScholarDigital Library
J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the provenance of non-answers to queries over extracted data. PVLDB, 1(1):736--747, 2008. Google ScholarDigital Library
N. Q. V. Hung, N. T. Tam, Z. Mikló, K. Aberer, A. Gal, and M. Weidlich. Pay-as-you-go reconciliation in schema matching networks. In ICDE, pages 220--231, 2014.Google Scholar
P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. HCOMP, pages 64--67, 2010. Google ScholarDigital Library
J. Jiao, J. Yan, H. Zhao, and W. Fan. Expertrank: An expert user ranking algorithm in online communities. NISS, pages 674--679, 2009. Google ScholarDigital Library
B. Kanagal, J. Li, and A. Deshpande. Sensitivity analysis and explanations for robust query evaluation in probabilistic databases. In ACM SIGMOD, pages 841--852. ACM, 2011. Google ScholarDigital Library
D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. NIPS, pages 1953--1961, 2011.Google ScholarDigital Library
R. M. Karp. Reducibility among combinatorial problems. In R. E. Miller and J. W. Thatcher, editors, Complexity of Computer Computations, pages 85--103, Plenum Press, New York, 1972.Google ScholarCross Ref
B. Kimelfeld, J. Vondrák, and R. Williams. Maximizing conjunctive views in deletion propagation. In ACM PODS, pages 187--198, 2011. Google ScholarDigital Library
A. Marcus, D. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. PVLDB, 6(2):109--120, 2012. Google ScholarDigital Library
R. Mccann, W. Shen, and A. Doan. Matching schemas in online communities: A web 2.0 approach. In ICDE, pages 110--119, 2008. Google ScholarDigital Library
A. Meliou, W. Gatterbauer, K. F. Moore, and D. Suciu. The complexity of causality and responsibility for query answers and non-answers. PVLDB, 4(1):34--45, 2010. Google ScholarDigital Library
R. J. Miller, L. M. Haas, and M. A. Hernández. Schema mapping as query discovery. In VLDB, pages 77--88, 2000. Google ScholarDigital Library
T. Milo and S. Zohar. Using schema matching to simplify heterogeneous data translation. In VLDB, volume 98, pages 24--27. Citeseer, 1998. Google ScholarDigital Library
A. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. Crowdscreen: Algorithms for filtering data with humans. In SIGMOD, pages 361--372, 2012. Google ScholarDigital Library
A. G. Parameswaran, S. Boyd, H. Garcia-Molina, A. Gupta, N. Polyzotis, and J. Widom. Optimal crowd-powered rating and filtering algorithms. VLDB, 7(9):685--696, 2014. Google ScholarDigital Library
H. Park, R. Pang, A. Parameswaran, H. Garcia-Molina, N. Polyzotis, and J. Widom. An overview of the deco system: data model and query language; query processing and optimization. In SIGMOD, pages 22--27, 2012. Google ScholarDigital Library
H. Park and J. Widom. Crowdfill: collecting structured data from the crowd. In SIGMOD, pages 577--588, 2014. Google ScholarDigital Library
M. J. Raddick, G. Bracey, P. L. Gay, C. J. Lintott, P. Murray, K. Schawinski, A. S. Szalay, and J. Vandenberg. Galaxy zoo: exploring the motivations of citizen science volunteers. Astronomy Education Review, 9(1), 2010.Google ScholarCross Ref
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google Scholar
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001. Google ScholarDigital Library
V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: whom to trust when everyone lies a bit. ICML, 2009. Google ScholarDigital Library
C. Sapia, G. Höfling, M. Müller, C. Hausdorf, H. Stoyan, and U. Grimmer. On supporting the data warehouse design by data mining techniques. In Proc. GI-Workshop Data Mining and Data Warehousing, page 63. Citeseer, 1999.Google Scholar
S. Sarawagi. Explaining differences in multidimensional aggregates. In VLDB, pages 42--53, 1999. Google ScholarDigital Library
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge unifying wordnet and wikipedia. In WWW, pages 697--706, 2007. Google ScholarDigital Library
Q. T. Tran and C.-Y. Chan. How to conquer why-not questions. In SIGMOD Conference, pages 15--26, 2010. Google ScholarDigital Library
B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, pages 673--684, 2013. Google ScholarDigital Library
Uniprot. http://www.uniprot.org/.Google Scholar
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(10):1483--1494, 2012. Google ScholarDigital Library
J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, pages 469--480, 2014. Google ScholarDigital Library
Open football. https://github.com/openfootball/.Google Scholar
World cup history. http://www.worldcup-history.com/.Google Scholar
S. E. Whang, H. Garcia-Molina, and P. Lofgren. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013. Google ScholarDigital Library
J. Xu, D. V. Kalashnikov, and S. Mehrotra. Query aware determinization of uncertain objects. IEEE Trans. Knowl. Data Eng., 27(1):207--221, 2015.Google ScholarCross Ref
C. J. Zhang, Z. Zhao, L. Chen, H. V. Jagadish, and C. C. Cao. Crowdmatcher: crowd-assisted schema matching. In SIGMOD, pages 721--724, 2014. Google ScholarDigital Library

Index Terms

Query-Oriented Data Cleaning with Oracles
1. Information systems
  1. Data management systems

Recommendations

Learning Over Dirty Data Without Cleaning
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Real-world datasets are dirty and contain many errors, such as violations of integrity constraints and entity duplicates. Learning over dirty databases may result in inaccurate models. Data scientists spend most of their time on preparing and repairing ...
Read More
Using Recon for data cleaning
KDD'95: Proceedings of the First International Conference on Knowledge Discovery and Data Mining

To aid in making investment decisions, financial analysts purchase large amounts of data from financial data providers. They use this historical data to develop financial models for making predictions and assessing risk about current dat about current ...
Read More
Ontologies for Reusing Data Cleaning Knowledge
ICSC '12: Proceedings of the 2012 IEEE Sixth International Conference on Semantic Computing

The emergence of new business models, namely, the establishment of partnerships between organizations, the chance that companies have of adding existing data on the web, especially in the semantic web, to their information, led to the emphasis on some ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crowdsourcing
data cleaning
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 37
  Total Citations
  View Citations
- 675
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Query-Oriented Data Cleaning with Oracles

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning Over Dirty Data Without Cleaning

Using Recon for data cleaning

Ontologies for Reusing Data Cleaning Knowledge