Abstract
This paper explores an analysis-aware data cleaning architecture for a large class of SPJ SQL queries. In particular, we propose QuERy, a novel framework for integrating entity resolution (ER) with query processing. The aim of QuERy is to correctly and efficiently answer complex queries issued on top of dirty data. The comprehensive empirical evaluation of the proposed solution demonstrates its significant advantage in terms of efficiency over the traditional techniques for the given problem settings.
- http://www.ics.uci.edu/~haltwaij/QuERy.pdf.Google Scholar
- http://www.trifacta.com.Google Scholar
- Y. Altowim, D. V. Kalashnikov, and S. Mehrotra. Progressive approach to relational entity resolution. VLDB, 2014. Google ScholarDigital Library
- H. Altwaijry, D. V. Kalashnikov, and S. Mehrotra. Query-driven approach to entity resolution. VLDB, 2013. Google ScholarDigital Library
- R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, 2002. Google ScholarDigital Library
- O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB J., 2009. Google ScholarDigital Library
- I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. TKDD, 2007. Google ScholarDigital Library
- I. Bhattacharya and L. Getoor. Query-time entity resolution. JAIR, 2007. Google ScholarDigital Library
- M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD, 2013. Google ScholarDigital Library
- X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005. Google ScholarDigital Library
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 2007. Google ScholarDigital Library
- A. Gruenheid, X. L. Dong, and D. Srivastava. Incremental record linkage. VLDB, 2014. Google ScholarDigital Library
- P. J. Haas and J. M. Hellerstein. Ripple joins for online aggregation. In SIGMOD Record, 1999. Google ScholarDigital Library
- O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller. Framework for evaluating clustering algorithms in duplicate detection. VLDB, 2009. Google ScholarDigital Library
- J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation.Google Scholar
- M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD Record, 1995. Google ScholarDigital Library
- E. Ioannou, W. Nejdl, C. Niederée, and Y. Velegrakis. On-the-fly entity-aware query processing in the presence of linkage. VLDB, 2010. Google ScholarDigital Library
- S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In SIGCHI, 2011. Google ScholarDigital Library
- L. Kolb, A. Thor, and E. Rahm. Load balancing for mapreduce-based entity resolution. In ICDE, 2012. Google ScholarDigital Library
- H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. DKE, 2010. Google ScholarDigital Library
- A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In SIGKDD, 2000. Google ScholarDigital Library
- Y. Sismanis, L. Wang, A. Fuxman, P. J. Haas, and B. Reinwald. Resolution-aware query answering for business intelligence. In ICDE, 2009. Google ScholarDigital Library
- M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.Google Scholar
- J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, 2014. Google ScholarDigital Library
- S. E. Whang and H. Garcia-Molina. Incremental entity resolution on rules and data. VLDB J., 2014. Google ScholarDigital Library
- S. E. Whang, D. Marmaros, and H. Garcia-Molina. Pay-as-you-go entity resolution. TKDE, 2013. Google ScholarDigital Library
Index Terms
- QuERy: a framework for integrating entity resolution with query processing
Recommendations
View-based query containment
PODS '03: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsQuery containment is the problem of checking whether for all databases the answer to a query is a subset of the answer to a second query. In several data management tasks, such as data integration, mobile computing, etc., the data of interest are only ...
Query Folding
ICDE '96: Proceedings of the Twelfth International Conference on Data EngineeringQuery folding refers to the activity of determining if and how a query can be answered using a given set of resources, which might be materialized views, cached results of previous queries, or queries answerable by other databases. We investigate query ...
View-based query processing: On the relationship between rewriting, answering and losslessness
As a result of the extensive research in view-based query processing, three notions have been identified as fundamental, namely rewriting, answering, and losslessness. Answering amounts to computing the tuples satisfying the query in all databases ...
Comments