skip to main content
research-article

QuERy: a framework for integrating entity resolution with query processing

Published:01 November 2015Publication History
Skip Abstract Section

Abstract

This paper explores an analysis-aware data cleaning architecture for a large class of SPJ SQL queries. In particular, we propose QuERy, a novel framework for integrating entity resolution (ER) with query processing. The aim of QuERy is to correctly and efficiently answer complex queries issued on top of dirty data. The comprehensive empirical evaluation of the proposed solution demonstrates its significant advantage in terms of efficiency over the traditional techniques for the given problem settings.

References

  1. http://www.ics.uci.edu/~haltwaij/QuERy.pdf.Google ScholarGoogle Scholar
  2. http://www.trifacta.com.Google ScholarGoogle Scholar
  3. Y. Altowim, D. V. Kalashnikov, and S. Mehrotra. Progressive approach to relational entity resolution. VLDB, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Altwaijry, D. V. Kalashnikov, and S. Mehrotra. Query-driven approach to entity resolution. VLDB, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB J., 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. TKDD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. I. Bhattacharya and L. Getoor. Query-time entity resolution. JAIR, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Gruenheid, X. L. Dong, and D. Srivastava. Incremental record linkage. VLDB, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. J. Haas and J. M. Hellerstein. Ripple joins for online aggregation. In SIGMOD Record, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller. Framework for evaluating clustering algorithms in duplicate detection. VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation.Google ScholarGoogle Scholar
  16. M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD Record, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Ioannou, W. Nejdl, C. Niederée, and Y. Velegrakis. On-the-fly entity-aware query processing in the presence of linkage. VLDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In SIGCHI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Kolb, A. Thor, and E. Rahm. Load balancing for mapreduce-based entity resolution. In ICDE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. DKE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In SIGKDD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Sismanis, L. Wang, A. Fuxman, P. J. Haas, and B. Reinwald. Resolution-aware query answering for business intelligence. In ICDE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.Google ScholarGoogle Scholar
  24. J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. E. Whang and H. Garcia-Molina. Incremental entity resolution on rules and data. VLDB J., 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. E. Whang, D. Marmaros, and H. Garcia-Molina. Pay-as-you-go entity resolution. TKDE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. QuERy: a framework for integrating entity resolution with query processing
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 9, Issue 3
        November 2015
        144 pages

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 November 2015
        Published in pvldb Volume 9, Issue 3

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader