skip to main content
research-article

Stale view cleaning: getting fresh answers from stale materialized views

Published:01 August 2015Publication History
Skip Abstract Section

Abstract

Materialized views (MVs), stored pre-computed results, are widely used to facilitate fast queries on large datasets. When new records arrive at a high rate, it is infeasible to continuously update (maintain) MVs and a common solution is to defer maintenance by batching updates together. Between batches the MVs become increasingly stale with incorrect, missing, and superfluous rows leading to increasingly inaccurate query results. We propose Stale View Cleaning (SVC) which addresses this problem from a data cleaning perspective. In SVC, we efficiently clean a sample of rows from a stale MV, and use the clean sample to estimate aggregate query results. While approximate, the estimated query results reflect the most recent data. As sampling can be sensitive to long-tailed distributions, we further explore an outlier indexing technique to give increased accuracy when the data distributions are skewed. SVC complements existing deferred maintenance approaches by giving accurate and bounded query answers between maintenance. We evaluate our method on a generated dataset from the TPC-D benchmark and a real video distribution application. Experiments confirm our theoretical results: (1) cleaning an MV sample is more efficient than full view maintenance, (2) the estimated results are more accurate than using the stale MV, and (3) SVC is applicable for a wide variety of MVs.

References

  1. Conviva. http://www.conviva.com/.Google ScholarGoogle Scholar
  2. S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. I. Jordan, S. Madden, B. Mozafari, and I. Stoica. Knowing when you're wrong: building fast and reliable approximate query processing systems. In SIGMOD Conference, pages 481--492, 2014. Google ScholarGoogle Scholar
  3. S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In EuroSys, pages 29--42, 2013. Google ScholarGoogle Scholar
  4. A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. Descriptive and prescriptive data cleaning. In SIGMOD Conference, pages 445--456, 2014. Google ScholarGoogle Scholar
  5. S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. R. Narasayya. Overcoming limitations of sampling for aggregation queries. In ICDE, pages 534--542, 2001. Google ScholarGoogle Scholar
  6. S. Chaudhuri and V. Narasayya. TPC-D data generation with skew. ftp.research.microsoft.com/users/viveknar/tpcdskew.Google ScholarGoogle Scholar
  7. R. Chirkova and J. Yang. Materialized views. Foundations and Trends in Databases, 4(4):295--405, 2012.Google ScholarGoogle Scholar
  8. A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. SIAM Review, 51(4):661--703, 2009. Google ScholarGoogle Scholar
  9. G. Cormode, M. N. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1-3):1--294, 2012. Google ScholarGoogle Scholar
  10. D. R. Cox and D. V. Hinkley. Theoretical statistics. CRC Press, 1979.Google ScholarGoogle Scholar
  11. Y. Cui and J. Widom. Lineage tracing for general data warehouse transformations. VLDB J., 12(1):41--58, 2003. Google ScholarGoogle Scholar
  12. M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. In VLDB, 2001. Google ScholarGoogle Scholar
  13. P. B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. ACM Trans. Database Syst., 27(3):261--298, 2002. Google ScholarGoogle Scholar
  14. A. Gupta and I. S. Mumick. Maintenance of materialized views: Problems, techniques, and applications. IEEE Data Eng. Bull., 18(2):3--18, 1995.Google ScholarGoogle Scholar
  15. H. Gupta and I. S. Mumick. Incremental maintenance of aggregate and outerjoin expressions. Information Systems, 31(6):435--464, 2006. Google ScholarGoogle Scholar
  16. C. Henke, C. Schmoll, and T. Zseby. Empirical evaluation of hash functions for packetid generation in sampled multipoint measurements. In Passive and Active Network Measurement, pages 197--206. Springer, 2009. Google ScholarGoogle Scholar
  17. S. Joshi and C. M. Jermaine. Materialized sample views for database approximation. IEEE Trans. Knowl. Data Eng., 20(3):337--351, 2008. Google ScholarGoogle Scholar
  18. C. Koch, Y. Ahmad, O. Kennedy, M. Nikolic, A. Nötzli, D. Lupei, and A. Shaikhha. Dbtoaster: higher-order delta processing for dynamic, frequently fresh views. VLDB J., 23(2):253--278, 2014. Google ScholarGoogle Scholar
  19. S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. http://www.ocf.berkeley.edu/~sanjayk/pubs/svc-2014.pdf, 2014.Google ScholarGoogle Scholar
  20. P.-A. Larson, W. Lehner, J. Zhou, and P. Zabback. Cardinality estimation using sample views with quality assurance. In SIGMOD, pages 175--186, 2007. Google ScholarGoogle Scholar
  21. P.-Å. Larson and H. Z. Yang. Computing queries from derived relations. In VLDB, pages 259--269, 1985. Google ScholarGoogle Scholar
  22. P. L'Ecuyer and R. Simard. Testu01: Ac library for empirical testing of random number generators. ACM Transactions on Mathematical Software (TOMS), 33(4):22, 2007. Google ScholarGoogle Scholar
  23. Z. Liu, B. Jiang, and J. Heer. imMens: Real-time visual querying of big data. Comput. Graph. Forum, 32(3):421--430, 2013.Google ScholarGoogle Scholar
  24. A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. In SIGMOD Conference, pages 505--516, 2011. Google ScholarGoogle Scholar
  25. M. Mitzenmacher. A brief history of generative models for power law and lognormal distributions. Internet Mathematics, 1(2):226--251, 2003.Google ScholarGoogle Scholar
  26. S. Nirkhiwale, A. Dobra, and C. M. Jermaine. A sampling algebra for aggregate estimation. PVLDB, 6(14):1798--1809, 2013. Google ScholarGoogle Scholar
  27. F. Olken. Random sampling from databases. PhD thesis, University of California, 1993.Google ScholarGoogle Scholar
  28. F. Olken and D. Rotem. Simple random sampling from relational databases. In VLDB, pages 160--169, 1986. Google ScholarGoogle Scholar
  29. F. Olken and D. Rotem. Maintenance of materialized views of sampling queries. In ICDE, pages 632--641, 1992. Google ScholarGoogle Scholar
  30. E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google ScholarGoogle Scholar
  31. V. Srinivasan and M. J. Carey. Compensation-based on-line query processing. In SIGMOD Conference, pages 331--340, 1992. Google ScholarGoogle Scholar
  32. J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD Conference, pages 469--480, 2014. Google ScholarGoogle Scholar
  33. K. Weil. Rainbird: Real-time analytics at twitter. In Strata, 2011.Google ScholarGoogle Scholar
  34. E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013. Google ScholarGoogle Scholar
  35. K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, pages 277--288, 2014. Google ScholarGoogle Scholar

Index Terms

  1. Stale view cleaning: getting fresh answers from stale materialized views
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Proceedings of the VLDB Endowment
            Proceedings of the VLDB Endowment  Volume 8, Issue 12
            Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
            August 2015
            728 pages

            Publisher

            VLDB Endowment

            Publication History

            • Published: 1 August 2015
            Published in pvldb Volume 8, Issue 12

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader