Abstract
Materialized views (MVs), stored pre-computed results, are widely used to facilitate fast queries on large datasets. When new records arrive at a high rate, it is infeasible to continuously update (maintain) MVs and a common solution is to defer maintenance by batching updates together. Between batches the MVs become increasingly stale with incorrect, missing, and superfluous rows leading to increasingly inaccurate query results. We propose Stale View Cleaning (SVC) which addresses this problem from a data cleaning perspective. In SVC, we efficiently clean a sample of rows from a stale MV, and use the clean sample to estimate aggregate query results. While approximate, the estimated query results reflect the most recent data. As sampling can be sensitive to long-tailed distributions, we further explore an outlier indexing technique to give increased accuracy when the data distributions are skewed. SVC complements existing deferred maintenance approaches by giving accurate and bounded query answers between maintenance. We evaluate our method on a generated dataset from the TPC-D benchmark and a real video distribution application. Experiments confirm our theoretical results: (1) cleaning an MV sample is more efficient than full view maintenance, (2) the estimated results are more accurate than using the stale MV, and (3) SVC is applicable for a wide variety of MVs.
- Conviva. http://www.conviva.com/.Google Scholar
- S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. I. Jordan, S. Madden, B. Mozafari, and I. Stoica. Knowing when you're wrong: building fast and reliable approximate query processing systems. In SIGMOD Conference, pages 481--492, 2014. Google Scholar
- S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In EuroSys, pages 29--42, 2013. Google Scholar
- A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. Descriptive and prescriptive data cleaning. In SIGMOD Conference, pages 445--456, 2014. Google Scholar
- S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. R. Narasayya. Overcoming limitations of sampling for aggregation queries. In ICDE, pages 534--542, 2001. Google Scholar
- S. Chaudhuri and V. Narasayya. TPC-D data generation with skew. ftp.research.microsoft.com/users/viveknar/tpcdskew.Google Scholar
- R. Chirkova and J. Yang. Materialized views. Foundations and Trends in Databases, 4(4):295--405, 2012.Google Scholar
- A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. SIAM Review, 51(4):661--703, 2009. Google Scholar
- G. Cormode, M. N. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1-3):1--294, 2012. Google Scholar
- D. R. Cox and D. V. Hinkley. Theoretical statistics. CRC Press, 1979.Google Scholar
- Y. Cui and J. Widom. Lineage tracing for general data warehouse transformations. VLDB J., 12(1):41--58, 2003. Google Scholar
- M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. In VLDB, 2001. Google Scholar
- P. B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. ACM Trans. Database Syst., 27(3):261--298, 2002. Google Scholar
- A. Gupta and I. S. Mumick. Maintenance of materialized views: Problems, techniques, and applications. IEEE Data Eng. Bull., 18(2):3--18, 1995.Google Scholar
- H. Gupta and I. S. Mumick. Incremental maintenance of aggregate and outerjoin expressions. Information Systems, 31(6):435--464, 2006. Google Scholar
- C. Henke, C. Schmoll, and T. Zseby. Empirical evaluation of hash functions for packetid generation in sampled multipoint measurements. In Passive and Active Network Measurement, pages 197--206. Springer, 2009. Google Scholar
- S. Joshi and C. M. Jermaine. Materialized sample views for database approximation. IEEE Trans. Knowl. Data Eng., 20(3):337--351, 2008. Google Scholar
- C. Koch, Y. Ahmad, O. Kennedy, M. Nikolic, A. Nötzli, D. Lupei, and A. Shaikhha. Dbtoaster: higher-order delta processing for dynamic, frequently fresh views. VLDB J., 23(2):253--278, 2014. Google Scholar
- S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. http://www.ocf.berkeley.edu/~sanjayk/pubs/svc-2014.pdf, 2014.Google Scholar
- P.-A. Larson, W. Lehner, J. Zhou, and P. Zabback. Cardinality estimation using sample views with quality assurance. In SIGMOD, pages 175--186, 2007. Google Scholar
- P.-Å. Larson and H. Z. Yang. Computing queries from derived relations. In VLDB, pages 259--269, 1985. Google Scholar
- P. L'Ecuyer and R. Simard. Testu01: Ac library for empirical testing of random number generators. ACM Transactions on Mathematical Software (TOMS), 33(4):22, 2007. Google Scholar
- Z. Liu, B. Jiang, and J. Heer. imMens: Real-time visual querying of big data. Comput. Graph. Forum, 32(3):421--430, 2013.Google Scholar
- A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. In SIGMOD Conference, pages 505--516, 2011. Google Scholar
- M. Mitzenmacher. A brief history of generative models for power law and lognormal distributions. Internet Mathematics, 1(2):226--251, 2003.Google Scholar
- S. Nirkhiwale, A. Dobra, and C. M. Jermaine. A sampling algebra for aggregate estimation. PVLDB, 6(14):1798--1809, 2013. Google Scholar
- F. Olken. Random sampling from databases. PhD thesis, University of California, 1993.Google Scholar
- F. Olken and D. Rotem. Simple random sampling from relational databases. In VLDB, pages 160--169, 1986. Google Scholar
- F. Olken and D. Rotem. Maintenance of materialized views of sampling queries. In ICDE, pages 632--641, 1992. Google Scholar
- E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google Scholar
- V. Srinivasan and M. J. Carey. Compensation-based on-line query processing. In SIGMOD Conference, pages 331--340, 1992. Google Scholar
- J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD Conference, pages 469--480, 2014. Google Scholar
- K. Weil. Rainbird: Real-time analytics at twitter. In Strata, 2011.Google Scholar
- E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013. Google Scholar
- K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, pages 277--288, 2014. Google Scholar
Index Terms
- Stale view cleaning: getting fresh answers from stale materialized views
Recommendations
View-based query containment
PODS '03: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsQuery containment is the problem of checking whether for all databases the answer to a query is a subset of the answer to a second query. In several data management tasks, such as data integration, mobile computing, etc., the data of interest are only ...
Keyword query cleaning
Unlike traditional database queries, keyword queries do not adhere to predefined syntax and are often dirty with irrelevant words from natural languages. This makes accurate and efficient keyword query processing over databases a very challenging task.
...
View-based query processing: On the relationship between rewriting, answering and losslessness
As a result of the extensive research in view-based query processing, three notions have been identified as fundamental, namely rewriting, answering, and losslessness. Answering amounts to computing the tuples satisfying the query in all databases ...
Comments