Abstract
The ability to estimate resource consumption of SQL queries is crucial for a number of tasks in a database system such as admission control, query scheduling and costing during query optimization. Recent work has explored the use of statistical techniques for resource estimation in place of the manually constructed cost models used in query optimization. Such techniques, which require as training data examples of resource usage in queries, offer the promise of superior estimation accuracy since they can account for factors such as hardware characteristics of the system or bias in cardinality estimates. However, the proposed approaches lack robustness in that they do not generalize well to queries that are different from the training examples, resulting in significant estimation errors. Our approach aims to address this problem by combining knowledge of database query processing with statistical models. We model resource-usage at the level of individual operators, with different models and features for each operator type, and explicitly model the asymptotic behavior of each operator. This results in significantly better estimation accuracy and the ability to estimate resource usage of arbitrary plans, even when they are very different from the training instances. We validate our approach using various large scale real-life and benchmark workloads on Microsoft SQL Server.
- Decision Tree Learning, Bregman Divergences, and the Histogram Trick. In submission.Google Scholar
- Program for TPC-H data generation with Skew. ftp://ftp.research.microsoft.com/users/viveknar/TPCDSkew/.Google Scholar
- TPC-H and TPC-DS Benchmarks. http://www.tpc.org.Google Scholar
- WEKA workbench. http://www.cs.waikato.ac.nz/ml/weka/.Google Scholar
- A. Aboulnaga and S. Chaudhuri. Self-tuning Histograms: Building Histograms Without Looking at Data. In SIGMOD, pages 181--192, 1999. Google ScholarDigital Library
- M. Ahmad, S. Duan, A. Aboulnaga, and S. Babu. Predicting Completion Times of Batch Query Workloads using Interaction-aware Models and Simulation. In EDBT, pages 449--460, 2011. Google ScholarDigital Library
- M. Akdere, U. Cetintemel, M. Riondato, E. Upfal, and S. Zdonik. The Case for Predictive Database Systems: Opportunities and Challenges. In CIDR, pages 167--174, 2011.Google Scholar
- M. Akdere, U. Cetintemel, M. Riondato, E. Upfal, and S. Zdonik. Learning-based Query Performance Modeling and Prediction. In ICDE, 2012. Google ScholarDigital Library
- N. Bruno, S. Chaudhuri, and L. Gravano. STHoles: A Multidimensional Workload-Aware Histogram. In SIGMOD, pages 211--222, 2001. Google ScholarDigital Library
- C. Curino, E. P. C. Jones, S. Madden, and H. Balakrishnan. Workload-aware Database Monitoring and Consolidation. In SIGMOD, pages 313--324, 2011. Google ScholarDigital Library
- D. J. DeWitt, J. F. Naughton, and J. Burger. Nested Loops Revisited. In PDIS, pages 230--242, 1993. Google ScholarDigital Library
- J. Duggan, U. Cetintemel, O. Papaemmanouil, and E. Upfal. Performance Prediction for Concurrent Database Workloads. In SIGMOD, pages 337--348, 2011. Google ScholarDigital Library
- M. Elhemali, C. A. Galindo-Legaria, T. Grabs, and M. M. Joshi. Execution Strategies for SQL Subqueries. In SIGMOD, pages 993--1004, 2007. Google ScholarDigital Library
- J. Friedman. Greedy Function Approximation: a Gradient Boosting Machine. Annals of Statistics, 29(5):1189--1232, 2001.Google ScholarCross Ref
- A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. In ICDE, pages 592--603, 2009. Google ScholarDigital Library
- G. Graefe. Query Evaluation Techniques for Large Databases. ACM Comput. Surv., 25(2):73--169, 1993. Google ScholarDigital Library
- H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. In VLDB, pages 1111--1122, 2011.Google ScholarDigital Library
- E. P. D. Pednault. Transform Regression and the Kolmogorov Superposition Theorem. In SDM, pages 35--46, 2006.Google ScholarCross Ref
- M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. LEO - DB2's LEarning Optimizer. In VLDB, pages 19--28, 2001. Google ScholarDigital Library
- V. Vapnik. Statistical Learning Theory. Wiley, 2000.Google Scholar
- Q. Wu, C. J. Burges, K. M. Svore, and J. Gao. Ranking, Boosting, and Model Adaptation. Technical report, Microsoft Research, 2008.Google Scholar
- N. Zhang, P. J. Haas, V. Josifovski, G. M. Lohman, and C. Zhang. Statistical Learning Techniques for Costing XML Queries. In VLDB, pages 289--300, 2005. Google ScholarDigital Library
Recommendations
Online estimation for subset-based SQL queries
VLDB '05: Proceedings of the 31st international conference on Very large data basesThe largest databases in use today are so large that answering a query exactly can take minutes, hours, or even days. One way to address this problem is to make use of approximation algorithms. Previous work on online aggregation has considered how to ...
Robust variance estimation for random effects meta-analysis
In random effects meta-analysis, an overall effect is estimated using a weighted mean, with weights based on estimated marginal variances. The variance of the overall effect is often estimated using the inverse of the sum of the estimated weights, and ...
Equivalence of SQL queries in presence of embedded dependencies
PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsWe consider the problem of finding equivalent minimal-size reformulations of SQL queries in presence of embedded dependencies [1]. Our focus is on select-project-join (SPJ) queries with equality comparisons, also known as safe conjunctive (CQ) queries, ...
Comments