Abstract
Query optimizers depend on selectivity estimates of query predicates to produce a good execution plan. When a query contains multiple predicates, today's optimizers use a variety of assumptions, such as independence between predicates, to estimate selectivity. While such techniques have the benefit of fast estimation and small memory footprint, they often incur large selectivity estimation errors. In this work, we reconsider selectivity estimation as a regression problem. We explore application of neural networks and tree-based ensembles to the important problem of selectivity estimation of multi-dimensional range predicates. While their straightforward application does not outperform even simple baselines, we propose two simple yet effective design choices, i.e., regression label transformation and feature engineering, motivated by the selectivity estimation context. Through extensive empirical evaluation across a variety of datasets, we show that the proposed models deliver both highly accurate estimates as well as fast estimation.
- https://support.microsoft.com/en-us/help/2658214/fix-poor-performance-when-you-run-a-query-that-contains-correlated-and.Google Scholar
- https://bitbucket.org/mheimel/feedback-kde/src/default/.Google Scholar
- Cardinality estimation for correlated columns in SQL Server 2016. https://blogs.msdn.microsoft.com/sql_server_team/cardinality-estimation-for-correlated-columns-in-sql-server-2016/.Google Scholar
- How the planner uses statistics. https://www.postgresql.org/docs/current/row-estimation-examples.html.Google Scholar
- Statistics in Microsoft SQL Server 2017. https://docs.microsoft.com/en-us/sql/relational-databases/statistics/statistics?view=sql-server-2017.Google Scholar
- Statistics in Microsoft SQL Server 2017: When to update statistics. https://docs.microsoft.com/en-us/sql/relational-databases/statistics/statistics?view=sql-server-2017#UpdateStatistics.Google Scholar
- Understanding optimizer statistics with oracle database 18c. https://www.oracle.com/technetwork/database/bi-datawarehousing/twp-stats-concepts-0218-4403739.pdf.Google Scholar
- UCI machine learning repository. https://archive.ics.uci.edu/ml/index.php.Google Scholar
- M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, 2016. Google ScholarDigital Library
- A. Aboulnaga and S. Chaudhuri. Self-tuning histograms: Building histograms without looking at data. In SIGMOD, 1999. Google ScholarDigital Library
- J. Boulos and M. Bretonneux. Selectivity Estimation Using Neural Networks. 1996.Google Scholar
- N. Bruno, S. Chaudhuri, and L. Gravano. Stholes: A multidimensional workload-aware histogram. In SIGMOD, 2001. Google ScholarDigital Library
- S. Chaudhuri. Query optimizers: Time to rethink the contract? In SIGMOD, 2009. Google ScholarDigital Library
- S. Chaudhuri, V. Narasayya, and R. Ramamurthy. A pay-as-you-go framework for query execution feedback. PVLDB, 1(1):1141--1152, 2008. Google ScholarDigital Library
- C. M. Chen and N. Roussopoulos. Adaptive selectivity estimation using query feedback. In SIGMOD, 1994.Google ScholarDigital Library
- T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In KDD, 2016. Google ScholarDigital Library
- F. Chollet et al. Keras. https://keras.io, 2015.Google Scholar
- L. Chu, X. Hu, J. Hu, L. Wang, and J. Pei. Exact and consistent interpretation for piecewise linear neural networks: A closed form solution. In KDD, 2018. Google ScholarDigital Library
- G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends databases, 4(1--3), 2012. Google ScholarDigital Library
- L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD, 2001. Google ScholarDigital Library
- D. Gunopulos, G. Kollios, J. Tsotras, and C. Domeniconi. Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal, 14(2), 2005.Google ScholarDigital Library
- M. Heimel, M. Kiefer, and V. Markl. Self-tuning, gpu-accelerated kernel density models for multidimensional selectivity estimation. In SIGMOD, 2015. Google ScholarDigital Library
- I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004. Google ScholarDigital Library
- G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: A highly efficient gradient boosting decision tree. In NIPS, 2017.Google Scholar
- A. Kipf, T. Kipf, B. Radke, V. Leis, P. A. Boncz, and A. Kemper. Learned cardinalities: Estimating correlated joins with deep learning. In CIDR 2019, 2019.Google Scholar
- M. S. Lakshmi and S. Zhou. Selectivity estimation in extensible databases - a neural network approach. In VLDB, 1998. Google ScholarDigital Library
- V. Leis, B. Radke, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann. Query optimization through the looking glass, and what we found running the join order benchmark. The VLDB Journal, 27(5), 2018. Google ScholarDigital Library
- H. Liu, M. Xu, Z. Yu, V. Corvinelli, and C. Zuzarte. Cardinality Estimation Using Neural Networks. In CASCON, 2015.Google Scholar
- H. Lu and R. Setiono. Effective query size estimation using neural networks. Applied Intelligence, 16(3), 2002. Google ScholarDigital Library
- V. Markl, P. J. Haas, M. Kutsch, N. Megiddo, U. Srivastava, and T. M. Tran. Consistent selectivity estimation via maximum entropy. The VLDB Journal, 16(1), 2007. Google ScholarDigital Library
- Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In SIGMOD, 1998. Google ScholarDigital Library
- G. Moerkotte, T. Neumann, and G. Steidl. Preventing bad plans by bounding the impact of cardinality estimation errors. PVLDB, 2(1):982--993, 2009. Google ScholarDigital Library
- G. Montúfar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In NIPS, 2014.Google ScholarDigital Library
- M. Müller, G. Moerkotte, and O. Kolb. Improved selectivity estimation by combining knowledge from sampling and synopses. PVLDB, 11(9):1016--1028, 2018. Google ScholarDigital Library
- M. Muralikrishna and D. J. DeWitt. Equi-depth multidimensional histograms. In SIGMOD, 1988. Google ScholarDigital Library
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2011.Google Scholar
- V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In VLDB, 1997. Google ScholarDigital Library
- M. Shekelyan, A. Dignös, and J. Gamper. Digithist: A histogram-based data summary with tight error bounds. PVLDB, 10(11):1514--1525, 2017. Google ScholarDigital Library
- U. Srivastava, P. J. Haas, V. Markl, M. Kutsch, and T. M. Tran. Isomer: Consistent histogram construction using query feedback. In ICDE, 2006. Google ScholarDigital Library
- M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. Leo - db2's learning optimizer. In VLDB, 2001. Google ScholarDigital Library
- K. Tzoumas, A. Deshpande, and C. S. Jensen. Efficiently adapting graphical models for selectivity estimation. The VLDB Journal, 22(1), 2013. Google ScholarDigital Library
- Y.-L. Wu, D. Agrawal, and A. El Abbadi. Applying the golden rule of sampling for query estimation. In SIGMOD, 2001. Google ScholarDigital Library
Index Terms
- Selectivity estimation for range predicates using lightweight models
Recommendations
Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataSelectivity estimation - the problem of estimating the result size of queries - is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor cardinality ...
Selectivity estimation using probabilistic models
Estimating the result size of complex queries that involve selection on multiple attributes and the join of several relations is a difficult but fundamental task in database query processing. It arises in cost-based query optimization, query profiling, ...
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem
ICDE '04: Proceedings of the 20th International Conference on Data EngineeringQueries with (equality or LIKE) selection predicatesover string attributes are widely used in relationaldatabases. However, state-of-the-art techniques forestimating selectivities of string predicates are often biasedtowards severely underestimating ...
Comments