research-article

Selectivity estimation for range predicates using lightweight models

Authors:
Anshuman Dutt

Microsoft Research

Microsoft Research
View Profile

,
Chi Wang

Microsoft Research

Microsoft Research
View Profile

,
Azade Nazi

Google AI

Google AI
View Profile

,
Srikanth Kandula

Microsoft Research

Microsoft Research
View Profile

,
Vivek Narasayya

Microsoft Research

Microsoft Research
View Profile

,
Surajit Chaudhuri

Microsoft Research

Microsoft Research
View Profile

Proceedings of the VLDB Endowment Volume 12 Issue 9pp 1044–1057https://doi.org/10.14778/3329772.3329780

Published:01 May 2019Publication History

Proceedings of the VLDB Endowment

Abstract

Query optimizers depend on selectivity estimates of query predicates to produce a good execution plan. When a query contains multiple predicates, today's optimizers use a variety of assumptions, such as independence between predicates, to estimate selectivity. While such techniques have the benefit of fast estimation and small memory footprint, they often incur large selectivity estimation errors. In this work, we reconsider selectivity estimation as a regression problem. We explore application of neural networks and tree-based ensembles to the important problem of selectivity estimation of multi-dimensional range predicates. While their straightforward application does not outperform even simple baselines, we propose two simple yet effective design choices, i.e., regression label transformation and feature engineering, motivated by the selectivity estimation context. Through extensive empirical evaluation across a variety of datasets, we show that the proposed models deliver both highly accurate estimates as well as fast estimation.

References

https://support.microsoft.com/en-us/help/2658214/fix-poor-performance-when-you-run-a-query-that-contains-correlated-and.Google Scholar
https://bitbucket.org/mheimel/feedback-kde/src/default/.Google Scholar
Cardinality estimation for correlated columns in SQL Server 2016. https://blogs.msdn.microsoft.com/sql_server_team/cardinality-estimation-for-correlated-columns-in-sql-server-2016/.Google Scholar
How the planner uses statistics. https://www.postgresql.org/docs/current/row-estimation-examples.html.Google Scholar
Statistics in Microsoft SQL Server 2017. https://docs.microsoft.com/en-us/sql/relational-databases/statistics/statistics?view=sql-server-2017.Google Scholar
Statistics in Microsoft SQL Server 2017: When to update statistics. https://docs.microsoft.com/en-us/sql/relational-databases/statistics/statistics?view=sql-server-2017#UpdateStatistics.Google Scholar
Understanding optimizer statistics with oracle database 18c. https://www.oracle.com/technetwork/database/bi-datawarehousing/twp-stats-concepts-0218-4403739.pdf.Google Scholar
UCI machine learning repository. https://archive.ics.uci.edu/ml/index.php.Google Scholar
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, 2016. Google ScholarDigital Library
A. Aboulnaga and S. Chaudhuri. Self-tuning histograms: Building histograms without looking at data. In SIGMOD, 1999. Google ScholarDigital Library
J. Boulos and M. Bretonneux. Selectivity Estimation Using Neural Networks. 1996.Google Scholar
N. Bruno, S. Chaudhuri, and L. Gravano. Stholes: A multidimensional workload-aware histogram. In SIGMOD, 2001. Google ScholarDigital Library
S. Chaudhuri. Query optimizers: Time to rethink the contract? In SIGMOD, 2009. Google ScholarDigital Library
S. Chaudhuri, V. Narasayya, and R. Ramamurthy. A pay-as-you-go framework for query execution feedback. PVLDB, 1(1):1141--1152, 2008. Google ScholarDigital Library
C. M. Chen and N. Roussopoulos. Adaptive selectivity estimation using query feedback. In SIGMOD, 1994.Google ScholarDigital Library
T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In KDD, 2016. Google ScholarDigital Library
F. Chollet et al. Keras. https://keras.io, 2015.Google Scholar
L. Chu, X. Hu, J. Hu, L. Wang, and J. Pei. Exact and consistent interpretation for piecewise linear neural networks: A closed form solution. In KDD, 2018. Google ScholarDigital Library
G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends databases, 4(1--3), 2012. Google ScholarDigital Library
L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD, 2001. Google ScholarDigital Library
D. Gunopulos, G. Kollios, J. Tsotras, and C. Domeniconi. Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal, 14(2), 2005.Google ScholarDigital Library
M. Heimel, M. Kiefer, and V. Markl. Self-tuning, gpu-accelerated kernel density models for multidimensional selectivity estimation. In SIGMOD, 2015. Google ScholarDigital Library
I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004. Google ScholarDigital Library
G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: A highly efficient gradient boosting decision tree. In NIPS, 2017.Google Scholar
A. Kipf, T. Kipf, B. Radke, V. Leis, P. A. Boncz, and A. Kemper. Learned cardinalities: Estimating correlated joins with deep learning. In CIDR 2019, 2019.Google Scholar
M. S. Lakshmi and S. Zhou. Selectivity estimation in extensible databases - a neural network approach. In VLDB, 1998. Google ScholarDigital Library
V. Leis, B. Radke, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann. Query optimization through the looking glass, and what we found running the join order benchmark. The VLDB Journal, 27(5), 2018. Google ScholarDigital Library
H. Liu, M. Xu, Z. Yu, V. Corvinelli, and C. Zuzarte. Cardinality Estimation Using Neural Networks. In CASCON, 2015.Google Scholar
H. Lu and R. Setiono. Effective query size estimation using neural networks. Applied Intelligence, 16(3), 2002. Google ScholarDigital Library
V. Markl, P. J. Haas, M. Kutsch, N. Megiddo, U. Srivastava, and T. M. Tran. Consistent selectivity estimation via maximum entropy. The VLDB Journal, 16(1), 2007. Google ScholarDigital Library
Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In SIGMOD, 1998. Google ScholarDigital Library
G. Moerkotte, T. Neumann, and G. Steidl. Preventing bad plans by bounding the impact of cardinality estimation errors. PVLDB, 2(1):982--993, 2009. Google ScholarDigital Library
G. Montúfar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In NIPS, 2014.Google ScholarDigital Library
M. Müller, G. Moerkotte, and O. Kolb. Improved selectivity estimation by combining knowledge from sampling and synopses. PVLDB, 11(9):1016--1028, 2018. Google ScholarDigital Library
M. Muralikrishna and D. J. DeWitt. Equi-depth multidimensional histograms. In SIGMOD, 1988. Google ScholarDigital Library
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2011.Google Scholar
V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In VLDB, 1997. Google ScholarDigital Library
M. Shekelyan, A. Dignös, and J. Gamper. Digithist: A histogram-based data summary with tight error bounds. PVLDB, 10(11):1514--1525, 2017. Google ScholarDigital Library
U. Srivastava, P. J. Haas, V. Markl, M. Kutsch, and T. M. Tran. Isomer: Consistent histogram construction using query feedback. In ICDE, 2006. Google ScholarDigital Library
M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. Leo - db2's learning optimizer. In VLDB, 2001. Google ScholarDigital Library
K. Tzoumas, A. Deshpande, and C. S. Jensen. Efficiently adapting graphical models for selectivity estimation. The VLDB Journal, 22(1), 2013. Google ScholarDigital Library
Y.-L. Wu, D. Agrawal, and A. El Abbadi. Applying the golden rule of sampling for query estimation. In SIGMOD, 2001. Google ScholarDigital Library

Index Terms

Selectivity estimation for range predicates using lightweight models
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Selectivity estimation - the problem of estimating the result size of queries - is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor cardinality ...
Read More
Selectivity estimation using probabilistic models

Estimating the result size of complex queries that involve selection on multiple attributes and the join of several relations is a difficult but fundamental task in database query processing. It arises in cost-based query optimization, query profiling, ...
Read More
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem
ICDE '04: Proceedings of the 20th International Conference on Data Engineering

Queries with (equality or LIKE) selection predicatesover string attributes are widely used in relationaldatabases. However, state-of-the-art techniques forestimating selectivities of string predicates are often biasedtowards severely underestimating ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 12, Issue 9
May 2019
110 pages
ISSN:2150-8097
Editors:
Lei Chen
HKUST
,
Fatma Özcan
IBM Research - Almaden
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 May 2019
Published in pvldb Volume 12, Issue 9
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 80
  Total Citations
  View Citations
- 631
  Total Downloads
- Downloads (Last 12 months)139
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Selectivity estimation for range predicates using lightweight models

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries

Selectivity estimation using probabilistic models

Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Selectivity estimation for range predicates using lightweight models

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries

Selectivity estimation using probabilistic models

Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media