skip to main content
research-article

Selectivity estimation for range predicates using lightweight models

Published:01 May 2019Publication History
Skip Abstract Section

Abstract

Query optimizers depend on selectivity estimates of query predicates to produce a good execution plan. When a query contains multiple predicates, today's optimizers use a variety of assumptions, such as independence between predicates, to estimate selectivity. While such techniques have the benefit of fast estimation and small memory footprint, they often incur large selectivity estimation errors. In this work, we reconsider selectivity estimation as a regression problem. We explore application of neural networks and tree-based ensembles to the important problem of selectivity estimation of multi-dimensional range predicates. While their straightforward application does not outperform even simple baselines, we propose two simple yet effective design choices, i.e., regression label transformation and feature engineering, motivated by the selectivity estimation context. Through extensive empirical evaluation across a variety of datasets, we show that the proposed models deliver both highly accurate estimates as well as fast estimation.

References

  1. https://support.microsoft.com/en-us/help/2658214/fix-poor-performance-when-you-run-a-query-that-contains-correlated-and.Google ScholarGoogle Scholar
  2. https://bitbucket.org/mheimel/feedback-kde/src/default/.Google ScholarGoogle Scholar
  3. Cardinality estimation for correlated columns in SQL Server 2016. https://blogs.msdn.microsoft.com/sql_server_team/cardinality-estimation-for-correlated-columns-in-sql-server-2016/.Google ScholarGoogle Scholar
  4. How the planner uses statistics. https://www.postgresql.org/docs/current/row-estimation-examples.html.Google ScholarGoogle Scholar
  5. Statistics in Microsoft SQL Server 2017. https://docs.microsoft.com/en-us/sql/relational-databases/statistics/statistics?view=sql-server-2017.Google ScholarGoogle Scholar
  6. Statistics in Microsoft SQL Server 2017: When to update statistics. https://docs.microsoft.com/en-us/sql/relational-databases/statistics/statistics?view=sql-server-2017#UpdateStatistics.Google ScholarGoogle Scholar
  7. Understanding optimizer statistics with oracle database 18c. https://www.oracle.com/technetwork/database/bi-datawarehousing/twp-stats-concepts-0218-4403739.pdf.Google ScholarGoogle Scholar
  8. UCI machine learning repository. https://archive.ics.uci.edu/ml/index.php.Google ScholarGoogle Scholar
  9. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Aboulnaga and S. Chaudhuri. Self-tuning histograms: Building histograms without looking at data. In SIGMOD, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Boulos and M. Bretonneux. Selectivity Estimation Using Neural Networks. 1996.Google ScholarGoogle Scholar
  12. N. Bruno, S. Chaudhuri, and L. Gravano. Stholes: A multidimensional workload-aware histogram. In SIGMOD, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Chaudhuri. Query optimizers: Time to rethink the contract? In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Chaudhuri, V. Narasayya, and R. Ramamurthy. A pay-as-you-go framework for query execution feedback. PVLDB, 1(1):1141--1152, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. M. Chen and N. Roussopoulos. Adaptive selectivity estimation using query feedback. In SIGMOD, 1994.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In KDD, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. Chollet et al. Keras. https://keras.io, 2015.Google ScholarGoogle Scholar
  18. L. Chu, X. Hu, J. Hu, L. Wang, and J. Pei. Exact and consistent interpretation for piecewise linear neural networks: A closed form solution. In KDD, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends databases, 4(1--3), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Gunopulos, G. Kollios, J. Tsotras, and C. Domeniconi. Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal, 14(2), 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Heimel, M. Kiefer, and V. Markl. Self-tuning, gpu-accelerated kernel density models for multidimensional selectivity estimation. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: A highly efficient gradient boosting decision tree. In NIPS, 2017.Google ScholarGoogle Scholar
  25. A. Kipf, T. Kipf, B. Radke, V. Leis, P. A. Boncz, and A. Kemper. Learned cardinalities: Estimating correlated joins with deep learning. In CIDR 2019, 2019.Google ScholarGoogle Scholar
  26. M. S. Lakshmi and S. Zhou. Selectivity estimation in extensible databases - a neural network approach. In VLDB, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. V. Leis, B. Radke, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann. Query optimization through the looking glass, and what we found running the join order benchmark. The VLDB Journal, 27(5), 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. Liu, M. Xu, Z. Yu, V. Corvinelli, and C. Zuzarte. Cardinality Estimation Using Neural Networks. In CASCON, 2015.Google ScholarGoogle Scholar
  29. H. Lu and R. Setiono. Effective query size estimation using neural networks. Applied Intelligence, 16(3), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. V. Markl, P. J. Haas, M. Kutsch, N. Megiddo, U. Srivastava, and T. M. Tran. Consistent selectivity estimation via maximum entropy. The VLDB Journal, 16(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In SIGMOD, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. G. Moerkotte, T. Neumann, and G. Steidl. Preventing bad plans by bounding the impact of cardinality estimation errors. PVLDB, 2(1):982--993, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. G. Montúfar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In NIPS, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Müller, G. Moerkotte, and O. Kolb. Improved selectivity estimation by combining knowledge from sampling and synopses. PVLDB, 11(9):1016--1028, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Muralikrishna and D. J. DeWitt. Equi-depth multidimensional histograms. In SIGMOD, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2011.Google ScholarGoogle Scholar
  37. V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In VLDB, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Shekelyan, A. Dignös, and J. Gamper. Digithist: A histogram-based data summary with tight error bounds. PVLDB, 10(11):1514--1525, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. U. Srivastava, P. J. Haas, V. Markl, M. Kutsch, and T. M. Tran. Isomer: Consistent histogram construction using query feedback. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. Leo - db2's learning optimizer. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. K. Tzoumas, A. Deshpande, and C. S. Jensen. Efficiently adapting graphical models for selectivity estimation. The VLDB Journal, 22(1), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Y.-L. Wu, D. Agrawal, and A. El Abbadi. Applying the golden rule of sampling for query estimation. In SIGMOD, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Selectivity estimation for range predicates using lightweight models
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 12, Issue 9
        May 2019
        110 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 May 2019
        Published in pvldb Volume 12, Issue 9

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader