Editorial Notes
Computationally Reproducible. The experimental results of this paper were reproduced by a SIGMOD Review Committee and were found to support the central results reported in the paper. Details of the review process are found here:
http://db-reproducibility.seas.harvard.edu/#process
ABSTRACT
Quickly and accurately estimating the selectivity of multidimensional predicates is a vital part of a modern relational query optimizer. The state-of-the art in this field are multidimensional histograms, which offer good estimation quality but are complex to construct and hard to maintain. Kernel Density Estimation (KDE) is an interesting alternative that does not suffer from these problems. However, existing KDE-based selectivity estimators can hardly compete with the estimation quality of state-of-the art methods.
In this paper, we substantially expand the state-of-the-art in KDE-based selectivity estimation by improving along three dimensions: First, we demonstrate how to numerically optimize a KDE model, leading to substantially improved estimates. Second, we develop methods to continuously adapt the estimator to changes in both the database and the query workload. Finally, we show how to drastically improve the performance by pushing computations onto a GPU.
We provide an implementation of our estimator and experimentally evaluate it on a variety of datasets and workloads, demonstrating that it efficiently scales up to very large model sizes, adapts itself to database changes, and typically outperforms the estimation quality of both existing Kernel Density Estimators as well as state-of-the-art multidimensional histograms.
- Multivariate Density Estimation - Theory, Practice and Visualization. John Wiley & Sons, Inc., 1992.Google Scholar
- W. Andrzejewski, A. Gramacki, and J. Gramacki. Graphics processing units in acceleration of bandwidth selection for kernel density estimation. Int. J. Appl. Math. Comput. Sci, 23(4):869--885, 2013.Google ScholarDigital Library
- K. Bache and M. Lichman. UCI machine learning repository, 2013.Google Scholar
- B. Blohsfeld, D. Korus, and B. Seeger. A comparison of selectivity estimators for range queries on metric attributes. SIGMOD Record, 28(2):239--250, 1999. Google ScholarDigital Library
- A. W. Bowman. An alternative method of cross-validation for the smoothing of density estimates. Biometrika, 71(2):353--360, 1984.Google ScholarCross Ref
- S. Breß, M. Heimel, N. Siegmund, L. Bellatreche, and G. Saake. Gpu-accelerated database systems: Survey and open challenges. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XV, pages 1--35. Springer, 2014.Google Scholar
- N. Bruno, S. Chaudhuri, and L. Gravano. Stholes: a multidimensional workload-aware histogram. SIGMOD Record, 30(2):211--222, 2001. Google ScholarDigital Library
- R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190--1208, 1995. Google ScholarDigital Library
- S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. SIGMOD Record, 28(2):263--274, 1999. Google ScholarDigital Library
- S. Christodoulakis. Implications of certain assumptions in database performance evauation. ACM Transactions on Database Systems (TODS), 9(2):163--186, 1984. Google ScholarDigital Library
- T. Duong and M. L. Hazelton. Cross-validation bandwidth matrices for multivariate kernel density estimation. Scandinavian Journal of Statistics, 32(3):485--506, 2005.Google ScholarCross Ref
- R. Gemulla, W. Lehner, and P. J. Haas. Maintaining bounded-size sample synopses of evolving datasets. The VLDB Journal, 17(2):173--201, 2008. Google ScholarDigital Library
- P. B. Gibbons, Y. Matias, and V. Poosala. Maintaining a random sample of a relation in a database in the presence of updates to the relation, Jan. 4 2000. US Patent 6,012,064.Google Scholar
- D. Gunopulos, G. Kollios, J. Tsotras, and C. Domeniconi. Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal, 14(2):137--154, 2005. Google ScholarDigital Library
- D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi. Approximating multi-dimensional aggregate range queries over real attributes. ACM SIGMOD Record, 29(2):463--474, 2000. Google ScholarDigital Library
- M. Heimel and V. Markl. A first step towards gpu-assisted query optimization. In International Workshop on Accelerating Data Management Systems using Modern Processor and Storage Architectures (ADMS), Istanbul, Turkey, pages 33--44, 2012.Google Scholar
- M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious parallelism for in-memory column-stores. Proceedings of the VLDB Endowment, 6(9):709--720, 2013. Google ScholarDigital Library
- C. Heinz and B. Seeger. Cluster kernels: Resource-aware kernel density estimators over streaming data. Knowledge and Data Engineering, IEEE Transactions on, 20(7):880--893, 2008. Google ScholarDigital Library
- D. Horn. Stream reduction operations for gpgpu applications. Gpu gems, 2:573--589, 2005.Google Scholar
- Y. Ioannidis. The history of histograms (abridged). In Proceedings of the 29th VLDB conference, pages 19--30. VLDB Endowment, 2003. Google ScholarDigital Library
- Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. SIGMOD Record, 20(2):268--277, 1991. Google ScholarDigital Library
- S. G. Johnson. The NLopt nonlinear-optimization package. http://ab-initio.mit.edu/nlopt.Google Scholar
- M. C. Jones, J. S. Marron, and S. J. Sheather. A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 91(433):401--407, 1996.Google ScholarCross Ref
- A. R. Kan and G. Timmer. Stochastic global optimization methods part ii: Multi level methods. Mathematical Programming, 39(1):57--78, 1987. Google ScholarDigital Library
- P.-A. Larson, W. Lehner, J. Zhou, and P. Zabback. Cardinality estimation using sample views with quality assurance. In Proceedings of the 2007 ACM SIGMOD Conference, pages 175--186. ACM, 2007. Google ScholarDigital Library
- J.-H. Lee, D.-H. Kim, and C.-W. Chung. Multi-dimensional selectivity estimation using compressed histogram information. SIGMOD Record, 28(2):205--214, 1999. Google ScholarDigital Library
- Q. Li and J. Racine. Nonparametric estimation of distributions with categorical and continuous data. journal of multivariate analysis, 86(2):266--292, 2003. Google ScholarDigital Library
- R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling, volume 19. ACM, 1990. Google ScholarDigital Library
- V. Markl, P. J. Haas, M. Kutsch, N. Megiddo, U. Srivastava, and T. M. Tran. Consistent selectivity estimation via maximum entropy. The VLDB journal, 16(1):55--76, 2007. Google ScholarDigital Library
- Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. SIGMOD Record, 27(2):448--459, 1998. Google ScholarDigital Library
- G. Moerkotte, T. Neumann, and G. Steidl. Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment, 2(1):982--993, 2009. Google ScholarDigital Library
- M. Muralikrishna and D. J. DeWitt. Equi-depth multidimensional histograms. SIGMOD Record, 17(3):28--36, 1988. Google ScholarDigital Library
- F. Olken and D. Rotem. Maintenance of materialized views of sampling queries. In Data Engineering, 1992. Proceedings. Eighth International Conference on, pages 632--641. IEEE, 1992. Google ScholarDigital Library
- V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proceedings of the 23rd VLDB conference, pages 486--495. VLDB Endowment, 1997. Google ScholarDigital Library
- N. Reddy and J. R. Haritsa. Analyzing plan diagrams of database query optimizers. In Proceedings of the 31st VLDB conference, pages 1228--1239. VLDB Endowment, 2005. Google ScholarDigital Library
- M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In Neural Networks, 1993., IEEE International Conference on, pages 586--591. IEEE, 1993.Google ScholarCross Ref
- S. R. Sain, K. A. Baggerly, and D. W. Scott. Cross-validation of multivariate densities. Journal of the American Statistical Association, 89(427):807--817, 1994.Google ScholarCross Ref
- U. Srivastava, P. J. Haas, V. Markl, M. Kutsch, and T. M. Tran. Isomer: Consistent histogram construction using query feedback. In Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference on, pages 39--39. IEEE, 2006. Google ScholarDigital Library
- S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. Online outlier detection in sensor data using non-parametric models. In Proceedings of the 32nd VLDB conference, pages 187--198. VLDB Endowment, 2006. Google ScholarDigital Library
- K. Svanberg. A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM journal on optimization, 12(2):555--573, 2002. Google ScholarDigital Library
- G. R. Terrell and D. W. Scott. Variable kernel density estimation. The Annals of Statistics, pages 1236--1265, 1992.Google ScholarCross Ref
- T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012.Google Scholar
- J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37--57, 1985. Google ScholarDigital Library
- M. Wand and M. Jones. Comparison of smoothing parameterizations in bivariate kernel density estimation. Journal of the American Statistical Association, 88(422):520--528, 1993.Google ScholarCross Ref
- M. Wand and M. Jones. Multivariate plug-in bandwidth selection. Computational Statistics, 9(2):97--116, 1994.Google Scholar
- Z. Zhang, Y. Yang, R. Cai, D. Papadias, and A. Tung. Kernel-based skyline cardinality estimation. In Proceedings of the 2009 ACM SIGMOD Conference, pages 509--522. ACM, 2009. Google ScholarDigital Library
Index Terms
- Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation
Recommendations
Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataSelectivity estimation - the problem of estimating the result size of queries - is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor cardinality ...
Kernel density estimation in accelerators
Kernel density estimation (KDE) is a popular technique used to estimate the probability density function of a random variable. KDE is considered a fundamental data smoothing algorithm, and it is a common building block in many scientific applications. ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Comments