research-article

Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation

Authors:
Max Heimel

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

,
Martin Kiefer

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

,
Volker Markl

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataMay 2015Pages 1477–1492https://doi.org/10.1145/2723372.2749438

Published:27 May 2015Publication History

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 1477–1492

Editorial Notes

Computationally Reproducible. The experimental results of this paper were reproduced by a SIGMOD Review Committee and were found to support the central results reported in the paper. Details of the review process are found here:

http://db-reproducibility.seas.harvard.edu/#process

ABSTRACT

Quickly and accurately estimating the selectivity of multidimensional predicates is a vital part of a modern relational query optimizer. The state-of-the art in this field are multidimensional histograms, which offer good estimation quality but are complex to construct and hard to maintain. Kernel Density Estimation (KDE) is an interesting alternative that does not suffer from these problems. However, existing KDE-based selectivity estimators can hardly compete with the estimation quality of state-of-the art methods.

In this paper, we substantially expand the state-of-the-art in KDE-based selectivity estimation by improving along three dimensions: First, we demonstrate how to numerically optimize a KDE model, leading to substantially improved estimates. Second, we develop methods to continuously adapt the estimator to changes in both the database and the query workload. Finally, we show how to drastically improve the performance by pushing computations onto a GPU.

We provide an implementation of our estimator and experimentally evaluate it on a variety of datasets and workloads, demonstrating that it efficiently scales up to very large model sizes, adapts itself to database changes, and typically outperforms the estimation quality of both existing Kernel Density Estimators as well as state-of-the-art multidimensional histograms.

References

Multivariate Density Estimation - Theory, Practice and Visualization. John Wiley & Sons, Inc., 1992.Google Scholar
W. Andrzejewski, A. Gramacki, and J. Gramacki. Graphics processing units in acceleration of bandwidth selection for kernel density estimation. Int. J. Appl. Math. Comput. Sci, 23(4):869--885, 2013.Google ScholarDigital Library
K. Bache and M. Lichman. UCI machine learning repository, 2013.Google Scholar
B. Blohsfeld, D. Korus, and B. Seeger. A comparison of selectivity estimators for range queries on metric attributes. SIGMOD Record, 28(2):239--250, 1999. Google ScholarDigital Library
A. W. Bowman. An alternative method of cross-validation for the smoothing of density estimates. Biometrika, 71(2):353--360, 1984.Google ScholarCross Ref
S. Breß, M. Heimel, N. Siegmund, L. Bellatreche, and G. Saake. Gpu-accelerated database systems: Survey and open challenges. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XV, pages 1--35. Springer, 2014.Google Scholar
N. Bruno, S. Chaudhuri, and L. Gravano. Stholes: a multidimensional workload-aware histogram. SIGMOD Record, 30(2):211--222, 2001. Google ScholarDigital Library
R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190--1208, 1995. Google ScholarDigital Library
S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. SIGMOD Record, 28(2):263--274, 1999. Google ScholarDigital Library
S. Christodoulakis. Implications of certain assumptions in database performance evauation. ACM Transactions on Database Systems (TODS), 9(2):163--186, 1984. Google ScholarDigital Library
T. Duong and M. L. Hazelton. Cross-validation bandwidth matrices for multivariate kernel density estimation. Scandinavian Journal of Statistics, 32(3):485--506, 2005.Google ScholarCross Ref
R. Gemulla, W. Lehner, and P. J. Haas. Maintaining bounded-size sample synopses of evolving datasets. The VLDB Journal, 17(2):173--201, 2008. Google ScholarDigital Library
P. B. Gibbons, Y. Matias, and V. Poosala. Maintaining a random sample of a relation in a database in the presence of updates to the relation, Jan. 4 2000. US Patent 6,012,064.Google Scholar
D. Gunopulos, G. Kollios, J. Tsotras, and C. Domeniconi. Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal, 14(2):137--154, 2005. Google ScholarDigital Library
D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi. Approximating multi-dimensional aggregate range queries over real attributes. ACM SIGMOD Record, 29(2):463--474, 2000. Google ScholarDigital Library
M. Heimel and V. Markl. A first step towards gpu-assisted query optimization. In International Workshop on Accelerating Data Management Systems using Modern Processor and Storage Architectures (ADMS), Istanbul, Turkey, pages 33--44, 2012.Google Scholar
M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious parallelism for in-memory column-stores. Proceedings of the VLDB Endowment, 6(9):709--720, 2013. Google ScholarDigital Library
C. Heinz and B. Seeger. Cluster kernels: Resource-aware kernel density estimators over streaming data. Knowledge and Data Engineering, IEEE Transactions on, 20(7):880--893, 2008. Google ScholarDigital Library
D. Horn. Stream reduction operations for gpgpu applications. Gpu gems, 2:573--589, 2005.Google Scholar
Y. Ioannidis. The history of histograms (abridged). In Proceedings of the 29th VLDB conference, pages 19--30. VLDB Endowment, 2003. Google ScholarDigital Library
Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. SIGMOD Record, 20(2):268--277, 1991. Google ScholarDigital Library
S. G. Johnson. The NLopt nonlinear-optimization package. http://ab-initio.mit.edu/nlopt.Google Scholar
M. C. Jones, J. S. Marron, and S. J. Sheather. A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 91(433):401--407, 1996.Google ScholarCross Ref
A. R. Kan and G. Timmer. Stochastic global optimization methods part ii: Multi level methods. Mathematical Programming, 39(1):57--78, 1987. Google ScholarDigital Library
P.-A. Larson, W. Lehner, J. Zhou, and P. Zabback. Cardinality estimation using sample views with quality assurance. In Proceedings of the 2007 ACM SIGMOD Conference, pages 175--186. ACM, 2007. Google ScholarDigital Library
J.-H. Lee, D.-H. Kim, and C.-W. Chung. Multi-dimensional selectivity estimation using compressed histogram information. SIGMOD Record, 28(2):205--214, 1999. Google ScholarDigital Library
Q. Li and J. Racine. Nonparametric estimation of distributions with categorical and continuous data. journal of multivariate analysis, 86(2):266--292, 2003. Google ScholarDigital Library
R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling, volume 19. ACM, 1990. Google ScholarDigital Library
V. Markl, P. J. Haas, M. Kutsch, N. Megiddo, U. Srivastava, and T. M. Tran. Consistent selectivity estimation via maximum entropy. The VLDB journal, 16(1):55--76, 2007. Google ScholarDigital Library
Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. SIGMOD Record, 27(2):448--459, 1998. Google ScholarDigital Library
G. Moerkotte, T. Neumann, and G. Steidl. Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment, 2(1):982--993, 2009. Google ScholarDigital Library
M. Muralikrishna and D. J. DeWitt. Equi-depth multidimensional histograms. SIGMOD Record, 17(3):28--36, 1988. Google ScholarDigital Library
F. Olken and D. Rotem. Maintenance of materialized views of sampling queries. In Data Engineering, 1992. Proceedings. Eighth International Conference on, pages 632--641. IEEE, 1992. Google ScholarDigital Library
V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proceedings of the 23rd VLDB conference, pages 486--495. VLDB Endowment, 1997. Google ScholarDigital Library
N. Reddy and J. R. Haritsa. Analyzing plan diagrams of database query optimizers. In Proceedings of the 31st VLDB conference, pages 1228--1239. VLDB Endowment, 2005. Google ScholarDigital Library
M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In Neural Networks, 1993., IEEE International Conference on, pages 586--591. IEEE, 1993.Google ScholarCross Ref
S. R. Sain, K. A. Baggerly, and D. W. Scott. Cross-validation of multivariate densities. Journal of the American Statistical Association, 89(427):807--817, 1994.Google ScholarCross Ref
U. Srivastava, P. J. Haas, V. Markl, M. Kutsch, and T. M. Tran. Isomer: Consistent histogram construction using query feedback. In Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference on, pages 39--39. IEEE, 2006. Google ScholarDigital Library
S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. Online outlier detection in sensor data using non-parametric models. In Proceedings of the 32nd VLDB conference, pages 187--198. VLDB Endowment, 2006. Google ScholarDigital Library
K. Svanberg. A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM journal on optimization, 12(2):555--573, 2002. Google ScholarDigital Library
G. R. Terrell and D. W. Scott. Variable kernel density estimation. The Annals of Statistics, pages 1236--1265, 1992.Google ScholarCross Ref
T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012.Google Scholar
J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37--57, 1985. Google ScholarDigital Library
M. Wand and M. Jones. Comparison of smoothing parameterizations in bivariate kernel density estimation. Journal of the American Statistical Association, 88(422):520--528, 1993.Google ScholarCross Ref
M. Wand and M. Jones. Multivariate plug-in bandwidth selection. Computational Statistics, 9(2):97--116, 1994.Google Scholar
Z. Zhang, Y. Yang, R. Cai, D. Papadias, and A. Tung. Kernel-based skyline cardinality estimation. In Proceedings of the 2009 ACM SIGMOD Conference, pages 509--522. ACM, 2009. Google ScholarDigital Library

Index Terms

Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Selectivity estimation - the problem of estimating the result size of queries - is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor cardinality ...
Read More
Kernel density estimation in accelerators

Kernel density estimation (KDE) is a popular technique used to estimate the probability density function of a random variable. KDE is considered a fundamental data smoothing algorithm, and it is a common building block in many scientific applications. ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Results Reproduced / v1.1
Author Tags
bandwidth selection
cardinality estimation
gpgpu
gpu-accelerated databases
gpu-assisted query optimization
graphics cards
kernel density estimator
opencl
query optimization
selectivity estimation
self-tuning databases
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 53
  Total Citations
  View Citations
- 1,037
  Total Downloads
- Downloads (Last 12 months)76
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Editorial Notes

ABSTRACT

References

Cited By

Index Terms

Recommendations

Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries

Kernel density estimation in accelerators

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing