skip to main content
10.1145/2723372.2749438acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Results Reproduced / v1.1

Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation

Published:27 May 2015Publication History

Editorial Notes

Computationally Reproducible. The experimental results of this paper were reproduced by a SIGMOD Review Committee and were found to support the central results reported in the paper. Details of the review process are found here:

http://db-reproducibility.seas.harvard.edu/#process

ABSTRACT

Quickly and accurately estimating the selectivity of multidimensional predicates is a vital part of a modern relational query optimizer. The state-of-the art in this field are multidimensional histograms, which offer good estimation quality but are complex to construct and hard to maintain. Kernel Density Estimation (KDE) is an interesting alternative that does not suffer from these problems. However, existing KDE-based selectivity estimators can hardly compete with the estimation quality of state-of-the art methods.

In this paper, we substantially expand the state-of-the-art in KDE-based selectivity estimation by improving along three dimensions: First, we demonstrate how to numerically optimize a KDE model, leading to substantially improved estimates. Second, we develop methods to continuously adapt the estimator to changes in both the database and the query workload. Finally, we show how to drastically improve the performance by pushing computations onto a GPU.

We provide an implementation of our estimator and experimentally evaluate it on a variety of datasets and workloads, demonstrating that it efficiently scales up to very large model sizes, adapts itself to database changes, and typically outperforms the estimation quality of both existing Kernel Density Estimators as well as state-of-the-art multidimensional histograms.

References

  1. Multivariate Density Estimation - Theory, Practice and Visualization. John Wiley & Sons, Inc., 1992.Google ScholarGoogle Scholar
  2. W. Andrzejewski, A. Gramacki, and J. Gramacki. Graphics processing units in acceleration of bandwidth selection for kernel density estimation. Int. J. Appl. Math. Comput. Sci, 23(4):869--885, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Bache and M. Lichman. UCI machine learning repository, 2013.Google ScholarGoogle Scholar
  4. B. Blohsfeld, D. Korus, and B. Seeger. A comparison of selectivity estimators for range queries on metric attributes. SIGMOD Record, 28(2):239--250, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. W. Bowman. An alternative method of cross-validation for the smoothing of density estimates. Biometrika, 71(2):353--360, 1984.Google ScholarGoogle ScholarCross RefCross Ref
  6. S. Breß, M. Heimel, N. Siegmund, L. Bellatreche, and G. Saake. Gpu-accelerated database systems: Survey and open challenges. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XV, pages 1--35. Springer, 2014.Google ScholarGoogle Scholar
  7. N. Bruno, S. Chaudhuri, and L. Gravano. Stholes: a multidimensional workload-aware histogram. SIGMOD Record, 30(2):211--222, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190--1208, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. SIGMOD Record, 28(2):263--274, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Christodoulakis. Implications of certain assumptions in database performance evauation. ACM Transactions on Database Systems (TODS), 9(2):163--186, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Duong and M. L. Hazelton. Cross-validation bandwidth matrices for multivariate kernel density estimation. Scandinavian Journal of Statistics, 32(3):485--506, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  12. R. Gemulla, W. Lehner, and P. J. Haas. Maintaining bounded-size sample synopses of evolving datasets. The VLDB Journal, 17(2):173--201, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. B. Gibbons, Y. Matias, and V. Poosala. Maintaining a random sample of a relation in a database in the presence of updates to the relation, Jan. 4 2000. US Patent 6,012,064.Google ScholarGoogle Scholar
  14. D. Gunopulos, G. Kollios, J. Tsotras, and C. Domeniconi. Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal, 14(2):137--154, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi. Approximating multi-dimensional aggregate range queries over real attributes. ACM SIGMOD Record, 29(2):463--474, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Heimel and V. Markl. A first step towards gpu-assisted query optimization. In International Workshop on Accelerating Data Management Systems using Modern Processor and Storage Architectures (ADMS), Istanbul, Turkey, pages 33--44, 2012.Google ScholarGoogle Scholar
  17. M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious parallelism for in-memory column-stores. Proceedings of the VLDB Endowment, 6(9):709--720, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Heinz and B. Seeger. Cluster kernels: Resource-aware kernel density estimators over streaming data. Knowledge and Data Engineering, IEEE Transactions on, 20(7):880--893, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Horn. Stream reduction operations for gpgpu applications. Gpu gems, 2:573--589, 2005.Google ScholarGoogle Scholar
  20. Y. Ioannidis. The history of histograms (abridged). In Proceedings of the 29th VLDB conference, pages 19--30. VLDB Endowment, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. SIGMOD Record, 20(2):268--277, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. G. Johnson. The NLopt nonlinear-optimization package. http://ab-initio.mit.edu/nlopt.Google ScholarGoogle Scholar
  23. M. C. Jones, J. S. Marron, and S. J. Sheather. A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 91(433):401--407, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  24. A. R. Kan and G. Timmer. Stochastic global optimization methods part ii: Multi level methods. Mathematical Programming, 39(1):57--78, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P.-A. Larson, W. Lehner, J. Zhou, and P. Zabback. Cardinality estimation using sample views with quality assurance. In Proceedings of the 2007 ACM SIGMOD Conference, pages 175--186. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J.-H. Lee, D.-H. Kim, and C.-W. Chung. Multi-dimensional selectivity estimation using compressed histogram information. SIGMOD Record, 28(2):205--214, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Q. Li and J. Racine. Nonparametric estimation of distributions with categorical and continuous data. journal of multivariate analysis, 86(2):266--292, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling, volume 19. ACM, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. V. Markl, P. J. Haas, M. Kutsch, N. Megiddo, U. Srivastava, and T. M. Tran. Consistent selectivity estimation via maximum entropy. The VLDB journal, 16(1):55--76, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. SIGMOD Record, 27(2):448--459, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. Moerkotte, T. Neumann, and G. Steidl. Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment, 2(1):982--993, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Muralikrishna and D. J. DeWitt. Equi-depth multidimensional histograms. SIGMOD Record, 17(3):28--36, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. F. Olken and D. Rotem. Maintenance of materialized views of sampling queries. In Data Engineering, 1992. Proceedings. Eighth International Conference on, pages 632--641. IEEE, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proceedings of the 23rd VLDB conference, pages 486--495. VLDB Endowment, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. N. Reddy and J. R. Haritsa. Analyzing plan diagrams of database query optimizers. In Proceedings of the 31st VLDB conference, pages 1228--1239. VLDB Endowment, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In Neural Networks, 1993., IEEE International Conference on, pages 586--591. IEEE, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  37. S. R. Sain, K. A. Baggerly, and D. W. Scott. Cross-validation of multivariate densities. Journal of the American Statistical Association, 89(427):807--817, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  38. U. Srivastava, P. J. Haas, V. Markl, M. Kutsch, and T. M. Tran. Isomer: Consistent histogram construction using query feedback. In Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference on, pages 39--39. IEEE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. Online outlier detection in sensor data using non-parametric models. In Proceedings of the 32nd VLDB conference, pages 187--198. VLDB Endowment, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. K. Svanberg. A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM journal on optimization, 12(2):555--573, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. G. R. Terrell and D. W. Scott. Variable kernel density estimation. The Annals of Statistics, pages 1236--1265, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  42. T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012.Google ScholarGoogle Scholar
  43. J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37--57, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. Wand and M. Jones. Comparison of smoothing parameterizations in bivariate kernel density estimation. Journal of the American Statistical Association, 88(422):520--528, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  45. M. Wand and M. Jones. Multivariate plug-in bandwidth selection. Computational Statistics, 9(2):97--116, 1994.Google ScholarGoogle Scholar
  46. Z. Zhang, Y. Yang, R. Cai, D. Papadias, and A. Tung. Kernel-based skyline cardinality estimation. In Proceedings of the 2009 ACM SIGMOD Conference, pages 509--522. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader