skip to main content
article

Optimized stratified sampling for approximate query processing

Published:01 June 2007Publication History
Skip Abstract Section

Abstract

The ability to approximately answer aggregation queries accurately and efficiently is of great benefit for decision support and data mining tools. In contrast to previous sampling-based studies, we treat the problem as an optimization problem where, given a workload of queries, we select a stratified random sample of the original data such that the error in answering the workload queries using the sample is minimized. A key novelty of our approach is that we can tailor the choice of samples to be robust, even for workloads that are “similar” but not necessarily identical to the given workload. Finally, our techniques recognize the importance of taking into account the variance in the data distribution in a principled manner. We show how our solution can be implemented on a database system, and present results of extensive experiments on Microsoft SQL Server that demonstrate the superior quality of our method compared to previous work.

References

  1. Acharya, S., Gibbons, P. B., and Poosala, V. 2000. Congressional samples for approximate answering of group-by queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Acharya, S., Gibbons, P. B., Poosala, V., and Ramaswamy, S. 1999. Join synopses for approximate query answering. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Agrawal, S., Chaudhuri, C., and Narasayya, V. 2000. Automated selection of materialized views and indexes in SQL databases. In Proceedings of the International Conference on Very Large Databases. 496--505. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Agrawal, S., Narasayya, V., and Yang, B. 2004. Integrating vertical and horizontal partitioning into automated physical database design. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 359--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Babcock, B., Chaudhuri, C., and Das, G. 2003. Dynamic sample selection for approximate query processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 539--550. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Barbará, D. and Sullivan, M. 1997. Quasi-Cubes: Exploiting approximations in multidimensional databases. SIGMOD Rec. 26, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Barbará, D. and Wu., X. 1999. Using approximations to scale exploratory data analysis in datacubes. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bethel, J. 1989. Sample allocation in multivariate surveys. Surv. Methodol. 15, 47--57.Google ScholarGoogle Scholar
  9. Causey, B. D. 1983. Computational aspects of optimal allocation in multivariate stratified sampling. SIAM J. Sci. Statist. Comput. 4, 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chakrabarti, K., Garofalakis, M., Rastogi, R., and Shim, K. 2000. Approximate query processing using wavelets. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chaudhuri, S., Das, G., Datar, M., Motwani, R., and Narasayya, V. 2001. Overcoming limitations of sampling for aggregation queries. In Proceedings of the IEEE International Conference on Data Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chaudhuri, S., Das, G., and Narasayya, V. 2001. A robust, optimization-based approach for approximate answering of aggregation queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chaudhuri, S., Krishnamurthy, R., Potamianos, S., and Shim, S. 1995. Optimizing queries with materialized views. In Proceedings of the IEEE International Conference on Data Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chaudhuri, S., Motwani, R., and Narasayya, V. 1999. Random sampling over joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chaudhuri, S. and Narasayya, V. 2006. Program for TPC-D data generation with skew. http://research.microsoft.com/dmx/Google ScholarGoogle Scholar
  16. Chaudhuri, S. and Narasayya, V. 1997. An efficient, cost-driven index selection tool for Microsoft SQL server. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chaudhuri, S. and Narasayya, V. 1998. AutoAdmin what-if index analysis utility. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Chromy, J. W. 1987. Design optimization with multiple objectives. In Proceedings of the Survey Research Section, American Statistical Association.Google ScholarGoogle Scholar
  19. Cochran, W. G. 1977. Sampling Techniques, 3rd ed. John Wiley, New York.Google ScholarGoogle Scholar
  20. Das, G. 2003. Survey of approximate query processing techniques. Tutorial at the International Conference on Scientific and Statistical Database Management.Google ScholarGoogle Scholar
  21. Fan, C. T., Muller, M. E., and Rezucha, I. 1962. Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J. Amer. Statis. Assoc. 57, 298, 387--402.Google ScholarGoogle ScholarCross RefCross Ref
  22. Ganti, V., Lee M. L., and Ramakrishnan, R. 2000. ICICLES: Self-Tuning samples for approximate query answering. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Garofalakis, M. N. and Gibbons, P. B. 2001. Approximate query processing: Taming the terabytes. Tutorial at the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Garofalakis, M., Ganguly, S., Kumar, A., and Rastogi, R. 2005. Join-Distinct aggregate estimation over update streams. In Proceedings of the International Conference on the Principles of Database Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Getoor, L., Taskar, B., and Koller, D. 2001. Selectivity estimation using probabilistic model. In Proceedings of the SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Gibbons, P. B., Matias, Y., and Poosala, V. 1997. Fast incremental maintenance of approximate histograms. In Proceedings of the 23rd International Conference on Very Large Data Bases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Goldstein, J. and Larson, P. 2001. Optimizing queries using materialized views: A practical, scalable solution. In Proceedings of the SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., and Pirahesh, H. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining Knowl. Discov. 1, 1, 29--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Greenwald, M. and Khanna, S. 2001. Space-Efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Gunopulos, D., Kollios, G., Tsotras, V. J., and Domeniconi, C. 2000. Approximating multi-dimensional aggregate range queries over real attributes. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Halevy, A. 2001. Answering queries using views: A survey. VLDB J. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Hellerstein, J., Haas, P., and Wang, H. 1997. Online aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. HochBaum, D. 1997. Approximation Algorithms for NP-Hard Problems. PWS Publishing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ioannidis, Y. and Poosala, V. 1999. Histogram based approximations of set-valued query answers. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jermaine, C. 2003. Robust estimation with sampling and approximate pre-aggregation. In Proceedings of the International Conference on Very Large Databases. 886--897. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jermaine, C., Pol, A., and Arumugam, S. 2004. Online maintenance of very large random samples. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kempe D., Dobra, A., and Gehrke, J. 2003. Gossip-Based computation of aggregate information. In Proceedings of IEEE Foundations of Computer Science. 482--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Koudas, N. and Srivastava, D. 2003. Data stream query processing: A tutorial. Tutorial at the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lohr, S. 1999. Sampling: Design and Analysis. Duxbury Press.Google ScholarGoogle Scholar
  40. Miller, R. G., Jr. 1981. Simultaneous Statis. Inference. Springer Series in Statistics. Springer Verlag.Google ScholarGoogle Scholar
  41. Mitchell, T. 1997. Machine Learning. McGraw-Hill, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge University Press, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Olken, F. 1993. Random sampling from databases. Ph.D. dissertation, Computer Science, UC Berkeley.Google ScholarGoogle Scholar
  44. Pavlov, D., Mannila, H, and Smyth, P. 2003. Beyond independence: Probabilistic methods for query approximation on binary transaction data. IEEE Trans. Knowl. Data Eng. 15, 6, 1409--1421. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Polyzotis, N., Garofalakis, M. N., and Ioannidis, Y. 2004. Approximate XML query answers. In Proceedings of the SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Poosala, V. and Ganti, V. 1999. Fast approximate answers to aggregate queries on a data cube. In Proceedings of the International Conference on Scientific and Statistical Database Management. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Thisted, R. A. 1988. Elements of Statis. Comput. Chapman and Hall, London. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Transaction Processing Performance Council. 2007. TPC benchmark R: Decision Support. Revision 1.1.0. http//www.tpc.org.Google ScholarGoogle Scholar
  49. Valliant, R. and Gentle, J. 1997. An application of mathematical programming to a sample allocation problem. Comput. Statis. Data Anal. 25, 337--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Vitter, J. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Vitter, J. and Wang, M. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelet. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Vitter, J., Wang, M., and Iyer, B. 1998. Data cube approximation and histogram via wavelets. In Proceedings of the International Conference on Information and Knowledge Management. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zipf, G. E. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley.Google ScholarGoogle Scholar

Index Terms

  1. Optimized stratified sampling for approximate query processing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 32, Issue 2
      June 2007
      267 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/1242524
      Issue’s Table of Contents

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 June 2007
      Published in tods Volume 32, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader