article

Optimized stratified sampling for approximate query processing

Authors:
Surajit Chaudhuri

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

,
Gautam Das

University of Texas at Arlington, Arlington, TX

University of Texas at Arlington, Arlington, TX
View Profile

,
Vivek Narasayya

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 32 Issue 2pp 9–eshttps://doi.org/10.1145/1242524.1242526

Published:01 June 2007Publication History

ACM Transactions on Database Systems

Abstract

The ability to approximately answer aggregation queries accurately and efficiently is of great benefit for decision support and data mining tools. In contrast to previous sampling-based studies, we treat the problem as an optimization problem where, given a workload of queries, we select a stratified random sample of the original data such that the error in answering the workload queries using the sample is minimized. A key novelty of our approach is that we can tailor the choice of samples to be robust, even for workloads that are “similar” but not necessarily identical to the given workload. Finally, our techniques recognize the importance of taking into account the variance in the data distribution in a principled manner. We show how our solution can be implemented on a database system, and present results of extensive experiments on Microsoft SQL Server that demonstrate the superior quality of our method compared to previous work.

References

Acharya, S., Gibbons, P. B., and Poosala, V. 2000. Congressional samples for approximate answering of group-by queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
Acharya, S., Gibbons, P. B., Poosala, V., and Ramaswamy, S. 1999. Join synopses for approximate query answering. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
Agrawal, S., Chaudhuri, C., and Narasayya, V. 2000. Automated selection of materialized views and indexes in SQL databases. In Proceedings of the International Conference on Very Large Databases. 496--505. Google ScholarDigital Library
Agrawal, S., Narasayya, V., and Yang, B. 2004. Integrating vertical and horizontal partitioning into automated physical database design. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 359--370. Google ScholarDigital Library
Babcock, B., Chaudhuri, C., and Das, G. 2003. Dynamic sample selection for approximate query processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 539--550. Google ScholarDigital Library
Barbará, D. and Sullivan, M. 1997. Quasi-Cubes: Exploiting approximations in multidimensional databases. SIGMOD Rec. 26, 3. Google ScholarDigital Library
Barbará, D. and Wu., X. 1999. Using approximations to scale exploratory data analysis in datacubes. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
Bethel, J. 1989. Sample allocation in multivariate surveys. Surv. Methodol. 15, 47--57.Google Scholar
Causey, B. D. 1983. Computational aspects of optimal allocation in multivariate stratified sampling. SIAM J. Sci. Statist. Comput. 4, 2.Google ScholarDigital Library
Chakrabarti, K., Garofalakis, M., Rastogi, R., and Shim, K. 2000. Approximate query processing using wavelets. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
Chaudhuri, S., Das, G., Datar, M., Motwani, R., and Narasayya, V. 2001. Overcoming limitations of sampling for aggregation queries. In Proceedings of the IEEE International Conference on Data Engineering. Google ScholarDigital Library
Chaudhuri, S., Das, G., and Narasayya, V. 2001. A robust, optimization-based approach for approximate answering of aggregation queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
Chaudhuri, S., Krishnamurthy, R., Potamianos, S., and Shim, S. 1995. Optimizing queries with materialized views. In Proceedings of the IEEE International Conference on Data Engineering. Google ScholarDigital Library
Chaudhuri, S., Motwani, R., and Narasayya, V. 1999. Random sampling over joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
Chaudhuri, S. and Narasayya, V. 2006. Program for TPC-D data generation with skew. http://research.microsoft.com/dmx/Google Scholar
Chaudhuri, S. and Narasayya, V. 1997. An efficient, cost-driven index selection tool for Microsoft SQL server. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
Chaudhuri, S. and Narasayya, V. 1998. AutoAdmin what-if index analysis utility. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
Chromy, J. W. 1987. Design optimization with multiple objectives. In Proceedings of the Survey Research Section, American Statistical Association.Google Scholar
Cochran, W. G. 1977. Sampling Techniques, 3rd ed. John Wiley, New York.Google Scholar
Das, G. 2003. Survey of approximate query processing techniques. Tutorial at the International Conference on Scientific and Statistical Database Management.Google Scholar
Fan, C. T., Muller, M. E., and Rezucha, I. 1962. Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J. Amer. Statis. Assoc. 57, 298, 387--402.Google ScholarCross Ref
Ganti, V., Lee M. L., and Ramakrishnan, R. 2000. ICICLES: Self-Tuning samples for approximate query answering. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
Garofalakis, M. N. and Gibbons, P. B. 2001. Approximate query processing: Taming the terabytes. Tutorial at the International Conference on Very Large Databases. Google ScholarDigital Library
Garofalakis, M., Ganguly, S., Kumar, A., and Rastogi, R. 2005. Join-Distinct aggregate estimation over update streams. In Proceedings of the International Conference on the Principles of Database Systems. Google ScholarDigital Library
Getoor, L., Taskar, B., and Koller, D. 2001. Selectivity estimation using probabilistic model. In Proceedings of the SIGMOD International Conference on Management of Data. Google ScholarDigital Library
Gibbons, P. B., Matias, Y., and Poosala, V. 1997. Fast incremental maintenance of approximate histograms. In Proceedings of the 23rd International Conference on Very Large Data Bases. Google ScholarDigital Library
Goldstein, J. and Larson, P. 2001. Optimizing queries using materialized views: A practical, scalable solution. In Proceedings of the SIGMOD International Conference on Management of Data. Google ScholarDigital Library
Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., and Pirahesh, H. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining Knowl. Discov. 1, 1, 29--53. Google ScholarDigital Library
Greenwald, M. and Khanna, S. 2001. Space-Efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
Gunopulos, D., Kollios, G., Tsotras, V. J., and Domeniconi, C. 2000. Approximating multi-dimensional aggregate range queries over real attributes. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
Halevy, A. 2001. Answering queries using views: A survey. VLDB J. Google ScholarDigital Library
Hellerstein, J., Haas, P., and Wang, H. 1997. Online aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
HochBaum, D. 1997. Approximation Algorithms for NP-Hard Problems. PWS Publishing. Google ScholarDigital Library
Ioannidis, Y. and Poosala, V. 1999. Histogram based approximations of set-valued query answers. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
Jermaine, C. 2003. Robust estimation with sampling and approximate pre-aggregation. In Proceedings of the International Conference on Very Large Databases. 886--897. Google ScholarDigital Library
Jermaine, C., Pol, A., and Arumugam, S. 2004. Online maintenance of very large random samples. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
Kempe D., Dobra, A., and Gehrke, J. 2003. Gossip-Based computation of aggregate information. In Proceedings of IEEE Foundations of Computer Science. 482--491. Google ScholarDigital Library
Koudas, N. and Srivastava, D. 2003. Data stream query processing: A tutorial. Tutorial at the International Conference on Very Large Databases. Google ScholarDigital Library
Lohr, S. 1999. Sampling: Design and Analysis. Duxbury Press.Google Scholar
Miller, R. G., Jr. 1981. Simultaneous Statis. Inference. Springer Series in Statistics. Springer Verlag.Google Scholar
Mitchell, T. 1997. Machine Learning. McGraw-Hill, New York. Google ScholarDigital Library
Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge University Press, New York. Google ScholarDigital Library
Olken, F. 1993. Random sampling from databases. Ph.D. dissertation, Computer Science, UC Berkeley.Google Scholar
Pavlov, D., Mannila, H, and Smyth, P. 2003. Beyond independence: Probabilistic methods for query approximation on binary transaction data. IEEE Trans. Knowl. Data Eng. 15, 6, 1409--1421. Google ScholarDigital Library
Polyzotis, N., Garofalakis, M. N., and Ioannidis, Y. 2004. Approximate XML query answers. In Proceedings of the SIGMOD International Conference on Management of Data. Google ScholarDigital Library
Poosala, V. and Ganti, V. 1999. Fast approximate answers to aggregate queries on a data cube. In Proceedings of the International Conference on Scientific and Statistical Database Management. Google ScholarDigital Library
Thisted, R. A. 1988. Elements of Statis. Comput. Chapman and Hall, London. Google ScholarDigital Library
Transaction Processing Performance Council. 2007. TPC benchmark R: Decision Support. Revision 1.1.0. http//www.tpc.org.Google Scholar
Valliant, R. and Gentle, J. 1997. An application of mathematical programming to a sample allocation problem. Comput. Statis. Data Anal. 25, 337--360. Google ScholarDigital Library
Vitter, J. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1. Google ScholarDigital Library
Vitter, J. and Wang, M. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelet. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
Vitter, J., Wang, M., and Iyer, B. 1998. Data cube approximation and histogram via wavelets. In Proceedings of the International Conference on Information and Knowledge Management. Google ScholarDigital Library
Zipf, G. E. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley.Google Scholar

Index Terms

Optimized stratified sampling for approximate query processing
1. Information systems
  1. Data management systems

Recommendations

Approximate query processing using wavelets

Approximate query processing has emerged as a cost-effective approach for dealing with the huge data volumes and stringent response-time requirements of today's decision support systems (DSS). Most work in this area, however, has so far been limited in ...
Read More
Approximate Query Processing with Error Guarantees
Big-Data-Analytics in Astronomy, Science, and Engineering
Abstract
In recent years, with the increase of data and the sophistication of analysis requirements, query processing in databases has become more important. Recently, approximate query processing (AQP) was proposed for efficiently executing database ...
Read More
Sampling over Union of Joins
SIGMOD '23: Companion of the 2023 International Conference on Management of Data

Data scientists often draw on multiple relational data sources for analysis. A standard assumption in learning and approximate query answering is that the data is a uniform and independent sample of the underlying distribution. To avoid the cost of join ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 32, Issue 2
June 2007
267 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1242524
Issue’s Table of Contents

Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2007
Published in tods Volume 32, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Random sampling
approximation
query processing
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 176
  Total Citations
  View Citations
- 1,862
  Total Downloads
- Downloads (Last 12 months)98
- Downloads (Last 6 weeks)25
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimized stratified sampling for approximate query processing

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Approximate query processing using wavelets

Approximate Query Processing with Error Guarantees

Sampling over Union of Joins

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Optimized stratified sampling for approximate query processing

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Approximate query processing using wavelets

Approximate Query Processing with Error Guarantees

Sampling over Union of Joins

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media