Abstract
The ability to approximately answer aggregation queries accurately and efficiently is of great benefit for decision support and data mining tools. In contrast to previous sampling-based studies, we treat the problem as an optimization problem where, given a workload of queries, we select a stratified random sample of the original data such that the error in answering the workload queries using the sample is minimized. A key novelty of our approach is that we can tailor the choice of samples to be robust, even for workloads that are “similar” but not necessarily identical to the given workload. Finally, our techniques recognize the importance of taking into account the variance in the data distribution in a principled manner. We show how our solution can be implemented on a database system, and present results of extensive experiments on Microsoft SQL Server that demonstrate the superior quality of our method compared to previous work.
- Acharya, S., Gibbons, P. B., and Poosala, V. 2000. Congressional samples for approximate answering of group-by queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Acharya, S., Gibbons, P. B., Poosala, V., and Ramaswamy, S. 1999. Join synopses for approximate query answering. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Agrawal, S., Chaudhuri, C., and Narasayya, V. 2000. Automated selection of materialized views and indexes in SQL databases. In Proceedings of the International Conference on Very Large Databases. 496--505. Google ScholarDigital Library
- Agrawal, S., Narasayya, V., and Yang, B. 2004. Integrating vertical and horizontal partitioning into automated physical database design. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 359--370. Google ScholarDigital Library
- Babcock, B., Chaudhuri, C., and Das, G. 2003. Dynamic sample selection for approximate query processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 539--550. Google ScholarDigital Library
- Barbará, D. and Sullivan, M. 1997. Quasi-Cubes: Exploiting approximations in multidimensional databases. SIGMOD Rec. 26, 3. Google ScholarDigital Library
- Barbará, D. and Wu., X. 1999. Using approximations to scale exploratory data analysis in datacubes. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
- Bethel, J. 1989. Sample allocation in multivariate surveys. Surv. Methodol. 15, 47--57.Google Scholar
- Causey, B. D. 1983. Computational aspects of optimal allocation in multivariate stratified sampling. SIAM J. Sci. Statist. Comput. 4, 2.Google ScholarDigital Library
- Chakrabarti, K., Garofalakis, M., Rastogi, R., and Shim, K. 2000. Approximate query processing using wavelets. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
- Chaudhuri, S., Das, G., Datar, M., Motwani, R., and Narasayya, V. 2001. Overcoming limitations of sampling for aggregation queries. In Proceedings of the IEEE International Conference on Data Engineering. Google ScholarDigital Library
- Chaudhuri, S., Das, G., and Narasayya, V. 2001. A robust, optimization-based approach for approximate answering of aggregation queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Chaudhuri, S., Krishnamurthy, R., Potamianos, S., and Shim, S. 1995. Optimizing queries with materialized views. In Proceedings of the IEEE International Conference on Data Engineering. Google ScholarDigital Library
- Chaudhuri, S., Motwani, R., and Narasayya, V. 1999. Random sampling over joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Chaudhuri, S. and Narasayya, V. 2006. Program for TPC-D data generation with skew. http://research.microsoft.com/dmx/Google Scholar
- Chaudhuri, S. and Narasayya, V. 1997. An efficient, cost-driven index selection tool for Microsoft SQL server. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
- Chaudhuri, S. and Narasayya, V. 1998. AutoAdmin what-if index analysis utility. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Chromy, J. W. 1987. Design optimization with multiple objectives. In Proceedings of the Survey Research Section, American Statistical Association.Google Scholar
- Cochran, W. G. 1977. Sampling Techniques, 3rd ed. John Wiley, New York.Google Scholar
- Das, G. 2003. Survey of approximate query processing techniques. Tutorial at the International Conference on Scientific and Statistical Database Management.Google Scholar
- Fan, C. T., Muller, M. E., and Rezucha, I. 1962. Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J. Amer. Statis. Assoc. 57, 298, 387--402.Google ScholarCross Ref
- Ganti, V., Lee M. L., and Ramakrishnan, R. 2000. ICICLES: Self-Tuning samples for approximate query answering. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
- Garofalakis, M. N. and Gibbons, P. B. 2001. Approximate query processing: Taming the terabytes. Tutorial at the International Conference on Very Large Databases. Google ScholarDigital Library
- Garofalakis, M., Ganguly, S., Kumar, A., and Rastogi, R. 2005. Join-Distinct aggregate estimation over update streams. In Proceedings of the International Conference on the Principles of Database Systems. Google ScholarDigital Library
- Getoor, L., Taskar, B., and Koller, D. 2001. Selectivity estimation using probabilistic model. In Proceedings of the SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Gibbons, P. B., Matias, Y., and Poosala, V. 1997. Fast incremental maintenance of approximate histograms. In Proceedings of the 23rd International Conference on Very Large Data Bases. Google ScholarDigital Library
- Goldstein, J. and Larson, P. 2001. Optimizing queries using materialized views: A practical, scalable solution. In Proceedings of the SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., and Pirahesh, H. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining Knowl. Discov. 1, 1, 29--53. Google ScholarDigital Library
- Greenwald, M. and Khanna, S. 2001. Space-Efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Gunopulos, D., Kollios, G., Tsotras, V. J., and Domeniconi, C. 2000. Approximating multi-dimensional aggregate range queries over real attributes. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Halevy, A. 2001. Answering queries using views: A survey. VLDB J. Google ScholarDigital Library
- Hellerstein, J., Haas, P., and Wang, H. 1997. Online aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- HochBaum, D. 1997. Approximation Algorithms for NP-Hard Problems. PWS Publishing. Google ScholarDigital Library
- Ioannidis, Y. and Poosala, V. 1999. Histogram based approximations of set-valued query answers. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
- Jermaine, C. 2003. Robust estimation with sampling and approximate pre-aggregation. In Proceedings of the International Conference on Very Large Databases. 886--897. Google ScholarDigital Library
- Jermaine, C., Pol, A., and Arumugam, S. 2004. Online maintenance of very large random samples. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Kempe D., Dobra, A., and Gehrke, J. 2003. Gossip-Based computation of aggregate information. In Proceedings of IEEE Foundations of Computer Science. 482--491. Google ScholarDigital Library
- Koudas, N. and Srivastava, D. 2003. Data stream query processing: A tutorial. Tutorial at the International Conference on Very Large Databases. Google ScholarDigital Library
- Lohr, S. 1999. Sampling: Design and Analysis. Duxbury Press.Google Scholar
- Miller, R. G., Jr. 1981. Simultaneous Statis. Inference. Springer Series in Statistics. Springer Verlag.Google Scholar
- Mitchell, T. 1997. Machine Learning. McGraw-Hill, New York. Google ScholarDigital Library
- Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge University Press, New York. Google ScholarDigital Library
- Olken, F. 1993. Random sampling from databases. Ph.D. dissertation, Computer Science, UC Berkeley.Google Scholar
- Pavlov, D., Mannila, H, and Smyth, P. 2003. Beyond independence: Probabilistic methods for query approximation on binary transaction data. IEEE Trans. Knowl. Data Eng. 15, 6, 1409--1421. Google ScholarDigital Library
- Polyzotis, N., Garofalakis, M. N., and Ioannidis, Y. 2004. Approximate XML query answers. In Proceedings of the SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Poosala, V. and Ganti, V. 1999. Fast approximate answers to aggregate queries on a data cube. In Proceedings of the International Conference on Scientific and Statistical Database Management. Google ScholarDigital Library
- Thisted, R. A. 1988. Elements of Statis. Comput. Chapman and Hall, London. Google ScholarDigital Library
- Transaction Processing Performance Council. 2007. TPC benchmark R: Decision Support. Revision 1.1.0. http//www.tpc.org.Google Scholar
- Valliant, R. and Gentle, J. 1997. An application of mathematical programming to a sample allocation problem. Comput. Statis. Data Anal. 25, 337--360. Google ScholarDigital Library
- Vitter, J. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1. Google ScholarDigital Library
- Vitter, J. and Wang, M. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelet. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Vitter, J., Wang, M., and Iyer, B. 1998. Data cube approximation and histogram via wavelets. In Proceedings of the International Conference on Information and Knowledge Management. Google ScholarDigital Library
- Zipf, G. E. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley.Google Scholar
Index Terms
- Optimized stratified sampling for approximate query processing
Recommendations
Approximate query processing using wavelets
Approximate query processing has emerged as a cost-effective approach for dealing with the huge data volumes and stringent response-time requirements of today's decision support systems (DSS). Most work in this area, however, has so far been limited in ...
Approximate Query Processing with Error Guarantees
Big-Data-Analytics in Astronomy, Science, and EngineeringAbstractIn recent years, with the increase of data and the sophistication of analysis requirements, query processing in databases has become more important. Recently, approximate query processing (AQP) was proposed for efficiently executing database ...
Sampling over Union of Joins
SIGMOD '23: Companion of the 2023 International Conference on Management of DataData scientists often draw on multiple relational data sources for analysis. A standard assumption in learning and approximate query answering is that the data is a uniform and independent sample of the underlying distribution. To avoid the cost of join ...
Comments