Sample synopses for approximate answering of group-by queries

Authors:
Philipp Rösch

Technische Universität Dresden, Dresden, Germany

Technische Universität Dresden, Dresden, Germany
View Profile

,
Wolfgang Lehner

Technische Universität Dresden, Dresden, Germany

Technische Universität Dresden, Dresden, Germany
View Profile

EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database TechnologyMarch 2009Pages 403–414https://doi.org/10.1145/1516360.1516408

Published:24 March 2009Publication History

EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology

Pages 403–414

ABSTRACT

With the amount of data in current data warehouse databases growing steadily, random sampling is continuously gaining in importance. In particular, interactive analyses of large datasets can greatly benefit from the significantly shorter response times of approximate query processing. Typically, those analytical queries partition the data into groups and aggregate the values within the groups. Further, with the commonly used roll-up and drill-down operations a broad range of group-by queries is posed to the system, which makes the construction of highly-specialized synopses difficult.

In this paper, we propose a general-purpose sampling scheme that is biased in order to answer group-by queries with high accuracy. While existing techniques focus on the size of the group when computing its sample size, our technique is based on its standard deviation. The basic idea is that the more homogeneous a group is, the less representatives are required in order to give a good estimate. With an extensive set of experiments, we show that our approach reduces both the estimation error and the construction cost compared to existing techniques.

References

S. Acharya, P. Gibbons, and V. Poosala. Congressional Samples for Approximate Answering of Group-By Queries. In SIGMOD, pages 487--498, 2000. Google ScholarDigital Library
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In SIGMOD, pages 574--576, 1999. Google ScholarDigital Library
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join Synopses for Approximate Query Answering. In SIGMOD, pages 275--286, 1999. Google ScholarDigital Library
B. Babcock, S. Chaudhuri, and G. Das. Dynamic Sample Selection for Approximate Query Processing. In SIGMOD, pages 539--550, 2003. Google ScholarDigital Library
P. Brown and P. Haas. BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data. In VLDB, pages 668--679, 2003. Google ScholarDigital Library
J. Brutlag and T. Richardson. A block sampling approach to distinct value estimation. Technical report, University of Washington, Department of Statistics, 2000.Google Scholar
K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. Approximate Query Processing Using Wavelets. In VLDB, pages 111--122, 2000. Google ScholarDigital Library
S. Chaudhuri, G. Das, M. Datar, and R. M. V. Narasayya. Overcoming Limitations of Sampling for Aggregation Queries. In ICDE, pages 534--544, 2001. Google ScholarDigital Library
S. Chaudhuri, G. Das, and V. Narasayya. A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. In SIGMOD, pages 295--306, 2001. Google ScholarDigital Library
S. Chaudhuri, G. Das, and U. Srivastava. Effective Use of Block-level Sampling in Statistics Estimation. In SIGMOD, pages 287--298, 2004. Google ScholarDigital Library
W. Cochran. Sampling Techniques. Wiley Series in Probability & Mathematical Statistics. John Wiley & Sons, 3rd edition, 1977.Google Scholar
D. DeWitt, J. Naughton, D. Schneider, and S. Seshadri. Practical Skew Handling in Parallel Joins. In VLDB, 1992. Google ScholarDigital Library
V. Ganti, M. Lee, and R. Ramakrishnan. ICICLES: Self-Tuning Samples for Approximate Query Answering. In The VLDB Journal, pages 176--187, 2000. Google ScholarDigital Library
R. Gemulla, P. Rösch, and W. Lehner. Linked Bernoulli Synopses: Sampling Along Foreign-Keys. In SSDBM, pages 6--23, 2008. Google ScholarDigital Library
I. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. In SIGMOD, pages 647--658, 2004. Google ScholarDigital Library
Y. E. Ioannidis and V. Poosala. Histogram-Based Approximation of Set-Valued Query-Answers. In VLDB, pages 174--185, 1999. Google ScholarDigital Library
C. Jermaine. Robust Estimation With Sampling and Approximate Pre-Aggregation. In VLDB, pages 886--897, 2003. Google ScholarDigital Library
T. Johnson, S. Muthukrishnan, and I. Rozenbaum. Sampling Algorithms in a Stream Operator. In SIGMOD, pages 1--12, 2005. Google ScholarDigital Library
A. Klein, R. Gemulla, P. Rösch, and W. Lehner. Derby/S: A DBMS for Sample-Based Query Answering (Demo). In SIGMOD, pages 757--759, 2006. Google ScholarDigital Library
G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold. An Efficient Approximation Scheme for Data Mining Tasks. In ICDE, pages 453--462, 2001. Google ScholarDigital Library
Y. Matias, J. S. Vitter, and M. Wang. Wavelet-Based Histograms for Selectivity Estimation. In SIGMOD, pages 448--459, 1998. Google ScholarDigital Library
V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita. Improved Histograms for Selectivity Estimation of Range Predicates. In SIGMOD, pages 294--305, 1996. Google ScholarDigital Library
P. Rösch, R. Gemulla, and W. Lehner. Designing Random Sample Synopses with Outliers. In ICDE, pages 1400--1402, 2008. Google ScholarDigital Library
H. Toivonen. Sampling Large Databases for Association Rules. In VLDB, pages 134--145, 1996. Google ScholarDigital Library
J. Vitter. Random Sampling with a Reservoir. ACM Trans. Mathematical Software, 11(1):37--57, 1985. Google ScholarDigital Library

Sample synopses for approximate answering of group-by queries
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

Congressional samples for approximate answering of group-by queries
SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data

In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex decision support queries using precomputed summary statistics, such as samples. Decision support queries routinely segment the data into ...
Read More
Join synopses for approximate query answering

In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good ...
Read More
Join synopses for approximate query answering
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data

In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
March 2009
1180 pages
ISBN:9781605584225
DOI:10.1145/1516360
Editors:
Martin Kersten
CWI, The Netherlands
,
Boris Novikov
University of Saint Petersburg, Russia
,
Jens Teubner
ETH Zurich, Switzerland
,
Vladimir Polutin
HP Labs, Russia
,
Stefan Manegold
CWI, The Netherlands
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 March 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate7of10submissions,70%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 293
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.