skip to main content
10.1145/1516360.1516408acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article
Free Access

Sample synopses for approximate answering of group-by queries

Published:24 March 2009Publication History

ABSTRACT

With the amount of data in current data warehouse databases growing steadily, random sampling is continuously gaining in importance. In particular, interactive analyses of large datasets can greatly benefit from the significantly shorter response times of approximate query processing. Typically, those analytical queries partition the data into groups and aggregate the values within the groups. Further, with the commonly used roll-up and drill-down operations a broad range of group-by queries is posed to the system, which makes the construction of highly-specialized synopses difficult.

In this paper, we propose a general-purpose sampling scheme that is biased in order to answer group-by queries with high accuracy. While existing techniques focus on the size of the group when computing its sample size, our technique is based on its standard deviation. The basic idea is that the more homogeneous a group is, the less representatives are required in order to give a good estimate. With an extensive set of experiments, we show that our approach reduces both the estimation error and the construction cost compared to existing techniques.

References

  1. S. Acharya, P. Gibbons, and V. Poosala. Congressional Samples for Approximate Answering of Group-By Queries. In SIGMOD, pages 487--498, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In SIGMOD, pages 574--576, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join Synopses for Approximate Query Answering. In SIGMOD, pages 275--286, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Babcock, S. Chaudhuri, and G. Das. Dynamic Sample Selection for Approximate Query Processing. In SIGMOD, pages 539--550, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Brown and P. Haas. BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data. In VLDB, pages 668--679, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Brutlag and T. Richardson. A block sampling approach to distinct value estimation. Technical report, University of Washington, Department of Statistics, 2000.Google ScholarGoogle Scholar
  7. K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. Approximate Query Processing Using Wavelets. In VLDB, pages 111--122, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chaudhuri, G. Das, M. Datar, and R. M. V. Narasayya. Overcoming Limitations of Sampling for Aggregation Queries. In ICDE, pages 534--544, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chaudhuri, G. Das, and V. Narasayya. A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. In SIGMOD, pages 295--306, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chaudhuri, G. Das, and U. Srivastava. Effective Use of Block-level Sampling in Statistics Estimation. In SIGMOD, pages 287--298, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Cochran. Sampling Techniques. Wiley Series in Probability & Mathematical Statistics. John Wiley & Sons, 3rd edition, 1977.Google ScholarGoogle Scholar
  12. D. DeWitt, J. Naughton, D. Schneider, and S. Seshadri. Practical Skew Handling in Parallel Joins. In VLDB, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. V. Ganti, M. Lee, and R. Ramakrishnan. ICICLES: Self-Tuning Samples for Approximate Query Answering. In The VLDB Journal, pages 176--187, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Gemulla, P. Rösch, and W. Lehner. Linked Bernoulli Synopses: Sampling Along Foreign-Keys. In SSDBM, pages 6--23, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. In SIGMOD, pages 647--658, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. E. Ioannidis and V. Poosala. Histogram-Based Approximation of Set-Valued Query-Answers. In VLDB, pages 174--185, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Jermaine. Robust Estimation With Sampling and Approximate Pre-Aggregation. In VLDB, pages 886--897, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Johnson, S. Muthukrishnan, and I. Rozenbaum. Sampling Algorithms in a Stream Operator. In SIGMOD, pages 1--12, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Klein, R. Gemulla, P. Rösch, and W. Lehner. Derby/S: A DBMS for Sample-Based Query Answering (Demo). In SIGMOD, pages 757--759, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold. An Efficient Approximation Scheme for Data Mining Tasks. In ICDE, pages 453--462, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Matias, J. S. Vitter, and M. Wang. Wavelet-Based Histograms for Selectivity Estimation. In SIGMOD, pages 448--459, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita. Improved Histograms for Selectivity Estimation of Range Predicates. In SIGMOD, pages 294--305, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Rösch, R. Gemulla, and W. Lehner. Designing Random Sample Synopses with Outliers. In ICDE, pages 1400--1402, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. Toivonen. Sampling Large Databases for Association Rules. In VLDB, pages 134--145, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Vitter. Random Sampling with a Reservoir. ACM Trans. Mathematical Software, 11(1):37--57, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Sample synopses for approximate answering of group-by queries

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
        March 2009
        1180 pages
        ISBN:9781605584225
        DOI:10.1145/1516360

        Copyright © 2009 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 March 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate7of10submissions,70%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader