skip to main content
10.1145/304182.304207acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article
Free Access

Join synopses for approximate query answering

Authors Info & Claims
Published:01 June 1999Publication History

ABSTRACT

In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good approximate answers for join-queries using only statistics (in particular, samples) from the base relations. We propose join synopses as an effective solution for this problem and show how precomputing just one join synopsis for each relation suffices to significantly improve the quality of approximate answers for arbitrary queries with foreign key joins. We present optimal strategies for allocating the available space among the various join synopses when the query work load is known and identify heuristics for the common case when the work load is not known. We also present efficient algorithms for incrementally maintaining join synopses in the presence of updates to the base relations. Our extensive set of experiments on the TPC-D benchmark database show the effectiveness of join synopses and various other techniques proposed in this paper.

References

  1. AGPR99a.S. Acharya, E B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua approximate query answering system. In Proc. ACM SIGMOD International Conf. on Management of Data, June 1999. Demonstration paper. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. AGPR99b.S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. Technical report, Bell Laboratories, Murray Hilt, New Jersey, 1999. Full version of the paper appearing in SIGMOD'99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. AMS96.N. Alon, Y. IV.Iatias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proc. 28th ACM Syrup. on the Theory of Computing, pages 20-29, May 1996. Full version to appear in JCSS special issue for STOC'96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. APR99.S. Acharya, V. Poosala, and S. Ramaswamy. Selectivity estimation in spatial databases. In Proc. ACM SIGMOD International Conf. on Management of Data, June 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. BDF+97.l). Barbarfi, W. DuMouchel, C. Faloutsos, E J. Haas, J. M. HeUerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The New Jersey data reduction report. Bulletin of the Technical Committee on Data Engineering, 20(4):3-45, 1997.Google ScholarGoogle Scholar
  6. CR94.C.M. Chen .and N. Roussopoulos. Adaptive selectivity estimation using query feedback. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 16 l- 172, May 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. GGMS96.S. Ganguly, E B. Gibbons, Y. Matins, and A. Silberschatz. Bifocal sampling for skew-resistant join size estimation. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 271-281, June 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. GM98.P.B. Gibbons and Y. Matins. New sampling-based summary statistics for improving approximate query answers. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 331-342, June 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. GM99a.E B. Gibbons and Y. Matins. Selecting estimation proce.- dures and bounds for approximate answering of aggregation queries. Technical report, Bell Laboratories, Murray Hill, New Jersey, 1999.Google ScholarGoogle Scholar
  10. GM99b.E B. Gibbons and Y. Matins. Synopsis data structures for mas. sive data sets. DIMACS: Series in Discrete Mathematics and Theoretical Computer Science, 1999. To appear. Available,. as Bell Labs tech. rep., Sept. 1998, and at http://www.belllabs. corn/- pbgibbons/.Google ScholarGoogle Scholar
  11. GMP97a.E B. Gibbons, Y. Matias, and V. Poosala. Aqua project white paper. Technical report, Bell Laboratories, Murray Hill, New Jersey, December 1997.Google ScholarGoogle Scholar
  12. GMP97b.P.B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. In Proc. 23rd International Conf. on Very Large Data Bases, pages 466-. 475, August 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Haa96.P.J. Haas. Hoeffding inequalities for join-selectivity estimation and online aggregation. Technical Report RJ 10040, IBM Almaden Research Center, San Jose, CA, 1996.Google ScholarGoogle Scholar
  14. Haa97.P.J. Haas. Large-sample and deterministic confidence intervals for online aggregation. In Proc. 9th International Conf. on Scientific and Statistical Database Management, August 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. HHW97.J.M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proc. ACM SIGMOD international Co1~ on Management of Data, pages 171-182, May 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. HNS94.P.j. Haas, J. F. Naughton, and A. N. Swami. On the relative cost of sampling for join selectivity estimation. In Proc. 13th ACM Symp. on Principles of Database Systems, pages 14-24, May 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. HNSS95.P.J. Haas, J. E Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values o1 an attribute. In Proc. 21st International Conf. on Very l~rge Data Bases, pages 311-322, September 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. HÖT88.W.-C. Hou, G. Ozsoyo~lu, and B. K. Taneja. Statistical estimators for relational algebra expressions. In Proc. 7th ACM Syrup. on Principles of Database Systems, pages 276- 287, March 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Koo80.R.P. Kooi. The Optimization of Queries in Relational Databases. PhD thesis, Case Western Reserve University, September 1980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. LN95.R.J. Lipton and J. E Naughton. Query size estimation by adaptive sampling. J. Computer and System Sciences, 51(1):18-25, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. LNS90.R.J. Lipton, j. E Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling. In Proc. ACM SIGMOD International Conf. on Management of Data, pages l- 12, May 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. OR92.E Olken and D. Rotem. Maintenance of materialized views of sampling queries. In Proc. 8th IEEE International Cot, f on Data Engineering, pages 632-641, February 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. PIHS96.V. Poosala, Y. E. loannidis, P. J. Haas, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 294-305, June 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Poo97.V. Poosala. Histogram-based estimation techniques in databases. PhD thesis, Univ. of Wisconsin-Madison, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. SAC+79.P.G. Selinger, M. M. Astrahan, D. D. Chambedin, R. A. Lode, and T. T. Price. Access path selection in a relational database management system. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 23-34, June 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sch97.D. Schneider. The ins & outs (and everything in between) of data warehousing. Tutorial in the 23rd International Conf. on Very Large Data Bases, August 1997.Google ScholarGoogle Scholar
  27. VL93.S.V. Vrbsky and J. W. S. Liu. Approximate~a query processor that produces monotonically improving approximate: answers. IEEE Trans. on Knowledge and Data Engineering, 5(6):1056-1068, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Join synopses for approximate query answering

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data
            June 1999
            604 pages
            ISBN:1581130848
            DOI:10.1145/304182

            Copyright © 1999 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 June 1999

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate785of4,003submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader