ABSTRACT
In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good approximate answers for join-queries using only statistics (in particular, samples) from the base relations. We propose join synopses as an effective solution for this problem and show how precomputing just one join synopsis for each relation suffices to significantly improve the quality of approximate answers for arbitrary queries with foreign key joins. We present optimal strategies for allocating the available space among the various join synopses when the query work load is known and identify heuristics for the common case when the work load is not known. We also present efficient algorithms for incrementally maintaining join synopses in the presence of updates to the base relations. Our extensive set of experiments on the TPC-D benchmark database show the effectiveness of join synopses and various other techniques proposed in this paper.
- AGPR99a.S. Acharya, E B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua approximate query answering system. In Proc. ACM SIGMOD International Conf. on Management of Data, June 1999. Demonstration paper. Google ScholarDigital Library
- AGPR99b.S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. Technical report, Bell Laboratories, Murray Hilt, New Jersey, 1999. Full version of the paper appearing in SIGMOD'99. Google ScholarDigital Library
- AMS96.N. Alon, Y. IV.Iatias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proc. 28th ACM Syrup. on the Theory of Computing, pages 20-29, May 1996. Full version to appear in JCSS special issue for STOC'96. Google ScholarDigital Library
- APR99.S. Acharya, V. Poosala, and S. Ramaswamy. Selectivity estimation in spatial databases. In Proc. ACM SIGMOD International Conf. on Management of Data, June 1999. Google ScholarDigital Library
- BDF+97.l). Barbarfi, W. DuMouchel, C. Faloutsos, E J. Haas, J. M. HeUerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The New Jersey data reduction report. Bulletin of the Technical Committee on Data Engineering, 20(4):3-45, 1997.Google Scholar
- CR94.C.M. Chen .and N. Roussopoulos. Adaptive selectivity estimation using query feedback. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 16 l- 172, May 1994. Google ScholarDigital Library
- GGMS96.S. Ganguly, E B. Gibbons, Y. Matins, and A. Silberschatz. Bifocal sampling for skew-resistant join size estimation. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 271-281, June 1996. Google ScholarDigital Library
- GM98.P.B. Gibbons and Y. Matins. New sampling-based summary statistics for improving approximate query answers. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 331-342, June 1998. Google ScholarDigital Library
- GM99a.E B. Gibbons and Y. Matins. Selecting estimation proce.- dures and bounds for approximate answering of aggregation queries. Technical report, Bell Laboratories, Murray Hill, New Jersey, 1999.Google Scholar
- GM99b.E B. Gibbons and Y. Matins. Synopsis data structures for mas. sive data sets. DIMACS: Series in Discrete Mathematics and Theoretical Computer Science, 1999. To appear. Available,. as Bell Labs tech. rep., Sept. 1998, and at http://www.belllabs. corn/- pbgibbons/.Google Scholar
- GMP97a.E B. Gibbons, Y. Matias, and V. Poosala. Aqua project white paper. Technical report, Bell Laboratories, Murray Hill, New Jersey, December 1997.Google Scholar
- GMP97b.P.B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. In Proc. 23rd International Conf. on Very Large Data Bases, pages 466-. 475, August 1997. Google ScholarDigital Library
- Haa96.P.J. Haas. Hoeffding inequalities for join-selectivity estimation and online aggregation. Technical Report RJ 10040, IBM Almaden Research Center, San Jose, CA, 1996.Google Scholar
- Haa97.P.J. Haas. Large-sample and deterministic confidence intervals for online aggregation. In Proc. 9th International Conf. on Scientific and Statistical Database Management, August 1997. Google ScholarDigital Library
- HHW97.J.M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proc. ACM SIGMOD international Co1~ on Management of Data, pages 171-182, May 1997. Google ScholarDigital Library
- HNS94.P.j. Haas, J. F. Naughton, and A. N. Swami. On the relative cost of sampling for join selectivity estimation. In Proc. 13th ACM Symp. on Principles of Database Systems, pages 14-24, May 1994. Google ScholarDigital Library
- HNSS95.P.J. Haas, J. E Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values o1 an attribute. In Proc. 21st International Conf. on Very l~rge Data Bases, pages 311-322, September 1995. Google ScholarDigital Library
- HÖT88.W.-C. Hou, G. Ozsoyo~lu, and B. K. Taneja. Statistical estimators for relational algebra expressions. In Proc. 7th ACM Syrup. on Principles of Database Systems, pages 276- 287, March 1988. Google ScholarDigital Library
- Koo80.R.P. Kooi. The Optimization of Queries in Relational Databases. PhD thesis, Case Western Reserve University, September 1980. Google ScholarDigital Library
- LN95.R.J. Lipton and J. E Naughton. Query size estimation by adaptive sampling. J. Computer and System Sciences, 51(1):18-25, 1995. Google ScholarDigital Library
- LNS90.R.J. Lipton, j. E Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling. In Proc. ACM SIGMOD International Conf. on Management of Data, pages l- 12, May 1990. Google ScholarDigital Library
- OR92.E Olken and D. Rotem. Maintenance of materialized views of sampling queries. In Proc. 8th IEEE International Cot, f on Data Engineering, pages 632-641, February 1992. Google ScholarDigital Library
- PIHS96.V. Poosala, Y. E. loannidis, P. J. Haas, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 294-305, June 1996. Google ScholarDigital Library
- Poo97.V. Poosala. Histogram-based estimation techniques in databases. PhD thesis, Univ. of Wisconsin-Madison, 1997. Google ScholarDigital Library
- SAC+79.P.G. Selinger, M. M. Astrahan, D. D. Chambedin, R. A. Lode, and T. T. Price. Access path selection in a relational database management system. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 23-34, June 1979. Google ScholarDigital Library
- Sch97.D. Schneider. The ins & outs (and everything in between) of data warehousing. Tutorial in the 23rd International Conf. on Very Large Data Bases, August 1997.Google Scholar
- VL93.S.V. Vrbsky and J. W. S. Liu. Approximate~a query processor that produces monotonically improving approximate: answers. IEEE Trans. on Knowledge and Data Engineering, 5(6):1056-1068, 1993. Google ScholarDigital Library
Index Terms
- Join synopses for approximate query answering
Recommendations
Join synopses for approximate query answering
In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good ...
TuG synopses for approximate query answering
This article introduces the Tuple Graph (TuG) synopses, a new class of data summaries that enable accurate approximate answers for complex relational queries. The proposed summarization framework adopts a “semi-structured” view of the relational ...
Synopses for query optimization: a space-complexity perspective
PODS '04: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsDatabase systems use precomputed synopses of data to estimate the cost of alternative plans during query optimization. A number of alternative synopsis structures have been proposed, but histograms are by far the most commonly used. While histograms ...
Comments