Abstract
Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, approximate answers to OLAP aggregation queries.
In this paper, we present a novel method that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner. We construct a compact data cube, which is an approximate and space-efficient representation of the underlying multidimensional array, based upon a multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a smalll number of I/Os, depending upon the desired accuracy.
We present two I/O-efficient algorithms to construct the compact data cube for the important case of sparse high-dimensional arrays, which often arise in practice. The traditional histogram methods are infeasible for the massive high-dimensional data sets in OLAP applications. Previously developed wavelet techniques are efficient only for dense data. Our on-line query processing algorithm is very fast and capable of refining answers as the user demands more accuracy. Experiments on real data show that our method provides significantly more accurate results for typical OLAP aggregation queries than other efficient approximation techniques such as random sampling.
- AAD+96 S. Agarwal, R. Agrawal, P. Deshpande, J. Naughton, S. Sarawagi, and R. Ramakrishnan. On the computation of multidimensional aggregates. In Proceedings of the 1996 International Conference on Very Large Databases, Mumbai, India, 1996. Google ScholarDigital Library
- AV88 A. Aggaxwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications o/ the ACM, 31(9):1116-1127, 1988. Google ScholarDigital Library
- BDF+97 D . Barbara, W. DuMouchel, C. Faloutsos, P. J. H aas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng,, V. Poosala, K. A. Ross, and K. C. Sevcik. The New Jersey data reduction report. Bulletin o.f the Technical Committee on Data Engineering, 20(4), 1997.Google Scholar
- Bur U.S. Census Bureau. Census bureau databases. The online data are available on the web at http ://www. census, gov/.Google Scholar
- FJS97 C. Faloutsos, H. V. Jagadish, and N. D. Sidiropoulos. Recovering information from summary data. In Proceedings of the 1997 International Conference on Very Large Databases, Athens, Greece, August 1997. Google ScholarDigital Library
- GBLP96 J. Gray, A. Bosworth, A. I,ayman, and H. Pirahesh. Data cube: A relational aggregation operator general- ~zing group-by, cross-tabs and subtotals. In Proceedings of the 12th Annual IEEE Conference on Data Engineering (ICDE '96), pages 131-139, 1996. Google ScholarDigital Library
- GM98 P.B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the 1998 A CM SIGMOD International Conference on Management of Data, Seattle, WA, June 1998. Google ScholarDigital Library
- HAMS97 C.-T. Ho, R. Agrawal, N. Megiddo, and R. Srikant. Range queries in OLAP data cubes. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, AZ, May 1997. Google ScholarDigital Library
- HHW97 J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proceedings of the 1997 A CM SIGMOD International Conference on Management of Data, Tucson, AZ, May 1997. Google ScholarDigital Library
- HRU96 Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada, May 1996. Google ScholarDigital Library
- JS94 B. Jawerth and W. Sweldens. An overview of wavelet based multire.solution analyses. SIAM Rev., 36(3):377- 412, 1994. Google ScholarDigital Library
- MVW98 Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 A CM SIGMOD International Conference on Management of Data, pages 448-459, Seattle, WA, June 1998. Google ScholarDigital Library
- PC98 N. Pendse and R. Creeth. The OLAP report, 1998. The online report is available on the web at http ://www. olapreport, com/Analyses, htm/.Google Scholar
- PI96 ~. Poosala arid Y. E. Ioannidis. Estimation of queryresult distribution and its application in parallel-join load balancing. In Proceedings of the 1996 International Conference on Very Large Databases, Bombay, India, September 1996. Google ScholarDigital Library
- PI97 V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proceeding.~ of the 1997 International Conference on Very Large Databases, Athens, Greece, August 1997. Google ScholarDigital Library
- PIHS96 V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. Shekita. Improved histograms for selectivity estimation of range redicates. In Proceedings of the 1996 A CM SIGMOD International Conference on Management of Data, Montreal, Canada, May 1996. Google ScholarDigital Library
- SAC+79 P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD International Confernce on Management of Data, pages 23-34, 1979. Google ScholarDigital Library
- SDS96 E.J. Stollnitz, T. D. Derose, and D. H. Salesin. Wavelets for Computer Graphics. Morgan Kaufmann, 1996. Google ScholarDigital Library
- Ven94 D.E. Vengroff. A transparent parallel I/O environment. In Proceedings of the 199,{ DAGS Symposium on Parallel Computa tion, July 1994.Google Scholar
- Ven97 D.E. Vengroff. TPIE User Manual and Reference. Duke University, 1997. The manual and software distribution are available on the web at http ://www. cs. duke. edu/TPlE/.Google Scholar
- Vit99 J.S. Vitter. ExternM memory Mgorithms and data structures. In J. Abello and J. S. Vitter, editors, External Memory Algorithms and Visualization, DI- MACS series. American Mathematical Society, to appear 1999. Available via the author's web page http://www, cs. duke. edu/'j sv/. Google ScholarDigital Library
- VS94 J.S. Vitter and E. A. M, Shriver. Algorithms for parallel memory I: Two-level memories. Algorithmica, 12(2- 3):110-147, 1994. Special double issue on Large-Scale Memories.Google Scholar
- VV96 D.E. Vengroff and J. S. Vitter. I/O-efficient scientific computation using TPIE. In Proceedings of the Goddard Conference on Mass Storage Systems and T~hnologies, NASA Conference Publication 3340, VohJme II, pages 553-570, College Park, MD, September 1fi96. Google ScholarDigital Library
- VWI98 J.S. Vitter, M. Wang, and B. Iyer. Data cube approximation and histograms via wavelets. In Proceedings of Seventh International Conference on Information and Knowledge Management, pages 96-104, Washington D.C., November 1998. Google ScholarDigital Library
- ZDN97 Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proceedings of the 1997 A CM SIGMOD International Conference on Managemen~ of Data, Tucson, AZ, May 1997. Google ScholarDigital Library
Index Terms
- Approximate computation of multidimensional aggregates of sparse data using wavelets
Recommendations
Approximate query processing using wavelets
Approximate query processing has emerged as a cost-effective approach for dealing with the huge data volumes and stringent response-time requirements of today's decision support systems (DSS). Most work in this area, however, has so far been limited in ...
Approximate computation of multidimensional aggregates of sparse data using wavelets
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of dataComputing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse ...
Optimal and approximate computation of summary statistics for range aggregates
PODS '01: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsFast estimates for aggregate queries are useful in database query optimization, approximate query answering and online query processing. Hence, there has been a lot of focus on “selectivity estimation”, that is, computing summary statistics on the ...
Comments