Approximate computation of multidimensional aggregates of sparse data using wavelets

Authors:
Jeffrey Scott Vitter

Center for Geometric Computing and Department of Computer Science, Duke University, Durham, NC

Center for Geometric Computing and Department of Computer Science, Duke University, Durham, NC
View Profile

,
Min Wang

Center for Geometric Computing and Department of Computer Science, Duke University, Durham, NC

Center for Geometric Computing and Department of Computer Science, Duke University, Durham, NC
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 28 Issue 2June 1999pp 193–204https://doi.org/10.1145/304181.304199

Published:01 June 1999Publication History

ACM SIGMOD Record

Abstract

Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, approximate answers to OLAP aggregation queries.

In this paper, we present a novel method that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner. We construct a compact data cube, which is an approximate and space-efficient representation of the underlying multidimensional array, based upon a multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a smalll number of I/Os, depending upon the desired accuracy.

We present two I/O-efficient algorithms to construct the compact data cube for the important case of sparse high-dimensional arrays, which often arise in practice. The traditional histogram methods are infeasible for the massive high-dimensional data sets in OLAP applications. Previously developed wavelet techniques are efficient only for dense data. Our on-line query processing algorithm is very fast and capable of refining answers as the user demands more accuracy. Experiments on real data show that our method provides significantly more accurate results for typical OLAP aggregation queries than other efficient approximation techniques such as random sampling.

References

AAD+96 S. Agarwal, R. Agrawal, P. Deshpande, J. Naughton, S. Sarawagi, and R. Ramakrishnan. On the computation of multidimensional aggregates. In Proceedings of the 1996 International Conference on Very Large Databases, Mumbai, India, 1996. Google ScholarDigital Library
AV88 A. Aggaxwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications o/ the ACM, 31(9):1116-1127, 1988. Google ScholarDigital Library
BDF+97 D . Barbara, W. DuMouchel, C. Faloutsos, P. J. H aas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng,, V. Poosala, K. A. Ross, and K. C. Sevcik. The New Jersey data reduction report. Bulletin o.f the Technical Committee on Data Engineering, 20(4), 1997.Google Scholar
Bur U.S. Census Bureau. Census bureau databases. The online data are available on the web at http ://www. census, gov/.Google Scholar
FJS97 C. Faloutsos, H. V. Jagadish, and N. D. Sidiropoulos. Recovering information from summary data. In Proceedings of the 1997 International Conference on Very Large Databases, Athens, Greece, August 1997. Google ScholarDigital Library
GBLP96 J. Gray, A. Bosworth, A. I,ayman, and H. Pirahesh. Data cube: A relational aggregation operator general- ~zing group-by, cross-tabs and subtotals. In Proceedings of the 12th Annual IEEE Conference on Data Engineering (ICDE '96), pages 131-139, 1996. Google ScholarDigital Library
GM98 P.B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the 1998 A CM SIGMOD International Conference on Management of Data, Seattle, WA, June 1998. Google ScholarDigital Library
HAMS97 C.-T. Ho, R. Agrawal, N. Megiddo, and R. Srikant. Range queries in OLAP data cubes. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, AZ, May 1997. Google ScholarDigital Library
HHW97 J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proceedings of the 1997 A CM SIGMOD International Conference on Management of Data, Tucson, AZ, May 1997. Google ScholarDigital Library
HRU96 Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada, May 1996. Google ScholarDigital Library
JS94 B. Jawerth and W. Sweldens. An overview of wavelet based multire.solution analyses. SIAM Rev., 36(3):377- 412, 1994. Google ScholarDigital Library
MVW98 Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 A CM SIGMOD International Conference on Management of Data, pages 448-459, Seattle, WA, June 1998. Google ScholarDigital Library
PC98 N. Pendse and R. Creeth. The OLAP report, 1998. The online report is available on the web at http ://www. olapreport, com/Analyses, htm/.Google Scholar
PI96 ~. Poosala arid Y. E. Ioannidis. Estimation of queryresult distribution and its application in parallel-join load balancing. In Proceedings of the 1996 International Conference on Very Large Databases, Bombay, India, September 1996. Google ScholarDigital Library
PI97 V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proceeding.~ of the 1997 International Conference on Very Large Databases, Athens, Greece, August 1997. Google ScholarDigital Library
PIHS96 V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. Shekita. Improved histograms for selectivity estimation of range redicates. In Proceedings of the 1996 A CM SIGMOD International Conference on Management of Data, Montreal, Canada, May 1996. Google ScholarDigital Library
SAC+79 P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD International Confernce on Management of Data, pages 23-34, 1979. Google ScholarDigital Library
SDS96 E.J. Stollnitz, T. D. Derose, and D. H. Salesin. Wavelets for Computer Graphics. Morgan Kaufmann, 1996. Google ScholarDigital Library
Ven94 D.E. Vengroff. A transparent parallel I/O environment. In Proceedings of the 199,{ DAGS Symposium on Parallel Computa tion, July 1994.Google Scholar
Ven97 D.E. Vengroff. TPIE User Manual and Reference. Duke University, 1997. The manual and software distribution are available on the web at http ://www. cs. duke. edu/TPlE/.Google Scholar
Vit99 J.S. Vitter. ExternM memory Mgorithms and data structures. In J. Abello and J. S. Vitter, editors, External Memory Algorithms and Visualization, DI- MACS series. American Mathematical Society, to appear 1999. Available via the author's web page http://www, cs. duke. edu/'j sv/. Google ScholarDigital Library
VS94 J.S. Vitter and E. A. M, Shriver. Algorithms for parallel memory I: Two-level memories. Algorithmica, 12(2- 3):110-147, 1994. Special double issue on Large-Scale Memories.Google Scholar
VV96 D.E. Vengroff and J. S. Vitter. I/O-efficient scientific computation using TPIE. In Proceedings of the Goddard Conference on Mass Storage Systems and T~hnologies, NASA Conference Publication 3340, VohJme II, pages 553-570, College Park, MD, September 1fi96. Google ScholarDigital Library
VWI98 J.S. Vitter, M. Wang, and B. Iyer. Data cube approximation and histograms via wavelets. In Proceedings of Seventh International Conference on Information and Knowledge Management, pages 96-104, Washington D.C., November 1998. Google ScholarDigital Library
ZDN97 Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proceedings of the 1997 A CM SIGMOD International Conference on Managemen~ of Data, Tucson, AZ, May 1997. Google ScholarDigital Library

Index Terms

Approximate computation of multidimensional aggregates of sparse data using wavelets
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
    2. Information integration
      1. Data warehouses
  2. Information systems applications
    1. Decision support systems
      1. Data warehouses
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Approximate query processing using wavelets

Approximate query processing has emerged as a cost-effective approach for dealing with the huge data volumes and stringent response-time requirements of today's decision support systems (DSS). Most work in this area, however, has so far been limited in ...
Read More
Approximate computation of multidimensional aggregates of sparse data using wavelets
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data

Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse ...
Read More
Optimal and approximate computation of summary statistics for range aggregates
PODS '01: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Fast estimates for aggregate queries are useful in database query optimization, approximate query answering and online query processing. Hence, there has been a lot of focus on “selectivity estimation”, that is, computing summary statistics on the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGMOD Record Volume 28, Issue 2
June 1999
599 pages
ISSN:0163-5808
DOI:10.1145/304181
Chairmen:
Susan Davidson
Univ. of Pennsylvania
,
Christos Faloutsos
Carnegie Mellon Univ.
Issue’s Table of Contents
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data
June 1999
604 pages
ISBN:1581130848
DOI:10.1145/304182
Chairmen:
Susan B. Davidson
Univ. of Pennsylvania, Philidelphia
,
Christos Faloutsos
Carnegie Mellon Univ., Pittsburgh
Copyright © 1999 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1999
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 289
  Total Citations
  View Citations
- 1,386
  Total Downloads
- Downloads (Last 12 months)97
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Approximate computation of multidimensional aggregates of sparse data using wavelets

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

Approximate query processing using wavelets

Approximate computation of multidimensional aggregates of sparse data using wavelets

Optimal and approximate computation of summary statistics for range aggregates