skip to main content
article
Free Access

Approximate computation of multidimensional aggregates of sparse data using wavelets

Authors Info & Claims
Published:01 June 1999Publication History
Skip Abstract Section

Abstract

Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, approximate answers to OLAP aggregation queries.

In this paper, we present a novel method that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner. We construct a compact data cube, which is an approximate and space-efficient representation of the underlying multidimensional array, based upon a multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a smalll number of I/Os, depending upon the desired accuracy.

We present two I/O-efficient algorithms to construct the compact data cube for the important case of sparse high-dimensional arrays, which often arise in practice. The traditional histogram methods are infeasible for the massive high-dimensional data sets in OLAP applications. Previously developed wavelet techniques are efficient only for dense data. Our on-line query processing algorithm is very fast and capable of refining answers as the user demands more accuracy. Experiments on real data show that our method provides significantly more accurate results for typical OLAP aggregation queries than other efficient approximation techniques such as random sampling.

References

  1. AAD+96 S. Agarwal, R. Agrawal, P. Deshpande, J. Naughton, S. Sarawagi, and R. Ramakrishnan. On the computation of multidimensional aggregates. In Proceedings of the 1996 International Conference on Very Large Databases, Mumbai, India, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. AV88 A. Aggaxwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications o/ the ACM, 31(9):1116-1127, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. BDF+97 D . Barbara, W. DuMouchel, C. Faloutsos, P. J. H aas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng,, V. Poosala, K. A. Ross, and K. C. Sevcik. The New Jersey data reduction report. Bulletin o.f the Technical Committee on Data Engineering, 20(4), 1997.Google ScholarGoogle Scholar
  4. Bur U.S. Census Bureau. Census bureau databases. The online data are available on the web at http ://www. census, gov/.Google ScholarGoogle Scholar
  5. FJS97 C. Faloutsos, H. V. Jagadish, and N. D. Sidiropoulos. Recovering information from summary data. In Proceedings of the 1997 International Conference on Very Large Databases, Athens, Greece, August 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. GBLP96 J. Gray, A. Bosworth, A. I,ayman, and H. Pirahesh. Data cube: A relational aggregation operator general- ~zing group-by, cross-tabs and subtotals. In Proceedings of the 12th Annual IEEE Conference on Data Engineering (ICDE '96), pages 131-139, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. GM98 P.B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the 1998 A CM SIGMOD International Conference on Management of Data, Seattle, WA, June 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. HAMS97 C.-T. Ho, R. Agrawal, N. Megiddo, and R. Srikant. Range queries in OLAP data cubes. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, AZ, May 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. HHW97 J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proceedings of the 1997 A CM SIGMOD International Conference on Management of Data, Tucson, AZ, May 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. HRU96 Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada, May 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. JS94 B. Jawerth and W. Sweldens. An overview of wavelet based multire.solution analyses. SIAM Rev., 36(3):377- 412, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. MVW98 Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 A CM SIGMOD International Conference on Management of Data, pages 448-459, Seattle, WA, June 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. PC98 N. Pendse and R. Creeth. The OLAP report, 1998. The online report is available on the web at http ://www. olapreport, com/Analyses, htm/.Google ScholarGoogle Scholar
  14. PI96 ~. Poosala arid Y. E. Ioannidis. Estimation of queryresult distribution and its application in parallel-join load balancing. In Proceedings of the 1996 International Conference on Very Large Databases, Bombay, India, September 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. PI97 V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proceeding.~ of the 1997 International Conference on Very Large Databases, Athens, Greece, August 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. PIHS96 V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. Shekita. Improved histograms for selectivity estimation of range redicates. In Proceedings of the 1996 A CM SIGMOD International Conference on Management of Data, Montreal, Canada, May 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. SAC+79 P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD International Confernce on Management of Data, pages 23-34, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. SDS96 E.J. Stollnitz, T. D. Derose, and D. H. Salesin. Wavelets for Computer Graphics. Morgan Kaufmann, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ven94 D.E. Vengroff. A transparent parallel I/O environment. In Proceedings of the 199,{ DAGS Symposium on Parallel Computa tion, July 1994.Google ScholarGoogle Scholar
  20. Ven97 D.E. Vengroff. TPIE User Manual and Reference. Duke University, 1997. The manual and software distribution are available on the web at http ://www. cs. duke. edu/TPlE/.Google ScholarGoogle Scholar
  21. Vit99 J.S. Vitter. ExternM memory Mgorithms and data structures. In J. Abello and J. S. Vitter, editors, External Memory Algorithms and Visualization, DI- MACS series. American Mathematical Society, to appear 1999. Available via the author's web page http://www, cs. duke. edu/'j sv/. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. VS94 J.S. Vitter and E. A. M, Shriver. Algorithms for parallel memory I: Two-level memories. Algorithmica, 12(2- 3):110-147, 1994. Special double issue on Large-Scale Memories.Google ScholarGoogle Scholar
  23. VV96 D.E. Vengroff and J. S. Vitter. I/O-efficient scientific computation using TPIE. In Proceedings of the Goddard Conference on Mass Storage Systems and T~hnologies, NASA Conference Publication 3340, VohJme II, pages 553-570, College Park, MD, September 1fi96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. VWI98 J.S. Vitter, M. Wang, and B. Iyer. Data cube approximation and histograms via wavelets. In Proceedings of Seventh International Conference on Information and Knowledge Management, pages 96-104, Washington D.C., November 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. ZDN97 Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proceedings of the 1997 A CM SIGMOD International Conference on Managemen~ of Data, Tucson, AZ, May 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Approximate computation of multidimensional aggregates of sparse data using wavelets

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGMOD Record
            ACM SIGMOD Record  Volume 28, Issue 2
            June 1999
            599 pages
            ISSN:0163-5808
            DOI:10.1145/304181
            Issue’s Table of Contents
            • cover image ACM Conferences
              SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data
              June 1999
              604 pages
              ISBN:1581130848
              DOI:10.1145/304182

            Copyright © 1999 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 June 1999

            Check for updates

            Qualifiers

            • article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader