skip to main content
article

Approximation and streaming algorithms for histogram construction problems

Published:01 March 2006Publication History
Skip Abstract Section

Abstract

Histograms and related synopsis structures are popular techniques for approximating data distributions. These have been successful in query optimization and a variety of applications, including approximate querying, similarity searching, and data mining, to name a few. Histograms were a few of the earliest synopsis structures proposed and continue to be used widely. The histogram construction problem is to construct the best histogram restricted to a space bound that reflects the data distribution most accurately under a given error measure.The histograms are used as quick and easy estimates. Thus, a slight loss of accuracy, compared to the optimal histogram under the given error measure, can be offset by fast histogram construction algorithms. A natural question arises in this context: Can we find a fast near optimal approximation algorithm for the histogram construction problem? In this article, we give the first linear time (1+ϵ)-factor approximation algorithms (for any ϵ > 0) for a large number of histogram construction problems including the use of piecewise small degree polynomials to approximate data, workloads, etc. Several of our algorithms extend to data streams.Using synthetic and real-life data sets, we demonstrate that in many scenarios the approximate histograms are almost identical to optimal histograms in quality and are significantly faster to construct.

References

  1. Acharya, S., Gibbons, P., Poosala, V., and Ramaswamy, S. 1999. The aqua approximate query answering system. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 574--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alon, N., Matias, Y., and Szegedy, M. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bertolotto, M. and Egenhofer, M. J. 1999. Progressive vector transmission. In Proceedings of the 7th ACM Symposium on Advances in Geographical Information Systems. ACM, New York, 152--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Borodin, A. and El-Yaniv, R. 1998. Online Computation and Competitive Analysis. Cambridge University Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Datar, M., Gionis, A., Indyk, P., and Motwani, R. 2002. Maintaining stream statistics over sliding windows. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA 2002). ACM, New York, 635--644. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Donjerkovic, D., Ioannidis, Y. E., and Ramakrishnan, R. 1999. Dynamic histograms: Capturing evolving data sets. CS-TR 99-1396, University of Wisconsin, Madison, WI. Mar.Google ScholarGoogle Scholar
  7. Feigenbaum, J., Kannan, S., Strauss, M., and Viswanathan, M. 2002. An approximate l1-difference algorithm for massive data streams. SIAM J. Comput. 32, 1, 131--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Garofalakis, M. N. and Gibbons, P. B. 2002. Wavelet synopses with error guarantees. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 476--487. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Garofalakis, M. N. and Gibbons, P. B. 2004. Probabilistic wavelet synopses. ACM Trans. Datab. Syst. 29, 43--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gibbons, P. and Matias, Y. 1999. Synopsis data structures for massive data sets. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA 1999). ACM, New York, 909--910. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Gilbert, A. C., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., and Strauss, M. 2002. Fast, small-space algorithms for approximate histogram maintenance. In Proceedings of the ACM Symposium on Theory of Computing. ACM, New York, 389--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. 2001. Optimal and approximate computation of summary statistics for range aggregates. In Proceedings of the ACM Symposium on Principles of Database Systems. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Greenwald, M. and Khanna, S. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD Conference. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Guha, S. 2005. Space efficiency in synopsis construction problems. In Proceedings of the VLDB Conference. 409--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Guha, S. and Harb, B. 2006. Approximation algorithms for wavelet transform coding of data streams. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA 2006). ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Guha, S., Indyk, P., Muthukrishnan, S., and Strauss, M. 2002a. Histogramming data streams with fast per-item processing. In Proceedings of the ICALP. 681--692. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Guha, S., Koudas, N., and Srivastava, D. 2002b. Fast algorithms for hierarchical range histogram construction. In Proceedings of the ACM Symposium on Principles of Database Systems. ACM, New York, 180--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Guha, S. and Koudas, N. 2002. Approximating a data stream for querying and estimation: Algorithms and performance evaluation. In Proceedings of the IEEE 18th International Conference on Data Engineering (ICDE'02). IEEE Computer. Society Press, Los Alamitos, CA, 567--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Guha, S., Koudas, N., and Shim, K. 2001. Data streams and histograms. In Proceedings of the ACM Symposium on Theory of Computing. ACM, New York, 471--475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Guha, S., Shim, K., and Woo, J. 2004. REHIST: Relative error histogram construction algorithms. In Proceedings of the VLDB Conference, 300--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hochbaum, D. 1996. Approximation Algorithms for NP Hard Problems. Brooks/Cole Pubishing Company. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Indyk, P. 2000. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings of the IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, 189--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ioannidis, Y. and Poosala, V. 1995. Balancing histogram optimality and practicality for query result size estimation. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 233--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ioannidis, Y. E. 1993. Universality of serial histograms. In Proceedings of the VLDB Conference, 256--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ioannidis, Y. E. 2003. The history of histograms (abridged). In Proceedings of the VLDB Conference, 19--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jagadish, H. V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. C., and Suel, T. 1998. Optimal histograms with quality guarantees. In Proceedings of the VLDB Conference, 275--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Keogh, E., Chakrabati, K., Mehrotra, S., and Pazzani, M. 2002. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans. Datab. Syst. 27, 2, 188--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kooi, R. 1980. The optimization of queries in relational databases. Case Western Reserve University.Google ScholarGoogle Scholar
  29. Koudas, N., Muthukrishnan, S., and Srivastava, D. 2000. Optimal histograms for hierarchical range queries. In Proceedings of the ACM Principles of Database Systems. ACM, New York, 196--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Manku, G. S., Rajagopalan, S., and Lindsay, B. 1998. Approximate medians and other quantiles in one pass and with limited memory. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 426--435. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Matias, Y., Vitter, J. S., and Wang, M. 1998. Wavelet-based histograms for selectivity estimation. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 448--459. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Muralikrishna, M. and DeWitt, D. J. 1988. Equi-depth histograms for estimating selectivity factors for multidimensional queries. In Proceedings of the ACM SIGMOD, Conference. ACM, New York, 28--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Muthukrishnan, S. and Strauss, M. 2004. Approximate histogram and wavelet summaries of streaming data. DIMACS TR 52.Google ScholarGoogle Scholar
  34. Poosala, V., Ioannidis, Y., Haas, P., and Shekita, E. 1996. Improved histograms for selectivity estimation of range predicates. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 294--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., and Price, T. G. 1979. Access path selection in a relational database management system. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 23--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Thaper, N., Guha, S., Indyk, P., and Koudas, N. 2002. Dynamic multidimensional histograms. In Proceedings of the SIGMOD Conference. ACM, New York, 428--439. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Vazirani, V. 2001. Approximation Algorithms. Springer-Verlag, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Approximation and streaming algorithms for histogram construction problems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Database Systems
          ACM Transactions on Database Systems  Volume 31, Issue 1
          March 2006
          438 pages
          ISSN:0362-5915
          EISSN:1557-4644
          DOI:10.1145/1132863
          Issue’s Table of Contents

          Copyright © 2006 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 March 2006
          Published in tods Volume 31, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader