ABSTRACT
Histograms are a concise and flexible way to construct summary structures for large data sets. They have attracted a lot of attention in database research due to their utility in many areas, including query optimization, and approximate query answering. They are also a basic tool for data visualization and analysis.In this paper, we present a formal study of dynamic multidimensional histogram structures over continuous data streams. At the heart of our proposal is the use of a dynamic summary data structure (vastly different from a histogram) maintaining a succinct approximation of the data distribution of the underlying continuous stream. On demand, an accurate histogram is derived from this dynamic data structure. We propose algorithms for extracting such an accurate histogram and we analyze their behavior and tradeoffs. The proposed algorithms are able to provide approximate guarantees about the quality of the estimation of the histograms they extract.We complement our analytical results with a thorough experimental evaluation using real data sets.
- A. Aboulnaga and S. Chaudhuri. Self Tuning Histograms: Building Histograms Without Looking at Data. Proceedings of ACM SIGMOD, pages 181-192, June 1999. Google ScholarDigital Library
- S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua Approximate Query Answering System. Proceedings of ACM SIGMOD, Philladephia PA, pages 574-578, June 1999. Google ScholarDigital Library
- B. Babcock, M. Datar, and R. Motwani. Sampling From a Moving Window Over Streaming Data. Proceedings of the Symposium on Discrete Algorithms, 2002. Google ScholarDigital Library
- S. Babu and J. Widom. Contineous Queries Over Data Streams. SIGMOD Record, Sept. 2001. Google ScholarDigital Library
- N. Bruno, L. Gravano, and S. Chaudhuri. STHoles: A Workload Aware Multidimensional Histogram. Proceedings of ACM SIGMOD, May 2001. Google ScholarDigital Library
- M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining Stream Statistics over Sliding Windows. Proceedings of the Symposium on Discrete Algorithms, 2002. Google ScholarDigital Library
- P. Gibbons, Y. Mattias, and V. Poosala. Fast Incremental Maintenance of Approximate Histograms. Proceedings of VLDB, Athens Greece, pages 466-475, Aug. 1997. Google ScholarDigital Library
- A. Gilbert, S. Guha, P. Indyk, Y. Kotadis, S. Muthukrishnan, and M. Strauss. Fast, small-space algorithms for approximate histogram maintanance. Proc. STOC, 2002. Google ScholarDigital Library
- A. Gilbert, Y. Kotadis, S. Muthukrishnan, and M. Strauss. Quicksand: quick summary and analysis of network data. DIMACS tech report.Google Scholar
- A. Gilbert, Y. Kotadis, S. Muthukrishnan, and M. Strauss. Surfing Wavelets on Streams: One Pass Summaries for Approximate Aggregate Queries. Proceedings of VLDB, pages 79-88, 2001. Google ScholarDigital Library
- J. Gray, A. Bosworth, A. Leyman, and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross Tab and Sub Total. Proceedings of ICDE, pages 152-159, May 1996. Google ScholarDigital Library
- M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. Proceedings of ACM SIGMOD, Santa Barbara, May 2001. Google ScholarDigital Library
- S. Guha and N. Koudas. Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation. ICDE, Feb. 2002.Google ScholarCross Ref
- S. Guha, N. Koudas, and K. Shim. Data Streams and Histograms. Symposium on the Theory of Computing (STOC), July 2001. Google ScholarDigital Library
- S. Guha, N. Mishra, R. Motwani, and L. O'callahan. Clustering Data Streams. Foundations of Computer Science (FOCS), Sept. 2000. Google ScholarDigital Library
- D. Gunopulos, G. Kollios, V. Tsotras, and C. Domeniconi. Approximating Multi-Dimensional Aggregate Range Queries Over Real Attributes. Proceedings of ACM SIGMOD, June 2000. Google ScholarDigital Library
- P. Haas, J. Naughton, S. Seshadri, and L. Stokes. Sampling Based Estimation Of the Number Of Distinct Values Of An Attribute. Proceedings of VLDB, pages 311-322, June 1995. Google ScholarDigital Library
- P. Haas, J. Naughton, S. Seshadri, and A. Swami. Fixed Precision Estimation Of Join Selectivity. Proceedings of ACM PODS, pages 190-201, June 1993. Google ScholarDigital Library
- P. Haas and A. Swami. Sequantial Sampling Procedures for Query Size Estimation. Proceedings of ACM SIGMOD, San Diego, CA, pages 341-350, June 1992. Google ScholarDigital Library
- P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation". Foundations of Computer Science (FOCS), Sept. 2000. Google ScholarDigital Library
- Y. Ioannidis and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. Proceedings of ACM SIGMOD, San Hose, CA, pages 233-244, June 1995. Google ScholarDigital Library
- H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal Histograms with Quality Guarantees. Proceedings of VLDB, pages 275-286, Aug. 1998. Google ScholarDigital Library
- S. Khanna, S. Muthukrishnan, and S. Skiena. Efficient array partitioning. Proc. ICALP, 1997. Google ScholarDigital Library
- R. P. Kooi. The Optimization of Queries in Relational Databases. PhD Thesis, Case Western Reserve University, Sept. 1980. Google ScholarDigital Library
- J. Lee, D. Kim, and C. Chung. Multidimensional Selectivity Estimation Using Compressed Histogram Information. Proceedings of ACM SIGMOD, pages 205-214, June 1999. Google ScholarDigital Library
- S. Madden and M. Franklin. Fjording the Stream: An Architecture for Queries Over Streaming Sensor Data. Proceedings of ICDE, Feb. 2002. Google ScholarDigital Library
- Y. Mattias, J. S. Vitter, and M. Wang. Wavelet-Based Histograms for Selectivity Estimation. Proc. of the 1998 ACM SIGMOD Intern. Conf. on Management of Data, June 1998. Google ScholarDigital Library
- Y. Mattias, J. S. Vitter, and M. Wang. Dynamic Maintenance of Wavelet-Based Histograms. Proceedings of the International Conference on Very Large Databases, (VLDB), Cairo, Egypt, pages 101-111, Sept. 2000. Google ScholarDigital Library
- S. Muthukrishnan, V. Poosala, and T. Suel. Partitioning two dimensional arrays: algorithms, complexity and applications. Proc. Intl Conf. Database Theory, 1998. Google ScholarDigital Library
- V. Poosala and Y. Ioannidis. Selectivity Estimation Without the Attribute Value Independence Assumption. Proceedings of VLDB, Athens Greece, pages 486-495, Aug. 1997. Google ScholarDigital Library
- V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. Improved Histograms for Selectivity Estimation of Range Predicates. Proceedings of ACM SIGMOD, Montreal Canada, pages 294-305, June 1996. Google ScholarDigital Library
- G. Singh, S. Rajagopalan, and B. Lindsay. Random Sampling Techniques For Space Efficient Computation Of Large Datasets. Proceedings of SIGMOD, Philladelphia PA, pages 251-262, June 1999. Google ScholarDigital Library
- J. Vitter and M. Wang. Approximate computation of multidimensional aggregates on sparse data using wavelets. Proceedings of SIGMOD, pages 193-204, June 1999. Google ScholarDigital Library
- J. Vitter, M. Wang, and B. R. Iyer. Data Cube Approximation and Histograms via Wavelets. Proc. of the 1998 ACM CIKM Intern. Conf. on Information and Knowledge Management, November 1998. Google ScholarDigital Library
- Y. Wu, D. Agrawal, and A. E. Abbadi. Applying the Golden Rule of Sampling for Selectivity Estimation. Proceedings of ACM SIGMOD, May 2001. Google ScholarDigital Library
Index Terms
- Dynamic multidimensional histograms
Recommendations
Dynamic Histograms: Capturing Evolving Data Sets
ICDE '00: Proceedings of the 16th International Conference on Data EngineeringConventional histograms are `static' since they cannot be updated but only recalculated. In this paper, we introduce a `dynamic' version of V-optimal histograms, which is constructed and maintained incrementally. Our experimental results indicate that a ...
Constructing Join Histograms from Histograms with q-error Guarantees
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataHistograms are implemented and used in any database system, usually defined on a single-column of a database table. However, one of the most desired statistical data in such systems are statistics on the correlation among columns. In this paper we ...
Efficient construction of histograms for multidimensional data using quad-trees
Histograms can be useful in estimating the selectivity of queries in areas such as database query optimization and data exploration. In this paper, we propose a new histogram method for multidimensional data, called the Q-Histogram, based on the use of ...
Comments