Abstract
Histograms and related synopsis structures are popular techniques for approximating data distributions. These have been successful in query optimization and a variety of applications, including approximate querying, similarity searching, and data mining, to name a few. Histograms were a few of the earliest synopsis structures proposed and continue to be used widely. The histogram construction problem is to construct the best histogram restricted to a space bound that reflects the data distribution most accurately under a given error measure.The histograms are used as quick and easy estimates. Thus, a slight loss of accuracy, compared to the optimal histogram under the given error measure, can be offset by fast histogram construction algorithms. A natural question arises in this context: Can we find a fast near optimal approximation algorithm for the histogram construction problem? In this article, we give the first linear time (1+ϵ)-factor approximation algorithms (for any ϵ > 0) for a large number of histogram construction problems including the use of piecewise small degree polynomials to approximate data, workloads, etc. Several of our algorithms extend to data streams.Using synthetic and real-life data sets, we demonstrate that in many scenarios the approximate histograms are almost identical to optimal histograms in quality and are significantly faster to construct.
- Acharya, S., Gibbons, P., Poosala, V., and Ramaswamy, S. 1999. The aqua approximate query answering system. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 574--576. Google ScholarDigital Library
- Alon, N., Matias, Y., and Szegedy, M. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147. Google ScholarDigital Library
- Bertolotto, M. and Egenhofer, M. J. 1999. Progressive vector transmission. In Proceedings of the 7th ACM Symposium on Advances in Geographical Information Systems. ACM, New York, 152--157. Google ScholarDigital Library
- Borodin, A. and El-Yaniv, R. 1998. Online Computation and Competitive Analysis. Cambridge University Press, Cambridge, MA. Google ScholarDigital Library
- Datar, M., Gionis, A., Indyk, P., and Motwani, R. 2002. Maintaining stream statistics over sliding windows. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA 2002). ACM, New York, 635--644. Google ScholarDigital Library
- Donjerkovic, D., Ioannidis, Y. E., and Ramakrishnan, R. 1999. Dynamic histograms: Capturing evolving data sets. CS-TR 99-1396, University of Wisconsin, Madison, WI. Mar.Google Scholar
- Feigenbaum, J., Kannan, S., Strauss, M., and Viswanathan, M. 2002. An approximate l1-difference algorithm for massive data streams. SIAM J. Comput. 32, 1, 131--151. Google ScholarDigital Library
- Garofalakis, M. N. and Gibbons, P. B. 2002. Wavelet synopses with error guarantees. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 476--487. Google ScholarDigital Library
- Garofalakis, M. N. and Gibbons, P. B. 2004. Probabilistic wavelet synopses. ACM Trans. Datab. Syst. 29, 43--90. Google ScholarDigital Library
- Gibbons, P. and Matias, Y. 1999. Synopsis data structures for massive data sets. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA 1999). ACM, New York, 909--910. Google ScholarDigital Library
- Gilbert, A. C., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., and Strauss, M. 2002. Fast, small-space algorithms for approximate histogram maintenance. In Proceedings of the ACM Symposium on Theory of Computing. ACM, New York, 389--398. Google ScholarDigital Library
- Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. 2001. Optimal and approximate computation of summary statistics for range aggregates. In Proceedings of the ACM Symposium on Principles of Database Systems. ACM, New York. Google ScholarDigital Library
- Greenwald, M. and Khanna, S. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD Conference. ACM, New York. Google ScholarDigital Library
- Guha, S. 2005. Space efficiency in synopsis construction problems. In Proceedings of the VLDB Conference. 409--420. Google ScholarDigital Library
- Guha, S. and Harb, B. 2006. Approximation algorithms for wavelet transform coding of data streams. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA 2006). ACM, New York. Google ScholarDigital Library
- Guha, S., Indyk, P., Muthukrishnan, S., and Strauss, M. 2002a. Histogramming data streams with fast per-item processing. In Proceedings of the ICALP. 681--692. Google ScholarDigital Library
- Guha, S., Koudas, N., and Srivastava, D. 2002b. Fast algorithms for hierarchical range histogram construction. In Proceedings of the ACM Symposium on Principles of Database Systems. ACM, New York, 180--187. Google ScholarDigital Library
- Guha, S. and Koudas, N. 2002. Approximating a data stream for querying and estimation: Algorithms and performance evaluation. In Proceedings of the IEEE 18th International Conference on Data Engineering (ICDE'02). IEEE Computer. Society Press, Los Alamitos, CA, 567--576. Google ScholarDigital Library
- Guha, S., Koudas, N., and Shim, K. 2001. Data streams and histograms. In Proceedings of the ACM Symposium on Theory of Computing. ACM, New York, 471--475. Google ScholarDigital Library
- Guha, S., Shim, K., and Woo, J. 2004. REHIST: Relative error histogram construction algorithms. In Proceedings of the VLDB Conference, 300--311. Google ScholarDigital Library
- Hochbaum, D. 1996. Approximation Algorithms for NP Hard Problems. Brooks/Cole Pubishing Company. Google ScholarDigital Library
- Indyk, P. 2000. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings of the IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, 189--197. Google ScholarDigital Library
- Ioannidis, Y. and Poosala, V. 1995. Balancing histogram optimality and practicality for query result size estimation. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 233--244. Google ScholarDigital Library
- Ioannidis, Y. E. 1993. Universality of serial histograms. In Proceedings of the VLDB Conference, 256--267. Google ScholarDigital Library
- Ioannidis, Y. E. 2003. The history of histograms (abridged). In Proceedings of the VLDB Conference, 19--30. Google ScholarDigital Library
- Jagadish, H. V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. C., and Suel, T. 1998. Optimal histograms with quality guarantees. In Proceedings of the VLDB Conference, 275--286. Google ScholarDigital Library
- Keogh, E., Chakrabati, K., Mehrotra, S., and Pazzani, M. 2002. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans. Datab. Syst. 27, 2, 188--228. Google ScholarDigital Library
- Kooi, R. 1980. The optimization of queries in relational databases. Case Western Reserve University.Google Scholar
- Koudas, N., Muthukrishnan, S., and Srivastava, D. 2000. Optimal histograms for hierarchical range queries. In Proceedings of the ACM Principles of Database Systems. ACM, New York, 196--204. Google ScholarDigital Library
- Manku, G. S., Rajagopalan, S., and Lindsay, B. 1998. Approximate medians and other quantiles in one pass and with limited memory. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 426--435. Google ScholarDigital Library
- Matias, Y., Vitter, J. S., and Wang, M. 1998. Wavelet-based histograms for selectivity estimation. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 448--459. Google ScholarDigital Library
- Muralikrishna, M. and DeWitt, D. J. 1988. Equi-depth histograms for estimating selectivity factors for multidimensional queries. In Proceedings of the ACM SIGMOD, Conference. ACM, New York, 28--36. Google ScholarDigital Library
- Muthukrishnan, S. and Strauss, M. 2004. Approximate histogram and wavelet summaries of streaming data. DIMACS TR 52.Google Scholar
- Poosala, V., Ioannidis, Y., Haas, P., and Shekita, E. 1996. Improved histograms for selectivity estimation of range predicates. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 294--305. Google ScholarDigital Library
- Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., and Price, T. G. 1979. Access path selection in a relational database management system. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 23--34. Google ScholarDigital Library
- Thaper, N., Guha, S., Indyk, P., and Koudas, N. 2002. Dynamic multidimensional histograms. In Proceedings of the SIGMOD Conference. ACM, New York, 428--439. Google ScholarDigital Library
- Vazirani, V. 2001. Approximation Algorithms. Springer-Verlag, New York. Google ScholarDigital Library
Index Terms
- Approximation and streaming algorithms for histogram construction problems
Recommendations
Optimal Approximation Algorithms for Maximum Distance-Bounded Subgraph Problems
In this paper we study the (in)approximability of two distance-based relaxed variants of the maximum clique problem (Max Clique), named Max d-Clique and Max d-Club: A d-clique in a graph $$G = (V, E)$$G=(V,E) is a subset $$S\subseteq V$$S⊆V of vertices ...
Improved Approximation Algorithms for Minimum Cost Node-Connectivity Augmentation Problems
Let źG(s, t) denote the maximum number of pairwise internally disjoint st-paths in a graph G = (V, E). For a set T⊆V$T \subseteq V$ of terminals, G is k-T-connected if źG(s, t) ź k for all s, t ź T; if T = V then G is k-connected. Given a root node s, G ...
Optimal Approximation Algorithms for Maximum Distance-Bounded Subgraph Problems
COCOA 2015: Proceedings of the 9th International Conference on Combinatorial Optimization and Applications - Volume 9486A d-clique in a graph $$G = V, E$$G=V,E is a subset $$S\subseteq V$$S⊆V of vertices such that for pairs of vertices $$u, v\in S$$u,v∈S, the distance between u and v is at most d in G. A d-club in a graph $$G = V, E$$G=V,E is a subset $$S'\subseteq V$$S'⊆...
Comments