ABSTRACT
Histograms have been used widely to capture data distribution, to represent the data by a small number of step functions. Dynamic programming algorithms which provide optimal construction of these histograms exist, albeit running in quadratic time and linear space. In this paper we provide linear time construction of 1 + ε approximation of optimal histograms, running in polylogarithmic space.
Our results extend to the context of data-streams, and in fact generalize to give 1 + ε approximation of several problems in data-streams which require partitioning the index set into intervals. The only assumptions required are that the cost of an interval is monotonic under inclusion (larger interval has larger cost) and that the cost can be computed or approximated in small space. This exhibits a nice class of problems for which we can have near optimal data-stream algorithms.
- 1.S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. Join Synopses For Approximate Query Answering. Proceedings of ACM SIGMOD, pages 275-286, June 1999.]] Google ScholarDigital Library
- 2.S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua Approximate Query Answering System. Proceedings of ACM SIGMOD, pages 574-578, June 1999.]] Google ScholarDigital Library
- 3.M. Bertolotto and M. J. Egenhofer. Progressive vector transmission. Proceedings of the 7th ACM symposium on Advances in Geographical Information Systems, 1999.]] Google ScholarDigital Library
- 4.S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: How much is enough ? Proceedings of ACM SIGMOD, 1998.]] Google ScholarDigital Library
- 5.J. Feigenbaum, S. Kannan, M. Strauss, and M. Vishwanathan. An approximate l1-difference algorithm for massive data sets. Proceedings of 40th Annual IEEE Symposium on Foundations of Computer Science, pages 501-511, 1999.]] Google ScholarDigital Library
- 6.P. Flajolet and G. N. Martin. Probabilistic counting. Proceedings of 24th Annual IEEE Symposium on Foundations of Computer Science, pages 76-82, 1983.]]Google ScholarDigital Library
- 7.P. Gibbons, Y. Mattias, and V. Poosala. Fast Incremental Maintenance of Approximate Histograms. Proceedings of VLDB, pages 466-475, 1997.]] Google ScholarDigital Library
- 8.J. Gray, A. Bosworth, A. Leyman, and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross Tab and Sub Total. Proceedings of ICDE, pages 152-159, 1996.]] Google ScholarDigital Library
- 9.M. R. Henzinger, P .Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report 1998-011, Digital Equipment Corporation, Systems Research Center, May, 1998.]]Google Scholar
- 10.P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. Proceedings of 41st Annual Symposium on Foundations of Computer Science, 2000.]] Google ScholarDigital Library
- 11.Y. Ioannidis. Universality of Serial Histograms. Proceedings of VLDB, pages 256-277, 1993.]] Google ScholarDigital Library
- 12.Y. Ioannidis and S. Christodoulakis. Optimal Histograms for Limiting Worst-Case Error Propagation in the Size of Join Results. ACM Transactions on Database Systems, Vol. 18, No. 4, pages 709-748, December 1993.]] Google ScholarDigital Library
- 13.Y. Ioannidis and V. Poosala. Histogram Based Approximation of Set Valued Query Answers. Proceedings of VLDB, pages 174-185, 1999.]] Google ScholarDigital Library
- 14.Y. Ioannidis and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. Proceedings of ACM SIGMOD, 1995.]] Google ScholarDigital Library
- 15.H. V Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal Histograms with Quality Guarantees. Proceedings of VLDB, 1998.]] Google ScholarDigital Library
- 16.H. V. Jagadish, Nick Koudas, and S. Muthukrishnan. Mining Deviants in a Time Series Database. Proceedings of VLDB, 1999.]] Google ScholarDigital Library
- 17.N. Koudas, S. Muthukrishnan, and D. Srivastava. Optimal histograms for hierarchical range queries. Proceedings of ACM PODS, 2000.]] Google ScholarDigital Library
- 18.G .S .Manku, S. Rajagopalan, and B. Lindsay. Approximate medians and other quantiles in one pass with limited memory. Proceedings of the ACM SIGMOD, 1998.]] Google ScholarDigital Library
- 19.G .S .Manku, S. Rajagopalan, and B. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large databases. Proceedings of ACM SIGMOD, 1999.]] Google ScholarDigital Library
- 20.Y. Matias, J. Scott Vitter, and M. Wang. Wavelet-Based Histograms for Selectivity Estimation. Proceedings of ACM SIGMOD, 1998.]] Google ScholarDigital Library
- 21.J. I. Munro and M. S. Paterson. Selection and sorting with limited storage. Theoretical Computer Science, pages 315-323, 1980.]]Google Scholar
- 22.M. Muralikrishna and D. J. DeWitt. Equi-depth histograms for estimating selectivity factors for multidimension al queries. Proceedings of ACM SIGMOD, 1998.]]Google Scholar
- 23.S. Muthukrishnan, V. Poosala, and T. Suel. On Rectangular Partitioning In Two Dimensions: Algorithms, Complexity and Applications. Proceedings of ICDT, 1999.]] Google ScholarDigital Library
- 24.V. Poosala and V. Ganti. Fast Approximate Answers To Aggregate Queries On A Datacube. SSDBM, pages 24-33, 1999.]] Google ScholarDigital Library
- 25.V. Poosala and Y. Ioannidis. Estimation of Query-Result Distribution and Its Application In Parallel Join Load Balancing. Proceedings of VLDB, 1996.]] Google ScholarDigital Library
- 26.V. Poosala and Y. Ioannidis. Selectivity Estimation Without the Attribute Value Independence Assumption. Proceedings of VLDB, 1997.]] Google ScholarDigital Library
- 27.V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. Improved Histograms for Selectivity Estimation of Range Predicates. Proceedings of ACM SIGMOD, 1996.]] Google ScholarDigital Library
- 28.J. Vitter and M. Wang. Approximate computation of multidimensional aggregates on sparse data using wavelets. Proceedings of ACM SIGMOD, 1999.]] Google ScholarDigital Library
- 29.J.S. Vitter, M. Wang, and B. R. Iyer. Data Cube Approximation and Histograms via Wavelets. Proceedings of the 1998 ACM CIKM Intern. Conf. on Information and Knowledge Management, 1998.]] Google ScholarDigital Library
Index Terms
- Data-streams and histograms
Recommendations
Data Streams with Bounded Deletions
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsTwo prevalent models in the data stream literature are the insertion-only and turnstile models. Unfortunately, many important streaming problems require a Θ(log(n)) multiplicative factor more space for turnstile streams than for insertion-only streams. ...
Change detection in learning histograms from data streams
EPIA'07: Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligenceIn this paper we study the problem of constructing histograms from high-speed time-changing data streams. Learning in this context requires the ability to process examples once at the rate they arrive, maintaining a histogram consistent with the most ...
Comments