ABSTRACT
Suppose we have a large table T of items i, each with a weight wi, e.g., people and their salary. In a general preprocessing step for estimating arbitrary subset sums, we assign each item a random priority depending on its weight. Suppose we want to estimate the sum of an arbitrary subset I ⊆ T. For any q > 2, considering only the q highest priority items from I, we obtain an unbiased estimator of the sum whose relative standard deviation is O(1/√q). Thus to get an expected approximation factor of 1 ± ε, it suffices to consider O(1/±ε2) items from I. Our estimator needs no knowledge of the number of items in the subset I, but we can also estimate that number if we want to estimate averages.The above scheme performs the same role as the on-line aggregation of Hellerstein et al. (SIGMOD'97) but it has the advantage of having expected good performance for any possible sequence of weights. In particular, the performance does not deteriorate in the common case of heavy-tailed weight distributions. This point is illustrated experimentally both with real and synthetic data.We will also show that our approach can be used to improve Cohen's size estimation framework (FOCS'94).
- N. Alon and J. H. Spencer, The Probabilistic Method, 2nd ed., Wiley, 2000.Google ScholarCross Ref
- L. Arge "External Memory Data Structures" In Handbook of Massive Data Sets, J. Abello, P. M. Pardalos, M. G. C. Resende (Eds.), Kluwer Academic Publishers, 2002, pages 313--357 Google ScholarDigital Library
- L. Arge and V. Samoladas and J. S. Vitter, On Two-Dimensional Indexability and Optimal Range Search Indexing, PODS 1999: 346--357 Google ScholarDigital Library
- S. Acharya, P. B. Gibbons, V. Poosala, S. Ramaswamy: The Aqua Approximate Query Answering System. SIGMOD Conference 1999: 574--576 Google ScholarDigital Library
- R. J. Adler, R. E. Feldman and M. S. Taqqu, (Eds.) "A practical guide to heavy tails", Birkhauser, Boston, 1998.Google ScholarDigital Library
- B. C. Arnold and N. Balakrishnan, "Relations, Bounds and Approximations for Order Statistics", Lecture Notes in Statistics, vol. 53, Springer, New York, 1988.Google Scholar
- K. R. W. Brewer and M. Hanif, "Sampling With Unequal Probabilities", Lecture Notes in Statistics, vol. 15, Springer, New York, 1983.Google ScholarCross Ref
- E. Cohen, "Size-estimation framework with applications to transitive closure and reachability", J. Comput. Syst. Sci., 55(3):441--453, 1997. Google ScholarDigital Library
- S. Chaudhuri, R. Motwani, V. R. Narasayya, "On Random Sampling over Joins", SIGMOD Conference 1999: 263--274 Google ScholarDigital Library
- H. A. David, "Order Statistics", Second Edition, Wiley Series in Probability and Mathematical Statistics, Wiley, New York, 1981Google Scholar
- N. G. Duffield, C. Lund, M. Thorup, "Learn more, sample less: control of volume and variance in network measurements", Manuscript, 2002, to appear in IEEE Trans. on Information Theory. Google ScholarDigital Library
- N. G. Duffield, C. Lund, M. Thorup, "Flow Sampling Under Hard Resource Constraints", ACM SIGMETRICS 2004, pages 85--96. Google ScholarDigital Library
- P. B. Gibbons, Y. Matias, "New Sampling-Based Summary Statistics for Improving Approximate Query Answers", SIGMOD Conference 1998: 331--342. Google ScholarDigital Library
- C. L. Hays, "What They Know About You", New York Times, November 14, 2004, Section 3, Page 1.Google Scholar
- J. M. Hellerstein, P. J. Haas, H. J. Wang, "Online Aggregation", SIGMOD Conference 1997: 171--182 Google ScholarDigital Library
- F. Olken, D. Rotem: "Random Sampling from Databases - A Survey" http://pueblo.lbl.gov/~olken/sampling.html March 1994Google Scholar
- K. Park, G. Kim, and M. Crovella, "On the Relationship Between File Sizes, Transport Protocols, and Self-Similar Network Traffic". In Proc. 4th Int. Conf. Network Protocols (ICNP), 1996. Google ScholarDigital Library
- M. Szegedy, "Near optimality of the priority sampling procedure". Electronic Colloquium on Computational Complexity Report TR05-001, 2005.Google Scholar
- J. S. Vitter, "Random Sampling with a Reservoir", ACM Trans. Math. Softw. 11(1): 37--57 (1985) Google ScholarDigital Library
Index Terms
- Estimating arbitrary subset sums with few probes
Recommendations
Priority sampling for estimation of arbitrary subset sums
From a high-volume stream of weighted items, we want to create a generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets. Applied to Internet traffic analysis, the items could be records ...
Sampling-based estimators for subset-based queries
We consider the problem of using sampling to estimate the result of an aggregation operation over a subset-based SQL query, where a subquery is correlated to an outer query by a NOT EXISTS, NOT IN, EXISTS or IN clause. We design an unbiased estimator ...
Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums
From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size $k$ that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking ...
Comments