Article

Estimating arbitrary subset sums with few probes

Authors:
Noga Alon

Tel Aviv University, Tel Aviv, Israel

Tel Aviv University, Tel Aviv, Israel
View Profile

,
Nick Duffield

AT&T Labs---Research, NJ

AT&T Labs---Research, NJ
View Profile

,
Carsten Lund

AT&T Labs---Research, NJ

AT&T Labs---Research, NJ
View Profile

,
Mikkel Thorup

AT&T Labs---Research, NJ

AT&T Labs---Research, NJ
View Profile

PODS '05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsJune 2005Pages 317–325https://doi.org/10.1145/1065167.1065209

Published:13 June 2005Publication History

PODS '05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Pages 317–325

ABSTRACT

Suppose we have a large table T of items i, each with a weight w_i, e.g., people and their salary. In a general preprocessing step for estimating arbitrary subset sums, we assign each item a random priority depending on its weight. Suppose we want to estimate the sum of an arbitrary subset I ⊆ T. For any q > 2, considering only the q highest priority items from I, we obtain an unbiased estimator of the sum whose relative standard deviation is O(1/√q). Thus to get an expected approximation factor of 1 ± ε, it suffices to consider O(1/±ε²) items from I. Our estimator needs no knowledge of the number of items in the subset I, but we can also estimate that number if we want to estimate averages.The above scheme performs the same role as the on-line aggregation of Hellerstein et al. (SIGMOD'97) but it has the advantage of having expected good performance for any possible sequence of weights. In particular, the performance does not deteriorate in the common case of heavy-tailed weight distributions. This point is illustrated experimentally both with real and synthetic data.We will also show that our approach can be used to improve Cohen's size estimation framework (FOCS'94).

References

N. Alon and J. H. Spencer, The Probabilistic Method, 2nd ed., Wiley, 2000.Google ScholarCross Ref
L. Arge "External Memory Data Structures" In Handbook of Massive Data Sets, J. Abello, P. M. Pardalos, M. G. C. Resende (Eds.), Kluwer Academic Publishers, 2002, pages 313--357 Google ScholarDigital Library
L. Arge and V. Samoladas and J. S. Vitter, On Two-Dimensional Indexability and Optimal Range Search Indexing, PODS 1999: 346--357 Google ScholarDigital Library
S. Acharya, P. B. Gibbons, V. Poosala, S. Ramaswamy: The Aqua Approximate Query Answering System. SIGMOD Conference 1999: 574--576 Google ScholarDigital Library
R. J. Adler, R. E. Feldman and M. S. Taqqu, (Eds.) "A practical guide to heavy tails", Birkhauser, Boston, 1998.Google ScholarDigital Library
B. C. Arnold and N. Balakrishnan, "Relations, Bounds and Approximations for Order Statistics", Lecture Notes in Statistics, vol. 53, Springer, New York, 1988.Google Scholar
K. R. W. Brewer and M. Hanif, "Sampling With Unequal Probabilities", Lecture Notes in Statistics, vol. 15, Springer, New York, 1983.Google ScholarCross Ref
E. Cohen, "Size-estimation framework with applications to transitive closure and reachability", J. Comput. Syst. Sci., 55(3):441--453, 1997. Google ScholarDigital Library
S. Chaudhuri, R. Motwani, V. R. Narasayya, "On Random Sampling over Joins", SIGMOD Conference 1999: 263--274 Google ScholarDigital Library
H. A. David, "Order Statistics", Second Edition, Wiley Series in Probability and Mathematical Statistics, Wiley, New York, 1981Google Scholar
N. G. Duffield, C. Lund, M. Thorup, "Learn more, sample less: control of volume and variance in network measurements", Manuscript, 2002, to appear in IEEE Trans. on Information Theory. Google ScholarDigital Library
N. G. Duffield, C. Lund, M. Thorup, "Flow Sampling Under Hard Resource Constraints", ACM SIGMETRICS 2004, pages 85--96. Google ScholarDigital Library
P. B. Gibbons, Y. Matias, "New Sampling-Based Summary Statistics for Improving Approximate Query Answers", SIGMOD Conference 1998: 331--342. Google ScholarDigital Library
C. L. Hays, "What They Know About You", New York Times, November 14, 2004, Section 3, Page 1.Google Scholar
J. M. Hellerstein, P. J. Haas, H. J. Wang, "Online Aggregation", SIGMOD Conference 1997: 171--182 Google ScholarDigital Library
F. Olken, D. Rotem: "Random Sampling from Databases - A Survey" http://pueblo.lbl.gov/~olken/sampling.html March 1994Google Scholar
K. Park, G. Kim, and M. Crovella, "On the Relationship Between File Sizes, Transport Protocols, and Self-Similar Network Traffic". In Proc. 4th Int. Conf. Network Protocols (ICNP), 1996. Google ScholarDigital Library
M. Szegedy, "Near optimality of the priority sampling procedure". Electronic Colloquium on Computational Complexity Report TR05-001, 2005.Google Scholar
J. S. Vitter, "Random Sampling with a Reservoir", ACM Trans. Math. Softw. 11(1): 37--57 (1985) Google ScholarDigital Library

Index Terms

Recommendations

Priority sampling for estimation of arbitrary subset sums

From a high-volume stream of weighted items, we want to create a generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets. Applied to Internet traffic analysis, the items could be records ...
Read More
Sampling-based estimators for subset-based queries

We consider the problem of using sampling to estimate the result of an aggregation operation over a subset-based SQL query, where a subquery is correlated to an outer query by a NOT EXISTS, NOT IN, EXISTS or IN clause. We design an unbiased estimator ...
Read More
Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums

From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size $k$ that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PODS '05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2005
388 pages
ISBN:1595930620
DOI:10.1145/1065167
General Chair:
Georg Gottlob
Vienna University of Technology, Austria
,
Program Chair:
Foto Afrati
National Technical University of Athens, Greece
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 June 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
IP Flows
databases
preprocessing
sampling
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate642of2,707submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 33
  Total Citations
  View Citations
- 321
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Estimating arbitrary subset sums with few probes

PODS '05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Priority sampling for estimation of arbitrary subset sums

Sampling-based estimators for subset-based queries

Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Estimating arbitrary subset sums with few probes

PODS '05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Priority sampling for estimation of arbitrary subset sums

Sampling-based estimators for subset-based queries

Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media