skip to main content
10.1145/1065167.1065209acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
Article

Estimating arbitrary subset sums with few probes

Published:13 June 2005Publication History

ABSTRACT

Suppose we have a large table T of items i, each with a weight wi, e.g., people and their salary. In a general preprocessing step for estimating arbitrary subset sums, we assign each item a random priority depending on its weight. Suppose we want to estimate the sum of an arbitrary subset IT. For any q > 2, considering only the q highest priority items from I, we obtain an unbiased estimator of the sum whose relative standard deviation is O(1/√q). Thus to get an expected approximation factor of 1 ± ε, it suffices to consider O(1/±ε2) items from I. Our estimator needs no knowledge of the number of items in the subset I, but we can also estimate that number if we want to estimate averages.The above scheme performs the same role as the on-line aggregation of Hellerstein et al. (SIGMOD'97) but it has the advantage of having expected good performance for any possible sequence of weights. In particular, the performance does not deteriorate in the common case of heavy-tailed weight distributions. This point is illustrated experimentally both with real and synthetic data.We will also show that our approach can be used to improve Cohen's size estimation framework (FOCS'94).

References

  1. N. Alon and J. H. Spencer, The Probabilistic Method, 2nd ed., Wiley, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  2. L. Arge "External Memory Data Structures" In Handbook of Massive Data Sets, J. Abello, P. M. Pardalos, M. G. C. Resende (Eds.), Kluwer Academic Publishers, 2002, pages 313--357 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Arge and V. Samoladas and J. S. Vitter, On Two-Dimensional Indexability and Optimal Range Search Indexing, PODS 1999: 346--357 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Acharya, P. B. Gibbons, V. Poosala, S. Ramaswamy: The Aqua Approximate Query Answering System. SIGMOD Conference 1999: 574--576 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. J. Adler, R. E. Feldman and M. S. Taqqu, (Eds.) "A practical guide to heavy tails", Birkhauser, Boston, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. C. Arnold and N. Balakrishnan, "Relations, Bounds and Approximations for Order Statistics", Lecture Notes in Statistics, vol. 53, Springer, New York, 1988.Google ScholarGoogle Scholar
  7. K. R. W. Brewer and M. Hanif, "Sampling With Unequal Probabilities", Lecture Notes in Statistics, vol. 15, Springer, New York, 1983.Google ScholarGoogle ScholarCross RefCross Ref
  8. E. Cohen, "Size-estimation framework with applications to transitive closure and reachability", J. Comput. Syst. Sci., 55(3):441--453, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chaudhuri, R. Motwani, V. R. Narasayya, "On Random Sampling over Joins", SIGMOD Conference 1999: 263--274 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. A. David, "Order Statistics", Second Edition, Wiley Series in Probability and Mathematical Statistics, Wiley, New York, 1981Google ScholarGoogle Scholar
  11. N. G. Duffield, C. Lund, M. Thorup, "Learn more, sample less: control of volume and variance in network measurements", Manuscript, 2002, to appear in IEEE Trans. on Information Theory. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. G. Duffield, C. Lund, M. Thorup, "Flow Sampling Under Hard Resource Constraints", ACM SIGMETRICS 2004, pages 85--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. B. Gibbons, Y. Matias, "New Sampling-Based Summary Statistics for Improving Approximate Query Answers", SIGMOD Conference 1998: 331--342. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. L. Hays, "What They Know About You", New York Times, November 14, 2004, Section 3, Page 1.Google ScholarGoogle Scholar
  15. J. M. Hellerstein, P. J. Haas, H. J. Wang, "Online Aggregation", SIGMOD Conference 1997: 171--182 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. F. Olken, D. Rotem: "Random Sampling from Databases - A Survey" http://pueblo.lbl.gov/~olken/sampling.html March 1994Google ScholarGoogle Scholar
  17. K. Park, G. Kim, and M. Crovella, "On the Relationship Between File Sizes, Transport Protocols, and Self-Similar Network Traffic". In Proc. 4th Int. Conf. Network Protocols (ICNP), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Szegedy, "Near optimality of the priority sampling procedure". Electronic Colloquium on Computational Complexity Report TR05-001, 2005.Google ScholarGoogle Scholar
  19. J. S. Vitter, "Random Sampling with a Reservoir", ACM Trans. Math. Softw. 11(1): 37--57 (1985) Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Estimating arbitrary subset sums with few probes

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in
                  • Published in

                    cover image ACM Conferences
                    PODS '05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
                    June 2005
                    388 pages
                    ISBN:1595930620
                    DOI:10.1145/1065167
                    • General Chair:
                    • Georg Gottlob,
                    • Program Chair:
                    • Foto Afrati

                    Copyright © 2005 ACM

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 13 June 2005

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Check for updates

                    Qualifiers

                    • Article

                    Acceptance Rates

                    Overall Acceptance Rate642of2,707submissions,24%

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader