skip to main content
10.1145/2882903.2882948acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Augmented Sketch: Faster and More Accurate Stream Processing

Authors Info & Claims
Published:14 June 2016Publication History

ABSTRACT

Approximated algorithms are often used to estimate the frequency of items on high volume, fast data streams. The most common ones are variations of Count-Min sketch, which use sub-linear space for the count, but can produce errors in the counts of the most frequent items and can misclassify low-frequency items. In this paper, we improve the accuracy of sketch-based algorithms by increasing the frequency estimation accuracy of the most frequent items and reducing the possible misclassification of low-frequency items, while also improving the overall throughput. Our solution, called Augmented Sketch (ASketch), is based on a pre-filtering stage that dynamically identifies and aggregates the most frequent items. Items overflowing the pre-filtering stage are processed using a conventional sketch algorithm, thereby making the solution general and applicable in a wide range of contexts. The pre-filtering stage can be efficiently implemented with SIMD instructions on multi-core machines and can be further parallelized through pipeline parallelism where the filtering stage runs in one core and the sketch algorithm runs in another core.

References

  1. C. Aggarwal and K. Subbian. Event Detection in Social Streams. In SDM, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  2. C. Aggarwal and P. S. Yu. On Classification of High-Cardinality Data Streams. In SDM, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  3. N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. In STOC, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Berinde, P. Indyk, G. Cormode, and M. J. Strauss. Space-optimal Heavy Hitters with Strong Error Bounds. ACM TODS, 35(4):26, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Bitton and D. J. DeWitt. Duplicate Record Elimination in Large Data Files. ACM TODS, 8(2):255--265, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. H. Bloom. Space/time Trade-offs in Hash Coding with Allowable Errors. Comm. of ACM, 13:422--426, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Charikar, K. Chen, and M. Farach-Colton. Finding Frequent Items in Data Streams. In ICALP, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1--3):1--294, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Cormode and M. Hadjieleftheriou. Finding Frequent Items in Data Streams. In VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava. Holistic UDAFs at Streaming Speeds. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Cormode and S. Muthukrishnan. An Improved Data-Stream Summary: The Count-min Sketch and its Applications. J. of Algorithms, 55(1), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. Processing Complex Aggregate Queries over Data Streams. In SIGMOD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Estan and G. Varghese. New Directions in Traffic Measurement and Accounting: Focusing on the Elephants, Ignoring the Mice. ACM Trans. Comput. Syst., 21(3):270--313, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Ganguly, M. Garofalakis, and R. Rastogi. Processing data-stream join aggregates using skimmed sketches. In Advances in Database Technology-EDBT 2004, pages 569--586. Springer, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  15. S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz. Bifocal Sampling for Skew-Resistant Join Size Estimation. In SIGMOD, pages 271--281, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Garofalakis and P. B. Gibbons. Wavelet Synopses with Error Guarantees. In SIGMOD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. How to Summarize the Universe: Dynamic Maintenance of Quantiles. In VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Goyal, H. D. III, and G. Cormode. Sketch Algorithms for Estimating Point Queries in NLP. In EMNLP-CoNLL, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Guha, N. Koudas, and K. Shim. Data-streams and Histograms. In STOC, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Koch, S. Scherzinger, and M. Schmidt. XML Prefiltering as a String Matching Problem. In ICDE, pages 626--635, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based Change Detection: Methods, Evaluation, and Applications. In IMC, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Larson. Grouping and Duplicate Elimination: Benefits of Early Aggregation, 1997. Technical Report, MSR-TR-97--36.Google ScholarGoogle Scholar
  24. T. Liu, Y. Sun, A. X. Liu, L. Guo, and B. Fang. A Prefiltering Approach to Regular Expression Matching for Network Security Systems. In Applied Cryptography and Network Security, pages 363--380, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani. Counter braids: a novel counter architecture for per-flow measurement. In SIGMETRICS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N. Manerikar and T. Palpanas. Frequent Items in Streaming Data: An Experimental Evaluation of the State-of-the-art. Data Knowl. Eng., 68(4):415--430, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Metwally, D. Agrawal, and A. E. Abbadi. Efficient Computation of Frequent and Top-k Elements in Data Streams. In ICDT, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Misra and D. Gries. Finding repeated elements. Science of computer programming, 2(2):143--152, 1982.Google ScholarGoogle Scholar
  29. S. Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  30. A. Pietracaprina, M. Riondato, E. Upfal, and F. Vandin. Mining top-K Frequent Itemsets Through Progressive Sampling. Data Min. Knowl. Discov., 21(2):310--326, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. O. Rottenstreich, Y. Kanizo, and I. Keslassy. The Variable-Increment Counting Bloom Filter. IEEE/ACM Trans. Netw., 22(4):1092--1105, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Roy, J. Teubner, and G. Alonso. Efficient Frequent Item Counting in Multi-Core Hardware. In KDD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. F. Rusu and A. Dobra. Sketches for Size of Join Estimation. ACM TODS, 33(3):15, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Thomas, R. Bordawekar, C. Aggarwal, and P. S. Yu. On Efficient Query Processing of Stream Counts on the Cell Processor. In ICDE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. W. Yan and Y. X. B. Malin. Scalable Load Balancing for MapReduce-based Record Linkage. In IPCCC, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  36. J. Yu, Z. Chong, H. Lu, and A. Zhou. False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. P. Zhao, C. C. Aggarwal, and M. Wang. gSketch: On Query Estimation in Graph Streams. In VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Augmented Sketch: Faster and More Accurate Stream Processing

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
            June 2016
            2300 pages
            ISBN:9781450335317
            DOI:10.1145/2882903

            Copyright © 2016 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 14 June 2016

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate785of4,003submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader