ABSTRACT
Approximated algorithms are often used to estimate the frequency of items on high volume, fast data streams. The most common ones are variations of Count-Min sketch, which use sub-linear space for the count, but can produce errors in the counts of the most frequent items and can misclassify low-frequency items. In this paper, we improve the accuracy of sketch-based algorithms by increasing the frequency estimation accuracy of the most frequent items and reducing the possible misclassification of low-frequency items, while also improving the overall throughput. Our solution, called Augmented Sketch (ASketch), is based on a pre-filtering stage that dynamically identifies and aggregates the most frequent items. Items overflowing the pre-filtering stage are processed using a conventional sketch algorithm, thereby making the solution general and applicable in a wide range of contexts. The pre-filtering stage can be efficiently implemented with SIMD instructions on multi-core machines and can be further parallelized through pipeline parallelism where the filtering stage runs in one core and the sketch algorithm runs in another core.
- C. Aggarwal and K. Subbian. Event Detection in Social Streams. In SDM, 2012.Google ScholarCross Ref
- C. Aggarwal and P. S. Yu. On Classification of High-Cardinality Data Streams. In SDM, 2010.Google ScholarCross Ref
- N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. In STOC, 1996. Google ScholarDigital Library
- R. Berinde, P. Indyk, G. Cormode, and M. J. Strauss. Space-optimal Heavy Hitters with Strong Error Bounds. ACM TODS, 35(4):26, 2010. Google ScholarDigital Library
- D. Bitton and D. J. DeWitt. Duplicate Record Elimination in Large Data Files. ACM TODS, 8(2):255--265, 1983. Google ScholarDigital Library
- B. H. Bloom. Space/time Trade-offs in Hash Coding with Allowable Errors. Comm. of ACM, 13:422--426, 1970. Google ScholarDigital Library
- M. Charikar, K. Chen, and M. Farach-Colton. Finding Frequent Items in Data Streams. In ICALP, 2002. Google ScholarDigital Library
- G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1--3):1--294, 2012. Google ScholarDigital Library
- G. Cormode and M. Hadjieleftheriou. Finding Frequent Items in Data Streams. In VLDB, 2008. Google ScholarDigital Library
- G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava. Holistic UDAFs at Streaming Speeds. In SIGMOD, 2004. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. An Improved Data-Stream Summary: The Count-min Sketch and its Applications. J. of Algorithms, 55(1), 2005. Google ScholarDigital Library
- A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. Processing Complex Aggregate Queries over Data Streams. In SIGMOD, 2002. Google ScholarDigital Library
- C. Estan and G. Varghese. New Directions in Traffic Measurement and Accounting: Focusing on the Elephants, Ignoring the Mice. ACM Trans. Comput. Syst., 21(3):270--313, 2003. Google ScholarDigital Library
- S. Ganguly, M. Garofalakis, and R. Rastogi. Processing data-stream join aggregates using skimmed sketches. In Advances in Database Technology-EDBT 2004, pages 569--586. Springer, 2004.Google ScholarCross Ref
- S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz. Bifocal Sampling for Skew-Resistant Join Size Estimation. In SIGMOD, pages 271--281, 1996. Google ScholarDigital Library
- M. Garofalakis and P. B. Gibbons. Wavelet Synopses with Error Guarantees. In SIGMOD, 2002. Google ScholarDigital Library
- A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. In VLDB, 2001. Google ScholarDigital Library
- A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. How to Summarize the Universe: Dynamic Maintenance of Quantiles. In VLDB, 2002. Google ScholarDigital Library
- A. Goyal, H. D. III, and G. Cormode. Sketch Algorithms for Estimating Point Queries in NLP. In EMNLP-CoNLL, 2012. Google ScholarDigital Library
- S. Guha, N. Koudas, and K. Shim. Data-streams and Histograms. In STOC, 2001. Google ScholarDigital Library
- C. Koch, S. Scherzinger, and M. Schmidt. XML Prefiltering as a String Matching Problem. In ICDE, pages 626--635, 2008. Google ScholarDigital Library
- B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based Change Detection: Methods, Evaluation, and Applications. In IMC, 2003. Google ScholarDigital Library
- P. Larson. Grouping and Duplicate Elimination: Benefits of Early Aggregation, 1997. Technical Report, MSR-TR-97--36.Google Scholar
- T. Liu, Y. Sun, A. X. Liu, L. Guo, and B. Fang. A Prefiltering Approach to Regular Expression Matching for Network Security Systems. In Applied Cryptography and Network Security, pages 363--380, 2012. Google ScholarDigital Library
- Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani. Counter braids: a novel counter architecture for per-flow measurement. In SIGMETRICS, 2008. Google ScholarDigital Library
- N. Manerikar and T. Palpanas. Frequent Items in Streaming Data: An Experimental Evaluation of the State-of-the-art. Data Knowl. Eng., 68(4):415--430, 2009. Google ScholarDigital Library
- A. Metwally, D. Agrawal, and A. E. Abbadi. Efficient Computation of Frequent and Top-k Elements in Data Streams. In ICDT, 2005. Google ScholarDigital Library
- J. Misra and D. Gries. Finding repeated elements. Science of computer programming, 2(2):143--152, 1982.Google Scholar
- S. Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005.Google ScholarCross Ref
- A. Pietracaprina, M. Riondato, E. Upfal, and F. Vandin. Mining top-K Frequent Itemsets Through Progressive Sampling. Data Min. Knowl. Discov., 21(2):310--326, 2010. Google ScholarDigital Library
- O. Rottenstreich, Y. Kanizo, and I. Keslassy. The Variable-Increment Counting Bloom Filter. IEEE/ACM Trans. Netw., 22(4):1092--1105, 2014. Google ScholarDigital Library
- P. Roy, J. Teubner, and G. Alonso. Efficient Frequent Item Counting in Multi-Core Hardware. In KDD, 2012. Google ScholarDigital Library
- F. Rusu and A. Dobra. Sketches for Size of Join Estimation. ACM TODS, 33(3):15, 2008. Google ScholarDigital Library
- D. Thomas, R. Bordawekar, C. Aggarwal, and P. S. Yu. On Efficient Query Processing of Stream Counts on the Cell Processor. In ICDE, 2009. Google ScholarDigital Library
- W. Yan and Y. X. B. Malin. Scalable Load Balancing for MapReduce-based Record Linkage. In IPCCC, 2013.Google ScholarCross Ref
- J. Yu, Z. Chong, H. Lu, and A. Zhou. False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams. In VLDB, 2004. Google ScholarDigital Library
- P. Zhao, C. C. Aggarwal, and M. Wang. gSketch: On Query Estimation in Graph Streams. In VLDB, 2012. Google ScholarDigital Library
Index Terms
- Augmented Sketch: Faster and More Accurate Stream Processing
Recommendations
Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataApproximate stream processing algorithms, such as Count-Min sketch, Space-Saving, etc., support numerous applications in databases, storage systems, networking, and other domains. However, the unbalanced distribution in real data streams poses great ...
XY-Sketch: on Sketching Data Streams at Web Scale
WWW '21: Proceedings of the Web Conference 2021Conventional sketching methods on counting stream item frequencies use hash functions for mapping data items to a concise structure, e.g., a two-dimensional array, at the expense of overcounting due to hashing collisions. Despite the popularity, ...
Fast and accurate stream processing by filtering the cold
AbstractApproximate stream processing algorithms, such as Count-Min sketch, Space-Saving, support numerous applications across multiple areas such as databases, storage systems, and networking. However, the unbalanced distribution in real data streams are ...
Comments