research-article

Augmented Sketch: Faster and More Accurate Stream Processing

Authors:
Pratanu Roy

Eidgenössische Technische Hochschule Zurich, Zurich, Switzerland

Eidgenössische Technische Hochschule Zurich, Zurich, Switzerland
View Profile

,
Arijit Khan

Nanyang Technological University Singapore, Singapore, Singapore

Nanyang Technological University Singapore, Singapore, Singapore
View Profile

,
Gustavo Alonso

Eidgenössische Technische Hochschule Zurich, Zurich, Switzerland

Eidgenössische Technische Hochschule Zurich, Zurich, Switzerland
View Profile

SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataJune 2016Pages 1449–1463https://doi.org/10.1145/2882903.2882948

Published:14 June 2016Publication History

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 1449–1463

ABSTRACT

Approximated algorithms are often used to estimate the frequency of items on high volume, fast data streams. The most common ones are variations of Count-Min sketch, which use sub-linear space for the count, but can produce errors in the counts of the most frequent items and can misclassify low-frequency items. In this paper, we improve the accuracy of sketch-based algorithms by increasing the frequency estimation accuracy of the most frequent items and reducing the possible misclassification of low-frequency items, while also improving the overall throughput. Our solution, called Augmented Sketch (ASketch), is based on a pre-filtering stage that dynamically identifies and aggregates the most frequent items. Items overflowing the pre-filtering stage are processed using a conventional sketch algorithm, thereby making the solution general and applicable in a wide range of contexts. The pre-filtering stage can be efficiently implemented with SIMD instructions on multi-core machines and can be further parallelized through pipeline parallelism where the filtering stage runs in one core and the sketch algorithm runs in another core.

References

C. Aggarwal and K. Subbian. Event Detection in Social Streams. In SDM, 2012.Google ScholarCross Ref
C. Aggarwal and P. S. Yu. On Classification of High-Cardinality Data Streams. In SDM, 2010.Google ScholarCross Ref
N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. In STOC, 1996. Google ScholarDigital Library
R. Berinde, P. Indyk, G. Cormode, and M. J. Strauss. Space-optimal Heavy Hitters with Strong Error Bounds. ACM TODS, 35(4):26, 2010. Google ScholarDigital Library
D. Bitton and D. J. DeWitt. Duplicate Record Elimination in Large Data Files. ACM TODS, 8(2):255--265, 1983. Google ScholarDigital Library
B. H. Bloom. Space/time Trade-offs in Hash Coding with Allowable Errors. Comm. of ACM, 13:422--426, 1970. Google ScholarDigital Library
M. Charikar, K. Chen, and M. Farach-Colton. Finding Frequent Items in Data Streams. In ICALP, 2002. Google ScholarDigital Library
G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1--3):1--294, 2012. Google ScholarDigital Library
G. Cormode and M. Hadjieleftheriou. Finding Frequent Items in Data Streams. In VLDB, 2008. Google ScholarDigital Library
G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava. Holistic UDAFs at Streaming Speeds. In SIGMOD, 2004. Google ScholarDigital Library
G. Cormode and S. Muthukrishnan. An Improved Data-Stream Summary: The Count-min Sketch and its Applications. J. of Algorithms, 55(1), 2005. Google ScholarDigital Library
A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. Processing Complex Aggregate Queries over Data Streams. In SIGMOD, 2002. Google ScholarDigital Library
C. Estan and G. Varghese. New Directions in Traffic Measurement and Accounting: Focusing on the Elephants, Ignoring the Mice. ACM Trans. Comput. Syst., 21(3):270--313, 2003. Google ScholarDigital Library
S. Ganguly, M. Garofalakis, and R. Rastogi. Processing data-stream join aggregates using skimmed sketches. In Advances in Database Technology-EDBT 2004, pages 569--586. Springer, 2004.Google ScholarCross Ref
S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz. Bifocal Sampling for Skew-Resistant Join Size Estimation. In SIGMOD, pages 271--281, 1996. Google ScholarDigital Library
M. Garofalakis and P. B. Gibbons. Wavelet Synopses with Error Guarantees. In SIGMOD, 2002. Google ScholarDigital Library
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. In VLDB, 2001. Google ScholarDigital Library
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. How to Summarize the Universe: Dynamic Maintenance of Quantiles. In VLDB, 2002. Google ScholarDigital Library
A. Goyal, H. D. III, and G. Cormode. Sketch Algorithms for Estimating Point Queries in NLP. In EMNLP-CoNLL, 2012. Google ScholarDigital Library
S. Guha, N. Koudas, and K. Shim. Data-streams and Histograms. In STOC, 2001. Google ScholarDigital Library
C. Koch, S. Scherzinger, and M. Schmidt. XML Prefiltering as a String Matching Problem. In ICDE, pages 626--635, 2008. Google ScholarDigital Library
B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based Change Detection: Methods, Evaluation, and Applications. In IMC, 2003. Google ScholarDigital Library
P. Larson. Grouping and Duplicate Elimination: Benefits of Early Aggregation, 1997. Technical Report, MSR-TR-97--36.Google Scholar
T. Liu, Y. Sun, A. X. Liu, L. Guo, and B. Fang. A Prefiltering Approach to Regular Expression Matching for Network Security Systems. In Applied Cryptography and Network Security, pages 363--380, 2012. Google ScholarDigital Library
Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani. Counter braids: a novel counter architecture for per-flow measurement. In SIGMETRICS, 2008. Google ScholarDigital Library
N. Manerikar and T. Palpanas. Frequent Items in Streaming Data: An Experimental Evaluation of the State-of-the-art. Data Knowl. Eng., 68(4):415--430, 2009. Google ScholarDigital Library
A. Metwally, D. Agrawal, and A. E. Abbadi. Efficient Computation of Frequent and Top-k Elements in Data Streams. In ICDT, 2005. Google ScholarDigital Library
J. Misra and D. Gries. Finding repeated elements. Science of computer programming, 2(2):143--152, 1982.Google Scholar
S. Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005.Google ScholarCross Ref
A. Pietracaprina, M. Riondato, E. Upfal, and F. Vandin. Mining top-K Frequent Itemsets Through Progressive Sampling. Data Min. Knowl. Discov., 21(2):310--326, 2010. Google ScholarDigital Library
O. Rottenstreich, Y. Kanizo, and I. Keslassy. The Variable-Increment Counting Bloom Filter. IEEE/ACM Trans. Netw., 22(4):1092--1105, 2014. Google ScholarDigital Library
P. Roy, J. Teubner, and G. Alonso. Efficient Frequent Item Counting in Multi-Core Hardware. In KDD, 2012. Google ScholarDigital Library
F. Rusu and A. Dobra. Sketches for Size of Join Estimation. ACM TODS, 33(3):15, 2008. Google ScholarDigital Library
D. Thomas, R. Bordawekar, C. Aggarwal, and P. S. Yu. On Efficient Query Processing of Stream Counts on the Cell Processor. In ICDE, 2009. Google ScholarDigital Library
W. Yan and Y. X. B. Malin. Scalable Load Balancing for MapReduce-based Record Linkage. In IPCCC, 2013.Google ScholarCross Ref
J. Yu, Z. Chong, H. Lu, and A. Zhou. False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams. In VLDB, 2004. Google ScholarDigital Library
P. Zhao, C. C. Aggarwal, and M. Wang. gSketch: On Query Estimation in Graph Streams. In VLDB, 2012. Google ScholarDigital Library

Index Terms

Augmented Sketch: Faster and More Accurate Stream Processing
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Data streams
  2. Information systems applications
    1. Data mining
      1. Data stream mining
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. Vector / streaming algorithms
    2. Streaming, sublinear and near linear time algorithms
      1. Sketching and sampling

Recommendations

Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Approximate stream processing algorithms, such as Count-Min sketch, Space-Saving, etc., support numerous applications in databases, storage systems, networking, and other domains. However, the unbalanced distribution in real data streams poses great ...
Read More
XY-Sketch: on Sketching Data Streams at Web Scale
WWW '21: Proceedings of the Web Conference 2021

Conventional sketching methods on counting stream item frequencies use hash functions for mapping data items to a concise structure, e.g., a two-dimensional array, at the expense of overcounting due to hashing collisions. Despite the popularity, ...
Read More
Fast and accurate stream processing by filtering the cold
Abstract
Approximate stream processing algorithms, such as Count-Min sketch, Space-Saving, support numerous applications across multiple areas such as databases, storage systems, and networking. However, the unbalanced distribution in real data streams are ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
approximated algorithms
data streams
data structures
sketch
stream summary
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 146
  Total Citations
  View Citations
- 1,293
  Total Downloads
- Downloads (Last 12 months)112
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Augmented Sketch: Faster and More Accurate Stream Processing

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing

XY-Sketch: on Sketching Data Streams at Web Scale

Fast and accurate stream processing by filtering the cold

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Augmented Sketch: Faster and More Accurate Stream Processing

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing

XY-Sketch: on Sketching Data Streams at Web Scale

Fast and accurate stream processing by filtering the cold

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media