Skip to main content
Erschienen in: Computing 6/2020

02.01.2019

Optimizing the confidence bound of count-min sketches to estimate the streaming big data query results more precisely

verfasst von: Ruixin Guo, Erkang Xue, Feng Zhang, Gansen Zhao, Guangzhi Qu

Erschienen in: Computing | Ausgabe 6/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A count-min sketch is a probabilistic data structure, which serves as a frequency table of events to process a stream of big data. It uses hash functions to map events to frequencies. Querying a count-min sketch returns the targeted event along with an estimated frequency, which is not less than the actual frequency. The estimated error, i.e., the difference between the estimated frequency and the actual, can be measured by a pre-defined confidence bound. However, the bound originally defined is too loose. The reason is that the Markov inequality used to derive the bound does not perform well. In this paper, based on binomial distribution and central limit theorem, we define a tighter bound. We indicate that the reliability of the bound is related to the deviation of data, which can be measured by the data’s coefficient of standard deviation. Our extensive experiments well support the effectiveness and efficiency of the new bound.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Chen D, Wang L, Xiaomin W, Chen J, Khan SU, Koodziej J, Tian M, Huang F, Liu W (2013) Hybrid modelling and simulation of huge crowd over a hierarchical grid architecture. Future Gener Comput Syst 29(5):1309–1317CrossRef Chen D, Wang L, Xiaomin W, Chen J, Khan SU, Koodziej J, Tian M, Huang F, Liu W (2013) Hybrid modelling and simulation of huge crowd over a hierarchical grid architecture. Future Gener Comput Syst 29(5):1309–1317CrossRef
3.
Zurück zum Zitat Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1):58–75MathSciNetCrossRef Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1):58–75MathSciNetCrossRef
4.
Zurück zum Zitat Deng Z, Wu X, Wang L, Chen X, Ranjan R, Zomaya A, Chen D (2015) Parallel processing of dynamic continuous queries over streaming data flows. IEEE Trans Parallel Distrib Syst 26(3):834–846CrossRef Deng Z, Wu X, Wang L, Chen X, Ranjan R, Zomaya A, Chen D (2015) Parallel processing of dynamic continuous queries over streaming data flows. IEEE Trans Parallel Distrib Syst 26(3):834–846CrossRef
5.
Zurück zum Zitat Deng Z, Han W, Wang L, Ranjan R, Zomaya AY, Jie W (2017) An efficient online direction-preserving compression approach for trajectory streaming data. Future Gener Comput Syst 68:150–162CrossRef Deng Z, Han W, Wang L, Ranjan R, Zomaya AY, Jie W (2017) An efficient online direction-preserving compression approach for trajectory streaming data. Future Gener Comput Syst 68:150–162CrossRef
6.
Zurück zum Zitat Dong L, Yao H, Ranjan R, Zhang F, Pan M (2017) Fast lightweight reconfiguration of virtual constellation for obtaining of earth observation big data. Clust Comput 20(3):2299–2310CrossRef Dong L, Yao H, Ranjan R, Zhang F, Pan M (2017) Fast lightweight reconfiguration of virtual constellation for obtaining of earth observation big data. Clust Comput 20(3):2299–2310CrossRef
7.
Zurück zum Zitat Everitt B, Skrondal A (2002) The Cambridge dictionary of statistics, vol 106. Cambridge University Press, CambridgeMATH Everitt B, Skrondal A (2002) The Cambridge dictionary of statistics, vol 106. Cambridge University Press, CambridgeMATH
8.
Zurück zum Zitat Ge Luo L, Wang KY, Cormode G (2016) Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB J 25(4):449–472CrossRef Ge Luo L, Wang KY, Cormode G (2016) Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB J 25(4):449–472CrossRef
9.
Zurück zum Zitat Goyal A, Jagarlamudi J, Daumé III, Hal VS (2010) Sketch techniques for scaling distributional similarity to the web. In: Proceedings of the 2010 workshop on geometrical models of natural language semantics, Association for Computational Linguistics, pp 51–56 Goyal A, Jagarlamudi J, Daumé III, Hal VS (2010) Sketch techniques for scaling distributional similarity to the web. In: Proceedings of the 2010 workshop on geometrical models of natural language semantics, Association for Computational Linguistics, pp 51–56
10.
Zurück zum Zitat He Z, Chonglong W, Liu G, Zheng Z, Tian Y (2015) Decomposition tree: a spatio-temporal indexing method for movement big data. Clust Comput 18(4):1481–1492CrossRef He Z, Chonglong W, Liu G, Zheng Z, Tian Y (2015) Decomposition tree: a spatio-temporal indexing method for movement big data. Clust Comput 18(4):1481–1492CrossRef
11.
Zurück zum Zitat Ippoliti D, Jiang C, Ding Z, Zhou X (2016) Online adaptive anomaly detection for augmented network flows. ACM Trans Auton Adapt Syst (TAAS) 11(3):17 Ippoliti D, Jiang C, Ding Z, Zhou X (2016) Online adaptive anomaly detection for augmented network flows. ACM Trans Auton Adapt Syst (TAAS) 11(3):17
12.
Zurück zum Zitat Khoshkbarforoushha A, Ranjan R, Gaire R, Abbasnejad E, Wang L, Zomaya AY (2017) Distribution based workload modelling of continuous queries in clouds. IEEE Trans Emerg Top Comput 5(1):120–133CrossRef Khoshkbarforoushha A, Ranjan R, Gaire R, Abbasnejad E, Wang L, Zomaya AY (2017) Distribution based workload modelling of continuous queries in clouds. IEEE Trans Emerg Top Comput 5(1):120–133CrossRef
13.
Zurück zum Zitat Leon-Garcia A (2008) Probability, statistics, and random processes for electrical engineering, 3rd edn. Pearson, London Leon-Garcia A (2008) Probability, statistics, and random processes for electrical engineering, 3rd edn. Pearson, London
14.
Zurück zum Zitat Li H, Huang H (2005) New estimation methods of count-min sketch. In: Research issues in data engineering: stream data mining and applications, 2005. RIDE-SDMA 2005. 15th international workshop on, IEEE, pp 73–80 Li H, Huang H (2005) New estimation methods of count-min sketch. In: Research issues in data engineering: stream data mining and applications, 2005. RIDE-SDMA 2005. 15th international workshop on, IEEE, pp 73–80
15.
Zurück zum Zitat Liu H, Sun Y, Kim MS (2011) Fine-grained ddos detection scheme based on bidirectional count sketch. In: Computer communications and networks (ICCCN), 2011 proceedings of 20th international conference on, IEEE, pp 1–6 Liu H, Sun Y, Kim MS (2011) Fine-grained ddos detection scheme based on bidirectional count sketch. In: Computer communications and networks (ICCCN), 2011 proceedings of 20th international conference on, IEEE, pp 1–6
16.
Zurück zum Zitat Minton GT, Price E (2014) Improved concentration bounds for count-sketch. In: Proceedings of the twenty-fifth annual ACM-SIAM symposium on discrete algorithms, society for industrial and applied mathematics, pp 669–686 Minton GT, Price E (2014) Improved concentration bounds for count-sketch. In: Proceedings of the twenty-fifth annual ACM-SIAM symposium on discrete algorithms, society for industrial and applied mathematics, pp 669–686
17.
Zurück zum Zitat Mood AMF (1950) Introduction to the theory of statistics. McGraw-hill, NYMATH Mood AMF (1950) Introduction to the theory of statistics. McGraw-hill, NYMATH
18.
Zurück zum Zitat Papapetrou O, Garofalakis M, Deligiannakis A (2015) Sketching distributed sliding-window data streams. The VLDB J 24(3):345–368CrossRef Papapetrou O, Garofalakis M, Deligiannakis A (2015) Sketching distributed sliding-window data streams. The VLDB J 24(3):345–368CrossRef
19.
Zurück zum Zitat Perera C, Ranjan R, Wang L, Khan SU, Zomaya AY (2015) Big data privacy in the internet of things era. IT Prof 17(3):32–39CrossRef Perera C, Ranjan R, Wang L, Khan SU, Zomaya AY (2015) Big data privacy in the internet of things era. IT Prof 17(3):32–39CrossRef
21.
Zurück zum Zitat Ranjan R, Wang L, Zomaya AY, Tao J, Jayaraman PP, Georgakopoulos D (2016) Advances in methods and techniques for processing streaming big data in datacentre clouds. IEEE Trans Emerg Top Comput 4(2):262–265CrossRef Ranjan R, Wang L, Zomaya AY, Tao J, Jayaraman PP, Georgakopoulos D (2016) Advances in methods and techniques for processing streaming big data in datacentre clouds. IEEE Trans Emerg Top Comput 4(2):262–265CrossRef
22.
Zurück zum Zitat Rottenstreich O, Kanizo Y, Keslassy I (2014) The variable-increment counting bloom filter. IEEE/ACM Trans Netw 22(4):1092–1105CrossRef Rottenstreich O, Kanizo Y, Keslassy I (2014) The variable-increment counting bloom filter. IEEE/ACM Trans Netw 22(4):1092–1105CrossRef
23.
Zurück zum Zitat Rusu F, Dobra A (2008) Sketches for size of join estimation. ACM Trans Database Syst (TODS) 33(3):15CrossRef Rusu F, Dobra A (2008) Sketches for size of join estimation. ACM Trans Database Syst (TODS) 33(3):15CrossRef
24.
Zurück zum Zitat Schechter S, Herley C, Mitzenmacher M (2010) Popularity is everything: a new approach to protecting passwords from statistical-guessing attacks. In: Proceedings of the 5th USENIX conference on Hot topics in security, USENIX Association, pp 1–8 Schechter S, Herley C, Mitzenmacher M (2010) Popularity is everything: a new approach to protecting passwords from statistical-guessing attacks. In: Proceedings of the 5th USENIX conference on Hot topics in security, USENIX Association, pp 1–8
25.
Zurück zum Zitat Tong D, Prasanna V (2016) High throughput sketch based online heavy hitter detection on fpga. ACM SIGARCH Comput Archit N 43(4):70–75CrossRef Tong D, Prasanna V (2016) High throughput sketch based online heavy hitter detection on fpga. ACM SIGARCH Comput Archit N 43(4):70–75CrossRef
26.
Zurück zum Zitat Wang L, Ranjan R (2015) Processing distributed internet of things data in clouds. IEEE Cloud Comput 2(1):76–80CrossRef Wang L, Ranjan R (2015) Processing distributed internet of things data in clouds. IEEE Cloud Comput 2(1):76–80CrossRef
27.
Zurück zum Zitat Yang Y, Zhu J (2016) Write skew and zipf distribution: evidence and implications. ACM Trans Storage (TOS) 12(4):21MathSciNet Yang Y, Zhu J (2016) Write skew and zipf distribution: evidence and implications. ACM Trans Storage (TOS) 12(4):21MathSciNet
28.
Zurück zum Zitat Zhang F, Gong T, Lee VE, Zhao G, Rong C, Guangzhi Q (2016) Fast algorithms to evaluate collaborative filtering recommender systems. Knowl-Based Syst 96(3):96–103 Zhang F, Gong T, Lee VE, Zhao G, Rong C, Guangzhi Q (2016) Fast algorithms to evaluate collaborative filtering recommender systems. Knowl-Based Syst 96(3):96–103
29.
Zurück zum Zitat Zhang F, Lee VE, Raymond Choo K-K (2018) Jo-dpmf: differentially private matrix factorization learning through joint optimization. Inf Sci 467(10):271–281MathSciNet Zhang F, Lee VE, Raymond Choo K-K (2018) Jo-dpmf: differentially private matrix factorization learning through joint optimization. Inf Sci 467(10):271–281MathSciNet
Metadaten
Titel
Optimizing the confidence bound of count-min sketches to estimate the streaming big data query results more precisely
verfasst von
Ruixin Guo
Erkang Xue
Feng Zhang
Gansen Zhao
Guangzhi Qu
Publikationsdatum
02.01.2019
Verlag
Springer Vienna
Erschienen in
Computing / Ausgabe 6/2020
Print ISSN: 0010-485X
Elektronische ISSN: 1436-5057
DOI
https://doi.org/10.1007/s00607-018-00695-z

Weitere Artikel der Ausgabe 6/2020

Computing 6/2020 Zur Ausgabe