Skip to main content

2019 | OriginalPaper | Buchkapitel

Parallel Streaming Random Sampling

verfasst von : Kanat Tangwongsan, Srikanta Tirthapura

Erschienen in: Euro-Par 2019: Parallel Processing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper investigates parallel random sampling from a potentially-unending data stream whose elements are revealed in a series of element sequences (minibatches). While sampling from a stream was extensively studied sequentially, not much has been explored in the parallel context, with prior parallel random-sampling algorithms focusing on the static batch model. We present parallel algorithms for minibatch-stream sampling in two settings: (1) sliding window, which draws samples from a prespecified number of most-recently observed elements, and (2) infinite window, which draws samples from all the elements received. Our algorithms are computationally and memory efficient: their work matches the fastest sequential counterpart, their parallel depth is small (polylogarithmic), and their memory usage matches the best known.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
One could also consider a window to be the w most recent minibatches, and similar techniques are expected to work.
 
2
The original algorithm stores the largest priorities but is equivalent to our view.
 
3
Because \(|X| \ge s\), the function \(\nu \) is always defined.
 
Literatur
1.
Zurück zum Zitat Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of the Annual ACM-SIAM Symposium on Discrete algorithms (SODA), pp. 633–634 (2002) Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of the Annual ACM-SIAM Symposium on Discrete algorithms (SODA), pp. 633–634 (2002)
2.
Zurück zum Zitat Blelloch, G.E., Maggs, B.M.: Chapter 10: parallel algorithms. In: The Computer Science and Engineering Handbook, 2nd edn. Chapman and Hall/CRC (2004) Blelloch, G.E., Maggs, B.M.: Chapter 10: parallel algorithms. In: The Computer Science and Engineering Handbook, 2nd edn. Chapman and Hall/CRC (2004)
3.
Zurück zum Zitat Braverman, V., Ostrovsky, R., Zaniolo, C.: Optimal sampling from sliding windows. In: Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 147–156 (2009) Braverman, V., Ostrovsky, R., Zaniolo, C.: Optimal sampling from sliding windows. In: Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 147–156 (2009)
4.
Zurück zum Zitat Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: Proceedings of the International Conference on Data Engineering (ICDE), p. 6 (2006) Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: Proceedings of the International Conference on Data Engineering (ICDE), p. 6 (2006)
5.
Zurück zum Zitat Chung, Y., Tirthapura, S., Woodruff, D.P.: A simple message-optimal algorithm for random sampling from a distributed stream. IEEE Trans. Knowl. Data Eng. (TKDE) 28(6), 1356–1368 (2016)CrossRef Chung, Y., Tirthapura, S., Woodruff, D.P.: A simple message-optimal algorithm for random sampling from a distributed stream. IEEE Trans. Knowl. Data Eng. (TKDE) 28(6), 1356–1368 (2016)CrossRef
6.
Zurück zum Zitat Cormode, G.: The continuous distributed monitoring model. SIGMOD Rec. 42(1), 5–14 (2013)CrossRef Cormode, G.: The continuous distributed monitoring model. SIGMOD Rec. 42(1), 5–14 (2013)CrossRef
7.
Zurück zum Zitat Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Continuous sampling from distributed streams. J. ACM 59(2), 10:1–10:25 (2012)MathSciNetCrossRef Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Continuous sampling from distributed streams. J. ACM 59(2), 10:1–10:25 (2012)MathSciNetCrossRef
8.
Zurück zum Zitat Das, S., Antony, S., Agrawal, D., El Abbadi, A.: Thread cooperation in multicore architectures for frequency counting over multiple data streams. Proc. VLDB Endow. (PVLDB) 2(1), 217–228 (2009)CrossRef Das, S., Antony, S., Agrawal, D., El Abbadi, A.: Thread cooperation in multicore architectures for frequency counting over multiple data streams. Proc. VLDB Endow. (PVLDB) 2(1), 217–228 (2009)CrossRef
9.
Zurück zum Zitat Efraimidis, P.S., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf. Process. Lett. 97(5), 181–185 (2006)MathSciNetCrossRef Efraimidis, P.S., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf. Process. Lett. 97(5), 181–185 (2006)MathSciNetCrossRef
10.
Zurück zum Zitat Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: Proceedings of the International Conference on Management of Data (SIGMOD), pp. 379–392 (2008) Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: Proceedings of the International Conference on Management of Data (SIGMOD), pp. 379–392 (2008)
11.
Zurück zum Zitat Gibbons, P., Tirthapura, S.: Estimating simple functions on the union of data streams. In: Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 281–291 (2001) Gibbons, P., Tirthapura, S.: Estimating simple functions on the union of data streams. In: Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 281–291 (2001)
12.
Zurück zum Zitat Gu, Y., Shun, J., Sun, Y., Blelloch, G.E.: A top-down parallel semisort. In: Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 24–34 (2015) Gu, Y., Shun, J., Sun, Y., Blelloch, G.E.: A top-down parallel semisort. In: Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 24–34 (2015)
13.
Zurück zum Zitat Johnson, T., Shkapenyuk, V.: Data stream warehousing in tidalrace. In: Proceedingsof the Conference on Innovative Data Systems Research (CIDR) (2015) Johnson, T., Shkapenyuk, V.: Data stream warehousing in tidalrace. In: Proceedingsof the Conference on Innovative Data Systems Research (CIDR) (2015)
14.
Zurück zum Zitat Reif, J.H.: An optimal parallel algorithm for integer sorting. In: Proceedings of the IEEE Annual Symposium on Foundations of Computer Science (FOCS), pp. 496–504 (1985) Reif, J.H.: An optimal parallel algorithm for integer sorting. In: Proceedings of the IEEE Annual Symposium on Foundations of Computer Science (FOCS), pp. 496–504 (1985)
15.
Zurück zum Zitat Ross, S.M.: Introduction to Probability Models, 10th edn. Academic Press, Cambridge (2009) Ross, S.M.: Introduction to Probability Models, 10th edn. Academic Press, Cambridge (2009)
16.
Zurück zum Zitat Sanders, P., Lamm, S., Hübschle-Schneider, L., Schrade, E., Dachsbacher, C.: Efficient parallel random sampling - vectorized, cache-efficient, and online. ACM Trans. Math. Softw. 44(3), 29:1–29:14 (2018)MathSciNetCrossRef Sanders, P., Lamm, S., Hübschle-Schneider, L., Schrade, E., Dachsbacher, C.: Efficient parallel random sampling - vectorized, cache-efficient, and online. ACM Trans. Math. Softw. 44(3), 29:1–29:14 (2018)MathSciNetCrossRef
17.
Zurück zum Zitat Tangwongsan, K., Pavan, A., Tirthapura, S.: Parallel triangle counting in massive streaming graphs. In: Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM), pp. 781–786 (2013) Tangwongsan, K., Pavan, A., Tirthapura, S.: Parallel triangle counting in massive streaming graphs. In: Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM), pp. 781–786 (2013)
19.
Zurück zum Zitat Tangwongsan, K., Tirthapura, S., Wu, K.: Parallel streaming frequency-based aggregates. In: Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 236–245 (2014) Tangwongsan, K., Tirthapura, S., Wu, K.: Parallel streaming frequency-based aggregates. In: Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 236–245 (2014)
21.
Zurück zum Zitat Xu, B., Tirthapura, S., Busch, C.: Sketching asynchronous data streams over sliding windows. Distrib. Comput. 20(5), 359–374 (2008)CrossRef Xu, B., Tirthapura, S., Busch, C.: Sketching asynchronous data streams over sliding windows. Distrib. Comput. 20(5), 359–374 (2008)CrossRef
22.
Zurück zum Zitat Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), pp. 423–438 (2013) Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), pp. 423–438 (2013)
Metadaten
Titel
Parallel Streaming Random Sampling
verfasst von
Kanat Tangwongsan
Srikanta Tirthapura
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-29400-7_32