Skip to main content
Erschienen in: Cluster Computing 1/2019

19.12.2017

An iterative sampling method for online aggregation

verfasst von: Zhiqiang Zhang, Jianghua Hu, Xiaoqin Xie, Haiwei Pan, Xiaoning Feng

Erschienen in: Cluster Computing | Sonderheft 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Online aggregation (OLA) makes it possible to save cost by taking acceptable approximate early answers. Compared to the precise results, computing the approximate ones are more cost effective, especially for large-scale datasets. The user can terminate the processing at any time, when he/she is satisfied with the quality of the result. And the performance of OLA relies on the sampling approach and estimation model. But in large scale distributed computing environment, how to realize OLA more efficiently is a challenging problem. In this paper, we consider the problem of providing OLA in the distributed computing environment and propose a Hadoop-based iterative sampling method for online aggregation. The desired precision of the user can be met by two iteration samplings. To avoid the effects of data bias, we propose a “layered sampling” method to ensure that the approximate aggregation result is statistically meaningful. The experimental results showed the “layered sampling” method considers not only the time efficiency, but also the usage of computing and storage resources of Hadoop.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Pansare, N., Borkar, V.R., Jermaine, C., et al.: Online aggregation for large MapReduce jobs. Proc. VLDB Endow 4(11), 1135–1145 (2011) Pansare, N., Borkar, V.R., Jermaine, C., et al.: Online aggregation for large MapReduce jobs. Proc. VLDB Endow 4(11), 1135–1145 (2011)
2.
Zurück zum Zitat Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004) Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
3.
Zurück zum Zitat Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD Conference Proceedings, pp. 171–182 (1997) Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD Conference Proceedings, pp. 171–182 (1997)
4.
Zurück zum Zitat Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: SSDBM 1997 Conference Proceedings, pp. 51–63 (1997) Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: SSDBM 1997 Conference Proceedings, pp. 51–63 (1997)
5.
Zurück zum Zitat Qin, C., Rusu, F.: Sampling estimators for parallel online aggregation. In: Big Data, pp. 204–217. Springer, Berlin (2013) Qin, C., Rusu, F.: Sampling estimators for parallel online aggregation. In: Big Data, pp. 204–217. Springer, Berlin (2013)
6.
Zurück zum Zitat Qin, C., Rusu, F.: PF-OLA: a high-performance framework for parallel online aggregation. Distrib. Parallel Databases 32, 1–39 (2013) Qin, C., Rusu, F.: PF-OLA: a high-performance framework for parallel online aggregation. Distrib. Parallel Databases 32, 1–39 (2013)
7.
Zurück zum Zitat Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable Hash ripple join algorithm. In: SIGMOD, pp. 252–262 (2002) Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable Hash ripple join algorithm. In: SIGMOD, pp. 252–262 (2002)
8.
Zurück zum Zitat Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: SIGMOD, pp. 287–298 (1999) Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: SIGMOD, pp. 287–298 (1999)
9.
Zurück zum Zitat Wu, S., et al.: Distributed online aggregation. PVLDB 2(1), 443–454 (2009) Wu, S., et al.: Distributed online aggregation. PVLDB 2(1), 443–454 (2009)
10.
Zurück zum Zitat Wu, S., et al.: Continuous sampling for online aggregation over multiple queries. In: SIGMOD, pp. 651–662 (2010) Wu, S., et al.: Continuous sampling for online aggregation over multiple queries. In: SIGMOD, pp. 651–662 (2010)
11.
Zurück zum Zitat Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Gerth, J., Talbot, J., Elmeleegy, K., Sears, R.: Online aggregation and continuous query support in MapReduce. In: SIGMOD Conference, pp. 1115–1118 (2010) Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Gerth, J., Talbot, J., Elmeleegy, K., Sears, R.: Online aggregation and continuous query support in MapReduce. In: SIGMOD Conference, pp. 1115–1118 (2010)
12.
Zurück zum Zitat Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endow. 5(10), 1028–1039 (2012)CrossRef Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endow. 5(10), 1028–1039 (2012)CrossRef
13.
Zurück zum Zitat Kalavri, V., Brundza, V., Vlassov, V.: Block sampling: efficient accurate online aggregation in MapReduce. In: IEEE 5th International Conference on Cloud Computing Technology and Science (CloudCom), vol. 1, pp. 250–257. IEEE, New York (2013) Kalavri, V., Brundza, V., Vlassov, V.: Block sampling: efficient accurate online aggregation in MapReduce. In: IEEE 5th International Conference on Cloud Computing Technology and Science (CloudCom), vol. 1, pp. 250–257. IEEE, New York (2013)
14.
Zurück zum Zitat Gan, Y., Meng, X., Shi, Y.: Processing online aggregation on skewed data in MapReduce. In: Proceedings of the Fifth International Workshop on Cloud Data Management, pp. 3–10. ACM, New York (2013) Gan, Y., Meng, X., Shi, Y.: Processing online aggregation on skewed data in MapReduce. In: Proceedings of the Fifth International Workshop on Cloud Data Management, pp. 3–10. ACM, New York (2013)
15.
Zurück zum Zitat Xixian, H., Jianzhong, L., Hong, G.: PAA: an efficient approximate aggregation algorithm on Massive Data. J. Comput. Res. Dev. 51(1), 41–53 (2014) Xixian, H., Jianzhong, L., Hong, G.: PAA: an efficient approximate aggregation algorithm on Massive Data. J. Comput. Res. Dev. 51(1), 41–53 (2014)
16.
Zurück zum Zitat Ci, X., Meng, X.: An efficient block sampling strategy for online aggregation in the Cloud. In: Proceedings of International Conference on Web-Age Information Management (WAIM), June 8–10, Qingdao, China. LNCS 9098, pp. 362–373 (2015) Ci, X., Meng, X.: An efficient block sampling strategy for online aggregation in the Cloud. In: Proceedings of International Conference on Web-Age Information Management (WAIM), June 8–10, Qingdao, China. LNCS 9098, pp. 362–373 (2015)
17.
Zurück zum Zitat Zhang, Z., Hu, J., Xie, X., Pan, H., Feng, X.: An online approximate aggregation query processing method based on Hadoop. In: IEEE 20th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Nanchang, China, pp. 117–122 (2016) Zhang, Z., Hu, J., Xie, X., Pan, H., Feng, X.: An online approximate aggregation query processing method based on Hadoop. In: IEEE 20th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Nanchang, China, pp. 117–122 (2016)
20.
Zurück zum Zitat Govindarajulu, Z.: Elements of Sampling Theory and Methods, pp. 64–72. Prentice Hall, Upper Saddle River (1999) Govindarajulu, Z.: Elements of Sampling Theory and Methods, pp. 64–72. Prentice Hall, Upper Saddle River (1999)
24.
Zurück zum Zitat ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite
Metadaten
Titel
An iterative sampling method for online aggregation
verfasst von
Zhiqiang Zhang
Jianghua Hu
Xiaoqin Xie
Haiwei Pan
Xiaoning Feng
Publikationsdatum
19.12.2017
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe Sonderheft 1/2019
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-017-1451-x

Weitere Artikel der Sonderheft 1/2019

Cluster Computing 1/2019 Zur Ausgabe