Skip to main content

2016 | OriginalPaper | Buchkapitel

ResilientStore: A Heuristic-Based Data Format Selector for Intermediate Results

verfasst von : Rana Faisal Munir, Oscar Romero, Alberto Abelló, Besim Bilalli, Maik Thiele, Wolfgang Lehner

Erschienen in: Model and Data Engineering

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Large-scale data analysis is an important activity in many organizations that typically requires the deployment of data-intensive workflows. As data is processed these workflows generate large intermediate results, which are typically pipelined from one operator to the following. However, if materialized, these results become reusable, hence, subsequent workflows need not recompute them. There are already many solutions that materialize intermediate results but all of them assume a fixed data format. A fixed format, however, may not be the optimal one for every situation. For example, it is well-known that different data fragmentation strategies (e.g., horizontal and vertical) behave better or worse according to the access patterns of the subsequent operations. In this paper, we present ResilientStore, which assists on selecting the most appropriate data format for materializing intermediate results. Given a workflow and a set of materialization points, it uses rule-based heuristics to choose the best storage data format based on subsequent access patterns. We have implemented ResilientStore for HDFS and three different data formats: SequenceFile, Parquet and Avro. Experimental results show that our solution gives 18 % better performance than any solution based on a single fixed format.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Abelló, A., Ferrarons, J., Romero, O.: Building cubes with MapReduce. In: Proceedings of the DOLAP (2011) Abelló, A., Ferrarons, J., Romero, O.: Building cubes with MapReduce. In: Proceedings of the DOLAP (2011)
2.
Zurück zum Zitat Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: Proceedings of the SIGMOD (2014) Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: Proceedings of the SIGMOD (2014)
3.
Zurück zum Zitat Chen, Y., Alspaugh, S., Katz, R.: Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads. In: Proceedings of the VLDB (2012) Chen, Y., Alspaugh, S., Katz, R.: Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads. In: Proceedings of the VLDB (2012)
4.
Zurück zum Zitat Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of the OSDI (2004) Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of the OSDI (2004)
5.
Zurück zum Zitat DeWitt, D.J., Halverson, A., Nehme, R., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: Proceedings of the SIGMOD (2013) DeWitt, D.J., Halverson, A., Nehme, R., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: Proceedings of the SIGMOD (2013)
6.
Zurück zum Zitat Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs. In: Proceedings of the VLDB (2012) Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs. In: Proceedings of the VLDB (2012)
7.
Zurück zum Zitat Elmore, A., Duggan, J., Stonebraker, M., Balazinska, M., Gadepally, V., Heer, J., Howe, B., Kepner, J., Kraska, T., Madden, S., Maier, D., Mattson, T., Papadopoulos, S., Parkhurst, J., Tatbul, N., Vartak, M., Zdonik, S.: A demonstration of the BigDAWG polystore system. In: Proceedings of the VLDB (2015) Elmore, A., Duggan, J., Stonebraker, M., Balazinska, M., Gadepally, V., Heer, J., Howe, B., Kepner, J., Kraska, T., Madden, S., Maier, D., Mattson, T., Papadopoulos, S., Parkhurst, J., Tatbul, N., Vartak, M., Zdonik, S.: A demonstration of the BigDAWG polystore system. In: Proceedings of the VLDB (2015)
8.
Zurück zum Zitat Färber, F., Cha, S.K., Primsch, J., Bornhovd, C., Sigg, S., Lehner, W.: SAP HANA database - data management for modern business applications. In: Proceedings of the SIGMOD Record (2011) Färber, F., Cha, S.K., Primsch, J., Bornhovd, C., Sigg, S., Lehner, W.: SAP HANA database - data management for modern business applications. In: Proceedings of the SIGMOD Record (2011)
9.
Zurück zum Zitat Floratou, A., Patel, J.M., Shekita, E.J., Tata, S.: Column-oriented storage techniques for MapReduce. In: Proceedings of the VLDB (2011) Floratou, A., Patel, J.M., Shekita, E.J., Tata, S.: Column-oriented storage techniques for MapReduce. In: Proceedings of the VLDB (2011)
10.
Zurück zum Zitat Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of the SOSP (2003) Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of the SOSP (2003)
11.
Zurück zum Zitat He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of the ICDE (2011) He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of the ICDE (2011)
12.
Zurück zum Zitat Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my Data Files. Here are my Queries. Where are my Results? In: Proceedings of the CIDR (2011) Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my Data Files. Here are my Queries. Where are my Results? In: Proceedings of the CIDR (2011)
13.
Zurück zum Zitat Jindal, A., Quian-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: Proceedings of the SOCC (2011) Jindal, A., Quian-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: Proceedings of the SOCC (2011)
14.
Zurück zum Zitat Jindal, A., Quian-Ruiz, J.-A., Dittrich, J.: WWHow! freeing data storage from cages. In: Proceedings of the CIDR (2013) Jindal, A., Quian-Ruiz, J.-A., Dittrich, J.: WWHow! freeing data storage from cages. In: Proceedings of the CIDR (2013)
15.
Zurück zum Zitat Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Incremental consolidation of data-intensive multi-flows. In: Proceedings of the TKDE (2016) Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Incremental consolidation of data-intensive multi-flows. In: Proceedings of the TKDE (2016)
16.
Zurück zum Zitat Kalavri, V., Shang, H., Vlassov, V.: m2r2: a framework for results materialization and reuse. In: Proceedings of the BDSE (2013) Kalavri, V., Shang, H., Vlassov, V.: m2r2: a framework for results materialization and reuse. In: Proceedings of the BDSE (2013)
17.
Zurück zum Zitat Raman, V., Attaluri, G., Barber, R., Chainani, N., Kalmuk, D., KulandaiSamy, V., Leenstra, J., Lightstone, S., Liu, S., Lohman, G.M., Malkemus, T., Mueller, R., Pandis, I., Schiefer, B., Sharpe, D., Sidle, R., Storm, A., Zhang, L.: DB2 with BLU acceleration: so much more than just a column store. In: Proceedings of the VLDB (2013) Raman, V., Attaluri, G., Barber, R., Chainani, N., Kalmuk, D., KulandaiSamy, V., Leenstra, J., Lightstone, S., Liu, S., Lohman, G.M., Malkemus, T., Mueller, R., Pandis, I., Schiefer, B., Sharpe, D., Sidle, R., Storm, A., Zhang, L.: DB2 with BLU acceleration: so much more than just a column store. In: Proceedings of the VLDB (2013)
18.
Zurück zum Zitat Schaarschmidt, M., Gessert, F., Ritter, N.: Towards automated polyglot persistence. In: Proceedings of the BTW (2015) Schaarschmidt, M., Gessert, F., Ritter, N.: Towards automated polyglot persistence. In: Proceedings of the BTW (2015)
Metadaten
Titel
ResilientStore: A Heuristic-Based Data Format Selector for Intermediate Results
verfasst von
Rana Faisal Munir
Oscar Romero
Alberto Abelló
Besim Bilalli
Maik Thiele
Wolfgang Lehner
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-45547-1_4

Premium Partner