Skip to main content
Erschienen in: The VLDB Journal 3/2014

01.06.2014 | Regular Paper

Towards zero-overhead static and adaptive indexing in Hadoop

verfasst von: Stefan Richter, Jorge-Arnulfo Quiané-Ruiz, Stefan Schuh, Jens Dittrich

Erschienen in: The VLDB Journal | Ausgabe 3/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Hadoop MapReduce has evolved to an important industry standard for massive parallel data processing and has become widely adopted for a variety of use-cases. Recent works have shown that indexes can improve the performance of selective MapReduce jobs dramatically. However, one major weakness of existing approaches is high index creation costs. We present HAIL (Hadoop Aggressive Indexing Library), a novel indexing approach for HDFS and Hadoop MapReduce. HAIL creates different clustered indexes over terabytes of data with minimal, often invisible costs, and it dramatically improves runtimes of several classes of MapReduce jobs. HAIL features two different indexing pipelines, static indexing and adaptive indexing. HAIL static indexing efficiently indexes datasets while uploading them to HDFS. Thereby, HAIL leverages the default replication of Hadoop and enhances it with logical replication. This allows HAIL to create multiple clustered indexes for a dataset, e.g., one for each physical replica. Still, in terms of upload time, HAIL matches or even improves over the performance of standard HDFS. Additionally, HAIL adaptive indexing allows for automatic, incremental indexing at job runtime with minimal runtime overhead. For example, HAIL adaptive indexing can completely index a dataset as byproduct of only four MapReduce jobs while incurring an overhead as low as 11 % for the very first of those job only. In our experiments, we show that HAIL improves job runtimes by up to 68\(\times \) over Hadoop. This article is an extended version of the VLDB 2012 paper (Dittrich et al. in PVLDB 5(11):1591–1602, 2012).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
A simple example of such a use-case would be a distributed grep.
 
2
The professor is aware that for some situations, the opposite is true.
 
3
Actually, it is a split. The difference does not matter here. We will get back to this in Sect. 4.2.
 
4
Alternatively, HAIL can also suggest an appropriate schema to users through schema analysis.
 
5
Alternatively, HAIL allows Bob to specify the selection predicate and the projected attributes in the job configuration class.
 
6
A Hadoop instance responsible to execute map and reduce tasks.
 
7
That was obtained from the HAILInputFormat via getSplits().
 
8
Notice that, all map tasks (even from different MapReduce jobs) running on the same node interact with the same AdaptiveIndexer instance.
Hence, the AdaptiveIndexer can end up by indexing data blocks from different MapReduce jobs at the same time.
 
9
It is worth noting that \(T_{idxOverhead}\) denotes only the additional runtime that a MapReduce job has due to adaptive indexing.
 
10
For this cluster type, we allocate an additional large node to run the namenode and jobtracker.
 
11
This is the time a map task takes to read and process its input.
 
12
Recall that, this query projects all attributes, which is indeed more beneficial for Hadoop++ as it uses a row layout.
 
13
Although HAIL is still indexing further blocks.
 
Literatur
1.
Zurück zum Zitat Abouzied, A., Abadi, D. J., Silberschatz, A.: Invisible loading: access-driven data transfer from raw Files into database systems. In: EDBT, pp. 1–10 (2013) Abouzied, A., Abadi, D. J., Silberschatz, A.: Invisible loading: access-driven data transfer from raw Files into database systems. In: EDBT, pp. 1–10 (2013)
2.
Zurück zum Zitat Agrawal, S., et al.: Database tuning advisor for Microsoft SQL server 2005. In: VLDB, pp. 1110–1121 (2004) Agrawal, S., et al.: Database tuning advisor for Microsoft SQL server 2005. In: VLDB, pp. 1110–1121 (2004)
3.
Zurück zum Zitat Ailamaki, A., et al.: Weaving relations for Cache performance. In: VLDB, pp. 169–180 (2001) Ailamaki, A., et al.: Weaving relations for Cache performance. In: VLDB, pp. 169–180 (2001)
4.
Zurück zum Zitat Alagiannis, I., Borovica, R., Branco, M., Idreos, S., Ailamaki, A.: NoDB: Efficient query execution on raw data files. In: SIGMOD Conference, pp. 241–252 (2012) Alagiannis, I., Borovica, R., Branco, M., Idreos, S., Ailamaki, A.: NoDB: Efficient query execution on raw data files. In: SIGMOD Conference, pp. 241–252 (2012)
5.
Zurück zum Zitat Blanas, S., et al.: A comparison of join algorithms for log processing in MapReduce. In: SIGMOD, pp. 975–986 (2010) Blanas, S., et al.: A comparison of join algorithms for log processing in MapReduce. In: SIGMOD, pp. 975–986 (2010)
6.
Zurück zum Zitat Bruno, N., Chaudhuri, S.: To tune or not to tune? A lightweight physical design alerter. In: VLDB, pp. 499–510 (2006) Bruno, N., Chaudhuri, S.: To tune or not to tune? A lightweight physical design alerter. In: VLDB, pp. 499–510 (2006)
7.
Zurück zum Zitat Bruno, N., Chaudhuri, S.: An online approach to physical design tuning. In: ICDE, pp. 826–835 (2007) Bruno, N., Chaudhuri, S.: An online approach to physical design tuning. In: ICDE, pp. 826–835 (2007)
8.
Zurück zum Zitat Bruno, N., Chaudhuri, S.: Physical design refinement: the merge-reduce approach. ACM TODS 32(4), 28:1–28:41 (2007) Bruno, N., Chaudhuri, S.: Physical design refinement: the merge-reduce approach. ACM TODS 32(4), 28:1–28:41 (2007)
9.
Zurück zum Zitat Cafarella, M.J., Ré, C.: Manimal: relational optimization for data-intensive programs. In: WebDB (2010) Cafarella, M.J., Ré, C.: Manimal: relational optimization for data-intensive programs. In: WebDB (2010)
10.
Zurück zum Zitat Chaudhuri, S., Narasayya, V.R.: An efficient cost-driven index selection tool for Microsoft SQL server. In: VLDB, pp. 146–155 (1997) Chaudhuri, S., Narasayya, V.R.: An efficient cost-driven index selection tool for Microsoft SQL server. In: VLDB, pp. 146–155 (1997)
11.
Zurück zum Zitat Chaudhuri, S., Narasayya, V.R.: Self-tuning database systems: a decade of progress. In: VLDB, pp. 3–14 (2007) Chaudhuri, S., Narasayya, V.R.: Self-tuning database systems: a decade of progress. In: VLDB, pp. 3–14 (2007)
12.
Zurück zum Zitat Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. PVLDB 3(1—-2), 1459–1468 (2010) Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. PVLDB 3(1—-2), 1459–1468 (2010)
13.
Zurück zum Zitat Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. CACM 53(1), 72–77 (2010)CrossRef Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. CACM 53(1), 72–77 (2010)CrossRef
14.
Zurück zum Zitat Dittrich, J., Quiané-Ruiz, J.-A.: Efficient parallel data processing in MapReduce workflows. PVLDB 5, 2014–2015 (2012) Dittrich, J., Quiané-Ruiz, J.-A.: Efficient parallel data processing in MapReduce workflows. PVLDB 5, 2014–2015 (2012)
15.
Zurück zum Zitat Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010) Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)
16.
Zurück zum Zitat Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. PVLDB 5(11), 1591–1602 (2012) Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. PVLDB 5(11), 1591–1602 (2012)
17.
Zurück zum Zitat Dittrich, J.-P., Fischer, P.M., Kossmann, D.: AGILE: Adaptive indexing for context-aware information filters. In: SIGMOD, pp. 215–226 (2005) Dittrich, J.-P., Fischer, P.M., Kossmann, D.: AGILE: Adaptive indexing for context-aware information filters. In: SIGMOD, pp. 215–226 (2005)
18.
Zurück zum Zitat Eltabakh, M.Y., et al.: CoHadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011) Eltabakh, M.Y., et al.: CoHadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)
19.
Zurück zum Zitat Finkelstein, S.J., et al.: Physical database design for relational databases. ACM TODS 13(1), 91–128 (1988)CrossRefMathSciNet Finkelstein, S.J., et al.: Physical database design for relational databases. ACM TODS 13(1), 91–128 (1988)CrossRefMathSciNet
20.
Zurück zum Zitat Graefe, G., Halim, F., Idreos, S., Kuno, H.A., Manegold, S.: Concurrency control for adaptive indexing. PVLDB 5(7), 656–667 (2012) Graefe, G., Halim, F., Idreos, S., Kuno, H.A., Manegold, S.: Concurrency control for adaptive indexing. PVLDB 5(7), 656–667 (2012)
21.
Zurück zum Zitat Graefe G., Kuno, H.A.: Self-selecting, self-tuning, incrementally optimized indexes. In: EDBT, pp. 371–381 (2010) Graefe G., Kuno, H.A.: Self-selecting, self-tuning, incrementally optimized indexes. In: EDBT, pp. 371–381 (2010)
24.
Zurück zum Zitat Halim, F., et al.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. PVLDB 5(6), 502–513 (2012) Halim, F., et al.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. PVLDB 5(6), 502–513 (2012)
25.
Zurück zum Zitat Halim, F., Idreos, S., Karras, P., Yap, R.H.C.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. PVLDB 5(6), 502–513 (2012) Halim, F., Idreos, S., Karras, P., Yap, R.H.C.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. PVLDB 5(6), 502–513 (2012)
26.
Zurück zum Zitat Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB 4(11), 1111–1122 (2011) Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB 4(11), 1111–1122 (2011)
27.
Zurück zum Zitat Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my data files. Here are my queries. Where are my results?. In: CIDR, pp. 57–68 (2011) Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my data files. Here are my queries. Where are my results?. In: CIDR, pp. 57–68 (2011)
28.
Zurück zum Zitat Idreos, S., et al.: Database cracking. In: CIDR, pp. 68–78 (2007) Idreos, S., et al.: Database cracking. In: CIDR, pp. 68–78 (2007)
29.
Zurück zum Zitat Idreos, S., et al.: Self-organizing tuple reconstruction in column-stores. In: SIGMOD, pp. 297–308 (2009) Idreos, S., et al.: Self-organizing tuple reconstruction in column-stores. In: SIGMOD, pp. 297–308 (2009)
30.
Zurück zum Zitat Idreos, S., et al.: Merging what’s cracked, cracking what’s merged: adaptive indexing in main-memory column-stores. PVLDB 4(9), 586–597 (2011) Idreos, S., et al.: Merging what’s cracked, cracking what’s merged: adaptive indexing in main-memory column-stores. PVLDB 4(9), 586–597 (2011)
31.
Zurück zum Zitat Idreos, S., Kersten, M.L., Manegold, S.: Updating a cracked database. In: SIGMOD Conference, pp. 413–424 (2007) Idreos, S., Kersten, M.L., Manegold, S.: Updating a cracked database. In: SIGMOD Conference, pp. 413–424 (2007)
32.
Zurück zum Zitat Jahani, E., et al.: Automatic optimization for MapReduce programs. PVLDB 4(6), 385–396 (2011) Jahani, E., et al.: Automatic optimization for MapReduce programs. PVLDB 4(6), 385–396 (2011)
33.
Zurück zum Zitat Jiang, D., et al.: The performance of MapReduce: an in-depth study. PVLDB 3(1), 472–483 (2010) Jiang, D., et al.: The performance of MapReduce: an in-depth study. PVLDB 3(1), 472–483 (2010)
34.
Zurück zum Zitat Jindal, A., Quiané-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: SOCC (2011) Jindal, A., Quiané-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: SOCC (2011)
35.
Zurück zum Zitat Lin, J., et al.: Full-text indexing for optimizing selection operations in large-scale data analytics. In: MapReduce Workshop (2011) Lin, J., et al.: Full-text indexing for optimizing selection operations in large-scale data analytics. In: MapReduce Workshop (2011)
36.
Zurück zum Zitat Logothetis, D., et al.: In-situ MapReduce for log processing. In: USENIX (2011) Logothetis, D., et al.: In-situ MapReduce for log processing. In: USENIX (2011)
37.
Zurück zum Zitat Lühring, M., et al.: Autonomous management of soft indexes. In: ICDE Workshop on Self-Managing Database Systems, pp. 450–458 (2007) Lühring, M., et al.: Autonomous management of soft indexes. In: ICDE Workshop on Self-Managing Database Systems, pp. 450–458 (2007)
38.
Zurück zum Zitat Olston, C.: Keynote: programming and debugging large-scale data processing workflows. In: SOCC (2011) Olston, C.: Keynote: programming and debugging large-scale data processing workflows. In: SOCC (2011)
39.
Zurück zum Zitat Pavlo, A., et al.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp. 165–178 (2009) Pavlo, A., et al.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp. 165–178 (2009)
40.
Zurück zum Zitat Quiané-Ruiz, J.-A., Pinkel, C., Schad, J., Dittrich, J.: RAFTing MapReduce: fast recovery on the RAFT. In: ICDE, pp. 589–600 (2011) Quiané-Ruiz, J.-A., Pinkel, C., Schad, J., Dittrich, J.: RAFTing MapReduce: fast recovery on the RAFT. In: ICDE, pp. 589–600 (2011)
41.
Zurück zum Zitat Schad, J., Dittrich, J., Quiané-Ruiz, J.-A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. PVLDB 3(1), 460–471 (2010) Schad, J., Dittrich, J., Quiané-Ruiz, J.-A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. PVLDB 3(1), 460–471 (2010)
42.
Zurück zum Zitat Schnaitter, K., et al.: COLT: continuous on-line tuning. In: SIGMOD, pp. 793–795 (2006) Schnaitter, K., et al.: COLT: continuous on-line tuning. In: SIGMOD, pp. 793–795 (2006)
43.
Zurück zum Zitat Thusoo, A., et al.: Data warehousing and analytics infrastructure at facebook. In: SIGMOD, pp. 1013–1020 (2010) Thusoo, A., et al.: Data warehousing and analytics infrastructure at facebook. In: SIGMOD, pp. 1013–1020 (2010)
44.
Zurück zum Zitat White, T.: Hadoop: The Definitive Guide. O’Reilly, Sebastopol (2011) White, T.: Hadoop: The Definitive Guide. O’Reilly, Sebastopol (2011)
45.
Zurück zum Zitat Yang, H.-C., Parker, D.S.: Traverse: simplified indexing on large Map-Reduce-merge clusters. In: DASFAA, pp. 308–322 (2009) Yang, H.-C., Parker, D.S.: Traverse: simplified indexing on large Map-Reduce-merge clusters. In: DASFAA, pp. 308–322 (2009)
46.
Zurück zum Zitat Zaharia, M., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: EuroSys, pp. 265–278 (2010) Zaharia, M., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: EuroSys, pp. 265–278 (2010)
Metadaten
Titel
Towards zero-overhead static and adaptive indexing in Hadoop
verfasst von
Stefan Richter
Jorge-Arnulfo Quiané-Ruiz
Stefan Schuh
Jens Dittrich
Publikationsdatum
01.06.2014
Verlag
Springer Berlin Heidelberg
Erschienen in
The VLDB Journal / Ausgabe 3/2014
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-013-0332-z

Weitere Artikel der Ausgabe 3/2014

The VLDB Journal 3/2014 Zur Ausgabe