nach oben

The VLDB Journal

Erschienen in:

01.06.2014 | Regular Paper

Towards zero-overhead static and adaptive indexing in Hadoop

verfasst von: Stefan Richter, Jorge-Arnulfo Quiané-Ruiz, Stefan Schuh, Jens Dittrich

Erschienen in: The VLDB Journal | Ausgabe 3/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Hadoop MapReduce has evolved to an important industry standard for massive parallel data processing and has become widely adopted for a variety of use-cases. Recent works have shown that indexes can improve the performance of selective MapReduce jobs dramatically. However, one major weakness of existing approaches is high index creation costs. We present HAIL (Hadoop Aggressive Indexing Library), a novel indexing approach for HDFS and Hadoop MapReduce. HAIL creates different clustered indexes over terabytes of data with minimal, often invisible costs, and it dramatically improves runtimes of several classes of MapReduce jobs. HAIL features two different indexing pipelines, static indexing and adaptive indexing. HAIL static indexing efficiently indexes datasets while uploading them to HDFS. Thereby, HAIL leverages the default replication of Hadoop and enhances it with logical replication. This allows HAIL to create multiple clustered indexes for a dataset, e.g., one for each physical replica. Still, in terms of upload time, HAIL matches or even improves over the performance of standard HDFS. Additionally, HAIL adaptive indexing allows for automatic, incremental indexing at job runtime with minimal runtime overhead. For example, HAIL adaptive indexing can completely index a dataset as byproduct of only four MapReduce jobs while incurring an overhead as low as 11 % for the very first of those job only. In our experiments, we show that HAIL improves job runtimes by up to 68\(\times \) over Hadoop. This article is an extended version of the VLDB 2012 paper (Dittrich et al. in PVLDB 5(11):1591–1602, 2012).

Vorheriger Artikel Personalized trajectory matching in spatial networks

Nächster Artikel The backchase revisited

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

A simple example of such a use-case would be a distributed grep.

The professor is aware that for some situations, the opposite is true.

Actually, it is a split. The difference does not matter here. We will get back to this in Sect. 4.2.

Alternatively, HAIL can also suggest an appropriate schema to users through schema analysis.

Alternatively, HAIL allows Bob to specify the selection predicate and the projected attributes in the job configuration class.

A Hadoop instance responsible to execute map and reduce tasks.

That was obtained from the HAILInputFormat via getSplits().

Notice that, all map tasks (even from different MapReduce jobs) running on the same node interact with the same AdaptiveIndexer instance.

Hence, the AdaptiveIndexer can end up by indexing data blocks from different MapReduce jobs at the same time.

It is worth noting that \(T_{idxOverhead}\) denotes only the additional runtime that a MapReduce job has due to adaptive indexing.

For this cluster type, we allocate an additional large node to run the namenode and jobtracker.

This is the time a map task takes to read and process its input.

Recall that, this query projects all attributes, which is indeed more beneficial for Hadoop++ as it uses a row layout.

Although HAIL is still indexing further blocks.

Abouzied, A., Abadi, D. J., Silberschatz, A.: Invisible loading: access-driven data transfer from raw Files into database systems. In: EDBT, pp. 1–10 (2013)

Agrawal, S., et al.: Database tuning advisor for Microsoft SQL server 2005. In: VLDB, pp. 1110–1121 (2004)

Ailamaki, A., et al.: Weaving relations for Cache performance. In: VLDB, pp. 169–180 (2001)

Alagiannis, I., Borovica, R., Branco, M., Idreos, S., Ailamaki, A.: NoDB: Efficient query execution on raw data files. In: SIGMOD Conference, pp. 241–252 (2012)

Blanas, S., et al.: A comparison of join algorithms for log processing in MapReduce. In: SIGMOD, pp. 975–986 (2010)

Bruno, N., Chaudhuri, S.: To tune or not to tune? A lightweight physical design alerter. In: VLDB, pp. 499–510 (2006)

Bruno, N., Chaudhuri, S.: An online approach to physical design tuning. In: ICDE, pp. 826–835 (2007)

Bruno, N., Chaudhuri, S.: Physical design refinement: the merge-reduce approach. ACM TODS 32(4), 28:1–28:41 (2007)

Cafarella, M.J., Ré, C.: Manimal: relational optimization for data-intensive programs. In: WebDB (2010)

10.

Chaudhuri, S., Narasayya, V.R.: An efficient cost-driven index selection tool for Microsoft SQL server. In: VLDB, pp. 146–155 (1997)

11.

Chaudhuri, S., Narasayya, V.R.: Self-tuning database systems: a decade of progress. In: VLDB, pp. 3–14 (2007)

12.

Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. PVLDB 3(1—-2), 1459–1468 (2010)

13.

Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. CACM 53(1), 72–77 (2010)CrossRef

14.

Dittrich, J., Quiané-Ruiz, J.-A.: Efficient parallel data processing in MapReduce workflows. PVLDB 5, 2014–2015 (2012)

15.

Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)

16.

Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. PVLDB 5(11), 1591–1602 (2012)

17.

Dittrich, J.-P., Fischer, P.M., Kossmann, D.: AGILE: Adaptive indexing for context-aware information filters. In: SIGMOD, pp. 215–226 (2005)

18.

Eltabakh, M.Y., et al.: CoHadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)

19.

Finkelstein, S.J., et al.: Physical database design for relational databases. ACM TODS 13(1), 91–128 (1988)CrossRefMathSciNet

20.

Graefe, G., Halim, F., Idreos, S., Kuno, H.A., Manegold, S.: Concurrency control for adaptive indexing. PVLDB 5(7), 656–667 (2012)

21.

Graefe G., Kuno, H.A.: Self-selecting, self-tuning, incrementally optimized indexes. In: EDBT, pp. 371–381 (2010)

22.

http://engineering.twitter.com/2010/04/hadoop-at-twitter.html

23.

Hadoop Users. http://wiki.apache.org/hadoop/PoweredBy

24.

Halim, F., et al.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. PVLDB 5(6), 502–513 (2012)

25.

Halim, F., Idreos, S., Karras, P., Yap, R.H.C.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. PVLDB 5(6), 502–513 (2012)

26.

Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB 4(11), 1111–1122 (2011)

27.

Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my data files. Here are my queries. Where are my results?. In: CIDR, pp. 57–68 (2011)

28.

Idreos, S., et al.: Database cracking. In: CIDR, pp. 68–78 (2007)

29.

Idreos, S., et al.: Self-organizing tuple reconstruction in column-stores. In: SIGMOD, pp. 297–308 (2009)

30.

Idreos, S., et al.: Merging what’s cracked, cracking what’s merged: adaptive indexing in main-memory column-stores. PVLDB 4(9), 586–597 (2011)

31.

Idreos, S., Kersten, M.L., Manegold, S.: Updating a cracked database. In: SIGMOD Conference, pp. 413–424 (2007)

32.

Jahani, E., et al.: Automatic optimization for MapReduce programs. PVLDB 4(6), 385–396 (2011)

33.

Jiang, D., et al.: The performance of MapReduce: an in-depth study. PVLDB 3(1), 472–483 (2010)

34.

Jindal, A., Quiané-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: SOCC (2011)

35.

Lin, J., et al.: Full-text indexing for optimizing selection operations in large-scale data analytics. In: MapReduce Workshop (2011)

36.

Logothetis, D., et al.: In-situ MapReduce for log processing. In: USENIX (2011)

37.

Lühring, M., et al.: Autonomous management of soft indexes. In: ICDE Workshop on Self-Managing Database Systems, pp. 450–458 (2007)

38.

Olston, C.: Keynote: programming and debugging large-scale data processing workflows. In: SOCC (2011)

39.

Pavlo, A., et al.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp. 165–178 (2009)

40.

Quiané-Ruiz, J.-A., Pinkel, C., Schad, J., Dittrich, J.: RAFTing MapReduce: fast recovery on the RAFT. In: ICDE, pp. 589–600 (2011)

41.

Schad, J., Dittrich, J., Quiané-Ruiz, J.-A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. PVLDB 3(1), 460–471 (2010)

42.

Schnaitter, K., et al.: COLT: continuous on-line tuning. In: SIGMOD, pp. 793–795 (2006)

43.

Thusoo, A., et al.: Data warehousing and analytics infrastructure at facebook. In: SIGMOD, pp. 1013–1020 (2010)

44.

White, T.: Hadoop: The Definitive Guide. O’Reilly, Sebastopol (2011)

45.

Yang, H.-C., Parker, D.S.: Traverse: simplified indexing on large Map-Reduce-merge clusters. In: DASFAA, pp. 308–322 (2009)

46.

Zaharia, M., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: EuroSys, pp. 265–278 (2010)

Titel: Towards zero-overhead static and adaptive indexing in Hadoop
verfasst von: Stefan Richter
Jorge-Arnulfo Quiané-Ruiz
Stefan Schuh
Jens Dittrich
Publikationsdatum: 01.06.2014
Verlag: Springer Berlin Heidelberg
Erschienen in: The VLDB Journal / Ausgabe 3/2014
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI: https://doi.org/10.1007/s00778-013-0332-z

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2014

The backchase revisited

Personalized trajectory matching in spatial networks

Approximate similarity search for online multimedia services on distributed CPU–GPU platforms

QUBLE: towards blending interactive visual subgraph search queries on large networks

Event detection over twitter social media streams

A survey of large-scale analytical query processing in MapReduce