Skip to main content
Erschienen in: The Journal of Supercomputing 2/2014

01.02.2014

Parallel labeling of massive XML data with MapReduce

verfasst von: Hyebong Choi, Kyong-Ha Lee, Yoon-Joon Lee

Erschienen in: The Journal of Supercomputing | Ausgabe 2/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The volume of XML data has become enormous and still grows very quickly as many data have been typed in XML by virtue of its simplicity and extensibility. While a tree labeling algorithm has a crucial role in XML query processing, conventional algorithms are all sequential so that they fail to label a large volume of XML data in a timely manner. To address this issue, we devise parallel tree labeling algorithms for massive XML data. Specifically, we focus on how to efficiently label a single large XML file in parallel. We first propose parallel versions of two prominent tree labeling schemes based on the MapReduce framework. We then present techniques for runtime workload balancing and data repartition to solve performance issues caused by data skewness and MapReduce’s inherited limitation. Through extensive experiments with synthetic and real-world datasets on 15 nodes, we show that our parallel labeling algorithms are up to 17 times faster than conventional algorithms, providing strong durability against data skewness.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
5.
Zurück zum Zitat Bairoch A et al (2005) The universal protein resource (UniProt). Nucleic Acids Res 33(suppl 1), D154–D159 Bairoch A et al (2005) The universal protein resource (UniProt). Nucleic Acids Res 33(suppl 1), D154–D159
7.
Zurück zum Zitat Beyer K, Cochrane R, Josifovski V, Kleewein J, Lapis G, Lohman G, Lyle B, Özcan F, Pirahesh H, Seemann N et al (2005) System RX: one part relational, one part XML. In: Proceedings of the 2005 ACM SIGMOD conference. ACM, New York, pp 347–358 CrossRef Beyer K, Cochrane R, Josifovski V, Kleewein J, Lapis G, Lohman G, Lyle B, Özcan F, Pirahesh H, Seemann N et al (2005) System RX: one part relational, one part XML. In: Proceedings of the 2005 ACM SIGMOD conference. ACM, New York, pp 347–358 CrossRef
8.
Zurück zum Zitat Böhme T, Rahm E (2004) Supporting efficient streaming and insertion of XML data in RDBMS. In: Proc 3rd DIWeb workshop, pp 70–81 Böhme T, Rahm E (2004) Supporting efficient streaming and insertion of XML data in RDBMS. In: Proc 3rd DIWeb workshop, pp 70–81
9.
Zurück zum Zitat Bray T, Paoli J, Sperberg-McQueen C, Maler E, Yergeau F (1997) Extensible markup language (XML). World Wide Web J 2(4):27–66 Bray T, Paoli J, Sperberg-McQueen C, Maler E, Yergeau F (1997) Extensible markup language (XML). World Wide Web J 2(4):27–66
10.
Zurück zum Zitat Bruno N et al (2002) Holistic twig joins: optimal XML pattern matching. In: Proceedings of the ACM SIGMOD conference, pp 310–321 Bruno N et al (2002) Holistic twig joins: optimal XML pattern matching. In: Proceedings of the ACM SIGMOD conference, pp 310–321
11.
Zurück zum Zitat Choi H, Lee KH, Kim SH, Lee YJ, Moon B (2012) HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries. In: Proceedings of the 21th ACM CIKM conference, pp 2737–2739 Choi H, Lee KH, Kim SH, Lee YJ, Moon B (2012) HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries. In: Proceedings of the 21th ACM CIKM conference, pp 2737–2739
12.
Zurück zum Zitat Cormen TH et al (2001) Introduction to algorithms. MIT Press, Cambridge MATH Cormen TH et al (2001) Introduction to algorithms. MIT Press, Cambridge MATH
13.
Zurück zum Zitat Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113 CrossRef Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113 CrossRef
14.
Zurück zum Zitat Dietz P (1982) Maintaining order in a linked list. In: Proceedings of the 14th annual ACM symposium on theory of computing. ACM, New York, pp 122–127 Dietz P (1982) Maintaining order in a linked list. In: Proceedings of the 14th annual ACM symposium on theory of computing. ACM, New York, pp 122–127
15.
Zurück zum Zitat Grün C, Gath S, Holupirek A, Scholl M (2009) XQuery full text implementation in BaseX. Database and XML technologies, pp 114–128 Grün C, Gath S, Holupirek A, Scholl M (2009) XQuery full text implementation in BaseX. Database and XML technologies, pp 114–128
16.
Zurück zum Zitat Ibrahim S, Jin H, Lu L, Wu S, He B, Qi L (2010) Leen: locality/fairness-aware key partitioning for mapreduce in the cloud. In: IEEE second international conference on cloud computing technology and science (CloudCom), 2010. IEEE Press, New York, pp 17–24 CrossRef Ibrahim S, Jin H, Lu L, Wu S, He B, Qi L (2010) Leen: locality/fairness-aware key partitioning for mapreduce in the cloud. In: IEEE second international conference on cloud computing technology and science (CloudCom), 2010. IEEE Press, New York, pp 17–24 CrossRef
17.
Zurück zum Zitat Kwon Y, Balazinska M, Howe B, Rolia J (2012) SkewTune: mitigating skew in MapReduce applications. In: Proceedings of the ACM SIGMOD conference. ACM, New York, pp 25–36 Kwon Y, Balazinska M, Howe B, Rolia J (2012) SkewTune: mitigating skew in MapReduce applications. In: Proceedings of the ACM SIGMOD conference. ACM, New York, pp 25–36
18.
Zurück zum Zitat Lee K, Choi H, Lee Y, Chung YD, Moon B (2012) Parallel data processing with MapReduce: A survey. ACM SIGMOD Record 40(4):11–20 CrossRef Lee K, Choi H, Lee Y, Chung YD, Moon B (2012) Parallel data processing with MapReduce: A survey. ACM SIGMOD Record 40(4):11–20 CrossRef
19.
Zurück zum Zitat Li Q, Moon B (2001) Indexing and querying XML data for regular path expressions. In: Proceedings of the 27th VLDB conference, pp 361–370 Li Q, Moon B (2001) Indexing and querying XML data for regular path expressions. In: Proceedings of the 27th VLDB conference, pp 361–370
20.
Zurück zum Zitat Lu W, Chiu K, Pan Y (2006) A parallel approach to XML parsing. In: 7th IEEE/ACM international conference on grid computing. IEEE Press, New York, pp 223–230 Lu W, Chiu K, Pan Y (2006) A parallel approach to XML parsing. In: 7th IEEE/ACM international conference on grid computing. IEEE Press, New York, pp 223–230
21.
Zurück zum Zitat Murthy R, Liu Z, Krishnaprasad M, Chandrasekar S, Tran A, Sedlar E, Florescu D, Kotsovolos S, Agarwal N, Arora V et al (2005) Towards an enterprise XML architecture. In: Proceedings of the ACM SIGMOD conference. ACM, New York, pp 953–957 Murthy R, Liu Z, Krishnaprasad M, Chandrasekar S, Tran A, Sedlar E, Florescu D, Kotsovolos S, Agarwal N, Arora V et al (2005) Towards an enterprise XML architecture. In: Proceedings of the ACM SIGMOD conference. ACM, New York, pp 953–957
22.
Zurück zum Zitat O’Neil P, O’Neil E, Pal S, Cseri I, Schaller G, Westbury N (2004) ORDPATHs: insert-friendly XML node labels. In: Proceedings of the ACM SIGMOD conference. ACM, New York, pp 903–908 O’Neil P, O’Neil E, Pal S, Cseri I, Schaller G, Westbury N (2004) ORDPATHs: insert-friendly XML node labels. In: Proceedings of the ACM SIGMOD conference. ACM, New York, pp 903–908
23.
Zurück zum Zitat Pan Y, Lu W, Zhang Y, Chili K (2007) A static load-balancing scheme for parallel XML parsing on multicore CPUs. In: 7th IEEE international symposium on cluster computing and the grid, 2007. CCGRID 2007. IEEE Press, New York, pp 351–362 Pan Y, Lu W, Zhang Y, Chili K (2007) A static load-balancing scheme for parallel XML parsing on multicore CPUs. In: 7th IEEE international symposium on cluster computing and the grid, 2007. CCGRID 2007. IEEE Press, New York, pp 351–362
25.
Zurück zum Zitat Schmidt A, Waas F, Kersten M, Carey M, Manolescu I, Busse R (2002) XMark: a benchmark for XML data management. In: Proceedings of the 28th VLDB conference. VLDB Endowment, pp 974–985 Schmidt A, Waas F, Kersten M, Carey M, Manolescu I, Busse R (2002) XMark: a benchmark for XML data management. In: Proceedings of the 28th VLDB conference. VLDB Endowment, pp 974–985
26.
Zurück zum Zitat Scott ML, SCOTT ML (1998) Dewey decimal classification. Libraries Unlimited Scott ML, SCOTT ML (1998) Dewey decimal classification. Libraries Unlimited
27.
Zurück zum Zitat Shah B, Rao P, Moon B, Rajagopalan M (2009) A data parallel algorithm for XML DOM parsing. Database and XML technologies pp 75–90 Shah B, Rao P, Moon B, Rajagopalan M (2009) A data parallel algorithm for XML DOM parsing. Database and XML technologies pp 75–90
28.
Zurück zum Zitat Suciu D (1992) Treebank: XML data repository Suciu D (1992) Treebank: XML data repository
29.
Zurück zum Zitat Tatarinov I, Viglas S, Beyer K, Shanmugasundaram J, Shekita E, Zhang C (2002) Storing and querying ordered XML using a relational database system. In: Proceedings of the ACM SIGMOD conference. ACM, New York, pp 204–215 Tatarinov I, Viglas S, Beyer K, Shanmugasundaram J, Shekita E, Zhang C (2002) Storing and querying ordered XML using a relational database system. In: Proceedings of the ACM SIGMOD conference. ACM, New York, pp 204–215
30.
Zurück zum Zitat Zhang C et al (2001) On supporting containment queries in relational database management systems. In: Proceedings of the ACM SIGMOD conference, pp 425–436 Zhang C et al (2001) On supporting containment queries in relational database management systems. In: Proceedings of the ACM SIGMOD conference, pp 425–436
Metadaten
Titel
Parallel labeling of massive XML data with MapReduce
verfasst von
Hyebong Choi
Kyong-Ha Lee
Yoon-Joon Lee
Publikationsdatum
01.02.2014
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 2/2014
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-013-1008-6

Weitere Artikel der Ausgabe 2/2014

The Journal of Supercomputing 2/2014 Zur Ausgabe

Premium Partner