Skip to main content
Erschienen in: Distributed and Parallel Databases 3/2015

01.09.2015

Locality-aware allocation of multi-dimensional correlated files on the cloud platform

verfasst von: Xiaofei Zhang, Yongxin Tong, Lei Chen, Min Wang, Shicong Feng

Erschienen in: Distributed and Parallel Databases | Ausgabe 3/2015

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The effective management of enormous data volumes on the Cloud platform has attracted devoting research efforts. In this paper, we study the problem of allocating files with multidimensional correlations on the Cloud platform, such that files can be retrieved and processed more efficiently. Currently, most prevailing Cloud file systems allocate data following the principles of fault tolerance and availability, while inter-file correlations, i.e. files correlated with each other, are often neglected. As a matter of fact, data files are commonly correlated in various ways in real practices. And correlated files are most likely to be involved in the same computation process. Therefore, it raises a new challenge of allocating files with multi-dimensional correlations with the “subspace locality” taken into consideration to improve the system throughput. We propose two allocation methods for multi-dimensional correlated files stored on the Cloud platform, such that the I/O efficiency and data access locality are improved in the MapReduce processing paradigm, without hurting the fault tolerance and availability properties of the underlying file systems. Different from the techniques proposed in [1,2], which quickly map the locations of desired data for a given query \({\mathcal {Q}}\), we focus on improving the system throughput for batch jobs over correlated data files. We clearly formulate the problem and study a series of solutions on HDFS [9]. Evaluations with real application scenarios prove the effectiveness of our proposals: significant I/O and network costs can be saved during the data retrieval and processing. Especially for batch OLAP jobs, our solution demonstrates well balanced workload among distributed computing nodes.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Small files will be placed in the same block until a block is full.
 
2
We believe the closeness measurement is application dependent and consider it as a predefined metric.
 
3
We consider each partition group in \({\mathcal {P}}_i\) is equally important. So is the \(m\) different feature subspaces.
 
4
two tasks are orthogonal if they are not performed on the same partition group of \({\mathcal {P}}_b\).
 
Literatur
1.
Zurück zum Zitat Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: Proceedings of VLDB Endow, pp. 922–933 (2009) Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: Proceedings of VLDB Endow, pp. 922–933 (2009)
2.
Zurück zum Zitat Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of EDBT, pp. 99–110 (2010) Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of EDBT, pp. 99–110 (2010)
4.
Zurück zum Zitat Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MaPreduce. In: Proceedings of SIGMOD, pp. 975–986 (2010) Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MaPreduce. In: Proceedings of SIGMOD, pp. 975–986 (2010)
5.
Zurück zum Zitat Brunet, J., Tamayo, P., Golub, T.R., Mesirov, J.P.: Metagenes and molecular pattern discovery using matrix factorization. PNAS 101(12), 4164–4169 (2004)CrossRef Brunet, J., Tamayo, P., Golub, T.R., Mesirov, J.P.: Metagenes and molecular pattern discovery using matrix factorization. PNAS 101(12), 4164–4169 (2004)CrossRef
6.
Zurück zum Zitat Chen, Y., Wang, W., Du, X., Zhou, X.: Continuously monitoring the correlations of massive discrete streams. In: Proceedings of CIKM, pp. 1571–1576 (2011) Chen, Y., Wang, W., Du, X., Zhou, X.: Continuously monitoring the correlations of massive discrete streams. In: Proceedings of CIKM, pp. 1571–1576 (2011)
7.
Zurück zum Zitat Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of NSDI, pp. 313–328 (2010) Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of NSDI, pp. 313–328 (2010)
8.
Zurück zum Zitat Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Gerth, J., Talbot, J., Elmeleegy, K., Sears, R.: Online aggregation and continuous query support in MapReduce. In: Proceedings of SIGMOD, pp. 1115–1118 (2010) Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Gerth, J., Talbot, J., Elmeleegy, K., Sears, R.: Online aggregation and continuous query support in MapReduce. In: Proceedings of SIGMOD, pp. 1115–1118 (2010)
9.
Zurück zum Zitat Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of OSDI, pp. 137–150 (2004) Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of OSDI, pp. 137–150 (2004)
10.
Zurück zum Zitat Dittrich, J., Quiané-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). In: Proceedings of VLDB Endow, pp. 515–529 (2010) Dittrich, J., Quiané-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). In: Proceedings of VLDB Endow, pp. 515–529 (2010)
11.
Zurück zum Zitat Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. In: Proceedings of VLDB Endow, pp. 575–585 (2011) Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. In: Proceedings of VLDB Endow, pp. 575–585 (2011)
12.
Zurück zum Zitat Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. In: Proceedings of SOSP, pp. 29–43 (2003) Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. In: Proceedings of SOSP, pp. 29–43 (2003)
13.
Zurück zum Zitat Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL querying of large RDF graphs. Proc. VLDB 4(11), 1123–1134 (2011) Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL querying of large RDF graphs. Proc. VLDB 4(11), 1123–1134 (2011)
15.
Zurück zum Zitat Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: an in-depth study. In: Proceedings of VLDB Endow, pp. 472–483 (2010) Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: an in-depth study. In: Proceedings of VLDB Endow, pp. 472–483 (2010)
16.
Zurück zum Zitat Jiang, D., Tung, A.K.H., Chen, G.: MAP-JOIN-REDUCE toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. 23(9), 1299–1311 (2011)CrossRef Jiang, D., Tung, A.K.H., Chen, G.: MAP-JOIN-REDUCE toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. 23(9), 1299–1311 (2011)CrossRef
17.
Zurück zum Zitat Jolliffe, I.T.: Principal Component Analysis. Springer, New York (2002) Jolliffe, I.T.: Principal Component Analysis. Springer, New York (2002)
18.
Zurück zum Zitat Lei, M., Vrbsky, S.V., Hong, X.: An on-line replication strategy to increase availability in Data Grids. J. Futur. Gener. Comput. Syst. 24(2), 85–98 (2008)CrossRefMATH Lei, M., Vrbsky, S.V., Hong, X.: An on-line replication strategy to increase availability in Data Grids. J. Futur. Gener. Comput. Syst. 24(2), 85–98 (2008)CrossRefMATH
19.
Zurück zum Zitat Lieberman, H., Selker, T.: Out of context: computer systems that adapt to, and learn from, context. IBM Syst. J. 39(3–4), 617–632 (2000)CrossRefMATH Lieberman, H., Selker, T.: Out of context: computer systems that adapt to, and learn from, context. IBM Syst. J. 39(3–4), 617–632 (2000)CrossRefMATH
20.
Zurück zum Zitat Nehme, R., Bruno, N.: Automated partitioning design in parallel database systems. In: Proceedings of SIGMOD, pp. 1137–1148 (2011) Nehme, R., Bruno, N.: Automated partitioning design in parallel database systems. In: Proceedings of SIGMOD, pp. 1137–1148 (2011)
21.
Zurück zum Zitat Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of SIGMOD, pp. 165–178 (2009) Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of SIGMOD, pp. 165–178 (2009)
22.
Zurück zum Zitat Ranganathan, K., Iamnitchi, A., Foster, I.: Improving data availability through dynamic model-driven replication in large peer-to-peer communities. In: Proceedings of CCGRID, pp. 376–381 (2002) Ranganathan, K., Iamnitchi, A., Foster, I.: Improving data availability through dynamic model-driven replication in large peer-to-peer communities. In: Proceedings of CCGRID, pp. 376–381 (2002)
23.
Zurück zum Zitat Samet, H.: Foundations of Multi-dimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann Publishers Inc. (2005) Samet, H.: Foundations of Multi-dimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann Publishers Inc. (2005)
24.
Zurück zum Zitat Silberschatz, A., Korth, H., Sudarshan, S.: Database Systems Concepts, 5th edn. MCGraw-Hill Inc. (2006) Silberschatz, A., Korth, H., Sudarshan, S.: Database Systems Concepts, 5th edn. MCGraw-Hill Inc. (2006)
25.
Zurück zum Zitat Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)CrossRef Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)CrossRef
29.
Zurück zum Zitat Wang, J., Wu, S., Gao, H., Li, J., Ooi, B.C.: Indexing multi-dimensional data in a cloud system. In: Proceedings of SIGMOD, pp. 591–602 (2010) Wang, J., Wu, S., Gao, H., Li, J., Ooi, B.C.: Indexing multi-dimensional data in a cloud system. In: Proceedings of SIGMOD, pp. 591–602 (2010)
30.
Zurück zum Zitat Wang, J., Jea, K.: A near-optimal database allocation for reducing the average waiting time in the grid computing environment. J. Inf. Sci. 179(21), 3772–3790 (2009)MathSciNetCrossRefMATH Wang, J., Jea, K.: A near-optimal database allocation for reducing the average waiting time in the grid computing environment. J. Inf. Sci. 179(21), 3772–3790 (2009)MathSciNetCrossRefMATH
31.
Zurück zum Zitat Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: Proceedings of OSDI, pp. 307–320 (2006) Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: Proceedings of OSDI, pp. 307–320 (2006)
32.
Zurück zum Zitat Zhang, X., Ai, J., Wang, Z., Lu, J., Meng, X.: An efficient multi-dimensional index for cloud data management. In: Proceedings of CloudDB, pp. 17–24 (2009) Zhang, X., Ai, J., Wang, Z., Lu, J., Meng, X.: An efficient multi-dimensional index for cloud data management. In: Proceedings of CloudDB, pp. 17–24 (2009)
Metadaten
Titel
Locality-aware allocation of multi-dimensional correlated files on the cloud platform
verfasst von
Xiaofei Zhang
Yongxin Tong
Lei Chen
Min Wang
Shicong Feng
Publikationsdatum
01.09.2015
Verlag
Springer US
Erschienen in
Distributed and Parallel Databases / Ausgabe 3/2015
Print ISSN: 0926-8782
Elektronische ISSN: 1573-7578
DOI
https://doi.org/10.1007/s10619-014-7153-y

Weitere Artikel der Ausgabe 3/2015

Distributed and Parallel Databases 3/2015 Zur Ausgabe