Skip to main content
Erschienen in: Cluster Computing 4/2014

01.12.2014

DIRAQ: scalable in situ data- and resource-aware indexing for optimized query performance

verfasst von: Sriram Lakshminarasimhan, Xiaocheng Zou, David A. Boyuka II, Saurabh V. Pendse, John Jenkins, Venkatram Vishwanath, Michael E. Papka, Scott Klasky, Nagiza F. Samatova

Erschienen in: Cluster Computing | Ausgabe 4/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Scientific data analytics in high-performance computing environments has been evolving along with the advancement of computing capabilities. With the onset of exascale computing, the increasing gap between compute performance and I/O bandwidth has rendered the traditional post-simulation processing a tedious process. Despite the challenges due to increased data production, there exists an opportunity to benefit from “cheap” computing power to perform query-driven exploration and visualization during simulation time. To accelerate such analyses, applications traditionally augment, post-simulation, raw data with large indexes, which are then repeatedly utilized for data exploration. However, the generation of current state-of-the-art indexes involves a compute- and memory-intensive processing, thus rendering them inapplicable in an in situ context. In this paper we propose DIRAQ, a parallel in situ, in network data encoding and reorganization technique that enables the transformation of simulation output into a query-efficient form, with negligible runtime overhead to the simulation run. DIRAQ’s effective core-local, precision-based encoding approach incorporates an embedded compressed index that is 3–6\(\times \) smaller than current state-of-the-art indexing schemes. Its data-aware index adjustmentation improves performance of group-level index layout creation by up to 35 % and reduces the size of the generated index by up to 27 %. Moreover, DIRAQ’s in network index merging strategy enables the creation of aggregated indexes that speed up spatial-context query responses by up to \(10\times \) versus alternative techniques. DIRAQ’s topology-, data-, and memory-aware aggregation strategy results in efficient I/O and yields overall end-to-end encoding and I/O time that is less than that required to write the raw data with MPI collective I/O.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Abbasi, H., Eisenhauer, G., Wolf, M., Schwan, K., Klasky, S.: Just in time: adding value to the IO pipelines of high performance applications with JITStaging. In Proc. Symp, High Performance Distributed Computing (HPDC), (2011) Abbasi, H., Eisenhauer, G., Wolf, M., Schwan, K., Klasky, S.: Just in time: adding value to the IO pipelines of high performance applications with JITStaging. In Proc. Symp, High Performance Distributed Computing (HPDC), (2011)
2.
Zurück zum Zitat Abbasi, H., Lofstead, J., Zheng, F., Schwan, K., Wolf, M., Klasky, S.: Extending I/O through high performance data services. In Proc. Conf, Cluster Computing (CLUSTER), (2009) Abbasi, H., Lofstead, J., Zheng, F., Schwan, K., Wolf, M., Klasky, S.: Extending I/O through high performance data services. In Proc. Conf, Cluster Computing (CLUSTER), (2009)
3.
Zurück zum Zitat Abbasi, H., Wolf, M., Eisenhauer, G., Klasky, S., Schwan, K., Zheng, F.: DataStager: scalable data staging services for petascale applications. In Proc. Symp, High Performance Distributed Computing (HPDC), (2009) Abbasi, H., Wolf, M., Eisenhauer, G., Klasky, S., Schwan, K., Zheng, F.: DataStager: scalable data staging services for petascale applications. In Proc. Symp, High Performance Distributed Computing (HPDC), (2009)
4.
Zurück zum Zitat Bennett, J.C., Abbasi, H., Bremer, P.-T., Grout, R., Gyulassy, A., Jin, T., Klasky, S., Kolla, H., Parashar, M., Pascucci, V., Pebay, P., Thompson, D., Yu, H., Zhang, F., Chen, J.: Combining in-situ and in-transit processing to enable extreme-scale scientific analysis. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), (2012) Bennett, J.C., Abbasi, H., Bremer, P.-T., Grout, R., Gyulassy, A., Jin, T., Klasky, S., Kolla, H., Parashar, M., Pascucci, V., Pebay, P., Thompson, D., Yu, H., Zhang, F., Chen, J.: Combining in-situ and in-transit processing to enable extreme-scale scientific analysis. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), (2012)
5.
Zurück zum Zitat S. Byna, J. Chou, O. Rübel, Prabhat, H. Karimabadi, W. S. Daughton, V. Roytershteyn, E. W. Bethel, M. Howison, K.-J. Hsu, K.-W. Lin, A. Shoshani, A. Uselton, and K. Wu. Parallel I/O, analysis, and visualization of a trillion particle simulation. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), (2012). S. Byna, J. Chou, O. Rübel, Prabhat, H. Karimabadi, W. S. Daughton, V. Roytershteyn, E. W. Bethel, M. Howison, K.-J. Hsu, K.-W. Lin, A. Shoshani, A. Uselton, and K. Wu. Parallel I/O, analysis, and visualization of a trillion particle simulation. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), (2012).
6.
Zurück zum Zitat Chaarawi, M., Gabriel, E.: Automatically selecting the number of aggregators for collective I/O operations. In Proc. Conf, Cluster Computing (CLUSTER), (2011) Chaarawi, M., Gabriel, E.: Automatically selecting the number of aggregators for collective I/O operations. In Proc. Conf, Cluster Computing (CLUSTER), (2011)
7.
Zurück zum Zitat Chen, J.H., Choudhary, A., de Supinski, B., De Vries, E.R., Hawkes, S., Klasky, W.-K., Liao, K.-L., Ma, J., Mellor-Crummey, N., Podhorszki, R., Sankaran, S., Yoo, C.S.: Terascale direct numerical simulations of turbulent combustion using S3D. J. Comput. Sci. Dis. 2(1), 015001 (2009)CrossRef Chen, J.H., Choudhary, A., de Supinski, B., De Vries, E.R., Hawkes, S., Klasky, W.-K., Liao, K.-L., Ma, J., Mellor-Crummey, N., Podhorszki, R., Sankaran, S., Yoo, C.S.: Terascale direct numerical simulations of turbulent combustion using S3D. J. Comput. Sci. Dis. 2(1), 015001 (2009)CrossRef
8.
Zurück zum Zitat J. Chou, K. Wu, and Prabhat, H. FastQuery: a parallel indexing system for scientific data. In Proc. Conf. Cluster Computing (CLUSTER), (2011). J. Chou, K. Wu, and Prabhat, H. FastQuery: a parallel indexing system for scientific data. In Proc. Conf. Cluster Computing (CLUSTER), (2011).
9.
Zurück zum Zitat J. Chou, K. Wu, O. Rübel, M. Howison, J. Qiang, Prabhat, B. Austin, E. W. Bethel, R. D. Ryne, and A. Shoshani. Parallel index and query for large scale data analysis. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), (2011). J. Chou, K. Wu, O. Rübel, M. Howison, J. Qiang, Prabhat, B. Austin, E. W. Bethel, R. D. Ryne, and A. Shoshani. Parallel index and query for large scale data analysis. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), (2011).
10.
Zurück zum Zitat del Rosario, J.M., Bordawekar, R., Choudhary, A.: Improved parallel I/O via a two-phase run-time access strategy. ACM SIGARCH Comput. Archi. News 21(5), 31–38 (1993)CrossRef del Rosario, J.M., Bordawekar, R., Choudhary, A.: Improved parallel I/O via a two-phase run-time access strategy. ACM SIGARCH Comput. Archi. News 21(5), 31–38 (1993)CrossRef
11.
Zurück zum Zitat Fryxell, B., Olson, K., Ricker, P., Timmes, F.X., Zingale, M., Lamb, D.Q., MacNeice, P., Rosner, R., Truran, J.W., Tufo, H.: FLASH: an adaptive mesh hydrodynamics code for modeling astrophysical thermonuclear flashes. Astrophys. J. Suppl. Ser. 131, 273–334 (2000)CrossRef Fryxell, B., Olson, K., Ricker, P., Timmes, F.X., Zingale, M., Lamb, D.Q., MacNeice, P., Rosner, R., Truran, J.W., Tufo, H.: FLASH: an adaptive mesh hydrodynamics code for modeling astrophysical thermonuclear flashes. Astrophys. J. Suppl. Ser. 131, 273–334 (2000)CrossRef
12.
Zurück zum Zitat Fu, J., Latham, R., Min, M., Carothers, C.D.: I/O threads to reduce checkpoint blocking for an electromagnetics solver on Blue Gene/P and Cray XK6. In Proc. Workshop on Runtime and Operating Systems for Supercomputers (ROSS), (2012) Fu, J., Latham, R., Min, M., Carothers, C.D.: I/O threads to reduce checkpoint blocking for an electromagnetics solver on Blue Gene/P and Cray XK6. In Proc. Workshop on Runtime and Operating Systems for Supercomputers (ROSS), (2012)
13.
Zurück zum Zitat Fu, J., Min, M., Latham, R., Carothers, C.D.: Parallel I/O performance for application-level checkpointing on the Blue Gene/P system. In Proc. Conf, Cluster Computing (CLUSTER) (2011) Fu, J., Min, M., Latham, R., Carothers, C.D.: Parallel I/O performance for application-level checkpointing on the Blue Gene/P system. In Proc. Conf, Cluster Computing (CLUSTER) (2011)
14.
Zurück zum Zitat Z. Gong, D. Boyuka, X. Zou, Q. Liu, N. Podhorszki, S. Klasky, X. Ma, and N. F. Samatova. Parlo: Parallel run-time layout optimization for scientific data explorations with heterogeneous access patterns. In International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp 343–351. IEEE, 2013. Z. Gong, D. Boyuka, X. Zou, Q. Liu, N. Podhorszki, S. Klasky, X. Ma, and N. F. Samatova. Parlo: Parallel run-time layout optimization for scientific data explorations with heterogeneous access patterns. In International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp 343–351. IEEE, 2013.
15.
Zurück zum Zitat K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Proc. Conf. Neural Networks, (1989). K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Proc. Conf. Neural Networks, (1989).
16.
Zurück zum Zitat Igel, C., Hüsken, M.: Empirical evaluation of the improved rprop learning algorithm. J. Neurocomput. 50, 2003 (2003)CrossRef Igel, C., Hüsken, M.: Empirical evaluation of the improved rprop learning algorithm. J. Neurocomput. 50, 2003 (2003)CrossRef
17.
Zurück zum Zitat Jenkins, J., Arkatkar, I., Lakshminarasimhan, S., Shah, N., Schendel, E.R., Ethier, S., Chang, C.-S., Chen, J.H., Kolla, H., Klasky, S., Ross, R.B., Samatova, N.F.: Analytics-driven lossless data compression for rapid in-situ indexing, storing, and querying. In Proc. Conf. Database and Expert Systems Applications, Part II (DEXA), (2012) Jenkins, J., Arkatkar, I., Lakshminarasimhan, S., Shah, N., Schendel, E.R., Ethier, S., Chang, C.-S., Chen, J.H., Kolla, H., Klasky, S., Ross, R.B., Samatova, N.F.: Analytics-driven lossless data compression for rapid in-situ indexing, storing, and querying. In Proc. Conf. Database and Expert Systems Applications, Part II (DEXA), (2012)
18.
Zurück zum Zitat Kim, J., Abbasi, H., Chacon, L., Docan, C., Klasky, S., Liu, Q., Podhorszki, N., Shoshani, A., Wu, K.: Parallel in situ indexing for data-intensive computing. In Proc. Symp, Large Data Analysis and Visualization (LDAV), (2011) Kim, J., Abbasi, H., Chacon, L., Docan, C., Klasky, S., Liu, Q., Podhorszki, N., Shoshani, A., Wu, K.: Parallel in situ indexing for data-intensive computing. In Proc. Symp, Large Data Analysis and Visualization (LDAV), (2011)
19.
Zurück zum Zitat Kumar, S., Vishwanath, V., Carns, P., Levine, J.A., Latham, R., Scorzelli, G., Kolla, H., Grout, R., Ross, R., Papka, M.E., Chen, J., Pascucci, V.: Efficient data restructuring and aggregation for I/O acceleration in PIDX. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), (2012) Kumar, S., Vishwanath, V., Carns, P., Levine, J.A., Latham, R., Scorzelli, G., Kolla, H., Grout, R., Ross, R., Papka, M.E., Chen, J., Pascucci, V.: Efficient data restructuring and aggregation for I/O acceleration in PIDX. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), (2012)
20.
Zurück zum Zitat S. Lakshminarasimhan, D. A. Boyuka, S. V. Pendse, X. Zou, J. Jenkins, V. Vishwanath, M. E. Papka, and N. F. Samatova. Scalable in situ scientific data encoding for analytical query processing. In Proceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing, pp 1–12. ACM, (2013). S. Lakshminarasimhan, D. A. Boyuka, S. V. Pendse, X. Zou, J. Jenkins, V. Vishwanath, M. E. Papka, and N. F. Samatova. Scalable in situ scientific data encoding for analytical query processing. In Proceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing, pp 1–12. ACM, (2013).
21.
Zurück zum Zitat Ma, K.L.: In situ visualization at extreme scale: challenges and opportunities. J. Comput. Graph. Appl. 29, 14–19 (2009) Ma, K.L.: In situ visualization at extreme scale: challenges and opportunities. J. Comput. Graph. Appl. 29, 14–19 (2009)
22.
Zurück zum Zitat S. Nissen. Implementation of a fast artificial neural network library (fann). Technical report, Department of Computer Science University of Copenhagen (DIKU), (2003). http://fann.sf.net S. Nissen. Implementation of a fast artificial neural network library (fann). Technical report, Department of Computer Science University of Copenhagen (DIKU), (2003). http://​fann.​sf.​net
23.
Zurück zum Zitat O. Rübel, Prabhat, K. Wu, H. Childs, J. Meredith, C. G. R. Geddes, E. Cormier-Michel, S. Ahern, G. H. Weber, P. Messmer, H. Hagen, B. Hamann, and E. W. Bethel. High performance multivariate visual data exploration for extremely large data. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), (2008). O. Rübel, Prabhat, K. Wu, H. Childs, J. Meredith, C. G. R. Geddes, E. Cormier-Michel, S. Ahern, G. H. Weber, P. Messmer, H. Hagen, B. Hamann, and E. W. Bethel. High performance multivariate visual data exploration for extremely large data. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), (2008).
24.
Zurück zum Zitat Schmuck, F., Haskin, R.: GPFS: a shared-disk file system for large computing clusters. In Proc. Conf, File and Storage Technologies (FAST) (2002) Schmuck, F., Haskin, R.: GPFS: a shared-disk file system for large computing clusters. In Proc. Conf, File and Storage Technologies (FAST) (2002)
25.
Zurück zum Zitat Thakur, R., Choudhary, A.: An extended two-phase method for accessing sections of out-of-core arrays. J. Sci. Program. 5(4), 301–317 (1996) Thakur, R., Choudhary, A.: An extended two-phase method for accessing sections of out-of-core arrays. J. Sci. Program. 5(4), 301–317 (1996)
26.
Zurück zum Zitat Tu, T., Yu, H., Bielak, J., Ghattas, O., Lopez, J.C., Ma, K.-L., O’Hallaron, D.R., Ramirez-Guzman, L., Stone, N., Taborda-Rios, R., Urbanic, J.: Remote runtime steering of integrated terascale simulation and visualization. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC) (2006) Tu, T., Yu, H., Bielak, J., Ghattas, O., Lopez, J.C., Ma, K.-L., O’Hallaron, D.R., Ramirez-Guzman, L., Stone, N., Taborda-Rios, R., Urbanic, J.: Remote runtime steering of integrated terascale simulation and visualization. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC) (2006)
27.
Zurück zum Zitat Vishwanath, V., Hereld, M., Morozov, V., Papka, M.E.: Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems, pp. 1–11. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC) (2011). Vishwanath, V., Hereld, M., Morozov, V., Papka, M.E.: Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems, pp. 1–11. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC) (2011).
28.
Zurück zum Zitat Wu, K.: FastBit: an efficient indexing technology for accelerating data-intensive science. J. Phys. 16, 556 (2005) Wu, K.: FastBit: an efficient indexing technology for accelerating data-intensive science. J. Phys. 16, 556 (2005)
29.
Zurück zum Zitat Wu, K., Otoo, E., Shoshani, A.: On the performance of bitmap indices for high cardinality attributes. In Proc, Conf Very Large Data Bases (VLDB) (2004) Wu, K., Otoo, E., Shoshani, A.: On the performance of bitmap indices for high cardinality attributes. In Proc, Conf Very Large Data Bases (VLDB) (2004)
30.
Zurück zum Zitat K. Wu, R. R. Sinha, C. Jones, S. Ethier, S. Klasky, K.-L. Ma, A. Shoshani, and M. Winslett. Finding regions of interest on toroidal meshes. J. Comput. Sci. Dis. 4(1), (2011). K. Wu, R. R. Sinha, C. Jones, S. Ethier, S. Klasky, K.-L. Ma, A. Shoshani, and M. Winslett. Finding regions of interest on toroidal meshes. J. Comput. Sci. Dis. 4(1), (2011).
31.
Zurück zum Zitat Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In Proc. Conf, World Wide Web (WWW) (2009) Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In Proc. Conf, World Wide Web (WWW) (2009)
32.
Zurück zum Zitat Yoo, R.M., Lee, H., Chow, K., Lee, H.-H.S.: Constructing a non-linear model with neural networks for workload characterization. In Proc. Symp, Workload Characterization (IISWC), (2006) Yoo, R.M., Lee, H., Chow, K., Lee, H.-H.S.: Constructing a non-linear model with neural networks for workload characterization. In Proc. Symp, Workload Characterization (IISWC), (2006)
33.
Zurück zum Zitat H. Yu, C. Wang, R. W. Grout, J. H. Chen, and K.-L. Ma. In situ visualization for large-scale combustion simulations. J. Comput. Graph. Appl. 30(3), 45–57, (2010). H. Yu, C. Wang, R. W. Grout, J. H. Chen, and K.-L. Ma. In situ visualization for large-scale combustion simulations. J. Comput. Graph. Appl. 30(3), 45–57, (2010).
34.
Zurück zum Zitat Zhang, J., Long, X., Torsten, S.: Performance of compressed inverted list caching in search engines. In Proc. Conf, World Wide Web (WWW) (2008) Zhang, J., Long, X., Torsten, S.: Performance of compressed inverted list caching in search engines. In Proc. Conf, World Wide Web (WWW) (2008)
35.
Zurück zum Zitat Zheng, F., Abbasi, H., Docan, C., Lofstead, J., Liu, Q., Klasky, S., Parashar, M., Podhorszki, N., Schwan, K., Wolf, M.: PreDatA: preparatory data analytics on peta-scale machines. In Proc. Symp, Parallel Distributed Processing (IPDPS), (2010) Zheng, F., Abbasi, H., Docan, C., Lofstead, J., Liu, Q., Klasky, S., Parashar, M., Podhorszki, N., Schwan, K., Wolf, M.: PreDatA: preparatory data analytics on peta-scale machines. In Proc. Symp, Parallel Distributed Processing (IPDPS), (2010)
36.
Zurück zum Zitat Zukowski, M., Heman, S., Nes, N., Boncz, P.: Super-scalar RAM-CPU cache compression. In Proc. Conf, Data Engineering (ICDE). (2006) Zukowski, M., Heman, S., Nes, N., Boncz, P.: Super-scalar RAM-CPU cache compression. In Proc. Conf, Data Engineering (ICDE). (2006)
Metadaten
Titel
DIRAQ: scalable in situ data- and resource-aware indexing for optimized query performance
verfasst von
Sriram Lakshminarasimhan
Xiaocheng Zou
David A. Boyuka II
Saurabh V. Pendse
John Jenkins
Venkatram Vishwanath
Michael E. Papka
Scott Klasky
Nagiza F. Samatova
Publikationsdatum
01.12.2014
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe 4/2014
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-014-0358-z

Weitere Artikel der Ausgabe 4/2014

Cluster Computing 4/2014 Zur Ausgabe

Premium Partner