Skip to main content

2016 | OriginalPaper | Buchkapitel

Analyzing Data Properties Using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features

verfasst von : Julian M. Kunkel

Erschienen in: High Performance Computing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Understanding the characteristics of data stored in data centers helps computer scientists in identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant formats but also helps in a procurement to define benchmarks that cover these formats. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a small set of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of running novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified.
This paper investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It will be demonstrated that on our production system, scanning 1 % of files and data volume is sufficient to deduct conclusions. This speeds up the analysis process and reduces costs of such studies significantly. The contributions of this paper are: (1) the systematic investigation of the inherent analysis error when operating only on a subset of data, (2) the demonstration of methods that help future studies to mitigate this error, (3) the illustration of the approach on a study for scientific file types and compression for a data center.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The value is an estimate based on the TCO of the system for 5 years. It is conservative and does not include secondary costs such as jitter introduced to other models by the caused I/O.
 
2
Obviously, if those 160 projects are not representative, deducing properties for the full data is not valid. Still the introduced analysis and approaches are correct. The number of 10 k files was choosen as it would ensure to scan at most 0.5 % of the files.
 
3
From the GZIP files, the extension tar.gz is observed on 9 % of files, representing 53 % of GZIP data overall size. Thus most GZIP files are also TAR files.
 
Literatur
1.
Zurück zum Zitat Kotrlik, J., Higgins, C.: Organizational research: determining appropriate sample size in survey research appropriate sample size in survey research. Inf. Technol. Learn. Perform. J. 19(1), 43 (2001) Kotrlik, J., Higgins, C.: Organizational research: determining appropriate sample size in survey research appropriate sample size in survey research. Inf. Technol. Learn. Perform. J. 19(1), 43 (2001)
2.
Zurück zum Zitat Newcombe, R.G.: Two-sided confidence intervals for the single proportion: comparison of seven methods. Stat. Med. 17(8), 857–872 (1998)CrossRef Newcombe, R.G.: Two-sided confidence intervals for the single proportion: comparison of seven methods. Stat. Med. 17(8), 857–872 (1998)CrossRef
3.
Zurück zum Zitat Lofstead, J., Polte, M., Gibson, G., Klasky, S., Schwan, K., Oldfield, R., Wolf, M., Liu, Q.: Six degrees of scientific data: reading patterns for extreme scale science IO. In: Proceedings of the 20th International Symposium on High Performance Distributed Computing, pp. 49–60. ACM (2011) Lofstead, J., Polte, M., Gibson, G., Klasky, S., Schwan, K., Oldfield, R., Wolf, M., Liu, Q.: Six degrees of scientific data: reading patterns for extreme scale science IO. In: Proceedings of the 20th International Symposium on High Performance Distributed Computing, pp. 49–60. ACM (2011)
4.
Zurück zum Zitat Lakshminarasimhan, S., Shah, N., Ethier, S., Ku, S.H., Chang, C.S., Klasky, S., Latham, R., Ross, R., Samatova, N.F.: ISABELA for effective in situ compression of scientific data. Concurrency Comput. Pract. Experience 25(4), 524–540 (2013)CrossRef Lakshminarasimhan, S., Shah, N., Ethier, S., Ku, S.H., Chang, C.S., Klasky, S., Latham, R., Ross, R., Samatova, N.F.: ISABELA for effective in situ compression of scientific data. Concurrency Comput. Pract. Experience 25(4), 524–540 (2013)CrossRef
5.
Zurück zum Zitat Kunkel, J., Kuhn, M., Ludwig, T.: Exascale storage systems - an analytical study of expenses. Supercomputing Front. Innovations 1(1), 116–134 (2014) Kunkel, J., Kuhn, M., Ludwig, T.: Exascale storage systems - an analytical study of expenses. Supercomputing Front. Innovations 1(1), 116–134 (2014)
6.
Zurück zum Zitat Kuhn, M., Chasapis, K., Dolz, M., Ludwig, T.: Compression By Default - Reducing Total Cost of Ownership of Storage Systems, June 2014 Kuhn, M., Chasapis, K., Dolz, M., Ludwig, T.: Compression By Default - Reducing Total Cost of Ownership of Storage Systems, June 2014
7.
Zurück zum Zitat Hübbe, N., Kunkel, J.: Reducing the HPC-datastorage footprint with MAFISC - multidimensional adaptive filtering improved scientific data compression. Comput. Sci. Res. Dev. 28, 231–239 (2013)CrossRef Hübbe, N., Kunkel, J.: Reducing the HPC-datastorage footprint with MAFISC - multidimensional adaptive filtering improved scientific data compression. Comput. Sci. Res. Dev. 28, 231–239 (2013)CrossRef
8.
Zurück zum Zitat Legesse, S.D.: Performance Evaluation of File Systems Compression Features. Master’s thesis, University of Oslo (2014) Legesse, S.D.: Performance Evaluation of File Systems Compression Features. Master’s thesis, University of Oslo (2014)
9.
Zurück zum Zitat Zuck, A., Toledo, S., Sotnikov, D., Harnik, D.: Compression and SSDs: where and how? In: 2nd Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW 2014), Broomfield, CO. USENIX Association, October 2014 Zuck, A., Toledo, S., Sotnikov, D., Harnik, D.: Compression and SSDs: where and how? In: 2nd Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW 2014), Broomfield, CO. USENIX Association, October 2014
10.
Zurück zum Zitat Jin, K., Miller, E.L.: The effectiveness of deduplication on virtual machine disk images. In: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, 7. ACM (2009) Jin, K., Miller, E.L.: The effectiveness of deduplication on virtual machine disk images. In: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, 7. ACM (2009)
11.
Zurück zum Zitat Meister, D., Kaiser, J., Brinkmann, A., Kuhn, M., Kunkel, J., Cortes, T.: A study on data deduplication in HPC storage systems. In: Proceedings of the ACM/IEEE Conference on High Performance Computing (SC). IEEE Computer Society, November 2012 Meister, D., Kaiser, J., Brinkmann, A., Kuhn, M., Kunkel, J., Cortes, T.: A study on data deduplication in HPC storage systems. In: Proceedings of the ACM/IEEE Conference on High Performance Computing (SC). IEEE Computer Society, November 2012
12.
Zurück zum Zitat Schulzweida, U., Kornblueh, L., Quast, R.: CDO Users guide: Climate Data Operators Version 1.6. 1 (2006) Schulzweida, U., Kornblueh, L., Quast, R.: CDO Users guide: Climate Data Operators Version 1.6. 1 (2006)
13.
Zurück zum Zitat Resnick, S.I.: Heavy-Tail Phenomena: Probabilistic and Statistical Modeling. Springer Science & Business Media, New York (2007)MATH Resnick, S.I.: Heavy-Tail Phenomena: Probabilistic and Statistical Modeling. Springer Science & Business Media, New York (2007)MATH
14.
Zurück zum Zitat Tursunalieva, A., Silvapulle, P.: Estimation of Confidence Intervals for the Mean of Heavy Tailed Loss Distributions: A Comparative Study Using a Simulation Method (2009) Tursunalieva, A., Silvapulle, P.: Estimation of Confidence Intervals for the Mean of Heavy Tailed Loss Distributions: A Comparative Study Using a Simulation Method (2009)
Metadaten
Titel
Analyzing Data Properties Using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features
verfasst von
Julian M. Kunkel
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-46079-6_10

Neuer Inhalt