ABSTRACT
Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many iterations of benchmarks, invoke virtual machines (VMs) several times, or even rebuild VM or benchmark binaries more than once. All this repetition costs time to complete experiments. Currently, many evaluations give up on sufficient repetition or rigorous statistical methods, or even run benchmarks only in training sizes. The results reported often lack proper variation estimates and, when a small difference between two systems is reported, some are simply unreliable.
In contrast, we provide a statistically rigorous methodology for repetition and summarising results that makes efficient use of experimentation time. Time efficiency comes from two key observations. First, a given benchmark on a given platform is typically prone to much less non-determinism than the common worst-case of published corner-case studies. Second, repetition is most needed where most uncertainty arises (whether between builds, between executions or between iterations). We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision.
We present our methodology as a cookbook that guides researchers on the number of repetitions they should run to obtain reliable results. We also show how to present results with an effect size confidence interval. As an example, we show how to use our methodology to conduct throughput experiments with the DaCapo and SPEC CPU benchmarks on three recent platforms.
Supplemental Material
Available for Download
Originally published version of "Rigorous benchmarking in reasonable time"
- M. Arnold, M. Hind, and B. G. Ryder. Online feedback-directed optimization of Java. In Proceedings of the 17th annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2002. Google ScholarDigital Library
- S. Basu and A. DasGupta. Robustness of standard confidence intervals for location parameters under departure from normality. Annals of Statistics, 23(4):1433--1442, 1995.Google ScholarCross Ref
- S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann. The DaCapo benchmarks: Java benchmarking development and analysis. In Proceedings of the 21nd annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA), pages 169--190. ACM, 2006. Google ScholarDigital Library
- W. G. Cochran. Sampling Techniques: Third Edition. Wiley, 1977.Google Scholar
- R. Coe. It's the effect size, stupid: What effect size is and why it is important. In Annual Conference of the British Educational Research Association (BERA), 2002.Google Scholar
- J. Cohen. The Earth is round (pverb <.05). American Psychologist, 49(12):997--1003, 1994.Google ScholarCross Ref
- C. Curtsinger and E. D. Berger. Stabilizer: Statistically sound performance evaluation. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS). ACM, 2013. Google ScholarDigital Library
- E. C. Fieller. Some problems in interval estimation. Journal of the Royal Statistical Society, 16(2):175--185, 1954.Google Scholar
- A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous Java performance evaluation. In Proceedings of the 22nd annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2007. Google ScholarDigital Library
- A. Georges, L. Eeckhout, and D. Buytaert. Java performance evaluation through rigorous replay compilation. In Proceedings of the 23rd ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2008. Google ScholarDigital Library
- D. Gu, C. Verbrugge, and E. Gagnon. Code layout as a source of noise in JVM performance. In Component And Middleware Performance Workshop, OOPSLA, 2004.Google Scholar
- C. Hill and B. Thompson. Computing and interpreting effect sizes. In Higher Education: Handbook of Theory and Research, volume 19, pages 175--196. Springer, 2005.Google Scholar
- R. Jain. The Art of Computer Systems Performance Analysis. Wiley, 1991.Google Scholar
- T. Kalibera and R. E. Jones. Quantifying performance changes with effect size confidence intervals. Technical Report 4--12, University of Kent, 2012.Google Scholar
- T. Kalibera and P. Tuma. Precise regression benchmarking with random effects: Improving Mono benchmark results. In Proceedings of Third European Performance Engineering Workshop (EPEW), volume 4054 of LNCS. Springer, 2006. Google ScholarDigital Library
- D. J. Lilja. Measuring Computer Performance: A Practitioner's Guide. Cambridge University Press, 2000. Google ScholarDigital Library
- Y. Luo and L. K. John. Efficiently evaluating speedup using sampled processor simulation. IEEE Computer Architecture Letters, 3(1):6--6, 2004. Google ScholarDigital Library
- S. E. Maxwell and H. D. Delaney. Designing Experiments and Analyzing Data: a Model Comparison Perspective. Routledge, 2004.Google Scholar
- C. E. McCulloch, S. R. Searle, and J. M. Neuhaus. Generalized, Linear, and Mixed Models. Wiley, 2008.Google Scholar
- T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney. Producing wrong data without doing anything obviously wrong! In Proceeding of the 14th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2009. Google ScholarDigital Library
- S. Nakagawa and I. C. Cuthill. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews, 82(4):591--605, 2007.Google ScholarCross Ref
- K. Ogata, T. Onodera, K. Kawachiya, H. Komatsu, and T. Nakatani. Replay compilation: Improving debuggability of a just-in-time compiler. In Proceedings of the 21st annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2006. Google ScholarDigital Library
- M. E. Payton, M. H. Greenstone, and N. Schenker. Overlapping confidence intervals or standard error intervals: What do they mean in terms of statistical significance? Journal of Insect Science, 3(1996), 2003.Google Scholar
- D. Rasch and V. Guiard. The robustness of parametric statistical methods. Psychology Science, 46(2):175--208, 2004.Google Scholar
- R. M. Royall. The effect of sample size on the meaning of significance tests. American Statistician, 40(4):313--315, 1986.Google Scholar
- S. R. Searle, G. Casella, and C. E. McCulloch. Variance Components. Wiley, 1992.Google ScholarCross Ref
Index Terms
- Rigorous benchmarking in reasonable time
Recommendations
Rigorous benchmarking in reasonable time
ISMM '13Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many ...
Rigorous benchmarking in reasonable time
ISMM '13: Proceedings of the 2013 international symposium on memory managementExperimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many ...
A Comprehensive Java Benchmark Study on Memory and Garbage Collection Behavior of DaCapo, DaCapo Scala, and SPECjvm2008
ICPE '17: Proceedings of the 8th ACM/SPEC on International Conference on Performance EngineeringBenchmark suites are an indispensable part of scientific research to compare different approaches against each another. The diversity of benchmarks is an important asset to evaluate novel approaches for effectiveness and weaknesses. In this paper, we ...
Comments