research-article

Rigorous benchmarking in reasonable time

Authors:
Tomas Kalibera

University of Kent, Canterbury, United Kingdom

University of Kent, Canterbury, United Kingdom
View Profile

,
Richard Jones

University of Kent, Canterbury, United Kingdom

University of Kent, Canterbury, United Kingdom
View Profile

ISMM '13: Proceedings of the 2013 international symposium on memory managementJune 2013Pages 63–74https://doi.org/10.1145/2491894.2464160

Published:20 June 2013Publication History

ISMM '13: Proceedings of the 2013 international symposium on memory management

Pages 63–74

ABSTRACT

Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many iterations of benchmarks, invoke virtual machines (VMs) several times, or even rebuild VM or benchmark binaries more than once. All this repetition costs time to complete experiments. Currently, many evaluations give up on sufficient repetition or rigorous statistical methods, or even run benchmarks only in training sizes. The results reported often lack proper variation estimates and, when a small difference between two systems is reported, some are simply unreliable.

In contrast, we provide a statistically rigorous methodology for repetition and summarising results that makes efficient use of experimentation time. Time efficiency comes from two key observations. First, a given benchmark on a given platform is typically prone to much less non-determinism than the common worst-case of published corner-case studies. Second, repetition is most needed where most uncertainty arises (whether between builds, between executions or between iterations). We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision.

We present our methodology as a cookbook that guides researchers on the number of repetitions they should run to obtain reliable results. We also show how to present results with an effect size confidence interval. As an example, we show how to use our methodology to conduct throughput experiments with the DaCapo and SPEC CPU benchmarks on three recent platforms.

Supplemental Material

Available for Download

pdf

p63-kalibera.pdf (454.5 KB)

Originally published version of "Rigorous benchmarking in reasonable time"

References

M. Arnold, M. Hind, and B. G. Ryder. Online feedback-directed optimization of Java. In Proceedings of the 17th annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2002. Google ScholarDigital Library
S. Basu and A. DasGupta. Robustness of standard confidence intervals for location parameters under departure from normality. Annals of Statistics, 23(4):1433--1442, 1995.Google ScholarCross Ref
S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann. The DaCapo benchmarks: Java benchmarking development and analysis. In Proceedings of the 21nd annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA), pages 169--190. ACM, 2006. Google ScholarDigital Library
W. G. Cochran. Sampling Techniques: Third Edition. Wiley, 1977.Google Scholar
R. Coe. It's the effect size, stupid: What effect size is and why it is important. In Annual Conference of the British Educational Research Association (BERA), 2002.Google Scholar
J. Cohen. The Earth is round (pverb <.05). American Psychologist, 49(12):997--1003, 1994.Google ScholarCross Ref
C. Curtsinger and E. D. Berger. Stabilizer: Statistically sound performance evaluation. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS). ACM, 2013. Google ScholarDigital Library
E. C. Fieller. Some problems in interval estimation. Journal of the Royal Statistical Society, 16(2):175--185, 1954.Google Scholar
A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous Java performance evaluation. In Proceedings of the 22nd annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2007. Google ScholarDigital Library
A. Georges, L. Eeckhout, and D. Buytaert. Java performance evaluation through rigorous replay compilation. In Proceedings of the 23rd ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2008. Google ScholarDigital Library
D. Gu, C. Verbrugge, and E. Gagnon. Code layout as a source of noise in JVM performance. In Component And Middleware Performance Workshop, OOPSLA, 2004.Google Scholar
C. Hill and B. Thompson. Computing and interpreting effect sizes. In Higher Education: Handbook of Theory and Research, volume 19, pages 175--196. Springer, 2005.Google Scholar
R. Jain. The Art of Computer Systems Performance Analysis. Wiley, 1991.Google Scholar
T. Kalibera and R. E. Jones. Quantifying performance changes with effect size confidence intervals. Technical Report 4--12, University of Kent, 2012.Google Scholar
T. Kalibera and P. Tuma. Precise regression benchmarking with random effects: Improving Mono benchmark results. In Proceedings of Third European Performance Engineering Workshop (EPEW), volume 4054 of LNCS. Springer, 2006. Google ScholarDigital Library
D. J. Lilja. Measuring Computer Performance: A Practitioner's Guide. Cambridge University Press, 2000. Google ScholarDigital Library
Y. Luo and L. K. John. Efficiently evaluating speedup using sampled processor simulation. IEEE Computer Architecture Letters, 3(1):6--6, 2004. Google ScholarDigital Library
S. E. Maxwell and H. D. Delaney. Designing Experiments and Analyzing Data: a Model Comparison Perspective. Routledge, 2004.Google Scholar
C. E. McCulloch, S. R. Searle, and J. M. Neuhaus. Generalized, Linear, and Mixed Models. Wiley, 2008.Google Scholar
T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney. Producing wrong data without doing anything obviously wrong! In Proceeding of the 14th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2009. Google ScholarDigital Library
S. Nakagawa and I. C. Cuthill. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews, 82(4):591--605, 2007.Google ScholarCross Ref
K. Ogata, T. Onodera, K. Kawachiya, H. Komatsu, and T. Nakatani. Replay compilation: Improving debuggability of a just-in-time compiler. In Proceedings of the 21st annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2006. Google ScholarDigital Library
M. E. Payton, M. H. Greenstone, and N. Schenker. Overlapping confidence intervals or standard error intervals: What do they mean in terms of statistical significance? Journal of Insect Science, 3(1996), 2003.Google Scholar
D. Rasch and V. Guiard. The robustness of parametric statistical methods. Psychology Science, 46(2):175--208, 2004.Google Scholar
R. M. Royall. The effect of sample size on the meaning of significance tests. American Statistician, 40(4):313--315, 1986.Google Scholar
S. R. Searle, G. Casella, and C. E. McCulloch. Variance Components. Wiley, 1992.Google ScholarCross Ref

Index Terms

Rigorous benchmarking in reasonable time
1. General and reference
  1. Cross-computing tools and techniques
    1. Measurement
    2. Metrics

Recommendations

Rigorous benchmarking in reasonable time
ISMM '13

Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many ...
Read More
Rigorous benchmarking in reasonable time
ISMM '13: Proceedings of the 2013 international symposium on memory management

Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many ...
Read More
A Comprehensive Java Benchmark Study on Memory and Garbage Collection Behavior of DaCapo, DaCapo Scala, and SPECjvm2008
ICPE '17: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering

Benchmark suites are an indispensable part of scientific research to compare different approaches against each another. The diversity of benchmarks is an important asset to evaluate novel approaches for effectiveness and weaknesses. In this paper, we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISMM '13: Proceedings of the 2013 international symposium on memory management
June 2013
140 pages
ISBN:9781450321006
DOI:10.1145/2491894
General Chair:
Perry S. Cheng
IBM Research, USA
,
Program Chair:
Erez Petrank
Technion, Israel
ACM SIGPLAN Notices Volume 48, Issue 11
ISMM '13
November 2013
128 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2555670
Editor:
Mark W. Bailey
Hamilton College, Clinton, NY
Issue’s Table of Contents
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
benchmarking methodology
dacapo
spec cpu
statistical methods
Qualifiers
- research-article
Conference

Acceptance Rates
ISMM '13 Paper Acceptance Rate11of22submissions,50%Overall Acceptance Rate72of156submissions,46%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 58
  Total Citations
  View Citations
- 736
  Total Downloads
- Downloads (Last 12 months)63
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Rigorous benchmarking in reasonable time

ISMM '13: Proceedings of the 2013 international symposium on memory management

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Rigorous benchmarking in reasonable time

Rigorous benchmarking in reasonable time

A Comprehensive Java Benchmark Study on Memory and Garbage Collection Behavior of DaCapo, DaCapo Scala, and SPECjvm2008