skip to main content
10.1145/2491894.2464160acmconferencesArticle/Chapter ViewAbstractPublication PagesismmConference Proceedingsconference-collections
research-article

Rigorous benchmarking in reasonable time

Published:20 June 2013Publication History

ABSTRACT

Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many iterations of benchmarks, invoke virtual machines (VMs) several times, or even rebuild VM or benchmark binaries more than once. All this repetition costs time to complete experiments. Currently, many evaluations give up on sufficient repetition or rigorous statistical methods, or even run benchmarks only in training sizes. The results reported often lack proper variation estimates and, when a small difference between two systems is reported, some are simply unreliable.

In contrast, we provide a statistically rigorous methodology for repetition and summarising results that makes efficient use of experimentation time. Time efficiency comes from two key observations. First, a given benchmark on a given platform is typically prone to much less non-determinism than the common worst-case of published corner-case studies. Second, repetition is most needed where most uncertainty arises (whether between builds, between executions or between iterations). We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision.

We present our methodology as a cookbook that guides researchers on the number of repetitions they should run to obtain reliable results. We also show how to present results with an effect size confidence interval. As an example, we show how to use our methodology to conduct throughput experiments with the DaCapo and SPEC CPU benchmarks on three recent platforms.

Skip Supplemental Material Section

Supplemental Material

References

  1. M. Arnold, M. Hind, and B. G. Ryder. Online feedback-directed optimization of Java. In Proceedings of the 17th annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Basu and A. DasGupta. Robustness of standard confidence intervals for location parameters under departure from normality. Annals of Statistics, 23(4):1433--1442, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  3. S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann. The DaCapo benchmarks: Java benchmarking development and analysis. In Proceedings of the 21nd annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA), pages 169--190. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. W. G. Cochran. Sampling Techniques: Third Edition. Wiley, 1977.Google ScholarGoogle Scholar
  5. R. Coe. It's the effect size, stupid: What effect size is and why it is important. In Annual Conference of the British Educational Research Association (BERA), 2002.Google ScholarGoogle Scholar
  6. J. Cohen. The Earth is round (pverb <.05). American Psychologist, 49(12):997--1003, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  7. C. Curtsinger and E. D. Berger. Stabilizer: Statistically sound performance evaluation. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS). ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. C. Fieller. Some problems in interval estimation. Journal of the Royal Statistical Society, 16(2):175--185, 1954.Google ScholarGoogle Scholar
  9. A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous Java performance evaluation. In Proceedings of the 22nd annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Georges, L. Eeckhout, and D. Buytaert. Java performance evaluation through rigorous replay compilation. In Proceedings of the 23rd ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Gu, C. Verbrugge, and E. Gagnon. Code layout as a source of noise in JVM performance. In Component And Middleware Performance Workshop, OOPSLA, 2004.Google ScholarGoogle Scholar
  12. C. Hill and B. Thompson. Computing and interpreting effect sizes. In Higher Education: Handbook of Theory and Research, volume 19, pages 175--196. Springer, 2005.Google ScholarGoogle Scholar
  13. R. Jain. The Art of Computer Systems Performance Analysis. Wiley, 1991.Google ScholarGoogle Scholar
  14. T. Kalibera and R. E. Jones. Quantifying performance changes with effect size confidence intervals. Technical Report 4--12, University of Kent, 2012.Google ScholarGoogle Scholar
  15. T. Kalibera and P. Tuma. Precise regression benchmarking with random effects: Improving Mono benchmark results. In Proceedings of Third European Performance Engineering Workshop (EPEW), volume 4054 of LNCS. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. J. Lilja. Measuring Computer Performance: A Practitioner's Guide. Cambridge University Press, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Luo and L. K. John. Efficiently evaluating speedup using sampled processor simulation. IEEE Computer Architecture Letters, 3(1):6--6, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. E. Maxwell and H. D. Delaney. Designing Experiments and Analyzing Data: a Model Comparison Perspective. Routledge, 2004.Google ScholarGoogle Scholar
  19. C. E. McCulloch, S. R. Searle, and J. M. Neuhaus. Generalized, Linear, and Mixed Models. Wiley, 2008.Google ScholarGoogle Scholar
  20. T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney. Producing wrong data without doing anything obviously wrong! In Proceeding of the 14th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Nakagawa and I. C. Cuthill. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews, 82(4):591--605, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  22. K. Ogata, T. Onodera, K. Kawachiya, H. Komatsu, and T. Nakatani. Replay compilation: Improving debuggability of a just-in-time compiler. In Proceedings of the 21st annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. E. Payton, M. H. Greenstone, and N. Schenker. Overlapping confidence intervals or standard error intervals: What do they mean in terms of statistical significance? Journal of Insect Science, 3(1996), 2003.Google ScholarGoogle Scholar
  24. D. Rasch and V. Guiard. The robustness of parametric statistical methods. Psychology Science, 46(2):175--208, 2004.Google ScholarGoogle Scholar
  25. R. M. Royall. The effect of sample size on the meaning of significance tests. American Statistician, 40(4):313--315, 1986.Google ScholarGoogle Scholar
  26. S. R. Searle, G. Casella, and C. E. McCulloch. Variance Components. Wiley, 1992.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Rigorous benchmarking in reasonable time

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ISMM '13: Proceedings of the 2013 international symposium on memory management
        June 2013
        140 pages
        ISBN:9781450321006
        DOI:10.1145/2491894

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 June 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        ISMM '13 Paper Acceptance Rate11of22submissions,50%Overall Acceptance Rate72of156submissions,46%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader