Skip to main content
Erschienen in: International Journal on Software Tools for Technology Transfer 1/2019

Open Access 03.11.2017 | Regular Paper

Reliable benchmarking: requirements and solutions

verfasst von: Dirk Beyer, Stefan Löwe, Philipp Wendler

Erschienen in: International Journal on Software Tools for Technology Transfer | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Benchmarking is a widely used method in experimental computer science, in particular, for the comparative evaluation of tools and algorithms. As a consequence, a number of questions need to be answered in order to ensure proper benchmarking, resource measurement, and presentation of results, all of which is essential for researchers, tool developers, and users, as well as for tool competitions. We identify a set of requirements that are indispensable for reliable benchmarking and resource measurement of time and memory usage of automatic solvers, verifiers, and similar tools, and discuss limitations of existing methods and benchmarking tools. Fulfilling these requirements in a benchmarking framework can (on Linux systems) currently only be done by using the cgroup and namespace features of the kernel. We developed BenchExec, a ready-to-use, tool-independent, and open-source implementation of a benchmarking framework that fulfills all presented requirements, making reliable benchmarking and resource measurement easy. Our framework is able to work with a wide range of different tools, has proven its reliability and usefulness in the International Competition on Software Verification, and is used by several research groups worldwide to ensure reliable benchmarking. Finally, we present guidelines on how to present measurement results in a scientifically valid and comprehensible way.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
5
Our experience from competition organization shows that developers of complex tools are not always aware of how their system spawns child processes and how to properly terminate them.
 
7
i.e., with high cohesion and loose coupling
 
8
We experienced this when organizing SV-COMP’13, for a portfolio-based verifier. Initial CPU-time measurements were significantly too low, which was only discovered by chance. The verifier had to be patched to wait for its subprocesses and the benchmarks had to be rerun.
 
11
Systems can be even more complex and have more layers. However, the hierarchy presented here captures the facts that are most important for the performance of software from our target domain. Thus, we use this abstracted definition and nomenclature.
 
18
Or clear the caches with drop_caches.
 
19
Or use a library that does this reliably.
 
26
SV-COMP’16 for the first time required all participating teams to contribute such a module for their tool to BenchExec [6], leading to 21 new tools being integrated into BenchExec.
 
27
Tools that do not support this format can also be benchmarked. In this case, the property is not passed to the tool, but used only internally by BenchExec to determine the expected result.
 
29
For example, BenchExec is used to automatically check for regressions in the integration test-suite of CPAchecker.
 
30
We successfully use BenchExec on four different clusters, each under different administrative control and with software as old as SuSE Enterprise 11 and Linux 3.0, and on the machines of the student computer pool of our department.
 
32
Cf. SV-COMP benchmark definitions at https://​github.​com/​sosy-lab/​sv-comp
 
35
Providing a complete ready-to-use VM would achieve this, but this is typically not suited for replicating performance results.
 
40
For examples, cf. Tables 4 and 5 in [9]
 
47
cf. verify.sh in the CPBM package
 
51
Git revision 9d58031 from 2013-09-13 on https://​github.​com/​tkren/​vcwc
 
52
A utility for executing commands in a chroot environment, cf. http://​linux.​die.​net/​man/​1/​schroot
 
Literatur
1.
Zurück zum Zitat Balyo, T., Heule, M.J.H., Järvisalo, M.: SAT competition 2016: recent developments. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 5061–5063. AAAI Press (2017) Balyo, T., Heule, M.J.H., Järvisalo, M.: SAT competition 2016: recent developments. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 5061–5063. AAAI Press (2017)
2.
Zurück zum Zitat Barrett, C., Fontaine, P., Tinelli, C.: The SMT-LIB standard: version 2.5. Technical report, University of Iowa (2015). www.smt-lib.org Barrett, C., Fontaine, P., Tinelli, C.: The SMT-LIB standard: version 2.5. Technical report, University of Iowa (2015). www.​smt-lib.​org
3.
Zurück zum Zitat Beyer, D.: Competition on software verification (SV-COMP). In: Proceedings of TACAS, LNCS 7214, pp. 504–524. Springer (2012) Beyer, D.: Competition on software verification (SV-COMP). In: Proceedings of TACAS, LNCS 7214, pp. 504–524. Springer (2012)
4.
Zurück zum Zitat Beyer, D.: Second competition on software verification (Summary of SV-COMP 2013). In: Proceedings of TACAS, LNCS 7795, pp. 594–609. Springer (2013) Beyer, D.: Second competition on software verification (Summary of SV-COMP 2013). In: Proceedings of TACAS, LNCS 7795, pp. 594–609. Springer (2013)
5.
Zurück zum Zitat Beyer, D.: Software verification and verifiable witnesses (Report on SV-COMP 2015). In: Proceedings of TACAS, LNCS 9035, pp. 401–416. Springer (2015) Beyer, D.: Software verification and verifiable witnesses (Report on SV-COMP 2015). In: Proceedings of TACAS, LNCS 9035, pp. 401–416. Springer (2015)
6.
Zurück zum Zitat Beyer, D.: Reliable and reproducible competition results with BenchExec and witnesses (Report on SV-COMP 2016). In: Proceedings of TACAS, LNCS 9636, pp. 887–904. Springer (2016) Beyer, D.: Reliable and reproducible competition results with BenchExec and witnesses (Report on SV-COMP 2016). In: Proceedings of TACAS, LNCS 9636, pp. 887–904. Springer (2016)
7.
Zurück zum Zitat Beyer, D.: Software verification with validation of results (Report on SV-COMP 2017). In: Proceedings of TACAS, LNCS 10206, pp. 331–349. Springer (2017) Beyer, D.: Software verification with validation of results (Report on SV-COMP 2017). In: Proceedings of TACAS, LNCS 10206, pp. 331–349. Springer (2017)
8.
Zurück zum Zitat Beyer, D., Dresler, G., Wendler, P.: Software verification in the Google App-Engine cloud. In: Proceedings of CAV, LNCS 8559, pp. 327–333. Springer (2014) Beyer, D., Dresler, G., Wendler, P.: Software verification in the Google App-Engine cloud. In: Proceedings of CAV, LNCS 8559, pp. 327–333. Springer (2014)
9.
Zurück zum Zitat Beyer, D.. Löwe, S., Novikov, E., Stahlbauer, A., Wendler, P.: Precision reuse for efficient regression verification. In: Proceedings of FSE, pp. 389–399. ACM (2013) Beyer, D.. Löwe, S., Novikov, E., Stahlbauer, A., Wendler, P.: Precision reuse for efficient regression verification. In: Proceedings of FSE, pp. 389–399. ACM (2013)
10.
Zurück zum Zitat Beyer, D., Löwe, S., Wendler, P.: Benchmarking and resource measurement. In: Proceedings of SPIN, LNCS 9232, pp. 160–178. Springer (2015) Beyer, D., Löwe, S., Wendler, P.: Benchmarking and resource measurement. In: Proceedings of SPIN, LNCS 9232, pp. 160–178. Springer (2015)
11.
Zurück zum Zitat Brooks, A., Roper, M., Wood, M., Daly, J., Miller, J.: Replication’s role in software engineering. In: Guide to Advanced Empirical Software Engineering, pp. 365–379. Springer (2008) Brooks, A., Roper, M., Wood, M., Daly, J., Miller, J.: Replication’s role in software engineering. In: Guide to Advanced Empirical Software Engineering, pp. 365–379. Springer (2008)
12.
Zurück zum Zitat Charwat, G., Ianni, G., Krennwallner, T., Kronegger, M., Pfandler, A., Redl, C., Schwengerer, M., Spendier, L., Wallner, J., Xiao, G.: VCWC: a versioning competition workflow compiler. In: Proceedings of LPNMR, LNCS 8148, pp. 233–238. Springer (2013) Charwat, G., Ianni, G., Krennwallner, T., Kronegger, M., Pfandler, A., Redl, C., Schwengerer, M., Spendier, L., Wallner, J., Xiao, G.: VCWC: a versioning competition workflow compiler. In: Proceedings of LPNMR, LNCS 8148, pp. 233–238. Springer (2013)
13.
Zurück zum Zitat Cok, D.R., Déharbe, D., Weber, T.: The 2014 SMT competition. JSAT 9, 207–242 (2016)MathSciNet Cok, D.R., Déharbe, D., Weber, T.: The 2014 SMT competition. JSAT 9, 207–242 (2016)MathSciNet
14.
Zurück zum Zitat Collberg, C.S., Proebsting, T.A.: Repeatability in computer-systems research. Commun. ACM 59(3), 62–69 (2016)CrossRef Collberg, C.S., Proebsting, T.A.: Repeatability in computer-systems research. Commun. ACM 59(3), 62–69 (2016)CrossRef
15.
Zurück zum Zitat de Oliveira, A.B., Petkovich, J.-C., Fischmeister, S.: How much does memory layout impact performance? A wide study. In: Proceedings of REPRODUCE (2014) de Oliveira, A.B., Petkovich, J.-C., Fischmeister, S.: How much does memory layout impact performance? A wide study. In: Proceedings of REPRODUCE (2014)
16.
Zurück zum Zitat Gu, D., Verbrugge, C., Gagnon, E.: Code layout as a source of noise in JVM performance. Stud. Inform. Univ. 4(1), 83–99 (2005) Gu, D., Verbrugge, C., Gagnon, E.: Code layout as a source of noise in JVM performance. Stud. Inform. Univ. 4(1), 83–99 (2005)
17.
Zurück zum Zitat Handigol, N., Heller, B., Jeyakumar, V., Lantz, B., McKeown, N.: Reproducible network experiments using container-based emulation. In: Proceedings of CoNEXT, pp. 253–264. ACM (2012) Handigol, N., Heller, B., Jeyakumar, V., Lantz, B., McKeown, N.: Reproducible network experiments using container-based emulation. In: Proceedings of CoNEXT, pp. 253–264. ACM (2012)
18.
Zurück zum Zitat Hocko, M., Kalibera, T.: Reducing performance non-determinism via cache-aware page allocation strategies. In: Proceedings of ICPE, pp. 223–234. ACM (2010) Hocko, M., Kalibera, T.: Reducing performance non-determinism via cache-aware page allocation strategies. In: Proceedings of ICPE, pp. 223–234. ACM (2010)
19.
Zurück zum Zitat JCGM Working Group 2. International vocabulary of metrology—basic and general concepts and associated terms (VIM), 3rd edition. Technical Report JCGM 200:2012, BIPM (2012) JCGM Working Group 2. International vocabulary of metrology—basic and general concepts and associated terms (VIM), 3rd edition. Technical Report JCGM 200:2012, BIPM (2012)
20.
Zurück zum Zitat Juristo, N., Gómez, O.S.: Replication of software engineering experiments. In: Empirical Software Engineering and Verification, pp. 60–88. Springer (2012) Juristo, N., Gómez, O.S.: Replication of software engineering experiments. In: Empirical Software Engineering and Verification, pp. 60–88. Springer (2012)
21.
Zurück zum Zitat Kalibera, T., Bulej, L., Tuma, P.: Benchmark precision and random initial state. In: Proceedings of SPECTS, pp. 484–490. SCS (2005) Kalibera, T., Bulej, L., Tuma, P.: Benchmark precision and random initial state. In: Proceedings of SPECTS, pp. 484–490. SCS (2005)
22.
Zurück zum Zitat Kordon, F., Hulin-Hubard, F.: BenchKit, a tool for massive concurrent benchmarking. In: Proceedings of ACSD, pp. 159–165. IEEE (2014) Kordon, F., Hulin-Hubard, F.: BenchKit, a tool for massive concurrent benchmarking. In: Proceedings of ACSD, pp. 159–165. IEEE (2014)
23.
Zurück zum Zitat Krishnamurthi, S., Vitek, J.: The real software crisis: repeatability as a core value. Commun. ACM 58(3), 34–36 (2015)CrossRef Krishnamurthi, S., Vitek, J.: The real software crisis: repeatability as a core value. Commun. ACM 58(3), 34–36 (2015)CrossRef
24.
Zurück zum Zitat Mytkowicz, T., Diwan, A., Hauswirth, M., Sweeney, P.F.: Producing wrong data without doing anything obviously wrong! In: Proceedings of ASPLOS, pp. 265–276. ACM (2009) Mytkowicz, T., Diwan, A., Hauswirth, M., Sweeney, P.F.: Producing wrong data without doing anything obviously wrong! In: Proceedings of ASPLOS, pp. 265–276. ACM (2009)
25.
Zurück zum Zitat Petkovich, J., de Oliveira, A.B., Zhang, Y., Reidemeister, T., Fischmeister, S.: DataMill: a distributed heterogeneous infrastructure for robust experimentation. Softw. Pract. Exp. 46(10), 1411–1440 (2016) Petkovich, J., de Oliveira, A.B., Zhang, Y., Reidemeister, T., Fischmeister, S.: DataMill: a distributed heterogeneous infrastructure for robust experimentation. Softw. Pract. Exp. 46(10), 1411–1440 (2016)
26.
Zurück zum Zitat Rizzi, E.F., Elbaum, S., Dwyer, M.B.: On the techniques we create, the tools we build, and their misalignments: a study of Klee. In: Proceedings of ICSE, pp. 132–143. ACM (2016) Rizzi, E.F., Elbaum, S., Dwyer, M.B.: On the techniques we create, the tools we build, and their misalignments: a study of Klee. In: Proceedings of ICSE, pp. 132–143. ACM (2016)
27.
Zurück zum Zitat Roussel, O.: Controlling a solver execution with the runsolver tool. JSAT 7, 139–144 (2011)MathSciNetMATH Roussel, O.: Controlling a solver execution with the runsolver tool. JSAT 7, 139–144 (2011)MathSciNetMATH
28.
Zurück zum Zitat Singh, B., Srinivasan, V.: Containers: challenges with the memory resource controller and its performance. In: Proceedings of Ottawa Linux Symposium (OLS), pp. 209–222 (2007) Singh, B., Srinivasan, V.: Containers: challenges with the memory resource controller and its performance. In: Proceedings of Ottawa Linux Symposium (OLS), pp. 209–222 (2007)
29.
Zurück zum Zitat Stump, A., Sutcliffe, G., Tinelli, C.: StarExec: a cross-community infrastructure for logic solving. In: Proceedings of IJCAR, LNCS 8562, pp. 367–373. Springer (2014) Stump, A., Sutcliffe, G., Tinelli, C.: StarExec: a cross-community infrastructure for logic solving. In: Proceedings of IJCAR, LNCS 8562, pp. 367–373. Springer (2014)
30.
Zurück zum Zitat Suh, Y.-K., Snodgrass, R .T., Kececioglu, J .D., Downey, P .J., Maier, R .S., Yi, C.: EMP: execution time measurement protocol for compute-bound programs. Softw. Pract. Exp. 47(4), 559–597 (2017)CrossRef Suh, Y.-K., Snodgrass, R .T., Kececioglu, J .D., Downey, P .J., Maier, R .S., Yi, C.: EMP: execution time measurement protocol for compute-bound programs. Softw. Pract. Exp. 47(4), 559–597 (2017)CrossRef
31.
Zurück zum Zitat Tichy, W.F.: Should computer scientists experiment more? IEEE Comput. 31(5), 32–40 (1998)CrossRef Tichy, W.F.: Should computer scientists experiment more? IEEE Comput. 31(5), 32–40 (1998)CrossRef
32.
Zurück zum Zitat Visser, W., Geldenhuys, J., Dwyer, M.B.: Green: reducing, reusing and recycling constraints in program analysis. In: Proceedings of FSE, pp. 58:1–58:11. ACM (2012) Visser, W., Geldenhuys, J., Dwyer, M.B.: Green: reducing, reusing and recycling constraints in program analysis. In: Proceedings of FSE, pp. 58:1–58:11. ACM (2012)
33.
Zurück zum Zitat Vitek, J., Kalibera, T.: Repeatability, reproducibility, and rigor in systems research. In: Proceedings of EMSOFT, pp. 33–38. ACM (2011) Vitek, J., Kalibera, T.: Repeatability, reproducibility, and rigor in systems research. In: Proceedings of EMSOFT, pp. 33–38. ACM (2011)
Metadaten
Titel
Reliable benchmarking: requirements and solutions
verfasst von
Dirk Beyer
Stefan Löwe
Philipp Wendler
Publikationsdatum
03.11.2017
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal on Software Tools for Technology Transfer / Ausgabe 1/2019
Print ISSN: 1433-2779
Elektronische ISSN: 1433-2787
DOI
https://doi.org/10.1007/s10009-017-0469-y

Weitere Artikel der Ausgabe 1/2019

International Journal on Software Tools for Technology Transfer 1/2019 Zur Ausgabe