nach oben

International Journal on Software Tools for Technology Transfer

Erschienen in:

Open Access 03.11.2017 | Regular Paper

Reliable benchmarking: requirements and solutions

verfasst von: Dirk Beyer, Stefan Löwe, Philipp Wendler

Erschienen in: International Journal on Software Tools for Technology Transfer | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Benchmarking is a widely used method in experimental computer science, in particular, for the comparative evaluation of tools and algorithms. As a consequence, a number of questions need to be answered in order to ensure proper benchmarking, resource measurement, and presentation of results, all of which is essential for researchers, tool developers, and users, as well as for tool competitions. We identify a set of requirements that are indispensable for reliable benchmarking and resource measurement of time and memory usage of automatic solvers, verifiers, and similar tools, and discuss limitations of existing methods and benchmarking tools. Fulfilling these requirements in a benchmarking framework can (on Linux systems) currently only be done by using the cgroup and namespace features of the kernel. We developed BenchExec, a ready-to-use, tool-independent, and open-source implementation of a benchmarking framework that fulfills all presented requirements, making reliable benchmarking and resource measurement easy. Our framework is able to work with a wide range of different tools, has proven its reliability and usefulness in the International Competition on Software Verification, and is used by several research groups worldwide to ensure reliable benchmarking. Finally, we present guidelines on how to present measurement results in a scientifically valid and comprehensible way.

Nächster Artikel First international Competition on Runtime Verification: rules, benchmarks, tools, and final results of CRV 2014

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

https://www.spec.org

http://www.tpc.org

http://nlrp.ipd.kit.edu

cf. ACM’s guideline: https://www.acm.org/publications/policies/artifact-review-badging

Our experience from competition organization shows that developers of complex tools are not always aware of how their system spawns child processes and how to properly terminate them.

https://www.sosy-lab.org/research/benchmarking

i.e., with high cohesion and loose coupling

We experienced this when organizing SV-COMP’13, for a portfolio-based verifier. Initial CPU-time measurements were significantly too low, which was only discovered by chance. The verifier had to be patched to wait for its subprocesses and the benchmarks had to be rerun.

http://man7.org/linux/man-pages/man2/setrlimit.2.html

http://man7.org/linux/man-pages/man2/setpgrp.2.html

Systems can be even more complex and have more layers. However, the hierarchy presented here captures the facts that are most important for the performance of software from our target domain. Thus, we use this abstracted definition and nomenclature.

https://cpachecker.sosy-lab.org

https://svn.sosy-lab.org/software/cpachecker/trunk

https://www.sosy-lab.org/research/benchmarking#benchmarks

https://perf.wiki.kernel.org

http://libcg.sourceforge.net

https://www.kernel.org/doc/Documentation/cgroup-v1

Or clear the caches with drop_caches.

Or use a library that does this reliably.

http://man7.org/linux/man-pages/man7/namespaces.7.html

https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt

https://github.com/sosy-lab/benchexec

https://github.com/sosy-lab/benchexec/blob/master/doc/INSTALL.md

https://github.com/sosy-lab/benchexec/blob/master/doc/run-results.md

https://github.com/sosy-lab/benchexec/blob/master/doc/benchexec.md

SV-COMP’16 for the first time required all participating teams to contribute such a module for their tool to BenchExec [6], leading to 21 new tools being integrated into BenchExec.

Tools that do not support this format can also be benchmarked. In this case, the property is not passed to the tool, but used only internally by BenchExec to determine the expected result.

https://www.sosy-lab.org/research/benchmarking#tables

For example, BenchExec is used to automatically check for regressions in the integration test-suite of CPAchecker.

We successfully use BenchExec on four different clusters, each under different administrative control and with software as old as SuSE Enterprise 11 and Linux 3.0, and on the machines of the student computer pool of our department.

cf. https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface

Cf. SV-COMP benchmark definitions at https://github.com/sosy-lab/sv-comp

https://www.ctan.org/pkg/pgfplots

https://github.com/sosy-lab/benchexec/tree/master/contrib/plots

Providing a complete ready-to-use VM would achieve this, but this is typically not suited for replicating performance results.

Cf. the instructions of the publisher, for example https://www.acm.org/publications/policies/dlinclusions, http://www.ieee.org/documents/ieee-supplemental-material-overview.zip, and https://www.springer.com/gp/authors-editors/journal-author/journal-author-helpdesk/preparation/1276#c40940

For example www.figshare.com or www.runmycode.org

https://www.ctan.org/pkg/siunitx

https://www.ctan.org/pkg/pgfplotstable

For examples, cf. Tables 4 and 5 in [9]

https://www.acm.org/publications/policies/artifact-review-badging

http://evaluate.inf.usi.ch/artifacts/aea

http://fmv.jku.at/runlim

http://alviano.net/2014/02/26

Git revision b9b2f11 from 2017-05-02 on https://github.com/alviano/python/tree/master/pyrunlim

http://www.cprover.org/software/benchmarks

cf. verify.sh in the CPBM package

http://smt-exec.org

http://smtexec.cs.uiowa.edu/TreeLimitedRun.c

http://www.cril.univ-artois.fr/~roussel/runsolver

Git revision 9d58031 from 2013-09-13 on https://github.com/tkren/vcwc

A utility for executing commands in a chroot environment, cf. http://linux.die.net/man/1/schroot

http://www.cosyverif.org/benchkit.php

http://sebastien.godard.pagesperso-orange.fr

https://github.com/sosy-lab/benchexec#authors

https://www.open-mpi.org/projects/hwloc

Balyo, T., Heule, M.J.H., Järvisalo, M.: SAT competition 2016: recent developments. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 5061–5063. AAAI Press (2017)

Barrett, C., Fontaine, P., Tinelli, C.: The SMT-LIB standard: version 2.5. Technical report, University of Iowa (2015). www.smt-lib.org

Beyer, D.: Competition on software verification (SV-COMP). In: Proceedings of TACAS, LNCS 7214, pp. 504–524. Springer (2012)

Beyer, D.: Second competition on software verification (Summary of SV-COMP 2013). In: Proceedings of TACAS, LNCS 7795, pp. 594–609. Springer (2013)

Beyer, D.: Software verification and verifiable witnesses (Report on SV-COMP 2015). In: Proceedings of TACAS, LNCS 9035, pp. 401–416. Springer (2015)

Beyer, D.: Reliable and reproducible competition results with BenchExec and witnesses (Report on SV-COMP 2016). In: Proceedings of TACAS, LNCS 9636, pp. 887–904. Springer (2016)

Beyer, D.: Software verification with validation of results (Report on SV-COMP 2017). In: Proceedings of TACAS, LNCS 10206, pp. 331–349. Springer (2017)

Beyer, D., Dresler, G., Wendler, P.: Software verification in the Google App-Engine cloud. In: Proceedings of CAV, LNCS 8559, pp. 327–333. Springer (2014)

Beyer, D.. Löwe, S., Novikov, E., Stahlbauer, A., Wendler, P.: Precision reuse for efficient regression verification. In: Proceedings of FSE, pp. 389–399. ACM (2013)

10.

Beyer, D., Löwe, S., Wendler, P.: Benchmarking and resource measurement. In: Proceedings of SPIN, LNCS 9232, pp. 160–178. Springer (2015)

11.

Brooks, A., Roper, M., Wood, M., Daly, J., Miller, J.: Replication’s role in software engineering. In: Guide to Advanced Empirical Software Engineering, pp. 365–379. Springer (2008)

12.

Charwat, G., Ianni, G., Krennwallner, T., Kronegger, M., Pfandler, A., Redl, C., Schwengerer, M., Spendier, L., Wallner, J., Xiao, G.: VCWC: a versioning competition workflow compiler. In: Proceedings of LPNMR, LNCS 8148, pp. 233–238. Springer (2013)

13.

Cok, D.R., Déharbe, D., Weber, T.: The 2014 SMT competition. JSAT 9, 207–242 (2016)MathSciNet

14.

Collberg, C.S., Proebsting, T.A.: Repeatability in computer-systems research. Commun. ACM 59(3), 62–69 (2016)CrossRef

15.

de Oliveira, A.B., Petkovich, J.-C., Fischmeister, S.: How much does memory layout impact performance? A wide study. In: Proceedings of REPRODUCE (2014)

16.

Gu, D., Verbrugge, C., Gagnon, E.: Code layout as a source of noise in JVM performance. Stud. Inform. Univ. 4(1), 83–99 (2005)

17.

Handigol, N., Heller, B., Jeyakumar, V., Lantz, B., McKeown, N.: Reproducible network experiments using container-based emulation. In: Proceedings of CoNEXT, pp. 253–264. ACM (2012)

18.

Hocko, M., Kalibera, T.: Reducing performance non-determinism via cache-aware page allocation strategies. In: Proceedings of ICPE, pp. 223–234. ACM (2010)

19.

JCGM Working Group 2. International vocabulary of metrology—basic and general concepts and associated terms (VIM), 3rd edition. Technical Report JCGM 200:2012, BIPM (2012)

20.

Juristo, N., Gómez, O.S.: Replication of software engineering experiments. In: Empirical Software Engineering and Verification, pp. 60–88. Springer (2012)

21.

Kalibera, T., Bulej, L., Tuma, P.: Benchmark precision and random initial state. In: Proceedings of SPECTS, pp. 484–490. SCS (2005)

22.

Kordon, F., Hulin-Hubard, F.: BenchKit, a tool for massive concurrent benchmarking. In: Proceedings of ACSD, pp. 159–165. IEEE (2014)

23.

Krishnamurthi, S., Vitek, J.: The real software crisis: repeatability as a core value. Commun. ACM 58(3), 34–36 (2015)CrossRef

24.

Mytkowicz, T., Diwan, A., Hauswirth, M., Sweeney, P.F.: Producing wrong data without doing anything obviously wrong! In: Proceedings of ASPLOS, pp. 265–276. ACM (2009)

25.

Petkovich, J., de Oliveira, A.B., Zhang, Y., Reidemeister, T., Fischmeister, S.: DataMill: a distributed heterogeneous infrastructure for robust experimentation. Softw. Pract. Exp. 46(10), 1411–1440 (2016)

26.

Rizzi, E.F., Elbaum, S., Dwyer, M.B.: On the techniques we create, the tools we build, and their misalignments: a study of Klee. In: Proceedings of ICSE, pp. 132–143. ACM (2016)

27.

Roussel, O.: Controlling a solver execution with the runsolver tool. JSAT 7, 139–144 (2011)MathSciNetMATH

28.

Singh, B., Srinivasan, V.: Containers: challenges with the memory resource controller and its performance. In: Proceedings of Ottawa Linux Symposium (OLS), pp. 209–222 (2007)

29.

Stump, A., Sutcliffe, G., Tinelli, C.: StarExec: a cross-community infrastructure for logic solving. In: Proceedings of IJCAR, LNCS 8562, pp. 367–373. Springer (2014)

30.

Suh, Y.-K., Snodgrass, R .T., Kececioglu, J .D., Downey, P .J., Maier, R .S., Yi, C.: EMP: execution time measurement protocol for compute-bound programs. Softw. Pract. Exp. 47(4), 559–597 (2017)CrossRef

31.

Tichy, W.F.: Should computer scientists experiment more? IEEE Comput. 31(5), 32–40 (1998)CrossRef

32.

Visser, W., Geldenhuys, J., Dwyer, M.B.: Green: reducing, reusing and recycling constraints in program analysis. In: Proceedings of FSE, pp. 58:1–58:11. ACM (2012)

33.

Vitek, J., Kalibera, T.: Repeatability, reproducibility, and rigor in systems research. In: Proceedings of EMSOFT, pp. 33–38. ACM (2011)

Titel: Reliable benchmarking: requirements and solutions
verfasst von: Dirk Beyer
Stefan Löwe
Philipp Wendler
Publikationsdatum: 03.11.2017
Verlag: Springer Berlin Heidelberg
Erschienen in: International Journal on Software Tools for Technology Transfer / Ausgabe 1/2019
Print ISSN: 1433-2779
Elektronische ISSN: 1433-2787
DOI: https://doi.org/10.1007/s10009-017-0469-y

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2019

First international Competition on Runtime Verification: rules, benchmarks, tools, and final results of CRV 2014

Greedy pebbling for proof space compression

TestREx: a framework for repeatable exploits

Hybrid automata: from verification to implementation