research-article

From Repeatability to Reproducibility and Corroboration

Author:
Dror G. Feitelson

The Hebrew University of Jerusalem, Jerusalem, Israel

The Hebrew University of Jerusalem, Jerusalem, Israel
View Profile

Authors Info & Claims

ACM SIGOPS Operating Systems Review Volume 49 Issue 1January 2015pp 3–11https://doi.org/10.1145/2723872.2723875

Published:20 January 2015Publication History

ACM SIGOPS Operating Systems Review

Abstract

Being able to repeat experiments is considered a hallmark of the scientific method, used to confirm or refute hypotheses and previously obtained results. But this can take many forms, from precise repetition using the original experimental artifacts, to conceptual reproduction of the main experimental idea using new artifacts. Furthermore, the conclusions from previous work can also be corroborated using a different experimental methodology altogether. In order to promote a better understanding and use of such methodologies we propose precise definitions for different terms, and suggest when and why each should be used.

References

A. Abedi, A. Heard, and T. Brecht, "Conducting repeatable experiments and fair comparisons using 802.11n MIMO networks". Operating Syst. Rev. 49(1), Jan 2015. Google ScholarDigital Library
I. Amato, "Pons and Fleischmann redux? " Science 260(5110), p. 895, 14 May 1993, DOI:10.1126/science.260.5110.895.Google ScholarCross Ref
T. Austin, E. Larson, and D. Ernst, "SimpleScalar: An infrastructure for computer system modeling". Computer 35(2), pp. 59--67, Feb 2002, DOI:10.1109/2.982917. Google ScholarDigital Library
V. R. Basili and R. W. Selby, "Comparing the effectiveness of software testing strategies ". IEEE Trans. Softw. Eng. SE-13(12), pp. 1278--1296, Dec 1987, DOI:10.1109/TSE.1987.232881. Google ScholarDigital Library
V. R. Basili, R. W. Selby, and D. H. Hutchens, "Experimentation in software engineering ". IEEE Trans. Softw. Eng. SE-12(7), pp. 733--743, Jul 1986, DOI:10.1109/TSE.1986.6312975. Google ScholarDigital Library
V. R. Basili, F. Shull, and F. Lanubile, "Building knowledge through families of experiments". IEEE Trans. Softw. Eng. 25(4), pp. 456--473, Jul/Aug 1999, DOI:10.1109/32.799939. Google ScholarDigital Library
A. Bavier, N. Feamster, M. Huang, L. Peterson, and J. Rexford, "In VINI veritas: Realistic and controlled network experimentation ". In ACM SIGCOMM Conf., pp. 3--14, Sep 2006, DOI:10.1145/1151659.1159916. Google ScholarDigital Library
C. G. Begley and L. M. Ellis, "Raise standards for preclinical cancer research ". Nature 483(7391), pp. 531--533, 29 Mar 2012, DOI:10.1038/483531a.Google ScholarCross Ref
S. M. Blackburn et al., "The DaCapo benchmarks: Java benchmarking development and analysis ". In 21st Object-Oriented Prog. Syst., Lang., & Appl. Conf. Proc., pp. 169--190, Oct 2006, DOI:10.1145/1167473.1167488. Google ScholarDigital Library
S. M. Blackburn et al., Can You Trust Your Experimental Results? Tech. Rep. #1, Evaluate Collaboratory, Feb 2012. URL http://evaluate.inf.usi.ch/sites/default/files/EvaluateCollaboratoryTR1.pdf.Google Scholar
F. Bornemann, D. Laurie, S. Wagon, and J. Waldvogel, The SIAM 100-Digit Challenge: A Study in High-Accuracy Numerical Computing. SIAM, 2004.Google Scholar
L. Breslau, D. Estrin, K. Fall, S. Floyd, J. Heidemann, A. Helmy, P. Huang, S. McCanne, K. Varadhan, Y. Xu, and H. Yu, "Advances in network simulation". Computer 33(5), pp. 59--67, May 2000, DOI:10.1109/2.841785. Google ScholarDigital Library
A. Brooks, J. Daly, J. Miller, M. Roper, and M. Wood, Replication's Role in Experimental Computer Science. Tech. Rep. EFoCS-5-94 {RR/94/172}, University of Strathclyde, 1994.Google Scholar
B. Clark, T. Deshane, E. Dow, S. Evanchik, M. Finlayson, J. Herne, and J. N. Matthews, "Xen and the art of repeated research ". In USENIX Tech. Conf., Jun 2004. Google ScholarDigital Library
R. A. De Millo, R. J. Lipton, and A. J. Perlis, "Social processes and proofs of theorems and programs". Comm. ACM 22(5), pp. 271--280, May 1979, DOI:10.1145/359104.359106. Google ScholarDigital Library
A. B. de Oliveira, S. Fischmeister, A. Diwan, M. Hauswirth, and P. F. Sweeney, "Why you should care about quantile regression ". In 18th Intl. Conf. Architect. Support for Prog. Lang. & Operating Syst., pp. 207--218, Mar 2013, DOI:10.1145/2451116.2451140. Google ScholarDigital Library
A. B. de Oliveira, J.-C. Petkovich, and S. Fischmeister, "How much does memory layout impact performance? a wide study ". In Intl. Workshop Reproducible Research Methodologies, pp. 23--28, Feb 2014.Google Scholar
A. B. de Oliveira, J.-C. Petkovich, T. Reidemeister, and S. Fischmeister, "DataMill: Rigorous performance evaluation made easy". In 4th Intl. Conf. Performance Engineering, pp. 137--148, Apr 2013, DOI:10.1145/2479871.2479892. Google ScholarDigital Library
P. A. Dinda, G. Memik, R. P. Dick, B. Lin, A. Mallik, A. Gupta, and S. Rossoff, "The user in experimental computer systems research ". In Workshop Experimental Comput. Sci., art. no. 10, Jun 2007, DOI:10.1145/1281700.1281710. Google ScholarDigital Library
C. Drummond, "Replicability is not reproducibility: Nor is it good science ". In 4th Workshop Evaluation Methods for Machine Learning, Jun 2009.Google Scholar
D. M. Erceg-Hurn and V. M. Mirosevich, "Modern robust statistical methods: An easy way to maximize the accuracy and power of your research ". Am. Psych. 63(7), pp. 591--601, Oct 2008, DOI:10.1037/0003-066X.63.7.591.Google ScholarCross Ref
D. G. Feitelson, "Experimental analysis of the root causes of performance evaluation results: A backfilling case study". IEEE Trans. Parallel & Distributed Syst. 16(2), pp. 175--182, Feb 2005, DOI:10.1109/TPDS.2005.18. Google ScholarDigital Library
D. G. Feitelson, "Experimental computer science: The need for a cultural change". URL http://www.cs.huji.ac.il/¿feit/papers/exp05.pdf, 2005.Google Scholar
D. G. Feitelson, Workload Modeling for Computer Systems Performance Evaluation. Cambridge University Press, 2015.Google Scholar
D. G. Feitelson, D. Tsafrir, and D. Krakov, "Experience with using the Parallel Workloads Archive". J. Parallel & Distributed Comput. 74(10), pp. 2967--2982, Oct 2014, DOI:10.1016/j.jpdc.2014.06.013.Google ScholarCross Ref
G. Fursin, R. Miceli, A. Lokhmotov, M. Gerndt, M. Baboulin, A. D. Malony, Z. Chamski, D. Novillo, and D. Del Vento, "Collective Mind: Towards practical and collaborative auto-tuning ". Scientific Prog. 22(4), pp. 309--329, 2014, DOI:10.3233/SPR-140396.Google ScholarDigital Library
J. Y. Gil, K. Lenz, and Y. Shimron, "A microbenchmark case study and lessons learned ". In SPLASH'11 Workshops, pp. 297--308, Oct 2011, DOI:10.1145/2095050.2095100. Google ScholarDigital Library
S. Hanenberg, "Faith, hope, and love: An essay on software science's neglect of human factors ". In Object-Oriented Prog. Syst., Lang., & Appl. Conf. Proc., pp. 933--946, Oct 2010, DOI:10.1145/1932682.1869536. (Onward track). Google ScholarDigital Library
A. S. Harji, P. A. Buhr, and T. Brecht, "Our troubles with Linux kernel upgrades and why you should care ". Operating Syst. Rev. 47(2), pp. 66--72, Jul 2013, DOI:10.1145/2506164.2506175. Google ScholarDigital Library
H. Hellman, Great Feuds in Science: Ten of the Liveliest Disputes Ever. John Wiley & Sons, 1998.Google Scholar
J. P. A. Ioannidis, "Why most published research findings are false ". PLOS Medicine 2(8), pp. 0696--0701, Aug 2005, DOI:10.1371/journal.pmed.0020124.Google ScholarCross Ref
L. D. Jackel, D. Hackett, E. Krotkov, M. Perschbacher, J. Pippine, and C. Sullivan, "How DARPA structures its robotics programs to improve locomotion and navigation ". Comm. ACM 50(11), pp. 55--59, Nov 2007, DOI:10.1145/1297797.1297823. Google ScholarDigital Library
I. Jacobson, G. Booch, and J. Rumbaugh, The Unified Software Development Process. Addison Wesley, 1999. Google ScholarDigital Library
R. Jain, The Art of Computer Systems Performance Analysis. John Wiley & Sons, 1991.Google Scholar
B. E. John, "Avoiding "it's JUST a replication" ". In CHI2013 Workshop on Replication of HCI Research, pp. 3--7, Apr 2013.Google Scholar
N. Juristo and S. Vegas, "Functional testing, structural testing and code reading: What fault type do they each detect?" In Empirical Methods and Studies in Software Engineering: Experiences from ESERNET, R. Conradi and A. I. Wang (eds.), pp. 208--232, Springer-Verlag, 2003, DOI:10.1007/978-3-540-45143-3_12. Lect. Notes Comput. Sci. vol. 2765.Google ScholarCross Ref
N. Juristo, S. Vegas, M. Solari, S. Abrahao, and I. Ramos, "Comparing the effectiveness of equivalence partitioning,branch testing and code reading by stepwise abstraction applied by subjects ". In 5th Intl. Conf. Software Testing, Verification, & Validation, pp. 330--339, Apr 2012, DOI:10.1109/ICST.2012.113. Google ScholarDigital Library
T. Kalibera, J. Hagelberg, P. Maj, F. Pizlo, B. Titzer, and J. Vitek, "A family of real-time Java benchmarks". Concurrency & Computation -- Pract. & Exp. 23(14), pp. 1679--1700, Sep 2011, DOI:10.1002/cpe.1677. Google ScholarDigital Library
E. Kamsties and C. M. Lott, "An empirical evaluation of three defect-detection techniques ". In 5th European Softw. Eng. Conf., pp. 362--383, Springer-Verlag, Sep 1995, DOI:10.1007/3-540-60406-5_25. Lect. Notes Comput. Sci. vol. 989. Google ScholarDigital Library
R. A. Klein et al., "Investigating variation in replicability: A "many labs" replication project ". Social Psychology 45(3), pp. 142--152, 2014, DOI:10.1027/1864-9335/a000178.Google ScholarCross Ref
R. Kohavi and R. Longbotham, "Unexpected results in online controlled experiments". SIGKDD Explorations 12(2), pp. 31--35, Dec 2010, DOI:10.1145/1964897.1964905. Google ScholarDigital Library
D. J. Lilja, Measuring Computer Performance: A Practitioner's Guide. Cambridge University Press, 2000. Google ScholarDigital Library
J. Lung, J. Aranda, S. Easterbrook, and G. Wilson, "On the difficulty of replicating human subjects studies in software engineering ". In 30th Intl. Conf. Softw. Eng., pp. 191--200, May 2008, DOI:10.1145/1368088.1368115. Google ScholarDigital Library
I. Manolescu et al., "The repeatability experiment of SIGMOD 2008 ". ACM SIGMOD Record 37(1), pp. 39--45, Mar 2008, DOI:10.1145/1374780.1374791. Google ScholarDigital Library
L. McVoy and C. Staelin, "lmbench: Portable tools for performance analysis ". In USENIX Ann. Technical Conf., pp. 279--294, Jan 1996. Google ScholarDigital Library
T. Mudge, "Report on the panel: How can computer architecture researchers avoid becoming the society for irreproducible results? " Comput. Arch. News 24(1), pp. 1--5, Mar 1996. Google ScholarDigital Library
G. J. Myers, "A controlled experiment in program testing and code walkthroughs/inspections ". Comm. ACM 21(9), pp. 760--768, Sep 1978, DOI:10.1145/359588.359602. Google ScholarDigital Library
T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney, "Producing wrong data without doing anything obviously wrong! " In 14th Intl. Conf. Architect. Support for Prog. Lang. & Operating Syst., pp. 265--276, Mar 2009, DOI:10.1145/2528521.1508275. Google ScholarDigital Library
L. Peterson and V. S. Pai, "Experience-driven experimental systems research ". Comm. ACM 50(11), pp. 38--44, Nov 2007, DOI:10.1145/1297797.1297820. Google ScholarDigital Library
J. Sambrook and D. W. Russell, Molecular Cloning: A Laboratory Manual. Cold Spring Harbor Laboratory Press, 3rd ed., 2001.Google Scholar
J. D. Scargle, "Publication bias: The "file-drawer" problem in scientific inference ". J. Sci. Explor. 14(1), pp. 91--106, 2000.Google Scholar
S. R. Schach, Object-Oriented and Classical Software Engineering. McGraw-Hill, 6th ed., 2005. Google ScholarDigital Library
S. Schmidt, "Shall we really do it again? the powerful concept of replication is neglected in the social sciences ". Rev. General Psychology 13(2), pp. 90--100, Jun 2009, DOI:10.1037/a0015108.Google ScholarCross Ref
F. Shull, V. Basili, J. Carver, J. C. Maldonado, G. H. Travassos, M. Mendonça, and S. Fabbri, "Replicating software engineering experiments: Addressing the tacit knowledge problem". In Intl. Symp. Empirical Softw. Eng., pp. 7--16, Oct 2002, DOI:10.1109/ISESE.2002.1166920. Google ScholarCross Ref
F. J. Shull, J. C. Carver, S. Vegas, and N. Juristo, "The role of replications in empirical software engineering ". Empirical Softw. Eng. 13(2), pp. 211--218, Apr 2008, DOI:10.1007/s10664-008-9060-1. Google ScholarDigital Library
D. J. Simons, "The value of direct replication ". Perspective on Psychological Sci. 9(1), pp. 76--80, Jan 2014, DOI:10.1177/1745691613514755.Google ScholarCross Ref
D. Tsafrir, K. Ouaknine, and D. G. Feitelson, "Reducing performance evaluation sensitivity and variability by input shaking ". In 15th Modeling, Anal. & Simulation of Comput. & Telecomm. Syst., pp. 231--237, Oct 2007, DOI:10.1109/MASCOTS.2007.58. Google ScholarDigital Library
J. Vitek and T. Kalibera, "Repeatability, reproducibility and rigor in systems research ". In 9th Intl. Conf. Embedded Software, pp. 33--38, Oct 2011, DOI:10.1145/2038642.2038650. Google ScholarDigital Library
E. M. Voorhees, "TREC: Continuing information retrieval's tradition of experimentation ". Comm. ACM 50(11), pp. 51--54, Nov 2007, DOI:10.1145/1297797.1297822. Google ScholarDigital Library
S. Wartik, "Are comparative analyses worthwhile? " Computer 29(7), p. 120, Jul 1996. Google ScholarDigital Library
J. M. Wicherts, D. Borsboom, J. Kats, and D. Molenaar, "The poor availability of psychological research data for reanalysis ". Am. Psych. 61(7), pp. 726--728, Oct 2006, DOI:10.1037/0003-066X.61.7.726.Google ScholarCross Ref
R. Wilcox, Introduction to Robust Estimation & Hypothesis Testing. Academic Press, 3rd ed., 2012.Google Scholar
R. R. Wilcox, Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy. Springer, 2nd ed., 2010.Google Scholar
M. Wood, M. Roper, A. Brooks, and J. Miller, "Comparing and combining software defect detection techniques: A replicated empirical study ". In European Softw. Eng. Conf. & Intl. Symp. Foundations of Softw. Eng., pp. 262--277, Springer-Verlag, Sep 1997, DOI:10.1007/3-540-63531-9_19. Lect. Notes Comput. Sci. vol. 1301. Google ScholarCross Ref
J. J. Yi, D. J. Lilja, and D. M. Hawkins, "Improving computer architecture simulation methodology by adding statistical rigor ". IEEE Trans. Comput. 54(11), pp. 1360--1373, Nov 2005, DOI:10.1109/TC.2005.184. Google ScholarDigital Library
N. Zakay and D. G. Feitelson, "Workload resampling for performance evaluation of parallel job schedulers ". Concurrency & Computation -- Pract. & Exp. 26(12), pp. 2079--2105, Aug 2014, DOI:10.1002/cpe.3240.Google ScholarDigital Library

Index Terms

From Repeatability to Reproducibility and Corroboration

Recommendations

KheOps: Cost-effective Repeatability, Reproducibility, and Replicability of Edge-to-Cloud Experiments
ACM REP '23: Proceedings of the 2023 ACM Conference on Reproducibility and Replicability

Distributed infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex scientific workflows to be executed across hybrid systems spanning from IoT Edge devices to Clouds, and sometimes to ...
Read More
Statistical analysis and improvement of the repeatability and reproducibility of an evaluation method for IMUs on a smartphone
EICS '16: Proceedings of the 8th ACM SIGCHI Symposium on Engineering Interactive Computing Systems

An increasing number of mobile applications, like indoor-navigation or augmented reality, rely on highly accurate inertial sensor systems. However, there exists no standardized test for common inertial sensors, like accelerometers and gyroscopes to ...
Read More
Time-Independent Experiment Reproducibility: Turning the WalT Platform into a Time Machine
PE-WASUN '16: Proceedings of the 13th ACM Symposium on Performance Evaluation of Wireless Ad Hoc, Sensor, & Ubiquitous Networks

This paper concerns the problem of testing wireless networks in a way that guarantees repeatability and reproducibility of experiments. To contribute to this research objective, we have developed WalT, a reproducible platform for running reproducible ...
Read More

Reviews

Reviewer: Andrew Brooks

The result of a single experiment is rarely believed. Mistakes might have been made when designing the experiment, conducting the experiment, or analyzing the data. A result becomes an accepted fact only after others have successfully redone the experiment. To clarify the nature of the redoing of an experiment, five terms are proposed and discussed. Repetition is proposed as meaning to redo the experiment exactly using the same artifacts. Replication is proposed as meaning to redo the experiment but having access only to the descriptions of the artifacts. Variation is proposed as meaning to redo the experiment with controlled modifications to establish the scope of the result. Reproduction is proposed as meaning to redo the experiment with conceptually similar artifacts. Corroboration is proposed as meaning to provide evidence in support of the result of the experiment by using a different approach. Section 9 contains an example about a caching experiment that succeeds in illustrating the use of these five terms. There are many useful discussions. For example, the conditions for exact repeatability are enumerated along with the impediments to achieving exact repeatability. Because of the transient nature of independent repositories of experimental software and data, a suggestion is made that such repositories are best curated by professional organizations. The discussion on meta-analysis, while useful, should have been expanded upon by drawing on the lessons learned from medical research. The case is made for use of the five terms. This paper is recommended to all those engaged in experimental work. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGOPS Operating Systems Review Volume 49, Issue 1
Special Issue on Repeatability and Sharing of Experimental Artifacts
January 2015
155 pages
ISSN:0163-5980
DOI:10.1145/2723872
Editors:
Jeanna Neefe Matthews
Clarkson University, Potsdam, NY
,
Thomas Bressoud
Denison University, Granville, OH
Issue’s Table of Contents
Copyright © 2015 Author
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 January 2015
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 59
  Total Citations
  View Citations
- 686
  Total Downloads
- Downloads (Last 12 months)56
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

From Repeatability to Reproducibility and Corroboration

ACM SIGOPS Operating Systems Review

Abstract

References

Cited By

Index Terms

Recommendations

KheOps: Cost-effective Repeatability, Reproducibility, and Replicability of Edge-to-Cloud Experiments

Statistical analysis and improvement of the repeatability and reproducibility of an evaluation method for IMUs on a smartphone

Time-Independent Experiment Reproducibility: Turning the WalT Platform into a Time Machine

Reviews

Access critical reviews of Computing literature here