skip to main content
research-article

From Repeatability to Reproducibility and Corroboration

Published:20 January 2015Publication History
Skip Abstract Section

Abstract

Being able to repeat experiments is considered a hallmark of the scientific method, used to confirm or refute hypotheses and previously obtained results. But this can take many forms, from precise repetition using the original experimental artifacts, to conceptual reproduction of the main experimental idea using new artifacts. Furthermore, the conclusions from previous work can also be corroborated using a different experimental methodology altogether. In order to promote a better understanding and use of such methodologies we propose precise definitions for different terms, and suggest when and why each should be used.

References

  1. A. Abedi, A. Heard, and T. Brecht, "Conducting repeatable experiments and fair comparisons using 802.11n MIMO networks". Operating Syst. Rev. 49(1), Jan 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. I. Amato, "Pons and Fleischmann redux? " Science 260(5110), p. 895, 14 May 1993, DOI:10.1126/science.260.5110.895.Google ScholarGoogle ScholarCross RefCross Ref
  3. T. Austin, E. Larson, and D. Ernst, "SimpleScalar: An infrastructure for computer system modeling". Computer 35(2), pp. 59--67, Feb 2002, DOI:10.1109/2.982917. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. V. R. Basili and R. W. Selby, "Comparing the effectiveness of software testing strategies ". IEEE Trans. Softw. Eng. SE-13(12), pp. 1278--1296, Dec 1987, DOI:10.1109/TSE.1987.232881. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. V. R. Basili, R. W. Selby, and D. H. Hutchens, "Experimentation in software engineering ". IEEE Trans. Softw. Eng. SE-12(7), pp. 733--743, Jul 1986, DOI:10.1109/TSE.1986.6312975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. R. Basili, F. Shull, and F. Lanubile, "Building knowledge through families of experiments". IEEE Trans. Softw. Eng. 25(4), pp. 456--473, Jul/Aug 1999, DOI:10.1109/32.799939. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Bavier, N. Feamster, M. Huang, L. Peterson, and J. Rexford, "In VINI veritas: Realistic and controlled network experimentation ". In ACM SIGCOMM Conf., pp. 3--14, Sep 2006, DOI:10.1145/1151659.1159916. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. G. Begley and L. M. Ellis, "Raise standards for preclinical cancer research ". Nature 483(7391), pp. 531--533, 29 Mar 2012, DOI:10.1038/483531a.Google ScholarGoogle ScholarCross RefCross Ref
  9. S. M. Blackburn et al., "The DaCapo benchmarks: Java benchmarking development and analysis ". In 21st Object-Oriented Prog. Syst., Lang., & Appl. Conf. Proc., pp. 169--190, Oct 2006, DOI:10.1145/1167473.1167488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. M. Blackburn et al., Can You Trust Your Experimental Results? Tech. Rep. #1, Evaluate Collaboratory, Feb 2012. URL http://evaluate.inf.usi.ch/sites/default/files/EvaluateCollaboratoryTR1.pdf.Google ScholarGoogle Scholar
  11. F. Bornemann, D. Laurie, S. Wagon, and J. Waldvogel, The SIAM 100-Digit Challenge: A Study in High-Accuracy Numerical Computing. SIAM, 2004.Google ScholarGoogle Scholar
  12. L. Breslau, D. Estrin, K. Fall, S. Floyd, J. Heidemann, A. Helmy, P. Huang, S. McCanne, K. Varadhan, Y. Xu, and H. Yu, "Advances in network simulation". Computer 33(5), pp. 59--67, May 2000, DOI:10.1109/2.841785. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Brooks, J. Daly, J. Miller, M. Roper, and M. Wood, Replication's Role in Experimental Computer Science. Tech. Rep. EFoCS-5-94 {RR/94/172}, University of Strathclyde, 1994.Google ScholarGoogle Scholar
  14. B. Clark, T. Deshane, E. Dow, S. Evanchik, M. Finlayson, J. Herne, and J. N. Matthews, "Xen and the art of repeated research ". In USENIX Tech. Conf., Jun 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. A. De Millo, R. J. Lipton, and A. J. Perlis, "Social processes and proofs of theorems and programs". Comm. ACM 22(5), pp. 271--280, May 1979, DOI:10.1145/359104.359106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. B. de Oliveira, S. Fischmeister, A. Diwan, M. Hauswirth, and P. F. Sweeney, "Why you should care about quantile regression ". In 18th Intl. Conf. Architect. Support for Prog. Lang. & Operating Syst., pp. 207--218, Mar 2013, DOI:10.1145/2451116.2451140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. B. de Oliveira, J.-C. Petkovich, and S. Fischmeister, "How much does memory layout impact performance? a wide study ". In Intl. Workshop Reproducible Research Methodologies, pp. 23--28, Feb 2014.Google ScholarGoogle Scholar
  18. A. B. de Oliveira, J.-C. Petkovich, T. Reidemeister, and S. Fischmeister, "DataMill: Rigorous performance evaluation made easy". In 4th Intl. Conf. Performance Engineering, pp. 137--148, Apr 2013, DOI:10.1145/2479871.2479892. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. A. Dinda, G. Memik, R. P. Dick, B. Lin, A. Mallik, A. Gupta, and S. Rossoff, "The user in experimental computer systems research ". In Workshop Experimental Comput. Sci., art. no. 10, Jun 2007, DOI:10.1145/1281700.1281710. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Drummond, "Replicability is not reproducibility: Nor is it good science ". In 4th Workshop Evaluation Methods for Machine Learning, Jun 2009.Google ScholarGoogle Scholar
  21. D. M. Erceg-Hurn and V. M. Mirosevich, "Modern robust statistical methods: An easy way to maximize the accuracy and power of your research ". Am. Psych. 63(7), pp. 591--601, Oct 2008, DOI:10.1037/0003-066X.63.7.591.Google ScholarGoogle ScholarCross RefCross Ref
  22. D. G. Feitelson, "Experimental analysis of the root causes of performance evaluation results: A backfilling case study". IEEE Trans. Parallel & Distributed Syst. 16(2), pp. 175--182, Feb 2005, DOI:10.1109/TPDS.2005.18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. G. Feitelson, "Experimental computer science: The need for a cultural change". URL http://www.cs.huji.ac.il/¿feit/papers/exp05.pdf, 2005.Google ScholarGoogle Scholar
  24. D. G. Feitelson, Workload Modeling for Computer Systems Performance Evaluation. Cambridge University Press, 2015.Google ScholarGoogle Scholar
  25. D. G. Feitelson, D. Tsafrir, and D. Krakov, "Experience with using the Parallel Workloads Archive". J. Parallel & Distributed Comput. 74(10), pp. 2967--2982, Oct 2014, DOI:10.1016/j.jpdc.2014.06.013.Google ScholarGoogle ScholarCross RefCross Ref
  26. G. Fursin, R. Miceli, A. Lokhmotov, M. Gerndt, M. Baboulin, A. D. Malony, Z. Chamski, D. Novillo, and D. Del Vento, "Collective Mind: Towards practical and collaborative auto-tuning ". Scientific Prog. 22(4), pp. 309--329, 2014, DOI:10.3233/SPR-140396.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Y. Gil, K. Lenz, and Y. Shimron, "A microbenchmark case study and lessons learned ". In SPLASH'11 Workshops, pp. 297--308, Oct 2011, DOI:10.1145/2095050.2095100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Hanenberg, "Faith, hope, and love: An essay on software science's neglect of human factors ". In Object-Oriented Prog. Syst., Lang., & Appl. Conf. Proc., pp. 933--946, Oct 2010, DOI:10.1145/1932682.1869536. (Onward track). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. S. Harji, P. A. Buhr, and T. Brecht, "Our troubles with Linux kernel upgrades and why you should care ". Operating Syst. Rev. 47(2), pp. 66--72, Jul 2013, DOI:10.1145/2506164.2506175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. Hellman, Great Feuds in Science: Ten of the Liveliest Disputes Ever. John Wiley & Sons, 1998.Google ScholarGoogle Scholar
  31. J. P. A. Ioannidis, "Why most published research findings are false ". PLOS Medicine 2(8), pp. 0696--0701, Aug 2005, DOI:10.1371/journal.pmed.0020124.Google ScholarGoogle ScholarCross RefCross Ref
  32. L. D. Jackel, D. Hackett, E. Krotkov, M. Perschbacher, J. Pippine, and C. Sullivan, "How DARPA structures its robotics programs to improve locomotion and navigation ". Comm. ACM 50(11), pp. 55--59, Nov 2007, DOI:10.1145/1297797.1297823. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. I. Jacobson, G. Booch, and J. Rumbaugh, The Unified Software Development Process. Addison Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. Jain, The Art of Computer Systems Performance Analysis. John Wiley & Sons, 1991.Google ScholarGoogle Scholar
  35. B. E. John, "Avoiding "it's JUST a replication" ". In CHI2013 Workshop on Replication of HCI Research, pp. 3--7, Apr 2013.Google ScholarGoogle Scholar
  36. N. Juristo and S. Vegas, "Functional testing, structural testing and code reading: What fault type do they each detect?" In Empirical Methods and Studies in Software Engineering: Experiences from ESERNET, R. Conradi and A. I. Wang (eds.), pp. 208--232, Springer-Verlag, 2003, DOI:10.1007/978-3-540-45143-3_12. Lect. Notes Comput. Sci. vol. 2765.Google ScholarGoogle ScholarCross RefCross Ref
  37. N. Juristo, S. Vegas, M. Solari, S. Abrahao, and I. Ramos, "Comparing the effectiveness of equivalence partitioning,branch testing and code reading by stepwise abstraction applied by subjects ". In 5th Intl. Conf. Software Testing, Verification, & Validation, pp. 330--339, Apr 2012, DOI:10.1109/ICST.2012.113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. T. Kalibera, J. Hagelberg, P. Maj, F. Pizlo, B. Titzer, and J. Vitek, "A family of real-time Java benchmarks". Concurrency & Computation -- Pract. & Exp. 23(14), pp. 1679--1700, Sep 2011, DOI:10.1002/cpe.1677. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. E. Kamsties and C. M. Lott, "An empirical evaluation of three defect-detection techniques ". In 5th European Softw. Eng. Conf., pp. 362--383, Springer-Verlag, Sep 1995, DOI:10.1007/3-540-60406-5_25. Lect. Notes Comput. Sci. vol. 989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. R. A. Klein et al., "Investigating variation in replicability: A "many labs" replication project ". Social Psychology 45(3), pp. 142--152, 2014, DOI:10.1027/1864-9335/a000178.Google ScholarGoogle ScholarCross RefCross Ref
  41. R. Kohavi and R. Longbotham, "Unexpected results in online controlled experiments". SIGKDD Explorations 12(2), pp. 31--35, Dec 2010, DOI:10.1145/1964897.1964905. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. D. J. Lilja, Measuring Computer Performance: A Practitioner's Guide. Cambridge University Press, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Lung, J. Aranda, S. Easterbrook, and G. Wilson, "On the difficulty of replicating human subjects studies in software engineering ". In 30th Intl. Conf. Softw. Eng., pp. 191--200, May 2008, DOI:10.1145/1368088.1368115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. I. Manolescu et al., "The repeatability experiment of SIGMOD 2008 ". ACM SIGMOD Record 37(1), pp. 39--45, Mar 2008, DOI:10.1145/1374780.1374791. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. L. McVoy and C. Staelin, "lmbench: Portable tools for performance analysis ". In USENIX Ann. Technical Conf., pp. 279--294, Jan 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. T. Mudge, "Report on the panel: How can computer architecture researchers avoid becoming the society for irreproducible results? " Comput. Arch. News 24(1), pp. 1--5, Mar 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. G. J. Myers, "A controlled experiment in program testing and code walkthroughs/inspections ". Comm. ACM 21(9), pp. 760--768, Sep 1978, DOI:10.1145/359588.359602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney, "Producing wrong data without doing anything obviously wrong! " In 14th Intl. Conf. Architect. Support for Prog. Lang. & Operating Syst., pp. 265--276, Mar 2009, DOI:10.1145/2528521.1508275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. L. Peterson and V. S. Pai, "Experience-driven experimental systems research ". Comm. ACM 50(11), pp. 38--44, Nov 2007, DOI:10.1145/1297797.1297820. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. J. Sambrook and D. W. Russell, Molecular Cloning: A Laboratory Manual. Cold Spring Harbor Laboratory Press, 3rd ed., 2001.Google ScholarGoogle Scholar
  51. J. D. Scargle, "Publication bias: The "file-drawer" problem in scientific inference ". J. Sci. Explor. 14(1), pp. 91--106, 2000.Google ScholarGoogle Scholar
  52. S. R. Schach, Object-Oriented and Classical Software Engineering. McGraw-Hill, 6th ed., 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. S. Schmidt, "Shall we really do it again? the powerful concept of replication is neglected in the social sciences ". Rev. General Psychology 13(2), pp. 90--100, Jun 2009, DOI:10.1037/a0015108.Google ScholarGoogle ScholarCross RefCross Ref
  54. F. Shull, V. Basili, J. Carver, J. C. Maldonado, G. H. Travassos, M. Mendonça, and S. Fabbri, "Replicating software engineering experiments: Addressing the tacit knowledge problem". In Intl. Symp. Empirical Softw. Eng., pp. 7--16, Oct 2002, DOI:10.1109/ISESE.2002.1166920. Google ScholarGoogle ScholarCross RefCross Ref
  55. F. J. Shull, J. C. Carver, S. Vegas, and N. Juristo, "The role of replications in empirical software engineering ". Empirical Softw. Eng. 13(2), pp. 211--218, Apr 2008, DOI:10.1007/s10664-008-9060-1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. D. J. Simons, "The value of direct replication ". Perspective on Psychological Sci. 9(1), pp. 76--80, Jan 2014, DOI:10.1177/1745691613514755.Google ScholarGoogle ScholarCross RefCross Ref
  57. D. Tsafrir, K. Ouaknine, and D. G. Feitelson, "Reducing performance evaluation sensitivity and variability by input shaking ". In 15th Modeling, Anal. & Simulation of Comput. & Telecomm. Syst., pp. 231--237, Oct 2007, DOI:10.1109/MASCOTS.2007.58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. J. Vitek and T. Kalibera, "Repeatability, reproducibility and rigor in systems research ". In 9th Intl. Conf. Embedded Software, pp. 33--38, Oct 2011, DOI:10.1145/2038642.2038650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. E. M. Voorhees, "TREC: Continuing information retrieval's tradition of experimentation ". Comm. ACM 50(11), pp. 51--54, Nov 2007, DOI:10.1145/1297797.1297822. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. S. Wartik, "Are comparative analyses worthwhile? " Computer 29(7), p. 120, Jul 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. J. M. Wicherts, D. Borsboom, J. Kats, and D. Molenaar, "The poor availability of psychological research data for reanalysis ". Am. Psych. 61(7), pp. 726--728, Oct 2006, DOI:10.1037/0003-066X.61.7.726.Google ScholarGoogle ScholarCross RefCross Ref
  62. R. Wilcox, Introduction to Robust Estimation & Hypothesis Testing. Academic Press, 3rd ed., 2012.Google ScholarGoogle Scholar
  63. R. R. Wilcox, Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy. Springer, 2nd ed., 2010.Google ScholarGoogle Scholar
  64. M. Wood, M. Roper, A. Brooks, and J. Miller, "Comparing and combining software defect detection techniques: A replicated empirical study ". In European Softw. Eng. Conf. & Intl. Symp. Foundations of Softw. Eng., pp. 262--277, Springer-Verlag, Sep 1997, DOI:10.1007/3-540-63531-9_19. Lect. Notes Comput. Sci. vol. 1301. Google ScholarGoogle ScholarCross RefCross Ref
  65. J. J. Yi, D. J. Lilja, and D. M. Hawkins, "Improving computer architecture simulation methodology by adding statistical rigor ". IEEE Trans. Comput. 54(11), pp. 1360--1373, Nov 2005, DOI:10.1109/TC.2005.184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. N. Zakay and D. G. Feitelson, "Workload resampling for performance evaluation of parallel job schedulers ". Concurrency & Computation -- Pract. & Exp. 26(12), pp. 2079--2105, Aug 2014, DOI:10.1002/cpe.3240.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. From Repeatability to Reproducibility and Corroboration

          Recommendations

          Reviews

          Andrew Brooks

          The result of a single experiment is rarely believed. Mistakes might have been made when designing the experiment, conducting the experiment, or analyzing the data. A result becomes an accepted fact only after others have successfully redone the experiment. To clarify the nature of the redoing of an experiment, five terms are proposed and discussed. Repetition is proposed as meaning to redo the experiment exactly using the same artifacts. Replication is proposed as meaning to redo the experiment but having access only to the descriptions of the artifacts. Variation is proposed as meaning to redo the experiment with controlled modifications to establish the scope of the result. Reproduction is proposed as meaning to redo the experiment with conceptually similar artifacts. Corroboration is proposed as meaning to provide evidence in support of the result of the experiment by using a different approach. Section 9 contains an example about a caching experiment that succeeds in illustrating the use of these five terms. There are many useful discussions. For example, the conditions for exact repeatability are enumerated along with the impediments to achieving exact repeatability. Because of the transient nature of independent repositories of experimental software and data, a suggestion is made that such repositories are best curated by professional organizations. The discussion on meta-analysis, while useful, should have been expanded upon by drawing on the lessons learned from medical research. The case is made for use of the five terms. This paper is recommended to all those engaged in experimental work. Online Computing Reviews Service

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGOPS Operating Systems Review
            ACM SIGOPS Operating Systems Review  Volume 49, Issue 1
            Special Issue on Repeatability and Sharing of Experimental Artifacts
            January 2015
            155 pages
            ISSN:0163-5980
            DOI:10.1145/2723872
            Issue’s Table of Contents

            Copyright © 2015 Author

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 20 January 2015

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader