Abstract
Being able to repeat experiments is considered a hallmark of the scientific method, used to confirm or refute hypotheses and previously obtained results. But this can take many forms, from precise repetition using the original experimental artifacts, to conceptual reproduction of the main experimental idea using new artifacts. Furthermore, the conclusions from previous work can also be corroborated using a different experimental methodology altogether. In order to promote a better understanding and use of such methodologies we propose precise definitions for different terms, and suggest when and why each should be used.
- A. Abedi, A. Heard, and T. Brecht, "Conducting repeatable experiments and fair comparisons using 802.11n MIMO networks". Operating Syst. Rev. 49(1), Jan 2015. Google ScholarDigital Library
- I. Amato, "Pons and Fleischmann redux? " Science 260(5110), p. 895, 14 May 1993, DOI:10.1126/science.260.5110.895.Google ScholarCross Ref
- T. Austin, E. Larson, and D. Ernst, "SimpleScalar: An infrastructure for computer system modeling". Computer 35(2), pp. 59--67, Feb 2002, DOI:10.1109/2.982917. Google ScholarDigital Library
- V. R. Basili and R. W. Selby, "Comparing the effectiveness of software testing strategies ". IEEE Trans. Softw. Eng. SE-13(12), pp. 1278--1296, Dec 1987, DOI:10.1109/TSE.1987.232881. Google ScholarDigital Library
- V. R. Basili, R. W. Selby, and D. H. Hutchens, "Experimentation in software engineering ". IEEE Trans. Softw. Eng. SE-12(7), pp. 733--743, Jul 1986, DOI:10.1109/TSE.1986.6312975. Google ScholarDigital Library
- V. R. Basili, F. Shull, and F. Lanubile, "Building knowledge through families of experiments". IEEE Trans. Softw. Eng. 25(4), pp. 456--473, Jul/Aug 1999, DOI:10.1109/32.799939. Google ScholarDigital Library
- A. Bavier, N. Feamster, M. Huang, L. Peterson, and J. Rexford, "In VINI veritas: Realistic and controlled network experimentation ". In ACM SIGCOMM Conf., pp. 3--14, Sep 2006, DOI:10.1145/1151659.1159916. Google ScholarDigital Library
- C. G. Begley and L. M. Ellis, "Raise standards for preclinical cancer research ". Nature 483(7391), pp. 531--533, 29 Mar 2012, DOI:10.1038/483531a.Google ScholarCross Ref
- S. M. Blackburn et al., "The DaCapo benchmarks: Java benchmarking development and analysis ". In 21st Object-Oriented Prog. Syst., Lang., & Appl. Conf. Proc., pp. 169--190, Oct 2006, DOI:10.1145/1167473.1167488. Google ScholarDigital Library
- S. M. Blackburn et al., Can You Trust Your Experimental Results? Tech. Rep. #1, Evaluate Collaboratory, Feb 2012. URL http://evaluate.inf.usi.ch/sites/default/files/EvaluateCollaboratoryTR1.pdf.Google Scholar
- F. Bornemann, D. Laurie, S. Wagon, and J. Waldvogel, The SIAM 100-Digit Challenge: A Study in High-Accuracy Numerical Computing. SIAM, 2004.Google Scholar
- L. Breslau, D. Estrin, K. Fall, S. Floyd, J. Heidemann, A. Helmy, P. Huang, S. McCanne, K. Varadhan, Y. Xu, and H. Yu, "Advances in network simulation". Computer 33(5), pp. 59--67, May 2000, DOI:10.1109/2.841785. Google ScholarDigital Library
- A. Brooks, J. Daly, J. Miller, M. Roper, and M. Wood, Replication's Role in Experimental Computer Science. Tech. Rep. EFoCS-5-94 {RR/94/172}, University of Strathclyde, 1994.Google Scholar
- B. Clark, T. Deshane, E. Dow, S. Evanchik, M. Finlayson, J. Herne, and J. N. Matthews, "Xen and the art of repeated research ". In USENIX Tech. Conf., Jun 2004. Google ScholarDigital Library
- R. A. De Millo, R. J. Lipton, and A. J. Perlis, "Social processes and proofs of theorems and programs". Comm. ACM 22(5), pp. 271--280, May 1979, DOI:10.1145/359104.359106. Google ScholarDigital Library
- A. B. de Oliveira, S. Fischmeister, A. Diwan, M. Hauswirth, and P. F. Sweeney, "Why you should care about quantile regression ". In 18th Intl. Conf. Architect. Support for Prog. Lang. & Operating Syst., pp. 207--218, Mar 2013, DOI:10.1145/2451116.2451140. Google ScholarDigital Library
- A. B. de Oliveira, J.-C. Petkovich, and S. Fischmeister, "How much does memory layout impact performance? a wide study ". In Intl. Workshop Reproducible Research Methodologies, pp. 23--28, Feb 2014.Google Scholar
- A. B. de Oliveira, J.-C. Petkovich, T. Reidemeister, and S. Fischmeister, "DataMill: Rigorous performance evaluation made easy". In 4th Intl. Conf. Performance Engineering, pp. 137--148, Apr 2013, DOI:10.1145/2479871.2479892. Google ScholarDigital Library
- P. A. Dinda, G. Memik, R. P. Dick, B. Lin, A. Mallik, A. Gupta, and S. Rossoff, "The user in experimental computer systems research ". In Workshop Experimental Comput. Sci., art. no. 10, Jun 2007, DOI:10.1145/1281700.1281710. Google ScholarDigital Library
- C. Drummond, "Replicability is not reproducibility: Nor is it good science ". In 4th Workshop Evaluation Methods for Machine Learning, Jun 2009.Google Scholar
- D. M. Erceg-Hurn and V. M. Mirosevich, "Modern robust statistical methods: An easy way to maximize the accuracy and power of your research ". Am. Psych. 63(7), pp. 591--601, Oct 2008, DOI:10.1037/0003-066X.63.7.591.Google ScholarCross Ref
- D. G. Feitelson, "Experimental analysis of the root causes of performance evaluation results: A backfilling case study". IEEE Trans. Parallel & Distributed Syst. 16(2), pp. 175--182, Feb 2005, DOI:10.1109/TPDS.2005.18. Google ScholarDigital Library
- D. G. Feitelson, "Experimental computer science: The need for a cultural change". URL http://www.cs.huji.ac.il/¿feit/papers/exp05.pdf, 2005.Google Scholar
- D. G. Feitelson, Workload Modeling for Computer Systems Performance Evaluation. Cambridge University Press, 2015.Google Scholar
- D. G. Feitelson, D. Tsafrir, and D. Krakov, "Experience with using the Parallel Workloads Archive". J. Parallel & Distributed Comput. 74(10), pp. 2967--2982, Oct 2014, DOI:10.1016/j.jpdc.2014.06.013.Google ScholarCross Ref
- G. Fursin, R. Miceli, A. Lokhmotov, M. Gerndt, M. Baboulin, A. D. Malony, Z. Chamski, D. Novillo, and D. Del Vento, "Collective Mind: Towards practical and collaborative auto-tuning ". Scientific Prog. 22(4), pp. 309--329, 2014, DOI:10.3233/SPR-140396.Google ScholarDigital Library
- J. Y. Gil, K. Lenz, and Y. Shimron, "A microbenchmark case study and lessons learned ". In SPLASH'11 Workshops, pp. 297--308, Oct 2011, DOI:10.1145/2095050.2095100. Google ScholarDigital Library
- S. Hanenberg, "Faith, hope, and love: An essay on software science's neglect of human factors ". In Object-Oriented Prog. Syst., Lang., & Appl. Conf. Proc., pp. 933--946, Oct 2010, DOI:10.1145/1932682.1869536. (Onward track). Google ScholarDigital Library
- A. S. Harji, P. A. Buhr, and T. Brecht, "Our troubles with Linux kernel upgrades and why you should care ". Operating Syst. Rev. 47(2), pp. 66--72, Jul 2013, DOI:10.1145/2506164.2506175. Google ScholarDigital Library
- H. Hellman, Great Feuds in Science: Ten of the Liveliest Disputes Ever. John Wiley & Sons, 1998.Google Scholar
- J. P. A. Ioannidis, "Why most published research findings are false ". PLOS Medicine 2(8), pp. 0696--0701, Aug 2005, DOI:10.1371/journal.pmed.0020124.Google ScholarCross Ref
- L. D. Jackel, D. Hackett, E. Krotkov, M. Perschbacher, J. Pippine, and C. Sullivan, "How DARPA structures its robotics programs to improve locomotion and navigation ". Comm. ACM 50(11), pp. 55--59, Nov 2007, DOI:10.1145/1297797.1297823. Google ScholarDigital Library
- I. Jacobson, G. Booch, and J. Rumbaugh, The Unified Software Development Process. Addison Wesley, 1999. Google ScholarDigital Library
- R. Jain, The Art of Computer Systems Performance Analysis. John Wiley & Sons, 1991.Google Scholar
- B. E. John, "Avoiding "it's JUST a replication" ". In CHI2013 Workshop on Replication of HCI Research, pp. 3--7, Apr 2013.Google Scholar
- N. Juristo and S. Vegas, "Functional testing, structural testing and code reading: What fault type do they each detect?" In Empirical Methods and Studies in Software Engineering: Experiences from ESERNET, R. Conradi and A. I. Wang (eds.), pp. 208--232, Springer-Verlag, 2003, DOI:10.1007/978-3-540-45143-3_12. Lect. Notes Comput. Sci. vol. 2765.Google ScholarCross Ref
- N. Juristo, S. Vegas, M. Solari, S. Abrahao, and I. Ramos, "Comparing the effectiveness of equivalence partitioning,branch testing and code reading by stepwise abstraction applied by subjects ". In 5th Intl. Conf. Software Testing, Verification, & Validation, pp. 330--339, Apr 2012, DOI:10.1109/ICST.2012.113. Google ScholarDigital Library
- T. Kalibera, J. Hagelberg, P. Maj, F. Pizlo, B. Titzer, and J. Vitek, "A family of real-time Java benchmarks". Concurrency & Computation -- Pract. & Exp. 23(14), pp. 1679--1700, Sep 2011, DOI:10.1002/cpe.1677. Google ScholarDigital Library
- E. Kamsties and C. M. Lott, "An empirical evaluation of three defect-detection techniques ". In 5th European Softw. Eng. Conf., pp. 362--383, Springer-Verlag, Sep 1995, DOI:10.1007/3-540-60406-5_25. Lect. Notes Comput. Sci. vol. 989. Google ScholarDigital Library
- R. A. Klein et al., "Investigating variation in replicability: A "many labs" replication project ". Social Psychology 45(3), pp. 142--152, 2014, DOI:10.1027/1864-9335/a000178.Google ScholarCross Ref
- R. Kohavi and R. Longbotham, "Unexpected results in online controlled experiments". SIGKDD Explorations 12(2), pp. 31--35, Dec 2010, DOI:10.1145/1964897.1964905. Google ScholarDigital Library
- D. J. Lilja, Measuring Computer Performance: A Practitioner's Guide. Cambridge University Press, 2000. Google ScholarDigital Library
- J. Lung, J. Aranda, S. Easterbrook, and G. Wilson, "On the difficulty of replicating human subjects studies in software engineering ". In 30th Intl. Conf. Softw. Eng., pp. 191--200, May 2008, DOI:10.1145/1368088.1368115. Google ScholarDigital Library
- I. Manolescu et al., "The repeatability experiment of SIGMOD 2008 ". ACM SIGMOD Record 37(1), pp. 39--45, Mar 2008, DOI:10.1145/1374780.1374791. Google ScholarDigital Library
- L. McVoy and C. Staelin, "lmbench: Portable tools for performance analysis ". In USENIX Ann. Technical Conf., pp. 279--294, Jan 1996. Google ScholarDigital Library
- T. Mudge, "Report on the panel: How can computer architecture researchers avoid becoming the society for irreproducible results? " Comput. Arch. News 24(1), pp. 1--5, Mar 1996. Google ScholarDigital Library
- G. J. Myers, "A controlled experiment in program testing and code walkthroughs/inspections ". Comm. ACM 21(9), pp. 760--768, Sep 1978, DOI:10.1145/359588.359602. Google ScholarDigital Library
- T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney, "Producing wrong data without doing anything obviously wrong! " In 14th Intl. Conf. Architect. Support for Prog. Lang. & Operating Syst., pp. 265--276, Mar 2009, DOI:10.1145/2528521.1508275. Google ScholarDigital Library
- L. Peterson and V. S. Pai, "Experience-driven experimental systems research ". Comm. ACM 50(11), pp. 38--44, Nov 2007, DOI:10.1145/1297797.1297820. Google ScholarDigital Library
- J. Sambrook and D. W. Russell, Molecular Cloning: A Laboratory Manual. Cold Spring Harbor Laboratory Press, 3rd ed., 2001.Google Scholar
- J. D. Scargle, "Publication bias: The "file-drawer" problem in scientific inference ". J. Sci. Explor. 14(1), pp. 91--106, 2000.Google Scholar
- S. R. Schach, Object-Oriented and Classical Software Engineering. McGraw-Hill, 6th ed., 2005. Google ScholarDigital Library
- S. Schmidt, "Shall we really do it again? the powerful concept of replication is neglected in the social sciences ". Rev. General Psychology 13(2), pp. 90--100, Jun 2009, DOI:10.1037/a0015108.Google ScholarCross Ref
- F. Shull, V. Basili, J. Carver, J. C. Maldonado, G. H. Travassos, M. Mendonça, and S. Fabbri, "Replicating software engineering experiments: Addressing the tacit knowledge problem". In Intl. Symp. Empirical Softw. Eng., pp. 7--16, Oct 2002, DOI:10.1109/ISESE.2002.1166920. Google ScholarCross Ref
- F. J. Shull, J. C. Carver, S. Vegas, and N. Juristo, "The role of replications in empirical software engineering ". Empirical Softw. Eng. 13(2), pp. 211--218, Apr 2008, DOI:10.1007/s10664-008-9060-1. Google ScholarDigital Library
- D. J. Simons, "The value of direct replication ". Perspective on Psychological Sci. 9(1), pp. 76--80, Jan 2014, DOI:10.1177/1745691613514755.Google ScholarCross Ref
- D. Tsafrir, K. Ouaknine, and D. G. Feitelson, "Reducing performance evaluation sensitivity and variability by input shaking ". In 15th Modeling, Anal. & Simulation of Comput. & Telecomm. Syst., pp. 231--237, Oct 2007, DOI:10.1109/MASCOTS.2007.58. Google ScholarDigital Library
- J. Vitek and T. Kalibera, "Repeatability, reproducibility and rigor in systems research ". In 9th Intl. Conf. Embedded Software, pp. 33--38, Oct 2011, DOI:10.1145/2038642.2038650. Google ScholarDigital Library
- E. M. Voorhees, "TREC: Continuing information retrieval's tradition of experimentation ". Comm. ACM 50(11), pp. 51--54, Nov 2007, DOI:10.1145/1297797.1297822. Google ScholarDigital Library
- S. Wartik, "Are comparative analyses worthwhile? " Computer 29(7), p. 120, Jul 1996. Google ScholarDigital Library
- J. M. Wicherts, D. Borsboom, J. Kats, and D. Molenaar, "The poor availability of psychological research data for reanalysis ". Am. Psych. 61(7), pp. 726--728, Oct 2006, DOI:10.1037/0003-066X.61.7.726.Google ScholarCross Ref
- R. Wilcox, Introduction to Robust Estimation & Hypothesis Testing. Academic Press, 3rd ed., 2012.Google Scholar
- R. R. Wilcox, Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy. Springer, 2nd ed., 2010.Google Scholar
- M. Wood, M. Roper, A. Brooks, and J. Miller, "Comparing and combining software defect detection techniques: A replicated empirical study ". In European Softw. Eng. Conf. & Intl. Symp. Foundations of Softw. Eng., pp. 262--277, Springer-Verlag, Sep 1997, DOI:10.1007/3-540-63531-9_19. Lect. Notes Comput. Sci. vol. 1301. Google ScholarCross Ref
- J. J. Yi, D. J. Lilja, and D. M. Hawkins, "Improving computer architecture simulation methodology by adding statistical rigor ". IEEE Trans. Comput. 54(11), pp. 1360--1373, Nov 2005, DOI:10.1109/TC.2005.184. Google ScholarDigital Library
- N. Zakay and D. G. Feitelson, "Workload resampling for performance evaluation of parallel job schedulers ". Concurrency & Computation -- Pract. & Exp. 26(12), pp. 2079--2105, Aug 2014, DOI:10.1002/cpe.3240.Google ScholarDigital Library
Index Terms
- From Repeatability to Reproducibility and Corroboration
Recommendations
KheOps: Cost-effective Repeatability, Reproducibility, and Replicability of Edge-to-Cloud Experiments
ACM REP '23: Proceedings of the 2023 ACM Conference on Reproducibility and ReplicabilityDistributed infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex scientific workflows to be executed across hybrid systems spanning from IoT Edge devices to Clouds, and sometimes to ...
Statistical analysis and improvement of the repeatability and reproducibility of an evaluation method for IMUs on a smartphone
EICS '16: Proceedings of the 8th ACM SIGCHI Symposium on Engineering Interactive Computing SystemsAn increasing number of mobile applications, like indoor-navigation or augmented reality, rely on highly accurate inertial sensor systems. However, there exists no standardized test for common inertial sensors, like accelerometers and gyroscopes to ...
Time-Independent Experiment Reproducibility: Turning the WalT Platform into a Time Machine
PE-WASUN '16: Proceedings of the 13th ACM Symposium on Performance Evaluation of Wireless Ad Hoc, Sensor, & Ubiquitous NetworksThis paper concerns the problem of testing wireless networks in a way that guarantees repeatability and reproducibility of experiments. To contribute to this research objective, we have developed WalT, a reproducible platform for running reproducible ...
Comments