Top

Empirical Software Engineering

Published in:

01-02-2015

A practical guide to controlled experiments of software engineering tools with human participants

Authors: Amy J. Ko, Thomas D. LaToza, Margaret M. Burnett

Published in: Empirical Software Engineering | Issue 1/2015

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Empirical studies, often in the form of controlled experiments, have been widely adopted in software engineering research as a way to evaluate the merits of new software engineering tools. However, controlled experiments involving human participants actually using new tools are still rare, and when they are conducted, some have serious validity concerns. Recent research has also shown that many software engineering researchers view this form of tool evaluation as too risky and too difficult to conduct, as they might ultimately lead to inconclusive or negative results. In this paper, we aim both to help researchers minimize the risks of this form of tool evaluation, and to increase their quality, by offering practical methodological guidance on designing and running controlled experiments with developers. Our guidance fills gaps in the empirical literature by explaining, from a practical perspective, options in the recruitment and selection of human participants, informed consent, experimental procedures, demographic measurements, group assignment, training, the selecting and design of tasks, the measurement of common outcome variables such as success and time on task, and study debriefing. Throughout, we situate this guidance in the results of a new systematic review of the tool evaluations that were published in over 1,700 software engineering papers published from 2001 to 2011.

previous article HAZOP-based identification of events in use cases

next article An experimental investigation comparing individual and collaborative work productivity when using desktop and cloud modeling tools

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

Strictly speaking, an experiment is by definition quantitative [Basili 2007]. Other kinds of empirical studies are not technically experiments. Thus, in this paper, when we refer to experiments, we mean quantitative experiments.

Studies using this last method were often referred to by authors as “case studies,” but this usage conflicts with the notion of case studies as empirical investigations of some phenomenon within a real-life context (Yin 2003), as the tool use experience reports in these papers were not performed in real life contexts. “Case study” was also used to refer to evaluations without human use.

Anderson JR, Reiser BJ (1985) The LISP tutor. Byte 10:159–175

Aranda J (2011) How do practitioners perceive Software Engineering Research? http://catenary.wordpress.com/2011/05/19/how-do-practitioners-perceive-software-engineering-research/. Retrieved: 08-01-2011

Atkins DL, Ball T, Graves TL, Mockus A (2002) Using version control data to evaluate the impact of software tools: A case study of the version editor. IEEE Trans Softw Eng 28(7):625–637CrossRef

Bangor A, Kortum PT, Miller JT (2008) An empirical evaluation of the system usability scale. Int J Human-Comput Interact 24(6):574–594CrossRef

Basili VR (1993) The experimental paradigm in software engineering. Int Work Exp Eng Issues: Crit Assess Futur Dir 706:3–12MATH

Basili VR (1996) The role of experimentation in software engineering: Past, current, and future. International Conference on Software Engineering, 442–449

Basili VR (2007) The role of controlled experiments in software engineering research. Empirical Software Engineering Issues, LNCS 4336, Basili V et al. (Eds.), Springer-Verlag, 33–37

Basili VR, Selby RW, Hutchens DH (1986) Experimentation in software engineering. IEEE Trans Softw Eng, 733–743, July

Basili VR, Caldiera G, Rombach HD (1994) The goal question metric approach. In Encyclopedia of Software Engineering, John Wiley and Sons, 528–532

Beringer P (2004) Using students as subjects in requirements prioritization. International Symposium on Empirical Software Engineering, 167–176

Boehm BW, Papaccio PN (1988) Understanding and controlling software costs. IEEE Trans Softw Eng SE-14(10):1462–1477CrossRef

Breaugh JA (2003) Effect size estimation: factors to consider and mistakes to avoid. J Manag 29(1):79–97

Bruun A, Gull P, Hofmeister L, Stage J (2009) Let your users do the testing: a comparison of three remote asynchronous usability testing methods. ACM Conference on Human Factors in Computing Systems, 1619–1628

Buse RPL, Sadowski C, Weimer W (2011) Benefits and barriers of user evaluation in software engineering research. ACM Conference on Systems, Programming, Languages and Applications

Carver J, Jaccheri L, Morasca S, Shull F (2003). Issues in using students in empirical studies in software engineering education. Software Metrics Symposium, 239–249

Chuttur MY (2009). Overview of the technology acceptance model: Origins, developments and future directions. Indiana University, USA, Sprouts: Working Papers on Information Systems

Davis FD (1989) Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Q 13(3):319CrossRef

Dell N, Vaidyanathan V, Medhi I, Cutrell E, Thies W (2012) “Yours is better!” Participant response bias in HCI. ACM Conference on Human Factors in Computing Systems, 1321–1330

Dieste O, Grim´n A, Juristo N, Saxena H (2011) Quantitative determination of the relationship between internal validity and bias in software engineering experiments: Consequences for systematic literature reviews. International Symposium on Empirical Software Engineering and Measurement, 285–294

Dig D, Manzoor K, Johnson R, Nguyen TN (2008) Effective software merging in the presence of object-oriented refactorings. IEEE Trans Softw Eng 34(3):321–335CrossRef

Dittrich Y (ed) (2007) Special issue on qualitative software engineering research. Inf Softw Technol, 49(6):531–694. doi:10.1016/j.infsof.2007.02.009

Dybå T, Kampenes V, Sjøberg D (2006) A systematic review of statistical power in software engineering experiments. Inf Softw Technol 48(8):745–755CrossRef

Dybå T, Prikladnicki R, Rönkkö K, Seaman C, Sillito J (2011) Qualitative research in software engineering. Empir Softw Eng 16(4):425–429CrossRef

Easterbrook S, Singer J, Storey M, Damian D (2008) Selecting empirical methods for software engineering research, in Guide to Advanced Empirical Software Engineering, Springer, 285–311

Feigenspan J, Kastner C, Liebig J, Apel S, Hanenberg S (2012). Measuring programming experience. International Conference on Program Comprehension, 73–82

Fenton N (1993) How effective are software engineering methods? J Syst Softw 22(2):141–146CrossRef

Flyvbjerg B (2006) Five misunderstandings about case study research. Qual Inq 12(2):219–245CrossRef

Glass RL, Vessey I, Ramesh V (2002) Research in software engineering: an analysis of the literature. Inf Softw Technol 44(8):491–506CrossRef

Golden E, John BE, Bass L (2005) The value of a usability-supporting architectural pattern in software architecture design: a controlled experiment. ACM/IEEE International Conference on Software Engineering

Greenberg S, Buxton B (2008) Usability evaluation considered harmful (some of the time). ACM Conference on Human Factors in Computing Systems, 111–120

Gwet KL (2010) Handbook of inter-rater reliability, 2nd edn. Advanced Analytics, Gaithersburg

Hanenberg S (2010) An experiment about static and dynamic type systems: doubts about the positive impact of static type systems on development time. ACM International Conference on Object-Oriented Programming Systems Languages and Applications (OOPSLA), 22–35

Hannay JE, Sjøberg DIK, Dyba T (2007) A systematic review of theory use in software engineering experiments. IEEE Trans Softw Eng 33(2):87–107CrossRef

Holmes R, Walker RJ (2013) Systematizing pragmatic software reuse. ACM Trans Softw Eng Methodol 21(4), Article 20: 44 pages

John B, Packer H (1995) Learning and using the cognitive walkthrough method: a case study approach. ACM Conference on Human Factors in Computing Systems, 429–436

Juristo N, Moreno AM (2001) Basics of software engineering experimentation. Springer

Kampenes V, Dybå T, Hannay J, Sjøberg D (2007) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11–12):1073–1086CrossRef

Kaptein M, Robertson J (2012) Rethinking statistical analysis methods for CHI. ACM Conference on Human Factors in Computing Systems, 1105–1114

Kelleher C, Pausch R (2005). Stencils-based tutorials: design and evaluation. ACM Conference on Human Factors in Computing Systems, 541–550

Keller FS (1968) Good-bye teacher. J Appl Behav Anal 1:79–89CrossRef

Keppel G (1982) Design and analysis: a researcher’s handbook, 2nd edn. Prentice-Hall, Englewood Cliffs

Kersten M, Murphy G (2006) Using task context to improve programmer productivity. ACM Symposium on Foundations of Software Engineering, 1–11

Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, Emam, K.E., Rosenberg J (2002) Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng 28(8):721–734CrossRef

Kitchenham BA, Brereton P, Turner M, Niazi MK, Linkman S, Pretorius R, Budgen D (2010) Refining the systematic literature review process—two participant-observer case studies. Empir Softw Eng 15(6):618–653CrossRef

Kittur A, Chi EH, Suh B (2008) Crowdsourcing user studies with Mechanical Turk. ACM Conference on Human Factors in Computing Systems, 453–456

Ko AJ, Myers BA (2009) Finding causes of program output with the Java Whyline. ACM Conference on Human Factors in Computing Systems, 1569–1578

Ko AJ, Wobbrock JO (2010) Cleanroom: edit-time error detection with the uniqueness heuristic. IEEE Symposium on Visual Languages and Human-Centric Computing, 7–14

Ko AJ, Burnett MM, Green TRG, Rothermel KJ, Cook CR (2002) Using the Cognitive Walkthrough to improve the design of a visual programming experiment. J Vis Lang Comput 13:517–544CrossRef

Ko AJ, DeLine R, Venolia G (2007) Information needs in collocated software development teams. International Conference on Software Engineering (ICSE)

LaToza TD, Myers BA (2010) Developers ask reachability questions. International Conference on Software Engineering (ICSE), 185–194

LaToza, TD, Myers BA (2011) Visualizing call graphs. IEEE Visual Languages and Human-Centric Computing (VL/HCC), Pittsburgh, PA

LaToza TD, Myers BA (2011) Designing useful tools for developers. ACM SIGPLAN Workshop on Evaluation and Usability of Programming Languages and Tools (PLATEAU), 45–50

Lazar J, Feng JH, Hochheiser H (2010) Research methods in human-computer interaction. Wiley

Lott C, Rombach D (1996) Repeatable software engineering experiments for comparing defect-detection techniques. Empir Softw Eng 1:241–277CrossRef

Martin DW (1996) Doing psychology experiments, 4th edn. Brooks/Cole, Pacific Grove

McDowall D, McCleary R, Meidinger E, Hay RA (1980) Interrupted Time Series Analysis, 1st Edition. SAGE Publications

Murphy GC, Walker RJ, Baniassad ELA (1999) Evaluating emerging software development technologies: lessons learned from assessing aspect-oriented programming. IEEE Trans Softw Eng 25(4):438–455CrossRef

Murphy-Hill E, Murphy GC, Griswold WG (2010). Understanding context: creating a lasting impact in experimental software engineering research. FSE/SDP Workshop on Future of Software Engineering Research

Newell A (1973) You can’t play 20 questions with nature and win: projective comments on the papers of this symposium. In: Chase WG (ed) Visual information processing. Academic, New York

Nickerson RS (1998) Confirmation bias: a ubiquitous phenomenon in many guises. Rev Gen Psychol 2(2):175–220CrossRef

Nimmer JW, Ernst MD (2002) Invariant inference for static checking: an empirical evaluation. SIGSOFT Softw Eng Notes 27(6):11–20CrossRef

Olsen DR (2007) Evaluating user interface systems research. ACM Symposium on User Interface Software and Technology, 251–258

Polson P, Lewis C, Rieman J, Wharton C (1992) Cognitive walkthroughs: a method for theory-based evaluation of user interfaces. Int J Human-Comput Interact 36:741–773

Ramesh V, Glass RL, Vessey I (2004) Research in computer science: an empirical study. J Syst Softw 70(1–2):165–176CrossRef

Rombach HD, Basili VR, Selby RW (1992) Experimental software engineering issues: critical assessment and future directions. International Workshop Dagstuhl Castle (Germany), Sept. 14–18

Rosenthal R (1966) Experimenter effects in behavioral research. Appleton, New York

Rosenthal R, Rosnow R (2007) Essentials of behavioral research: methods and data analysis. McGraw-Hill, 3rd edition

Rosenthal R, Rubin DB (1978) Interpersonal expectancy effects: the first 345 studies. Behav Brain Sci 1(3):377–386CrossRef

Ross J, Irani L, Silberman MS, Zaldivar A, Tomlinson B (2010) Who are the crowdworkers? Shifting demographics in Mechanical Turk. ACM Conference on Human Factors in Computing Systems, 2863–2872

Rothermel KJ, Cook C, Burnett MM, Schonfeld J, Green TRG, Rothermel G (2000) WYSIWYT testing in the spreadsheet paradigm: an empirical evaluation. ACM International Conference on Software Engineering, 230–239

Rubin J, Chisnell D (2008) Handbook of usability testing: how to plan, design, and conduct effective tests. Wiley

Runeson P, Höst M (2009) Guidelines for conducting and reporting case study research in software engineering. Empir Softw Eng 14(2):131–164CrossRef

Shull F, Singer J, Sjøberg DIK (2006) Guide to advanced empirical software engineering. Springer

Sillito J, Murphy G, De Volder K (2006) Questions programmers ask during software evolution tasks. ACM SIGSOFT/FSE, 23–34

Sjoberg DIK, Dyba T, Jorgensen M (2007) The future of empirical methods in software engineering research. In 2007 Future of Software Engineering (FOSE ’07), 358–378

Sjøberg D, Anda B, Arisholm E, Dyba T, Jorgensen M, Karahasanovic A, Koren E, Voka M (2003) Conducting realistic experiments in software engineering. Empirical Software Engineering and Measurement

Sjøberg DIK, Hannay JE, Hansen O, Kampenes VB, Karahasanović A, Liborg N-K, Rekdal AC (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31(9):733–753CrossRef

Steele CM, Aronson J (1995) Stereotype threat and the intellectual test performance of African-Americans. J Personal Soc Psychol 69:797–811CrossRef

Tichy WF (1998) Should computer scientists experiment more? 16 excuses to avoid experimentation. IEEE Comput 31(5):32–40CrossRef

Tichy WF, Lukowicz P, Prechelt L, Heinz EA (1995) Experimental evaluation in computer science: a quantitative study. J Syst Softw 28(1):9–18CrossRef

Walther JB (2002) Research ethics in internet-enabled research: human subjects issues and methodological myopia. Ethics Inf Technol 4(3):205–216CrossRef

Wickelgren WA (1977) Speed-accuracy tradeoff and information processing dynamics. Acta Psychol 41(1):67–85CrossRef

Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. Springer

Yin RK (2003) Case study research: design and methods. Sage Publications

Zannier C, Melnik G, Maurer F (2006) On the success of empirical studies in the international conference on software engineering. ACM/IEEE International Conference on Software Engineering, 341–350

Title: A practical guide to controlled experiments of software engineering tools with human participants
Authors: Amy J. Ko
Thomas D. LaToza
Margaret M. Burnett
Publication date: 01-02-2015
Publisher: Springer US
Published in: Empirical Software Engineering / Issue 1/2015
Print ISSN: 1382-3256
Electronic ISSN: 1573-7616
DOI: https://doi.org/10.1007/s10664-013-9279-3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 1/2015

HAZOP-based identification of events in use cases

Studying the relationship between logging characteristics and the code quality of platform software

An experimental investigation comparing individual and collaborative work productivity when using desktop and cloud modeling tools

Mining software repair models for reasoning on the search space of automated program fixing

Understanding the Influence of User Participation and Involvement on System Success – a Systematic Mapping Study

Management of community contributions

Premium Partner