ABSTRACT
Most empirical disciplines promote the reuse and sharing of datasets, as it leads to greater possibility of replication. While this is increasingly the case in Empirical Software Engineering, some of the most popular bug-fix datasets are now known to be biased. This raises two significant concerns: first, that sample bias may lead to underperforming prediction models, and second, that the external validity of the studies based on biased datasets may be suspect. This issue has raised considerable consternation in the ESE literature in recent years. However, there is a confounding factor of these datasets that has not been examined carefully: size. Biased datasets are sampling only some of the data that could be sampled, and doing so in a biased fashion; but biased samples could be smaller, or larger. Smaller data sets in general provide less reliable bases for estimating models, and thus could lead to inferior model performance. In this setting, we ask the question, what affects performance more, bias, or size? We conduct a detailed, large-scale meta-analysis, using simulated datasets sampled with bias from a high-quality dataset which is relatively free of bias. Our results suggest that size always matters just as much bias direction, and in fact much more than bias direction when considering information-retrieval measures such as AUCROC and F-score. This indicates that at least for prediction models, even when dealing with sampling bias, simply finding larger samples can sometimes be sufficient. Our analysis also exposes the complexity of the bias issue, and raises further issues to be explored in the future.
- E. Arisholm, L. C. Briand, and E. B. Johannessen. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. JSS, 83(1):2–17, 2010. Google ScholarDigital Library
- A. Bachmann, C. Bird, F. Rahman, P. Devanbu, and A. Bernstein. The Missing Links : Bugs and Bug-fix Commits Categories and Subject Descriptors. In Proceedings of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2010), volume 2 of FSE ’10, pages 97–106. ACM, 2010. Google ScholarDigital Library
- C. Bird, A. Bachmann, E. Aune, and J. Duffy. Fair and balanced?: bias in bug-fix datasets. In Proceedings of the the 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT symposium on The Foundations of Software Engineering, 2009. Google ScholarDigital Library
- C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and P. Devanbu. Fair and balanced?: bias in bug-fix datasets. In Proceedings of the the 7th FSE, pages 121–130. ACM, 2009. Google ScholarDigital Library
- C. Bird, N. Nagappan, B. Murphy, H. Gall, and P. T. Devanbu. Don’t touch my code!: examining the effects of ownership on software quality. In T. Gyimóthy and A. Zeller, editors, SIGSOFT FSE, pages 4–14. ACM, 2011. Google ScholarDigital Library
- J. Cohen. Applied multiple regression/correlation analysis for the behavioral sciences. Lawrence Erlbaum, 2003.Google Scholar
- D. Cubrani´c and G. C. Murph. Hipikat: recommending pertinent software development artifacts. In Proc. Int’l Conf. Software Engineering (ICSE), pages 408–418, Portland, Oregon, 2003. IEEE Computer Society Press. Google ScholarDigital Library
- M. Fischer, M. Pinzger, and H. Gall. Populating a release history database from version control and bug tracking systems. In Proceedings of the International Conference on Software Maintenance, pages 23–32, Los Alamitos CA, September 2003. IEEE Press. Google ScholarDigital Library
- U. Grömping. Relative importance for linear regression in r: the package relaimpo. Journal of Statistical Software, 17(1):1–27, 2006.Google ScholarCross Ref
- S. Kim, H. Zhang, R. Wu, and L. Gong. Dealing with noise in defect prediction. In Proceedings of the 33rd International Conference on Software Engineering, pages 481–490. ACM, 2011. Google ScholarDigital Library
- S. Kim, T. Zimmermann, E. Whitehead Jr, and A. Zeller. Predicting faults from cached history. In Proceedings of the 29th ICSE, pages 489–498. IEEE Computer Society, 2007. Google ScholarDigital Library
- S. Le Cessie and J. Van Houwelingen. Ridge estimators in logistic regression. Applied statistics, pages 191–201, 1992.Google Scholar
- T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE TSE, 33(1):2–13, 2007. Google ScholarDigital Library
- A. Mockus and L. G. Votta. Identifying reasons for software changes using historic databases. In ICSM ’00, page 120, Washington, DC, USA, 2000. IEEE Computer Society. Google ScholarDigital Library
- A. Mockus and D. M. Weiss. Predicting risk of software changes. Bell Labs Technical Journal, 5(2):169–180, 2000.Google ScholarCross Ref
- R. Moser, W. Pedrycz, and G. Succi. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In W. Schäfer, M. B. Dwyer, and V. Gruhn, editors, ICSE, pages 181– 190. ACM, 2008. Google ScholarDigital Library
- T. H. Nguyen, B. Adams, and A. E. Hassan. A case study of bias in bug-fix datasets. In Proceedings of WCRE, pages 259–268, 2010. Google ScholarDigital Library
- D. Posnett, V. Filkov, and P. Devanbu. Ecological inference in empirical software engineering. In ASE’2011, pages 362–371. IEEE, 2011. Google ScholarDigital Library
- F. Rahman and P. Devanbu. How, and why, process metrics are better. http://www.cs.ucdavis.edu/ research/tech-reports/2011/CSE-2012-33.pdf, 2012.Google Scholar
- F. Rahman, D. Posnett, and P. Devanbu. Recalling the “imprecision” of cross-project defect prediction. In the 20th ACM SIGSOFT FSE, pages –. ACM, 2012. Google ScholarDigital Library
- R. Wu, H. Zhang, S. Kim, and S. C. Cheung. Re-Link : Recovering Links between Bugs and Changes. In Proceedings of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2011), 2011. Google ScholarDigital Library
- T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In Proceedings of the Third International Workshop on Predictor Models in Software Engineering, PROMISE ’07, pages 9–, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
Index Terms
- Sample size vs. bias in defect prediction
Recommendations
Bias in knowledge graph embeddings
ASONAM '20: Proceedings of the 12th IEEE/ACM International Conference on Advances in Social Networks Analysis and MiningIn this paper, we study bias in knowledge graph embeddings. We focus on gender bias in occupations, but our approach is applicable to other types of bias. We start by proposing measures for identifying bias in the dataset (i.e., in the KG) and then ...
A survey on bias in visual datasets
AbstractComputer Vision (CV) has achieved remarkable results, outperforming humans in several tasks. Nonetheless, it may result in significant discrimination if not handled properly. Indeed, CV systems highly depend on training datasets and ...
Highlights- We describe many different types of bias that can be encountered in visual data.
Preliminary comparison of techniques for dealing with imbalance in software defect prediction
EASE '14: Proceedings of the 18th International Conference on Evaluation and Assessment in Software EngineeringImbalanced data is a common problem in data mining when dealing with classification problems, where samples of a class vastly outnumber other classes. In this situation, many data mining algorithms generate poor models as they try to optimize the ...
Comments