Skip to main content
Erschienen in: Empirical Software Engineering 1-2/2012

01.02.2012

On the dataset shift problem in software engineering prediction models

verfasst von: Burak Turhan

Erschienen in: Empirical Software Engineering | Ausgabe 1-2/2012

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A core assumption of any prediction model is that test data distribution does not differ from training data distribution. Prediction models used in software engineering are no exception. In reality, this assumption can be violated in many ways resulting in inconsistent and non-transferrable observations across different cases. The goal of this paper is to explain the phenomena of conclusion instability through the dataset shift concept from software effort and fault prediction perspective. Different types of dataset shift are explained with examples from software engineering, and techniques for addressing associated problems are discussed. While dataset shifts in the form of sample selection bias and imbalanced data are well-known in software engineering research, understanding other types is relevant for possible interpretations of the non-transferable results across different sites and studies. Software engineering community should be aware of and account for the dataset shift related issues when evaluating the validity of research outcomes.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
For simplicity, it’s assumed that there are no other confounding effects of such changes than on software size.
 
2
However, this would introduce an unfair bias, since it would mean using, during model construction phase, information related to an attribute that is to be predicted (i.e. defect rate). The model is supposed to predict that attribute in the first place, and should be blind to such prior facts that exist in the test data.
 
3
Domain shift is not included in the discussion, since that is a measurement related issue that should be separately handled by the researcher/practitioner.
 
4
In practice, this warning applies to simulation studies, since test responses are typically not known in real settings.
 
Literatur
Zurück zum Zitat Alpaydin E (2010) Introduction to machine learning, 2nd edn. The MIT Press, Cambridge, MA Alpaydin E (2010) Introduction to machine learning, 2nd edn. The MIT Press, Cambridge, MA
Zurück zum Zitat Bakır A, Turhan B, Bener A (2010) A new perspective on data homogeneity in software cost estimation: a study in the embedded systems domain. Softw Qual J 18(1):57–80CrossRef Bakır A, Turhan B, Bener A (2010) A new perspective on data homogeneity in software cost estimation: a study in the embedded systems domain. Softw Qual J 18(1):57–80CrossRef
Zurück zum Zitat Bickel S, Brückner M, Scheffer T (2009) Discriminative learning under covariate shift. J Mach Learn Res 10:2137–2155MathSciNet Bickel S, Brückner M, Scheffer T (2009) Discriminative learning under covariate shift. J Mach Learn Res 10:2137–2155MathSciNet
Zurück zum Zitat Boehm B, Horowitz E, Madachy R, Reifer D, Clark BK, Steece B, Brown AW, Chulani S, Abts C (2000) Software cost estimation with Cocomo II. Prentice Hall, Englewood Cliffs, NJ Boehm B, Horowitz E, Madachy R, Reifer D, Clark BK, Steece B, Brown AW, Chulani S, Abts C (2000) Software cost estimation with Cocomo II. Prentice Hall, Englewood Cliffs, NJ
Zurück zum Zitat Briand L, Wust J (2002) Empirical studies of quality models in object-oriented systems. Adv Comput 56:97–166CrossRef Briand L, Wust J (2002) Empirical studies of quality models in object-oriented systems. Adv Comput 56:97–166CrossRef
Zurück zum Zitat Briand LC, Melo WL, Wust J (2002) Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans Softw Eng 28:706–720CrossRef Briand LC, Melo WL, Wust J (2002) Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans Softw Eng 28:706–720CrossRef
Zurück zum Zitat Candela JQ, Sugiyama M, Schwaighofer A, Lawrence ND (eds) (2009) Dataset shift in machine learning. The MIT Press, Cambridge, MA Candela JQ, Sugiyama M, Schwaighofer A, Lawrence ND (eds) (2009) Dataset shift in machine learning. The MIT Press, Cambridge, MA
Zurück zum Zitat Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58CrossRef Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58CrossRef
Zurück zum Zitat Demirors O, Gencel C (2009) Conceptual association of functional size measurement methods. IEEE Softw 26(3):71–78CrossRef Demirors O, Gencel C (2009) Conceptual association of functional size measurement methods. IEEE Softw 26(3):71–78CrossRef
Zurück zum Zitat Drummond C, Holte RC (2006) Cost curves: an improved method for visualizing classifier performance. Mach Learn 65(1):95–130CrossRef Drummond C, Holte RC (2006) Cost curves: an improved method for visualizing classifier performance. Mach Learn 65(1):95–130CrossRef
Zurück zum Zitat Guo P, Lyu MR (2000) Software quality prediction using mixture models with EM algorithm. In: Proceedings of the the first Asia-Pacific conference on quality software (APAQS’00). IEEE Computer Society, Washington, DC, USA, pp 69–78 Guo P, Lyu MR (2000) Software quality prediction using mixture models with EM algorithm. In: Proceedings of the the first Asia-Pacific conference on quality software (APAQS’00). IEEE Computer Society, Washington, DC, USA, pp 69–78
Zurück zum Zitat Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD explorations, vol 11/1 Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD explorations, vol 11/1
Zurück zum Zitat Huang J, Smola AJ, Gretton A, Borgwardt KM, Schšlkopf B (2006) Correcting sample selection bias by unlabeled data. Neural Information Processing Systems, pp 601–608 Huang J, Smola AJ, Gretton A, Borgwardt KM, Schšlkopf B (2006) Correcting sample selection bias by unlabeled data. Neural Information Processing Systems, pp 601–608
Zurück zum Zitat Jiang Y, Cukic B, Ma Y (2008a) Techniques for evaluating fault prediction models. Empir Soft Eng 13(5):561–595CrossRef Jiang Y, Cukic B, Ma Y (2008a) Techniques for evaluating fault prediction models. Empir Soft Eng 13(5):561–595CrossRef
Zurück zum Zitat Jiang Y, Cukic B, Menzies T (2008b) Cost curve evaluation of fault prediction models. In: Proceedings of the 19th int’l symposium on software reliability engineering (ISSRE 2008), Redmond, WA, pp 197–206 Jiang Y, Cukic B, Menzies T (2008b) Cost curve evaluation of fault prediction models. In: Proceedings of the 19th int’l symposium on software reliability engineering (ISSRE 2008), Redmond, WA, pp 197–206
Zurück zum Zitat Keung JW, Kitchenham BA, Jeffery DR (2008) Analogy-X: providing statistical inference to analogy-based software cost estimation. IEEE Trans Softw Eng 34(4):471–484CrossRef Keung JW, Kitchenham BA, Jeffery DR (2008) Analogy-X: providing statistical inference to analogy-based software cost estimation. IEEE Trans Softw Eng 34(4):471–484CrossRef
Zurück zum Zitat Kitchenham BA, Mendes E, Travassos GH (2007) Cross versus within-company cost estimation studies: a systematic review. IEEE Trans Softw Eng 33(5):316–329CrossRef Kitchenham BA, Mendes E, Travassos GH (2007) Cross versus within-company cost estimation studies: a systematic review. IEEE Trans Softw Eng 33(5):316–329CrossRef
Zurück zum Zitat Kocaguneli E, Menzies T (2011) How to find relevant data for effort estimation? In: Proceedings of the 5th ACM/IEEE international symposium on empirical software engineering and measurement (ESEM’11) Kocaguneli E, Menzies T (2011) How to find relevant data for effort estimation? In: Proceedings of the 5th ACM/IEEE international symposium on empirical software engineering and measurement (ESEM’11)
Zurück zum Zitat Kocaguneli E, Gay G, Menzies T, Yang Y, Keung JW (2010) When to use data from other projects for effort estimation. In: Proceedings of the IEEE/ACM international conference on automated software engineering (ASE ’10). ACM, New York, pp 321–324CrossRef Kocaguneli E, Gay G, Menzies T, Yang Y, Keung JW (2010) When to use data from other projects for effort estimation. In: Proceedings of the IEEE/ACM international conference on automated software engineering (ASE ’10). ACM, New York, pp 321–324CrossRef
Zurück zum Zitat Lin J, Keogh E, Lonardi S, Lankford J, Nystrom DM (2004) Visually mining and monitoring massive time series. In: Proceedings of 10th ACM SIGKDD international conference on knowledge and data mining. ACM Press, pp 460–469 Lin J, Keogh E, Lonardi S, Lankford J, Nystrom DM (2004) Visually mining and monitoring massive time series. In: Proceedings of 10th ACM SIGKDD international conference on knowledge and data mining. ACM Press, pp 460–469
Zurück zum Zitat Lokan C, Wright T, Hill PR, Stringer M (2001) Organizational benchmarking using the isbsg data repository. IEEE Softw 18:26–32CrossRef Lokan C, Wright T, Hill PR, Stringer M (2001) Organizational benchmarking using the isbsg data repository. IEEE Softw 18:26–32CrossRef
Zurück zum Zitat Menzies T, Jalali O, Hihn J, Baker D, Lum K (2010) Stable rankings for different effort models. Autom Softw Eng 17(4):409–437CrossRef Menzies T, Jalali O, Hihn J, Baker D, Lum K (2010) Stable rankings for different effort models. Autom Softw Eng 17(4):409–437CrossRef
Zurück zum Zitat Menzies T, Turhan B, Bener A, Gay G, Cukic B, Jiang Y (2008) Implications of ceiling effects in defect predictors. In: Proceedings of the 4th international workshop on predictor models in software engineering (PROMISE ’08). ACM, New York, pp 47–54CrossRef Menzies T, Turhan B, Bener A, Gay G, Cukic B, Jiang Y (2008) Implications of ceiling effects in defect predictors. In: Proceedings of the 4th international workshop on predictor models in software engineering (PROMISE ’08). ACM, New York, pp 47–54CrossRef
Zurück zum Zitat Myrtveit I, Stensrud E, Shepperd M (2005) Reliability and validity in comparative studies of software prediction models. IEEE Trans Softw Eng 31(5):380–391CrossRef Myrtveit I, Stensrud E, Shepperd M (2005) Reliability and validity in comparative studies of software prediction models. IEEE Trans Softw Eng 31(5):380–391CrossRef
Zurück zum Zitat Premraj R, Zimmermann T (2007) Building software cost estimation models using homogenous data. In: Proceedings of the first international symposium on empirical software engineering and measurement (ESEM ’07). IEEE Computer Society, Washington, DC, USA, pp 393–400CrossRef Premraj R, Zimmermann T (2007) Building software cost estimation models using homogenous data. In: Proceedings of the first international symposium on empirical software engineering and measurement (ESEM ’07). IEEE Computer Society, Washington, DC, USA, pp 393–400CrossRef
Zurück zum Zitat Shepperd M, Kadoda G (2001) Comparing software prediction techniques using simulation. IEEE Trans Softw Eng 27(11):1014–1022CrossRef Shepperd M, Kadoda G (2001) Comparing software prediction techniques using simulation. IEEE Trans Softw Eng 27(11):1014–1022CrossRef
Zurück zum Zitat Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan Inference 90(2):227–244MathSciNetCrossRef Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan Inference 90(2):227–244MathSciNetCrossRef
Zurück zum Zitat Storkey A (2009) When training and test sets are different: characterizing learning transfer. In: Quionero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND (eds) Dataset shift in machine learning, chapter 1. The MIT Press, Cambridge, MA, pp 3–28 Storkey A (2009) When training and test sets are different: characterizing learning transfer. In: Quionero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND (eds) Dataset shift in machine learning, chapter 1. The MIT Press, Cambridge, MA, pp 3–28
Zurück zum Zitat Sugiyama M, Suzuki T, Nakajima S, Kashima H, von Bünau P, Kawanabe M (2008) Direct importance estimation for covariate shift adaptation. Ann Inst Stat Math 60(4):699–746CrossRef Sugiyama M, Suzuki T, Nakajima S, Kashima H, von Bünau P, Kawanabe M (2008) Direct importance estimation for covariate shift adaptation. Ann Inst Stat Math 60(4):699–746CrossRef
Zurück zum Zitat Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14(5):540–578CrossRef Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14(5):540–578CrossRef
Zurück zum Zitat Wieczorek I, Ruhe M (2002) How valuable is company-specific data compared to multi-company data for software cost estimation? In: Proceedings of the 8th international symposium on software metrics (METRICS ’02). IEEE Computer Society, Washington, DC, USA, p 237CrossRef Wieczorek I, Ruhe M (2002) How valuable is company-specific data compared to multi-company data for software cost estimation? In: Proceedings of the 8th international symposium on software metrics (METRICS ’02). IEEE Computer Society, Washington, DC, USA, p 237CrossRef
Zurück zum Zitat Zhang H, Sheng S (2004) Learning weighted naive Bayes with accurate ranking. In: Proceedings of the 4th IEEE international conference on data mining, pp 567–570 Zhang H, Sheng S (2004) Learning weighted naive Bayes with accurate ranking. In: Proceedings of the 4th IEEE international conference on data mining, pp 567–570
Zurück zum Zitat Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction. In: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. ACM Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction. In: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. ACM
Metadaten
Titel
On the dataset shift problem in software engineering prediction models
verfasst von
Burak Turhan
Publikationsdatum
01.02.2012
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 1-2/2012
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-011-9182-8

Weitere Artikel der Ausgabe 1-2/2012

Empirical Software Engineering 1-2/2012 Zur Ausgabe

Premium Partner