Skip to main content
Erschienen in: Software Quality Journal 3/2007

01.09.2007

Software quality estimation with limited fault data: a semi-supervised learning perspective

verfasst von: Naeem Seliya, Taghi M. Khoshgoftaar

Erschienen in: Software Quality Journal | Ausgabe 3/2007

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We addresses the important problem of software quality analysis when there is limited software fault or fault-proneness data. A software quality model is typically trained using software measurement and fault data obtained from a previous release or similar project. Such an approach assumes that fault data is available for all the training modules. Various issues in software development may limit the availability of fault-proneness data for all the training modules. Consequently, the available labeled training dataset is such that the trained software quality model may not provide predictions. More specifically, the small set of modules with known fault-proneness labels is not sufficient for capturing the software quality trends of the project. We investigate semi-supervised learning with the Expectation Maximization (EM) algorithm for software quality estimation with limited fault-proneness data. The hypothesis is that knowledge stored in software attributes of the unlabeled program modules will aid in improving software quality estimation. Software data collected from a large NASA software project is used during the semi-supervised learning process. The software quality model is evaluated with multiple test datasets collected from other NASA software projects. Compared to software quality models trained only with the available set of labeled program modules, the EM-based semi-supervised learning scheme improves generalization performance of the software quality models.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In P. Bartlett & Y. Mansour (Eds), Proceedings of 11th annual ACM conference on computational learning theory, Madison, WI, July 1998, pp. 92–100, ACM Press. Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In P. Bartlett & Y. Mansour (Eds), Proceedings of 11th annual ACM conference on computational learning theory, Madison, WI, July 1998, pp. 92–100, ACM Press.
Zurück zum Zitat Brodley, C. E., & Friedl, M. A. (1999). Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11, 131–167.CrossRef Brodley, C. E., & Friedl, M. A. (1999). Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11, 131–167.CrossRef
Zurück zum Zitat Demirez, A., & Bennett, K. (2000). Optimization approaches to semisupervised learning. In M. Ferris, O. Mangasarian, & J. Pang (Eds), Applications and algorithms of complementarity. Boston, MA: Kluwer Academic Publishers. Demirez, A., & Bennett, K. (2000). Optimization approaches to semisupervised learning. In M. Ferris, O. Mangasarian, & J. Pang (Eds), Applications and algorithms of complementarity. Boston, MA: Kluwer Academic Publishers.
Zurück zum Zitat Fenton, N. E., & Pfleeger, S. L. (1997). Software metrics: A rigorous and practical approach (2nd ed.). ITP, Boston, MA: PWS Publishing Company. Fenton, N. E., & Pfleeger, S. L. (1997). Software metrics: A rigorous and practical approach (2nd ed.). ITP, Boston, MA: PWS Publishing Company.
Zurück zum Zitat Fung, G., & Mangasarian, O. (2001). Semi-supervised support vector machines for unlabeled data classification. Optimization Methods and Software, 15, 29–44.CrossRef Fung, G., & Mangasarian, O. (2001). Semi-supervised support vector machines for unlabeled data classification. Optimization Methods and Software, 15, 29–44.CrossRef
Zurück zum Zitat Ghahramani, Z., & Jordan, M. I. (1994). Supervised learning from incomplete data via an EM approach. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems (Vol. 6, pp. 120–127). Morgan Kaufmann: San Francisco, CA. Ghahramani, Z., & Jordan, M. I. (1994). Supervised learning from incomplete data via an EM approach. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems (Vol. 6, pp. 120–127). Morgan Kaufmann: San Francisco, CA.
Zurück zum Zitat Gokhale, S. S., & Lyu, M. R. (1997). Regression tree modeling for the prediction of software quality. In H. Pham (Ed.), Proceedings of 3rd international conference on reliability and quality in design, Anaheim, CA, March 1997, pp. 31–36, International Society of Science and Applied Technologies. Gokhale, S. S., & Lyu, M. R. (1997). Regression tree modeling for the prediction of software quality. In H. Pham (Ed.), Proceedings of 3rd international conference on reliability and quality in design, Anaheim, CA, March 1997, pp. 31–36, International Society of Science and Applied Technologies.
Zurück zum Zitat Goldman, S., & Zhou, Y. (2000). Enhancing supervised learning with unlabeled data. In Proceedings of 17th international conference on machine learning, Stanford University, CA, June–July 2000, pp. 327–334, Morgan Kaufmann. Goldman, S., & Zhou, Y. (2000). Enhancing supervised learning with unlabeled data. In Proceedings of 17th international conference on machine learning, Stanford University, CA, June–July 2000, pp. 327–334, Morgan Kaufmann.
Zurück zum Zitat Gray, A. R., & MacDonell, S. G. (1999). Software metrics data analysis: Exploring the relative performance of some commonly used modeling techniques. Empirical Software Engineering Journal, 4, 297–316.CrossRef Gray, A. R., & MacDonell, S. G. (1999). Software metrics data analysis: Exploring the relative performance of some commonly used modeling techniques. Empirical Software Engineering Journal, 4, 297–316.CrossRef
Zurück zum Zitat Guo, L., Cukic, B., & Singh, H. (2003). Predicting fault prone modules by the dempster-shafer belief networks. In Proceedings of the 18th international conference on automated software engineering, Montreal, Quebec, Canada, October 2003, pp. 249–252, IEEE Computer Society. Guo, L., Cukic, B., & Singh, H. (2003). Predicting fault prone modules by the dempster-shafer belief networks. In Proceedings of the 18th international conference on automated software engineering, Montreal, Quebec, Canada, October 2003, pp. 249–252, IEEE Computer Society.
Zurück zum Zitat Imam, K. E., Benlarbi, S., Goel, N., & Rai, S. N. (2001). Comparing case-based reasoning classifiers for predicting high-risk software componenets. Journal of Systems and Software, 55(3), 301–320.CrossRef Imam, K. E., Benlarbi, S., Goel, N., & Rai, S. N. (2001). Comparing case-based reasoning classifiers for predicting high-risk software componenets. Journal of Systems and Software, 55(3), 301–320.CrossRef
Zurück zum Zitat Khoshgoftaar, T. M., & Joshi, V. (2004). Noise elimination with ensemble-classifier filtering: A case-study in software quality engineering. In Proceedings of the 16th international conference on software engineering and knowledge engineering, Banff, Canada, June 2004, pp. 226–231. Khoshgoftaar, T. M., & Joshi, V. (2004). Noise elimination with ensemble-classifier filtering: A case-study in software quality engineering. In Proceedings of the 16th international conference on software engineering and knowledge engineering, Banff, Canada, June 2004, pp. 226–231.
Zurück zum Zitat Khoshgoftaar, T. M., Liu, Y., & Seliya, N. (2003). Genetic programming-based decision trees for software quality classification. In Proceedings of 15th international conference on tools with artificial intelligence, Sacramento, CA, USA, November 2003, pp. 374–383, IEEE Computer Society. Khoshgoftaar, T. M., Liu, Y., & Seliya, N. (2003). Genetic programming-based decision trees for software quality classification. In Proceedings of 15th international conference on tools with artificial intelligence, Sacramento, CA, USA, November 2003, pp. 374–383, IEEE Computer Society.
Zurück zum Zitat Khoshgoftaar, T. M., & Seliya, N. (2002). Tree-based software quality models for fault prediction. In Proceedings of 8th international software metrics symposium, Ottawa, Ontario, Canada, June 2002, pp. 203–214, IEEE Computer Society. Khoshgoftaar, T. M., & Seliya, N. (2002). Tree-based software quality models for fault prediction. In Proceedings of 8th international software metrics symposium, Ottawa, Ontario, Canada, June 2002, pp. 203–214, IEEE Computer Society.
Zurück zum Zitat Khoshgoftaar, T. M., & Seliya, N. (2003). Analogy-based practical classification rules for software quality estimation. Empirical Software Engineering Journal, 8(4), 325–350.CrossRef Khoshgoftaar, T. M., & Seliya, N. (2003). Analogy-based practical classification rules for software quality estimation. Empirical Software Engineering Journal, 8(4), 325–350.CrossRef
Zurück zum Zitat Khoshgoftaar, T. M., Yuan, X., & Allen, E. B. (2000). Balancing misclassification rates in classification tree models of software quality. Empirical Software Engineering Journal, 5, 313–330, Kluwer Academic Publishers.CrossRef Khoshgoftaar, T. M., Yuan, X., & Allen, E. B. (2000). Balancing misclassification rates in classification tree models of software quality. Empirical Software Engineering Journal, 5, 313–330, Kluwer Academic Publishers.CrossRef
Zurück zum Zitat Khoshgoftaar, T. M., Zhong, S., & Joshi, V. (2005). Noise elimination with ensemble-classifier filtering for software quality estimation. Intelligent Data Analysis: An International Journal, 9(1), 3–27.CrossRef Khoshgoftaar, T. M., Zhong, S., & Joshi, V. (2005). Noise elimination with ensemble-classifier filtering for software quality estimation. Intelligent Data Analysis: An International Journal, 9(1), 3–27.CrossRef
Zurück zum Zitat Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Hoboken, NJ: John Wiley and Sons.CrossRef Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Hoboken, NJ: John Wiley and Sons.CrossRef
Zurück zum Zitat Lyu, M. (1996). Handbook of software reliability engineering. New York, NY: IEEE Computer Press, McGraw Hill. Lyu, M. (1996). Handbook of software reliability engineering. New York, NY: IEEE Computer Press, McGraw Hill.
Zurück zum Zitat McCallum, A. K., & Nigam K. (1998). Employing EM and pool-based active learning for text classification. In Proceedings of the 15th international conference on machine learning, Madison, WI, July 1998, pp. 350–358, Morgan Kaufmann. McCallum, A. K., & Nigam K. (1998). Employing EM and pool-based active learning for text classification. In Proceedings of the 15th international conference on machine learning, Madison, WI, July 1998, pp. 350–358, Morgan Kaufmann.
Zurück zum Zitat Mitchell, T. (1999). The role of unlabeled data in supervised learning. In Proceedings of the 6th international colloquium on cognitive science, Donostia, San Sebastian, Spain, May 1999, Institute for Logic, Cognition, Language and Information. Mitchell, T. (1999). The role of unlabeled data in supervised learning. In Proceedings of the 6th international colloquium on cognitive science, Donostia, San Sebastian, Spain, May 1999, Institute for Logic, Cognition, Language and Information.
Zurück zum Zitat Nigam K., & Ghani R. (2000). Analyzing the effectiveness and applicability of co-training. In Proceedings of 9th international conference on information and knowledge management, McLean, VA, November 2000, pp. 86–93, ACM Press. Nigam K., & Ghani R. (2000). Analyzing the effectiveness and applicability of co-training. In Proceedings of 9th international conference on information and knowledge management, McLean, VA, November 2000, pp. 86–93, ACM Press.
Zurück zum Zitat Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (1998). Learning to classify text from labeled and unlabeled documents. In Proceedings of 15th conference of the American association for artificial intelligence, Madison, WI, July 1998, pp. 792–799, AAAI Press. Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (1998). Learning to classify text from labeled and unlabeled documents. In Proceedings of 15th conference of the American association for artificial intelligence, Madison, WI, July 1998, pp. 792–799, AAAI Press.
Zurück zum Zitat Ohlsson, M. C., & Runeson, P. (2002). Experience from replicating empirical studies on prediction models. In Proceedings of 8th international software metrics symposium, Ottawa, Ontario, Canada, June 2002, pp. 217–226, IEEE Computer Society. Ohlsson, M. C., & Runeson, P. (2002). Experience from replicating empirical studies on prediction models. In Proceedings of 8th international software metrics symposium, Ottawa, Ontario, Canada, June 2002, pp. 217–226, IEEE Computer Society.
Zurück zum Zitat Pizzi, N. J., Summers, R., & Pedrycz ,W. (2002). Software quality prediction using median-adjusted class labels. In Proceedings of international joint conference on neural networks, Honolulu, HI, May 2002, Vol. 3, pp. 2405–2409, IEEE Computer Society. Pizzi, N. J., Summers, R., & Pedrycz ,W. (2002). Software quality prediction using median-adjusted class labels. In Proceedings of international joint conference on neural networks, Honolulu, HI, May 2002, Vol. 3, pp. 2405–2409, IEEE Computer Society.
Zurück zum Zitat Schneidewind, N. F. (2001). Investigation of logistic regression as a discriminant of software quality. In Proceedings of 7th international software metrics symposium, London, UK, April 2001, pp. 328–337, IEEE Computer Society. Schneidewind, N. F. (2001). Investigation of logistic regression as a discriminant of software quality. In Proceedings of 7th international software metrics symposium, London, UK, April 2001, pp. 328–337, IEEE Computer Society.
Zurück zum Zitat Schneidewind, N. F. (2002). Body of knowledge for software quality measurement. IEEE Computer, 35(2), 77–83.CrossRef Schneidewind, N. F. (2002). Body of knowledge for software quality measurement. IEEE Computer, 35(2), 77–83.CrossRef
Zurück zum Zitat Seeger, M. (2001). Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh, Scotland, UK, February 2001. Seeger, M. (2001). Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh, Scotland, UK, February 2001.
Zurück zum Zitat Suarez, A., & Lutsko, J. F. (1999). Globally optimal fuzzy decision trees for classification and regression. Pattern Analysis and Machine Intelligence, 21(12), 1297–1311.CrossRef Suarez, A., & Lutsko, J. F. (1999). Globally optimal fuzzy decision trees for classification and regression. Pattern Analysis and Machine Intelligence, 21(12), 1297–1311.CrossRef
Zurück zum Zitat Whitten, I. H., & Frank, E. (2000). Data mining: Practical machine learning tools and techniques with JAVA implementations. San Francisco, CA: Morgan Kaufmann. Whitten, I. H., & Frank, E. (2000). Data mining: Practical machine learning tools and techniques with JAVA implementations. San Francisco, CA: Morgan Kaufmann.
Metadaten
Titel
Software quality estimation with limited fault data: a semi-supervised learning perspective
verfasst von
Naeem Seliya
Taghi M. Khoshgoftaar
Publikationsdatum
01.09.2007
Erschienen in
Software Quality Journal / Ausgabe 3/2007
Print ISSN: 0963-9314
Elektronische ISSN: 1573-1367
DOI
https://doi.org/10.1007/s11219-007-9013-8

Weitere Artikel der Ausgabe 3/2007

Software Quality Journal 3/2007 Zur Ausgabe

Editorial Notes

In this issue

Premium Partner