nach oben

Information Systems Frontiers

Erschienen in:

01.11.2014

A comparative study of iterative and non-iterative feature selection techniques for software defect prediction

verfasst von: Taghi M. Khoshgoftaar, Kehan Gao, Amri Napolitano, Randall Wald

Erschienen in: Information Systems Frontiers | Ausgabe 5/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Two important problems which can affect the performance of classification models are high-dimensionality (an overabundance of independent features in the dataset) and imbalanced data (a skewed class distribution which creates at least one class with many fewer instances than other classes). To resolve these problems concurrently, we propose an iterative feature selection approach, which repeated applies data sampling (in order to address class imbalance) followed by feature selection (in order to address high-dimensionality), and finally we perform an aggregation step which combines the ranked feature lists from the separate iterations of sampling. This approach is designed to find a ranked feature list which is particularly effective on the more balanced dataset resulting from sampling while minimizing the risk of losing data through the sampling step and missing important features. To demonstrate this technique, we employ 18 different feature selection algorithms and Random Undersampling with two post-sampling class distributions. We also investigate the use of sampling and feature selection without the iterative step (e.g., using the ranked list from a single iteration, rather than combining the lists from multiple iterations), and compare these results from the version which uses iteration. Our study is carried out using three groups of datasets with different levels of class balance, all of which were collected from a real-world software system. All of our experiments use four different learners and one feature subset size. We find that our proposed iterative feature selection approach outperforms the non-iterative approach.

Vorheriger Artikel Concept-concept association information integration and multi-model collaboration for multimedia semantic concept detection

Nächster Artikel A unified framework for evaluating test criteria in model-checking-assisted test case generation

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Waikato Environment for Knowledge Analysis (weka) is a popular suite of machine learning software written in Java, developed at the University of Waikato. weka is free software available under the GNU General Public License. In this study, all experiments and algorithms were implemented in the weka tool.

Boetticher, G., Menzies, T., Ostrand, T. (2007). Promise repository of empirical software engineering data. [Online]. Available: http://promisedata.org/.

Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984). Classification and regression trees. Boca Raton: Chapman and Hall/CRC Press.

Chen, Z., Menzies, T., Port, D., Boehm, B. (2005). Finding the right data for software cost modeling. IEEE Software, 22(6), 38–46.CrossRef

Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods, 2nd edn. Cambridge: Cambridge University Press.CrossRef

Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.

Gao, K., Khoshgoftaar, T.M., Seliya, N. (2012). Predicting high-risk program modules by selecting the right software measurements. Software Quality Journal, 20(1), 3–42.CrossRef

Goh, L., Song, Q., Kasabov, N. (2004). A novel feature selection method to improve classification of gene expression data. In Proceedings of the second conference on Asia-Pacific bioinformatics (pp. 161–166). Dunedin.

Gonzalez, R.C., & Woods, R.E. (2008). Digital image processing, 3rd edn. New Jersey: Prentice Hall.

Haykin, S. (1999). Neural networks: a comprehensive foundation, 2nd edn. New Jersey: Prentice Hall Interanational, Inc.

Jeffery, I.B., Higgins, D.G., Culhane, A.C. (2006). Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. BMC Bioinformatics, 7(359).

Jiang, Y., Lin, J., Cukic, B., Menzies, T. (2009). Variance analysis in software fault prediction models. In Proceedings of the 20th IEEE international symposium on software reliability engineering (pp. 99–108). Bangalore-Mysore.

Jong, K., Marchiori, E., Sebag, M., van der Vaart, A. (2004). Feature selection in proteomic pattern data with support vector machines. In Proceedings of the 2004 IEEE symposium on computational intelligence in bioinformatics and computational biology.

Kamal, A.H., Zhu, X., Pandya, A.S., Hsu, S., Shoaib, M. (2009). The impact of gene selection on imbalanced microarray expression data. In Proceedings of the 1st international conference on bioinformatics and computational biology; lecture notes in bioinformatics (Vol. 5462, pp. 259–269). New Orleans.

Khoshgoftaar, T.M., & Gao, K. (2010). A novel software metric selection technique using the area under roc curves. In Proceedings of the 22nd international conference on software engineering and knowledge engineering (pp. 203–208). San Francisco.

Khoshgoftaar, T.M., Golawala, M., Van Hulse, J. (2007). An empirical study of learning from imbalanced data using random forest. In Proceedings of the 19th IEEE international conference on tools with artificial intelligence (Vol. 2, pp. 310–317). Washington, DC.

Khoshgoftaar, T.M., Gao, K., Bullard, L.A. (2012a). A comparative study of filter-based and wrapper-based feature ranking techniques for software quality modeling. International Journal of Reliability, Quality and Safety Engineering, 18(4), 341–364.CrossRef

Khoshgoftaar, T.M., Gao, K., Napolitano, A. (2012b). Exploring an iterative feature selection technique for highly imbalanced data sets. In Information Reuse and Integration (IRI), 2012 IEEE 13th international conference on (pp. 101–108).

Kira, K., & Rendell, L.A. (1992). A practical approach to feature selection. In Proceedings of 9th international workshop on machine learning (pp. 249–256).

Lessmann, S., Baesens, B., Mues, C., Pietsch, S. (2008). Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Transactions on Software Engineering, 34(4), 485–496.CrossRef

Liu, T.-Y. (2009). Easyensemble and feature selection for imbalance data sets. In Proceedings of the 2009 internationalc joint conference on bioinformatics, systems biology and intelligent computing (pp. 517–520). Washington, DC: IEEE Computer Society.

Liu, H., Motoda, H., Setiono, R., Zhao, Z. (2010). Feature selection: an ever evolving frontier in data mining. In Proceedings of the fourth international workshop on feature selection in data mining (pp. 4–13). Hyderabad.

Menzies, T., Greenwald, J., Frank, A. (2007). Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1), 2–13.CrossRef

Mishra, D., & Sahu, B. (2011). Feature selection for cancer classification: a signal-to-noise ratio approach. International Journal of Scientific & Engineering Research, 2(4).

Rodriguez, D., Ruiz, R., Cuadrado-Gallego, J., Aguilar-Ruiz, J. (2007). Detecting fault modules applying feature selection to classifiers. In Proceedings of 8th IEEE international conference on information reuse and integration (pp. 667–672). Las Vegas.

Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A. (2010). Rusboost: a hybrid approach to alleviate class imbalance. IEEE Transactions on Systems, Man & Cybernetics: Part A: Systems and Humans, 40(1), 185–197.CrossRef

Song, Q., Jia, Z., Shepperd, M., Ying, S., Liu, J. (2011). A general software defect-proneness prediction framework. IEEE Transactions On Software Engineering, 37(3), 356–370.CrossRef

Souza, J., Japkowicz, N., Matwin, S. (2005). Stochfs: a framework for combining feature selection outcomes through a stochastic process. In Knowledge discovery in databases: PKDD 2005 (Vol. 3721, pp. 667–674).

Votta, L.G., & Porter, A.A. (1995). Experimental software engineering: A report on the state of the art. In Proceedings of the 17th. International conference on software engineering (pp. 277–279). Seattle: IEEE Computer Society.

Witten, I.H., Frank, E., Hall, M.A. (2011). Data mining: practical machine learning tools and techniques, 3rd edn. Burlington: Morgan Kaufmann.

Wohlin, C., Runeson, P., Host, M., Ohlsson, M.C., Regnell, B., Wesslen, A. (2012). Experimentation in software engineering. Heidelberg/New York: Springer.CrossRef

Zimmermann, T., Premraj, R., Zeller, A. (2007). Predicting defects for eclipse. In Proceedings of the 29th international conference on software engineering workshops (p. 76). Washington, DC: IEEE Computer Society.

Titel: A comparative study of iterative and non-iterative feature selection techniques for software defect prediction
verfasst von: Taghi M. Khoshgoftaar
Kehan Gao
Amri Napolitano
Randall Wald
Publikationsdatum: 01.11.2014
Verlag: Springer US
Erschienen in: Information Systems Frontiers / Ausgabe 5/2014
Print ISSN: 1387-3326
Elektronische ISSN: 1572-9419
DOI: https://doi.org/10.1007/s10796-013-9430-0

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 5/2014

Measuring the performance of aspect oriented software: A case study of Leader/Followers and Half-Sync/Half-Async architectures

An FAR-SW based approach for webpage information extraction

Aspectualization of code clones—an algorithmic approach

Guest editorial: Information reuse, integration, and reusable systems

Understanding the IT/business partnership: A business process perspective

Concept-concept association information integration and multi-model collaboration for multimedia semantic concept detection

Premium Partner