Skip to main content
Erschienen in: Empirical Software Engineering 2/2020

12.02.2020

Predicting software defect type using concept-based classification

verfasst von: Sangameshwar Patil, B. Ravindran

Erschienen in: Empirical Software Engineering | Ausgabe 2/2020

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Automatically predicting the defect type of a software defect from its description can significantly speed up and improve the software defect management process. A major challenge for the supervised learning based current approaches for this task is the need for labeled training data. Creating such data is an expensive and effort-intensive task requiring domain-specific expertise. In this paper, we propose to circumvent this problem by carrying out concept-based classification (CBC) of software defect reports with help of the Explicit Semantic Analysis (ESA) framework. We first create the concept-based representations of a software defect report and the defect types in the software defect classification scheme by projecting their textual descriptions into a concept-space spanned by the Wikipedia articles. Then, we compute the “semantic” similarity between these concept-based representations and assign the software defect type that has the highest similarity with the defect report. The proposed approach achieves accuracy comparable to the state-of-the-art semi-supervised and active learning approach for this task without requiring labeled training data. Additional advantages of the CBC approach are: (i) unlike the state-of-the-art, it does not need the source code used to fix a software defect, and (ii) it does not suffer from the class-imbalance problem faced by the supervised learning paradigm.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
Note that Table 1 and Table 8 contain only the introductory definition snippets from the classification schemes. Their detailed descriptions along with contextual information and examples are available in IBM (2013a, b) and IEEE (2009).
 
2
The expert needs to refer to IBM(2013a, b) to get the detailed descriptions and understand the defect type classification scheme.
 
4
Following the ESA terminology, we use “a concept” and “a Wikipedia article” interchangeably.
 
9
Mahout, the machine learning library, https://​mahout.​apache.​org
 
10
Lucene, the search engine library https://​lucene.​apache.​org/​core
 
11
OpenNLP, the natural language processing library https://​opennlp.​apache.​org
 
Literatur
Zurück zum Zitat Alenezi M, Magel K, Banitaan S (2013) Efficient bug triaging using text mining. Journal of Software 8(9):2185–2190CrossRef Alenezi M, Magel K, Banitaan S (2013) Efficient bug triaging using text mining. Journal of Software 8(9):2185–2190CrossRef
Zurück zum Zitat Bridge N, Miller C (1998) Orthogonal defect classification using defect data to improve software development. Software Quality 3(1):1–8 Bridge N, Miller C (1998) Orthogonal defect classification using defect data to improve software development. Software Quality 3(1):1–8
Zurück zum Zitat Butcher M, Munro H, Kratschmer T (2002) Improving software testing via ODC: Three case studies. IBM Syst J 41(1):31–44CrossRef Butcher M, Munro H, Kratschmer T (2002) Improving software testing via ODC: Three case studies. IBM Syst J 41(1):31–44CrossRef
Zurück zum Zitat Carrozza G, Pietrantuono R, Russo S (2015) Defect analysis in mission-critical software systems: a detailed investigation. Journal of Software: Evolution and Process 27(1):22–49CrossRef Carrozza G, Pietrantuono R, Russo S (2015) Defect analysis in mission-critical software systems: a detailed investigation. Journal of Software: Evolution and Process 27(1):22–49CrossRef
Zurück zum Zitat Chawla NV, Japkowicz N, Kotcz A (2004) Edit: Special issue on learning from imbalanced data sets. SIGKDD Explorations Newsletter 6(1):1–6. 10.1145/1007730.1007733CrossRef Chawla NV, Japkowicz N, Kotcz A (2004) Edit: Special issue on learning from imbalanced data sets. SIGKDD Explorations Newsletter 6(1):1–6. 10.1145/1007730.1007733CrossRef
Zurück zum Zitat Chillarege R (1996) Orthogonal defect classification. Handbook of Software Reliability Engineering, pp 359–399 Chillarege R (1996) Orthogonal defect classification. Handbook of Software Reliability Engineering, pp 359–399
Zurück zum Zitat Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46CrossRef Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46CrossRef
Zurück zum Zitat Čubranić D (2004) Automatic bug triage using text categorization. In: Proceedings of 16th international conference on software engineering & knowledge engineering (SEKE) Čubranić D (2004) Automatic bug triage using text categorization. In: Proceedings of 16th international conference on software engineering & knowledge engineering (SEKE)
Zurück zum Zitat Egozi O, Markovitch S, Gabrilovich E (2011) Concept-based information retrieval using explicit semantic analysis. ACM Trans. Inf. Syst. 29(2):8CrossRef Egozi O, Markovitch S, Gabrilovich E (2011) Concept-based information retrieval using explicit semantic analysis. ACM Trans. Inf. Syst. 29(2):8CrossRef
Zurück zum Zitat Ferschke O, Zesch T, Gurevych I (2011) Wikipedia revision toolkit: Efficiently accessing Wikipedia’s edit history. In: Proceedings of the ACL-HLT 2011 system demonstrations, association for computational linguistics, pp 97–102 Ferschke O, Zesch T, Gurevych I (2011) Wikipedia revision toolkit: Efficiently accessing Wikipedia’s edit history. In: Proceedings of the ACL-HLT 2011 system demonstrations, association for computational linguistics, pp 97–102
Zurück zum Zitat Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th intl. Joint conf. on artificial intelligence (IJCAI), vol 7, pp 1606–1611 Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th intl. Joint conf. on artificial intelligence (IJCAI), vol 7, pp 1606–1611
Zurück zum Zitat Gabrilovich E, Markovitch S (2009) Wikipedia-based semantic interpretation for natural language processing. J Artif Intell Res 34:443–498CrossRef Gabrilovich E, Markovitch S (2009) Wikipedia-based semantic interpretation for natural language processing. J Artif Intell Res 34:443–498CrossRef
Zurück zum Zitat Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: ACM Sigmod record, vol 29. ACM, pp 1–12 Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: ACM Sigmod record, vol 29. ACM, pp 1–12
Zurück zum Zitat Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, AmsterdamMATH Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, AmsterdamMATH
Zurück zum Zitat He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans on Knowledge and Data Engineering He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans on Knowledge and Data Engineering
Zurück zum Zitat Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: Proceedings of 35th international conference on software engineering, pp 392–401 Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: Proceedings of 35th international conference on software engineering, pp 392–401
Zurück zum Zitat Huang L, Ng V, Persing I, Geng R, Bai X, Tian J (2011) AutoODC: Automated generation of orthogonal defect classifications. In: Proceedings of 26th IEEE/ACM international conference on automated software engineering (ASE) Huang L, Ng V, Persing I, Geng R, Bai X, Tian J (2011) AutoODC: Automated generation of orthogonal defect classifications. In: Proceedings of 26th IEEE/ACM international conference on automated software engineering (ASE)
Zurück zum Zitat Huang L, Ng V, Persing I, Chen M, Li Z, Geng R, Tian J (2015) AutoODC: Automated generation of orthogonal defect classifications. Automated Software Engineering Journal 22(1):3–46CrossRef Huang L, Ng V, Persing I, Chen M, Li Z, Geng R, Tian J (2015) AutoODC: Automated generation of orthogonal defect classifications. Automated Software Engineering Journal 22(1):3–46CrossRef
Zurück zum Zitat IEEE (2009) IEEE standard 1044-2009 classification for software anomalies IEEE (2009) IEEE standard 1044-2009 classification for software anomalies
Zurück zum Zitat Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intelligent Data Analysis 6(5):429–449CrossRef Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intelligent Data Analysis 6(5):429–449CrossRef
Zurück zum Zitat Jurafsky D, Martin JH (2000) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 1st edn. Prentice Hall PTR, Upper Saddle River Jurafsky D, Martin JH (2000) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 1st edn. Prentice Hall PTR, Upper Saddle River
Zurück zum Zitat Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New YorkCrossRef Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New YorkCrossRef
Zurück zum Zitat Mellegård N, Staron M, Törner F (2012) A light-weight defect classification scheme for embedded automotive software and its initial evaluation. In: Proceedings of IEEE 23rd International Symp. on Software Reliability Engineering (ISSRE), pp 261–270 Mellegård N, Staron M, Törner F (2012) A light-weight defect classification scheme for embedded automotive software and its initial evaluation. In: Proceedings of IEEE 23rd International Symp. on Software Reliability Engineering (ISSRE), pp 261–270
Zurück zum Zitat Menzies T, Marcus A (2008) Automated severity assessment of software defect reports. In: IEEE international conference on software maintenance (ICSM), pp 346–355 Menzies T, Marcus A (2008) Automated severity assessment of software defect reports. In: IEEE international conference on software maintenance (ICSM), pp 346–355
Zurück zum Zitat Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13, pp 522–531 Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13, pp 522–531
Zurück zum Zitat Patil S (2017) Concept based classification of software defect reports. In: Proceedings of 14th international conference on mining software repositories (MSR), IEEE/ACM Patil S (2017) Concept based classification of software defect reports. In: Proceedings of 14th international conference on mining software repositories (MSR), IEEE/ACM
Zurück zum Zitat Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830MathSciNetMATH Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830MathSciNetMATH
Zurück zum Zitat Robertson S, Zaragoza H, et al (2009) The probabilistic relevance framework: BM25 and beyond. Foundations and Trends®; in Information Retrieval 3(4):333–389CrossRef Robertson S, Zaragoza H, et al (2009) The probabilistic relevance framework: BM25 and beyond. Foundations and Trends®; in Information Retrieval 3(4):333–389CrossRef
Zurück zum Zitat Robertson S E, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M, et al (1995) Okapi at TREC-3. NIST Special Publication Sp 109:109 Robertson S E, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M, et al (1995) Okapi at TREC-3. NIST Special Publication Sp 109:109
Zurück zum Zitat Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: Proceedings of 29th international conference on software engineering. IEEE Computer Society, pp 499–510 Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: Proceedings of 29th international conference on software engineering. IEEE Computer Society, pp 499–510
Zurück zum Zitat Salton G, McGill M J (1986) Introduction to modern information retrieval. McGraw-Hill Inc, New YorkMATH Salton G, McGill M J (1986) Introduction to modern information retrieval. McGraw-Hill Inc, New YorkMATH
Zurück zum Zitat Silva N, Vieira M (2014) Experience report: orthogonal classification of safety critical issues. In: 2014 IEEE 25th international symposium on software reliability engineering. IEEE, pp 156–166 Silva N, Vieira M (2014) Experience report: orthogonal classification of safety critical issues. In: 2014 IEEE 25th international symposium on software reliability engineering. IEEE, pp 156–166
Zurück zum Zitat Thung F, Lo D, Jiang L (2012) Automatic defect categorization. In: Proceedings of 19th working conference on reverse engineering (WCRE). IEEE, pp 205–214 Thung F, Lo D, Jiang L (2012) Automatic defect categorization. In: Proceedings of 19th working conference on reverse engineering (WCRE). IEEE, pp 205–214
Zurück zum Zitat Thung F, Le X-BD, Lo D (2015) Active semi-supervised defect categorization. In: Proceedings of IEEE 23rd international conference on program comprehension (ICPC), pp 60–70 Thung F, Le X-BD, Lo D (2015) Active semi-supervised defect categorization. In: Proceedings of IEEE 23rd international conference on program comprehension (ICPC), pp 60–70
Zurück zum Zitat Vallespir D, Grazioli F, Herbert J (2009) A framework to evaluate defect taxonomies. In: Proceedings of XV Congreso Argentino de Ciencias de La Computación Vallespir D, Grazioli F, Herbert J (2009) A framework to evaluate defect taxonomies. In: Proceedings of XV Congreso Argentino de Ciencias de La Computación
Zurück zum Zitat Wagner S (2008) Defect classification and defect types revisited. In: Proceedings of workshop on defects in large software systems. ACM, pp 39–40 Wagner S (2008) Defect classification and defect types revisited. In: Proceedings of workshop on defects in large software systems. ACM, pp 39–40
Zurück zum Zitat Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) An approach to detecting duplicate bug reports using natural language and execution information. In: Proceedings of the 30th international conference on software engineering, pp 461–470 Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) An approach to detecting duplicate bug reports using natural language and execution information. In: Proceedings of the 30th international conference on software engineering, pp 461–470
Zurück zum Zitat Xia X, Lo D, Wang X, Zhou B (2014) Automatic defect categorization based on fault triggering conditions. In: Proceedings of 19th international conference on engineering of complex computer systems (ICECCS). IEEE, pp 39–48 Xia X, Lo D, Wang X, Zhou B (2014) Automatic defect categorization based on fault triggering conditions. In: Proceedings of 19th international conference on engineering of complex computer systems (ICECCS). IEEE, pp 39–48
Zurück zum Zitat Zaki MJ, Meira W Jr (2014) Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, CambridgeCrossRef Zaki MJ, Meira W Jr (2014) Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, CambridgeCrossRef
Zurück zum Zitat Zesch T, Müller C, Gurevych I (2008) Extracting lexical semantic knowledge from wikipedia and wiktionary. In: Proceedings of 6th International conference on language resources and evaluation (LREC), vol 8, pp 1646–1652 Zesch T, Müller C, Gurevych I (2008) Extracting lexical semantic knowledge from wikipedia and wiktionary. In: Proceedings of 6th International conference on language resources and evaluation (LREC), vol 8, pp 1646–1652
Zurück zum Zitat Zhou Y, Tong Y, Gu R, Gall H (2016) Combining text mining and data mining for bug report classification. Journal of Software: Evolution and Process 28(3) Zhou Y, Tong Y, Gu R, Gall H (2016) Combining text mining and data mining for bug report classification. Journal of Software: Evolution and Process 28(3)
Metadaten
Titel
Predicting software defect type using concept-based classification
verfasst von
Sangameshwar Patil
B. Ravindran
Publikationsdatum
12.02.2020
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 2/2020
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-019-09779-6

Weitere Artikel der Ausgabe 2/2020

Empirical Software Engineering 2/2020 Zur Ausgabe