nach oben

Progress in Artificial Intelligence

Erschienen in:

28.03.2019 | Regular Paper

Label prediction on issue tracking systems using text mining

verfasst von: Jesús M. Alonso-Abad, Carlos López-Nozal, Jesús M. Maudes-Raedo, Raúl Marticorena-Sánchez

Erschienen in: Progress in Artificial Intelligence | Ausgabe 3/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Issue tracking systems are overall change-management tools in software development. The issue-solving life cycle is a complex socio-technical activity that requires team discussion and knowledge sharing between members. In that process, issue classification facilitates an understanding of issues and their analysis. Issue tracking systems permit the tagging of issues with default labels (e.g., bug, enhancement) or with customized team labels (e.g., test failures, performance). However, a current problem is that many issues in open-source projects remain unlabeled. The aim of this paper is to improve maintenance tasks in development teams, evaluating models that can suggest a label for an issue using its text comments. We analyze data on issues from several GitHub trending projects, first by extracting issue information and then by applying text mining classifiers (i.e., support vector machine and naive Bayes multinomial). The results suggest that very suitable classifiers may be obtained to label the issues or, at least, to suggest the most suitable candidate labels.

Vorheriger Artikel Advanced visualization of Twitter data for its analysis as a communication channel in traditional companies

Nächster Artikel A chaotic salp swarm algorithm based on quadratic integrate and fire neural model for function optimization

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

https://github.com/.

Anil Kumar, R., Ravi, V.: Predicting credit card customer churn in banks using data mining. Int. J. Data Anal. Tech. Strateg. 1(1), 4–28 (2008)CrossRef

Anjali, M., Jivani, G.: A comparative study of stemming algorithms. Int. J. Comput. Tech. Appl. 2(6), 1930–1938 (2011)

Barandela, R., Valdovinos, R.M., Sánchez, J.S.: New applications of ensembles of classifiers. Pattern Anal. Appl. 6(3), 245–256 (2003)MathSciNetCrossRef

Basili, V., Caldiera, G., Rombach, D.H.: The goal question metric approach. In: Marciniak, J. (ed.) Encyclopedia of Software Engineering. Wiley, New York (1994). https://docweb.lrz-muenchen.de/cgi-bin/doc/nph-webdoc.cgi/000110A/http/scholar.google.de/scholar=3fhl=3dde&lr=3d&cluster=3d4068380033007143449

Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004)CrossRef

Batuwita, R., Palade, V.: microPred: effective classification of pre-mirnas for human mirna gene prediction. Bioinformatics 25(8), 989–995 (2009)CrossRef

Berczuk, S., Appleton, B.: Software Configuration Management Patterns: Effective Teamwork, Practical Integration, 01st edn. Addison Wesley Longman Inc Div Pearson Suite 300, Boston (2002)

Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C. Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining(PAKDD09). Lecture Notes on Computer Science, vol. 5476, pp. 475–482. Springer, New York (2009)

Cabot, J., Izquierdo, J.L.C., Cosentino, V., Rolandi, B.: Exploring the use of labels to categorize issues in Open-Source Software projects. In: 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), pp. 550–554 (2015). https://doi.org/10.1109/SANER.2015.7081875

10.

Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATHCrossRef

11.

Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)CrossRef

12.

Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: Smoteboost: improving prediction of the minority class in boosting. In: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2003), pp. 107–119 (2003)

13.

Cieslak, D.A., Chawla, N.V.: Learning decision trees for unbalanced data. In: Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases—Part I, ECML PKDD’08, pp. 241–256, Springer, Berlin (2008)

14.

Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1023/A:1022627411411 MATHCrossRef

15.

Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C., Kuncheva, L.I.: Random balance: ensembles of variable priors classifiers for imbalanced data. Knowl. Based Syst. 85, 96–111 (2015)CrossRef

16.

Drown, D.J., Khoshgoftaar, T.M., Seliya, N.: Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Trans. Syst. Man Cybern. Part A: Syst. Hum. 39(5), 1097–1107 (2009)CrossRef

17.

Eskildsen, S.F., Coupé, P., Fonov, V., Collins, D.L.: Detecting Alzheimer’s disease by morphological MRI using hippocampal grading and cortical thickness. In: Esther, B., Marion, S., van John, S., Wiro, N., Stefan, K., (eds.) Challenge on Computer-Aided Diagnosis of Dementia Based on Structural MRI Data, pp. 38–47 (2014)

18.

Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)MATH

19.

Fan, W., Salvatore, J.S., Junxin, Z., Philip, K.C.: Adacost: misclassification cost-sensitive boosting. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML’99, pp. 97–105, San Francisco, CA, (1999). Morgan Kaufmann Publishers Inc

20.

Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006). https://doi.org/10.1016/j.patrec.2005.10.010 MathSciNetCrossRef

21.

Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)CrossRef

22.

García-Pedrajas, N., Pérez-Rodríguez, J., García-Pedrajas, M.D., Ortiz-Boyer, D., Fyfe, C.: Class imbalance methods for translation initiation site recognition in DNA sequences. Knowl. Based Syst. 25(1), 22–34 (2012)CrossRef

23.

Gousios, G.: The GHTorrent dataset and tool suite. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR’13, pp. 233–236. IEEE Press, Piscataway, NJ (2013). http://dl.acm.org/citation.cfm?id=2487085.2487132

24.

Güemes-Peña, D., López-Nozal, C., Marticorena-Sánchez, R., Maudes-Raedo, J.: Emerging topics in mining software repositories. Progr. Artif. Intell. 7(3), 237–247 (2018). https://doi.org/10.1007/s13748-018-0147-7 CrossRef

25.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009). https://doi.org/10.1145/1656274.1656278 CrossRef

26.

Han, H., Wang, W.Y., Mao, B.H.: Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: 2005 International Conference on Intelligent Computing (ICIC05). Lecture Notes on Computer Science, vol. 3644, pp. 878–887. Springer, New York (2005)

27.

Irfan, R., King, C., Grages, D., Ewen, S., Khan, S., Madani, S., Kolodziej, J., Wang, L., Chen, D., Rayes, A., Tziritas, N., Xu, C.-Z., Zomaya, A., Alzahrani, A., Li, H.: A survey on text mining in social networks. Knowl. Eng. Rev. 30(2), 157–170 (2015)CrossRef

28.

Izquierdo, J.L.C., Cosentino, V., Rolandi, B., Bergel, A., Cabot, J.: GiLA: GitHub label analyzer. In: 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), pp. 479–483 (2015). https://doi.org/10.1109/SANER.2015.7081860

29.

Joshi, M.V., Kumar, V., Agarwal, R.C.: Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: Proceedings IEEE International Conference on Data Mining (ICDM 2001), pp. 257–264 (2001)

30.

Khan, A., Baharudin, B., Lee, L.H., Khan, K., Tronoh, U.T.P.: A review of machine learning algorithms for text-documents classification. J. Adv. Inf. Technol. 1(1), 4–18 (2010)

31.

Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI’95) vol. 2, pp. 1137–1143. Morgan Kaufmann Publishers Inc., San Francisco (1995). http://dl.acm.org/citation.cfm?id=1643031.1643047

32.

Kotsiantis, S.B., Pintelas, P.E.: Mixture of expert agents for handling imbalanced data sets. Ann. Math. Comput. Teleinform. 1(1), 46–55 (2003)

33.

Krawczyk, B., Galar, M., Jeleń, Ł., Herrera, F.: Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl. Soft Comput. 38, 714–726 (2016)CrossRef

34.

Kukar, M., Kononenko, I.: Cost-sensitive learning with neural networks. In: Proceedings of the 13th European Conference on Artificial Intelligence (ECAI-98), pp. 445–449. Citeseer (1998)

35.

Lachiche, N., Flach, P.A.: Improving accuracy and cost of two-class and multi-class probabilistic classifiers using roc curves. In: ICML (2003)

36.

Liao, T.W.: Classification of weld flaws with imbalanced class data. Expert Syst. Appl. 35(3), 1041–1052 (2008)CrossRef

37.

Ling, C.X., Sheng, V.S., Yang, Q.: Test strategies for cost-sensitive decision trees. IEEE Trans. Knowl. Data Eng. 18(8), 1055–1067 (2006)CrossRef

38.

Liu, W., Chawla, S., Cieslak, D.A., Chawla, N.V.: A robust decision tree algorithm for imbalanced data sets. Proceedings of the SIAM International Conference on Data Mining, SDM, pp. 766–777 (2010)

39.

Lovins, J.B.: Development of a stemming algorithm. Mechan. Transl. Comput. Linguist. 11, 22–31 (1968)

40.

McCallum, A.K.: Bow: a toolkit for statistical language modeling, text retrieval, classification and clustering (1996). http://www.cs.cmu.edu/~mccallum/bow

41.

McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48 (1998)

42.

Phua, C., Alahakoon, D., Lee, V.: Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explor. Newsl. 6(1), 50–59 (2004)CrossRef

43.

Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRef

44.

Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). https://doi.org/10.1145/505282.505283 MathSciNetCrossRef

45.

Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J.: Improving software-quality predictions with data sampling and boosting. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 39(6), 1283–1294 (2009)CrossRef

46.

Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A: Syst. Hum. 40(1), 185–197 (2010)CrossRef

47.

Sohrawardi, S.J., Azam, I., Hosain, S.: A comparative study of text classification algorithms on user submitted bug reports. In: 2014 Ninth International Conference on Digital Information Management (ICDIM), pp. 242–247 (2014)

48.

Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427–437 (2009). https://doi.org/10.1016/j.ipm.2009.03.002 CrossRef

49.

Sun, C., Lo, D., Wang, X., Jiang, J., Khoo, S.C.: A discriminative model approach for accurate duplicate bug report retrieval. In: Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering—vol. 1, ICSE’10, pp. 45–54. ACM, New York (2010). https://doi.org/10.1145/1806799.1806811.

50.

Sun, Y., Kamel, M., Wong, A., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit. 40, 3358–3378 (2007)MATHCrossRef

51.

Sun, Z., Song, Q., Zhu, X.: Using coding-based ensemble learning to improve software defect prediction. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(6), 1806–1817 (2012)CrossRef

52.

Treude, C., Storey, M.A.: Work item tagging: communicating concerns in collaborative software development. IEEE Trans. Softw. Eng. 38(1), 19–34 (2012). https://doi.org/10.1109/TSE.2010.91 CrossRef

53.

Valdivia Garcia, H., Shihab, E.: Characterizing and predicting blocking bugs in open source projects. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 72–81. ACM, New York (2014). https://doi.org/10.1145/2597073.2597099

54.

Vapnik, V.N.: The Nature of Statistical Learning Theory (Information Science and Statistics). Springer, New York (1999)

55.

Veropoulos, K., Campbel, C., Cristianini, N.: Controlling the sensitivity of support vector machines. In: Proceedings of the International Joint Conference on AI, pp. 55–60 (1999)

56.

Visa, S., Ralescu, A.: Issues in mining imbalanced data sets—a review paper. In: Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, pp. 67–73 (2005)

57.

Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. In: IEEE Symposium Series on Computational Intelligence and Data Mining (IEEE CIDM 2009), pp. 324–331 (2009)

58.

Wen, W., Yu, T., Hayes, J.H.: Colua: automatically predicting configuration bug reports and extracting configuration options. In: 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), pp. 150–161 (2016). https://doi.org/10.1109/ISSRE.2016.29

59.

Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)

60.

Wohlin, C., Runeson, P., Höst, M., Ohlsson, M.C., Regnell, B., Wesslén, A.: Experimentation in Software Engineering. Springer, New York (2012)MATHCrossRef

61.

Xia, X., Feng, Y., Lo, D., Chen, Z., Wang, X.: Towards more accurate multi-label software behavior learning. In: 2014 Software Evolution Week—IEEE Conference on Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE), pp. 134–143 (2014). https://doi.org/10.1109/CSMR-WCRE.2014.6747163

62.

Xia, X., Lo, D., Wang, X., Zhou, B.: Accurate developer recommendation for bug resolution. In: Proceedings of the 20th Working Conference Reverse Engineering (2013)

63.

Xia, X., Lo, D., Wang, X., Zhou, B.: Tag recommendation in software information sites. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR’13, pp. 287–296. IEEE Press, Piscataway, NJ (2013). http://dl.acm.org/citation.cfm?id=2487085.2487140

64.

Zhang, M., Zhou, Z.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014). https://doi.org/10.1109/TKDE.2013.39 CrossRef

65.

Zou, Q., Xie, S., Lin, Z., Wu, M., Ju, Y. (2016) Finding the best classification threshold in imbalanced classification. Big Data Res. 5, 2–8. Big data analytics and applications

Titel: Label prediction on issue tracking systems using text mining
verfasst von: Jesús M. Alonso-Abad
Carlos López-Nozal
Jesús M. Maudes-Raedo
Raúl Marticorena-Sánchez
Publikationsdatum: 28.03.2019
Verlag: Springer Berlin Heidelberg
Erschienen in: Progress in Artificial Intelligence / Ausgabe 3/2019
Print ISSN: 2192-6352
Elektronische ISSN: 2192-6360
DOI: https://doi.org/10.1007/s13748-019-00182-2

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 3/2019

Advanced visualization of Twitter data for its analysis as a communication channel in traditional companies

Scaling up the learning-from-crowds GLAD algorithm using instance-difficulty clustering

Early anticipation of driver’s maneuver in semiautonomous vehicles using deep learning

WordificationMI: multi-relational data mining through multiple-instance propositionalization

OCAPIS: R package for Ordinal Classification and Preprocessing in Scala

A framework for evaluation in learning from label proportions