Skip to main content
Top
Published in: Empirical Software Engineering 2/2023

01-03-2023

Evaluating software user feedback classifier performance on unseen apps, datasets, and metadata

Authors: Peter Devine, Yun Sing Koh, Kelly Blincoe

Published in: Empirical Software Engineering | Issue 2/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Understanding users’ needs is crucial to building and maintaining high quality software. Online software user feedback has been shown to contain large amounts of information useful to requirements engineering (RE). Previous studies have created machine learning classifiers for parsing this feedback for development insight. While these classifiers report generally good performance when evaluated on a test set, questions remain as to how well they extend to unseen data in various forms. This study evaluates machine learning classifiers’ performance on feedback for two common classification tasks (classifying bug reports and feature requests). Using seven datasets from prior research studies, we investigate the performance of classifiers when evaluated on feedback from different apps than those contained in the training set and when evaluated on completely different datasets (coming from different feedback channels and/or labelled by different researchers). We also measure the difference in performance of using channel-specific metadata as a feature in classification. We find that using metadata as features in classifying bug reports and feature requests does not lead to a statistically significant improvement in the majority of datasets tested. We also demonstrate that classification performance is similar on feedback from unseen apps compared to seen apps in the majority of cases tested. However, the classifiers evaluated do not perform well on unseen datasets. We show that multi-dataset training or zero shot classification approaches can somewhat mitigate this performance decrease. We discuss the implications of these results on developing user feedback classification models to analyse and extract software requirements.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Footnotes
3
This dataset contains feedback that is labelled as “Error”. While the classification of this class of feedback is not reported on in the paper, we use this class as our bug report class.
 
4
The replication package contains two datasets referring to research questions 1 and 3 from this study, of which the latter is a pre-filtered set of feedback (filtered to contain only requirements relevant feedback) used to measure clustering performance, rather than classification. Therefore, while no classification metrics are reported for this RQ3 dataset (Dataset E), we still use it for training and testing models.
 
Literature
go back to reference Ali Khan J, Liu L, Wen L, Ali R (2020) Conceptualising, extracting and analysing requirements arguments in users’ forums: the crowdre-arg framework. J Softw: Evol Process 32(12):e2309 Ali Khan J, Liu L, Wen L, Ali R (2020) Conceptualising, extracting and analysing requirements arguments in users’ forums: the crowdre-arg framework. J Softw: Evol Process 32(12):e2309
go back to reference Araujo A, Golo M, Viana B, Sanches F, Romero R, Marcacini R (2020) From bag-of-words to pre-trained neural language models: Improving automatic classification of app reviews for requirements engineering. In: Anais do XVII encontro nacional de inteligência artificial e computacional. SBC, pp 378–389 Araujo A, Golo M, Viana B, Sanches F, Romero R, Marcacini R (2020) From bag-of-words to pre-trained neural language models: Improving automatic classification of app reviews for requirements engineering. In: Anais do XVII encontro nacional de inteligência artificial e computacional. SBC, pp 378–389
go back to reference Batista G E, Prati R C, Monard M C (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Expl Newsl 6(1):20–29CrossRef Batista G E, Prati R C, Monard M C (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Expl Newsl 6(1):20–29CrossRef
go back to reference Berki E, Georgiadou E, Holcombe M (2004) Requirements engineering and process modelling in software quality management—towards a generic process metamodel. Softw Qual J 12(3):265–283CrossRef Berki E, Georgiadou E, Holcombe M (2004) Requirements engineering and process modelling in software quality management—towards a generic process metamodel. Softw Qual J 12(3):265–283CrossRef
go back to reference Broy M (2006) Requirements engineering as a key to holistic software quality. In: International symposium on computer and information sciences. Springer, pp 24–34 Broy M (2006) Requirements engineering as a key to holistic software quality. In: International symposium on computer and information sciences. Springer, pp 24–34
go back to reference Ciurumelea A, Schaufelbühl A, Panichella S, Gall H C (2017) Analyzing reviews and code of mobile apps for better release planning. In: 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 91–102 Ciurumelea A, Schaufelbühl A, Panichella S, Gall H C (2017) Analyzing reviews and code of mobile apps for better release planning. In: 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 91–102
go back to reference Cleland-Huang J, Settimi R, Zou X, Solc P (2007) Automated classification of non-functional requirements. Requir Eng 12(2):103–120CrossRef Cleland-Huang J, Settimi R, Zou X, Solc P (2007) Automated classification of non-functional requirements. Requir Eng 12(2):103–120CrossRef
go back to reference Damian D, Chisan J (2006) An empirical study of the complex relationships between requirements engineering processes and other processes that lead to payoffs in productivity, quality, and risk management. IEEE Trans Softw Eng 32 (7):433–453CrossRef Damian D, Chisan J (2006) An empirical study of the complex relationships between requirements engineering processes and other processes that lead to payoffs in productivity, quality, and risk management. IEEE Trans Softw Eng 32 (7):433–453CrossRef
go back to reference Devlin J, Chang M W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1 (Long and Short Papers). https://aclanthology.org/N19-1423. Association for Computational Linguistics, Minneapolis, pp 4171–4186 Devlin J, Chang M W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1 (Long and Short Papers). https://​aclanthology.​org/​N19-1423. Association for Computational Linguistics, Minneapolis, pp 4171–4186
go back to reference Dhinakaran V T, Pulle R, Ajmeri N, Murukannaiah P K (2018) App review analysis via active learning: reducing supervision effort without compromising classification accuracy. In: 2018 IEEE 26th international requirements engineering conference (RE). IEEE, pp 170–181 Dhinakaran V T, Pulle R, Ajmeri N, Murukannaiah P K (2018) App review analysis via active learning: reducing supervision effort without compromising classification accuracy. In: 2018 IEEE 26th international requirements engineering conference (RE). IEEE, pp 170–181
go back to reference Di Sorbo A, Grano G, Aaron Visaggio C, Panichella S (2021) Investigating the criticality of user-reported issues through their relations with app rating. J Softw: Evol Process 33(3):e2316 Di Sorbo A, Grano G, Aaron Visaggio C, Panichella S (2021) Investigating the criticality of user-reported issues through their relations with app rating. J Softw: Evol Process 33(3):e2316
go back to reference Gillies A (2011) Software quality: theory and management. Lulu com Gillies A (2011) Software quality: theory and management. Lulu com
go back to reference Guzman E, El-Haliby M, Bruegge B (2015) Ensemble methods for app review classification: an approach for software evolution (n). In: 2015 30th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 771–776 Guzman E, El-Haliby M, Bruegge B (2015) Ensemble methods for app review classification: an approach for software evolution (n). In: 2015 30th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 771–776
go back to reference Guzman E, Alkadhi R, Seyff N (2016) A needle in a haystack: what do twitter users say about software?. In: 2016 IEEE 24th international requirements engineering conference (RE). IEEE, pp 96–105 Guzman E, Alkadhi R, Seyff N (2016) A needle in a haystack: what do twitter users say about software?. In: 2016 IEEE 24th international requirements engineering conference (RE). IEEE, pp 96–105
go back to reference Guzman E, Ibrahim M, Glinz M (2017) A little bird told me: mining tweets for requirements and software evolution. In: 2017 IEEE 25th international requirements engineering conference (RE). IEEE, pp 11–20 Guzman E, Ibrahim M, Glinz M (2017) A little bird told me: mining tweets for requirements and software evolution. In: 2017 IEEE 25th international requirements engineering conference (RE). IEEE, pp 11–20
go back to reference Hadi M A, Fard F H (2021) Evaluating pre-trained models for user feedback analysis in software engineering: a study on classification of app-reviews. arXiv:2104.05861 Hadi M A, Fard F H (2021) Evaluating pre-trained models for user feedback analysis in software engineering: a study on classification of app-reviews. arXiv:2104.​05861
go back to reference Henao P R, Fischbach J, Spies D, Frattini J, Vogelsang A (2021) Transfer learning for mining feature requests and bug reports from tweets and app store reviews. In: 2021 IEEE 29th international requirements engineering conference workshops (REW). IEEE, pp 80–86 Henao P R, Fischbach J, Spies D, Frattini J, Vogelsang A (2021) Transfer learning for mining feature requests and bug reports from tweets and app store reviews. In: 2021 IEEE 29th international requirements engineering conference workshops (REW). IEEE, pp 80–86
go back to reference Iacob C, Harrison R, Faily S (2013) Online reviews as first class artifacts in mobile app development. In: International conference on mobile computing, applications, and services. Springer, pp 47–53 Iacob C, Harrison R, Faily S (2013) Online reviews as first class artifacts in mobile app development. In: International conference on mobile computing, applications, and services. Springer, pp 47–53
go back to reference Iqbal T, Khan M, Taveter K, Seyff N (2021) Mining reddit as a new source for software requirements. In: 2021 IEEE 29th international requirements engineering conference (RE). IEEE, pp 128–138 Iqbal T, Khan M, Taveter K, Seyff N (2021) Mining reddit as a new source for software requirements. In: 2021 IEEE 29th international requirements engineering conference (RE). IEEE, pp 128–138
go back to reference Kassab M, Neill C, Laplante P (2014) State of practice in requirements engineering: contemporary data. Innov Syst Softw Eng 10(4):235–241CrossRef Kassab M, Neill C, Laplante P (2014) State of practice in requirements engineering: contemporary data. Innov Syst Softw Eng 10(4):235–241CrossRef
go back to reference Lim S, Henriksson A, Zdravkovic J (2021) Data-driven requirements elicitation: a systematic literature review. SN Comput Sci 2(1):1–35CrossRef Lim S, Henriksson A, Zdravkovic J (2021) Data-driven requirements elicitation: a systematic literature review. SN Comput Sci 2(1):1–35CrossRef
go back to reference Lin D, Bezemer C P, Zou Y, Hassan A E (2019) An empirical study of game reviews on the steam platform. Empir Softw Eng 24(1):170–207CrossRef Lin D, Bezemer C P, Zou Y, Hassan A E (2019) An empirical study of game reviews on the steam platform. Empir Softw Eng 24(1):170–207CrossRef
go back to reference Maalej W, Kurtanović Z, Nabil H, Stanik C (2016) On the automatic classification of app reviews. Requir Eng 21(3):311–331CrossRef Maalej W, Kurtanović Z, Nabil H, Stanik C (2016) On the automatic classification of app reviews. Requir Eng 21(3):311–331CrossRef
go back to reference Magalhães C, Sardinha A, Araújo J (2021) Mare: an active learning approach for requirements classification. In: RE@Next! track of the 29th IEEE international requirements engineering conference Magalhães C, Sardinha A, Araújo J (2021) Mare: an active learning approach for requirements classification. In: RE@Next! track of the 29th IEEE international requirements engineering conference
go back to reference Nayebi M, Cho H, Ruhe G (2018) App store mining is not enough for app improvement. Empir Softw Eng 23(5):2764–2794CrossRef Nayebi M, Cho H, Ruhe G (2018) App store mining is not enough for app improvement. Empir Softw Eng 23(5):2764–2794CrossRef
go back to reference Nuseibeh B, Easterbrook S (2000) Requirements engineering: a roadmap. In: Proceedings of the conference on the future of software engineering, pp 35–46 Nuseibeh B, Easterbrook S (2000) Requirements engineering: a roadmap. In: Proceedings of the conference on the future of software engineering, pp 35–46
go back to reference Pagano D, Maalej W (2013) User feedback in the appstore: an empirical study. In: 2013 21st IEEE international requirements engineering conference (RE). IEEE, pp 125–134 Pagano D, Maalej W (2013) User feedback in the appstore: an empirical study. In: 2013 21st IEEE international requirements engineering conference (RE). IEEE, pp 125–134
go back to reference Panichella S, Di Sorbo A, Guzman E, Visaggio C A, Canfora G, Gall H C (2016) Ardoc: app reviews development oriented classifier. In: Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, pp 1023–1027 Panichella S, Di Sorbo A, Guzman E, Visaggio C A, Canfora G, Gall H C (2016) Ardoc: app reviews development oriented classifier. In: Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, pp 1023–1027
go back to reference Radliński Ł (2012) Empirical analysis of the impact of requirements engineering on software quality. In: International working conference on requirements engineering: foundation for software quality. Springer, pp 232–238 Radliński Ł (2012) Empirical analysis of the impact of requirements engineering on software quality. In: International working conference on requirements engineering: foundation for software quality. Springer, pp 232–238
go back to reference Rempel P, Mäder P (2016) Preventing defects: the impact of requirements traceability completeness on software quality. IEEE Trans Softw Eng 43 (8):777–797CrossRef Rempel P, Mäder P (2016) Preventing defects: the impact of requirements traceability completeness on software quality. IEEE Trans Softw Eng 43 (8):777–797CrossRef
go back to reference Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv:1910.01108 Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv:1910.​01108
go back to reference Scalabrino S, Bavota G, Russo B, Di Penta M, Oliveto R (2017) Listening to the crowd for the release planning of mobile apps. IEEE Trans Softw Eng 45(1):68–86CrossRef Scalabrino S, Bavota G, Russo B, Di Penta M, Oliveto R (2017) Listening to the crowd for the release planning of mobile apps. IEEE Trans Softw Eng 45(1):68–86CrossRef
go back to reference Stanik C, Haering M, Maalej W (2019) Classifying multilingual user feedback using traditional machine learning and deep learning. In: 2019 IEEE 27th international requirements engineering conference workshops (REW). IEEE, pp 220–226 Stanik C, Haering M, Maalej W (2019) Classifying multilingual user feedback using traditional machine learning and deep learning. In: 2019 IEEE 27th international requirements engineering conference workshops (REW). IEEE, pp 220–226
go back to reference Sultan M A, Bethard S, Sumner T (2014) Back to basics for monolingual alignment: exploiting word similarity and contextual evidence. Trans Assoc Comput Linguist 2:219–230CrossRef Sultan M A, Bethard S, Sumner T (2014) Back to basics for monolingual alignment: exploiting word similarity and contextual evidence. Trans Assoc Comput Linguist 2:219–230CrossRef
go back to reference Tizard J, Wang H, Yohannes L, Blincoe K (2019) Can a conversation paint a picture? Mining requirements in software forums. In: 2019 IEEE 27th international requirements engineering conference (RE). IEEE, pp 17–27 Tizard J, Wang H, Yohannes L, Blincoe K (2019) Can a conversation paint a picture? Mining requirements in software forums. In: 2019 IEEE 27th international requirements engineering conference (RE). IEEE, pp 17–27
go back to reference Tizard J, Rietz T, Liu X, Blincoe K (2021) Voice of the users: an extended study of software feedback engagement. Requir Eng 1–23 Tizard J, Rietz T, Liu X, Blincoe K (2021) Voice of the users: an extended study of software feedback engagement. Requir Eng 1–23
go back to reference Williams G, Mahmoud A (2017) Mining twitter feeds for software user requirements. In: 2017 IEEE 25th international requirements engineering conference (RE). IEEE, pp 1–10 Williams G, Mahmoud A (2017) Mining twitter feeds for software user requirements. In: 2017 IEEE 25th international requirements engineering conference (RE). IEEE, pp 1–10
go back to reference Yin W, Hay J, Roth D (2019) Benchmarking zero-shot text classification: datasets, evaluation and entailment approach. arXiv:1909.00161 Yin W, Hay J, Roth D (2019) Benchmarking zero-shot text classification: datasets, evaluation and entailment approach. arXiv:1909.​00161
Metadata
Title
Evaluating software user feedback classifier performance on unseen apps, datasets, and metadata
Authors
Peter Devine
Yun Sing Koh
Kelly Blincoe
Publication date
01-03-2023
Publisher
Springer US
Published in
Empirical Software Engineering / Issue 2/2023
Print ISSN: 1382-3256
Electronic ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-022-10254-y

Other articles of this Issue 2/2023

Empirical Software Engineering 2/2023 Go to the issue

Premium Partner