Skip to main content
Top
Published in: Empirical Software Engineering 2/2018

21-07-2017

EnTagRec ++: An enhanced tag recommendation system for software information sites

Authors: Shaowei Wang, David Lo, Bogdan Vasilescu, Alexander Serebrenik

Published in: Empirical Software Engineering | Issue 2/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Software engineers share experiences with modern technologies using software information sites, such as Stack Overflow. These sites allow developers to label posted content, referred to as software objects, with short descriptions, known as tags. Tags help to improve the organization of questions and simplify the browsing of questions for users. However, tags assigned to objects tend to be noisy and some objects are not well tagged. For instance, 14.7% of the questions that were posted in 2015 on Stack Overflow needed tag re-editing after the initial assignment. To improve the quality of tags in software information sites, we propose EnTagRec ++, which is an advanced version of our prior work EnTagRec. Different from EnTagRec, EnTagRec ++ does not only integrate the historical tag assignments to software objects, but also leverages the information of users, and an initial set of tags that a user may provide for tag recommendation. We evaluate its performance on five software information sites, Stack Overflow, Ask Ubuntu, Ask Different, Super User, and Freecode. We observe that even without considering an initial set of tags that a user provides, it achieves Recall@5 scores of 0.821, 0.822, 0.891, 0.818 and 0.651, and Recall@10 scores of 0.873, 0.886, 0.956, 0.887 and 0.761, on Stack Overflow, Ask Ubuntu, Ask Different, Super User, and Freecode, respectively. In terms of Recall@5 and Recall@10, averaging across the 5 datasets, it improves upon TagCombine, which is the prior state-of-the-art approach, by 29.3% and 14.5% respectively. Moreover, the performance of our approach is further boosted if users provide some initial tags that our approach can leverage to infer additional tags: when an initial set of tags is given, Recall@5 is improved by 10%.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Footnotes
3
Since the implementation of Stack Overflow’s proprietary system is, to the best of our knowledge, not documented publicly, a meaningful comparison was not possible.
 
13
Our experiments show that the effectiveness of UIC substantially degrades if it takes into consideration all tags.
 
14
By construction, γ is an extra weight given to some of the tags in \(T_{\text {\small {\textsl {BIC}}} \cup \text {\small {\textsl {FIC}}}}\).
 
15
Since EnTagRec ++ o (t) is itself a probability score, it could also be expressed as a function of only three coefficients α′, β′, and γ′, with the fourth being automatically 1 − α′ − β′ − γ′. We chose the four-coefficient expression to better reflect the four components of EnTagRec ++.
 
Literature
go back to reference Al-Kofahi JM, Tamrawi A, Nguyen TT, Nguyen HA, Nguyen TN (2010) Fuzzy set approach for automatic tagging in evolving software ICSM, pp 1–10 Al-Kofahi JM, Tamrawi A, Nguyen TT, Nguyen HA, Nguyen TN (2010) Fuzzy set approach for automatic tagging in evolving software ICSM, pp 1–10
go back to reference Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983CrossRef Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983CrossRef
go back to reference Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling ICSE, pp 95–104 Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling ICSE, pp 95–104
go back to reference Baldi P, Lopes CV, Linstead E, Bajracharya SK (2008) A theory of aspects as latent topics OOPSLA, pp 543–562 Baldi P, Lopes CV, Linstead E, Bajracharya SK (2008) A theory of aspects as latent topics OOPSLA, pp 543–562
go back to reference Bazelli B, Hindle A, Stroulia E (2013) On the personality traits of stackoverflow users. In: 2013 IEEE international conference on software maintenance, pp 460–463 Bazelli B, Hindle A, Stroulia E (2013) On the personality traits of stackoverflow users. In: 2013 IEEE international conference on software maintenance, pp 460–463
go back to reference Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29:1165–1188MathSciNetCrossRefMATH Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29:1165–1188MathSciNetCrossRefMATH
go back to reference Bindelli S, Criscione C, Curino C, Drago ML, Eynard D, Orsi G (2008) Improving search and navigation by combining ontologies and social tags. In: On the move to meaningful internet systems, OTM 2008 Workshops, OTM confederated international workshops and posters, ADI, AWeSoMe, COMBEK, EI2N, IWSSA, MONET, OnToContent + QSI, ORM, PerSys, RDDS, SEMELS, and SWWS 2008, Monterrey, Mexico, November 9-14, 2008. Proceedings, pp 76–85 Bindelli S, Criscione C, Curino C, Drago ML, Eynard D, Orsi G (2008) Improving search and navigation by combining ontologies and social tags. In: On the move to meaningful internet systems, OTM 2008 Workshops, OTM confederated international workshops and posters, ADI, AWeSoMe, COMBEK, EI2N, IWSSA, MONET, OnToContent + QSI, ORM, PerSys, RDDS, SEMELS, and SWWS 2008, Monterrey, Mexico, November 9-14, 2008. Proceedings, pp 76–85
go back to reference Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. JMLR, 993–1022 Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. JMLR, 993–1022
go back to reference Brandt J, Guo PJ, Lewenstein J, Dontcheva M, Klemmer SR (2009) Two studies of opportunistic programming: interleaving web foraging, learning, and writing code CHI. ACM, pp 1589–1598 Brandt J, Guo PJ, Lewenstein J, Dontcheva M, Klemmer SR (2009) Two studies of opportunistic programming: interleaving web foraging, learning, and writing code CHI. ACM, pp 1589–1598
go back to reference Cabot J, Izquierdo JLC, Cosentino V, Rolandi B (2015) Exploring the use of labels to categorize issues in open-source software projects. In: 22nd IEEE international conference on software analysis, evolution, and reengineering, SANER 2015. Montreal, QC, Canada, March 2-6, 2015, pp 550–554 Cabot J, Izquierdo JLC, Cosentino V, Rolandi B (2015) Exploring the use of labels to categorize issues in open-source software projects. In: 22nd IEEE international conference on software analysis, evolution, and reengineering, SANER 2015. Montreal, QC, Canada, March 2-6, 2015, pp 550–554
go back to reference Capobianco G, Lucia AD, Oliveto R, Panichella A, Panichella S (2013) Improving IR-based traceability recovery via noun-based indexing of software artifacts. J Softw Evol Process 25(7):743–762CrossRef Capobianco G, Lucia AD, Oliveto R, Panichella A, Panichella S (2013) Improving IR-based traceability recovery via noun-based indexing of software artifacts. J Softw Evol Process 25(7):743–762CrossRef
go back to reference Cress U, Held C, Kimmerle J (2013) The collective knowledge of social tags: direct and indirect influences on navigation, learning, and information processing. Comput Educ 60(1):59–73CrossRef Cress U, Held C, Kimmerle J (2013) The collective knowledge of social tags: direct and indirect influences on navigation, learning, and information processing. Comput Educ 60(1):59–73CrossRef
go back to reference Crestani F (1997) Application of spreading activation techniques in information retrieval. Artif Intell Rev 11(6):453–482CrossRef Crestani F (1997) Application of spreading activation techniques in information retrieval. Artif Intell Rev 11(6):453–482CrossRef
go back to reference Gelman A, Carlin J, Stern H, Rubin D (2003) Bayesian data analysis. CRC Press Gelman A, Carlin J, Stern H, Rubin D (2003) Bayesian data analysis. CRC Press
go back to reference Ghamrawi N, McCallum A (2005) Collective multi-label classification CIKM, pp 195–200 Ghamrawi N, McCallum A (2005) Collective multi-label classification CIKM, pp 195–200
go back to reference Golder SA, Huberman BA (2006) Usage patterns of collaborative tagging systems. J Inf Sci 32(2):198–206CrossRef Golder SA, Huberman BA (2006) Usage patterns of collaborative tagging systems. J Inf Sci 32(2):198–206CrossRef
go back to reference Grissom RJ, Kim JJ (2005) Effect sizes for research. A broad practical approach Grissom RJ, Kim JJ (2005) Effect sizes for research. A broad practical approach
go back to reference Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc
go back to reference Held C, Kimmerle J, Cress U (2012) Learning by foraging: the impact of individual knowledge and social tags on web navigation processes. Comput Hum Behav 28(1):34–40CrossRef Held C, Kimmerle J, Cress U (2012) Learning by foraging: the impact of individual knowledge and social tags on web navigation processes. Comput Hum Behav 28(1):34–40CrossRef
go back to reference Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics, SOMA ’10, pp 80–88 Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics, SOMA ’10, pp 80–88
go back to reference Jäschke R, Marinho LB, Hotho A, Schmidt-Thieme L, Stumme G (2007) Tag recommendations in folksonomies PKDD Jäschke R, Marinho LB, Hotho A, Schmidt-Thieme L, Stumme G (2007) Tag recommendations in folksonomies PKDD
go back to reference Joorabchi A, English M, Mahdi AE (2015) Automatic mapping of user tags to wikipedia concepts: the case of a q&a website âĂŞ stackoverflow. J Inf Sci 41 (5):570–583CrossRef Joorabchi A, English M, Mahdi AE (2015) Automatic mapping of user tags to wikipedia concepts: the case of a q&a website âĂŞ stackoverflow. J Inf Sci 41 (5):570–583CrossRef
go back to reference Lukins SK, Kraft NA, Etzkorn LH (2010) Bug localization using latent dirichlet allocation. Inf Softw Technol 52(9):972–990CrossRef Lukins SK, Kraft NA, Etzkorn LH (2010) Bug localization using latent dirichlet allocation. Inf Softw Technol 52(9):972–990CrossRef
go back to reference Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, Lucia AD (2013) How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms ICSE, pp 522–531 Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, Lucia AD (2013) How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms ICSE, pp 522–531
go back to reference Pletea D, Vasilescu B, Serebrenik A (2014) Security and emotion: Sentiment analysis of security discussions on github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014. ACM, New York, pp 348–351 Pletea D, Vasilescu B, Serebrenik A (2014) Security and emotion: Sentiment analysis of security discussions on github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014. ACM, New York, pp 348–351
go back to reference Porter MF (1997) An algorithm for suffix stripping Readings in information retrieval. Morgan Kaufmann, pp 313–316 Porter MF (1997) An algorithm for suffix stripping Readings in information retrieval. Morgan Kaufmann, pp 313–316
go back to reference Puurula A (2011) Mixture models for multi-label text classification. In: 10th New Zealand computer science research student conference Puurula A (2011) Mixture models for multi-label text classification. In: 10th New Zealand computer science research student conference
go back to reference Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP ’09, pp 248–256 Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP ’09, pp 248–256
go back to reference Rebouças M, Pinto G, Ebert F, Torres W, Serebrenik A, Castor F (2016) An empirical study on the usage of the swift programming language. In: 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), pp 634–638 Rebouças M, Pinto G, Ebert F, Torres W, Serebrenik A, Castor F (2016) An empirical study on the usage of the swift programming language. In: 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), pp 634–638
go back to reference Samaniego FI (2010) A comparison of the bayesian and frequentist approaches to estimation. Series in Statistics, Springer Samaniego FI (2010) A comparison of the bayesian and frequentist approaches to estimation. Series in Statistics, Springer
go back to reference Shokripour R, Anvik J, Kasirun ZM, Zamani S (2013) Why so complicated? Simple term filtering and weighting for location-based bug report assignment recommendation MSR Shokripour R, Anvik J, Kasirun ZM, Zamani S (2013) Why so complicated? Simple term filtering and weighting for location-based bug report assignment recommendation MSR
go back to reference Sigurbjörnsson B, van Zwol R (2008) Flickr tag recommendation based on collective knowledge WWW ’08, pp 327–336 Sigurbjörnsson B, van Zwol R (2008) Flickr tag recommendation based on collective knowledge WWW ’08, pp 327–336
go back to reference Storey M-A, Ryall J, Singer J, Myers D, Cheng L-T, Muller M (2009) How software developers use tagging to support reminding and refinding. IEEE Trans Softw Eng 35(undefined):470–483CrossRef Storey M-A, Ryall J, Singer J, Myers D, Cheng L-T, Muller M (2009) How software developers use tagging to support reminding and refinding. IEEE Trans Softw Eng 35(undefined):470–483CrossRef
go back to reference Storey M-A, Treude C, van Deursen A, Cheng L-T (2010) The impact of social media on software engineering practices and tools. In: FoSER ’10, pp 359–364 Storey M-A, Treude C, van Deursen A, Cheng L-T (2010) The impact of social media on software engineering practices and tools. In: FoSER ’10, pp 359–364
go back to reference Thung F, Lo D, Jiang L (2012) Detecting similar applications with collaborative tagging. In: ICSM, pp 600–603 Thung F, Lo D, Jiang L (2012) Detecting similar applications with collaborative tagging. In: ICSM, pp 600–603
go back to reference Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: HLT-NAACL Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: HLT-NAACL
go back to reference Treude C, Storey M-A (2009) How tagging helps bridge the gap between social and technical aspects in software development. In: ICSE ’09, pp 12–22 Treude C, Storey M-A (2009) How tagging helps bridge the gap between social and technical aspects in software development. In: ICSE ’09, pp 12–22
go back to reference Treude C, Storey M-A (2012) Work item tagging: communicating concerns in collaborative software development. IEEE Trans Softw Eng 38(1):19–34CrossRef Treude C, Storey M-A (2012) Work item tagging: communicating concerns in collaborative software development. IEEE Trans Softw Eng 38(1):19–34CrossRef
go back to reference Vasilescu B, Serebrenik A, Devanbu PT, Filkov V (2014) How social Q&A sites are changing knowledge sharing in open source software communities. In: CSCW, pp 342–354 Vasilescu B, Serebrenik A, Devanbu PT, Filkov V (2014) How social Q&A sites are changing knowledge sharing in open source software communities. In: CSCW, pp 342–354
go back to reference Vasilescu B, Serebrenik A, van den Brand MGJ (2013) The babel of software development: linguistic diversity in open source. In: Jatowt A, Lim E-P, Ding Y, Miura A, Tezuka T, Dias G, Tanaka K, Flanagin A, Dai BT (eds) Proceedings of the social informatics: 5th international conference, SocInfo 2013, Kyoto, Japan, November 25-27, 2013. Springer International Publishing, pp 391–404 Vasilescu B, Serebrenik A, van den Brand MGJ (2013) The babel of software development: linguistic diversity in open source. In: Jatowt A, Lim E-P, Ding Y, Miura A, Tezuka T, Dias G, Tanaka K, Flanagin A, Dai BT (eds) Proceedings of the social informatics: 5th international conference, SocInfo 2013, Kyoto, Japan, November 25-27, 2013. Springer International Publishing, pp 391–404
go back to reference Vogt CC, Cottrell GW (1999) Fusion via a linear combination of scores. Inf Retr 1(3):151–173CrossRef Vogt CC, Cottrell GW (1999) Fusion via a linear combination of scores. Inf Retr 1(3):151–173CrossRef
go back to reference Wang S, Lo D, Jiang L (2012) Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging. In: ICSM, pp 604–607 Wang S, Lo D, Jiang L (2012) Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging. In: ICSM, pp 604–607
go back to reference Wang S, Lo D, Vasilescu B, Serebrenik A (2014) EnTagRec: an enhanced tag recommendation system for software information sites. In: 30th IEEE international conference on software maintenance and evolution, Victoria, BC, Canada, September 29 - October 3, 2014. IEEE Computer Society, pp 291– 300 Wang S, Lo D, Vasilescu B, Serebrenik A (2014) EnTagRec: an enhanced tag recommendation system for software information sites. In: 30th IEEE international conference on software maintenance and evolution, Victoria, BC, Canada, September 29 - October 3, 2014. IEEE Computer Society, pp 291– 300
go back to reference Wang W, Niu N, Liu H, Wu Y (2015) Tagging in assisted tracing. In: 2015 IEEE/ACM 8th international symposium on software and systems traceability, pp 8–14 Wang W, Niu N, Liu H, Wu Y (2015) Tagging in assisted tracing. In: 2015 IEEE/ACM 8th international symposium on software and systems traceability, pp 8–14
go back to reference Wang X-Y, Xia X, Lo D (2015) Tagcombine: recommending tags to contents in software information sites. J Comput Sci Technol 30(5):1017–1035CrossRef Wang X-Y, Xia X, Lo D (2015) Tagcombine: recommending tags to contents in software information sites. J Comput Sci Technol 30(5):1017–1035CrossRef
go back to reference Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1 (4):80–83CrossRef Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1 (4):80–83CrossRef
go back to reference Xia X, Lo D, Wang X, Zhou B (2013) Tag recommendation in software information sites. In: MSR ’13, pp 287–296 Xia X, Lo D, Wang X, Zhou B (2013) Tag recommendation in software information sites. In: MSR ’13, pp 287–296
go back to reference Zangerle E, Gassler W, Specht G (2011) Using tag recommendations to homogenize folksonomies in microblogging environments. In: SocInfo’11, pp 113–126 Zangerle E, Gassler W, Specht G (2011) Using tag recommendations to homogenize folksonomies in microblogging environments. In: SocInfo’11, pp 113–126
Metadata
Title
EnTagRec ++: An enhanced tag recommendation system for software information sites
Authors
Shaowei Wang
David Lo
Bogdan Vasilescu
Alexander Serebrenik
Publication date
21-07-2017
Publisher
Springer US
Published in
Empirical Software Engineering / Issue 2/2018
Print ISSN: 1382-3256
Electronic ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-017-9533-1

Other articles of this Issue 2/2018

Empirical Software Engineering 2/2018 Go to the issue

Premium Partner