Skip to main content
Erschienen in: Empirical Software Engineering 3/2024

01.05.2024

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

verfasst von: Amal Alazba, Hamoud Aljamaan, Mohammad Alshayeb

Erschienen in: Empirical Software Engineering | Ausgabe 3/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Context

Code smell detection is the process of identifying poorly designed and implemented code pieces. Machine learning-based approaches require enormous amounts of manually labeled data, which are costly and difficult to scale. Unsupervised semantic feature learning, or learning without manual annotation, is vital for effectively harvesting an enormous amount of available data.

Objective

The objective of this study is to propose a new code smell detection approach that utilizes self-supervised learning to learn intermediate representations without the need for labels and then fine-tune these representations on multiple tasks.

Method

We propose a Code Representation with Transformers (CoRT) to learn the semantic and structural features of the source code by training transformers to recognize masked reserved words that are applied to the code given as input. We empirically demonstrated that the defined proxy task provides a powerful method for learning semantic and structural features. We exhaustively evaluated our approach on four downstream tasks: detection of the Data Class, God Class, Feature Envy, and Long Method code smells. Moreover, we compare our results with those of two paradigms: supervised learning and a feature-based approach. Finally, we conducted a cross-project experiment to evaluate the generalizability of our method to unseen labeled data.

Results

The results indicate that the proposed method has a high detection performance for code smells. For instance, the detection performance of CoRT on Data Class achieved a score of F1 between 88.08–99.4, Area Under Curve (AUC) between 89.62–99.88, and Matthews Correlation Coefficient (MCC) between 75.28–98.8, while God Class achieved a value of F1 ranges from 86.32–99.03, AUC of 92.1–99.85, and MCC of 76.15–98.09. Compared with the baseline model and feature-based approach, CoRT achieved better detection performance and had a high capability to detect code smells in unseen datasets.

Conclusions

The proposed method has been shown to be effective in detecting class-level, and method-level code smells.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, New York, NY, USA, pp 2623–2631. https://doi.org/10.1145/3292500.3330701 Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, New York, NY, USA, pp 2623–2631. https://​doi.​org/​10.​1145/​3292500.​3330701
Zurück zum Zitat Bryton S, Brito e Abreu F, Monteiro M (2010) Reducing subjectivity in code smells detection: experimenting with the long method. In: 2010 seventh international conference on the quality of information and communications technology. pp 337–342. https://doi.org/10.1109/QUATIC.2010.60 Bryton S, Brito e Abreu F, Monteiro M (2010) Reducing subjectivity in code smells detection: experimenting with the long method. In: 2010 seventh international conference on the quality of information and communications technology. pp 337–342. https://​doi.​org/​10.​1109/​QUATIC.​2010.​60
Zurück zum Zitat Charalampidou S, Ampatzoglou A, Avgeriou P (2015) Size and cohesion metrics as indicators of the long method bad smell: An empirical study. In: Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering. Association for Computing Machinery, Beijing, China, pp 1–10. https://doi.org/10.1145/2810146.2810155 Charalampidou S, Ampatzoglou A, Avgeriou P (2015) Size and cohesion metrics as indicators of the long method bad smell: An empirical study. In: Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering. Association for Computing Machinery, Beijing, China, pp 1–10. https://​doi.​org/​10.​1145/​2810146.​2810155
Zurück zum Zitat Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online. Association for Computational Linguistics, pp 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139 Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online. Association for Computational Linguistics, pp 1536–1547. https://​doi.​org/​10.​18653/​v1/​2020.​findings-emnlp.​139
Zurück zum Zitat Fontana FA, Zanoni M, Marino A, Mäntylä MV (2013) Code smell detection: Towards a machine learning-based approach. In: Proceedings of the 2013 IEEE international conference on software maintenance. IEEE Computer Society, USA, pp 396–399. https://doi.org/10.1109/ICSM.2013.56 Fontana FA, Zanoni M, Marino A, Mäntylä MV (2013) Code smell detection: Towards a machine learning-based approach. In: Proceedings of the 2013 IEEE international conference on software maintenance. IEEE Computer Society, USA, pp 396–399. https://​doi.​org/​10.​1109/​ICSM.​2013.​56
Zurück zum Zitat Fowler M, Beck K, Brant J et al (1999) Refactoring: Improving the design of existing code, 1st edn. Addison-Wesley Professional, Reading, MA Fowler M, Beck K, Brant J et al (1999) Refactoring: Improving the design of existing code, 1st edn. Addison-Wesley Professional, Reading, MA
Zurück zum Zitat Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. ArXiv, abs/1803.07728. Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. ArXiv, abs/1803.07728.
Zurück zum Zitat Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Yin J, Jiang D, Zhou M (2020) GraphCodeBERT: Pre-training Code Representations with Data Flow. ArXiv, abs/2009.08366 Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Yin J, Jiang D, Zhou M (2020) GraphCodeBERT: Pre-training Code Representations with Data Flow. ArXiv, abs/2009.08366
Zurück zum Zitat Hadj-Kacem M, Bouassida N (2018) A hybrid approach to detect code smells using deep learning. In: Proceedings of the 13th international conference on evaluation of novel approaches to software engineering. SCITEPRESS - Science and Technology Publications, Lda, Setubal, PRT, pp 137–146. https://doi.org/10.5220/0006709801370146 Hadj-Kacem M, Bouassida N (2018) A hybrid approach to detect code smells using deep learning. In: Proceedings of the 13th international conference on evaluation of novel approaches to software engineering. SCITEPRESS - Science and Technology Publications, Lda, Setubal, PRT, pp 137–146. https://​doi.​org/​10.​5220/​0006709801370146​
Zurück zum Zitat Hadj-Kacem M, Bouassida N (2019b) Improving the identification of code smells by combining structural and semantic information. In: Gedeon T, Wong KW, Lee M (eds) Neural Information Processing. Springer International Publishing, Cham, pp 296–304 Hadj-Kacem M, Bouassida N (2019b) Improving the identification of code smells by combining structural and semantic information. In: Gedeon T, Wong KW, Lee M (eds) Neural Information Processing. Springer International Publishing, Cham, pp 296–304
Zurück zum Zitat Hassaine S, Khomh F, Gueheneuc Y-G, Hamel S (2010) IDS: an immune-inspired approach for the detection of software design smells. In: 2010 Seventh International Conference on the Quality of Information and Communications Technology, pp 343–348. https://doi.org/10.1109/QUATIC.2010.61 Hassaine S, Khomh F, Gueheneuc Y-G, Hamel S (2010) IDS: an immune-inspired approach for the detection of software design smells. In: 2010 Seventh International Conference on the Quality of Information and Communications Technology, pp 343–348. https://​doi.​org/​10.​1109/​QUATIC.​2010.​61
Zurück zum Zitat He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Zurück zum Zitat Ioffe S, Szegedy C (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International conference on machine learning, pp 448–456 Ioffe S, Szegedy C (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International conference on machine learning, pp 448–456
Zurück zum Zitat Kotsiantis SB (2007) Supervised machine learning: A review of classification techniques. In: Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies. IOS Press, NLD, pp 3–24 Kotsiantis SB (2007) Supervised machine learning: A review of classification techniques. In: Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies. IOS Press, NLD, pp 3–24
Zurück zum Zitat Le H, Wang Y, Gotmare AD, Savarese S, Hoi SC (2022) Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Adv Neural Inf Process Syst 35:21314–21328 Le H, Wang Y, Gotmare AD, Savarese S, Hoi SC (2022) Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Adv Neural Inf Process Syst 35:21314–21328
Zurück zum Zitat Liu S, Wu B, Xie X, Meng G, Liu Y (2023) ContraBERT: Enhancing code pre-trained models via contrastive learning. arXiv preprint arXiv:2301.09072 Liu S, Wu B, Xie X, Meng G, Liu Y (2023) ContraBERT: Enhancing code pre-trained models via contrastive learning. arXiv preprint arXiv:2301.09072
Zurück zum Zitat Lu S, Guo D, Ren S, Huang J, Svyatkovskiy A, Blanco A, Clement C, Drain D, Jiang D, Tang D, Li G (2021) CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 Lu S, Guo D, Ren S, Huang J, Svyatkovskiy A, Blanco A, Clement C, Drain D, Jiang D, Tang D, Li G (2021) CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664
Zurück zum Zitat Maiga A, Ali N, Bhattacharya N, Sabané A, Guéhéneuc YG, Antoniol G, Aimeur E (2012b) Support vector machines for anti-pattern detection. In: 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp 278–281. https://doi.org/10.1145/2351676.2351723 Maiga A, Ali N, Bhattacharya N, Sabané A, Guéhéneuc YG, Antoniol G, Aimeur E (2012b) Support vector machines for anti-pattern detection. In: 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp 278–281. https://​doi.​org/​10.​1145/​2351676.​2351723
Zurück zum Zitat Nafi KW, Kar TS, Roy B, Roy CK, Schneider KA (2019) CLCDSA: cross language code clone detection using syntactical features and api documentation. In: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, San Diego, CA, USA, pp 1026–1037. https://doi.org/10.1109/ASE.2019.00099 Nafi KW, Kar TS, Roy B, Roy CK, Schneider KA (2019) CLCDSA: cross language code clone detection using syntactical features and api documentation. In: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, San Diego, CA, USA, pp 1026–1037. https://​doi.​org/​10.​1109/​ASE.​2019.​00099
Zurück zum Zitat Parr T (2013) The definitive ANTLR 4 reference. The Definitive ANTLR 4 Reference, pp 1–326 Parr T (2013) The definitive ANTLR 4 reference. The Definitive ANTLR 4 Reference, pp 1–326
Zurück zum Zitat Roy GG, Veraart VE (1996) Software engineering education: from an engineering perspective. In: Proceedings 1996 International Conference Software Engineering: Education and Practice, pp 256–262 Roy GG, Veraart VE (1996) Software engineering education: from an engineering perspective. In: Proceedings 1996 International Conference Software Engineering: Education and Practice, pp 256–262
Zurück zum Zitat Tempero E, Anslow C, Dietrich J, Han T, Li J, Lumpe M, Melton H, Noble J (2010) The qualitas corpus: A curated collection of java code for empirical studies. In: 2010 Asia Pacific Software Engineering Conference, pp 336–345. https://doi.org/10.1109/APSEC.2010.46 Tempero E, Anslow C, Dietrich J, Han T, Li J, Lumpe M, Melton H, Noble J (2010) The qualitas corpus: A curated collection of java code for empirical studies. In: 2010 Asia Pacific Software Engineering Conference, pp 336–345. https://​doi.​org/​10.​1109/​APSEC.​2010.​46
Zurück zum Zitat Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention Is All You Need. Adv Neural Inf Process Syst 30 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention Is All You Need. Adv Neural Inf Process Syst 30
Zurück zum Zitat Wang H, Liu J, Kang J, Yin W, Sun H, Wang H (2020) Feature envy detection based on Bi-LSTM with self-attention mechanism. In: 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, Exeter, United Kingdom, pp 448–457. https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00082 Wang H, Liu J, Kang J, Yin W, Sun H, Wang H (2020) Feature envy detection based on Bi-LSTM with self-attention mechanism. In: 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, Exeter, United Kingdom, pp 448–457. https://​doi.​org/​10.​1109/​ISPA-BDCloud-SocialCom-SustainCom51426.​2020.​00082
Zurück zum Zitat Wang Y, Wang W, Joty S, Hoi SCH (2021) CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 Wang Y, Wang W, Joty S, Hoi SCH (2021) CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859
Zurück zum Zitat Wang Y, Le H, Gotmare AD, Bui ND, Li J, Hoi SC (2023) CodeT5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 Wang Y, Le H, Gotmare AD, Bui ND, Li J, Hoi SC (2023) CodeT5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922
Zurück zum Zitat Watanabe S, Hutter F (2022) c-TPE: Generalizing tree-structured parzen estimator with inequality constraints for continuous and categorical hyperparameter optimization. arXiv preprint arXiv:2211.14411 Watanabe S, Hutter F (2022) c-TPE: Generalizing tree-structured parzen estimator with inequality constraints for continuous and categorical hyperparameter optimization. arXiv preprint arXiv:2211.14411
Zurück zum Zitat Xu W, Zhang X (2021) Multi-granularity code smell detection using deep learning method based on abstract syntax tree. In: Proceeding 33rd Int. Conf. Software Engineering and Knowledge Engineering, pp 503–509 Xu W, Zhang X (2021) Multi-granularity code smell detection using deep learning method based on abstract syntax tree. In: Proceeding 33rd Int. Conf. Software Engineering and Knowledge Engineering, pp 503–509
Metadaten
Titel
CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection
verfasst von
Amal Alazba
Hamoud Aljamaan
Mohammad Alshayeb
Publikationsdatum
01.05.2024
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 3/2024
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-024-10445-9

Weitere Artikel der Ausgabe 3/2024

Empirical Software Engineering 3/2024 Zur Ausgabe

Premium Partner