Skip to main content
Top
Published in: Empirical Software Engineering 3/2024

01-05-2024

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

Authors: Amal Alazba, Hamoud Aljamaan, Mohammad Alshayeb

Published in: Empirical Software Engineering | Issue 3/2024

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Context

Code smell detection is the process of identifying poorly designed and implemented code pieces. Machine learning-based approaches require enormous amounts of manually labeled data, which are costly and difficult to scale. Unsupervised semantic feature learning, or learning without manual annotation, is vital for effectively harvesting an enormous amount of available data.

Objective

The objective of this study is to propose a new code smell detection approach that utilizes self-supervised learning to learn intermediate representations without the need for labels and then fine-tune these representations on multiple tasks.

Method

We propose a Code Representation with Transformers (CoRT) to learn the semantic and structural features of the source code by training transformers to recognize masked reserved words that are applied to the code given as input. We empirically demonstrated that the defined proxy task provides a powerful method for learning semantic and structural features. We exhaustively evaluated our approach on four downstream tasks: detection of the Data Class, God Class, Feature Envy, and Long Method code smells. Moreover, we compare our results with those of two paradigms: supervised learning and a feature-based approach. Finally, we conducted a cross-project experiment to evaluate the generalizability of our method to unseen labeled data.

Results

The results indicate that the proposed method has a high detection performance for code smells. For instance, the detection performance of CoRT on Data Class achieved a score of F1 between 88.08–99.4, Area Under Curve (AUC) between 89.62–99.88, and Matthews Correlation Coefficient (MCC) between 75.28–98.8, while God Class achieved a value of F1 ranges from 86.32–99.03, AUC of 92.1–99.85, and MCC of 76.15–98.09. Compared with the baseline model and feature-based approach, CoRT achieved better detection performance and had a high capability to detect code smells in unseen datasets.

Conclusions

The proposed method has been shown to be effective in detecting class-level, and method-level code smells.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
go back to reference Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, New York, NY, USA, pp 2623–2631. https://doi.org/10.1145/3292500.3330701 Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, New York, NY, USA, pp 2623–2631. https://​doi.​org/​10.​1145/​3292500.​3330701
go back to reference Bryton S, Brito e Abreu F, Monteiro M (2010) Reducing subjectivity in code smells detection: experimenting with the long method. In: 2010 seventh international conference on the quality of information and communications technology. pp 337–342. https://doi.org/10.1109/QUATIC.2010.60 Bryton S, Brito e Abreu F, Monteiro M (2010) Reducing subjectivity in code smells detection: experimenting with the long method. In: 2010 seventh international conference on the quality of information and communications technology. pp 337–342. https://​doi.​org/​10.​1109/​QUATIC.​2010.​60
go back to reference Charalampidou S, Ampatzoglou A, Avgeriou P (2015) Size and cohesion metrics as indicators of the long method bad smell: An empirical study. In: Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering. Association for Computing Machinery, Beijing, China, pp 1–10. https://doi.org/10.1145/2810146.2810155 Charalampidou S, Ampatzoglou A, Avgeriou P (2015) Size and cohesion metrics as indicators of the long method bad smell: An empirical study. In: Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering. Association for Computing Machinery, Beijing, China, pp 1–10. https://​doi.​org/​10.​1145/​2810146.​2810155
go back to reference Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online. Association for Computational Linguistics, pp 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139 Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online. Association for Computational Linguistics, pp 1536–1547. https://​doi.​org/​10.​18653/​v1/​2020.​findings-emnlp.​139
go back to reference Fontana FA, Zanoni M, Marino A, Mäntylä MV (2013) Code smell detection: Towards a machine learning-based approach. In: Proceedings of the 2013 IEEE international conference on software maintenance. IEEE Computer Society, USA, pp 396–399. https://doi.org/10.1109/ICSM.2013.56 Fontana FA, Zanoni M, Marino A, Mäntylä MV (2013) Code smell detection: Towards a machine learning-based approach. In: Proceedings of the 2013 IEEE international conference on software maintenance. IEEE Computer Society, USA, pp 396–399. https://​doi.​org/​10.​1109/​ICSM.​2013.​56
go back to reference Fowler M, Beck K, Brant J et al (1999) Refactoring: Improving the design of existing code, 1st edn. Addison-Wesley Professional, Reading, MA Fowler M, Beck K, Brant J et al (1999) Refactoring: Improving the design of existing code, 1st edn. Addison-Wesley Professional, Reading, MA
go back to reference Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. ArXiv, abs/1803.07728. Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. ArXiv, abs/1803.07728.
go back to reference Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Yin J, Jiang D, Zhou M (2020) GraphCodeBERT: Pre-training Code Representations with Data Flow. ArXiv, abs/2009.08366 Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Yin J, Jiang D, Zhou M (2020) GraphCodeBERT: Pre-training Code Representations with Data Flow. ArXiv, abs/2009.08366
go back to reference Hadj-Kacem M, Bouassida N (2018) A hybrid approach to detect code smells using deep learning. In: Proceedings of the 13th international conference on evaluation of novel approaches to software engineering. SCITEPRESS - Science and Technology Publications, Lda, Setubal, PRT, pp 137–146. https://doi.org/10.5220/0006709801370146 Hadj-Kacem M, Bouassida N (2018) A hybrid approach to detect code smells using deep learning. In: Proceedings of the 13th international conference on evaluation of novel approaches to software engineering. SCITEPRESS - Science and Technology Publications, Lda, Setubal, PRT, pp 137–146. https://​doi.​org/​10.​5220/​0006709801370146​
go back to reference Hadj-Kacem M, Bouassida N (2019b) Improving the identification of code smells by combining structural and semantic information. In: Gedeon T, Wong KW, Lee M (eds) Neural Information Processing. Springer International Publishing, Cham, pp 296–304 Hadj-Kacem M, Bouassida N (2019b) Improving the identification of code smells by combining structural and semantic information. In: Gedeon T, Wong KW, Lee M (eds) Neural Information Processing. Springer International Publishing, Cham, pp 296–304
go back to reference Hassaine S, Khomh F, Gueheneuc Y-G, Hamel S (2010) IDS: an immune-inspired approach for the detection of software design smells. In: 2010 Seventh International Conference on the Quality of Information and Communications Technology, pp 343–348. https://doi.org/10.1109/QUATIC.2010.61 Hassaine S, Khomh F, Gueheneuc Y-G, Hamel S (2010) IDS: an immune-inspired approach for the detection of software design smells. In: 2010 Seventh International Conference on the Quality of Information and Communications Technology, pp 343–348. https://​doi.​org/​10.​1109/​QUATIC.​2010.​61
go back to reference He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
go back to reference Ioffe S, Szegedy C (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International conference on machine learning, pp 448–456 Ioffe S, Szegedy C (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International conference on machine learning, pp 448–456
go back to reference Kotsiantis SB (2007) Supervised machine learning: A review of classification techniques. In: Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies. IOS Press, NLD, pp 3–24 Kotsiantis SB (2007) Supervised machine learning: A review of classification techniques. In: Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies. IOS Press, NLD, pp 3–24
go back to reference Le H, Wang Y, Gotmare AD, Savarese S, Hoi SC (2022) Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Adv Neural Inf Process Syst 35:21314–21328 Le H, Wang Y, Gotmare AD, Savarese S, Hoi SC (2022) Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Adv Neural Inf Process Syst 35:21314–21328
go back to reference Liu S, Wu B, Xie X, Meng G, Liu Y (2023) ContraBERT: Enhancing code pre-trained models via contrastive learning. arXiv preprint arXiv:2301.09072 Liu S, Wu B, Xie X, Meng G, Liu Y (2023) ContraBERT: Enhancing code pre-trained models via contrastive learning. arXiv preprint arXiv:2301.09072
go back to reference Lu S, Guo D, Ren S, Huang J, Svyatkovskiy A, Blanco A, Clement C, Drain D, Jiang D, Tang D, Li G (2021) CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 Lu S, Guo D, Ren S, Huang J, Svyatkovskiy A, Blanco A, Clement C, Drain D, Jiang D, Tang D, Li G (2021) CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664
go back to reference Maiga A, Ali N, Bhattacharya N, Sabané A, Guéhéneuc YG, Antoniol G, Aimeur E (2012b) Support vector machines for anti-pattern detection. In: 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp 278–281. https://doi.org/10.1145/2351676.2351723 Maiga A, Ali N, Bhattacharya N, Sabané A, Guéhéneuc YG, Antoniol G, Aimeur E (2012b) Support vector machines for anti-pattern detection. In: 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp 278–281. https://​doi.​org/​10.​1145/​2351676.​2351723
go back to reference Nafi KW, Kar TS, Roy B, Roy CK, Schneider KA (2019) CLCDSA: cross language code clone detection using syntactical features and api documentation. In: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, San Diego, CA, USA, pp 1026–1037. https://doi.org/10.1109/ASE.2019.00099 Nafi KW, Kar TS, Roy B, Roy CK, Schneider KA (2019) CLCDSA: cross language code clone detection using syntactical features and api documentation. In: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, San Diego, CA, USA, pp 1026–1037. https://​doi.​org/​10.​1109/​ASE.​2019.​00099
go back to reference Parr T (2013) The definitive ANTLR 4 reference. The Definitive ANTLR 4 Reference, pp 1–326 Parr T (2013) The definitive ANTLR 4 reference. The Definitive ANTLR 4 Reference, pp 1–326
go back to reference Roy GG, Veraart VE (1996) Software engineering education: from an engineering perspective. In: Proceedings 1996 International Conference Software Engineering: Education and Practice, pp 256–262 Roy GG, Veraart VE (1996) Software engineering education: from an engineering perspective. In: Proceedings 1996 International Conference Software Engineering: Education and Practice, pp 256–262
go back to reference Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention Is All You Need. Adv Neural Inf Process Syst 30 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention Is All You Need. Adv Neural Inf Process Syst 30
go back to reference Wang H, Liu J, Kang J, Yin W, Sun H, Wang H (2020) Feature envy detection based on Bi-LSTM with self-attention mechanism. In: 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, Exeter, United Kingdom, pp 448–457. https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00082 Wang H, Liu J, Kang J, Yin W, Sun H, Wang H (2020) Feature envy detection based on Bi-LSTM with self-attention mechanism. In: 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, Exeter, United Kingdom, pp 448–457. https://​doi.​org/​10.​1109/​ISPA-BDCloud-SocialCom-SustainCom51426.​2020.​00082
go back to reference Wang Y, Wang W, Joty S, Hoi SCH (2021) CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 Wang Y, Wang W, Joty S, Hoi SCH (2021) CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859
go back to reference Wang Y, Le H, Gotmare AD, Bui ND, Li J, Hoi SC (2023) CodeT5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 Wang Y, Le H, Gotmare AD, Bui ND, Li J, Hoi SC (2023) CodeT5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922
go back to reference Watanabe S, Hutter F (2022) c-TPE: Generalizing tree-structured parzen estimator with inequality constraints for continuous and categorical hyperparameter optimization. arXiv preprint arXiv:2211.14411 Watanabe S, Hutter F (2022) c-TPE: Generalizing tree-structured parzen estimator with inequality constraints for continuous and categorical hyperparameter optimization. arXiv preprint arXiv:2211.14411
go back to reference Xu W, Zhang X (2021) Multi-granularity code smell detection using deep learning method based on abstract syntax tree. In: Proceeding 33rd Int. Conf. Software Engineering and Knowledge Engineering, pp 503–509 Xu W, Zhang X (2021) Multi-granularity code smell detection using deep learning method based on abstract syntax tree. In: Proceeding 33rd Int. Conf. Software Engineering and Knowledge Engineering, pp 503–509
Metadata
Title
CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection
Authors
Amal Alazba
Hamoud Aljamaan
Mohammad Alshayeb
Publication date
01-05-2024
Publisher
Springer US
Published in
Empirical Software Engineering / Issue 3/2024
Print ISSN: 1382-3256
Electronic ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-024-10445-9

Other articles of this Issue 3/2024

Empirical Software Engineering 3/2024 Go to the issue

Premium Partner