Skip to main content
Top

2024 | OriginalPaper | Chapter

Source Code Clone Detection Using Unsupervised Similarity Measures

Author : Jorge Martinez-Gil

Published in: Software Quality as a Foundation for Security

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses. To do that, we compile the existing unsupervised strategies and evaluate their performance on a benchmark dataset to guide software engineers in selecting appropriate methods for their specific use cases. The source code of this study is available at https://​github.​com/​jorge-martinez-gil/​codesim

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Ul Ain, Q., Butt, W.H., Anwar, M.W., Azam, F., Maqbool, B.: A systematic review on code clone detection. IEEE Access 7, 86121–86144 (2019)CrossRef Ul Ain, Q., Butt, W.H., Anwar, M.W., Azam, F., Maqbool, B.: A systematic review on code clone detection. IEEE Access 7, 86121–86144 (2019)CrossRef
2.
go back to reference Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages, vol. 3(POPL), pp. 1–29 (2019) Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages, vol. 3(POPL), pp. 1–29 (2019)
3.
go back to reference Aniceto, R.C., Holanda, M., Castanho, C., Da Silva, D.: Source code plagiarism detection in an educational context: a literature mapping. In: 2021 IEEE Frontiers in Education Conference (FIE), pp. 1–9. IEEE (2021) Aniceto, R.C., Holanda, M., Castanho, C., Da Silva, D.: Source code plagiarism detection in an educational context: a literature mapping. In: 2021 IEEE Frontiers in Education Conference (FIE), pp. 1–9. IEEE (2021)
4.
go back to reference Baxter, I.D., et al.: Clone detection using abstract syntax trees. In: 1998 International Conference on Software Maintenance, ICSM 1998, Bethesda, Maryland, USA, November 16–19, 1998, pp. 368–377. IEEE Computer Society (1998) Baxter, I.D., et al.: Clone detection using abstract syntax trees. In: 1998 International Conference on Software Maintenance, ICSM 1998, Bethesda, Maryland, USA, November 16–19, 1998, pp. 368–377. IEEE Computer Society (1998)
5.
go back to reference Bellon, S., Koschke, R., Antoniol, G., Krinke, J., Merlo, E.: Comparison and evaluation of clone detection tools. IEEE Trans. Softw. Eng. 33(9), 577–591 (2007)CrossRef Bellon, S., Koschke, R., Antoniol, G., Krinke, J., Merlo, E.: Comparison and evaluation of clone detection tools. IEEE Trans. Softw. Eng. 33(9), 577–591 (2007)CrossRef
6.
go back to reference Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000, pp. 39–48. IEEE (2000) Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000, pp. 39–48. IEEE (2000)
7.
go back to reference Corley, C.D., Mihalcea, R.: Measuring the semantic similarity of texts. In: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, pp. 13–18 (2005) Corley, C.D., Mihalcea, R.: Measuring the semantic similarity of texts. In: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, pp. 13–18 (2005)
8.
go back to reference Damashek, M.: Gauging similarity with n-grams: language-independent categorization of text. Science 267(5199), 843–848 (1995)CrossRef Damashek, M.: Gauging similarity with n-grams: language-independent categorization of text. Science 267(5199), 843–848 (1995)CrossRef
9.
go back to reference Dang, Y., Ge, S., Huang, R., Zhang, D.: Code clone detection experience at microsoft. In: Proceedings of the 5th International Workshop on Software Clones, pp. 63–64 (2011) Dang, Y., Ge, S., Huang, R., Zhang, D.: Code clone detection experience at microsoft. In: Proceedings of the 5th International Workshop on Software Clones, pp. 63–64 (2011)
10.
go back to reference Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T., (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019) Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T., (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019)
11.
go back to reference Dou, S., et al.: Towards understanding the capability of large language models on code clone detection: a survey. arXiv preprint arXiv:2308.01191 (2023) Dou, S., et al.: Towards understanding the capability of large language models on code clone detection: a survey. arXiv preprint arXiv:​2308.​01191 (2023)
12.
go back to reference Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its use in optimization. ACM Trans. Programm. Lang. Syst. (TOPLAS) 9(3), 319–349 (1987)CrossRef Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its use in optimization. ACM Trans. Programm. Lang. Syst. (TOPLAS) 9(3), 319–349 (1987)CrossRef
13.
go back to reference Gabel, M., Jiang, L., Su, Z.: Scalable detection of semantic clones. In: Proceedings of the 30th International Conference on Software Engineering, pp. 321–330 (2008) Gabel, M., Jiang, L., Su, Z.: Scalable detection of semantic clones. In: Proceedings of the 30th International Conference on Software Engineering, pp. 321–330 (2008)
14.
go back to reference Haque, S., Eberhart, Z., Bansal, A., McMillan, C.: Semantic similarity metrics for evaluating source code summarization. In: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, pp. 36–47 (2022) Haque, S., Eberhart, Z., Bansal, A., McMillan, C.: Semantic similarity metrics for evaluating source code summarization. In: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, pp. 36–47 (2022)
15.
go back to reference Hartanto, A.D., Syaputra, A., Pristyanto, Y.: Best parameter selection of Rabin-Karp algorithm in detecting document similarity. In: 2019 International Conference on Information and Communications Technology (ICOIACT), pp. 457–461. IEEE (2019) Hartanto, A.D., Syaputra, A., Pristyanto, Y.: Best parameter selection of Rabin-Karp algorithm in detecting document similarity. In: 2019 International Conference on Information and Communications Technology (ICOIACT), pp. 457–461. IEEE (2019)
17.
go back to reference Horwitz, S.: Identifying the semantic and textual differences between two versions of a program. In: Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation, pp. 234–245 (1990) Horwitz, S.: Identifying the semantic and textual differences between two versions of a program. In: Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation, pp. 234–245 (1990)
18.
go back to reference Juergens, E., Deissenboeck, F., Hummel, B., Wagner, S.: Do code clones matter? In: 2009 IEEE 31st International Conference on Software Engineering, pp. 485–495. IEEE (2009) Juergens, E., Deissenboeck, F., Hummel, B., Wagner, S.: Do code clones matter? In: 2009 IEEE 31st International Conference on Software Engineering, pp. 485–495. IEEE (2009)
19.
go back to reference Karnalim, O.: TF-IDF inspired detection for cross-language source code plagiarism and collusion. Comput. Sci. 21, 1–24 (2020)CrossRef Karnalim, O.: TF-IDF inspired detection for cross-language source code plagiarism and collusion. Comput. Sci. 21, 1–24 (2020)CrossRef
20.
go back to reference Karnalim, O.: Explanation in code similarity investigation. IEEE Access 9, 59935–59948 (2021)CrossRef Karnalim, O.: Explanation in code similarity investigation. IEEE Access 9, 59935–59948 (2021)CrossRef
21.
go back to reference Karnalim, O., Budi, S., Toba, H., Joy, M.: Source code plagiarism detection in academia with information retrieval: dataset and the observation. Inform. Educ. 18(2), 321–344 (2019)CrossRef Karnalim, O., Budi, S., Toba, H., Joy, M.: Source code plagiarism detection in academia with information retrieval: dataset and the observation. Inform. Educ. 18(2), 321–344 (2019)CrossRef
22.
go back to reference Karnalim, O., Simon: Syntax trees and information retrieval to improve code similarity detection. In: Proceedings of the Twenty-Second Australasian Computing Education Conference, pp. 48–55 (2020) Karnalim, O., Simon: Syntax trees and information retrieval to improve code similarity detection. In: Proceedings of the Twenty-Second Australasian Computing Education Conference, pp. 48–55 (2020)
23.
go back to reference Krinke, J.: Identifying similar code with program dependence graphs. In: Proceedings Eighth Working Conference on Reverse Engineering, pp. 301–309. IEEE (2001) Krinke, J.: Identifying similar code with program dependence graphs. In: Proceedings Eighth Working Conference on Reverse Engineering, pp. 301–309. IEEE (2001)
24.
go back to reference Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, vol. 10, pp. 707–710 (1966) Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, vol. 10, pp. 707–710 (1966)
25.
go back to reference Martinez-Gil, J.: Semantic similarity aggregators for very short textual expressions: a case study on landmarks and points of interest. J. Intell. Inf. Syst. 53(2), 361–380 (2019)CrossRef Martinez-Gil, J.: Semantic similarity aggregators for very short textual expressions: a case study on landmarks and points of interest. J. Intell. Inf. Syst. 53(2), 361–380 (2019)CrossRef
26.
go back to reference Martinez-Gil, J.: A comprehensive review of stacking methods for semantic similarity measurement. Mach. Learn. App. 10, 100423 (2022) Martinez-Gil, J.: A comprehensive review of stacking methods for semantic similarity measurement. Mach. Learn. App. 10, 100423 (2022)
27.
go back to reference Martinez-Gil, J., Chaves-Gonzalez, J.M.: Semantic similarity controllers: on the trade-off between accuracy and interpretability. Knowl. Based Syst. 234, 107609 (2021)CrossRef Martinez-Gil, J., Chaves-Gonzalez, J.M.: Semantic similarity controllers: on the trade-off between accuracy and interpretability. Knowl. Based Syst. 234, 107609 (2021)CrossRef
28.
go back to reference Martinez-Gil, J., Chaves-Gonzalez, J.M.: A novel method based on symbolic regression for interpretable semantic similarity measurement. Expert Syst. Appl. 160, 113663 (2020)CrossRef Martinez-Gil, J., Chaves-Gonzalez, J.M.: A novel method based on symbolic regression for interpretable semantic similarity measurement. Expert Syst. Appl. 160, 113663 (2020)CrossRef
29.
go back to reference Novak, M., Joy, M., Kermek, D.: Source-code similarity detection and detection tools used in academia: a systematic review. ACM Trans. Comput. Educ. (TOCE) 19(3), 1–37 (2019)CrossRef Novak, M., Joy, M., Kermek, D.: Source-code similarity detection and detection tools used in academia: a systematic review. ACM Trans. Comput. Educ. (TOCE) 19(3), 1–37 (2019)CrossRef
30.
go back to reference Nuñez-Varela, A.S., Pérez-Gonzalez, H.G., Martínez-Perez, F.E., Soubervielle-Montalvo, C.: Source code metrics: a systematic mapping study. J. Syst. Softw. 128, 164–197 (2017)CrossRef Nuñez-Varela, A.S., Pérez-Gonzalez, H.G., Martínez-Perez, F.E., Soubervielle-Montalvo, C.: Source code metrics: a systematic mapping study. J. Syst. Softw. 128, 164–197 (2017)CrossRef
31.
go back to reference Peters, M.E., et al.: Deep contextualized word representations. In: Walker, M.A. Ji, H., Stent, A., (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1–6, 2018, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics (2018) Peters, M.E., et al.: Deep contextualized word representations. In: Walker, M.A. Ji, H., Stent, A., (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1–6, 2018, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics (2018)
32.
go back to reference Ragkhitwetsagul, C., Krinke, J., Marnette, B.: A picture is worth a thousand words: code clone detection based on image similarity. In: 12th IEEE International Workshop on Software Clones, IWSC 2018, Campobasso, Italy, March 20, 2018, pp. 44–50. IEEE Computer Society (2018) Ragkhitwetsagul, C., Krinke, J., Marnette, B.: A picture is worth a thousand words: code clone detection based on image similarity. In: 12th IEEE International Workshop on Software Clones, IWSC 2018, Campobasso, Italy, March 20, 2018, pp. 44–50. IEEE Computer Society (2018)
33.
go back to reference Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci. Comput. Programm. 74(7), 470–495 (2009)MathSciNetCrossRef Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci. Comput. Programm. 74(7), 470–495 (2009)MathSciNetCrossRef
34.
go back to reference Roy, C.K., Cordy, J.R.: A survey on software clone detection research. Queen’s School Comput. TR. 541(115), 64–68 (2007) Roy, C.K., Cordy, J.R.: A survey on software clone detection research. Queen’s School Comput. TR. 541(115), 64–68 (2007)
35.
go back to reference Saini, N., Singh, S., et al.: Code clones: detection and management. Proc. Comput. Sci. 132, 718–727 (2018)CrossRef Saini, N., Singh, S., et al.: Code clones: detection and management. Proc. Comput. Sci. 132, 718–727 (2018)CrossRef
36.
go back to reference Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85 (2003) Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85 (2003)
37.
go back to reference Singla, N., Garg, D.: String matching algorithms and their applicability in various applications. Int. J. Soft Comput. Eng. 1(6), 218–222 (2012) Singla, N., Garg, D.: String matching algorithms and their applicability in various applications. Int. J. Soft Comput. Eng. 1(6), 218–222 (2012)
38.
go back to reference Wise, M.J.: String similarity via greedy string tiling and running Karp-Rabin matching. Online Preprint 119(1), 1–17 (1993) Wise, M.J.: String similarity via greedy string tiling and running Karp-Rabin matching. Online Preprint 119(1), 1–17 (1993)
39.
go back to reference Ming, X.: A similarity metric method of obfuscated malware using function-call graph. J. Comput. Virol. Hacking Techn. 9, 35–47 (2013)CrossRef Ming, X.: A similarity metric method of obfuscated malware using function-call graph. J. Comput. Virol. Hacking Techn. 9, 35–47 (2013)CrossRef
40.
Metadata
Title
Source Code Clone Detection Using Unsupervised Similarity Measures
Author
Jorge Martinez-Gil
Copyright Year
2024
DOI
https://doi.org/10.1007/978-3-031-56281-5_2

Premium Partner