Skip to main content
Erschienen in: Knowledge and Information Systems 2/2015

01.05.2015 | Regular Paper

Efficient clustering-based source code plagiarism detection using PIY

verfasst von: Tony Ohmann, Imad Rahal

Erschienen in: Knowledge and Information Systems | Ausgabe 2/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Vast amounts of information available online make plagiarism increasingly easy to commit, and this is particularly true of source code. The traditional approach of detecting copied work in a course setting is manual inspection. This is not only tedious but also typically misses code plagiarized from outside sources or even from an earlier offering of the course. Systems to automatically detect source code plagiarism exist but tend to focus on small submission sets. One such system that has become the standard in automated source code plagiarism detection is measure of software similarity (MOSS) Schleimer et al. in proceedings of the 2003 ACM SIGMOD international conference on management of data, ACM, San Diego, 2003. In this work, we present an approach called program it yourself (PIY) which is empirically shown to outperform MOSS in detection accuracy. By utilizing parallel processing and data clustering, PIY is also capable of maintaining detection accuracy and reasonable runtimes even when using extremely large data repositories.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bhattacharjee A, Jamil HM (2013) CodeBlast: a two-stage algorithm for improved program similarity matching in large software repositories. In: Proceedings of the 28th annual ACM symposium on applied computing. ACM, Coimbra, pp 846–852 Bhattacharjee A, Jamil HM (2013) CodeBlast: a two-stage algorithm for improved program similarity matching in large software repositories. In: Proceedings of the 28th annual ACM symposium on applied computing. ACM, Coimbra, pp 846–852
2.
Zurück zum Zitat Chen X, Francia B, Li M et al (2004) Shared information and program plagiarism detection. IEEE Trans Inf Theory 50(7):1545–1551CrossRefMATHMathSciNet Chen X, Francia B, Li M et al (2004) Shared information and program plagiarism detection. IEEE Trans Inf Theory 50(7):1545–1551CrossRefMATHMathSciNet
3.
Zurück zum Zitat Choi Y, Park Y, Choi J et al (2013) RAMC: runtime abstract memory context based plagiarism detection in binary code. In: Proceedings of the 7th international conference on ubiquitous information management and communication. Kota Kinabalu, Malaysia, pp 67–73 Choi Y, Park Y, Choi J et al (2013) RAMC: runtime abstract memory context based plagiarism detection in binary code. In: Proceedings of the 7th international conference on ubiquitous information management and communication. Kota Kinabalu, Malaysia, pp 67–73
4.
Zurück zum Zitat Chuda D, Navrat P, Kovacova B et al (2012) The issue of (software) plagiarism: a student view. IEEE Trans Educ 55(1):22–28CrossRef Chuda D, Navrat P, Kovacova B et al (2012) The issue of (software) plagiarism: a student view. IEEE Trans Educ 55(1):22–28CrossRef
5.
Zurück zum Zitat Cosma G, Joy M (2008) Towards a definition of source-code plagiarism. IEEE Trans Educ 51(2):195–200CrossRef Cosma G, Joy M (2008) Towards a definition of source-code plagiarism. IEEE Trans Educ 51(2):195–200CrossRef
6.
Zurück zum Zitat El Bachir Menai M, Al-Hassoun NS (2010) Similarity detection in Java programming assignments. In: Proceedings of the 5th international conference on computer science and education (ICCSE). IEEE, Hefei, pp 356–361 El Bachir Menai M, Al-Hassoun NS (2010) Similarity detection in Java programming assignments. In: Proceedings of the 5th international conference on computer science and education (ICCSE). IEEE, Hefei, pp 356–361
7.
Zurück zum Zitat Ester M, Kriegel HP, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining. AAAI Press, Portland, pp 226–231 Ester M, Kriegel HP, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining. AAAI Press, Portland, pp 226–231
8.
Zurück zum Zitat Faidhi JA, Robinson S (1987) An empirical approach for detecting program similarity within a university programming environment. Comput Educ 11(1):11–19CrossRef Faidhi JA, Robinson S (1987) An empirical approach for detecting program similarity within a university programming environment. Comput Educ 11(1):11–19CrossRef
9.
Zurück zum Zitat Flores E, Barrón-Cedeño A, Rosso P et al (2011) Towards the detection of cross-language source code reuse. In: Proceedings of the 16th international conference on applications of natural language to information systems. Springer, Salford, pp 250–253 Flores E, Barrón-Cedeño A, Rosso P et al (2011) Towards the detection of cross-language source code reuse. In: Proceedings of the 16th international conference on applications of natural language to information systems. Springer, Salford, pp 250–253
11.
Zurück zum Zitat Klieman AB, Kowaltowski T (2009) Qualitative analysis and comparison of plagiarism-detection systems in student programs. Instituto de Computacao Universidade Estadual de Campinas (UNICAMP): Sao Poalo, Brazil Klieman AB, Kowaltowski T (2009) Qualitative analysis and comparison of plagiarism-detection systems in student programs. Instituto de Computacao Universidade Estadual de Campinas (UNICAMP): Sao Poalo, Brazil
12.
Zurück zum Zitat Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New YorkCrossRef Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New YorkCrossRef
13.
Zurück zum Zitat Kuo JY, Huang FC, Hung C, et al (2012) The study of plagiarism detection for object-oriented programming. In: Proceedings of the 6th international conference on genetic and evolutionary computing (ICGEC). IEEE, Kitakyushu, pp 188–191 Kuo JY, Huang FC, Hung C, et al (2012) The study of plagiarism detection for object-oriented programming. In: Proceedings of the 6th international conference on genetic and evolutionary computing (ICGEC). IEEE, Kitakyushu, pp 188–191
14.
Zurück zum Zitat Lea D (2000) A Java fork/join framework. In: Proceedings of the ACM 2000 conference on Java Grande. ACM, San Francisco, pp 36–43 Lea D (2000) A Java fork/join framework. In: Proceedings of the ACM 2000 conference on Java Grande. ACM, San Francisco, pp 36–43
15.
Zurück zum Zitat Lin C, Snyder L (2008) Principles of parallel programming. Addison-Wesley, Boston Lin C, Snyder L (2008) Principles of parallel programming. Addison-Wesley, Boston
16.
Zurück zum Zitat Liu C, Chen C, Han J, et al (2006) GPLAG: detection of software plagiarism by program dependence graph analysis. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Philadelphia, pp 872–881 Liu C, Chen C, Han J, et al (2006) GPLAG: detection of software plagiarism by program dependence graph analysis. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Philadelphia, pp 872–881
17.
Zurück zum Zitat Marinescu D, Baicoianu A, Dimitriu S (2013) A plagiarism detection system in computer source code. Int J Comput Sci Res Appl 3(1):22–30 Marinescu D, Baicoianu A, Dimitriu S (2013) A plagiarism detection system in computer source code. Int J Comput Sci Res Appl 3(1):22–30
18.
Zurück zum Zitat Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th international conference on very large data bases. Morgan Kaufmann Publishers, San Francisco, pp 144–155 Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th international conference on very large data bases. Morgan Kaufmann Publishers, San Francisco, pp 144–155
19.
Zurück zum Zitat Pang-Ning T, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston Pang-Ning T, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston
20.
Zurück zum Zitat Prechelt L, Malpohl G, Philippsen M (2002) Finding plagiarisms among a set of programs with JPLAG. J Univers Comput Sci 8:1016–1038 Prechelt L, Malpohl G, Philippsen M (2002) Finding plagiarisms among a set of programs with JPLAG. J Univers Comput Sci 8:1016–1038
21.
Zurück zum Zitat Rahal I, Wang B, Schnepf J (2009) A primer on text-data analysis. In: (2nd ed) Encyclopedia of information science and technology. IGI Publishing, Hershey, pp 3111–3118 Rahal I, Wang B, Schnepf J (2009) A primer on text-data analysis. In: (2nd ed) Encyclopedia of information science and technology. IGI Publishing, Hershey, pp 3111–3118
22.
Zurück zum Zitat Rosales F, García A, Rodríguez S et al (2008) Detection of plagiarism in programming assignments. IEEE Trans Educ 51(2):174–183CrossRef Rosales F, García A, Rodríguez S et al (2008) Detection of plagiarism in programming assignments. IEEE Trans Educ 51(2):174–183CrossRef
23.
Zurück zum Zitat Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRefMATH Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRefMATH
24.
Zurück zum Zitat Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data. ACM, San Diego, pp 76–85 Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data. ACM, San Diego, pp 76–85
25.
Zurück zum Zitat Van Rijsbergen CJ (1977) A theoretical basis for the use of co-occurrence data in information retrieval. J Doc 33(2):106–119CrossRef Van Rijsbergen CJ (1977) A theoretical basis for the use of co-occurrence data in information retrieval. J Doc 33(2):106–119CrossRef
26.
Zurück zum Zitat Whale G (1990) Identification of program similarity in large populations. Comput J 33(2):140–146CrossRef Whale G (1990) Identification of program similarity in large populations. Comput J 33(2):140–146CrossRef
27.
Zurück zum Zitat Wise MJ (1995) A system for comparing biological sequences using the running Karp-Rabin Greedy String-Tiling algorithm. In: Proceedings of the third international conference on intelligent systems for molecular biology. AAAI Press, Cambridge, pp 393–401 Wise MJ (1995) A system for comparing biological sequences using the running Karp-Rabin Greedy String-Tiling algorithm. In: Proceedings of the third international conference on intelligent systems for molecular biology. AAAI Press, Cambridge, pp 393–401
28.
Zurück zum Zitat Wise MJ (1996) YAP3: improved detection of similarities in computer program and other texts. In: Proceedings of the ACM SIGCSE technical symposium on computer science education. ACM, Philadelphia, pp 130–134 Wise MJ (1996) YAP3: improved detection of similarities in computer program and other texts. In: Proceedings of the ACM SIGCSE technical symposium on computer science education. ACM, Philadelphia, pp 130–134
Metadaten
Titel
Efficient clustering-based source code plagiarism detection using PIY
verfasst von
Tony Ohmann
Imad Rahal
Publikationsdatum
01.05.2015
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 2/2015
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-014-0742-2

Weitere Artikel der Ausgabe 2/2015

Knowledge and Information Systems 2/2015 Zur Ausgabe

Premium Partner