Skip to main content
Top

2017 | OriginalPaper | Chapter

Intrinsic Plagiarism Detection with Feature-Rich Imbalanced Dataset Learning

Authors : Andrianna Polydouri, Georgios Siolas, Andreas Stafylopatis

Published in: Engineering Applications of Neural Networks

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In the context of intrinsic plagiarism detection, we are trying to discover plagiarised passages in a text, based on the stylistic changes and inconsistencies within the document itself. The main idea consists in profiling the style of the original author and marking as outliers the passages that seem to differ significantly. Besides some novel stylistic and semantic features, the present work proposes a new approach to the problem, where machine learning plays a significant role. Notably, we also consider, for the first time, the reality of unbalanced training dataset in intrinsic plagiarism detection as a major parameter of the problem. Our detection system is tested on the data corpora of PAN Webis intrinsic plagiarism detection’s shared tasks of 2009 and 2011 and is compared to the results of the highest score participations.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. 45(1), 63–82 (2010)CrossRef Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. 45(1), 63–82 (2010)CrossRef
2.
go back to reference Cheng, N., Chandramouli, R., Subbalakshmi, K.: Author gender identification from text. Digit. Invest. 8, 78–88 (2011)CrossRef Cheng, N., Chandramouli, R., Subbalakshmi, K.: Author gender identification from text. Digit. Invest. 8, 78–88 (2011)CrossRef
3.
go back to reference Stamatatos, E.: Intrinsic plagiarism detection using character n-gram profiles. In: Proceedings of the 3rd International Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse (2009) Stamatatos, E.: Intrinsic plagiarism detection using character n-gram profiles. In: Proceedings of the 3rd International Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse (2009)
4.
go back to reference Oberreuter, G., L’Huillier, G., Rios, S.A., Velasquez, J.D.: Approaches for intrinsic and external plagiarism detection. In: Notebook for PAN at CLEF (2011) Oberreuter, G., L’Huillier, G., Rios, S.A., Velasquez, J.D.: Approaches for intrinsic and external plagiarism detection. In: Notebook for PAN at CLEF (2011)
5.
go back to reference Kestemont, M., Luyckx, K., Daelemans, W.: Intrinsic plagiarism detection using character trigrams distance scores. In: Notebook for PAN at CLEF (2011) Kestemont, M., Luyckx, K., Daelemans, W.: Intrinsic plagiarism detection using character trigrams distance scores. In: Notebook for PAN at CLEF (2011)
6.
go back to reference Kuznetsov, M., Motrenko, A., Kuznetsova, R., Strijov, V.: Methods for intrinsic plagiarism detection and author diarization. In: Notebook for PAN at CLEF (2016) Kuznetsov, M., Motrenko, A., Kuznetsova, R., Strijov, V.: Methods for intrinsic plagiarism detection and author diarization. In: Notebook for PAN at CLEF (2016)
7.
go back to reference Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 1st International Competition on Plagiarism Detection (2009) Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 1st International Competition on Plagiarism Detection (2009)
8.
go back to reference Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd International Competition on Plagiarism Detection (2011) Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd International Competition on Plagiarism Detection (2011)
9.
go back to reference Alzahrani, S.M., Salim, N., Abraham, A.: Understanding plagiarism linguistic patterns, textual features, and detection methods. Int. J. Educ. Integr. 9(1), 55–71 (2013) Alzahrani, S.M., Salim, N., Abraham, A.: Understanding plagiarism linguistic patterns, textual features, and detection methods. Int. J. Educ. Integr. 9(1), 55–71 (2013)
10.
go back to reference Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: SVMs modeling for highly imbalanced classification. In: Journal of LaTeX Class Files (2002) Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: SVMs modeling for highly imbalanced classification. In: Journal of LaTeX Class Files (2002)
11.
go back to reference Rosso, P., Rangel, F., Potthast, M., Stamatatos, E., Tschuggnall, M., Stein, B.: Overview of PAN 2016 new challenges for authorship analysis: cross-genre profiling, clustering, diarization, and obfuscation. In: 7th International Conference of the CLEF Initiative (2016) Rosso, P., Rangel, F., Potthast, M., Stamatatos, E., Tschuggnall, M., Stein, B.: Overview of PAN 2016 new challenges for authorship analysis: cross-genre profiling, clustering, diarization, and obfuscation. In: 7th International Conference of the CLEF Initiative (2016)
12.
go back to reference DuBay, W.H.: The Principles of Readability, p. 21 (2004) DuBay, W.H.: The Principles of Readability, p. 21 (2004)
13.
go back to reference Leanne, S., Matwin, S.: Intrinsic plagiarism detection using complexity analysis (2009) Leanne, S., Matwin, S.: Intrinsic plagiarism detection using complexity analysis (2009)
14.
go back to reference Meyer zu Eissen, S., Stein, B.: Intrinsic plagiarism detection. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006). doi:10.1007/11735106_66 CrossRef Meyer zu Eissen, S., Stein, B.: Intrinsic plagiarism detection. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006). doi:10.​1007/​11735106_​66 CrossRef
15.
go back to reference Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority oversampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)MATH Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority oversampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)MATH
16.
go back to reference Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(1), 559–563 (2017) Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(1), 559–563 (2017)
Metadata
Title
Intrinsic Plagiarism Detection with Feature-Rich Imbalanced Dataset Learning
Authors
Andrianna Polydouri
Georgios Siolas
Andreas Stafylopatis
Copyright Year
2017
DOI
https://doi.org/10.1007/978-3-319-65172-9_9

Premium Partner