Skip to main content
Erschienen in: International Journal on Digital Libraries 2-3/2018

21.03.2017

Reuse and plagiarism in Speech and Natural Language Processing publications

verfasst von: Joseph Mariani, Gil Francopoulo, Patrick Paroubek

Erschienen in: International Journal on Digital Libraries | Ausgabe 2-3/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The aim of this experiment is to present an easy way to compare fragments of texts in order to detect (supposed) results of copy and paste operations between articles in the domain of Natural Language Processing (NLP), including Speech Processing. The search space of the comparisons is a corpus labeled as NLP4NLP, which includes 34 different conferences and journals and gathers a large part of the NLP activity over the past 50 years. This study considers the similarity between the papers of each individual event and the complete set of papers in the whole corpus, according to four different types of relationship (self-reuse, self-plagiarism, reuse and plagiarism) and in both directions: a paper borrowing a fragment of text from another paper of the corpus (that we will call the source paper), or in the reverse direction, fragments of text from the source paper being borrowed and inserted in another paper of the corpus. The results show that self-reuse is rather a common practice, but that plagiarism seems to be very unusual, and that both stay within legal and ethical limits.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
3
The total number of papers is 67,937, but in the case of a joint conference, the papers are counted twice. This number reduces to 65,003, if we count duplicate papers only once. Similarly, the number of venues is 577 when all venues are counted, but this number reduces to 558 when the 19 joint conferences are counted only once.
 
10
TagParser is a tool created and distributed by Tagmatica (see www.​tagmatica.​com).
 
13
Also called “n-grams” in some NLP papers.
 
14
Concerning this specific problem, for instance, PACLIC and COLING which are one column formatted give much better extraction quality than LREC and ACL which are two columns formatted.
 
15
It takes 69 h instead of 44 h on a mid-range mono-processor Xeon E3-1270 V2 with 32 GB of RAM.
 
16
But the space limitations do not allow to present these results in lengthy details. Furthermore, we do not want to display personal results.
 
19
To this regard, the reader will find a certain degree of overlapping between this paper and the one we published at LREC 2016 on reuse and plagiarism limited to the LREC papers, regarding the description of the NLP4NLP corpus and of the similarity measure algorithm.
 
Literatur
1.
Zurück zum Zitat Barron-Cedeno, A., Potthast, M., Rosso, P., Stein, B., Eiselt, A.: Corpus and evaluation measures for automatic plagiarism detection. In: Proceedings of LREC 2010, pp. 771–774. Valletta (2010) Barron-Cedeno, A., Potthast, M., Rosso, P., Stein, B., Eiselt, A.: Corpus and evaluation measures for automatic plagiarism detection. In: Proceedings of LREC 2010, pp. 771–774. Valletta (2010)
2.
Zurück zum Zitat Barron-Cedeno, A., Vila, M., Marti, M.A., Rosso, P.: Plagiarism meets paraphrasing insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)CrossRef Barron-Cedeno, A., Vila, M., Marti, M.A., Rosso, P.: Plagiarism meets paraphrasing insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)CrossRef
3.
Zurück zum Zitat Bensalem, I., Rosso, P., Chikhi, S.,: Intrinsic plagiarism detection using n-gram classes. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing 2014, pp. 1459–1464. Doha (2014) Bensalem, I., Rosso, P., Chikhi, S.,: Intrinsic plagiarism detection using n-gram classes. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing 2014, pp. 1459–1464. Doha (2014)
5.
Zurück zum Zitat Bird, S., Dale, R., Dorr, B.J., Gibson, B., Joseph, M.T., Kan, M.-Y., Dongwon, L., Powley, B., Radev, D.R., Tan Y.F.: The ACL anthology reference corpus: a reference dataset for bibliographic research in Computational linguistics. In: Proceedings of LREC 2008, pp. 1755–1759. Marrakesh (2008) Bird, S., Dale, R., Dorr, B.J., Gibson, B., Joseph, M.T., Kan, M.-Y., Dongwon, L., Powley, B., Radev, D.R., Tan Y.F.: The ACL anthology reference corpus: a reference dataset for bibliographic research in Computational linguistics. In: Proceedings of LREC 2008, pp. 1755–1759. Marrakesh (2008)
6.
Zurück zum Zitat Calzolari, N., Del Gratta, R., Francopoulo, G., Mariani, J., Rubino, F., Russo, I., Soria, C.: The LRE map. Harmonising community descriptions of resources. In: Proceedings of LREC 2012, pp. 1084–1089. Istanbul (2012) Calzolari, N., Del Gratta, R., Francopoulo, G., Mariani, J., Rubino, F., Russo, I., Soria, C.: The LRE map. Harmonising community descriptions of resources. In: Proceedings of LREC 2012, pp. 1084–1089. Istanbul (2012)
7.
Zurück zum Zitat Ceska, Z., Fox, C.: The influence of text pre-processing on plagiarism detection. In: Proceedings of the Recent Advances in Natural Language Processing Conference 2009, pp. 55–59. Borovets (2009) Ceska, Z., Fox, C.: The influence of text pre-processing on plagiarism detection. In: Proceedings of the Recent Advances in Natural Language Processing Conference 2009, pp. 55–59. Borovets (2009)
8.
Zurück zum Zitat Chong, M., Specia, L.: Lexical generalisation for word-level matching in plagiarism detection. In: Proceedings of the Recent Advances in Natural Language Processing Conference 2011, pp. 704–709. Hissar (2011) Chong, M., Specia, L.: Lexical generalisation for word-level matching in plagiarism detection. In: Proceedings of the Recent Advances in Natural Language Processing Conference 2011, pp. 704–709. Hissar (2011)
10.
Zurück zum Zitat Clough, P., Gaizauskas, R., Piao, S.S.L., Wilks, Y.: Measuring text reuse. In: Proceedings of ACL’2002, pp. 152–159. Philadelphia (2002) Clough, P., Gaizauskas, R., Piao, S.S.L., Wilks, Y.: Measuring text reuse. In: Proceedings of ACL’2002, pp. 152–159. Philadelphia (2002)
11.
Zurück zum Zitat Clough, P., Gaizauskas, R., Piao, S.S.L.: Building and annotating a corpus for the study of journalistic text reuse. In: Proceedings of LREC 2002, pp. 1678–1691. Las Palmas (2002) Clough, P., Gaizauskas, R., Piao, S.S.L.: Building and annotating a corpus for the study of journalistic text reuse. In: Proceedings of LREC 2002, pp. 1678–1691. Las Palmas (2002)
12.
Zurück zum Zitat Clough, P., Stevenson, M.: Developing a corpus of plagiarised short answers. Lang. Resour. Eval. 45(1), 5–24 (2011)CrossRef Clough, P., Stevenson, M.: Developing a corpus of plagiarised short answers. Lang. Resour. Eval. 45(1), 5–24 (2011)CrossRef
13.
Zurück zum Zitat Councill, I.G., Giles, C.L., Kan, M.-Y.: ParsCit: an open-source CRF reference string parsing package. In: Proceedings of LREC 2008, pp. 661–667. Marrakesh (2008) Councill, I.G., Giles, C.L., Kan, M.-Y.: ParsCit: an open-source CRF reference string parsing package. In: Proceedings of LREC 2008, pp. 661–667. Marrakesh (2008)
14.
Zurück zum Zitat Francopoulo, G.: TagParser: well on the way to ISO-TC37 conformance. In: Proceedings of ICGL (International Conference on Global Interoperability for Language Resources) 2008. Hong Kong (2008) Francopoulo, G.: TagParser: well on the way to ISO-TC37 conformance. In: Proceedings of ICGL (International Conference on Global Interoperability for Language Resources) 2008. Hong Kong (2008)
15.
Zurück zum Zitat Francopoulo, G., Marcoul, F., Causse, D., Piparo, G.: Global atlas: proper nouns, from Wikipedia to LMF. In: Francopoulo, G. (ed) LMF Lexical Markup Framework. ISTE Wiley (2013) Francopoulo, G., Marcoul, F., Causse, D., Piparo, G.: Global atlas: proper nouns, from Wikipedia to LMF. In: Francopoulo, G. (ed) LMF Lexical Markup Framework. ISTE Wiley (2013)
17.
Zurück zum Zitat Francopoulo, G., Mariani, J., Paroubek, P.: A study of reuse and plagiarism in LREC papers. In: Proceedings of LREC 2016, pp. 72–83. Portorož (2016) Francopoulo, G., Mariani, J., Paroubek, P.: A study of reuse and plagiarism in LREC papers. In: Proceedings of LREC 2016, pp. 72–83. Portorož (2016)
19.
Zurück zum Zitat Gaizauskas, R., Foster, J., Wilks, Y., Arundel, J., Clough, P., Piao, S.S.L.: The METER corpus: a corpus for analysing journalistic text reuse. In: Proceedings of the Corpus Linguistics Conference 2001, pp. 214–223. Lancaster (2001) Gaizauskas, R., Foster, J., Wilks, Y., Arundel, J., Clough, P., Piao, S.S.L.: The METER corpus: a corpus for analysing journalistic text reuse. In: Proceedings of the Corpus Linguistics Conference 2001, pp. 214–223. Lancaster (2001)
21.
Zurück zum Zitat Guo, Y., Che, W., Liu, T., Li, S.: A graph-based method for entity linking. In: Proceedings of the International Joint Conference on NLP 2011, pp. 1010–1018. Chiang Mai (2011) Guo, Y., Che, W., Liu, T., Li, S.: A graph-based method for entity linking. In: Proceedings of the International Joint Conference on NLP 2011, pp. 1010–1018. Chiang Mai (2011)
22.
Zurück zum Zitat Gupta, P., Rosso, P.: Text reuse with ACL: (upward) trends. In: Proceedings ACL’2012 Special Workshop on Rediscovering 50 Years of Discoveries, pp. 76–82. Jeju (2012) Gupta, P., Rosso, P.: Text reuse with ACL: (upward) trends. In: Proceedings ACL’2012 Special Workshop on Rediscovering 50 Years of Discoveries, pp. 76–82. Jeju (2012)
23.
Zurück zum Zitat Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarised documents. J. Am. Soc. Inf. Sci. Technol. 54(3), 203–215 (2003)CrossRef Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarised documents. J. Am. Soc. Inf. Sci. Technol. 54(3), 203–215 (2003)CrossRef
24.
Zurück zum Zitat HaCohen-Kerner, Y., Tayeb, A., Ben-Dror, N.: Detection of simple plagiarism in computer science papers. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pp. 421–429. Beijing (2010) HaCohen-Kerner, Y., Tayeb, A., Ben-Dror, N.: Detection of simple plagiarism in computer science papers. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pp. 421–429. Beijing (2010)
25.
Zurück zum Zitat Kasprzak, J., Brandejs, M.: Improving the reliability of the plagiarism detection system lab. In: Proceedings of the Uncovering Plagiarism, Authorship and Social Software Misuse (PAN) at CLEF’2010. Padua (2010) Kasprzak, J., Brandejs, M.: Improving the reliability of the plagiarism detection system lab. In: Proceedings of the Uncovering Plagiarism, Authorship and Social Software Misuse (PAN) at CLEF’2010. Padua (2010)
26.
Zurück zum Zitat Lyon, C., Malcolm, J., Dickerson, B.: Detecting short passages of similar text in large document collections. In: Proceedings of the Empirical Methods in Natural Language Processing Conference 2001, pp. 118–125. Pittsburgh (2001) Lyon, C., Malcolm, J., Dickerson, B.: Detecting short passages of similar text in large document collections. In: Proceedings of the Empirical Methods in Natural Language Processing Conference 2001, pp. 118–125. Pittsburgh (2001)
27.
Zurück zum Zitat Mariani, J., Paroubek, P., Francopoulo, G., Delaborde, M.: Rediscovering 25 years of discoveries in spoken language processing: a preliminary ISCA archive analysis. In: Proceedings of Interspeech 2013, pp. 4632–4669. Lyon (2013) Mariani, J., Paroubek, P., Francopoulo, G., Delaborde, M.: Rediscovering 25 years of discoveries in spoken language processing: a preliminary ISCA archive analysis. In: Proceedings of Interspeech 2013, pp. 4632–4669. Lyon (2013)
28.
Zurück zum Zitat Moro, A., Raganato, A., Navigli, R.: Entity linking meets word sense disambiguation: a unified approach. Trans. Assoc. Comput. Linguist. 2, 231–244 (2014) Moro, A., Raganato, A., Navigli, R.: Entity linking meets word sense disambiguation: a unified approach. Trans. Assoc. Comput. Linguist. 2, 231–244 (2014)
29.
Zurück zum Zitat Nawab R.M.A., Stevenson, M., Clough, P.: Detecting text reuse with modified and weighted n-grams. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, pp. 54–58. Montréal (2012) Nawab R.M.A., Stevenson, M., Clough, P.: Detecting text reuse with modified and weighted n-grams. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, pp. 54–58. Montréal (2012)
30.
Zurück zum Zitat Potthast, M., Stein, B., Barron-Cedeno, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pp. 997–1005. Beijing (2010) Potthast, M., Stein, B., Barron-Cedeno, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pp. 997–1005. Beijing (2010)
31.
Zurück zum Zitat Radev, D.R., Muthukrishnan, P., Qazvinian, V., Abu-Jbara, A.: The ACL anthology network corpus. Lang. Resour. Eval. 47(4), 919–944 (2013)CrossRef Radev, D.R., Muthukrishnan, P., Qazvinian, V., Abu-Jbara, A.: The ACL anthology network corpus. Lang. Resour. Eval. 47(4), 919–944 (2013)CrossRef
32.
Zurück zum Zitat Samuelson, P.: Self-plagiarism or fair use? Commun. ACM 37(8), 21–25 (1994)CrossRef Samuelson, P.: Self-plagiarism or fair use? Commun. ACM 37(8), 21–25 (1994)CrossRef
33.
Zurück zum Zitat Stamatatos, E., Koppel, M.: Plagiarism and authorship analysis: introduction to the special issue. Lang. Resour. Eval. 45(1), 1–5 (2011)CrossRef Stamatatos, E., Koppel, M.: Plagiarism and authorship analysis: introduction to the special issue. Lang. Resour. Eval. 45(1), 1–5 (2011)CrossRef
34.
Zurück zum Zitat Stamatatos, E.: Plagiarism detection using stopword n-grams. J. Am. Soc. Inf. Sci. Technol. 62(12), 2512–2527 (2011)CrossRef Stamatatos, E.: Plagiarism detection using stopword n-grams. J. Am. Soc. Inf. Sci. Technol. 62(12), 2512–2527 (2011)CrossRef
35.
Zurück zum Zitat Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. 45(1), 63–82 (2011)CrossRef Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. 45(1), 63–82 (2011)CrossRef
36.
Zurück zum Zitat Vilnat, A., Paroubek, P., de la Clergerie, E.V., Francopoulo, G., Guénot, M.-L.: PASSAGE syntactic representation: a minimal common ground for evaluation. In: Proceedings of LREC 2010, pp. 2478–2485. Valletta (2010) Vilnat, A., Paroubek, P., de la Clergerie, E.V., Francopoulo, G., Guénot, M.-L.: PASSAGE syntactic representation: a minimal common ground for evaluation. In: Proceedings of LREC 2010, pp. 2478–2485. Valletta (2010)
Metadaten
Titel
Reuse and plagiarism in Speech and Natural Language Processing publications
verfasst von
Joseph Mariani
Gil Francopoulo
Patrick Paroubek
Publikationsdatum
21.03.2017
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal on Digital Libraries / Ausgabe 2-3/2018
Print ISSN: 1432-5012
Elektronische ISSN: 1432-1300
DOI
https://doi.org/10.1007/s00799-017-0211-0

Weitere Artikel der Ausgabe 2-3/2018

International Journal on Digital Libraries 2-3/2018 Zur Ausgabe