Skip to main content

2014 | OriginalPaper | Buchkapitel

Measuring Global Similarity Between Texts

verfasst von : Uli Fahrenberg, Fabrizio Biondi, Kevin Corre, Cyrille Jegourel, Simon Kongshøj, Axel Legay

Erschienen in: Statistical Language and Speech Processing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We propose a new similarity measure between texts which, contrary to the current state-of-the-art approaches, takes a global view of the texts to be compared. We have implemented a tool to compute our textual distance and conducted experiments on several corpuses of texts. The experiments show that our methods can reliably identify different global types of texts.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Basset, N., Asarin, E.: Thin and thick timed regular languages. In: Fahrenberg and Tripakis [9], pp. 113–128 Basset, N., Asarin, E.: Thin and thick timed regular languages. In: Fahrenberg and Tripakis [9], pp. 113–128
3.
Zurück zum Zitat Cortelazzo, M.A., Nadalutti, P., Tuzzi, A.: Improving Labbé’s intertextual distance: testing a revised version on a large corpus of italian literature. J. Quant. Linguist. 20(2), 125–152 (2013)CrossRef Cortelazzo, M.A., Nadalutti, P., Tuzzi, A.: Improving Labbé’s intertextual distance: testing a revised version on a large corpus of italian literature. J. Quant. Linguist. 20(2), 125–152 (2013)CrossRef
4.
Zurück zum Zitat Damerau, F.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)CrossRef Damerau, F.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)CrossRef
6.
Zurück zum Zitat Fahrenberg, U., Legay, A.: Generalized quantitative analysis of metric transition systems. In: Shan, C. (ed.) APLAS 2013. LNCS, vol. 8301, pp. 192–208. Springer, Heidelberg (2013) CrossRef Fahrenberg, U., Legay, A.: Generalized quantitative analysis of metric transition systems. In: Shan, C. (ed.) APLAS 2013. LNCS, vol. 8301, pp. 192–208. Springer, Heidelberg (2013) CrossRef
8.
Zurück zum Zitat Fahrenberg, U., Legay, A., Thrane, C.R.: The quantitative linear-time-branching-time spectrum. In: Chakraborty, S., Kumar, A. (eds.) FSTTCS. vol. 13 of LIPIcs, pp. 103–114 (2011) Fahrenberg, U., Legay, A., Thrane, C.R.: The quantitative linear-time-branching-time spectrum. In: Chakraborty, S., Kumar, A. (eds.) FSTTCS. vol. 13 of LIPIcs, pp. 103–114 (2011)
9.
Zurück zum Zitat Fahrenberg, U., Tripakis, S. (eds.): FORMATS 2011. LNCS, vol. 6919. Springer, Heidelberg (2011)MATH Fahrenberg, U., Tripakis, S. (eds.): FORMATS 2011. LNCS, vol. 6919. Springer, Heidelberg (2011)MATH
10.
Zurück zum Zitat Haverkort, B.R.: Formal modeling and analysis of timed systems: Technology push or market pull? In: Fahrenberg and Tripakis [9], pp. 18–24 Haverkort, B.R.: Formal modeling and analysis of timed systems: Technology push or market pull? In: Fahrenberg and Tripakis [9], pp. 18–24
11.
Zurück zum Zitat Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)CrossRef Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)CrossRef
12.
Zurück zum Zitat Kharmeh, S.A., Eder, K., May, D.: A design-for-verification framework for a configurable performance-critical communication interface. In: Fahrenberg and Tripakis [9], pp. 335–351 Kharmeh, S.A., Eder, K., May, D.: A design-for-verification framework for a configurable performance-critical communication interface. In: Fahrenberg and Tripakis [9], pp. 335–351
13.
Zurück zum Zitat Kuhn, H.W.: The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2(1–2), 83–97 (1955)CrossRef Kuhn, H.W.: The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2(1–2), 83–97 (1955)CrossRef
15.
Zurück zum Zitat Labbé, C., Labbé, D.: Inter-textual distance and authorship attribution Corneille and Molière. J. Quant. Linguist. 8(3), 213–231 (2001)CrossRef Labbé, C., Labbé, D.: Inter-textual distance and authorship attribution Corneille and Molière. J. Quant. Linguist. 8(3), 213–231 (2001)CrossRef
16.
Zurück zum Zitat Labbé, C., Labbé, D.: A tool for literary studies: intertextual distance and tree classification. Literary Linguist. Comp. 21(3), 311–326 (2006)CrossRef Labbé, C., Labbé, D.: A tool for literary studies: intertextual distance and tree classification. Literary Linguist. Comp. 21(3), 311–326 (2006)CrossRef
17.
Zurück zum Zitat Labbé, C., Labbé, D.: Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? Scientometrics 94(1), 379–396 (2013)CrossRef Labbé, C., Labbé, D.: Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? Scientometrics 94(1), 379–396 (2013)CrossRef
18.
Zurück zum Zitat Labbé, D.: Experiments on authorship attribution by intertextual distance in English. J. Quant. Linguist. 14(1), 33–80 (2007)CrossRef Labbé, D.: Experiments on authorship attribution by intertextual distance in English. J. Quant. Linguist. 14(1), 33–80 (2007)CrossRef
19.
Zurück zum Zitat Lin, C.Y., Hovy, E.H.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: HLT-NAACL (2003) Lin, C.Y., Hovy, E.H.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: HLT-NAACL (2003)
20.
Zurück zum Zitat Lin, C.Y., Och, F.J.: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Scott, D., Daelemans, W., Walker, M.A. (eds.) ACL. pp. 605–612. ACL (2004) Lin, C.Y., Och, F.J.: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Scott, D., Daelemans, W., Walker, M.A. (eds.) ACL. pp. 605–612. ACL (2004)
21.
Zurück zum Zitat Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRef Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRef
23.
Zurück zum Zitat Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL. pp. 311–318. ACL (2002) Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL. pp. 311–318. ACL (2002)
24.
Zurück zum Zitat Sankaranarayanan, S., Homaei, H., Lewis, C.: Model-based dependability analysis of programmable drug infusion pumps. In: Fahrenberg and Tripakis [9], pp. 317–334 Sankaranarayanan, S., Homaei, H., Lewis, C.: Model-based dependability analysis of programmable drug infusion pumps. In: Fahrenberg and Tripakis [9], pp. 317–334
25.
Zurück zum Zitat Savoy, J.: Authorship attribution: a comparative study of three text corpora and three languages. J. Quant. Linguist. 19(2), 132–161 (2012)CrossRef Savoy, J.: Authorship attribution: a comparative study of three text corpora and three languages. J. Quant. Linguist. 19(2), 132–161 (2012)CrossRef
26.
Zurück zum Zitat Savoy, J.: Authorship attribution based on specific vocabulary. ACM Trans. Inf. Syst. 30(2), 12 (2012)CrossRef Savoy, J.: Authorship attribution based on specific vocabulary. ACM Trans. Inf. Syst. 30(2), 12 (2012)CrossRef
27.
Zurück zum Zitat Smith, S.T., Kao, E.K., Senne, K.D., Bernstein, G., Philips, S.: Bayesian discovery of threat networks. CoRR abs/1311.5552v1 (2013) Smith, S.T., Kao, E.K., Senne, K.D., Bernstein, G., Philips, S.: Bayesian discovery of threat networks. CoRR abs/1311.5552v1 (2013)
28.
Zurück zum Zitat Smith, S.T., Senne, K.D., Philips, S., Kao, E.K., Bernstein, G.: Network detection theory and performance. CoRR abs/1303.5613v1 (2013) Smith, S.T., Senne, K.D., Philips, S., Kao, E.K., Bernstein, G.: Network detection theory and performance. CoRR abs/1303.5613v1 (2013)
29.
Zurück zum Zitat Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)CrossRef Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)CrossRef
31.
Zurück zum Zitat Tomasi, F., Bartolini, I., Condello, F., Degli Esposti, M., Garulli, V., Viale, M.: Towards a taxonomy of suspected forgery in authorship attribution field. A case: Montale’s Diario Postumo. In: DH-CASE. pp. 10:1–10:8. ACM (2013) Tomasi, F., Bartolini, I., Condello, F., Degli Esposti, M., Garulli, V., Viale, M.: Towards a taxonomy of suspected forgery in authorship attribution field. A case: Montale’s Diario Postumo. In: DH-CASE. pp. 10:1–10:8. ACM (2013)
32.
Zurück zum Zitat Ulusoy, A., Smith, S.L., Ding, X.C., Belta, C.: Robust multi-robot optimal path planning with temporal logic constraints. CoRR abs/1202.1307v2 (2012) Ulusoy, A., Smith, S.L., Ding, X.C., Belta, C.: Robust multi-robot optimal path planning with temporal logic constraints. CoRR abs/1202.1307v2 (2012)
33.
Zurück zum Zitat Ulusoy, A., Smith, S.L., Ding, X.C., Belta, C., Rus, D.: Optimal multi-robot path planning with temporal logic constraints. CoRR abs/1107.0062v1 (2011) Ulusoy, A., Smith, S.L., Ding, X.C., Belta, C., Rus, D.: Optimal multi-robot path planning with temporal logic constraints. CoRR abs/1107.0062v1 (2011)
Metadaten
Titel
Measuring Global Similarity Between Texts
verfasst von
Uli Fahrenberg
Fabrizio Biondi
Kevin Corre
Cyrille Jegourel
Simon Kongshøj
Axel Legay
Copyright-Jahr
2014
DOI
https://doi.org/10.1007/978-3-319-11397-5_17