Skip to main content
Top

2020 | OriginalPaper | Chapter

Key Passages : From Statistics to Deep Learning

Authors : Laurent Vanni, Marco Corneli, Dominique Longrée, Damon Mayaffre, Frédéric Precioso

Published in: Text Analytics

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This contribution compares statistical analysis and deep learning approaches to textual data. The extraction of key passages using statistics and deep learning is implemented using the Hyperbase software. An evaluation of the underlying calculations is given by using examples from two different languages—French and Latin. Our hypothesis is that deep learning is not only sensitive to word frequency but also to more complex phenomena containing linguistic features that pose problems for statistical approaches. These linguistic patterns, also known as motives Mellet and Longrée (Belg J Linguist 23:161–173, 2009 [9]), are essential for highlighting key passages. If confirmed, this hypothesis would provide us with a better understanding of the deep learning black box. Moreover, it would bring new ways of understanding and interpreting texts. Thus, this paper introduces a novel approach to explore the hidden layers of a convolutional neural network, trying to explain which are the relevant linguistic features used by the network to perform the classification task. This explanation attempt is the major contribution of this work. Finally, in order to show the potential of our deep learning approach, when testing it on the two corpora (French and Latin), we compare the obtained linguistic features with those highlighted by a standard text mining technique (z-score computing).

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
This contribution has been founded by the French government, Agence Nationale de la Recherche, project Investissement d’Avenir UCA\(^{\text {JEDI}}\) n\(^{\circ }\) ANR-15-IDEX-01.
 
2
Also known as z-score, specificity in textual data analysis since [5] is based on hypergeometric distribution.
 
3
Also known as motives: complex linguistic objects with variable and discontinuous spans.
 
4
The software is used by the UMR Bases, Corpus, Language and was developed in collaboration with the LASLA. A first local version was coded by Etienne Brunet. A recent web development was realized by Laurent Vanni http://​hyperbase.​unice.​fr/​.
 
5
In general, \(x \in \mathbb {R}^N\) can be read as “a vector of N real components.”
 
6
The number of tokens for each text is one of several hyper-parameters of the network.
 
Literature
1.
go back to reference Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146CrossRef Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146CrossRef
2.
go back to reference Ducoffe M, Precioso F, Arthur A, Mayaffre D, Lavigne F, Vanni L (2016) Machine learning under the light of phraseology expertise: use case of presidential speeches, de gaulle - hollande (1958–2016). Actes de JADT 2016:155–168 Ducoffe M, Precioso F, Arthur A, Mayaffre D, Lavigne F, Vanni L (2016) Machine learning under the light of phraseology expertise: use case of presidential speeches, de gaulle - hollande (1958–2016). Actes de JADT 2016:155–168
3.
go back to reference Joulin A, Grave E, Mikolov PBT (2017) Bag of tricks for efficient text classification. EACL, 427 Joulin A, Grave E, Mikolov PBT (2017) Bag of tricks for efficient text classification. EACL, 427
4.
go back to reference Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Info Proc Syst 1097–1105 Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Info Proc Syst 1097–1105
5.
go back to reference Lafon P (1984) Dépouillements et statistiques en lexicométrie. In Genève-Paris, Slatkine-Champion Lafon P (1984) Dépouillements et statistiques en lexicométrie. In Genève-Paris, Slatkine-Champion
6.
go back to reference Lebart L (1997) Réseaux de neurones et analyse des correspondances. In: Modulad, (INRIA Paris), vol 18, pp 21–37 Lebart L (1997) Réseaux de neurones et analyse des correspondances. In: Modulad, (INRIA Paris), vol 18, pp 21–37
7.
go back to reference Longrée D, Mellet S (2013) Le motif : une unité phraséologique englobante ? étendre le champ de la phraséologie de la langue au discours. Langages 189:65–79CrossRef Longrée D, Mellet S (2013) Le motif : une unité phraséologique englobante ? étendre le champ de la phraséologie de la langue au discours. Langages 189:65–79CrossRef
9.
go back to reference Mellet S, Longrée D (2009) Syntactical motifs and textual structures. Belg J Linguist 23:161–173CrossRef Mellet S, Longrée D (2009) Syntactical motifs and textual structures. Belg J Linguist 23:161–173CrossRef
10.
go back to reference Mellet S, Longrée D (2012) Légitimité d’une unité textométrique : le motif. Actes de JADT 2012:715–728 Mellet S, Longrée D (2012) Légitimité d’une unité textométrique : le motif. Actes de JADT 2012:715–728
11.
go back to reference Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 3111–3119 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 3111–3119
12.
go back to reference Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543 Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
13.
go back to reference Quiniou S, Cellier P, Charnois T, Legallois D (2012) Fouille de données pour la stylistique : cas des motifs séquentiels émergents. In: Actes de JADT 2012 Quiniou S, Cellier P, Charnois T, Legallois D (2012) Fouille de données pour la stylistique : cas des motifs séquentiels émergents. In: Actes de JADT 2012
14.
go back to reference Rastier F (2007) Passages. Corpus 6:25–54 Rastier F (2007) Passages. Corpus 6:25–54
15.
go back to reference Rastier F (2011) La mesure et le grain: sémantique de corpus. Champion; diff, Slatkine Rastier F (2011) La mesure et le grain: sémantique de corpus. Champion; diff, Slatkine
16.
go back to reference Salem A (1987) Pratique des segments répétés. essai de statistique textuelle. Klincksieck, Paris Salem A (1987) Pratique des segments répétés. essai de statistique textuelle. Klincksieck, Paris
17.
go back to reference Vanni L, Ducoffe M, Precioso F, Mayaffre D, Longree D et al (2018) Text deconvolution saliency (tds) : a deep tool box for linguistic analysis. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long Papers). Melbourne, Australia. Association for Computational Linguistics, pp 548–557 Vanni L, Ducoffe M, Precioso F, Mayaffre D, Longree D et al (2018) Text deconvolution saliency (tds) : a deep tool box for linguistic analysis. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long Papers). Melbourne, Australia. Association for Computational Linguistics, pp 548–557
18.
go back to reference Vanni L, Mittmann A (2016) Cooccurrences spécifiques et représentations graphiques, le nouveau thème d’hyperbase. Actes de JADT 2016:295–305 Vanni L, Mittmann A (2016) Cooccurrences spécifiques et représentations graphiques, le nouveau thème d’hyperbase. Actes de JADT 2016:295–305
Metadata
Title
Key Passages : From Statistics to Deep Learning
Authors
Laurent Vanni
Marco Corneli
Dominique Longrée
Damon Mayaffre
Frédéric Precioso
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-52680-1_4

Premium Partner