Top

Published in:

2020 | OriginalPaper | Chapter

Key Passages : From Statistics to Deep Learning

Authors : Laurent Vanni, Marco Corneli, Dominique Longrée, Damon Mayaffre, Frédéric Precioso

Published in: Text Analytics

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

This contribution compares statistical analysis and deep learning approaches to textual data. The extraction of key passages using statistics and deep learning is implemented using the Hyperbase software. An evaluation of the underlying calculations is given by using examples from two different languages—French and Latin. Our hypothesis is that deep learning is not only sensitive to word frequency but also to more complex phenomena containing linguistic features that pose problems for statistical approaches. These linguistic patterns, also known as motives Mellet and Longrée (Belg J Linguist 23:161–173, 2009 [9]), are essential for highlighting key passages. If confirmed, this hypothesis would provide us with a better understanding of the deep learning black box. Moreover, it would bring new ways of understanding and interpreting texts. Thus, this paper introduces a novel approach to explore the hidden layers of a convolutional neural network, trying to explain which are the relevant linguistic features used by the network to perform the classification task. This explanation attempt is the major contribution of this work. Finally, in order to show the potential of our deep learning approach, when testing it on the two corpora (French and Latin), we compare the obtained linguistic features with those highlighted by a standard text mining technique (z-score computing).

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Studying Narrative Flows by Text Analysis and Network Text Analysis

next chapter Concentration Indices for Dialogue Dominance Phenomena in TV Series: The Case of the Big Bang Theory

This contribution has been founded by the French government, Agence Nationale de la Recherche, project Investissement d’Avenir UCA\(^{\text {JEDI}}\) n\(^{\circ }\) ANR-15-IDEX-01.

Also known as z-score, specificity in textual data analysis since [5] is based on hypergeometric distribution.

Also known as motives: complex linguistic objects with variable and discontinuous spans.

The software is used by the UMR Bases, Corpus, Language and was developed in collaboration with the LASLA. A first local version was coded by Etienne Brunet. A recent web development was realized by Laurent Vanni http://hyperbase.unice.fr/.

In general, \(x \in \mathbb {R}^N\) can be read as “a vector of N real components.”

The number of tokens for each text is one of several hyper-parameters of the network.

Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146CrossRef

Ducoffe M, Precioso F, Arthur A, Mayaffre D, Lavigne F, Vanni L (2016) Machine learning under the light of phraseology expertise: use case of presidential speeches, de gaulle - hollande (1958–2016). Actes de JADT 2016:155–168

Joulin A, Grave E, Mikolov PBT (2017) Bag of tricks for efficient text classification. EACL, 427

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Info Proc Syst 1097–1105

Lafon P (1984) Dépouillements et statistiques en lexicométrie. In Genève-Paris, Slatkine-Champion

Lebart L (1997) Réseaux de neurones et analyse des correspondances. In: Modulad, (INRIA Paris), vol 18, pp 21–37

Longrée D, Mellet S (2013) Le motif : une unité phraséologique englobante ? étendre le champ de la phraséologie de la langue au discours. Langages 189:65–79CrossRef

Longrée D, Mellet S, Lavigne F (2019) Construction cognitive d’un motif : cooccurrences textuelles et associations mémorielles. In: CogniTextes. http://journals.openedition.org/cognitextes/1202

Mellet S, Longrée D (2009) Syntactical motifs and textual structures. Belg J Linguist 23:161–173CrossRef

10.

Mellet S, Longrée D (2012) Légitimité d’une unité textométrique : le motif. Actes de JADT 2012:715–728

11.

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 3111–3119

12.

Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

13.

Quiniou S, Cellier P, Charnois T, Legallois D (2012) Fouille de données pour la stylistique : cas des motifs séquentiels émergents. In: Actes de JADT 2012

14.

Rastier F (2007) Passages. Corpus 6:25–54

15.

Rastier F (2011) La mesure et le grain: sémantique de corpus. Champion; diff, Slatkine

16.

Salem A (1987) Pratique des segments répétés. essai de statistique textuelle. Klincksieck, Paris

17.

Vanni L, Ducoffe M, Precioso F, Mayaffre D, Longree D et al (2018) Text deconvolution saliency (tds) : a deep tool box for linguistic analysis. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long Papers). Melbourne, Australia. Association for Computational Linguistics, pp 548–557

18.

Vanni L, Mittmann A (2016) Cooccurrences spécifiques et représentations graphiques, le nouveau thème d’hyperbase. Actes de JADT 2016:295–305

Title: Key Passages : From Statistics to Deep Learning
Authors: Laurent Vanni
Marco Corneli
Dominique Longrée
Damon Mayaffre
Frédéric Precioso
Publisher: Springer International Publishing
Book: Text Analytics
Print ISBN: 978-3-030-52679-5

Electronic ISBN: 978-3-030-52680-1

Copyright Year: 2020
DOI: https://doi.org/10.1007/978-3-030-52680-1_4

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner