Skip to main content

Über dieses Buch

This book constitutes the proceedings of the 6th International Conference on Statistical Language and Speech Processing, SLSP 2018, held in Mons, Belgium, in October 2018.

The 15 full papers presented in this volume were carefully reviewed and selected from 40 submissions. They were organized in topical sections named: speech synthesis and spoken language generation; speech recognition and post-processing; natural language processing and understanding; and text processing and analysis.



Invited Paper


Analysing Speech for Clinical Applications

The boost in speech technologies that we have witnessed over the last decade has allowed us to go from a state of the art in which correctly recognizing strings of words was a major target, to a state in which we aim much beyond words. We aim at extracting meaning, but we also aim at extracting all possible cues that are conveyed by the speech signal. In fact, we can estimate bio-relevant traits such as height, weight, gender, age, physical and mental health. We can also estimate language, accent, emotional and personality traits, and even environmental cues. This wealth of information, that one can now extract with recent advances in machine learning, has motivated an exponentially growing number of speech-based applications that go much beyond the transcription of what a speaker says. In particular, it has motivated many health related applications, namely aiming at non-invasive diagnosis and monitorization of diseases that affect speech.
Most of the recent work on speech-based diagnosis tools addresses the extraction of features, and/or the development of sophisticated machine learning classifiers [5, 7, 1214, 17]. The results have shown remarkable progress, boosted by several joint paralinguistic challenges, but most results are obtained from limited training data acquired in controlled conditions.
This talk covers two emerging concerns related to this growing trend. One is the collection of large in-the-wild datasets and the effects of this extended uncontrolled collection in the results [4]. Another concern is how the diagnosis may be done without compromising patient privacy [18].
As a proof-of-concept, we will discuss these two aspects and show our results for two target diseases, Depression and Cold, a selection motivated by the availability of corresponding lab datasets distributed in paralinguistic challenges. The availability of these lab datasets allowed us to build a baseline system for each disease, using a simple neural network trained with common features that have not been optimized for either disease. Given the modular architecture adopted, each component of the system can be individually improved at a later stage, although the limited amount of data does not motivate us to exploit deeper networks.
Our mining effort has been focused on video blogs (vlogs), that include a single speaker which, at some point, admits that he/she is currently affected by a given disease. Retrieving vlogs with the target disease involves not only a simple query (i.e. depression vlog), but also a post-filtering stage to exclude videos that do not correspond to our target of first person, present experiences (lectures, in particular, are relatively frequent). This filtering stage combines multimodal features automatically extracted from the video and its metadata, using mostly off-the-shelf tools.
We collected a large dataset for each target disease from YouTube, and manually labelled a small subset which we named the in-the-Wild Speech Medical (WSM) corpus. Although our mining efforts made use of relatively simple techniques using mostly existing toolkits, they proved effective. The best performing models achieved a precision of \(88\%\) and \(93\%\), and a recall of \(97\%\) and \(72\%\), for the datasets of Cold and Depression, respectively, in the task of filtering videos containing these speech affecting diseases.
We compared the performance of our baseline neural network classifiers trained with data collected in controlled conditions in tests with corresponding in-the-wild data. For the Cold datasets, the baseline neural network achieved an Unweighted Average Recall (UAR) of 66.9% for the controlled dataset, and 53.1% for the manually labelled subset of the WSM corpus. For the Depression datasets, the corresponding values were 60.6%, and 54.8%, respectively (at interview level, the UAR increased to 61.9% for the vlog corpus). The performance degradation that we had anticipated for using in-the-wild data may be due to a greater variability in recording conditions (p.e. microphone, noise) and in the effects of speech altering diseases in the subjects’ speech. Our current work with vlog datasets attempts to estimate the quality of the predicted labels of a very large set in an unsupervised way, using noisy models.
The second aspect we addressed was patient privacy. Privacy is an emerging concern among users of voice-activated digital assistants, sparkled by the awareness of devices that must be always in the listening mode. Despite this growing concern, the potential misuse of health related speech based cues has not yet been fully realized. This is the motivation for adopting secure computation frameworks, in which cryptographic techniques are combined with state-of-the-art machine learning algorithms. Privacy in speech processing is an interdisciplinary topic, which was first applied to speaker verification, using Secure Multi-Party Computation, and Secure Modular Hashing techniques [1, 15], and later to speech emotion recognition, also using hashing techniques [6]. The most recent efforts on privacy preserving speech processing have followed the progress in secure machine learning, combining neural networks and Full Homomorphic Encryption (FHE) [3, 8, 9].
In this work, we applied an encrypted neural network, following the FHE paradigm, to the problem of secure detection of pathological speech. This was done by developing an encrypted version of a neural network, trained with unencrypted data, in order to produce encrypted predictions of health-related labels. As proof-of-concept, we used the same two above mentioned target diseases, and compared the performance of the simple neural network classifiers with their encrypted counterparts on datasets collected in controlled conditions. For the Cold dataset, the baseline neural network achieved a UAR of 66.9%, whereas the encrypted network achieved 66.7%. For the Depression dataset, the baseline value was 60.6%, whereas the encrypted network achieved 60.2% (67.9% at interview level). The slight difference in results showed the validity of our secure approach.
This approach relies on the computation of features on the client side before encryption, with only the inference stage being computed in an encrypted setting. Ideally, an end-to-end approach would overcome this limitation, but combining convolutional neural networks with FHE imposes severe limitations to their size. Likewise, the use of recurrent layers such as LSTMs (Long Short Term Memory) also requires a number of operations too large for current FHE frameworks, making them computationally unfeasible as well.
FHE schemes, by construction, only work with integers, whilst neural networks work with real numbers. By using encoding methods to convert real weights to integers we are throwing away the capability of using an FHE batching technique that would allow us to compute several predictions, at the same time, using the same encrypted value. Recent advances in machine learning have pushed towards the “quantization” and“discretization” of neural networks, so that models occupy less space and operations consume less power. Some works have already implemented these techniques using homomorphic encryption, such as Binarized Neural Networks [10, 11, 16] and Discretized Neural Networks [2]. The talk will also cover our recent efforts in applying this type of approach to the detection of health related cues in speech signals, while discretizing the network and maximizing the throughput of its encrypted counterpart.
More than presenting our recent work in these two aspects of speech analysis for medical applications, this talk intends to point to different directions for future work in these two relatively unexplored topics that were by no means exhausted in this summary.
Isabel Trancoso, Joana Correia, Francisco Teixeira, Bhiksha Raj, Alberto Abad

Speech Synthesis and Spoken Language Generation


DNN-Based Speech Synthesis for Arabic: Modelling and Evaluation

This paper investigates the use of deep neural networks (DNN) for Arabic speech synthesis. In parametric speech synthesis, whether HMM-based or DNN-based, each speech segment is described with a set of contextual features. These contextual features correspond to linguistic, phonetic and prosodic information that may affect the pronunciation of the segments. Gemination and vowel quantity (short vowel vs. long vowel) are two particular and important phenomena in Arabic language. Hence, it is worth investigating if those phenomena must be handled by using specific speech units, or if their specification in the contextual features is enough. Consequently four modelling approaches are evaluated by considering geminated consonants (respectively long vowels) either as fully-fledged phoneme units or as the same phoneme as their simple (respectively short) counterparts. Although no significant difference has been observed in previous studies relying on HMM-based modelling, this paper examines these modelling variants in the framework of DNN-based speech synthesis. Listening tests are conducted to evaluate the four modelling approaches, and to assess the performance of DNN-based Arabic speech synthesis with respect to previous HMM-based approach.
Amal Houidhek, Vincent Colotte, Zied Mnasri, Denis Jouvet

Phone-Level Embeddings for Unit Selection Speech Synthesis

Deep neural networks have become the state of the art in speech synthesis. They have been used to directly predict signal parameters or provide unsupervised speech segment descriptions through embeddings. In this paper, we present four models with two of them enabling us to extract phone-level embeddings for unit selection speech synthesis. Three of the models rely on a feed-forward DNN, the last one on an LSTM. The resulting embeddings enable replacing usual expert-based target costs by an euclidean distance in the embedding space. This work is conducted on a French corpus of an 11 h audiobook. Perceptual tests show the produced speech is preferred over a unit selection method where the target cost is defined by an expert. They also show that the embeddings are general enough to be used for different speech styles without quality loss. Furthermore, objective measures and a perceptual test on statistical parametric speech synthesis show that our models perform comparably to state-of-the-art models for parametric signal generation, in spite of necessary simplifications, namely late time integration and information compression.
Antoine Perquin, Gwénolé Lecorvé, Damien Lolive, Laurent Amsaleg

Disfluency Insertion for Spontaneous TTS: Formalization and Proof of Concept

This paper presents an exploratory work to automatically insert disfluencies in text-to-speech (TTS) systems. The objective is to make TTS more spontaneous and expressive. To achieve this, we propose to focus on the linguistic level of speech through the insertion of pauses, repetitions and revisions. We formalize the problem as a theoretical process, where transformations are iteratively composed. This is a novel contribution since most of the previous work either focus on the detection or cleaning of linguistic disfluencies in speech transcripts, or solely concentrate on acoustic phenomena in TTS, especially pauses. We present a first implementation of the proposed process using conditional random fields and language models. The objective and perceptual evalation conducted on an English corpus of spontaneous speech show that our proposition is effective to generate disfluencies, and highlights perspectives for future improvements.
Raheel Qader, Gwénolé Lecorvé, Damien Lolive, Pascale Sébillot

Speech Recognition and Post-Processing


Forced Alignment of the Phonologie du Français Contemporain Corpus

The Phonologie du Français Contemporain project is an international, collaborative research effort to create resources for the study of contemporary French phonology. It has produced a large, partially transcribed and annotated corpus of spoken French, consisting of approximately 300 h of recordings, and covering 48 geographical regions (including Metropolitan France, Belgium, Switzerland, Canada, and French-speaking countries of Africa). Following a detailed protocol, speakers read aloud a word list and a short text and engage in guided and spontaneous conversation with an interviewer. The corpus presents several challenges: significant regional accent variation; variable recording quality and different types of environment noise; variation in speaker characteristics (age, sex); and interspersed segments of overlapping speech. In this article, we describe the procedure followed to address these challenges and produce an automatic forced alignment of the corpus at the phone, syllable and token level, starting from the initial transcriptions.
George Christodoulides

A Syllable Structure Approach to Spoken Language Recognition

Spoken language recognition is the task of automatically determining the identity of the language spoken in a speech clip. Prior approaches to spoken language recognition have been able to accurately determine the language within an audio clip. However, they usually require long training time and large datasets since most of the existing approaches heavily rely on phonotactic, acoustic-phonetic and prosodic information. Moreover, the features extracted may not be linguistic features, but speaker features instead. This paper presents a novel approach based on a linguistics perspective, particularly that of syllable structure. Based on human listening experiments, there has been strong evidence that syllable structure is a significant knowledge source in human spoken language recognition. The approach includes a block for labelling common syllable structures (CV, CVC, VC, etc.). Then, a long short-term memory (LSTM) network is used to transform the Mel-frequency cepstral coefficients (MFCC) of an audio clip to its syllable structure, thereby diminishing the influence of speakers on extracted features and reducing the number of dimensions for the final language predictor. The array of syllables is then passed through the second LSTM network to predict the language. The proposed method creates a generalized and scalable framework with acceptable accuracy for spoken language recognition. Our experiments with 10 different languages demonstrate the feasibility of the proposed approach, which achieves a comparable accuracy of 70.40% with a computing time of 37 ms for every second of speech, outperforming most of the existing methods based on acoustic-phonetic and phonotactic features by efficiency.
Ruei-Hung Alex Lee, Jyh-Shing Roger Jang

Investigating a Hybrid Learning Approach for Robust Automatic Speech Recognition

In order to properly train an automatic speech recognition system, speech with its annotated transcriptions is required. The amount of real annotated data recorded in noisy and reverberant conditions is extremely limited, especially compared to the amount of data that can be simulated by adding noise to clean annotated speech. Thus, using both real and simulated data is important in order to improve robust speech recognition. Another promising method applied to speech recognition in noisy and reverberant conditions is multi-task learning. A successful auxiliary task consists of generating clean speech features using a regression loss (as a denoising auto-encoder). But this auxiliary task uses as targets clean speech which implies that real data cannot be used. In order to tackle this problem a Hybrid-Task Learning system is proposed. This system switches frequently between multi and single-task learning depending on whether the input is real or simulated data respectively. We show that the relative improvement brought by the proposed hybrid-task learning architecture can reach up to 4.4% compared to the traditional single-task learning approach on the CHiME4 database.
Gueorgui Pironkov, Sean U. N. Wood, Stéphane Dupont, Thierry Dutoit

A Comparison of Adaptation Techniques and Recurrent Neural Network Architectures

Recently, recurrent neural networks have become state-of-the-art in acoustic modeling for automatic speech recognition. The long short-term memory (LSTM) units are the most popular ones. However, alternative units like gated recurrent unit (GRU) and its modifications outperformed LSTM in some publications. In this paper, we compared five neural network (NN) architectures with various adaptation and feature normalization techniques. We have evaluated feature-space maximum likelihood linear regression, five variants of i-vector adaptation and two variants of cepstral mean normalization. The most adaptation and normalization techniques were developed for feed-forward NNs and, according to results in this paper, not all of them worked also with RNNs. For experiments, we have chosen a well known and available TIMIT phone recognition task. The phone recognition is much more sensitive to the quality of AM than large vocabulary task with a complex language model. Also, we published the open-source scripts to easily replicate the results and to help continue the development.
Jan Vaněk, Josef Michálek, Jan Zelinka, Josef Psutka

Restoring Punctuation and Capitalization Using Transformer Models

Restoring punctuation and capitalization in the output of automatic speech recognition (ASR) system greatly improves readability and extends the number of downstream applications. We present a Transformer-based method for restoring punctuation and capitalization for Latvian and English, following the established approach of using neural machine translation (NMT) models. NMT methods here pose a challenge as the length of the predicted sequence does not always match the length of the input sequence. We offer two solutions to this problem: a simple target sequence cutting or padding by force and a more sophisticated attention alignment-based method. Our approach reaches new state of the art results for Latvian and competitive results on English.
Andris Vāravs, Askars Salimbajevs

Natural Language Processing and Understanding


Arabic Name Entity Recognition Using Deep Learning

Many applications that we use on a daily basis incorporate Natural Language Processing (NLP), from simple tasks such as automatic text correction to speech recognition. A lot of research has been done on NLP for the English language but not much attention was given to the NLP of the Arabic language. The purpose of this work is to implement a tagging model for Arabic Name Entity Recognition which is an important information extraction task in NLP. It serves as a building block for more advanced tasks. We developed a deep learning model that consists of Bidirectional Long Short Term Memory and Conditional Random Field with the addition of different network layers such as Word Embedding, Convolutional Neural Network, and Character Embedding. Hyperparameters have been tuned to maximize the F1-score.
David Awad, Caroline Sabty, Mohamed Elmahdy, Slim Abdennadher

Movie Genre Detection Using Topological Data Analysis

We show that by applying discourse features derived through topological data analysis (TDA), namely homological persistence, we can improve classification results on the task of movie genre detection, including identification of overlapping movie genres. On the IMDB dataset we improve prior art results, namely we increase the Jaccard score by 4.7% over a recent results by Hoang. We also significantly improve the F-score (by over 15%) and slightly improve the hit rate (by 0.5%, ibid.). We see our contribution as threefold: (a) for general audience of computational linguists, we want to increase their awareness about topology as a possible source of semantic features; (b) for researchers using machine learning for NLP tasks, we want to propose the use of topological features when the number of training examples is small; and (c) for those already aware of the existence of computational topology, we see this work as contributing to the discussion about the value of topology for NLP, in view of mixed results reported by others.
Pratik Doshi, Wlodek Zadrozny

Low-Resource Text Classification Using Domain-Adversarial Learning

Deep learning techniques have recently shown to be successful in many natural language processing tasks forming state-of-the-art systems. They require, however, a large amount of annotated data which is often missing. This paper explores the use of domain-adversarial learning as a regularizer to avoid overfitting when training domain invariant features for deep, complex neural network in low-resource and zero-resource settings in new target domains or languages. In case of new languages, we show that monolingual word-vectors can be directly used for training without pre-alignment. Their projection into a common space can be learnt ad-hoc at training time reaching the final performance of pretrained multilingual word-vectors.
Daniel Grießhaber, Ngoc Thang Vu, Johannes Maucher

Handling Ellipsis in a Spoken Medical Phraselator

We consider methods for handling incomplete (elliptical) utterances in spoken phraselators, and describe how they have been implemented inside BabelDr, a substantial spoken medical phraselator. The challenge is to extend the phrase matching process so that it is sensitive to preceding dialogue context. We contrast two methods, one using limited-vocabulary strict grammar-based speech and language processing and one using large-vocabulary speech recognition with fuzzy grammar-based processing, and present an initial evaluation on a spoken corpus of 821 context-sentence/elliptical-phrase pairs. The large-vocabulary/fuzzy method strongly outperforms the limited-vocabulary/strict method over the whole corpus, though it is slightly inferior for the subset that is within grammar coverage. We investigate possibilities for combining the two processing paths, using several machine learning frameworks, and demonstrate that hybrid methods strongly outperform the large-vocabulary/fuzzy method.
Manny Rayner, Johanna Gerlach, Pierrette Bouillon, Nikos Tsourakis, Hervé Spechbach

Text Processing and Analysis


Knowledge Transfer for Active Learning in Textual Anonymisation

Data privacy compliance has gained a lot of attention over the last years. The automation of the de-identification process is a challenging task that often requires annotating in-domain data from scratch, as there is usually a lack of annotated resources for such scenarios. In this work, knowledge from a classifier learnt from a source annotated dataset is transferred to speed up the process of training a binary personal data identification classifier in a pool-based Active Learning context, for a new initially unlabelled target dataset which differs in language and domain. To this end, knowledge from the source classifier is used for seed selection and uncertainty based query selection strategies. Through the experimentation phase, multiple entropy-based criteria and input diversity measures are combined. Results show a significant improvement of the anonymisation label from the first batch, speeding up the classifier’s learning curve in the target domain and reaching top performance with less than 10% of the total training data, thus demonstrating the usefulness of the proposed approach even when the anonymisation domains diverge significantly.
Laura García-Sardiña, Manex Serras, Arantza del Pozo

Studying the Effects of Text Preprocessing and Ensemble Methods on Sentiment Analysis of Brazilian Portuguese Tweets

The analysis of social media posts can provide useful feedback regarding user experience for people and organizations. This task requires the use of computational tools due to the massive amount of content and the speed at which it is generated. In this article we study the effects of text preprocessing heuristics and ensembles of machine learning algorithms on the accuracy and polarity bias of classifiers when performing sentiment analysis on short text messages. The results of an experimental evaluation performed on a Brazilian Portuguese tweets dataset have shown that these strategies have significant impact on increasing classification accuracy, particularly when the ensembles include a deep neural net, but not always on reducing polarity bias.
Fernando Barbosa Gomes, Juan Manuel Adán-Coello, Fernando Ernesto Kintschner

Text Documents Encoding Through Images for Authorship Attribution

In order to use a machine learning methodology for classifying text documents, relevant features have to be first extracted from them. The current approach uses the chaos game representation to produce an image out of a text document, flattens the images into vectors, while further reduces the dimension via singular value decomposition. Finally, a neural network learns the features relevant for each author and the built model is used to classify new samples. The results obtained on some well known benchmark data sets approach or exceed those in prior literature, and encourage further research within this unexplored area.
Daniel Lichtblau, Catalin Stoean


Weitere Informationen

Premium Partner