Skip to main content
Top

2017 | Book

Speech and Computer

19th International Conference, SPECOM 2017, Hatfield, UK, September 12-16, 2017, Proceedings

insite
SEARCH

About this book

This book constitutes the proceedings of the 19th International Conference on Speech and Computer, SPECOM 2017, held in Hatfield, UK, in September 2017.
The 80 papers presented in this volume were carefully reviewed and selected from 150 submissions. The papers present current research in the area of computer speech processing (recognition, synthesis, understanding etc.) and related domains (including signal processing, language and text processing, computational paralinguistics, multi-modal speech processing, human-computer interaction).

Table of Contents

Frontmatter

Invited Talks

Frontmatter
Low-Resource Speech Recognition and Keyword-Spotting

The IARPA Babel program ran from March 2012 to November 2016. The aim of the program was to develop agile and robust speech technology that can be rapidly applied to any human language in order to provide effective search capability on large quantities of real world data. This paper will describe some of the developments in speech recognition and keyword-spotting during the lifetime of the project. Two technical areas will be briefly discussed with a focus on techniques developed at Cambridge University: the application of deep learning for low-resource speech recognition; and efficient approaches for keyword spotting. Finally a brief analysis of the Babel speech language characteristics and language performance will be presented.

Mark J. F. Gales, Kate M. Knill, Anton Ragni
Big Data, Deep Learning – At the Edge of X-Ray Speaker Analysis

With two years, one has roughly heard a thousand hours of speech – with ten years, around ten thousand. Similarly, an automatic speech recogniser’s data hunger these days is often fed in these dimensions. In stark contrast, however, only few databases to train a speaker analysis system contain more than ten hours of speech. Yet, these systems are ideally expected to recognise the states and traits of speakers independent of the person, spoken content, language, cultural background, and acoustic disturbances at human parity or even super-human levels. While this is not reached at the time for many tasks such as speaker emotion recognition, deep learning – often described to lead to ‘dramatic improvements’ – in combination with sufficient learning data satisfying the ‘deep data cravings’ holds the promise to get us there. Luckily, every second, more than five hours of video are uploaded to the web and several hundreds of hours of audio and video communication in most languages of the world take place. If only a fraction of these data would be shared and labelled reliably, ‘x-ray’-alike automatic speaker analysis could be around the corner for next gen human-computer interaction, mobile health applications, and many further benefits to society. In this light, first, a solution towards utmost efficient exploitation of the ‘big’ (unlabelled) data available is presented. Small-world modelling in combination with unsupervised learning help to rapidly identify potential target data of interest. Then, gamified dynamic cooperative crowdsourcing turn its labelling into an entertaining experience, while reducing the amount of required labels to a minimum by learning alongside the target task also the labellers’ behaviour and reliability. Further, increasingly autonomous deep holistic end-to-end learning solutions are presented for the task at hand. Benchmarks are given from the nine research challenges co-organised by the author over the years at the annual Interspeech conference since 2009. The concluding discussion will contain some crystal ball gazing alongside practical hints not missing out on ethical aspects.

Björn W. Schuller

Conference Papers

Frontmatter
A Comparison of Covariance Matrix and i-vector Based Speaker Recognition

The paper presents results of an evaluation of covariance matrix and i-vector based speaker identification methods on Serbian S70W100s120 database. Open set speaker identification evaluation scheme was adopted. The number of target speakers and the number of impostors were 20 and 60 respectively. Additional utterances from 41 speakers were used for training. Amount of data for modeling a target speaker was limited to about 4 s of speech. In this study, the i-vector base approach showed significantly better performance (equal error rate EER ~5%) than the covariance matrix based approach (EER ~16%). This small EER for the i-vector based approach was obtained after substantial reduction of the number of the parameters in universal background model, i-vector transformation matrix and Gaussian probabilistic linear discriminant analysis that is typically reported in the papers. Additionally, these experiments showed that cepstral mean and variance normalization can deteriorate EER in case of a single channel.

Nikša Jakovljević, Ivan Jokić, Slobodan Jošić, Vlado Delić
A Trainable Method for the Phonetic Similarity Search in German Proper Names

Efficient methods for the similarity search in word databases play a significant role in various applications such as the robust search or indexing of names and addresses, spell-checking algorithms or the monitoring of trademark rights. The underlying distance measures are associated with similarity criteria of the users, and phonetic-based search algorithms are well-established since decades. Nonetheless, rule-based phonetic algorithms exhibit some weak points, e.g. their strong language dependency, the search overhead by tolerance or the risk of missing valid matches vice versa, which causes a pseudo-phonetic functionality in some cases. In contrast, we suggest a novel, adaptive method for similarity search in words, which is based on a trainable grapheme-to-phoneme (G2P) converter that generates most likely and widely correct pronunciations. Only as a second step, the similarity search in the phonemic reference data is performed by involving a conventional string metric such as the Levenshtein distance (LD). The G2P algorithm achieves a string accuracy of up to 99.5% in a German pronunciation lexicon and can be trained for different languages or specific domains such as proper names. The similarity tolerance can be easily adjusted by parameters like the admissible number or likability of pronunciation variants as well as by the phonemic or graphemic LD. As a proof of concept, we compare the G2P-based search method on a German surname database and a telephone book including first name, surname and street name to similarity matches by the conventional Cologne phonetic (Kölner Phonetik, KP) algorithm.

Oliver Jokisch, Horst-Udo Hain
Acoustic and Perceptual Correlates of Vowel Articulation in Parkinson’s Disease With and Without Mild Cognitive Impairment: A Pilot Study

This pilot study investigates the added acoustic and perceptual effect of cognitive impairment on vowel articulation precision in individuals with Parkinson’s Disease (PD). We compared PD patients with and without Mild Cognitive Impairments (MCI) to elderly healthy controls on various acoustic measurements of the first and second formants of the vowels /i, u, a:, , a/, extracted from spontaneous speech recordings. In addition, 15 naïve listeners performed intelligibility ratings on segments of the spontaneous speech. Results show a centralization of vowel formant frequencies, an increased formant frequency variability and reduced intelligibility in individuals with PD compared to controls. Acoustic and perceptual effects of cognitive impairments on vowel articulation precision were only found for the male speakers.

Michaela Strinzel, Vasilisa Verkhodanova, Fedor Jalvingh, Roel Jonkers, Matt Coler
Acoustic Cues for the Perceptual Assessment of Surround Sound

Speech and audio codecs are implemented in a variety of multimedia applications, and multichannel sound is offered by first streaming or cloud-based services. Beside the objective of perceptual quality, coding-related research is focused on low bitrate and minimal latency. The IETF-standardized Opus codec provides a high perceptual quality, low latency and the capability of coding multiple channels in various audio bandwidths up to Fullband (20 kHz). In a previous perceptual study on Opus-processed 5.1 surround sound, uncompressed and degraded stimuli were rated on a five-point degradation category scale (DMOS) for six channels at total bitrates between 96 and 192 kbit/s. This study revealed that the perceived quality depends on the music characteristics. In the current study we analyze spectral and music-feature differences between those five music stimuli at three coding bitrates and uncompressed sound to identify objective causes for perceptual differences. The results show that samples with annoying audible degradations involve higher spectral differences within the LFE channel as well as highly uncorrelated LSPs.

Ingo Siegert, Oliver Jokisch, Alicia Flores Lotz, Franziska Trojahn, Martin Meszaros, Michael Maruschke
Acoustic Modeling in the STC Keyword Search System for OpenKWS 2016 Evaluation

This paper describes in detail the acoustic modeling part of the keyword search system developed in the Speech Technology Center (STC) for the OpenKWS 2016 evaluation. The key idea was to utilize diversity of both sound representations and acoustic model architectures in the system. For the former, we extended speaker-dependent bottleneck (SDBN) approach to the multilingual case, which is the main contribution of the paper. Two types of multilingual SDBN features were applied in addition to conventional spectral and cepstral features. The acoustic model architectures employed in the final system are based on deep feedforward and recurrent neural networks. We also applied speaker adaptation of acoustic models using multilingual i-vectors, speed perturbation based data augmentation and semi-supervised training. Final STC system comprised 9 acoustic models, which allowed it to achieve strong performance and to be among the top three systems in the evaluation.

Ivan Medennikov, Aleksei Romanenko, Alexey Prudnikov, Valentin Mendelev, Yuri Khokhlov, Maxim Korenevsky, Natalia Tomashenko, Alexander Zatvornitskiy
Adaptation Approaches for Pronunciation Scoring with Sparse Training Data

In Computer Assisted Language Learning systems, pronunciation scoring consists in providing a score grading the overall pronunciation quality of the speech uttered by a student. In this work, a log-likelihood ratio obtained with respect to two automatic speech recognition (ASR) models was used as score. One model represents native pronunciation while the other one captures non-native pronunciation. Different approaches to obtain each model and different amounts of training data were analyzed. The best results were obtained training an ASR system using a separate large corpus without pronunciation quality annotations and then adapting it to the native and non-native data, sequentially. Nevertheless, when models are trained directly on the native and non-native data, pronunciation scoring performance is similar. This is a surprising result considering that word error rates for these models are significantly worse, indicating that ASR performance is not a good predictor of pronunciation scoring performance on this system.

Federico Landini, Luciana Ferrer, Horacio Franco
An Algorithm for Detection of Breath Sounds in Spontaneous Speech with Application to Speaker Recognition

Automatic detection and demarcation of non-speech sounds in speech is critical for developing sophisticated human-machine interaction systems. The main objective of this study is to develop acoustic features capturing the production differences between speech and breath sounds in terms of both, excitation source and vocal tract system based characteristics. Using these features, a rule-based algorithm is proposed for automatic detection of breath sounds in spontaneous speech. The proposed algorithm outperforms the previous methods for detection of breath sounds in spontaneous speech. Further, the importance of breath detection for speaker recognition is analyzed by considering an i-vector-based speaker recognition system. Experimental results show that the detection of breath sounds, prior to i-vector extraction, is essential to nullify the effect of breath sounds occurring in test samples on speaker recognition, which otherwise will degrade the performance of i-vector-based speaker recognition systems.

Sri Harsha Dumpala, K. N. R. K. Raju Alluri
An Alternative Approach to Exploring a Video

Exploring the content of a video is typically inefficient due to the linear streamed nature of its media. Video may be seen as a combination of a set of features, the visual track, the audio track and transcription of the spoken words, etc. These features may be viewed as a set of temporally bounded parallel modalities. It is our contention that together these modalities and derived features have the potential to be presented individually or in discrete combination, to allow deeper and more effective content exploration within different parts of a video. This paper presents a novel system for videos’ exploration and reports a recent user study conducted to learn usage patterns by offering video content as an alternative representation. The learned usage patterns may be utilized to build a template driven representation engine that uses the features to offer a multimodal synopsis of video that may lead to more efficient exploration of video content.

Fahim A. Salim, Fasih Haider, Owen Conlan, Saturnino Luz
An Analysis of the RNN-Based Spoken Term Detection Training

This paper studies the training process of the recurrent neural networks used in the spoken term detection (STD) task. The method used in the paper employ two jointly trained Siamese networks using unsupervised data. The grapheme representation of a searched term and the phoneme realization of a putative hit are projected into the pronunciation embedding space using such networks. The score is estimated as relative distance of these embeddings. The paper studies the influence of different loss functions, amount of unsupervised data and the meta-parameters on the performance of the STD system.

Jan Švec, Luboš Šmídl, Josef V. Psutka
Analysis of Interaction Parameter Levels in Interaction Quality Modelling for Human-Human Conversation

Estimation of the dialogue quality, especially the quality of interaction, is an essential part for improving the quality of spoken dialogue systems (SDSs) or call centres. The Interaction Quality (IQ) metric is one of such approaches. Originally, it was designed for SDSs to estimate an ongoing human-computer spoken interaction (HCSI). Due to a similarity between task-oriented human-human conversation (HHC) and HCSI, this approach was adapted to HHC. As for HCSI, for HHC the IQ model is based on features from three interaction parameter levels: an exchange, a window, and a dialogue level. We determine the significance of different levels for IQ modelling for HHC. Moreover, for the window level we try to find an optimal window size. Our study was aimed to simplify the IQ model for HHC, as well as to find differences and similarities between IQ models for HHC and HCSI.

Anastasiia Spirina, Olesia Vaskovskaia, Tatiana Karaseva, Alina Skorokhod, Iana Polonskaia, Maxim Sidorov
Annotation Error Detection: Anomaly Detection vs. Classification

We compare two approaches to automatic detection of annotation errors in single-speaker read-speech corpora used for speech synthesis: anomaly- and classification-based detection. Both approaches principally differ in that the classification-based approach needs to use both correctly annotated and misannotated words for training. On the other hand, the anomaly-based detection approach needs only the correctly annotated words for training (plus a few misannotated words for validation). We show that both approaches lead to statistically comparable results when all available misannotated words are utilized during detector/classifier development. However, when a smaller number of misannotated words are used, the anomaly detection framework clearly outperforms the classification-based approach. A final listening test showed the effectiveness of the annotation error detection for improving the quality of synthetic speech.

Jindřich Matoušek, Daniel Tihelka
Are You Addressing Me? Multimodal Addressee Detection in Human-Human-Computer Conversations

The goal of addressee detection is to answer the question ‘Are you addressing me?’ In order to participate in multiparty conversations, a spoken dialogue system is supposed to determine whether a user is addressing the system or another human. The present paper describes three levels of speech and text analysis (acoustical, lexical, and syntactical) for multimodal addressee detection and reveals the connection between them and the classification performance for different categories of speech. We propose several classification models and compare their performance with the results of the original research performed by the authors of the Smart Video Corpus which we use in our computations. Our most effective meta-classifier working with acoustical, syntactical, and lexical features provides an unweighted average recall equal to 0.917, showing a nine percent advantage over the best baseline model, though the baseline classifier additionally uses head orientation data. We also propose an LSTM neural network for text classification which replaces the lexical and the syntactical classifier by a single model reaching the same performance as the most effective meta-classifier does, despite the fact that this meta-model additionally analyses acoustical data.

Oleg Akhtiamov, Dmitrii Ubskii, Evgeniia Feldina, Aleksei Pugachev, Alexey Karpov, Wolfgang Minker
Assessing Spoken Dialog Services from the End-User Perspective: Usability and Experience

Assessment of usability and user experience of spoken dialog services is a rather complex task, which remains difficult to achieve with real end-users. In this work a three-fold evaluation approach is introduced, which supports reliable assessment of usability and user experience. The approach combines interaction log data based assessment (at dialog, task and node level) with an optimized questionnaire-based end-users evaluation and a controlled stress test performed by an IVR system. The 3-fold evaluation approach was used for the assessment of usability and user experience of the pilot deployment of a voice banking systems. The proposed assessment approach provides sufficient evidence for the business informed decision-making with respect to perceived user quality of the interaction and offered services and allows for investigation of potential improvement areas.

Otilia Kocsis, Basilis Kladis, Anastasios Tsopanoglou, Nikos Fakotakis
Audio-Replay Attack Detection Countermeasures

This paper presents the Speech Technology Center (STC) replay attack detection systems proposed for Automatic Speaker Verification Spoofing and Countermeasures Challenge 2017. In this study we focused on comparison of different spoofing detection approaches. These were GMM based methods, high level features extraction with simple classifier and deep learning frameworks. Experiments performed on the development and evaluation parts of the challenge dataset demonstrated stable efficiency of deep learning approaches in case of changing acoustic conditions. At the same time SVM classifier with high level features provided a substantial input in the efficiency of the resulting STC systems according to the fusion systems results.

Galina Lavrentyeva, Sergey Novoselov, Egor Malykh, Alexander Kozlov, Oleg Kudashev, Vadim Shchemelinin
Automatic Estimation of Presentation Skills Using Speech, Slides and Gestures

This paper proposes an automatic system which uses multimodal techniques for automatically estimating oral presentation skills. It is based on a set of features from three sources; audio, gesture and power-point slides. Machine learning techniques are used to classify each presentation into two classes (high vs. low) and into three classes; low, average, and high-quality presentation. Around 448 Multimodal recordings of the MLA’14 dataset were used for training and evaluating three different 2-class and 3-class classifiers. Classifiers were evaluated for each feature type independently and for all features combined together. The best accuracy of the 2-class systems is 90.1% achieved by SVM trained on audio features and 75% for 3-class systems achieved by random forest trained on slides features. Combining three feature types into one vector improves all systems accuracy by around 5%.

Abualsoud Hanani, Mohammad Al-Amleh, Waseem Bazbus, Saleem Salameh
Automatic Phonetic Transcription for Russian: Speech Variability Modeling

At the moment more advanced approaches to phonetic transcription are required for different speech technology tasks such as TTS or ASR. All subtle differences in phonetic characteristics of sound sequences inside the words and in the word boundaries need more accurate and variable transcription rules. Moreover, there is a need to take into account not only the normal rules of phonetic transcription. it is important to include the information about speech variability in regional and social dialects, popular speech and colloquial variants of the high frequency lexis. In this paper a reliable method for automatic phonetic transcription of Russian text is presented. The system is used for making not only an ideal transcription for the Russian text but also takes into account the complex processes of sound change and variation within the Russian standard pronunciation. Our transcribing system is reliable and could be used not only for the TTS systems but also in ASR tasks that require more flexible approach to phonetic transcription of the text.

Vera Evdokimova, Pavel Skrelin, Tatiana Chukaeva
Automatic Smoker Detection from Telephone Speech Signals

This paper proposes an automatic smoking habit detection from spontaneous telephone speech signals. In this method, each utterance is modeled using i-vector and non-negative factor analysis (NFA) frameworks, which yield low-dimensional representation of utterances by applying factor analysis on Gaussian mixture model means and weights respectively. Each framework is evaluated using different classification algorithms to detect the smoker speakers. Finally, score-level fusion of the i-vector-based and the NFA-based recognizers is considered to improve the classification accuracy. The proposed method is evaluated on telephone speech signals of speakers whose smoking habits are known drawn from the National Institute of Standards and Technology (NIST) 2008 and 2010 Speaker Recognition Evaluation databases. Experimental results over 1194 utterances show the effectiveness of the proposed approach for the automatic smoking habit detection task.

Amir Hossein Poorjam, Soheila Hesaraki, Saeid Safavi, Hugo van Hamme, Mohamad Hasan Bahari
Bimodal Anti-Spoofing System for Mobile Security

Multi-modal biometric verification systems are in active development and show impressive performance nowadays. However, such systems need additional protection from spoofing attacks. In our paper we present full pipeline of anti-spoofing method (based on our previous work) for bimodal audiovisual verification system. This method allows to evaluate parameters of quality for a sequence of face images during a verification process. Based on this parameters it’s decided whether the data is suitable for processing by the standard method (fiducial points based audiovisual liveness detection, FALD). If the quality of data is not sufficient, then our system switches to a new algorithm (svm-based audiovisual liveness detection, SALD), which provides less protection quality, but is able to operate when FALD is unsuitable. To improve the quality of the FALD algorithm we have collected the special dataset. This dataset allows to get better reliability of the algorithm for searching of fiducial points on the user’s face image. Tests show that developed system can significantly improve the quality of anti-spoofing protection versus our previous work.

Eugene Luckyanets, Aleksandr Melnikov, Oleg Kudashev, Sergey Novoselov, Galina Lavrentyeva
Canadian English Word Stress: A Corpora-Based Study of National Identity in a Multilingual Community

Canadian English (CE) word stress, apart from sharing stress patterns with either the American or the British norms, reveals nationally specific rhythm-based features. The evidence was collected by working through the English Pronouncing Dictionary (EPD) and the Canadian Oxford Dictionary (COD). The next step was comparing frequencies of words with varying stress patterns in three national written and spoken speech corpora: the British National Corpus (BNC), the Corpus of Contemporary American English (COCA) and the Corpus of Canadian English (CCE). The words under analysis displayed nearly identical frequencies in the three sources; 89 most frequent polysyllabic words were selected for online express-survey. Canadian subjects (30) representing the diversity of CE linguistic identities (anglophones, francophones, allophones) which affected their decisions on word stress locations demonstrated their preferences for either the Canadian, the British or the American stress patterns, accordingly. The viability of the Canadian stress patterns was supported by the data from two more Canadian natural speech corpora: International Dialects of English Archive (IDEA) and Voices of the International Corpus of English (VOICE). Acoustic and perceptual analyses based on production and perception processing performed by native (anglophone) CE speakers demonstrated the significance of secondary stress in CE stress patterns.

Tatiana Shevchenko, Daria Pozdeeva
Classification of Formal and Informal Dialogues Based on Turn-Taking and Intonation Using Deep Neural Networks

Here, we introduce a classification method for distinguishing between formal and informal dialogues using feature sets based on prosodic data. One such feature set is the raw fundamental frequency values paired with speaker information (i.e. turn-taking). The other feature set we examine is the prosodic labels extracted from the raw F0 values via the ProsoTool algorithm, which is also complemented by turn-taking. We evaluated the two feature sets by comparing the accuracy scores our classification method got, which uses them to classify dialogue-excerpts taken from the HuComTech corpus. With the ProsoTool features we achieved an average accuracy score of $$85.2\%$$, which meant a relative error rate reduction of $$24\%$$ compared to the accuracy scores attained using F0 features. Regardless of the feature set applied, however, our method yields better accuracy scores than those got by human listeners, who only managed to distinguish between formal and informal dialogue to an accuracy level of $$56.5\%$$.

István Szekrényes, György Kovács
Clustering Target Speaker on a Set of Telephone Dialogs

The ability of the speaker’s voice model to reproduce detailed parameterization of individual speech features is an important property for its use in solving different biometric problems. In general case one of the main reasons of performance degradation in voice biometric systems is the voice variability that occurs when speaker’s state (emotional, physiological, etc.) or channel conditions are changing. Therefore, accurate modeling of the intra-speaker voice variability leads to a more accurate voice model. This can be achieved by collecting multiple speech samples of the same speaker recorded in diverse conditions to create so-called multi-session model. We consider the case when speech data is represented by dialogues recorded in a single channel. This setup raises the problem of grouping the segments of a target speaker from the set of dialogues. We propose a clustering algorithm to solve this problem, which is based on the probabilistic linear discriminant analysis (PLDA). Our experiments demonstrate effectiveness of the proposed approach compared to solutions based on exhaustive search.

Andrey Shulipa, Aleksey Sholohov, Yuri Matveev
Cognitive Entropy in the Perceptual-Auditory Evaluation of Emotional Modal States of Foreign Language Communication Partner

The paper deals with the phenomenon of perceptual-auditory divergence in the evaluation of the foreign language communication partner’s emotional-modal state. The problem of “human – human” interaction (vice versa “man – machine”) is characterized by a very complex phenomenon associated with the communication “native language – foreign language” taking into account the idiosyncrasy of speech production and speech perception. The idiosyncrasy can be determined from the positions of the individual mixing of various types of below-specified information (e.g., biological, psychological, social, cognitive, etc.). All these factors affect the process of recognizing the communication partner’s emotional-modal state. Therefore, one can assume that the idiosyncratic features of the perceiver (in this case, the listener) affect, in turn, the evaluation of the emotional-modal state, primarily that of a foreign language communication partner. In our pilot study special emphasis is laid on this problem considered on the basis of Russian-German and German-Russian matches. The obtained data suggest a new model of perceptual-auditory processing of verbal stimuli including such components as perceptual-auditory idiosyncrasy and auditory-perceptual cognitive entropy.

Rodmonga Potapova, Vsevolod Potapov
Correlation Normalization of Syllables and Comparative Evaluation of Pronunciation Quality in Speech Rehabilitation

The paper considers the solution of aligning syllables in time problem. This kind of normalization allows to compare different implementations of the same syllable. This allows us to talk about a comparative evaluation of the syllables pronunciation quality in the event that one of the syllables is a reference implementation. If a patient’s record before the operative treatment of oral cancer is used as such a syllable, a comparative assessment of the quality of pronunciation of syllables in the process of speech rehabilitation can be made. In the process of normalization, an approach aimed at maximizing the correlation between individual fragments of the syllable is applied. Then, as a measure of similarity between the reference and the estimated syllable, the correlation coefficient is used. The work demonstrates the validity of such a decision based on the processing of records from healthy people and patients before and after surgical treatment. The results of this work allow us to approach the implementation of an automated software system for assessing the quality of pronunciation of syllables and proceed to implement its working prototype.

Evgeny Kostyuchenko, Roman Meshcheryakov, Dariya Ignatieva, Alexander Pyatkov, Evgeny Choynzonov, Lidiya Balatskaya
CRF-Based Phrase Boundary Detection Trained on Large-Scale TTS Speech Corpora

The paper compares different approaches in the phrase boundary detection issue, based on the data gained from speech corpora recorded for the purpose of the text-to-speech (TTS) system. It is showed that conditional random fields model can outperform basic deterministic and classification-based algorithms both in speaker-dependent and speaker independent phrasing. The results on manually annotated sentences with phrase breaks are presented here as well.

Markéta Jůzová
Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder

In our earlier work in statistical parametric speech synthesis, we proposed a vocoder using continuous F0 in combination with Maximum Voiced Frequency (MVF), which was successfully used with a feed-forward deep neural network (DNN). The advantage of a continuous vocoder in this scenario is that vocoder parameters are simpler to model than traditional vocoders with discontinuous F0. However, DNNs have a lack of sequence modeling which might degrade the quality of synthesized speech. In order to avoid this problem, we propose the use of sequence-to-sequence modeling with recurrent neural networks (RNNs). In this paper, four neural network architectures (long short-term memory (LSTM), bidirectional LSTM (BLSTM), gated recurrent network (GRU), and standard RNN) are investigated and applied using this continuous vocoder to model F0, MVF, and Mel-Generalized Cepstrum (MGC) for more natural sounding speech synthesis. Experimental results from objective and subjective evaluations have shown that the proposed framework converges faster and gives state-of-the-art speech synthesis performance while outperforming the conventional feed-forward DNN.

Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Géza Németh
Design of Online Echo Canceller in Duplex Mode

A new online echo canceller system was developed that perfectly works in the duplex mode and shows a good performance in the conventional half-duplex mode. The near speech signal is not corrupted in the duplex mode by the linear compensation procedures but degrades significantly by nonlinear suppression. The conventional linear compensation system is based on the LMS adaptation or its modifications that do not provide a necessary high accuracy of a big number of the impulse response coefficients. We have implemented the LS method for online estimation of the full impulse response using the superfast Schur algorithm for Toeplitz matrix inversion. Implementation details are important. They include numerical accuracy, initialization, update criteria, packet control. Good results of an echo compensation are shown on the data from the Matlab Audio System Toolbox.

Andrey Barabanov, Evgenij Vikulov
Detection of Stance and Sentiment Modifiers in Political Blogs

The automatic detection of seven types of modifiers was studied: Certainty, Uncertainty, Hypotheticality, Prediction, Recommendation, Concession/Contrast and Source. A classifier aimed at detecting local cue words that signal the categories was the most successful method for five of the categories. For Prediction and Hypotheticality, however, better results were obtained with a classifier trained on tokens and bigrams present in the entire sentence. Unsupervised cluster features were shown useful for the categories Source and Uncertainty, when a subset of the training data available was used. However, when all of the 2,095 sentences that had been actively selected and manually annotated were used as training data, the cluster features had a very limited effect. Some of the classification errors made by the models would be possible to avoid by extending the training data set, while other features and feature representations, as well as the incorporation of pragmatic knowledge, would be required for other error types.

Maria Skeppstedt, Vasiliki Simaki, Carita Paradis, Andreas Kerren
Digits to Words Converter for Slavic Languages in Systems of Automatic Speech Recognition

In this paper, a system for digits to words conversion for almost all Slavic languages is proposed. This system was developed for improvement of text corpora which we are using for building of a lexicon or for training of language models and acoustic models in the task of Large Vocabulary Continuous Speech Recognition (LVCSR). Strings of digits, some other special characters (%, €, $, ...) or abbreviations of physical units (km, m, cm, kg, l, $${}^\circ $$C, etc.) occur very often in our text corpora. It is in about 5% cases. The strings of digits or special characters are usually omitted if a lexicon is being built or if the language model is being trained. The task of digits to words conversion in non-inflected languages (e.g. English) is solved by relatively simple conversion or lookup table. The problem is more complex in inflected Slavic languages. The string of digits can be converted into several different word combinations. It depends on the context and resulting words are inflected by gender or cases. The main goal of this research was to find the rules (patterns) for conversion of string of digits into words for Slavic languages. The second goal was to unify this patterns over Slavic languages and to integrate them to the universal system for digits to words conversion.

Josef Chaloupka
Discriminating Speakers by Their Voices — A Fusion Based Approach

The task of Speaker Discrimination (SD) consists in checking whether two speech segments belong to the same speaker or not. In this research field, it is often difficult to decide what could be the best classifier in terms of accuracy and robustness. For that purpose, we have implemented 9 classifiers: Support Vector Machines, Linear Discriminant Analysis, Multi-Layer Perceptron, Generalized Linear Model, Self Organizing Map, Adaboost, Second Order Statistical Measures, Linear Regression and Gaussian Mixture Models. Furthermore, a new fusion approach is proposed and experimented in speaker discrimination. Several experiments of speaker discrimination were conducted on Hub4 Broadcast-News with relatively short segments. The obtained results have shown that the best classifier is the SVM and that the proposed fusion approach is quite interesting since it provided the best performances at all.

Halim Sayoud, Siham Ouamour, Zohra Hamadache
Emotional Poetry Generation

In this article we describe a new system for the automatic creation of poetry in Basque that not only generates novel poems, but also creates them conveying a certain attitude or state of mind. A poem is a text structured according to predefined formal rules, whose parts are semantically related and with an intended message, aiming to elicit an emotional response. The proposed system receives as an input the topic of the poem and the affective state (positive, neutral or negative) and tries to give as output a novel poem that: (1) satisfies formal constraints of rhyme and metric, (2) shows coherent content related to the given topic, and (3) expresses them through the predetermined mood. Although the presented system creates poems in Basque, it is highly modular and easily extendable to new languages.

Aitzol Astigarraga, José María Martínez-Otzeta, Igor Rodriguez, Basilio Sierra, Elena Lazkano
End-to-End Large Vocabulary Speech Recognition for the Serbian Language

This paper presents the results of a large vocabulary speech recognition for the Serbian language, developed by using Eesen end-to-end framework. Eesen involves training a single deep recurrent neural network, containing a number of bidirectional long short-term memory layers, modeling the connection between the speech and a set of context-independent lexicon units. This approach reduces the amount of expert knowledge needed in order to develop other competitive speech recognition systems. The training is based on a connectionist temporal classification, while decoding allows the usage of weighted finite-state transducers. This provides much faster and more efficient decoding in comparison to other similar systems. A corpus of approximately 215 h of audio data (about 171 h of speech and 44 h of silence, or 243 male and 239 female speakers) was employed for the training (about 90%) and testing (about 10%) purposes. On a set of more than 120000 words, the word error rate of 14.68% and the character error rate of 3.68% is achieved.

Branislav Popović, Edvin Pakoci, Darko Pekar
Examining the Impact of Feature Selection on Sentiment Analysis for the Greek Language

Sentiment analysis identifies the attitude that a person has towards a service, a topic or an event and it is very useful for companies which receive many written opinions. Research studies have shown that the determination of sentiment in written text can be accurately determined through text and part of speech features. In this paper, we present an approach to recognize opinions in Greek language and we examine the impact of feature selection on the analysis of opinions and the performance of the classifiers. We analyze a large number of feedback and comments from teachers towards e-learning, life-long courses that have attended with the aim to specify their opinions. A number of text-based and part of speech based features from textual data are extracted and a generic approach to analyze text and determine opinion is presented. Evaluation results indicate that the approach illustrated is accurate in specifying opinions in Greek text and also sheds light on the effect that various features have on the classification performance.

Nikolaos Spatiotis, Michael Paraskevas, Isidoros Perikos, Iosif Mporas
Experimenting with Hybrid TDNN/HMM Acoustic Models for Russian Speech Recognition

In this paper, we study an application of time delay neural networks (TDNNs) in acoustic modeling for large vocabulary continuous Russian speech recognition. We created TDNNs with various numbers of hidden layers and units in the hidden layers with p-norm nonlinearity. Training of acoustic models was carried out on our own Russian speech corpus containing phonetically balanced phrases. Duration of the speech corpus is more than 30 h. Testing of TDNN-based acoustic models was performed in the very large vocabulary continuous Russian speech recognition task. Conducted experiments showed that TDNN models outperformed baseline deep neural network models in terms of the word error rate.

Irina Kipyatkova
Exploring Multiparty Casual Talk for Social Human-Machine Dialogue

Much talk between humans is casual and multiparty. It facilitates social bonding and mutual co-presence rather than strictly being used to exchange information in order to complete well-defined practical tasks. Artificial partners that are capable of participating as a speaker or listener in such talk would be useful for companionship, educational, and social contexts. However, such applications require dialogue structure beyond simple question/answer routines. While there is body of theory on multiparty casual talk, there is a lack of work quantifying such phenomena. This is critical if we are to manage and generate human machine multiparty casual talk. We outline the current knowledge on the structure of casual talk, describe our investigations in this domain, summarise our findings on timing, laughter, and disfluency in this domain, and discuss how they can inform the design and implementation of truly social machine dialogue partners.

Emer Gilmartin, Benjamin R. Cowan, Carl Vogel, Nick Campbell
First Experiments to Detect Anomaly Using Personality Traits vs. Prosodic Features

This paper presents the design of an anomaly detector based on three different sets of features, one corresponding to some prosodic descriptors and two extracted from Big Five traits. Big Five traits correspond to a simple but efficient representation of a human personality. They are extracted from a manual annotation while prosodic features are extracted directly from the speech signal. We evaluate two different anomaly detection methods: One-Class SVM (OC-SVM) and iForest, each one combined with a threshold classification to decide the “normality” of a sample. The different combinations of models and feature sets are evaluated on the SSPNET-Personality corpus which has already been used in several experiments, including a previous work on separating two types of personality profiles in a supervised way. In this work, we propose the above mentioned unsupervised methods, and discuss their performance, to detect particular audio-clips produced by a speaker with an abnormal personality. Results show that using automatically extracted prosodic features competes with the Big Five traits. In our case, OC-SVM seems to get better results than iForest.

Cedric Fayet, Arnaud Delhay, Damien Lolive, Pierre-François Marteau
Fusion of a Novel Volterra-Wiener Filter Based Nonlinear Residual Phase and MFCC for Speaker Verification

This paper investigates the complementary nature of the speaker-specific information present in the Volterra-Wiener filter residual (VWFR) phase of speech signal in comparison with the information present in conventional Mel Frequency Cepstral Coefficients (MFCC) and Teager Energy Operator (TEO) phase. The feature set is derived from residual phase extracted from the output of nonlinear filter designed using Volterra-Weiner series exploiting higher order linear as well as nonlinear relationships hidden in the sequence of samples of speech signal. The proposed feature set is being used to conduct Speaker Verification (SV) experiments on NIST SRE 2002 database using state-of-the-art GMM-UBM system. The score-level fusion of proposed feature set with MFCC gives an EER of 6.05% as compared to EER of 8.9% with MFCC alone. EER of 8.83% is obtained for TEO phase in fusion with MFCC, indicating that residual phase from proposed nonlinear filtering approach contain complementary speaker-specific information.

Purvi Agrawal, Hemant A. Patil
Hesitations in Spontaneous Speech: Acoustic Analysis and Detection

Spontaneous speech is different from any other type of speech in many ways, with speech disfluencies being the prominent feature. These phenomena both play an important role in communication, and also cause problems for automatic speech processing. In this study we present the results of acoustic analysis of the most frequent disfluencies - voiced hesitations (filled pauses and lengthenings) across different speaking styles in spontaneous Russian speech, as well as results of experiments on their detection using SVM classifier on a joint Russian and English spontaneous speech corpus. Results of acoustic analysis showed significant differences in fundamental frequency and energy distribution ratios of hesitations and their contexts across speaking styles in Russian: comparing to the dialogues, in monologues speakers exhibit more prosodic cues for the adjacent context and hesitations. Experiments on detection of voiced hesitations on a mixed language and style corpus with SVM resulted in achieving F1–score = 0.48 (With F1–score = 0.55 for only Russian data).

Vasilisa Verkhodanova, Vladimir Shapranov, Irina Kipyatkova
Human as Acmeologic Entity in Social Network Discourse (Multidimensional Approach)

The main aim of the project is to study development of the acmeological approach to modeling the characteristics of the speech activity of communicants in the social network discourse (SND), taking into account the degree of “maturation” of the destructive features of the personality considering the fact that formerly some material was used, which includes features of the personality structure for the “subjects” based on samples of their written and spoken speech in the electronic media environment for the purpose of constructing an acmeological matrix that will enable experimental “measurement” of the acmeological difference in the potentials of an individual. Scientific novelty and significance of this research are determined by the modern social network situation with regard to a lot of negative information communication utterances.

Rodmonga Potapova, Vsevolod Potapov
Improved Speaker Adaptation by Combining I-vector and fMLLR with Deep Bottleneck Networks

This paper investigates how deep bottleneck neural networks can be used to combine the benefits of both i-vectors and speaker-adaptive feature transformations. We show how a GMM-based speech recognizer can be greatly improved by applying feature-space maximum likelihood linear regression (fMLLR) transformation to outputs of a deep bottleneck neural network trained on a concatenation of regular Mel filterbank features and speaker i-vectors. The addition of the i-vectors reduces word error rate of the GMM system by 3–7% compared to an identical system without i-vectors. We also examine Deep Neural Network (DNN) systems trained on various combinations of i-vectors, fMLLR-transformed bottleneck features and other feature space transformations. The best approach results speaker-adapted DNNs which showed 15–19% relative improvement over a strong speaker-independent DNN baseline.

Thai Son Nguyen, Kevin Kilgour, Matthias Sperber, Alex Waibel
Improving of LVCSR for Causal Czech Using Publicly Available Language Resources

The paper presents the design of Czech casual speech recognition which is a part of the wider research focused on understanding very informal speaking styles. The study was carried out using the NCCCz corpus and the contributions of optimized acoustic and language models as well as pronunciation lexicon optimization were analyzed. Special attention was paid to the impact of publicly available corpora suitable for language model (LM) creation. Our final DNN-HMM system achieved in the task of casual speech recognition WER of 30–60% depending on LM used. The results of recognition for other speaking styles are presented as well for the comparison purposes. The system was built using KALDI toolkit and created recipes are available for the research community.

Petr Mizera, Petr Pollak
Improving Performance of Speaker Identification Systems Using Score Level Fusion of Two Modes of Operation

In this paper we present a score level fusion methodology for improving the performance of closed-set speaker identification. The fusion is performed on scores which are extracted from GMM-UBM text-dependent and text-independent speaker identification engines. The experimental results indicated that the score level fusion improves the speaker identification performance compared with the best performing single operation mode of speaker identification.

Saeid Safavi, Iosif Mporas
Improving Speech-Based Emotion Recognition by Using Psychoacoustic Modeling and Analysis-by-Synthesis

Most technical communication systems use speech compression codecs to save transmission bandwidth. A lot of development was made to guarantee a high speech intelligibility resulting in different compression techniques: Analysis-by-Synthesis, psychoacoustic modeling and a hybrid mode of both. Our first assumption is that the hybrid mode improves the speech intelligibility. But, enabling a natural spoken conversation also requires affective, namely emotional, information, contained in spoken language, to be intelligibly transmitted. Usually, compression methods are avoided for emotion recognition problems, as it is feared that compression degrades the acoustic characteristics needed for an accurate recognition [1]. By contrast, in our second assumption we state that the combination of psychoacoustic modeling and Analysis-by-Synthesis codecs could actually improve speech-based emotion recognition by removing certain parts of the acoustic signal that are considered “unnecessary”, while still containing the full emotional information. To test both assumptions, we conducted an ITU-recommended POLQA measuring as well as several emotion recognition experiments employing two different datasets to verify the generality of this assumption. We compared our results on the hybrid mode with Analysis-by-Synthesis-only and psychoacoustic modeling-only codecs. The hybrid mode does not show remarkable differences regarding the speech intelligibility, but it outperforms all other compression settings in the multi-class emotion recognition experiments and achieves even an $$\sim $$3.3% absolute higher performance than the uncompressed samples.

Ingo Siegert, Alicia Flores Lotz, Olga Egorow, Andreas Wendemuth
In Search of Sentence Boundaries in Spontaneous Speech

Oral text is certainly discrete. It is built of “small bricks”, units of not only lexical but also the higher syntactical level. Common syntagmatic pauses, hesitative pauses such as physical (unfilled ones including breaks of clauses), sound pauses (e-e, m-m), and verbal (vot, kak eto, nu, znachit etc.) are markers of this discreetness. However, that reveals neither syntagma nor sentence as a unit to describe a syntactic structure of an oral text. Any type of pauses may occur in any place of an audio sequence. Thus, the search of sentences in spontaneous speech is quite complicated. In order to obtain such units a methodic of coercive punctuation that was used for marking the spontaneous monologues from the collection of oral texts named «Balanced Annotated Textotec» could be offered. The testee (philology experts) were asked to mark ends of the sentences by putting a period in the transcripts where neither pauses nor punctuation had been marked. The testee could only rely on the syntactic structure of the text and the connection between words and predicate centers. Involving more than twenty experts in an experiment provides more statistically accurate results. In this work we describe the results of our experiment and discuss further perspectives how those results can be used for automatic search of sentence boundaries in spontaneous speech.

Natalia Bogdanova-Beglarian
Investigating Acoustic Correlates of Broad and Narrow Focus Perception by Japanese Learners of English

This work is an addition to the relatively short line of research concerning second language prosody perception. Using a prominence marking experiment, the study demonstrates that Japanese learners of English can perceptually discriminate between different focus scopes. Perceptual score profiles imply that narrowly focused words are identified and discriminated relatively easily, while differentiation of different scopes of broad focus presents a greater challenge. An analysis of a range of acoustic cues indicates that perceptual scores correlate most strongly with F0-based features. While this result is in contradiction with previous research results, it is shown that the divergence is attributable to the particular acoustic characteristics of the stimulus.

Gábor Pintér, Oliver Jokisch, Shinobu Mizuguchi
Language Adaptive Multilingual CTC Speech Recognition

Recently, it has been demonstrated that speech recognition systems are able to achieve human parity. While much research is done for resource-rich languages like English, there exists a long tail of languages for which no speech recognition systems do yet exist. The major obstacle in building systems for new languages is the lack of available resources. In the past, several methods have been proposed to build systems in low-resource conditions by using data from additional source languages during training. While it has been shown that DNN/HMM hybrid setups trained in low-resource conditions benefit from additional data, we are proposing a similar technique using sequence based neural network acoustic models with Connectionist Temporal Classification (CTC) loss function. We demonstrate that setups with multilingual phone sets benefit from the addition of Language Feature Vectors (LFVs).

Markus Müller, Sebastian Stüker, Alex Waibel
Language Model Optimization for a Deep Neural Network Based Speech Recognition System for Serbian

This paper presents the results obtained using several variants of trigram language models in a large vocabulary continuous speech recognition (LVCSR) system for the Serbian language, based on the deep neural network (DNN) framework implemented within the Kaldi speech recognition toolkit. This training approach allows parallelization using several threads on either multiple GPUs or multiple CPUs, and provides a natural-gradient modification to the stochastic gradient descent (SGD) optimization method. Acoustic models are trained over a fixed number of training epochs with parameter averaging in the end. This paper discusses recognition using different language models trained with Kneser-Ney or Good-Turing smoothing methods, as well as several pruning parameter values. The results on a test set containing more than 120000 words and different utterance types are explored and compared to the referent results with GMM-HMM speaker-adapted models for the same speech database. Online and offline recognition results are compared to each other as well. Finally, the effect of additional discriminative training using a language model prior to the DNN stage is explored.

Edvin Pakoci, Branislav Popović, Darko Pekar
Lexico-Semantical Indices of “Deprivation – Aggression” Modality Correlation in Social Network Discourse

The article analyzes the social network discourse (SND) with elements of speech aggression actualized by communicants, whose emotional state is caused by various deprivation factors. The analysis of 398 statements from men and women revealed the frequency use of these statements the stylistic modality of which relates to the aggressive type, while actualizing topics in the speech communication associated with facts of social-cognitive deprivation. The dominant type of speech activity in the analyzed SNDs is an aggressive speech response of the SND-communicant (SND-addressee) to the speech provocation of another communicant (SND-sender). Previously it was revealed that, as a rule, a major role is played by the gender factor: compared to women, men feel more at ease in using statements of the aggressive type, both for speech provocation and for aggressive speech response.

Rodmonga Potapova, Liliya Komalova
Linguistic Features and Sociolinguistic Variability in Everyday Spoken Russian

The paper reviews the results of the project aimed at describing everyday Russian language and analyzing the special characteristics of its usage by different social groups. The presented study was made on the material of 125,000 words annotated subcorpus of the ORD corpus, which contains speech fragments of 256 people representing different gender, age, professional and status groups. The linguistic features from different linguistic levels, which could be considered as diagnostic for different social groups, have been analyzed. It turned out that in terms of sociolinguistic variability all features under investigation may be divided into three categories: (1) the diagnostic features, which display statistically significant differences between certain social groups; (2) the linguistic features, which could be considered as common for all sociolects and referring to some permanent, universal properties of everyday language; and (3) the potentially diagnostic features, which have shown some quantitative difference between the considered groups, but the extent of this difference does not allow to regard them as statistically significant at the moment. The last group of features is the most extensive and requires additional studies on a larger amount of speech data.

Natalia Bogdanova-Beglarian, Tatiana Sherstinova, Olga Blinova, Gregory Martynenko
Medical Speech Recognition: Reaching Parity with Humans

We present a speech recognition system for the medical domain whose architecture is based on a state-of-the-art stack trained on over 270 h of medical speech data and 30 million tokens of text from clinical episodes. Despite the acoustic challenges and linguistic complexity of the domain, we were able to reduce the system’s word error rate to below 16% in a realistic clinical use case. To further benchmark our system, we determined the human word error rate on a corpus covering a wide variety of speakers, working with multiple medical transcriptionists, and found that our speech recognition system performs on a par with humans.

Erik Edwards, Wael Salloum, Greg P. Finley, James Fone, Greg Cardiff, Mark Miller, David Suendermann-Oeft
Microphone Array Post-filter in Frequency Domain for Speech Recognition Using Short-Time Log-Spectral Amplitude Estimator and Spectral Harmonic/Noise Classifier

We propose a novel computationally efficient real-time microphone array speech enhancement postfilter with a small delay that takes into account features of speech signal and recognition algorithms. The algorithm is efficient for small microphone arrays. The filter is based on applying a binary classification model to the Log Short-Term Spectral Amplitude (Log-STSA). The proposed algorithm allows substantial improvement of recognition accuracy with minor increase in complexity compared to Wiener post-filter and lower complexity compared to existing voice model based approaches. Objective tests using dual microphone array, ETSI binaural noise database, TIDIGITS database, and CMU Sphinx 4 speech recognizer demonstrate overall 41% Error Rate reduction for SNR from 15 dB to 0 dB. Subjective evaluation also demonstrates substantial noise reduction and intelligibility improvement without musical noise artifacts common for Wiener and Spectral Subtraction based methods. Testing with SiSEC10 four microphone linear equispaced array database shows that recognition accuracy is improved with increased base and/or number of microphones in array.

Sergey Salishev, Ilya Klotchkov, Andrey Barabanov
Multimodal Keyword Search for Multilingual and Mixlingual Speech Corpus

A novel framework for searching keywords in multilingual and mixlingual speech corpus is proposed. This framework is capable of searching spoken as well as text queries. The capability of spoken search enables it to search out-of-vocabulary (OOV) words. The capability of searching text queries enables it to perform semantic search. An advanced application of searching keyword translations in mixlingual speech corpus is also possible within posteriorgram framework with this system. It is shown that the performance of text queries is comparable or better than the performance of spoken queries if the language of the keyword is included in the training languages. Also, a technique for combining information from text and spoken queries is proposed which further enhances the search performance. This system is based on multiple posteriorgrams based on articulatory classes trained with multiple languages.

Abhimanyu Popli, Arun Kumar
Neural Network Doc2vec in Automated Sentiment Analysis for Short Informal Texts

The article covers approaches to automated sentiment analysis task. Under the supervised learning method a new program was created with the help of Doc2vec – a module of Gensim that is one of Python’s libraries. The program specialization is short informal texts of ecology domain which are parts of macropolylogues in social network discourse.

Natalia Maslova, Vsevolod Potapov
Neural Network Speaker Descriptor in Speaker Diarization of Telephone Speech

In this paper, we have been investigating an approach to a speaker representation for a diarization system that clusters short telephone conversation segments (produced by the same speaker). The proposed approach applies a neural-network-based descriptor that replaces a usual i-vector descriptor in the state-of-the-art diarization systems. The comparison of these two techniques was done on the English part of the CallHome corpus. The final results indicate the superiority of the i-vector’s approach although our proposed descriptor brings an additive information. Thus, the combined descriptor represents a speaker in a segment for diarization purpose with lower diarization error (almost 20% relative improvement compared with only i-vector application).

Zbyněk Zajíc, Jan Zelinka, Luděk Müller
Novel Linear Prediction Temporal Phase Based Features for Speaker Recognition

This paper proposes novel features based on linear prediction of temporal phase (LPTP) for speaker recognition task. The proposed LPTC feature vector represents Discrete Cosine Transform (DCT) (for energy compaction and decorrelation) coefficients of LP spectrum derived from temporal phase of speech signal. The results are shown on standard NIST 2002 SRE and GMM-UBM (Gaussian Mixture Modeling-Universal Background Modeling) approach. A recently proposed supervised score-level fusion method is used for combining evidences of Mel Frequency Cepstral Coefficients (MFCC) and proposed feature set. Performance of proposed feature set is compared with state-of-the-art MFCC features. It is evident from the results that proposed features gives 4% improvement in % identification rate and 2% decrement in % EER than that of standard MFCC alone. In addition, when the supervised score-level fusion is used, identification rate improves 8% and EER is decreased by 2% indicating that proposed feature captures complimentary information than MFCC alone.

Ami Gandhi, Hemant A. Patil
Novel Phase Encoded Mel Cepstral Features for Speaker Verification

In this paper, we propose novel phase encoded Mel cepstral coefficients (PEMCC) features for Automatic Speaker Verification (ASV) task. This is motivated by recently proposed phase encoding scheme that uses causal delta dominance condition (CDD). In particular, we got on an average of 80% reduction in log-spectral distortion (LSD) for reconstruction error compared to its magnitude spectrum counterpart, using CDD scheme. This result indicates that phase encoded magnitude spectrum is having better reconstruction capability. The experiments of proposed PEMCC features are carried out on standard statistically meaningful NIST 2002 SRE database and the performance is compared with baseline MFCC features. Furthermore, score-level fusion of MFCC+PEMCC features gave better results for GMM-UBM-based system, i-vector probabilistic linear discriminant analysis (PLDA)-based system and i-vector Cosine Distance Scoring (CDS)-based system over MFCC and PEMCC features alone. This illustrates, the proposed PEMCC features capture complementary speaker-specific information.

Apeksha J. Naik, Rishabh Tak, Hemant A. Patil
On a Way to the Computer Aided Speech Intonation Training

Presented in the paper is a software system designed to train learners in producing a variety of recurring intonation patterns of speech. The system is based on comparing the melodic (tonal) portraits of a reference phrase and a phrase spoken by the learner and involves active learner-system interaction. Since parametric representation of intonation features of the speech signal faces fundamental difficulties, the paper intends to show how these difficulties can be overcome. The main algorithms used in the training system proposed for analyzing and comparing intonation features are considered. A set of reference sentences is given which represents the basic intonation patterns of English speech and their main varieties. The system’s interface is presented and the results of the system operation are illustrated.

Boris Lobanov, Yelena Karnevskaya, Vladimir Zhitko
On Residual CNN in Text-Dependent Speaker Verification Task

Deep learning approaches are still not very common in the speaker verification field. We investigate the possibility of using deep residual convolutional neural network with spectrograms as an input features in the text-dependent speaker verification task. Despite the fact that we were not able to surpass the baseline system in quality, we achieved a quite good results for such a new approach getting an 5.23% ERR on the RSR2015 evaluation part. Fusion of the baseline and proposed systems outperformed the best individual system by 18% relatively.

Egor Malykh, Sergey Novoselov, Oleg Kudashev
Perception and Acoustic Features of Speech of Children with Autism Spectrum Disorders

The goal of our study is to reveal verbal and non-verbal information in speech features of children with autism spectrum disorders (ASD). 30 children with ASD aged 5–14 years and 160 typically developing (TD) coevals were participants in the study. ASD participants were divided into groups according to the presence of development reversals (ASD-1) and developmental risk diagnosed at the birth (ASD-2). The listeners (n = 220 adults) recognized the word’s meaning, correspondence of the repetition word’s meaning and intonation contour to the sample, age, and gender of ASD child’s speech with less probability vs. TD children. Perception data are confirmed by acoustic features. We found significant differences in pitch values, vowels formants frequency and energy between ASD groups and between ASD and TD in spontaneous speech and repetition words. Pitch values of stress vowels were significantly higher in spontaneous speech vs. repetition words for ASD-1 children, ASD-2, and TD children aged 7–12 years. Pitch values in the spontaneous speech of the ASD-1 were higher than in the ASD-2 children. The coarticulation effect was shown for ASD and TD repetition words. Age dynamic of ASD children acoustic features indicated mastering of clear articulation.

Elena Lyakso, Olga Frolova, Aleksey Grigorev
Phase Analysis and Labeling Strategies in a CNN-Based Speaker Change Detection System

In this paper we analyze different labeling strategies and their impact on speaker change detection rates. We explore binary, linear fuzzy, quadratic and Gaussian labeling functions. We come to the conclusion that the labeling function is very important and the linear variant outperforms the rest. We also add phase information from the spectrum to the input of our convolutional neural network. Experiments show that even though the phase is informative its benefit is negligible and may be omitted. In the experiments we use a coverage-purity measure which is independent on tolerance parameters.

Marek Hrúz, Petr Salajka
Preparing Audio Recordings of Everyday Speech for Prosody Research: The Case of the ORD Corpus

Studying prosody is important for understanding many linguistic, pragmatic, and discourse phenomena, as well as for solution of many applied tasks (in particular, in speech technologies). Prosody of everyday speech is extremely diverse, demonstrating high interpersonal and intrapersonal variations. Furthermore, natural everyday speech produces a multitude of effects which are hardly possible to obtain in speech laboratories. Because of this fact, it is very important to create resources containing representative collections of everyday speech data. The ORD corpus is a large resource aimed at studying everyday Russian speech. The paper describes the main stages of speech processing in the ORD corpus starting from segmentation of original files into macroepisodes and up to compiling prosody information into the database. This prosody database will be further used for building empirical prosody models.

Tatiana Sherstinova
Recognizing Emotionally Coloured Dialogue Speech Using Speaker-Adapted DNN-CNN Bottleneck Features

Emotionally coloured speech recognition is a key technology toward achieving human-like spoken dialog systems. However, despite rapid progress in automatic speech recognition (ASR) and emotion research, much less work has examined ASR systems that recognize the verbal content of emotionally coloured speech. Approaches that exist in emotional speech recognition mostly involve adapting standard ASR models to include information about prosody and emotion. In this study, instead of adapting a model to handle emotional speech, we focus on feature transformation methods to solve the mismatch and improve the ASR performance. In this way, we can train the model with emotionally coloured speech without any explicit emotional annotation. We investigate the use of two different deep bottleneck network structures: deep neural networks (DNNs) and convolutional neural networks (CNNs). We hypothesize that the trained bottleneck features may be able to extract essential information that represents the verbal content while abstracting away from superficial differences caused by emotional variance. We also try various combinations of these two bottleneck features with feature-space speaker adaptation. Experiments using Japanese and English emotional speech data reveal that both varieties of bottleneck features and feature-space speaker adaptation successfully improve the emotional speech recognition performance.

Kohei Mukaihara, Sakriani Sakti, Satoshi Nakamura
Relationship Between Perception of Cuteness in Female Voices and Their Durations

The cuteness of female voices is especially important in Japanese pop culture. We investigate the relationship between the perception of cuteness in female voices and the voices’ duration. In our hypothesis, the perception of cuteness should become more ambiguous and unstable as the duration becomes shorter. To confirm this hypothesis, we conducted listening tests where participants listened to female voices with various durations and rated their cuteness on scale of 1 to 5. The results show: (1) the instability of cuteness perception becomes higher as the duration is lower, (2) for voices rated “4” or higher when presented fully, their ambiguity of cuteness perception increases (i.e., the ratings become close to “3 (indeterminable)”) when presented shortly, and (3) for voices rated “2” or lower when presented fully, the ambiguity of cuteness perception does not increase even if the durations are short.

Ryohei Ohno, Masanori Morise, Tetsuro Kitahara
Retaining Expression on De-identified Faces

The extensive use of video surveillance along with advances in face recognition has ignited concerns about the privacy of the people identifiable in the recorded documents. A face de-identification algorithm, named $$ k $$-Same, has been proposed by prior research and guarantees to thwart face recognition software. However, like many previous attempts in face de-identification, $$ k $$-Same fails to preserve the utility such as gender and expression of the original data. To overcome this, a new algorithm is proposed here to preserve data utility as well as protect privacy. In terms of utility preservation, this new algorithm is capable of preserving not only the category of the facial expression (e.g., happy or sad) but also the intensity of the expression. This new algorithm for face de-identification possesses a great potential especially with real-world images and videos as each facial expression in real life is a continuous motion consisting of images of the same expression with various degrees of intensity.

Li Meng, Aruna Shenoy
Semi-automatic Facial Key-Point Dataset Creation

This paper presents a semi-automatic method for creating a large scale facial key-point dataset from a small number of annotated images. The method consists of annotating the facial images by hand, training Active Appearance Model (AAM) from the annotated images and then using the AAM to annotate a large number of additional images for the purpose of training a neural network. The images from the AAM are then re-annotated by the neural network and used to validate the precision of the proposed neural network detections. The neural network architecture is presented including the training parameters.

Miroslav Hlaváč, Ivan Gruber, Miloš Železný, Alexey Karpov
Song Emotion Recognition Using Music Genre Information

Music Emotion Recognition (MER) is an important topic in music understanding, recommendation and retrieval that has gained great attention in the last years due to the constantly increasing number of people accessing digital musical content. In this paper we propose a new song emotion recognition system that takes into consideration the song’s genre and we investigate the effect that genre has, on the recognition task of four basic music emotions of the valence-arousal (VA) plane: happy, angry, sad and peaceful. Experiments on a database consisting of 1100 songs from four different music genres (blues, country, pop and rock) using timbral, spectral, dynamical and chroma descriptors of the music, have shown that successful recognition of the song’s genre as a pre-processing step, can improve the recognition of its emotion by a factor of 10–15%.

Athanasios Koutras
Spanish Corpus for Sentiment Analysis Towards Brands

Posts published in the social media are a good source of feedback to assess the impact of advertising campaigns. Whereas most of the published corpora of messages in the Sentiment Analysis domain tag posts with polarity labels, this paper presents a corpus in Spanish language where tagging has been made using 8 predefined emotions: love-hate, happiness-sadness, trust-fear, satisfaction-dissatisfaction. In every post, extracted from Twitter, sentiments have been annotated towards each specific brand under study. The corpus is published as a collection of RDF resources with links to external entities. Also a vocabulary describing this emotion classification along with other relevant aspects of customer’s opinion is provided.

María Navas-Loro, Víctor Rodríguez-Doncel, Idafen Santana-Perez, Alberto Sánchez
Speech Enhancement for Speaker Recognition Using Deep Recurrent Neural Networks

This paper describes the speech denoising system based on long short-term memory (LSTM) neural networks. The architecture of the presented network is designed to make speech enhancement in spectrogram magnitude domain. The audio resynthesis is performed via the inverse short-time Fourier transform by maintaining the original phase. Objective quality is assessed by root mean square error between clean and denoised audio signals on CHiME corpus and speaker verification rate by using RSR2015 corpus. Proposed system demonstrates improved results on both metrics.

Maxim Tkachenko, Alexander Yamshinin, Nikolay Lyubimov, Mikhail Kotov, Marina Nastasenko
Stance Classification in Texts from Blogs on the 2016 British Referendum

The problem of identifying and correctly attributing speaker stance in human communication is addressed in this paper. The data set consists of political blogs dealing with the 2016 British referendum. A cognitive-functional framework is adopted with data annotated for six notional stance categories: contrariety, hypotheticality, necessity, prediction, source of knowledge, and uncertainty. We show that these categories can be implemented in a text classification task and automatically detected. To this end, we propose a large set of lexical and syntactic linguistic features. These features were tested and classification experiments were implemented using different algorithms. We achieved accuracy of up to 30% for the six-class experiments, which is not fully satisfactory. As a second step, we calculated the pair-wise combinations of the stance categories. The contrariety and necessity binary classification achieved the best results with up to 71% accuracy.

Vasiliki Simaki, Carita Paradis, Andreas Kerren
The “Retrospective Commenting” Method for Longitudinal Recordings of Everyday Speech

The paper describes a pilot experiment aimed at revealing the occurrences of miscommunication between interlocutors in everyday speech recordings. Here, miscommunication is understood as situations in which the recipient perceives the meaning of the message in a different way from what was intended by the speaker. The experiment was based on the methodology of longitudinal recordings taken during one day, following the approach which is used for gathering audio data for the ORD speech corpus. But in addition it was enhanced by audition of the whole recording afterwards by the respondent himself/herself and his/her simultaneous commenting on some points of communicative settings with unobservable features. The task of the respondent was to note all occurrences of miscommunication, to explain to the researcher all unclear moments of interaction, to help in interpreting the emotional state of interlocutors, and to give some hints on pragmatic purposes, revealing those aspects of spoken interaction that are usually hidden behind the evident facts. The results of the experiment showed that miscommunication is indeed a rather frequent phenomenon in everyday face-to-face interaction. Moreover, the retrospective commenting method could significantly broaden the opportunities of discourse and pragmatic research based on long-term recordings.

Arto Mustajoki, Tatiana Sherstinova
The 2016 RWTH Keyword Search System for Low-Resource Languages

In this paper we describe the RWTH Aachen keyword search (KWS) system developed in the course of the IARPA Babel program. We put focus on acoustic modeling with neural networks and evaluate the full pipeline with respect to the KWS performance. At the core of this study lie multilingual bottleneck features extracted from a deep neural network trained on all 28 languages available to the project articipants. We show that in a low-resource scenario, the multilingual features are crucial for achieving state-of-the-art performance.Further highlights of this work include comparisons of tandem and hybrid acoustic models based on feed-forward and recurrent neural networks, keyword search pipelines based on lattice and time-marked word list representation and measuring the effect of adding large amounts of text data scraped from the web. The evaluation is performed on multiple languages of the last two project periods.

Pavel Golik, Zoltán Tüske, Kazuki Irie, Eugen Beck, Ralf Schlüter, Hermann Ney
The Effect of Morphological Factors on Sentence Boundaries in Russian Spontaneous Speech

The paper evaluates the contribution of morphological factors to the probability of sentence boundaries in Russian unscripted monologue. The analysis is based on multiple expert manual annotations of unscripted speech which allow obtaining fine-grained estimates of the probability of sentence boundary at each word junction. We used linear regression analysis to explore whether there is a relationship between sentence boundaries marked by the annotators and the grammatical features of the text. We focused on morphological factors related to the presence or absence of sentence boundaries.

Anton Stepikhov, Anastassia Loukina
The Pausing Method Based on Brown Clustering and Word Embedding

One of the most important parts of the synthesis of natural speech is the correct pause placement. Properly placed pauses in speech affect the perception of information. In this article, we consider the method of predicting pause positions for the synthesis of speech. For this purpose, two speech corpora were prepared in the Kazakh language. The input parameters were vector representations of words obtained from the cluster model and from the algorithm of the canonical correlations analysis. The support vector machine was used to predict the pauses within the sentence. Our results show F-1 = 0.781 for pause prediction.

Arman Kaliyev, Sergey V. Rybin, Yuri Matveev
Unsupervised Document Classification and Topic Detection

This article presents a method for pre-processing the feature vectors representing text documents that are consequently classified using unsupervised methods. The main goal is to show that state-of-the-art classification methods can be improved by a certain data preparation process. The first method is a standard K-means clustering and the second Latent Dirichlet allocation (LDA) method. Both are widely used in text processing. The mentioned algorithms are applied to two data sets in two different languages. First of them, the 20NewsGroup is a widely used benchmark for classification of English documents. The second set was selected from the large body of Czech news articles and was used mainly to compare the performance of the tested methods also for the case of less frequently studied language. Furthermore, the unsupervised methods are also compared with the supervised ones in order to (in some sense) ascertain the upper-bound of the task.

Jaromír Novotný, Pavel Ircing
Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions

The purpose of this study is to develop a robust audio-visual speech recognition system and to investigate the influence of a high-speed video data on the recognition accuracy of continuous Russian speech under different noisy conditions. Developed experimental setup and collected multimodal database allow us to explore the impact brought by the high-speed video recordings with various frames per second (fps) starting from standard 25 fps up to high-speed 200 fps. At the moment there is no research objectively reflecting the dependence of the speech recognition accuracy from the video frame rate. Also there are no relevant audio-visual databases for model training. In this paper, we try to fill in this gap for continuous Russian speech. Our evaluation experiments show the increase of absolute recognition accuracy up to 3% and prove that the use of the high-speed camera JAI Pulnix with 200 fps allows achieving better recognition results under different acoustically noisy conditions.

Denis Ivanko, Alexey Karpov, Dmitry Ryumin, Irina Kipyatkova, Anton Saveliev, Victor Budkov, Dmitriy Ivanko, Miloš Železný
Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition

Vast majority of current research in the area of audiovisual speech recognition via lipreading from frontal face videos focuses on simple cases such as isolated phrase recognition or structured speech, where the vocabulary is limited to several tens of units. In this paper, we diverge from these traditional applications and investigate the effect of incorporating the visual information in the task of continuous speech recognition with vocabulary size ranging from several hundred to half a million words. To this end, we evaluate various visual speech parametrizations, both existing and novel, that are designed to capture different kind of information in the video signal. The experiments are conducted on a moderate sized dataset of 54 speakers, each uttering 100 sentences in Czech language. We show that even for large vocabularies the visual signal contains enough information to improve the word accuracy up to 15% relatively to the acoustic-only recognition.

Karel Paleček
Vocal Emotion Conversion Using WSOLA and Linear Prediction

The paper deals with speech emotion conversion using Waveform Similarity Overlap Add (WSOLA) and subsequent linear prediction analysis for spectral transformation. Duration modification is done by taking the ratio between segment durations of neutral and target speech. After performing modification using WSOLA, the duration modified source speech is time aligned with target and further subjected to linear prediction analysis to yield the LP coefficients. The target emotion is re-synthesised by using the prosody manipulated residual and LPCs from source. The waveform similarity property of WSOLA is exploited to give output with minimal distortion. The proposed algorithm is subjectively and objectively evaluated along with popular TD-PSOLA algorithm. The correlation between synthesised and real target shows an average improvement of 60% across all emotions with the proposed technique.

Susmitha Vekkot, Shikha Tripathi
Voice Conversion for TTS Systems with Tuning on the Target Speaker Based on GMM

The paper is devoted to improving the methods of voice conversion (VC) for developing text-to-speech synthesis systems with capabilities of tuning on the target speaker. Such system with VC module in acoustic processor, parametric representation of speech database for concatenative synthesis based on instantaneous harmonic representation is presented in the paper. Voice conversion is based on multiple regression mapping function and Gaussian mixture model (GMM), the method of text-independent learning is based on hidden Markov models and modified Viterbi algorithm. Experimental evaluation of the proposed solutions in terms of naturalness and similarity is presented as well.

Vadim Zahariev, Elias Azarov, Alexander Petrovsky
VoiScan: Telephone Voice Analysis for Health and Biometric Applications

The telephone, whether mobile, landline, or VoIP, is probably the most widely used form of long-distance communication. The most common use of voice biometrics is in telephone-based speaker verification, so the ability to operate effectively over the telephone is crucial. Similarly, access to vocal health monitoring, and other voice analysis technology, would benefit enormously if it were available over the telephone, via an automatic system. This paper describes a set of voice analysis algorithms, designed to be robust against the kinds of distortion and signal degradation encountered in modern telephone communication. The basis of the algorithms in traditional analysis is discussed, as are the design choices made in order to ensure robustness. The utility of these algorithms is demonstrated in a number of target domains.

Ladan Baghai-Ravary, Steve W. Beet
Web Queries Classification Based on the Syntactical Patterns of Search Types

Nowadays, people make frequent use of search engines in order to find the information they need on the web. The abundance of available data has rendered the process of obtaining relevant information challenging in terms of processing and analyzing it. A broad range of web queries classification techniques have been proposed with the aim of helping in understanding the actual intent behind a web search. In this research, we have categorized search queries through introducing Search Type Syntactical Patterns for automatically identifying and classifying search engine user queries. Experiments show that our approach has a good level of accuracy in identifying different search types.

Alaa Mohasseb, Mohamed Bader-El-Den, Andreas Kanavos, Mihaela Cocea
What Speech Recognition Accuracy is Needed for Video Transcripts to be a Useful Search Interface?

Informative videos (e.g. recorded lectures) are increasingly being made available online, but they are difficult to use, browse and search. Nowadays, popular platforms let users search and navigate videos via a transcript, which, in order to guarantee a satisfactory level of word accuracy, has typically been generated using some manual inputs. The goal of our work is to try and take a step closer to the fully automatic generation of informative video transcripts based on current automatic speech recognition technology. We present a user study designed to better understand viewers’ use of video transcripts for searching a video content, with the aim of estimating what minimum word recognition accuracy is needed for video captions to be a useful search interface. We found that transcripts with 70% word recognition accuracy are as effective as 100% accuracy transcripts in supporting video search when using single word search. We also found that there are large variations in the time it takes to search a video, independently of the quality of the transcript. With adequate and adapted search strategies, even low accuracy transcripts can support quick video search.

Yang Chao, Marie-Luce Bourguet
Backmatter
Metadata
Title
Speech and Computer
Editors
Alexey Karpov
Rodmonga Potapova
Iosif Mporas
Copyright Year
2017
Electronic ISBN
978-3-319-66429-3
Print ISBN
978-3-319-66428-6
DOI
https://doi.org/10.1007/978-3-319-66429-3

Premium Partner