Skip to main content
Top

2016 | Book

Speech and Computer

18th International Conference, SPECOM 2016, Budapest, Hungary, August 23-27, 2016, Proceedings

insite
SEARCH

About this book

This book constitutes the proceedings of the 18th International Conference on Speech and Computer, SPECOM 2016, held in Budapest, Hungary, in August 2016. The 85 papers presented in this volume were carefully reviewed and selected from 154 submissions.

Table of Contents

Frontmatter

Invited Talks

Frontmatter
Automatic Speech Recognition Based on Neural Networks

In automatic speech recognition, as in many areas of machine learning, stochastic modeling relies on neural networks more and more. Both in acoustic and language modeling, neural networks today mark the state of the art for large vocabulary continuous speech recognition, providing huge improvements over former approaches that were solely based on Gaussian mixture hidden markov models and count-based language models. We give an overview of current activities in neural network based modeling for automatic speech recognition. This includes discussions of network topologies and cell types, training and optimization, choice of input features, adaptation and normalization, multitask training, as well as neural network based language modeling. Despite the clear progress obtained with neural network modeling in speech recognition, a lot is to be done, yet to obtain a consistent and self-contained neural network based modeling approach that ties in with the former state of the art. We will conclude by a discussion of open problems as well as potential future directions w.r.t. to neural network integration into automatic speech recognition systems.

Ralf Schlüter, Patrick Doetsch, Pavel Golik, Markus Kitza, Tobias Menne, Kazuki Irie, Zoltán Tüske, Albert Zeyer
Machine Processing of Dialogue States; Speculations on Conversational Entropy

This keynote talk presents some ideas about ‘conversational’ speaking machines, illustrated with examples from the Herme dialogues. Herme was a small device that initiated conversations with passers-by in the Science Gallery at Trinity College in Dublin and managed to engage the majority in short conversations lasting approximately three minutes. No speech recognition was employed. Experience from that data collection and analyses of human-human conversational interactions has led us to consider a theory of Conversational Entropy wherein tight couplings become looser through time as topics decay and are refreshed by speaker changes and conversational restarts. Laughter is a particular cue to this decay mechanism and might prove to be sufficient information for machines to intrude into human conversations without causing offence.

Nick Campbell
Speech Recognition Challenges in the Car Navigation Industry

Until a few decades ago, machines talking and understanding human speech were only the subject of science fiction. Nowadays, Text to Speech (TTS) and Automatic Speech Recognition (ASR) became reality, but they are still being considered to be fancy. Automotive infotainment is a selling point for car manufacturers, it is a symbol of being hi-tech, and car commercials often feature the display of the head unit for a few seconds. As avoiding Driver Distraction has grown a major design aspect, ASR is becoming trendy and almost compulsory. But let us see how far we have gotten. In the first part, this talk will summarize the most popular Speech features in today’s car navigation systems, and will look into the underlying technology, solutions and limitations widely applied in the industry. We will mention typical context designs, dialogue systems and address search, and we will show how the common technology leads to typical HMI solutions. We will point out the possibilities and limitations of on-board and server-based recognition, and consider why we need to resort to exclusively offline solutions for a while in this industry. At this point we will have an overview of the ingredients, so the talk will focus on problematic and sub-optimal ASR features requested by automotive manufacturers, explaining why they negatively affect recognition accuracy. A workaround often leads to troublesome and seemingly unnecessary questions for the user, so it is not easy to compromise. In the last part, we will examine a certain address search scenario which is trivial for users, and is feasible with a server-based ASR, however being an open question when done offline.

Attila Vékony

Conference Papers

Frontmatter
A Comparison of Acoustic Features of Speech of Typically Developing Children and Children with Autism Spectrum Disorders

The goal of this study is to find out the acoustic features specific for ASD children vocalizations and speech. Three types of experiments were conducted: emotional speech, spontaneous speech, and the repetition of words. Participants in the study were children with ASD (F 84.0 according to ICD-10), biologically aged 5–14 years (n = 25 children) and typically developing (TD) children aged 5–14 years (n = 60). We compare acoustic features that are widely used in speech recognition and speech perception: pitch values, max and min values of pitch, pitch range, formants frequency, energy and duration. Formant triangles were plotted for vowels with apexes corresponding to the vowels [a], [u], and [i] in F1, F2 coordinates, and their areas were compared. For all children with ASD voice and speech is characterized by high values of pitch, abnormal spectrum, and well-marked high-frequency. Stressed vowels from the words of children (TD & ASD), spoken in discomfort, have higher values of pitch and the third (emotional) formants than spoken in a comfortable condition. ASD children showed higher values of pitch in spontaneous speech than in repetition speech. The current results are a first step toward developing speech based bio-markers for early diagnosis of ASD.

Elena Lyakso, Olga Frolova, Aleksey Grigorev
A Deep Neural Networks (DNN) Based Models for a Computer Aided Pronunciation Learning System

Gaussian Mixture Models (GMM) has been the most common used models in pronunciation verification systems. The recently introduced Deep Neural Networks (DNN) has proved to provide significantly better discriminative models of the acoustic space. In this paper, we introduce our efforts to upgrade the models of a Computer Aided Language Learner (CAPL) system that is used to teach the Arabic pronunciation for Quran recitation rules. Four major enhancements were introduced, firstly we used SAT to reduce the inter-speakers variability, secondly, we integrated a hybrid DNN-HMM models to enhance the acoustic model and decrease the phone error rate. Third, we integrated Minimum Phone Error (MPE) with the hybrid DNN. Finally, in the testing phase, we used a grammar-based decoding graph to limit the search space to the frequent errors types. A comparison between the performance of the conventional GMM-HMM and the hybrid DNN-HMM was performed with results showing significant performance improvements.

Mohamed S. Elaraby, Mustafa Abdallah, Sherif Abdou, Mohsen Rashwan
A Linguistic Interpretation of the Atom Decomposition of Fundamental Frequency Contour for American English

One of the most recently proposed techniques for modeling the prosody of an utterance is the decomposition of its pitch, duration and/or energy contour into physiologically motivated units called atoms, based on matching pursuit. Since this model is based on the physiology of the production of sentence intonation, it is essentially language independent. However, the intonation of an utterance in a particular language is obviously under the influence of factors of a predominantly linguistic nature. In this research, restricted to the case of American English with prosody annotated using standard ToBI conventions, we have shown that, under certain mild constraints, the positive and negative atoms identified in the pitch contour coincide very well with high and low pitch accents and phrase accents of ToBI. By giving a linguistic interpretation of the atom decomposition model, this research enables its practical use in domains such as speech synthesis or cross-lingual prosody transfer.

Tijana Delić, Branislav Gerazov, Branislav Popović, Milan Sečujski
A Phonetic Segmentation Procedure Based on Hidden Markov Models

In this paper, a novel variant of an automatic phonetic segmentation procedure is presented, especially useful if data is scarce. The procedure uses the Kaldi speech recognition toolkit as its basis, and combines and modifies several existing methods and Kaldi recipes. Both the specifics of model training and test data alignment are explained in detail. Effectiveness of artificial extension of the starting amount of manually labeled material during training is examined as well. Experimental results show the admirable overall correctness of the proposed procedure in the given test environment. Several variants of the procedure are compared, and the usage of speaker-adapted context-dependent triphone models trained without the expanded manually checked data is proven to produce the best results. A few ways to improve the procedure even more, as well as future work, are also discussed.

Edvin Pakoci, Branislav Popović, Nikša Jakovljević, Darko Pekar, Fathy Yassa
A Preliminary Exploration of Group Social Engagement Level Recognition in Multiparty Casual Conversation

Sensing human social engagement in dyadic or multiparty conversation is key to the design of decision strategies in conversational dialogue agents to decide suitable strategies in various human machine interaction scenarios. In this paper we report on studies we have carried out on the novel research topic about social group engagement in non-task oriented (casual) multiparty conversations. Fusion of hand-crafted acoustic and visual cues was used to predict social group engagement levels and was found to achieve higher results than using audio and visual cues separately.

Yuyun Huang, Emer Gilmartin, Benjamin R. Cowan, Nick Campbell
An Agonist-Antagonist Pitch Production Model

Prosody is a phenomenon that is crucial for numerous fields of speech research, accenting the importance of having a robust prosody model. A class of intonation models based on the physiology of pitch production are especially attractive for their inherent multilingual support. These models rely on an accurate model of muscle activation. Traditionally they have used the 2nd order spring-damper-mass (SDM) muscle model. However, recent research has shown that the SDM model is not sufficient for adequate modelling of the muscle dynamics. The 3rd order Hill type model offers a more accurate representation of muscle dynamics, but it has been shown to be underdamped when using physiologically plausible muscle parameters. In this paper we propose an agonist-antagonist pitch production (A2P2) model that both validates and gives insight behind the improved results of using higher-order critically damped system models in intonation modelling.

Branislav Gerazov, Philip N. Garner
An Algorithm for Phase Manipulation in a Speech Signal

While human auditory system is predominantly sensitive to the amplitude spectrum of an incoming sound, a number of sound perception studies have shown that the phase spectrum is also perceptually relevant. In case of speech, its relevance can be established through experiments with speech vocoding or parametric speech synthesis, where particular ways of manipulating the phase of voiced excitation (i.e. setting it to zero or random values) can be shown to affect voice quality. In such experiments the phase should be manipulated with as little distortion of the amplitude spectrum as possible, lest the degradation in voice quality perceived through listening tests, caused by the distortion of amplitude spectrum, be incorrectly attributed to the influence of phase. The paper presents an algorithm for phase manipulation of a speech signal, based on inverse filtering, which introduces negligible distortion into the amplitude spectrum, and demonstrates its accuracy on a number of examples.

Darko Pekar, Siniša Suzić, Robert Mak, Meir Friedlander, Milan Sečujski
An Exploratory Study on Sociolinguistic Variation of Russian Everyday Speech

The research presented in this paper has been conducted in the framework of the large sociolinguistic project aimed at describing everyday spoken Russian and analyzing the special characteristics of its usage by different social groups of speakers. The research is based on the material of the ORD corpus containing long-term audio recordings of everyday communication. The aim of the given exploratory study is to reveal the linguistic parameters, in terms of which the difference in speech between different social groups is the most evident. An exploratory subcorpus, consisting of audio fragments of spoken communication of 12 respondents (6 men and 6 women, 4 representatives for each age group, and representatives of different professional and status groups) with the total duration of 106 min and of similar communication settings, was created and fully annotated. The quantitative description of a number of linguistic parameters on phonetic, lexical, morphological, and syntax levels in each social group was made. The biggest difference between social groups was observed in speech rate, phonetic reduction, lexical preferences, and syntactic irregularities. The study has shown that the differences between age groups are more significant than between gender groups, and the speech of young people differs most strongly from the others.

Natalia Bogdanova-Beglarian, Tatiana Sherstinova, Olga Blinova, Gregory Martynenko
Adaptation of DNN Acoustic Models Using KL-divergence Regularization and Multi-task Training

The adaptation of context-dependent deep neural network acoustic models is particularly challenging, because most of the context-dependent targets will have no occurrences in a small adaptation data set. Recently, a multi-task training technique has been proposed that trains the network with context-dependent and context-independent targets in parallel. This network structure offers a straightforward way for network adaptation by training only the context-independent part during the adaptation process. Here, we combine this simple adaptation technique with the KL-divergence regularization method also proposed recently. Employing multi-task training we attain a relative word error rate reduction of about 3 % on a broadcast news recognition task. Then, by using the combined adaptation technique we report a further error rate reduction of 2 % to 5 %, depending on the duration of the adaptation data, which ranged from 20 to 100 s.

László Tóth, Gábor Gosztolya
Advances in STC Russian Spontaneous Speech Recognition System

In this paper we present the latest improvements to the Russian spontaneous speech recognition system developed in Speech Technology Center (STC). Significant word error rate (WER) reduction was obtained by applying hypothesis rescoring with sophisticated language models. These were the Recurrent Neural Network Language Model and regularized Long-Short Term Memory Language Model. For acoustic modeling we used the deep neural network (DNN) trained with speaker-dependent bottleneck features, similar to our previous system. This DNN was combined with the deep Bidirectional Long Short-Term Memory acoustic model by the use of score fusion. The resulting system achieves WER of 16.4 %, with an absolute reduction of 8.7 % and relative reduction of 34.7 % compared to our previous system result on this test set.

Ivan Medennikov, Alexey Prudnikov
Approaches for Out-of-Domain Adaptation to Improve Speaker Recognition Performance

In last years satisfactory performance of speaker recognition (SR) systems have been achieved in evaluations provided by NIST. It was possible due to using large datasets to train system parameters and accurate speaker variability modeling. In such a cases test and train conditions are similar and it ensures good performance for the evaluations. However in practical applications when training and testing conditions are different the problem of mismatching of the optimal SR system parameters occurs. It is the main problem in the deployment of the real application systems. It leads to reducing SR systems effectiveness. This paper investigates discriminative and generative approaches for the adaptation of the parameters of the speaker recognition systems and proposes effective solutions to improve their performance.

Andrey Shulipa, Sergey Novoselov, Aleksandr Melnikov
Assessment of the Relation Between Low-Frequency Features and Velum Opening by Using Real Articulatory Data

This work aims to assess the relation between low-frequency speech features and velum opening by using data coming from an electromagnetic articulograph system (EMA). In previous works, features related to frequency content below first formant has been proposed in order to detect nasalized sounds and hypernasality; however, those low-frequency features have not yet been assessed on real articulatory data regarding the dynamical behavior of velum opening. In order to evaluate the relationship between low-frequency features and velum opening, statistical association between acoustic information and velum movement is measured. In addition, the parameters are evaluated in an acoustic-to-articulatory system based on radial basis neural networks. Results suggest the existence of low-frequency features related to velum position. Therefore, this kind of parameters could be useful in acoustic-to-articulatory mapping systems.

Alexander Sepulveda-Sepulveda, German Castellanos-Dominguez
Automatic Summarization of Highly Spontaneous Speech

This paper addresses speech summarization of highly spontaneous speech. Speech is converted into text using an ASR, then segmented into tokens. Human made and automatic, prosody based tokenization are compared. The obtained sentence-like units are analysed by a syntactic parser to help automatic sentence selection for the summary. The preprocessed sentences are ranked based on thematic terms and sentence position. The thematic term is expressed in two ways: TF-IDF and Latent Semantic Indexing. The sentence score is calculated as linear combination of the thematic term score and a sentence position score. To generate the summary, the top 10 candidates for the most informative/best summarizing sentences are selected. The system performance showed comparable results (recall: 0.62, precision: 0.79 and F-measure 0.68) with the prosody based tokenization approach. A subjective test is also carried out on a Likert scale.

András Beke, György Szaszák
Backchanneling via Twitter Data for Conversational Dialogue Systems

Backchanneling plays a crucial role in human-to-human communication. In this study, we propose a method for generating a rich variety of backchanneling, which is not just limited to simple “hm” or “sure” responses, to realize smooth communication in conversational dialogue systems. We formulate the problem of what the backchanneling generation function should return for given user inputs as a multi-class classification problem and determine a suitable class using a recurrent neural network with a long short-term memory. Training data for our model comprised pairs of tweets and replies acquired from Twitter. Experimental results demonstrated that our method can appropriately select backchannels to given inputs and significantly outperform baseline methods.

Michimasa Inaba, Kenichi Takahashi
Bio-Inspired Sparse Representation of Speech and Audio Using Psychoacoustic Adaptive Matching Pursuit

Current paper devoted to the sparse audio and speech signal modelling via the matching pursuit (MP) algorithm. Redundant dictionary of the time-frequency functions is constructed through the frame-based psychoacoustic optimized wavelet packet (WP) transform. Anthropomorphic adaptation of the time-frequency plan allows minimizing perceptual redundancy of the signal modelling. Psychoacoustic information at MP stage for the best atom selection from the dictionary is used. It improves algorithm performance in terms of human hearing system and computational complexity. Described signal model can be applied in many audio and speech processing tasks such as source separation, watermarking, classification and so on. Presented research focused on the signal encoding. Universal audio/speech coding algorithm that is suitable for the input signals with different sound content is proposed.

Alexey Petrovsky, Vadzim Herasimovich, Alexander Petrovsky
Combining Atom Decomposition of the F0 Track and HMM-based Phonological Phrase Modelling for Robust Stress Detection in Speech

Weighted Correlation based Atom Decomposition (WCAD) algorithm is a technique for intonation modelling that uses a matching pursuit framework to decompose the F0 contour into a set of basic components, called atoms. The atoms attempt to model the physiological activation of the laryngeal muscles responsible for changes in F0. Recently, WCAD has been upgraded to use the orthogonal matching pursuit (OMP) algorithm, which gives qualitative improvements in the modelling of intonation. A possible exploitation of the OMP based WCAD is the automatic detection of stress in speech, which we undertake for the Hungarian language. Correlation is demonstrated between stress and atomic peaks, as well as between stress and atomic valleys on the previous syllable. The stress detection technique based on WCAD is compared to a baseline system using HMM/GMM stress/phrase models. 7 % improvement is noticed in the F-measure compared to baseline when evaluating on hand-made reference. Finally, we propose a hybrid approach which outperforms both individual systems (by 11 % compared to the baseline).

György Szaszák, Máté Ákos Tündik, Branislav Gerazov, Aleksandar Gjoreski
Comparative Analysis of Classifiers for Automatic Language Recognition in Spontaneous Speech

In this paper we consider a language identification system based on the state-of-the-art i-vector method. Paper presents a comparative analysis of different methods for the classification in the i-vector space to determine the most efficient for this task. Experimental results show the reliability of the method based on linear discriminant analysis and naive Bayes classifier which is sufficient for usage in practical applications.

Konstantin Simonchik, Sergey Novoselov, Galina Lavrentyeva
Comparison of Retrieval Approaches and Blind Relevance Feedback Methods Within the Czech Speech Information Retrieval

This article has several objectives. First, it is to compare the most used information retrieval methods on a single speech retrieval collection. The collection, used in the CLEF 2007 Czech task, contains automatically transcribed spontaneous interviews of holocaust survivors and is to our knowledge the only Czech collection of spontaneous speech intended for speech information retrieval. Apart from the first experiments presented in the CLEF competition, no comprehensive experiments have been published on this collection to compare the different information retrieval methods. The second objective of this paper is to compare the results of using the blind relevance feedback methods with the individual retrieval methods and introduce the possibility of using the score normalization methods for the selection of documents for the blind relevance feedback. The third objective of this article is to compare different normalization methods among themselves. Exhaustive experiments were performed for each method and its settings. For all information retrieval methods used, the experiments results showed that the use of score normalization methods significantly improves the achieved retrieval score.

Lucie Skorkovská
Convolutional Neural Network in the Task of Speaker Change Detection

This paper presents an approach to detect speaker changes in telephone conversations. The speaker change problem is presented as a classification problem. We use a Convolutional Neural Network to analyze short audio segments. The Network plays a role of a regressor. It outputs higher values for segments that are more likely to contain a speaker change. Upon thresholding the regressed value the decision about the segment is made. The experiment shows that the Convolutional Neural Network outperforms a baseline system based on the Bayesian Information Criterion. It behaves very well on previously unseen data produced by previously unheard speakers.

Marek Hrúz, Marie Kunešová
Design of a Speech Corpus for Research on Cross-Lingual Prosody Transfer

Since the prosody of a spoken utterance carries information about its discourse function, salience, and speaker attitude, prosody models and prosody generation modules have played a crucial part in text-to-speech (TTS) synthesis systems from the beginning, especially those set not only on sounding natural, but also on showing emotion or particular speaker intention. Prosody transfer within speech-to-speech translation is a recent research area with increasing importance, with one of its most important research topics being the detection and treatment of salient events, i.e. instances of prominence or focus which do not result from syntactic constraints, but are rather products of semantic or pragmatic level effects. This paper presents the design and the guidelines for the creation of a multilingual speech corpus containing prosodically rich sentences, ultimately aimed at training statistical prosody models for multilingual prosody transfer in the context of expressive speech synthesis.

Milan Sečujski, Branislav Gerazov, Tamás Gábor Csapó, Vlado Delić, Philip N. Garner, Aleksandar Gjoreski, David Guennec, Zoran Ivanovski, Aleksandar Melov, Géza Németh, Ana Stojković, György Szaszák
Designing High-Coverage Multi-level Text Corpus for Non-professional-voice Conservation

The paper focuses on building a text corpus suitable for the conservation of the voices of non-professional speakers, who are loosing their voices due to serious healthy problems. Since we do not know in advance, how many sentences a speaker will be able to record, we propose a multi-level greedy algorithm which can ensure the coverage of selected texts by various phonetic and prosodic units. The comparison of such coverage is presented for various corpus sizes, and compared to the generic TTS corpus recorded by a healthy professional speaker.

Markéta Jůzová, Daniel Tihelka, Jindřich Matoušek
Designing Syllable Models for an HMM Based Speech Recognition System

In this paper we present novel ways of incorporating syllable information into an HMM based speech recognition system. Syllable based acoustic modelling is appealing as syllables have certain acoustic-phonetic dependencies that can not be modeled in a pure phone based system. On the other hand, syllable based systems suffer from sparsity issues. In this paper we investigate the potential of different acoustic units such as phone, phone clusters, phones-in-syllables, demi-syllables and syllables in combination with a variety of back-off schemes. Experimental results are presented on the Wall Street Journal database. When working with traditional frame based features only, results only show minor improvements. However, we expect that the developed system will show its full potential when incorporating additional segmental features at the syllable level.

Kseniya Proença, Kris Demuynck, Dirk Van Compernolle
Detecting Filled Pauses and Lengthenings in Russian Spontaneous Speech Using SVM

Spontaneous speech differs from any other type of speech in many ways. And the presence of speech disfluencies is its prominent characteristic. These phenomena are important feature in human-human communication and at the same time a challenging obstacle for the speech processing tasks. This paper reports the experiment results on automatic detection of filled pauses and sound lengthenings basing on the automatically extracted acoustic features. We have performed machine learning experiments using support vector machine (SVM) classifier on the mixed and quality diverse corpus of Russian spontaneous speech. We applied Gaussian filtering and morphological opening to post-process the probability estimates from an SVM classifier. As the result we achieved F1–score of 0.54, with precision and recall being 0.55 and 0.53 respectively.

Vasilisa Verkhodanova, Vladimir Shapranov
Detecting Laughter and Filler Events by Time Series Smoothing with Genetic Algorithms

Social signal detection, where the aim is to identify vocalizations like laughter and filler events (sounds like “eh”, “er”, etc.) is a popular task in the area of computational paralinguistics, a subfield of speech technology. Recent studies have shown that besides applying state-of-the-art machine learning methods, it is worth making use of the contextual information and adjusting the frame-level scores based on the local neighbourhood. In this study we apply a weighted average time series smoothing filter for laughter and filler event identification, and set the weights using genetic algorithms. Our results indicate that this is a viable way of improving the Area Under the Curve (AUC) scores: our resulting scores are much better than the accuracy of the raw likelihoods produced by both AdaBoost.MH and DNN, and we also significantly outperform standard time series filters as well.

Gábor Gosztolya
Detecting State of Aggression in Sentences Using CNN

In this article we study verbal expression of aggression and its detection using machine learning and neural networks methods. We test our results using our corpora of messages from anonymous imageboards. We also compare Random forest classifier with convolutional neural network for “Movie reviews with one sentence per review” corpus.

Denis Gordeev
DNN-Based Acoustic Modeling for Russian Speech Recognition Using Kaldi

In the paper, we describe a research of DNN-based acoustic modeling for Russian speech recognition. Training and testing of the system was performed using the open-source Kaldi toolkit. We created tanh and p-norm DNNs with a different number of hidden layers and a different number of hidden units of tanh DNNs. Testing of the models was carried out on very large vocabulary continuous Russian speech recognition task. We obtained a relative WER reduction of 20 % comparing to the baseline GMM-HMM system.

Irina Kipyatkova, Alexey Karpov
DNN-Based Duration Modeling for Synthesizing Short Sentences

Statistical parametric speech synthesis conventionally utilizes decision tree clustered context-dependent hidden Markov models (HMMs) to model speech parameters. But decision trees are unable to capture complex context dependencies and fail to model the interaction between linguistic features. Recently deep neural networks (DNNs) have been applied in speech synthesis and they can address some of these limitations. This paper focuses on the prediction of phone durations in Text-to-Speech (TTS) systems using feedforward DNNs in case of short sentences (sentences containing one, two or three syllables only). To achieve better prediction accuracy hyperparameter optimization was carried out with manual grid search. Recordings from a male and a female speaker were used to train the systems, and the output of various configurations were compared against conventional HMM-based solutions and natural speech. Experimental results of objective evaluations show that DNNs can outperform previous state-of-the-art solutions in duration modeling.

Péter Nagy, Géza Németh
Emotional Speech of 3-Years Old Children: Norm-Risk-Deprivation

The goal of the study is to compare emotional speech and vocalizations of 3-years old healthy children (control) and children with neurological disorders (risk), brought up in families and children from the orphanage (deprivation). Audio and video recording of the child’s speech and behavior were made in model situations, designed to evoke the emotional expressions of children during interaction with their mothers and the experimenter. Perceptual analysis was conducted to estimate the possibility of child’s emotional state recognition when listening the child’s speech and vocalizations by groups of native speakers: parents, experts, adults who do not have their own children. Native speakers have been attributed child’s utterances to the state of comfort, discomfort, neutral and to clarify the emotional state as anger, fear, sadness, happiness, surprise, calm. The acoustic characteristics of the child’s speech and vocalizations: pitch values, the range of pitch values, duration of utterances, duration of vocalizations and stressed vowels, formant frequencies were measured. Dialogues of children with mothers and experimenter were described for evaluation of the level of the child’s speech mastering. Phonetic analysis of child’s emotional utterances was made. Differences in recognition of emotional state between groups of children were revealed: native speakers identified emotional state in the voice of healthy children grown up at families better than in orphans’ voice, whereas experts recognized emotional state better compared to parents and adults without experience of interaction with their own children. The communication between children of risk and deprivation groups and adults is obstructed due to the features of the acoustic characteristics of their emotional speech.

Olga Frolova, Elena Lyakso
Ensemble Deep Neural Network Based Waveform-Driven Stress Model for Speech Synthesis

Stress annotations in the training corpus of speech synthesis systems are usually obtained by applying language rules to the transcripts. However, the actual stress patterns seen in the waveform are not guaranteed to be canonical, they can deviate from locations defined by language rules. This is driven mostly by speaker dependent factors. Therefore, stress models based on these corpora can be far from perfect. This paper proposes a waveform based stress annotation technique. According to the stress classes, four feedforward deep neural networks (DNNs) were trained to model fundamental frequency (F0) of speech. During synthesis, stress labels are generated from the textual input and an ensemble of the four DNNs predict the F0 trajectories. Objective and subjective evaluation was carried out. The results show that the proposed method surpasses the quality of vanilla DNN-based F0 models.

Bálint Pál Tóth, Kornél István Kis, György Szaszák, Géza Németh
Evaluation of Response Times on a Touch Screen Using Stereo Panned Speech Command Auditory Feedback

User interfaces to access mobile and handheld devices usually incorporate touch screens. Fast user responses are in general not critical, however, some applications require fast and accurate reactions from users. Errors and response times depend on many factors such as the user’s abilities, feedback types and latencies from the device, sizes of the buttons to press, etc. We conducted an experiment with 17 subjects to test response time and accuracy to different kinds of speech-based auditory stimuli over headphones. Speech signals were spatialized based on stereo amplitude panning. Results show significantly better response times for 3 directions than for 5, as well as for native language compared to English, and more accurate judgements based on the meaning of the speech sounds rather than their direction.

Hunor Nagy, György Wersényi
Evaluation of the Speech Quality During Rehabilitation After Surgical Treatment of the Cancer of Oral Cavity and Oropharynx Based on a Comparison of the Fourier Spectra

In this paper, we propose the selection of parameters for quality evaluation criterion of pronunciation of certain phonemes. Is presented a comparison of the different options and criteria for the selection of the parameter metric serving their basis - the Minkowskian metric. This approach is used for the comparative assessment of the quality of their utterances in the process of voice rehabilitation of patients after surgical treatment of cancer of the oral cavity and oropharynx. The pronunciation before surgery, taken as a etalon, and after the operation in the course of employment with a speech therapist are compared. The proposed criterion is calculated based on a comparison of the Fourier spectra of these signals and detect differences on the basis of Minkowskian distance. Pre-signals are subjected to the procedure of normalization for the comparability of the spectra. At the end of the experiment the value of the Minkowskian distance parameter to ensure the greatest legibility signals in comparing the quality of pronunciation was suggested. Various approaches to the formation of the quality evaluation criteria pronouncing phonemes are presented. The applicability of the proposed approach for an objective comparative evaluation of the quality of pronouncing phonemes [k] and [t] in patients before and after surgery is confirmed.

Evgeny Kostyuchenko, Mescheryakov Roman, Dariya Ignatieva, Alexander Pyatkov, Evgeny Choynzonov, Lidiya Balatskaya
Experiments with One–Class Classifier as a Predictor of Spectral Discontinuities in Unit Concatenation

We present a sequence of experiments with one–class classification, aimed at examining the ability of such a classifier to detect spectral smoothness of units, as an alternative to heuristics–based measures used within unit selection speech synthesizers. A set of spectral feature distances was computed between neighbouring frames in natural speech recordings, i.e. those representing natural joins, from which the per–vowel classifier was trained. In total, three types of classifiers were examined for distances computed from several different signal parametrizations. For the evaluation, the trained classifiers were tested against smooth or discontinuous joins as they were perceived by human listeners in the ad–hoc listening test designed for this purpose.

Daniel Tihelka, Martin Grůber, Markéta Jůzová
Exploring GMM-derived Features for Unsupervised Adaptation of Deep Neural Network Acoustic Models

In this paper we investigate GMM-derived features recently introduced for adaptation of context-dependent deep neural network HMM (CD-DNN-HMM) acoustic models. We present an initial attempt of improving the previously proposed adaptation algorithm by applying lattice scores and by using confidence measures in the traditional maximum a posteriori (MAP) adaptation algorithm. Modified MAP adaptation is performed for the auxiliary GMM model used in a speaker adaptation procedure for a DNN. In addition we introduce two approaches - data augmentation and data selection, for improving the regularization in MAP adaptation for DNN. Experimental results on the Wall Street Journal (WSJ0) corpus show that the proposed adaptation technique can provide, on average, up to $$9.9\,\%$$ relative word error rate (WER) reduction under an unsupervised adaptation setup, compared to speaker independent DNN-HMM systems built on conventional features.

Natalia Tomashenko, Yuri Khokhlov, Anthony Larcher, Yannick Estève
Feature Space VTS with Phase Term Modeling

A new variant of Vector Taylor Series based features compensation algorithm is proposed. The phase-sensitive speech distortion model is used and the phase term is modeled as a multivariate gaussian with unknown mean vector and covariance matrix. These parameters are estimated based on Maximum Likelihood principle and EM-algorithm is used for this. EM formulas of parameter update are derived as well MMSE estimate of the clean speech features. The experiments on Aurora2 database show that taking phase term into account and data-driven estimation of its parameters result in relative WER reduction of about 20 % compared to phase-insensitive VTS version. The proposed method is also compared to the VTS with constant phase vector and this approximation is shown to be very efficient.

Maxim Korenevsky, Aleksei Romanenko
Finding Speaker Position Under Difficult Acoustic Conditions

In this paper are presented different approaches for speaker position identification that use a microphone array and known voice models. Comparison of speaker positioning is performed by using acoustic maps based on FBF and PHAT. The goal of the experiments is to find best algorithm parameters and their approbation for different types of noises. The proposed approaches allows for better results in automatic positioning under noisy conditions. It enables to identify the target speaker whose speech duration is longer than 10 s.

Evgeniy Shuranov, Aleksandr Lavrentyev, Alexey Kozlyaev, Galina Lavrentyeva, Valeriya Volkovaya
Fusing Various Audio Feature Sets for Detection of Parkinson’s Disease from Sustained Voice and Speech Recordings

The aim of this study is the analysis of voice and speech recordings for the task of Parkinson’s disease detection. Voice modality corresponds to sustained phonation /a/ and speech modality to a short sentence in Lithuanian language. Diverse information from recordings is extracted by 22 well-known audio feature sets. Random forest is used as a learner, both for individual feature sets and for decision-level fusion. Essentia descriptors were found as the best individual feature set, achieving equal error rate of 16.3 % for voice and 13.3 % for speech. Fusion of feature sets and modalities improved detection and achieved equal error rate of 10.8 %. Variable importance in fusion revealed speech modality as more important than voice.

Evaldas Vaiciukynas, Antanas Verikas, Adas Gelzinis, Marija Bacauskiene, Kestutis Vaskevicius, Virgilijus Uloza, Evaldas Padervinskis, Jolita Ciceliene
HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

In this paper we present a software-hardware complex for collection of audio-visual speech databases with a high-speed camera and a dynamic microphone. We describe the architecture of the developed software as well as some details of the collected database of Russian audio-visual speech HAVRUS. The developed software provides synchronization and fusion of both audio and video channels and makes allowance for and processes the natural factor of human speech - the asynchrony of audio and visual speech modalities. The collected corpus comprises recordings of 20 native speakers of Russian and is meant for further research and experiments on audio-visual Russian speech recognition.

Vasilisa Verkhodanova, Alexander Ronzhin, Irina Kipyatkova, Denis Ivanko, Alexey Karpov, Miloš Železný
Human-Smartphone Interaction for Dangerous Situation Detection and Recommendation Generation While Driving

The paper presents a human-smartphone interaction system that is aimed at dangerous situation detection in a vehicle while driving. The system implements the driver head position and face tracking to detect if the driver is fine or he/she drowsed or distracted. For the image recognition, the OpenCV computer vision library is used that allows to determine the main head and face parameters that are analyzed to detect dangerous situations. Taking into account detected dangerous situation and current situation in the road (e.g., city or countryside driving; hotels, gas stations, cafes, restaurants around; Internet availability) the system generates recommendations for the driver to prevent accidents caused by dangerous driver behavior.

Alexander Smirnov, Alexey Kashevnik, Igor Lashkov
Improving Automatic Speech Recognition Containing Additive Noise Using Deep Denoising Autoencoders of LSTM Networks

Automatic speech recognition systems (ASR) suffer from performance degradation under noisy conditions. Recent work, using deep neural networks to denoise spectral input features for robust ASR, have proved to be successful. In particular, Long Short-Term Memory (LSTM) autoencoders have outperformed other state of the art denoising systems when applied to the mfcc’s of a speech signal. In this paper we also consider denoising LSTM autoencoders (DLSTMA), but instead use three different DLSTMAs and apply each to the mfcc’s, fundamental frequency, and energy features, respectively. Results are given using several kinds of additive noise at different intensity levels, and show how this collection of DLSTMA’s improves the performance of the ASR in comparison with the LSTM autoencoder.

Marvin Coto-Jiménez, John Goddard-Close, Fabiola Martínez-Licona
Improving the Quality of Automatic Speech Recognition in Trucks

In this paper we consider the problem of the DNN-HMM acoustic models training for automatic speech recognition systems on russian language in modern commercial trucks. The speech database for training and testing the ASR system was recorded in various models of trucks, operating under different conditions. The experiments on the test part of the speech database, show that acoustic models trained on the base of specifically modeled training speech database enable to improve the recognition quality in a moving truck from 35 % to 88 % compared to the acoustic models trained on a clean speech. Also a new topology of the neural network was proposed. It allows to reduce the computational costs significantly without loss of the recognition accuracy.

Maxim Korenevsky, Ivan Medennikov, Vadim Shchemelinin
Improving Recognition of Dysarthric Speech Using Severity Based Tempo Adaptation

Dysarthria is a motor speech disorder, characterized by slurred or slow speech resulting in low intelligibility. Automatic recognition of dysarthric speech is beneficial to enable people with dysarthria to use speech as a mode of interaction with electronic devices. In this paper we propose a mechanism to adapt the tempo of sonorant part of dysarthric speech to match that of normal speech, based on the severity of dysarthria. We show a significant improvement in recognition of tempo-adapted dysasrthic speech, using a Gaussian Mixture Model (GMM) - Hidden Markov Model (HMM) recognition system as well as a Deep neural network (DNN) - HMM based system. All evaluations were done on Universal Access Speech Corpus.

Chitralekha Bhat, Bhavik Vachhani, Sunil Kopparapu
Improving Robustness of Speaker Verification by Fusion of Prompted Text-Dependent and Text-Independent Operation Modalities

In this paper we present a fusion methodology for combining prompted text-dependent and text-independent speaker verification operation modalities. The fusion is performed in score level extracted from GMM-UBM single mode speaker verification engines using several machine learning algorithms for classification. In order to improve the performance we apply clustering of the score-based data before the classification stage. The experimental results indicated that the fusion of the two operation modes improves the speaker verification performance both in terms of sensitivity and specificity by approximately 2 % and 1.5 % respectively.

Iosif Mporas, Saeid Safavi, Reza Sotudeh
Improvements to Prosodic Variation in Long Short-Term Memory Based Intonation Models Using Random Forest

Statistical parametric speech synthesis has overcome unit selection methods in many aspects, including flexibility and variability. However, the intonation of these systems is quite monotonic, especially in case of longer sentences. Due to statistical methods the variation of fundamental frequency (F0) trajectories decreases. In this research a random forest (RF) based classifier was trained with radio conversations based on the perceived variation by a human annotator. This classifier was used to extend the labels of a phonetically balanced, studio quality speech corpus. With the extended labels a Long Short-Term Memory (LSTM) network was trained to model fundamental frequency (F0). Objective and subjective evaluations were carried out. The results show that the variation of the generated F0 trajectories can be fine-tuned with an additional input of the LSTM network.

Bálint Pál Tóth, Balázs Szórádi, Géza Németh
In-Document Adaptation for a Human Guided Automatic Transcription Service

In this work, the task is to assist human transcribers to produce, for example, interview or parliament speech transcriptions. The system will perform in-document adaptation based on a small amount of manually corrected automatic speech recognition results. The corrected segments of the spoken document are used to adapt the speech recognizer’s acoustic and language model. The updated models are used in second-pass recognition to produce a more accurate automatic transcription for the remaining uncorrected parts of the spoken document. In this work we evaluate two common adaptation methods for speech data in settings that represent typical transcription tasks. For adapting the acoustic model we use the Maximum A Posteriori adaptation method. For adapting the language model we use linear interpolation. We compare results of supervised adaptation to unsupervised adaptation, and evaluate the total benefit of using human corrected segments for in-document adaptation for typical transcription tasks.

André Mansikkaniemi, Mikko Kurimo, Krister Lindén
Interaction Quality as a Human-Human Task-Oriented Conversation Performance

The spoken dialogue systems (SDSs), which are designed to replace employees in different services, need some indicators, which show what happened in the ongoing dialogue and what the next step in system’s behaviour should be. Thus, some indicators for the SDSs come from the field of the call centre’s quality evaluation. In turn, some metrics like Interaction Quality (IQ), which was designed for human-computer spoken interaction, can be applied to human-human conversations. Such experience might be used for both call centres and SDSs for service quality improvement. This paper provides the results of IQ modelling for human-human task-oriented conversation with several classification algorithms.

Anastasiia Spirina, Olesia Vaskovskaia, Maxim Sidorov, Alexander Schmitt
Investigation of Segmentation in i-Vector Based Speaker Diarization of Telephone Speech

The goal of this paper is to evaluate the contribution of speaker change detection (SCD) to the performance of a speaker diarization system in the telephone domain. We compare the overall performance of an i-vector based system using both SCD-based segmentation and a naive constant length segmentation with overlapping segments. The diarization system performs K-means clustering of i-vectors which represent the individual segments, followed by a resegmentation step. Experiments were done on the English part of the CallHome corpus. The final results indicate that the use of speaker change detection is beneficial, but the differences between the two segmentation approaches are diminished by the use of resegmentation.

Zbyněk Zajíc, Marie Kunešová, Vlasta Radová
Investigation of Speech Signal Parameters Reflecting the Truth of Transmitted Information

A review of the existing methods of transmitted information truth diagnostics is presented. A conclusion concerning the purposefulness of this function realization in polymodal infocommunication systems has been shown. Parameters of speech signal that reflect the truth of transmitted information are considered. The results of testing the developed software are presented. Based on the undertaken study, a conclusion concerning possibility of realization of transmitted information truth in the course of interpersonal communication between subscribers has been drawn and a decisive rule has been formulated.

Victor Budkov, Irina Vatamaniuk, Vladimir Basov, Daniyar Volf
Investigating Signal Correlation as Continuity Metric in a Syllable Based Unit Selection Synthesis System

In recent years, text-to-speech (TTS) systems have shown considerable improvement as far as the quality of the synthetic speech is concerned. Data driven synthesis methods using syllable as basic unit for concatenation, have proved to generate high quality speech for Indian Languages because of their advantage of prosodic matching function. However, still there is no acceptable solution to the optimal selection of speech segments in terms of audible discontinuities and human perception. This problem gets aggravated in the cases where there is no enough data for building the voice due to the missing units. In this paper, we continue our efforts in trying to address this by investigating the use of a new continuity measure based on maximum signal correlation for optimal selection of units in concatenative text-to-speech (TTS) synthesis framework. We explore two formulations for calculating the signal correlation: cross correlation (CC) based and average magnitude difference function (AMDF) based. We first perform an initial experiment to understand the significance of the approach and then build 5 experimental systems. Evaluations on 30 sentences for each of these languages by native users of the language show that the proposed continuity measure results in more natural sounding synthesis.

Sai Sirisha Rallabandi, Sai Krishna Rallabandi, Naina Teertha, Kumaraswamy R., Suryakanth V. Gangashetty
Knowledge Transfer for Utterance Classification in Low-Resource Languages

The paper deals with a problem of short text classification in Kazakh. Traditional text classification approaches require labeled data to build accurate classifiers. However the amount of available labeled data is usually very limited due to high cost of labeling or data accessibility issues. We describe a method of constructing a classifier without labeled data in the target language. A convolutional neural network (CNN) is trained on Russian labeled texts and a language vector space transform is used to transfer knowledge from Russian into Kazakh. Classification accuracy is evaluated on a dataset of customer support requests. The presented method demonstrates competitive results compared with an approach that employed a sophisticated automatic translation system.

Andrei Smirnov, Valentin Mendelev
Language Identification Using Time Delay Neural Network D-Vector on Short Utterances

This paper describes d-vector language identification (LID) system on short utterances using time delay neural network (TDNN) acoustic model for the speech recognition task. The acoustic TDNN model is chosen for ASR system of ICQ messenger and it’s applied for the LID task. We compared LID TDNN d-vector results to i-vector baseline. It was found that the TDNN system performance is close at any durations while i-vector shows good results only at long time. Open-set test is conducted. Relative improvement of 5.5 % over the i-vector system is shown.

Maxim Tkachenko, Alexander Yamshinin, Nikolay Lyubimov, Mikhail Kotov, Marina Nastasenko
Lexical Stress in Punjabi and Its Representation in PLS

Punjabi is a tonal language and belongs to Indo-Aryan family of languages. Punjabi literature reveals that the suprasegmental phonemes such as Tone, Nasalization and stress are realized at the syllable level. There is abundance of geminated words in which stress Co-occurs on the geminated consonant. The disyllabic words have highest frequency of occurrence. There are very few quadrisyllabic/polysyllabic words excluding borrowed words. There is limited work available on Punjabi generative phonology. Initial efforts were made however no conclusive work on linguistic rules for stress is available. Pronunciation lexicon development is a very useful resource for machine learning and is critical for speech technology research. Pronunciation lexicon specification (PLS) of W3C enables development of such data in standard XML format. This PLS data ought to be enriched with stress information encoded in IPA so that the Punjabi Text-to-Speech systems can use it to deliver near natural voice. An attempt has been made in this paper to study Non-tonal disyllabic words for identifying stress patterns. The data was further analyzed to define linguistic contexts in which stress occurs in Punjabi disyllabic words.

Swaran Lata, Swati Arora, Simerjeet Kaur
Low Inter-Annotator Agreement in Sentence Boundary Detection and Annotator Personality

The paper investigates how the annotators personality affects the result of their segmentation of unscripted speech into sentences. This task is inherently ambiguous and the disagreement between the annotators may result from a variety of factors – from speech disfluencies and linguistic properties of the text to social characteristics and the individuality of a speaker. While some boundaries are marked by the majority of annotators, there is also a substantial number of boundaries marked only by one or several experts.In this paper we focus on sentence boundaries that are only marked by a small number of annotators. We test the hypothesis that such “uncommon” boundaries are more likely to be identified by experts with particular personality traits. We found significant relationship between uncommon boundaries and two psychological traits of annotators measured by the Big Five personality inventory: emotionality and extraversion.

Anton Stepikhov, Anastassia Loukina
LSTM-Based Language Models for Spontaneous Speech Recognition

The language models (LMs) used in speech recognition to predict the next word (given the context) often rely on too short context, which leads to recognition errors. In theory, using recurrent neural networks (RNN) should solve this problem, but in practice the RNNs do not fully utilize the potential of the long context. The RNN-based language models with long short-term memory (LSTM) units take better advantage of the long context and demonstrate good results in terms of perplexity for many datasets. We used LSTM-LMs trained with regularization to rescore the recognition word lattices and obtained much lower WER as compared to the n-gram and conventional RNN-based LMs for the Russian and English languages.

Ivan Medennikov, Anna Bulusheva
Measuring Prosodic Entrainment in Italian Collaborative Game-Based Dialogues

In a large number of studies, it has been observed that conversational partners tend to adapt each other’s speech over the course of the interaction. This phenomenon, variously named as entrainment, coordination, alignment or adaptation, is widely believed to be crucial to mutual understanding and successful communication in human interaction. Modelling human adaptation in speech behaviour would also be very important for improving naturalness in voiced-based human-machine interaction systems. Recently, a body of research in this field has been devoted to find evidence of prosodic entrainment by measuring a number of acoustic-prosodic parameters in some languages, yet not in Italian. Our study offers a contribution to this research line. We analysed game-based collaborative dialogues between Italian speakers, by measuring their articulation rate, pitch range, pitch level and loudness. Results show some evidence of overall speech coordination (convergence and synchrony) between conversational partners, wherein the combination of specific prosodic parameters involved may vary across dialogues. Our results are in line with those obtained in previous studies on other languages, thus contributing to providing a useful basis for modelling prosodic adaptation in multilingual spoken dialogue systems.

Michelina Savino, Loredana Lapertosa, Alessandro Caffò, Mario Refice
Microphone Array Directivity Improvement in Low-Frequency Band for Speech Processing

This paper presents a new method of improving microphone array directivity in the low-frequency band. The method is based on a sub-band processing technique. We also evaluate the parameters and characteristics of the method and consider some of its practical implementations.

Mikhail Stolbov, Sergei Aleinik
Modeling Imperative Utterances in Russian Spoken Dialogue: Verb-Central Quantitative Approach

The study is aimed at detecting stable wording patterns of the utterances with directive function in Russian, and based on the material of speech corpus containing long-term audio recordings of everyday spoken communication. The lemmatized and morphologically annotated mini-corpus in question includes 2030 utterances with 2nd person Sg and Pl verb forms in imperative mood and consists of 11075 word forms. The research involves the data on frequencies of (co-)occurrences of word forms, lemmas, parts of speech within the mini-corpus.

Olga Blinova
Multimodal Perception of Aggressive Behavior

The paper proposes the results of the comparative auditory-perceptual and visual-perceptual analyses of Russian, English, Spanish and Tatar experimental samples representing the emotional-modal complex aggression. It describes statistically valid differences between auditory and visual types of perception of aggressive (physical and verbal) behavior, influenced by such factors as emotional-modal state of a recipient and language of communication.

Rodmonga Potapova, Liliya Komalova
On Individual Polyinformativity of Speech and Voice Regarding Speakers Auditive Attribution (Forensic Phonetic Aspect)

This paper considers the role of the auditive recognition of speakers regarding the attribution of speech and voice individual features. Our study investigates how well listeners can attribute a set of individual features of speakers: verbal, paraverbal, extraverbal, physiological, anthropometric, physical, emotional, social, etc. The main task of this investigation was to indicate which attributes of a speaker should be auditive recognized: universal, group or idiosyncratic ones. For auditive analysis special questionnaires were used. Two types of speech and voice were analysed: interindividual, intraindividual ones.

Rodmonga Potapova, Vsevolod Potapov
Online Biometric Identification with Face Analysis in Web Applications

Internet security is an important issue that concerns everyone who uses it without exception. Over the past few years, there has been a significant improvement in internet security but little attention has been paid to protect careless users. This paper introduces a user-based security application that could replace the classic login frame on websites in order to offer an extra security level that allows a biometric identification of the user that prevents unauthorized login to his personal page.

Gerasimos Arvanitis, Konstantinos Moustakas, Nikos Fakotakis
Optimization of Zelinski Post-filtering Calculation

This paper describes a new optimized method for calculating Zelinski post-filter transfer function for a microphone array. Optimized algorithm requires less memory and fewer arithmetical multiplications. We demonstrate that for the known algorithm computational complexity increases quadratically as a function of the number of microphones. In contrast, the computational complexity of the proposed algorithm increases linearly. This provides a considerable acceleration in the calculation of the post-filter transfer function.

Sergei Aleinik
Phonetic Aspects of High Level of Naturalness in Speech Synthesis

The paper is concerned with the phonetic aspects of speech synthesis of Russian vowels with the use of a voice source signal. An original method of recording the glottal wave synchronously with an output speech signal was employed to obtain the experimental material. Several types of perceptual experiments were carried out. The comparison of the recorded signals allowed us to analyze the structure of the speech signal at different stages of its generation. The source-filter interaction is analyzed by speech signal filtering. The transfer functions of the articulation for the Russian vowels were obtained. The transfer functions and voice source signals of different vowels were used to generate new signals. The resulted signals were analyzed. We examined the way the fundamental frequency, voice quality and a type of phoneme influence the source-filter interaction. In the paper the perceptual experiments, acoustic analysis and signal generation results are presented.

Vera Evdokimova, Pavel Skrelin, Andrey Barabanov, Karina Evgrafova
Polybasic Attribution of Social Network Discourse

Nowadays a number of studies have demonstrated the great interest in discourse differences on the domain of monologue, dialogue and polylogue communication on the Internet. This paper describes the results of our investigation regarding relations between some types of deprivation, on the one hand, and, on the other hand, its verbal, paraverbal and non-verbal determinants from emotional and emotional-modal point of view, on the basis of spoken Russian communication by means of YouTube.com, Skype and ok.ru videohostings. The research is aimed at developing a knowledge database for the decision-making system and the computer-aided analysis of Russian spoken and written discourses in social network communication on Internet.

Rodmonga Potapova, Vsevolod Potapov
Precise Estimation of Harmonic Parameter Trend and Modification of a Speech Signal

The high frequency part of the voiced speech signal beyond 4 kHz is very difficult to study and to decompose into harmonics. In the HNM this spectrum part is assumed to be noise. In this paper it is shown that the main problem is numerical. Faster harmonics have faster trends. It is necessary to implement precise estimation technique to estimate a high frequency complex amplitude on a short time interval. An illustrative example is supplied. In the second part of the paper a new modification technique is proposed for interpolation of the complex amplitudes in the case of intonation modification. Reliable estimates of harmonic complex amplitudes are necessary as inputs. Then a nonlinear rule is formulated that incorporates specific features of formants and their slopes.

Andrey Barabanov, Valentin Magerkin, Evgenij Vikulov
Profiling a Set of Personality Traits of a Text’s Author: A Corpus-Based Approach

Authorship profiling, i.e. revealing information about an unknown author by analyzing their text, is a task of growing importance. Researchers are currently attempting to identify certain psychological characteristics of a text’s author (extraversion, openness, etc.). However, it is well-known that a lot of psychological traits mutually correlate making up what is known as a personality psychological profile. The aim of the study is to assess the probability of self-destructive behaviour of an individual as a set of particular traits via formal parameters of their texts. Here we have used corpus RusPersonality, which consists of Russian-language texts labeled with information on their authors. A set of correlations between scores on the Freiburg Personality Inventory scales that are known to be indicative of self-destructive behaviour and text variables has been calculated. A mathematical model which predicts the probability of self-destructive behaviour has been obtained.

Tatiana Litvinova, Olga Zagorovskaya, Olga Litvinova, Pavel Seredin
Prosody Analysis of Malay Language Storytelling Corpus

In this paper, the prosody of the storytelling speech corpus is analyzed. The main objective of the analysis is to develop prosody rules to convert neutral speech to storytelling speech. The speech corpus (neutral and storytelling speech) contains 464 speech sentences, 4,656 words, and 10,928 syllables. It was recorded by three female storytellers, one male professional speaker, two female speakers and two male speakers. The prosodic features considered for analysis are tempo, pause (sentence and phrase-level), duration, intensity, and pitch. Further analysis of the word categories exist in storytelling speech such as verb, adverb, adjective, noun, conjunction and amplifier are also conducted. The global prosody analysis showed that mean prosodic of storytelling is higher than neutral speech, especially intensity and pitch. Investigation on the word categories showed that words categorized as adverb, adjective, amplifier and conjunctions have significant number of prominent syllables. Meanwhile, nouns and verbs do not have significant difference between neutral and storytelling speech. Positions of the words (i.e. initial, middle, last) in a phrase for different word categories also proved to have different increasing factor in duration, pitch and intensity.

Izzad Ramli, Noraini Seman, Norizah Ardi, Nursuriati Jamil
Quality Assessment of Two Fullband Audio Codecs Supporting Real-Time Communication

Recent audio codecs enable high quality signals up to fullband (20 kHz) which is usually associated with the maximal audible bandwidth. Following previous studies on speech coding assessment, we survey in this novel study the music coding ability of two real-time codecs with fullband capability – the IETF standardized Opus codec as well as the 3 GPP specified EVS codec. We tested both codecs with vocal, instrumental and mixed music signals. For evaluation, we predicted human assessments using the instrumental POLQA method which has been primarily designed for speech assessment. Additionally, we performed two listening tests as a reference with a total of 21 young adults. Opus and EVS show a similar music coding performance. The quality assessment mainly depends on the specific music characteristics and on the tested bitrates from 16.4 to 64 kbit/s. The POLQA measure and the listening results are correlating, whereas the absolute ratings of the young listeners achieve much lower MOS values.

M. Maruschke, O. Jokisch, M. Meszaros, F. Trojahn, M. Hoffmann
Robust Speech Analysis Based on Source-Filter Model Using Multivariate Empirical Mode Decomposition in Noisy Environments

This paper proposes a robust speech analysis method based on source-filter model using multivariate empirical mode decomposition (MEMD) under noisy conditions. The proposed method has two stages. At the first stage, magnitude spectrum of noisy speech signal is decomposed by MEMD into intrinsic mode functions (IMFs), and then IMFs corresponded to noise part are removed from them. At the second stage, log-magnitude spectrum of noise-reduced signals are decomposed into IMFs. Then, these are divided into two groups: the first group characterized by spectral fine structure for fundamental frequency estimation and the second group characterized by frequency response of vocal-tract filter for formant frequencies estimation. As opposed to the conventional linear prediction (LP) and cepstrum methods, the proposed method decomposes noise automatically in magnitude spectral domain and makes noise mixture become sparse in log-magnitude spectral domain. The results show that the proposed method outperforms LP and cepstrum methods under noisy conditions.

Surasak Boonkla, Masashi Unoki, Stanislav S. Makhanov
Scenarios of Multimodal Information Navigation Services for Users in Cyberphysical Environment

Cyberphysical systems (CPS) provide a broad range of possibilities in many fields of human activity such as multimodal human-computer interaction (HCI). The paper discusses the architecture of developing multimodal information navigation system of SPIIRAS, considering an approach of building a corporate information subsystem for tracking events, scheduling and displaying information on a distributed set of stationary monitors. The subsystem architecture was described in detail. The suggested algorithm for generating schedules showed high performance. The subsystem uses standard network technologies, is not tied to any software or hardware platforms, matches extensibility and portability criteria and may be used as a component of the cyberphysical environment in various organizations. Scenarios of user handling depending on user status are presented.

Irina Vatamaniuk, Dmitriy Levonevskiy, Anton Saveliev, Alexander Denisov
Scores Calibration in Speaker Recognition Systems

It is well known that variability of speech signal quality affects the performance of speaker recognition systems. Difference in speech quality between enrollment and test utterances leads to shifting of scores and performance degradation. In order to improve the effectiveness of speaker recognition in these circumstances the scores calibration is required. Speech signal parameters that have a strong impact on speaker recognition performance are total speech duration, signal to noise ratio and reverberation time. Their variability leads to scores shifting and unreliable accept/reject decisions. In this paper we investigate the effects of speech duration variability on the calibration when enroll and test speech utterances originate from the same channel. An effective method of scores stabilization is also presented.

Andrey Shulipa, Sergey Novoselov, Yuri Matveev
Selecting Keypoint Detector and Descriptor Combination for Augmented Reality Application

In this paper, we compare the performance of image keypoints detectors and descriptors on well known Oxford dataset. We use evaluation criteria which were presented by Mikolajczyk et al. [12, 13]. We created most of the possible combinations of keypoint detector and descriptor, but in this paper, we present only selected pairs. The best performing detector and descriptor pair are selected for future research, mainly with the focus on augmented reality.

Lukáš Bureš, Luděk Müller
Semi-automatic Speaker Verification System Based on Analysis of Formant, Durational and Pitch Characteristics

Modern speaker verification systems take advantage of a number of complementary base classifiers by fusing them to get reliable verification decisions. The paper presents a semi-automatic speaker verification system based on fusion of formant frequencies, phone durations and pitch characteristics. Experimental results demonstrate that combination of these characteristics improves speaker verification performance. For improved and cost-effective performance of the pitch subsystem further we selected the most informative pitch characteristics.

Elena Bulgakova, Aleksey Sholohov
Speaker-Dependent Bottleneck Features for Egyptian Arabic Speech Recognition

In this paper, several ways to improve a speech recognition system for the Egyptian dialect of Arabic language are presented. The research is based on the CALLHOME Egyptian Arabic corpus. We demonstrate the contribution of speaker-dependent bottleneck features trained on other languages and verify the possibility of application of a small Modern Standard Arabic (MSA) corpus to derive phonetic transcriptions. The systems obtained demonstrate good results as compared to those published before.

Aleksei Romanenko, Valentin Mendelev
Speech Acts Annotation of Everyday Conversations in the ORD Сorpus of Spoken Russian

The paper describes annotation principles developed for tagging of speech acts in the “One Day of Speech” (ORD) corpus of Russian everyday speech, with special attention being paid to categories and subcategories of speech acts distinguished in the ORD. Annotation of speech acts is a part of pragmatic annotation of the corpus, which includes as well the tagging of macro- and microepisodes of verbal communication. Speech acts are annotated on four levels: (1) the orthographic transcript with information on syntagmatic and phrasal boundaries, (2) the speakerʼs code, (3) the main category of a speech act, and (4) its subcategory. Practical approbation of the proposed annotation scheme has been made on the material of 6 macroepisodes of everyday communication, in which 2250 speech acts have been discerned. Pragmatic annotation of the ORD corpus provides an opportunity to study everyday discourse in terms of speech acts and to study linguistic properties and patterns of speech acts of different types.

Tatiana Sherstinova
Speech Enhancement with Microphone Array Using a Multi Beam Adaptive Noise Suppressor

This paper presents a new speech enhancement method with microphone array for the joint suppression of coherent and diffuse noise. The proposed method is based on combined technique: target and noise steering beamforming and adaptive noise suppression. The microphone array forms two beams steered in the directions of target speaker and of noise source. The signal of reference beam is used to suppress the noise in primary channel. The proposed algorithm of Adaptive Noise Suppressor (ANS) is based on the transformation of the signal spectrum of the reference channel into the noise spectrum of the main channel using noise equalizer and algorithm of dual channel spectral subtraction. The effectiveness of the proposed technique is confirmed in varying real life coherent and diffuse noise conditions. The experimental results show that proposed method is an efficient procedure for speech quality improvement in real life noisy and reverberant conditions with SNRs down to −5 dB and reverberation time up to 0.88 s.

Mikhail Stolbov, Alexander Lavrentyev
Speech Features Evaluation for Small Set Automatic Speaker Verification Using GMM-UBM System

This paper overviews the application sphere of speaker verification systems and illustrates the use of the Gaussian mixture model and the universal background model (GMM-UBM) in an automatic text-independent speaker verification task. The experimental evaluation of the GMM-UBM system using different speech features is conducted on a 50 speaker set and a result is presented. Equal error rate (EER) using 256 component Gaussian mixture model and feature vector containing 14 mel frequency cepstral coefficients (MFCC) and the voicing probability is 0,76 %. Comparing to standard 14 MFCC vector 23,7 % of EER improvement was acquired.

Ivan Rakhmanenko, Roman Meshcheryakov
Speech Recognition Combining MFCCs and Image Features

Automatic speech recognition (ASR) task constitutes a well-known issue among fields like Natural Language Processing (NLP), Digital Signal Processing (DSP) and Machine Learning (ML). In this work, a robust supervised classification model is presented (MFCCs + autocor + SVM) for feature extraction of solo speech signals. Mel Frequency Cepstral Coefficients (MFCCs) are exploited combined with Content Based Image Retrieval (CBIR) features extracted from spectrogram produced by each frame of the speech signal. Improvement of classification accuracy using such extended feature vectors is examined against using only MFCCs with several classifiers for three scenarios of different number of speakers.

Stamatis Karlos, Nikos Fazakis, Katerina Karanikola, Sotiris Kotsiantis, Kyriakos Sgarbas
Sociolinguistic Extension of the ORD Corpus of Russian Everyday Speech

The ORD corpus is one of the largest resources of contemporary spoken Russian. By 2014, its collection numbered about 400 h of recordings made by a group of 40 respondents (20 men and 20 women, of different ages and professions), who volunteered to spend a whole day with a switched-on voice recorder, recording all their verbal communication. The corpus presents the unique linguistic material recorded in natural communicative situations, allowing spoken Russian and the everyday discourse to be studied in many aspects. However, the original sample of respondents was not sufficient enough to study a sociolinguistic variation of speech. Thus, it was decided to launch a large project aiming at the ORD sociolinguistic extension, which was supported by the Russian Science Foundation. The paper describes the general principles for the sociolinguistic extension of the corpus. It defines social groups which should be presented in the corpus in adequate numbers, sets criteria for selecting participants, describes the “recorder’s kit” for the respondents and involves the adaptation principles of the ORD annotation and structure. Now, the ORD collection exceeds 1200 h of recordings, presenting speech of 127 respondents and hundreds of their interlocutors. 2450 macro episodes of everyday spoken communication have been already annotated, and the speech transcripts add up to 1 mln words.

Natalia Bogdanova-Beglarian, Tatiana Sherstinova, Olga Blinova, Olga Ermolova, Ekaterina Baeva, Gregory Martynenko, Anastasia Ryko
Statistical Analysis of Acoustical Parameters in the Voice of Children with Juvenile Dysphonia

The goal of this research is to answer the question: is it necessary to build a completely different system in order to automatically recognize functional dysphonia (FD) in children’s cases or is it possible to train the system with healthy and pathological voices of adults? For this reason preliminary statistical analyses were carried out between healthy and functional dysphonia voices of children and healthy children voices with healthy adults’. The statistical analyses draw the conclusion that variations of Jitter and Shimmer values with Harmonics-to-Noise Ratio (HNR) and the first component of the mel-frequency cepstral coefficients (MFCC1) are good indicators to separate Healthy and FD voices in case of children as well. Healthy samples of children and adult voices were compared giving the clear conclusion that differences exist in the examined acoustical parameters even between healthy child and healthy adult groups. It is necessary to carry out the investigations separately on children’s voices as well, we cannot use adult voices to make any conclusions to children’s voices. Lastly the differences between adult female and male samples were examined. The study results confirmed our assumptions that in order to build an automatic decision making system that recognizes FD it is advisable to build separate systems for adult males, adult females and children.

Miklós Gábriel Tulics, Ferenc Kazinczi, Klára Vicsi
Stress, Arousal, and Stress Detector Trained on Acted Speech Database

This paper reports on initial experiments with the creation of a suitable database for training and testing systems for stress detection in speech and first experimental results. Based on the psychological understanding of the concepts of stress and emotion, we operationalized stress as a level of arousal, which can be detected in speech. We describe here a speech database with three levels of “acted stress” and three levels of soothing. For the very first experiment performed on the database we detect different levels of stress using Gaussian mixture models. The accuracy of detecting three levels of stress was 89 % for speakers included in the training database and 73 % for speakers whose recordings were not used during the adaptation of the GMM models.

Róbert Sabo, Milan Rusko, Andrej Ridzik, Jakub Rajčáni
Study on the Improvement of Intelligibility for Elderly Speech Using Formant Frequency Shift Method

In general, aging is progressing in developed countries. Elderly people have difficulty controlling their articulation accurately due to aging. We need to improve the quality of elderly speech for smooth communication. In this paper, we analyzed the 1st formant frequency (F1) and 2nd formant frequency (F2) between the more intelligible speech and less intelligible speech of Japanese elderly people. In addition, we improved the intelligibility of less intelligible elderly speech by using the formant frequency shift method. This method is the correcting by shift value of formant frequency based on LPC. The shift value is the magnification such as expanding the F1-F2 size of less intelligible speech.

Yuto Tanaka, Mitsunori Mizumachi, Yoshihisa Nakatoh
Text Classification in the Domain of Applied Linguistics as Part of a Pre-editing Module for Machine Translation Systems

This article describes the method of document classification on the basis of a vector space model with regard to the domain of Applied Linguistics for Russian. This method makes it possible to classify input text data in two different categories: applied linguistics texts (AL) and non-applied linguistics texts (nonAL). The proposed method is implemented using the statistical measure of TF-IDF and the evaluation measure of cosine similarity. The study gives promising results and opens up further prospects for the application of this approach to text classification in other languages.

Ksenia Oskina
Tonal Specification of Perceptually Prominent Non-nuclear Pitch Accents in Russian

The paper deals with tonal characteristics of perceptually prominent prosodic words in the pre-nuclear part of the intonational phrase. The research is based on a 20 h part of the annotated Russian speech corpus CORPRES. Non-nuclear prominent words are grouped according to the direction of pitch movement on the stressed syllable. It is shown that the pitch accent shape on these words is highly correlated with the type of pitch movement on the nucleus. Most often, falling pre-nuclear accents occur with the rising nucleus, and rising—with the falling nucleus. Emotional or highly individual speech may contain more complex pitch movements in the pre-nuclear part (e.g. fall-rise), or a sequence of prominent words with the same pattern.

Nina Volskaya, Tatiana Kachkovskaia
Toward Sign Language Motion Capture Dataset Building

The article deals with a recording procedure for motion dataset building mainly for sign language synthesis systems. Data gloves and two types of optical motion capture techniques are considered such as one source of sign language speech data for advanced training of more natural and acceptable body movements of signing avatars. A summary of the state-of-the-art technologies provides an overview of possibilities, and even limiting factors in relation to the sign language recording. The combination of the motion capture technologies overcomes the existing difficulties of such a complex task of recording both manual and non-manual component of the sign language. A result is the recording procedure for simultaneous motion capture of signing subject towards further research yet unexplored phenomenon of sign language production by a human.

Zdeněk Krňoul, Pavel Jedlička, Jakub Kanis, Miloš Železný
Trade-Off Between Speed and Accuracy for Noise Variance Minimization (NVM) Pitch Estimation Algorithm

New version of NVM algorithm [3] in case of stationary voice model for precise estimation of the Fundamental frequency on a short time interval is proposed. Its computational complexity is proportional to that of FFT on the same time interval. A precise trade-off between approximation error and numerical speed is established.

Andrey Barabanov, Aleksandr Melnikov
Unsupervised Trained Functional Discourse Parser for e-Learning Materials Scaffolding

The article describes the way of automatic segmentation of natural language text into fragments with different functional semantics. The proposed solution is based on the analysis of how the various parts of speech are distributed through the text. The amount and variety of nouns, verbs and adjectives is calculated for a set of sliding windows with the same length. The text is divided into fragments using clustering of windows set. We considered two clustering methods: ISODATA and a method based on the minimum spanning tree. The results of comparison of the methods with each other and with the manually text markup are shown.

Varvara Krayvanova, Svetlana Duka
Backmatter
Metadata
Title
Speech and Computer
Editors
Andrey Ronzhin
Rodmonga Potapova
Géza Németh
Copyright Year
2016
Electronic ISBN
978-3-319-43958-7
Print ISBN
978-3-319-43957-0
DOI
https://doi.org/10.1007/978-3-319-43958-7

Premium Partner