Skip to main content
Top

2015 | Book

Speech and Computer

17th International Conference, SPECOM 2015, Athens, Greece, September 20-24, 2015, Proceedings

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 17th International Conference on Speech and Computer, SPECOM 2015, held in Athens, Greece, in September 2015. The 59 revised full papers presented together with 2 invited talks were carefully reviewed and selected from 104 initial submissions. The papers cover a wide range of topics in the area of computer speech processing such as recognition, synthesis, and understanding and related domains including signal processing, language and text processing, multi-modal speech processing or human-computer interaction.

Table of Contents

Frontmatter

Invited Talks

Frontmatter
Multimodal Human-Robot Interaction from the Perspective of a Speech Scientist

Human-Robot-Interaction (HRI) is a research area that developed steadily during the last years. While robots in the last decades of the 20th century have been mostly constructed to work autonomously, the rise of service robots during the last 20 years has mostly contributed to the development of effective communication methods between human users and robots. This development has been even accelerated with the advancement of humanoid robots, where the demand for effective human-robot-interaction is even more obvious. It is also amazing to note that, inspired by the success of HRI in the area of service and humanoid robotics, human-robot-interfaces become nowadays even attractive for areas, where HRI has never played a major role before, especially for industrial robots or robots in outdoor environments. Compared to classical human-computer-interaction (HCI), one can say that the basic interaction algorithms are not that much different in HRI, e.g. a speech or gesture recognizer would not work much differently in both domains. The major differences between HCI and HRI are more in the different utilization of modalities, which also depends very much on the type of employed robot. Therefore, the primary goal of this paper is the description of the major differences between HCI and HRI and the presentation of the most important modalities used in HRI and how they affect the interaction depending on the various types of available robot platforms.

Gerhard Rigoll
A Decade of Discriminative Language Modeling for Automatic Speech Recognition

This paper summarizes the research on discriminative language modeling focusing on its application to automatic speech recognition (ASR). A discriminative language model (DLM) is typically a linear or log-linear model consisting of a weight vector associated with a feature vector representation of a sentence. This flexible representation can include linguistically and statistically motivated features that incorporate morphological and syntactic information. At test time, DLMs are used to rerank the output of an ASR system, represented as an N-best list or lattice. During training, both negative and positive examples are used with the aim of directly optimizing the error rate. Various machine learning methods, including the structured perceptron, large margin methods and maximum regularized conditional log-likelihood, have been used for estimating the parameters of DLMs. Typically positive examples for DLM training come from the manual transcriptions of acoustic data while the negative examples are obtained by processing the same acoustic data with an ASR system. Recent research generalizes DLM training by either using automatic transcriptions for the positive examples or simulating the negative examples.

Murat Saraclar, Erinc Dikici, Ebru Arisoy

Conference Papers

Frontmatter
A Bilingual Kazakh-Russian System for Automatic Speech Recognition and Synthesis

The paper presents a system for speech recognition and synthesis for the Kazakh and Russian languages. It is designed for use by speakers of Kazakh; due to the prevalence of bilingualism among Kazakh speakers, it was considered essential to design a bilingual Kazakh-Russian system. Developing our system involved building a text processing and transcription system that deals with both Kazakh and Russian text, and is used in both speech synthesis and recognition applications. We created a Kazakh TTS voice and an additional Russian voice using the recordings of the same bilingual voice artist. A Kazakh speech database was collected and used to train deep neural network acoustic models for the speech recognition system. The resulting models demonstrated sufficient performance for practical applications in interactive voice response and keyword spotting scenarios.

Olga Khomitsevich, Valentin Mendelev, Natalia Tomashenko, Sergey Rybin, Ivan Medennikov, Saule Kudubayeva
A Comparative Study of Speech Processing in Microphone Arrays with Multichannel Alignment and Zelinski Post-Filtering

In this paper we present the results of a comparative study of algorithms for speech signal processing in a microphone array. We compared the multichannel alignment method and two modifications of the well-known Zelinski post-filtering. Comparisons were performed using artificial and real signals recorded in real noisy environments. The experiments helped us to devise recommendations for choosing a suitable method of signal processing for different noise conditions.

Sergei Aleinik, Mikhail Stolbov
A Comparison of RNN LM and FLM for Russian Speech Recognition

In the paper, we describe a research of recurrent neural network (RNN) language model (LM) for N-best list rescoring for automatic continuous Russian speech recognition and make a comparison of it with factored language model (FLM). We tried RNN with different number of units in the hidden layer. For FLM creation, we used five linguistic factors: word, lemma, stem, part-of-speech, and morphological tag. All models were trained on the text corpus of 350M words. Also we made linear interpolation of RNN LM and FLM with the baseline 3-gram LM. We achieved the relative WER reduction of 8 % using FLM and 14 % relative WER reduction using RNN LM with respect to the baseline model.

Irina Kipyatkova, Alexey Karpov
A Frequency Domain Adaptive Decorrelating Algorithm for Speech Enhancement

In this paper, we propose a new frequency domain-symmetric adaptive decorrelating (FD-SAD) algorithm to cancel punctual noise components from noisy observations. The proposed FD-SAD is combined with the forward blind source separation (FBSS) structure to enhance the performances of the time-domain symmetric adaptive decorelating (TD-SAD) algorithm. The new FD-SAD algorithm shows a fast convergence speed and good tracking behaviour even in very noisy conditions.

Mohamed Djendi, Feriel Khemies, Amina Morsli
Acoustic Markers of Emotional State “Aggression”

The paper presents the results of comparison between auditory-perceptual and acoustic data obtained of the analysis of Russian, English, Spanish and Tatar speech produced in the emotional state of aggression [

5

]. Issues concerning the selection of relevant features in studies of the type are described. It is pointed out that the emotional state of aggression, as a special case of emotional state of physiological excitation, is characterized by acoustic markers that don’t always conform to the common sense rule saying that the values being measured must correspond. In the concluding part of the paper preliminary approximate qualitative estimates of auditory-perceptual and acoustic characteristics describing the pronunciational manifestation of emotional state ‘aggression’ in the mentioned languages are presented.

Rodmonga Potapova, Liliya Komalova, Nikolay Bobrov
Algorithms for Low Bit-Rate Coding with Adaptation to Statistical Characteristics of Speech Signal

The article establishes the general trends of speech coding algorithms based on linear prediction. The task of adaptation of speech codec to the statistical characteristics of the coding parameters is set and accomplished. The main procedures of their forming are examined. The results of experimental studies of the developed adaptive low bit-rate coding algorithms are presented. The benefits of the quality of remade speech in comparison with algorithms on FS1015, FS1017 and FS1016 standards and Full-rate GSM are displayed.

Anton Saveliev, Oleg Basov, Andrey Ronzhin, Alexander Ronzhin
Analysing Human-Human Negotiations with the Aim to Develop a Dialogue System

We are studying human-human spoken dialogues in the Estonian dialogue corpus with the aim to design a dialogue system which carries out negotiation with a user in a natural language. Three sub-corpora have been analyzed: (1) telemarketing calls where a sales clerk of an educational company argues for taking a training course by a customer; (2) conversations between a travel agent and a customer who is planning a trip; and (3) everyday conversations where one participant argues for performing an action by the partner. A special case of negotiation – debate where the participants have contradicting communicative goals – has been implemented as an experimental dialogue system.

Mare Koit
Analysis of Facial Motion Capture Data for Visual Speech Synthesis

The paper deals with interpretation of facial motion capture data for visual speech synthesis. For the purpose of analysis visual speech composed of 170 artificially created words was recorded by one speaker and the state-of-the-art face motion capture method. New nonlinear method is proposed to approximate the motion capture data using intentionally defined set of articulatory parameters. The result of the comparison shows that the proposed method outperforms baseline method with the same number of parameters. The precision of the approximation is evaluated by the parameter values extracted from unseen dataset and also verified with the 3D animated model of human head as the output reproducing visual speech in an artificial manner.

Miloš Železný, Zdeněk Krňoul, Pavel Jedlička
Auditory-Perceptual Recognition of the Emotional State of Aggression

The authors propose several stages to research verbal-cognitive mechanisms regarding the formation and development of verbal realization of the emotional state of aggression. This paper describes an experimental study which investigates the auditory-perceptual analysis of male scenic speech experiencing the emotional state of aggression (for Russian, English, Spanish and Tatar languages). The results statistically confirm the detected auditory-perceptual “passports” of the emotional state of aggression and demonstrate differences in auditory perception of verbal aggressive behavior by groups of male and female listeners.

Rodmonga Potapova, Liliya Komalova
Automatic Classification and Prediction of Attitudes: Audio - Visual Analysis of Video Blogs

This paper reports a study of automatic attitude recognition from a collection of over 500 segments of our video blog data. We annotated and analysed 3 different attitudinal states of the speakers. Following that, we extracted and analysed prosodic and visual features relevant to the classification task. We use machine learning methods and techniques to attain better understanding of the feature sets and their contribution to the prediction model.

Noor Alhusna Madzlan, Yuyun Huang, Nick Campbell
Automatic Close Captioning for Live Hungarian Television Broadcast Speech: A Fast and Resource-Efficient Approach

In this paper, the application of LVCSR (Large Vocabulary Continuous Speech Recognition) technology is investigated for real-time, resource-limited broadcast close captioning. The work focuses on transcribing live broadcast conversation speech to make such programs accessible to deaf viewers. Due to computational limitations, real time factor (RTF) and memory requirements are kept low during decoding with various models tailored for Hungarian broadcast speech recognition. Two decoders are compared on the direct transcription task of broadcast conversation recordings, and setups employing re-speakers are also tested. Moreover, the models are evaluated on a broadcast news transcription task as well, and different language models (LMs) are tested in order to demonstrate the performance of our systems in settings when low memory consumption is a less crucial factor.

Ádám Varga, Balázs Tarján, Zoltán Tobler, György Szaszák, Tibor Fegyó, Csaba Bordás, Péter Mihajlik
Automatic Estimation of Web Bloggers’ Age Using Regression Models

In this article, we address the problem of automatic age estimation of web users based on their posts. Most studies on age identification treat the issue as a classification problem. Instead of following an age category classification approach, we investigate the appropriateness of several regression algorithms on the task of age estimation. We evaluate a number of well-known and widely used machine learning algorithms for numerical estimation, in order to examine their appropriateness on this task. We used a set of 42 text features. The experimental results showed that the Bagging algorithm with RepTree base learner offered the best performance, achieving estimation of web users’ age with mean absolute error equal to 5.44, while the root mean squared error is approximately 7.14.

Vasiliki Simaki, Christina Aravantinou, Iosif Mporas, Vasileios Megalooikonomou
Automatic Preprocessing Technique for Detection of Corrupted Speech Signal Fragments for the Purpose of Speaker Recognition

In this paper we propose a preprocessing technique which allows to detect clicks, tones, overloads, clipping, etc., as well as to discover the parts of good-quality speech signal. As a result the performance of the speaker recognition system increases significantly. It should be noted that when describing noise detectors we aim only to provide a full list of algorithms we used as well as their parameters that we obtained in our experiments. The main goal of the paper is to demonstrate that using a set of simple detectors is very effective in detecting speech for speaker recognition task under the conditions of real noise.

Konstantin Simonchik, Sergei Aleinik, Dmitry Ivanko, Galina Lavrentyeva
Automatic Sound Recognition of Urban Environment Events

The audio analysis of speaker’s surroundings has been a first step for several processing systems that enable speaker’s mobility though his daily life. These algorithms usually operate in a short-time analysis decomposing the incoming events in time and frequency domain. In this paper, an automatic sound recognizer is studied, which investigates audio events of interest from urban environment. Our experiments were conducted using a close set of audio events from which well known and commonly used audio descriptors were extracted and models were training using powerful machine learning algorithms. The best urban sound recognition performance was achieved by SVMs with accuracy equal to approximately 93 %.

Theodoros Theodorou, Iosif Mporas, Nikos Fakotakis
Automatically Trained TTS for Effective Attacks to Anti-spoofing System

This article is the proceeding of the priority research direction of the voice biometrics systems spoofing problem. We continue exploring speech synthesis spoofing attacks based on creating a text-to-speech voice. In our work we focused on the completely automatic way to create new voices for text-to-speech system and the investigation of the state-of-art spoofing detection system vulnerability to this spoofing attacks. Results obtained during our experiments demonstrate that 10 seconds of speech material is enough for EER increasement up to 19.67 %. Considering the fact, that automatic method for synthesis voiced training allows perpetrators to increase the amount of spoofing attacks to biometric systems, we raise the issue of relevance of a new type of spoofing attack, and development of the effective methods to detect it.

Galina Lavrentyeva, Alexandr Kozlov, Sergey Novoselov, Konstantin Simonchik, Vadim Shchemelinin
EmoChildRu: Emotional Child Russian Speech Corpus

We present the first child emotional speech corpus in Russian, called “EmoChildRu”, which contains audio materials of 3–7 year old kids. The database includes over 20 K recordings (approx. 30 h), collected from 100 children. Recordings were carried out in three controlled settings by creating different emotional states for children: playing with a standard set of toys; repetition of words from a toy-parrot in a game store setting; watching a cartoon and retelling of the story, respectively. This corpus is designed to study the reflection of the emotional state in the characteristics of voice and speech and for studies of the formation of emotional states in ontogenesis. A portion of the corpus is annotated for three emotional states (discomfort, neutral, comfort). Additional data include brain activity measurements (original EEG, evoked potentials records), the results of the adult listeners analysis of child speech, questionnaires, and description of dialogues. The paper reports two child emotional speech analysis experiments on the corpus: by adult listeners (humans) and by an automatic classifier (machine), respectively. Automatic classification results are very similar to human perception, although the accuracy is below 55 % for both, showing the difficulty of child emotion recognition from speech under naturalistic conditions.

Elena Lyakso, Olga Frolova, Evgeniya Dmitrieva, Aleksey Grigorev, Heysem Kaya, Albert Ali Salah, Alexey Karpov
Cognitive Mechanism of Semantic Content Decoding of Spoken Discourse in Noise

This paper discusses the results of experimental research in the field of auditory recognition of the semantic content of Russian spoken discourses in noise. Some spoken discourses in noise were presented to listeners for auditory perception and then recognition of main topics and subtopics of the stimuli. The research included the following questions: is it possible to define topics and subtopics of spoken discourse in noise; does the quality of speech recognition and understanding depend on a subject of spoken discourse; how do these factors correlate. Statistical analysis revealed that there is a main effect of the semantic content recognition of spoken text/discourse in noise (some kinds of signal-noise ratio) by listeners.

Rodmonga Potapova, Vsevolod Potapov
Combining Prosodic and Lexical Classifiers for Two-Pass Punctuation Detection in a Russian ASR System

We propose a system for automatic punctuation prediction in recognized speech using prosodic, word and grammatical features. An SVM classifier is trained using prosody, and a CRF classifier is trained on a large text dataset using word-based features. The probabilities are then fused to produce a joint decision on comma and period placement, with a second classification pass for question mark detection. Training two classifiers separately enables us to avoid data sparseness for the lexical classifier, and to increase the overall robustness of the system. This works well for Russian and could be applied to other inflected languages. The system was tested on different speech styles. On manual transcripts, we achieved an F-score of 50–71 % for periods, 46–66 % for commas, 19–47 % for question marks, and 77–87 % for “mark/no mark” classification. The results for recognizer output are 46–66 % for periods, 43–60 % for commas, 10–38 % for questions, and 64–80 % for “mark/no mark”.

Olga Khomitsevich, Pavel Chistikov, Tatiana Krivosheeva, Natalia Epimakhova, Irina Chernykh
Construction of a Modern Greek Grammar Checker Through Mnemosyne Formalism

The aim of this paper is to present a useful and friendly electronic tool (grammar checker) which will carry out the morphological and syntactic analysis of sentences, phrases and words in order to correct syntactic, grammatical and stylistic errors. We also present the formalism used (the

Mnemosyne’s Kanon

) and also the particularities of the Greek language that hinder the computational processing. Given that the major problem of modern Greek is the lexical ambiguity we designed the Greek tagger grounded on linguistic criteria for those cases where the lexical ambiguity impede the imprint of the errors in Greek language. The texts that were given for correction to the grammar checker were also corrected by a person. In a very large percentage the grammar checker approximates in accuracy the human-corrector.

Panagiotis Gakis, Christos Panagiotakopoulos, Kyriakos Sgarbas, Christos Tsalidis, Verykios Vasilios
Contribution to the Design of an Expressive Speech Synthesis System for the Arabic Language

In this paper we will present a contribution to the design of an expressive speech synthesis system for the Arabic language. The system uses diphone concatenation as the synthesis method for the generation of 10 phonetically balanced sentences in Arabic. Rules for the orthographic-to-phonetic transcription are detailed, as well as the methodology employed for recording the diphone database. The sentences were synthesized with both “neutral” and “sadness” expressions and rated by 10 listeners, and the results of the test are provided.

Lyes Demri, Leila Falek, Hocine Teffahi
Deep Neural Network Based Continuous Speech Recognition for Serbian Using the Kaldi Toolkit

This paper presents a deep neural network (DNN) based large vocabulary continuous speech recognition (LVCSR) system for Serbian, developed using the open-source Kaldi speech recognition toolkit. The DNNs are initialized using stacked restricted Boltzmann machines (RBMs) and trained using cross-entropy as the objective function and the standard error backpropagation procedure in order to provide posterior probability estimates for the hidden Markov model (HMM) states. Emission densities of HMM states are represented as Gaussian mixture models (GMMs). The recipes were modified based on the particularities of the Serbian language in order to achieve the optimal results. A corpus of approximately 90 hours of speech (21000 utterances) is used for the training. The performances are compared for two different sets of utterances between the baseline GMM-HMM algorithm and various DNN settings.

Branislav Popović, Stevan Ostrogonac, Edvin Pakoci, Nikša Jakovljević, Vlado Delić
DNN-Based Speech Synthesis: Importance of Input Features and Training Data

Deep neural networks (DNNs) have been recently introduced in speech synthesis. In this paper, an investigation on the importance of input features and training data on speaker dependent (SD) DNN-based speech synthesis is presented. Various aspects of the training procedure of DNNs are investigated in this work. Additionally, several training sets of different size (i.e., 13.5, 3.6 and 1.5 h of speech) are evaluated.

Alexandros Lazaridis, Blaise Potard, Philip N. Garner
Emotion State Manifestation in Voice Features: Chimpanzees, Human Infants, Children, Adults

The goal of the study is to investigate how emotional states of human and chimpanzees are manifested in the voice features. The participants of this study were 5 infants aged 3 months and 12 months, 30 children from 3 years to 7 years old, 10 adult actors, 5 chimpanzees aged 3–17 years, and 360 adults listeners of vocalizations and speech. Perceptual and spectrographic analysis methods were used. The reflection of the discomfort state in the infants vocalizations, in the speech of 3–4 years old children, in actors meaningless speech, and state of anger, fear and sadness in vocalizations of chimpanzees and actors speech of different language were recognized by listeners better than reflection in the voice of comfort and joy state. The pitch values, its variability, the values of the third “emotional” formant, and duration are important acoustical features for the recognition of participants state of discomfort in the voice.

Elena Lyakso, Olga Frolova
Estimation of Vowel Spectra Near Vocal Chords with Restoration of a Clipped Speech Signal

Speech signals with Russian vowels were recorded simultaneously by two microphones. The first microphone was located in the larynx near the vocal chords and the second one was outside a mouth near the lips. A signal in the inner microphone is formed by the vocal chords and contains a weak reverberation echo. It is clipped to a half of the energy because of the sensitivity restrictions on the microphone. A new mathematical algorithm is proposed for restoration of the clipped signal part. The restored signal sounds better. The restored spectra of vowels contain the first formant that cannot be explained by a backward reverberation. A transfer function of the vocal tract is estimated by comparing spectra of the signals from the input and output microphones.

Andrey Barabanov, Vera Evdokimova, Pavel Skrelin
Fast Algorithm for Precise Estimation of Fundamental Frequency on Short Time Intervals

Fast algorithms are proposed for precise estimation of the Fundamental frequency on a short time interval. The approach is a generalization of the unbiased frequency estimator. Its computational complexity is proportional to that of FFT on the same time interval. A trade-off between approximation error and numerical speed is established. The result is generalized to the linear trend model. The lower bound is obtained for the time interval length with a nonsingular information matrix in the estimation problem. The frequency estimation algorithm is not sensitive to big random noises.

Andrey Barabanov, Alexandr Melnikov, Valentin Magerkin, Evgenij Vikulov
Gender Classification of Web Authors Using Feature Selection and Language Models

In the present article, we address the problem of automatic gender classification of web blog authors. More specifically, we employ eight widely used machine learning algorithms, in order to study the effectiveness of feature selection on improving the accuracy of gender classification. The feature ranking is performed over a set of statistical, part-of-speech tagging and language model features. In the experiments, we employed classification models based on decision trees, support vector machines and lazy-learning algorithms. The experimental evaluation performed on blog author gender classification data demonstrated the importance of language model features for this task and that feature selection significantly improves the accuracy of gender classification, regardless of the type of the machine learning algorithm used.

Christina Aravantinou, Vasiliki Simaki, Iosif Mporas, Vasileios Megalooikonomou
Improving Acoustic Models for Russian Spontaneous Speech Recognition

The aim of the paper is to investigate the ways to improve acoustic models for Russian spontaneous speech recognition. We applied the main steps of the Kaldi Switchboard recipe to a Russian dataset but obtained low accuracy with respect to the results for English spontaneous telephone speech. We found two methods to be especially useful for Russian spontaneous speech: the i-vector based deep neural network adaptation and speaker-dependent bottleneck features which provide 8.6 % and 11.9 % relative word error rate reduction over the baseline system respectively.

Alexey Prudnikov, Ivan Medennikov, Valentin Mendelev, Maxim Korenevsky, Yuri Khokhlov
Information Sources of Word Semantics Methods

This paper studies quality and orthogonality of information sources used in methods for computing word semantics. The quality of the methods is measured on several hand-crafted comparison datasets. The orthogonality is estimated by measuring the performance increase when two information sources are linearly interpolated using optimal interpolation parameters. The experiment conclusions reveal both expected and contradictory results and offer a deeper insight into the information sources of particular methods.

Miloslav Konopík, Ondr̆ej Praz̆ák
Invariant Components of Speech Signals: Analysis and Visualization

In real-world acoustic environment the speech signal is characterized by high variability. It is possible to transmit information only using invariant structures of the speech signal. Some of these invariant structures are formed directly in the speech production apparatus while some are generated by the human auditory perception system. It is shown experimentally that the latter is most sensitive to changes in the speech signal invariant components. Some analysis methods of speech signal harmonic components (invariant components of a speech signal) are proposed.

Valeriy Zhenilo, Vsevolod Potapov
Language Model Speaker Adaptation for Transcription of Slovak Parliament Proceedings

Language model and acoustic model adaptation play an important role in enhancing performance and robustness of automatic speech recognition, especially in the case of domain-specific, gender-dependent, or user-adapted systems development. This paper is oriented on the language model speaker adaptation for transcription of parliament proceedings in Slovak for individual speaker. Based on the current research studies, we have developed a framework combining multiple speech recognition outputs with acoustic and language model adaptation at different stages. The preliminary results show a significant decrease in the model perplexity from 45 % to 74 % relatively and the speech recognition word error rate from 29 % to 43 %, for male and female speakers respectively.

Ján Staš, Daniel Hládek, Jozef Juhár
Macro Episodes of Russian Everyday Oral Communication: Towards Pragmatic Annotation of the ORD Speech Corpus

The ORD corpus is a representative resource of everyday spoken Russian that contains about 1000 h of long-term audio recordings of daily communication made in real settings by research volunteers. ORD macro episodes are the large communication episodes united by setting/scene of communication, social roles of participants and their general activity. The paper describes annotation principles used for tagging of macro episodes, provides current statistics on communication situations presented in the corpus and reveals their most common types. Annotation of communication situations allows using these codes as filters for selection of audio data, therefore making it possible to study Russian everyday speech in different communication situations, to determine and describe various registers of spoken Russian. As an example, several high frequency word lists referring to different communication situations are compared. Annotation of macro episodes that is made for the ORD corpus is a prerequisite for its further pragmatic annotation.

Tatiana Sherstinova
Missing Feature Kernel and Nonparametric Window Subband Power Distribution for Robust Sound Event Classification

Sound Event Classification (SEC) aims to understand the real life events using sound information. A major problem of SEC is that it has to deal with uncontrolled environmental conditions, leading to extremely high levels of noise, reverberation, overlapping, attenuation and distortion. As a result, some parts of the captured signals could be masked out or completely missing. In this paper, we propose a novel missing feature classification method by utilizing a missing feature kernel in the classification optimization machine. The proposed method first transforms audio segments into the Subband Power Distribution (SPD), a novel image representation where the pure signal’s area is separable. A novel masking approach is then proposed to separate the SPD into reliable and non-reliable parts. Next, missing feature kernel (MFK), in forms of probabilistic distances on the intersection between reliable areas of the SPD images, is developed and integrated into SVM optimization framework. Experimental results show superiority of the proposed method for challenging tasks of SEC, when signals come out with severe noises and distortions.

Tran Huy Dat, Jonathan William Dennis, Ng Wen Zheng Terence
Multi-factor Method for Detection of Filled Pauses and Lengthenings in Russian Spontaneous Speech

Spontaneous speech contains high rates of speech disfluencies, most common being filled paused and lengthenings (FPs). Human language technologies are often developed for other than spontaneous types of speech, and disfluencies occurrence is the reason for many mistakes in automatic speech recognition systems. In this paper we present a method of automatic detection of FPs using linear combination of statistical characteristics of acoustic parameters variance, basing on a preliminary study of FPs parameters across the mixed and quality-diverse corpus of Russian spontaneous speech. Experiments were carried out on a corpus, consisting of the task-based dialogue corpus of Russian spontaneous speech collected in SPIIRAS and on Russian casual conversations from Open Source Multi-Language Audio Database collected in Binghamton University.

Vasilisa Verkhodanova, Vladimir Shapranov
Multimodal Presentation of Bulgarian Child Language

The holistic tradition in modern linguistics is characterized by integrated and corpus approaches towards the explored phenomena. In that way specific circumstances are established so that speech could be thoroughly examined in norm and pathology in exploiting contemporary multimedia equipment and software products. The present research work focuses on some of the possibilities of ones of the most frequently used interactive platforms such as TalkBank and CHILDES, which diverse corpora have an extremely broad specter of applications in different spheres of science and social life which on its part specifies them as socially valid and crucial. Additionally the paper dwells on a Bulgarian child language corpus created within the parameters of that multimodal paradigm presentation.

Dimitar Popov, Velka Popova
On Deep and Shallow Neural Networks in Speech Recognition from Speech Spectrum

This paper demonstrates how usual feature extraction methods such as the PLP can be successfully replaced by a neural network and how signal processing methods such as mean normalization, variance normalization and delta coefficients can be successfully utilized when a NN-based feature extraction and a NN-based acoustic model are used simultaneously. The importance of the deep NNs is also investigated. The system performance was evaluated on the British English speech corpus WSJCAM0.

Jan Zelinka, Petr Salajka, Luděk Müller
Opinion Recognition on Movie Reviews by Combining Classifiers

In this paper we present a combined opinion recognition scheme based on discriminative algorithms, decision trees and probabilistic algorithms. The proposed scheme takes advantage of the information provided from each of the recognition models in decision level, in order to provide refined and more accurate opinion recognition results. The experimental results showed that the proposed combined scheme achieved an overall recognition performance of 87.90 %, increasing the accuracy of our best-performing opinion recognition model by 3.5 %.

Athanasia Koumpouri, Iosif Mporas, Vasileios Megalooikonomou
Optimization of Pitch Tracking and Quantization

The article presents the results of the research focused on procedures for allocating and quantizing values of the pitch. Corresponding optimization tasks are set and accomplished. The results of experimental study of the developed algorithm for determining the pitch lag and its optimal quantizer are presented. The gain in noise immunity and signal to noise ratio compared to the known solutions is shown.

Oleg Basov, Andrey Ronzhin, Victor Budkov
PLDA Speaker Verification with Limited Speech Data

In some speaker verification applications the amount of data available for enrolment and verification can be limited. One of the aims of this paper is to study the impact of the volume of enrolment and verification data on the performance of the system. The second aim is focused on the improvement of the speaker verification using PLDA. The PLDA is generally used to model the speaker and channel variability in the i-vector space using data from several recording sessions. In our experiment, only data from single-session per speaker was available. Therefore, we divided the development recordings into shorter segments and these segments were treated as if they were recorded in different sessions. This approach does not model the inter-session speaker variability, nor the channel variability. However, we assumed that statistical modelling of the intra-session speaker variability could bring an improvement to the results of the verification. Different granularity of segmentation was studied at various amount of enrolment and verification data.

Andrej Ridzik, Milan Rusko
Real-Time Context Aware Audio Augmented Reality

The purpose of this paper is to present a method for real time augmented reality sound production from virtual sources, which are located in a real environment. In the performed experiments, we will initially emphasize on augmenting audio information, beyond the existing environmental sounds, using headphones. The main goal of the approach is to produce a virtual sound that has a natural result so that the user gets immersed and senses a context aware synthetic sound. The necessary data, such as spatial coordinates of source and listener, relative distance and relative velocity between them, room dimensions and potential obstacles between virtual source and listener are given as input to the proposed framework. Real time techniques are used for data processing. These techniques are fast and effective in order to achieve high performance requirements. The resulted sound gives the impression to the listener that the virtual source is part of the real environment. Any dynamic change of the parameters will have as a result the simultaneous real time change of the produced sound.

Gerasimos Arvanitis, Konstantinos Moustakas, Nikos Fakotakis
Recurrent Neural Networks for Hypotheses Re-Scoring

We present our first results in applications of recurrent neural networks to Russian. The problem of re-scoring of equiprobable hypotheses has been solved. We train several recurrent neural networks on a lemmatized news corpus to mitigate the problem of data sparseness. We also make use of morphological information to make the predictions more accurate. Finally we train the Ranking SVM model and show that combination of recurrent neural networks and morphological information gives better results than 5-gram model with Knesser-Ney discounting.

Mikhail Kudinov
Review of the Opus Codec in a WebRTC Scenario for Audio and Speech Communication

The Internet Engineering Task Force (IETF) – the open Internet standards-development body – considers the Opus codec as a highly versatile audio codec for interactive voice and music transmission. In this review we survey the dynamic functioning of the Opus codec within a Web Real-Time Communication (WebRTC) framework based on the Google Chrome browser. The codec behavior and the effectively utilized features during the active communication process are tested and analyzed under various testing conditions. In the experiments, we verify the Opus performance and interactivity. Relevant codec parameters can easily be adapted in application development. In addition, WebRTC framework-coded speech achieves a similar MOS assessment compared to stand-alone Opus coding.

Michael Maruschke, Oliver Jokisch, Martin Meszaros, Viktor Iaroshenko
Semantic Multilingual Differences of Terminological Definitions Regarding the Concept “Artificial Intelligence”

The current use of information technology in terminography gives rise to a fundamentally new lexicographical paradigm as compared to classical concepts of ordering the semantic constituents of natural language units. This article presents the concept of formalization of semantic representation of lexis on the example of the “Artificial Intelligence” term. An attempt is also made to develop an optimal strategy for the construction of a context-oriented terminological electronic translation dictionary.

Rodmonga Potapova, Ksenia Oskina
SNR Estimation Based on Adaptive Signal Decomposition for Quality Evaluation of Speech Enhancement Algorithms

This paper presents a new method for estimating signal-to-noise ratio based on adaptive signal decomposition. Statistical simulation shows that the proposed method has lower variance and bias than the known signal-to-noise ratio measures. We discuss the parameters and characteristics of the proposed method and its practical implementation.

Sergei Aleinik, Mikhail Stolbov
Sociolinguistic Factors in Text-Based Sentence Boundary Detection

The paper explores the correlation between perception of spontaneous speech based on textual information and original speech in sound recording. We investigate factors which may affect the extent of a reader’s ‘guesstimate’ of prosodic characteristics of the original speech. To explore a reader’s prosodic competence, we focused on pause as the most prominent cue of prosodic boundaries and performed statistical analysis to find out, on the one hand, whether there is a correlation between an annotator’s estimation of a sentence end and a real pause in these positions and, on the other hand, if the type of text and sociolinguistic characteristics of a speaker influence this estimation.

Anton Stepikhov
Sparsity Analysis and Compensation for i-Vector Based Speaker Verification

Over recent years, i-vector based framework has been proven to provide state-of-art performance in speaker verification. Most of the researches focus on compensating the channel variability of i-vector. In this paper we will give an analysis that in the case that the duration of enrollment or test utterance is limited, i-vector based system may suffer from biased estimation problem. In order to solve this problem, we propose an improved i-vector extraction algorithm which we term Adapted First order Baum-Welch Statistics Analysis (AFSA). This new algorithm suppresses and compensates the deviation of first order Baum-Welch statistics caused by phonetic sparsity and phonetic imbalance. Experiments were performed based on NIST 2008 SRE data sets, Experimental results show that 10 %–15 % relative improvement is achieved compared to the baseline of traditional i-vector based system.

Wei Li, Tian Fan Fu, Jie Zhu, Ning Chen
Speaker Identification Using Semi-supervised Learning

Semi-supervised classification methods use available unlabeled data, along with a small set of labeled examples, to increase the classification accuracy in comparison with training a supervised method using only the labeled data. In this work, a new semi-supervised method for speaker identification is presented. We present a comparison with other well-known semi-supervised and supervised classification methods on benchmark datasets and verify that the presented technique exhibits better accuracy in most cases.

Nikos Fazakis, Stamatis Karlos, Sotiris Kotsiantis, Kyriakos Sgarbas
Speaker Verification Using Spectral and Durational Segmental Characteristics

In the present paper we report on some of the results obtained by fusion of human assisted speaker verification methods based on formant features and statistics of phone durations. Our experiments on the database of spontaneous speech demonstrate that using segmental durational characteristics leads to better performance, which shows the applicability of these features for the speaker verification task.

Elena Bulgakova, Aleksei Sholohov, Natalia Tomashenko, Yuri Matveev
Speech Enhancement in Quasi-Periodic Noises Using Improved Spectral Subtraction Based on Adaptive Sampling

The paper presents a speech processing method based on spectral subtraction that is effective for reduction of specific rate-dependent noises. Such noises are produced by a variety of different rotation sources such as turbines and car engines. Applicability of convenient spectral subtraction for such noises is limited since their power spectral density (PSD) is connected with rotation rate and therefore constantly changing. The paper shows that in some cases it is possible to compensate variation of PSD by adaptive sampling rate. The signal can be processed in warped time domain that makes noise parameters more stable and easy to estimate. Stabilization of PSD leads to more accurate evaluation of noise parameters and significantly improves result of noise reduction. For de-termination of current rotation rate the proposed method can either use external reference signal or the noisy signal itself applying pitch detector to it. Considering that the noise typically consists of deterministic and stochastic components narrow-band and wide-band components of the noise are removed separately. The method is compared to the recently proposed maximum a posteriori method (MAP).

Elias Azarov, Maxim Vashkevich, Alexander Petrovsky
Sub-word Language Modeling for Russian LVCSR

Russian is a highly inflected language with rich morphology. It is characterized by the low lexical coverage, high out-of-vocabulary (OOV) rate and perplexity. Therefore, the large vocabulary continuous speech recognition (LVCSR) of Russian and languages with similar morphology still remains to be a challenging task. Augmenting the full-word language model by fragments is a well-known approach targeting this challenge which also allows us to recognize missing words in the lexicon (open vocabulary recognition). In this paper we suggest a novel “double-sided” approach for marking word fragments, which reduces the WER by up to 3.7 % absolute (20.8 % relative) compared to the full-word baseline and by up to 1.1 % absolute (7.2 % relative) compared to the corresponding sub-word baseline, tested on evaluation set. Moreover, the type of word decomposition (syllables or morpheme-like units), their smallest size and optimal number of non-fragmented words were also investigated for Russian LVCSR.

Sergey Zablotskiy, Wolfgang Minker
Temporal Organization of Phrase-final Words as a Function of Pitch Movement Type

It is well known that the type of pitch movement has an influence on segments’ (at least the stressed vowel’s) duration, and so does the position of the word within the phrase. The Corpus of Professionally Read Speech was used here to study the interaction between these two factors. Statistical analysis has allowed us to obtain a description for 14 frequent pitch movement types (following the classification used in the Corpus) in terms of their temporal characteristics, namely the duration of the stressed vowel and the duration of the post-stressed vowel and the final consonant (if there were any), for words ending in -cV, -cVc, -cVcv, -cVcvc, -cVccv, and -cVccvc.

Tatiana Kachkovskaia
The “One Day of Speech” Corpus: Phonetic and Syntactic Studies of Everyday Spoken Russian

The studies described in the paper are made on the base of the ORD – “One day of speech” – corpus of Russian everyday speech which contains long-term audio recordings of daily communication. The ORD corpus provides rich authentic material for research in phonetics and syntax of spoken Russian, and may be used for adjustment and improvement of speech synthesis and recognition systems. Current phonetic investigations of the ORD corpus relate to temporal studies, study of speech reduction, phonetic realization of words and affixes, investigation of phonetic errors and mondegreens, studies of rhythm structures and hesitation phenomena. Syntactic studies primarily deal with linear word order of syntactic groups, syntactic complexity of spoken utterances, and specific syntactic phenomena of spontaneous speech. In this paper, we summarize main achievements in phonetic and syntactic studies made on the base of the ORD corpus and outline some directions for further investigations.

Natalia Bogdanova-Beglarian, Gregory Martynenko, Tatiana Sherstinova
The Multi-level Approach to Speech Corpora Annotation for Automatic Speech Recognition

In the paper the multi-level approach to audio files annotation is briefly summarized. The emphasis is mainly placed on the development of annotation rules. Firstly, some general requirements are outlined and more specific markers are listed, which may or may not be included in a particular rule set depending on the given practical task. Then software tools used for creating annotations and its spell-checking are described, and an example of a database created on the basis of the multi-level approach to annotation is given. Lastly, the application of tag sorting in ASR training and testing is discussed.

Igor Glavatskih, Tatyana Platonova, Valeria Rogozhina, Anna Shirokova, Anna Smolina, Mikhail Kotov, Anna Ovsyannikova, Sergey Repalov, Mikhail Zulkarneev
The Role of Prosody in the Perception of Synthesized and Natural Speech

This paper presents the results of research of perception of synthesized and natural speech, and investigates the role of the prosodic characteristic of pauses in the process of speech comprehension. The research involved a series of perception tasks, including quality assessment, an intelligibility task and comprehension tests of ten shorter and one longer text in Serbian produced by the AlfaNum speech synthesizer and a professional actor, and a follow-up comprehension task of synthesized speech with modified pauses. The results of the intelligibility task show similar performance by both groups of subjects, while the comprehensibility tasks indicate better performance for natural than for synthesized speech. The results of the follow-up task show that the modified prosody contributed to the better performance of the subjects. The quality assessment task revealed the subjects preference for natural speech mainly on the basis of the prosodic characteristic of pauses.

Maja Marković, Bojana Jakovljević, Tanja Milićev, Nataša Miliević
The Singular Estimation Pitch Tracker

A model of singular estimation process of speech fundamental pitch frequency is reviewed. Existing solutions for the known classes of mathematical problems (Singular spectrum analysis, fast Fourier transform, and convolution) are used to develop a numerical implementation of the model. The evaluation of the fundamental pitch frequency with existing algorithms.

Daniyar Volf, Roman Meshcheryakov, Sergey Kharchenko
Voice Conversion Between Synthesized Bilingual Voices Using Line Spectral Frequencies

Voice conversion is a technique that transforms the source speaker individuality to that of the target speaker. We propose the simple and intuitive voice conversion algorithm not using training data between different languages and it uses text-to-speech generated speech rather than recorded real voices. The suggested method reconstructed the voice after transforming line spectral frequencies (LSF) by formant space warping functions. The formant space is the space consisted of representative four monophthongs for each language. The warping functions are represented by piecewise linear equations using pairs of four formants at matched monophthongs. In this paper, we applied LSF to voice conversion because LSF are not overly sensitive to quantization noise and can be interpolated. From experimental results, LSF based voice conversion shows good results for ABX and MOS tests than the direct frequency warping approaches.

Young-Sun Yun, Jinman Jung, Seongbae Eun
Voicing-Based Classified Split Vector Quantizer for Efficient Coding of AMR-WB ISF Parameters

Modern speech coders necessitate efficient coding of the linear predictive coding (LPC) coefficients. Line spectral Frequencies (LSF) and Immittance Spectral Frequencies (ISF) parameters are currently the most efficient choices of transmission parameters for the LPC coefficients. In this paper, we present a voicing-based classified split vector quantization scheme developed for efficient coding of wideband AMR-WB G.722.2 ISF (Immittance Spectral Frequencies) parameters under noiseless channel conditions. It was designed based on the classified vector quantization (CVQ) structure combined with the split vector quantization (SVQ). Simulation results will show that the new ISF coding scheme, called ISF-CSVQ coder, performs better than the conventional non classified ISF-SVQ, while saving several bits per frame.

Merouane Bouzid, Salah-Eddine Cheraitia
Vulnerability of Voice Verification System with STC Anti-spoofing Detector to Different Methods of Spoofing Attacks

This paper explores the robustness of a text-independent voice verification system against different methods of spoofing attacks based on speech synthesis and voice conversion techniques. Our experiments show that spoofing attacks based on the speech synthesis are most dangerous, but the use of standard TV-JFA approach based spoofing detection module can reduce the False Acceptance error rate of the whole speaker recognition system from 80 % to 1 %.

Vadim Shchemelinin, Alexandr Kozlov, Galina Lavrentyeva, Sergey Novoselov, Konstantin Simonchik
WebTransc — A WWW Interface for Speech Corpora Production and Processing

This paper describes a web application that was designed to prepare and process speech corpora, key data sources for automatic speech recognition (ASR), natural language processing (NLP), speech synthesis (TTS) and many other tasks. The application allows users to process the corpora with no other equipment than a web browser with internet connection. The application has been used, upgraded and improved for several years and its history is also described here. During that time, many valuable experiences with speech corpora processing have been gained and they are also mentioned as some good practices.

Tomáš Valenta, Luboš Šmídl
Word-External Reduction in Spontaneous Russian

Among many types of phonetic modifications in casual speech, those that occur across word boundaries can affect two words simultaneously. If inter-word penetration is considerable and accompanied with lack of other sources (semantic, syntactic) of information, the differentiation of two words can be confused. The present study, based on the data from spontaneous Russian, investigates word-external quantitative reductions. The classification of all examples is given in order to reveal what patterns of modified realizations can occur in sloppy Russian and what their probability is.

Yulia Nigmatulina
Backmatter
Metadata
Title
Speech and Computer
Editors
Andrey Ronzhin
Rodmonga Potapova
Nikos Fakotakis
Copyright Year
2015
Electronic ISBN
978-3-319-23132-7
Print ISBN
978-3-319-23131-0
DOI
https://doi.org/10.1007/978-3-319-23132-7

Premium Partner