Skip to main content

2013 | Buch

Speech and Computer

15th International Conference, SPECOM 2013, Pilsen, Czech Republic, September 1-5, 2013. Proceedings

herausgegeben von: Miloš Železný, Ivan Habernal, Andrey Ronzhin

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 15th International Conference on Speech and Computer, SPECOM 2013, held in Pilsen, Czech Republic. The 48 revised full papers presented were carefully reviewed and selected from 90 initial submissions. The papers are organized in topical sections on speech recognition and understanding, spoken language processing, spoken dialogue systems, speaker identification and diarization, speech forensics and security, language identification, text-to-speech systems, speech perception and speech disorders, multimodal analysis and synthesis, understanding of speech and text, and audio-visual speech processing.

Inhaltsverzeichnis

Frontmatter

Conference Papers

Automatic Detection of the Prosodic Structures of Speech Utterances

This paper presents an automatic approach for the detection of the prosodic structures of speech utterances. The algorithm relies on a hierarchical representation of the prosodic organization of the speech utterances. The approach is applied on a corpus of radio French broadcast news and also on radio and TV shows which are more spontaneous speech data. The algorithm detects prosodic boundaries whether they are followed or not by pause. The detection of the prosodic boundaries and of the prosodic structures is based on an approach that integrates little linguistic knowledge and mainly uses the amplitude of the F0 slopes and the inversion of the slopes as described in [1], as well as phone durations. The automatic prosodic segmentation results are then compared to a manual prosodic segmentation made by an expert phonetician. Finally, the results obtained by this automatic approach provide an insight into the most frequently used prosodic structures in the broadcasting speech style as well as in a more spontaneous speech style.

Katarina Bartkova, Denis Jouvet
A Method for Auditory Evaluation of Synthesized Speech Intonation

The paper proposes an approach to diagnostic testing for suprasegmental quality of Russian TTS-generated speech. We describe a two-step evaluation strategy to measure the perception of prosodic features on the basis of an arbitrary selection of synthesized sentences from an existing representative inventory. As a result of a series of auditory tests, an integral intonation intelligibility coefficient is calculated, enabling us to compare intelligibility and expressiveness, i. e. the functional aspect of the TTS prosodic component. The method is demonstrated on several Russian TTS voices. A psycholinguistic test requiring a linguistic decision on the type of perceived utterance (statement/question/exclamation/non-terminal phrase) was offered to native Russian speakers, along with a quality test of the naturalness of the phase melody (0-2 scale). Subsequently, several observations are proposed for further enhancement of the formal aspect of prosody, to facilitate finding solutions to the problems detected in course of the evaluation procedure.

Anna Solomennik, Anna Cherentsova
Acoustic Modeling with Deep Belief Networks for Russian Speech Recognition

This paper presents continuous Russian speech recognition with deep belief networks in conjunction with HMM. Recognition is performed in two stages. In the first phase deep belief networks are used to calculate the phoneme state probability for feature vectors describing speech. In the second stage, these probabilities are used by Viterbi decoder for generating resulting sequence of words. Two-stage training procedure of deep belief networks is used based on restricted Boltzmann machines. In the first stage neural network is represented as a stack of restricted Boltzmann machines and sequential training is performed, when the previous machine output is the input to the next. After a rough adjustment of the weights second stage is performed using a back-propagation training procedure. The advantage of this method is that it allows usage of unlabeled data for training. It makes the training more robust and effective.

Mikhail Zulkarneev, Ruben Grigoryan, Nikolay Shamraev
An Analysis of Speech Signals of the Choapam Variant Zapotec Language

The Zapotec language, as well as many other prehispanic languages in Mexico, is endangered for many reasons including a lack of use by the younger population who prefer to speak Spanish instead, and by the dying out of older native speakers. In this paper an analysis of the Choapam variant of Zapotec is presented; a list of words in this Zapotec was recorded, and a time and formant analysis was carried out in order to obtain basic information used to describe the language. The formant analysis focused on the vowels in the language, due complications which arise with them, and gives a first classification of them. Some of the difficulties experienced in the study, which are similar to those encountered with many other endangered languages, are detailed. Although this is a first approach to these languages analysis, it is hoped that the use of this information will contribute to further efforts aimed at helping preserve the language.

Gabriela Oliva-Juarez, Fabiola Martínez-Licona, Alma Martínez-Licona, John Goddard-Close
Analysis of Expert Manual Annotation of the Russian Spontaneous Monologue: Evidence from Sentence Boundary Detection

The paper describes the corpus of Russian spontaneous monologues and the results of its expert manual annotation. The corpus is balanced with respect to speakers’ social characteristics and a text genre. The analysis of manual labelling of transcriptions reveals experts’ disagreement in sentence boundary detection. The paper demonstrates that labelled boundaries may have different status. We also show that speakers’ social characteristics (gender and speech usage) and a text genre influence inter-labeller agreement.

Anton Stepikhov
Application of l 1 Estimation of Gaussian Mixture Model Parameters for Language Identification

In this paper we explore the using of

l

1

optimization for a parameter estimation of Gaussian mixture models (GMM) applied to the language identification. To train the Universal background model (UBM) at each step of Expectation maximization (EM) algorithm the problem of the GMM means estimation is stated as

l

1

optimization. The approach is Iteratively reweighted least squares (IRLS). Also here is represented the corresponding solution of the Maximum a posteriori probability (MAP) adaptation. The results of the above UBM-MAP system combined with Support vector machine (SVM) are reported on the LDC and GlobalPhone datasets.

Danila Doroshin, Maxim Tkachenko, Nikolay Lubimov, Mikhail Kotov
Application of Automatic Fragmentation for the Semantic Comparison of Texts

The article considers the algorithm of extraction of elementary semantic information from texts. Processed texts present articles from natural-science area and have large size and quite arbitrary structure. It is shown that the condensed syntactic graph has enough semantic information for the primary cluster analysis. A two-level model of the semantic cluster analysis is proposed. It’s first phase is based on the syntactic graph of the text fragments. We can identify not only keywords, but also non-uniform structural relationships between objects due to the qualitative and quantitative analysis of the lexical items and syntactic relations between them. A simple and efficient method for preprocessing comparison of scientific publications is presented. This method is based on the calculation of clusters of text fragments.

Varvara Krayvanova, Elena Kryuchkova
Auditory and Spectrographic Analysis of the Words of 2-8 Years-Old Russian Children

The purpose of this investigation was to examine the peculiarities of recognition by native speakers of a lexical word meaning of Russian 2-8 years old children. Significant improvement of recognition by adult native speakers of words meaning of 8 years old children in comparison with 2-7 years old children’s is established. It was shown that by 7 years stressed vowels in words are significant longer than unstressed vowels; pitch values do not differ in stressed and unstressed vowels, that relevant for Russian. By 7 years accuracy of articulation of vowels, by 8 years - most correct articulation of the majority of consonants is formed. The relation between child’s age articulation skills and the word meaning recognition by native speakers are discussed.

Elena Lyakso, Evgenia Bednaya, Aleksei Grigorev
Auditory and Visual Recognition of Emotional Behaviour of Foreign Language Subjects (by Native and Non-native Speakers)

The “human – human” interaction problem as a basis for “human – machine” investigations constitutes special complexity in the presence of such factors as “native language – foreign language” communication, affiliation to different ethnic cultures. In this connection we set a task of searching for key (basic) features of two-channel decoding (perceptual-auditory and perceptual-visual) of foreign language and other-cultural communication referring to the search of the solution of the verbal – non-verbal communication process modeling characterized in particular, by the final effect of interaction successfulness (consensus).

Rodmonga Potapova, Vsevolod Potapov
Automatic Detection of Speech Disfluencies in the Spontaneous Russian Speech

Spontaneous speech is rarely fluent due to human nature. And among other characteristics of spontaneous speech there are the speech variation and the presence of speech disfluencies such as hesitations, fillers, artefacts. Such elements are an obstacle for automatic speech processing as well as for its tran-scriptions processing. For automatic detection of these elements a corpus of spontaneous Russian speech was collected basing on a task methodology. Corpus was annotated taking into account such types of disfluencies as hesitations, repairs, sound lengthening, as well as artefacts. For hesitation and artefacts detection there were used such parameters as duration, energy, fundamental frequency, and other spectral characteristics.

Vasilisa Verkhodanova, Vladimir Shapranov
Automatic Morphological Annotation in a Text-to-Speech System for Hebrew

The paper presents the module for automatic morphological annotation within a text synthesizer for Hebrew, based on an efficient combination of two approaches. The first approach includes the selection of lexemes from appropriate lexica, while the other approach involves automatic morphological analysis of text input using a complex expert algorithm relying on a set of transformational rules and using 6 types of scoring procedures. The module operates on a set of 30 part-of-speech tags with more than 3000 corresponding morphological categories. The paper discusses the advantages of the proposed method in the context of an extremely morphologically complex language such as Hebrew, with particular emphasis given to the relative importance of individual scoring procedures. When all 6 scoring procedures are applied, the accuracy of 99.6% is achieved on a corpus of 3093 sentences (55046 words).

Branislav Popović, Milan Sečujski, Vlado Delić, Marko Janev, Igor Stanković
Comparative Study of English, Dutch and German Prosodic Features (Fundamental Frequency and Intensity) as Means of Speech

This paper reports on a comparative analysis of specific prosodic features (fundamental frequency and intensity) used as a means of speech influence in Germanic languages (English, Dutch, German,). The aim of the study is to compare the prosodic patterns of the languages mentioned above in order to find out whether these patterns are similar or they have some differences that are used only in a particular language. The conducted experiment included two types of analysis: acoustic analysis and perceptual analysis. The results obtained during the acoustic analysis supervised by R.K. Potapova testify to the fact that the languages use different prosodic features as a means of speech influence.

Anna Moskvina
Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data

An estimation of parameters of a multivariate Gaussian Mixture Model is usually based on a criterion (e.g. Maximum Likelihood) that is focused mostly on training data. Therefore, testing data, which were not seen during the training procedure, may cause problems. Moreover, numerical instabilities can occur (e.g. for low-occupied Gaussians especially when working with full-covariance matrices in high-dimensional spaces). Another question concerns the number of Gaussians to be trained for a specific data set. The approach proposed in this paper can handle all these issues. It is based on an assumption that the training and testing data were generated from the same source distribution. The key part of the approach is to use a criterion based on the source distribution rather than using the training data itself. It is shown how to modify an estimation procedure in order to fit the source distribution better (despite the fact that it is unknown), and subsequently new estimation algorithm for diagonal- as well as full-covariance matrices is derived and tested.

Jan Vaněk, Lukáš Machlica, Josef V. Psutka, Josef Psutka
Dealing with Diverse Data Variances in Factor Analysis Based Methods

Probabilistic Linear Discriminant Analysis (PLDA) and the concept of i-vectors are state-of-the-art methods used in the speaker recognition. They are based on Factor Analysis, in which a data covariance matrix is decomposed in order to find a low dimensional representation of given feature vectors. More precisely, the Factor Analysis based methods seek for directions/subspaces in which the projected (overall/between/within) variance is highest. In order to train models related to individual methods, development speech corpora comprising various acoustic conditions are utilized. The higher are the variations in some of these acoustic conditions, the more will the model tend to reflect them. Strong data variations in some of the development corpora may suppress conditions present in other corpora. This can lead to poor recognition when acoustic variations in test conditions significantly differ. In this paper techniques alleviating such effects are investigated. The idea is to use several background and i-vector models related to different parts of development data so that several i-vectors are extracted, processed and handed over to the PLDA modelling. PLDA model is then used to utilize all the extracted information and provide the verification result.

Lukáš Machlica
Detection of the Frequency Characteristics of the Articulation System with the Use of Voice Source Signal Recording Method

The given research is aimed at registering and analysing speech signal obtained through a microphone placed in the proximity of the vocal folds and comparing it with the output speech signal. The external microphone was located near the lips of the subject. The internal one was located in the proximity of the subjects’s vocal folds by a phoniatrician with the use of special medical equipment. The speech signal containing isolated vowels and connected speech was registered synchronously through both microphones. The main interest of the paper is the acoustic characteristics of these signals, mainly, the vowel formant structure. Besides, the non-linearity of the vocal tract system was considered. The new method of obtaining frequency characteristics of the articulation system is used. The coprocessing of several acoustic realizations helps to elaborate the methods of discrimination and modeling the transfer functions of the voice source and filter components of the vocal tract. The speech signals that are influenced at different levels by two parts of the vocal tract are processed. It allows constructing the vowel formant structure of frequency constituents and their variations.

Vera Evdokimova, Karina Evgrafova, Pavel Skrelin, Tatiana Chukaeva, Nikolay Shvalev
Encoding of Spatial Perspectives in Human-Machine Interaction

A spatial context is often present in speech-based human-machine interaction, and its role is especially significant in interaction with robotic systems. Studies in the cognitive sciences show that frames of reference used in language and in non-linguistic cognition are correlated. In general, humans may use multiple frames of references. But since the visual sensory modality operates mainly in a relative frame, most of users normally and preferably use relative reference frame in spatial language. Therefore, there is a need to enable dialogue systems to process dialogue acts that instantiate user-centered frames of reference. This paper introduces a cognitively-inspired, computational modeling method that addresses this research question, and illustrates it for a three-party human-machine interaction scenario. The paper also reports on an implementation of the proposed model within a prototype system, and briefly discusses some aspects of the model’s generalizability and scalability.

Milan Gnjatović, Vlado Delić
Evaluation of Advanced Language Modeling Techniques for Russian LVCSR

The Russian language is characterized by very flexible word order, which limits the ability of the standard

n

-grams to capture important regularities in the data. Moreover, it is highly inflectional language with rich morphology, which leads to high out-of-vocabulary (OOV) word rates. In this paper, we present comparison of two advanced language modeling techniques: factored language model (FLM) and recurrent neural network (RNN) language model, applied for Russian large vocabulary speech recognition. Evaluation experiments showed that the FLM, built using training corpus of 10M words was better and reduced the perplexity and word error rate (WER) by 20% and 4.0% respectively. Further WER reduction by 7.4% was achieved when the training data were increased to 40M words and 3-gram, FLM and RNN language models were combined together by linear interpolation.

Daria Vazhenina, Konstantin Markov
Examining Vulnerability of Voice Verification Systems to Spoofing Attacks by Means of a TTS System

This paper examines the method of spoofing text-dependent voice verification systems based on the most popular TTS approaches: Unit Selection and HMM. Research of this method shows the possibility of achieving a false acceptance error of 98%-100% if the duration of the TTS database is sufficiently large. A distinctive feature of the method is that it can be fully automatical if used in conjunction with a speech recognition system.

Vadim Shchemelinin, Konstantin Simonchik
Exploiting Multiple ASR Outputs for a Spoken Language Understanding Task

In this paper, we present an approach to Spoken Language Understanding, where the input to the semantic decoding process is a composition of multiple hypotheses provided by the Automatic Speech Recognition module. This way, the semantic constraints can be applied not only to a unique hypothesis, but also to other hypotheses that could represent a better recognition of the utterance. To do this, we have developed an algorithm to combine multiple sentences into a weighted graph of words, which is the input to the semantic decoding process. It has also been necessary to develop a specific algorithm to process these graphs of words according to the statistical models that represent the semantics of the task. This approach has been evaluated in a SLU task in Spanish. Results, considering different configurations of ASR outputs, show the better behavior of the system when a combination of hypotheses is considered.

Marcos Calvo, Fernando García, Lluís-F. Hurtado, Santiago Jiménez, Emilio Sanchis
Fast Algorithm for Automatic Alignment of Speech and Imperfect Text Data

A solution to the problem of fast single-pass alignment of speech with imperfect transcripts is introduced. The proposed technique is based on constructing a special word network for segmentation. We examine robustness and segmentation quality for different types of errors and different levels of noise in the text, depending on the parameters of network tuning. Experiments showed that with properly selected parameters the algorithm is robust to noise of any type in transcripts. The proposed approach has been successfully applied to the task of creating movie subtitles.

Natalia A. Tomashenko, Yuri Y. Khokhlov
GMM Based Language Identification System Using Robust Features

In this work, we propose new features for the GMM based spoken language identification system. A two stage approach is followed for extraction of the proposed new features. MFCCs and formants are extracted from huge corpus of all languages under consideration. In the first phase, MFCCs and formants are concatenated to form the feature vector. K clusters are formed from these feature vectors and one Gaussian is designed for each cluster. In the second phase, these feature vectors are evaluated against each of the K Gaussians and the returned K probabilities are considered as the elements of the proposed new feature vector, thus forming a K-element new feature vector. This proposed method for deriving new feature vector is common for both training and testing phases. In the training phase, K-element feature vectors are generated from the language specific speech corpus and language specific GMMs are trained. In testing phase, similar procedure is followed for extraction of K-element feature vector from unknown speech utterance and evaluated against language specific GMMs. Usefulness, the language specific apriori knowledge is used for further improvement of recognition performance. The experiments are carried out on OGI database and the LID performance is nearly 100%.

Sadanandam Manchala, V. Kamakshi Prasad
Hierarchical Clustering and Classification of Emotions in Human Speech Using Confusion Matrices

Although most of the natural emotions expressed in speech can be clearly identified by humans, automatic classification systems still display significant limitations on this task. Recently, hierarchical strategies have been proposed using different heuristics for choosing the appropriate levels in the hierarchy. In this paper, we propose a method for choosing these levels by hierarchically clustering a confusion matrix. To this end, a Mexican Spanish emotional speech database was created and employed to classify the ’big six’ emotions (anger, disgust, fear, joy, sadness, surprise) together with a neutral state. A set of 14 features was extracted from the speech signal of each utterance and a hierarchical classifier was defined from the dendrogram obtained by applying Wards clustering method to a certain confusion matrix. The classification rate of this hierarchical classifier showed a slight improvement compared to those of various classifiers trained directly with all 7 classes.

Manuel Reyes-Vargas, Máximo Sánchez-Gutiérrez, Leonardo Rufiner, Marcelo Albornoz, Leandro Vignolo, Fabiola Martínez-Licona, John Goddard-Close
Improvements in Czech Expressive Speech Synthesis in Limited Domain

In our recent work, a method on how to enumerate differences between various expressive categories (communicative functions) has been proposed. To improve the overall impact of this approach to both the quality of synthetic expressive speech and expressivity perception by listeners, a few modifications are suggested in this paper. The main ones consist in a different way of expressive data processing and penalty matrix calculation. A complex evaluation using listening tests and some auxiliary measures was performed.

Martin Grůber, Jindřich Matoušek
Improving Prosodic Break Detection in a Russian TTS System

We propose using statistical methods for predicting positions and durations of prosodic breaks in a Russian TTS system, in order to improve on a baseline rule-based system. The paper reports experiments with CART and Random Forests (RF) classifiers. We used CART to predict break durations inside and between sentences, and compared the results of CART and RF for predicting break positions inside sentences. We find that both classifiers show an improvement over the baseline system in predicting break positions, with RF showing the best results. We also observe good results in experiments with predicting break durations. To increase the naturalness of synthesized speech, we included probability-based break durations into a working Russian TTS system. We also built an experimental system with probability-based break placement in sentence parts without punctuation marks, which was evaluated higher than the baseline system in a pilot listening experiment.

Pavel Chistikov, Olga Khomitsevich
Investigation of Forensically Significant Changes of Acoustic Features with Regard to Code-Switching (on the Basis of Russian and German)

The investigation of the phenomenon of code-switching presents great scientific interest nowadays. An explosive development of info-communication technologies and the growth of the international crime made it crucial to create multilingual systems, capable of speaker identification with regard to code-switching. That task is obviously impossible without a thorough study of speaker identification and code-switching on the basis of different languages. The paper focuses on the investigation of changes of the speaker’s specific speech features under the conditions of switching from the native language to a foreign one. The conducted experiment was based on the material in Russian and German languages and included two types of analysis: acoustic analysis and perceptual analysis. The results of the experiment testify to the fact that the situation of code-switching has an impact on certain speech features. For instance, the mean fundamental frequency tends to decrease, speech melody becomes uneven, speech tempo slows down, duration of pauses increases, the articulation becomes tenser and distinct. At the same time it was found out that some characteristics of the speaker’s voice were not subjected to any changes.

Tatiana Platonova, Anna Smolina
LIMA: A Spoken Language Identification Framework

This paper presents LIMA, the

L

anguage

I

dentification for

M

ultilingual

A

SR, which is a web-based parameterisable spoken language identification framework. LIMA is a novel system which facilitates a personalised experience for the user who can tailor the system to evaluate different LID techniques with varied parameterisations across a range of languages. A number of standard LID techniques have been implemented in the system, together with a novel technique based on unique

n

-phones. By way of illustration of the system, evaluation results for one particular parameterisation of the system are presented.

Amalia Zahra, Julie Carson-Berndsen
Language Identification System for the Tatar Language

This paper describes a speech identification system for the Tatar, English and Russian languages. It also presents a newly created Tatar speech corpus, which is used for building a language model. The main idea is to investigate the potential of basic phonotactic approaches (i.e. PRLM-approach) when working with the Tatar language. The results indicate that the proposed system can be successfully employed for identifying the Tatar, English and Russian languages.

Aidar Khusainov, Dzhavdet Suleymanov
Language Model Comparison for Ukrainian Real-Time Speech Recognition System

This paper describes a real-time speech recognition system for Ukrainian designed basically for text dictation purpose targeting moderate computation requirements. The research is focused on language model parameter estimation. As a Slavonic language Ukrainian is highly inflective and tolerates relatively free word order. These features motivate transition from word- to class-based statistical language model. According to our experimental research, class-based LMs occupy less space and potentially outperform a 3-gram word-based model. We also describe several tools developed to visualize HMMs, to predict word stress, and to manage cluster-based language modeling.

Mykola Sazhok, Valentyna Robeiko
Lexicon Size and Language Model Order Optimization for Russian LVCSR

In this paper, the comparison of 2,3,4-gram language models with various lexicon sizes is presented. The text data forming the training corpus has been collected from recent Internet news sites; total size of the corpus is about 350 million words (2.4 GB data). The language models were built using the recognition lexicons of 110K, 150K, 219K, and 303K words. For evaluation of these models such characteristics as perplexity, OOV words rate and n-gram hit rate were computed. Experimental results on continuous Russian speech recognition are also given in the paper.

Irina Kipyatkova, Alexey Karpov
Lingua-cognitive Survey of the Semantic Field “Aggression” in Multicultural Communication: Typed Text

The article describes first phase of the complex research of lingua-cognitive mechanisms of formation and development of aggression represented in language and speech. It produces results of content-analysis of the texts in Russian, English, Spanish and Tatar languages aimed at designing the semantic field “aggression”. It also describes classification of aggression types and lists linguistic markers of verbal aggression.

Rodmonga Potapova, Liliya Komalova
Method for Pornography Filtering in the WEB Based on Automatic Classification and Natural Language Processing

The paper presents a method for pornography detection in the web pages based on natural language processing. The described classification method uses feature set of single words and groups of words. Syntax analysis is performed to extract collocations. A modification of TF-IDF is used to weight terms. An evaluation and comparison of quality and performance of classification are given.

Roman Suvorov, Ilya Sochenkov, Ilya Tikhomirov
Noise and Channel Normalized Cepstral Features for Far-speech Recognition

The paper analyses suitable features for distorted speech recognition. The aim is to explore the application of command ASR system when the speech is recorded with far-distance microphones with a possible strong additive and convolutory noise. The paper analyses feasible contribution of basic spectral subtraction coupled with cepstral mean normalization in minimizing of the influence of present distortion in such far-talk channel. The results are compared with reference close-talk speech recognition system. The results show the improvement in WER for channels with low or medium SNR. Using the combination of these basic techniques WERR of 55.6% was obtained for medium distance channel and WERR of 22.5% for far distance channel.

Michal Borsky, Petr Mizera, Petr Pollak
Parametric Speech Synthesis and User Interface for Speech Modification

A new parametric allophone-to-speech synthesis system of the Russian language is described. It is assumed that a sequence of allophones with prosodic information is given and it is required to compute the corresponding speech signal. Allophones are stored in the database by model parameter sets only. The model of a voiced signal is purely polyharmonic. Modification of pitch, energy and duration can be effectively made to a large extent without loss of quality. The system is based on precise estimation of harmonic parameters at the stage of database parameterization, on the cluster description of allophone merging in the Russian language and on the effective synthesis algorithm with opportunity to arbitrarily change any prosodic parameter through the graphic user interface.

Alexander Shipilo, Andrey Barabanov, Mikhail Lipkovich
Phrase-Final Segment Lengthening in Russian: Preliminary Results of a Corpus-Based Study

The paper presents preliminary results of a corpus-based study of phrase-final segment lengthening in Russian. The Corpus of Russian Professionally Read Speech (CORPRES) was used to investigate the degree of lengthening for segments immediately preceding phrase boundaries as a function of segment class and boundary type. According to our data, there is a general tendency for shorter segments to show more lengthening than longer segments (in pairs like /f/–/s/, /t/–/t

j

/ etc.). However, this seems to work the opposite way in pairs of fricatives vs. stops. We have also found that boundary depth (sentence-final vs. non-sentence-final) and the presence or absence of a pause have an effect on phrase-final segment lengthening.

Tatiana Kachkovskaia, Nina Volskaya
Pseudo Real-Time Spoken Term Detection Using Pre-retrieval Results

Spoken term detection (STD) is one of key technologies for spoken document processing. This paper describes a method to realize pseudo real-time spoken term detection using pre-retrieval results. Pre-retrieval results for all combination of syllable bigrams are prepared beforehand. The retrieval time depends on the number of candidate sections of the pre-retrieval results. Therefore, the paper proposes the method to control the retrieval time by the number. A few top candidates are obtained in almost real-time by limiting the small number of candidate sections. While a user is confirming the candidate sections, the system can conduct the rest of retrieval by increasing the number of candidate sections gradually. The paper demonstrate the proposed method enables pseudo real-time spoken term detection by evaluation experiments using actual presentation speech corpus; Corpus of Spontaneous Japanese (CSJ).

Yoshiaki Itoh, Hiroyuki Saito, Kazuyo Tanaka, Shi-wook Lee
Results for Variable Speaker and Recording Conditions on Spoken IR in Finnish

The performance of current spoken information retrieval (IR) systems depend on the success of automatic speech recognition (ASR) to provide transcripts of the material for indexing. In addition to the ASR system design, ASR performance is strongly affected by the recording conditions, speakers, speaking style and speech content. However, the average word error rate in ASR is not a relevant measure for spoken IR, where only the extracted index terms or keywords matter. In this paper, we measure the spoken IR performance in variable material ranging from controlled single speaker news reading to real-world broadcasts with variable conditions, speakers, and background noise. The effect of using multicondition acoustic models and online adaptation is also studied, as well as controlled addition of background babble noise. The experiments are performed in Finnish, which is an agglutinative and highly inflected language, using morph-based language modelling.

Ville T. Turunen, Mikko Kurimo, Sami Keronen
SVID Speaker Recognition System for NIST SRE 2012

A description of the SVID speaker recognition system is presented. This system was developed for submission to the NIST SRE 2012.

Alexander Kozlov, Oleg Kudashev, Yuri Matveev, Timur Pekhovsky, Konstantine Simonchik, Andrei Shulipa
Segmentation of Telephone Speech Based on Speech and Non-speech Models

In this paper we investigate the automatic segmentation of recorded telephone conversations based on models for speech and non-speech to find sentence-like chunks for use in speech recognition systems. Presented are two different approaches, based on Gaussian Mixture Models (GMMs) and Support Vector Machines (SVMs), respectively. The proposed methods provide segmentations that allow for competitive speech recognition performance in terms of word error rate (WER) compared to manual segmentation.

Michael Heck, Christian Mohr, Sebastian Stüker, Markus Müller, Kevin Kilgour, Jonas Gehring, Quoc Bao Nguyen, Van Huy Nguyen, Alex Waibel
Software for Assessing Voice Quality in Rehabilitation of Patients after Surgical Treatment of Cancer of Oral Cavity, Oropharynx and Upper Jaw

Restoration of speech functions after operations on the organs of speech (production) requires the development of procedures and support programs for rehabilitation. Available means to restore the original voice function for the patient and for speech therapy. Software tool feature of is the use of a combination of speech sounds, which are the most common in speech and which affect the naturalness and intelligibility of speech. It shows the effectiveness of procedures and programs.

Lidiya N. Balatskaya, Evgeny L. Choinzonov, Svetlana Yu. Chizevskaya, Eugeny U. Kostyuchenko, Roman V. Meshcheryakov
Speaker Turn Detection Based on Multimodal Situation Analysis

The main stage of speaker diarization is a detection of time labels, where speakers are changed. The most of the approaches to the decision of the speaker turn detection problem is focused on processing of audio signal captured in one channel and applied for archive records. Recently the problem of speaker diarization became to be considered from multimodal point of view. In this paper we outline modern methods of audio and video signal processing and personification data analysis for multimodal speaker diarization. The proposed PARAD-R software for Russian speech analysis implemented for audio speaker diarization and will be enhanced based on advances of multimodal situation analysis in a meeting room.

Andrey Ronzhin, Victor Budkov
Speech and Crosstalk Detection for Robust Speech Recognition Using a Dual Microphone System

This paper proposes a practical speech detection technique for robust automatic speech recognition, suitable for use under various interference conditions. This technique consists of a dual microphone system and an algorithm for processing their signals. The microphone module is placed in the workplace of the target speaker. The module consists of two symmetrical supercardioid microphones directed in opposite directions. The algorithm of target speaker detection is proposed for this scheme. This algorithm makes it possible to implement spatial filtering of speakers. Experiments with real recordings demonstrate a significant reduction of speech recognition errors for the target speaker due to suppression of acoustic crosstalk. The main advantage of the proposed technique is simplicity of its use in a wide range of practical situations.

Mikhail Stolbov, Marina Tatarnikova
Speech and Language Resources within Speech Recognition and Synthesis Systems for Serbian and Kindred South Slavic Languages

Unlike other new technologies, most speech technologies are heavily language dependent and have to be developed separately for each language. The paper gives a detailed description of speech and language resources for Serbian and kindred South Slavic languages developed during the last decade within joint projects of the Faculty of Technical Sciences, Novi Sad, Serbia and the company “AlfaNum”. It points out the advantages of simultaneous development of speech synthesis and recognition as complementary speech technologies, and discusses the possibility of reuse of speech and language resources across kindred languages.

Vlado Delić, Milan Sečujski, Nikša Jakovljević, Darko Pekar, Dragiša Mišković, Branislav Popović, Stevan Ostrogonac, Milana Bojanić, Dragan Knežević
Statistical Language Aspects of Intonation and Gender Features Based on the Lithuanian Language

The article deals with one of modern speech technology trends – defining of language aspects applicable in various tasks. It is proposed a method that is based on typical statistical performance for pitch pattern of chosen language. Obtained typical ranges of parameter variation are compared for languages of different language families: Lithuanian on one side and Uzbek and Azerbaijani on the other side. It is also presented a gender analysis of intonation patterns for Lithuanian speech.

Michael Khitrov, Ludmila Beldiman, Andrey Vasiliev
Text Understanding as Interpretation of Predicative Structure Strings of Main Text’s Sentences as Result of Pragmatic Analysis (Combination of Linguistic and Statistic Approaches)

This paper reports on an approach to presentation of a text in its minimized form in metalanguage that allows restoring a text similar to the origin. Here such text representation is a string of extended predicative structures of the text sentences, isolated by ranging and further removal of sentences insignificant according to the semantic net of the text. The extended predicative structures are a result of a comprehensive linguistic analysis of text sentences. Analysis of the semantics of the whole text is made by statistical methods.

Alexander A. Kharlamov, Tatyana V. Yermolenko, Andrey A. Zhonin
The Diarization System for an Unknown Number of Speakers

This paper presents a system for speaker diarization that can be used if the number of speakers is unknown. The proposed system is based on the ag-glomerative clustering approach in conjunction with factor analysis, Total Variability approach and linear discriminant analysis. We present the results of the proposed diarization system. The results demonstrate that our system can be used both if an answering machine or handset transfer is present in telephone recordings and in the case of a summed channel in telephone or meeting recordings.

Oleg Kudashev, Alexander Kozlov
The Problem of Voice Template Aging in Speaker Recognition Systems

It is well known that device, language and environmental mismatch adversely affect speaker recognition performance. Much less attention is paid to effect of age-related voice changes on speaker recognition performance. In this paper we attempted to answer if speaker recognition algorithms have the re-sistance to age-related changes, and how often we have to update the voice bi-ometric templates. We have investigated such effects basing on the speech da-tabase collected during the period 2006-2010 and have found a clear trend of degradation of the performance of automatic speaker recognition systems in a time interval of up to 4 years.

Yuri Matveev
The Use of Several Language Models and Its Impact on Word Insertion Penalty in LVCSR

This paper investigates the influence of hypothesis length in

N

-best list rescoring. It is theoretically explained why language models prefer shorter hypotheses. This bias impacts on the word insertion penalty used in continuous speech recognition. The theoretical findings are confirmed by experiments. Parameter optimization performed on the Slovene Broadcast News database showed why optimal word insertion penalties tend be greater when two language models are used in speech recognition. This paper also presents a two-pass speech recognition algorithm. Two types of language models were used, a standard trigram word-based language model and a trigram model of morpho-syntactic description tags. A relative decrease of 2.02 % in word error rate was achieved after parameter optimization. Statistical tests were performed to confirm the significance of the word error rate decrease.

Gregor Donaj, Zdravko Kačič
The Use of d-gram Language Models for Speech Recognition in Russian

This article deals with a description of a method of accounting of syntactic links in language model for hypotheses obtained after the first passage of decoding of speech. Several stages of processing include POS tagging, dependency parsing, and using factored language models for hypotheses rescoring. The use of fast parsing algorithms such as ‘shift-reduce’ algorithm and rejection of constituency grammar in favor of the dependency grammar allows overcoming the main drawback of the previous approaches, the exponential growth (to the number of lattice nodes) of computations with increase of word lattice size.

Mikhail Zulkarneev, Pavel Satunovsky, Nikolay Shamraev
Backmatter
Metadaten
Titel
Speech and Computer
herausgegeben von
Miloš Železný
Ivan Habernal
Andrey Ronzhin
Copyright-Jahr
2013
Verlag
Springer International Publishing
Electronic ISBN
978-3-319-01931-4
Print ISBN
978-3-319-01930-7
DOI
https://doi.org/10.1007/978-3-319-01931-4