Skip to main content

2009 | Buch

Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions

COST Action 2102 International Conference Prague, Czech Republic, October 15-18, 2008 Revised Selected and Invited Papers

herausgegeben von: Anna Esposito, Robert Vích

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This volume brings together the peer-reviewed contributions of the participants at the COST 2102 International Conference on “Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions” held in Prague, Czech Republic, October 15–18, 2008. The conference was sponsored by COST (European Cooperation in the Field of Scientific and Technical Research, www. cost. esf. org/domains_actions/ict) in the - main of Information and Communication Technologies (ICT) for disseminating the research advances developed within COST Action 2102: “Cross-Modal Analysis of Verbal and Nonverbal Communication” http://cost2102. cs. stir. ac. uk. COST 2102 research networking has contributed to modifying the conventional theoretical approach to the cross-modal analysis of verbal and nonverbal communi- tion changing the concept of face to face communication with that of body to body communication as well as developing the idea of embodied information. Information is no longer the result of a difference in perception and is no longer measured in terms of quantity of stimuli, since the research developed in COST 2102 has proved that human information processing is a nonlinear process that cannot be seen as the sum of the numerous pieces of information available. Considering simply the pieces of inf- mation available, results in a model of the receiver as a mere decoder, and produces a huge simplification of the communication process.

Inhaltsverzeichnis

Frontmatter

Emotions and ICT

Cross-Fertilization between Studies on ICT Practices of Use and Cross-Modal Analysis of Verbal and Nonverbal Communication

The following are comments and considerations on how the Information Communication Technology (ICT) will exploits research on cross modal analysis of verbal and nonverbal communication.

Leopoldina Fortunati, Anna Esposito, Jane Vincent
Theories without Heart

In general, sociological theories are, or at least seem, to be without heart. However many fundamental sociological notions such as solidarity, social cohesion and identity are highly emotional. In the field of information and communication technologies studies there is a specific theory, that of domestication (Silverstone and Hirsch [58], Silverstone and Haddon [59], Haddon and Silverstone [31]), inside which several research studies on the emotional relationship with ICTs have flourished (Fortunati [19], [20]; Vincent [63]). In this paper I will focus on this theory, which is one of the frameworks most commonly applied to understand the integration of ICTs into everyday life. I argue that emotion empowers sociological theories when its analysis is integrated into them. To conclude, I will discuss the seminal idea proposed by Star and Bowker [62] to consider emotion as an infrastructure of body-to-body communication.

Leopoldina Fortunati
Prosodic Characteristics and Emotional Meanings of Slovak Hot-Spot Words

In this paper, we investigate emotionally charged hot-spot

jVj-words

from a corpus that is based on recordings of puppet plays in Slovak. The potential of these hot-spot words for detecting emotion in larger utterances was tested. More specifically, we tested the effect of prosodic and voice quality characteristics and the presence or absence of lexical context on the perception of emotions that the

jVj-words

convey. We found that the lexical cues present in the context are better predictors for the perception of emotions than the prosodic and voice quality features in the

jVj-words

themselves. Nevertheless, both prosodic as well as voice quality features are useful and complementary in detecting emotion of individual words from the speech signal as well as of larger utterances. Finally, we argue that a corpus based on recordings of puppet plays presents a novel and advantageous approach to the collection of data for emotional speech research.

Štefan Beňuš, Milan Rusko
Affiliations, Emotion and the Mobile Phone

Over half of the world’s population is expected to be using mobile phones by 2009 and many have become attached to and even dependent on this small electronic communications device. Drawing on seven years of research into the affective aspects of mobile phone usage this chapter will examine ways that people in the UK have incorporated mobile phones into their lives. It will explore what they use it for, their affiliations and their emotional attachment to the mobile phone and how it appears to have gained such an extraordinary role in maintaining close relationships with family and friends.

Jane Vincent
Polish Emotional Speech Database – Recording and Preliminary Validation

The paper presents the state of the art review on emotional speech databases and designing of Polish database. The principles set for naturalness, choice of emotions, speakers selection, text material and validation procedure are presented. The six simulated emotional states: anger, sadness, happiness, fear, disgust, surprise plus neutral state were chosen for recording by speakers from three groups: professional actors, amateur actors and amateurs. Linguistic material consists of ten short everyday life sentences. Amateurs database recordings are completed and next step will be recordings of professional actors. The database of over two thousand amateur speakers utterances was recorded and validated by perception listeners forced choice. The results of emotion recognition and the influence of musical education, gender and nationality for the group of over two hundred listeners are presented.

Piotr Staroniewicz, Wojciech Majewski
Towards a Framework of Critical Multimodal Analysis: Emotion in a Film Trailer

This paper presents a pilot study in the analysis of emotion integrating approaches from traditionally separate theoretical backgrounds: socio-semiotic studies and cognitive studies. The general aim is to identify a flexible and comprehensive framework for critical multimodal analysis of visual, verbal (oral, written and a blend of the two), kinetic, sound, music and graphic aspects. The case-study, exemplifying this kind of integrated approach in its initial stages, identifies the voices of emotion as expressed in the trailer of the film

An Inconvenient Truth

(2006) and discusses their intertextual and interdiscoursal characteristics. Aspects of the on-going research projects are discussed.

Maria Bortoluzzi
Biosignal Based Emotion Analysis of Human-Agent Interactions

A two-phase procedure, based on biosignal recordings, is applied in an attempt to classify the emotion valence content in human-agent interactions. In the first phase, participants are exposed to a sample of pictures with known valence values (taken from IAPS dataset) and classifiers are trained on selected features of the biosignals recorded. During the second phase, biosignals are recorded for each participant while watching video clips with interactions with a female and male ECAs. The classifiers trained in the first phase are applied and a comparison between the two interfaces is carried on based on the classifications of the emotional response from the video clips. The results obtained are promising and are discussed in the paper together with the problems encountered, and the suggestions for possible future improvement.

Evgenia Hristova, Maurice Grinberg, Emilian Lalev
Emotional Aspects in User Experience with Interactive Digital Television: A Case Study on Dyslexia Rehabilitation

This work explores the emotional response of users in front of Information and Communication Technologies and the aspects triggering emotional reactions. One of the opportunities provided by interactive services delivered through Digital Television (DTV) is to promote the use of socially relevant ICT-based services by large groups of people who have neither Internet access nor the needed skills to use the Internet, but are familiar with television and its remote control. The pilot research programme on DTV in Italy has been developed through a number of initiatives, some of them issued to explore the potential impact on population of the new digital services associated with the broadcasted TV channels. Fondazione Ugo Bordoni (FUB) co-funded six T-government projects with the main objective to experiment high interactivity services involving real users. The T-islessia project concentrated on the rehabilitation of children at risk of dyslexia using interactive exercises through the TV channel. Structured listening sessions were used to investigate the young users’ attitudes and their emotional reactions towards the technological tool. During these sessions, drawings and interviews were adopted in a complementary way. Positive results derived from the field investigation with regard to the effectiveness of the rehabilitation through the DTV platform: the greater the number of interactive sessions, the higher the level of acquired phonetic skills.

Filomena Papa, Bartolomeo Sapio
Investigation of Normalised Time of Increasing Vocal Fold Contact as a Discriminator of Emotional Voice Type

To date, descriptions of thecategorisation ofemotional voice type have mostly been provided in terms of fundamental frequency (f0), amplitude and duration. It is of interest to seek additional cues that may help to improve recognition of emotional colouring in speech, and, expressiveness in speech synthesis. The present contribution examines a specific laryngeal measure - the

n

ormalised

t

ime of

i

ncreasing

c

ontact of the vocal folds (NTIC) i.e. increasing contact time divided by cycle duration - as estimated from the electroglottogram signal. This preliminary study, using a single female speaker, analyses the sustained vowel [a:], produced when simulating the emotional states anger, joy, neutral, sad and tender. The results suggest that NTIC may not be ideally suited for emotional voice discrimination. Additional measures are suggested to further characterise the emotional portrayals.

Peter J. Murphy, Anne-Maria Laukkanen
Evaluation of Speech Emotion Classification Based on GMM and Data Fusion

This paper describes continuation of our research on automatic emotion recognition from speech based on Gaussian Mixture Models (GMM). We use similar technique for emotion recognition as for speaker recognition. From previous research it seems to be better to use a lesser number of GMM components than is used for speaker recognition and better results are also achieved for a greater number of speech parameters used for GMM modeling. In previous experiments we used suprasegmental and segmental parameters separately and also together, which can be described as fusion on feature level. The experiment described in this paper is based on an evaluation of score level fusion for two GMM classifiers used separately for segmental and suprasegmental parameters. We evaluate two techniques of score level fusion – dot product of scores from both classifiers and maximum selection and maximum confidence selections.

Martin Vondra, Robert Vích
Spectral Flatness Analysis for Emotional Speech Synthesis and Transformation

According to psychological research of emotional speech different emotions are accompanied by different spectral noise. We control its amount by spectral flatness according to which the high frequency noise is mixed in voiced frames during cepstral speech synthesis. Our experiments are aimed at statistical analysis of spectral flatness in three emotions (joy, sadness, anger), and a neutral state for comparison. Calculated histograms of spectral flatness distribution are visually compared and modelled by Gamma probability distribution. Obtained statistical parameters and emotional-to-neutral ratios of their mean values show good correlation for both male and female voices and all three emotions.

Jiří Přibil, Anna Přibilová

Verbal and Nonverbal Features of Computational Phonetics

Voice Pleasantness of Female Voices and the Assessment of Physical Characteristics

It has been demonstrated that there seem to be non-linguistic patterns from which listeners refer to the appearance of a speaker. This study completes the collected voice parameters, which count so far as indicators of physical attributes and as factors for voice attractiveness. Since scientists tend to prefer male voices for such analyses this one is based on female voices and both-gendered judges. 20 female voices were played to 102 listeners, 52 male and 50 female. Because of different rating strategies of female and male listeners, the group specific rating strategies were compared; a bimodal rating dispersion was assumed and mostly proved.

Vivien Zuta
Technical and Phonetic Aspects of Speech Quality Assessment: The Case of Prosody Synthesis

The present paper proposes a discussion of methods used for subjective assessment of speech quality in technical sciences and in linguistics. Stressing the fact that purely mathematical evaluation of synthetic speech is not sufficient, we try to show that the perspectives and approaches used in the two scientific domains are not necessarily the same. Next we proceed to a pilot experiment consisting in the assessment of synthetic sentences as generated by five different prosodic models, by means of the MOS formalism (ITU-T P.800). The two groups of listeners involved (students of engineering vs. linguistics) provide rather similar results.

Jana Tučková, Jan Holub, Tomáš Duběda
Syntactic Doubling: Some Data on Tuscan Italian

The data of an experimental study on Syntactic Doubling (Raddoppiamento Sintattico (RS)) in Tuscan Italian are reported and discussed. (a) Consonant not subject to RS with consonant possibly subject to RS in adjacent stressed syllables; (b) with a phonological boundary intervening between trigger and target; and (c) with no phonological phrase boundary intervening are compared. We consistently find lengthening in (b) (short RS), but more lengthening in (c) (long RS). Our data indicate that (short) RS applies across phonological phrases. The publication of these data was motivated by the recent interest raised by the ESF-funded project Edisyn [1] on dialect syntax aiming, among the other goals, to investigate doubling phenomena in European languages and dialects for cross-linguistic comparisons.

Anna Esposito
Perception of Czech in Noise: Stability of Vowels

The paper is based on results of perceptual tests focused on recognition of Czech words in noise. Identification of vowels in the syllabic nuclei proves most consistent. Resistance of individual phonemes and regularities in their substitution are examined. The primary focus concerns the position in a word. It shows that in Czech the position in the stressed syllable is not decisive for correct identification of vowels by listeners. Also the relations to duration and formant characteristics (F2/F1) bring negative results. Identification rates distinctly vary even among particular occurrences of individual phonemes. These differences do not have a verifiable relation to particular acoustic or structural feature. The assumption is vindicated that in the recognition of the acoustic word in adverse conditions a complex of features is in operation where these can act separately, in cooperation or in competition.

Jitka Veroňková, Zdena Palková
Challenges in Segmenting the Czech Lateral Liquid

The study is part of a larger project focused on defining criteria for manual segmentation. Liquids are typically the most problematic sounds in this respect. We examine various acoustic correlates of the Czech alveolar lateral liquid /l/ and their role in isolating this sound in the acoustic stream. Relative formant intensity turned out to be the most reliable criterion. The results indicate a surprisingly high contribution of F3, whose intensity is often stronger in than in neighbouring vowels. This appears to be a new finding.

Radek Skarnitzl
Implications of Acoustic Variation for the Segmentation of the Czech Trill /r/

The Czech alveolar sonorant trill /r/, like liquids generally, constitutes a challenge from the point of view of locating its boundaries in the acoustic stream. As it is desirable to label and segment a phonetic corpus uniformly and also to facilitate a high degree of inter-labeller agreement, rules for specifying speechsound boundaries should be unambiguous and as straightforward as possible. In this study, we examined various acoustic forms of Czech /r/ – from the trill and a flap to strongly reduced instances – and their implications for segmentation. The above-mentioned requirements resulted in the necessity to treat the segmentation of intervocalic items of /r/ differently from items occurring in consonant clusters.

Pavel Machač
Voicing in Labial Plosives in Czech

Traditional phonetic descriptions of Czech consonants assume that voiced segments are always realized with complete vocal fold vibration. The present research addresses the importance of voicing in Czech labial plosives using a subset of the Prague Phonetic Corpus. Corpus studies provide a power ful tool to study variation in actual realizations of speech and offer material to investigate its impact on perception. In the first study the extent of vocal fold vibration in voiced plosive /b/ in intervocalic position was examined. The results show that a considerable number of these plosives were realized with devoiced portions. In the second study it was tested whether Czech phonetically experienced listeners were able to recognize differences in voicing. A strong relationship was found between the amount of vocal fold vibration in the targets and the ratings by the participants indicating that voicing is a good predictor for the phonological voice/voiceless distinction in Czech.

Annett B. Jorschick
Normalization of the Vocalic Space

Phonetic units which constitute natural continuous speech display immense variation due to a substantial number of factors. Consequently, one of the key questions for speech scientists concerns the translation of individual bundles of acoustic features into conventional linguistic meanings or types. Although the problem of normalization of acoustic data is common to many areas of speech science, its solutions depend on particular applicational objectives. An overview of the development in the field of normalization is presented from the perspective of the phonetic understanding of speech communication. The explanatory value of individual methodological outcomes is discussed. Both indexical (related to the speaker identity) and contextual (related to the linguistic form) factors are considered and several normalization algorithms are compared with each other. Recent findings indicate that human listeners exploit not only visual cues but also their cumulated social experience when processing sounds of speech.

Jan Volín

Algorithmic and Theoretical Analysis of Multimodal Interfaces

Gaze Behaviors for Virtual Crowd Characters

Nowadays, crowds of virtual characters are used in many domains such as neurosciences, psychology, and computer sciences. Since as human beings, we are natural experts in human being representation and movement, it makes it that much harder to correctly model and animate virtual characters. This becomes even more challenging when considering crowds of virtual characters. Indeed, in addition to the representation and animation, there is the mandatory trade-off between rich, realistic behaviors and computational costs. In this paper, we present a crowd engine, to which we introduce and extra layer which allows its characters to produce gaze behaviors. We thus enhance crowd realism by allowing the characters composing it to be aware of their environment and other characters and/or a user.

Helena Grillon, Barbara Yersin, Jonathan Maïm, Daniel Thalmann
Gestural Abstraction and Restatement: From Iconicity to Metaphor

The question of abstraction and metaphor in gesture is particularly controversial. Some scholars such as David McNeill, who first introduced this concept for gestures in a systematic way, think that gesture can convey abstract meaning and metaphoric thought, while others believe that gestures can only be considered to be iconic representations. This question will be addressed here by means of an analysis of cases of “on-line” abstraction in the gestural production concurrent with restatements of path descriptions.

Nicla Rossini
Preliminary Prosodic and Gestural Characteristics of Instructing Acts in Polish Task-Oriented Dialogues

In the present study, selected properties of multimodal instructing acts are discussed. Realisations of the instructing acts extracted from a corpus of task-oriented dialogues are analysed in terms of their syntactic structure, prosodic properties and accompanying gestures. The syntactic structures found in the material are similar to those found in earlier studies on map task dialogues. Deictic vocabulary is more frequent in gesture-supported instructions. The mean relative pitch range is similar to the values obtained for instructions in earlier studies and different from the values for syntactically similar questions. As opposite to verbally ill-formed instructions, the well-formed ones tend to contain at least one gestural stroke. It is shown that the relative range of pitch frequency is higher in the gesture-accompanied instructing acts. It is also noticed that prosody and gesture may play similar roles in utterances.

Maciej Karpiński
Polish Children’s Gesticulation in Narrating (Re-telling) a Cartoon

The study was aimed at a preliminary analysis of the nonverbal component of utterances produced by nine year old Polish children in the task of re-telling a cartoon. Most research on children’s gestures so far has been confessed to younger subjects. The analyses presented in the article concern the relations between semantic contributions from the visual and auditory modalities (gestures and speech) as well as the viewpoint of gestures and the use of gesture space. Gestural phrases were tagged for their internal structure (phases) and gestures were categorized into basic types.

Ewa Jarmołowicz-Nowikow
Prediction of Learning Abilities Based on a Cross-Modal Evaluation of Non-verbal Mental Attributes Using Video-Game-Like Interfaces

The authors propose the thesis that today’s children immersed in cyberspace need to rely on different skills and mental attributes in order to interact successfully with knowledge. It is argued that learning pedagogies as well as corresponding assessment tools must comply with the multi-modality principle. The paper describes a multimodal evaluation of the learning potential and reading/learning abilities of young children’s brains. The method is based on the assessment of non-verbal abilities using video-game-like interfaces. The results show that the ability to orientate and navigate, to sequence or categorize objects or events, as well as to discriminate visual and auditory stimuli and the short-term visual and auditory memory can predict reading and learning abilities. Moreover, the combined assessment of several independent modalities significantly increases the predictive power.

Yiannis Laouris, Elena Aristodemou, Pantelis Makris
Automatic Sentence Modality Recognition in Children’s Speech, and Its Usage Potential in the Speech Therapy

In the Laboratory of Speech Acoustics prosody recognition experiments have been prepared, in which, among the others, we were searching for the possibilities of the recognition of sentence modalities. Due to our promising results in the sentence modality recognition, we adopted the method for children modality recognition, and looked for the possibility, how it can be used as an automatic feedback in an audio - visual pronunciation teaching and training system. Our goal was to develop a sentence intonation teaching and training system for speech handicapped children, helping them to learn the correct prosodic pronunciation of sentences. In the experiment basic sentence modality models have been developed and used. For the training of these models, we have recorded a speech prosody database with correctly speaking children, processed and segmented according to the types of modalities. At the recording of this database, 59 children read a text of one word sentences, simple and complex sentences. HMM models of modality types were built by training the recognizer with this correctly speaking children database. The result of the children sentence modality recognition was not adequate enough for the purpose of automatic feedback in case of pronunciation training. Thus another way of classification was prepared. This time the recordings of the children were sorted rigorously by the type of the intonation curves of sentences, which were different in many cases from the sentence modality classes. With the new classes, further tests were carried out. The trained HMM models were used, not for the recognition of the modality of sentences, but checking the correctness of the intonation of sentences pronounced by speech handicapped children. Therefore, an initial database, consisting of the recordings of the voices of two speech handicapped children had been prepared, similar to the database of healthy children.

Dávid Sztahó, Katalin Nagy, Klára Vicsi
Supporting Engagement and Floor Control in Hybrid Meetings

Remote participants in hybrid meetings often have problems to follow what is going on in the (physical) meeting room they are connected with. This paper describes a videoconferencing system for participation in hybrid meetings. The system has been developed as a research vehicle to see how technology based on automatic real-time recognition of conversational behavior in meetings can be used to improve engagement and floor control by remote participants. The system uses modules for online speech recognition, real-time visual focus of attention as well as a module that signals who is being addressed by the speaker. A built-in keyword spotter allows an automatic meeting assistant to call the remote participant’s attention when a topic of interest is raised, pointing at the transcription of the fragment to help him catch-up.

Rieks op den Akker, Dennis Hofs, Hendri Hondorp, Harm op den Akker, Job Zwiers, Anton Nijholt
Behavioral Consistency Extraction for Face Verification

In this paper we investigate how the use of computational statistical models, derived from moving images, can take part in the face recognition process. As a counterpart to psychological experimental results showing a significant beneficial effect of facial non-rigid movement, two features obtained from face sequences, the central tendency and type of movement variation, are associated to improve face verification compared with single static images. By using General Group-wise Registration algorithm, the correspondences across the sequences are captured to build a combined shape and appearance model, parameterizing the face sequences. The parameters are projected to an identity-only space to find the central tendency of each subject. In addition, facial movement consistencies across different behaviors exhibited by the same subjects are recorded. These two features are fused by a confidence-based decision system for authentication applications. Using the BANCA video database, the results show that the extra information extracted from moving images significantly and efficiently improves performance.

Hui Fang, Nicholas Costen
Protecting Face Biometric DCT Templates by Means of Pseudo-random Permutations

Biometric template security and privacy is a great concern of biometric systems, because unlike passwords and tokens, compromised biometric templates cannot be revoked and reissued. In this paper we present a protection scheme based on a user dependent pseudo-random ordering of the DCT template coefficients. In addition to privacy enhancement, this scheme also lets to increase the biometric recognition performance, because a hacker can hardly match a fake biometric sample without knowing the pseudo-random ordering.

Marcos Faundez-Zanuy
Facial Expressions Recognition from Image Sequences

Human machine interaction is one of the emerging fields for the coming years. Interacting with others in our daily life is a face to face interaction. Faces are the natural way of interaction between humans and hence also useful in human machine interaction.

This paper describes a novel technique to recognize the human facial expressions and manipulating this task for human machine interaction. We use 2D model based approach for human facial expression recognition. An active shape model (ASM) is fitted to the face image and texture information is extraced. This shape and texture information is combined with optical flow based temporal information of the image sequences to form a feature vector for the image. We experimented on image sequences of 97 different persons of Cohn-Kanade-Facial Expression Database. A classification rate of 92.4% is obtained using a binary decision tree classifier, whereas a classification rate of 96.4% is obtained using pairwise classifier based on support vector machines. This system is capable to work in realtime.

Zahid Riaz, Christoph Mayer, Michael Beetz, Bernd Radig
Czech Artificial Computerized Talking Head George

This contribution is about a computer-implemented Czech speaking talking head called “George”. This talking head is based on a fully parametric photo-realistic 3D model of human head. The creation and development of this talking head is described here in detail. The talking head George produces realistic animation of face, lips and jaw movements synchronized with either synthetic speech from Czech text to speech synthesis system. The potential applications with this artificial talking head are described in the last part of this paper.

Josef Chaloupka, Zdenek Chaloupka
An Investigation into Audiovisual Speech Correlation in Reverberant Noisy Environments

As evidence of a link between the various human communication production domains has become more prominent in the last decade, the field of multimodal speech processing has undergone significant expansion. Many different specialised processing methods have been developed to attempt to analyze and utilize the complex relationship between multimodal data streams. This work uses information extracted from an audiovisual corpus to investigate and assess the correlation between audio and visual features in speech. A number of different feature extraction techniques are assessed, with the intention of identifying the visual technique that maximizes the audiovisual correlation. Additionally, this paper aims to demonstrate that a noisy and reverberant audio environment reduces the degree of audiovisual correlation, and that the application of a beamformer remedies this. Experimental results, carried out in a synthetic scenario, confirm the positive impact of beamforming not only for improving the audio-visual correlation but also in a complete audio-visual speech enhancement scheme. Thus, this work inevitably highlights an important aspect for the development of future promising bimodal speech enhancement systems.

Simone Cifani, Andrew Abel, Amir Hussain, Stefano Squartini, Francesco Piazza
Articulatory Speech Re-synthesis: Profiting from Natural Acoustic Speech Data

The quality of static phones (e.g. vowels, fricatives, nasals, laterals) generated by articulatory speech synthesizers has reached a high level in the last years. Our goal is to expand this high quality to dynamic speech, i.e. whole syllables, words, and utterances by re-synthesizing natural acoustic speech data. Re-synthesis means that vocal tract action units or articulatory gestures, describing the succession of speech movements, are adapted spatio-temporally with respect to a natural speech signal produced by a natural “model speaker” of Standard German. This adaptation is performed using the software tool SAGA (Sound and Articulatory Gesture Alignment) that is currently under development in our lab. The resulting action unit scores are stored in a database and serve as input for our articulatory speech synthesizer. This technique is designed to be the basis for a unit selection articulatory speech synthesis in the future.

Dominik Bauer, Jim Kannampuzha, Bernd J. Kröger
A Blind Source Separation Based Approach for Speech Enhancement in Noisy and Reverberant Environment

Several efforts have been put by the international scientific community on the Speech Enhancement (SE) research field, specially for the several applications it may have (like human-machine dialogue systems and speaker identification/verification). An innovative SE scheme is presented in this work: it integrates the spatial method (Blind Source Separation, BSS) with the temporal method (Adaptive Noise Canceller, ANC) and a final stage composed of a Multichannel Signal Detection and a Post Filter (MSD+PF) to enhance vocal signals in noisy and reverberant environment. We used a broadband blind source separation (BSS) algorithm to separate target and interference signals in real reverberant scenarios and the two post-processing stages ANC and MSD+PF, in cascade with the first one, to improve the separation yielded by the BSS. In particular the former one allows to further reduce the residual interference signal still presents in the desired target signal after separation, by using as reference the other output of the BSS stage. Computer real-time simulations show progressive improvements across the different processing stages in terms of the chosen quality parameter, i.e. the coherence between the two output channels.

Alessio Pignotti, Daniele Marcozzi, Simone Cifani, Stefano Squartini, Francesco Piazza
Quantitative Analysis of the Relative Local Speech Rate

This paper deals with the immediate relative speech rate analysis. It represents a design of the algorithm of its determination based on the dynamic time warping (DTW) method. It also shows a practical application of the relative speech rate determination in the case of some pathological discourses. From the point of view of the speech rate there are examined the discourses of the Parkinson’s disease’s patients, persons suffering from stammering and patients after cochlear implantation.

Jan Janda
Czech Spontaneous Speech Collection and Annotation: The Database of Technical Lectures

Applying speech recognition into real working systems, spontaneous speech recognition has increasing importance. For the development purposes of such applications, the need of spontaneous speech database is evident both for general design or training and testing of such systems. This paper describes the collection of Czech spontaneous data recorded within technical lectures. It is supposed to be used as a material for the analysis of particular phenomena which appear within spontaneous speech but also as an extension material for training of spontaneous speech recognizers. Mainly the presence of spontaneous speech phenomena such as higher rate of non-speech events, changes in pronunciation, or sentence irregularities, should be the most important contribution of the collected database for the training purposes in comparison to the usage of available read speech databases only. Speech signals are captured in two different channels with slightly different quality and about 14 hours of speech from 15 different speakers are currently collected and annotated. The first analyses of spontaneous speech related effects in the collected data have been performed and the comparison with read speech databases is presented.

Josef Rajnoha, Petr Pollák
BSSGUI – A Package for Interactive Control of Blind Source Separation Algorithms in MATLAB

This paper introduces a Matlab graphical user interface (GUI) that provides an easy operation of several Blind Source Separation (BSS) algorithms together with adjustment of their parameters. BSSGUI enables working with input and output data, multiple signal plots, and saving of output variables to the base Matlab workspace or to a file. The Monte Carlo Analysis allows for the validation of particular features of BSS algorithms integrated into the package. The BSSGUI package is available for free at

http://bssgui.wz.cz

.

Jakub Petkov, Zbyněk Koldovský
Accuracy Analysis of Generalized Pronunciation Variant Selection in ASR Systems

Automated speech recognition systems work typically with pronunciation dictionary for generating expected phonetic content of particular words in recognized utterance. But the pronunciation can vary in many situations. Besides the cases with more possible pronunciation variants specified manually in the dictionary there are typically many other possible changes in the pronunciation depending on word context or speaking style, very typical for our case of Czech language. In this paper we have studied the accuracy of proper selection of automatically predicted pronunciation variants in Czech HMM ASR based systems. We have analyzed correctness of pronunciation variant selection in forced alignment of known utterances used as an ASR training data. Using the proper pronunciation variant, more exact transcriptions of utterances were created for further purposes, mainly for the more accurate training of acoustic HMM models. Finally, as the target and the most important application are LVCSR systems, the accuracy of LVCSR results using different levels of automated pronunciation generation were tested.

Václav Hanžl, Petr Pollák
Analysis of the Possibilities to Adapt the Foreign Language Speech Recognition Engines for the Lithuanian Spoken Commands Recognition

This paper presents our activities trying to adapt the foreign language based speech recognition engine for the recognition of the Lithuanian speech commands. The speakers of less popular languages (such as the Lithuanian) have several choices: to develop own speech recognition engines or to try adapting the speech recognition models developed and trained for the foreign languages to the task of recognition of their native spoken language. The first approach is expensive in time, financial and human resources sense. The second approach can lead to the faster implementation of the Lithuanian speech recognition modules into some practical tasks but the proper adaptation and optimization procedures should be found and investigated. This paper presents some of our efforts trying to adapt the foreign language oriented speech recognition engines for the recognition of the Lithuanian speech commands for the speaker-independent applications. The experimental investigation shows that the promising results could be achieved with relatively modest investments.

Rytis Maskeliunas, Algimantas Rudzionis, Vytautas Rudzionis
MLLR Transforms Based Speaker Recognition in Broadcast Streams

This paper deals with utilization of maximum likelihood linear regression (MLLR) adaptation transforms for speaker recognition in broadcast news streams. This task is specific particularly for widely varying acoustic conditions, microphones, transmission channels, background noise and short duration of recordings (usually in the range from 5 to 15 seconds). MLLR transforms based features are modeled using support vector machines (SVM). Obtained results are compared with a GMM based system with traditional MFCC features. The paper also deals with inter-session variability compensation techniques suitable for both systems and emphases the importance of feature vector scaling for SVM based system.

Jan Silovsky, Petr Cerva, Jindrich Zdansky
Backmatter
Metadaten
Titel
Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions
herausgegeben von
Anna Esposito
Robert Vích
Copyright-Jahr
2009
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-03320-9
Print ISBN
978-3-642-03319-3
DOI
https://doi.org/10.1007/978-3-642-03320-9

Neuer Inhalt