nach oben

2010 | Buch

Kapitel lesen Erstes Kapitel lesen

Development of Multimodal Interfaces: Active Listening and Synchrony

Second COST 2102 International Training School, Dublin, Ireland, March 23-27, 2009, Revised Selected Papers

herausgegeben von: Anna Esposito, Nick Campbell, Carl Vogel, Amir Hussain, Anton Nijholt

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This volume brings together, through a peer-revision process, the advanced research results obtained by the European COST Action 2102: Cross-Modal Analysis of Verbal and Nonverbal Communication, primarily discussed for the first time at the Second COST 2102 International Training School on “Development of Multimodal Int- faces: Active Listening and Synchrony” held in Dublin, Ireland, March 23–27 2009. The school was sponsored by COST (European Cooperation in the Field of Sci- tific and Technical Research, www.cost.esf.org ) in the domain of Information and Communication Technologies (ICT) for disseminating the advances of the research activities developed within the COST Action 2102: “Cross-Modal Analysis of Verbal and Nonverbal Communication” (cost2102.cs.stir.ac.uk) COST Action 2102 in its third year of life brought together about 60 European and 6 overseas scientific laboratories whose aim is to develop interactive dialogue systems and intelligent virtual avatars graphically embodied in a 2D and/or 3D interactive virtual world, capable of interacting intelligently with the environment, other avatars, and particularly with human users.

Inhaltsverzeichnis

Frontmatter

Spacing and Orientation in Co-present Interaction

An introduction to the way in which people arrange themselves spatially in various kinds of focused interaction, especially conversation. It is shown how participants may jointly establish and maintain a spatial-orientational system, referred to as an F-formation, which functions as part of the way in which participants in conversation preserve the integrity of their occasion of interaction and jointly manage their attention.

Adam Kendon

Group Cohesion, Cooperation and Synchrony in a Social Model of Language Evolution

Experiments conducted in a simulation environment demonstrated that both implicit coordination and explicit cooperation among agents leads to the rapid emergence of systems with key properties of natural languages, even under very pessimistic assumptions about shared information states. In this setting, cooperation is shown to elicit more rapid convergence on greater levels of understanding in populations that do not expand, but which grow more intimate, than in groups that may expand and contract. There is a smaller but significant effect of synchronized segmentation of utterances. The models show distortions in synonymy and homonymy rates that are exhibited by natural languages, but relative conformity with what one would rationally build into an artificial language to achieve successful communication: understanding correlates with synonymy rather than homonymy.

Carl Vogel

Pointing Gestures and Synchronous Communication Management

The focus of this paper is on pointing gestures that do not function as deictic pointing to a concrete referent but rather as structuring the flow of information. Examples are given on their use in giving feedback and creating a common ground in natural conversations, and their meaning is described with the help of semantic themes of the Index Finger Extended gesture family. A communication model is also sketched together with an exploration of the simultaneous occurrence of gestures and speech signals, using a two-way approach that combines top-down linguistic-pragmatic and bottom-up signal analysis.

Kristiina Jokinen

How an Agent Can Detect and Use Synchrony Parameter of Its Own Interaction with a Human?

Synchrony is claimed by psychology as a crucial parameter of any social interaction: to give to human a feeling of natural interaction, a feeling of agency [17], an agent must be able to synchronise with this human on appropriate time [29] [11] [15] [16] [27]. In the following experiment, we show that synchrony can be more than a state to reach during interaction, it can be a useable cue of the human’s satisfaction and level of engagement concerning the ongoing interaction: the better is the interaction, the more synchronous with the agent is the human. We built an architecture that can acquire a human partner’s level of synchrony and use this parameter to adapt the agent behavior. This architecture detects temporal relation [1] existing between the actions of the agent and the actions of the human. We used this detected level of synchrony as reinforcement for learning [6]: the more constant the temporal relation between agent and human remains, the more positive is the reinforcement, conversely if the temporal relation varies above a threshold the reinforcement is negative. In a teaching task, this architecture enables naive humans to make the agent learn left-right associations just by the mean of intuitive interactions. The convergence of this learning reinforced by synchrony shows that synchrony conveys current information concerning human satisfaction and that we are able to extract and reuse this information to adapt the agent behavior appropriately.

Ken Prepin, Philippe Gaussier

Accessible Speech-Based and Multimodal Media Center Interface for Users with Physical Disabilities

We present a multimodal media center user interface with a hands-free speech recognition input method for users with physical disabilities. In addition to speech input, the application features a zoomable context + focus graphical user interface and several other modalities, including speech output, haptic feedback, and gesture input. These features have been developed in co-operation with representatives from the target user groups. In this article, we focus on the speech input interface and its evaluations. We discuss the user interface design and results from a long-term pilot study taking place in homes of physically disabled users, and compare the results to a public pilot study and laboratory studies carried out with non-disabled users.

Markku Turunen, Jaakko Hakulinen, Aleksi Melto, Juho Hella, Tuuli Laivo, Juha-Pekka Rajaniemi, Erno Mäkinen, Hannu Soronen, Mervi Hansen, Santtu Pakarinen, Tomi Heimonen, Jussi Rantala, Pellervo Valkama, Toni Miettinen, Roope Raisamo

A Controller-Based Animation System for Synchronizing and Realizing Human-Like Conversational Behaviors

The Embodied Conversational Agents (ECAs) are an application of virtual characters that is subject of considerable ongoing research. An essential prerequisite for creating believable ECAs is the ability to describe and visually realize multimodal conversational behaviors. The recently developed Behavior Markup Language (BML) seeks to address this requirement by granting a means to specify physical realizations of multimodal behaviors through human-readable scripts. In this paper we present an approach to implement a behavior realizer compatible with BML language. The system’s architecture is based on hierarchical controllers which apply preprocessed behaviors to body modalities. Animation database is feasibly extensible and contains behavior examples constructed upon existing lexicons and theory of gestures. Furthermore, we describe a novel solution to the issue of synchronizing gestures with synthesized speech using neural networks and propose improvements to the BML specification.

Aleksandra Čereković, Tomislav Pejša, Igor S. Pandžić

Generating Simple Conversations

This paper describes the Conversation Simulator, a software program designed to generate simulated conversations. The simulator consists of scriptable agents that can exchange speech acts and conversational signals. The paper illustrates how such a tool can be used to study some effects of seemingly straightforward choices in conversations.

Mark ter Maat, Dirk Heylen

Media Differences in Communication

With the ever-growing ubiquity of computer-mediated communication, the application of language research to computer-mediated environments becomes increasingly relevant. How do overhearer effects, discourse markers, differences for monologues and dialogues, and other verbal findings transmute in the transition from face-to-face to computer-mediated communication (CMC)? Which of these factors have an impact on CMC? Furthermore, how can computer interfaces alleviate these potential shortcomings? When is CMC the preferred communicative medium? These questions are explored in this paper.

Roxanne B. Raine

Towards Influencing of the Conversational Agent Mental State in the Task of Active Listening

The proposed paper describes an approach that was used to influence conversational agent Greta’s mental state. The beginning this paper introduces the problem of conversational agents, especially in the listener role. The listener’s backchannels also influence its mental state. The simple agent state manager was developed to impact Greta’s internal state. After describing this manager, we present an overview of evaluation experiments carried out to obtain information about agent state manager functionality, as well as the impact of the mental state changes on the overall interaction.

Stanislav Ondáš, Elisabetta Bevacqua, Jozef Juhár, Peter Demeter

Integrating Emotions in the TRIPLE ECA Model

This paper presents the introduction of emotion-based mechanisms in the TRIPLE ECA model. TRIPLE is a hybrid cognitive model consisting of three interacting modules – the reasoning, the connectionist, and the emotion engines – running in parallel. The interplay between these three modules is discussed in the paper with a focus on the role and implementation of the emotion engine which is based on the FAtiMA agent architecture. The influence of emotions in TRIPLE is related to the volume of the working memory, the speed of the inference mechanisms, the interaction between the reasoning and the connectionist engine, and the connectionist engine itself. Emotions will increase the most important cognitive aspects of the model like context sensitivity, rich experiential episodic knowledge and anticipatory mechanisms.

Kiril Kiryazov, Maurice Grinberg

Manipulating Stress and Cognitive Load in Conversational Interactions with a Multimodal System for Crisis Management Support

The quality assessment of multimodal conversational interfaces is influenced by many factors. Stress and cognitive load are two of most important. In the literature, these two factors are considered as being related and accordingly summarized under the single concept of ‘cognitive demand’. However, our assumption is that even if they are related, these two factors can still occur independently. Therefore, it is essential to control their levels during the interaction in order to determine the impact that each factor has on the perceived conversational quality. In this paper we present preliminary experiments in which we tried to achieve a factor separation by inducing alternating low/high levels of both stress and cognitive load. The stress/cognitive load levels were manipulated by varying task difficulty, information presentation and time pressure. Physiological measurements, performance metrics, as well as subjective reports were deployed to validate the induced stress and cognitive load levels. Results showed that our manipulations were successful for the cognitive load and partly for the stress. The levels of both factors were better indicated by subjective reports and performance metrics than by physiological measurements.

Andreea Niculescu, Yujia Cao, Anton Nijholt

Sentic Computing: Exploitation of Common Sense for the Development of Emotion-Sensitive Systems

Emotions are a fundamental component in human experience, cognition, perception, learning and communication. In this paper we explore how the use of Common Sense Computing can significantly enhance computers’ emotional intelligence i.e. their capability of perceiving and expressing emotions, to allow machines to make more human-like decisions and improve the human-computer interaction.

Erik Cambria, Amir Hussain, Catherine Havasi, Chris Eckl

Face-to-Face Interaction and the KTH Cooking Show

We share our experiences with integrating motion capture recordings in speech and dialogue research by describing (1) Spontal, a large project collecting 60 hours of video, audio and motion capture spontaneous dialogues, is described with special attention to motion capture and its pitfalls; (2) a tutorial where we use motion capture, speech synthesis and an animated talking head to allow students to create an active listener; and (3) brief preliminary results in the form of visualizations of motion capture data over time in a Spontal dialogue. We hope that given the lack of writings on the use of motion capture for speech research, these accounts will prove inspirational and informative.

Jonas Beskow, Jens Edlund, Björn Granström, Joakim Gustafson, David House

Affect Listeners: Acquisition of Affective States by Means of Conversational Systems

We present the concept and motivations for the development of Affect Listeners, conversational systems aiming to detect and adapt to affective states of users, and meaningfully respond to users’ utterances both at the content- and affect-related level. In this paper, we describe the system architecture and the initial set of core components and mechanisms applied, and discuss the application and evaluation scenarios of Affect Listener systems.

Marcin Skowron

Nonverbal Synchrony or Random Coincidence? How to Tell the Difference

Nonverbal synchrony in face-to-face interaction has been studied in numerous empirical investigations focusing on various communication channels. Furthermore, the pervasiveness of synchrony in physics, chemistry and biology adds to its face-validity. This paper is focused on establishing criteria for a statistical evaluation of synchrony in human interaction. When assessing synchrony in any communication context, it is necessary to distinguish genuine synchrony from pseudosynchrony, which may arise due to random coincidence. By using a bootstrap approach, we demonstrate a way to quantify the amount of synchrony that goes beyond random coincidence, thus establishing an objective measure for the phenomenon. Applying this technique to psychotherapy data, we develop a hypothesis-driven empirical evaluation of nonverbal synchrony. The method of surrogate testing in order to control for chance is suitable to any corpus of empirical data and lends itself to better empirically informed inference.

Fabian Ramseyer, Wolfgang Tschacher

Biometric Database Acquisition Close to “Real World” Conditions

In this paper we present an autonomous biometric device developed in the framework of a national project. This system is able to capture speech, hand-geometry, online signature and face, and can open a door when the user is positively verified. Nevertheless the main purpose is to acquire a database without supervision (normal databases are collected in the presence of a supervisor that tells you what to do in front of the device, which is an unrealistic situation). This system will permit us to explain the main differences between what we call "real conditions" as opposed to "laboratory conditions".

Marcos Faundez-Zanuy, Joan Fàbregas, Miguel Ángel Ferrer-Ballester, Aythami Morales, Javier Ortega-Garcia, Guillermo Gonzalez de Rivera, Javier Garrido

Optimizing Phonetic Encoding for Viennese Unit Selection Speech Synthesis

While developing lexical resources for a particular language variety (Viennese), we experimented with a set of 5 different phonetic encodings, termed phone sets, used for unit selection speech synthesis. We started with a very rich phone set based on phonological considerations and covering as much phonetic variability as possible, which was then reduced to smaller sets by applying transformation rules that map or merge phone symbols. The optimal trade-off was found measuring the phone error rates of automatically learnt grapheme-to-phone rules and by a perceptual evaluation of 27 representative synthesized sentences. Further, we describe a method to semi-automatically enlarge the lexical resources for the target language variety using a lexicon base for Standard Austrian German.

Michael Pucher, Friedrich Neubarth, Volker Strom

Advances on the Use of the Foreign Language Recognizer

This paper presents our most recent activities trying to adapt the foreign language based speech recognition engine for the recognition of the Lithuanian speech commands. As presented in our earlier papers the speakers of less popular languages (such as the Lithuanian) have several choices: to develop own speech recognition engines or to try adapting the speech recognition models developed and trained for the foreign languages to the task of recognition of their native spoken language. The second approach can lead to the faster implementation of the Lithuanian speech recognition modules into some practical tasks but the proper adaptation and optimization procedures should be found and investigated. This paper presents our activities trying to improve the recognition of Lithuanian voice commands using multiple transcriptions per command and English recognizer.

Rytis Maskeliunas, Algimantas Rudzionis, Vytautas Rudzionis

Challenges in Speech Processing of Slavic Languages (Case Studies in Speech Recognition of Czech and Slovak)

Slavic languages pose a big challenge for researchers dealing with speech technology. They exhibit a large degree of inflection, namely declension of nouns, pronouns and adjectives, and conjugation of verbs. This has a large impact on the size of lexical inventories in these languages, and significantly complicates the design of text-to-speech and, in particular, speech-to-text systems. In the paper, we demonstrate some of the typical features of the Slavic languages and show how they can be handled in the development of practical speech processing systems. We present our solutions we applied in the design of voice dictation and broadcast speech transcription systems developed for Czech. Furthermore, we demonstrate how these systems can be converted to another similar Slavic language, in our case Slovak. All the presented systems operate in real time with very large vocabularies (350K words in Czech, 170K words in Slovak) and some of them have been already deployed in practice.

Jan Nouza, Jindrich Zdansky, Petr Cerva, Jan Silovsky

Multiple Feature Extraction and Hierarchical Classifiers for Emotions Recognition

The recognition of the emotional states of speaker is a multi-disciplinary research area that has received great interest in the last years. One of the most important goals is to improve the voiced-based human-machine interactions. Recent works on this domain use the proso-dic features and the spectrum characteristics of speech signal, with standard classifier methods. Furthermore, for traditional methods the improvement in performance has also found a limit. In this paper, the spectral characteristics of emotional signals are used in order to group emotions. Standard classifiers based on Gaussian Mixture Models, Hidden Markov Models and Multilayer Perceptron are tested. These classifiers have been evaluated in different configurations with different features, in order to design a new hierarchical method for emotions classification. The proposed multiple feature hierarchical method improves the performance in 6.35% over the standard classifiers.

Enrique M. Albornoz, Diego H. Milone, Hugo L. Rufiner

Emotional Vocal Expressions Recognition Using the COST 2102 Italian Database of Emotional Speech

The present paper proposes a new speaker-independent approach to the classification of emotional vocal expressions by using the COST 2102 Italian database of emotional speech. The audio records extracted from video clips of Italian movies possess a certain degree of spontaneity and are either noisy or slightly degraded by an interruption making the collected stimuli more realistic in comparison with available emotional databases containing utterances recorded under studio conditions. The audio stimuli represent 6 basic emotional states:

happiness

sarcasm/irony

fear

anger

surprise

, and

sadness

. For these more realistic conditions, and using a speaker independent approach, the proposed system is able to classify the emotions under examination with 60.7% accuracy by using a hierarchical structure consisting of a Perceptron and fifteen Gaussian Mixture Models (GMM) trained to distinguish within each pair (couple) of emotions under examination. The best features in terms of high discriminative power were selected by using the Sequential Floating Forward Selection (SFFS) algorithm among a large number of spectral, prosodic and voice quality features. The results were compared with the subjective evaluation of the stimuli provided by human subjects.

Hicham Atassi, Maria Teresa Riviello, Zdeněk Smékal, Amir Hussain, Anna Esposito

Microintonation Analysis of Emotional Speech

The paper addresses reflection of microintonation in male and female acted emotional speech. Microintonation component of speech melody is analyzed regarding its spectral and statistical parameters. Achieved statistical results of microintonation analysis show good correlation comparing male and female voices for four emotional states (joy, sadness, anger, neutral state) portrayed by several professional actors.

Jiří Přibil, Anna Přibilová

Speech Emotion Modification Using a Cepstral Vocoder

This paper deals with speech modification using a cepstral vocoder with the intent to change the emotional content of speech. The cepstral vocoder contains the analysis and synthesis stages. The analysis stage performs the estimation of speech parameters – vocal tract properties, fundamental frequency, intensity, etc. In this parametric domain the segmental and suprasegmental speech modifications may be performed and than the speech can be reconstructed using the parametric source-filter cepstral model. We use the described cepstral vocoder and speech parameter modifications as a tool for research in emotional speech modeling and synthesis. The paper is focused rather on the description of this system and its possibilities than to precise settings of parameter modifications for speech generation with given emotions. The system is still under development. Plans for future research are shortly summarized.

Martin Vondra, Robert Vích

Analysis of Emotional Voice Using Electroglottogram-Based Temporal Measures of Vocal Fold Opening

Descriptions of emotional voice type have typically been provided in terms of fundamental frequency (f0), intensity and duration. Further features, such as measures of laryngeal characteristics, may help to improve recognition of emotional colouring in speech, and, expressiveness in speech synthesis. The present contribution examines three temporal measures of vocal fold opening – as indicated by the time of decreasing contact of the vocal folds estimated from the electroglottogram signal. This initial investigation, using a single female speaker, analyses the sustained vowel [a:], produced when simulating the emotional states anger, joy, neutral, sad and tender. The results indicate discrimination of emotional voice type using two of the measures of vocal fold opening duration.

Peter J. Murphy, Anne-Maria Laukkanen

Effects of Smiling on Articulation: Lips, Larynx and Acoustics

The present paper reports on results of a study investigating changes of lip features, larynx position and acoustics caused by smiling while speaking. 20 triplets of words containing one of the vowels /a:/, /i:/, /u:/ were spoken and audiovisually recorded. Lip features were extracted manually as well as using a 3D motion capture technique, formants were measured in the acoustic signal, and the vertical larynx position was determined where visible. Results show that during production of /u:/ F1 and F2 are not significantly affected despite of changes of lip features while F3 is increased. For /a:/ F1 and F3 are unchanged where for /i:/ only F3 is not affected. Furthermore, while the effect of smiling on the outer lip features is comparable between vowels, inner lip features are differently affected for different vowels. These differences in the impact on /a:/, /i:/ and /u:/ suggest that the effect of smiling on vowel production is vowel dependent.

Sascha Fagel

Neural Basis of Emotion Regulation

From the neurobiological point of view emotions can be defined as complex responses to personally-relevant events; such responses are characterized by peculiar subjective feelings and vegetative and motor reactions. In man the complex neural network subserving emotions has been traditionally studied via clinical observations of brain-damaged patients, but in recent years the development of modern neuroimaging techniques such as Positron Emission Tomography (PET) or functional Magnetic Resonance (fMRI) has allowed to investigate the neural basis of emotions and emotion regulation in normal subjects. The present chapter offers a brief overview of the main neural structures involved in emotion processing, summarizes the role recently ascribed to mirror neurons in empathy, and describes the possible neural basis of integrated emotion regulation typical of our species.

Luigi Trojano

Automatic Meeting Participant Role Detection by Dialogue Patterns

We introduce a new concept of ‘Vocalization Horizon’ for automatic speaker role detection in general meeting recordings. We demonstrate that classification accuracy reaches 38.5% when Vocalization Horizon and other features (i.e. vocalization duration and start time) are available. With another type of Horizon, the Pause - Overlap Horizon, the classification accuracy reaches 39.5%. Pauses and overlaps are also useful vocalization features for meeting structure analysis. In our experiments, the Bayesian Network classifier outperforms other classifiers, and is proposed for similar applications.

Jing Su, Bridget Kane, Saturnino Luz

Linguistic and Non-verbal Cues for the Induction of Silent Feedback

The aim of this study is to analyze certain linguistic (dialogue acts, morphosyntactic units, semantics) and non-verbal cues (face, hand and body gestures) that may induce the silent feedback of a participant in face-to-face discussions. We analyze the typology and functions of the feedback expressions as attested in a corpus of TV interviews and then we move on to the investigation of the immediately preceding context to find systematic evidence related to the production of feedback. Our motivation is to look into the case of active listening by processing data from real dialogues based on the discourse and lexical content that induces the listener’s reactions.

Maria Koutsombogera, Harris Papageorgiou

Audiovisual Tools for Phonetic and Articulatory Visualization in Computer-Aided Pronunciation Training

This paper reviews interactive methods for improving the phonetic competence of subjects in the case of second language learning as well as in the case of speech therapy for subjects suffering from hearing-impairments or articulation disorders. As an example our audiovisual feedback software “SpeechTrainer” for improving the pronunciation quality of Standard German by visually highlighting acoustics-related and articulation-related sound features will be introduced here. Results from literature on training methods as well as the results concerning our own software indicate that audiovisual tools for phonetic and articulatory visualization are beneficial for computer-aided pronunciation training environments.

Bernd J. Kröger, Peter Birkholz, Rüdiger Hoffmann, Helen Meng

Gesture Duration and Articulator Velocity in Plosive-Vowel-Transitions

In this study the gesture duration and articulator velocity in con-sonant-vowel-transitions has been analysed using electromagnetic articulography (EMA). The receiver coils where placed on the tongue, lips and teeth. We found onset and offset durations which are statistically significant for a special articulator. The duration of the offset is affected by the degree of opening of the following vowel. The acquired data is intended to tune the control model of an articulatory speech synthesizer to improve the acoustic quality of plosive-vowel-transitions.

Dominik Bauer, Jim Kannampuzha, Phil Hoole, Bernd J. Kröger

Stereo Presentation and Binaural Localization in a Memory Game for the Visually Impaired

Socialization of the visually impaired represents a challenge for the society and science today. The aim of this research is to investigate the possibility of using binaural perception of sound for two-dimensional source localization in interactive games for the visually impaired. Such an additional source of information would contribute to more equal participation of the visually impaired in online gaming. The paper presents a concept of an online memory game accessible both by the sighted and the blind, with a multimodal user interface enabled with speech technologies. The final chapter discusses the effects of its initial application and testing, as well as perspectives for further development.

Vlado Delić, Nataša Vujnović Sedlar

Pathological Voice Analysis and Classification Based on Empirical Mode Decomposition

Empirical mode decomposition (EMD) is an algorithm for signal analysis recently introduced by Huang. It is a completely data-driven non-linear method for the decomposition of a signal into AM - FM components. In this paper two new EMD-based methods for the analysis and classification of pathological voices are presented. They are applied to speech signals corresponding to real and simulated sustained vowels. We first introduce a method that allows the robust extraction of the fundamental frequency of sustained vowels. Its determination is crucial for pathological voice analysis and diagnosis. This new method is based on the ensemble empirical mode decomposition (EEMD) algorithm and its performance is compared with others from the state of the art. As a second EMD-based tool, we explore spectral properties of the intrinsic mode functions and apply them to the classification of normal and pathological sustained vowels. We show that just using a basic pattern classification algorithm, the selected spectral features of only three modes are enough to discriminate between normal and pathological voices.

Gastón Schlotthauer, María E. Torres, Hugo L. Rufiner

Disfluencies and the Perspective of Prosodic Fluency

This work explores prosodic cues of disfluent phenomena. We have conducted a perceptual experiment to test if listeners would rate all disfluencies as disfluent events or if some of them would be rated as fluent devices in specific prosodic contexts. Results pointed out significant differences (p < 0.05) between judgments of fluency

vs.

disfluency. Distinct prosodic properties of these events were also significant (p < 0.05) in their characterization as fluent devices. In an attempt to discriminate which linguistic features are more salient in the classification of disfluencies, we have also used CART techniques on a corpus of 3.5 hours of spontaneous and prepared non-scripted speech. CART results pointed out 2 splits: break indices and contour shape. The first split indicates that disfluent events uttered at breaks 3 and 4 are considered felicitous. The second one indicates that these events must have plateau or ascending contours to be considered as such; otherwise they are strongly penalized. The results obtained show that there are regular trends in the production of disfluencies, namely, prosodic phrasing and contour shape.

Helena Moniz, Isabel Trancoso, Ana Isabel Mata

Subjective Tests and Automatic Sentence Modality Recognition with Recordings of Speech Impaired Children

Prosody recognition experiments have been prepared in the Laboratory of Speech Acoustics, in which, among the others, we were searching for the possibilities of the recognition of sentence modalities. Due to our promising results in the sentence modality recognition, we adopted the method for children modality recognition, and looked for the possibility, how it can be used as an automatic feedback in an audio - visual pronunciation teaching and training system. Our goal was to develop a sentence intonation teaching and training system for speech handicapped children, helping them to learn the correct prosodic pronunciation of sentence. HMM models of modality types were built by training the recognizer with a correctly speaking children database. During the present work, a large database was collected from speech impaired children. Subjective tests were carried out with this database of speech impaired children, in order to examine how human listeners are able to categorize the heard recordings of sentence modalities. Then automatic sentence modality recognition experiments were done with the formerly trained HMM models. By the result of the subjective tests, the probability of acceptance of the sentence modality recognizer can be adjusted. Comparing the result of the subjective tests and the results of the automatic sentence modality recognition tests processed on the database of speech impaired children, it is showed that the automatic recognizer classified the recordings more strictly, but not worse. The introduced method could be implemented as a part of a speech teaching system.

David Sztaho, Katalin Nagy, Klara Vicsi

The New Italian Audio and Video Emotional Database

This paper describes the general specifications and characteristics of the New Italian Audio and Video Emotional Database, collected to improve the COST 2102 Italian Audio and Video Emotional Database [28] and to support the research effort of the COST Action 2102: “Cross Modal Analysis of Verbal and Nonverbal Communication” (http://cost2102.cs.stir.ac.uk/). The database should allow the cross-modal analysis of audio and video recordings for defining distinctive, multi-modal emotional features, and identify emotional states from multimodal signals. Emphasis is placed on stimuli selection procedures, theoretical and practical aspects for stimuli identification, characteristics of selected stimuli and progresses in their assessment and validation.

Anna Esposito, Maria Teresa Riviello

Spoken Dialogue in Virtual Worlds

Human-computer conversations have attracted a great deal of interest especially in virtual worlds. In fact, research gave rise to spoken dialogue systems by taking advantage of speech recognition, language understanding and speech synthesis advances. This work surveys the state of the art of speech dialogue systems. Current dialogue system technologies and approaches are first introduced emphasizing differences between them, then, speech recognition and synthesis and language understanding are introduced as complementary and necessary modules. On the other hand, as the development of spoken dialogue systems becomes more complex, it is necessary to define some processes to evaluate their performance. Wizard-of-Oz techniques play an important role to achieve this task. Thanks to this technique is obtained a suitable dialogue corpus necessary to achieve good performance. A description of this technique is given in this work together with perspectives on multimodal dialogue systems in virtual worlds.

Gérard Chollet, Asmaa Amehraye, Joseph Razik, Leila Zouari, Houssemeddine Khemiri, Chafic Mokbel

Backmatter

Titel: Development of Multimodal Interfaces: Active Listening and Synchrony
herausgegeben von: Anna Esposito
Nick Campbell
Carl Vogel
Amir Hussain
Anton Nijholt
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-12397-9
Print ISBN: 978-3-642-12396-2
DOI: https://doi.org/10.1007/978-3-642-12397-9