nach oben

2007 | Buch

Kapitel lesen Erstes Kapitel lesen

Verbal and Nonverbal Communication Behaviours

COST Action 2102 International Workshop, Vietri sul Mare, Italy, March 29-31, 2007, Revised Selected and Invited Papers

herausgegeben von: Anna Esposito, Marcos Faundez-Zanuy, Eric Keller, Maria Marinaro

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This volume brings together the invited papers and selected participants’ contributions presented at the COST 2102 International Workshop on “Verbal and Nonverbal Communication Behaviours”, held in Vietri sul Mare, Italy, March 29–31, 2007. The workshop was jointly organized by the Faculty of Science and the Faculty of Psychology of the Second University of Naples, Caserta, Italy, and the International Institute for Advanced Scientific Studies “Eduardo R. Caianiello”(IIASS), Vietri sul Mare, Italy. The workshop was a COST 2102 event, and it was mainly sponsored by the COST (European Cooperation in the Field of Scientific and Technical Research) Action 2102 in the domain of Information and Communication Technologies (ICT), as well as by the above-mentioned organizing Institutions. The main theme of the workshop was to discuss the fundamentals of verbal and nonverbal communication features and their relationships with the identification of a person, his/her socio-cultural background and personal traits. In the past decade, a number of different research communities within the psychological and computational sciences have tried to characterize human behaviour in face-to-face communication by several features that describe relationships between facial, prosodic/voice quality, formal and informal communication modes, cultural differences, individual and socio-cultural variations, stable personality traits and degrees of expressiveness and emphasis, as well as the individuation of the interlocutor’s emotional and psychological states.

Inhaltsverzeichnis

Frontmatter

Introduction

COST 2102: Cross-Modal Analysis of Verbal and Nonverbal Communication (CAVeNC)

Abstract

In the following are described the fundamental features and the major objectives of COST 2102: Cross-Modal Analysis of Verbal and Nonverbal Communication (CAVeNC) as they have been expressed in the Memorandum of Understanding. COST (European Cooperation in the Field of Scientific and Technical Research) is “one of the longest-running instruments supporting co-operation among scientists and researchers across Europe” www.cost.esf.org. In this framework, COST 2102 is an initiative founded in the Domain of Information and Communication Technologies that has become operative on Dec 2006. Details on the on-going activities as well as the structure and the organization of COST 2102 can be found on www.cost2102.eu. I want to express my gratitude to all researchers which have joined COST 2102 for making real the dream of sharing knowledge and research work with leading experts in the field of multimodal communication.

Anna Esposito

I – Verbal and Noverbal Coding Schema

Annotation Schemes for Verbal and Non-Verbal Communication: Some General Issues

Abstract

During the past 5-10 years, increasing efforts have been put into annotation of verbal and non-verbal human-human and human-machine communication in order to better understand the complexities of multimodal communication and model them in computers. This has helped highlight the huge challenges which still confront annotators in this field, from conceptual confusion through lacking or immature coding schemes to inadequate coding tools. We discuss what is an annotation scheme, briefly review previous work on annotation sche mes and tools, describe current trends, and discuss challenges ahead.

Niels Ole Bernsen, Laila Dybkjær

Presenting in Style by Virtual Humans

Abstract

The paper addresses the issue of making Virtual Humans unique and typical of some (social or ethnical) group, by endowing them with style. First a conceptual framework of defining style is discussed, identifying how style is manifested in speech and nonverbal communication. Then the GESTYLE language is introduced, making it possible to define the style of a VH in terms of Style Dictionaries, assigning non-deterministic choices to express certain meanings by nonverbal signals and speech. It is possible to define multiple sources of style and maintain conflicts and dynamical changes. GESTYLE is a text markup language which makes it possible to generate speech and accompanying facial expressions and hand gestures automatically, by declaring the style of the VH and using meaning tags in the text. GESTYLE can be coupled with different low-level TTS and animation engines.

Zsófia Ruttkay

Analysis of Nonverbal Involvement in Dyadic Interactions

Abstract

In the following, we comment on the assignment of the dynamic variable, its meaning, indicators and furthermore its dimensions. We examine some interaction models which incorporate nonverbal involvement as a dynamic variable. Then we give a short overview of two recently undertaken studies in advance of dyadic interactions focusing on nonverbal involvement measured in a multivariate manner. The first study concentrates on conflict regulation of interacting children being friends. The second study examines intrapersonal conflict and its social expression through “emotional overinvolvement” (EOI) of patients in psychotherapy. We also mention a pilot-study in which the proxemic behaviour between two children in a conflict episode is analysed focusing here on violation of personal space and its restoration through synchronisation. We end with some comments on multiple dimensions and scaling with respect to involvement including thoughts about multidimensional interaction data (MID).

Uwe Altmann, Rico Hermkes, Lutz-Michael Alisch

II – Emotional Expressions

Children’s Perception of Musical Emotional Expressions

Abstract

This study investigates on children’s ability to interpret emotions in instrumentally-presented melodies. 40 children (20 males and 20 females) all aged six years, have been tested for the comprehension of emotional concepts through the correct matching of emotional pictures to pieces of music. Results indicate that 6-year-old emotional responses to orchestral extracts considered full of affective cues are similar to those demonstrated by adults and that there is no gender effect.

Anna Esposito, Manuela Serio

Emotional Style Conversion in the TTS System with Cepstral Description

Abstract

This contribution describes experiments with emotional style conversion performed on the utterances produced by the Czech and Slovak text-to-speech (TTS) system with cepstral description and basic prosody generated by rules. Emotional style conversion was realized as post-processing of the TTS output speech signal, and as a real-time implementation into the system. Emotional style prototypes representing three emotional states (sad, angry, and joyous) were obtained from the sentences with the same information content. The problem with the different frame length between the prototype and the target utterance was solved by linear time scale mapping (LTSM). The results were evaluated by a listening test of the resynthetized utterances.

Jiří Přibil, Anna Přibilová

Meaningful Parameters in Emotion Characterisation

Abstract

In expressive speech synthesis some method of mimicking the way one specific speaker express emotions is needed. In this work we have studied the suitability of long term prosodic parameters and short term spectral parameters to reflect emotions in speech, by means of the analysis of the results of two automatic emotion classification systems. Those systems have been trained with different emotional monospeaker databases recorded in standard Basque that include six emotions. Both of them are able to differentiate among emotions for a specific speaker with very high identification rates (above 75%), but the models are not applicable to other speakers (identification rates drop to 20%). Therefore in the synthesis process the control of both spectral and prosodic features is essential to get expressive speech and when a change in speaker is desired the values of the parameters should be re-estimated.

Eva Navas, Inmaculada Hernáez, Iker Luengo, Iñaki Sainz, Ibon Saratxaga, Jon Sanchez

III – Gestural Expressions

Prosodic and Gestural Expression of Interactional Agreement

Abstract

Conversational interactions are cooperatively constructed activities in which participants negotiate their entrances, turns and alignments with other speakers, oftentimes with an underlying long-term objective of obtaining some agreement. Obtaining a final and morally binding accord in a conversational interaction is of importance in a great variety of contexts, particularly in psychotherapeutic interactions, in contractual negotiations or in educational contexts. Various prosodic and gestural elements in a conversational interaction can be interpreted as signals of a speaker’s agreement and they are probably of importance in the emergence of an accord in a conversational exchange. In this paper, we survey the social and psychological context of agreement seeking, as well as the existing literature on the visual and prosodic measurement of agreement in conversational settings.

Eric Keller, Wolfgang Tschacher

Gesture, Prosody and Lexicon in Task-Oriented Dialogues: Multimedia Corpus Recording and Labelling

Abstract

The aim of the DiaGest Project is to study interdependencies between gesture, lexicon, and prosody in Polish dialogues. The material under study comprises three tasks realised by twenty pairs of subjects. Two tasks involve instructional, task-oriented dialogues, while the third is based on a question answering procedure. A system for corpus labelling is currently being designed on the basis of current standards. The corpus will be annotated for gestures, lexical content of utterances, intonation and rhythm. In order to relate various phenomena to the contextualized meaning of dialogue utterances, the material will also be tagged in terms of dialogue acts. Synchronised tags will be placed in respective annotation tiers in ELAN. A number of detailed studies related to the problems of gesture-prosody, gesture-lexicon and prosody-lexicon interactions will be carried out on the basis of the tagged material.

Ewa Jarmolowicz, Maciej Karpinski, Zofia Malisz, Michal Szczyszek

Egyptian Grunts and Transportation Gestures

Abstract

The paper has two main subjects related to Egyptian culture. The first is a collection of Egyptian grunts used by almost all Egyptians in everyday life, and recognized by almost everybody. The second is another collection of gestures used by passengers of a special kind of public transportation – called microbus- in greater Cairo and outside. Such gestures differ with the geographic location of the microbus route and are used to communicate with the bus driver and his helper. The material of the two collections was provided by students in communication skills classes offered by the author through undergraduate and graduate curricula.

Aly N. El-Bahrawy

IV – Analysis and Algorithms for Verbal and Nonverbal Speech

On the Use of NonVerbal Speech Sounds in Human Communication

Abstract

Recent work investigating the interaction of the speech signal with the meaning of the verbal content has revealed interactions not yet modelled in either speech recognition technology or in contemporary linguistic science. In this paper we describe paralinguistic speech features that co-exist alongside linguistic content and propose a model of their function and usage, and discuss methods for ncorporating them into real-world applications and devices.

Nick Campbell

Speech Spectrum Envelope Modeling

Abstract

A new method for speech analysis is described. It is based on extremes finding in the magnitude spectrum of a speech frame followed by interpolation. The interpolated spectrum envelope can be used for speech synthesis and also for the estimation of the excitation and background noise. In the contribution the proposed method is illustrated using a noisy speech frame and compared with LPC spectrum and spectrum obtained by classical and hidden cepstral smoothing.

Robert Vích, Martin Vondra

Using Prosody in Fixed Stress Languages for Improvement of Speech Recognition

Abstract

In this chapter we examine the usage of prosodic features in speech recognition, with a special attention payed to agglutinating and fixed stress languages. Current knowledge in speech prosody exploitation is addressed in the introduction. The used prosodic features, acoustic-prosodic pre-processing, and segmentation in terms of prosodic units are presented in details. We use the expression “prosodic unit” in order to differentiate them from prosodic phrases, which are usually longer. We trained a HMM-based prosodic segmenter relying on fundamental frequency and intensity of speech. The output of this prosodic segmenter is used for N-best lattice rescoring in parallel with a simplified bigram language model in a continuous speech recognizer, in order to improve speech recognition performance. Experiments for Hungarian language show a WER reduction of about 4% using a simple lattice rescoring. The performance of the prosodic segmenter is also investigated in comparison with our earlier experiments.

György Szaszák, Klára Vicsi

Single-Channel Noise Suppression by Wavelets in Spectral Domain

Abstract

The paper describes the design of a new single-channel method for speech enhancement that employs the wavelet transform. Signal decomposition is currently performed in the time domain while noise is removed on individual decomposition levels using thresholding techniques. Here the wavelet transform is applied in the spectral domain. Used as the basis is the method of spectral subtraction, which is suitable for real-time implementation because of its simplicity. The greatest problem in the spectral subtraction method is a trustworthy noise estimate, in particular when non-stationary noise is concerned. Using the wavelet transform we can achieve a more accurate power spectral density also of noise that is non-stationary. Listening tests and SNR measurements yield satisfactory results in comparison with earlier reported experience.

Zdeněk Smékal, Petr Sysel

Voice Source Change During Fundamental Frequency Variation

Abstract

Prosody refers to certain properties of the speech signal including audible changes in pitch, loudness, and syllable length. The acoustic manifestation of prosody is typically measured in terms of fundamental frequency (f0), amplitude and duration. These three cues have formed the basis for extensive studies of prosody in natural speech. The present work seeks to go beyond this level of representation and to examine additional factors that arise as a result of the underlying production mechanism. For example, intonation is studied with reference to the f0 contour. However, to change f0 requires changes in the laryngeal configuration that results in glottal flow parameter changes. These glottal changes may serve as important psychoacoustic markers in addition to (or in conjunction with) the f0 targets. The present work examines changes in open quotient with f0 in connected speech using electroglottogram and volume velocity at the lips signals. This preliminary study suggests that individual differences may exist in terms of glottal changes for a particular f0 variation.

Peter J. Murphy

A Gesture-Based Concept for Speech Movement Control in Articulatory Speech Synthesis

Abstract

An articulatory speech synthesizer comprising a three-dimensional vocal tract model and a gesture-based concept for control of articulatory movements is introduced and discussed in this paper. A modular learning concept based on speech perception is outlined for the creation of gestural control rules. The learning concept includes on sensory feedback information for articulatory states produced by the model itself, and auditory and visual information of speech items produced by external speakers. The complete model (control module and synthesizer) is capable of producing high-quality synthetic speech signals and introduces a scheme for the natural speech production and speech perception processes.

Bernd J. Kröger, Peter Birkholz

A Novel Psychoacoustically Motivated Multichannel Speech Enhancement System

Abstract

The ubiquitous noise reduction / speech enhancement problem has gained an increasing interest in recent years. This is due both to progress made by microphone-array systems and to the successful introduction of perceptual models. In the last decade, several methods incorporating psychoacoustic criteria in single channel speech enhancement systems have been proposed, however very few works exploit these features in the multichannel case. In this paper we present a novel psychoacoustically motivated, multichannel speech enhancement system that exploits spatial information and psychoacoustic concepts. The proposed framework offers enhanced flexibility allowing for a multitude of perceptually-based post-filtering solutions. Moreover, the system has been devised on a frame-by-frame basis to facilitate real-time implementation. Objective performance measures and informal subjective listening tests for the case of speech signals corrupted with real car and F-16 cockpit noise demonstrate enhanced performance of the proposed speech enhancement system in terms of musical residual noise reduction compared to conventional multichannel techniques.

Amir Hussain, Simone Cifani, Stefano Squartini, Francesco Piazza, Tariq Durrani

Analysis of Verbal and Nonverbal Acoustic Signals with the Dresden UASR System

Abstract

During the last few years, a framework for the development of algorithms for speech analysis and synthesis was implemented. The algorithms are connected to common databases on the different levels of a hierarchical structure. This framework which is called UASR (Unified Approach for Speech Synthesis and Recognition) and some related experiments and applications are described. Special focus is directed to the suitability of the system for processing nonverbal signals. This part is related to the analysis methods which are addressed in the COST 2102 initiative now. A potential application field in interaction research is discussed.

Rüdiger Hoffmann, Matthias Eichner, Matthias Wolff

V – Machine Multimodal Interaction

VideoTRAN: A Translation Framework for Audiovisual Face-to-Face Conversations

Abstract

Face-to-face communication remains the most powerful human interaction. Electronic devices can never fully replace the intimacy and immediacy of people conversing in the same room, or at least via a videophone. There are many subtle cues provided by facial expressions and vocal intonation that let us know how what we are saying is affecting the other person. Transmission of these nonverbal cues is very important when translating conversations from a source language into a target language. This chapter describes VideoTRAN, a conceptual framework for translating audiovisual face-to-face conversations. A simple method for audiovisual alignment in the target language is proposed and the process of audiovisual speech synthesis is described. The VideoTRAN framework has been tested in a translating videophone. An H.323 software client translating videophone allows for the transmission and translation of a set of multimodal verbal and nonverbal clues in a multilingual face-to-face communication setting.

Jerneja Žganec Gros

Spoken and Multimodal Communication Systems in Mobile Settings

Abstract

Mobile devices, such as smartphones, have become powerful enough to implement efficient speech-based and multimodal interfaces, and there is an increasing need for such systems. This chapter gives an overview of design and development issues necessary to implement mobile speech-based and multimodal systems. The chapter reviews infrastructure design solutions that make it possible to distribute the user interface between servers and mobile devices, and support user interface migration from server-based to distributed services. An example is given on how an existing server-based spoken timetable application is turned into a multimodal distributed mobile application.

Markku Turunen, Jaakko Hakulinen

Multilingual Augmentative Alternative Communication System

Abstract

People with severe motor control problems, who at the same time lack of verbal communication, use alternatively non verbal communication techniques and aids which usually combine symbols, icons, drawings, sounds and text. The present paper describes completely configurable multilingual software that can contribute to the above group’s needs as it facilitates access to a personalized computerized system which provides options for non verbal-written communication. This system incorporates enhanced or new features like acronymic writing, single switch access, word and phrase prediction, keyboard layout configuration, scanning of word and phrase lists and makes communication through internet (email and chatting options) possible. More over the system records all keystrokes, all words and acronyms used and provides valuable data for research on the best possible configuration of the system. What makes the system more innovative is the possibility it provides to users to send their emails and to network through internet chatting, with others.

Pantelis Makris

Analysis and Synthesis of Multimodal Verbal and Non-verbal Interaction for Animated Interface Agents

Abstract

The use of animated talking agents is a novel feature of many multimodal spoken dialogue systems. The addition and integration of a virtual talking head has direct implications for the way in which users approach and interact with such systems. However, understanding the interactions between visual expressions, dialogue functions and the acoustics of the corresponding speech presents a substantial challenge. Some of the visual articulation is closely related to the speech acoustics, while there are other articulatory movements affecting speech acoustics that are not visible on the outside of the face. Many facial gestures used for communicative purposes do not affect the acoustics directly, but might nevertheless be connected on a higher communicative level in which the timing of the gestures could play an important role. This chapter looks into the communicative function of the animated talking agent, and its effect on intelligibility and the flow of the dialogue.

Jonas Beskow, Björn Granström, David House

Generating Nonverbal Signals for a Sensitive Artificial Listener

Abstract

In the Sensitive Artificial Listener project research is performed with the aim to design an embodied agent that not only generates the appropriate nonverbal behaviors that accompany speech, but that also displays verbal and nonverbal behaviors during the production of speech by its conversational partner. Apart from many applications for embodied agents where natural interaction between agent and human partner also require this behavior, the results of this project are also meant to play a role in research on emotional behavior during conversations. In this paper, our research and implementation efforts in this project are discussed and illustrated with examples of experiments, research approaches and interfaces in development.

Dirk Heylen, Anton Nijholt, Mannes Poel

Low-Complexity Algorithms for Biometric Recognition

Abstract

In this paper we emphasize the relevance of low-complexity algorithms for biometric recognition and we present to examples with special emphasis on face recognition. Our face recognition application has been implemented on a low-cost fixed point processor and we have evaluated that with 169 integer coefficients per face we achieve better identification results (92%) than the classical eigenfaces approach (86.5%), and close to the DCT (92.5%) with a reduced computational cost.

Marcos Faundez-Zanuy, Virginia Espinosa-Duró, Juan Antonio Ortega

Towards to Mobile Multimodal Telecommunications Systems and Services

Abstract

The communication itself is considered as a multimodal interactive process binding speech with other modalities. In this contribution some results of the project MobilTel (Mobile Multimodal Telecommunications System) are presented. It has provided a research framework resulting in a develop-ment of mobile terminal (PDA) based multimodal interface, enabling user to obtain information from internet by multimodal way through wireless telecommunication network. The MobilTel communicator is a speech centric multimodal system with speech interaction capabilities in Slovak language supplemented with graphical modalities. The graphical modalities are pen – touch screen interaction, keyboard, and display on which the information is more user friendly presented, and provides hyperlink and scrolling menu availability. The architecture of the MobilTel communicator and methods of interaction between PDA and MobilTel communicator are described. The graphical examples of services that enable users to obtain information about weather or information about train connection are also presented.

Matúš Pleva, Ján Papaj, Anton Čižmár, L’ubomír Doboš, Jozef Juhár, Stanislav Ondáš, Michal Mirilovič

Embodied Conversational Agents in Wizard-of-Oz and Multimodal Interaction Applications

Abstract

Embodied conversational agents employed in multimodal interaction applications have the potential to achieve similar properties as humans in face-to-face conversation. They enable the inclusion of verbal and nonverbal communication. Thus, the degree of personalization of the user interface is much higher than in other human-computer interfaces. This, of course, greatly contributes to the naturalness and user friendliness of the interface, opening-up a wide area of possible applications. Two implementations of embodied conversational agents in human-computer interaction are presented in this paper: the first one in a Wizard-of-Oz application and the second in a dialogue system. In the Wizard-of-Oz application, the embodied conversational agent is applied in a way that it conveys the spoken information of the operator to the user with whom the operator communicates. Depending on the scenario of the application, the user may or not be aware of the operator’s involvement. The operator can communicate with the user based on audio/visual, or only audio, communication. This paper describes an application setup, which enables distant communication with the user, where the user is unaware of the operator’s involvement. A real-time viseme recognizer is needed to ensure a proper response from the agent. In addition, implementation of the embodied conversational agent Lili hosting an entertainment show, which is broadcast by RTV Slovenia, will be described in more detail. Employment of the embodied conversational agent as a virtual major-domo named Maja, within an intelligent ambience, using speech recognition system and TTS system PLATTOS, will be also described.

Matej Rojc, Tomaž Rotovnik, Mišo Brus, Dušan Jan, Zdravko Kačič

Telling Stories with a Synthetic Character: Understanding Inter-modalities Relations

Abstract

Can we create a virtual storyteller that is expressive enough to convey in a natural way a story to an audience? What are the most important features for creating such character? This paper presents a study where the influence of different modalities in the perception of a story told by both a synthetic storyteller and a real one are analyzed. In order to evaluate it, three modes of communication were taken into account: voice, facial expression and gestures. One hundred and eight students from computer science watched a video where a storyteller narrated the traditional Portuguese story entitled ”O Coelhinho Branco” (The little white rabbit). The students were divided into four groups. Each of these groups saw one video where the storyteller was portrayed either by a synthetic character or a human. The storyteller’s voice, no matter the nature of the character, could also be real or synthetic. After the video display, the participants filled a questionnaire where they rated the storyteller’s performance. Although the synthetic versions used in the experiment obtained lower classifications than their natural counterparts, the data suggests that the gap between synthetic and real gestures is the smallest while the synthetic voice is the furthest from its natural version. Furthermore, when we used the synthetic voice, the facial expressions of both characters (the virtual and the real) were rated worse than with the real voice. This effect was not significant for the gestures, thus suggesting that the importance of building synthetic voices as natural as possible is extremely important as it impacts in the perception of other means of communication (such as the perception of the facial expression).

Guilherme Raimundo, João Cabral, Celso Melo, Luís C. Oliveira, Ana Paiva, Isabel Trancoso

Backmatter

Titel: Verbal and Nonverbal Communication Behaviours
herausgegeben von: Anna Esposito
Marcos Faundez-Zanuy
Eric Keller
Maria Marinaro
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-76442-7
Print ISBN: 978-3-540-76441-0
DOI: https://doi.org/10.1007/978-3-540-76442-7