Skip to main content

Über dieses Buch

This book is dedicated to the dreamers, their dreams, and their perseverance in research work. This volume brings together the selected and peer–reviewed contributions of the p- ticipants at the COST 2102 International Conference on Verbal and Nonverbal F- tures of Human–Human and Human–Machine Interaction, held in Patras, Greece, October 29–31, 2007, hosted by the 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2008). The conference was sponsored by COST (European Cooperation in the Field of Scientific and Technical Research, ) in the domain of Information and Communication Technologies (ICT) for disseminating the advances of the - search activity developed within COST Action 2102: “Cross-Modal Analysis of V- bal and Nonverbal Communication”( COST Action 2102 is a network of about 60 European and 6 overseas laboratories whose aim is to develop “an advanced acoustical, perceptual and psychological analysis of verbal and non-verbal communication signals originating in spontaneous face-to-face interaction, in order to identify algorithms and automatic procedures capable of identifying the human emotional states. Particular care is devoted to the recognition of emotional states, gestures, speech and facial expressions, in antici- tion of the implementation of intelligent avatars and interactive dialogue systems that could be exploited to improve user access to future telecommunication services”(see COST 2102 Memorandum of Understanding (MoU)



- Static and Dynamic Processing of Faces, Facial Expressions, and Gaze

Data Mining Spontaneous Facial Behavior with Automatic Expression Coding

The computer vision field has advanced to the point that we are now able to begin to apply automatic facial expression recognition systems to important research questions in behavioral science. The machine perception lab at UC San Diego has developed a system based on machine learning for fully automated detection of 30 actions from the facial action coding system (FACS). The system, called Computer Expression Recognition Toolbox (CERT), operates in real-time and is robust to the video conditions in real applications. This paper describes two experiments which are the first applications of this system to analyzing spontaneous human behavior: Automated discrimination of posed from genuine expressions of pain, and automated detection of driver drowsiness. The analysis revealed information about facial behavior during these conditions that were previously unknown, including the coupling of movements. Automated classifiers were able to differentiate real from fake pain significantly better than naïve human subjects, and to detect critical drowsiness above 98% accuracy.  Issues for application of machine learning systems to facial expression analysis are discussed.
Marian Bartlett, Gwen Littlewort, Esra Vural, Kang Lee, Mujdat Cetin, Aytul Ercil, Javier Movellan

Ekfrasis: A Formal Language for Representing and Generating Sequences of Facial Patterns for Studying Emotional Behavior

Emotion is a topic that has received much attention during the last few years, both in the context of speech synthesis, image understanding as well as in automatic speech recognition, interactive dialogues systems and wearable computing. This paper presents a formal model of a language (called Ekfrasis) as a software methodology that synthesizes (or generates) automatically various facial expressions by appropriately combining facial features. The main objective here is to use this methodology to generate various combinations of facial expressions and study if these combinations efficiently represent emotional behavioral patterns.
Nikolaos Bourbakis, Anna Esposito, Despina Kavraki

On the Relevance of Facial Expressions for Biometric Recognition

Biometric face recognition presents a wide range of variability sources, such as make up, illumination, pose, facial expression, etc. In this paper we use the Japanese Female Facial Expression Database (JAFFE) in order to evaluate the influence of facial expression in biometric recognition rates. In our experiments we used a nearest neighbor classifier with different number of training samples, different error criteria, and several feature extractions. Our experimental results reveal that some facial expressions produce a recognition rate drop, but the optimal length of the feature extracted vectors is the same with the presence of facial expressions than with neutral faces.
Marcos Faundez-Zanuy, Joan Fabregas

Biometric Face Recognition with Different Training and Testing Databases

Biometric face recognition presents a wide range of variability sources, such as make up, illumination, pose, facial expression, etc. Although some public available databases include these phenomena, it is a laboratory condition far away from real biometric system scenarios. In this paper we perform a set of experiments training and testing with different face databases in order to reduce the wide range of problems present in face images from different users (make up, facial expression, rotations, etc.). We use a novel dispersion matcher, which opposite to classical biometric systems, does not need to be trained with the whole set of users. It can recognize if two photos are of the same person, even if the photos of that person were not used in training the classifier.
Joan Fabregas, Marcos Faundez-Zanuy

Combining Features for Recognizing Emotional Facial Expressions in Static Images

This work approaches the problem of recognizing emotional facial expressions in static images focusing on three preprocessing techniques for feature extraction such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Gabor filters. These methods are commonly used for face recognition and the novelty consists in combining features provided by them in order to improve the performance of an automatic procedure for recognizing emotional facial expressions. Testing and recognition accuracy were performed on the Japanese Female Facial Expression (JAFFE) database using a Multi-Layer Perceptron (MLP) Neural Network as classifier. The best classification accuracy on variations of facial expressions included in the training set was obtained combining PCA and LDA features (93% of correct recognition rate), whereas, combining PCA, LDA and Gabor filter features the net gave 94% of correct classification on facial expressions of subjects not included in the training set.
Jiří Přinosil, Zdeněk Smékal, Anna Esposito

Mutually Coordinated Anticipatory Multimodal Interaction

We introduce our research on anticipatory and coordinated interaction between a virtual human and a human partner. Rather than adhering to the turn taking paradigm, we choose to investigate interaction where there is simultaneous expressive behavior by the human interlocutor and a humanoid. Various applications in which we can study and specify such behavior, in particular behavior that requires synchronization based on predictions from performance and perception, are presented. Some observations concerning the role of predictions in conversations are presented and architectural consequences for the design of virtual humans are drawn.
Anton Nijholt, Dennis Reidsma, Herwin van Welbergen, Rieks op den Akker, Zsofia Ruttkay

Affordances and Cognitive Walkthrough for Analyzing Human-Virtual Human Interaction

This study investigates how the psychological notion of affordance, known from human computer interface design, can be adopted for the analysis and design of communication of a user with a Virtual Human (VH), as a novel interface. We take as starting point the original notion of affordance, used to describe the function of objects for humans. Then, we dwell on the human-computer interaction case when the object used by the human is (a piece of software in) the computer. In the next step, we look at human-human communication and identify actual and perceived affordances of the human body and mind. Then using the generic framework of affordances, we explain certain essential phenomena of human-human multimodal communication. Finally, we show how they carry over to the case of communicating with a ’designed human’, that is an VH, whose human-like communication means may be augmented with ones reminiscent of the computer and fictive worlds. In the closing section we discuss and reformulate the method of cognitive walkthrough to make it applicable for evaluating the design of verbal and non-verbal interactive behaviour of VHs.
Zsófia Ruttkay, Rieks op den Akker

- Emotional Speech Synthesis and Recognition: Applications to Telecommunication Systems

Individual Traits of Speaking Style and Speech Rhythm in a Spoken Discourse

This paper describes an analysis of the verbal and nonverbal speaking characteristics of six speakers of Japanese when talking over the telephone to partners whose degrees of familiarity change over time. The speech data from 100 30-minute conversations between them was transcribed and the acoustic characteristics of each utterance subjected to an analysis of timing characteristics to determine individuality with respect to duration of utterance, degree of overlap, pauses, and other aspects of speech and discourse rhythm. The speakers showed many common traits, but noticeable differences were found to correlate well with degree of familiarity with the interlocutor. Several different styles of interaction can be automatically distinguished in the conversational speech data from their timing patterns.
Nick Campbell

The Organization of a Neurocomputational Control Model for Articulatory Speech Synthesis

The organization of a computational control model of articulatory speech synthesis is outlined in this paper. The model is based on general principles of neurophysiology and cognitive psychology. Thus it is based on such neural control circuits, neural maps and mappings as are hypothesized to exist in the human brain, and the model is based on learning or training mechanisms similar to those occurring during the human process of speech acquisition. The task of the control module is to generate articulatory data for controlling an articulatory-acoustic speech synthesizer. Thus a com plete “BIONIC” (i.e. BIOlogically motivated and techNICally realized) speech syn the sizer is described, capable of generating linguistic, sensory, and motor neural representations of sounds, syllables, and words, capable of generating articu latory speech movements from neuromuscular activation, and subse quently capable of generating acoustic speech signals by controlling an articu latory-acoustic vocal tract model. The module developed thus far is capable of producing single sounds (vowels and consonants), simple CV- and VC-syllables, and first sample words. In addition, processes of human-human interaction occurring during speech acquisition (mother-child or carer-child interactions) are briefly discussed in this paper.
Bernd J. Kröger, Anja Lowit, Ralph Schnitker

Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems

Speech intelligibility is the most important parameter in evaluation of speech quality. In the contribution, a new objective intelligibility assessment of general speech processing algorithms is proposed. It is based on automatic recognition methods developed for discrete and fluent speech processing. The idea is illustrated on two case studies: a) comparison of listening evaluation of Czech rhyme tests with automatic discrete speech recognition and b) automatic continuous speech recognition of general topic Czech texts read by professional and nonprofessional speakers vs. the same texts generated by several Czech Text-to-Speech systems. The aim of the proposed approach is fast and objective intelligibility assessment of Czech Text-to-Speech systems, which include male and female voices and a voice conversion module.
Robert Vích, Jan Nouza, Martin Vondra

ECESS Platform for Web Based TTS Modules and Systems Evaluation

The paper presents platform for web based TTS modules and systems evaluation named RES (Remote Evaluation System). It is being developed within the European Centre of Excellence for Speech Synthesis (ECESS, The presented platform will be used for web based online evaluation of various text-to-speech (TTS) modules, and even complete TTS systems, presently running at different Institutes and Universities worldwide. Each ECESS partner has to install the RES module server locally and connect it with its TTS modules. Using the RES client, partners will be able to perform different evaluation tasks for their modules using any necessary additional modules and/or language resources of their other partners, installed locally at the other partners’ sites. Additionally, they will be able to integrate their own modules into the complete web-based TTS system in conjunction with the necessary modules of other partners. By using the RES client they could also build-up a complete TTS system via web, without using any of their own modules. Several partners can contribute their modules, even with the same functionality, and it is easy to add a new module to the whole web-based distributed system. The user will decide which partner’s module to use in his own configuration or for a particular evaluation task. Evaluation can be done by any institution, able to access modules for evaluation without the need to install these modules locally. The platform will be used within the evaluation campaigns of different TTS modules and complete TTS systems carried-out by the ECESS consortium. The first remote-based evaluation campaign of text processing modules using the developed platform is foreseen for January 2008.
Matej Rojc, Harald Höge, Zdravko Kačič

Towards Slovak Broadcast News Automatic Recording and Transcribing Service

The information is one of the most valuable commodities nowadays. The information retrieval mechanisms from broadcast news recordings is then becoming the one of the most requested services from the end-users. The planned Slovak automatic broadcast news (BN) processing service provides automatic transcribing and metadata extracting abilities, enabling users to obtain information from the processed recordings using a web interface and the search engine. The resulted information is then provided trough multimodal interface, which allows users to see not only recorded audio-visual material, but also all automatically extracted metadata (verbal and nonverbal), and also to select incorrectly automatically identified data. The architecture of the present system is linear, which means every module starts after the previous has finished the data processing.
Matúš Pleva, Anton Čižmár, Jozef Juhár, Stanislav Ondáš, Michal Mirilovič

Computational Stylometry: Who’s in a Play?

Automatic text classification techniques are applied to the problem of quantifying strength of characterization within plays, using a case study of the works of four sample playwrights that are freely available in machine-readable form. Strong characters are those whose speeches constitute homogeneous categories in comparison with other characters—their speeches are more attributable to themselves than to their play or their author.
Carl Vogel, Gerard Lynch

The Acoustic Front-End in Scenarios of Interaction Research

The paper is concerning some problems which are posed by the growing interest in social interaction research as far as they can be solved by engineers in acoustics and speech technology. Firstly the importance of nonverbal and paraverbal modalities in two prototypical scenarios are discussed: face-to-face interactions in psychotherapeutic consulting and side-by-side interactions of children cooperating in a computer game. Some challenges in processing signals are stated with respect to both scenarios. The following technologies of acoustic signal processing are discussed: (a) analysis of the influence of the room impulse response to the recognition rate, (b) adaptive two-channel microphone, (c) localization and separation of sound sources in rooms, and (d) single-channel noise suppression.
Rüdiger Hoffmann, Lutz-Michael Alisch, Uwe Altmann, Thomas Fehér, Rico Petrick, Sören Wittenberg, Rico Hermkes

Application of Expressive Speech in TTS System with Cepstral Description

Expressive speech synthesis representing different human emotions has been in the interests of researchers for a longer time. Recently, some experiments with storytelling speaking style have been performed. This particular speaking style is suitable for applications aimed at children as well as special applications aimed at blind people. Analyzing human storytellers’ speech, we designed a set of prosodic parameters prototypes for converting speech produced by the text-to-speech (TTS) system into storytelling speech. In addition to suprasegmental characteristics (pitch, intensity, and duration) included in these speech prototypes, also information about significant frequencies of spectral envelope and spectral flatness determining degree of voicing was used.
Jiří Přibil, Anna Přibilová

Speech Emotion Perception by Human and Machine

The human speech contains and reflects information about the emotional state of the speaker. The importance of research of emotions is increasing in telematics, information technologies and even in health services. The research of the mean acoustical parameters of the emotions is a very complicated task. The emotions are mainly characterized by suprasegmental parameters, but other segmental factors can contribute to the perception of the emotions as well. These parameters are varying within one language, according to speakers etc. In the first part of our research work, human emotion perception was examined. Steps of creating an emotional speech database are presented. The database contains recordings of 3 Hungarian sentences with 8 basic emotions pronounced by nonprofessional speakers. Comparison of perception test results obtained with database recorded by nonprofessional speakers showed similar recognition results as an earlier perception test obtained with professional actors/actresses. It was also made clear, that a neutral sentence before listening to the expression of the emotion pronounced by the same speakers cannot help the perception of the emotion in a great extent. In the second part of our research work, an automatic emotion recognition system was developed. Statistical methods (HMM) were used to train different emotional models. The optimization of the recognition was done by changing the acoustic preprocessing parameters and the number of states of the Markov models.
Szabolcs Levente Tóth, David Sztahó, Klára Vicsi

Expressive Speech Synthesis Using Emotion-Specific Speech Inventories

In this paper we explore the use of emotion-specific speech inventories for expressive speech synthesis. We recorded a semantically neutral sentence and 26 logatoms containing all the diphones and CVC triphones necessary to synthesize the same sentence. The speech material was produced by a professional actress expressing all logatoms and the sentence with the six basic emotions and in neutral tone. 7 emotion-dependent inventories were constructed from the logatoms. The 7 inventories paired with the prosody extracted from the 7 natural sentences were used to synthesize 49 sentences. 194 listeners evaluated the emotions expressed in the logatoms and in the natural and synthetic sentences. The intended emotion was recognized above chance level for 99% of the logatoms and for all natural sentences. Recognition rates significantly above chance level were obtained for each emotion. The recognition rate for some synthetic sentences exceeded that of natural ones.
Csaba Zainkó, Márk Fék, Géza Németh

Study on Speaker-Independent Emotion Recognition from Speech on Real-World Data

In the present work we report results from on-going research activity in the area of speaker-independent emotion recognition. Experimentations are performed towards examining the behavior of a detector of negative emotional states over non-acted/acted speech. Furthermore, a score-level fusion of two classifiers on utterance level is applied, in attempt to improve the performance of the emotion recognizer. Experimental results demonstrate significant differences on recognizing emotions on acted/real-world speech.
Theodoros Kostoulas, Todor Ganchev, Nikos Fakotakis

Exploiting a Vowel Based Approach for Acted Emotion Recognition

This paper is dedicated to the description and the study of a new feature extraction approach for emotion recognition. Our contribution is based on the extraction and the characterization of phonemic units such as vowels and consonants, which are provided by a pseudo-phonetic speech segmentation phase combined with a vowel detector. The segmentation algorithm is evaluated on both emotional (Berlin) and non-emotional (TIMIT, NTIMIT) databases. Concerning the emotion recognition task, we propose to extract MFCC acoustic features from these pseudo-phonetic segments (vowels, consonants) and we compare this approach with traditional voice and unvoiced segments. The classification is achieved by the well-known k-nn classifier (k nearest neighbors) on the Berlin corpus.
Fabien Ringeval, Mohamed Chetouani

Towards Annotation of Nonverbal Vocal Gestures in Slovak

The paper presents some of the problems of classification and annotation of speech sounds that have their own phonetic content, phonological function, and prosody, but they do not have an adequate linguistic (or text) representation. One of the most important facts about these "nonverbal vocal gestures" is that they often have a rich semantic content and they play an important role in expressive speech. The techniques that have been used in an effort to find an adequate classification system and annotation scheme for these gestures include prosody modeling and approaches comparing the nonverbal vocal gestures with their verbal (lexical) and body counterparts.
Milan Rusko, Jozef Juhár

The CineLingua Approach: Verbal and Non-verbal Features in Second Language Acquisition. Film Narrative to Anchor Comprehension and Production

This study has as a central focus a foreign language classroom environment that makes daily use of multimedia film lessons in the target language. More specifically this study explores whether first-semester, first-year college students of Italian will understand film narrative, and how, the comprehension of the film narrative will affect their written production. Eighty-one first year college students participated in this study. The results suggest that the Anchored Learning Group performed better than the Basic-Skills Learning Group in the comprehension tasks of the two film segments. The second experimental probe, consisting in a production task, shows that compared to the Basic-Skills Learning Group, the written productions of the Anchored Learning Group reflect far better the structure of the narrative discourse participants have been exposed to. On the occasion of this production task, students were required to write their essay using at least ten Italian verbs.
Rosa Volpe


Weitere Informationen