Skip to main content

2014 | Buch

Natural Interaction with Robots, Knowbots and Smartphones

Putting Spoken Dialog Systems into Practice

herausgegeben von: Joseph Mariani, Sophie Rosset, Martine Garnier-Rizet, Laurence Devillers

Verlag: Springer New York

insite
SUCHEN

Über dieses Buch

These proceedings presents the state-of-the-art in spoken dialog systems with applications in robotics, knowledge access and communication. It addresses specifically: 1. Dialog for interacting with smartphones; 2. Dialog for Open Domain knowledge access; 3. Dialog for robot interaction; 4. Mediated dialog (including crosslingual dialog involving Speech Translation); and,5. Dialog quality evaluation. These articles were presented at the IWSDS 2012 workshop.

Inhaltsverzeichnis

Frontmatter

Spoken Dialog Systems in Everyday Applications

Frontmatter
Chapter 1. Spoken Language Understanding for Natural Interaction: The Siri Experience

Recent advances in software integration and efforts toward more personalization and context awareness have brought closer the long-standing vision of the ubiquitous intelligent personal assistant. This has become particularly salient in the context of smartphones and electronic tablets, where natural language interaction has the potential to considerably enhance mobile experience. Far beyond merely offering more options in terms of user interface, this trend may well usher in a genuine paradigm shift in man-machine communication. This contribution reviews the two major semantic interpretation frameworks underpinning natural language interaction, along with their respective advantages and drawbacks. It then discusses the choices made in Siri, Apple’s personal assistant on the iOS platform, and speculates on how the current implementation might evolve in the near future to best mitigate any downside.

Jerome R. Bellegarda
Chapter 2. Development of Speech-Based In-Car HMI Concepts for Information Exchange Internet Apps

The permanent use of smartphones impacts the automotive environment. People tend to use their smartphone’s Internet capabilities manually while driving, which endangers the driver’s safety. Therefore, an intuitive in-car speech interface to the Internet is crucial in order to reduce driver distraction. Before developing an in-car speech dialog system to a new domain, you have to examine which speech-based human-machine interface concept is the most intuitive. This work in progress report describes the design of various human-machine interface concepts which include speech as main input and output modality. These concepts are based on two different dialog strategies: a command-based and a conversational speech dialog. Different graphical user interfaces, one including an avatar, have been designed in order to best support the speech dialog strategies and to raise the level of naturalness in the interaction. For each human-machine interface concept a prototype which allows for an online hotel booking has been developed. These prototypes will be evaluated in driving simulator experiments on usability and driving performance.

Hansjörg Hofmann, Anna Silberstein, Ute Ehrlich, André Berton, Christian Müller, Angela Mahr
Chapter 3. Real Users and Real Dialog Systems: The Hard Challenge for SDS

Much of the research done in our community is based on developing spoken dialog systems and testing various techniques within those dialog systems. Because it makes it easier to deal with our experimental conditions, many of our tests and studies involve controlled (paid or volunteered) users. However, we have seen in a number of studies that these controlled users do not use the system in the same way as those for whom the system was actually designed. Sometimes the difference between the real user, who wants the information the spoken dialog system is providing or who wants to give information to it, and the controlled user, who is acting under some direction, is not that different. Certainly in some circumstances it is necessary to use the latter. But, since state-of-the-art systems have become increasingly reliant on large amounts of user data to train their models of behavior, it is critical that the user behavior we train on is real user behavior. This paper describes the issues that arise when building a spoken dialog system for real users. The goal is to provide both a service to the user and a realistic spoken dialog system (SDS) research platform.

Alan W. Black, Maxine Eskenazi
Chapter 4. A Multimodal Multi-device Discourse and Dialogue Infrastructure for Collaborative Decision-Making in Medicine

The dialogue components we developed provide the infrastructure of the disseminated industrial prototype RadSpeech—a semantic speech dialogue system for radiologists. The major contribution of this paper is the description of a new speech-based interaction scenario of RadSpeech where two radiologists use two independent but related mobile speech devices (iPad and iPhone) and collaborate via a connected large screen installation using related speech commands. With traditional user interfaces, users may browse or explore patient data, but little to no help is given when it comes to structuring the collaborative user input and annotate radiology images in real-time with ontology-based medical annotations. A distinctive feature is that the interaction design includes the screens of the mobile devices for touch screen interaction for more complex tasks rather than the simpler ones such as a mere remote control of the image display on the large screen.

Daniel Sonntag, Christian Schulz

Spoken Dialog Prototypes and Products

Frontmatter
Chapter 5. Yochina: Mobile Multimedia and Multimodal Crosslingual Dialogue System

Yochina is a mobile application for crosslingual and cross-cultural understanding. The core of the demonstrated app supports dialogues between English and Chinese and German and Chinese. The dialogue facility is connected with interactive language guides, culture guides and country guides. The app is based on a generic framework enabling such novel combinations of interactive assistance and reference for any language pair, travel region and culture. The framework integrates template-based translation, speech synthesis, finite-state models of crosslingual dialogues and multimedia sentence generation. Furthermore, it allows the interlinking between crosslingual communication and tourism-relevant content. A semantic search provides easy access to words, phrases, translations and information.

Feiyu Xu, Sven Schmeier, Renlong Ai, Hans Uszkoreit
Chapter 6. Walk This Way: Spatial Grounding for City Exploration

Recently there has been an interest in spatially aware systems for pedestrian routing and city exploration, due to the proliferation of smartphones with GPS receivers among the general public. Since GPS readings are noisy, giving good and well-timed route instructions to pedestrians is a challenging problem. This paper describes a spoken-dialogue prototype for pedestrian navigation in Stockholm that addresses this problem by using various grounding strategies.

Johan Boye, Morgan Fredriksson, Jana Götze, Joakim Gustafson, Jürgen Königsmann
Chapter 7. Multimodal Dialogue System for Interaction in AmI Environment by Means of File-Based Services

This paper presents our ongoing work on the development of a multimodal dialogue system to enable user control of home appliances in an ambient intelligence environment. The physical interaction with the appliances is carried out by means of

Octopus

, a system developed in a previous study to ease communication with hardware devices by abstracting them as network files. To operate the appliances and get information about their state, the dialogue system writes and reads files using WebDAV. This architecture presents an important advantage since the appliances are considered as abstract objects, which notably simplifies dialogue system’s interaction with them.

Nieves Ábalos, Gonzalo Espejo, Ramón López-Cózar, Francisco J. Ballesteros, Enrique Soriano, Gorka Guardiola
Chapter 8. Development of a Toolkit Handling Multiple Speech-Oriented Guidance Agents for Mobile Applications

In this study, we propose a novel toolkit to handle multiple speech-oriented guidance agents for mobile applications. The basic architecture of the toolkit is server-and-client architecture. We assumed the servers are located on a cloud-computing environment, and the clients are mobile phones, such as the iPhone. Huge amounts of servers exist on the cloud-computing environment, and each server can communicate with other servers. It is difficult to develop an omnipotent spoken dialog system, but it is easy to develop a spoken dialog agent that has limited but deep knowledge. If such limited agents could communicate with each other, a spoken dialog system with wide-ranging knowledge could be created. In this paper, we implemented speech-oriented guidance servers specialized to provide guide information for confined locations and implemented a mobile application that can get information from the servers.

Sunao Hara, Hiromichi Kawanami, Hiroshi Saruwatari, Kiyohiro Shikano
Chapter 9. Providing Interactive and User-Adapted E-City Services by Means of Voice Portals

Digital cities offer new ways to provide information and services of a town or a region in an integrated form, favoring citizen participation and the use of services in previously unavailable ways. In addition, Speech Technologies and Language Processing have made possible the development of a number of new applications which are based on spoken dialog systems. One of them is voice portals, which facilitate spoken interaction with the Internet to provide their users with specific information or web services. In this chapter, we describe a voice portal developed to provide municipal information. The different functionalities provided by the system include to query information about the City Council, access city information, carry out several steps and procedures, complete surveys, access the citizen’s mailbox to leave messages for suggestions and complaints, and to be transferred to the City Council to be attended by a teleoperator. This way, the voice portal improves the support of public services by increasing their availability, flexibility, and control while reducing costs and missed calls.

David Griol, María García-Jiménez, Zoraida Callejas, Ramón López-Cózar

Multi-domain, Crosslingual Spoken Dialog Systems

Frontmatter
Chapter 10. Efficient Language Model Construction for Spoken Dialog Systems by Inducting Language Resources of Different Languages

Since the quality of the language model directly affects the performance of the spoken dialog system (SDS), we should use a statistical language model (LM) trained with a large amount of data that is matched to the task domain. When porting a SDS to another language, however, it is costly to re-collect a large amount of user utterances in the target language. We thus use the language resources in a source language by utilizing statistical machine translation. The main challenge in this work is to induct automatic speech recognition results collected using a speech-input system that differs from the target SDS both in the task and the target language. To select appropriate sentences to be included in the training data for the LM, we induct a spoken language understanding module of the dialog system in the source language. Experimental construction using over three million user utterances showed that it is vital to conduct a selection from the translation results.

Teruhisa Misu, Shigeki Matsuda, Etsuo Mizukami, Hideki Kashioka, Haizhou Li
Chapter 11. Towards Online Planning for Dialogue Management with Rich Domain Knowledge

Most approaches to dialogue management have so far concentrated on offline optimisation techniques, where a dialogue policy is precomputed for all possible situations and then plugged into the dialogue system. This development strategy has however some limitations in terms of domain scalability and adaptivity, since these policies are essentially static and cannot readily accommodate runtime changes in the environment or task dynamics. In this paper, we follow an alternative approach based on online planning. To ensure that the planning algorithm remains tractable over longer horizons, the presented method relies on probabilistic models expressed via

probabilistic rules

that capture the internal structure of the domain using high-level representations. We describe in this paper the generic planning algorithm, ongoing implementation efforts and directions for future work.

Pierre Lison
Chapter 12. A Two-Step Approach for Efficient Domain Selection in Multi-Domain Dialog Systems

This paper discusses a domain selection method for multi-domain dialog systems to generate the most appropriate system utterance in response to a user utterance. We present a two-step approach for efficient domain selection. In our proposed approach, the domain candidates are listed in descending order of scores and then each domain is verified by content-based filtering. When we applied our method, the accuracy increased and the time cost decreased compared to baseline methods.

Injae Lee, Seokhwan Kim, Kyungduk Kim, Donghyeon Lee, Junhwi Choi, Seonghan Ryu, Gary Geunbae Lee

Human-Robot Interaction

Frontmatter
Chapter 13. From Informative Cooperative Dialogues to Long-Term Social Relation with a Robot

A lot of progress have been made in the domain of human-machine dialogue, but it is still a real challenge and, most often, only informative cooperative kind of dialogues are explored. This paper tries to explore the ability of a robot to create and maintain a long-term social relationship through more advanced dialogue techniques. We expose the social (Goffman), psychological (Scherer) and neural (Mountcastle) theories used to accomplish such kind of complex social interactions. From these theories, we build a consistent model, computationally efficient to create a robot that can understand the concept of lying and have compassion: a robotic social companion.

Axel Buendia, Laurence Devillers
Chapter 14. Integration of Multiple Sound Source Localization Results for Speaker Identification in Multiparty Dialogue System

Humanoid robots need to head toward human participants when answering to their questions in multiparty dialogues. Some positions of participants are difficult to localize from robots in multiparty situations, especially when the robots can only use their own sensors. We present a method for identifying the speaker more accurately by integrating the multiple sound source localization results obtained from two robots: one talking mainly with participants and the other also joining the conversation when necessary. We place them so that they can compensate for each other’s localization capabilities and then integrate their two results. Our experimental evaluation revealed that using two robots improved speaker identification compared with using only one robot. We furthermore implemented our method into humanoid robots and constructed a demo system.

Taichi Nakashima, Kazunori Komatani, Satoshi Sato
Chapter 15. Investigating the Social Facilitation Effect in Human–Robot Interaction

The social facilitation effect is a well-known social-psychological phenomenon. It describes how performance changes depending on the presence or absence of others. The current study investigates if the social facilitation effect can also be observed while interacting with anthropomorphic robots.

Ina Wechsung, Patrick Ehrenbrink, Robert Schleicher, Sebastian Möller
Chapter 16. More Than Just Words: Building a Chatty Robot

This paper presents the motivation for, design and current implementation of a robot spoken dialogue platform, created to aid exploration of multimodal human-machine dialogue. It also describes the design of a dialogue used to collect a database of interactive speech while the robot was exhibited a over a three-month period at the Science Gallery in Dublin. The system was wizard-controlled and collected samples of informal, chatty dialogue—normally difficult to capture under laboratory conditions for human-human dialogue and particularly so for human-machine interaction. The system is being further developed to facilitate further data collection and experimentation.

Emer Gilmartin, Nick Campbell
Chapter 17. Predicting When People Will Speak to a Humanoid Robot

We tackle the novel problem of predicting when a user is likely to begin speaking to a humanoid robot. Human speakers usually take the state of their addressee into consideration and choose when to begin speaking to the addressee, and our idea is to use this convention with a system that interprets audio input. The proposed method predicts when a user is likely to begin speaking to a humanoid robot by machine learning that uses the robot’s behaviors—such as its posture, motion, and utterance—as input features. We create a data set manually annotated by three human participants indicating in real time whether or not they would be likely to begin speaking to the robot. We collect the parts to which the three commonly give the same labels and use these parts as the training and evaluation data for machine learning. Results of an experimental evaluation showed that our model correctly predicted 88.5% of the common parts in an open test. This result is similar to the results of a cross-validation, demonstrating that our model is not dependent on a specific training data set. A possible application of the model is the elimination of environmental noises that occur at timing when a cooperative user is not likely to begin speaking to a robot.

Takaaki Sugiyama, Kazunori Komatani, Satoshi Sato
Chapter 18. Designing an Emotion Detection System for a Socially Intelligent Human-Robot Interaction

The long-term goal of this work is to build an assistive robot for elderly and disabled people. It is part of the French ANR ARMEN project. The subjects will interact with a mobile robot controlled by a virtual character. In order to build this system, we collected interactions between patients from different medical centers and a Wizard-of-Oz operated virtual character in the frame of scenarii written with physicians and functional therapists. The human-robot spoken interaction consisted mainly of small talking with patients, with no real task to perform. For precise tasks such as “Finding a remote control,” keyword recognition is performed. The main focus of the article is to build an emotion detection system that will be used to control the dialog and the answer strategy of the virtual character. This article presents the Wizard-of-Oz system for the audio corpus collection which is used for training the emotion detection module. We analyze the audio data at the segmental level on annotated measures of acoustically perceived emotion but also at the interaction level with global objective measures such as amount of speech and emotion. We also report on the results of a questionnaire qualifying the interaction and the agent and compare between objective and subjective measures.

Clément Chastagnol, Céline Clavel, Matthieu Courgeon, Laurence Devillers
Chapter 19. Multimodal Open-Domain Conversations with the Nao Robot

In this paper we discuss the design of human-robot interaction focussing especially on social robot communication and multimodal information presentation. As a starting point we use the WikiTalk application, an open-domain conversational system which has been previously developed using a robotics simulator. We describe how it can be implemented on the Nao robot platform, enabling Nao to make informative spoken contributions on a wide range of topics during conversation. Spoken interaction is further combined with gesturing in order to support Nao’s presentation by natural multimodal capabilities, and to enhance and explore natural communication between human users and robots.

Kristiina Jokinen, Graham Wilcock
Chapter 20. Component Pluggable Dialogue Framework and Its Application to Social Robots

This paper is concerned with the design and development of a component pluggable event-driven dialogue framework for service robots. We abstract standard dialogue functions and encapsulate them into different types of components or plug-ins. A component can be a hardware device, a software module, an algorithm or a database connection. The framework is empowered by a multipurpose XML-based dialogue engine, which is capable for pipeline information flow construction, event mediation, multi-topic dialogue modeling and different types of knowledge representation. The framework is domain-independent, cross-platform, and multilingual. Experiments on various service robots in our social robotics laboratory showed that the same framework works for all the robots that need speech interface. The development cycle for new dialogue system is greatly shortened while the system robustness, reliability, and maintainability are significantly improved.

Ridong Jiang, Yeow Kee Tan, Dilip Kumar Limbu, Tran Anh Dung, Haizhou Li

Spoken Dialog Systems Components

Frontmatter
Chapter 21. Visual Contribution to Word Prominence Detection in a Playful Interaction Setting

This paper investigates how prominent words can be distinguished from non-prominent ones in a setting where a user was interacting in a small game, designed as a Wizard-of-Oz experiment, with a computer. Misunderstandings of the system were triggered and the user was asked to correct them naturally, i. e. using prosodic cues. Consequently, the corrected word is expected to be highly prominent. Audio-visual recordings with a distant microphone and without visual markers were made. As acoustic features relative energy, duration and fundamental frequency were calculated. From the visual channel rigid head movements and image transformation-based features from the mouth region were extracted. Different feature combinations are evaluated regarding their power to discriminate the prominent from the non-prominent words using a SVM. Depending on the features accuracies of approximately 70%–80% are achieved. Thereby the visual features are in particular beneficial when the acoustic features are weaker.

Martin Heckmann
Chapter 22. Label Noise Robustness and Learning Speed in a Self-Learning Vocal User Interface

A self-learning vocal user interface learns to map user-defined spoken commands to intended actions. The voice user interface is trained by mining the speech input and the provoked action on a device. Although this generic procedure allows a great deal of flexibility, it comes at a cost. Two requirements are important to create a user-friendly learning environment. First, the self-learning interface should be robust against typical errors that occur in the interaction between a non-expert user and the system. For instance, the user gives a wrong learning example to the system by commanding “Turn on the television!” and pushing a power button on the wrong remote control. The spoken command is then supervised by a wrong action and we refer to these errors as label noise. Secondly, the mapping between voice commands and intended actions should happen fast, i.e. require few examples. To meet these requirements, we implemented learning through supervised NMF. We tested keyword recognition accuracy for different levels of label noise and different sizes of training sets. Our learning approach is robust against label noise, but some improvement regarding fast mapping is desirable.

Bart Ons, Jort F. Gemmeke, Hugo Van hamme
Chapter 23. Topic Classification of Spoken Inquiries Using Transductive Support Vector Machine

In this work, we address the topic classification of spoken inquiries in Japanese that are received by a guidance system operating in a real environment, with a semi-supervised learning approach based on a transductive support vector machine (TSVM). Manual data labeling, which is required for supervised learning, is a costly process, and unlabeled data are usually abundant and cheap to obtain. TSVM allows to treat partially labeled data for semi-supervised learning, including labeled and unlabeled samples in the training set. We are interested in evaluating the influence of including unlabeled samples in the training of the topic classification models, as well as the amount of them that could be necessary for improving performance. Experimental results show that this approach can be useful for taking advantage of unlabeled samples, especially when using larger unlabeled datasets. In particular, we found gains in classification performance for specific topics, such as city information, with a 6.30% F-measure improvement in the case of children’s inquiries and 7.63% for access information in the case of adults’ inquiries.

Rafael Torres, Hiromichi Kawanami, Tomoko Matsui, Hiroshi Saruwatari, Kiyohiro Shikano
Chapter 24. Frame-Level Selective Decoding Using Native and Non-native Acoustic Models for Robust Speech Recognition to Native and Non-native Speech

This paper proposes a frame-level selective-decoding method by using both native acoustic models (AMs) and non-native AMs in order to construct a robust speech recognition system for non-native speech as well as native speech. To this end, we use two kinds of well-trained AMs: (a) AMs trained with a large amount of native speech (

native AMs

) and (b) AMs trained with a plenty amount of non-native speech (

non-native AMs

). First, each speech feature vector is decoded using

native AMs

and

non-native AMs

in parallel. And, we select proper AMs by comparing the likelihoods of the two AMs. Then, the next

M

frames of speech feature vectors are decoded by using the selected AMs, where

M

is a pre-defined parameter. The selection and the decoding procedures are repeated until an end of an utterance is encountered. From automatic speech recognition (ASR) experiments for English spoken by Korean speakers, it is shown that an ASR system employing the proposed method reduces an average word error rate (WER) by 16.6% and 41.3% for English spoken by Koreans and native English, respectively, when compared to an ASR system employing an utterance-level selective-decoding method.

Yoo Rhee Oh, Hoon Chung, Jeom-ja Kang, Yun Keun Lee
Chapter 25. Analysis of Speech Under Stress and Cognitive Load in USAR Operations

This paper presents ongoing work on analysis of speech under stress and cognitive load in speech recordings of Urban Search and Rescue (USAR) training operations. During the training operations several team members communicate with other members on the field and members on the control command using only one radio channel. The type of stress encountered in the USAR domain, more specifically on the human team communication, includes both physical or psychological stress and cognitive task load. Physical stress due to the real situation and cognitive task load due to tele-operation of robots and equipment. We were able to annotate and identify the acoustic correlates of these two types of stress on the recordings. Traditional prosody features and acoustic features extracted at sub-band level probed to be robust to discriminate among the different types of stress and neutral data.

Marcela Charfuelan, Geert-Jan Kruijff

Dialog Management

Frontmatter
Chapter 26. Does Personality Matter? Expressive Generation for Dialogue Interaction

This paper summarizes our recent work on developing the technical capabilities needed to automatically generate dialogue utterances that express either a personality or the persona of a dramatic character. In previous work, we developed a personality-based generation engine, PERSONAGE, that produces dialogic restaurant recommendations that varied according to the speakers personality. More recently we have been exploring three issues: (1) how to coordinate verbal expression of personality or character with nonverbal expression through facial or body animation parameters; (2) whether we can express character models that we learn from film dialogue with the existing parameters of the PERSONAGE engine; and (3) whether we can show experimentally that expressive generation is useful in a range of tasks. Our long-term goal is to create off-the-shelf tools to support the creation of spoken dialogue agents with their own persona and personality, for a broad range of types of dialogue agents in task-oriented applications or in interactive stories and games.

Marilyn A. Walker, Jennifer Sawyer, Grace Lin, Sam Wing
Chapter 27. Application and Evaluation of a Conditioned Hidden Markov Model for Estimating Interaction Quality of Spoken Dialogue Systems

The interaction quality (IQ) metric has recently been introduced for measuring the quality of spoken dialogue systems (SDSs) on the exchange level. While previous work relied on support vector machines (SVMs), we evaluate a conditioned hidden Markov model (CHMM) which accounts for the sequential character of the data and, in contrast to a regular hidden Markov model (HMM), provides class probabilities. While the CHMM achieves an unweighted average recall (UAR) of 0.39, it is outperformed by regular HMM with an UAR of 0.44 and a SVM with an UAR of 0.49, both trained and evaluated under the same conditions.

Stefan Ultes, Robert ElChab, Wolfgang Minker
Chapter 28. FLoReS: A Forward Looking, Reward Seeking, Dialogue Manager

We present FLoReS, a new information-state-based dialogue manager, making use of forward inference, local dialogue structure, and plan operators representing subdialogue structure. The aim is to support both advanced, flexible, mixed initiative interaction and efficient policy creation by domain experts. The dialogue manager has been used for two characters in the SimCoach project and is currently being used in several related projects. We present the design of the dialogue manager and preliminary comparative evaluation with a previous system that uses a more conventional state chart dialogue manager.

Fabrizio Morbini, David DeVault, Kenji Sagae, Jillian Gerten, Angela Nazarian, David Traum
Chapter 29. A Clustering Approach to Assess Real User Profiles in Spoken Dialogue Systems

Evaluation methodologies for spoken dialogue systems try to provide an efficient means of assessing the quality of the system and/or predicting the user satisfaction. In order to do so, they must be carried out over a corpus of dialogues which contains as many possible prospective or real user types as possible. In this paper we present a clustering approach to provide insight on whether user profiles can be automatically detected from the interaction parameters and overall quality predictions, providing a way of corroborating the most representative features for defining user profiles. We have carried out different experiments over a corpus of 62 dialogues with the INSPIRE dialogue system, from which the clustering approach provided an efficient way of easily obtaining information about the suitability of distinguishing between different user groups to complete a more significative evaluation of the system.

Zoraida Callejas, David Griol, Klaus-Peter Engelbrecht, Ramón López-Cózar
Chapter 30. What Are They Achieving Through the Conversation? Modeling Guide–Tourist Dialogues by Extended Grounding Networks

In goal-oriented or task-oriented conversations, the participants have to share many things to achieve collaboration. The ideas of

common ground

,

shared knowledge

, and similar concepts are important to understand the process of achievement. In this study, to model guide–tourist dialogues considering such grounding process, we proposed the idea of

extended grounding networks

by introducing the concept of

contribution topics

and applied it to different data collected from dialogues between a human guide and tourists.

Etsuo Mizukami, Hideki Kashioka
Chapter 31. Co-adaptation in Spoken Dialogue Systems

Spoken dialogue systems are man-machine interfaces which use speech as the medium of interaction. In recent years, dialogue optimization using reinforcement learning has evolved to be a state-of-the-art technique. The primary focus of research in the dialogue domain is to learn some optimal policy with regard to the task description (reward function) and the user simulation being employed. However, in case of human-human interaction, the parties involved in the dialogue conversation mutually evolve over the period of interaction. This very ability of humans to coadapt attributes largely towards increasing the naturalness of the dialogue. This paper outlines a novel framework for coadaptation in spoken dialogue systems, where the dialogue manager and user simulation evolve over a period of time; they incrementally and mutually optimize their respective behaviors.

Senthilkumar Chandramohan, Matthieu Geist, Fabrice Lefèvre, Olivier Pietquin
Chapter 32. Developing Non-goal Dialog System Based on Examples of Drama Television

This paper presents a design and experiments of developing a non-goal dialog system by utilizing human-to-human conversation examples from drama television. The aim is to build a conversational agent that can interact with users in as natural a fashion as possible, while reducing the time requirement for database design and collection. A number of the challenging design issues we faced are described, including (1) filtering and constructing a dialog example database from the drama conversations and (2) retrieving a proper system response by finding the best dialog example based on the current user query. Subjective evaluation from a small user study is also discussed.

Lasguido Nio, Sakriani Sakti, Graham Neubig, Tomoki Toda, Mirna Adriani, Satoshi Nakamura
Chapter 33. A User Model for Dialog System Evaluation Based on Activation of Subgoals

User models have become increasingly popular to conduct simulation-based testing of spoken dialog systems. These models usually describe users’ overt behavior, as opposed to the underlying reasons for the observed actions. While such models are useful to generate test data, a causal model might be more generally applicable to different systems and, in addition, allows to derive useful information for data analysis and prediction of user judgments. Thus, a modeling approach trying to explain user behavior is proposed in this paper, which is based on Dörner’s PSI theory. The evaluation shows that the utterances generated by this model are similar to those of real users.

Klaus-Peter Engelbrecht
Chapter 34. Real-Time Feedback System for Monitoring and Facilitating Discussions

In this chapter, we present a system that provides real-time feedback about an ongoing discussion. Various speech statistics, such as speaking length, speaker turns and speaking turn duration, are computed and displayed in real time. In social monitoring, such statistics have been used to interpret and deduce talking mannerisms of people and gain insights on human social characteristics and behaviour. However, such analysis is usually conducted in an offline fashion, after the discussion has ended. In contrast, our system analyses the speakers and provides feedback to the speakers in real time during the discussion, which is a novel approach with plenty of potential applications. The proposed system consists of portable, easy to use equipment for recording the conversations. A user-friendly graphical user interface displays statistics about the ongoing discussion. Customized individual feedback to participants during conversation can be provided. Such close-loop design may help individuals to contribute effectively in the group discussion, potentially leading to more productive and perhaps shorter meetings. Here we present preliminary results on two-people face to face discussion. In the longer term, our system may prove to be useful, e.g. for coaching purposes and for facilitating business meetings.

Sanat Sarda, Martin Constable, Justin Dauwels, Shoko Dauwels (Okutsu), Mohamed Elgendi, Zhou Mengyu, Umer Rasheed, Yasir Tahir, Daniel Thalmann, Nadia Magnenat-Thalmann
Chapter 35. Evaluation of Invalid Input Discrimination Using Bag-of-Words for Speech-Oriented Guidance System

We investigate a discrimination method for invalid and valid inputs, received by a speech-oriented guidance system operating in a real environment. Invalid inputs include background voices, which are not directly uttered to the system, and nonsense utterances. Such inputs should be rejected beforehand. We have reported methods using not only the likelihood values of Gaussian mixture models (GMM) but also other information in inputs such as bag-of-words, utterance duration, and signal-to-noise ratio to discriminate invalid inputs from valid ones. To deal with these multiple information, we used support vector machine (SVM) with radial basis function kernel and maximum entropy (ME) method and compare the performance. In this paper, we compare the performance changing the amount of training data. In the experiments, we achieve 87.01% of F-measure for SVM and 83.73% for ME using 3,000 training data, while F-measure for GMM-based baseline method is 81.73%.

Haruka Majima, Rafael Torres, Hiromichi Kawanami, Sunao Hara, Tomoko Matsui, Hiroshi Saruwatari, Kiyohiro Shikano
Metadaten
Titel
Natural Interaction with Robots, Knowbots and Smartphones
herausgegeben von
Joseph Mariani
Sophie Rosset
Martine Garnier-Rizet
Laurence Devillers
Copyright-Jahr
2014
Verlag
Springer New York
Electronic ISBN
978-1-4614-8280-2
Print ISBN
978-1-4614-8279-6
DOI
https://doi.org/10.1007/978-1-4614-8280-2

Neuer Inhalt