Skip to main content



Long Papers

Impact of a Newly Developed Modern Standard Arabic Speech Corpus on Implementing and Evaluating Automatic Continuous Speech Recognition Systems

Being current formal linguistic standard and only acceptable form of Arabic language for all native speakers, Modern Standard Arabic (MSA) still lacks sufficient spoken corpora compared to other forms like Dialectal Arabic. This paper describes our work towards developing a new speech corpus for MSA, which can be used for implementing and evaluating any Arabic automatic continuous speech recognition system. The speech corpus contains 415 (367 training and 48 testing) sentences recorded by 42 (21 male and 21 female) Arabic native speakers from 11 countries representing three major regions (Levant, Gulf, and Africa). The impact of using this speech corpus on overall performance of Arabic automatic continuous speech recognition systems was examined. Two development phases were conducted based on the size of training data, Gaussian mixture distributions, and tied states (senones). Overall results indicate that larger training data size result higher word recognition rates and lower Word Error Rates (WER).
Mohammad A. M. Abushariah, Raja N. Ainon, Roziati Zainuddin, Bassam A. Al-Qatab, Assal A. M. Alqudah

User and Noise Adaptive Dialogue Management Using Hybrid System Actions

In recent years reinforcement-learning-based approaches have been widely used for policy optimization in spoken dialogue systems (SDS). A dialogue management policy is a mapping from dialogue states to system actions, i.e. given the state of the dialogue the dialogue policy determines the next action to be performed by the dialogue manager. So-far policy optimization primarily focused on mapping the dialogue state to simple system actions (such as confirm or ask one piece of information) and the possibility of using complex system actions (such as confirm or ask several slots at the same time) has not been well investigated. In this paper we explore the possibilities of using complex (or hybrid) system actions for dialogue management and then discuss the impact of user experience and channel noise on complex action selection. Our experimental results obtained using simulated users reveal that user and noise adaptive hybrid action selection can perform better than dialogue policies which can only perform simple actions.
Senthilkumar Chandramohan, Olivier Pietquin

Detection of Unknown Speakers in an Unsupervised Speech Controlled System

In this paper we investigate the capability of our self-learning speech controlled system comprising speech recognition, speaker identification and speaker adaptation to detect unknown users. Our goal is to enhance automated speech controlled systems by an unsupervised personalization of the human-computer interface. New users should be allowed to use a speech controlled device without the need to identify themselves or to undergo a time-consumptive enrollment. Instead, the system should detect new users during the operation of the device. New speaker profiles should be initialized and incrementally adjusted without any additional intervention of the user. Such a personalization of human-computer interfaces represents an important research issue. Exemplarily, in-car applications such as speech controlled navigation, hands-free telephony or infotainment systems are investigated. Results for detecting unknown speakers are presented for a subset of the SPEECON database.
Tobias Herbig, Franz Gerl, Wolfgang Minker

Evaluation of Two Approaches for Speaker Specific Speech Recognition

In this paper we examine two approaches for the automatic personalization of speech controlled systems. Speech recognition may be significantly improved by continuous speaker adaptation if the speaker can be reliably tracked. We evaluate two approaches for speaker identification suitable to identify 5-10 recurring users even in adverse environments. Only a very limited amount of speaker specific data can be used for training. A standard speaker identification approach is extended by speaker specific speech recognition. Multiple recognitions of speaker identity and spoken text are avoided to reduce latencies and computational complexity. In comparison, the speech recognizer itself is used to decode spoken phrases and to identify the current speaker in a single step. The latter approach is advantageous for applications which have to be performed on embedded devices, e.g. speech controlled navigation in automobiles. Both approaches were evaluated on a subset of the SPEECON database which represents realistic command and control scenarios for in-car applications.
Tobias Herbig, Franz Gerl, Wolfgang Minker

Issues in Predicting User Satisfaction Transitions in Dialogues: Individual Differences, Evaluation Criteria, and Prediction Models

This paper addresses three important issues in automatic prediction of user satisfaction transitions in dialogues. The first issue concerns the individual differences in user satisfaction ratings and how they affect the possibility of creating a user-independent prediction model. The second issue concerns how to determine appropriate evaluation criteria for predicting user satisfaction transitions. The third issue concerns how to train suitable prediction models. We present our findings for these issues on the basis of the experimental results using dialogue data in two domains.
Ryuichiro Higashinaka, Yasuhiro Minami, Kohji Dohsaka, Toyomi Meguro

Expansion of WFST-Based Dialog Management for Handling Multiple ASR Hypotheses

We proposed a weighted finite-state transducer-based dialog manager (WFSTDM) which was a platform for expandable and adaptable dialog systems. In this platform, all rules and/or models for dialog management (DM) are expressed in WFST form, and the WFSTs are used to accomplish various tasks via multiple modalities. With this framework, we constructed a statistical dialog system using the user concept and system action tags which were acquired from an annotated corpus of human-to-human spoken dialogs as input and output labels of the WFST. We introduced a spoken language understanding (SLU) WFST for converting user utterances to user concept tags, a dialog scenario WFST forconverting user concept tags to system action tags and a sentence generation (SG) WFST for converging system action tags to system utterances. The tag sequence probabilities of the dialog scenario WFST were estimated by using a spoken dialog corpus for hotel reservation. TheSLU, scenario and SG WFSTs were then composed to be a dialog management WFST which determines the next action of the system responding to the user input. In our previous research, we evaluated the dialog strategy by referring to the manual transcription. Then in this paper, we present the performance of WFSTDM when speech recognition hypotheses are input. To alleviate degradation of the DM performance caused by speech recognition errors, we expand the WFSTDM for handling multiple hypotheses of speech recognition and confidence score which indicate acoustic and linguistic reliability of speech recognition. We also evaluated the accuracy of SLU results and the correctness of system actions selected by the dialog management WFST. We confirmed that the performance of dialog management was enhanced by choosing the optimal action among all the WFST paths for multiple hypotheses (N-best) of speech recognition in consideration of confidence score.
Naoto Kimura, Chiori Hori, Teruhisa Misu, Kiyonori Ohtake, Hisashi Kawai, Satoshi Nakamura

Evaluation of Facial Direction Estimation from Cameras for Multi-modal Spoken Dialog System

This paper presents the results of an evaluation of image-processing techniques for estimating facial direction from a camera for a multi-modal spoken dialog system on a large display panel. The system is called the “proactive dialog system” and aims to present acceptable information in an acceptable time. It can detect non-verbal information, such as changes in gaze and facial direction as well as head gestures of the user during dialog, and recommend suitable information. We implemented a dialog scenario to present sightseeing information on the system. Experiments which consist of 100 sesions with 80 subjects were conducted to evaluate the system’s efficiency. The system grows particularly clear when dialog contains recommendations.
Akihiro Kobayashi, Kentaro Kayama, Etsuo Mizukami, Teruhisa Misu, Hideki Kashioka, Hisashi Kawai, Satoshi Nakamura

D3 Toolkit: A Development Toolkit for Daydreaming Spoken Dialog Systems

Recently various data-driven spoken language technologies have been applied to spoken dialog system development. However, high cost of maintaining the spoken dialog systems is one of the biggest challenges. In addition, a fixed corpus collected by human is never enough to cover diverse real user’s utterances. The concept of a daydreaming dialog system can solve the problem by making the system learn from previous human-machine dialogs. This paper introduces D3 (Daydreaming Dialog system Development) toolkit, which is a back-end support toolkit for the development of the daydreaming spoken dialog systems. For reducing human efforts, D3 toolkit generates new utterances with semantic annotation and new knowledge by analyzing the usage log file. The new added corpus is determined by verifying proper candidates using semi-automatic methods. The augmented corpus is used for building improved models and self-evolution of the dialog system is possible by replacing the old models. We implemented the D3 toolkit using web-based technologies to provide a familiar environment to non-expert end-users.
Donghyeon Lee, Kyungduk Kim, Cheongjae Lee, Junhwi Choi, Gary Geunbae Lee

New Technique to Enhance the Performance of Spoken Dialogue Systems by Means of Implicit Recovery of ASR Errors

This paper proposes a new technique to implicitly correct some ASR errors made by spoken dialogue systems, which is implemented at two levels: statistical and linguistic. The goal of the former level is to employ for the correction knowledge extracted from the analysis of a training corpus comprised of utterances and their corresponding ASR results. The outcome of the analysis is a set of syntactic-semantic models and a set of lexical models, which are optimally selected during the correction. The goal of the correction at the linguistic level is to repair errors not detected during the statistical level which affects the semantics of the sentences. Experiments carried out with a previously-developed spoken dialogue system for the fast food domain indicate that the technique allows enhancing word accuracy, spoken language understanding and task completion by 8.5%, 16.54% and 44.17% absolute, respectively.
Ramón López-Cózar, David Griol, José F. Quesada

Simulation of the Grounding Process in Spoken Dialog Systems with Bayesian Networks

User simulation has become an important trend of research in the field of spoken dialog systems because collecting and annotating real man-machine interactions with users is often expensive and time consuming. Yet, such data are generally required for designing and assessing efficient dialog systems. The general problem of user simulation is thus to produce as many as necessary natural, various and consistent interactions from as few data as possible. In this paper, is proposed a user simulation method based on Bayesian Networks (BN) that is able to produce consistent interactions in terms of user goal and dialog history but also to simulate the grounding process that often appears in human-human interactions. The BN is trained on a database of 1234 human-machine dialogs in the TownInfo domain (a tourist information application). Experiments with a state-of-the-art dialog system (REALL-DUDE/DIPPER/OAA) have been realized and promising results are presented.
Stéphane Rossignol, Olivier Pietquin, Michel Ianotto

Facing Reality: Simulating Deployment of Anger Recognition in IVR Systems

With the availability of real-life corpora studies dealing with speech-based emotion recognition have turned towards recognition of angry users on turn level. Based on acoustic, linguistic and sometimes contextual features classifiers yield performance values of 0.7-0.8 f-score when classifying angry vs. non-angry user turns. The effect of deploying anger classifiers in real systems still remains an open point and has not been examined so far. Is the current performance of anger detection already adequate enough for a change in dialogue strategy or even an escalation to an operator? In this study we explore the impact of an anger classifier that has been published in a previous study on specific dialogues. We introduce a cost-sensitive classifier that reduces the number of misclassified non-angry user turns significantly.
Alexander Schmitt, Tim Polzehl, Wolfgang Minker

A Discourse and Dialogue Infrastructure for Industrial Dissemination

We think that modern speech dialogue systems need a prior usability analysis to identify the requirements for industrial applications. In addition, work from the area of the Semantic Web should be integrated. These requirements can then be met by multimodal semantic processing, semantic navigation, interactive semantic mediation, user adaptation/personalisation, interactive service composition, and semantic output representation which we will explain in this paper.We will also describe the discourse and dialogue infrastructure these components develop and provide two examples of disseminated industrial prototypes.
Daniel Sonntag, Norbert Reithinger, Gerd Herzog, Tilman Becker

Short Papers

Impact of Semantic Web on the Development of Spoken Dialogue Systems

We examined several possible uses of semantic Web technologies in developing spoken dialogue systems. We report that three advanced features of the semantic Web, namely, collective intelligence, smooth integration of knowledge, and automatic inference, have a large impact on various stages in the development of spoken dialogue systems, such as language modeling, semantic analysis, dialogue management, sentence generation, and user modeling. As an example, we implemented a query generation method for semantic search based on semantic Web technology.
Masahiro Araki, Yu Funakura

A User Model to Predict User Satisfaction with Spoken Dialog Systems

In order to predict interactions of users with spoken dialog systems and their ratings of the interaction, we propose to model basic needs of the user which impacting her emotional state. In defining the model we follow the PSI theory by Dörner [1] and identify Competence and Certainty as relevant needs in this context. By analysis of questionnaires we show that such needs impact the users overall opinion of the system. Furthermore, relations to interaction parameters are analyzed.
Klaus-Peter Engelbrecht, Sebastian Möller

Sequence-Based Pronunciation Modeling Using a Noisy-Channel Approach

Previous approaches to spontaneous speech recognition address the multiple pronunciation problem by modeling the alteration of the pronunciation on a phoneme to phoneme level. However, the phonetic transformation effects induced by the pronunciation of the whole sentence are not considered yet. In this paper we attempt to model the sequence-based pronunciation variation using a noisy-channel approach where the spontaneous phoneme sequence is considered as a “noisy” string and the goal is to recover the “clean” string of the word sequence. Hereby, the whole word sequence and its effect on the alternation of the phonemes will be taken into consideration. Moreover, the system not only learns the phoneme transformation but also the mapping from the phoneme to the word directly. In this preliminary study, first the phonemes will be recognized with the present recognition system and afterwards the pronunciation variation model based on the noisy-channel approach will map from the phoneme to the word level. Our experiments use Switchboard as spontaneous speech corpus. The results show that the proposed method improves the word accuracy consistently over the conventional recognition system. The best system achieves up to 38.9% relative improvement to the baseline speech recognition.
Hansjörg Hofmann, Sakriani Sakti, Ryosuke Isotani, Hisashi Kawai, Satoshi Nakamura, Wolfgang Minker

Rational Communication and Affordable Natural Language Interaction for Ambient Environments

This paper discusses rational interaction as a methodology for designing and implementing dialogue management in ambient environments. It is assumed that natural (multimodal) language communication is the most intuitive way of interaction, and most suitable when the interlocutors are involved in open-ended activities that concern negotiations and planning. The paper discusses aspects that support this hypothesis by focussing especially on how interlocutors build shared context through natural language, and create social bonds through affective communication. Following the design guidelines for interactive artefacts, it is proposed that natural language provides human-computer systems with an interface which is affordable: it readily suggests the appropriate ways to use the interface.
Kristiina Jokinen

Construction and Experiment of a Spoken Consulting Dialogue System

This paper addresses a spoken dialogue framework that helps users make decisions. Various decision criteria are involved when we select an alternative from a given set of alternatives. When adopting a spoken dialogue interface, users have little idea of the kinds of criteria that the system can handle. We thus consider a recommendation function that proactively presents information that the user would be interested in. We implemented a sightseeing guidance system with a recommendation function and conducted a user experiment. We provided an initial analysis of the framework in terms of the system prompt and users’ behavior, as well as in terms of user’s behavior and his/her knowledge.
Teruhisa Misu, Chiori Hori, Kiyonori Ohtake, Hideki Kashioka, Hisashi Kawai, Satoshi Nakamura

A Study Toward an Evaluation Method for Spoken Dialogue Systems Considering User Criteria

In the development cycle of a spoken dialogue system (SDS), it is important to know how users actually behave and talk and what they expect of the SDS. We are developing SDSs which realize natural communication between users and systems. To collect users’ real data, a wide-scale experiment was carried out with a smart-phone prototype SDS. In this brief paper, we report on the experiment’s results and make a tentative analysis of cases in which there were gaps between system performance and user judgment. This requires both an adequate experimental design and an evaluation methodology that considers users’ judgement criteria.
Etsuo Mizukami, Hideki Kashioka, Hisashi Kawai, Satoshi Nakamura

A Classifier-Based Approach to Supporting the Augmentation of the Question-Answer Database for Spoken Dialogue Systems

Dealing with a variety of user questions in question-answer spoken dialogue systems requires preparing as many question-answer patterns as possible. This paper proposes a method for supporting the augmentation of the question-answer database. It uses user questions collected with an initial question-answer system, and detects questions that need to be added to the database. It uses two language models; one is built from the database and the other is a large-vocabulary domain-independent model. Experimental results suggest the proposed method is effective in reducing the amount of effort for augmenting the database when compared to a baseline method that used only the initial database.
Hiromi Narimatsu, Mikio Nakano, Kotaro Funakoshi

The Influence of the Usage Mode on Subjectively Perceived Quality

The current paper presents an evaluation study of a multimodal mobile entertainment system. Aim of the study was to investigate the effect of the usage mode (explorative vs. task-oriented) on the perceived quality. In one condition the participants were asked to perform specific tasks (task-oriented mode) and in another to do “whatever they want to do with the device”. It was shown that the explorative test setting result in better ratings than the task-oriented.
Ina Wechsung, Anja Naumann, Sebastian Möller

Demo Papers

Sightseeing Guidance Systems Based on WFST-Based Dialogue Manager

We are developping a spoken dialogue system that help the users through spontaneous interactions on sightseeing guidance domain. The systems are constructed on our framework of weighted finite-state transducer (WFST) based dialogue manager. The demos are our prototype spoken dialogue systems on Kyoto tourist information assistance.
Teruhisa Misu, Chiori Hori, Kiyonori Ohtake, Etsuo Mizukami, Akihiro Kobayashi, Kentaro Kayama, Tetsuya Fujii, Hideki Kashioka, Hisashi Kawai, Satoshi Nakamura

Spoken Dialogue System Based on Information Extraction from Web Text

We present a novel spoken dialogue system which uses the up-to-date information on the web. It is based on information extraction which is defined by the predicate-argument (P-A) structure and realized by shallow parsing. Based on the information structure, the dialogue system can perform question answering and also proactive information presentation using the dialogue context and a topic model.
Koichiro Yoshino, Tatsuya Kawahara


Weitere Informationen

Premium Partner