main-content

This book compiles and presents a synopsis on current global research efforts to push forward the state of the art in dialogue technologies, including advances to language and context understanding, and dialogue management, as well as human–robot interaction, conversational agents, question answering and lifelong learning for dialogue systems.

### Skip Act Vectors: Integrating Dialogue Context into Sentence Embeddings

This paper compares several approaches for computing dialogue turn embeddings and evaluate their representation capacities in two dialogue act related tasks within a hierarchical Recurrent Neural Network architecture. These turn embeddings can be produced explicitely or implicitely by extracting the hidden layer of a model trained for a given task. We introduce skip-act, a new dialogue turn embeddings approach, which are extracted as the common representation layer from a multi-task model that predicts both the previous and the next dialogue act. The models used to learn turn embeddings are trained on a large dialogue corpus with light supervision, while the models used to predict dialog acts using turn embeddings are trained on a sub-corpus with gold dialogue act annotations. We compare their performances for predicting the current dialogue act as well as their ability to predict the next dialogue act, which is a more challenging task that can have several applicative impacts. With a better context representation, the skip-act turn embeddings are shown to outperform previous approaches both in terms of overall F-measure and in terms of macro-F1, showing regular improvements on the various dialogue acts.

Jeremy Auguste, Frédéric Béchet, Géraldine Damnati, Delphine Charlet

### End-to-end Modeling for Selection of Utterance Constructional Units via System Internal States

In order to make conversational agents or robots conduct human-like behaviors, it is important to design a model of the system internal states. In this paper, we address a model of favorable impression to the dialogue partner. The favorable impression is modeled to change according to user’s dialogue behaviors and also affect following dialogue behaviors of the system, specifically selection of utterance constructional units. For this modeling, we propose a hierarchical structure of logistic regression models. First, from the user’s dialogue behaviors, the model estimates the level of user’s favorable impression to the system and also the level of the user’s interest in the current topic. Then, based on the above results, the model predicts the system’s favorable impression to the user. Finally, the model determines selection of utterance constructional units in the next system turn. We train each of the logistic regression models individually with a small amount of annotated data of favorable impression. Afterward, the entire multi-layer network is fine-tuned with a larger amount of dialogue behavior data. An experimental result shows that the proposed method achieves higher accuracy on the selection of the utterance constructional units, compared with methods that do not take into account the system internal states.

Koki Tanaka, Koji Inoue, Shizuka Nakamura, Katsuya Takanashi, Tatsuya Kawahara

### Context Aware Dialog Management with Unsupervised Ranking

We propose MoveRank, a novel hybrid approach to dialog management that uses a knowledge graph domain structure designed by a domain-expert. The domain encoder converts a symbolic output of the NLU into a vector representation. MoveRank uses an unsupervised similarity measure to obtain the optimal dialog state modifications in a given context. Using a 1K utterance dataset automatically constructed with template expansion from a small set of annotated human-human dialogs, we show that the proposed unsupervised ranking approach produces the correct result on the gold labeled input without spelling variations. Using an encoding method designed to handle spelling variations, MoveRank is correct with $$\mathrm{F}-1=0.86$$ F - 1 = 0.86 , with the complete set of labels (including intent, entity, and item) and $$\mathrm{F}-1=0.82$$ F - 1 = 0.82 , with only the intent labels.

### Predicting Laughter Relevance Spaces in Dialogue

In this paper we address the task of predicting spaces in interaction where laughter can occur. We introduce the new task of predicting actual laughs in dialogue and address it with various deep learning models, namely recurrent neural network (RNN), convolution neural network (CNN) and combinations of these. We also attempt to evaluate human performance for this task via an Amazon Mechanical Turk (AMT) experiment. The main finding of the present work is that deep learning models outperform untrained humans in this task.

Vladislav Maraev, Christine Howes, Jean-Philippe Bernardy

### Transfer Learning for Unseen Slots in End-to-End Dialogue State Tracking

This paper proposes a transfer learning algorithm for end-to-end dialogue state tracking (DST) to handle new slots with a small set of training data, which has not yet been discussed in the literature on conventional approaches. The goal of transfer learning is to improve DST performance for new slots by leveraging slot-independent parameters extracted from DST models for existing slots. An end-to-end DST model is composed of a spoken language understanding module and an update module. We assume that parameters of the update module can be slot-independent. To make the parameters slot-independent, a DST model for each existing slot is trained by sharing the parameters of the update module across all existing slots. The slot-independent parameters are transferred to a DST model for the new slot. Experimental results show that the proposed algorithm achieves 82.5% accuracy on the DSTC2 dataset, outperforming a baseline algorithm by 1.8% when applied to a small set of training data. We also show its potential robustness for the network architecture of update modules.

Kenji Iwata, Takami Yoshida, Hiroshi Fujimura, Masami Akamine

### Managing Multi-task Dialogs by Means of a Statistical Dialog Management Technique

One of the most demanding tasks when developing a dialog system consists of deciding the next system response considering the user’s actions and the dialog history, which is the fundamental responsibility related to dialog management. A statistical dialog management technique is proposed in this work to reduce the effort and time required to design the dialog manager. This technique allows not only an easy adaptation to new domains, but also to deal with the different subtasks for which the dialog system has been designed. The practical application of the proposed technique to develop a dialog system for a travel-planning domain shows that the use of task-specific dialog models increases the quality and number of successful interactions with the system in comparison with developing a single dialog model for the complete domain.

David Griol, Zoraida Callejas, Jose F. Quesada

### Generating Supportive Utterances for Open-Domain Argumentative Dialogue Systems

Towards creating an open-domain argumentative dialogue system, preparing a database of structured argumentative knowledge for the system as reported in previous work is difficult because diverse propositions exist in the open-domain setting. In this paper, instead of structured knowledge, we use a simple seq2seq-based model to generate supportive utterances to user utterances in an open-domain discussion. We manually collected 45,000 utterance pairs consisting of a user utterance and supportive utterance and proposed a method to augment the manually collected pairs to cover various discussion topics. The generated supportive utterances were then manually evaluated and the results showed that the proposed model could generate supportive utterances with an accuracy of 0.70, significantly outperforming baselines.

Koh Mitsuda, Ryuichiro Higashinaka, Taichi Katayama, Junji Tomita

### VOnDA: A Framework for Ontology-Based Dialogue Management

We present VOnDA, a framework to implement the dialogue management functionality in dialogue systems. Although domain-independent, VOnDA is tailored towards dialogue systems with a focus on social communication, which implies the need of a long-term memory and high user adaptivity. For these systems, which are used in health environments or elderly care, margin of error is very low and control over the dialogue process is of topmost importance. The same holds for commercial applications, where customer trust is at risk. VOnDA ’s specification and memory layer relies upon an extended version of RDF/OWL, which provides a universal and uniform representation, and facilitates interoperability with external data sources, e.g., from physical sensors.

Bernd Kiefer, Anna Welker, Christophe Biwer

### Towards Increasing Naturalness and Flexibility in Human-Robot Dialogue Systems

The chapter discusses some approaches to increasing the naturalness and flexibility of human-robot interaction, with examples from the WikiTalk dialogue system. WikiTalk enables robots to talk fluently about thousands of topics using Wikipedia-based talking. However, there are three challenging areas that need to be addressed to make the system more natural: speech interaction, face recognition, interaction history. We address these challenges and describe more context-aware approaches taking the individual partner into account when generating responses. Finally, we discuss the need for a Wikipedia-based listening capability to enable robots to follow the changing topics in human conversation. This would allow robots to join in the conversation using Wikipedia-based talking to make new topically relevant dialogue contributions.

Graham Wilcock, Kristiina Jokinen

### A Classification-Based Approach to Automating Human-Robot Dialogue

We present a dialogue system based on statistical classification which was used to automate human-robot dialogue in a collaborative navigation domain. The classifier was trained on a small corpus of multi-floor Wizard-of-Oz dialogue including two wizards: one standing in for dialogue capabilities and another for navigation. Below, we describe the implementation details of the classifier and show how it was used to automate the dialogue wizard. We evaluate our system on several sets of source data from the corpus and find that response accuracy is generally high, even with very limited training data. Another contribution of this work is the novel demonstration of a dialogue manager that uses the classifier to engage in multi-floor dialogue with two different human roles. Overall, this approach is useful for enabling spoken dialogue systems to produce robust and accurate responses to natural language input, and for robots that need to interact with humans in a team setting.

Felix Gervits, Anton Leuski, Claire Bonial, Carla Gordon, David Traum

### Engagement-Based Adaptive Behaviors for Laboratory Guide in Human-Robot Dialogue

We address an application of engagement recognition in human-robot dialogue. Engagement is defined as how much a user is interested in the current dialogue, and keeping users engaged is important for spoken dialogue systems. In this study, we apply a real-time engagement recognition model to laboratory guide by autonomous android ERICA which plays the role of the guide. According to an engagement score of a user, ERICA generates adaptive behaviors consisting of feedback utterances and additional explanations. A subject experiment showed that the adaptive behaviors increased both the engagement score and related subjective scores such as interest and empathy.

Koji Inoue, Divesh Lala, Kenta Yamamoto, Katsuya Takanashi, Tatsuya Kawahara

### Spoken Dialogue Robot for Watching Daily Life of Elderly People

The number of aged people is increasing. The influence of solitude on both physical and mental health of those seniors is a social problem that needs an urgent solution in advanced societies. We propose a spoken dialogue robot that looks over elderly people through conversations by using functions of life-support via information navigation, attentive listening, and anomaly detection. In this paper, we describe a demonstration system implemented in the conversational robot.

Koichiro Yoshino, Yukitoshi Murase, Nurul Lubis, Kyoshiro Sugiyama, Hiroki Tanaka, Sakti Sakriani, Shinnosuke Takamichi, Satoshi Nakamura

### How to Address Humans: System Barge-In in Multi-user HRI

This work investigates different barge-in strategies in the context of multi-user Spoken Dialogue Systems. We conduct a pilot user study in which different approaches are compared in view of how they are perceived by the user. For this setting, the Nao robot serves as a demonstrator for the underlying multi-user Dialogue System which interrupts an ongoing conversation between two humans in order to introduce additional information. The results show that the use of supplemental sounds at this task does not necessarily lead to a positive assessment and may influence the user’s perception of the reliability, competence and understandability of the system.

Nicolas Wagner, Matthias Kraus, Niklas Rach, Wolfgang Minker

### Bone-Conducted Speech Enhancement Using Hierarchical Extreme Learning Machine

Deep learning-based approaches have demonstrated promising performance for speech enhancement (SE) tasks. However, these approaches generally require large quantities of training data and computational resources for model training. An alternate hierarchical extreme learning machine (HELM) model has been previously reported to perform SE and has demonstrated satisfactory results with a limited amount of training data. In this study, we investigate application of the HELM model to improve the quality and intelligibility of bone-conducted speech. Our experimental results show that the proposed HELM-based bone-conducted SE framework can effectively enhance the original bone-conducted speech and outperform a deep denoising autoencoder-based bone-conducted SE system in terms of speech quality and intelligibility with improved recognition accuracy when a limited quantity of training data is available.

Tassadaq Hussain, Yu Tsao, Sabato Marco Siniscalchi, Jia-Ching Wang, Hsin-Min Wang, Wen-Hung Liao

### Benchmarking Natural Language Understanding Services for Building Conversational Agents

We have recently seen the emergence of several publicly available Natural Language Understanding (NLU) toolkits, which map user utterances to structured, but more abstract, Dialogue Act (DA) or Intent specifications, while making this process accessible to the lay developer. In this paper, we present the first wide coverage evaluation and comparison of some of the most popular NLU services, on a large, multi-domain (21 domains) dataset of 25 K user utterances that we have collected and annotated with Intent and Entity Type specifications and which will be released as part of this submission ( https://github.com/xliuhw/NLU-Evaluation-Data ). The results show that on Intent classification Watson significantly outperforms the other platforms, namely, Dialogflow, LUIS and Rasa; though these also perform well. Interestingly, on Entity Type recognition, Watson performs significantly worse due to its low Precision (At the time of producing the camera-ready version of this paper, we noticed the seemingly recent addition of a ‘Contextual Entity’ annotation tool to Watson, much like e.g. in Rasa. We’d threfore like to stress that this paper does not include an evaluation of this feature in Watson NLU.). Again, Dialogflow, LUIS and Rasa perform well on this task.

Xingkun Liu, Arash Eshghi, Pawel Swietojanski, Verena Rieser

### Dialogue System Live Competition: Identifying Problems with Dialogue Systems Through Live Event

We organized a competition entitled “the dialogue system live competition” in which the audience, consisting mainly of researchers in the dialogue community, watched and evaluated a live dialogue conducted between users and dialogue systems. The motivation behind the event was to cultivate state-of-the-art techniques in dialogue systems and enable the dialogue community to share the problems with current dialogue systems. There are two parts to the competition: preliminary selection and live event. In the preliminary selection, eleven systems were evaluated by crowd-sourcing. Three systems proceeded to the live event to perform dialogues with designated speakers and to be evaluated by the audience. This paper describes the design and procedure of the competition, the results of the preliminary selection and live event of the competition, and the problems we identified from the event.

Ryuichiro Higashinaka, Kotaro Funakoshi, Michimasa Inaba, Yuiko Tsunomori, Tetsuro Takahashi, Reina Akama

### Multimodal Dialogue Data Collection and Analysis of Annotation Disagreement

We have been collecting multimodal dialogue data [1] to contribute to the development of multimodal dialogue systems that can take a user’s non-verbal behaviors into consideration. We recruited 30 participants from the general public whose ages ranged from 20 to 50 and genders were almost balanced. The consent form to be filled in by the participants was updated to enable data distribution to researchers as long as it is used for research purposes. After the data collection, eight annotators were divided into three groups and assigned labels representing how much a participant looks interested in the current topic to every exchange. The labels given among the annotators do not always agree as they depend on subjective impressions. We also analyzed the disagreement among annotators and temporal changes of impressions of the same annotators.

Kazunori Komatani, Shogo Okada, Haruto Nishimoto, Masahiro Araki, Mikio Nakano

### Analyzing How a Talk Show Host Performs Follow-Up Questions for Developing an Interview Agent

This paper aims to reveal how human experts delve into the topics of a conversation and to apply such strategies in developing an interview agent. The purpose of recent studies on chat-oriented dialogue systems has centered around how to respond in conversation to a wide variety of topics. However, in interviews, it is also important to delve into particular topics to deepen the conversation. Since interviewers in talk shows, such as talk show hosts, are considered to be proficient at such skills, in this paper we analyze how a talk show host interviews her guests in a TV program and thus reveal the strategies used to delve into specific topics. In particular, we focus on follow-up questions and analyze the kinds of follow-up questions used. We also develop a prediction model that judges whether the next utterance should be a follow-up question and then evaluate the model’s performance.

Hiromi Narimatsu, Ryuichiro Higashinaka, Hiroaki Sugiyama, Masahiro Mizukami, Tsunehiro Arimoto

### Chat-Oriented Dialogue System That Uses User Information Acquired Through Dialogue and Its Long-Term Evaluation

A chat-oriented dialogue system can become more likeable if it can remember information about users and use that information during a dialogue. We propose a chat-oriented dialogue system that can use user information acquired during a dialogue and discuss its effectiveness on long-term interaction. In our subjective evaluation over five consecutive days, we compared three systems: a system that can remember and use user information over multiple days (proposed system), one that can only remember user information within a single dialogue session, and another that does not remember any user information. We found that users were significantly more satisfied with our proposed system than with the other two. This paper is the first to verify the effectiveness of remembering on long-term interaction with a fully automated chat-oriented dialogue system.

Yuiko Tsunomori, Ryuichiro Higashinaka, Takeshi Yoshimura, Yoshinori Isoda

### Reranking of Responses Using Transfer Learning for a Retrieval-Based Chatbot

This paper presents how to improve retrieval-based open-domain dialogue systems by re-ranking retrieved responses. The paper uses a retrieval based open domain dialogue system implemented previously, namely Iris chatbot as a case study. We investigate two approaches to re-rank the retrieved responses. The first approach trains a re-ranker using machine generated responses that were annotated by human participants through WOCHAT (Workshops and Session Series on Chatbots and Conversational Agents) and its shared-tasks [5, 6]. The second approach uses transfer learning by training the re-ranker on a large dataset from a different domain. We chose the Ubuntu dialogue dataset as the domain. The human evaluation test asked subjects to rank and review three different dialogue systems, the baseline Iris system, the Iris system enhanced with a re-ranker trained on WOCHAT data, and the Iris system enhanced with a re-ranker trained on the Ubuntu data. The Iris system enhanced with a re-ranker trained on WOCHAT data received the highest ratings from the human subjects.

Ibrahim Taha Aksu, Nancy F. Chen, Luis Fernando D’Haro, Rafael E. Banchs

### Online FAQ Chatbot for Customer Support

Chatbots and conversational systems are becoming a prominent research area, and many businesses are starting to leverage on their capability to handle basic communication tasks. With a vast variety of available frameworks for chat-bot development from tech giants, business organizations can build their own systems quickly and conveniently. However, these frameworks often lack a proper set of holistic tools to build a chatbot that is manageable, adaptable to learn, and scalable. Hence, frequently, additional machine learning mechanisms are needed to improve performance. In this paper, we demonstrate a chatbot system that uses machine learning to answer Frequently Asked Questions (FAQs) from our school website. The system includes different types of user query and a vector similarity analysis component to handle long and complex user queries. In addition, the Google’ s DialogFlow framework is used for intention detection.

Thi Ly Vu, Kyaw Zin Tun, Chng Eng-Siong, Rafael E. Banchs

### What’s Chat and Where to Find It

Chat or ‘non-goal directed’ dialogue has becomse a popular domain for spoken dialog system research, while the exponential increase in the use of commercial chatbots is creating interest in how to add friendly talk to make task-oriented systems more personable, or indeed to create systems which can create and maintain friendly relations with a user through the use of social talk. However, such talk is not very well defined, relevant data sources are few, and how to create artificial social talk is still an inexact science. This non-technical position paper briefly overviews these areas, exploring data used in chat systems and the limitations and challenges involved, and how these impact on the implementation of realistic social talk in spoken dialog systems.

Emer Gilmartin

### Generation of Objections Using Topic and Claim Information in Debate Dialogue System

In recent years, systems with a dialogue interface are attracting wide attention [1, 2]. We propose a dialogue system that can debate with users about news broadcasts on TV or radio and help users to understand the meaning deeply. We previously reported a debate system that collected opinions from the Web [4], vectorized them, and finally selected the most appropriate supporting/opposing opinion among them for debating. In this paper, we propose a Neural Network Language Model that can generate objections instead selecting one opinion for debating. The model generates sentences by putting claim information (supporting/opposition) in the input layer of Long Short-Term Memory (LSTM) [3]. We conducted experiments by BLEU score and Human Evaluation, and both showed the effectiveness of our method.

Kazuaki Furumai, Tetsuya Takiguchi, Yasuo Ariki

### A Differentiable Generative Adversarial Network for Open Domain Dialogue

This work presents a novel methodology to train open domain neural dialogue systems within the framework of Generative Adversarial Networks with gradient based optimization methods. We avoid the non-differentiability related to text-generating networks approximating the word vector corresponding to each generated token via a top-k softmax. We show that a weighted average of the word vectors of the most probable tokens computed from the probabilities resulting of the top-k softmax leads to a good approximation of the word vector of the generated token. Finally we demonstrate through a human evaluation process that training a neural dialogue system via adversarial learning with this method successfully discourages it from producing generic responses. Instead it tends to produce more informative and variate ones.

Asier López Zorrilla, Mikel deVelasco Vázquez, M. Inés Torres

### A Job Interview Dialogue System with Autonomous Android ERICA

We demonstrate a job interview dialogue with the autonomous android ERICA which plays the role of an interviewer. Conventional job interview dialogue systems ask only pre-defined questions. The job interview system of ERICA generates follow-up questions based on the interviewee’s response on the fly. The follow-up questions consist of two kinds of approaches: selection-based and keyword-based. The first type question is based on selection from a pre-defined question set, which can be used in many cases. The second type of question is based on a keyword extracted from the interviewee’s response, which digs into the interviewee’s response dynamically. These follow-up questions contribute to realizing natural and trained dialogue.

Koji Inoue, Kohei Hara, Divesh Lala, Shizuka Nakamura, Katsuya Takanashi, Tatsuya Kawahara

### Automatic Head-Nod Generation Using Utterance Text Considering Personality Traits

We propose a model for generating head nods from an utterance text considering personality traits. We have been investigating the automatic generation of body motion, such as nodding, from utterance text in dialog agent systems. Human body motion varies greatly depending on personality. Therefore, it is important to appropriately generate body motion according to the personality of the dialog agent. To construct our model, we first compiled a Japanese corpus of 24 dialogues including utterance, nod information, and personality traits (Big Five) of participants. Our nod-generation model also estimates the presence, frequency, and depth during each phrase by using various types of language information extracted from utterance text and personality traits. We evaluated how well the model can generate and estimate nods based on individual personality traits. The results indicate that our model using language information and personality trails outperformed a model using only language information.

Ryo Ishii, Taichi Katayama, Ryuichiro Higashinaka, Junji Tomita

### Opinion Building Based on the Argumentative Dialogue System BEA

In this work, we introduce BEA, an argumentative Dialogue System that assists the user in his or her opinion forming regarding a certain controversial topic. To this end, we establish an opinion model based on weighted bipolar argumentation graphs that allows the system to infer the influence of preferences expressed by the user on all related aspects and is updated by the system in real time during the interaction. The system and the model are tested and discussed by use of an argument structure consisting of 72 components in a proof of principal scenario, showing a high sensitivity of the employed model regarding the expressed preferences.

Annalena Aicher, Niklas Rach, Wolfgang Minker, Stefan Ultes

### Learning Between the Lines: Interactive Learning Modules Within Corpus Design

The present paper reports on the advantages of learning inferences and understanding strategies from the interactive structure of a corpus. First of all, we introduce the SUGAR corpus for the cooking domain, describing its peculiar collection and annotation procedures. After this first overview, we show how information included within the corpus can be used to enhance the action interpretation in dialogue systems. This can be the case of linguistic elements or related lexical units which can be acquired from a linked database or from rephrasing strategies within the corpus itself. In all the AI-based approaches depending on a training process using large and representative corpora, the probability to correctly predict the creativity a speaker can perform in using language is lower than expected. Trying to capture most of the possible words and expressions a speaker could use is extremely necessary, but even an empirical, finite collection of cases could not be enough. For this reason, the use of our corpus, possibly in combination with online training, appears as an appealing solution.

Maria Di Maro, Antonio Origlia, Francesco Cutugno

### Framing Lifelong Learning as Autonomous Deployment: Tune Once Live Forever

Lifelong Learning in the context of Artificial Intelligence is a new paradigm that is still in its infancy. It refers to agents that are able to learn continuously, accumulating the knowledge learned in previous tasks and using it to help future learning. In this position paper we depart from the focus on learning new tasks and instead take a stance from the perspective of the life-cycle of intelligent software. We propose to focus lifelong learning research on autonomous intelligent systems that sustain their performance after deployment in production across time without the need of machine learning experts. This perspective is being applied to three European projects funded under the CHIST-ERA framework on several domains of application.

Eneko Agirre, Anders Jonsson, Anthony Larcher

### Continuous Learning for Question Answering

We consider the problem of answering natural language questions over a Knowledge Graph, in the case of systems that must evolve over time in a production environment. One of the key issues is that we can expect that the system will receive questions that cannot be answered with the current state of the Knowledge Graph. We discuss here the challenges we need to address in this scenario and the expected behavior of this kind of Lifelong learning system. We also suggest a first task to address this problem and a possible procedure to build a benchmark.

Anselmo Peñas, Mathilde Veron, Camille Pradel, Arantxa Otegi, Guillermo Echegoyen, Alvaro Rodrigo

Steven Spielberg’s “A.I.” tells the story of two artificial agents: David and Teddy. While David resembles a human child, his companion Teddy is much simpler. Its behavior, however, still suggests a crucial mix of capabilities that stretch the state of the art in AI today. We argue that, unlike most contemporary AI, Teddy qualifies as a bona fide agent, and that implementing such a system would represent a valuable advance in our understanding of agency. We then describe a project to integrate our existing work to create a simple agent with Teddy-like capabilities.

Don Perlis, Clifford Bakalian, Justin Brody, Timothy Clausner, Matthew D. Goldberg, Adam Hamlin, Vincent Hsiao, Darsana Josyula, Chris Maxey, Seth Rabin, David Sekora, Jared Shamwell, Jesse Silverberg

### Lifelong Learning and Task-Oriented Dialogue System: What Does It Mean?

The main objective of this paper is to propose a functional definition of lifelong learning systems adapted to the framework of task-oriented dialogue systems. We mainly identified two aspects where a lifelong learning technology could be applied in such systems: to improve the natural language understanding module and to enrich the database used by the system. Given our definition, we present an example of how it could be implemented in an existing task-oriented dialogue system that is developed in the LIHLITH project.

Mathilde Veron, Sahar Ghannay, Anne-Laure Ligozat, Sophie Rosset

### Towards Understanding Lifelong Learning for Dialogue Systems

Lifelong learning is the ability of a software system to adapt to new situations during its lifetime. We explore how this paradigm can be applied to dialogue systems, how it might be implemented, and how we can evaluate the lifelong learning progress.

Mark Cieliebak, Olivier Galibert, Jan Deriu

### Incremental Improvement of a Question Answering System by Re-ranking Answer Candidates Using Machine Learning

We implement a method for re-ranking top-10 results of a state-of-the-art question answering (QA) system. The goal of our re-ranking approach is to improve the answer selection given the user question and the top-10 candidates. We focus on improving deployed QA systems that do not allow re-training or when re-training comes at a high cost. Our re-ranking approach learns a similarity function using n-gram based features using the query, the answer and the initial system confidence as input. Our contributions are: (1) we generate a QA training corpus starting from 877 answers from the customer care domain of T-Mobile Austria, (2) we implement a state-of-the-art QA pipeline using neural sentence embeddings that encode queries in the same space than the answer index, and (3) we evaluate the QA pipeline and our re-ranking approach using a separately provided test set. The test set can be considered to be available after deployment of the system, e.g., based on feedback of users. Our results show that the system performance, in terms of top-n accuracy and the mean reciprocal rank, benefits from re-ranking using gradient boosted regression trees. On average, the mean reciprocal rank improves by $$9.15\%$$ 9.15 % .

Michael Barz, Daniel Sonntag

### Measuring Catastrophic Forgetting in Visual Question Answering

Catastrophic forgetting is a ubiquitous problem for the current generation of Artificial Neural Networks: When a network is asked to learn multiple tasks in a sequence, it fails dramatically as it tends to forget past knowledge. Little is known on how far multimodal conversational agents suffer from this phenomenon. In this paper, we study the problem of catastrophic forgetting in Visual Question Answering (VQA) and propose experiments in which we analyze pairs of tasks based on CLEVR, a dataset requiring different skills which involve visual or linguistic knowledge. Our results show that dramatic forgetting is at place in VQA, calling for studies on how multimodal models can be enhanced with continual learning methods.

Claudio Greco, Barbara Plank, Raquel Fernández, Raffaella Bernardi

### Position Paper: Brain Signal-Based Dialogue Systems

This position paper focuses on the problem of building dialogue systems for people who have lost the ability to communicate via speech, e.g., patients of locked-in syndrome or severely disabled people. In order for such people to communicate to other people and computers, dialogue systems that are based on brain responses to (imagined) speech are needed. A speech-based dialogue system typically consists of an automatic speech recognition module and a speech synthesis module. In order to build a dialogue system that is able to work on the basis of brain signals, a system needs to be developed that is able to recognize speech imagined by a person and can synthesize speech from imagined speech. This paper proposes combining new and emerging technology on neural speech recognition and auditory stimulus construction from brain signals to build brain signal-based dialogue systems. Such systems have a potentially large impact on society.

Odette Scharenborg, Mark Hasegawa-Johnson

### First Leap Towards Development of Dialogue System for Autonomous Bus

This paper describes the dialogue system for the autonomous bus. Without driver onboard in an autonomous bus, a passenger needs someone to talk to when in need. In that scenario, the dialogue system in this work helps a passenger to manage travel plan. The system is designed to work in both chat-oriented and goal-oriented conversations. The internal design is rule-based utterance matching. We also describe the database, which can be easily expandable by the administrator for future development. The dialogue system deployment on android smartphone interface is demonstrated in this paper.

Maulik C. Madhavi, Tong Zhan, Haizhou Li, Min Yuan

### Overview of the Dialogue Breakdown Detection Challenge 4

To promote the research and development of dialogue breakdown detection for dialogue systems, we have been organizing a series of dialogue breakdown detection challenges to detect a system’s inappropriate utterances that lead to dialogue breakdowns in chat-oriented dialogue. In this paper, we overview Dialogue Breakdown Detection Challenge 4 (DBDC4). As in the previous challenges, we used datasets in English and Japanese. Four teams participated in the challenge, in which all four teams worked on English, and two of the four teams worked on Japanese as well. This paper describes the task setting, evaluation metrics, and datasets for the challenge and the results of the submitted runs of the participants.

Ryuichiro Higashinaka, Luis F. D’Haro, Bayan Abu Shawar, Rafael E. Banchs, Kotaro Funakoshi, Michimasa Inaba, Yuiko Tsunomori, Tetsuro Takahashi, João Sedoc

### Dialogue Breakdown Detection Using BERT with Traditional Dialogue Features

Despite of the significant improvements of Natural Language Processing with Neural networks such as machine reading comprehensions, chat-oriented dialogue systems sometimes generate inappropriate response utterances that cause dialogue breakdown because of the difficulty of generating utterances. If we can detect such inappropriate utterances and suppress them, dialogue systems can continue the dialogue easily.

Hiroaki Sugiyama

### RSL19BD at DBDC4: Ensemble of Decision Tree-Based and LSTM-Based Models

RSL19BD (Waseda University Sakai Laboratory) participated in the Fourth Dialogue Breakdown Detection Challenge (DBDC4) and submitted five runs to both English and Japanese subtasks. In these runs, we utilise the Decision Tree-based model and the Long Short-Term Memory-based (LSTM-based) model following the approaches of RSL17BD and KTH in the Third Dialogue Breakdown Detection Challenge (DBDC3) respectively. The Decision Tree-based model follows the approach of RSL17BD but utilises RandomForestRegressor instead of ExtraTreesRegressor. In addition, instead of predicting the mean and the variance of the probability distribution of the three breakdown labels, it predicts the probability of each label directly. The LSTM-based model follows the approach of KTH with some changes in the architecture and utilises Convolutional Neural Network (CNN) to perform text feature extraction. In addition, instead of targeting the single breakdown label and minimising the categorical cross entropy loss, it targets the probability distribution of the three breakdown labels and minimises the mean squared error. Run 1 utilises a Decision Tree-based model; Run 2 utilises an LSTM-based model; Run 3 performs an ensemble of 5 LSTM-based models; Run 4 performs an ensemble of Run 1 and Run 2; Run 5 performs an ensemble of Run 1 and Run 3. Run 5 statistically significantly outperformed all other runs in terms of MSE (NB, PB, B) for the English data and all other runs except Run 4 in terms of MSE (NB, PB, B) for the Japanese data (alpha level $$=$$ = 0.05).

Chih-Hao Wang, Sosuke Kato, Tetsuya Sakai

### LSTM for Dialogue Breakdown Detection: Exploration of Different Model Types and Word Embeddings

One of the principal problems of human-computer interaction is miscommunication. Occurring mainly on behalf of the dialogue system, miscommunication can lead to dialogue breakdown, i.e., a point when the dialogue cannot be continued. Detecting breakdown can facilitate its prevention or recovery after breakdown occurred. In the paper, we propose a multinomial sequence classifier for dialogue breakdown detection. We explore several LSTM models each different in terms of model type and word embedding models they use. We select our best performing model and compare it with the performance of the best model and with the majority baseline from the previous challenge. We conclude that our detector outperforms the baselines during the offline testing.

Mariya Hendriksen, Artuur Leeuwenberg, Marie-Francine Moens