1 Introduction
-
Virtual Assistants aid users in everyday tasks, such as scheduling appointments. They usually operate on predefined actions which can be triggered by voice command.
-
Information-seeking systems provide users with information about a question (e.g. the most suitable hotel in town). These questions also include factual questions as well as more complex questions.
-
E-learning dialogue systems train students for various situations. For instance, they train the interaction with medical patients or train military personnel in questioning a witness.
2 A general overview
2.1 Dialogue systems
-
Task-oriented systems are developed to help the user solve a specific task as efficiently as possible. The dialogues are characterized by following a clearly defined structure that is derived from the domain. The dialogues follow mixed initiative; both the user and the system can take the lead. Usually, the systems found in the literature are built for speech input and output. However, task-oriented systems in the domain of assisting users are built on multi-modal input and output.
-
Conversational agents display a more unstructured conversation, as their purpose is to have open-domain dialogues with no specific task to solve. Most of these systems are built to emulate social interactions, and thus longer dialogues are desired.
-
Question Answering (QA) systems are built for the specific task of answering questions. The dialogues are not defined by a structure as with task-oriented systems, however, they mostly follow the question and answer style pattern. QA systems may be built for a specific domain, but may be also tilted towards more open domain questions. Usually, the domain is dictated by the underlying data, e.g. knowledge bases or text snippets from forums. Traditional QA systems work on a singe-turn interaction, however, there are systems that allow multiple turns to cover follow-up questions. The initiative is mostly done by the user, who asks questions.
Task-oriented DS | Conversational agents | Interactive QA | |
---|---|---|---|
Task | Yes—clear defined | No | Yes—answer questions |
Dial. Structure | Highly structured | Not structured | No |
Domain | Restricted | Mostly open domain | Mixed |
Turns | Multi | Multi | Single/Multi |
Length | Short | Long | – |
Initiative | Mixed/system init | Mixed/user init | User init |
Interface | Multi-modal | Multi-modal | Mostly text |
2.2 Evaluation
-
Automatic in order to reduce the dependency on human labour, which is time- and cost-intensive as well as not necessarily repeatable, the evaluation method should be automated, or at least partially automated.
-
Repeatable the evaluation method should yield the same result if applied multiple times to the same dialogue system under the same circumstances.
-
Correlated to human judgments the procedure should yield ratings that correlate to human judgments.
-
Differentiate between different dialogue systems the evaluation procedure should be able to differentiate between different strategies. For instance, if one wants to test the effect of a barge-in feature (i.e. allowing the user to interrupt the dialogue system), the evaluation procedure should be able to highlight the effects.
-
Explainable the method should give insights into which features of the dialogue system impact the quality of the dialogue and in which manner they do so. For instance, the methods should reveal that the automatic speech recognition system’s word-error rate has a high influence on the quality of the natural language understanding component, which in turn impacts the intent classification.
-
Lab experiments Before crowdsourcing was popular, dialogue systems were evaluated in a lab environment. Users were invited to participate in the lab where they interacted with a dialogue system and subsequently filled a questionnaire. For instance, Young et al. (2010) recruited 36 subjects, which were given instructions and presented with various scenarios. The subjects were asked to solve a task using a spoken dialogue system. Furthermore, a supervisor was present to guide the users. The lab environment is very controlled, which is not necessarily comparable to the real world (Black et al. 2011; Schmitt and Ultes 2015).
-
In-field experiments Here, the evaluation is performed by collecting feedback from real users of the dialogue systems (Lamel et al. 2000). For instance, for the Spoken Dialogue Challenge (Black et al. 2011), the systems were developed to provide bus schedule information in Pittsburgh. The evaluation was performed by redirecting the evening calls to the dialogue systems and getting the user feedback at the end of the conversation. The Alexa Prize5 also followed the same strategy, i.e. it let real users interact with operational systems and gathered user feedback over a span of several months.
-
Crowdsourcing Recently, human evaluation has shifted from a lab environment to using crowdsourcing platforms such as Amazon Mechanical Turk (AMT). These platforms provide large amounts of recruited users. Jurcícek et al. (2011) evaluate the validity of using crowdsourcing for evaluating dialogue systems, and their experiments suggest that using enough crowdsourced users, the quality of the evaluation is comparable to the lab conditions. Current research relies on crowdsourcing for human evaluation (Serban et al. 2017a; Wen et al. 2017).Especially conversational dialogue systems are evaluated via crowdsourcing, where there are two main evaluation procedures: crowdworkers either talk to the system and rate the interaction or they are presented with a context from the test set and a response by the system, which they need to rate. In both settings, the crowdworkers are aksed to rate the system based on quality, fluency or appropriateness. Recently, Adiwardana et al. (2020) introduced Sensibleness and Specificity Average (SSA), where humans rate the sensibleness and specificity of a response. These capture two aspects of human behaviour: making sense and being specific. A dialogue system can be sensible by responding with vague answers (e.g. “I don’t know”), whereas it is only specific if it takes the context into account.
2.3 Modular structure of this article
3 Task oriented dialogue system
3.1 Characteristics
3.2 Dialogue structure
Slot | Type | Example values |
---|---|---|
ame | Verbatim string | Alimentum,... |
eatType | Dictionary | restaurant, pub, coffee shop |
familyFriendly | Boolean | yes, no |
food | Dictionary | Italian, French, English,... |
near | Verbatim string | Burger King |
area | Dictionary | riverside, city center |
customerRating | Dictionary | 1 of 5, 3 of 5, 5 of 5, low, average, high |
priceRange | Dictionary | \(< \pounds\) 20, \(\pounds\) 20–25, \(> \pounds\) 30 cheap, moderate, high |
Dialogue act | Description |
---|---|
hello\((a=x, b=y,\ldots )\) | Open a dialogue and give info \(a=x, b=y,\ldots\) |
inform\((a=x, b=y,\ldots )\) | Give information \(a=x, b=y,\ldots\) |
request\((a, b=x,\ldots )\) | Request value for a given \(b=x,\ldots\) |
reqalts\((a=x,\ldots )\) | Request alternative with \(a=x,\ldots\) |
confirm\((a=x, b=y,\ldots )\) | Explicitly confirm \(a=x, b=y,\ldots\) |
confreq\((a=x,\ldots , d)\) | Implicitly confirm \(a=x,\ldots\) and request value of d |
select\((a=x, a=y)\) | Select either \(a=x\) or \(a=y\) |
affirm\((a=x, b=y)\) | Affirm and give further info \(a=x, b=y,\ldots\) |
negate\((a=x)\) | Negate and give corrected value \(a=x\) |
deny\((a=x)\) | Deny that \(a=x\) |
bye() | Close a dialogue |
Utterance | Dialogue Act |
---|---|
U: Hi, I am looking for somewhere to eat | hello(task = find,type=restaurant) |
S: You are looking for a restaurant. What type of food? | confreq(type = restaurant,food) |
U: I’d like an Italian somewhere near the museum. | inform(food = Italian,near=museum) |
S: Roma is a nice Italian restaurant near the museum. | inform(name = “Roma”, type = restaurant, food = Italian, near = museum) |
U: Is it reasonably priced? | confirm(pricerange = moderate) |
S: Yes, Roma is in the moderate price range. | affirm(name = “Roma”, pricerange = moderate) |
U: What is the phone number? | request(phone) |
S: The number of Roma is 385456. | inform(name = “Roma”, phone = “385456”) |
U: Ok, thank you goodbye. | bye() |
3.3 Technologies
3.3.1 Pipelined systems
3.3.2 End-to-end trainable systems
3.4 Evaluation
-
User satisfaction modelling Here, the assumption is that the usability of the system can be approximated by the satisfaction of its users, which can be measured by questionnaires. These approaches aim to model the human judgements, i.e. creating models which give the same ratings as the human judges. First, a human evaluation is performed where subjects interact with the dialogue system. Afterwards, the dialogue system is rated via questionnaires. Finally, the ratings are used as target labels to fit a model based on objectively measurable features (e.g. task success rate, word error rate of the ASR system).
-
User simulation Here, the idea is to simulate the behaviour of the users. There are two applications of user simulation: firstly, to evaluate a functioning system with the goal of finding weaknesses and secondly, the user simulation is used as an environment to train a reinforcement-learning based system. The evaluation in the latter is based on the reward achieved by the dialogue manager under the user simulation.
-
Set of Constraints, which define the target information to be retrieved. For instance, the specifications of the venue (e.g. a bar in the central area, which serves beer) or the travel route (e.g. ticket from Torino to Milano at 8pm).
-
Set of Requests, which define what information the user wants. For instance the name, address and the phone number of the venue.
DATA | DEPART-CITY | ARRIVAL-CITY | DEPART-RANGE | DEPART-TIME | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | v9 | v10 | v11 | v12 | v13 | v14 | |
KEY | ||||||||||||||
v1 | 22 | 1 | 3 | |||||||||||
v2 | 29 | |||||||||||||
v3 | 4 | 16 | 4 | 1 | ||||||||||
v4 | 1 | 1 | 5 | 11 | 1 | |||||||||
v5 | 20 | |||||||||||||
v6 | 22 | |||||||||||||
v7 | 1 | 1 | 20 | 5 | ||||||||||
v8 | 1 | 2 | 8 | 15 | ||||||||||
v9 | 45 | 10 | ||||||||||||
v10 | 5 | 40 | ||||||||||||
v11 | 20 | 2 | ||||||||||||
v12 | 1 | 19 | 2 | 4 | ||||||||||
v13 | 2 | 18 | ||||||||||||
v14 | 2 | 6 | 3 | 21 | ||||||||||
sum | 30 | 30 | 25 | 15 | 25 | 25 | 30 | 20 | 50 | 50 | 25 | 25 | 25 | 25 |
3.4.1 User satisfaction modelling
-
Reliability Evanini et al. (2008) state as a main argument that users tend to interpret the questions on the questionnaires differently, thus making the evaluation unreliable. Gašić et al. (2011) noted that also in the lab setting, where users are given a predefined goal, users tend to forget the task requirements, thus, incorrectly assessing the task success. Furthermore, in the in-field setting, where the feedback is given optionally, the judgements are likely to be skewed towards the positive interactions.
-
Cognitive demand Schmitt and Ultes (2015) note that rating the dialogue puts more cognitive demand on users. This is especially true if the evaluation has to be done at the exchange level. This would falsify the judgments about the interaction.
-
Impracticability: Ultes et al. (2013) note the impracticability of having a user rate the live dialogue, as he would have to press a button on the phone, or have a special installation to give feedback.
Training set | \(R^2\) training (SE) | Test set | \(R^2\) test (SE) |
---|---|---|---|
ALL 90% | 0.47 (0.004) | ALL 10% | 0.50 (0.035) |
ELVIS 90% | 0.42 | TOOT | 0.55 |
ELVIS 90% | 0.42 | ANNIE | 0.36 |
NOVICES | 0.47 | ANNIE EXPERTS | 0.04 |
-
A linear regression model is fitted on \(90\%\) of the data and evaluated on the remaining \(10\%\). The results show that the model is able to explain \(R^2 = 50\%\) of the variance, which is considered to be a good predictor by the authors.
-
Training the regression model on the data for one system and evaluating the model on the data for another dialogue system (e.g. train on the ELVIS data and evaluate on the TOOT data) show high variability as well. The evaluation on the TOOT system data yields much higher scores than evaluating on the ANNIE data. These results show that the model is able to generalize to data of other dialogue systems to a certain degree.
-
The evaluation of the generalizability of the model across different populations of users yields a negative result. When trained on dialogue data from conversation by novice users (NOVICES), the linear model is not capable of predicting the scores by experienced users (ANNIE EXPERTS) of the dialogue system.
Feature set | \(IQ_{field}\) | \(IQ_{lab}\) | \(US_{lab}\) |
---|---|---|---|
ASR | 0.753 | 0.811 | 0.625 |
AUTO | 0.776 | 0.856 | 0.668 |
AUTO + EMO | 0.785 | 0.856 | 0.669 |
AUTO + EMO + USER | – | 0.888 | 0.741 |
Feature set | Test | Train | \(\rho\) |
---|---|---|---|
Auto | \(US_{lab}\) | \(IQ_{lab}\) | 0.667 |
Auto | \(IQ_{lab}\) | \(IQ_{field}\) | 0.647 |
Auto | \(IQ_{field}\) | \(IQ_{lab}\) | 0.696 |
3.4.2 User simulation
-
Interaction level Does the interaction take place at the semantic level (i.e. on the level of dialogue acts) or at the surface level (i.e. using natural language understanding and generation)?
-
User goal Does the simulation update the goal during the conversation or not? The dialogues in the second Dialogue State Tracking Challenge (DSTC2) data contain a large amount of examples where the user changes their goal during the interaction (Henderson et al. 2014). Thus, it is more realistic to model these changes as well.
-
Error model Whether and how to realistically model the errors made by the components of the dialogue system.
-
Evaluation of the user simulation For a discussion on this topic refer to Pietquin and Hastie (2013). There are two main evaluation strategies: direct and indirect evaluation. The direct evaluation of the simulation is based on metrics (e.g. precision and recall on dialogue acts, perplexity). The indirect evaluation measures the utility of the user simulation (e.g. by evaluating the trained dialogue manager).
-
State errors arise when the user input cannot be interpreted in the current state, but might be interpretable in a different state.
-
Capability errors arise when the system cannot execute the user’s commands due to missing capability.
-
Modelling errors arise due to discrepancies in how the user and the system model the world. For instance, when presented with a list of options and the system allows to address the elements in the list by their positions, but the user addresses them by their name.
-
High-level features, such as concept error rates or average number of semantic concepts per user turn (\(\#\)) AVP. Here, the results show that the simulation was not always able to recreate the absolute values, it was able to replicate the relative results. This is helpful, as it would lead to the same conclusions for the same questions.
-
User judgment prediction which is based on a predictive model trained using the PARADISE framework. Here, the authors compared the real user judgments to the predicted judgments (where the linear model predicted the judgments of the simulated dialogue). Again, the results show that the user model would yield the same conclusions as the user study, namely that young users rated the system higher than the older users and that old users judged the dynamic help system worse than the other.
-
Precision and Recall of predicted actions. Here, the simulation is used to predict the next user action for a given context from a dialogue corpus. The predicted user action is compared to the real user action and based on this precision and recall is computed. The results show that precision and recall are relatively low.
3.4.3 Subsystems evaluation
4 Conversational dialogue systems
4.1 Characteristics
4.2 Modelling conversational dialogue systems
-
Utterance selection Here, the dialogue is modelled as an information retrieval task. A set of candidate utterances is ranked by relevance. The dialogue structure is thus defined by the utterances in a dialogue database (Lee et al. 2009). The idea is to retrieve the most relevant answer to a given utterance, thus learning to map multiple semantically equivalent user-utterances to an appropriate answer.
-
Generative models Here, the dialogue systems are based on deep neural networks, which are trained to generate the most likely response to a given conversation history. Usually, the dialogue structure is learned from a large corpus of dialogues. Thus, the corpus defines the dialogue behaviour of the conversational agent.
4.2.1 Neural generative models
-
Adapt the loss functions. The main idea is to adapt the loss function in order to penalize generic responses and promote more diverse responses. Li et al. (2016a) propose two loss functions based on maximum mutual information: one is based on an anti-language model, which penalizes high-frequency words; the other is based on the probability of the source given the target. Li et al. (2016b) propose to train the neural conversational agent using the reinforcement-learning framework. This allows to learn a policy that can plan in advance and generate more meaningful responses. The major focus is the reward function, which encapsulates various aspects: ease of answering (reduce the likelihood of producing a dull response), information flow (penalize answers that are semantically similar to a previous answer given), and semantic coherence (based on the mutual information).
-
Condition the decoder. The seq2seq models perform a shallow generation process. This means that each sampled word is only conditioned on the previously sampled words. There are two methods for conditioning the generation process: condition on stochastic latent variables or on topics. Serban et al. (2017c) enhance the HRED model with stochastic latent variables at the utterance level and on the word level. At the decoding stage, first the latent variable is sampled from a multivariate normal distribution and then the output sequence is generated. Xing et al. (2017) add a topic-attention mechanism in their generation architecture, which takes as inputs topic words which are extracted using the Twitter LDA model (Zhao et al. 2011). The work by Ghazvininejad et al. (2018) extends the seq2seq model with a Facts Encoder. The “facts” are represented as a large collection of raw texts (Wikipedia, Amazon reviews, etc.), which are indexed by named entities.
4.2.2 Utterance selection methods
-
Surface form similarity. This measures the similarity at the token level. This includes measures such as: Levenshtein distance, METEOR (Lavie and Denkowski 2009), or TF-IDF retrieval models (Charras et al. 2016; Dubuisson Duplessis et al. 2016). For instance, Dubuisson Duplessis et al. (2017) propose an approach that exploits recurrent surface text patterns to represent dialogue utterances.
-
Multi-class classification task. These methods model the selection task as a multi-class classification problem, where each candidate response is a single class. For instance, Gandhe and Traum (2013) model each utterance as a separate class, and the training data consists of utterance-context pairs on which features are extracted. Then a perceptron model is trained to select the most appropriate response utterance. This approach is suitable for applications with a small amount (\(\sim 100\)) of candidate answers.
-
Neural network based approaches. Neural network architectures were introduced to leverage large amounts of training data. Usually, they are based on a siamese architecture, where both the current utterance and a candidate response are encoded. Based on this representation a binary classifier is trained to distinguish between relevant responses and irrelevant. One well-known example is the dual encoder architecture proposed by Lowe et al. (2017b). Dual Encoders transform the user input and a candidate response into a distributed representation. Based on the two representations a logistic regression layer is trained to classify the pair of utterance and candidate response as either relevant or not. The softmax score of the relevant class is used to sort the candidate responses. The authors experimented with different neural network architectures for modelling the encoder, such as recurrent neural networks or long short-term memory networks (LSTM) (Hochreiter and Schmidhuber 1997).
4.3 Evaluation methods
4.3.1 General metrics for conversational dialogue systems
-
Word-overlap metrics These metrics were originally proposed by the machine translation and the summarization community. They were initially a popular choice of metrics for evaluating dialogue systems seeing as they are easily applicable. Popular metrics such as BLEU score (Papineni et al. 2002) and ROUGE (Lin 2004) were used as approximation for the appropriateness of an utterance. However, Liu et al. (2016) showed that neither of the word-overlap based scores have any correlation to human judgments.Based on the criticism of the word-overlap metrics, several new metrics have been proposed. Galley et al. (2015) propose to include human judgments into the BLEU score, which they call \({\varDelta }\)BLEU. The human judges rated the reference responses of the test set according to the relevance to the context. The ratings are used to weight the BLEU score to reward high-rated responses and penalize low-rated responses. The correlation to human judgments was measured by means of Spearman’s \(\rho\). \({\varDelta }\)BLEU has a correlation of \(\rho = 0.484\), which is significantly higher than the correlation of the BLEU score, which lies at \(\rho = 0.318\). Although this increases the correlation of the metric to the human judgments, this procedure involves human judgments to label the reference sentences.
-
Trained metrics Lowe et al. (2017a) present an automatic dialogue evaluation model (ADEM), a recurrent neural network trained to predict appropriateness ratings by human judges. The human ratings were collected via Amazon Mechanical Turk, where the judges were presented with a dialogue context and a candidate response, which they rated on appropriateness on a scale from 1 to 5. Based on the ratings, a recurrent neural network was trained to score the model response, given the context and the reference response. The Pearson’s correlation between ADEM and the human judgments is computed on two levels: the utterance level and at the system level, where the system level rating is computed as the average score at the utterance-level achieved by the system.The Pearson’s correlation for ADEM lies at 0.41 on the utterance level and at 0.954 on the system level. For comparison, the correlation to human judgments for the ROUGE score only lies at 0.062 on the utterance level and at 0.268 at the system level.While ADEM relies on human labelled data, Tao et al. (2018) present a method, which has no need of human judges. Their model is based on two observations. Firstly, a response that is close to the ground truth is likely to be good. Secondly, a response that is related to the last utterance or the context of the conversation is good. They propose two submodels to capture these insights. The first model computes a representation of both the ground truth and the generated response based on min- and max-pooling of word embeddings. Then the cosine similarity is computed to measure the relatedness of the ground truth and the generated response. The second model rates the relatedness between the conversational context and the generated response. In order to train this model, the authors create a training set of positive examples (i.e. pairs of contexts and responses that are relevant) and negative examples (i.e. pairs of irrelevant contexts and responses). The positive examples are taken from the dialogues in the training material, whereas the negative examples are constructed by randomly sampling utterances from the corpus for a given context. Then a siamese neural network is trained on this training data to predict if a pair of context and response are relevant. The scores of both submodules are then normalized and averaged. The Pearson’s correlation for their model lies at 0.4594, which is comparable to ADEM.Although trained metrics have a significantly higher correlation to human judgements, they are show not to be robust (Sai et al. 2019). In fact, with simple manipulations of the response under consideration can lead to significant changes in the score of ADEM. For instance, in \(48.66\%\) of cases the predicted score increased when the generated response was reversed. In \(86.93\%\) of cases the predicted score increased when the generated response was replaced with a dull dummy response. Thus, creating reliable trained metrics is still an open problem.
4.3.2 Utterance selection metrics
5 Question answering dialogue systems
-
Task-oriented systems are developed for a multitude of tasks (e.g. restaurant reservation, travel information system, virtual assistant, etc.),whereas QA systems are developed to find answers to specific questions.
-
Task-oriented systems are usually domain-specific, i.e. the domain is defined in advance through an ontology and remains fixed. In contrast, QA systems usually work on broader domains (e.g. factoid QA can be done over different domains at once), although there are also some QA systems focused only on a specific domain (Sarrouti and Ouatik El Alaoui 2017; Do et al. 2017).
-
The dialogue aspect for QA systems is not tailored to sound human-like, rather, the focus is set on the completion of the task. That is, to provide a correct answer to the input question.
5.1 Characteristics
-
Open QA, where systems collect evidences and answers across several sources such as Web pages and knowledge bases (Fader et al. 2013)
-
Reading Comprehension (RC), where the answer is gathered from a single document. This is the most common approach.
-
Extractive RC, where systems extract spans of text with the answer. This approach has received a lot of attention fostered by the availability of popular benchmarks such as SQuAD (Rajpurkar et al. 2018), NewsQA (Trischler et al. 2017) or TriviaQA (Joshi et al. 2017). Each of these datasets contains thousands of examples, which permits to train Deep Learning systems and obtain good results.
-
Multiple-choice RC, where systems must select an answer from a set of candidates. Multiple-Choice (MC) is a common way to measure reading comprehension in humans. This is why some researches have pointed MC as a better format to test language understanding of automatic systems (Rogers et al. 2020a). There exists several MC collections, mostly in English. In some cases it involves paying crowd-workers to gather documents and/or pose questions regarding those documents. MCTest (Richardson et al. 2013), for example, proposed for the workers to invent short, children friendly, fictional stories and four questions with four answers each, including deliberately wrong answers. As a way to encourage a deeper understanding of texts, the QuAIL dataset includes unanswerable questions (Rogers et al. 2020b). Other datasets were created from real world exams. This is the case of the well known MC dataset RACE (Lai et al. 2017), or the multilingual Entrance Exams (Rodrigo et al. 2018).
-
Generative QA, where systems create a text that answers the question. The exact text is not necessarily contained in any document, which makes this a challenging task. This kind of systems has received less attention given that it is difficult to perform an exact evaluation and there are few datasets available (Kočiský et al. 2018).
5.2 Technologies
5.3 Evaluation of QA dialogue systems
-
Exact matching, which measures the percentage of candidate answers that match any one of the ground truth answers exactly.
-
Approximate matching based on F1, which measures the macro-average overlap between the bag of words of candidates and ground truth answers.
-
NASA TLX (cognitive workload questionnaire): Used to measure the cognitive workloads as subjects completed different scenarios.
-
Task questionnaire After each task the questionnaire is filled out, which focuses on the experiences of using a system for a specific task.
-
System questionnaire Compiled after using a system for multiple tasks. This measures the overall experiences of the subjects.
6 Evaluation datasets and challenges
6.1 Datasets for task-oriented dialogue systems
Name | Topics | # dialogues | Reference |
---|---|---|---|
DSTC1 | Bus schedules | 15,000 | (Williams et al. 2013) |
DSTC2 | Restaurants | 3000 | (Henderson et al. 2014) |
DSTC3 | Tourist information | 2265 | (Henderson et al. 2013a) |
DSTC4 & DSTC5 | Tourist information | 35 | (Kim et al. 2016) |
DSTC6 | Restaurant reservation | – | (Perez et al. 2017) |
DSTC7 (Flex Data) | Student guiding | 500 | (Gunasekara et al. 2019) |
DSTC8 (MetaLWOz) | 47 domains | 37,884 | (Lee et al. 2019) |
DSTC8 (Schema-Guided) | 20 domains | 22,825 | (Rastogi et al. 2019) |
MultiWOZ | Tourist information | 10,438 | (Budzianowski et al. 2018) |
Taskmaster-1 | 6 domains | 13,215 | (Byrne et al. 2019) |
MultiDoGo | 6 domains | 86,698 | (Peskov et al. 2019) |
6.2 Datasets for conversational dialogue systems
Name | Topics | # dialogues | References |
---|---|---|---|
Switchboard | Casual topics | 2400 |
Godfrey et al. (1992) |
British national corpus | Casual topics | 854 |
Leech (1993) |
SubTle corpus | Movie subtitles | 3.35M |
Ameixa and Coheur (2013) |
Reddit domestic abuse corpus | Abuse help | 21,133 |
Schrading (2015) |
Twitter corpus | Unrestricted | 1.3M |
Ritter et al. (2010) |
Twitter triple corpus | Unrestricted | 4322 |
Sordoni et al. (2015) |
Ubuntu dialogue corpus | Ubuntu problems | 930K |
Lowe et al. (2015) |
bAbI | Restaurant reservation | 3000 |
Bordes et al. (2017) |
OpenSubtitles | Movie subtitles | 36M |
Tiedemann (2012) |
CornellMovie | Movie dialogues | 220K |
Danescu and Lee (2011) |
6.3 Datasets for question answering dialogue systems
Name | Topics | # dialogues | References |
---|---|---|---|
Ubuntu dialogue corpus | Ubuntu problems | 930K |
Lowe et al. (2015) |
MSDialog | Microsoft products | 35K |
Qu et al. (2018) |
ibAbI | Restaurant reservation | – |
Li et al. (2017a) |
CoQA | 7 domains | 8399 |
Reddy et al. (2018) |
QuAC | People | 13,594 |
Choi et al. (2018) |
DoQA | Cooking | 1637 |
Campos et al. (2019) |