To enable a CA for e-Learning to adapt to information sources about the learner and to recover learner engagement, we first explore how to evaluate whether the learner is engaged in the conversational when interacting with a CA. An easily accessed information source is the conversational record, referred to as the “chat log”, of the learner’s interaction with the CA.
Classifying learner conversational Behaviours
Two approaches to evaluating the conversation were investigated. The first involved creating a machine learning classifier that was trained on past conversational logs with the ITS, annotated with ratings for conversational quality and appropriateness of the user responses. The conversational logs from a previous study (Heller and Procter
2009) provided data from 10 min conversations by 90 participants chatting with the historical Figure CA, Freudbot. For Procter, Lin, & Heller (
2016), we developed a
coding scheme to classify the
learner input and the
CA responses in the conversational logs, and identified the following two key features for the
learner input that are associated with level of learner engagement:
(1) Response Appropriateness: answering questions, responding to requests, addressing the topic under discussion, or changing to another domain related topic.
(2) Conversational quality: playing the role of conversant: using full sentences or phrases, not lone keywords, gibberish or random characters, and non-repetitive utterances.
Each student response was manually coded for the quality of conversation with values from 1 to 3. Response 1 represents what one would expect during a conversation, while 3 would be considered strange and inconsistent in a conversations. Response 2 was assigned if the coder was unsure. Each input was also rated for appropriateness on a scale from 1 to 4 based on how the student response compared to the ITS response. While initial performance figures are encouraging, this approach is still in the process of being developed. In particular, it was determined that more training data is required for the examples of poorer ratings.
Three patterns of learner behaviors
For the second approach we manually generated algorithms to categorize student input by identifying conversational behavioral patterns. This method has been shown to be reasonably robust. Identifying problematic conversational behaviors allows for a targeted form of intervention which attempts to repair or improve the conversation. This paper will focus on the second approach. We examined the logs of past studies using Freudbot (Heller, Procter, & Rose,
2016; Heller and Procter
2009) and identified three recurring patterns of learner behaviour:
-
Tryer: The learner attempts to ask questions exactly as one would hope they would, using full sentences (or close) on topics related to Freud. They continue to do this despite little or no success in getting Freud-related information from the CA. This
trying behaviour is characterized by relatively long sentences, high number of no-match cases per inputs and possibly input words with high abstractness value, a measure of cognitive engagement (Wen et al.
2014).
-
Keyworder: The learner answers questions or responds to bot output with single words or phrases associated with Freud or psychoanalysis, e. g. “ego”, “psychoanalysis”, “anxiety”. Typically jumping from one topic to the next. This keywording behaviour could be detected by short inputs, non-repetition, low number of no-match cases per inputs, and possibly low abstractness value of input words.
-
Morer: The learner discovers a word that leads to advancement through the narrative and repeats that word. For example, just keeps saying “ok”. Moreing behaviour could be detected by recognizing backchannel type words and phrases (“more”, “ok”, “I see”), and frequent consecutive repetition of those words.
Behavior detection
Learners may exhibit more than one of these behaviours. They may start off trying and eventually give up and start moreing. Or they might just stick with one strategy, like keywording and never experience a proper conversation. Often, these behaviours come about because of poor performance on the part of the CA, and the learner attempting to find a strategy that results in useful information being returned. Again, special functions have been programmed to identify certain learner dialogue acts, such as backchannel comments, which are used in conversation to indicate that one is following along and encouraging the other conversational partner to continue (e.g. “Okay”, “I see”, “uh huh”). Freudbot is programmed to recognize these phrases and continue the narrative associated with the current topic. The agent keeps a history of the use of these words and determines if consecutive repeated use of the same term has been used. In a similar way, Freudbot checks whether the learner is a tryer, indicated by longer sentences, suggesting complex questions or comments, followed by repair statements from the CA indicating it does not understand the learner input. The poor performance of the CA is an important aspect because an intervention is not required if the CA is successfully responding to the learner input with appropriate educational content. Again, if occurrences of this situation exceed a threshold, the associated data is published by a data source agent and received by a model agent. Another algorithm is used to detect potential keyworder behaviour. In each case, if the behaviour is detected enough times to exceed predetermined thresholds, the appropriate learner label – tryer, morer, or keyworder – is applied and this determination is published to the information stream, for the learner model agent to collect, possibly integrate with other data, such as the conversation quality, and determine if it should be passed on to the agent responsible for initiating interventions.
To tune and select the best parameters for the conversational behaviour detection algorithms we manually rated 26 conversations (613 turn pairs) from the chat logs of a previous experiment (Heller and Procter
2009) Each conversation was assigned a rating for each of the three types of behaviour:
trying,
keywording, and
moreing. False positives were judged to have a negative effect since they are likely to trigger inappropriate interventions. This can be confusing to the learner, and undermine the perception of intelligence that plays a large part in engaging the learner. Results from comparing the manual and automated ratings were used to find the best balance between catching the behaviour and not accidently triggering a false intervention.
To measure the accuracy of the algorithms, chat logs from the current study (see Section 4) were manually coded to identify the three behaviors. The human coder read the entire log for a participant and assigned any behaviors observed, and a confidence rating from 1 (low) to 3 (high) for each behavior.
The agent’s behavior assignments were compared against those of the human coder. Observations with low confidence ratings were ignored. As anticipated, the algorithms minimized false positives at the expense of false negatives, resulting in relatively high values for precision and relatively low values for recall (Table
1). Accuracy ratings are included but because there was a significant class imbalance for each of the behaviors it is potentially misleading as a performance measure. (Of 56 participants, manual coding found 48 tryers, 8 keyworders, and 21 morers.) F-scores, the harmonic mean of precision and recall, provide an indication of whether the balance of the two is reasonable. The F0.5 score is considered more appropriate because it weights recall lower than precision (by attenuating the influence of false negatives) which is consistent with the design objective of avoiding false positives ahead of reducing false negatives.
Table 1
Algorithm performance
Tryer
| 0.702 | 0.919 | 0.708 | 0.800 | 0.867 |
Keyworder
| 0.912 | 0.714 | 0.625 | 0.667 | 0.694 |
Morer
| 0.807 | 1.000 | 0.421 | 0.593 | 0.784 |
Interventions
We don’t measure level of engagement, but instead detect behaviours associated with poor (e.g.
keywording,
moreing), and good (e.g.
trying) conversational engagement. In Procter, Lin & Heller (
2016) we describe how the behaviour detection algorithms were implemented as software agents which parse and analyze the conversation in real-time to evaluate the learner’s conversational behaviour. The detection of any of the three conversational behaviours triggers an appropriate conversational intervention. The CA Representation (CA-REP) agent is responsible for monitoring events from the detection agents, and can direct the CA to inject an intervention into the conversation. We refer to this agent as the Intervention agent in this paper. The interventions support the pedagogical design described in Section 1 by encouraging the learner to make full use of the interactive narrative and conversational interface when it is determined that the student is not conversing or not exploring the narrative. The three behaviours and associated interventions are described briefly in Table
2.
Table 2
Behaviour types and interventions
Tryer
| Attempts to use proper conversation but CA does not match most input | 1 | CA apologizes and suggests topics based on learner’s area of interest (Freud’s life or theories) |
Keyworder
| Does not attempt to converse. Enters single words or short phrases | 2 | Suggests conversational phrases to advance further into topics (“Tell me more about…”) |
Morer
| Advances through topics by repeating the same “more” type word (“ok”, “more”, “go on”) | 3 | Reminds learner they can branch to other topics (“Tell me about”) or come back to a topic (“Tell me more about…”) |
The problem, simply put, is that the learner is either not managing to get to the CA content, as in the case of the tryer, or is not doing so through a conversational approach (morer and keyworder). The first type of problem is serious, the second is not optimal because the learner doesn’t make use of the conversational capabilities of the CA. Although morer behaviour does expose significant Freud content, it is not much different than reading a book. Keywording is like using a search engine. Both cases leave little motivation for the learner to interact again. Both cases would likely result in a poor rating of the CA.
If the learner can obtain content through a conversational approach, then there is no need to change anything. The learner is left in control of the conversation. If the system can recognize that the learner is having trouble obtaining content through a conversational approach, i.e. a tryer, the Intervention agent can address this by directing the CA to take some control of the conversation to introduce relevant topics. While this assumes some control from the learner, it is preferable to the learner having to resort to other behaviors to obtain useful information, such as just saying ‘yes’ (moreing), or using non-conversational input such as keywords.
In the case of trying behaviour, rather than stating “I don’t understand” (or similar ‘default’ response), the CA ‘recognizes’ the problem and asks a question: “I don’t seem to doing very well in trying to understand your comments and questions. If I can ask, are you more interested in my theories, or in my life?”. The CA uses the learner’s response to suggest an appropriate topic (theories, life/people, or both depending on stated preference). Additionally, future “no-match” responses will favor repair strategies that suggest topics related to the learner’s interest, or ask leading questions related to the learner’s interest. These are repair strategies that take away some of the control of the conversation from the learner, but are more likely to result in information being delivered.
In the case of keywording behaviour, the normal response from the CA will have the intervention appended to it (“I can’t help noticing you have a somewhat abrupt conversational style. In any case, you can ask me to tell you more about a topic if you’d like to go into more depth.”). The intention is to at least encourage the learner to use conversational directives to experience the narrative structure and appreciate the depth of the content, rather than just seeing the first section of each topic.
In the case of moreing behaviour, the process of triggering an intervention is the same as for keywording, i.e. the intervention is appended to a normal response. It informs the learner “You seem to be advancing the conversation by repeating the same word. This does allow you to cover a topic thoroughly, but remember that you can branch off to other topics (‘Tell me about...’) and come back to a topic (‘Tell me more about…’).” Again, the intention is to provide the learner with other ways to interact and encourage them to do so in a conversational way.
A secondary potential benefit of the interventions is to suggest that the CA has some level of awareness (of the learner’s behaviour) and therefore promote a sense of social presence.