Skip to main content
Top

2023 | Book

Artificial Intelligence in Education

24th International Conference, AIED 2023, Tokyo, Japan, July 3–7, 2023, Proceedings

Editors: Ning Wang, Genaro Rebolledo-Mendez, Noboru Matsuda, Olga C. Santos, Vania Dimitrova

Publisher: Springer Nature Switzerland

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 24th International Conference on Artificial Intelligence in Education, AIED 2023, held in Tokyo, Japan, during July 3-7, 2023. This event took place in hybrid mode.

The 53 full papers and 26 short papers presented in this book were carefully reviewed and selected from 311 submissions.

The papers present result in high-quality research on intelligent systems and the cognitive sciences for the improvement and advancement of education. The conference was hosted by the prestigious International Artificial Intelligence in Education Society, a global association of researchers and academics specializing in the many fields that comprise AIED, including, but not limited to, computer science, learning sciences, and education.

Table of Contents

Frontmatter

Full Papers

Frontmatter
Machine-Generated Questions Attract Instructors When Acquainted with Learning Objectives

Answering questions is an essential learning activity on online courseware. It has been shown that merely answering questions facilitates learning. However, generating pedagogically effective questions is challenging. Although there have been studies on automated question generation, the primary research concern thus far is about if and how those question generation techniques can generate answerable questions and their anticipated effectiveness. We propose Quadl, a pragmatic method for generating questions that are aligned with specific learning objectives. We applied Quadl to an existing online course and conducted an evaluation study with in-service instructors. The results showed that questions generated by Quadl were evaluated as on-par with human-generated questions in terms of their relevance to the learning objectives. The instructors also expressed that they would be equally likely to adapt Quadl-generated questions to their course as they would human-generated questions. The results further showed that Quadl-generated questions were better than those generated by a state-of-the-art question generation model that generates questions without taking learning objectives into account.

Machi Shimmei, Norman Bier, Noboru Matsuda
SmartPhone: Exploring Keyword Mnemonic with Auto-generated Verbal and Visual Cues

In second language vocabulary learning, existing works have primarily focused on either the learning interface or scheduling personalized retrieval practices to maximize memory retention. However, the learning content, i.e., the information presented on flashcards, has mostly remained constant. Keyword mnemonic is a notable learning strategy that relates new vocabulary to existing knowledge by building an acoustic and imagery link using a keyword that sounds alike. Beyond that, producing verbal and visual cues associated with the keyword to facilitate building these links requires a manual process and is not scalable. In this paper, we explore an opportunity to use large language models to automatically generate verbal and visual cues for keyword mnemonics. Our approach, an end-to-end pipeline for auto-generating verbal and visual cues, can automatically generate highly memorable cues. We investigate the effectiveness of our approach via a human participant experiment by comparing it with manually generated cues.

Jaewook Lee, Andrew Lan
Implementing and Evaluating ASSISTments Online Math Homework Support At large Scale over Two Years: Findings and Lessons Learned

Math performance continues to be an important focus for improvement. The most recent National Report Card in the U.S. suggested student math scores declined in the past two years possibly due to COVID-19 pandemic and related school closures. We report on the implementation of a math homework program that leverages AI-based one-to-one technology, in 32 schools for two years as a part of a randomized controlled trial in diverse settings of the state of North Carolina in the US. The program, called “ASSISTments,” provides feedback to students as they solve homework problems and automatically prepares reports for teachers about student performance on daily assignments. The paper describes the sample, the study design, the implementation of the intervention, including the recruitment effort, the training and support provided to teachers, and the approaches taken to assess teacher’s progress and improve implementation fidelity. Analysis of data collected during the study suggest that (a) treatment teachers changed their homework review practices as they used ASSISTments, and (b) the usage of ASSISTments was positively correlated with student learning outcome.

Mingyu Feng, Neil Heffernan, Kelly Collins, Cristina Heffernan, Robert F. Murphy
The Development of Multivariable Causality Strategy: Instruction or Simulation First?

Understanding phenomena by exploring complex interactions between variables is a challenging task for students of all ages. While the use of simulations to support exploratory learning of complex phenomena is common, students still struggle to make sense of interactive relationships between factors. Here we study the applicability of Problem Solving before Instruction (PS-I) approach to this context. In PS-I, learners are given complex tasks that help them make sense of the domain, prior to receiving instruction on the target concepts. While PS-I has been shown to be effective to teach complex topics, it is yet to show benefits for learning general inquiry skills. Thus, we tested the effect of exploring with simulations before instruction (as opposed to afterward) on the development of a multivariable causality strategy (MVC-strategy). Undergraduate students (N = 71) completed two exploration tasks using simulation about virus transmission. Students completed Task1 either before (Exploration-first condition) or after (Instruction-first condition) instruction related to multivariable causality and completed Task2 at the end of the intervention. Following, they completed transfer Task3 with a simulation on the topic of Predator-Prey relationships. Results showed that Instruction-first improved students’ Efficiency of MVC-strategy on Task1. However, these gaps were gone by Task2. Interestingly, Exploration-first had higher efficiency of MVC-strategy on transfer Task3. These results show that while Exploration-first did not promote performance on the learning activity, it has in fact improved learning on the transfer task, consistent with the PS-I literature. This is the first time that PS-I is found effective in teaching students better exploration strategies.

Janan Saba, Manu Kapur, Ido Roll
Content Matters: A Computational Investigation into the Effectiveness of Retrieval Practice and Worked Examples

In this paper we argue that artificial intelligence models of learning can contribute precise theory to explain surprising student learning phenomena. In some past studies of student learning, practice produces better learning than studying examples, whereas other studies show the opposite result. We reconcile and explain this apparent contradiction by suggesting that retrieval practice and example study involve different learning cognitive processes, memorization and induction, respectively, and that each process is optimal for learning different types of knowledge. We implement and test this theoretical explanation by extending an AI model of human cognition — the Apprentice Learner Architecture (AL) — to include both memory and induction processes and comparing the behavior of the simulated learners with and without a forgetting mechanism to the behavior of human participants in a laboratory study. We show that, compared to simulated learners without forgetting, the behavior of simulated learners with forgetting matches that of human participants better. Simulated learners with forgetting learn best using retrieval practice in situations that emphasize memorization (such as learning facts or simple associations), whereas studying examples improves learning in situations where there are multiple pieces of information available and induction and generalization are necessary (such as when learning skills or procedures).

Napol Rachatasumrit, Paulo F. Carvalho, Sophie Li, Kenneth R. Koedinger
Investigating the Utility of Self-explanation Through Translation Activities with a Code-Tracing Tutor

Code tracing is a foundational programming skill that involves simulating a program’s execution line by line, tracking how variables change at each step. To code trace, students need to understand what a given program line means, which can be accomplished by translating it into plain English. Translation can be characterized as a form of self-explanation, a general learning mechanism that involves making inferences beyond the instructional materials. Our work investigates if this form of self-explanation improves learning from a code-tracing tutor we created using the CTAT framework. We created two versions of the tutor. In the experimental version, students were asked to translate lines of code while solving code-tracing problems. In the control condition students were only asked to code trace without translating. The two tutor versions were compared using a between-subjects study (N = 44). The experimental group performed significantly better on translation and code-generation questions, but the control group performed significantly better on code-tracing questions. We discuss the implications of this finding for the design of tutors providing code-tracing support.

Maia Caughey, Kasia Muldner
Reducing the Cost: Cross-Prompt Pre-finetuning for Short Answer Scoring

Automated Short Answer Scoring (SAS) is the task of automatically scoring a given input to a prompt based on rubrics and reference answers. Although SAS is useful in real-world applications, both rubrics and reference answers differ between prompts, thus requiring a need to acquire new data and train a model for each new prompt. Such requirements are costly, especially for schools and online courses where resources are limited and only a few prompts are used. In this work, we attempt to reduce this cost through a two-phase approach: train a model on existing rubrics and answers with gold score signals and finetune it on a new prompt. Specifically, given that scoring rubrics and reference answers differ for each prompt, we utilize key phrases, or representative expressions that the answer should contain to increase scores, and train a SAS model to learn the relationship between key phrases and answers using already annotated prompts (i.e., cross-prompts). Our experimental results show that finetuning on existing cross-prompt data with key phrases significantly improves scoring accuracy, especially when the training data is limited. Finally, our extensive analysis shows that it is crucial to design the model so that it can learn the task’s general property. We publicly release our code and all of the experimental settings for reproducing our results ( https://github.com/hiro819/Reducing-the-cost-cross-prompt-prefinetuning-for-SAS ).

Hiroaki Funayama, Yuya Asazuma, Yuichiroh Matsubayashi, Tomoya Mizumoto, Kentaro Inui
Go with the Flow: Personalized Task Sequencing Improves Online Language Learning

Machine learning (ML) based adaptive learning promises great improvements in personalized learning for various learning contexts. However, it is necessary to look into the effectiveness of different interventions in specific learning areas. We conducted an online-controlled experiment to compare an online learning environment for spelling to an ML based implementation of the same learning platform. The learning platform is used in schools from all types in Germany. Our study focuses on the role of different machine learning-based adaptive task sequencing interventions that are compared to the control group. We evaluated nearly 500,000 tasks using different metrics. In total almost 6,000 students from class levels 5 to 13 (ages from 11–19) participated in the experiment. Our results show that the relative number of incorrect answers significantly decreased in both intervention groups. Other factors such as dropouts or competencies reveal mixed results. Our experiment showed that personalized task sequencing can be implemented as ML based interventions and improves error rates and dropout rates in language learning for students. However, the impact depends on the specific type of task sequencing.

Nathalie Rzepka, Katharina Simbeck, Hans-Georg Müller, Niels Pinkwart
Automated Hand-Raising Detection in Classroom Videos: A View-Invariant and Occlusion-Robust Machine Learning Approach

Hand-raising signals students’ willingness to participate actively in the classroom discourse. It has been linked to academic achievement and cognitive engagement of students and constitutes an observable indicator of behavioral engagement. However, due to the large amount of effort involved in manual hand-raising annotation by human observers, research on this phenomenon, enabling teachers to understand and foster active classroom participation, is still scarce. An automated detection approach of hand-raising events in classroom videos can offer a time- and cost-effective substitute for manual coding. From a technical perspective, the main challenges for automated detection in the classroom setting are diverse camera angles and student occlusions. In this work, we propose utilizing and further extending a novel view-invariant, occlusion-robust machine learning approach with long short-term memory networks for hand-raising detection in classroom videos based on body pose estimation. We employed a dataset stemming from 36 real-world classroom videos, capturing 127 students from grades 5 to 12 and 2442 manually annotated authentic hand-raising events. Our temporal model trained on body pose embeddings achieved an $$F_{1}$$ F 1 score of 0.76. When employing this approach for the automated annotation of hand-raising instances, a mean absolute error of 3.76 for the number of detected hand-raisings per student, per lesson was achieved. We demonstrate its application by investigating the relationship between hand-raising events and self-reported cognitive engagement, situational interest, and involvement using manually annotated and automatically detected hand-raising instances. Furthermore, we discuss the potential of our approach to enable future large-scale research on student participation, as well as privacy-preserving data collection in the classroom context.

Babette Bühler, Ruikun Hou, Efe Bozkir, Patricia Goldberg, Peter Gerjets, Ulrich Trautwein, Enkelejda Kasneci
Robust Educational Dialogue Act Classifiers with Low-Resource and Imbalanced Datasets

Dialogue acts (DAs) can represent conversational actions of tutors or students that take place during tutoring dialogues. Automating the identification of DAs in tutoring dialogues is significant to the design of dialogue-based intelligent tutoring systems. Many prior studies employ machine learning models to classify DAs in tutoring dialogues and invest much effort to optimize the classification accuracy by using limited amounts of training data (i.e., low-resource data scenario). However, beyond the classification accuracy, the robustness of the classifier is also important, which can reflect the capability of the classifier on learning the patterns from different class distributions. We note that many prior studies on classifying educational DAs employ cross entropy (CE) loss to optimize DA classifiers on low-resource data with imbalanced DA distribution. The DA classifiers in these studies tend to prioritize accuracy on the majority class at the expense of the minority class which might not be robust to the data with imbalanced ratios of different DA classes. To optimize the robustness of classifiers on imbalanced class distributions, we propose to optimize the performance of the DA classifier by maximizing the area under the ROC curve (AUC) score (i.e., AUC maximization). Through extensive experiments, our study provides evidence that (i) by maximizing AUC in the training process, the DA classifier achieves significant performance improvement compared to the CE approach under low-resource data, and (ii) AUC maximization approaches can improve the robustness of the DA classifier under different class imbalance ratios.

Jionghao Lin, Wei Tan, Ngoc Dang Nguyen, David Lang, Lan Du, Wray Buntine, Richard Beare, Guanliang Chen, Dragan Gašević
What and How You Explain Matters: Inquisitive Teachable Agent Scaffolds Knowledge-Building for Tutor Learning

Students learn by teaching a teachable agent, a phenomenon called tutor learning. Literature suggests that tutor learning happens when students (who tutor the teachable agent) actively reflect on their knowledge when responding to the teachable agent’s inquiries (aka knowledge-building). However, most students often lean towards delivering what they already know instead of reflecting on their knowledge (aka knowledge-telling). The knowledge-telling behavior weakens the effect of tutor learning. We hypothesize that the teachable agent can help students commit to knowledge-building by being inquisitive and asking follow-up inquiries when students engage in knowledge-telling. Despite the known benefits of knowledge-building, no prior work has operationalized the identification of knowledge-building and knowledge-telling features from students’ responses to teachable agent’s inquiries and governed them toward knowledge-building. We propose a Constructive Tutee Inquiry that aims to provide follow-up inquiries to guide students toward knowledge-building when they commit to a knowledge-telling response. Results from an evaluation study show that students who were treated by Constructive Tutee Inquiry not only outperformed those who were not treated but also learned to engage in knowledge-building without the aid of follow-up inquiries over time.

Tasmia Shahriar, Noboru Matsuda
Help Seekers vs. Help Accepters: Understanding Student Engagement with a Mentor Agent

Help from virtual pedagogical agents has the potential to improve student learning. Yet students often do not seek help when they need it, do not use help effectively, or ignore the agent’s help altogether. This paper seeks to better understand students’ patterns of accepting and seeking help in a computer-based science program called Betty’s Brain. Focusing on student interactions with the mentor agent, Mr. Davis, we examine the factors associated with patterns of help acceptance and help seeking; the relationship between help acceptance and help seeking; and how each behavior is related to learning outcomes. First, we examine whether students accepted help from Mr. Davis, operationalized as whether they followed his suggestions to read specific textbook pages. We find a significant positive relationship between help acceptance and student post-test scores. Despite this, help accepters made fewer positive statements about Mr. Davis in the interviews. Second, we identify how many times students proactively sought help from Mr. Davis. Students who most frequently sought help demonstrated more confusion while learning (measured using an interaction-based ML-based detector); tended to have higher science anxiety; and made more negative statements about Mr. Davis, compared to those who made few or no requests. However, help seeking was not significantly related to post-test scores. Finally, we draw from the qualitative interviews to consider how students understand and articulate their experiences with help from Mr. Davis.

Elena G. van Stee, Taylor Heath, Ryan S. Baker, J. M. Alexandra L. Andres, Jaclyn Ocumpaugh
Adoption of Artificial Intelligence in Schools: Unveiling Factors Influencing Teachers’ Engagement

Albeit existing evidence about the impact of AI-based adaptive learning platforms, their scaled adoption in schools is slow at best. In addition, AI tools adopted in schools may not always be the considered and studied products of the research community. Therefore, there have been increasing concerns about identifying factors influencing adoption, and studying the extent to which these factors can be used to predict teachers’ engagement with adaptive learning platforms. To address this, we developed a reliable instrument to measure more holistic factors influencing teachers’ adoption of adaptive learning platforms in schools. In addition, we present the results of its implementation with school teachers (n = 792) sampled from a large country-level population and use this data to predict teachers’ real-world engagement with the adaptive learning platform in schools. Our results show that although teachers’ knowledge, confidence and product quality are all important factors, they are not necessarily the only, may not even be the most important factors influencing the teachers’ engagement with AI platforms in schools. Not generating any additional workload, increasing teacher ownership and trust, generating support mechanisms for help, and assuring that ethical issues are minimised, are also essential for the adoption of AI in schools and may predict teachers’ engagement with the platform better. We conclude the paper with a discussion on the value of factors identified to increase the real-world adoption and effectiveness of adaptive learning platforms by increasing the dimensions of variability in prediction models and decreasing the implementation variability in practice.

Mutlu Cukurova, Xin Miao, Richard Brooker
The Road Not Taken: Preempting Dropout in MOOCs

Massive Open Online Courses (MOOCs) are often plagued by a low level of student engagement and retention, with many students dropping out before completing the course. In an effort to improve student retention, educational researchers are increasingly turning to the latest Machine Learning (ML) models to predict student learning outcomes, based on which instructors can provide timely support to at-risk students as the progression of a course. Though achieving a high prediction accuracy, these models are often “black-box” models, making it difficult to gain instructional insights from their results, and accordingly, designing meaningful and actionable interventions remains to be challenging in the context of MOOCs. To tackle this problem, we present an innovative approach based on Hidden Markov Model (HMM). We devoted our efforts to model students’ temporal interaction patterns in MOOCs in a transparent and interpretable manner, with the aim of empowering instructors to gain insights about actionable interventions in students’ next-step learning activities. Through extensive evaluation on two large-scale MOOC datasets, we demonstrated that, by gaining a temporally grounded understanding of students’ learning processes using HMM, both the students’ current engagement state and potential future state transitions could be learned, and based on which, an actionable next-step intervention tailored to the student current engagement state could be formulated to recommend to students. These findings have strong implications for real-world adoption of HMM for promoting student engagement and preempting dropouts.

Lele Sha, Ed Fincham, Lixiang Yan, Tongguang Li, Dragan Gašević, Kobi Gal, Guanliang Chen
Does Informativeness Matter? Active Learning for Educational Dialogue Act Classification

Dialogue Acts (DAs) can be used to explain what expert tutors do and what students know during the tutoring process. Most empirical studies adopt the random sampling method to obtain sentence samples for manual annotation of DAs, which are then used to train DA classifiers. However, these studies have paid little attention to sample informativeness, which can reflect the information quantity of the selected samples and inform the extent to which a classifier can learn patterns. Notably, the informativeness level may vary among the samples and the classifier might only need a small amount of low informative samples to learn the patterns. Random sampling may overlook sample informativeness, which consumes human labelling costs and contributes less to training the classifiers. As an alternative, researchers suggest employing statistical sampling methods of Active Learning (AL) to identify the informative samples for training the classifiers. However, the use of AL methods in educational DA classification tasks is under-explored. In this paper, we examine the informativeness of annotated sentence samples. Then, the study investigates how the AL methods can select informative samples to support DA classifiers in the AL sampling process. The results reveal that most annotated sentences present low informativeness in the training dataset and the patterns of these sentences can be easily captured by the DA classifier. We also demonstrate how AL methods can reduce the cost of manual annotation in the AL sampling process.

Wei Tan, Jionghao Lin, David Lang, Guanliang Chen, Dragan Gašević, Lan Du, Wray Buntine
Can Virtual Agents Scale Up Mentoring?: Insights from College Students’ Experiences Using the CareerFair.ai Platform at an American Hispanic-Serving Institution

Mentoring promotes underserved students’ persistence in STEM but is difficult to scale up. Conversational virtual agents can help address this problem by conveying a mentor’s experiences to larger audiences. The present study examined college students’ $$(N = 138)$$ ( N = 138 ) utilization of CareerFair.ai, an online platform featuring virtual agent-mentors that were self-recorded by sixteen real-life mentors and built using principles from the earlier MentorPal framework. Participants completed a single-session study which included 30 min of active interaction with CareerFair.ai, sandwiched between pre-test and post-test surveys. Students’ user experience and learning gains were examined, both for the overall sample and with a lens of diversity and equity across different, potentially underserved demographic groups. Findings included positive pre/post changes in intent to pursue STEM coursework and high user acceptance ratings (e.g., expected benefit, ease of use), with under-represented minority (URM) students giving significantly higher ratings on average than non-URM students. Self-reported learning gains of interest, actual content viewed on the CareerFair.ai platform, and actual learning gains were associated with one another, suggesting that the platform may be a useful resource in meeting a wide range of career exploration needs. Overall, the CareerFair.ai platform shows promise in scaling up aspects of mentoring to serve the needs of diverse groups of college students.

Yuko Okado, Benjamin D. Nye, Angelica Aguirre, William Swartout
Real-Time AI-Driven Assessment and Scaffolding that Improves Students’ Mathematical Modeling during Science Investigations

Developing models and using mathematics are two key practices in internationally recognized science education standards, such as the Next Generation Science Standards (NGSS) [1]. However, students often struggle at the intersection of these practices, i.e., developing mathematical models about scientific phenomena. In this paper, we present the design and initial classroom test of AI-scaffolded virtual labs that help students practice these competencies. The labs automatically assess fine-grained sub-components of students’ mathematical modeling competencies based on the actions they take to build their mathematical models within the labs. We describe how we leveraged underlying machine-learned and knowledge-engineered algorithms to trigger scaffolds, delivered proactively by a pedagogical agent, that address students’ individual difficulties as they work. Results show that students who received automated scaffolds for a given practice on their first virtual lab improved on that practice for the next virtual lab on the same science topic in a different scenario (a near-transfer task). These findings suggest that real-time automated scaffolds based on fine-grained assessment data can help students improve on mathematical modeling.

Amy Adair, Michael Sao Pedro, Janice Gobert, Ellie Segan
Improving Automated Evaluation of Student Text Responses Using GPT-3.5 for Text Data Augmentation

In education, intelligent learning environments allow students to choose how to tackle open-ended tasks while monitoring performance and behavior, allowing for the creation of adaptive support to help students overcome challenges. Timely feedback is critical to aid students’ progression toward learning and improved problem-solving. Feedback on text-based student responses can be delayed when teachers are overloaded with work. Automated evaluation can provide quick student feedback while easing the manual evaluation burden for teachers in areas with a high teacher-to-student ratio. Current methods of evaluating student essay responses to questions have included transformer-based natural language processing models with varying degrees of success. One main challenge in training these models is the scarcity of data for student-generated data. Larger volumes of training data are needed to create models that perform at a sufficient level of accuracy. Some studies have vast data, but large quantities are difficult to obtain when educational studies involve student-generated text. To overcome this data scarcity issue, text augmentation techniques have been employed to balance and expand the data set so that models can be trained with higher accuracy, leading to more reliable evaluation and categorization of student answers to aid teachers in the student’s learning progression. This paper examines the text-generating AI model, GPT-3.5, to determine if prompt-based text-generation methods are viable for generating additional text to supplement small sets of student responses for machine learning model training. We augmented student responses across two domains using GPT-3.5 completions and used that data to train a multilingual BERT model. Our results show that text generation can improve model performance on small data sets over simple self-augmentation.

Keith Cochran, Clayton Cohn, Jean Francois Rouet, Peter Hastings
The Automated Model of Comprehension Version 3.0: Paying Attention to Context

Reading comprehension is essential for both knowledge acquisition and memory reinforcement. Automated modeling of the comprehension process provides insights into the efficacy of specific texts as learning tools. This paper introduces an improved version of the Automated Model of Comprehension, version 3.0 (AMoC v3.0). AMoC v3.0 is based on two theoretical models of the comprehension process, namely the Construction-Integration and the Landscape models. In addition to the lessons learned from the previous versions, AMoC v3.0 uses Transformer-based contextualized embeddings to build and update the concept graph as a simulation of reading. Besides taking into account generative language models and presenting a visual walkthrough of how the model works, AMoC v3.0 surpasses the previous version in terms of the Spearman correlations between our activation scores and the values reported in the original Landscape Model for the presented use case. Moreover, features derived from AMoC significantly differentiate between high-low cohesion texts, thus arguing for the model’s capabilities to simulate different reading conditions.

Dragos Corlatescu, Micah Watanabe, Stefan Ruseti, Mihai Dascalu, Danielle S. McNamara
Analysing Verbal Communication in Embodied Team Learning Using Multimodal Data and Ordered Network Analysis

In embodied team learning activities, students are expected to learn to collaborate with others while freely moving in a physical learning space to complete a shared goal. Students can thus interact in various team configurations, resulting in increased complexity in their communication dynamics since unrelated dialogue segments can concurrently happen at different locations of the learning space. This can make it difficult to analyse students’ team dialogue solely using audio data. To address this problem, we present a study in a highly dynamic healthcare simulation setting to illustrate how spatial data can be combined with audio data to model embodied team communication. We used ordered network analysis (ONA) to model the co-occurrence and the order of coded co-located dialogue instances and identify key differences in the communication dynamics of high and low performing teams.

Linxuan Zhao, Yuanru Tan, Dragan Gašević, David Williamson Shaffer, Lixiang Yan, Riordan Alfredo, Xinyu Li, Roberto Martinez-Maldonado
Improving Adaptive Learning Models Using Prosodic Speech Features

Cognitive models of memory retrieval aim to describe human learning and forgetting over time. Such models have been successfully applied in digital systems that aid in memorizing information by adapting to the needs of individual learners. The memory models used in these systems typically measure the accuracy and latency of typed retrieval attempts. However, recent advances in speech technology have led to the development of learning systems that allow for spoken inputs. Here, we explore the possibility of improving a cognitive model of memory retrieval by using information present in speech signals during spoken retrieval attempts. We asked 44 participants to study vocabulary items by spoken rehearsal, and automatically extracted high-level prosodic speech features—patterns of stress and intonation—such as pitch dynamics, speaking speed and intensity from over 7,000 utterances. We demonstrate that some prosodic speech features are associated with accuracy and response latency for retrieval attempts, and that speech feature informed memory models make better predictions of future performance relative to models that only use accuracy and response latency. Our results have theoretical relevance, as they show how memory strength is reflected in a specific speech signature. They also have important practical implications as they contribute to the development of memory models for spoken retrieval that have numerous real-world applications.

Thomas Wilschut, Florian Sense, Odette Scharenborg, Hedderik van Rijn
Neural Automated Essay Scoring Considering Logical Structure

Automated essay scoring (AES) models based on deep neural networks (DNN) have recently achieved high accuracy. However, conventional neural AES models cannot explicitly consider the logical structure of each essay. Explicitly considering the logical structure in neural AES models is expected to improve scoring accuracy because logical structure is an important factor affecting essay quality. Accordingly, this study proposes a neural AES method that incorporates information about logical structure. First, the proposed method estimates the logical structure of each essay using the argument mining method, which is a machine learning method for extracting the logical structure from texts. Then, the logical structure is processed using a newly developed neural architecture, which we formulate as a transformer-based DNN model with modified self-attention, and a distributed representation of the logical structure is output. Finally, the proposed method integrates that distributed representation into conventional neural AES models to predict the essay score. We demonstrate the effectiveness of the proposed method through experiments using benchmark data for AES.

Misato Yamaura, Itsuki Fukuda, Masaki Uto
“Why My Essay Received a 4?”: A Natural Language Processing Based Argumentative Essay Structure Analysis

Writing argumentative essays is a critical component of students’ learning. Previous works on automatic assessments on essay writing often focused on providing a holistic score for the input essay, which only summarized the essay’s overall quality. However, to provide more pedagogical value and equitable educational opportunities for all students, an automatized system needs to provide detailed feedback on students’ essays. To address this issue, we developed an essay argumentative structure feedback system to support educators and students. We employed natural language processing (NLP) and data mining techniques to explore the association between argumentative structure and essay scores. First, we proposed a cross-prompt, sentence-level ensemble model to classify the argumentative elements and extract the argumentative structures from the essay. The model worked across multiple datasets and achieved high performance. Second, after applying the classification model on the ACT writing tests, we performed a sequential mining process to extract representative argumentative structures. Our findings highlight the role of organizational argumentative structure in essay scoring. Furthermore, we found a common argumentative structure used by the high-scored essays. Finally, with the knowledge of argumentative elements and structures used in the previous essays, we proposed a feedback tool design to complement the current AES systems and help students improve their argument writing skill.

Bokai Yang, Sungjin Nam, Yuchi Huang
Leveraging Deep Reinforcement Learning for Metacognitive Interventions Across Intelligent Tutoring Systems

This work compares two approaches to provide metacognitive interventions and their impact on preparing students for future learning across Intelligent Tutoring Systems (ITSs). In two consecutive semesters, we conducted two classroom experiments: Exp. 1 used a classic artificial intelligence approach to classify students into different metacognitive groups and provide static interventions based on their classified groups. In Exp. 2, we leveraged Deep Reinforcement Learning (DRL) to provide adaptive interventions that consider the dynamic changes in the student’s metacognitive levels. In both experiments, students received these interventions that taught how and when to use a backward-chaining (BC) strategy on a logic tutor that supports a default forward-chaining strategy. Six weeks later, we trained students on a probability tutor that only supports BC without interventions. Our results show that adaptive DRL-based interventions closed the metacognitive skills gap between students. In contrast, static classifier-based interventions only benefited a subset of students who knew how to use BC in advance. Additionally, our DRL agent prepared the experimental students for future learning by significantly surpassing their control peers on both ITSs.

Mark Abdelshiheed, John Wesley Hostetter, Tiffany Barnes, Min Chi
Enhancing Stealth Assessment in Collaborative Game-Based Learning with Multi-task Learning

Collaborative game-based learning environments offer the promise of combining the strengths of computer-supported collaborative learning and game-based learning to enable students to work collectively towards achieving problem-solving goals in engaging storyworlds. Group chat plays an important role in such environments, enabling students to communicate with team members while exploring the learning environment and collaborating on problem solving. However, students may engage in chat behavior that negatively affects learning. To help address this problem, we introduce a multidimensional stealth assessment model for jointly predicting students’ out-of-domain contributions to group chat as well as their learning outcomes with multi-task learning. Results from evaluating the model indicate that multi-task learning, which simultaneously performs the multidimensional stealth assessment, utilizing predictive features extracted from in-game actions and group chat data outperforms single-task variants and suggest that multi-task learning can effectively support stealth assessment in collaborative game-based learning environments.

Anisha Gupta, Dan Carpenter, Wookhee Min, Bradford Mott, Krista Glazewski, Cindy E. Hmelo-Silver, James Lester
How Peers Communicate Without Words-An Exploratory Study of Hand Movements in Collaborative Learning Using Computer-Vision-Based Body Recognition Techniques

Accumulating research in embodied cognition highlights the essential role of human bodies in knowledge learning and development. Hand movement is one of the most applied body motions in the collaborative ideation task when students co-construct knowledge with and without words. However, there is a limited understanding of how students in a group use their hand movements to coordinate understandings and reach a consensus. This study explored students’ hand movement patterns during the different types of knowledge co-construction discourses: quick consensus-building, integration-oriented consensus building, and conflict-oriented consensus building. Students’ verbal discussion transcripts were qualitatively analyzed to identify the type of knowledge co-construction discourses. Students’ hand motion was video-recorded, and their hand landmarks were detected using the machine learning tool MediaPipe. One-way ANOVA was conducted to compare students hand motions in different types of discourses. The results found there were different hand motion patterns in different types of collaboration discourses. Students tended to employ more hand motion during conflict-oriented consensus building discourses than during quick consensus building and integration-oriented consensus building discourses. At the group level, the collaborating students were found to present less equal hand movement during quick consensus-building than integration-oriented consensus building and conflict-oriented consensus building. The findings expand the existing understanding of embodied collaborative learning, providing insights for optimizing collaborative learning activities incorporating both verbal and non-verbal language.

Qianru Lyu, Wenli Chen, Junzhu Su, Kok Hui John Gerard Heng, Shuai Liu
Scalable Educational Question Generation with Pre-trained Language Models

The automatic generation of educational questions will play a key role in scaling online education, enabling self-assessment at scale when a global population is manoeuvring their personalised learning journeys. We develop EduQG, a novel educational question generation model built by adapting a large language model. Our extensive experiments demonstrate that EduQG can produce superior educational questions by further pre-training and fine-tuning a pre-trained language model on the scientific text and science question data.

Sahan Bulathwela, Hamze Muse, Emine Yilmaz
Involving Teachers in the Data-Driven Improvement of Intelligent Tutors: A Prototyping Study

Several studies show that log data analysis can lead to effective redesign of intelligent tutoring systems (ITSs). However, teachers are seldom included in the data-driven redesign of ITS, despite their pedagogical content knowledge. Examining teachers’ possible contributions is valuable. To investigate what contributions teachers might make and whether (and how) data would be useful, we first built an interactive prototype tool for visualizing student log data, SolutionVis, based on needs identified in interviews with tutor authors. SolutionVis presents students’ problem-solving processes with an intelligent tutor, including meta-cognitive aspects (e.g., hint requests). We then conducted a within-subjects user study with eight teachers to compare teachers’ redesign suggestions obtained in three conditions: a baseline “no data” condition (where teachers examined just the tutor itself) and two “with data” conditions in which teachers worked with SolutionVis and with a list representation of student solutions, respectively. The results showed that teachers generated useful redesign ideas in all three conditions, that they viewed the availability of data (in both formats) as helpful and enabled them to generate a wider range of redesign suggestions, specifically with respect to hint design and feedback on gaming-the-system behaviors and struggle. The current work suggests potential benefits and ways of involving teachers in the data-driven improvement of ITSs.

Meng Xia, Xinyi Zhao, Dong Sun, Yun Huang, Jonathan Sewall, Vincent Aleven
Reflexive Expressions: Towards the Analysis of Reflexive Capability from Reflective Text

As a first step towards computational analysis of reflexive capability, this research established a process of identifying and classifying short groups of words prominent in reflective writing resulting in a corpus of reflexive expressions. The expressions were classified into theoretical categories of reflexivity, which were evaluated through a series of experiments that crowd-sourced human judgements of expression-category suitability. The purpose of the work was to ascertain: (a) the feasibility of computationally identifying expressions of reflexivity in reflective text, and (b) the extent to which computational classification of the expressions accord with human judgements. Success could advance the computational analysis of reflective text aiding the identification of reflexive capability at scale. The study involved (1) Social-technical derivation of theoretically informed categories, (2) Computational generation of a corpus, and (3) crowdsourced human judgements. We found that computational generation of English reflexive expressions was feasible, and that some categories accord well with human judgements drawn from a fluent English population. The work is expected to provide a foundation for ongoing inquiry, and a basis for more general use in identifying evidence of learner reflexivity.

Andrew Gibson, Lance De Vine, Miguel Canizares, Jill Willis
Algebra Error Classification with Large Language Models

Automated feedback as students answer open-ended math questions has significant potential in improving learning outcomes at large scale. A key part of automated feedback systems is an error classification component, which identifies student errors and enables appropriate, predefined feedback to be deployed. Most existing approaches to error classification use a rule-based method, which has limited capacity to generalize. Existing data-driven methods avoid these limitations but specifically require mathematical expressions in student responses to be parsed into syntax trees. This requirement is itself a limitation, since student responses are not always syntactically valid and cannot be converted into trees. In this work, we introduce a flexible method for error classification using pre-trained large language models. We demonstrate that our method can outperform existing methods in algebra error classification, and is able to classify a larger set of student responses. Additionally, we analyze common classification errors made by our method and discuss limitations of automated error classification.

Hunter McNichols, Mengxue Zhang, Andrew Lan
Exploration of Annotation Strategies for Automatic Short Answer Grading

Automatic Short Answer Grading aims to automatically grade short answers authored by students. Recent work has shown that this task can be effectively reformulated as a Natural Language Inference problem. State-of-the-art is defined by the use of large pretrained language models fine-tuned in the domain dataset. But how to quantify the effectiveness of the models in small data regimes still remains an open issue. In this work we present a set of experiments to analyse the impact of different annotation strategies when not enough training examples for fine-tuning the model are available. We find that when annotating few examples, it is preferable to have more question variability than more answers per question. With this annotation strategy, our model outperforms state-of-the-art systems utilizing only 10% of the full-training set. Finally, experiments show that the use of out-of-domain annotated question-answer examples can be harmful when fine-tuning the models.

Aner Egaña, Itziar Aldabe, Oier Lopez de Lacalle
Impact of Learning a Subgoal-Directed Problem-Solving Strategy Within an Intelligent Logic Tutor

Humans adopt various problem-solving strategies depending on their mastery level, problem type, and complexity. Many of these problem-solving strategies have been integrated within intelligent problem-solvers to solve structured and complex problems efficiently. One such strategy is the means-ends analysis which involves comparing the goal and the givens of a problem and iteratively setting up subgoal(s) at each step until the subgoal(s) are straightforward to derive from the givens. However, little is known about the impact of explicitly teaching novices such a strategy for structured problem-solving with tutors. In this study, we teach novices a subgoal-directed problem-solving strategy inspired by means-ends analysis using a problem-based training intervention within an intelligent logic-proof tutor. As we analyzed students’ performance and problem-solving approaches after training, we observed that the students who learned the strategy used it more when solving new problems, constructed optimal logic proofs, and outperformed those who did not learn the strategy.

Preya Shabrina, Behrooz Mostafavi, Min Chi, Tiffany Barnes
Matching Exemplar as Next Sentence Prediction (MeNSP): Zero-Shot Prompt Learning for Automatic Scoring in Science Education

Developing natural language processing (NLP) models to automatically score students’ written responses to science problems is critical for science education. However, collecting sufficient student responses and labeling them for training or fine-tuning NLP models is time and cost-consuming. Recent studies suggest that large-scale pre-trained language models (PLMs) can be adapted to downstream tasks without fine-tuning by using prompts. However, no research has employed such a prompt approach in science education. As students’ written responses are presented with natural language, aligning the scoring procedure as the next sentence prediction task using prompts can skip the costly fine-tuning stage. In this study, we developed a zero-shot approach to automatically score student responses via Matching exemplars as Next Sentence Prediction (MeNSP). This approach employs no training samples. We first apply MeNSP in scoring three assessment tasks of scientific argumentation and found machine-human scoring agreements, Cohen’s Kappa ranges from 0.30 to 0.57, and F1 score ranges from 0.54 to 0.81. To improve scoring performance, we extend our research to the few-shots setting, either randomly selecting labeled student responses at each grading level or manually constructing responses to fine-tune the models. We find that one task’s performance is improved with more samples, Cohen’s Kappa from 0.30 to 0.38, and F1 score from 0.54 to 0.59; for the two other tasks, scoring performance is not improved. We also find that randomly selected few-shots perform better than the human expert-crafted approach. This study suggests that MeNSP can yield referable automatic scoring for student-written responses while significantly reducing the cost of model training. This method can benefit low-stakes classroom assessment practices in science education. Future research should further explore the applicability of the MeNSP in different types of assessment tasks in science education and further improve the model performance. Our code is available at https://github.com/JacksonWuxs/MeNSP .

Xuansheng Wu, Xinyu He, Tianming Liu, Ninghao Liu, Xiaoming Zhai
Learning When to Defer to Humans for Short Answer Grading

To assess student knowledge, educators face a tradeoff between open-ended versus fixed-response questions. Open-ended questions are easier to formulate, and provide greater insight into student learning, but are burdensome. Machine learning methods that could reduce the assessment burden also have a cost, given that large datasets of reliably assessed examples (labeled data) are required for training and testing. We address the human costs of assessment and data labeling using selective prediction, where the output of a machine learned model is used when the model makes a confident decision, but otherwise the model defers to a human decision-maker. The goal is to defer less often while maintaining human assessment quality on the total output. We refer to the deferral criteria as a deferral policy, and we show it is possible to learn when to defer. We first trained an autograder on a combination of historical data and a small amount of newly labeled data, achieving moderate performance. We then used the autograder output as input to a logistic regression to learn when to defer. The learned logistic regression equation constitutes a deferral policy. Tests of the selective prediction method on a held out test set showed that human-level assessment quality can be achieved with a major reduction of human effort.

Zhaohui Li, Chengning Zhang, Yumi Jin, Xuesong Cang, Sadhana Puntambekar, Rebecca J. Passonneau
Contrastive Learning for Reading Behavior Embedding in E-book System

When students use e-learning systems such as learning management systems and e-book systems, the operation logs are stored and analyzed to understand student learning behaviors. For implementing some applications, such as dashboard systems and at-risk student detection, the operation logs are mainly transformed into features designed by researchers. Such hand-crafted features, like the number of operations, are easily interpretable. However, the power of the hand-craft features may be limited for the recent large-scale educational dataset. In machine learning research, data-driven features are demonstrated to be a better representation than hand-crafted features. However, there are few discussions in the educational data due to a need for many operation logs. In this study, we collect reading logs of an e-book system. We propose a representation learning method for the reading logs based on contrastive learning. Our proposed method transforms time-series reading logs into reading behavior feature vectors directly without hand-crafted features. In our experiments, we demonstrate that the power of our feature representation is better than a traditional count-based hand-crafted feature representation in the at-risk student detection task. In addition, we investigate the characteristics of the feature space learned by our proposed method.

Tsubasa Minematsu, Yuta Taniguchi, Atsushi Shimada
Generalizable Automatic Short Answer Scoring via Prototypical Neural Network

We investigated the challenging task of generalizable automatic short answer scoring (ASAS), where a scoring model is tasked with generalizing to target domains (provided only with limited labeled data) that have no overlap with the auxiliary domains on which the model is trained. To address this, we introduced a framework based on Prototypical Neural Network (PNN). Specifically, for a target short answer instance whose score needs to be determined, the framework first calculates the distance between this target instance and each cluster of support instances (support instances are a set of labeled short answer instances that are grouped to different clusters according to their labels, i.e., the ground-truth scores). Then, it rates the target instance using the ground-truth score of the cluster that has the closest distance to the target instance. Through extensive empirical studies on an open-source ASAS dataset consisting of 10 different question prompts, we observed that the proposed approach consistently outperformed other baselines across settings concerning different numbers of support instances. We further observed that the proposed approach performed better when with wider training data sources than when with restricted data sources for training, showing that including more data sources for training may add to the generalizability of the proposed framework.

Zijie Zeng, Lin Li, Quanlong Guan, Dragan Gašević, Guanliang Chen
A Spatiotemporal Analysis of Teacher Practices in Supporting Student Learning and Engagement in an AI-Enabled Classroom

Research indicates that teachers play an active and important role in classrooms with AI tutors. Yet, our scientific understanding of the way teacher practices around AI tutors mediate student learning is far from complete. In this paper, we investigate spatiotemporal factors of student-teacher interactions by analyzing student engagement and learning with an AI tutor ahead of teacher visits (defined as episodes of a teacher being in close physical proximity to a student) and immediately following teacher visits. To conduct such integrated, temporal analysis around the moments when teachers visit students, we collect fine-grained, time-synchronized data on teacher positions in the physical classroom and student interactions with the AI tutor. Our case study in a K12 math classroom with a veteran math teacher provides some indications on factors that might affect a teacher’s decision to allocate their limited classroom time to their students and what effects these interactions have on students. For instance, teacher visits were associated more with students’ in-the-moment behavioral indicators (e.g., idleness) than a broader, static measure of student needs such as low prior knowledge. While teacher visits were often associated with positive changes in student behavior afterward (e.g., decreased idleness), there could be a potential mismatch between students visited by the teacher and who may have needed it more at that time (e.g., students who were disengaged for much longer). Overall, our findings indicate that teacher visits may yield immediate benefits for students but also that it is challenging for teachers to meet all needs – suggesting the need for better tool support.

Shamya Karumbaiah, Conrad Borchers, Tianze Shou, Ann-Christin Falhs, Pinyang Liu, Tomohiro Nagashima, Nikol Rummel, Vincent Aleven
Trustworthy Academic Risk Prediction with Explainable Boosting Machines

The use of predictive models in education promises individual support and personalization for students. To develop trustworthy models, we need to understand what factors and causes contribute to a prediction. Thus, it is necessary to develop models that are not only accurate but also explainable. Moreover, we need to conduct holistic model evaluations that also quantify explainability or other metrics next to established performance metrics. This paper explores the use of Explainable Boosting Machines (EBMs) for the task of academic risk prediction. EBMs are an extension of Generative Additive Models and promise a state-of-the-art performance on tabular datasets while being inherently interpretable. We demonstrate the benefits of using EBMs in the context of academic risk prediction trained on online learning behavior data and show the explainability of the model. Our study shows that EBMs are equally accurate as other state-of-the-art approaches while being competitive on relevant metrics for trustworthy academic risk prediction such as earliness, stability, fairness, and faithfulness of explanations. The results encourage the broader use of EBMs for other Artificial Intelligence in education tasks.

Vegenshanti Dsilva, Johannes Schleiss, Sebastian Stober
Automatic Educational Question Generation with Difficulty Level Controls

We consider the task of automatically generating math word problems (MWPs) of various difficulties that meet the needs of teachers in teaching and testing students in corresponding educational stages. Existing methods fail to produce high-quality problems while allowing the teacher control over the problem difficulty level. In this work, we introduce a controllable MWP generation pipeline that samples from an energy language model with various expert model components for realizing the target attributes. We control the difficulty of the resulting MWPs from mathematical and linguistic aspects by imposing constraints on equations, vocabulary, and topics. We also use other control attributes including fluency and distance to the conditioning sequence to manage language quality and creativity. Experiments and evaluation results demonstrate our approach improves upon the baselines in generating solvable, well-formed, and diverse MWPs of controlled difficulty levels. Lastly, we solicit feedback from various math educators who approve the effectiveness of our system for their MWP design processes. They suggest our outputs align with the expectations of problem designers showing a possibility of using such problem generators in real-life educational scenarios. Our code and data are available on request.

Ying Jiao, Kumar Shridhar, Peng Cui, Wangchunshu Zhou, Mrinmaya Sachan
Getting the Wiggles Out: Movement Between Tasks Predicts Future Mind Wandering During Learning Activities

Mind wandering (“zoning out”) is a frequent occurrence and is negatively related to learning outcomes, which suggests it would be beneficial to measure and mitigate it. To this end, we investigated whether movement from a wrist-worn accelerometer between tasks could predict mind wandering as 125 learners read long, connected, informative texts. We examined random forest models using both basic statistical and more novel nonlinear dynamics movement features, finding that the former were more predictive of future (i.e., about 5 min later) reports of mind wandering. Models generalized across students with AUROCS up to 0.62. Importantly, vertical movement as measured by the Z-axis accelerometer channel, e.g. flexion or extension of the elbow in stretching, was the most predictive signal, whereas horizontal arm movements (measured by X- and Y-axis channels) and rotational movement were not predictive. We discuss implications for theories of mind wandering and applications for intelligent learning interfaces that can prospectively detect mind wandering.

Rosy Southwell, Candace E. Peacock, Sidney K. D’Mello
Efficient Feedback and Partial Credit Grading for Proof Blocks Problems

Proof Blocks is a software tool that allows students to practice writing mathematical proofs by dragging and dropping lines instead of writing proofs from scratch. Proof Blocks offers the capability of assigning partial credit and providing solution quality feedback to students. This is done by computing the edit distance from a student’s submission to some predefined set of solutions. In this work, we propose an algorithm for the edit distance problem that significantly outperforms the baseline procedure of exhaustively enumerating over the entire search space. Our algorithm relies on a reduction to the minimum vertex cover problem. We benchmark our algorithm on thousands of student submissions from multiple courses, showing that the baseline algorithm is intractable, and that our proposed algorithm is critical to enable classroom deployment. Our new algorithm has also been used for problems in many other domains where the solution space can be modeled as a DAG, including but not limited to Parsons Problems for writing code, helping students understand packet ordering in networking protocols, and helping students sketch solution steps for physics problems. Integrated into multiple learning management systems, the algorithm serves thousands of students each year.

Seth Poulsen, Shubhang Kulkarni, Geoffrey Herman, Matthew West
Dropout Prediction in a Web Environment Based on Universal Design for Learning

Dropout prediction is an essential task in educational Web platforms to identify at-risk learners, enable individualized support, and eventually prevent students from quitting a course. Most existing studies on dropout prediction focus on improving machine learning methods based on a limited set of features to model students. In this paper, we contribute to the field by evaluating and optimizing dropout prediction using features based on personal information and interaction data. Multiple granularities of interaction and additional unique features, such as data on reading ability and learners’ cognitive abilities, are tested. Using the Universal Design for Learning (UDL), our Web-based learning platform called I3Learn aims at advancing inclusive science learning by focusing on the support of all learners. A total of 580 learners from different school types have used the learning platform. We predict dropout at different points in the learning process and compare how well various types of features perform. The effectiveness of predictions benefits from the higher granularity of interaction data that describe intermediate steps in learning activities. The cold start problem can be addressed using assessment data, such as a cognitive abilities assessment from the pre-test of the learning platform. We discuss the experimental results and conclude that the suggested feature sets may be able to reduce dropout in remote learning (e.g., during a pandemic) or blended learning settings in school.

Marvin Roski, Ratan Sebastian, Ralph Ewerth, Anett Hoppe, Andreas Nehring
Designing for Student Understanding of Learning Analytics Algorithms

Students use learning analytics systems to make day-to-day learning decisions, but may not understand their potential flaws. This work delves into student understanding of an example learning analytics algorithm, Bayesian Knowledge Tracing (BKT), using Cognitive Task Analysis (CTA) to identify knowledge components (KCs) comprising expert student understanding. We built an interactive explanation to target these KCs and performed a controlled experiment examining how varying the transparency of limitations of BKT impacts understanding and trust. Our results show that, counterintuitively, providing some information on the algorithm’s limitations is not always better than providing no information. The success of the methods from our BKT study suggests avenues for the use of CTA in systematically building evidence-based explanations to increase end user understanding of other complex AI algorithms in learning analytics as well as other domains.

Catherine Yeh, Noah Cowit, Iris Howley
Feedback and Open Learner Models in Popular Commercial VR Games: A Systematic Review

Virtual reality (VR) educational games are engaging as VR can enable high levels of immersion and presence. However, for effective learning, educational literature highlights the benefits if such games have effective feedback. We aimed to understand the nature of feedback provided in popular commercial VR educational games. To discover this, we systematically reviewed 260 commercially available educational games from VIVEPORT, Oculus and Steam, the key platforms for VR games. We assessed if they offered key forms of feedback we identified from the literature, score, levels, competition, self, self-comparison, till correct, accuracy, process, and Open Learner Models (OLMs). We found that just 67 games (26%) had any of these forms of feedback and just four had OLMs. Our key contributions are: (1) the first systematic review of feedback in commercial VR games; (2) literature-informed definition of key forms of feedback for VR games; (3) understanding about OLMs in commercial VR games.

YingAn Chen, Judy Kay, Soojeong Yoo
Gender Differences in Learning Game Preferences: Results Using a Multi-dimensional Gender Framework

Prompted by findings of gender differences in learning game preferences and outcomes, education researchers have proposed adapting games by gender to foster learning and engagement. However, such recommendations typically rely on intuition, rather than empirical data, and are rooted in a binary representation of gender. On the other hand, recent evidence from several disciplines indicates that gender is best understood through multiple dimensions, including gender-typed occupational interests, activities, and traits. Our research seeks to provide learning game designers with empirical guidance incorporating this framework in developing digital learning games that are inclusive, equitable, and effective for all students. To this end, we conducted a survey study among 333 5th and 6th grade students in five urban and suburban schools in a mid-sized U.S. city, with the goal of investigating how game preferences differ by gender identity or gender-typed measures. Our findings uncovered consistent differences in game preferences from both a binary and multi-dimensional gender perspective, with gender-typed measures being more predictive of game preferences than binary gender identity. We also report on preference trends for different game genres and discuss their implications on learning game design. Ultimately, this work supports using multiple dimensions of gender to inform the customization of learning games that meet individual students’ interests and preferences, instead of relying on vague gender stereotypes.

Huy A. Nguyen, Nicole Else-Quest, J. Elizabeth Richey, Jessica Hammer, Sarah Di, Bruce M. McLaren
Can You Solve This on the First Try? – Understanding Exercise Field Performance in an Intelligent Tutoring System

The analysis of fine-grained data on students’ learning behavior in intelligent tutoring systems using machine learning is a starting point for personalized learning. However, findings from such analyses are commonly viewed as uninterpretable and, hence, not helpful for understanding learning processes. The explainable AI method SHAP, which generates detailed explanations, is a promising approach here. It can help to better understand how different learning behaviors influence students’ learning outcomes and potentially produce new insights for adaptable intelligent tutoring systems in the future. Based on K-12 data (N = 472 students), we trained several ML models on student characteristics and behavioral trace data to predict whether a student will answer an exercise field correctly or not—a low-level approximation of academic performance. The optimized machine learning models (lasso regression, random forest, XGBoost) performed well ( $$AUC = 0.68$$ A U C = 0.68 ; $$F_1 = [0.63; 0.66; 0.69]$$ F 1 = [ 0.63 ; 0.66 ; 0.69 ] ), outperforming logistic regression ( $$AUC = 0.57$$ A U C = 0.57 ; $$F_1 = 0.52$$ F 1 = 0.52 ). SHAP importance values for the best model (XGBoost) indicated that, besides prior language achievement, several behavioral variables (e.g., performance on the previous attempts) were informative predictors. Thus, specific learning behaviors can help explain exercise field performance, independent of the student’s basic abilities and demographics—providing insights into areas of potential intervention. Moreover, the SHAP values highlight heterogeneity in the effects, supporting the notion of personalized learning.

Hannah Deininger, Rosa Lavelle-Hill, Cora Parrisius, Ines Pieronczyk, Leona Colling, Detmar Meurers, Ulrich Trautwein, Benjamin Nagengast, Gjergji Kasneci
A Multi-theoretic Analysis of Collaborative Discourse: A Step Towards AI-Facilitated Student Collaborations

Collaboration analytics are a necessary step toward implementing intelligent systems that can provide feedback for teaching and supporting collaborative skills. However, the wide variety of theoretical perspectives on collaboration emphasize assessment of different behaviors toward different goals. Our work demonstrates rigorous measurement of collaboration in small group discourse that combines coding schemes from three different theoretical backgrounds: Collaborative Problem Solving, Academically Productive Talk, and Team Cognition. Each scheme measured occurrence of unique collaborative behaviors. Correlations between schemes were low to moderate, indicating both some convergence and unique information surfaced by each approach. Factor analysis drives discussion of the dimensions of collaboration informed by all three. The two factors that explain the most variance point to how participants stay on task and ask for relevant information to find common ground. These results demonstrate that combining analytical tools from different perspectives offers researchers and intelligent systems a more complete understanding of the collaborative skills assessable in verbal communication.

Jason G. Reitman, Charis Clevenger, Quinton Beck-White, Amanda Howard, Sierra Rose, Jacob Elick, Julianna Harris, Peter Foltz, Sidney K. D’Mello
Development and Experiment of Classroom Engagement Evaluation Mechanism During Real-Time Online Courses

Student engagement is an essential indicator of the quality of teaching. However, during real-time online classes, teachers are required to balance course content and observe students’ reactions simultaneously. Here, we aim to develop a web-based classroom evaluation system for teachers that evaluates student participation automatically. In this study, we present a novel mechanism that evaluates student participation based on multi-reaction. The system estimates students’ head poses and facial expressions through the camera and uses the estimation results as criteria for assessing student participation. Additionally, we conducted two evaluation experiments to demonstrate the system’s effectiveness in automatically evaluating student participation in a real-time online classroom environment with nine students. The instruction experiment was divided into head pose estimation, facial expression recognition, and multi-reaction estimation; the accuracy rates were 70.0%, 60.0%, and 80.0%, respectively. Although participants did not show many heads poses in the simulation classroom experiment, the system evaluated the classroom by assessing expression recognition, and 70% of the questions showed the same results as those of a questionnaire.

Yanyi Peng, Masato Kikuchi, Tadachika Ozono
Physiological Synchrony and Arousal as Indicators of Stress and Learning Performance in Embodied Collaborative Learning

Advancements in sensing technologies, artificial intelligence (AI) and multimodal learning analytics (MMLA) are making it possible to model learners’ affective and physiological states. Physiological synchrony and arousal have been increasingly used to unpack students’ affective and cognitive states (e.g., stress), which can ultimately affect their learning performance and satisfaction in collaborative learning settings. Yet, whether these physiological features can be meaningful indicators of students’ stress and learning performance during highly dynamic, embodied collaborative learning (ECL) remains unclear. This paper explores the role of physiological synchrony and arousal as indicators of stress and learning performance in ECL. We developed two linear mixed models with the heart rate and survey data of 172 students in high-fidelity clinical simulations. The findings suggest that physiological synchrony measures are significant indicators of students’ perceived stress and collaboration performance, and physiological arousal measures are significant indicators of task performance, even after accounting for the variance explained by individual and group differences. These findings could contribute empirical evidence to support the development of analytic tools for supporting collaborative learning using AI and MMLA.

Lixiang Yan, Roberto Martinez-Maldonado, Linxuan Zhao, Xinyu Li, Dragan Gašević
Confusion, Conflict, Consensus: Modeling Dialogue Processes During Collaborative Learning with Hidden Markov Models

There is growing recognition that AI technologies can, and should, support collaborative learning. To provide this support, we need models of collaborative talk that reflect the ways in which learners interact. Great progress has been made in modeling dialogue for high school and college-age learners, but the dialogue processes that characterize collaborative talk between elementary learner dyads are not currently well understood. This paper reports on a study with elementary school learners (4th and 5th grade, ages 9–11 years old) coded collaboratively in dyads. We recorded dialogue from 22 elementary school learner dyads, covering 7594 total utterances. We labeled this corpus manually with dialogue acts and then induced a hidden Markov model to identify the underlying dialogue states and the transitions between these states. The model identified six distinct hidden states which we interpret as Social Dialogue, Confusion, Frustrated Coordination, Exploratory Talk, Directive & Disagreement, and Disagreement & Self-Explanation. The HMM revealed that when students entered into a productive exploratory talk state, the primary way they transitioned out of this state is when they became confused or reached an impasse. When this occurred, the learners then moved into states of disputation and conflict before re-entering the Exploratory Talk state. These findings can inform the design of AI agents who support young learners’ collaborative talk and help agents determine when students are conflicting rather than collaborating.

Toni V. Earle-Randell, Joseph B. Wiggins, Julianna Martinez Ruiz, Mehmet Celepkolu, Kristy Elizabeth Boyer, Collin F. Lynch, Maya Israel, Eric Wiebe
Unsupervised Concept Tagging of Mathematical Questions from Student Explanations

Assigning concept tags to questions enables Intelligent tutoring systems (ITS) to efficiently organize resources, help identify students’ strengths and weaknesses, and recommend suitable learning materials accordingly. Manual tagging is time-consuming, and inefficient for large question banks, and could lead to consistency issues due to differences in the perspectives of individual taggers. Automatic tagging techniques can efficiently generate consistent tags at lower costs. Generating automatic tags for mathematical questions is challenging as the question text is usually short and concise, and the question as well as the answer text contains mathematical symbols and formulas. However, prior works have not studied this problem extensively. In this context, we conducted a study in a graduate-level linear algebra course to understand if student explanations to solving mathematical problems can be employed to generate concept tags associated with those questions. In this paper, we propose a method called Unsupervised Skill Tagging (UST) to extract concept tags associated with a given question from explanation text. Using UST on the explanations generated, we show that the explanations indeed contain the expert-specified concept tags.

K. M. Shabana, Chandrashekar Lakshminarayanan
Robust Team Communication Analytics with Transformer-Based Dialogue Modeling

Adaptive training environments that can provide reliable insight into team communication offer great potential for team training and assessment. However, traditional techniques that enable meaningful analysis of team communication such as human transcription and speech classification are especially resource-intensive without machine assistance. Additionally, developing computational models that can perform robust team communication analytics based on small datasets poses significant challenges. We present a transformer-based team communication analysis framework that classifies each team member utterance according to dialogue act and the type of information flow exhibited. The framework utilizes domain-specific transfer learning of transformer-based language models pre-trained with large-scale external data and a prompt engineering method that represents both speaker utterances and speaker roles. Results from our evaluation of team communication data collected from live team training exercises suggest the transformer-based framework fine-tuned with team communication data significantly outperforms state-of-the-art models on both dialogue act recognition and information flow classification and additionally demonstrates improved domain-transfer capabilities.

Jay Pande, Wookhee Min, Randall D. Spain, Jason D. Saville, James Lester
Teacher Talk Moves in K12 Mathematics Lessons: Automatic Identification, Prediction Explanation, and Characteristic Exploration

Talk moves have been shown to facilitate enriched discussion and conversation in classrooms, leading to improved student learning outcomes. To support teachers in enhancing discursive techniques and providing timely feedback on their classroom discourse, this paper proposed two BERT-based deep learning models for automatically identifying teacher talk moves in K12 mathematics lessons. However, the proposed discourse models have complex structures and cannot offer clear explanations on the prediction of teacher utterance, potentially leading teachers to distrust the model. To address this issue, this paper employed three model-agnostic interpreting methods from explainable artificial intelligence and transparently explained to teachers how the model predictions were made by computing and displaying the word relevance. The analysis results confirmed the validity of these explanations. Further, the paper investigated the interpreting results to uncover key characteristics of each type of teacher talk moves. The findings indicated that simple words and phrases could serve as representative indicators of talk moves (e.g., phrases centering on “agree” and “disagree” in the case of talk move getting students to relate to another’s idea), which shows the potential to assist teachers in mastering discursive techniques. We believe that the paper built a solid step towards building an automated classroom discourse analysis system and fully addressing the interpretability concerns of deep learning-based discourse models.

Deliang Wang, Dapeng Shan, Yaqian Zheng, Gaowei Chen

Short Papers

Frontmatter
Ethical and Pedagogical Impacts of AI in Education

Artificial Intelligence is becoming pervasive in higher education. While these tools can provide customized intervention and feedback, they may cause ethical risks and sociotechnical implications. Current ethical discussions often focus on established technical issues and overlook further implications from students’ perspectives, which may increase their vulnerability. Taking a student-centered view, we apply the story completion method to understand students’ concerns about the future adoption of various analytics-based AI tools. 71 students elaborated on the provided story stems. A qualitative analysis of their stories reveals students’ perceptions that AIEd may disrupt aspects of the pedagogical landscape such as learner autonomy, learning environments and approaches, interactions and relationships, and pedagogical roles. This study provides an initial insight into student concerns about AIEd and a foundation for future research.

Bingyi Han, Sadia Nawaz, George Buchanan, Dana McKay
Multi-dimensional Learner Profiling by Modeling Irregular Multivariate Time Series with Self-supervised Deep Learning

Personalised or intelligent tutoring systems are being rapidly adopted because they enable tailored learner choices in, for example, exercise materials, study time, and intensity (i.e., the number of chosen exercises) over extended periods of time. This, however, poses significant challenges for profiling the characteristics of learner behaviors, mostly due to the great diversity in each individual’s learning path, the timing of exercise accomplishments, and varying degrees of engagement over time. To address this problem, this paper proposes an innovative approach that uses self-supervised deep learning to consolidate learner behaviors and performance into compact representations via irregular multivariate time series modeling. These representations can be used to highlight learners’ multi-dimensional behavioral characteristics on a massive scale for self-directed learners who can freely pick exercises and study at their own pace. With experiments on a large-scale real-world dataset, we empirically show that our approach can effectively reveal learner individuality as well as commonality in characteristics.

Qian Xiao, Breanne Pitt, Keith Johnston, Vincent Wade
Examining the Learning Benefits of Different Types of Prompted Self-explanation in a Decimal Learning Game

While self-explanation prompts have been shown to promote robust learning in several knowledge domains, there is less research on how different self-explanation formats benefit each skill set in a given domain. To address this gap, our work investigates 214 students’ problem-solving performance in a learning game for decimal numbers as they perform self-explanation in one of three formats: multiple-choice (N = 52), drag-and-drop (N = 72) and open-ended (N = 67). We found that self-explanation performance in all three formats was positively associated with problem-solving performance. At the same time, we observed better outcomes with the drag-and-drop format than the open-ended format for solving decimal addition problems that do not remind students about carrying, but worse outcomes than the multiple-choice and open-ended format for other problem types. These results point to the nuanced interactions between the problem type and self-explanation format that should be further investigated to identify when and how self-explanation is most helpful for learning.

Huy A. Nguyen, Xinying Hou, Hayden Stec, Sarah Di, John Stamper, Bruce M. McLaren
Plug-and-Play EEG-Based Student Confusion Classification in Massive Online Open Courses

Use of Electroencephalography (EEG)-based monitoring devices in classrooms has seen greater uptake with increased interest in Internet of Things (IOT) and human-computer interaction (HCI). The ability to interact directly with digital interfaces using brain signals offer significant advantage towards seamless and natural communication by simply thinking of the desired outcome. We propose the implementation of a new leave-one-subject-and-video-out paradigm alongside a plug-and-play lightweight EEG-based classification framework to accurately analyse the efficacy of EEG signals in determining students’ confusion levels. The proposed methodology achieves state-of-the-art performance, reaching 95.75% classification accuracy.

Han Wei Ng
CPSCoach: The Design and Implementation of Intelligent Collaborative Problem Solving Feedback

We present the design of CPSCoach, a fully-automated system that assesses and provides feedback on collaborative problem solving (CPS) competencies during remote collaborations. We leveraged existing data to develop deep NLP models that automatically assess the CPS competencies from speech, achieving moderate to high accuracies (average area under the receiver operating characteristic curve of .78). We engaged 43 participants in an iterative process to design the feedback mechanism, resulting in the first prototype of CPSCoach. We conducted a user study with 20 dyads who engaged with CPSCoach over multiple rounds. Participants thought the system was usable, but they were mixed about the accuracy of the feedback. We discuss design considerations for feedback systems aimed at improving CPS competencies.

Angela E. B. Stewart, Arjun Rao, Amanda Michaels, Chen Sun, Nicholas D. Duran, Valerie J. Shute, Sidney K. D’Mello
Development of Virtual Reality SBIRT Skill Training with Conversational AI in Nursing Education

The paper presents the process of conversational AI patient design and development for SBIRT skills training. SBIRT (Screening, Brief Intervention, and Referral to Treatment) is a comprehensive public behavior health approach that is commonly used by nurses and social workers to detect potential substance abuse in their patients. In the VR exam room, a nursing student practices SBIRT skills with a virtual patient powered by a conversational AI system. The development of the VR patient system was started by collecting sample dialogs from a standardized patient and a nurse practitioner. In addition, extended conversations were collected through prototypes with different interaction modes. With the intelligent virtual patient, the “SBIRT VR Training Program'’ provides the user with a diverse selection of simulated environments and personalized training. Our research focuses on the efficacy of a VR-based conversational AI training on a student’s acquisition and retention of the SBIRT training material.

Jinsil Hwaryoung Seo, Rohan Chaudhury, Ja-Hun Oh, Caleb Kicklighter, Tomas Arguello, Elizabeth Wells-Beede, Cynthia Weston
Balancing Test Accuracy and Security in Computerized Adaptive Testing

Computerized adaptive testing (CAT) is a form of personalized testing that accurately measures students’ knowledge levels while reducing test length. Bilevel optimization-based CAT (BOBCAT) is a recent framework that learns a data-driven question selection algorithm to effectively reduce test length and improve test accuracy. However, it suffers from high question exposure and test overlap rates, which potentially affects test security. This paper introduces a constrained version of BOBCAT to address these problems by changing its optimization setup and enabling us to trade off test accuracy for question exposure and test overlap rates. We show that C-BOBCAT is effective through extensive experiments on two real-world adult testing datasets.

Wanyong Feng, Aritra Ghosh, Stephen Sireci, Andrew S. Lan
A Personalized Learning Path Recommendation Method for Learning Objects with Diverse Coverage Levels

E-learning has resulted in the proliferation of educational resources, but challenges remain in providing personalized learning materials to learners amidst an abundance of resources. Previous personalized learning path recommendation (LPR) methods often oversimplified the competency features of learning objects (LOs), rendering them inadequate for LOs with diverse coverage levels. To address this limitation, an improved learning path recommendation framework is proposed that uses a novel graph-based genetic algorithm (GBGA) to optimize the alignment of features between learners and LOs. To evaluate the performance of the method, a series of computational experiments are conducted based on six simulation datasets with different levels of complexity. The results indicate that the proposed method is effective and stable for solving the LPR problem using LOs with diverse coverage levels.

Tengju Li, Xu Wang, Shugang Zhang, Fei Yang, Weigang Lu
Prompt-Independent Automated Scoring of L2 Oral Fluency by Capturing Prompt Effects

We propose a prompt-independent automated scoring method of second language (L2) oral fluency, which is robust to different cognitive demands of speaking prompts. When human examiners assess L2 learners’ oral fluency, they can consider the effects of different task prompts on speaking performance, systematically adjusting their evaluation criteria across prompts. However, conventional automated scoring methods tend to ignore such variability in speaking performance caused by prompt design and use prompt-specific features of speech. Their robustness is thus arguably limited to a specific prompt used in model training. To address this challenge, we operationalize prompt effects in terms of conceptual, linguistic and phonological features of speech and embed them, as well as a set of temporal features of speech, into a scoring model. We examined the agreement between true and predicted fluency scores in four different L2 English monologue prompts. The proposed method outperformed a conventional method which used only temporal features ( $$\kappa = 0.863 \text { vs. } 0.797$$ κ = 0.863 vs. 0.797 ). The detailed analysis showed that the conceptual and phonological features improved the performance of automated scoring. Meanwhile, the effectiveness of the linguistic features was not confirmed possibly because it may largely reflect redundant information to capture the prompt demands. These results suggest that the robustness of the automated fluency scoring should be achieved by careful consideration of what characteristics of L2 speech reflect the prompt effects.

Ryuki Matsuura, Shungo Suzuki
Navigating Wanderland: Highlighting Off-Task Discussions in Classrooms

Off-task discussions during collaborative learning offer benefits such as alleviating boredom and strengthening social relationships, and are therefore of interest to learning scientists. However, identifying moments of off-task speech requires researchers to navigate massive amounts of conversational data, which can be laborious. We lay the groundwork for automatically identifying off-task segments in a conversation, which can then be qualitatively analyzed and coded. We focus on in-person, real-time dialog and introduce an annotation scheme that examines two facets of dialog typical to in-person classrooms: whether utterances are pertinent to the lesson, and whether utterances are pertinent to the classroom, more broadly. We then present two computational models for identifying off-task utterances.

Ananya Ganesh, Michael Alan Chang, Rachel Dickler, Michael Regan, Jon Cai, Kristin Wright-Bettner, James Pustejovsky, James Martin, Jeff Flanigan, Martha Palmer, Katharina Kann
CTutor: Helping People Learn to Avoid Present Bias During Decision Making

Procrastination can harm many aspects of life, including physical, mental, or financial well-being. It is often a consequence of people’s tendency to prefer immediate benefits over long-term rewards (i.e., present bias). Due to its prevalence, we created C $$^{2}$$ 2 Tutor, an intelligent tutoring system (ITS) that can potentially reduce procrastination habits by teaching planning strategies. C $$^{2}$$ 2 Tutor teaches people how to make decisions aligned with long-term benefits. It will discourage present bias behavior while allowing for differences in user cognitive abilities. Our study found that C $$^{2}$$ 2 Tutor encourages far-sighted behavior while reducing maladaptive planning-strategy use.

Calarina Muslimani, Saba Gul, Matthew E. Taylor, Carrie Demmans Epp, Christabel Wayllace
A Machine-Learning Approach to Recognizing Teaching Beliefs in Narrative Stories of Outstanding Professors

The coding of text information to recognize the teaching beliefs of outstanding professors is crucial research to enhance teaching performance in university. Most previous studies adopted manual coding, and thus text information was limited to briefly descriptive statements or questionnaires, rather than full narrative stories of outstanding professors, owing to the time-consuming of manual coding. However, outstanding professors’ narrative stories, which contained more detailed information about the outstanding professors’ thinking and behaviors, were valuable text information to recognize the types of teaching beliefs of outstanding professors. Therefore, to overcome the time-consuming obstacle of manual coding, this study proposes a machine-learning-based approach, which exploits BERT with convolutional LSTM, to code narrative stories of outstanding teachers for the identification of the types of teaching beliefs of outstanding professors. In this study, the text information used for coding was a series of fourteen books published across fourteen years, namely The Stories of Outstanding Professors in National Taiwan University (NTU), which contained one million words describing the stories of three hundred NTU outstanding professors. Towards identifying six categories and thirty subcategories of teaching beliefs revealed in the narrative stories, our approach outperforms comparative methods by 1% to 86% in the F1 score. Comprehensive evaluations validate the effectiveness of our approach in assisting in not only recognizing the teaching beliefs from stories of outstanding professors, but also the process of coding text information from narrative stories.

Fandel Lin, Ding-Ying Guo, Jer-Yann Lin
BETTER: An Automatic feedBack systEm for supporTing emoTional spEech tRaining

Feedback is a crucial process in education because it helps learners identify their weaknesses whilst motivating them to continue to learn. Existing systems often only provide a score or rating with basic explanations. Although some systems provide detailed feedback, they require manual input from teachers. This paper proposes a real-time feedback visualisation system (called BETTER) for supporting emotional speech training which uses a visual dashboard to provide the learner with immediate feedback in the form of written, audio, and visual feedback. The AI-based feedback system utilises pitch tracking, transcriptions, and audio modifications in addition to one-dimensional convolutional neural networks (CNNs) to categorise speech into emotional states. A preliminary experiment was conducted involving a speech expert and 8 non-native speakers to assess their cognitive load, technology acceptance, and satisfaction while using the system.

Adam Wynn, Jingyun Wang
Eliciting Proactive and Reactive Control During Use of an Interactive Learning Environment

The dual mechanisms of control framework describes two modes of goal-directed behavior: proactive control (goal maintenance) and reactive control (goal activation on task demands). Although these mechanisms are relevant to learner behaviors during interaction with intelligent tutoring systems (ITS), their relation to ITSs is under-researched. We propose a manipulation to induce proactive or reactive control during interaction with an online tutoring system. We present two experiments where students solved problems using either proactive or reactive control. Study 1 validates the manipulation by investigating behavioral measures that reflect usage of the intended strategy and assesses whether either mode impacted learning. Study 2 investigates if alternating between control modes during problem solving affects student performance.

Deniz Sonmez Unal, Catherine M. Arrington, Erin Solovey, Erin Walker
How to Repeat Hints: Improving AI-Driven Help in Open-Ended Learning Environments

AI-driven personalized support can help students learn from Open-Ended Learning Environments (OELEs). In this paper, we focus on how to effectively provide repeated hints in OELEs, when students repeat a sub-optimal behavior after receiving a hint on how to recover from the first occurrence of the behavior. We formally compare two repeated hint designs in UnityCT, an OELE that fosters Computational Thinking (CT) via free-form game design in K-6 education, with the long-term goal of providing AI-driven personalized hints.

Sébastien Lallé, Özge Nilay Yalçın, Cristina Conati
Automatic Detection of Collaborative States in Small Groups Using Multimodal Features

Cultivating collaborative problem solving (CPS) skills in educational settings is critical in preparing students for the workforce. Monitoring and providing feedback to all groups is intractable for teachers in traditional classrooms but is potentially scalable with an AI agent who can observe and interact with groups. For this to be feasible, CPS moves need to first be detected, a difficult task even in constrained environments. In this paper, we detect CPS facets in relatively unconstrained contexts: an in-person group task where students freely move, interact, and manipulate physical objects. This is the first work to classify CPS in an unconstrained shared physical environment using multimodal features. Further, this lays the groundwork for employing such a solution in a classroom context, and establishes a foundation for integrating classroom agents into classrooms to assist with group work.

Mariah Bradford, Ibrahim Khebour, Nathaniel Blanchard, Nikhil Krishnaswamy
Affective Dynamic Based Technique for Facial Emotion Recognition (FER) to Support Intelligent Tutors in Education

Facial expressions of learners are relevant to their learning outcomes. The recognition of their emotional status influences the benefits of instruction or feedback provided by the intelligent tutor in education. However, learners’ emotions expressed during interactions with the intelligent tutor are mostly detected by self-reports of learners or judges who observe them in manually. The automated Facial Emotion Recognition (FER) task has been a challenging problem for intelligent tutors. The state-of-art automated FER methods target six basic emotions instead of learning-related emotions (e.g., neutral, confused, frustrated, and bored). Thus our research contributes to training a machine learning (ML) model to recognise learning-related emotions for intelligent tutors automatically, based on an Affective Dynamics (AD) model. We implement the AD model into our loss function (AD-loss) to fine-tune the ML model. In the test scenario, the AD-loss method improves the performance of state-of-art FER algorithms.

Xingran Ruan, Charaka Palansuriya, Aurora Constantin
Real-Time Hybrid Language Model for Virtual Patient Conversations

Advancements in deep learning have enabled the development of online learning tools for medical training, which is important for remote learning. However, face-to-face interaction is essential for practicing human-centric skills such as clinical skills. Presently, in medical training, such interactions can be mimicked using deep learning methodologies. However, the understanding of such models is often limited whereby lightweight models are unable to generalize beyond scope while large language models tend to produce unexpected responses. To overcome this, we propose a hybrid lightweight and large language model for creating virtual patients, which can be used for real-time autonomous training of trainee doctors in clinical settings using online platforms. This ensures high-quality and standardized learning for all individuals regardless of location and background.

Han Wei Ng, Aiden Koh, Anthea Foong, Jeremy Ong
Towards Enriched Controllability for Educational Question Generation

Question Generation (QG) is a task within Natural Language Processing (NLP) that involves automatically generating questions given an input, typically composed of a text and a target answer. Recent work on QG aims to control the type of generated questions so that they meet educational needs. A remarkable example of controllability in educational QG is the generation of questions underlying certain narrative elements, e.g., causal relationship, outcome resolution, or prediction. This study aims to enrich controllability in QG by introducing a new guidance attribute: question explicitness. We propose to control the generation of explicit and implicit (wh)-questions from children-friendly stories. We show preliminary evidence of controlling QG via question explicitness alone and simultaneously with another target attribute: the question’s narrative element. The code is publicly available at https://github.com/bernardoleite/question-generation-control .

Bernardo Leite, Henrique Lopes Cardoso
A Computational Model for the ICAP Framework: Exploring Agent-Based Modeling as an AIED Methodology

Recently, researchers have advocated for using complex systems methodologies including agent-based modeling in education. This study proposes using agent-based models to simulate teaching and learning environments. Specifically, we present ABICAP, an agent-based model that simulates learning in accordance with the ICAP framework, which defines four levels of cognitive engagement: Interactive, Constructive, Active, and Passive. The ICAP hypothesis suggests a higher level of engagement results in improved learning outcomes. To show how ABICAP can support running hypothetical studies in a risk-free and inexpensive environment, we present two simulations examining different pedagogical scenarios. We show how our model can surface counterintuitive results which may lead to a more nuanced understanding of ICAP. More generally, this paper provides a concrete example of how agent-based modeling can be used as a methodology for advancing education research.

Sina Rismanchian, Shayan Doroudi
Automated Program Repair Using Generative Models for Code Infilling

In educational settings, automated program repair techniques serve as a feedback mechanism to guide students working on their programming assignments. Recent work has investigated using large language models (LLMs) for program repair. In this area, most of the attention has been focused on using proprietary systems accessible through APIs. However, the limited access and control over these systems remain a block to their adoption and usage in education. The present work studies the repairing capabilities of open large language models. In particular, we focus on a recent family of generative models, which, on top of standard left-to-right program synthesis, can also predict missing spans of code at any position in a program. We experiment with one of these models on four programming datasets and show that we can obtain good repair performance even without additional training.

Charles Koutcheme, Sami Sarsa, Juho Leinonen, Arto Hellas, Paul Denny
Measuring the Quality of Domain Models Extracted from Textbooks with Learning Curves Analysis

This paper evaluates an automatically extracted domain model from textbooks and applies learning curve analysis to assess its ability to represent students’ knowledge and learning. Results show that extracted concepts are meaningful knowledge components with varying granularity, depending on textbook authors’ perspectives. The evaluation demonstrates the acceptable quality of the extracted domain model in knowledge modeling.

Isaac Alpizar-Chacon, Sergey Sosnovsky, Peter Brusilovsky
Predicting Progress in a Large-Scale Online Programming Course

With vast amounts of educational data being generated in schools, educators are increasingly embracing data mining techniques to track student progress, especially in programming courses, a growing area of computer science education research. However, there are few accurate and interpretable methods to track student progress in programming courses. To bridge this gap, we propose a decision tree approach to predict student progress in a large-scale online programming course. We demonstrate that this approach is highly interpretable and accurate, with an overall average accuracy of 88% and average dropout accuracy of 82%. Additionally, we identify important slides such as problem slide which significantly impact student outcomes.

Vincent Zhang, Bryn Jeffries, Irena Koprinska
Examining the Impact of Flipped Learning for Developing Young Job Seekers’ AI Literacy

While AI literacy is regarded as an essential competency to become a citizen in a rapidly changing society, it is challenging for people without computer science (CS) backgrounds to develop a sufficient level of AI competency. The main goal of this research is to examine the impact of the flipped learning approach to equip non-CS major students who intend to pursue careers in AI-related fields with basic AI literacy. Among various learner-centered methods, flipped learning was chosen as the main pedagogical frame to design an AI literacy curriculum. The participants were 80 adult learners who enrolled in the AI education program in Korea. The control group (N = 40) was taught in traditional instructor-centered methods whereas the experimental group (N = 40) was taught with a flipped learning method. Our research results indicate that AI literacy education with flipped learning improves the learning achievements of both CS majors and non-majors, especially effective for higher-order problem-solving skills.

Hyo-Jin Kim, Hyo-Jeong So, Young-Joo Suh
Automatic Analysis of Student Drawings in Chemistry Classes

Automatic analyses of student drawings in chemistry education have the potential to support classroom teaching. To date, related work on handwritten chemical structures or formulas is limited to well-defined presentation formats, e.g., Lewis structures. However, the large variety of possible illustrations in student drawings in chemical education has not been addressed yet. In this paper, we present a novel approach to identify visual primitives in student drawings from chemistry classes. Since the field lacks suitable datasets for the given task, we introduce a method to synthetically create a dataset for visual primitives. We demonstrate how detected visual primitives can be used to automatically classify drawings according to a taxonomy of drawing characteristics in chemistry and physics. Our experiments show that (1) the detection of visual primitives in student drawings, and (2) the subsequent classification of chemistry- and physics-specific drawing characteristics is possible.

Markos Stamatakis, Wolfgang Gritz, Jos Oldag, Anett Hoppe, Sascha Schanze, Ralph Ewerth
Training Language Models for Programming Feedback Using Automated Repair Tools

In introductory programming courses, automated repair tools (ARTs) are used to provide feedback to students struggling with debugging. Most successful ARTs take advantage of context-specific educational data to construct repairs to students’ buggy codes. Recent work in student program repair using large language models (LLMs) has also started to utilize such data. An underexplored area in this field is the use of ARTs in combination with LLMs. In this paper, we propose to transfer the repairing capabilities of existing ARTs to open large language models by finetuning LLMs on ART corrections to buggy codes. We experiment with this approach using three large datasets of Python programs written by novices. Our results suggest that a finetuned LLM provides more reliable and higher-quality repairs than the repair tool used for finetuning the model. This opens venues for further deploying and using educational LLM-based repair techniques.

Charles Koutcheme
Backmatter
Metadata
Title
Artificial Intelligence in Education
Editors
Ning Wang
Genaro Rebolledo-Mendez
Noboru Matsuda
Olga C. Santos
Vania Dimitrova
Copyright Year
2023
Electronic ISBN
978-3-031-36272-9
Print ISBN
978-3-031-36271-2
DOI
https://doi.org/10.1007/978-3-031-36272-9

Premium Partner