The article discusses the development and evaluation of an Intelligent Robotic Tutoring System (IRTS) designed to assist people with disabilities in learning work tasks. The study, conducted over a month in a sheltered workshop, shows significant learning improvements among participants. The IRTS uses reinforcement learning algorithms to adapt to individual needs, offering personalized assistance. The research highlights the potential of such systems to enhance employment opportunities for people with disabilities and provides insights into the challenges and benefits of real-world applications of intelligent tutoring systems.
AI Generated
This summary of the content was generated with the help of AI.
Abstract
Inclusion of people with disabilities in the open labor market using robotic assistance is a promising new and important field of research, albeit challenging. People with disabilities are severely underrepresented in the open labor market, although inclusion adds significant value on both financial and social levels. Here, collaborative industrial robots offer great potential for support. This work conducted a month-long, in-field user study in a workshop for people with disabilities to improve learning progress through collaboration with an innovative intelligent robotic tutoring system. Seven workers with a wide variety of disabilities solved assembly tasks while being supervised by the system. In case of errors or hesitations, different modes of assistance were automatically offered. Modes of assistance included robotic pointing gestures, speech prompts, and calling a supervisor. Which assistance to offer the different participants during the study was personalized by a shared policy using reinforcement learning. Here, new, non-stationary Contextual Multi-Armed Bandit algorithms were developed during the prior simulation-based study planning to include the workers contextual information. Pioneering results were obtained in three main areas. The participants significantly improved their skills in terms of time required per task. The algorithm learned within only one session per participant which modes of assistance were preferred. Finally, a comparison between simulation and re-simulation, including the study results, revealed the underlying basic assumptions to be correct but individual variation led to strong performance differences in the real-world setting. Looking ahead, the innovative system developed could pave the way for many people with disabilities to enter the open labor market.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
According to the World Health Organization, 15% of the world’s population lives with some form of disability. But despite the United Nations Convention on the Rights of Persons with Disabilities advocating for the right of people with disabilities to work on an equal basis with others, only 19.6% of women and 52.8% of men with disabilities have employment [1]. Necessary steps, such as reasonable accommodations, must be taken to create access to employment [2]. Pursuing a job enriches individual well-being tremendously since it provides financial security and makes people feel part of society. Yet more than 420 million people with disabilities of working age are unemployed [1].
This study explored the use of an innovative Intelligent Robotic Tutoring System (IRTS) to support inclusion of people with disabilities. Collaborative robots (cobots) are already widely researched in industry for collaborative tasks [3‐6]. Moreover, numerous studies have explored the therapeutic benefits of social robots for people with disabilities. Especially learning successes of children with autism [7‐10] and students with intellectual disabilities [11] are frequently considered. But few, thus far, have focused on cobots having the potential to assist people with disabilities learning and performing work tasks [12‐14]. This, in turn, is an essential factor in pursuing a job in the long term. While theoretical considerations of such collaborations [15], acceptance tests [16], or experiments under controlled laboratory conditions [17‐19] exist, few studies have been conducted under real working conditions as in [12]. None were performed over a long period of time. Even in the well-researched, related topic of social robots for children with autism, hardly any long-term studies exist, as in [20] and [21]. In order to test the real-world validity of cobot-assisted training in the largely unexplored field of supporting workers with disabilities, the present study employed its application under real working conditions over a longer period of observation. Thus, it represents a groundbreaking addition to this field of research.
Advertisement
The overall goal of this work is to teach work tasks using an adaptive robotic tutoring system that is individually responsive to people with highly individual and diverse disabilities. To achieve this goal, this work includes the following three key contributions, each of which is explained in more detail below.
1.
Developing an Intelligent Robotic Tutoring System based on new reinforcement learning algorithms
2.
Evaluating the Intelligent Robotic Tutoring System in a month-long field study in a workshop for people with disabilities
3.
Using and evaluating a simulation-based study design
First contribution: reinforcement learning algorithm
Nowadays, it is widely known that the best learning effects are achieved when tutoring strategies are customized to the users’ needs [22, 23]. While in the past the "one size fits all" method was practiced, in which every student received the same instructions, today there is a whole research area, that of Intelligent Tutoring Systems (ITS), which aims to make tutoring as individual and, consequently, as effective as possible [24‐26]. Therefore, the IRTS developed in this work will also be designed in such a way that it offers individual assistance to the users to provide the best support.
Optimal learning occurs when the task is neither too difficult nor too easy, as stated by the challenge point theory [27]. For this reason, the underlying algorithm is designed in such a way that it pursues two competing objectives, namely to improve the users task completion rate, but at the same time to provide as little assistance as possible. Therefore, the teaching algorithm is not based on predefined rules but adapts individually to the users during the runtime of the study. Here, different from e.g. [21], it transfers the learning effects of individual participants to improve its policy for all participants simultaneously, which minimizes the algorithm’s learning time. This is new since similar research on people with disabilities often focuses on the assistive system. The decision units are often delegated to a wizard of oz technique [28, 29], in which participants believe they are working with an autonomous system, when in fact the researcher is controlling system responses. Content customization [30], or easy to follow tutoring strategies [31] are also commonly used in these domains. In general research on ITS the focus is often put on symbolic interaction, such as logical proofs or language training with students without disabilities. These areas of application are already investigated with great success [32, 33]. This work combines both fields. ITS include a variety of methods, such as dynamic Bayesian networks, fuzzy decision trees, and (Partially observable) Markov Decision Problems ((PO)MDPs) to model student knowledge and learning. The latter approach is used in this work. Briefly, an agent, here the robot, exerts an action, here a robotic assistance, on the environment, here the human. By doing so, it changes the state [34], here the knowledge state of how well the human has learned the current task [35]. Depending on whether the action was successful or not, the agent receives a reward and thus learns which actions are successful and which are not. The complexity of the problem at hand becomes clear by the fact that certain observations can be made, such as the correct execution of a task, but the exact state of the person’s knowledge cannot be directly observed or known. This is why usually POMDPs are used. Here, the states are mapped to probability distributions of observations [33, 36]. While various methods to solve POMDPs exist, such as model-based [37] or model-free [38], they all rely on some form of assumptions on cognitive learner models and need a large amount of data. Newer research, such as on deep reinforcement learning, needs even more data [39]. The availability of such a large database is not feasible in the target group considered in this work. In addition, the abilities and limitations of the target group are so heterogeneous that no reliable assumptions can be made about learning models, which are usually modeled in terms of mathematical formulas. Wrong assumptions would worsen the performance [40]. In a comparison between POMDPs and Multi Armed Bandits (MAB) [41], that can be seen as one-state MDPs, it was found that the latter are better suited for heterogeneous target groups. The reason is that they are widely independent of pre-defined cognitive and student models. Instead, they adapt to each students’ characteristics online, based on real-world feedback, and are more computationally efficient [42]. These aspects are very important when working with people with disabilities, who vary significantly in their cognitive and motoric abilities and therefore represent a very heterogenous group with no reliable, individual user models available. MABs have also been successfully used in related assistance tasks, such as to personalize robot movements in handover activities [43] or to help make decisions in human-robot teams [44]. To adapt the algorithm even more individually to the different users, Contextual Multi-Armed Bandits (CMAB) can be used [45]. CMABs extend MABs with additional information on the humans’ knowledge state, or features of the assistance options. Thus, CMABs form a promising intermediate stage between POMDPs and MABs.
Following this approach, this work compared and further developed CMABs to support the considered target group in the best possible way. The main focus here is on the first-time use of CMAB algorithms to help a robotic tutor assist people with disabilities in learning tasks. Since the level of knowledge of people is constantly changing during a learning process, the algorithm must adapt to a non-stationary environment. In general, non-stationarity in MABs have been widely researched. Piecewise stationary environments [45, 46] are considered, as well as continuously changing environments [47]. Solution approaches are diverse, such as the \(\gamma \)-restart algorithm [48] or sliding windows [49]. In general, non-stationary approaches often slowly forget old conditions over time. In the present study, however, the algorithms are designed to remember the cumulative history of past learners so that they know which assistance to offer to new, unskilled learners. Thus, the users do not have to learn new tasks simultaneously but one learner can learn after the other. As explained in more detail in Materials and Methods, several new non-stationary CMAB algorithms are developed for this purpose. The availability of multiple, comparably good algorithms allows switching between algorithms if acceptance problems arise in the real-world evaluation. This is something that always has to be considered for the target group under consideration. The main features of the developed algorithms are: they do not offer more help than necessary; they constantly adapt to the ever-changing state of human knowledge; they apply their knowledge of individual participants to others during the study in order to adapt quickly.
Advertisement
Second contribution: month-long in-field user study
After development, the IRTS was evaluated in a one-month user study in a sheltered workshop (SW) for people with disabilities during normal working days. While such a study design is very challenging and difficult to organize, it is important for several reasons. Real, familiar, and accustomed working conditions can only be created in a real work environment during normal working hours and could not possibly be replicated in controlled laboratory conditions. This gives the data much greater validity. This is especially the case given the considered target group’s susceptibility to change and need for steady daily routines [50]. Simply participating in a study can cause changes in participants’ performance unrelated to the researched topic yet influence the outcome [16]. To minimize the influence of these effects and to obtain meaningful results, it is critical to run the study under real conditions and over multiple days. In addition, only a study run over a longer period can provide the observation of sustainable learning effects. This is also the only possibility to enable the algorithm used to learn its behavior under real conditions and thus adapt individually to the participants.
Due to the multitude of challenges of such a study design, even simple, non-adaptive robotic assistance systems for people with disabilities have so far only been tested under laboratory conditions [17, 18], or on a single day [12]. Never before have the learning effects of an intelligent robotic tutor been analyzed in the field, let alone over a longer period of time. Thus, in this work it was possible for the first time to determine long-term learning effects for people with disabilities through an IRTS.
Third contribution: simulation versus reality
The third key-contribution is the simulation-based study design and the final comparison between simulation and reality. Several methods exist to evaluate algorithms without live data. Some, like the Inverse Propensity Scoring, or the Doubly Robust Policy Evaluation are, however, not suited for nonstationary policies [51], others such as rejection sampling [52, 53] rely on logged data, which is not available for the application under consideration. In continuously changing environment, the direct method is often used [46]. Here, first a reward estimator \({\hat{r}}({{\varvec{x}}},a)\) of choosing arm a with context x is build, based on logged or randomly generated data. Then the policy is evaluated against the estimator [51], which is the approach chosen in this work.
However, since many assumptions are included in simulations, such as the state of knowledge, it is not always clear what validity the results of these simulations have. In [54] it is already indicated that the results in reality and theory can differ. In [12] it has been shown that even group leaders cannot always correctly assess the skills of their employees, which makes correct modeling of learner models nearly impossible. Since most ITS research and algorithms developed in this context are evaluated based on pure simulations or logged data [33, 51], an appropriate comparison is important to evaluate the validity of simulation results for the real world. This work uses a simulation as state-of-the-art concept to plan the in-field study and evaluates the limitations of simulation results using the insights of the field study.
Fig. 1
Setup of the intelligent robotic tutoring system in a workshop for people with disabilities. During the study, participants solved the task of assembling given Lego models, while a collaborative, industrial robot manipulator offered different pointing gestures or voice prompts in case of errors. The current working step was detected by a depth camera [12]
The following sections describe the implementation of the first IRTS and its evaluation in a month-long study in a SW, as Fig. 1 shows. During the study, participants performed an abstracted assembly task, namely building a given Lego model. Meanwhile, a collaborative industrial robotic manipulator followed the individual work steps using a depth camera and offered one of six different modes of assistance in the event of errors or concentration problems. Which assistance was offered when, to whom, was learned individually over the study period via reinforcement learning.
2 Materials and Methods
The objectives of this work were to investigate whether an intelligent robotic tutoring system could assist people with disabilities in learning new skills on the job, as well as to examine the simulation-based reinforcement learning algorithm under real time data. To do so, the algorithm was first developed and tested for suitability for the application area under consideration by means of a simulation. Then, the IRTS was tested for one month in a SW.
2.1 Simulation
The Linear Upper Confidence Bound (LinUCB) algorithm [52] is the most widely used CMAB model [55] and therefore used as a baseline for the developed algorithms. Furthermore, the presented ITS is supposed to offer as much help as needed but as little help as possible. To achieve this, different high costs were assigned to the different modes of assistance, with the most invasive ones having the highest costs. In the simulation, these are fictitious modes of assistance and randomly generated costs. Each assistance used receives a binary reward of 0 on failure or 1 on success. In case of success, the cost of the assistance is subtracted from the generated reward. Thus, the more knowledge a participant has, the lower modes of assistance should be chosen. This leads to the fact that the algorithm has to adapt to a changing environment, i.e. the learning progress. In this work, the environmental changes are therefore not only continuous, but directed in a certain direction. This means that just because some modes of assistance become better, others do not necessarily become worse, but just yield less reward. To achieve this goal, as well as the goal mentioned in the introduction to transfer algorithm knowledge between the participants, the following three algorithms were derived:
1.
Random pre-training with subsequent experience replay “LinUCB-E”
2.
With probability p = 0.1, modes of assistance with lower costs are chosen “LinUCB-P”
3.
With probability p = l, modes of assistance with lower costs are chosen “LinUCB-LP”, where l is the experimentally measured learning state of users.
The three developed algorithms, as well as an omniscient policy, that has the optimal parameters for every assistance, and a normal LinUCB baseline policy, were then tested under the constraints expected in the in-field study. According to the expected scope and possibilities of the study, the simulation ran for seven participants and six modes of assistance for a total of 1000 trials. Here, the participants’ knowledge was simulated following the Item Response Theory (IRT) [56]. All algorithms were evaluated under two scenarios. In both cases, the algorithm is updated after each arm used. In the first scenario, the users learn simultaneously, i.e. after each arm used, a new user can perform the next task. In the second scenario, the users learn one after the other, i.e. first only the first user performs a series of tasks, then only the second user, and so on.
In the following, first the implementation of the simulation is explained and then details about the algorithms are provided. To use the common notation of CMABs, the following refers to “arms”, rather than “modes of assistance”, but the words may be interchanged as desired.
2.1.1 Omniscient Policy and Simulated User Feedback
Following the direct method, in the simulation, all algorithms are compared against a d-dimensional ground-truth bandit parameter \(\varvec{\Theta }{^*}\in {{\mathbb {R}}}^d\) with \(\Vert {\varvec{{\Theta }}{^*}}\Vert _2\le 1\), that was randomly generated to avoid any bias. This omniscient policy \(\varvec{{\Theta }}{^*}\) represents the optimal parameters for every arm and knows which arm to take at which trial and does not have to learn anything. By normalizing the entries to 1, the expected reward, given by \({\mathbb {E}}[r_{t,a} \mid {\varvec{x}}_{t,a} ]= {\varvec{x}}_{t,a}^{\text {T}} {\varvec{\Theta }}{_{a}^*}\) with the optimal parameters \({\varvec{\Theta }}{_{a}^*}\), arm a with context \({\varvec{x}}_{t,a}\) at trial t, becomes the probability that an arm succeeds. The context vectors too, were randomly generated. The user’s context information consists of age, hearing impairment, cognitive impairment, and motoric impairment, all represented as a Performance Index (PI). The PI represents the fraction to which extent a person with disability can perform a task, compared to a person without disability [57]. The arm contexts include task difficulty, amount of help, motoric help, and acoustic help.
Additionally, the environment, thus the user’s knowledge, is assumed to continuously change, while the user learns. Following the IRT, the probability of solving a task S(t) at any trial t is modeled using a sigmoidal learning curve described in Equation 1. Here, every simulated user has other, randomly generated constant \(\delta \) and \(\beta \) values, with \(\delta \) being the user-dependent difficulty of the task and \(\beta \) being the discrimination coefficient. Here, the context and the learning behavior were not modeled as dependent on each other, since previous studies show that the context cannot always be reliably determined and, consequently, a correlation cannot always be observed in reality [12].
In the scenario considered in this work, just because some arms become better, others do not become worse. At some point, it is expected that all arms lead to a successful execution of the task. Therefore, this work presents the new approach of not adjusting the underlying ground-truth bandit parameter \({\varvec{\Theta }}{_{a}^*}\) as in [46, 47], but to add a second probability distribution S(t), on top of \({\mathbb {E}}[r_{t,a}\mid {\varvec{x}}_{t,a}]\).
When abreviating \({\mathbb {E}}[r_{t,a}\mid {\varvec{x}}_{t,a}]\) with E(t, a), the final success probability of a given arm is therefore:
Finally, noise is added to the final user feedback by computing a Bernoulli distribution based on P(t, a).
2.1.2 Simulated Study
The developed algorithms, as well as the omniscient policy and the baseline policy, were tested under the constraints expected in the in-field study. The algorithm that comes closest to the optimal results of the omniscient policy in the simulation will be used for the field study. Based on [12] the IRTS will include six different modes of assistance, moreover seven participants will participate. Assuming that each participant participates in the study for a total of about 3 h, distributed over one month, this is a total of 21 h. At the request of the supervisors of the sheltered workshop, the participants always have 30 s to solve the task independently before the IRTS intervenes. The modes of assistance take on average 30 s. Roughly speaking, 1260 modes of assistance can be performed over the entire study period. Since it cannot be assumed that an assistance is performed at every step, the number of actual modes of assistance is probably even lower. Therefore, only 1000 trials were performed in the simulation to investigate whether desired effects can be seen within the scope of the study. Bricks without robotic assistance can be placed much faster because waiting times are eliminated. Assuming a placing time of 10 s, an additional 1560 bricks could be laid without assistance. Furthermore, the participant’s learning curves were calibrated to allow full learning of the task in simulation to see the desired adaptation of the different algorithms. For this purpose, the randomly chosen values for \(\delta \) and \(\beta \) were restricted to the ranges [3,5] and [4,9], respectively. Due to corona-related time constraints during study planning and execution, each algorithm was run six times during planning and a mean value and standard deviation are calculated. The choice of algorithms during the study was based on these results. However, more detailed analyses were performed in the follow-up to further verify the interchangeability of the algorithms, as described in the Discussion.
2.1.3 Linear Upper Confidence Bound Algorithm
The developed algorithms are based on the most cited LinUCB algorithm [55] and extend it with directional non-stationarity. This means that in the use case considered in this work, not only is the environment non-stationary, but it is changing in a specific direction. More precisely, it always changes from the unlearned to the learned state of the user. The LinUCB selects the arm to be executed based on the Upper Confidence Bound (UCB) [58, 59]. UCB algorithms estimate the mean payoff \({\hat{\mu }}_{t,a}\) at trial t of each arm a, and the corresponding confidence interval \(c_{t,a}\) so that \(\vert {\hat{\mu }}_{t,a}-\mu _{a}\vert <c_{t,a}\) holds with high probability, where \(\mu _{a}\) is the true payoff. Then the arm is selected that achieves the highest upper confidence bound according to \(a_t=arg\,max_a ({\hat{\mu }}_{t,a}+c_{t,a})\). Moreover, the LinUCB assumes that the expected payoff \({\mathbb {E}}[r_{t,a}\mid {\varvec{x}}_{t,a}]\) of an arm a is linear in its feature vector \({\varvec{x}}_{t,a}\) with some unknown coefficient vector \({\varvec{\Theta }}{_{a}^*}\). The assumption of linearity makes the algorithm computationally efficient. The expected mean payoff of a disjoint model is thus described as:
for every arm with r being the reward. Following Ridge regression and introducing a hyper-parameter \(\alpha \), that is used to scale the upper deviation (the exploration), the total expected payoff for arm a at trial t becomes:
and the arm with \(arg\,max_{a\in A_t}(p_{t,a})\) will be selected [52]. Here, \({\varvec{A}}_a\) is the co-variance with respect to the context data \({\varvec{x}}_{t,a}\) for each arm at trial t.
2.1.4 Costs and Recurrent Component
The algorithm must operate under two competing objectives. On the one hand, it should lead to an improvement of the task completion rate. On the other hand, it should do so while providing minimal assistance to encourage users as much as possible. In other words, the algorithm should provide as much help as necessary, but as little help as possible. In addition, it must constantly adapt to the directed changing environment, i.e., to the user’s increase in knowledge. At the same time, old states should not be forgotten to account for new, unskilled users. To allow the CMAB algorithm to register, learn and adapt to the change in the user’s knowledge state, different high costs were assigned to the different modes of assistance, i.e., the arms of the bandit. To offer as little help as necessary, the more elaborate arms were assigned a higher cost than the less elaborate ones. These costs are then subtracted from each reward received. Thus, the more invasive arms receive a lower overall reward. This leads to the fact that for similar probability of success, the arms with lower costs are selected. In the approach chosen here, there is no budget in which the costs have to be managed, as for example in constrained bandit problems [60]. Instead, the costs serve to minimize the assistance chosen, which allows to achieve the two competing objectives. Figure 2 shows the effect of cost on the choice of arm by showing the cumulative number of times each arm was chosen as a function of the number of trials. To avoid bias in the data, the costs were randomly generated and appropriately assigned to the arm’s probabilities of success, which were also randomly generated.
Fig. 2
Simulation of the omniscient policy showing the ideal algorithm behavior. At the beginning, Arm 5 is used the most, which has the highest cost and at the same time the highest chance of success. The more the simulated users learn, the lower arms are chosen until Arm 6, which has the lowest cost, is used almost exclusively
First, Arm 5 that has the highest cost is chosen because it has the greatest chance of success with unskilled users. With increasing learning progress, the other arms also promise success and Arm 6, with the lowest costs, is selected. This can be seen in the larger gradient from about lap 800 compared to the other algorithms. However, all algorithms are still used, which is due to the fact that not all users have completely learned all skills within the 1000 trials and the arm with the lowest cost is therefore used predominantly, but not yet exclusively.
Furthermore, a recurring component was added to the context. While the CMAB learns per arm and over all users, additionally the individual user performance is tracked. Using a moving window of the last ten executions of an arm per user, the latest success probability of that arm was determined. Based on the assumption that with increasing learning progress also the lower arms lead to success, the variance of all arms from 1 is calculated on every trial (Equation 5), with n being the total number of arms of the current task and \(s_{a,t,u}\) being the success probability of the arm a at trial t for user u.
The learning state \(l_{t,u}^T\) thus provides information about how well user u masters task T at trial t. Every user context is extended by \(l_{t,u}^T\), which is updated at each trial.
In order to meet all the above requirements for the algorithm, three different algorithms were developed and compared in a simulation. All include the \(l_{t,u}^T\) context and were specifically conceptualized to operate in a setting where the notion of a cost of an arm exists. The normal LinUCB is simulated without and with the recurrent component (LinUCB-R). The different algorithms were developed because the effectiveness, according to the state-of-the-art, is based on simulation results but cannot be tested in real world settings in advance. However, the use in the real world is the main goal of this work. The existence of different algorithms of comparable quality, all following the same principles, allows to replace the algorithm in the real-world setting, in case insufficient assistance is achieved and user acceptance problems are triggered.
Fig. 3
Performance of six different Contextual Multi-Armed Bandit Algorithms in the simulated study with the users learning simultaneously (top) and one after the other (bottom). The LinUCB-E algorithm generates the most rewards in both cases. The results for trials 990–1000 are showed enlarged. The plots show the average and standard deviation of six rounds per algorithm to account for the simulation’s random effects
The LinUCB-E algorithm, the E stands for experience-replay, is pre-trained with randomly generated \(l_{t,u}^T\) values. Additionally, during the simulation or study, experience of previous trials is replayed. This is done for the following reasons. The normal LinUCB algorithm cannot detect the changing environment. At the beginning, when a lot of help is needed, it learns that invasive arms lead to success, and later, as soon as less invasive arms would also be successful, they are not tested anymore. By pre-training, this is prevented, because the algorithm learns the meaning of the \(l_{t,u}^T\) context and can react to a change. While in many problems of non-stationary algorithms the past is not relevant and can be forgotten [47, 49], this is different in this work’s use case. On the contrary, in order for untrained users to be able to use the same policy, and for the knowledge within users to be transferred efficiently, the initial states, of the untrained users must be remembered. For this reason, experiences replay was implemented. Experience replay is already often used in deep q-learning and is especially useful when little data or trials are available [61, 62].
The LinUCB-P algorithm, the P stands for probability-based, chooses with a fixed probability p = 0.1 % the arm with the next lower cost. This forces the algorithm to explore other arms, that might seem less promising from experience but will end up giving more reward once the user has higher success probabilities.
The LinUCB-LP algorithm, the LP stands for \(l_{t,u}^T\)-probability-based, is the same as the LinUCB-P algorithm, but uses the \(l_{t,u}^T\) value instead of a fixed p-value to select lower modes of assistance.
Figure 3 shows the different cumulative rewards of all algorithms under two scenarios: all users learn simultaneously, and the users learn one after another.
2.2 In-Field Study
An earlier article provides further details about the behavior-tree-based control structure of the developed robotic system, but without the tutoring function implemented in this work. The article also provides more details about the real-world setup, how the single components are implemented, and the advantages of a robotic system compared to working without robotic assistance [12]. A brief description, as well as extensions of the system and the timeline of the study are presented below.
2.2.1 Intelligent Robotic Tutoring System
The developed IRTS is a further development of the assistive system presented in [12]. In [12], design choices for an assistive system for people with disabilities at work were derived on the basis of an extensive literature research. In a one-day field study, the system was evaluated to determine whether the assistive system can support the immediate solution of new tasks. For this purpose, all participants were offered the same predefined assistance sequence in case of errors or hesitation. The system did not learn and had no speech assistance. However, the results of the study were very promising, as almost three times as many parts could be placed when the assistive system helped with the execution. Based on these promising results, the assistive system was further developed into an intelligent tutoring system that learns along with the users, as presented in this work. The IRTS is structured as follows.
Table 1
Different modes of assistance of the Intelligent Robotic Tutoring System
Assistance
Kind
Cost(0–1)
1
Waving
0.20
2
Pointing exactly to the current workpiece on the instruction screen
0.42
3
Speech Prompts
a. Slightly to the (left/right)
b. Slightly (downwards/upwards)
c. Grab the (yellow/blue/black) brick
0.46
4
Pointing to the right box/ pointing to the right spot at the workplace from close
0.50
5
Pointing to the right box from very close /pointing exactly to the right spot with the gripper direction and opening showing the exact brick position
0.68
6
Handing over the right brick + Assistance 3/ pointing exactly to the right spot with the gripper direction and opening showing the exact brick position + Assistance 3
0.90
7
Calling a supervisor
–
Experimentally determined costs are assigned to the modes of assistance to offer as much help as needed but as little as possible. Assistance 7, which was executed when four modes of assistance in a row did not lead to a change of the brick placement, was not included in the learning algorithm
A KUKA LBR iiwa 7 R800 robot was set up on a table 79 cm high opposite the participant. Above the work area is an Azure Kinect depth camera that monitors the current work steps and to the right of the participant is a 28” (71.12 cm) digital screen that displays the current work steps as pictorial prompts. Directly in front of the participant is a 15 cm \(\times \) 15 cm \(\times \) 1 cm Lego plate on which the Lego models are to be assembled and behind it are three boxes of 15 cm \(\times \) 20 cm \(\times \) 15 cm from each of which a Lego brick of one color can be grabbed. Two Logitech Z320 loudspeakers were set up to the left and right of the participants. These are a new feature compared to [12]. All components communicate via the software ROS [63], while following the control architectures of Behavior Trees [64, 65]. See [12] for more details on the single components and how they are linked. The two modes of assistance that were least helpful in [12] were replaced with voice prompts. The complete list of modes of assistance used can be seen in Table 1. The modes of assistance were ordered in such a way that the least elaborative, and thus the one with the lowest cost, was Assistance 1. In ascending order, the amount of assistance and associated costs increased. Moreover, each arm got a context vector assigned. The context consists of the same variables as the one used in the simulation. Corresponding values were initially assigned by the leading researcher. In a test run with five different non-impaired participants, the assignment was refined. Finally they were discussed and agreed upon with the group leaders of the sheltered workshop, who have special educational training. Another extension to the system compared to [12] is the reinforcement learning approach described earlier, which extends the existing behavior tree-based AI of the system by another AI component. This extension is decisive for the fact that the assistive system from [12] has become an intelligent tutoring system. Instead of merely improving the immediate execution of new tasks, the new approach enables the IRTS to learn along with the participants and to adapt its assistance according to their current learning progress. The system has thus changed from the basic "one-size-fits-all" approach to the much preferable "one-on-one-teaching" approach. Figure 4 shows the system diagram.
Fig. 4
System Diagram of the Intelligent Robotic Tutoring System. The IRTS combines various components that track work steps, control the system’s behavior, and determine appropriate modes of assistance. Various hardware components, such as a depth camera, industrial robot, and loudspeakers, as well as AI components such as behavior trees and reinforcement learning interact with each other. The running variables i, j, n and m indicate that the linked variables can have any number of units
The recruitment of the participants was performed by the heads of the SW, since they have the best knowledge about their employees. The only requirements were the presence of a severe disability and to be working in the SW so they would already be familiar with the surroundings and people. Ten participants took part in the familiarization phase. One of those participants could not solve the exercise sufficiently well due to severe motoric limitations, one had to leave the workshop out of health reasons, and one mastered the task with no need for help of the IRTS. Therefore, these three participants did not participate in the long-term study and the number of participants was reduced to seven. The seven participants, 4 females and 3 males, had ages ranging from 22 years to 59 years (mean = 38.7, standard deviation = 13.46). Due to the very diverse and varied nature of disabilities [1], it is not possible to make a definitive statement about the individual types and degrees of severity of disabilities. Instead, the group leaders assigned an individual PI to each participant for their motoric abilities and mental abilities prior to the study. The PIs, split into motoric and mental abilities, can be seen in Table 2. As in the simulation, the PI values and age were used as the user context. Hearing impairment was not included, since no participant had one. One of the participants was in contact with a collaborative robot before [16], but no one was familiar with the investigated system.
Table 2
Participant information
Participant
Age
Sex
Motoric PI
Mental PI
1
40
Male
0.80
0.65
2
22
Male
0.80
0.70
3
33
Female
0.90
0.75
4
59
Female
0.75
0.45
5
24
Female
0.55
0.80
6
50
Female
0.50
0.60
7
43
Male
0.40
0.65
The Performance Index (PI) represents the fraction to which extent a person with disability can perform a task, compared to a person without disability [57]. PI values for motoric and mental abilities were assigned by the group leaders
This study was carried out in accordance with the recommendations of the regulations governing the principles for safeguarding good academic practice at the Carl von Ossietzky University Oldenburg, Germany, by the Commission for Research Impact Assessment and Ethics. The protocol was approved by the Commission for Research Impact Assessment and Ethics (Drs.EK/2019/038-1). All participants, or, if required, their legal guardian, gave written informed consent in accordance with the Declaration of Helsinki.
2.2.3 Study Timeline
In the run-up to the study, the participants and, where applicable, their legal guardian, were provided with the participant information and declaration of consent. Both, as well as the study situation and task, were again verbally explained on the first study day. To prevent misunderstandings, the researcher was supported in doing so by the group leaders of the SW. The study took place over one month. As recommended in [16], during the first day all participants had time to become familiar with the study situation. For 20 min, they conducted their usual task while sitting at the study workplace with the moving robot in front of them. During the second day, the participants assembled Lego models alternating for 10 min with robot assistance and 10 min without for a total of 40 min. This should result in further familiarization to the task of the study, as well as the use of the IRTS, which could otherwise influence the results of the long-term study. Subsequently, participants had the opportunity to participate in the study for up to 20 min each day. The group leaders of the SW specified 20 min as the upper limit for reasons of concentration. All participants were encouraged to collaborate with the IRTS daily, but it was not an obligation. Due to vacation, illness, quarantine caused by Covid-19 and personal preferences, an average of 2 h 46 min of data was recorded across multiple sessions per participant with individual times ranging between 1 h 57 min and 3 h 36 min. The first 60 min for each participant were used for familiarization with the system and are not included in the following analysis to separate learning effects in handling the system and learning the task [16]. In total, data of a total of 12 h 22 min for the long-term evaluation and a total of 22 h 22 min including the prior familiarization were collected.
During the complete study, the current Lego model, each consisting of 8 bricks, was displayed brick by brick on the monitor and had to be assembled on the workspace in front of the participants. In addition, each partial step (grasping the brick, placing the brick, clearing the work area, and checking by the camera) that was currently being carried out, was displayed so that the participants could draw conclusions about the current error more easily. The visual detection of interaction states based on the camera could be cancelled manually by the researcher, so that erroneous detections did not affect the course of the study (see [12] for more details). The Lego model was changed, if it was executed 2 times in a row without errors or also at personal request of the participants.
The robot’s modes of assistance were adapted after the familiarization phase in consultation with the group leaders and participants, by adding speech prompts (see Table 1). Moreover, the reinforcement learning algorithm that selected the modes of assistance started its learning process only after the participants’ familiarization period. Here, the data generated on the second study day was used to randomly train the LinUCB-E algorithm before the long-term study started. This algorithm was chosen because it achieved the greatest cumulative reward in the simulation. Nevertheless, given the overlapping standard deviations, no algorithm seems to be far superior to the others. They seem to achieve comparably good results and are thus considered interchangeable. For this reason, the LinUCB-E algorithm was changed at three points in time for better acceptance by the participants. Here, the changes only affected the future selection of the assistance. Behavior already learned by the algortihm was not forgotten. The first change was made to the very first participant during the first round, which values are not included in the subsequent analysis. The used LinUCB-E algorithm overestimated skills and thus underestimated the level of assistance needed too often, by offering the same unhelpful assistance repetitively instead of switching to better assistance, which caused frustration for the participant. This was expressed in statements like "why are you so lazy today?" and "we’re about to stop being friends" towards the robot. Therefore, the LinUCB-P algorithm was used instead, which is not pre-learned. The second change to the algorithm was made after three days. From here on, a success counted only after the brick was placed correctly. Previously, an assistance counted as successful as soon as a correction was made in any way, and thus a reaction to the robot’s instructions took place. But since there were almost always reactions, the conditions for a success were made more specific and precise. The last change was made after the fourth day, when, according to subjective observation by the researcher, too much assistance was offered and the principle of offering as little assistance as possible [27] was not sufficiently satisfied. Therefore, the LinUCB-P algorithm was changed to the LinUCB-LP algorithm, which selects lower modes of assistance even more often.
2.3 Evaluation Method
During the month-long study, the following data was collected: Task Completion Times (TCT) per brick, number and kind of assistance per brick, which brick led to success and the expected rewards of the different modes of assistance of the CMAB algorithm. All data were automatically logged by the system. Additionally, the coordinate frames of the robot’s joints, as well as video and depth data of the two cameras, were recorded to subsequently verify the logged data manually through the researcher. No additional questionnaires were used, as recommended by the workshop supervisors. Most people with disabilities in the workshop cannot read. Some cannot speak to communicate, either. While there are adapted questionnaires based on easy language or pictures [62], the gained statements are still not always reliable. In some cases, the findings from questionnaires and observation differ widely [66]. According to the workshop supervisors, a large number of their employees make learned statements in order to trigger a desired reaction in the other person, rather than reflecting in statements and emotions how they actually feel. For this reason, this work focuses on the use of objective measures. However, even though no structural evaluation of statements took place, participant observations were made by the researcher. That is, statements that were made spontaneously and voluntarily, without being an answer to a specific question, are mentioned in the evaluation.
For the evaluation of the participant performance, the TCT, the elapsed times, and the number of performed modes of assistance per correctly placed brick were evaluated as a metric for the person’s skills. The evaluation was conducted with the software R [67] and the lme4 [68], and MuMIn [69] packages. A linear mixed effect regression model is used for the statistical evaluation of the relationship between TCT per correctly placed brick and total number of previously laid bricks and number of modes of assistance per brick and total number of previously laid bricks respectively. Linear mixed effect regressions are a commonly used evaluation method of longitudinal data because of its complex nature based on irregularly spaced or even missing data points [70]. The two main factors that enter the model are fixed and random effects. Fixed effects are variables that are expected to have an effect on the dependent variable. Random effects are grouping factors that should be controlled for. In this work, for the special target group, interdependence cannot be assumed between repeated measures of the same person, adding the need to consider mentioned random effects. The total number of previously placed bricks is entered into the model as fixed effects for both evaluation parts respectively. As random effects, intercepts for participants were added to account for the individual degrees of disabilities and varying amount of data points. In a second step, additionally intercepts for different skills, were added. Brick placement varied in difficulty depending on the target position of the brick. Placing a brick without adjacent bricks is more difficult than placing a brick directly next to another brick since counting is required. For this reason, the tasks have been divided into a total of six skills representing the different brick placement positions. A detailed list can be seen in Table 3. The current skill was added to the fixed linear model as a second random variable to further investigate the variances. Here, the individual disabilities can have a strong influence on the possible execution of a skill. While one cannot count well, it is more difficult for another to transfer elevated positions from the screen to the workplace. For this reason, the skills must be considered in dependence of the respective participant and are added to the model as so-called nested effects within participants. Random slopes were not considered for the final evaluations as they lead to singularities by overfitting the models. P-values were obtained by likelihood ratio tests of the full model with the effect in question against an intercept-only base-line model without the effect in question [71].
Table 3
Different skills to be mastered during the study
Skill
Description
1
Grab the correct brick
2
Place the brick freely on the workplace, without adjacent bricks
3
Place the brick next to another brick with aligned edges
4
Place the brick onto another brick with aligned edges
5
Place the brick next to another brick but slightly shifted
6
Place the brick onto other brick but slightly shifted
The tasks were divided into skills, to be able to measure the improvement in these skills over multiple Lego models once one model was mastered. The skills were repeated in the different models and varied in difficulty
For the subsequent comparison between participants’ theoretical and experimental learning curves, the TCT per correct brick is used to model the experimental learning progress. Unlike in the statistical analysis, participants are considered separately, with individual intercepts and slopes. This means that each participant has its own straight lines fitted through its data, which do not follow those curves generated by the linear mixed effect model. This was done because the focus is not on statistical significance, but to represent the learning curves as accurately as possible. Moreover, the regression lines were normalized to values between 0 and 1 to use them in a re-simulation. Here, 1 represents the fastest observed brick placement of approximately 20 s and for 0 the third quartile of all starting times of the individual participants was used, which corresponds to 80 s. The theoretically generated learning curves used in the study planning have a sigmoid shape and follow the IRT [56]. To eliminate random effects, this re-simulation now performed 50 rounds per algorithm, and plotted the 95% CI deviation to validate that the algorithms did not show significant differences compared to the normal LinUCB algorithm and were interchangeable.
3 Results
This paper focuses on demonstrating the effectiveness of the robotic tutor, so the following analysis focuses on the learning progress of the participants and the associated algorithmic adaptations. In addition, the simulation previously performed was compared with the actual study data. The results of all three areas are shown below.
3.1 Participants’ Learning
The familiarization phase was not included for the following analysis, thus the results are based on 742 min of data from repeated measurements on different days for seven participants. Throughout the study and across all participants, a total of 743 Lego bricks were placed correctly with 134 bricks having one or more of the IRTS’s modes of assistance carried out. A total of 800 modes of assistance were provided throughout the study.
As can be seen in Fig. 5, the TCT and the number of modes of assistance per brick behaved similar for all participants and show similar trends. Both variables visibly decreased over the course of the study and the scattering within the variables also decreased during the study period. Here, the study period is characterized by the number of bricks already laid.
A linear mixed effects analysis of the relationship between the TCT per correctly placed brick and the total number of previously placed bricks was performed. The full model performed significantly better than the null model (\(\chi ^2(1)\): 4.36, p =.037) and showed that the TCT decreases significantly but only marginally with the time spent with the IRTS (Estimate: \(-\)0.11, CI: \(-\)0.21 - \(-\)0.007, p < .05, marginal R2 = 0.005, conditional R2 = 0.326), with Estimate being the slope of the model and CI being the 95 % Confidence Interval. The same evaluation was performed for the second dependent variable studied, the number of modes of assistance required per correctly placed brick, showing no significant effect (\(\chi ^2(1)\): 2.44, p =.12, Estimate: \(-\)0.002, CI: \(-\)0.004 - \(-\)0.0005, p > .05, marginal R2 = 0.002, conditional R2 = 0.41). The individual fits of both investigated dependent variables and variations of the single points are shown in Fig. 5.
Fig. 5
Dependence of the Task Completion Time per correctly placed brick (left) or the number of required assists per brick (right) on the total number of bricks placed so far. Each of the seven participants is represented by its own color and shape and has its own interception of the fitted regression line. The negative slope indicates a positive learning effect of the participants. The analysis was conducted via a linear mixed effects analysis, where a significant relationship was found for the TCT with p < .05
Additionally, the influence of the current task, and thus the required skills, on the TCT and number of modes of assistance was investigated. A likelihood ratio test with an intercept-only base-line model shows again a significant improvement in the results for the TCT with (\(\chi ^2(1)\): 4.14, p = .04, Estimate: \(-\)0.10, CI: \(-\)0.20 - \(-\)0.004, p < .05, marginal R2 = 0.004, conditional R2 = 0.380) and no significance for the number of modes of assistance (\(\chi ^2(1)\): 2.14, p = .14, Estimate: \(-\)0.001, CI: \(-\)0.004 - \(-\)0.0006, p > .05, marginal R2 = 0.002, conditional R2 = 0.45). Figure 6 shows the intercepts for all skills per person.
Fig. 6
Nested dependency of the Task Completion Time per correctly placed brick on the total number of bricks placed so far for the different skills. Each of the seven participants is represented by their own plot. Each skill is shown for each participant by its own color and shape and has its own interception of the fitted regression line. The negative slope indicates a positive learning effect of the skills. The analysis was conducted via a nested linear mixed effects analysis, where a significant relationship was found with p < .05. Skill 1 represents grabbing the correct brick and is therefore part of the other skills and not listed separately. Details about the individual skills are listed in Table 3. The highest outliers are not visible in the images for a better visibility of the regression lines
The top graph from Fig. 7 shows that the algorithm learned and generated rewards during the study. Furthermore, the expected rewards of the different modes of assistance were logged throughout the duration of the study. This shows their change within participants and between participants over time, as can be seen in the bottom graph of Fig. 7. The three points in time of the previously described changes to the algorithm for better acceptance by the participants are also included in Fig. 7. Note, Assistance 7 was implemented as a fallback in case four modes of assistance in a row did not trigger a reaction in the participants, but it was not considered for the learning algorithm. In total, Assistance 7 was executed 26 times.
Fig. 7
Top: Cumulative reward over all skills and arms of the algorithm during the field study. Middle: Expected rewards of all modes of assistance for the Skill 2 over time. The peaks represent participant changes, as new context has entered the algorithm here. The expected rewards of the different modes of assistance change over time, which illustrates the learning of the algorithm. The algorithm learns across all participants at the same time. Bottom: Zoomed section of trials 20–80. Here, Participant 4 is viewed two times in succession, with three other participants in between
Figure 7 shows that the general trend of expected reward is initially negative before the arms settle between values of about 0.3–0.6. This is to be expected, since the costs, which are unknown at the starting point, lower the estimation of the rewards. Over the study period, Assistance 3, which was a speech prompt proved to be the most promising for most participants. This is evidenced by the highest expected reward. Second most promising was Assistance 2 which was a pointing gesture to the instruction screen. Assistance 3 promised the highest rewards, while the cost is lower than, for example, Assistance 6, which is a combination of pointing gestures and voice assistance and thus costs more. Moreover, the algorithm was able to learn for the individual participants as well as to transfer its learned knowledge to other participants, as can be seen in the zoomed section on the bottom of Fig. 7. Here, Participant 4 collaborated first with the IRTS, then Participants 6, 5 and 2 and then again Participant 4. The expected rewards of the different modes of assistance change within a participant, but also when a new participant uses the system. The change of participants has a relatively strong effect and is represented by the peaks in the curves, as each new participant entered new context to the algorithm. Once Participant 4 collaborated with the IRTS again, the expected rewards changed compared to the previous participant, as well as when compared to the final values of the previous run of Participant 4.
In general, already on the third day, after a total of seven sessions took place and nearly every participant took part at least once, Assistance 3 was almost always considered to be the most promising. This assistance was also favored in the further course and four participants expressed on their own that this was their favorite assistance.
3.3 Simulation versus Reality
The TCT per correct brick decreases significantly with the experience gained from bricks already laid. Therefore, this metric is used to model the participants’ learning progress. Figure 8 shows a comparison of the learning curves assumed in the simulated study planning based on the IRT and the experimentally generated learning curves determined as described in Materials and Methods.
The curves differ greatly in shape. In the simulation, the learning curves were created using a common mathematical model that creates sigmoid shaped learning curves. In this sigmoid curve, it is assumed that all knowledge stages, from completely unlearned to completely learned, are represented. In the time-limited user study, however, it is likely that the users were neither completely unlearned at the beginning, nor did they fully learn the task at the end. Instead, the learning progress will be somewhere in the middle of the sigmoid curves. This range can be linearly approximated, which was done in this work, in order to use the TCT values as a basis. Moreover, the experimentally generated learning curves vary more within participants than the theoretical ones. However, these differences between theoretical and experimental curves do not lead to strong performance differences of the different algorithms in the simulated study. The LinUCB-E Algorithm gets the most reward in both cases, as shown in Fig. 9. In both cases, the LinUCB-E algorithm achieves significantly better results, while no significant differences are seen between the other algorithms. The algorithms are thus interchangeable.
4 Discussion
The study presented here extends state-of-the-art knowledge by means of three main contributions. The assistance system developed under the first contribution was evaluated on the basis of a one-month field study in a workshop for people with disabilities, which represents the second contribution. This allowed new insights to be gained in two main areas: significant learning effects of the participants and learning efficiency of the developed algorithm. In addition, the implications of simulations for real-world studies were evaluated as part of the third contribution. The following discussion focuses on the new findings, their implications for the real world of work, and highlights limitations that require further research.
4.1 Participants’ Learning
This work successfully demonstrated the long-term learning effects of people with disabilities through the IRTS, which is the ultimate goal of such collaboration. During the month of the study period, the TCT per correctly placed brick significantly (p < .05) decreased with the time the participants spent with the IRTS. Also, across all considered skills of the different difficulty levels, a significant improvement in the TCT was determined (p < .05). The reduced TCT per correct brick can be considered as learning progress, thus showing that the IRTS is able to teach humans work tasks and enhance their abilities. The improvement within each skill also indicates that the IRTS could assist with a variety of different tasks. However, only a small effect was demonstrated in both analyses. Possible reasons could be that an even longer, total period would have to be considered, or the participation per day would have to be longer to measure larger effects.
Fig. 8
Comparison of learning curves used in the simulation-based study planning and those determined from study data. The theoretically generated learning curves have a sigmoid shape and follow the Item Response Theory [56]. The experimentally generated learning curves do not represent all available knowledge states (from unlearned to learned) but show a small section and were thus linearly approximated
Performance of six different Contextual Multi-Armed Bandit Algorithms in the simulated study with the theoretical learning curves (top, same as in Fig. 3)) and experimental learning curves (bottom). The LinUCB-E Algorithm generates the most rewards in both cases. The results for trials 990–1000 are shown enlarged. The plots show the average and standard deviation of six rounds per Algorithm to account for the simulation’s random effects
As this is the first study of its kind to investigate long-term learning effects of robotic tutoring in the workplace, no comparison can be made as to how much better learning effects the developed Intelligent Robotic Tutoring System achieves compared to non-learning approaches. However, the literature states that individual tutoring is always preferable to non-individual tutoring [22]. Moreover, a comparison with [12], in which an externally identical, but non-adaptive robotic assistance system was tested for the same Lego task over the period of one day, shows that Assistance 7, calling a supervisor, was used much less frequently. Assistance 7 was performed when four successively performed modes of assistance did not result in a change of the brick placement by the participant. In the non-adaptive system from [12], 34% of the bricks were placed with Assistance 7 and thus by the researcher because the predefined assistance sequence did not yield success. With the adaptive ITS, only 3.49% of the bricks were placed with Assistance 7, since the modes of assistance could be selected individually according to the needs and abilities of the participants. Thus, it rarely happened that the system selected four consecutive unhelpful modes of assistance.
When looking at the individual learning curves (Fig. 8) no improvement was observed in two of the participants. These are the two participants having the shortest (Participant 6) and the longest (Participant 4) TCTs. Both, moreover, have participated the least time in the study (71 min and 57 min, respectively). For Participant 6, it stands to reason that attention decreased over time since the task did not challenge her. For Participant 4, it would be interesting to see if a longer run time would have eventually led to improvement. This again suggests that the small effect found is due to the limited study duration and highlights the importance of conducting a study under real-world conditions and over a longer duration. Nevertheless, the shortest interaction that led to a visible learning success took place over a period of just over 1.5 h (Participant 5)—and this for tasks of varying levels of difficulty. Another noteworthy irregularity are the points of Participant 2 from brick 80 (Fig. 5). Here, in one instance, the strategy of how quickly the researcher helped to loosen the bricks was slightly changed and, in the other instance, the study took place earlier in the day. Both led to visible effects in the deviations of the data points from the regression line.
These results have multiple theoretical and practical implications. Especially in the open labor market and with respect to Industry 4.0, where smaller batch sizes must always be produced [72], the ability to learn and execute different new skills flexibly and quickly is essential. Not only was it generally demonstrated in this work that the IRTS can lead to learning progress, the results from Participant 5 indicate that learning progress can be achieved within a short period of time. Moreover, the expected variations within the participants, especially Participant 2, show how important it is that the IRTS continues to learn and indicates a possible further area of application, namely its use as a monitoring system for current well-being, adding clinical value to the system.
4.2 Developed Algorithm
In this work, we have shown that the IRTS can independently learn and decide when to perform which assistance without the need for a supervisor or group leader to manually track and review the learning progress. Here, the knowledge of the individual participants was successfully used to improve the algorithm for all participants. This has enabled the algorithm to learn quickly and efficiently. Already after approximately one session per participant, Assistance 3 promised the greatest rewards and was able to maintain this position during the course of the study with almost all participants (compare Fig. 7). Some participants even noticed that the IRTS figured out what their favorite assistance was, as various statements indicated. While learning across all participants is the norm in ITS systems that are studied in simulations [55], in comparable long-term studies in children with autism, this is a rather new approach as prior works are based on simpler models [20] or learn per participant [73]. Accordingly, our approach extends the current state of the art.
Autonomous assistance is an important feature since individualized support is essential for learning new skills. In the open labor market, there are no group leaders to supervise every step of the work, thus the IRTS would be able to offer a possibility of support that would not be available otherwise. Even in SWs, one group leader often supervises 20 or more people with disabilities. Thus, continuous personal supervision is not guaranteed here either, and the individual abilities of the workers are not always correctly assessed. For example, a comparison of Table 2 with Fig. 5 shows that Participant 6 was assigned low PIs while having the lowest TCTs. Here, the autonomous approach of the IRTS offers a great opportunity for support as an autonomous tutor. This points to the possibility of using robots on a larger scale to help people with disabilities learn and thus perform work activities without the need for additional human personnel. In addition, other studies support the positive effect of such collaboration, as it is easier to admit mistakes to robots than to a human supervisor [11]. In this work, too, the participants perceived the collaboration with the robot very positively. He was described as a tutor and friend and may even have conveyed a sense of security [12].
4.3 Simulation versus Reality
This challenging study was thoroughly planned beforehand using a simulation, which is the best practice to date. This provided an initial assessment of whether the study objectives could be achieved, and the selected algorithm could follow the learning behavior of the participants. Although the simulated study results were informative and helpful, differences to reality became apparent during the study. The simulation is based on literature-based assumptions about people’s learning behavior, which are based on ideal conditions where the learning curves and the associated effects were fitted to the overall study. Reality, however, showed that the assumptions made could not possibly be tailored to a group of people as heterogeneous as those considered in this study who learned at a slower pace and had motor limitations that further constrained their mental learning curves, results that were not considered in the simulation. Also, personal acceptance and emotions, such as frustration triggered by the first algorithm because it performed unhelpful assistance too often, could not be accounted for in the simulation. Although no major differences were found between the algorithms in the re-simulation, the algorithm that was the only one to produce significantly better results in the simulations proved unsuitable in the study and had to be replaced on the very first study day. On closer inspection, this is not surprising. The best algorithm LinUCB-E in simulation is pre-learned. In simulations, this gives it a clear advantage simply because it gets to see more data. Thus the algorithm knows directly at the starting point of the simulated study how it should act. Due to this advantage, other works have recommended to pre-train algorithms, if not enough training data can be collected during the study itself [39]. However, the problem lies in this pre-training. One possibility is to pre-train the algorithm using pure assumptions and models about the learning behavior. However, it is impossible to make reliable assumptions, especially in the target group under consideration. Alternatively, the algorithm can be trained based on a small preliminary study, as was the case in this work. Here, no arbitrary assumptions are made, but the underlying data set is usually still thin. In this case, learning was based on the data from [12], where participants placed a median of only 6 bricks and the most frequently used assistance was performed an average of 4.8 times per participant. Pre-learning an algorithm on such sparse data very quickly leads to misjudging the participants in reality. Therefore, the advantage that pre-learning has in simulation becomes a disadvantage in reality, as the algorithm adapts its strategy less flexibly than a cold start algorithm. Already other works, such as [54] recognized that simulation data can only be adapted to real world scenarios to a limited extent. Simulations represent a useful and very valuable tool for testing the theoretical behavior of new algorithms. However, especially when evaluating ITS, which are often based on pure simulations [33], it is important to consider that the results can only be applied to reality to a limited extent. While real-world applications benefit from simulation to test initial assumptions, algorithms based on pure simulations benefit from real-world evaluations to prove their validity outside of simulations.
4.4 Limitations and Future Work
One of the main challenges remains the very heterogeneous nature of the participants. Disabilities are diverse and varied and thus observed effects and learning methods that work for one are not right for another. Consequently, a definitive statement about the benefit of the IRTS for every individual person, is hard to make. An even larger study in which motor and mental impairments can be considered in a more differentiated way would be desirable. In this context, a study considering solely physical rehabilitation advantages would be interesting. Moreover, the variations in the data, such as the deterioration of Participant 6 or the variance of the data points in Participant 2 raise the question whether the tutor can be used additionally, through the continuous monitoring of the participants, as a diagnostic tool for current health status. In other words, while the pioneering results of this study are very promising, there are still other benefits of working with a robotic tutor yet to be explored. In addition, the IRTS could be tested on other tasks and skills and different algorithms could be compared. For example, in the presented in-field study, not a single algorithm was tested, but the algorithm was adapted at three points in time. This makes it difficult to make definitive statements about the explicit algorithm. Strictly speaking, only the described overall concept of using a shared policy via a LinUCB-based algorithm was clearly proven to be promising. In further studies, a clearer algorithm evaluation could be achieved by using a different algorithm per skill. By doing so, bad algorithms only lead to frustration in subtasks and not in the entire study. Therefore, the interventions of the researcher regarding the exchange of the algorithms could possibly be circumvented this way. Another exciting approach would be that frustration caused by unhelpful algorithms is automatically detected and the IRTS decides independently and user-specifically when the concept of the algorithm needs to be changed. Furthermore, during the simulation, the individual parameters fed into the algorithms, such as the moving window or the fixed exploration probability p, could have been further refined using a larger-scale, systematic search. Similarly, the still unreliable context could possibly be modeled explicitly with stochasticity. However, it is unclear whether the refined parameters could be transferred one to one to reality, or whether the simulation still includes too many assumptions. Furthermore, the IRTS is still susceptible to external influences that limit the automatic camera recognition and precision of pointing gestures when performing millimeter precise assembly tasks [12]. Here, further research for a more robust design is needed for a routine use of the system.
Combining the study results with earlier findings leads to a promising future scenario, which emphasizes the importance of the results achieved. Previous studies show that, after a short period of familiarization, many people with disabilities are happy to work with the IRTS and do not feel any fears or concerns but consider it a friend [16]. In addition, the IRTS can instantly and significantly improve their execution of new tasks, and even help them to solve tasks which were unsolvable without IRTS assistance [12]. At the same time, long-term collaboration leads to a significant learning effect for participants and new skills can be acquired. This combination of the joy of collaboration, the immediate improvement in performance, the autonomous nature of the system, and the long-term learning effects, makes the IRTS a promising and innovative way to support people both in SWs and in the open labor market. One IRTS could be permanently set up next to the work areas, consistently supporting multiple workers simultaneously [13] without needing more human personnel. The immediate increase in performance reduces lost productivity, allowing people to take on tasks they would not be assigned to otherwise while keeping risks for employers low. At the same time, people are constantly supported in learning these tasks by the IRTS, thus accelerating the learning process, and allowing continuous development of personal skills. Successful use of the IRTS could therefore help achieve the abilities and requirements needed in the open labor market and eventually help more people with disabilities to find work. Successful integration into the open labor market would, in turn, lead to multiple personal benefits, such as financial security. It can also make them feel integrated into society and improve their overall quality of life [1].
5 Conclusion
To support people with disabilities in learning new tasks, and thus help them integrate into the open labor market in the long term, this paper describes the development and evaluation of an Intelligent Robotic Tutoring System. The robotic tutor recognizes the human’s current work steps using a depth camera and offers one of six different assistance options. The modes of assistance range from robotic pointing gestures, to speech prompts, to calling a supervisor. Which assistance to offer the different people is personalized using AI in the form of reinforcement learning. In this context, new non-stationary Contextual Multi-Armed Bandit algorithms have been developed that learn, based on the context information of the current task and the human, which assistance will lead to success, while maintaining the principle of offering as much assistance as necessary, but as little assistance as possible. Furthermore, the shared policy remembers the cumulative history of past learners so that the algorithm know which assistance to offer to new, unskilled learners. The algorithms were first evaluated in a simulation of the planned in-field user study. The one-month in-field user study was then conducted in a workshop for people with disabilities during normal working hours, where the Intelligent Robotic Tutoring System assisted seven people with a wide range of disabilities in learning abstracted assembly tasks.
Important results were obtained in three main aspects. First, during the month-long in-field user study, a significant learning effect was demonstrated in the participants. The time required per assembly part decreased significantly over the month. The number of modes of assistance needed also decreased, although not significantly. Secondly, the new reinforcement learning algorithm learned very quickly which modes of assistance were preferred by which participant. In doing so, it successfully transferred its knowledge of individual participants to other participants. As few as seven sessions, or approximately one session per participant, were enough to select the preferred assistance for the individual users. Third, a comparison of the simulation with reality has revealed limitations of transferability. Here, the previously conducted study simulation was compared with a subsequent re-simulation in which the original assumptions about people’s learning behavior were replaced with actual measured learning curves. The comparison revealed the underlying basic assumptions to be correct, while many factors, such as personal references or motor limitations that further constrained their mental learning curves, cannot be adequately modeled, leading to strong performance differences of the algorithm in the real-world setting.
Overall, the results show that the innovative Intelligent Robotic Tutoring System is a promising approach for inclusion of people with disabilities in the open labor market. The collaboration can avoid errors in execution and associated productivity losses while helping people learn new tasks and improving personal skills. Additionally, participants repeatedly emphasized how much they enjoyed the collaboration and how grateful they were for the opportunity.
Acknowledgements
The authors thank the heads and group leaders of the Sheltered Workshop in Oldenburg for their support. Without their engagement of recruiting the participants, their communication with the legal guardians, and expert knowledge and advice this work would not have been possible. The entire research team thanks the study participants for being part of this research.
Declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Ethics Approval
This study was carried out in line with the principles of the Declaration of Helsinki and in accordance with the recommendations of the regulations governing the principles for safeguarding good academic practice at the Carl von Ossietzky University Oldenburg, Germany, by the Commission for Research Impact Assessment and Ethics. The protocol was approved by the Commission for Research Impact Assessment and Ethics (Drs.EK/2019/038-1).
Consent to Participate
All participants, or, if required, their legal guardian, gave written informed consent.
Supplementary information
The supplementary materials contain Movies S1 to S4, which are exemplary scenes from the study in which the IRTS provides modes of assistance to the participants.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Written by :
Sandra Drolshagen
is post-doc at the Department of Health Services Research, Carl von Ossietzky University of Oldenburg, Germany. Her research includes topics on human robot collaboration with a focus on robotic tutors. She works in the OFFIS Institute for Information Technology, Oldenburg, Germany in the Smart Human Robot Collaboration Group.
Max Pfingsthorn
is principal scientist and Group Manager of the Smart Human Robot Collaboration Group at the OFFIS Institute for Information Technology, Oldenburg, Germany. He is particularly interested in questions of perception in the context of interaction with (semi-)autonomous systems. The central theme of his research group is the mutual understanding between humans and robots.
Andreas Hein
is full professor of automation and measurement technologies at the Carl von Ossietzky University of Oldenburg and Member of the Board of OFFIS - Institute for Information Technology. His research interests include activity / mobility analysis, development of assistance systems for older people and care givers as well as human-robot interaction.