nach oben

Complex & Intelligent Systems

Erschienen in:

Open Access 09.11.2022 | Original Article

Fully adaptive recommendation paradigm: top-enhanced recommender distillation for intelligent education systems

verfasst von: Yimeng Ren, Kun Liang, Yuhu Shang, Xiankun Zhang

Erschienen in: Complex & Intelligent Systems | Ausgabe 2/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

Top-N recommendation has received great attention in assisting students in providing personalized learning guidance on the required subject/domain. Generally, existing approaches mainly aim to maximize the overall accuracy of the recommendation list while ignoring the accuracy of highly ranked recommended exercises, which seriously affects the students’ learning enthusiasm. Motivated by the Knowledge Distillation (KD) technique, we skillfully design a fully adaptive recommendation paradigm named Top-enhanced Recommender Distillation framework (TERD) to improve the recommendation effect of the top positions. Specifically, the proposed TERD transfers the knowledge of an arbitrary recommender (teacher network), and injects it into a well-designed student network. The prior knowledge provided by the teacher network, including student-exercise embeddings, and candidate exercise subsets, are further utilized to define the state and action space of the student network (i.e., DDQN). In addition, the student network introduces a well-designed state representation scheme and an effective individual ability tracing model to enhance the recommendation accuracy of top positions. The developed TERD follows a flexible model-agnostic paradigm that not only simplifies the action space of the student network, but also promotes the recommendation accuracy of the top position, thus enhancing the students’ motivation and engagement in e-learning environment. We implement our proposed approach on three well-established datasets and evaluate its Top-enhanced performance. The experimental evaluation on three publicly available datasets shows that our proposed TERD scheme effectively resolves the Top-enhanced recommendation issue.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

With growing developments in intelligent tutoring systems, advances in artificial intelligence and other emerging technologies have far-ranging consequences in online personalized learning. Technology-supported online learning, such as Recommendation System (RS), has been widely used for bringing new teaching model based on ethics of participation, openness, and collaboration [1].

As an advanced technology, Top-N recommendation algorithm [2, 3] can provide learners with many personalized learning experiences. Recently, Deep Reinforcement Learning (DRL) [4] technique has arose as an effective way to deal with Top-N recommendation issue. As a distinguished direction, many efforts have been dedicated to introducing more advanced RL techniques to maximize the expected sum of long-term rewards [5‐7].

In this paper, we revisit existing RL-based Top-N recommendation models and find that they still have some inherent deficiencies. First, existing RL recommendation approaches committed to maximizing the long-term returns, while ignoring the significance of the top position recommendations, i.e., they seldom care about the Top-enhanced issue. In practice, if the recommendation system does not provide appropriate exercises at the top position to fulfill learners’ needs, they may not have enough patience to keep learning. As illustrated in Fig. 1a, both Bob and Merry were discouraged and demotivated, because they found the recommended exercises are not suitable for them at the beginning of study, that maybe result in dropout. Second, RL models usually have extremely candidate action space when performing recommended tasks, which results in high computational and representational complexity. Also, in the early stage of the learning, the action selected is usually random. This problem may have a severe adverse impact on the task of exercise recommendation. Third, none of existing recommender can be directly applied to top-enhanced issue. We can design specific mechanisms to solve this problem, but since they are based on different neural network architectures, means that we have to redesign different exercise selection strategies for different models, i.e., this scheme still suffers from poor scalability. As a result, a general framework to accommodate the transition process from Top-N to Top-enhanced is required. Lastly, some data mining researchers [8, 9] have focused on recommending non-mastered exercises for each student by manually assigning the difficulty label. However, in practical scenarios, this solution inevitably leads to label bias. As suggested by [10], students’ knowledge construction process is not static, but evolves over time. Along this line, an advanced deep learning model is needed that can track students’ evolutionary mastery state of specific knowledge concepts.

We illustrate these complex interactions more clearly by a toy example in Fig. 1.

To overcome these obstacles, we develop a Top-enhanced Recommender Distillation framework (TERD)to handle exercise recommendation tasks flexibly. Specifically, we investigate: (1) how design generally solutions to effectively improve the recommendation performance in the top positions; (2) how to construct a good state representation and accurately measure student proficiency to help the recommender agent perform effective reinforcement learning.

The main novelties and contributions of the proposed TERD framework can be listed as the following four folds:

(1)

To our best knowledge, this is the first work to apply distillation technique to Top-enhanced exercise recommendation. It extracts the knowledge of an arbitrary recommenders and inject it into a student network for more effective Top-enhanced recommendations.

(2)

Different from the action selection strategy of the traditional DDQN-based methods, our proposed framework absorbs the essence of well-trained recommenders, thus largely decreasing the action selection space in DRL.

(3)

We attempt to design an effective student network (i.e., DDQN) by introducing new state representation scheme, and a flexible individual ability tracing method. Considering student’s learning long-term dynamic nature, we incorporate a stacked GRU network to design the state representation scheme. Different from the previous works that manually assigning the difficulty label for exercises, our proposed TERD is able to stand in the perspectives of different students and gain insight into their mastery of specific knowledge concepts.

(4)

We perform comprehensive experiments on three large-scale benchmark datasets to demonstrate the effectiveness of TERD model. The experiment results also show that student network outperforms the teacher recommender in Top-enhanced recommendation tasks.

The remainder of this paper is structured as follows. In Related work, we briefly review some recent related research. Section Preliminaries elaborates the basic work of TERD model. The technical details of proposed TERD framework are described step by step in Proposed methodologies. Then, we present the experimental results in Experiment. Finally, Conclusion summarizes this paper, analyzes this study’s limitations and provides future directions.

Recommender system

Among various recommendation models, collaborative filtering (CF) algorithm [11, 12] has attracted increasing attention from researchers, who have proposed many classical CF algorithms. Motivated by the excellent performance of Deep Learning (DL) techniques, researchers have developed various Top-N recommendation methods using deep learning. In the latest relevant research, Wu et al. [13] developed a two-stage neural recommendation model named KCP-ER, which consists of a knowledge concept prediction module and an exercise set filtering module. As a promising direction, many variants have been developed by introducing more advanced Deep Learning (DL) techniques, constructing a many-objective optimization model [14], and capturing user-item latent structures [15]. Recently, Deep Reinforcement Learning (DRL) techniques are widely applied to various scientific problems and, in several tasks, perform superior to humans. Lin et al. [6] presented a hierarchical reinforcement learning with dynamic recurrent mechanism for course recommender systems. The author in [7] also designed a DRL method based on Actor-Critic (AC) framework for knowledge recommendation. In contrast, we do not focus on the Top-N recommendation issue, and perform model distillation theory on our proposed TERD to reinforce the recommendation effect of the top position.

Knowledge distillation

Knowledge Distillation (KD) [16], which is a highly promising knowledge-transfer technique from a large well-trained model (a.k.a., a teacher network) to a relatively lightweight model (a.k.a., a student network), has exhibited the state-of-the-art performance in various fields. Surveying the recent literature, a lot of KD methods have achieved significant improvements in computer vision [17‐19], natural language processing [20, 21], and graph data mining [22].

Recently, some researchers introduced knowledge distillation (KD) technique to generate a small but powerful recommender. Wang et al. [23] proposed a novel knowledge distillation model with the probabilistic rank-aware sampling, termed as collaborative distillation, which adopted an improved student network training strategies to promote the top-N recommendation performance. Moreover, to alleviate the issue of high dimension and sparsity of tag information in actual scenarios, the author in [24] developed two novel heterogeneous knowledge distillation methods in feature-level and label-level to build relations between User-oriented autoencoder and item-oriented autoencoder. In the recent literature, several works proposing self-distillation [25, 26] have also been emerged. The work in [27] introduced a weighting mechanism to dynamically put less weights on uncertain samples and showed promising results. Huang et. al [28] introduced the self-distillation concept into GCN-based recommendation and proposed a two-phase knowledge distillation model improving recommendation performance.

However, little effort has been made to tackle the top-enhanced recommendation challenge. Through the exhaustive analyses, the recommendation algorithm closest to our idea is dedicated to maximize the effect of the ranking distillation in RS. The work in [29] developed a novel rank distillation model named ranking distillation (RD) using teacher network captures ranking patterns to guide the student network to rank unlabeled documents, revealing that knowledge distillation model can help extract more useful features via large teacher network. Motivated by the above approach, Kang et al. extended KD [30] by the regularization mechanism to enhance the student network performance.

Reinforcement learning in education

RL is a known algorithm to enable autonomous learning. In educational psychology scenarios, there has been a series of successful applications of deep RL methods. Online course recommendation has attracted widespread attention in the area of intelligent education. Along this line, Lin et al. [6] presented a hierarchical reinforcement learning with dynamic recurrent mechanism for course recommender systems, which designs a profile constructor to efficiently trace the learner’s preferences for personalized course recommendation. By treating learning path recommendation task as a Markov Decision Process, Liu et al. [7] developed a cognitive structure enhanced framework using actor-critic algorithm that can generate a suitable learning path to different learners. The work in [31] proposed a personalized recommendation method based on the standard Q-learning, in which a Q-table is constructed to store the Q-value of all state-action pairs. To extend this idea, author in [32] used several fully connected layers to take the place of the Exercise Q-table to approximately estimate the Q-value.

Despite these productive works, our method differs from these efforts in that we focus more on the recommendation accuracy of the top positions, rather than the most common top-N recommendation tasks.

Preliminaries

Problem statement

In this article’s recommendation scenario, four components are typically contained in learning systems: students, exercises, knowledge concepts and response score. A folksonomy is a tuple:

$$\begin{aligned} {\mathcal {F}} := ({\mathcal {U}},{\mathcal {E}},{\mathcal {K}},{\mathcal {Y}},{\mathcal {G}},{\mathcal {Z}}) \end{aligned}$$

where ${\mathcal {U}},{\mathcal {E}},{\mathcal {K}},{\mathcal {Y}}$ indicate students set, exercises set, knowledge concepts set, and response scores set, respectively. Also, we denote the association between exercises and knowledge concepts as a binary relation ${\mathcal {G}} \subseteq {\mathcal {E}} \times {\mathcal {K}}$, where $(e_i, k_j) \in {\mathcal {G}}$ if $e_i$ is related to $k_j$.

${\mathcal {Z}}$ is the internal relation among them. For the convenience of representation, the historical learning record of a certain student can be formulated as a 3-tuple:

$$\begin{aligned} {\mathcal {Z}} = \textrm{z}_{(\textrm{u}, e, \textrm{y})} \end{aligned}$$

where $\textrm{z}_{(u_m, e_i, \textrm{y}_i)} $ denotes the exercise $e_i$ practiced by student $u_m$ at her exercising step t and $\textrm{y}_{u_m, i} \in \{ 0,1 \}$ denotes the corresponding score. $\textrm{y}_i=1$ when the student answers the exercise correctly, and $\textrm{y}_i=0$ otherwise.

General RL structure

In essence, the recommendation task can be formalized as a Markov Decision Process (MDP). The overall RL framework for recommendation is presented in Fig. 2. More formally, the elements of an MDP $({\mathcal {S}}, {\mathcal {A}},{\mathcal {R}},{\mathcal {P}})$ can be characterized as:

State ${\mathcal {S}}$: At each design point t, the current state $s_t \in {\mathcal {S}}$ denotes the preceding exercising history of a student as well as each exercise $(e, k, \textrm{y})$ is also considered.
Action ${\mathcal {A}}$: An action a is a vector. Based on state $s_t$, taking action $a_t \in {\mathcal {A}}$ is defined as select an exercise $e_t$ for a certain student, after which agent enters a new state $s_{t+1}$.
Reward ${\mathcal {R}}$: An immediate reward $r_t$ is a scalar value which is obtained from the environment. When an exercise e is selected, we get the reward $r(s_t, a_t)$ according to the feedback of various objectives.
Transitions ${\mathcal {P}}$: Once the test’s feedback is collected, the agent will enter the next state based on the transition probability $p(s_{t+1} \mid s_t, a_t)$.

Remarks

To simulate the interaction between recommender agent and student with a given environment, agent should follow the four hypotheses.

(1)

Each student $u_i$ has a total of N rounds of interactions with his/her personal recommender agent.

(2)

$u_i$ can response at most one exercise in $L^*$ at each time step t.

(3)

In the t-th time step, agent first selects an exercise $e_t$ with the highest Q-value, and then deletes it from the exercise subset L as well as adds it into the Top-enhanced list $L^*$ for a particular student.

(4)

Given a student’s learning record $\textrm{z}_{(u_m,e_i,\textrm{y}_i)}^t$, if $y_i=1$, there is no hit exercise, and recommender agent receive negative feedback.

Proposed methodologies

Model description

The architecture of proposed TERD model is depicted in Fig. 3. TERD model treats student’s response logs as inputs and adaptively recommend suitable exercises to students. Specifically, two sub-modules inside this model are used to achieve this task. One is Teacher network, as a well-trained network, which is used to learn student-exercises embeddings from historical interactions and generate candidate exercise subset. The final output of the teacher network is distilled and transferred to the student network’s component to further guide the agent perform effective top-enhanced recommendation. The other is student network, which incorporates the essence of well-trained methods. Benefiting from the teacher network, distilled knowledge (i.e., student-exercise embeddings, and candidate exercise subset) that helps promote the recommendation accuracy of the top positions is transferred to the student network. In addition, two mechanisms are introduced in the student network: i) a well-designed state representation scheme to capture the long-term dynamic nature of student learning; ii) an efficient individual ability tracing model that is used to estimate the mastery probability of a student on each concept.

The teacher network

The main purpose of the teacher network is to transfer powerful distilled knowledge that guides the student network’s recommendation process. By transferring knowledge from more heavyweight and powerful teacher network, the performance of the final (student) model relies heavily on the strength of the teacher network. In other words, stronger teacher, stronger student. In this work, six advanced exercise recommendation methods are selected for comparison as the teacher networks, including,

ER-LOAF [33]: It designs a hybrid many-objective framework to recommend suitable exercises that accord with learners’ mastery level and knowledge concept coverage.
HB-DeepCF [34]: It embeds the students and exercises into a low-dimensional continuous vector spaces via auto-encoder techniques, and then integrated both recommender component and auto-encoder component into a new hybrid recommendation model for adaptively recommending exercises to each student.
DKVMN-RL [35]: It first acquires students’ mastery level of skills using the improved Dynamic Key-Value Memory Network (DKVMN), and a Q-learning algorithm is then used to learn an exercise recommendation policy.
LSTMCQP [36]: It uses a personalized LSTM approach to trace and model students’ knowledge mastery states and further designs a “recommend non-mastered exercises” recommendation strategy.
KCP-ER [13]: It develops a knowledge concept prediction-based Top-N recommendation model for finding a set of recommendation lists which are the trade-off among accuracy, coverage, novelty, and diversity.
TP-GNN [37]: It applies graph neural network to Top-N recommendation task, in which the aggregate functions and attention mechanism are employed together to generate a high-quality ranking list.

Technically, the proposed TERD is flexible since any existing recommendation techniques can be used as teacher networks without considering the detailed mechanisms behind them. Benefiting from the model-agnostic strategy of knowledge distillation, we do not require redesigning the recommended strategies of the student network if we modify the teacher network.

The embedding vectors of students and exercises from the teacher network are transferred into the state representation component of the student network, which helps to capture the long-term dynamic nature from students’ past learning trajectories to build the state representation of the student network. Moreover, the candidate exercise subset of the final output from teacher network is given as input to student network can effectively narrow the action space in DDQN. In this way, the distilled knowledge makes the student network have higher accuracy and faster convergence in the top position recommendation.

The student network

Network structure of the proposed DDQN agents

The goal of the student network is to adjust and optimize the candidate exercise subset L, thus achieve a performance improvement on the top positions. Specifically, we employ the deep reinforcement learning algorithm, and design novel task-specific reward functions for adaptively generating recommendation lists to students during the learning process. Under this architecture, the Double Deep Q-Network (DDQN) [38] with the experience replay mechanism is adopted as the student network. We would like to emphasize that the major centered in this study is on Top-enhanced issues, rather than exploring the best DRL approach in the context of exercise recommendation. The interaction between the agent and the environment in the DDQN-based TERD is depicted in Fig. 4.

Two same Q-Networks with different parameters are introduced in standard DDQN, namely, online network Q parameterized $\varphi $ by and target network ${\hat{Q}}$ parameterized by $\varphi ^-$. The target Q-Network estimates the target Q-value of the agent taking that action for the next state and updates parameters with the online Q-Network after a certain number of iterations. Remarkably, DDQN decouples the selection of actions from the evaluation, greatly reducing the overestimation of Q-values. The online Q-Network loss function is calculated through the temporal difference (TD) error [39], as shown in Eq. (1).

$$\begin{aligned} L(\theta ) = \mathbb {E}_{(s,a,r,s^{'}) \sim U(D)} {\left[ \left( y_t^{DDQN}-Q(s_t,a_t;\varphi ) \right) ^2 \right] } \end{aligned}$$

(1)

where $y_t^{DDQN} = r_t + \gamma {\hat{Q}}(s_{t+1}, argmax_{a^{'}}Q (s_{t+1}, a_{t+1}; \varphi ); \varphi ^{-})$.

The experiences generated in the interaction with the environment are stored in the experience replay buffer D in the form of $\left\langle s,a,r,s^{'} \right\rangle $, and the training samples are randomly selected from D.

Definition of reinforcement learning components

The state, action and reward for the student network are defined as follows.

State. The learnt student and exercise embeddings by the teacher network are employed to build the state of student network. In detail, at time step t, the embedded knowledge of teacher network output can be defined as: ${\tilde{x}} = [u_m;e_{t-n},\dots ,e_{t-1}]$, where $u_m, e_{t-i} \in \mathbb {R}_d$ represent the student and exercise embeddings obtained from teacher network, $[e_{t-n}, \dots , e_{t-1}]$ indicates the latest embeddings from exercising step 1 to t. ${\tilde{x}} \in \mathbb {R}^{d\times (n+1)}$ represents the concatenation of $\{ u_m;e_{t-n},\dots ,e_{t-1} \}$.Therefore, with embedded vector sequence ${\tilde{x}}$ described above as input, we incorporate a stacked GRU network (SGRU), which is more capable of modeling the student’s whole exercising trajectories. Specifically, the 1st component of the stacked GRU layer applies the GRU network to generate the hidden representation as below:

$$\begin{aligned} h_i^{(1)} = SGRU^{(1)}({\tilde{x}}_i, h_{i-1}^{(1)};\theta ^{(1)}) \end{aligned}$$

(2)

where $h_i^{(1)}$ denotes the hidden states at time step i. The 2nd component of the stacked SGRU layer has a similar structure to the 1st component, denoted as,

$$\begin{aligned} h_i^{(2)} = SGRU^{(2)}({\tilde{x}}_i, h_{i}^{(1)};\theta ^{(2)}) \end{aligned}$$

(3)

Similarly, $h_i^{(2)}$ denotes the hidden states in the second layer. When multiple layers of neurons are stacked, it may also bring some potential problems, e.g., overfitting and harder to train. Inspired by previous works [40], the residual connection technique are introduced between the two layers to alleviate above limitations.

Then, we obtain state representation $s_i$ via an activation function as follows.

$$\begin{aligned} s_i = \sigma \left( W_s \left( h_i^{(2)}\oplus {\tilde{x}}_i \right) + b_s \right) \end{aligned}$$

(4)

Specifically, in the t-th exercising step, if the recommender agent correctly selects one exercise $e_t$ for the student, $s_t$ will be updated as $s_{t+1}$; otherwise $s_{t+1}=s_t$.

Action. Taking action $a_t$ based on state $s_t$ refers to selecting the recommended exercise $e_t \in {\mathcal {Z}}_t$ to the student. Specifically, we select an exercise by sampling from the distribution $\pi (a\mid t;\varphi )$, where $\varphi $ is the set of model parameters. Meanwhile, it should be noted that the definition of the action space $A_t$ is based on the candidate exercise subset ${\mathcal {L}}$.

Reward. After the agent selects an action (i.e., an exercise) from the exercise subset, a reward signal r is received. Subsequently, this exercise is added to the Top-enhanced recommendation list ${\mathcal {L}}^*$. Considering that the design of the reward function will directly affect the agent’s action optimization strategy, we also carefully design a reward function with ranking quality characteristics. As mentioned before, our goal is to rank the exercises that students answered incorrectly at the top of recommendation list. Therefore, the reward function is designed to follow the strategy that the higher the ranking of the correctly recommended exercise, the stronger stimulation will be come. As a result, we redefined the standard NDCG metric as the reward function. At each exercising step t, the definition of the reward $R(s_t,a_t)$ for the state–action pair $(s_t,a_t)$ as follows:

$$\begin{aligned} R(s_t,a_t) = \frac{2^f-1}{log_2(t+1)} \end{aligned}$$

(5)

where $f\in [0,1]$ represents the probability of student’s $u_m$ wrong response the exercise $e_i$, which is a flexible feedback factor. Formally, we design the feedback factor f as follows:

$$\begin{aligned} f = 1-\mid g-\textrm{y}_{u_m, e_i}^t \mid \end{aligned}$$

(6)

where $\textrm{y}_{u_m, e_i}^t$ represents the performances score of the exercise $e_i$ practiced by student $u_m$ at her exercising step t. Specifically, we set $g=0$ in Eq. (6). TERD will follow the common strategy of recommending non-mastered exercises as many existing works do. Note that g is a flexible factor where g could be adjusted if we hope the agent focus more on recommending exercises of the desired difficulty.

To this end, we implement a system simulator in it with a knowledge tracing technique to simulate the performances score $\textrm{y}_{u_m, e_i}^t$ according to the feedback of the corresponding exercising. Specifically, Deep Knowledge Tracing (DKT) [41] is applied to acquire the implicit knowledge mastery level $\textrm{y}_{u_m, e_i}^t$. The input of the DKT model $\mathbb {Y}$ is the historical learning sequence ${\mathcal {Z}}_{u_m}$ of the student $u_m$, while the output $\textrm{y}_{u_m, e_i}^t$ is a predicted vector that represents the probability of a student answering the exercise correctly. The probability of answering the exercise $e_i$ correctly of students $u_m$ can be obtained from Eq. (7).

$$\begin{aligned} \textrm{y}_{u_m, e_i}^t = \mathbb {Y}^t {\left( \theta _{u_m},e_i, {\mathcal {Z}} _{u_m} \right) } \end{aligned}$$

(7)

where $\theta _{u_m}$ is the trainable parameters in the DKT model.

Training mechanism

Training Procedure. To make this framework better understood, Algorithm 1 summarizes the major steps of the training process. The training procedure contains two closely related phases, i.e., acquire prior knowledge (corresponding to line 1-2 in Algorithm 1) and student network training (corresponding to lines 3-16 in Algorithm 1). For the first phase, we dedicate to distill the exercise subset from the well-trained teacher network as a type of distilled knowledge to incorporate the student (corresponding to line 2 in Algorithm 1).

In the second phase, the agent starts its learning process from an initial state $s_0$ (corresponding to line 4 in Algorithm 1). At each exercising step t, the agent acquires the student’s state $s_t$, and then estimates the Q-value of the exercises in candidate exercise subset ${\mathcal {L}}$, and finally takes action $a_t$ at t. Subsequently the agent receives reward $r_{t+1}$ based on his/her response score $y_i$ (corresponding to lines 5-10 in Algorithm 1). On lines 11 to 16, DRE transits to a new state $s_{t+1}$ and then the action space will be adjusted. Here, the experience replay buffer $D_{buffer}$ is used to store the recommender policy. Note that we also employ the widely-used target networks [18] with soft replace technique to smooth the learning and avoid oscillations or divergence in the parameters (corresponding to lines 17–20 in Algorithm 1).

Testing Procedure. Algorithm 2 describes the Algorithm of TERD’s testing procedure. Utilizing the candidate exercise subset ${\mathcal {L}}$ from the teacher as distilled knowledge, for each student $u_i$ in the testing data, the student network can generate a high-quality Top-enhanced list ${\mathcal {L}}^*$. After T time steps, we can get the high-quality Top-enhanced list ${\mathcal {L}}^*$.

Theoretical analysis of TERD

In this section, we will provide an attractive theoretical analysis of TERD algorithm with reference to [42‐44].

Theorem 1

For a given reward function in Eq. (5), if the TD error in Eq. (1) is minimized, the proposed TERD algorithm outperforms the corresponding teacher recommender in maximizing the ranking accuracy of the recommendation list.

Proof of Theorem 1

Thus, the goal is to learn a policy $\pi $ that maps each state to action, so the value function of any state $s_t$ is the maximization of the expected return received from the time step t it moves forward. The state value function and the state–action value function for a policy $\pi $ can be defined as follows.

$$\begin{aligned} \begin{aligned} V(s_t;\varphi )&= \mathbb {E}_{\pi _\varphi }\left[ \sum _{k=0}^\infty \gamma ^kr(s_{t+k})\right] \\&= \mathbb {E}_{\pi _\varphi }[r_t + \gamma V(s_{t+1};\varphi )]Q(s_t, a_T;\varphi ) \\&=\mathbb {E}_{\pi _\varphi }[r_t + \gamma V(s_{t+1};\varphi ), a_t], a_t = \pi (s_t; \phi ) \\ \end{aligned} \end{aligned}$$

(8)

Besides, the DDQN network is updated accordingly by the temporal difference learning approach. Then, the updated policy is denoted as $\pi _{\varphi ^{'}}$. According to the policy improvement theorem [44],

$$\begin{aligned} \begin{aligned} \forall s_t, if \ Q(s_t, a_t;\varphi ^{'}; \varphi ) \geqslant V(s_t;\varphi ), then \ \pi _{\varphi ^{'}} \geqslant \pi _{\varphi } \end{aligned} \end{aligned}$$

(9)

By applying the above theorem repeatedly, we have

$$\begin{aligned} \begin{aligned} V(s_t;\varphi )&\leqslant Q(s_t, \pi (s_t;\varphi ^{'}); \varphi ) \\&= \mathbb {E}_{\pi _{\varphi ^{'}}}[r_t + \gamma V(s_{t+1};\varphi )] \\&\leqslant \mathbb {E}_{\pi _{\varphi ^{'}}}[r_t + \gamma Q(s_{t+1}, \pi (s_{t+1};\varphi ^{'}); \varphi ) \\&= \mathbb {E}_{\pi _{\varphi ^{'}}} \left[ r_t + \gamma \mathbb {E}_{\pi _{\varphi ^{'}}}[r_{t+1} + \gamma V(s_{t+2};\varphi ^{'})] \right] \\&= \mathbb {E}_{\pi _{\varphi ^{'}}}[r_{t} + \gamma r_{t+1} +\gamma ^2 V(s_{t+2};\varphi ^{'})] \\&\leqslant \mathbb {E}_{\pi _{\varphi ^{'}}}[r_{t} + \gamma r_{t+1} + \gamma ^2 r_{t+2} +\gamma ^3 V(s_{t+3};\varphi ^{'})] \\&= \mathbb {E}_{\pi _{\varphi ^{'}}}[r_{t} + \gamma r_{t+1} + \gamma ^2 r_{t+2} +\gamma ^3 r_{t+3} + \dots ] \\&= V(s_t;\varphi ^{'})\\ \end{aligned} \end{aligned}$$

(10)

The above equation demonstrates that it would be beneficial to use the updated policy $\pi _{\varphi ^{'}}$ to generate a high-quality recommendation list. $\square $

Experiment

In this section, we successively report the dataset description, the parameter settings in teacher and student network, and the evaluation protocols in sense of multi-objective optimization. Finally, we conduct plenty of experiments on three datasets to evaluate the Top-enhanced recommendation performance, and we aim to answer the following research questions:

RQ 1: Whether the student network plays a critical role in advancing the performance of Top-enhanced recommendation?
RQ 2: Comparing with existing well-known learning to rank model and RL-based exercise recommendation technique, how does our proposed TERD perform when K takes different values?
RQ 3: How does TERD perform in terms of the model efficiency compared to other state-of-the-art methods?
RQ 4: How do the key hyper-parameter settings affect TERD?
RQ 5: How about the interpretation of TERD on top-enhanced recommendation scenario?

We ask RQ 1 to evaluate whether the student network applying to six advanced teacher networks is work. We ask RQ 2 to evaluate the performance of the proposed TERD framework when comparing the results with two advanced learning to rank algorithms, i.e., DeepRank [15] and SQL-Rank [45] and four RL-based exercise recommendation i.e., DQN [46], MOOCERS [47], DDQN [38], and DDPG [48]. For RQ 3, we compare the efficiency of all methods on three datasets. Then, we conduct a series of parameters experiments to test the influence of the action space size, hidden dimensionality, and batch size. Finally, we visualize herein the exercising process of two students to evaluate the proposed TERD models on their ability in solving the top-enhanced task.

Experimental settings

Dataset descriptions

The experiments are carried out on three real-word datasets: ASSISTments0910 [49], Algebra2005 [50] and Statics2011 [51]. The basic statistical information for all the datasets is shown in Table 1. The detail descriptions are as follows:

ASSISTments0910. This dataset was provided by the online intelligent tutoring platform ASSISTments. Notably, it was gathered from “skill-Builder” question sets. Among other things, it embedded two heterogeneous features, hint counts and attempt counts, into the embedding of online learning. During the preprocessing, we removed students with no skills or less than three records on the “skill-Builder” dataset.

Algebra2005. This dataset was a part of the dataset in KDD Cup 2010 EDN Challenge. Also, we removed the students with less than three transactions. After preprocessing, there are 574 students, 436 knowledge concepts, 1,084 exercises and 607,025 interaction records.

Statics2011. This dataset was collected from an OLI Engineering Statics Course in Fall 2011, which contains 45,002 interactions on 87 concepts by 335 students. Note that this dataset was the densest of all the three datasets.

Table 1

Summary of experiment datasets (After preprocessing)

	Dataset
Statistics	ASSISTments0910	Algebra0506	Statics2011
No. of concepts	110	436	87
No. of students	4151	574	335
No. of exercises	16,891	1084	300
No. of records	325,637	607,025	45,002

All the experiments are conducted on a server with an NVIDIA RTX 3080 GPU with 10 GB of video memory. We implement our model using pytorch, which is a popular deep learning framework. To setup this experiment, we divided the datasets into 70%/10%/20% partitions, using the 70% as training set, 10% as validation set, and 20% as testing set.

Teacher and student settings

Teacher: The parameters of training algorithm (learning rate $\eta _{1}$, layers l, discount factor $\beta _{1}$, the number of neighbors n, the initial temperature parameter J, reduction factor c, learning goal g, experience replay memory D, the depth of propagation p, and difficulty range dr are elaborately set by preliminary experiments as shown in Table 2.

Student: the learning rate $\eta _{2}=0.01$; greedy policy p=0.2; discount factor $\beta _{2}=0.9$; size of the experience replay memory $D_{buffer}$=100; and the parameters of the target network are only updated every 100 steps from the online network.

Table 2

Parameter settings for teacher networks

		ASSISTments2009	ASSISTments2015	Statics2011
ER-LOAF	$\eta _{1}$	0.001	0.001	0.01
ER-LOAF	g	0	0	0
HB-DeepCF	$\eta _{1}$	0.01	0.01	0.01
HB-DeepCF	$\lambda _{l}$	4	4	4
DKVMN-RL	$\eta _{1}$	0.001	0.001	0.001
	D	200	100	50
	$\beta _{1}$	0.9	0.9	0.9
LSTMCQP	$\eta _{1}$	0.001	0.001	0.001
LSTMCQP	dr	0.4	0.5	0.4
KCP-ER	$\eta _{1}$	0.001	0.001	0.01
	g	0	0	0
	J	100	100	100
	c	0.095	0.095	0.195
TP-GNN	$\eta _{1}$	0.0005	0.0001	0.001
	n	50	30	10
	p	2	2	2

Evaluation protocols

We select three widely used metrics, namely Presision@N, MAP@N, and NDCG@N, to evaluate the top-enhanced recommendation performance in the TERD.

(1) Presision@N is defined as the measure of the accuracy of recommendation results. It is the proportion of the non-mastered exercises (recommended and mistaken) over a total number of exercises in the recommendation list. Presision@N can be simply computed as:

$$\begin{aligned} \begin{aligned} Presision@N = \frac{N(r=0)}{N(r=0)+N(r=1)} \end{aligned} \end{aligned}$$

(11)

(2) MAP@N is computed by considering the performance of precision at all positions of recommended list. First, AP@N is defined as:

$$\begin{aligned} \begin{aligned} AP@N = \frac{\sum _{k=1}^N r_k/k}{N} \end{aligned} \end{aligned}$$

(12)

where $r_{k}$=1 if recommended the non-mastered exercise and $r_{k}=0$ otherwise. Therefore, MAP@N is computed by the mean value of AP@N over all students.

(3) NDCG@N assigns higher scores to correct recommendations at higher ranks in the final recommendation list.

$$\begin{aligned} \begin{aligned} NDCG@N = Z_n \sum _{k=1}^N \frac{2^rk-1}{log_2(k+1)} \end{aligned} \end{aligned}$$

(13)

where $Z_{n}$ denotes the normalized term that computed over ideal value iDCG. We perform experiments with the setting of N={2, 5, 10}.

Competitive models

As alluded to previously, we select six well-trained recommenders as teacher networks. They are ER-LOAF [33], HB-DeepCF [34], DKVMN-RL [35], LSTMCQP [36], KCP-ER [13], and TP-GNN [37]. To further justify the effectiveness of TERD, we also include the learning to rank model (i.e., SQL-Rank and DeepRank) and the RL-based recommendation technique (i.e., DQN [46], MOOCERS [47], DDQN [38], and DDPG [48]) as the comparison methods.

SQL-Rank [42]: This is a list-wise model, which maximizes the likelihood of a permutation model for building learner-specific ranking.

DeepRank [15]: This is a neural network-based rank approach, which combines Matrix Factorization algorithms and deep neural network for solving ranking learning.

DQN [46]: This is a DQN-based recommendation method, where a deep Q-Network to select the optimal exercises at each step.

MOOCERS [47]: This is the first attempt that using actor-critic framework of reinforcement learning to support exercise recommended service. Besides, they design a flexible reward function, taking into account three objectives including Review, Difficulty and Learn.

DDQN [38]: It extends DQN and proposes a new way to calculate the training target.

DDPG [48]: It uses the deep deterministic policy gradient (DDPG) algorithm to select the highest-ranking score for recommendation.

The implementation of SQL-Rank, DeepRank, DQN, MOOCERS, DDQN, and DDPG are based on original paper with some fine-tuning to fit our task. For the reward functions of RL-based recommendation technique, this study adopts a design idea that exactly same as TERD. For a fair comparison, the six competing methods adopt the same set of important parameters (i.e., the hidden dimensionality, training batch size, learning rate) as the TERD.

TERD evaluation results and analysis (RQ1)

In this section, we compare TERD with six well-trained teacher networks. Tables 3, 4, 5 report the results of comparison methods on three datasets, respectively. We observed many interesting conclusions:

$\bullet $ Through comparing the results before and after removing the student network, we find that the proposed TERD achieves comparable or considerably improved performance in the top positions. It should be emphasized that TERD without student network will degenerate into a general model focus on top-N recommendations. In contrast, the student network utilities the knowledge transferred from the teacher network can effectively reduce the action space in DRL, and finally promotes the recommendation performance. This well demonstrates the positive effect of absorbing the essence of well-trained recommenders for Top-enhanced recommendation.

We notice that, compared to the metric results of recommendation on other two datasets, the improvements of three metrics on the Statics2011 are more significant. This indicates that our proposed TERD can achieve better performance in dense datasets. On the sparsest data set, ASSISTments0910, TERD framework also achieves significantly large performance improvement.
The performance of the pure KCP-ER method is the closest to ours among all the benchmark models, as it carefully designed four flexible optimization goals. Especially, the difficulty goal emphasized by the KCP-ER method shows advantages in promoting the performance of the algorithm.
The results also reveal that Top-N recommendation models based on cognitive diagnosis, such as KCP-ER and LSTMCQP, can achieve superior performances than the representative recommendation methods, such as ER-LOAF and HB-DeepCF. This is due to the representative recommendation methods focus only on the student-exercise explicit interaction information, while the recommendation methods of the cognitive diagnosis paradigm (i.e., KCP-ER and LSTMCQP) require to provide exercise that cohering with student’s proficiency level.
All above evidences indicate that TERD can generate excellent Top-N recommendation by making it flexible to replace the teacher network without redesigning the strategy. This is the strongest validation of the advantage of being fully adaptive.

Table 3

Results of comparison among six algorithms with and without adding teacher network on ASSISTments0910 dataset

	Precision			MAP			NDCG			Avg
	@2	@5	@10	@2	@5	@10	@2	@5	@10
ER-LOAF	0.3731	0.3654	0.3645	0.3310	0.2691	0.2342	0.3745	0.5345	0.6049
TERD	0.5084	0.4491	0.4352	0.4556	0.3515	0.3016	0.5083	0.6897	0.7730
Improve(%)	36.26	22.91	19.40	37.64	30.62	28.78	35.73	29.04	27.79	29.7967
HB-DeepCF	0.3359	0.3400	0.3478	0.2901	0.2382	0.2118	0.3365	0.4895	0.5596
TERD	0.4864	0.4374	0.4211	0.4254	0.3314	0.2836	0.4851	0.6659	0.7430
Improve(%)	44.81	28.65	21.08	46.64	39.13	33.90	44.16	36.04	32.77	36.3533
DKVMN-RL	0.3588	0.3726	0.3763	0.3214	0.2769	0.2452	0.3608	0.5319	0.6052
TERD	0.4921	0.4565	0.4402	0.4440	0.3568	0.3048	0.4964	0.69	0.7709
Improve(%)	37.15	22.52	16.98	38.15	28.86	24.31	37.58	29.72	27.38	29.1833
LSTMCQP	0.3626	0.3841	0.3964	0.3180	0.2836	0.2611	0.3608	0.5389	0.6204
TERD	0.4859	0.4573	0.4537	0.4284	0.3522	0.3134	0.4828	0.6782	0.7679
Improve(%)	34.00	19.06	14.46	34.72	24.19	20.03	33.81	25.85	23.77	25.5433
KCP-ER	0.4288	0.4136	0.4053	0.3867	0.3231	0.2800	0.4296	0.6104	0.6863
TERD	0.5321	0.4827	0.4575	0.4779	0.3864	0.3283	0.5311	0.7327	0.8138
Improve(%)	24.09	16.71	12.88	23.58	19.59	17.25	23.63	20.04	18.58	19.5944
TP-GNN	0.3307	0.3531	0.3699	0.2912	0.255	0.2324	0.3286	0.4932	0.5707
TERD	0.4638	0.4507	0.4428	0.4128	0.3413	0.2990	0.4633	0.6605	0.7439
Improve(%)	40.25	27.64	19.71	41.76	33.84	28.66	40.99	33.92	30.35	33.0133

Table 4

Results of comparison among six algorithms with and without adding teacher network on Algebra0506 dataset

	Precision			MAP			NDCG			Avg
	@2	@5	@10	@2	@5	@10	@2	@5	@10
ER-LOAF	0.2018	0.2128	0.2145	0.1686	0.1319	0.1143	0.2018	0.2958	0.3315
TERD	0.3349	0.2839	0.2696	0.2815	0.1954	0.1594	0.3395	0.4473	0.4909
Improve(%)	65.96	33.41	25.69	66.96	48.14	39.46	68.24	51.22	48.08	49.8775
HB-DeepCF	0.2099	0.2154	0.2264	0.176	0.1354	0.1194	0.2096	0.3031	0.3465
TERD	0.3383	0.3012	0.2874	0.2821	0.2028	0.1683	0.3344	0.4533	0.4996
Improve(%)	61.17	39.83	26.94	60.28	49.78	40.95	59.54	49.55	44.18	48.0244
DKVMN-RL	0.2523	0.2644	0.2675	0.2076	0.1646	0.143	0.2507	0.3710	0.4185
TERD	0.3601	0.3154	0.301	0.2947	0.2119	0.1743	0.3533	0.4795	0.5301
Improve(%)	42.73	19.29	12.52	41.96	28.74	21.89	40.93	29.25	26.67	29.3311
LSTMCQP	0.2764	0.2631	0.2692	0.2311	0.1709	0.1466	0.2746	0.3867	0.4390
TERD	0.3704	0.3057	0.3050	0.3154	0.2147	0.1771	0.3686	0.4845	0.5450
Improve(%)	34.01	16.19	13.30	36.48	25.63	20.80	34.23	25.29	24.15	25.5644
KCP-ER	0.2798	0.2768	0.2715	0.2317	0.1795	0.1497	0.2793	0.4006	0.4470
TERD	0.3727	0.3314	0.3057	0.3096	0.2254	0.1814	0.3709	0.5056	0.5520
Improve(%)	33.20	19.73	12.60	33.62	25.57	21.18	32.80	26.21	23.49	25.3778
TP-GNN	0.2241	0.2160	0.2323	0.1950	0.1401	0.1224	0.2269	0.3206	0.3715
TERD	0.3402	0.2724	0.2809	0.2956	0.1901	0.1583	0.3421	0.4429	0.5054
Improve(%)	51.81	26.11	20.92	51.81	35.69	29.33	50.77	38.15	36.04	37.8478

Table 5

Results of comparison among six algorithms with and without adding teacher network on Statics2011 dataset

	Precision			MAP			NDCG			Avg
	@2	@5	@10	@2	@5	@10	@2	@5	@10
ER-LOAF	0.2169	0.2024	0.2316	0.1884	0.1358	0.1191	0.2269	0.3080	0.3620
TERD	0.4504	0.3752	0.3224	0.3934	0.2795	0.2104	0.4483	0.5888	0.6257
Improve(%)	107.65	85.38	39.21	108.81	105.82	76.66	97.58	91.17	72.85	87.2367
HB-DeepCF	0.2140	0.2020	0.2437	0.1863	0.136	0.1271	0.224	0.2989	0.3578
TERD	0.4594	0.3968	0.3403	0.405	0.3039	0.2321	0.4607	0.6089	0.6416
Improve(%)	114.67	96.44	39.64	117.39	123.46	82.61	105.67	103.71	79.32	95.8788
DKVMN-RL	0.1937	0.2082	0.3055	0.1522	0.1345	0.1565	0.1875	0.2628	0.3516
TERD	0.4483	0.3890	0.3421	0.3921	0.3016	0.2399	0.4496	0.5857	0.6135
Improve(%)	131.44	86.84	11.98	157.62	124.24	53.29	139.79	122.87	74.49	100.2844
LSTMCQP	0.2224	0.2980	0.3371	0.1811	0.1782	0.1813	0.2203	0.3701	0.4418
TERD	0.4522	0.3980	0.3625	0.3925	0.2977	0.2268	0.4505	0.6076	0.6412
Improve(%)	103.33	33.56	7.50	116.73	67.06	25.10	104.49	64.17	45.13	63.0078
KCP-ER	0.3474	0.3347	0.3161	0.3042	0.2361	0.1926	0.3537	0.4961	0.5423
TERD	0.4743	0.395	0.3525	0.4191	0.2978	0.2340	0.4768	0.6245	0.6727
Improve(%)	36.53	18.02	11.52	37.77	26.13	21.50	34.80	25.88	24.05	26.2445
TP-GNN	0.2165	0.2369	0.2925	0.1849	0.1565	0.1632	0.2238	0.3156	0.3795
TERD	0.4425	0.3610	0.3208	0.3975	0.2850	0.2322	0.4542	0.5737	0.6002
Improve(%)	104.39	52.38	9.68	114.98	82.11	42.28	102.95	81.78	58.16	72.0789

Comparative results (RQ2)

In this subsection, we also compare TERD with two well-known learning to rank algorithms and four RL-based exercise recommendation techniques. According to Sect. 5.2 and Table 6, we can make two aspects conclusion. On the one hand, TERD consistently outperforms the state-of-the-art models by a considerable margin. This is the strongest evidence of effectiveness caused by the proposed TERD model. On the other hand, both RL-based recommendation techniques without explicitly ranking mechanisms are inferior to the learning to rank models. The reason is that the RL-based exercise recommendation technique contains a large number of candidate exercises in action space, which makes it difficult to perform the recommendation task in the complex space. Besides, another merit of TERD is able to stand in the perspectives of different students and gain insight into their mastery of specific knowledge concepts. Overall, the results of numerical experiments confirmed the effectiveness of introducing distillation technology.

Table 6

Performance of comparison methods on different datasets

		Precision			MAP			NDCG
		@2	@5	@10	@2	@5	@10	@2	@5	@10
ASSISTments2009	SQL-Rank	0.3896	0.3856	0.3816	0.3514	0.2909	0.2502	0.3929	0.5640	0.6365
	DeepRank	0.4506	0.4185	0.4087	0.4082	0.3300	0.2844	0.4517	0.6290	0.7068
	DQN	0.3562	0.3538	0.3530	0.3042	0.2482	0.2179	0.3543	0.5110	0.5796
	MOOCERS	0.3632	0.3592	0.3517	0.3046	0.2511	0.2229	0.3497	0.5119	0.5834
	DDQN	0.3877	0.3751	0.3680	0.3427	0.2771	0.2369	0.3894	0.5532	0.6223
	DDPG	0.3800	0.3717	0.3636	0.3232	0.2615	0.2365	0.3806	0.5494	0.6066
Algebra0506	SQL-Rank	0.2741	0.2709	0.2587	0.2322	0.1786	0.1455	0.2769	0.3957	0.4356
	DeepRank	0.3211	0.2943	0.2885	0.2706	0.2012	0.1678	0.3201	0.4427	0.4947
	DQN	0.2477	0.2489	0.2433	0.2093	0.1610	0.1312	0.2482	0.3582	0.3991
	MOOCERS	0.2489	0.2484	0.2472	0.2099	0.1589	0.1318	0.2470	0.3567	0.3997
	DDQN	0.3234	0.2787	0.2655	0.2781	0.1904	0.1515	0.3265	0.4360	0.4818
	DDPG	0.2706	0.2557	0.2575	0.2242	0.1624	0.1381	0.2686	0.3764	0.4251
Statics2011	SQL-Rank	0.3235	0.3094	0.2878	0.2858	0.2196	0.1751	0.3277	0.4565	0.4939
	DeepRank	0.4191	0.3796	0.3470	0.3722	0.2835	0.2271	0.4216	0.5759	0.6248
	DQN	0.3456	0.3186	0.3029	0.2969	0.2221	0.1815	0.3448	0.4738	0.5215
	MOOCERS	0.3180	0.3112	0.2672	0.2270	0.1943	0.1513	0.2868	0.4220	0.4458
	DDQN	0.4283	0.3737	0.3172	0.3888	0.2820	0.2082	0.4362	0.5829	0.6148
	DDPG	0.3291	0.3103	0.2917	0.2719	0.2183	0.1993	0.3367	0.4631	0.5179

Comparisons of efficiency (RQ3)

In real-world large-scale educational scenarios, reducing deployment costs and improving model efficiency is a fundamental but meaningful task. Correspondingly, we test and compare the training cost (running time per epoch), number of parameters, and the testing cost as criterion of judging the efficiency to explore whether TERD outperforms baseline models. Table 7 presents the experimental results of TERD and baselines on three datasets. Due to space limitation, we only keep the results of TERD based on the ER-LOAF model, and other showed the same trend as that with ER-LOAF. From Table 7, we observe that the time efficiency of TERD outperforms baselines significantly because TERD is equipped with more efficient distillation techniques that greatly reduces the search space of the student network. We also find that all the methods in the ASSISTments0910 dataset are extremely time-consuming, as the number of exercises is significantly more than the other datasets. As a result, the time costs of all methods in the ASSISTments0910 dataset are larger than other datasets. Besides, all methods have much faster computation speed and fewer parameters in processing small datasets (i.e., Algebra0506 and Statics2011). Remarkably, the RL-based exercise recommendation methods, require at least two times the parameter amount than the proposed one. In sum, our proposed TERD can save execution time while also achieving exceptional outcomes. Overall, this observation strongly confirms the advantages of the model in balancing effectiveness and efficiency.

Table 7

Comparisons of efficiency on three datasets

Model	Phase	ASSISTments0910		Algebra0506		Statics2011
		Time(s)	#Parameters	Time(s)	#Parameters	Time(s)	#Parameters
SQL-Rank	Train	552s	0.2195M	65s	0.0329M	35s	0.0214M
SQL-Rank	Test	31s	0.2195M	4.3s	0.0329M	2.9s	0.0214M
DeepRank	Train	698s	0.2679M	81s	0.0542M	50s	0.0496M
DeepRank	Test	36s	0.2679M	5.9s	0.0542M	4.6s	0.0496M
DQN	Train	493s	1.3134M	67s	0.1228M	41s	0.0308M
DQN	Test	25s	1.3134M	5.3s	0.1228M	3.6s	0.0308M
MOOCERS	Train	764s	1.3265M	83s	0.1358M	43s	0.0438M
MOOCERS	Test	40s	1.3265M	5.6s	0.1358M	4.1s	0.0438M
DDQN	Train	513s	2.6269M	89s	0.2455M	43s	0.0616M
DDQN	Test	29s	2.6269M	5.6s	0.2455M	3.8s	0.0616M
DDPG	Train	493s	1.3134M	67s	0.1228M	41s	0.0308M
DDPG	Test	25s	1.3134M	5.3s	0.1228M	3.6s	0.0308M
TERD	Train	468s	0.1324M	49s	0.0471M	31s	0.0186M
TERD	Test	13s	0.1324M	3.2s	0.0471M	2.3s	0.0186M

Comparisons of efficiency (RQ3)

To further verify the potential impact of hyper-parameter on the performance, we explored the performance of three different datasets for three metrics: Precision, MAP and NDCG. Due to space limitation, for the following studies, we only show the results of Top-5.

Parameter analysis on the action space

The Size of Action Space is an important parameter of TERD, which has a direct and crucial influence on the recommendation performance. The results are shown in Fig. 5, from which we can see: (i) In most cases, with the increase in the size of action space, the performance of all models slowly increases and then up to a point, and finally slightly degenerates. (ii) We also note that the traditional recommendation models do not perform well on three metrics, with the only exception of the NDCG@5 indicator of the HB-DeepCF method in Statics2011 dataset, where HB-DeepCF achieved remarkable distillation results.

Parameter analysis on the hidden dimensionality

The hidden dimensionality is another important hype-parameter in our model used in this study. Figure 6 show the results of six baseline methods w.r.t. different settings of the hidden dimensionality. Empirically, we tune the hidden dimensionality $h \in \{ 10,20,30,40,50,60,70,80,90,100 \}$. On all datasets, with the increase of the hidden dimensionality, the performance of TERD on the metrics of Presion@5, MAP@5 and NDCG@5 is improved. It is worth noting that the performance of TERD degrades significantly after reaching the peak performance. From these results, we also find that the traditional recommendation models do not perform well on three metrics. Therefore, we conclude that a sensible hidden dimensionality is indeed helpful for improving the model.

Parameter analysis on the batch size

Here we explored the impact of the batch size N on TERD by tuning N in the range of $\{ 8,16,32,64 \}$. Figure 7 reports the results of all models when trained on all the datasets using the Presion@5, MAP@5 and NDCG@5 metrics. In our experiment on the ASSISTments0910, Algebra0506, and Statics2011 datasets, we achieved the best performance value when $N=64$, $N=16$, and $N=32$, respectively. In summary, we can conclude that the proper small batch size promotes the training and convergence performance of the TERD algorithm, nevertheless too large batch size capacity occupies too much memory space that leads to the performance decline.

Case study (RQ5)

Besides improving performance, another important ability of TERD is to generate intuitive and easily understandable recommendation explanation. To make deep analysis about this claim, we randomly selected two student samples from ASSISTments0910 and Statics2011 (user_id: 79063 and Stu_72da98f3bbf369da59be0b3451a45051), and further visualized the change of her/his performance score generated by teacher and student network. Figure 8 shows the comparison results of six baselines with and without adding student network. From the figure, we can see that all TERD models perform better at the top positions of the exercising process. In addition, we find that the two learning to rank algorithms and four RL-based recommendation methods are significantly less effective, indicating that the introduction of distillation technique leads to better ranking performance. With the visualization results, instructors can know how much student have mastered certain exercises and then carry out targeted exercises. In summary, the heatmap-based explanations of TERD are intuitive, persuasive, and satisfactory.

Conclusion

Reinforcement Learning and Knowledge Distillation learning are both widely used in various recommendation scenarios. Theoretically, the two techniques are usually considered to be exclusive, resulting in most of the existing recommendation algorithms only make use of a single-policy algorithms. This work proposes an advanced version of distillation recommender: TERD, which could bring a synergy effect for boosting the accuracy in Top-enhanced recommendation. Specifically, this work attempts to further reinforce the Top-enhanced recommendation performance of the DRL-based student network by absorbing the valuable prior knowledge from the well-trained teacher network. Benefiting from the above innovations, prior knowledge that helps narrow the DRL action selection space is distilled into the student network, such that the recommendation performance, and the efficiency of the student network improves. The experimental evaluation on three datasets shows that our TERD framework indeed resolve the top-enhanced issue.

However, TERD is also affected by some limitations. (1) The proposed method estimates students’ learning states only according to the exercise representations and students’ responses, while ignoring other educational characteristics (e.g., slipping, guessing, exercise texts). We plan to exploit the slipping and guessing factors from the semantic representation of the exercise texts. Intuitively, we can use two single layer neural networks to model the slipping and guessing factors respectively. (2) It is difficult for the proposed recommender to make decent recommendations for new students and new exercises that appear after the recommendation model is trained. To address this problem, we can extend TERD as a cross-domain recommender systems that leverage the data from external domains as prior knowledge to support the learning of the target recommendation model.

TERD, a promising tool for practical recommendation tasks, provides a new perspective for KD model. There are three potential improvements for future work. First, we will introduce state-of-the-art DRL technique to make the agent further promote the recommendation accuracy in the top positions. Second, the knowledge incorrectly predicted by the teacher network is difficult to assist the student network to generate excellent recommendations. As a result, we would like to employ an additional professor model to assist training a more expressive teacher. Moreover, to achieve multiple objectives (such as novelty, diversity) of Top-enhanced recommendation, we intend to further apply multi-objective optimization algorithms to redesign the reward functions.

Declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel Memory-based crowd-aware robot navigation using deep reinforcement learning

Nächster Artikel DynK-hydra: improved dynamic architecture ensembling for efficient inference

Anderson A, Huttenlocher D, Kleinberg J, Leskovec J (2014) Engaging with massive online courses. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 687–698

Chen Y, Li X, Liu J, Ying Z (2018) Recommendation system for adaptive learning. Appl Psychol Meas 42(1):24–41CrossRef

Wang J, Xie H, Wang FL, Lee L-K, Au OTS (2021) Top-n personalized recommendation with graph neural networks in moocs. Comput Educ Artif Intell 2:100010CrossRef

Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D (2018) Deep reinforcement learning that matters. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32

Huang L, Fu M, Li F, Qu H, Liu Y, Chen W (2021) A deep reinforcement learning based long-term recommender system. Knowl-Based Syst 213:106706CrossRef

Lin Y, Lin F, Zeng W, Xiahou J, Li L, Wu P, Liu Y, Miao C (2022) Hierarchical reinforcement learning with dynamic recurrent mechanism for course recommendation. Knowl-Based Syst 244:108546CrossRef

Liu Q, Tong S, Liu C, Zhao H, Chen E, Ma H, Wang S (2019) Exploiting cognitive structure for adaptive learning. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 627–635

Chen J, Yin J (2006) Recommendation based on influence sets. In: Proceedings of the Workshop on Web Mining and Web Usage Analysis Citeseer

Leung K, Choy KL, Siu PK, Ho GT, Lam H, Lee CK (2018) A b2c e-commerce intelligent system for re-engineering the e-order fulfilment process. Expert Syst Appl 91:386–401CrossRef

10.

Gan W, Sun Y, Sun Y (2022) Knowledge interaction enhanced sequential modeling for interpretable learner knowledge diagnosis in intelligent tutoring systems. Neurocomputing 488:36–53CrossRef

11.

Shi Y, Larson M, Hanjalic A (2014) Collaborative filtering beyond the user-item matrix: a survey of the state of the art and future challenges. ACM Comput Surv (CSUR) 47(1):1–45CrossRef

12.

Chen J, Zhao C, Chen L (2020) Collaborative filtering recommendation algorithm based on user correlation and evolutionary clustering. Complex Intell Syst 6(1):147–156MathSciNetCrossRef

13.

Wu Z, Li M, Tang Y, Liang Q (2020) Exercise recommendation based on knowledge concept prediction. Knowl-Based Syst 210:106481CrossRef

14.

Xie L, Hu Z, Cai X, Zhang W, Chen J (2021) Explainable recommendation based on knowledge graph and multi-objective optimization. Complex Intell Syst 7(3):1241–1252CrossRef

15.

Chen M, Zhou X (2020) Deeprank: learning to rank with neural networks for recommendation. Knowl-Based Syst 209:106478CrossRef

16.

Hinton G, Vinyals O, Dean J et al (2015) Distilling the knowledge in a neural network. 2(7) arXiv preprint arXiv:1503.02531

17.

Bhardwaj S, Srinivasan M, Khapra MM (2019) Efficient video classification using fewer frames. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 354–363

18.

Wang C, Kong C, Lucey S (2019) Distill knowledge from nrsfm for weakly supervised 3d pose learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 743–752

19.

Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4777–4786

20.

Michieli U, Zanuttigh P (2021) Knowledge distillation for incremental learning in semantic segmentation. Comput Vis Image Underst 205:103167CrossRef

21.

Zhou C, Neubig G, Gu J (2019) Understanding knowledge distillation in non-autoregressive machine translation. arXiv preprint arXiv:1911.02727

22.

Yang Y, Qiu J, Song M, Tao D, Wang X (2020) Distilling knowledge from graph convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7074–7083

23.

Lee J-w, Choi M, Lee J, Shim H (2019) Collaborative distillation for top-n recommendation. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 369–378 IEEE

24.

Pan Y, He F, Yan X, Li H (2021) A synchronized heterogeneous autoencoder with feature-level and label-level knowledge distillation for the recommendation. Eng Appl Artif Intell 106:104494CrossRef

25.

Lee H, Hwang SJ, Shin J (2020) Self-supervised label augmentation via input transformations. In: International Conference on Machine Learning, pp. 5714–5724 . PMLR

26.

Xing H, Xiao Z, Zhan D, Luo S, Dai P, Li K (2022) Selfmatch: robust semisupervised time-series classification with self-distillation. Int J Intell Syst 2:2

27.

Xia Y, Yang Y (2021) Generalization self-distillation with epoch-wise regularization. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 IEEE

28.

Huang Z, Lin Z, Gong Z, Chen Y, Tang Y (2022) A two-phase knowledge distillation model for graph convolutional network-based recommendation. Int J Intell Syst 2:2

29.

Tang J, Wang K (2018) Ranking distillation: Learning compact ranking models with high performance for recommender system. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2289–2298

30.

Kang S, Hwang J, Kweon W, Yu H (2021) Item-side ranking regularized distillation for recommender system. Inf Sci 580:15–34MathSciNetCrossRef

31.

Chiang C-L, Cheng M-Y, Ye T-Y, Chen Y-L, Huang P-H (2019) Convergence improvement of q-learning based on a personalized recommendation system. In: 2019 International Automatic Control Conference (CACS), pp. 1–6 IEEE

32.

Huang Z, Liu Q, Zhai C, Yin Y, Chen E, Gao W, Hu G (2019) Exploring multi-objective exercise recommendations in online education systems. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1261–1270

33.

Xia J, Li G, Cao Z (2018) Personalized exercise recommendation algorithm combining learning objective and assignment feedback. J Intell Fuzzy Syst 35(3):2965–2973

34.

Gong T, Yao X (2019) Deep exercise recommendation model. Int J Model Optim 9(1):18–23CrossRef

35.

Ai F, Chen Y, Guo Y, Zhao Y, Wang Z, Fu G, Wang G (2019) Concept-aware deep knowledge tracing and exercise recommendation in an online learning system. Int Educ Data Mining Soc 2:2

36.

Huo Y, Wong DF, Ni LM, Chao LS, Zhang J (2020) Knowledge modeling via contextualized representations for lstm-based personalized exercise recommendation. Inf Sci 523:266–278CrossRef

37.

Wang J, Xie H, Wang FL, Lee L-K, Au OTS (2021) Top-n personalized recommendation with graph neural networks in moocs. Comput Educ Artif Intell 2:100010CrossRef

38.

Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30

39.

Tesauro G (1995) Temporal difference learning and td-gammon. Commun ACM 38(3):58–68CrossRef

40.

Wang X, Mei X, Huang Q, Han Z, Huang C (2021) Fine-grained learning performance prediction via adaptive sparse self-attention networks. Inf Sci 545:223–240MathSciNetCrossRef

41.

Piech C, Bassen J, Huang J, Ganguli S, Sahami M, Guibas LJ, Sohl-Dickstein J (2015) Deep knowledge tracing. Adv Neural Inf Process Syst 28:2

42.

Chen J, Xing H, Xiao Z, Xu L, Tao T (2021) A drl agent for jointly optimizing computation offloading and resource allocation in mec. IEEE Internet Things J 8(24):17508–17524CrossRef

43.

Pan J, Wang X, Cheng Y, Yu Q (2018) Multisource transfer double dqn based on actor learning. IEEE Trans Neural Netw Learn Syst 29(6):2227–2238CrossRef

44.

Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, CambridgeMATH

45.

Wu L, Hsieh C-J, Sharpnack J (2018) Sql-rank: A listwise approach to collaborative ranking. In: International Conference on Machine Learning, pp. 5315–5324 PMLR

46.

Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602

47.

Chuang A-C, Huang N-F, Tzeng J-W, Lee C-A, Huang Y-X, Huang H-H (2021) Moocers: exercise recommender system in moocs based on reinforcement learning algorithm. In: 2021 8th International Conference on Soft Computing & Machine Intelligence (ISCMI), pp. 186–190. IEEE

48.

Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971

49.

Heffernan P (2009) assistment-2009-2010. https://sites.google.com/site/assistmentsdata/home/assistment-2009/OT1/textendash2010-data

50.

Stamper J, Niculescu-Mizil A, Ritter S, Gordon G, Koedinger K (2005) Algebra https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp,2005

51.

Bier N (2011) Statics https://pslcdatashop.web.cmu.edu/

Titel: Fully adaptive recommendation paradigm: top-enhanced recommender distillation for intelligent education systems
verfasst von: Yimeng Ren
Kun Liang
Yuhu Shang
Xiankun Zhang
Publikationsdatum: 09.11.2022
Verlag: Springer International Publishing
Erschienen in: Complex & Intelligent Systems / Ausgabe 2/2023
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-022-00905-4

Springer Professional

Fully adaptive recommendation paradigm: top-enhanced recommender distillation for intelligent education systems

Abstract

Publisher's Note

Introduction

Recommender system

Knowledge distillation

Reinforcement learning in education

Preliminaries

Problem statement

General RL structure

Remarks

Proposed methodologies

Model description

The teacher network

The student network

Network structure of the proposed DDQN agents

Definition of reinforcement learning components

Training mechanism

Theoretical analysis of TERD

Experiment

Experimental settings

Dataset descriptions

Teacher and student settings

Evaluation protocols

Competitive models

TERD evaluation results and analysis (RQ1)

Comparative results (RQ2)

Comparisons of efficiency (RQ3)

Comparisons of efficiency (RQ3)

Parameter analysis on the action space

Parameter analysis on the hidden dimensionality

Parameter analysis on the batch size

Case study (RQ5)

Conclusion

Declarations

Conflict of interest

Publisher's Note

Premium Partner

		ASSISTments2009	ASSISTments2015	Statics2011
ER-LOAF	\(\eta _{1}\)	0.001	0.001	0.01
ER-LOAF	g	0	0	0
HB-DeepCF	\(\eta _{1}\)	0.01	0.01	0.01
HB-DeepCF	\(\lambda _{l}\)	4	4	4
DKVMN-RL	\(\eta _{1}\)	0.001	0.001	0.001
	D	200	100	50
	\(\beta _{1}\)	0.9	0.9	0.9
LSTMCQP	\(\eta _{1}\)	0.001	0.001	0.001
LSTMCQP	dr	0.4	0.5	0.4
KCP-ER	\(\eta _{1}\)	0.001	0.001	0.01
	g	0	0	0
	J	100	100	100
	c	0.095	0.095	0.195
TP-GNN	\(\eta _{1}\)	0.0005	0.0001	0.001
	n	50	30	10
	p	2	2	2

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

Recommender system

Knowledge distillation

Reinforcement learning in education

Preliminaries

Problem statement

General RL structure

Remarks

Proposed methodologies

Model description

The teacher network

The student network

Network structure of the proposed DDQN agents

Definition of reinforcement learning components

Training mechanism

Theoretical analysis of TERD

Experiment

Experimental settings

Dataset descriptions

Teacher and student settings

Evaluation protocols

Competitive models

TERD evaluation results and analysis (RQ1)

Comparative results (RQ2)

Comparisons of efficiency (RQ3)

Comparisons of efficiency (RQ3)

Parameter analysis on the action space

Parameter analysis on the hidden dimensionality

Parameter analysis on the batch size

Case study (RQ5)

Conclusion

Declarations

Conflict of interest

Publisher's Note

Weitere Artikel der Ausgabe 2/2023

The dilemma between eliminating dominance-resistant solutions and preserving boundary solutions of extremely convex Pareto fronts

Automated bridge crack detection method based on lightweight vision models

Deep learned vectors’ formation using auto-correlation, scaling, and derivations with CNN for complex and huge image retrieval

Guest editorial for special issue “Emerging topics in evolutionary multiobjective optimization”

A collaborative neurodynamic optimization algorithm to traveling salesman problem

Comparing interactive evolutionary multiobjective optimization methods with an artificial decision maker

Premium Partner