nach oben

Complex & Intelligent Systems

Open Access 04.03.2024 | Original Article

STO-CVAE: state transition-oriented conditional variational autoencoder for data augmentation in disability classification

verfasst von: Seong Jin Bang, Min Jung Kang, Min-Goo Lee, Sang Min Lee

Erschienen in: Complex & Intelligent Systems

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

The class imbalance problem occurs when there is an unequal distribution of classes in a dataset and is a significant issue in various artificial intelligence applications. This study focuses on the severe multiclass imbalance problem of human activity recognition in rehabilitation exercises for people with disabilities. To overcome this problem, we present a novel human action-centric augmentation method for human skeleton-based pose estimation. This study proposes the state transition-oriented conditional variational autoencoder (STO-CVAE) to capture action patterns in repeated exercises. The proposed approach generates action samples by capturing temporal information of human skeletons to improve the identification of minority disability classes. We conducted experimental studies with a real-world dataset gathered from rehabilitation exercises and confirmed the superiority and effectiveness of the proposed method. Specifically, all investigated classifiers (i.e., random forest, support vector machine, extreme gradient boosting, light gradient boosting machine, and TabNet) trained with the proposed augmentation method outperformed the models trained without augmentation in terms of the F1-score and accuracy, with F1-score showing the most improvement. Overall, the prediction accuracy of most classes was improved; in particular, the prediction accuracy of the minority classes was greatly improved. Hence, the proposed STO-CVAE can be used to improve the accuracy of disability classification in the field of physical medicine and rehabilitation and to provide suitable personal training and rehabilitation exercise programs.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Human activity recognition (HAR) aims to identify significant patterns from gestures and activities of the human body, which is extremely valuable in many applications. Recent advances in various technologies have made HAR research more dynamic, thus increasing demand in various industries [1]. Generally, HAR often utilizes multiple sensors, such as depth sensors, cameras, and wearable devices. However, data collection for HAR is a cost-intensive and challenging task because of privacy issues and data diversity problems [2]. In particular, it is an exceedingly difficult task to collect action data for people with disabilities. Nevertheless, collection of disability data for HAR research should be performed because the number of people with disabilities continues to increase, and they are performing various activities together in society with non-disabled people [3]. Owing to the difficulty of collecting data on general actions in daily life, HAR research for people with disabilities needs to be conducted using data on exercises. Additionally, in HAR, people with disabilities should be considered differently from non-disabled people because there are restrictions on various actions depending on the type of disability.

When collecting data for people with disabilities, the accuracy of the HAR task is particularly vulnerable to the class imbalance problem. The class imbalance problem occurs when the distribution of classes within a dataset is extremely skewed, that is, there are many more examples in one class (majority class) than in other classes (minority classes) [4]. This is an inherent problem that exists in a wide range of fields, such as facial and image recognition [5], medical diagnosis [6], and fraud detection [7]. Most studies assume that the data are balanced among different classes. However, in the case of data on disabled people, it is difficult to collect a perfectly balanced dataset. Therefore, identifying the minority disability classes is critical for assessing risks that may occur during exercises by the disabled. As such, improving the accuracy of the classifier is not sufficient because in supervised learning, the classifier may overfit to the majority classes [8].

Many studies have presented solutions to address the class imbalance problem. The most conventional approaches are the undersampling and oversampling methods. These methods balance the class ratio of imbalanced data by sampling from the original data. Another approach is the cost-sensitive learning method, in which machine learning algorithms, such as neural networks, converge using a cost function [9]. Ghorbani et al. [9] proposed modifying the cost function of neural networks with respect to relative weights based on class imbalance ratios. Recently, several augmentation studies have focused on capturing spatial and temporal information for different tasks, such as action recognition [10, 11]. However, data augmentation for HAR is complex because recognizing human action is a high-dimensional and complex task that requires the extraction of spatio-temporal information from human skeleton sequences [12, 13]. Few research articles have explored data augmentation techniques for the HAR task. Moreover, progress in related research has slowed because of the lack of data on people with disabilities.

Thus, we focus on alleviating the class imbalance problem by proposing a data augmentation method to improve the accuracy of disability classification. This study makes the following major contributions:

We propose the state transition-oriented conditional variational autoencoder (STO-CVAE), a data augmentation method specialized for human action recognition per type of disability, to resolve the class imbalance problem for the HAR task.

We provide a data-transformation method for capturing action from human pose estimation (HPE) to obtain action-centric data based on state transitions.

We examine the effectiveness and superiority of the proposed approach by conducting various experimental studies on real-world datasets.

The remainder of this paper is structured as follows. “Related works” reviews the related literature. The next section describes the proposed method. In “Experiments and results”, we present the experimental results for a real-world dataset. The final section summarizes the conclusions of the study and explores directions for future research.

This section presents an overview of the existing literature on four related topics: the data imbalance problem, HAR, HPE, and classification of tabular datasets. This study focuses on the data imbalance problem in classifying the disability type using video datasets. Additionally, HAR and HPE techniques are reviewed because they are used to recognize the actions of disabled people from videos. Finally, we examine the classification algorithms, especially for action recognition.

Data imbalance

Many studies have been conducted in various fields to solve the data imbalance problem. Traditionally, there are two main approaches to dealing with the imbalance problem: data-based and algorithm-based methods [14, 15].

The data-based methods adjust the class ratio of the learning dataset. Sampling methods can be divided into undersampling and oversampling methods. Under-sampling techniques involve arbitrarily removing examples from the majority class to balance the class ratio of the dataset [16], such as the random under-sampling, condensed nearest neighbor rule and Tomek links [17, 18]. These methods reduce training time, but informative samples from the majority class may be lost and unpredictable bias occurs, potentially resulting in decrease in accurate performance [19].

On the other hand, oversampling techniques replicate examples from the minority class to match the majority class ratio to balance the data. The oversampling approach includes random and synthetic sampling methods. Random oversampling (ROS) is a method of randomly replicating a sample until the minority class has the same ratio as the majority class. However, it may cause overfitting problems, because the ROS method simply generates a process for the original dataset, which is not conductive to the generalization performance of the classification [20]. Meanwhile, as a representative example of synthetic sampling methods, the synthetic minority oversampling technique generates data from the minority classes using the k-nearest neighbor (k-NN) algorithm [21]. However, generating synthetic data based on the k-NN algorithm is inadequate when the data are high-dimensional, nonlinear, or complex. Therefore, research dealing with the data imbalance problem in small datasets has become more active than that in big datasets. For small datasets, research has been conducted to develop a data augmentation method to increase prediction accuracy by modifying the ratio of synthetic data to solve the data imbalance problem [22].

Generative models

The most representative models of generating synthetic data are divided in three algorithms, generative adversarial networks (GANs), diffusion models, and variational autoencoders (VAEs). GANs are type of generative model using two neural networks, a generator and a discriminator, competing against each other. GANs have fast and accurate generating performance, so GANs have shown remarkable success in various applications such as image, videos, text and so forth [23]. Although GAN has attracted the attention of researchers due to remarkable performance and wide applicability, this is limited to image data. For tabular data, which accounts for the largest proportion in the world, GAN is vulnerable to multi-modal distribution problems [25, 26]. To overcome this, TGAN and CTGAN, which are specialized in tabular data, have developed [27]. CTGAN focuses on the problem of continuous variable processing and categorical data imbalance following non-Gaussian and multi-modal distribution by proposing mode-specific normalization and training-by-sampling methods. In the training phase, generators and discriminators are trained to generate tabular data similar to real data with conditional vectors and use loss functions of WGAN-GP [28]. However, the purpose of GAN-based models is not to learn the distribution of data, but to generate synthetic data, which is as similar as real data as possible.

Diffusion models are a class of generative models that directly generate samples from a noise Gaussian distribution, leveraging the concept of diffusion processes. Diffusion models have shown promising results in generating realistic high-dimensional data, but its suitability for tabular data is still an area of research [24]. This study does not deal with unstructured high-dimensional data, so diffusion models are out of scope.

VAEs are another type of generative model consisting of an encoder and decoder networks, and latent space. VAEs are well known for their ability to generate diverse synthetic data in terms of distribution, so it is evaluated as a suitable technique for generating tabular dataset [24]. VAE can build higher quality datasets for generating synthetic data, which is generated from the learned distribution of real data based on the variational inference. The VAE generates synthetic data with random noise following the Gaussian distribution. It includes various algorithms, such as the $\beta$-VAE and the conditional VAE (CVAE) [29, 30]. $\beta$-VAE introduces a relative weight $\beta$ on the Kullback–Leibler (KL) divergence loss against the reconstruction loss in the VAE loss function [31]. Meanwhile, the CVAE adds conditional information, such as label values to improve the embedding space learning. Wang et al. [32] demonstrated that CVAE outperforms $\beta$-VAE in terms of the reconstruction loss for the MNIST and Fashion MNIST datasets. As such, the algorithm-based methods adjust the loss function to concentrate on improving the ability of minority class.

Human action recognition (HAR)

Action recognition is an important task that has received considerable attention for decades due to its role in monitoring systems in health care and real-world applications [33, 34]. Existing studies are largely divided into RGB frame-based and human skeleton-based studies. Recently, a skeleton-based HAR study focusing on the position of human joints instead of RGB presentation has received assiduous attention [34]. This is because the position of human joints provides robust action recognition to overcome environmental noise (e.g., placement constraints, a variety of costumes) [35, 36]. However, since the skeleton-based HAR methods extract skeleton information per frame, there is a problem of considering temporal information. Moreover, there are challenges with different time intervals, even when taking the same action. In our study, rehabilitation exercise videos are used for extracting action information according to the disability type.

In general, the existing literature offers three main approaches to skeleton-based action recognition: handcrafted feature-based [37], deep learning [38, 39], and pose estimation methods [40]. Handcrafted feature-based methods have limitations in obtaining joint relationships with spatial features. Meanwhile, spatial and temporal information could be obtained with graph neural network based methods [41, 42]. However, there are limitations in recognizing repetitive actions on continuously irregular time intervals and not considering the differences in accuracy of the x, y, and z-axis values in three-dimensional (3D) space. These limitations show that accurately estimating human poses is difficult because the human posture has many degrees of freedom and occlusion problems [43]. Various HPE methods have been proposed to overcome these problems [44, 45]. HPE can be more accurate by predicting only the positions of the joints in each frame. Some estimators have also been developed for predicting more accurately joint positions in fields that deal with joints sensitively, such as the field of physical medicine and rehabilitation [46, 47]. In our study, we will use HPE with the proposed algorithm for rehabilitation exercise recognition.

Human pose estimation (HPE)

Recently, HPE has become one of the most important tasks in computer vision to estimate specific joint positions in the human body for HAR [48]. HPE is divided into two approaches: 2D and 3D pose estimation methods [49]. With recent advances in deep learning-based methods, 2D pose estimation techniques have shown high performance. For example, PoseNet and OpenPose are representative 2D HPE models [50, 51]. However, many actions in reality, such as fitness and yoga exercises, can be more accurately recognized by capturing them in 3D space than in 2D space. 3D pose estimation focuses on the occlusion problem by the direction of the camera or environmental constraints. Recent studies in deep learning architectures have led to significant progress in 3D pose estimation, especially lifting 3D poses from a single camera [52]. These require large-scale datasets to achieve the generalization capability to accurately estimate 3D poses [53]. Further, the recent advancements prefer a single image for 3D HPE because of a more accessible and convenient solution for real-world applications. 3D HPE contains two major approaches: direct 3D estimation approach and 2D-to-3D lifting approach. The former adopts the end-to-end manner to obtain 3D HPE from 2D images or videos while the latter extracts 2D HPE keypoints and transforms them to 3D dimension [67‐69]. Several studies focus on input image normalization and camera calibration with semi-supervised learning to leverage the accurate transformation [70, 71].

Meanwhile, three keypoints for HPE research are highlighted: (1) accurate pose estimation, (2) real-time processing, and (3) lightweight model architecture [72]. First, various research focuses on precisely predict the position and angles of keypoints. Second, HPE is required for efficiency even in real-time environments. With its optimized model structure and efficient processing method, it will quickly estimate human poses from live video streams. Third, a lightweight model architecture requires low resources, making it suitable for deployment portability, and usability in real-world scenarios. Therefore, regressor-based models like BlazePose are more suitable for on-device environments compared to heatmap-based models [54].

BlazePose is a 3D pose estimation model developed by Google. It is used as a proxy for a human detector and uses a detector-tracker machine learning pipeline. The detector determines the region of interest within the frame, which is a human object, and first detects the face of the human. The face detector is used as a proxy for the human detector [55]. It predicts three additional alignment features: the midpoint of a person’s hip, the radius of the circle surrounding the entire person, and the inclination angle of the line connecting the shoulder and midpoint of the hip.

The BlazePose tracker predicts the presence of a person for the $(x, y, z)$ coordinates of 33 points on the human body by presenting a new topology, as shown in Fig. 1, which is a superset of Common Objects in Context, BlazeFace, and BlazePalm topologies [56]. Unlike conventional approaches that use compute-intensive heatmap prediction, the pose estimation tracker of BlazePose uses a regression approach that combines heatmap and offset predictions for all keypoints. During training, BlazePose uses a heatmap and the offset loss to train a heatmap-based network, as shown in Fig. 2. After learning the left and center towers, BlazePose removes the heatmap output and learns the regression-based network. Consequently, it has a lightweight effectiveness while effectively using the heatmap result.

Classification for tabular dataset

In this subsection, we review the disability classification algorithms by recognizing keypoint-based actions, which are typically represented as tabular datasets. The existing research points out that machine learning algorithms, such as ensemble methods, have outperformed deep learning models when dealing with tabular datasets [57]. Thus, we describe four competitive ensemble algorithms with a deep learning model specialized for tabular dataset, named TabNet [58].

The extreme gradient boosting (XGBoost) and light gradient boosting machine (LightGBM) are representative algorithms for tabular datasets. XGBoost is a model that is sequentially learned and ensembled by weighting the errors of weak decision trees. It performs robustly on classification and regression tasks [59]. Meanwhile, LightGBM copes with high-dimensional data through efficient learning by adopting the gradient-based one-side sampling and the exclusive feature bundling algorithms [60]. The random forest algorithm is a bootstrap aggregating (bagging)-based model that is used in tasks dealing with multivariate tabular datasets [61]. It deals with missing values and avoids overfitting by combining multiple decision trees with the random subspace method. Further, it finally obtains the predictions through a voting mechanism [62]. The support vector machine (SVM) algorithm is a popular discriminative model that utilizes decision boundaries using support vectors in the data space. SVM is effective in dealing with high-dimensional structured data because it uses the kernel trick [63].

TabNet, a tabular data-oriented deep learning algorithm, was developed by Google [58]. It has an autoencoder structure that combines an encoder and a decoder to enable supervised and unsupervised learning. As shown in Fig. 3, TabNet consists of three components in the encoder section, namely the feature transformer, attentive transformer, and mask, in one step, which offers the advantage of decision tree-based gradient boosting. The feature transformer, which performs the encoding for each step, consists of four layers, and each layer consists of three blocks in sequence: a fully connected (FC) layer, batch normalization, and a gated linear unit (GLU). Using these sequential attentions, TabNet improves accuracy.

Proposed method

This section presents an overview of the proposed method. The proposed method handles the data imbalance problem for rehabilitation exercise datasets. First, we extract 33 keypoints from rehabilitation exercise videos of a rowing exercise using HPE. Second, we perform data augmentation using the proposed model, STO-CVAE, which does not simply learn keypoints, but also transforms them into a sample that reflects state transition-based human action, and then generates the action-based data using STO-CVAE (Fig. 4).

Data collection using HPE

Human exercise posture depends on the flow of combinations of keypoints over time. Therefore, it is vital to evaluate a person’s athletic ability using a combination of keypoints.

Until recently, public datasets for keypoints did not exist in the field of exercises for disabled people. Therefore, we had to collect data directly on on-device for this research. This study can be a reference for the research on disability action recognition based on the disability type. We collected videos of rehabilitation exercises from the National Rehabilitation Center, which is the primary division supporting and providing welfare programs for disabled people in South Korea. This study focused on rowing exercises among the ten rehabilitation exercises performed at the center. In the rowing exercise, the patients extend their arms to the right and left sequentially, at a constant speed. We constructed a keypoint-based tabular dataset using video data. The dataset had 99 feature values as 3D coordinate values of 33 keypoints per frame. Figure 5 shows the process of converting a video into a tabular dataset using HPE, which estimates 33 human keypoints at 30 frames/s.

The data for 19 people were collected and classified into seven classes based on the disability types: non-disabled (ND), cerebral palsy (CP), cerebral lesion (CL), spinal cord injury (SCI), muscular dystrophy (MD), intellectual disability (ID), and autism spectrum disorder (ASD). There was an extreme imbalance ratio according to class, as shown in Fig. 6. Specifically, the CL, ID, and ASD classes with a ratio of not less than 0.1 were considered minority classes, while the remaining classes were included in the majority class.

Human action state-centric transformation

Figure 7 shows the process of generating action samples through the proposed state transition algorithm. The proposed algorithm can capture temporal patterns regardless of the number of frames in continuous multi-frames on video. Figure 4b shows the state transition steps from data to the state transition-oriented transformation. First, a frame (image) extracted from the video is used to estimate the topological skeleton of the human body via the HPE model. Then, the estimated skeleton is represented as several keypoints for major joints in the body, and each keypoint contains $x, y,$ and $z$ values. We recognize human action according to the estimated keypoints with sequential multiple frames because it is hard to recognize the action from a single frame.

Second, regarding robust action recognition, sequential frames should be divided, and a frame group should be combined according to the specific position state, which is an action element. For the rowing exercise, wrist keypoints in both hands were used to split the positional states. As the rowing exercise involves moving both hands up and down, we considered two states: 0 and 1. We identified each state by considering the keypoint of both wrists and the threshold $\gamma$, which is the average of keypoint y-values of the shoulder and hip. The formula for $\gamma$ is defined as follows:

$$\gamma =\frac{{y}_{{\text{left-sholuder}}}+{y}_{{\text{right-sholuder}}}+{y}_{{\text{left-hip}}}+{y}_{{\text{right-hip}}}}{4}.$$

(1)

In state 0, the average of both two wrists $y$-values is less than the threshold $\gamma$, and vice versa. Figure 8 illustrates frames of position states 0 and 1.

Third, we recognize the change in position states that are robust to the various motion transition times for each exerciser. Here, a state transition is proposed to transform keypoints into positional-state-based actions. These action-based states can reflect clear difference patterns according to the disability type. Figure 4b shows the process of extracting the representative frame for each frame group. In the case of $k$-sequential frames form one group ${g}_{i}$ = {${f}_{1}, \ldots , {f}_{k}$} and $l$-sequential frames form the next group ${g}_{i+1}$=$\left\{{f}_{k+1}, \ldots , {f}_{k+l}\right\}$, then ${r}_{{g}_{i}}= {f}_{\lfloor\frac{k}{2}\rfloor}$ and ${r}_{{g}_{i+1}}= {f}_{\lfloor k+\frac{l-1}{2}\rfloor}$ are the representative frames of ${g}_{i}$ and ${g}_{i+1}$, respectively.

Fourth, we calculate the difference between ${g}_{i+1}$ and ${g}_{i}$ to reflect the state transition-oriented topological information. Since the representative frames ${r}_{{g}_{i}}$ and ${r}_{{g}_{i+1}}$ of the frame groups are 99-dimensional vectors of keypoints, the new sample ${x}_{i}$ contains state transition-based keypoint values. The formula for ${x}_{i}$ is as follows:

$${x}_{i}= {r}_{{g}_{i+1}}-{r}_{{g}_{i}},$$

(2)

$${r}_{{g}_{i}}=\left({r}_{{g}_{i}, 1,}{r}_{{g}_{i}, 2}.\ldots ,{r}_{{g}_{i}, 99}\right)\in {\mathbb{R}}^{99},$$

(3)

where ${x}_{i}$ is the state transition between ${r}_{i}$ and ${r}_{i+1}$. In addition, ${r}_{{g}_{i+1}}$ and ${r}_{{g}_{i}}$ are the representative frame keypoint coordinates of the groups ${g}_{i+1}$ and ${g}_{i}$, respectively, and $i$ is the group index. In addition, a frame count variable is added to the sample ${x}_{i}$, and the new variable is defined as the sum of the number of frames in frame groups ${g}_{i}$ and ${g}_{i+1}$. Therefore, we can obtain a transformed sample ${x}_{i}$ that reflects sequential information of exercises through state transition. The sample is divided into two actions: (i) arms from the head to the side of the hip and (ii) arms from the side of the hip to the head. The pseudocode of the state transition is presented in Algorithm 1.

State transition-oriented conditional variational autoencoder (STO-CVAE)

CVAE, which is an extension of VAE, utilizes class labels as a condition to more accurately reflect the class-dependent properties in generating synthetic data [21]. It differs from VAE in that it trains in consideration of certain conditions in encoding and decoding. CVAE aims to maximize the value of the marginal likelihood for the distribution model $p$ under the given conditions. The marginal likelihood is expressed as follows:

$${\text{log}}\,p(x)={D}_{KL}({q}_{\varnothing }(z|x, c)|\left|p\left(z|x\right)\right)+{\mathcal{L}}\left(\theta , \varnothing ; x, c\right),$$

(4)

where ${q}_{\varnothing }$ is the approximate posterior probability, $p(z|x, c)$ is the prior distribution of the latent variable $z$ under condition $c$, and $c$ is the class label as a condition of input data $x$. The second term ${\mathcal{L}}$ is the evidence lower bound (ELBO). Since the KL divergence is non-negative, the ELBO is the upper bound on the marginal likelihood of $p$. Then, (4) changes as follows:

$${\text{log}}\,p(x)\ge {\mathcal{L}}\left(\theta , \varnothing ;x, c\right),$$

(5)

$$\begin{aligned} {\mathcal{L}}\left(\theta , \varnothing ;x,c\right)&= -{\mathbb{E}}_{{q}_{\varnothing }\left(z|x,c\right)}\left[{\text{log}}\,\left({p}_{\theta }\left(x|z,c\right)\right)\right]\\ & \quad + {D}_{KL}({q}_{\varnothing }\left(z|x,c\right)|\left|{p}_{\theta }\left(x|z,c\right)\right).\end{aligned}$$

(6)

From (5), maximizing the marginal likelihood of $p$ is derived by replacing the marginal likelihood with the problem of maximizing the ELBO. In the ELBO Eq. (6), the first term should be calculated using the Monte Carlo gradient method during backpropagation for CVAE training. However, the Monte Carlo gradient method is unsuitable because of its high variance. The CVAE overcomes this problem via a reparameterization trick that uses random variables $\varepsilon \sim {\mathcal{N}}(0,{\sigma }_{\varepsilon })$ from a standard Gaussian distribution instead of sampling $z\sim {q}_{\varnothing }\left(z|x, c\right)$. As a result, the first term in (6) is the negative log-likelihood of the reconstruction error. The second term in (6) is the regularization of the KL divergence with the prior distribution for sampling $z$. Consequently, the equation can be rewritten as follows:

$${L}_{{\text{total}}}= {L}_{{\text{recon}}}+ {L}_{{\text{KL}}},$$

(7)

$${L}_{{\text{recon}}}=-\sum_{j=1}^{n}{({x}_{i}-{\widehat{x}}_{i})}^{2},$$

(8)

$${L}_{{\text{KL}}}= \frac{1}{2}\sum_{j=1}^{l}\left({\mu }_{i,j}^{2}+ {\sigma }_{i,j}^{2}-{\text{ln}}\left({\sigma }_{i,j}^{2}\right)-1\right),$$

(9)

where $n$ is the feature space of ${x}_{i}$, ${\widehat{x}}_{i}$ is the synthetic sample of ${x}_{i}$, and $l$ is the latent vector size of sampling $z$ from $\varepsilon \sim {\mathcal{N}}(0,{\sigma }_{\varepsilon })$. Here, we use the mean squared error instead of the binary cross entropy of the original CVAE for multiclass classification as shown in Eq. (8) [64, 65].

Based on the state transition estimation, we converted continuous multi frames, which estimated by BlazePose, into an action sample with temporal information. However, there is an uncertainty problem due to varying resolution, self-occlusion and complexity of action. Figure 9 presents boxplot results showing that the uncertainty of the z-axis difference value is significantly greater than those of other axes. To mitigate this issue, we propose STO-CVAE, a data augmentation model that utilizes an action sample-based state transition algorithm. STO-CAVE incorporates relative weights based on the inter-axis uncertainty observed in 3D space. In STO-CVAE, the reconstruction loss term can be reformulated for the state transition as follows. Here, we set a smaller weight to the estimated values of z-axis to reduce the side effect of the $z$-axis uncertainty of the keypoint estimations. The total loss of STO-CVAE is as follows:

$${L}_{{\text{total}}\_{\text{STO}}}= {L}_{{\text{recon}}\_{\text{STO}}}+ {L}_{{\text{KL}}},$$

(10)

$${L}_{{\text{recon}}\_{\text{STO}}}={w}_{z}\sum_{k=1}^{\left|{S}_{z}\right|}{({x}_{i,k} - {\widehat{x}}_{i,k})}^{2}+\left(1-{w}_{z}\right)\sum_{l=1}^{\left|{S}_{z-}\right|}{\left({x}_{i,l} - {\widehat{x}}_{i,l}\right)}^{2},$$

(11)

where ${x}_{i, k}=\Delta {r}_{i, k}$, ${S}_{z}=\left\{s\right|s \text{ is keypoint values of }z{\text{-axis}}\}$, ${S}_{z-}=\left\{s\right|s\notin {S}_{z}\},$in which the keypoints’ set $S$ is a union of the set ${S}_{z}$, which is a set of keypoint values of z-axis, and ${S}_{z-}$, which is a set of keypoint values of $x$, $y$-axis ($S={S}_{z}\cup {S}_{z-}, |{\text{S}}| = |{S}_{z}| + |{S}_{z-}| = 33 + 66 =99)$. ${w}_{z}$ is the weight for the difference in the keypoint values of $z$-axis of the ${x}_{i}$, and ${r}_{i, s}$ is an $s$($s\in S$) element of the representative frame of the group ${g}_{i}$. In Eq. (10), ${L}_{{\text{recon}}\_{\text{STO}}}$ is the loss term that gives a relative weight in terms of the uncertainty of estimating z-axis value. As for ${w}_{z}$, we experimentally determined to 0.3.

Class labels were used as condition $c$, for the seven classes (ND, CP, CL, SCI, MD, ID, and ASD). We defined the condition by replacing the string with a numerical value as follows: $\text{``ND''}\to 1/7$, ${\text{``CP''}}\to 2/7$, ${\text{``CL''}}\to 3/7$, ${\text{``SCI''}}\to 4/7$, ${\text{``MD''}}\to 5/7$, ${\text{``ID''}}\to 6/7$, and ${\text{``ASD''}}\to 7/7$.

Four layers were stacked in the encoder and decoder to build the STO-CVAE. Each layer comprised three components: a dense layer, a batch normalization layer, and a dropout layer. The leaky ReLU activation function was used in each layer. However, we used a hyperbolic tangent only in the last output layer because our dataset values were continuous between − 1 and 1. The pseudocode of the STO-CVAE training is presented in Algorithm 2.

Experiments and results

In experiments, we only used directly collected datasets because of the unavailability of public data for rehabilitation exercises according to the disability type. As for data augmentation for state transitions, we first divided the dataset into four subsets based on actions to improve the data generation performance. To evaluate the performance of STO-CVAE, a total of 100 test samples, representing all seven disability type classes, were generated using the state transition algorithm. The quality of the synthetic samples was assessed based on the enhanced classification performances. Additionally, a sensitivity analysis of training for STO-CVAE was conducted to verify the consistent convergence of data augmentation for the four actions. We used two state-of-the-art upsampling-based augmentation methods regarding the number of augmented samples per class. As for the classifiers, we used five major machine learning algorithms, including random forest, SVM, XGBoost, LightGBM, and TabNet [57]. We compared the accuracy and F1-score performance before and after data augmentation for all combinations.

Dividing the dataset based on state transitions

Each sample representing the action state has different modalities in the feature space. The principal component analysis (PCA) demonstrates that all actions have four individual clusters in two principal components, as shown in Fig. 10. These clusters are divided into four actions in the rowing exercise: (i) arms down left; (ii) arms up left to right middle; (iii) arms down right; and (iv) arms up right to left middle. Thus, we divided the dataset according to the exercise direction and position state to enhance the performance of data generation: position state 0—right (P0_R), position state 0—left (P0_L), position state 1—right (P1_R), and position state 1—left (P1_L).

Sensitivity analysis for STO-CVAE

We conducted a grid search to tune the hyperparameters; set a learning rate of 0.0001, a batch size of 256, and a dropout ratio of 0.5; and used the Adam optimizer with ${\beta }_{1}=0.9$ and ${\beta }_{2}=0.99$. We then conducted a sensitivity analysis to identify the optimal sampling hyperparameter $\varepsilon$ for the STO-CVAE. We set the sampling standard deviation $\varepsilon$ according to the classification accuracy for minority classes with augmented data. This is because it is important to lower the loss of the generative model and generate useful synthetic data for the minority class. As shown in Fig. 11 and Table 1, we built four STO-CVAE generative models, and all trained STO-CVAEs showed similar convergence in terms of the reconstruction error and KL loss using the same standard deviation.

Table 1

Sensitivity analysis results of STO-CVAE according to the sampling standard deviation ${\boldsymbol{\varepsilon}}$

Sampling standard deviation	Dataset	Train epoch	Reconstruction loss	KL loss
0.05	P0_L	25704	0.0839	0.0318
	P0_R	17191	0.0920	0.0385
	P1_L	17904	0.1050	0.0347
	P1_R	19085	0.0959	0.0350
0.1	P0_L	17393	0.1200	0.0592
	P0_R	10996	0.1300	0.0571
	P1_L	7059	0.2760	0.1560
	P1_R	15995	0.1390	0.0575
0.2	P0_L	8693	0.1980	0.0937
	P0_R	7569	0.2040	0.0861
	P1_L	6237	0.2140	0.0957
	P1_R	9030	0.1970	0.0946
0.3	P0_L	9661	0.36687	0.1222
	P0_R	10399	0.35275	0.1154
	P1_L	9150	0.38619	0.1241
	P1_R	11421	0.37454	0.1424

Data augmentation with STO-CVAE

Then, two recent upsampling methods were considered to compare the performance on data augmentation. The first and second settings were multiclass ($M$) and balanced multiclass (${\text{BM}}$) methods, respectively [22]. $M$ is a data augmentation method for each class in proportion to the ratio of samples for each class. Meanwhile, ${\text{BM}}$ is a method of generating more data for the class as the ratio of samples per minority class. If the class of the sample to be generated is $i$ and the number of samples to be generated is ${L}_{i}$, then each method is constructed as follows:

$${\text{Setting }} 1.\ {L}_{i}=m\cdot {N}_{t}\quad {\text{for }} M,$$

$${\text{Setting }} 2.\ {L}_{i}=\left[\frac{1-m}{n-1}\right]\cdot {N}_{t}\quad\text{for BM},$$

where $m$ is the proportion of class $i$ in the original data, $n$ is the number of classes, and ${N}_{t}$ is the total number of generated samples.

The base sample of our dataset had a severe class imbalance problem. Table 2 lists the sample number results per class obtained by applying settings 1 and 2 of the data augmentation methods. In setting 1, the augmented sample of each class was generated at the same ratio according to the class ratio of the base sample using the $M$ method. Consequently, a large number of majority class samples were generated, whereas a small number of minority class samples were generated, and there were even classes for which no samples were generated. In setting 2, a relatively large number of samples of the minority class were generated using the ${\text{BM}}$ method. As a result, a relatively large ratio of minority class samples was generated, whereas a small ratio of majority class samples was generated.

Table 2

Sample number results per class after applying data augmentation methods: settings 1 and 2

Class	Before data augmentation	After data augmentation
Class	Base sample	Setting1 ($M$)	Setting2 (${\text{BM}}$)
ND	593	1182 (+ 589)	871 (+ 278)
CP	799	1043 (+ 244)	1592 (+ 793)
CL	18	33 (+ 15)	392 (+ 374)
SCI	354	710 (+ 356)	671 (+ 317)
MD	481	961 (+ 480)	796 (+ 315)
ID	11	21 (+ 10)	386 (+ 375)
ASD	16	31 (+ 15)	390 (+ 374)

Metrics

We used the following four evaluation indicators to evaluate the classification model’s performance:

accuracy, which is defined as the number of correctly classified data examples divided by the total number of data examples in the dataset;

precision, which is defined for a class as the number of true positives divided by the total number of model predictions that belong to the positive class;

recall, which is defined for a class as the number of true positives divided by the total number of elements labeled as belonging to the positive class;

F1-score, which is defined as the weighted harmonic mean of the precision and recall metrics; the F1-score has a higher value when the precision and recall metrics are similar.

Classifiers

Five classifiers were used to evaluate the improvement in classification performance before and after data augmentation using the proposed method. We conducted hyperparameter tuning again before and after data augmentation because the training dataset was changed by data augmentation. Regarding hyperparameter tuning, cross-validation-based Bayesian optimization was used for all classification models [66].

Hyperparameter tuning for each classifier is described below. The number and depth of trees were adjusted for the random forest algorithm. The floating point, kernel coefficient, and kernel type were adjusted for the SVM algorithm. For XGBoost, the maximum depth, the minimum sum of the instance weights required for the child, sub-sample ratio, and learning rate were adjusted. For LightGBM, the maximum number of leaves of the tree, depth of the tree, learning rate, and normalization conditions were adjusted. For TabNet, hyperparameters such as the dimension of the prediction layer, dimension of the attention layer, number of decision steps, number of shared GLU layers of the feature transformer, and number of unshared GLU layers were adjusted.

For model performance evaluation, we trained all classifiers with tenfold cross-validation (CV). In training, the performance was evaluated by an average of ten accuracy values and ten F1-scores from ten validation sets. In addition, the generalization performance of the classifiers was estimated by the standard deviations of F1-score and accuracy.

Comparison of data augmentation methods

We conducted comparative studies with other generative models to evaluate the data generation performance of STO-CVAE. The compared models included VAE, CVAE, and CTGAN. The number of augmented samples per class was determined using BM sampling for all generative models. Table 3 demonstrates significant performance improvement across five classifiers when data augmentation was performed using STO-CVAE.

Table 3

F1-score results of classification models before and after data augmentation methods

Model	Before data augmentation	After data augmentation
Model	Before data augmentation	VAE	CVAE	CTGAN	STO-CVAE
Random forest	0.441	0.274	0.482	0.468	0.606
SVM	0.250	0.510	0.550	0.329	0.571
XGBoost	0.416	0.168	0.570	0.189	0.591
LightGBM	0.554	0.394	0.600	0.514	0.620
TabNet	0.498	0.519	0.610	0.395	0.649

Bold indicates the best performance among the compared algorithms

We used VAE, CVAE, CTGAN, and STO-CVAE for data augmentation

Augmented action-centric sample using STO-CVAE

To confirm the quality of the synthetic samples in terms of statistical characteristics, we conducted a Wilcoxon test to evaluate whether the rank of the population mean differed between the original and synthetic samples [73, 73]. In Table 4, we concluded that the differences between original and synthetic samples was not statistically significant. In the Wilcoxon tests, p value $\ge 0.01$ presented that the synthetic samples generated by STO-CVAE are not significantly difference to the original samples in the minority class (CL, ID, and ASD). Additionally, STO-CVAE, which is a generative model focused on z-axis uncertainty, is robust to z-axis keypoint distribution estimation.

Table 4

Results of Wilcoxon test for verification the significance of a synthetic samples, which are generated by STO-CVE

Class	Type	Wilcoxon rank-sum test (significant level ${\alpha }$ = 0.01)
Class	Type	y value for right ear	x value for left elbow	z value for right shoulder	x value for left shoulder
ND	Statistic	83,535.0	81,463.0	87,339.0	83,823.0
ND	p-value	0.278	0.114	0.863	0.310
CP	Statistic	155,871.0	156,787.0	146,655.0	157,914.0
CP	p value	0.547	0.645	0.044	0.773
CL	Statistic	31,231.0	34,228.0	33,776.0	31,810.0
CL	p value	0.067	0.690	0.539	0.120
SCI	Statistic	28,834.0	31,062.0	33,062.6	29,851.0
SCI	p value	0.180	0.854	0.681	0.416
MD	Statistic	53,831.0	54,280.0	56,851.0	53,782.0
MD	p value	0.176	0.228	0.716	0.171
ID	Statistic	34,281.0	31,746.0	31,308.0	31,261.0
ID	p value	0.645	0.095	0.061	0.058
ASD	Statistic	33,696.0	34,758.0	30,257.0	31,999.0
ASD	p value	0.214	0.884	0.022	0.143

The proposed STO-CVAE can generate action-centric samples. Figure 12 shows synthetic samples for the right wrist $x$-values and left index $y$-values among 99 keypoints. STO-CVAE generated samples according to each distribution for state transition per class. In the left index y-value of class CL, there was only one original sample in position state1 (left); thus, synthetic samples were generated using the ${\text{BM}}$ method for one sample. These variant samples improved classification performance.

Figure 13 shows the distribution of the average $y$-axis values on the left and right wrists for both real-world and augmented samples. The distribution has a single modal and high density on the left side next to the hip. Moreover, as shown in Fig. 11, both the actual and augmented samples demonstrate that the distribution of the same action according to the disability type is different. We verified the distribution similarity for all seven disability classes and all behaviors.

Classification results after data augmentation with STO-CVAE

Table 5 presents the training results for each classification model before and after data augmentation using STO-CVAE. In training, all classifiers trained with augmentation outperformed the models trained without augmentation in terms of the F1-score and accuracy. A comparison of the two sampling methods after data augmentation revealed that setting 1($M$) and setting 2(${\text{BM}}$) generated synthetic samples, resulting in a similar overall performance in terms of the F1-score and accuracy.

Table 5

Training results for classification models before and after data augmentation methods setting 1 ($M$) and setting 2 (${\text{BM}}$)

Model	Measure (%) (Deviation of 10-CV)	Before data augmentation	After data augmentation
Model	Measure (%) (Deviation of 10-CV)	Base sample	Setting 1(${\varvec{M}}$)	Setting 2(${\varvec{B}}{\varvec{M}}$)
Random Forest	F1-score	93.769 (1.960)	97.304 (0.640)	97.075 (0.620)
Random Forest	Accuracy	94.497 (1.723)	97.263 (0.515)	97.196 (0.698)
SVM	F1-score	90.497 (2.111)	94.772 (0.866)	95.943 (0.990)
SVM	Accuracy	91.198 (1.918)	95.320 (0.776)	95.960 (1.001)
XGBoost	F1-score	93.955 (2.428)	97.864 (0.571)	97.841 (0.619)
XGBoost	Accuracy	94.541 (2.207)	97.903 (0.059)	97.836 (0.622)
LightGBM	F1-score	95.303 (1.899)	97.047 (0.600)	97.797 (0.754)
LightGBM	Accuracy	95.510 (1.772)	97.020 (0.534)	97.792 (0.758)
TabNet	F1-score	97.212 (1.644)	99.862(0.144)	99.901 (0.085)
TabNet	Accuracy	99.733 (0.119)	99.831(0.172)	99.917 (0.078)

Bold indicates the best performance among the compared algorithms

Deviations from tenfold cross validation are in parentheses

Table 6 demonstrates that augmented samples improve all evaluation indicators, including the F1-score for all sampling methods. For SVM, the F1-score shows the most improvement, from 0.25 to 0.571. For TabNet, F1-score was not the highest before data augmentation, but it had the highest F1-score among all classifiers after data augmentation (setting 2). Figures 12, 13, 14, 15 and 16 show the confusion matrix of the five classifiers, including random forest, SVM, XGBoost, LGBM, and TabNet, before and after data augmentation. Overall, the prediction accuracy of most classes was improved; in particular, the prediction accuracy of the minority classes was greatly improved.

Table 6

Results for classification models before and after employing the data augmentation sampling methods

Model	Measure (%)	Before data augmentation	After data augmentation
Model	Measure (%)	Base	Setting 1(M)	Setting 2(BM)
Random forest	Precision	0.539	0.566	0.730
	Recall	0.475	0.503	0.620
	F1-score	0.441	0.478	0.606
	Accuracy	0.475	0.503	0.620
SVM	Precision	0.310	0.556	0.670
	Recall	0.300	0.459	0.566
	F1-score	0.250	0.411	0.571
	Accuracy	0.300	0.459	0.566
XGBoost	Precision	0.555	0.713	0.685
	Recall	0.454	0.637	0.610
	F1-score	0.416	0.620	0.591
	Accuracy	0.454	0.637	0.610
LightGBM	Precision	0.626	0.625	0.698
	Recall	0.563	0.574	0.627
	F1-score	0.554	0.578	0.620
	Accuracy	0.563	0.571	0.627
TabNet	Precision	0.616	0.758	0.730
	Recall	0.521	0.643	0.671
	F1-score	0.506	0.619	0.649
	Accuracy	0.521	0.643	0.671

Bold indicates the best performance among the compared algorithms

In Fig. 14, the accuracy of CL and ID classes was initially 0 before data augmentation. After applying data augmentation, the accurate performance increased to 0.61 and 0.23, respectively. This resulted in a notable improvement in the accuracy of the minority class, particular ASD, which increased from 023. And 0.83. As can be seen in Fig. 15, XGBoost showed a slight improvement in accuracy for all classes overall compared to random forest. In addition, referring to Figs. 16, 17 and 18, SVM, LightGBM, and TabNet models all present that the accuracy of the minority classes has a positive effect compared to the majority class in terms of accuracy improvement. Naturally, relying solely on the addition of synthetic samples proposed in this study to address the accuracy improvement is quite challenging. The main focus of this study is to generate robust synthetic samples with the small observed samples in minority classes. Importantly, the proposed approach consistently demonstrated improved accuracy across all combinations of experiments.

Conclusion

We proposed a data augmentation method called STO-CVAE for disability classification based on HPE keypoints. Our model uses state transitions to generate synthetic data to alleviate the class imbalance problem. In this method, we transform multiple frames that include human skeleton keypoints into an action according to the state transition. The sampling based on the proposed state transitions reduces the side effects of the uncertainty of the keypoint estimation in HPE and increases the learning efficiency of the generative model. Through this transformation, we avoided using a complex backbone for representation learning. Further, we examined several state-of-the-art data sampling approaches for the action-oriented samples by varying the ratios for each class and demonstrated the effectiveness of data augmentation using comparative experiments.

Regarding the implications of this research, the proposed STO-CVAE can be used to improve the accuracy of disability classification. An accurate disability classifier can help in providing suitable personal training and rehabilitation exercise programs. We expect that a fully customized AI trainer based on our approach can guide and recommend optimized exercises to individuals with disabilities. Furthermore, it can quickly recognize emergency stop situations according to the disability type during exercise. For example, a significant increase in heart rate may be dangerous for some people with the specific disability type.

In future works, we plan to develop a model that learns keypoints by reflecting on the structure of the human body. This is necessary because when a person exercises, the direction of movement is different for each keypoint, and keypoints in the same area have a high probability of moving in the same direction. In addition, we will extend our method to all rehabilitation exercises, not only one. In this study, heuristic data transformation was limited to a specific exercise. Therefore, state transition, which is based on the keypoints’ distribution for repetitive actions, can further increase the model scalability.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Khowaja SA et al (2020) Context-aware personalized human activity recognition using associative learning in smart environments. Hum Centric Comput Inf Sci 10(1):1–35. https://doi.org/10.1186/s13673-020-00240-yCrossRef

Mantey EA et al (2022) Maintaining privacy for a recommender system diagnosis using blockchain and deep learning. Hum Centric Comput Inf Sci 13

Bennett CL, Keyes O (2020) What is the point of fairness? Disability, AI and the complexity of justice. In: ACM SIGACCESS accessibility and computing, vol 125, p 1. https://doi.org/10.11425/3386296.3386301

Guo Y et al (2021) Evolutionary dual-ensemble class imbalance learning for human activity recognition. IEEE Trans Emerg Top Comput Intell 6(4):728–739. https://doi.org/10.1109/TETCI.2021.3079966CrossRef

Huang C et al (2019) Deep imbalanced learning for face recognition and attribute prediction. IEEE Trans Pattern Anal Mach Intell 42(11):2781–2794. https://doi.org/10.1109/TPAMI.2019.2914680CrossRefPubMed

Lepcha DC et al (2022) Multimodal medical image fusion based on pixel significance using anisotropic diffusion and cross bilateral filter. Hum Centric Comput Inf Sci. https://doi.org/10.22967/HCIS.2022.12.015CrossRef

Kim J-W, Hong G-W, Chang H (2021) Voice recognition and document classification-based data analysis for voice phishing detection. Hum Centric Comput Inf Sci. https://doi.org/10.22967/HCIS.2021.11.002CrossRef

Buda M, Maki A, Mazurowski MA (2022) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259. https://doi.org/10.1016/j.neunet.2018.07.011CrossRef

Ghorbani M et al (2022) RA-GCN: graph convolutional network for disease prediction problems with imbalanced data. Med Image Anal 75:102272. https://doi.org/10.1016/j.media.2021.102272CrossRefPubMed

10.

Yao L, Yang W, Huang W (2020) A data augmentation method for human action recognition using dense joint motion images. Appl Soft Comput 97:106713. https://doi.org/10.1016/j.asoc.2020.106713CrossRef

11.

Hamad RA et al (2020) Joint learning of temporal models to handle imbalanced data for human activity recognition. Appl Sci 10(15):5293. https://doi.org/10.3390/app10155293CrossRef

12.

Mehmood F, Chen E, Akbar MA, Alsanad AA (2021) Human action recognition of spatiotemporal parameters for skeleton sequences using MTLN feature learning framework. Electronics 10(21):2708CrossRef

13.

Liu M, Liu H, Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit 68:346–362ADSCrossRef

14.

Tarawneh AS, Hassanat AB, Altarawneh GA, Almuhaimeed A (2022) Stop oversampling for class imbalance learning: a review. IEEE Access 10:47643–47660CrossRef

15.

Bach M, Werner A, Palt M (2019) The proposal of undersampling method for learning from imbalanced datasets. Procedia Comput Sci 159:125–134. https://doi.org/10.1016/j.procs.2019.09.167CrossRef

16.

Mohammed R, Rawashdeh J, Abdullah M (2020) Machine learning with oversampling and undersampling techniques: overview study and experimental results. In: 11th international conference on information and communication systems (ICICS). Jordan, IEEE, 2020, pp 243–248

17.

Elhassan T, Aljurf M (2016) Classification of imbalance data using Tomek link (T-link) combined with random under-sampling (RUS) as a data reduction method. Glob J Technol Optim S 1

18.

Hasib KMd et al (2020) A survey of methods for managing the classification and solution of data imbalance problem 16:1546–1557. https://doi.org/10.3844/jcssp.2020.1546.1557. arXiv preprint. arXiv:2012.11870

19.

Bao Y, Yang S (2023) Two novel SMOTE methods for solving imbalanced classification problems. IEEE Access 11:5816–5823CrossRef

20.

Sharma S, Gosain A, Jain S (2022) A review of the oversampling techniques in class imbalance problem. In: International conference on innovative computing and communications: proceedings of ICICC 2021, vol 1. Springer Singapore, Singapore, pp 459–472

21.

Wei G, Mu W, Song Y, Dou J (2022) An improved and random synthetic minority oversampling technique for imbalanced data. Knowl Based Syst 248:108839CrossRef

22.

Moreno-Barea FJ, Jerez JM, Franco L (2020) Improving classification accuracy using data augmentation on small data sets. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2020.113696CrossRef

23.

Brophy E, Wang Z, She Q, Ward T (2023) Generative adversarial networks in time series: a systematic literature review. ACM Comput Surv 55(10):1–31CrossRef

24.

Croitoru FA, Hondru V, Ionescu RT, Shah M (2023) Diffusion models in vision: a survey. IEEE Trans Pattern Anal Mach Intell 45:10850–10869CrossRefPubMed

25.

Razghandi M, Zhou H, Erol-Kantarci M, Turgut D (2022) Variational autoencoder generative adversarial network for Synthetic Data Generation in smart home. In: ICC 2022-IEEE international conference on communications. IEEE, Korea, pp 4781–4786

26.

Ye H, Zhu Q, Yao Y, Jin Y, Zhang D (2022) Pairwise feature-based generative adversarial network for incomplete multi-modal Alzheimer’s disease diagnosis. Vis Comput 39(6):2235–2244CrossRef

27.

Gueye M, Attabi Y, Dumas M (2023) Row conditional-TGAN for generating synthetic relational databases. In: ICASSP 2023–2023 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Greece, pp 1–5

28.

Habibi O, Chemmakha M, Lazaar M (2023) Imbalanced tabular data modelization using CTGAN and machine learning to improve IoT Botnet attacks detection. Eng Appl Artif Intell 118:105669CrossRef

29.

Liu C et al (2022) Intrusion detection system after data augmentation schemes based on the VAE and CVAE. IEEE Trans Reliab 71:1000–1010CrossRef

30.

Zhou L, Deng W, Wu X (2020) Unsupervised anomaly localization using VAE and beta-VAE. https://doi.org/10.48550/arXiv.2005.10686. arXiv preprint. arXiv:2005.10686

31.

Li J et al (2022) Training β-VAE by aggregating a learned Gaussian posterior with a decoupled decoder. https://doi.org/10.48550/arXiv.2209.14783. arXiv preprint. arXiv:2209.14783

32.

Wang A, Blair N, Belkhale S (2019) Encouraging categorical meaning in the latent space of a VAE. https://www.nathanblair.me/pdfs/Encouraging_categorical_meaning_in_thelatent_space_of_a_VAE.pdf

33.

Kong Y, Fu Y (2022) Human action recognition and prediction: a survey. Int J Comput Vis 130(5):1366–1401CrossRef

34.

Zhang P, Lan C, Zeng W, Xing J, Xue J, Zheng N (2020) Semantics-guided neural networks for efficient skeleton-based human action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). USA, pp 1112–1121

35.

Nweke HF, Teh YW, Mujtaba G, Al-Garadi MA (2019) Data fusion and multiple classifier systems for human activity detection and health monitoring: review and open research directions. Inf Fusion 46:147–170CrossRef

36.

Jegham I, Khalifa AB, Alouani I, Mahjoub MA (2020) Vision-based human action recognition: an overview and real world challenges. Forensic Sci Int: Digit Investig 32:200901

37.

Li C, Xie C, Zhang B, Han J, Zhen X, Chen J (2021) Memory attention networks for skeleton-based action recognition. IEEE Trans Neural Netw Learn Syst 33(9):4800–4814CrossRef

38.

Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI conference on artificial intelligence. USA

39.

Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. USA, pp 1227–1236

40.

Chen Y, Tian Y, He M (2020) Monocular human pose estimation: a survey of deep learning-based methods. Comput Vis Image Underst 192:102897CrossRef

41.

Basly H, Ouarda W, Sayadi FE, Ouni B, Alimi AM (2022) DTR-HAR: deep temporal residual representation for human activity recognition. Vis Comput 38(3):993–1013CrossRef

42.

Senthilkumar N, Manimegalai M, Karpakam S, Ashokkumar SR, Premkumar M (2022) Human action recognition based on spatial–temporal relational model and LSTM-CNN framework. Mater Today: Proc 57:2087–2091

43.

Kostis I-A et al (2022) Human activity recognition under partial occlusion. In: International conference on engineering applications of neural networks, Chersonissos, Crete, Greece, pp 297–309

44.

Angelini F et al (2019) 2D pose-based real-time human action recognition with occlusion-handling. IEEE Trans Multimed 22:1433–1446CrossRef

45.

Sahoo SP, Modalavalasa S, Ari S (2022) DISNet: a sequential learning framework to handle occlusion in human action recognition with video acquisition sensors. Digit Signal Process. https://doi.org/10.1016/j.dsp.2022.103763CrossRef

46.

Zhao Z, Lan S, Zhang S (2020) Human pose estimation based speed detection system for running on treadmill. In: 2020 International conference on culture-oriented science and technology (ICCST). IEEE, China, pp 524–528

47.

Jalal A, Nadeem A, Bobasu S (2019) Human body parts estimation and detection for physical sports movements. In: 2019 2nd International conference on communication, computing and digital systems (C-CODE). IEEE, Pakistan, pp 104–109

48.

Boualia SN, Amara NEB (2019) Pose-based human activity recognition: a review. In: 15th International wireless communications and mobile computing conference (IWCMC). IEEE, Tangier, pp 1468–1475

49.

Gamra MB, Akhloufi MA (2021) A review of deep learning techniques for 2D and 3D human pose estimation. Image Vis Comput 114:104282CrossRef

50.

Kendall A, Grimes M, Cipolla R (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In: Proceedings of the IEEE international conference on computer vision. Chile, pp 2938–2946

51.

Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition. USA, pp 7291–7299

52.

Chen CH, Tyagi A, Agrawal A, Drover D, Mv R, Stojanov S, Rehg JM (2019) Unsupervised 3d pose estimation with geometric self-supervision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. USA, pp 5714–5724

53.

Cai Y, Ge L, Liu J, Cai J, Cham TJ, Yuan J, Thalmann NM (2019) Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In: Proceedings of the IEEE/CVF international conference on computer vision. Korea, pp 2272–2281

54.

Bazarevsky V et al (2020) Blazepose: on-device real-time body pose tracking. https://doi.org/10.48550/arXiv.2006.10204. arXiv preprint. arXiv:2006.10204

55.

Bazarevsky V et al (2019) Blazeface: sub-millisecond neural face detection on mobile gpus. https://doi.org/10.48550/arXiv.1907.05047. arXiv preprint. arXiv:1907.05047

56.

Bazarevsky V, Zhang F (2019) On-device, real-time hand tracking with mediapipe. Google AI Blog

57.

Feng J, Yu Y, Zhou ZH (2018) Multi-layered gradient boosting decision trees. Adv Neural Inf Process Syst 31

58.

Arik SÖ, Pfister T (2021) Tabnet: attentive interpretable tabular learning. In: Proceedings of the AAAI conference on artificial intelligence, 2021, vol 35, no 8, pp 6679–6687 [online]

59.

Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. USA, pp 785–794

60.

Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W et al (2017) Lightgbm: a highly efficient gradient boosting decision tree. In: Advances in neural information processing systems, vol 30. USA

61.

Breiman L (2021) Random forests. Mach Learn 45(1):5–32CrossRef

62.

Xia J, Zhang S, Cai G, Li L, Pan Q, Yan J, Ning G (2017) Adjusted weight voting algorithm for random forests in handling missing values. Pattern Recognit 69:52–60ADSCrossRef

63.

Nie F, Zhu W, Li X (2020) Decision tree SVM: an extension of linear SVM for non-linear classification. Neurocomputing 401:153–159CrossRef

64.

Alanazi Y, Schram M, Rajput K, Goldenberg S, Vidyaratne L, Pappas C et al (2023) Multi-module based CVAE to predict HVCM faults in the SNS accelerator. arXiv preprint. arXiv:2304.10639

65.

Debbagh M (2023) Learning structured output representations from attributes using deep conditional generative models. arXiv preprint. arXiv:2305.00980

66.

Wang Y, Wang H, Peng Z (2021) Rice diseases detection and classification using attention based neural network and Bayesian optimization. Expert Syst Appl 178:114770. https://doi.org/10.1016/j.eswa.2021.114770CrossRef

67.

Chen S, Xu Y, Zou B (2023) Prior-knowledge-based self-attention network for 3D human pose estimation. Expert Syst Appl 225:120213CrossRef

68.

Palermo M, Moccia S, Migliorelli L, Frontoni E, Santos CP (2021) Real-time human pose estimation on a smart walker using convolutional neural networks. Expert Syst Appl 184:115498CrossRef

69.

Zhou X, Huang Q, Sun X, Xue X, Wei Y (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 398–407

70.

Chang JY, Moon G, Lee KM (2019) PoseLifter: absolute 3D human pose lifting network from a single noisy 2D human pose. arXiv:1910.12029

71.

Pavllo D, Feichtenhofer C, Grangier D, Auli M (2019) 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). USA, pp 7753–7762

72.

Tarawneh AS, Hassanat AB, Almohammadi K, Chetverikov D, Bellinger C (2020) SMOTEFUNA: synthetic minority over-sampling technique based on furthest neighbour algorithm. IEEE Access 8:59069–59082CrossRef

73.

Shen F, Zhao X, Kou G, Alsaadi FE (2021) A new deep learning ensemble credit risk evaluation model with an improved synthetic minority oversampling technique. Appl Soft Comput 98:106852CrossRef

Titel: STO-CVAE: state transition-oriented conditional variational autoencoder for data augmentation in disability classification
verfasst von: Seong Jin Bang
Min Jung Kang
Min-Goo Lee
Sang Min Lee
Publikationsdatum: 04.03.2024
Verlag: Springer International Publishing
Erschienen in: Complex & Intelligent Systems
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-024-01370-x

Class	Before data augmentation	After data augmentation
Class	Base sample	Setting1 (\(M\))	Setting2 (\({\text{BM}}\))
ND	593	1182 (+ 589)	871 (+ 278)
CP	799	1043 (+ 244)	1592 (+ 793)
CL	18	33 (+ 15)	392 (+ 374)
SCI	354	710 (+ 356)	671 (+ 317)
MD	481	961 (+ 480)	796 (+ 315)
ID	11	21 (+ 10)	386 (+ 375)
ASD	16	31 (+ 15)	390 (+ 374)

Springer Professional

Abstract

Publisher's Note

Introduction

Related works

Data imbalance

Generative models

Human action recognition (HAR)

Human pose estimation (HPE)

Classification for tabular dataset

Proposed method

Data collection using HPE

Human action state-centric transformation

State transition-oriented conditional variational autoencoder (STO-CVAE)

Experiments and results

Dividing the dataset based on state transitions

Sensitivity analysis for STO-CVAE

Data augmentation with STO-CVAE

Metrics

Classifiers

Comparison of data augmentation methods

Augmented action-centric sample using STO-CVAE

Classification results after data augmentation with STO-CVAE

Conclusion

Publisher's Note