Top

Complex & Intelligent Systems

Published in:

Open Access 04-08-2022 | Original Article

Multi-view 3D human pose reconstruction based on spatial confidence point group for jump analysis in figure skating

Authors: Limao Tian, Xina Cheng, Masaaki Honda, Takeshi Ikenaga

Published in: Complex & Intelligent Systems | Issue 1/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Competitive figure skaters perform successful jumps with critical parameters, which are valuable for jump analysis in athlete training. Driven by recent computer vision applications, recovering 3D pose of figure skater to obtain the meaningful variables has become increasingly important. However, conventional works have suffered from getting 3D information based on the corresponding 2D information directly or leaving the specificity of sports out of consideration. Issues such as self-occlusion, abnormal pose, limitation of venue and so on will result in poor results. Motivated by these problems, this paper proposes a multi-task architecture based on a calibrated multi-camera system to facilitate jointly 3D jump pose of figure skater. The proposed methods consist of three key components: Likelihood distribution and temporal smoothness- based discrete probability points selection filter out the most valuable 2D information; Multi-perspective and combinations unification-based large-scale venue 3D reconstruction is proposed to deal with the multi-camera; multi-constraint-based human skeleton estimation decides the final 3D coordinate from the candidates. This work is proved can be applied to 3D animated display and motion capture of the figure skating competition. The success rate of the independent joint is: 93.38% of 70 mm error range, 92.57% of 50 mm error range and 91.55% of 30 mm error range.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Recovering human pose has become increasingly important in the field of sports, animation, human–computer interaction, video surveillance, action recognition and so on. The 2D pose is inherently ambiguous as one 2D pose is projected from a lot of 3D poses. The vast potential of estimating 3D human pose in sports has attracted academic interests. In figure skating, jump is widely acknowledged as one of the most critical elements of the figure skater’s program. The excellent jumping is attractive for the audience, but puts forward strict technical requirements for the athletes. Analyzing 3D jump in figure skating not only objectively evaluates the performance of figure skater, but also enhances the audience’s entertainment experience. Therefore, the analysis of 3D jump pose should not be overlooked as it plays a significant role for figure skater’s behavior understanding.

Why 3D pose with multi-perspective is interesting? First, an increasing number of efforts in the community neither the end-to-end method [1‐3] nor the two-step method [4‐6] are focused on monocular 3D pose [7, 8] based on the success of deep learning. Although many methods settle down to figure out a relative depth based on a reference point or a root joint, then according to the prior information to calculate the final 3D pose. The obtained 3D result is not the real world coordinate in the space. The multi-perspective structure [9‐12] satisfies the requirement of annotation for monocular 3D pose estimation because of the high accuracy and efficient process. Second, the multi-perspective 3D pose estimation has become widespread, motivated by some practical purposes. It is indisputable that in the practical applications, multi-perspective will avoid the problem of dead ends. Naturally, a higher accuracy can be achieved compared to single-view in most cases.

The goal of 3D human pose estimation in figure skating is to localize key points of body in 3D space. However, most of the previous 3D human pose estimation methods losing sight of the limitations from the big venue and the small target of figure skating. The diverse variations in background, costume, abnormal pose, self-occlusion, athletic fields and camera parameters make the 3D jump pose estimation a challenging problem.

This work summarizes two main difficulties as shown in Fig. 1. Figure skating as a kind of sport combines athletic power with elegant artistry, contains abnormal pose which is different from daily motion. First, the abnormal pose means there are several confusing limbs, which are difficult to identify even for human eyes. Second, the figure skating is always held in the large venue. The moving range of the athlete is big. In some situation the target is far from the camera and projects unclear image content which is difficult to detect.

In this paper, we develop a transformation system to generate figure skater’s 3D pose conditioned on the corresponding multi-view 2D Part Confidence Map [13] which indicates the probability that the joint may appear in the current map area, and attempt to effectively handle the 3D pose estimation in large venue via a binocular stereo reconstruction architecture for jump analysis in figure skating.

Multi-view 3D pose estimation

Iskakov et al. [14] presented two novel solutions for multi-view 3D human pose estimation based on new learnable triangulation methods that combine 3D information from multiple 2D views. Pavlakos et al. [15] presented an automatic way to gather 3D annotations for human pose estimation tasks, using a generic ConvNet for 2D pose estimation and recordings from a multi-view setup. Remelli et al. [16] proposed a new multi-view fusion technique for 3D pose estimation that is capable of reasoning across multiview geometry effectively, while introducing negligible computational overhead with respect to monocular methods. Tome et al. [17] proposed a CNN-based approach for multi-camera markerless motion capture of the human body. Their approach makes use of 3D reasoning throughout a multistage approach. Liang et al. [18] proposed a scalable neural network framework to reconstruct the 3D mesh of a human body from multi-view images, which focuses more on the human body shape rather than the human joints. Ohashi et al. [19] discussed 3D reconstruction of human motion from multi-camera images. Luo et al. [20] proposed a multi-task and multi-level neural network structure with physical constraint to estimate 3D human poses from single RGB image in an end-to-end. This work has limitations on abnormal pose and large venue, which are important feature of figure skating. After the Part Confidence Maps are computed from each camera image, the proposed spatial–temporal filter is applied to deliver the human motion data with accuracy and smoothness for human motion analysis. The conventional work [21] of this paper has developed a system to obtain the 3D jump pose of a figure skater. At the core of the approach, this method corrects inaccurate or even erroneous reconstruction results by combining spatial–temporal information and a multi-perspective during the process of 2D-to-3D pose transformation.

Sensor-based jump analysis in figure skating

Advancements in wearable technology have facilitated performance monitoring in a number of sports. Many sensor-based methods [22, 23] provide valuable analysis of figure skating.

An inertial sensor-based system (MISSIE) [24] has sampled the jumps and filmed with high speed video. With respect to time values of figure skating jump events toe pick, release of glide leg, take off and landing inertial data were manually and software algorithm analysed, and video data were manually analysed. MISSIE can be used for figure skating jump analysis and feedback, being superior to traditional video analysis. Another study [25] developed a prototype jump monitor for figure skating. The accurate identification of multi-revolution jumps and quantification of rotation speeds can be accomplished using a single waist-mounted IMU. The purpose of this study was to evaluate the feasibility of using an inertial measurement unit (IMU) to monitor figure skating jumping performance.

Manual-observation-based jump analysis in figure skating

Many figure skating analysts judge athletes’ performance based on the images or the videos of the competition [26, 27]. The successful execution of jumps can be determined through observing frame by frame.

Work [28] reviews the biomechanics of triple and quadruple figure skating jumps, focusing on information that has implications for strength and conditioning programs. By observing the motion of figure skater fully, they have summarized that to complete the required revolutions in a jump, a skater must balance the average angular velocity with the time in the air. Study [29] also tests an elite male junior skater. The skater performed a series of jumps on the ice, they observed substantial differences in the movement technique and kinematic parameters of the pre-take-off phase in jump performance.

Comparing to the previous studies of jump analysis in figure skating, this work has two attractive characteristics. First, the method allows hassle-free and no burdensome implementation, which means the figure skaters do not have to carry any measuring instruments. Second, this work focuses on using computer vision techniques to extract and analyze the figure skater’s 3D pose instead of observing the original figure skating video manually. All in all, the proposed method can not only be applied to real competitions, but also can obtain the figure skater’s pose automatically and efficiently.

Framework of the reconstruction system

The system defines a complete jump as four consecutive stages, glide leg (S1), take off (S2), spin in the air (S3) and landing (S4) as shown in Fig. 2.

Figure 4 illustrates the overall pipeline of the proposed approaches. First, as shown in Fig. 4, the video sequences are captured from six perspectives. The part confidence map of each joint is obtained using OpenPose [13], which is a 2D gesture recognition method. The part confidence map reflects the possibility of the 2D appearance area of each joint, which is explained clearly in OpenPose [13] program. Figure 4 gives three views to explain how to choose discrete probability points. From left to right, the higher intensity the confidence map, the 2D recognition of this joint is considered to be more accurate. Then, some discrete probability points are selected according to the likelihood distribution and the temporal smoothness. Here, likelihood distribution means the higher temperature in the heatmap, more points will be selected. Here the heatmap of the first confidence map is the lowest, so choose one point (the yellow point) to represent the 2D position of this joint, and the third confidence map has the highest temperature, so choose more points (the blue points) to represent the 2D position of the joint. So far, the 2D position of a joint in multiple perspectives has been represented by some discrete points, respectively.

Second, two views are arbitrarily selected from multi-view. The joint is numbered according to the confidence from different viewpoints. In the reconstruct spatial confidence point group part, take the first orange picture as an example, one perspective uses three green dots to represent the 2D position of the joint in this perspective, and the other perspective uses a yellow dot to represent the 2D position of the joint in this perspective. When using binocular reconstruction, each point in the first view will calculate a 3D point with each point in the other view. Therefore, the orange picture will get three 3D points (three by one), the green picture will get five 3D points (five by one), the blue picture will get fifteen 3D points (five by three). Here the color setting is consistent with the 2D heatmap. Then, the 3D confidences of the reconstructed 3D points are determined according to the number of points. The number of points in the orange picture is 3, so its confidence is not as high as the confidence level of the spatial confidence point group with 5 points in the green picture.

Third, multiple spatial confidence point groups of a certain joint will be merged into a large spatial point group. Then, figure skater’s 3D pose is estimated by analyzing the spatial confidence point groups and collecting the constraint statistics based on the prior conditions. As a matter of fact, since many 2D pose estimation approaches contain probability regions of body parts, this work depends on any other 2D pose detector which has confidence map of the joints as an intermediate result.

Likelihood distribution and temporal smoothness-based discrete probability points selection likelihood

In this example as shown in Fig. 5, the purpose is to select a certain number of points in each view based on the probability to replace the figure skater’s right elbow. In the top image, the confidence map of the right elbow has the strongest intensity. In the middle one, the confidence map is slightly weaker, and in the lower one, the right elbow is misrecognized. The two part confidence maps representing the right elbow are generated with noise. For each perspective, no matter how many part confidence maps represent the right elbow, some discrete points are extracted from them to represent right elbow. All in all, the discrete probability points are selected based on the heat distribution of the confidence map. The point groups which don’t conform to the spatiotemporal condition are filtered. Finally, a certain number of discrete points can be obtained in each view to represent the position of the right elbow.

Likelihood distribution

Figure 6 shows that the confidence degree gradually decreased from left to right (the color bias toward blue represents high confidence, and toward red represents low confidence). Therefore, the number of the discrete probability points corresponding to different confidence maps with different intensities gradually sparse naturally. The reason for this selection is to increase the influence of high-intensity confidence maps while weakening the influence of low-intensity confidence maps. These probability point groups depend on quantity to play their role in the subsequent processing.

Temporal smoothness

Due to some confusing limbs of figure skater and unclear target caused by large venue, often there are ambiguous identification results. In some cases, a joint has more than one part confidence map from a certain perspective, and also corresponds to two discrete point groups. As shown in Fig. 7, in general, this situation often occurs in the body part which has the right and the left difference (e.g., the neck won’t cause confusion, but the hands will). To solve this problem, this work proposes a 2D temporal filter base on the smoothness of the body part trajectory.

As shown in Fig. 8, after likelihood distribution, the confidence map of a series of consecutive frames taken from one perspective will first be presented in the form of discrete probability points. When the ambiguous confidence maps occur, the data of the previous frame should be taken into consideration. Temporal smoothing aims to achieve the uniqueness of the point group.

For each frame, the previous frame is the processed group, so the previous frame is always unique, assuming its highest score point as a reference:

$$\begin{aligned} \mathrm{joint}^{k_n}_{\mathrm{frame}_{t-1}}=\left( u_{\mathrm{frame}_{t-1}}^n,v_{\mathrm{frame}_{t-1}}^n,s_{\mathrm{frame}_{t-1}}^n\right) , \end{aligned}$$

(1)

where k is the joint number, following the rules in Fig. 7. u and v is the pixel coordinate, s is the confidence score. n represents the nth point in the current point group. The highest score in the current frame’s point groups is defined as:

$$\begin{aligned} \mathrm{joint}^{k_n}_{\mathrm{frame}_{t_0}}&=\left( u_{\mathrm{frame}_{t_0}}^n,v_{\mathrm{frame}_{t_0}}^n,s_{\mathrm{frame}_{t_0}}^n\right) \end{aligned}$$

(2)

$$\begin{aligned} \mathrm{joint}^{k_n}_{\mathrm{frame}_{t_1}}&=\left( u_{\mathrm{frame}_{t_1}}^n,v_{\mathrm{frame}_{t_1}}^n,s_{\mathrm{frame}_{t_1}}^n\right) . \end{aligned}$$

(3)

where $\mathrm{joint}^{k_n}_{\mathrm{frame}_{t_1}}$ only exists when there are ambiguous results. If there doesn’t exist double point groups, the highest score is marked as $\mathrm{joint}^{k_n}_{\mathrm{frame}_t}$ in the current frame’s point group.

There are three cases in which the previous frame handles the current frame. The first case is retaining the closer point group with the original intensity. The previous frame of this type is a high-intensity point group, it means that the error probability of the previous frame is relatively low. It’s obvious that the closer one between $\mathrm{joint}^{k_n}_{\mathrm{frame}_{t_0}}$ and $\mathrm{joint}^{k_n}_{\mathrm{frame}_{t_1}}$ that is more consistent with the continuity to the reference point group is considered as a valid result $\mathrm{joint}^{k_n}_{\mathrm{frame}_t}$ from the comprehensible mathematical bases. Therefore, the final point group of this frame is

$$\begin{aligned} \mathrm{joint}^{k_n}_{\mathrm{frame}_t}=\left( u_{\mathrm{frame}_t}^n,v_{\mathrm{frame}_t}^n,s_{\mathrm{frame}_t}^n\right) . \end{aligned}$$

(4)

The second case is reducing intensity of current frame’s point group. The previous frame of this type is a low-intensity point group, and is not a double point groups itself. It is more noteworthy that the continuity between the two frames is out of bounds. When the previous frame cannot provide reliable guarantee and the relationship between the two frames is not continuous, the point group of the current frame should be downgraded as much as possible. Therefore, the final point group of this frame is

$$\begin{aligned} \mathrm{joint}^{k_n}_{\mathrm{frame}_t}=\Bigg (u_{\mathrm{frame}_t}^n,v_{\mathrm{frame}_t}^n,0.100\Bigg ). \end{aligned}$$

(5)

The third case is retaining the closer point group and reducing intensity. This method is used to solve the problem that the current frame belongs to the double point groups and the previous frame is of low intensity. When the continuity of the relationship between frames is not strong, it is forced to select a point group between $\mathrm{joint}^{k_n}_{\mathrm{frame}_{t_0}}$ and $\mathrm{joint}^{k_n}_{\mathrm{frame}_{t_1}}$ closer to the previous frame. A selected group $\mathrm{joint}^{k_n}_{\mathrm{frame}_t}$ is demoted to reduce its influence in the overall situation. So the final point group of this frame is

$$\begin{aligned} \mathrm{joint}^{k_n}_{\mathrm{frame}_t}=\Bigg (u_{\mathrm{frame}_t}^n,v_{\mathrm{frame}_t}^n,0.100\Bigg ). \end{aligned}$$

(6)

Multi-perspective and combination unification-based large-scale venue 3D reconstruction

Binocular stereo vision mimics human eyes to obtain 3D information and consists of two cameras. The two cameras form a triangular relationship with the measured object in space. As shown in Fig. 9, the spatial coordinate can be obtained according to the calibration matrix and the pixel values of the two camera planes.

Multi-perspective

This work uses six cameras to capture 3D pose as the dead angle of the single-view created by the athlete’s 360-degree rotation had to be taken into account. The graduated color bar represents different confidence intensities which is weakening from left to right as shown in Fig. 10.

Reviewing the result of proposal one which provides the discrete point in each 2D point group is $\mathrm{joint}^{k_n}_{\mathrm{frame}_t}=(u_{\mathrm{frame}_t}^n,v_{\mathrm{frame}_t}^n,s_{\mathrm{frame}_t}^n)$. After reconstruction from multiple perspectives, the discrete points in each spatial confidence point group is

$$\begin{aligned} \begin{aligned} {\left\langle L,R \right\rangle } \mathrm{joint}^{k_m}_{\mathrm{frame}_t}&=\Bigg ({\left\langle L,R \right\rangle }x_{\mathrm{frame}_t}^{k_m},{\left\langle L,R \right\rangle }y_{\mathrm{frame}_t}^{k_m},\\&\qquad {\left\langle L,R \right\rangle }z_{\mathrm{frame}_t}^{k_m},{\left\langle L,R \right\rangle }s_{\mathrm{frame}_t}^{k_m}\Bigg ) \end{aligned} \end{aligned}$$

(7)

L and R indicate which two viewpoints are selected, those value range is camera 0 to camera 5. m is the mth 3D points in the spatial confidence point group. $s_{\mathrm{frame}_t}^m$ is the average confidence value of the selected 2D discrete points from two cameras.

Since six cameras are set every 60 degrees and the recovering 3D shape is mainly based on the binocular stereo reconstruction, the six camera setting means 15 combinations (${\left\langle 0,1 \right\rangle }$,${\left\langle 0,2 \right\rangle }$,${\left\langle 0,3 \right\rangle }$,${\left\langle 0,4 \right\rangle }$,${\left\langle 0,5 \right\rangle }$,${\left\langle 1,2 \right\rangle }$,${\left\langle 1,3 \right\rangle }$,

${\left\langle 1,4 \right\rangle }$,${\left\langle 1,5 \right\rangle }$,${\left\langle 2,3 \right\rangle }$,${\left\langle 2,4 \right\rangle }$,${\left\langle 2,5 \right\rangle }$,${\left\langle 3,4 \right\rangle }$,${\left\langle 3,5 \right\rangle }$,${\left\langle 4,5 \right\rangle }$). Due to the large area of the figure skating site and the complexity of the camera reconstruction results, how to choose and calculate the appropriate reconstruction combination is shown in Fig. 11. When the athlete moves to some position, the camera’s lens plane and the target can’t generate obvious angle which means there is deviation existing.

Without the hassle, combinations ${\left\langle 0,1 \right\rangle }$, ${\left\langle 2,5 \right\rangle }$, ${\left\langle 3,4 \right\rangle }$ can be judged as unusable, because the angles with the target is always close to 180 degrees, which means the corresponding rays of these combinations are nearly parallel. The rest of the combinations are considered valuable. However, the subtle differences still exists as shown in Fig. 12. The next step is to unify the trajectories based on the reconstructed spatial confidence point groups.

Camera combinations unification

For each joint, it will have several spatial confidence point groups with different intensities. To unify these groups, we select a root joint in advance and calculate the root joint’s position first. Then, the spatial confidence point group of other common joints are translated based on the root joint.

For the purpose of unifying the combination values, this work selects the neck joints of combination ${\left\langle 2,4 \right\rangle }$ or ${\left\langle 3,5 \right\rangle }$, in which there is no occlusion problem and the spatial confidence point group is always strong as the root joint. Here taking the combination ${\left\langle 2,4 \right\rangle }$ as the example:

$$\begin{aligned} \begin{aligned} {\left\langle 2,4 \right\rangle }\mathrm{joint}^1_{\mathrm{frame}_t}&= \left( \dfrac{\sum ^{r=m}_{r=1}{\left\langle 2,4 \right\rangle }x_{\mathrm{frame}_t}^{1_m}}{m},\right. \\&\qquad \dfrac{\sum ^{r=m}_{r=1}{\left\langle 2,4 \right\rangle }y_{\mathrm{frame}_t}^{1_m}}{m},\\&\left. \qquad \dfrac{\sum ^{r=m}_{r=1}{\left\langle 2,4 \right\rangle }z_{\mathrm{frame}_t}^{1_m}}{m}\right) . \end{aligned} \end{aligned}$$

(8)

The three components in brackets are recorded as $x_{\mathrm{root}_t}$, $y_{\mathrm{root}_t}$, $z_{\mathrm{root}_t}$. Then, every 3D point in different camera combinations’ spatial confidence point group belonging to one certain joint k needs to be reset as $S^{k_m}_{\mathrm{frame}_t}$ according to the following formula:

$$\begin{aligned} S^{k_m}_{\mathrm{frame}_t}= \left\{ \begin{array}{lr} {\left\langle L,R \right\rangle }x_{\mathrm{frame}_t}^{k_m}-{\left\langle L,R \right\rangle }x_{\mathrm{frame}_t}^{1_m}+x_{\mathrm{root}_t},\\ {\left\langle L,R \right\rangle }y_{\mathrm{frame}_t}^{k_m}-{\left\langle L,R \right\rangle }y_{\mathrm{frame}_t}^{1_m}+y_{\mathrm{root}_t},\\ {\left\langle L,R \right\rangle }z_{\mathrm{frame}_t}^{k_m}-{\left\langle L,R \right\rangle }z_{\mathrm{frame}_t}^{1_m}+y_{\mathrm{root}_t} \end{array} \right. \end{aligned}$$

(9)

So far, the results from all available camera combinations have been unified as shown in Fig. 13

Multi-constraint-based human skeleton estimation

After the first two proposed methods, one or more spatial confidence point groups belonging to each joint can be obtained. For the purpose of determining one unique joint 3D position, this work attempts to specify a reasonable and valuable mechanism by setting out constraints to choose the final spatial 3D point.

Priors based on body structure and motion trend

What kind of priors and how to obtain them are prerequisites for the implementation of this proposal. Here we mainly discuss the function of bone length and motion trend angle in the constrained superposition process.

To ensure the verisimilitude of human body, the length prior can be obtained from measuring the human body. However, when there is no excessive deviation in the reconstruction accuracy, it is possible to obtain the prior which is not different from manual measurement by selecting the joint value after supervision. This is more versatile, because no limitation to the athlete. The athletic motion trends of figure skaters are mentioned here to illustrate the motivation behind the subsequent employ of temporal information. The continuity of action leads to the close connection of the relationship between frames. The stability of the motion trend provides a shortcut for using the inter-frame connection. This method will make more detailed corrections according to the characteristics of the slight change of the limb angle between the frames. All in all, the length of human skeleton and the continuity of human motion are considered to constrain the calculation results.

Multi-constraint

After the reconstruction of spatial confidence point groups, each joint has multiple candidate groups. the system hopes to process all point groups into specific spatial locations as shown in Fig. 14.

As shown in Fig. 15, length constraint $l^{p\_{c_m}}_{\mathrm{frame}_t}$, angle constraint $\theta ^{p\_{c_m}}_{\mathrm{frame}_t}$ and confidence constraint $\mathrm{score}_{\mathrm{frame}_t}^{c_m}$ are added to filter out errors and select the most accurate 3D position:

$$\begin{aligned} \left\{ \begin{array}{ll} l^{p\_{c_m}}_{\mathrm{frame}_t},\\ \theta ^{p\_{c_m}}_{\mathrm{frame}_t},\\ \mathrm{score}_{\mathrm{frame}_t}^{c_m}, \end{array} \right. \end{aligned}$$

(10)

First, whether spatial confidence point groups is eligible to participate in weight assignment. There is no doubt that starting with length filter for points that exceed the limit can help eliminate a wide range of errors. Supposing $\mathrm{flag}_{p\_{c_m}}$ representing the prior bone length of the parent joint and the child joint. Then, $S^{c_m}_{\mathrm{frame}_t}$ representing the spatial coordinate and the corresponding confidence score for a certain point in all the spatial confidence points group of joint c:

$$\begin{aligned} S^{c_m}_{\mathrm{frame}_t}=\left( x_{\mathrm{frame}_t}^{c_m},y_{\mathrm{frame}_t}^{c_m},z_{\mathrm{frame}_t}^{c_m},s_{frame_t}^{c_m}\right) . \end{aligned}$$

(11)

The length distance of the parent and child joints is

$$\begin{aligned} \begin{aligned} (l^{p\_{c_m}}_{\mathrm{frame}_t})^{2}&= \left( x_{\mathrm{frame}_t}^{c_m}-x_{\mathrm{frame}_t}^{p_m}\right) ^{2}\\&\quad +\left( y_{\mathrm{frame}_t}^{c_m}-y_{\mathrm{frame}_t}^{p_m}\right) ^{2}\\&\quad +\left( z_{frame_t}^{c_m}-z_{\mathrm{frame}_t}^{p_m}\right) ^{2}. \end{aligned} \end{aligned}$$

(12)

If $l^{p\_{c_m}}_{\mathrm{frame}_t}$ in the range of $\mathrm{flag}_{p\_{c_m}}\pm 50.00$ centimeters, it is considered that it can enter the screening of the next constraint, otherwise it will directly discard.

Second, we consider the changes in limb angle and confidence score in parallel to locate the final position of the joint. The angle between the current frame’s limb and the previous frame’s limb is

$$\begin{aligned} \theta ^{p\_{c_m}}_{\mathrm{frame}_t}=\arccos \frac{\overrightarrow{v^{p\_{c_m}}_{\mathrm{frame}_t}}\cdot \overrightarrow{v^{p\_{c_m}}_{\mathrm{frame}_{t-1}}}}{{\left|\overrightarrow{v^{p\_{c_m}}_{frame_t}}||\overrightarrow{v^{p\_{c_m}}_{\mathrm{frame}_{t-1}}}\right|}}, \end{aligned}$$

(13)

where $\overrightarrow{v^{p\_{c_m}}_{\mathrm{frame}_t}}$ is the vector of the parent and child joints:

$$\begin{aligned} \overrightarrow{v^{p\_{c_m}}_{\mathrm{frame}_t}}= & {} \Bigg ( x^{{c_m}}_{{\mathrm{frame}_t}}-x^{p}_{{\mathrm{frame}_t}}, y^{{c_m}}_{{\mathrm{frame}_t}}-y^{p}_{{\mathrm{frame}_t}},\nonumber \\&z^{{c_m}}_{{\mathrm{frame}_t}}-z^{p}_{{\mathrm{frame}_t}} \Bigg ) \end{aligned}$$

(14)

Third, different weights are added of different intensities in spatial confidence point groups, and marked as $\mathrm{score}_{\mathrm{frame}_t}^{c_m}$. Angle constraint $\theta ^{p\_{c_m}}_{\mathrm{frame}_t}$ and confidence constraint $\mathrm{score}_{\mathrm{frame}_t}^{c_m}$ will cooperate to take effect and consider together. The tolerance of the angle gradually expands in a divergent way, searching for high-confidence points $S^{c_n}_{\mathrm{frame}_t}$ in spatial confidence point groups within the divergence range to jointly calculate the final spatial position. Here, the number of points with the same score belonging to one joint is denoted as n. The difference between m is that n is a subset of m. Similar to the concept of weighted average, the final spatial coordinate $C^c_{\mathrm{frame}_t}$ is

$$\begin{aligned} C^c_{\mathrm{frame}_t}= \left\{ \begin{array}{ll} \dfrac{\sum \Bigg ({\mathrm{score}_{\mathrm{frame}_t}^{c_n}\cdot x_{\mathrm{frame}_t}^{c_n}}\Bigg )}{\sum \Bigg ({\mathrm{score}_{\mathrm{frame}_t}^{c_n}\cdot n}\Bigg )},\\ \dfrac{\sum \Bigg ({\mathrm{score}_{\mathrm{frame}_t}^{c_n}\cdot y_{\mathrm{frame}_t}^{c_n}}\Bigg )}{\sum \Bigg ({\mathrm{score}_{\mathrm{frame}_t}^{c_n}\cdot n}\Bigg )},\\ \dfrac{\sum \Bigg ({\mathrm{score}_{\mathrm{frame}_t}^{c_n}\cdot z_{\mathrm{frame}_t}^{c_n}}\Bigg )}{\sum \Bigg ({\mathrm{score}_{\mathrm{frame}_t}^{c_n}\cdot n}\Bigg )},\\ \end{array} \right. \end{aligned}$$

(15)

Experiment result

Data set and experimental environment

The resource videos of the experiments records the figure skating scene with six cameras. In theory, the more cameras are used, the better performance the algorithm can achieve and the more computation time required. To decide the camera number, we do verification test of different camera number and concluded that at least 6 cameras should be used to ensure the accuracy of the algorithm. Details of the verification test are shown in the following subsection.

For each camera, the resolution is $1920 \times 1080$, the frame-rate is 60 pfs and the shutter speed is 0.001 s, so that there is little motion blur in the image. The test sequences contain single Flip jump, double Flip jump and triple Flip jump.

The experiment is executed with the following environment setting: the CPU is Intel Core i7-3770, the RAM is 8 GB, the compiler is Visual Studio 2017, and the external includes OpenCV-3.4.1.

Verification test of camera number

Due to the equipment limitations, this article only shows the experiment to discuss the conditions when cameras number equals or less than 6. To achieve the precise 3D pose of human, at least two cameras are required.

In figure skating, the spin pose of body is important to analysis the jump action. The experiment simulates the span scene. As Fig. 16 shows, the span angle is divided into 12 states.

Taking the camera 1 as an example, the joints recognition of athlete at each angle are observed in Fig. 17, where the joint number are marked in Fig. 7.

As Fig. 17 shows, when the span angle of athlete is position $\textcircled {1}$, facing the camera, all the joints are well seen and there are no misidentified joints. When the athlete turns to position $\textcircled {3}$, he basically faces the camera sideways, so his right arm has a certain degree of self-occlusion, and the three joints of right shoulder, right elbow and right wrist (joints 2, 3 and 4, respectively) are identified incorrectly. When the athlete turns to the angle of position $\textcircled {6}$, he is back to the camera. At this time, both of his arms are self-occluded, so the wrist joints of the left and right hands (joints 4 and 7) are incorrectly identified. The rest positions are similar. Taking 6 cameras into consideration, the observation results are summarized in Fig. 18. The six cameras are named from C1 to C6. In addition, each column represents the span angle of the athlete. The element (C1, $\textcircled {3}$) represents only the joint 2,3,4 can be observed in camera C1 at the position $\textcircled {3}$. Under the first column, the value 4 (2) marked by an arrow mark means the joint 4 can only be correctly observed by 2 cameras at position $\textcircled {1}$.

From the Fig. 18, it can be concluded that when 6 cameras are used, some joints in a specific position can only be observed by two cameras. Therefore, to ensure the accuracy of the algorithm, the camera number cannot be less than 6.

Evaluation method

This work considers the evaluation protocol which is the per-joint position error in millimeters. As shown in Fig. 19, first we manually label human joint pixels from six perspectives, then the artificial labeling coordinates are reconstructed and statistical processing is performed to obtain groundtruth.

It can’t be ignored that the large field of figure skating and small targets cause slight fluctuations at the pixel level, which cause centimeter-level errors in the real world. Therefore, in terms of error tolerance, not only the error range, but also the groundtruth error which caused by manual annotation needs to be considered. Therefore, this paper evaluates the results by both qualitative and quantitative methods.

First is the qualitative evaluation method. In this work, the ground truth range is taken as the center part of the sphere with a radius of error range. The result is defined as successful if the distance between the calculated coordinate and the center part of the sphere doesn’t exceed error range. The formulas for calculating the success rate is defined as the number of success joints divided by the number of total joints:

$$\begin{aligned} \mathrm{Success rate}= \frac{\mathrm{Successful joints}}{\mathrm{Total joints}}. \end{aligned}$$

(16)

As for the quantitative evaluation method, the mean per joint position error (MPJPE) are calculated. The error is calculated as

$$\begin{aligned} E_{\mathrm{MPJPE}} = \frac{1}{N_j} \sum ^{N_j}_{i=1}\left| S_{\mathrm{GT}}^{i} - S_{\mathrm{result}}^{i} \right| \end{aligned}$$

(17)

where the $S_{\mathrm{GT}}^{i}$ and $S_{\mathrm{result}}^{i}$ are the ground truth and the estimated result of the $i_{th}$ joints. The $N_j$ is the total number of human body joints, whose value is 13.

Experimental results

Comparison items are shown in Table 1. Comparison items are shown in Table 1. The proposed method here is compared with basic framework and conventional work. Although the comparison items use the same modules which are 2D information extraction, reconstruction and 3D pose correction, each of them exploits a different strategy for different module. The strategies used in proposed method has already been fully explained. The methods employed in basic framework and conventional work can be simply described as follows. In the 2D information extraction module, the basic framework just uses the recognized joint 2D pixels value from the existing 2D pose estimator directly. The conventional work adds temporal smoothness on it to filter out wrong points. In the 3D reconstruction module, the basic framework and conventional work both calculates the average value of the reconstruction results from multiple cameras as the fusion result. They all ignored the deviation between the 3D reconstruction points from multiple cameras. In the 3D correction module, the basic framework does not modify 3D skeleton, and the conventional work only modifies some unreasonable bone lengths based on the physiological human joint length as a prior. In the experiment part, only the end-to-end analysis results are presented. It is because that the three proposals are not independent with each other and no module can be replaced in the whole framework. Therefore, it is difficult to conduct ablation experiments here to demonstrate the improvement providing by each individual proposal.

The experimental results are shown in Table 2. As shown in Fig. 7, Upper represents the joints of the upper body (joint numbers are 0–7), and Lower represents the joints of the lower body (joint numbers are 8–13). The table lists the success rate of the upper body, lower body, and whole body joints within the specified error range. Compared with conventional work, the accuracy rates of the proposed work are almost all above 90% and the MPJPE value is significantly reduced from 74.12 to 23.57 mm. This is due to the usage of spatial confidence point groups to determine the possible position of joints. Then, human skeleton is generated in combination with temporal constraints. At the 2D level, conventional work uses the highest confidence point from the partial confidence map as the 2D position of the point, while the proposed method uses discrete probability points instead of a fixed highest probability point as the 2D position, which can avoid 2D error recognition. In reconstruction part, the conventional work calculates the reconstruction average value of different camera combinations from six perspectives as the final 3D result, while the proposed method generates a spatial confidence point group and unify all combinations based on the relative relationship with the root joint. In terms of 3D constraints, compared with traditional work, the proposed method also adds motion trend constraints and confidence constraints in addition to the human bone length constraints.

Table 1

Comparison items

Items	2D	Reconstruction	3D
Basic framework	Joint pixel value	Reconstruct with stable camera combinations + calculate average value	Without any correction
Conventional work [21]	Joint pixel value + temporal smoothness	Reconstruct with stable camera combinations + calculate average value	With simple prior-based correction
Proposed method	Joint confidence map + temporal smoothness	Reconstruct with all camera combinations + calculate the unified value of combinations	With multi-constraint-based correction

Table 2

Experiment results

Items		Basic framework	Conventional work [21]	Proposed method
Error range @30 mm	Upper	40.97%	70.60%	88.89%
	Lower	38.73%	80.4%8	95.08%
	Total	40.01%	74.83%	91.55%
Error range @50 mm	Upper	47.98%	72.97%	90.26%
	Lower	41.02%	84.23%	95.65%
	Total	44.99%	77.77%	92.57%
Error range @70 mm	Upper	51.75%	75.82%	91.30%
	Lower	43.04%	86.68%	96.15%
	Total	48.02%	80.48%	93.38%
MPJPE (mm)	Upper	112.11	85.47	28.74
	Lower	124.59	60.75	21.56
	Total	121.45	74.12	23.57

Consideration and analysis

The method proposed in this paper still has some problems unsolved. First of all, from the experimental results, the accuracy is not completely perfect. The large venue of figure skating is a very influential factor. Although the preparation of the data set has focused on the local field as much as possible, the more than 100 square meters field of view and fixed cameras positions caused many limitations. This is not only detrimental to the accuracy of camera calibration, but also to the accuracy of the binocular reconstruction of small target.

Besides, because this work is implemented in stages, the accuracy of 2D estimator plays a very crucial role. This work uses OpenPose network to estimate the 2D pose and the part confidence map. However, there are still many 2D misrecognitions, since OpenPose is weak at estimation of abnormal pose, which is the typical pose in figure skating. The 2D misrecognitions of pose estimation network directly affect the final reconstruction results and limit the upper bounder of proposals’ accuracy. To achieve more accurate 3D pose, a 2D pose estimation method specifically for figure skating is expected.

Moreover, if the system is applied to a real competition, the processing speed cannot be ignored. At present, the initial focus of this work is to make it possible to extract the 3D pose of figure skaters. In the future, realizing real-time processing while ensuring the accuracy is a necessary condition.

Conclusions

This work focuses on multi-view 3D human pose reconstruction which based on synchronized sequences. The proposed methods formulate the problem as 2D pose correction followed by 3D pose reconstruction with taking the advantages of figure skating’s particularity. The proposal starts with the 2D confidence map and presents a multi-technology correction solution based on the likelihood distribution and temporal information. Then, 3D reconstruction is conducted by taking fully consideration of the large-scale venue. Using multi-camera to capture the motion of the athlete is more conductive to motion understanding. And finally refining and narrowing down the spatial confidence point group via multi-constraint allow modeling a human pose prior in advance. In comparison to the basic framework and the conventional work [21], the success rate of the independent joint is generally more than 90%.

At present, OpenPose as a 2D human pose estimation network is used to obtain the confidence map of the joints.Although OpenPose works well in general conditions, it still contains many 2D misrecognitions for figure skating. To achieve more accurate 3D pose, a 2D pose estimation method specifically for figure skating is expected. In the future, the possible ways to improve the accuracy of 2D pose are pre-labeling a batch of skating data sets as training data and training a 2D network model to better identify skating positions. Moreover, adding key points such as hands and feet is also significant. At the same time, improving the algorithm to realize real-time while ensuring the accuracy is also important for real application. With these modifications, the system is expected to be applied to the objectively performance evaluation of figure skaters and real-time display of figure skating TV broadcast.

Acknowledgements

This work was jointly supported by National Natural Science Foundation of China (62006178), KAKENHI (21K11816) and Waseda University Grant for Special Research Project (2022C-179).

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

previous article Multi-granularity scenarios understanding network for trajectory prediction

next article Distributed tracking control of structural balance for complex dynamical networks based on the coupling targets of nodes and links

Pavlakos G, Zhou X, Derpanis KG, Daniilidis K (2017) Coarse-to-fine volumetric prediction for single-image 3D human pose. In: The IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2017.139

Kanazawa A, Black MJ, Jacobs DW, Malik J (2018) End-to-end recovery of human shape and pose. IEEE/CVF Conf Comput Vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR.2018.00744CrossRef

Zhou X, Huang Q, Sun X, Xue X, Wei Y (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach. IEEE Int Conf Comput Vis (ICCV). https://doi.org/10.1109/ICCV.2017.51CrossRef

Pavllo D, Feichtenhofer C, Grangier D, Auli M (2019) 3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Long Beach, June 16–20, pp 7745–7754

Martinez J, Hossain R, Romero J, Little JJ (2017) A simple yet effective baseline for 3D human pose estimation. IEEE Int Conf Comput Vis (ICCV). https://doi.org/10.1109/ICCV.2017.288CrossRef

Akhter I, Black MJ (2015) Pose-conditioned joint angle limits for 3D human pose reconstruction. IEEE Conf Comput Vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR.2015.7298751CrossRef

Cheng Y, Yang B, Wang B, Tan RT (2020) 3D human pose estimation using spatio-temporal networks with explicit occlusion training. Proc AAAI Conf Artif Intell 34(07):10631–10638. https://doi.org/10.1609/aaai.v34i07.6689CrossRef

Wang L, Chen Y, Guo Z, Qian K, Lin M, Li H, Ren JS (2019) Generalizing monocular 3D human pose estimation in the wild. In: The IEEE international conference on computer vision (ICCV), vol 1, pp 4024–4033. https://doi.org/10.1109/ICCVW.2019.00497

Li Y, Wang G, Ji X, Xiang Y, Fox D (2020) Multi-view matching network for 6D pose estimation. Int J Comput Vis 128:657–678CrossRef

10.

Hofmann M, Gavrila DM (2009) Multi-view 3D human pose estimation combining single-frame recovery. In: IEEE conference on computer vision and pattern recognition (IEEE). https://doi.org/10.1109/CVPR.2009.5206508

11.

Tian L, Cheng X, Honda M, Ikenaga T (2020) 3D pose reconstruction with multi-perspective and spatial confidence point group for jump analysis in figure skating. In: The international workshop on pattern recognition (IWPR)

12.

Joo H, Simon T, Li X, Liu H, Tan L, Gui L, Banerjee S, Godisart T, Nabbe B, Matthews I, Kanade T, Nobuhara S, Sheikh Y (2015) Panoptic studio: a massively multiview system for social interaction capture. In: IEEE international conference on computer vision (ICCV). https://doi.org/10.1109/ICCV.2015.381

13.

Cao Z, Hidalgo G, Simon T, Wei S, Sheikh Y (2017) OpenPose: realtime multi-person 2D pose estimation using part affinity fields. In: The IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2017.143

14.

Iskakov K, Burkov E, Lempitsky V, Malkov Y (2019) Learnable triangulation of human pose. In: The IEEE international conference on computer vision (ICCV). https://doi.org/10.1109/ICCV.2019.00781

15.

Pavlakos G, Zhou X, Derpanis KG, Daniilidis K (2017) Harvesting multiple views for marker-less 3D human pose annotations. In: The IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2017.138

16.

Remelli E, Han S, Honari S, Fua P, Wang R (2020) Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation. In: The IEEE conference on computer vision and pattern recognition (CVPR), pp. 6040–6049

17.

Tome D, Toso M, Agapito L, Russell C (2018) Rethinking pose in 3D: multi-stage refinement and recovery for markerless motion capture. In: International conference on 3D vision (3DV)

18.

Liang J, Lin MC (2019) Shape-aware human pose and shape reconstruction using multi-view images. University of Maryland. College Park; UMD-CP & UNC-CH

19.

Ohashi T, Ikegami Y, Yamamoto K, Takano W, Nakamura Y (2018) Video motion capture from the part confidence maps of multi-camera images by spatiotemporal filtering using the human skeletal model. In: IEEE/RSJ international conference on intelligent robots and systems (IROS)

20.

Luo D, Du S, Ikenaga T (2021) Multi-task neural network with physical constraint for real-time multi-person 3D pose estimation from monocular camera. Multimed Tools Appl 18(80):27223–27244CrossRef

21.

Tian L, Cheng X, Honda M, Ikenaga T (2020) Multi-technology correction based 3D human pose estimation for jump analysis in figure skating. In: The conference of the international sports engineering association (ISEA)

22.

Shi Y, Ozaki A, Honda M (2020) Kinematic analysis of figure skating jump by using wearable inertial measurement units. In: The conference of the international sports engineering association (ISEA)

23.

Moyer B (2012) Skating, as sensors might see it

24.

Schaefer K, Brown N, Alt W (2016) MlSSlE—a new method to analyse performance parameters of figure skating jumps. In: International conference of biomechanics in sport

25.

Bruening DA, Reynolds RE, Adair CW, Zapalo P, Ridge ST (2018) A sport-specific wearable jump monitor for figure skating. Open Soft Robot Res

26.

Lockwood K, Gervais PJ, McCreary DR (2006) Landing for success: a biomechanical and perceptual analysis of on-ice jumps in figure skating. Sports Biomech 5(2):231–41. https://doi.org/10.1080/14763140608522876CrossRef

27.

Vinogradova VI (2013) Biomechanics of leg swing and its effect on multi-turn jump performance in figure skating. In: Theory and practice of physical culture

28.

King DL (2006) Performing triple and quadruple figure skating jumps: implications for training. Can J Appl Physiol (CJAP) 30(6):743–53. https://doi.org/10.1139/h05-153CrossRef

29.

Anna M, Dagmara I, Czesław U (2018) Biomechanics of the Axel Paulsen figure skating jump. Pol J Sport Tour

Title: Multi-view 3D human pose reconstruction based on spatial confidence point group for jump analysis in figure skating
Authors: Limao Tian
Xina Cheng
Masaaki Honda
Takeshi Ikenaga
Publication date: 04-08-2022
Publisher: Springer International Publishing
Published in: Complex & Intelligent Systems / Issue 1/2023
Print ISSN: 2199-4536
Electronic ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-022-00837-z

Springer Professional

Multi-view 3D human pose reconstruction based on spatial confidence point group for jump analysis in figure skating

Abstract

Publisher's Note

Introduction

Multi-view 3D pose estimation

Sensor-based jump analysis in figure skating

Manual-observation-based jump analysis in figure skating

Framework of the reconstruction system

Likelihood distribution and temporal smoothness-based discrete probability points selection likelihood

Likelihood distribution

Temporal smoothness

Multi-perspective and combination unification-based large-scale venue 3D reconstruction

Multi-perspective

Camera combinations unification

Multi-constraint-based human skeleton estimation

Priors based on body structure and motion trend

Multi-constraint

Experiment result

Data set and experimental environment

Verification test of camera number

Evaluation method

Experimental results

Consideration and analysis

Conclusions

Acknowledgements

Publisher's Note

Premium Partner

Springer Professional

Abstract

Publisher's Note

Introduction

Related works

Multi-view 3D pose estimation

Sensor-based jump analysis in figure skating

Manual-observation-based jump analysis in figure skating

Framework of the reconstruction system

Likelihood distribution and temporal smoothness-based discrete probability points selection likelihood

Likelihood distribution

Temporal smoothness

Multi-perspective and combination unification-based large-scale venue 3D reconstruction

Multi-perspective

Camera combinations unification

Multi-constraint-based human skeleton estimation

Priors based on body structure and motion trend

Multi-constraint

Experiment result

Data set and experimental environment

Verification test of camera number

Evaluation method

Experimental results

Consideration and analysis

Conclusions

Acknowledgements

Publisher's Note

Other articles of this Issue 1/2023

Vibration prediction and analysis of strip rolling mill based on XGBoost and Bayesian optimization

GRAN: graph recurrent attention network for pedestrian orientation classification

A classification tree and decomposition based multi-objective evolutionary algorithm with adaptive operator selection

Correction to: Person identification from arm’s hair patterns using CT-twofold Siamese network in forensic psychiatric hospitals

A distributed gradient algorithm based on randomized block-coordinate and projection-free over networks

CIE-LSCP: color image encryption scheme based on the lifting scheme and cross-component permutation

Premium Partner