1 Introduction
2 Related Work
2.1 Classification of Human Activities
2.2 Detection of Social Interactions
2.3 Activity Recognition Datasets
3 System Overview
- Temporal segmentation of interactions: This component is responsible for finding the temporal intervals in which the social activities occur. It uses features based on social science theories, measured on the upper bodies. In practice, this behaves like a switch, which decides when the following components need to be activated and when not.
- Classification of the social activities: This component performs the classification of the detected social activities. It consists of three classifiers, which use three different sets of features based on individual poses, movements, and spatial relations. The output likelihoods are then merged to obtain a final likelihood vector of the activities.
- Estimation of the proximity-based priors: This component is responsible for estimating the probability priors from learnt distributions of the proximity between two subjects. These priors are then merged with the likelihood from the classifiers to obtain the posterior probability of the activities.
4 Feature-Sets
- Classification features: consisting of individual and social features. The first ones serve the two individual mixtures (\(X_{Ind_1}\),\(X_{Ind_2}\)) of the classification model. They are based on single skeletons and used for individual activity classification, as suggested by [9, 10]. The second ones are for the social mixture of the classification model (\(X_{Social}\)). They are based on both skeletons and are used for social activity classification, as proposed by [6].
4.1 Segmentation Features
- Upper joint distances: According to the proxemic theory of [13], humans create spacial sectors around them, the size of which depends on the personal intimacy and cultural background of the subjects. Extracting these sectors from the distance between two persons’ skeletal joints is relatively straightforward. As shown in Fig. 3a, the 2D distance \(d_{i,j}\), on the (x, z) plane of the camera’s frame, is computed between the upper body joints \(J_{i,1}\) and \(J_{j,2}\) of the two persons, where \(i,j \in \{H, L, R, T\}\) – i.e. head, left shoulder, right shoulder and torso, respectively resulting in 16 different distances. For example, \(d_{H,R}\) is the distance between the head of subject 1 and the right shoulder of subject 2.
- Body orientation angle to the reference line: According to [31], being in each other’s field of view plays an important role in the social interaction between two persons. The relative body orientation between them is therefore an important clue to discriminate between interactions and non-interactions, where distance alone would not be sufficient. As shown in Fig. 3b, we consider the following two angles:where \(\mathbf {n_1}\) and \(\mathbf {n_2}\) are the orientation vectors of the subjects (normal to the torso) and \(\mathbf {m}\) is the vector between their torsos.$$\begin{aligned} \begin{aligned} \alpha _{12}&=\angle (\mathbf {n_1}, \mathbf {m})&\alpha _{21}&=\angle (\mathbf {n_2}, -\mathbf {m}) \end{aligned} \end{aligned}$$(1)
- Temporal similarity of the orientations: [15] demonstrated that speakers and listeners often synchronise their movements. Based on this, we compute the logarithm L of windowed moving covariance matrices (4 features) to estimate the temporal similarity between relative changes of the subject orientations during the time interval \([t-w,t]\):where w is the window of reference (in our case \(w = 1\)s).$$\begin{aligned} \begin{aligned} L =\log (1+\mathrm {cov}(\alpha ^{t-w,\ldots ,t}_{12},\alpha ^{t-w,\ldots ,t}_{21}))\\ \end{aligned} \end{aligned}$$(2)
- O-space radius and oriented distance: According to the F-Formations theory by [16], social interactions occur when the transactional segments of the two subjects are overlapping, Interacting people stand on the border of a circular area (O-space), with their bodies oriented towards the centre. As shown in Fig. 3c, the O-space can be defined by (approximately) fitting a circle on the shoulders of the subjects and checking whether the normal vectors \(\mathbf {n_1}\) and \(\mathbf {n_2}\), from their torsos, lie inside or outside this space. The situation is fully captured by a set of features \([r, d_1^C, d_2^C]\), where r is the radius of the circle, and \(d_k^C\) (with \(k=1,2\)) is the distances between the extremity of the normal \(\mathbf {n_k}\) and the centre C. If \(d_k^C > r\), it means subject k is oriented towards the outside of the circle. Also, if \(r > r_{max}\), the two people are considered too far to be interacting. Note that, in this system, \(\mathbf {n_k}\) is a unit vector (1m).
- QTC\(_C\)relation: The Qualitative Trajectory Calculus (QTC) is a mathematical formalism introduced by [33] to describe spatial relations between two moving points. We use a particular version of the calculus, called QTC\(_C\), where the qualitative relations between two points \(P_k\) and \(P_l\) are expressed by the symbols \({q_i\in \{-,+,0\}}\) as follows:A string of QTC symbols \(\{q_1, q_2, q_3, q_4\}\) is therefore a compact representation of the 2D relative motion between \(P_k\) and \(P_l\). For example, \(\{-, -, 0, 0\}\) means “\(P_k\) and \(P_l\) are moving straight towards each other”. Other examples can be observed in Fig. 4a. The 2D trajectories considered in our work are those of the people’s torsos.
- \((q_1\)) −: \(P_k\) is moving towards \(P_l\)
- 0: \(P_k\) is stable with respect to \(P_l\)
- \(+\): \(P_k\) is moving away from \(P_l\)
- \((q_2\)) same as \(q_1\), but swapping \(P_k\) and \(P_l\)
- \((q_3\)) −: \(P_k\) is moving to the left side of \(\overrightarrow{P_k P_l}\)
- 0: \(P_k\) is moving along \(\overrightarrow{P_k P_l}\)
- \(+\): \(P_k\) is moving to the right side of \(\overrightarrow{P_k P_l}\)
- \((q_4\)) same as \(q_3\), but swapping \(P_k\) and \(P_l\).
- Temporal Histogram of QTC\(_C\)relations: QTC\(_C\) can be used to analyse sequences of torso trajectories using temporal histograms. In particular, we build two windowed moving histograms, with 9 time bins each, splitting the QTC\(_C\) components in two sets: the first one considers the distance relations \((q_1,q_2)\), while the second captures the side relations \((q_3,q_4)\). This separation has also the advantage of reducing the total number of bins (\(2\cdot 3^2\) rather than \(3^4\)). An example of QTC\(_C\) histogram is shown in Fig. 4b.
4.2 Classification Features
- Covariance of inter-body joint distances: Similar to the upper joint distances of Sect. 4.1, but extended to 3D and computed on the full set of joints to deal with the more complex task of activity classification. All the 3D Euclidean distances between the 15 joints of an individual skeleton are used to fill a \(15\) matrix \(\mathbf {D}\). The upper 120 triangular elements of its log-covariance matrix constitutes then the actual features, which basically represent the relative variation in the position and body posture of the subjects. The matrix logarithm makes the covariance based features more robust by mapping the covariance space into a euclidean space [12].
- Temporal covariance of inter-body joint distances: The temporal variation of the previous features is also considered by computing \(\mathbf D ^{t}\) and \(\mathbf D ^{t-n}\) at time t and \(t-n\), respectively, and their difference \(\mathbf R ^t=\mathbf D ^t-\mathbf D ^{t-n}\). The upper triangular elements of the log-covariance of \(\mathbf R ^t\) are the final features in this case. Like the previous set, this is also composed by 120 features.
- Minimum distance to torso: Two more social features are derived by calculating all the 3D distances between the joints of subject 1 and the torso of subject 2, then taking the minimum, and vice-versa (subject 2 to subject 1).
- Accumulated energy of the torsos: These features allow to discriminate the most active person (e.g. who is approaching the individual space of the other). They include the distance from torso to torso, plus the energy E depending on the distance variations of all the joints of a subject to the torso of the other:where \(d^t_{i,T}\) is the distance, at time t, of the ith joint of a subject to the torso T of the other, and \([t-n,t]\) is the considered time interval. Two energy features, one for each subject, are computed.$$\begin{aligned}&E = \sum \nolimits _i v_i^2\nonumber \\&\quad and ~~ v_i = d^t_{i,T} - d^{t-n}_{i,T} \end{aligned}$$(3)
5 Interaction Segmentation
6 Social Activity Classification
6.1 Dynamic Bayesian Mixture Model
6.2 Multi-Merge DBMM
6.3 Proximity-Based Priors
6.4 Combined Model
7 Experiments
7.1 Social Activity Dataset
No Segm. [6] | Man. Segm. | Aut. Segm. | |
---|---|---|---|
MM-DBMM with no priors | |||
% Accuracy | 92.13 | 97.65 | 97.02 |
% Precision | 45.39 | 76.96 | 68.46 |
% Recall | 83.79 | 76.08 | 68.20 |
MM-DBMM with multivariate priors | |||
% Accuracy | 91.33 | 97.86 | 97.08 |
% Precision | 59.72 | 79.42 | 70.62 |
% Recall | 79.3 | 81.06 | 71.23 |
MM-DBMM with GMM priors | |||
% Accuracy | 92.11 | 98.69 | 97.86 |
% Precision | 67.65 | 88.01 | 78.52 |
% Recall | 84.19 | 86.56 | 76.13 |
7.2 Overall System Performance
% Accuracy | % Precision | % Recall | |
---|---|---|---|
Segmentation | 92.26 | 92.26 | 92.26 |
HMM—interval | 1s | 2s | 3s | Full-Seq. |
Segm. accuracy (%) | 92.15 | 92.31 | 92.20 | 92.26 |
Handshake | Hug | Help walk | Help stand | Fight | Push | Talk | Draw attention | |
---|---|---|---|---|---|---|---|---|
% False negatives | 2.06 | 1.64 | 1.45 | 7.56 | 16.66 | 7.98 | 3.89 | 58.75 |
% False positives | 1.25 | 1.00 | 0.52 | 18.04 | 14.45 | 4.85 | 24.41 | 35.48 |