5.1 Emotional convergence
It is widely acknowledged that emotions are closely related to facial expressions, speech, and physiology [
7,
62,
63]. Hence, similarity of emotions also leads to similarity in emotional expression. Therefore, emotional convergence can potentially be assessed by analyzing the similarity between the facial expressions, physiology, and speech parameters of two or more interacting individuals. From this perspective, measuring emotional convergence might seem simple. Nonetheless, there are still a number of challenges that need to be resolved before emotional convergence can be measured automatically. In the following paragraphs, I present a step by step approach to measuring emotional convergence.
The first step to measuring emotional convergence is to track facial, speech, and/or physiological signals from two or more users. With increasing availability of wireless, unobtrusive measurement platforms around this has become relatively simple. For instance, the sociometer badge from Pentland and colleagues [
130] can unobtrusively track speech. Physiological signals can be measured through wireless unobtrusive wearable sensors [
179]. Depending on the application, developers might choose which modalities are most useful. When users are mobile, tracking facial expressions might be difficult and speech and physiological signals could be more appropriate. On the other hand, in video conferencing applications, facial expressions can easily be tracked by the cameras used for the video recordings.
Second, individual differences in expressivity and reactivity should be taken into account. It is well-known that there are strong inter-individual differences in emotional expressivity and baseline levels of physiological signals [
102,
111]. Such differences might be corrected for by longer-term measurement that can be used as baselines [
175]. This can be easy in lab situations where baseline measurements are often done, but can be difficult in practical applications. Therefore, another way of dealing with individual differences is by focusing on changes instead of using absolute values of the signals. This could, for instance, be done by employing first-order differences:
$$ \Delta x_i = x_i - x_{i-1}$$
(1)
where
x
i
is one sample of the signal at time
i and
x
i−1 is the sample at time
i−1. The difference in timing depends on the sample frequency of the signal, which can be different for different modalities. For instance, skin conductance changes relatively slowly and can be measured at, for instance 2 Hz [
171], whereas video signals are often recorded at 25 Hz, which is necessary to deal with the relatively fast changes in facial expressions. Hence, the temporal characteristics of the empathy analysis depend on the type of modalities that are used. Finally, individual differences can be tackled by using a baseline matrix as employed by Picard and colleagues [
135] and Van den Broek and colleagues [
172].
Third, different low-level features can be extracted that capture relevant properties of the modalities. For facial expressions, features are often values of different points on the face, for instance, from the facial action coding system [
126]. These can also be measured dynamically [
167]. Speech features might be intensity or pitch, which are often employed in affective computing research [
65,
103,
150,
151]. Physiological features that are likely to be coupled to emotional convergence are skin conductance level and heart rate variability, as these are strongly coupled to emotions [
20,
27,
31].
Fourth, the extracted features from the two individuals should be synchronized. Considering the temporal aspect of the signals is important, as similarity in the expressions not only entails similarity at one point in time, but also similarity in
change of the signals. Therefore, it is necessary to take into account signals over time. Moreover, there might be a time lag between the sender of an expression and a receiver responding to this the expression [
110]. Testing for time lags can be done by comparing the signals at different lags (for instance in a range of −5 to +5 seconds) and seeing if similarity increases or decreases [
143]. When typical time lags are known they can be applied by shifting the signal in time. Hence, synchronization (at a certain lag) of the signals is an important aspect of the emotional convergence measurements. This might be easy to do in laboratory situations, but can be difficult in practical real-world applications as synchronization requires timestamp signals from all users and a method to synchronize them (i.e., provide a handshake mechanism). Moreover, if different users are using different systems the systems should use the same method for handshaking.
Finally, when the relevant features are extracted and synchronized, different algorithms can be used to assess the similarity of the different values of two of the same features extracted from different individuals. Table
1 presents four different classes of algorithms that can be used to do this.
Table 1
Different algorithms that can be used to calculate similarity or dissimilarity between two temporal signals
Correlation | Time domain similarity measure giving a value in [0, 1]. For continuous signals a Pearson correlation can be used, whereas Kendall and Spearman indices measure correlations between ranked or ordinal data. | |
Coherence | Frequency domain similarity measure giving a value in [0, 1]. Sometimes, weighted coherence is used by correcting for the total power within the spectrum. | |
ARMA models | Model of individual time series using auto regressive and moving average components. Predictions can be made by regressing different people’s ARMA models onto eachother. | |
Divergence | Class of stochastic dissimilarity measures, including for instance Kullback-Leibler and Cauchy Schwarz divergences. | |
Correlation is for instance used for the synchrony detection algorithms used by Nijholt and colleagues [
123], Watanabe and colleagues [
176], and Ramseyer and Tsacher [
140]. Coherence has been used by Henning and colleagues [
86]. In these cases, it is important that appropriate corrections for autocorrelations within a signal are made [
29,
37]. A simple way to do this is to use first-order differences of the calculated signals (Eq.
1). A more sophisticated way is to construct autoregressive moving average (ARMA) models that explicitly model the autocorrelations [
76]. Subsequently, it can be tested how well the ARMA models of different individuals predict each other. This is the approach that has been taken by, for instance, Levenson and Ruef [
110]. A third way of correcting for autocorrelations was proposed by Ramseyer and Tsacher [
141] by shuffling the signal from one individual to see if it still correlates with the other individuals signal. If the correlations are similar to those from the unshuffled data, they are not due to synchronization. Finally, divergence measures can be used to calculate (dis)similarity. These have, to my knowledge, not been applied in an empathy-related context. Examples include Kullback-Leibner and Cauchy-Schwarz divergences, among others (see [
173] for a review).
Beside these relatively general classes of algorithms, the literature also contains some ad hoc similarity scores that have seemed to work. A simple algorithm is to look at similarity in the direction over time, with higher percentages of similar directions in the data relating to higher amounts of similarity. This can be expressed as
$$ \frac{1}{N} \sum_{i = 1}^N \delta(\Delta x_i \cdot\Delta y_i \geq0)$$
(2)
where
δ() returns 1 if its argument is true, and 0 otherwise. Another way is to calculate slopes of specific time windows and take the absolute value of differences between the slopes of two different users [
85,
86]. This can be expressed as
$$ \frac{1}{|S|} \sum_{i = 1}^{|S|} \left| S_{i} - T_{i} \right|$$
(3)
where
S and
T are two synchronized vectors of slopes of the signal and |
S| indicates the length of those vectors. Finally, calculating slopes over specific windows can also be used as input for correlation scores over the time domain. Hence, instead of calculating cross correlation over the signals themselves, they are calculated over sets of slopes. This has been called the concordance score by Marci and colleagues [
113,
114].
Using multiple modalities can significantly increase the performance of similarity measurement. This is the case because many of the different modalities are not only responsive to affective changes, but also to cognitive and physical changes [
33]. For instance, it is well known that cognitive workload or exercising influence heart rate and skin conductance. Another example is that it can be problematic to track facial expressions when eating, because in that facial muscles are activated as well. Therefore, combining measurements from multiple modalities and seeing if they match up can give much more precise indications of synchronization. Furthermore, physiological measures and speech parameters tend to tap into arousal components of emotions, whereas facial expressions mostly relate to valence [
4]. Hence, there is different information in different modalities, so combining modalities can give a more complete picture of emotional convergence as well.
In sum, I presented an empathy measurement pipeline based on measurement of physiological signals. First, signals have to be preprocessed and normalized. Subsequently, they have to be coupled in time (with a possible lag). Then, relevant features have to be extracted. Once, these features are extracted, there similarity has to be established by a similarity algorithm.
5.2 Empathic responding
For the third component of empathy, empathic responding, it is most important to measure whether a response is mostly related to sympathy or mostly related to personal distress. Unfortunately, there has not been a lot of research that has explicitly examined the differences between such responses, so there is a clear need to identify specific behavioral and physiological responses accompanying either sympathy or personal distress. Nonetheless, three different strategies can potentially be used to track whether empathic responses are mainly based on sympathy or on personal distress.
The first strategy is to track specific nonverbal behavior that is related to sympathy or personal distress. Zhou and colleagues [
190] present a review of facial and vocal indices related to empathic responding based on studies of human-coded behavioral responses to empathy invoking stimuli (e.g., videotapes of others in need or distress). They suggest that specific sympathy-related behaviors are found in signals of concerned attention. Typical examples of such behaviors are eyebrows pulled down and inward over the nose, head forward leans, reassuring tone of voice and sad looks. A study by Smith-Hanen [
160] reported arms-crossed position related to low sympathy. Behaviors related to personal distress are fearful or anxious expressions. Typical examples of such expressions are lip-biting [
56], negative facial expressions, sobs, and cries. This is a very limited set of behaviors related to empathic responding, and I therefore agree with Zhou and colleagues [
190] who state that “more information on empathy-related reactions in every-day life is needed” (p. 279).
Another way of approaching the measurement of empathic responding is to see to what extent the individuals share the same emotional state. For personal distress, the similarity in emotional state is likely to increase (as both interactants are truly distressed) whereas sympathy is likely to lead to less distress. This may be captured by different levels of emotional convergence. With high emotional convergence, personal distress is more likely whereas low emotional convergence is more related to sympathy. Hence, for automated measurements it may be sufficient to threshold emotional convergence in order to see if a response is sympathy or personal distress. Nonetheless, not responding at all would also lead to low emotional convergence, which is also low sympathy. Hence, this strategy cannot be used on its own, but might have value as an additional measurement of empathic responding.
The third strategy to measuring empathic responses is related to the notion that effortful control is involved in regulating emotional convergence. On the one hand, when high levels of effortful control are applied, reactions are sympathic. On the other hand, when effortful control is lacking, emotional convergence processes lead to personal distress. Hence, tracking regulation processes could give an indication of empathic responding. A wide variety of studies has shown that respiratory sinus arrhythmia (RSA; sometimes referred to as heart rate variability) is an indicator of emotion regulation [
48,
49,
136], especially during social interaction [
31]. RSA is an index of periodic changes in heart rate related to breathing and provides an index of parasympathetic activity of the autonomous nervous system [
20,
79]. Between-person differences in RSA have been related to individual differences in emotional flexibility (i.e., the ease with which one’s emotions can change; [
17]). Within-person changes in RSA have been related to the activation of emotion regulation processes [
69,
148]. Hence, RSA could also be a useful index for tracking empathic responses.
RSA can be measured by transforming the interbeat intervals of an ECG signal to the frequency domain. Subsequently, the power in the high frequency range (0.15 Hz–0.40 Hz, [
137]) can be calculated as an index of RSA. Because this power can also be influenced by respiration rate and volume it is often corrected for respiration parameters as well [
79].
In sum, there can be different approaches to automated measurement of empathic responses. It needs to be stressed that there has been (almost) no research on using these approaches and their feasibility and performance are to be determined in future studies. Finally, the three strategies are not mutually exclusive and combining them would likely provide the best solution for automated measurement of sympathy and personal distress.