Skip to main content
Top
Published in: Multimedia Systems 4/2023

Open Access 03-06-2023 | Regular Paper

Multi-modal humor segment prediction in video

Authors: Zekun Yang, Yuta Nakashima, Haruo Takemura

Published in: Multimedia Systems | Issue 4/2023

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Humor can be induced by various signals in the visual, linguistic, and vocal modalities emitted by humans. Finding humor in videos is an interesting but challenging task for an intelligent system. Previous methods predict humor in the sentence level given some text (e.g., speech transcript), sometimes together with other modalities, such as videos and speech. Such methods ignore humor caused by the visual modality in their design, since their prediction is made for a sentence. In this work, we first give new annotations to humor based on a sitcom by setting up temporal segments of ground truth humor derived from the laughter track. Then, we propose a method to find these temporal segments of humor. We adopt an approach based on sliding window, where the visual modality is described by pose and facial features along with the linguistic modality given as subtitles in each sliding window. We use long short-term memory networks to encode the temporal dependency in poses and facial features and pre-trained BERT to handle subtitles. Experimental results show that our method improves the performance of humor prediction.
Notes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Humor provokes laughter and provides amusement. It is an important medium to demonstrate our emotions and has become an essential tool in our daily life [1]. Humor can be used to draw people’s attention and relieve stressful or embarrassing situations. By properly using humor, communication between people will become easier and smoother.
Understanding humor is also important for human–machine communications (e.g., robots [2, 3] and virtual agents [4, 5]). A machine may interact with us in a more comprehensive manner, ultimately taking our emotions into its decision-making to respond to our various needs. Meanwhile, understanding humor is a challenging task for a machine in both computer vision and natural language processing communities because it requires a deeper knowledge of signals from people in visual (e.g., poses, gestures, and appearances), vocal (e.g., tones), and linguistic (e.g., puns) modalities, as well as their combinations [6], which can induce humor.
In recent years, some methods have been proposed to predict humor using both single modality and multiple modalities, which are often accompanied by a dedicated dataset [712]. Single modal humor prediction mainly uses the linguistic modality [1315], while multiple modal humor prediction combines the information from different modalities [6, 1618]. The ground-truth labels of these methods are usually associated with blocks of text, like sentences and dialogues, while signals from other modalities are often treated as supplementary. In the real world, however, humor is not necessarily tied to text; it can be invoked even in silence with funny actions and facial expressions, which are often ignored in the tasks driven by the linguistic modality. To cover broader variations of humor, we need another problem formulation of humor prediction.
In this work, we present a new humor prediction task. Unlike previous tasks that provide humor-related annotations based on a single sentence or a set of dialogues [16, 17], our proposed task provides temporal segments that are associated with humor as ground-truth labels, as shown in Fig. 1. We also propose a new method for humor prediction, which makes predictions with a sliding window. The method uses multimodal data within each window, i.e., video frames and subtitles. Our method aggregates subtitles as well as pose and facial features from video frames, which are then fed into our model. We convert these sliding-window predictions to temporal segments comparable with the ground-truth segments.
The main contributions in our work are three-fold.
1.
We give a new definition to humor by setting up temporal segments that are associated with humor as ground truth labels. Such definition covers a wider variety of humorous moments, even without associated text (or utterances).
 
2.
We also propose to find temporal segments, which can handle humor invoked solely by the visual modality. Our method uses the visual modality through poses and facial features in video frames as well as the linguistic modality through subtitles as input. Prediction is done over a sliding window, which is comparable with our ground truth.
 
3.
We compare different combinations of input features to show which feature combination is the best for our humor prediction task.
 
The rest of this work is arranged as follows: Sect. 2 shows some previous work related to humor prediction; Sect. 3 introduces our task and datasets; Sect. 4 presents our method to predict humor; Sect. 5 shows the experimental results and Sect. 6 is the conclusion part.
Methods for humor prediction usually take features obtained from text, images, and audio as inputs, giving a prediction of whether the input is associated with humor or not as an output. Single-modal humor prediction methods mainly use the linguistic modality. For example, Weller et al. [13] proposes a task that takes the text from Reddit pages as input and judges whether it is humorous or not based on the ratings. Fan et al. [14] uses an internal and external attention neural network for short text humor detection. Czapla et al. [15] applies a pre-trained language model to predict humor in Spanish tweets. All these methods make their prediction based only on text input. However, in the real world, humor can be invoked by other modalities. A multi-modal approach is necessary to broaden the application of humor prediction.
Multi-modal humor prediction methods combine information from different modalities together. For example, Hasan et al. [6] uses subtitle, visual, and audio features in TED talk videos. Patro et al. [17] builds a dataset based on a famous sitcom The Big Bang Theory and gives several baselines to predict humor based on both visual and language modalities. Kayatani et al. [16] also uses the same TV drama series as their testbed and presents a model to predict whether an utterance of a character causes laughter based on subtitles as well as facial features and the identity of the character. Yang et al. [18] obtains humor labels in videos based on user comments together with visual and audio features. The ground-truth humor labels in these methods are mainly associated with texts and a prediction is made for a sentence. Our ground-truth annotation, in contrast, is given as a segment specified by start and end time stamps, which allows covering humor invoked by various modalities.

3 Dataset and task

In this work, we give new annotations to humor labels by setting up temporal segments of humor based on the dataset in Patro et al. [17] and Kayatani et al. [16], which use a famous sitcom called the Big Bang Theory. The videos in this sitcom TV drama series contain canned laughter (or laughter tracks). Though such canned laughter is not equivalent to humor in general, we still believe that laughter is added if and only if humor is presented in a sitcom. This means that, at least in such a designed circumstance, laughter can be a good proximity of the presence of humor and gives a relatively objective criterion to identify where humor happens1 Hence, we use canned laughter to make ground-truth humor segments automatically (i.e., our ground-truth humor segment annotations are formed based on the laughter track).
To do this, we follow Kayatani et al. [16] and subtracted the left and right channels of the audio track to cancel the characters’ speech. Then we apply the Hilbert transform after low-pass filtering to the subtracted signal to obtain its wave envelope. This envelope basically gives larger values for canned laughter, jingles, music, etc.. Unlike [16] that annotates a humor label to each sentence, we want to make temporal segments of humor as shown in Fig. 2. We thus set up a threshold in the wave envelope, and define those samples above the threshold as humor to form raw temporal segments of humor. We then review all the extracted segments manually to remove non-laughter segments to finalize the humor segments (i.e., fixed humor segments). Our dataset thus consists of video frames, subtitles, and humor segments with start and end time stamps.
Table 1
Statistics of the dataset
Number of seasons
10
Number of episodes
228
Total duration
76:33:50
Number of segments
63,814
Number of subtitles
74,217
Humor segments
Count
 
31,851
 
Duration
Min (s)
0.042
  
Avg (s)
2.254
  
Max (s)
18.458
  
Total
19:56:32
Non-humor segments
Count
 
31,963
 
Duration
Min (s)
0.042
  
Avg (s)
6.377
  
Max (s)
76.792
  
Total
56:37:18
Humor subtitles
Count
 
33,408
 
Average words
 
7.38
Non-humor subtitles
Count
 
40,809
 
Average words
 
7.72
Total durations are shown in hh:mm:ss
The statistics of the dataset are shown in Table 1. We can see that the number of humor segments is quite large. A single episode has almost 140 humor segments on average, and more than one-fourth of the total duration contains laughter. As for the linguistic modality, we call subtitles that end within a humor segment as humor subtitles. Subtitles that start within a humor segment but end outside any humor segments are not counted as humor subtitles but are referred to as non-humor segments. The table shows that more than 44% of subtitles are associated with one of the humor segments. Figure 3 shows the distributions of top-20 words (counted over humor sentences) for humor and non-humor sentences, where stop words and characters’ names are removed. Considering the difference in the number of humor/non-humor subtitles, we would say that these two distributions do not differ a lot.
Different from previous work that merely judges whether a sentence or a set of dialogues is humorous or not, our task requires localizing humor segments based on video frames and subtitles. Note that there can be humor segments caused solely by acoustic signals (e.g., making a funny noise that cannot be transcribed); however, our task does not use the audio tracks since they have canned laughter, which is used to obtain the ground-truth humor segments.

4 Finding humor segments in video

Figure 4 is an overview of our method. We cast our humor segment prediction task to humor/non-humor prediction over sliding windows to model the dependency among them. To predict humor over sliding windows, we represent video frames by sequences of poses and faces of characters, which are handled by the pose flow and the face flow, respectively. The subtitles within each sliding window also go through the language flow. We use late fusion to summarize the prediction scores from different flows to obtain the per-window predictions. Then, per-window predictions are converted to temporal segments.
For the i-th window \({w}_{i}\), we aggregate video frames \({V}_{i}= \{{v}_{ij} \mid {j}=1,\ldots ,{J}\}\) and subtitles \({S}_{i} = \{{s}_{ik} \mid {k}=1,\ldots ,{K}_{i}\}\) within it as input, where J and \({K}_{i}\) are the numbers of frames and subtitles in \({w}_{i}\) (\({K}_{i}\) can vary for different windows), respectively. Note that, as in Sect. 3, we include the subtitles that end inside the window, while we do not include those subtitles that start inside but end outside the window. We use a neural network-based model to make humor/non-humor prediction \({h}_{i}\) for \({w}_{i}\).
Humor is sometimes induced by funny poses and facial expressions. Previous work [19] found that non-verbal humor based on gestures, facial expressions, or whole-body movement makes the robot more human-like and more entertaining. Motivated by this finding, we use two flows in the visual modality to represent poses and facial expressions of the characters in \({w}_{i}\) respectively. For the linguistic modality, previous work [16, 17] used BERT, a famous language Transformer, to represent subtitles and achieved good performance. Thus, we follow them to model the dependency among all subtitles \({s}_{ik}\) in \({S}_{i}\) with BERT.

4.1 Pose flow

Some funny actions can make people laugh. Poses in the video frames can be seen as reflections of them, and their features can be crucial for humor prediction. We compare 2-D and 3-D pose features: (1) the first one uses OpenPose [20] to detect joint positions in the video frames in \(V_i\). For each person in each video frame, we obtain a 3M-D vector containing the 2-D coordinates of each joint with the confidence score given by OpenPose, where \(M = 25\). (2) We convert the 2-D joint coordinates to 3-D coordinates with OpenPose 3D baseline [21] pre-trained on the Human 3.6 M dataset [22], which maps \(M = 25\) into \(M' = 17\) to fit the Human 3.6 M model. We obtain a 51-D vector containing all coordinates of joints in the 3-D space. In either kind of pose feature, the entries in the vector for undetected joints are set to 0.
Confidence score \(c^{{\text {P}}}_m\) for joint \(m = 1,\ldots , M\) is related to the visibility of the key point. We believe such a confidence score may somehow represent the importance of the corresponding person in the scene since the main characters in a scene tend to be placed around the center of the frame in bigger sizes. We thus calculate the average confidence score \(\bar{c}^{{\text {P}}}\) for each person by:
$$\begin{aligned} \bar{c}^{{\text {P}}}=\frac{1}{M} \sum _{m=1}^{M}{c}^{{\text {P}}}_{m}. \end{aligned}$$
(1)
Then, we rank characters in the scene based on \(\bar{c}^{{\text {P}}}\) and select top-3 characters for both 2-D and 3-D poses. Note that we still use the confidence scores obtained with OpenPose for 3-D poses (i.e., same \(\bar{x}^{{\text {P}}}\) is used for both 2-D and 3-D poses) because 3-D poses are based on 2-D poses.
Let \(x^{{\text {P}}}\) denote the vector of pose features (either 2-D or 3-D) for a single character, we fed \(x^{{\text {P}}}\) into FC layers and max-pool them to obtain a 128-D pose vector \({p}_{ij}\) (\({j}=1,\ldots ,{J}\), and J denotes the number of video frames in the sliding window) for each frame. We concatenate each frame’s pose vector and fed them into an long short-term memory (LSTM) layer with hidden state \(d_{ij}^{P}\). The hidden state corresponding to the last frame (i.e., \(d_{iJ}^{P}\)) is then fed into an FC layer to get score vector \(e^{{\textrm{P}}}_i \in [0, 1]^2\) for the pose flow.

4.2 Face flow

Exaggerated facial expressions can also cause laughter. We model such facial expressions using a similar way to the pose flow. We adopt two facial features (landmark positions and action units (AUs) [23]): For landmark positions, we use a variant of OpenPose to detect facial landmarks in the video frames in \(V_i\). For each person in \(V_i\), we obtain a 3N-D vector containing the 2-D coordinates of the face landmark with the confidence score \({c}^{F}_n\) given by OpenPose, where \(N = 70\). For AUs, we use Openface [24] to extract a \(N'\)-D vector of AUs from each character in a video frame and an average confidence score \(\bar{c}^{F}\), where \(N' = 35\). We calculate the average confidence score \(\bar{c}^{F}\) of each person for landmark positions:
$$\begin{aligned} \bar{c}^{F}=\frac{1}{N} \sum _{n=1}^{N}{c}^{{\text {F}}}_{n}. \end{aligned}$$
(2)
For both types of features, we select three characters with the largest \(\bar{c}^{F}\) scores. We fed their landmarks or AUs, \(x^{{\text {F}}}\), into FC layers and max-pool them to obtain a 128D face vector \({f}_{ij}\) (\({j}=1,\ldots ,{J}\), and J denotes the number of video frames in the sliding window) for each frame. Then we concatenate each frame’s face vector and fed them into an LSTM layer. The hidden state corresponding to the last frame is fed into an FC layer to get score vector \({e}^{{\textrm{F}}}_{i}\) for the face flow.

4.3 Language flow

The subtitles in the video contain the transcript of what the characters say, which is the primary source to make people laugh. We use BERT [25], which has been widely applied in many similar tasks with outstanding results [16, 17, 2628] to model the subtitles. We concatenate all the subtitles in \({S}_{i}\) by adding special tokens, including [CLS] and [SEP], and feed it to BERT. Only these special tokens are passed to BERT if there are no subtitles in the sliding window. The output corresponding to [CLS] is then fed into an FC layer to get score vector \({e}^{{\textrm{S}}}_{i}\).

4.4 Prediction and training

Our model uses late fusion for the final prediction, all prediction scores are summed together to get the final score vector:
$$\begin{aligned} {e}_{i} = \text {softmax}\left( {e}^{{\textrm{P}}}_{i} + {e}^{{\textrm{F}}}_{i} + {e}^{{\textrm{S}}}_{i}\right) . \end{aligned}$$
(3)
This score vector contains scores for humor \(e^{{\text {h}}}_i\) and non-humor \(e^{{\text {n}}}_i\) per-window, i.e., \(e_i = (e^{{\text {h}}}_i, e^{{\text {n}}}_i)\). When \(e^{{\text {h}}}_i > e^{{\text {n}}}_i\), the final binary prediction \(h_i = 1\), otherwise, \(h_i = 0\).
Per-window ground-truth label \(y_i\) for \(w_i\) can be derived from the ground-truth temporal segments in the dataset. Considering that laughter happens after a triggering part, we deem window \(w_i\) be associated with humor (\(y_i = 1\)) if the end time of \(w_i\) falls within any ground-truth humor segment, and otherwise non-humor (\(y_i = 0\)). Note that if \(w_i\) contains a triggering part but ends before the laughter segment, it is still considered as non-humor.

4.5 Converting frame predictions to temporal segments

We call prediction \(h_i\) of window \({w}_{i}\) a frame-level prediction. We convert frame-level predictions to humor segments. Let t be the shift amount between consecutive windows in second. If \({w}_{i}\) is predicted as humor, we deem the following t second a humor segment. Consecutive segments are merged together to form a humor segment.
Table 2
The training parameters in the experiment
Parameters
Setting
Learning rate
\(2 \times {10}^{-5}\)
Number of epoch
3
Training batch size
32
Inferring batch size
16
Max number of tokens
128
Optimizer
Adam [29]
Weight decay
\(1 \times {10}^{-5}\)

5 Experimental results

We resample the video frames to 2 fps and split the dataset into training (80%), validation (10%), and test (10%) sets. We implement our model with Python 3.7 and PyTorch. The hyper-parameters in the experiments are shown in Table 2. For the subtitle input, we use the bert-base-uncased model with 12 layers, 768 hidden sizes, 12 self-attentions, and 110 million parameters for feature representation. The model makes no distinction between upper-case and lower-case tokens in the input sequence. Cross-entropy loss is applied for training.
We set the length of our sliding window to 8 s and the shift to 2 s. These values are based on the average duration of humor segments (2.254 s) and non-humor segments (6.377 s) as shown in Table 1. Since a humor segment is always preceded by a non-humor segment and the non-humor segment serves as a context (setup and punchline) of the humor segment (laughter), the window of 8 s, which roughly corresponds to the sum of the averages of humor and non-humor segments, can cover the most part of these consecutive non-humor and humor segments to model humor. The shift of 2 s is sufficiently small for this window.
Table 3
Frame level results on the test set (in %)
Input
Acc
Pre
Rec
F1
Pose
Face
Subtitle
All positive
32.00
32.00
100.00
48.49
All negative
68.00
0.00
0.00
0.00
2D
68.84
66.27
5.36
9.92
3D
68.57
70.95
3.04
5.83
Landmark
68.93
66.31
5.93
10.89
Action unit
67.98
33.33
0.02
0.05
BERT
70.23
55.93
32.97
41.48
2D
Landmark
68.97
63.71
7.06
12.71
2D
Action unit
68.62
59.62
6.00
10.91
3D
Landmark
68.86
65.15
5.81
10.67
3D
Action unit
68.56
70.33
3.06
5.87
2D
BERT
70.75
57.23
34.09
42.73
3D
BERT
70.94
57.53
35.10
43.60
Landmark
BERT
71.01
57.38
36.56
44.66
Action unit
BERT
70.16
59.76
20.65
34.69
2D
Landmark
BERT
70.94
57.48
33.41
43.82
2D
Action unit
BERT
70.51
59.56
24.45
34.67
3D
Landmark
BERT
71.12
57.87
35.89
44.30
3D
Action unit
BERT
70.33
56.08
33.66
42.07
The best values under different metrics using the same type of input are in bold
Table 4
Segment-level results on the test set (in %)
Input
IoU = 0.25
IoU = 0.50
IoU = 0.75
Pose
Face
Subtitle
Pre
Rec
F1
Pre
Rec
F1
Pre
Rec
F1
2D
63.00
6.88
12.40
42.49
4.64
8.36
15.75
1.72
3.10
3D
64.25
4.60
8.58
39.66
2.84
5.30
8.94
0.64
1.19
Landmark
64.00
7.04
12.68
41.45
4.56
8.21
15.27
1.68
3.02
Action unit
33.33
0.04
0.08
0.00
0.00
0.00
0.00
0.00
0.00
BERT
75.49
35.23
48.04
52.61
24.58
33.51
20.87
9.75
13.29
2D
Landmark
60.28
8.56
14.99
43.10
6.12
10.71
19.44
2.76
4.83
2D
Action unit
56.06
8.88
15.33
36.36
5.76
9.94
16.41
2.60
4.49
3D
Landmark
63.33
6.84
12.34
40.37
4.36
7.86
15.93
1.72
3.10
3D
Action unit
62.98
4.56
8.50
39.23
2.84
5.30
9.39
0.68
1.27
2D
BERT
74.65
36.27
48.82
51.97
25.30
34.03
21.18
10.31
13.84
3D
BERT
76.07
37.00
49.78
54.44
26.48
35.63
22.29
10.84
14.59
Landmark
BERT
76.00
37.98
50.65
52.95
26.54
35.36
20.41
10.23
13.63
Action unit
BERT
76.77
24.71
37.39
55.21
17.79
26.90
22.08
7.11
10.76
2D
Landmark
BERT
76.48
38.34
51.08
52.07
26.14
34.81
21.02
10.55
14.05
2D
Action unit
BERT
76.46
28.83
41.87
55.34
20.90
30.35
23.39
8.83
12.82
3D
Landmark
BERT
75.35
38.38
50.86
52.85
27.02
35.76
21.74
11.11
14.71
3D
Action unit
BERT
75.47
35.19
48.00
51.46
24.02
32.75
20.03
9.35
12.75
The best values under different metrics using the same type of input are in bold

5.1 Quantitative results

We show the performance of frame-level predictions and segment-level predictions in Tables 3 and 4, respectively. We use accuracy (Acc), precision (Pre), recall (Rec), and F1 as metrics for frame-level predictions; and precision (Pre), recall (Rec), and F1 under different IoU thresholds for segment-level predictions. For comparison, we show the performance of two naive baselines (all positive and all negative labels), as well as the subtitle baseline, which solely uses the language flow for prediction i.e., the prediction is done based on \(e^{{\text {S}}}_i\).
[16] can also be our baseline, although their task is to predict humor in the sentence level. To make their sentence-level predictions comparable to ours, we convert them into frame-level predictions by making use of the time stamp of each sentence. For a sentence predicted as humor, we set all frames within the range from the end of the sentence to 2 s after the end of that sentence. We show the frame-level and segment-level performance in Tables 5 and 6, respectively, where “Char” denotes character features in Kayatani et al.’s method.
Table 5
Frame-level results of Kayatani et al.’s method (the scores are from [16]) and our method (in %)
Method
Input
Acc
Pre
Rec
F1
Kayatani et al. [16]
BERT
62.94
44.10
59.04
50.49
 
Action unit
BERT
65.20
46.14
52.13
48.95
 
Char
Action unit
BERT
66.71
47.40
39.88
43.40
Ours
BERT
70.23
55.93
32.97
41.48
 
Landmark
BERT
71.01
57.38
36.56
44.66
 
3D
Landmark
BERT
71.12
57.87
35.89
44.30
The best values under different metrics using the same method (Kayatani et al and Ours) are in bold
Table 6
Segment-level results of Kayatani et al.’s method and our method (in %)
Method
Input
IoU = 0.25
IoU = 0.50
IoU = 0.75
Pre
Rec
F1
Pre
Rec
F1
Pre
Rec
F1
Kayatani et al. [16]
BERT
66.18
42.02
51.41
36.34
23.07
28.22
10.14
6.44
7.87
 
Action unit + BERT
68.51
39.06
49.76
40.39
23.03
29.34
11.01
6.28
8.00
 
Char + Action unit + BERT
69.87
33.11
44.93
45.65
21.63
29.35
14.68
6.96
9.44
Ours
BERT
75.49
35.23
48.04
52.61
24.58
33.51
20.87
9.75
13.29
 
Landmark + BERT
76.00
37.98
50.65
52.95
26.54
35.36
20.41
10.23
13.63
 
3D + Landmark + BERT
75.35
38.38
50.86
52.85
27.02
35.76
21.74
11.11
14.71
The best values under different metrics using the same method (Kayatani et al and Ours) are in bold
Table 3 shows our ablation study results. We can see the accuracy of different input modalities in ours are all more than 60%. The recall and F1 scores are low when we use only pose and/or facial features. The language features improve the accuracy, recall, and F1 score. The best-performed modalities under our metrics are linguistic. We think this is because most humor in the dataset is triggered by what the actors are saying, while visually-induced humor segments are fewer than linguistically-induced ones. We also find that the visual modality does have some contributions in humor prediction, as the precision scores with the visual modality only are more than 65%. The model that uses 3-D poses, face landmarks, and subtitles as input has better accuracy, while the model that uses face landmarks and subtitles as input has better recall and F1 scores than other input modalities.
We report segment-level evaluation in Table 4 by setting up different IoU thresholds between the predicted segments and the ground-truth segments. We can see that the language features contribute much to improving the predictions: When our input contains subtitles, the recall and precision scores are much better than those with only visual features. Among the results, the one with 3-D poses, face landmarks, and subtitles has the best recall score when \({\text {IoU}}=0.25\), \({\text {IoU}}=0.50\) and \({\text {IoU}}=0.75\). It also has the best F1 scores over other inputs when \({\text {IoU}}=0.50\) and \({\text {IoU}}=0.75\), being the best-performing input in the segment-level evaluation.
We also analyze the results between ours and the method by [16]: Our sliding-window-based method improves the accuracy and precision by 4.41% and 10.47% respectively in frame level compared with [16]. However, the F1 score is not as good as expected compared with [16] using BERT only or the combination of action units and BERT. For the segment-level predictions, when \({\text {IoU}}\) is 0.75, the recall of our method using 3-D poses, face landmarks, and subtitles is 4.15% higher than the method at the sentence level. This implies that our method has a better alignment of predicted humor segments than the method at the sentence level.
Table 7
Frame-level results on the test set under different lengths (scores are in %)
Length (s)
Acc
Pre
Rec
F1
4
68.82
52.65
24.16
33.73
8
71.12
57.87
35.89
44.30
12
69.71
56.52
23.78
33.47
16
69.02
56.20
15.36
24.13
The best values under different metrics using the same type of input are in bold

5.2 Training with different lengths of sliding window

We evaluate the performance of different lengths of sliding windows to quantify their impact. We use 3-D poses and face landmarks as pose and face flow inputs along with the language flow input and show the frame-level and segment-level results for the lengths of sliding windows being 4 s, 8 s, 12 s, and 16 s in Tables 7 and 8, respectively. Note that we keep the shift of the sliding window to 2 s.
Table 8
Segment-level results on the test set under different lengths (scores are in %)
Length (s)
IoU = 0.25
IoU = 0.50
IoU = 0.75
Pre
Rec
F1
Pre
Rec
F1
Pre
Recall
F1
4
62.50
30.14
40.67
45.70
22.06
29.76
15.95
7.70
10.38
8
75.35
38.38
50.86
52.85
27.02
35.76
21.74
11.11
14.71
12
71.53
23.57
35.45
46.48
15.36
23.09
16.14
5.33
8.02
16
68.11
14.60
24.05
45.69
9.82
16.16
16.67
3.58
5.90
The best values under different metrics using the same type of input are in bold
From Tables 7 and 8, we can see that when the sliding window gets longer, the scores in both frame-level and segment-level increase at first, and then decrease. When the length of our sliding window is 8 s, the results in both frame and segment levels have the highest score. This means that a longer sliding window may lead to a performance drop. We think the reason is that the humor is often triggered in a relatively short period. When the sliding window is long, some information that is not related to humor is fed into the network, making the performance drop. Also, our dataset contains many language-based humor segments; when the sliding window is too short, the model only sees a single subtitle without any context that builds up the humor. Thus, the sliding window should be at an appropriate length for better predictions.

5.3 Training time

Our experiments are performed on a computer with Intel Core i7-8700K, 32 G RAM, and an NVIDIA Titan RTX GPU. We show the training time of different inputs in Table 9.
Table 9
Training time on our dataset per epoch (in mm:ss)
Input
Training time
Pose
1:06
Pose + face
2:01
Pose + subtitle
20:08
Face
1:07
Face + subtitle
20:46
Subtitle
19:45
Pose + face + subtitle
23:07
From the table, when subtitles are not used, a single only takes no more than 3 min. However, when we take subtitles as input, it costs more than 19 min. This is because BERT is a much larger network compared to the other part of our method.

5.4 Qualitative results

We show some example predictions to demonstrate the superiority of our method quantitatively. These examples are using 3-D pose, face flandmark, and subtitles of our method, as well as [16] using character, face, and subtitle in Fig. 5. The green and orange bars in the timeline indicate the predicted humor segments with our method and Kayatani et al.’s method, while the blue bars are the ground truth humor segments. In (a), the humor is mainly triggered by funny actions (one of the actors is lifting his arms in front of his chest), our method captures these actions and finds the corresponding segments. Kayatani et al.’s method also predicts humor, but its IoU is lower than ours. In (b), the humor is mainly caused by the two persons lifting a wooden board together. Our predicted segment has some overlap with the ground truth, but it starts earlier and ends earlier, while Kayatani et al.’s method fails to spot the humor segment. In (c), even though the actors are wearing special costumes, the humor segment only occupies a short time. Our method captures and predicts the humorous segment correctly, while Kayatani et al.’s method gives a longer segment. In (d), our method fails to capture the humor segment, while Kayatani et al.’s method catches it. This may be because this humor is triggered by a longer context, which our sliding-window-based model cannot capture, while Kayatani et al.’s can because it takes successive five subtitles as input.

6 Conclusion and outlook

In this work, we proposed a multi-modal model to predict humor segments in videos. We first automatically annotated temporal segments of humor in a dataset and then presented a framework to predict video humor in multiple modalities. Our method used features from subtitles along with different kinds of pose and face features in the videos to carry prediction. BERT was used to model the subtitles and LSTM networks were set up to model pose and facial expression features, respectively. Experimental results showed that our method outperformed the previous method by 4.41% in accuracy at the frame level and by 4.15% in the recall at the segment level and gave a better understanding of humor.
However, this work still faces some limitations. First, our model can hardly predict humor based on specific relationships between characters and other objects. Second, the data source is only limited to sitcom videos, whose ground truth laughter is easy to find. Third, our method only sums results from multiple modalities for the final predictions, which may lose cross-modal information. In future work, we want to model the relationship between objects and people in the videos. We also need to broaden the source of the dataset, apply new fusion techniques, and migrate the method to other kinds of emotions to cover a wider range of applications.

Acknowledgements

This work was supported by JSPS KAKENHI No. 18H03264 and China Scholarship Council.

Declarations

Conflict of interest

The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Footnotes
1
This assumption does not count humor without accompanying laughter. We leave handling such humor as our future work.
 
Literature
1.
go back to reference Meyer, J.C.: Humor as a double-edged sword: four functions of humor in communication. Commun. Theory 10(3), 310–331 (2000)CrossRef Meyer, J.C.: Humor as a double-edged sword: four functions of humor in communication. Commun. Theory 10(3), 310–331 (2000)CrossRef
2.
go back to reference Niculescu, A., van Dijk, B., Nijholt, A., Li, H., See, S.L.: Making social robots more attractive: the effects of voice pitch, humor and empathy. Int. J. Soc. Robot. 5, 171–191 (2013)CrossRef Niculescu, A., van Dijk, B., Nijholt, A., Li, H., See, S.L.: Making social robots more attractive: the effects of voice pitch, humor and empathy. Int. J. Soc. Robot. 5, 171–191 (2013)CrossRef
3.
go back to reference Mirnig, N., Stadler, S., Stollnberger, G., Giuliani, M., Tscheligi, M.: Robot humor: how self-irony and schadenfreude influence people’s rating of robot likability. In: 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 166–171 (2016) Mirnig, N., Stadler, S., Stollnberger, G., Giuliani, M., Tscheligi, M.: Robot humor: how self-irony and schadenfreude influence people’s rating of robot likability. In: 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 166–171 (2016)
4.
go back to reference Gray, C., Webster, T., Ozarowicz, B., Chen, Y., Bui, T., Srivastava, A., Fitter, N.T.: This bot knows what i’m talking about” humaninspired laughter classification methods for adaptive robotic comedians. In: 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp. 1007–1014 (2022) Gray, C., Webster, T., Ozarowicz, B., Chen, Y., Bui, T., Srivastava, A., Fitter, N.T.: This bot knows what i’m talking about” humaninspired laughter classification methods for adaptive robotic comedians. In: 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp. 1007–1014 (2022)
5.
go back to reference Kolb, W., Miller, T.: Human–computer Interaction in Pun Translation. Using Technologies for Creative-text Translation. Taylor & Francis, London (2022) Kolb, W., Miller, T.: Human–computer Interaction in Pun Translation. Using Technologies for Creative-text Translation. Taylor & Francis, London (2022)
6.
go back to reference Hasan, M.K., Rahman, W., Zadeh, A., Zhong, J., Tanveer, M.I., Morency, L.- P., Hoque, M.E.: UR-FUNNY: a multimodal language dataset for understanding humor. In: EMNLP-IJCNLP, pp. 2046–2056 (2019) Hasan, M.K., Rahman, W., Zadeh, A., Zhong, J., Tanveer, M.I., Morency, L.- P., Hoque, M.E.: UR-FUNNY: a multimodal language dataset for understanding humor. In: EMNLP-IJCNLP, pp. 2046–2056 (2019)
7.
go back to reference Gultchin, L., Patterson, G., Baym, N., Swinger, N., Kalai, A.: Humor in word embeddings: cockamamie gobbledegook for nincompoops. In: ICML, pp. 2474–2483 (2019) Gultchin, L., Patterson, G., Baym, N., Swinger, N., Kalai, A.: Humor in word embeddings: cockamamie gobbledegook for nincompoops. In: ICML, pp. 2474–2483 (2019)
8.
go back to reference Chandrasekaran, A., Vijayakumar, A.K., Antol, S., Bansal, M., Batra, D., Zitnick, C.L., Parikh, D.: We are humor beings: understanding and predicting visual humor. In: CVPR, pp. 4603–4612 (2016) Chandrasekaran, A., Vijayakumar, A.K., Antol, S., Bansal, M., Batra, D., Zitnick, C.L., Parikh, D.: We are humor beings: understanding and predicting visual humor. In: CVPR, pp. 4603–4612 (2016)
9.
go back to reference Ortega-Bueno, R., Muniz-Cuza, C.E., Pagola, J.E.M., Rosso, P.: UO UPV: deep linguistic humor detection in Spanish social media. In: Ibereval, pp. 204–213 (2018) Ortega-Bueno, R., Muniz-Cuza, C.E., Pagola, J.E.M., Rosso, P.: UO UPV: deep linguistic humor detection in Spanish social media. In: Ibereval, pp. 204–213 (2018)
10.
go back to reference Sane, S.R., Tripathi, S., Sane, K.R., Mamidi, R.: Deep learning techniques for humor detection in Hindi–English code-mixed tweets. In: Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 57–61 (2019) Sane, S.R., Tripathi, S., Sane, K.R., Mamidi, R.: Deep learning techniques for humor detection in Hindi–English code-mixed tweets. In: Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 57–61 (2019)
11.
go back to reference Kamal, A., Abulaish, M.: Self-deprecating humor detection: a machine learning approach. In: Pacling, pp. 483–494 (2019) Kamal, A., Abulaish, M.: Self-deprecating humor detection: a machine learning approach. In: Pacling, pp. 483–494 (2019)
12.
go back to reference Choube, A., Soleymani, M.: Punchline detection using context-aware hierarchical multimodal fusion. In: ICMI, pp. 675–679 (2020) Choube, A., Soleymani, M.: Punchline detection using context-aware hierarchical multimodal fusion. In: ICMI, pp. 675–679 (2020)
13.
go back to reference Weller, O., Seppi, K.: Humor detection: a transformer gets the last laugh. In: EMNLP-IJCNLP, pp. 3612–3616 (2019) Weller, O., Seppi, K.: Humor detection: a transformer gets the last laugh. In: EMNLP-IJCNLP, pp. 3612–3616 (2019)
14.
go back to reference Fan, X., Lin, H., Yang, L., Diao, Y., Shen, C., Chu, Y., Zou, Y.: Humor detection via an internal and external neural network. Neurocomputing 2020, 105–111 (2020)CrossRef Fan, X., Lin, H., Yang, L., Diao, Y., Shen, C., Chu, Y., Zou, Y.: Humor detection via an internal and external neural network. Neurocomputing 2020, 105–111 (2020)CrossRef
15.
go back to reference Czapla, B.F.P., Howard, J.: Applying a pre-trained language model to Spanish Twitter humor prediction. In: Iberian Languages Evaluation Forum (2019) Czapla, B.F.P., Howard, J.: Applying a pre-trained language model to Spanish Twitter humor prediction. In: Iberian Languages Evaluation Forum (2019)
16.
go back to reference Kayatani, Y., Yang, Z., Otani, M., Garcia, N., Chu, C., Nakashima, Y., Takemura, H.: The laughing machine: predicting humor in video. In: WACV, pp. 2072–2081 (2021) Kayatani, Y., Yang, Z., Otani, M., Garcia, N., Chu, C., Nakashima, Y., Takemura, H.: The laughing machine: predicting humor in video. In: WACV, pp. 2072–2081 (2021)
17.
go back to reference Patro, B.N., Lunayach, M., Srivastava, D., Singh, H., Namboodiri, V.P., et al.: Multimodal humor dataset: predicting laughter tracks for sitcoms. Wacv, pp. 576–585 (2021) Patro, B.N., Lunayach, M., Srivastava, D., Singh, H., Namboodiri, V.P., et al.: Multimodal humor dataset: predicting laughter tracks for sitcoms. Wacv, pp. 576–585 (2021)
18.
go back to reference Yang, Z., Ai, L., Hirschberg, J.: Multimodal indicators of humor in videos. In: MIPR, pp. 538–543 (2019) Yang, Z., Ai, L., Hirschberg, J.: Multimodal indicators of humor in videos. In: MIPR, pp. 538–543 (2019)
19.
go back to reference Wendt, C.S., Berg, G.: Nonverbal humor as a new dimension of HRI. In: RO-MAN 2009—The 18th IEEE International Symposium on Robot and Human Interactive Communication, pp. 183–188 (2009) Wendt, C.S., Berg, G.: Nonverbal humor as a new dimension of HRI. In: RO-MAN 2009—The 18th IEEE International Symposium on Robot and Human Interactive Communication, pp. 183–188 (2009)
20.
go back to reference Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., Sheikh, Y.A.: Open-pose: realtime multi-person 2D pose estimation using part affinity fields. In: IEEE TPAMI, pp. 172–186 (2019) Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., Sheikh, Y.A.: Open-pose: realtime multi-person 2D pose estimation using part affinity fields. In: IEEE TPAMI, pp. 172–186 (2019)
21.
go back to reference Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: Iccv, pp. 2659–2668 (2017) Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: Iccv, pp. 2659–2668 (2017)
22.
go back to reference Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE TPAMI 36(7), 1325–1339 (2013)CrossRef Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE TPAMI 36(7), 1325–1339 (2013)CrossRef
23.
go back to reference Ekman, P., Rosenberg, E.L.: What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford University Press, Oxford (1997) Ekman, P., Rosenberg, E.L.: What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford University Press, Oxford (1997)
24.
go back to reference Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.-P.: Openface 2.0: facial behavior analysis toolkit. Fg 2018, 59–66 (2018) Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.-P.: Openface 2.0: facial behavior analysis toolkit. Fg 2018, 59–66 (2018)
25.
go back to reference Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019) Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)
26.
go back to reference Ma, X., Xu, P., Wang, Z., Nallapati, R., Xiang, B.: Universal text representation from BERT: an empirical study. Preprint arXiv:1910.07973 (2019) Ma, X., Xu, P., Wang, Z., Nallapati, R., Xiang, B.: Universal text representation from BERT: an empirical study. Preprint arXiv:​1910.​07973 (2019)
27.
go back to reference Zhang, Z., Wu, Y., Zhao, H., Li, Z., Zhang, S., Zhou, X., Zhou, X.: Semantics-aware BERT for language understanding. In: AAAI, pp. 9628–9635 (2020) Zhang, Z., Wu, Y., Zhao, H., Li, Z., Zhang, S., Zhou, X., Zhou, X.: Semantics-aware BERT for language understanding. In: AAAI, pp. 9628–9635 (2020)
28.
go back to reference Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., Takemura, H.: Bert representations for video question answering. In: WACV, pp. 1556–1565 (2020) Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., Takemura, H.: Bert representations for video question answering. In: WACV, pp. 1556–1565 (2020)
Metadata
Title
Multi-modal humor segment prediction in video
Authors
Zekun Yang
Yuta Nakashima
Haruo Takemura
Publication date
03-06-2023
Publisher
Springer Berlin Heidelberg
Published in
Multimedia Systems / Issue 4/2023
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-023-01105-x

Other articles of this Issue 4/2023

Multimedia Systems 4/2023 Go to the issue