Top

Multimedia Systems

Published in:

Open Access 03-06-2023 | Regular Paper

Multi-modal humor segment prediction in video

Authors: Zekun Yang, Yuta Nakashima, Haruo Takemura

Published in: Multimedia Systems | Issue 4/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Humor can be induced by various signals in the visual, linguistic, and vocal modalities emitted by humans. Finding humor in videos is an interesting but challenging task for an intelligent system. Previous methods predict humor in the sentence level given some text (e.g., speech transcript), sometimes together with other modalities, such as videos and speech. Such methods ignore humor caused by the visual modality in their design, since their prediction is made for a sentence. In this work, we first give new annotations to humor based on a sitcom by setting up temporal segments of ground truth humor derived from the laughter track. Then, we propose a method to find these temporal segments of humor. We adopt an approach based on sliding window, where the visual modality is described by pose and facial features along with the linguistic modality given as subtitles in each sliding window. We use long short-term memory networks to encode the temporal dependency in poses and facial features and pre-trained BERT to handle subtitles. Experimental results show that our method improves the performance of humor prediction.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Humor provokes laughter and provides amusement. It is an important medium to demonstrate our emotions and has become an essential tool in our daily life [1]. Humor can be used to draw people’s attention and relieve stressful or embarrassing situations. By properly using humor, communication between people will become easier and smoother.

Understanding humor is also important for human–machine communications (e.g., robots [2, 3] and virtual agents [4, 5]). A machine may interact with us in a more comprehensive manner, ultimately taking our emotions into its decision-making to respond to our various needs. Meanwhile, understanding humor is a challenging task for a machine in both computer vision and natural language processing communities because it requires a deeper knowledge of signals from people in visual (e.g., poses, gestures, and appearances), vocal (e.g., tones), and linguistic (e.g., puns) modalities, as well as their combinations [6], which can induce humor.

In recent years, some methods have been proposed to predict humor using both single modality and multiple modalities, which are often accompanied by a dedicated dataset [7‐12]. Single modal humor prediction mainly uses the linguistic modality [13‐15], while multiple modal humor prediction combines the information from different modalities [6, 16‐18]. The ground-truth labels of these methods are usually associated with blocks of text, like sentences and dialogues, while signals from other modalities are often treated as supplementary. In the real world, however, humor is not necessarily tied to text; it can be invoked even in silence with funny actions and facial expressions, which are often ignored in the tasks driven by the linguistic modality. To cover broader variations of humor, we need another problem formulation of humor prediction.

In this work, we present a new humor prediction task. Unlike previous tasks that provide humor-related annotations based on a single sentence or a set of dialogues [16, 17], our proposed task provides temporal segments that are associated with humor as ground-truth labels, as shown in Fig. 1. We also propose a new method for humor prediction, which makes predictions with a sliding window. The method uses multimodal data within each window, i.e., video frames and subtitles. Our method aggregates subtitles as well as pose and facial features from video frames, which are then fed into our model. We convert these sliding-window predictions to temporal segments comparable with the ground-truth segments.

The main contributions in our work are three-fold.

We give a new definition to humor by setting up temporal segments that are associated with humor as ground truth labels. Such definition covers a wider variety of humorous moments, even without associated text (or utterances).

We also propose to find temporal segments, which can handle humor invoked solely by the visual modality. Our method uses the visual modality through poses and facial features in video frames as well as the linguistic modality through subtitles as input. Prediction is done over a sliding window, which is comparable with our ground truth.

We compare different combinations of input features to show which feature combination is the best for our humor prediction task.

The rest of this work is arranged as follows: Sect. 2 shows some previous work related to humor prediction; Sect. 3 introduces our task and datasets; Sect. 4 presents our method to predict humor; Sect. 5 shows the experimental results and Sect. 6 is the conclusion part.

Methods for humor prediction usually take features obtained from text, images, and audio as inputs, giving a prediction of whether the input is associated with humor or not as an output. Single-modal humor prediction methods mainly use the linguistic modality. For example, Weller et al. [13] proposes a task that takes the text from Reddit pages as input and judges whether it is humorous or not based on the ratings. Fan et al. [14] uses an internal and external attention neural network for short text humor detection. Czapla et al. [15] applies a pre-trained language model to predict humor in Spanish tweets. All these methods make their prediction based only on text input. However, in the real world, humor can be invoked by other modalities. A multi-modal approach is necessary to broaden the application of humor prediction.

Multi-modal humor prediction methods combine information from different modalities together. For example, Hasan et al. [6] uses subtitle, visual, and audio features in TED talk videos. Patro et al. [17] builds a dataset based on a famous sitcom The Big Bang Theory and gives several baselines to predict humor based on both visual and language modalities. Kayatani et al. [16] also uses the same TV drama series as their testbed and presents a model to predict whether an utterance of a character causes laughter based on subtitles as well as facial features and the identity of the character. Yang et al. [18] obtains humor labels in videos based on user comments together with visual and audio features. The ground-truth humor labels in these methods are mainly associated with texts and a prediction is made for a sentence. Our ground-truth annotation, in contrast, is given as a segment specified by start and end time stamps, which allows covering humor invoked by various modalities.

3 Dataset and task

In this work, we give new annotations to humor labels by setting up temporal segments of humor based on the dataset in Patro et al. [17] and Kayatani et al. [16], which use a famous sitcom called the Big Bang Theory. The videos in this sitcom TV drama series contain canned laughter (or laughter tracks). Though such canned laughter is not equivalent to humor in general, we still believe that laughter is added if and only if humor is presented in a sitcom. This means that, at least in such a designed circumstance, laughter can be a good proximity of the presence of humor and gives a relatively objective criterion to identify where humor happens¹ Hence, we use canned laughter to make ground-truth humor segments automatically (i.e., our ground-truth humor segment annotations are formed based on the laughter track).

To do this, we follow Kayatani et al. [16] and subtracted the left and right channels of the audio track to cancel the characters’ speech. Then we apply the Hilbert transform after low-pass filtering to the subtracted signal to obtain its wave envelope. This envelope basically gives larger values for canned laughter, jingles, music, etc.. Unlike [16] that annotates a humor label to each sentence, we want to make temporal segments of humor as shown in Fig. 2. We thus set up a threshold in the wave envelope, and define those samples above the threshold as humor to form raw temporal segments of humor. We then review all the extracted segments manually to remove non-laughter segments to finalize the humor segments (i.e., fixed humor segments). Our dataset thus consists of video frames, subtitles, and humor segments with start and end time stamps.

Table 1

Statistics of the dataset

Number of seasons			10
Number of episodes			228
Total duration			76:33:50
Number of segments			63,814
Number of subtitles			74,217
Humor segments	Count		31,851
	Duration	Min (s)	0.042
		Avg (s)	2.254
		Max (s)	18.458
		Total	19:56:32
Non-humor segments	Count		31,963
	Duration	Min (s)	0.042
		Avg (s)	6.377
		Max (s)	76.792
		Total	56:37:18
Humor subtitles	Count		33,408
	Average words		7.38
Non-humor subtitles	Count		40,809
	Average words		7.72

Total durations are shown in hh:mm:ss

The statistics of the dataset are shown in Table 1. We can see that the number of humor segments is quite large. A single episode has almost 140 humor segments on average, and more than one-fourth of the total duration contains laughter. As for the linguistic modality, we call subtitles that end within a humor segment as humor subtitles. Subtitles that start within a humor segment but end outside any humor segments are not counted as humor subtitles but are referred to as non-humor segments. The table shows that more than 44% of subtitles are associated with one of the humor segments. Figure 3 shows the distributions of top-20 words (counted over humor sentences) for humor and non-humor sentences, where stop words and characters’ names are removed. Considering the difference in the number of humor/non-humor subtitles, we would say that these two distributions do not differ a lot.

Different from previous work that merely judges whether a sentence or a set of dialogues is humorous or not, our task requires localizing humor segments based on video frames and subtitles. Note that there can be humor segments caused solely by acoustic signals (e.g., making a funny noise that cannot be transcribed); however, our task does not use the audio tracks since they have canned laughter, which is used to obtain the ground-truth humor segments.

4 Finding humor segments in video

Figure 4 is an overview of our method. We cast our humor segment prediction task to humor/non-humor prediction over sliding windows to model the dependency among them. To predict humor over sliding windows, we represent video frames by sequences of poses and faces of characters, which are handled by the pose flow and the face flow, respectively. The subtitles within each sliding window also go through the language flow. We use late fusion to summarize the prediction scores from different flows to obtain the per-window predictions. Then, per-window predictions are converted to temporal segments.

For the i-th window ${w}_{i}$, we aggregate video frames ${V}_{i}= \{{v}_{ij} \mid {j}=1,\ldots ,{J}\}$ and subtitles ${S}_{i} = \{{s}_{ik} \mid {k}=1,\ldots ,{K}_{i}\}$ within it as input, where J and ${K}_{i}$ are the numbers of frames and subtitles in ${w}_{i}$ (${K}_{i}$ can vary for different windows), respectively. Note that, as in Sect. 3, we include the subtitles that end inside the window, while we do not include those subtitles that start inside but end outside the window. We use a neural network-based model to make humor/non-humor prediction ${h}_{i}$ for ${w}_{i}$.

Humor is sometimes induced by funny poses and facial expressions. Previous work [19] found that non-verbal humor based on gestures, facial expressions, or whole-body movement makes the robot more human-like and more entertaining. Motivated by this finding, we use two flows in the visual modality to represent poses and facial expressions of the characters in ${w}_{i}$ respectively. For the linguistic modality, previous work [16, 17] used BERT, a famous language Transformer, to represent subtitles and achieved good performance. Thus, we follow them to model the dependency among all subtitles ${s}_{ik}$ in ${S}_{i}$ with BERT.

4.1 Pose flow

Some funny actions can make people laugh. Poses in the video frames can be seen as reflections of them, and their features can be crucial for humor prediction. We compare 2-D and 3-D pose features: (1) the first one uses OpenPose [20] to detect joint positions in the video frames in $V_i$. For each person in each video frame, we obtain a 3M-D vector containing the 2-D coordinates of each joint with the confidence score given by OpenPose, where $M = 25$. (2) We convert the 2-D joint coordinates to 3-D coordinates with OpenPose 3D baseline [21] pre-trained on the Human 3.6 M dataset [22], which maps $M = 25$ into $M' = 17$ to fit the Human 3.6 M model. We obtain a 51-D vector containing all coordinates of joints in the 3-D space. In either kind of pose feature, the entries in the vector for undetected joints are set to 0.

Confidence score $c^{{\text {P}}}_m$ for joint $m = 1,\ldots , M$ is related to the visibility of the key point. We believe such a confidence score may somehow represent the importance of the corresponding person in the scene since the main characters in a scene tend to be placed around the center of the frame in bigger sizes. We thus calculate the average confidence score $\bar{c}^{{\text {P}}}$ for each person by:

$$\begin{aligned} \bar{c}^{{\text {P}}}=\frac{1}{M} \sum _{m=1}^{M}{c}^{{\text {P}}}_{m}. \end{aligned}$$

(1)

Then, we rank characters in the scene based on $\bar{c}^{{\text {P}}}$ and select top-3 characters for both 2-D and 3-D poses. Note that we still use the confidence scores obtained with OpenPose for 3-D poses (i.e., same $\bar{x}^{{\text {P}}}$ is used for both 2-D and 3-D poses) because 3-D poses are based on 2-D poses.

Let $x^{{\text {P}}}$ denote the vector of pose features (either 2-D or 3-D) for a single character, we fed $x^{{\text {P}}}$ into FC layers and max-pool them to obtain a 128-D pose vector ${p}_{ij}$ (${j}=1,\ldots ,{J}$, and J denotes the number of video frames in the sliding window) for each frame. We concatenate each frame’s pose vector and fed them into an long short-term memory (LSTM) layer with hidden state $d_{ij}^{P}$. The hidden state corresponding to the last frame (i.e., $d_{iJ}^{P}$) is then fed into an FC layer to get score vector $e^{{\textrm{P}}}_i \in [0, 1]^2$ for the pose flow.

4.2 Face flow

Exaggerated facial expressions can also cause laughter. We model such facial expressions using a similar way to the pose flow. We adopt two facial features (landmark positions and action units (AUs) [23]): For landmark positions, we use a variant of OpenPose to detect facial landmarks in the video frames in $V_i$. For each person in $V_i$, we obtain a 3N-D vector containing the 2-D coordinates of the face landmark with the confidence score ${c}^{F}_n$ given by OpenPose, where $N = 70$. For AUs, we use Openface [24] to extract a $N'$-D vector of AUs from each character in a video frame and an average confidence score $\bar{c}^{F}$, where $N' = 35$. We calculate the average confidence score $\bar{c}^{F}$ of each person for landmark positions:

$$\begin{aligned} \bar{c}^{F}=\frac{1}{N} \sum _{n=1}^{N}{c}^{{\text {F}}}_{n}. \end{aligned}$$

(2)

For both types of features, we select three characters with the largest $\bar{c}^{F}$ scores. We fed their landmarks or AUs, $x^{{\text {F}}}$, into FC layers and max-pool them to obtain a 128D face vector ${f}_{ij}$ (${j}=1,\ldots ,{J}$, and J denotes the number of video frames in the sliding window) for each frame. Then we concatenate each frame’s face vector and fed them into an LSTM layer. The hidden state corresponding to the last frame is fed into an FC layer to get score vector ${e}^{{\textrm{F}}}_{i}$ for the face flow.

4.3 Language flow

The subtitles in the video contain the transcript of what the characters say, which is the primary source to make people laugh. We use BERT [25], which has been widely applied in many similar tasks with outstanding results [16, 17, 26‐28] to model the subtitles. We concatenate all the subtitles in ${S}_{i}$ by adding special tokens, including [CLS] and [SEP], and feed it to BERT. Only these special tokens are passed to BERT if there are no subtitles in the sliding window. The output corresponding to [CLS] is then fed into an FC layer to get score vector ${e}^{{\textrm{S}}}_{i}$.

4.4 Prediction and training

Our model uses late fusion for the final prediction, all prediction scores are summed together to get the final score vector:

$$\begin{aligned} {e}_{i} = \text {softmax}\left( {e}^{{\textrm{P}}}_{i} + {e}^{{\textrm{F}}}_{i} + {e}^{{\textrm{S}}}_{i}\right) . \end{aligned}$$

(3)

This score vector contains scores for humor $e^{{\text {h}}}_i$ and non-humor $e^{{\text {n}}}_i$ per-window, i.e., $e_i = (e^{{\text {h}}}_i, e^{{\text {n}}}_i)$. When $e^{{\text {h}}}_i > e^{{\text {n}}}_i$, the final binary prediction $h_i = 1$, otherwise, $h_i = 0$.

Per-window ground-truth label $y_i$ for $w_i$ can be derived from the ground-truth temporal segments in the dataset. Considering that laughter happens after a triggering part, we deem window $w_i$ be associated with humor ($y_i = 1$) if the end time of $w_i$ falls within any ground-truth humor segment, and otherwise non-humor ($y_i = 0$). Note that if $w_i$ contains a triggering part but ends before the laughter segment, it is still considered as non-humor.

4.5 Converting frame predictions to temporal segments

We call prediction $h_i$ of window ${w}_{i}$ a frame-level prediction. We convert frame-level predictions to humor segments. Let t be the shift amount between consecutive windows in second. If ${w}_{i}$ is predicted as humor, we deem the following t second a humor segment. Consecutive segments are merged together to form a humor segment.

Table 2

The training parameters in the experiment

Parameters	Setting
Learning rate	$2 \times {10}^{-5}$
Number of epoch	3
Training batch size	32
Inferring batch size	16
Max number of tokens	128
Optimizer	Adam [29]
Weight decay	$1 \times {10}^{-5}$

5 Experimental results

We resample the video frames to 2 fps and split the dataset into training (80%), validation (10%), and test (10%) sets. We implement our model with Python 3.7 and PyTorch. The hyper-parameters in the experiments are shown in Table 2. For the subtitle input, we use the bert-base-uncased model with 12 layers, 768 hidden sizes, 12 self-attentions, and 110 million parameters for feature representation. The model makes no distinction between upper-case and lower-case tokens in the input sequence. Cross-entropy loss is applied for training.

We set the length of our sliding window to 8 s and the shift to 2 s. These values are based on the average duration of humor segments (2.254 s) and non-humor segments (6.377 s) as shown in Table 1. Since a humor segment is always preceded by a non-humor segment and the non-humor segment serves as a context (setup and punchline) of the humor segment (laughter), the window of 8 s, which roughly corresponds to the sum of the averages of humor and non-humor segments, can cover the most part of these consecutive non-humor and humor segments to model humor. The shift of 2 s is sufficiently small for this window.

Table 3

Frame level results on the test set (in %)

Input			Acc	Pre	Rec	F1
Pose	Face	Subtitle	Acc	Pre	Rec	F1
All positive			32.00	32.00	100.00	48.49
All negative			68.00	0.00	0.00	0.00
2D	–	–	68.84	66.27	5.36	9.92
3D	–	–	68.57	70.95	3.04	5.83
–	Landmark	–	68.93	66.31	5.93	10.89
–	Action unit	–	67.98	33.33	0.02	0.05
–	–	BERT	70.23	55.93	32.97	41.48
2D	Landmark	–	68.97	63.71	7.06	12.71
2D	Action unit	–	68.62	59.62	6.00	10.91
3D	Landmark	–	68.86	65.15	5.81	10.67
3D	Action unit	–	68.56	70.33	3.06	5.87
2D	–	BERT	70.75	57.23	34.09	42.73
3D	–	BERT	70.94	57.53	35.10	43.60
–	Landmark	BERT	71.01	57.38	36.56	44.66
–	Action unit	BERT	70.16	59.76	20.65	34.69
2D	Landmark	BERT	70.94	57.48	33.41	43.82
2D	Action unit	BERT	70.51	59.56	24.45	34.67
3D	Landmark	BERT	71.12	57.87	35.89	44.30
3D	Action unit	BERT	70.33	56.08	33.66	42.07

The best values under different metrics using the same type of input are in bold

Table 4

Segment-level results on the test set (in %)

Input			IoU = 0.25			IoU = 0.50			IoU = 0.75
Pose	Face	Subtitle	Pre	Rec	F1	Pre	Rec	F1	Pre	Rec	F1
2D	–	–	63.00	6.88	12.40	42.49	4.64	8.36	15.75	1.72	3.10
3D	–	–	64.25	4.60	8.58	39.66	2.84	5.30	8.94	0.64	1.19
–	Landmark	–	64.00	7.04	12.68	41.45	4.56	8.21	15.27	1.68	3.02
–	Action unit	–	33.33	0.04	0.08	0.00	0.00	0.00	0.00	0.00	0.00
–	–	BERT	75.49	35.23	48.04	52.61	24.58	33.51	20.87	9.75	13.29
2D	Landmark	–	60.28	8.56	14.99	43.10	6.12	10.71	19.44	2.76	4.83
2D	Action unit	–	56.06	8.88	15.33	36.36	5.76	9.94	16.41	2.60	4.49
3D	Landmark	–	63.33	6.84	12.34	40.37	4.36	7.86	15.93	1.72	3.10
3D	Action unit	–	62.98	4.56	8.50	39.23	2.84	5.30	9.39	0.68	1.27
2D	–	BERT	74.65	36.27	48.82	51.97	25.30	34.03	21.18	10.31	13.84
3D	–	BERT	76.07	37.00	49.78	54.44	26.48	35.63	22.29	10.84	14.59
–	Landmark	BERT	76.00	37.98	50.65	52.95	26.54	35.36	20.41	10.23	13.63
–	Action unit	BERT	76.77	24.71	37.39	55.21	17.79	26.90	22.08	7.11	10.76
2D	Landmark	BERT	76.48	38.34	51.08	52.07	26.14	34.81	21.02	10.55	14.05
2D	Action unit	BERT	76.46	28.83	41.87	55.34	20.90	30.35	23.39	8.83	12.82
3D	Landmark	BERT	75.35	38.38	50.86	52.85	27.02	35.76	21.74	11.11	14.71
3D	Action unit	BERT	75.47	35.19	48.00	51.46	24.02	32.75	20.03	9.35	12.75

The best values under different metrics using the same type of input are in bold

5.1 Quantitative results

We show the performance of frame-level predictions and segment-level predictions in Tables 3 and 4, respectively. We use accuracy (Acc), precision (Pre), recall (Rec), and F1 as metrics for frame-level predictions; and precision (Pre), recall (Rec), and F1 under different IoU thresholds for segment-level predictions. For comparison, we show the performance of two naive baselines (all positive and all negative labels), as well as the subtitle baseline, which solely uses the language flow for prediction i.e., the prediction is done based on $e^{{\text {S}}}_i$.

[16] can also be our baseline, although their task is to predict humor in the sentence level. To make their sentence-level predictions comparable to ours, we convert them into frame-level predictions by making use of the time stamp of each sentence. For a sentence predicted as humor, we set all frames within the range from the end of the sentence to 2 s after the end of that sentence. We show the frame-level and segment-level performance in Tables 5 and 6, respectively, where “Char” denotes character features in Kayatani et al.’s method.

Table 5

Frame-level results of Kayatani et al.’s method (the scores are from [16]) and our method (in %)

Method	Input			Acc	Pre	Rec	F1
Kayatani et al. [16]	–	–	BERT	62.94	44.10	59.04	50.49
	–	Action unit	BERT	65.20	46.14	52.13	48.95
	Char	Action unit	BERT	66.71	47.40	39.88	43.40
Ours	–	–	BERT	70.23	55.93	32.97	41.48
	–	Landmark	BERT	71.01	57.38	36.56	44.66
	3D	Landmark	BERT	71.12	57.87	35.89	44.30

The best values under different metrics using the same method (Kayatani et al and Ours) are in bold

Table 6

Segment-level results of Kayatani et al.’s method and our method (in %)

Method	Input	IoU = 0.25			IoU = 0.50			IoU = 0.75
Method	Input	Pre	Rec	F1	Pre	Rec	F1	Pre	Rec	F1
Kayatani et al. [16]	BERT	66.18	42.02	51.41	36.34	23.07	28.22	10.14	6.44	7.87
	Action unit + BERT	68.51	39.06	49.76	40.39	23.03	29.34	11.01	6.28	8.00
	Char + Action unit + BERT	69.87	33.11	44.93	45.65	21.63	29.35	14.68	6.96	9.44
Ours	BERT	75.49	35.23	48.04	52.61	24.58	33.51	20.87	9.75	13.29
	Landmark + BERT	76.00	37.98	50.65	52.95	26.54	35.36	20.41	10.23	13.63
	3D + Landmark + BERT	75.35	38.38	50.86	52.85	27.02	35.76	21.74	11.11	14.71

The best values under different metrics using the same method (Kayatani et al and Ours) are in bold

Table 3 shows our ablation study results. We can see the accuracy of different input modalities in ours are all more than 60%. The recall and F1 scores are low when we use only pose and/or facial features. The language features improve the accuracy, recall, and F1 score. The best-performed modalities under our metrics are linguistic. We think this is because most humor in the dataset is triggered by what the actors are saying, while visually-induced humor segments are fewer than linguistically-induced ones. We also find that the visual modality does have some contributions in humor prediction, as the precision scores with the visual modality only are more than 65%. The model that uses 3-D poses, face landmarks, and subtitles as input has better accuracy, while the model that uses face landmarks and subtitles as input has better recall and F1 scores than other input modalities.

We report segment-level evaluation in Table 4 by setting up different IoU thresholds between the predicted segments and the ground-truth segments. We can see that the language features contribute much to improving the predictions: When our input contains subtitles, the recall and precision scores are much better than those with only visual features. Among the results, the one with 3-D poses, face landmarks, and subtitles has the best recall score when ${\text {IoU}}=0.25$, ${\text {IoU}}=0.50$ and ${\text {IoU}}=0.75$. It also has the best F1 scores over other inputs when ${\text {IoU}}=0.50$ and ${\text {IoU}}=0.75$, being the best-performing input in the segment-level evaluation.

We also analyze the results between ours and the method by [16]: Our sliding-window-based method improves the accuracy and precision by 4.41% and 10.47% respectively in frame level compared with [16]. However, the F1 score is not as good as expected compared with [16] using BERT only or the combination of action units and BERT. For the segment-level predictions, when ${\text {IoU}}$ is 0.75, the recall of our method using 3-D poses, face landmarks, and subtitles is 4.15% higher than the method at the sentence level. This implies that our method has a better alignment of predicted humor segments than the method at the sentence level.

Table 7

Frame-level results on the test set under different lengths (scores are in %)

Length (s)	Acc	Pre	Rec	F1
4	68.82	52.65	24.16	33.73
8	71.12	57.87	35.89	44.30
12	69.71	56.52	23.78	33.47
16	69.02	56.20	15.36	24.13

The best values under different metrics using the same type of input are in bold

5.2 Training with different lengths of sliding window

We evaluate the performance of different lengths of sliding windows to quantify their impact. We use 3-D poses and face landmarks as pose and face flow inputs along with the language flow input and show the frame-level and segment-level results for the lengths of sliding windows being 4 s, 8 s, 12 s, and 16 s in Tables 7 and 8, respectively. Note that we keep the shift of the sliding window to 2 s.

Table 8

Segment-level results on the test set under different lengths (scores are in %)

Length (s)	IoU = 0.25			IoU = 0.50			IoU = 0.75
Length (s)	Pre	Rec	F1	Pre	Rec	F1	Pre	Recall	F1
4	62.50	30.14	40.67	45.70	22.06	29.76	15.95	7.70	10.38
8	75.35	38.38	50.86	52.85	27.02	35.76	21.74	11.11	14.71
12	71.53	23.57	35.45	46.48	15.36	23.09	16.14	5.33	8.02
16	68.11	14.60	24.05	45.69	9.82	16.16	16.67	3.58	5.90

The best values under different metrics using the same type of input are in bold

From Tables 7 and 8, we can see that when the sliding window gets longer, the scores in both frame-level and segment-level increase at first, and then decrease. When the length of our sliding window is 8 s, the results in both frame and segment levels have the highest score. This means that a longer sliding window may lead to a performance drop. We think the reason is that the humor is often triggered in a relatively short period. When the sliding window is long, some information that is not related to humor is fed into the network, making the performance drop. Also, our dataset contains many language-based humor segments; when the sliding window is too short, the model only sees a single subtitle without any context that builds up the humor. Thus, the sliding window should be at an appropriate length for better predictions.

5.3 Training time

Our experiments are performed on a computer with Intel Core i7-8700K, 32 G RAM, and an NVIDIA Titan RTX GPU. We show the training time of different inputs in Table 9.

Table 9

Training time on our dataset per epoch (in mm:ss)

Input	Training time
Pose	1:06
Pose + face	2:01
Pose + subtitle	20:08
Face	1:07
Face + subtitle	20:46
Subtitle	19:45
Pose + face + subtitle	23:07

From the table, when subtitles are not used, a single only takes no more than 3 min. However, when we take subtitles as input, it costs more than 19 min. This is because BERT is a much larger network compared to the other part of our method.

5.4 Qualitative results

We show some example predictions to demonstrate the superiority of our method quantitatively. These examples are using 3-D pose, face flandmark, and subtitles of our method, as well as [16] using character, face, and subtitle in Fig. 5. The green and orange bars in the timeline indicate the predicted humor segments with our method and Kayatani et al.’s method, while the blue bars are the ground truth humor segments. In (a), the humor is mainly triggered by funny actions (one of the actors is lifting his arms in front of his chest), our method captures these actions and finds the corresponding segments. Kayatani et al.’s method also predicts humor, but its IoU is lower than ours. In (b), the humor is mainly caused by the two persons lifting a wooden board together. Our predicted segment has some overlap with the ground truth, but it starts earlier and ends earlier, while Kayatani et al.’s method fails to spot the humor segment. In (c), even though the actors are wearing special costumes, the humor segment only occupies a short time. Our method captures and predicts the humorous segment correctly, while Kayatani et al.’s method gives a longer segment. In (d), our method fails to capture the humor segment, while Kayatani et al.’s method catches it. This may be because this humor is triggered by a longer context, which our sliding-window-based model cannot capture, while Kayatani et al.’s can because it takes successive five subtitles as input.

6 Conclusion and outlook

In this work, we proposed a multi-modal model to predict humor segments in videos. We first automatically annotated temporal segments of humor in a dataset and then presented a framework to predict video humor in multiple modalities. Our method used features from subtitles along with different kinds of pose and face features in the videos to carry prediction. BERT was used to model the subtitles and LSTM networks were set up to model pose and facial expression features, respectively. Experimental results showed that our method outperformed the previous method by 4.41% in accuracy at the frame level and by 4.15% in the recall at the segment level and gave a better understanding of humor.

However, this work still faces some limitations. First, our model can hardly predict humor based on specific relationships between characters and other objects. Second, the data source is only limited to sitcom videos, whose ground truth laughter is easy to find. Third, our method only sums results from multiple modalities for the final predictions, which may lose cross-modal information. In future work, we want to model the relationship between objects and people in the videos. We also need to broaden the source of the dataset, apply new fusion techniques, and migrate the method to other kinds of emotions to cover a wider range of applications.

Acknowledgements

This work was supported by JSPS KAKENHI No. 18H03264 and China Scholarship Council.

Declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

previous article Empowering neural collaborative filtering with contextual features for multimedia recommendation

next article FDS_2D: rethinking magnitude-phase features for DeepFake detection

This assumption does not count humor without accompanying laughter. We leave handling such humor as our future work.

Meyer, J.C.: Humor as a double-edged sword: four functions of humor in communication. Commun. Theory 10(3), 310–331 (2000)CrossRef

Niculescu, A., van Dijk, B., Nijholt, A., Li, H., See, S.L.: Making social robots more attractive: the effects of voice pitch, humor and empathy. Int. J. Soc. Robot. 5, 171–191 (2013)CrossRef

Mirnig, N., Stadler, S., Stollnberger, G., Giuliani, M., Tscheligi, M.: Robot humor: how self-irony and schadenfreude influence people’s rating of robot likability. In: 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 166–171 (2016)

Gray, C., Webster, T., Ozarowicz, B., Chen, Y., Bui, T., Srivastava, A., Fitter, N.T.: This bot knows what i’m talking about” humaninspired laughter classification methods for adaptive robotic comedians. In: 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp. 1007–1014 (2022)

Kolb, W., Miller, T.: Human–computer Interaction in Pun Translation. Using Technologies for Creative-text Translation. Taylor & Francis, London (2022)

Hasan, M.K., Rahman, W., Zadeh, A., Zhong, J., Tanveer, M.I., Morency, L.- P., Hoque, M.E.: UR-FUNNY: a multimodal language dataset for understanding humor. In: EMNLP-IJCNLP, pp. 2046–2056 (2019)

Gultchin, L., Patterson, G., Baym, N., Swinger, N., Kalai, A.: Humor in word embeddings: cockamamie gobbledegook for nincompoops. In: ICML, pp. 2474–2483 (2019)

Chandrasekaran, A., Vijayakumar, A.K., Antol, S., Bansal, M., Batra, D., Zitnick, C.L., Parikh, D.: We are humor beings: understanding and predicting visual humor. In: CVPR, pp. 4603–4612 (2016)

Ortega-Bueno, R., Muniz-Cuza, C.E., Pagola, J.E.M., Rosso, P.: UO UPV: deep linguistic humor detection in Spanish social media. In: Ibereval, pp. 204–213 (2018)

10.

Sane, S.R., Tripathi, S., Sane, K.R., Mamidi, R.: Deep learning techniques for humor detection in Hindi–English code-mixed tweets. In: Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 57–61 (2019)

11.

Kamal, A., Abulaish, M.: Self-deprecating humor detection: a machine learning approach. In: Pacling, pp. 483–494 (2019)

12.

Choube, A., Soleymani, M.: Punchline detection using context-aware hierarchical multimodal fusion. In: ICMI, pp. 675–679 (2020)

13.

Weller, O., Seppi, K.: Humor detection: a transformer gets the last laugh. In: EMNLP-IJCNLP, pp. 3612–3616 (2019)

14.

Fan, X., Lin, H., Yang, L., Diao, Y., Shen, C., Chu, Y., Zou, Y.: Humor detection via an internal and external neural network. Neurocomputing 2020, 105–111 (2020)CrossRef

15.

Czapla, B.F.P., Howard, J.: Applying a pre-trained language model to Spanish Twitter humor prediction. In: Iberian Languages Evaluation Forum (2019)

16.

Kayatani, Y., Yang, Z., Otani, M., Garcia, N., Chu, C., Nakashima, Y., Takemura, H.: The laughing machine: predicting humor in video. In: WACV, pp. 2072–2081 (2021)

17.

Patro, B.N., Lunayach, M., Srivastava, D., Singh, H., Namboodiri, V.P., et al.: Multimodal humor dataset: predicting laughter tracks for sitcoms. Wacv, pp. 576–585 (2021)

18.

Yang, Z., Ai, L., Hirschberg, J.: Multimodal indicators of humor in videos. In: MIPR, pp. 538–543 (2019)

19.

Wendt, C.S., Berg, G.: Nonverbal humor as a new dimension of HRI. In: RO-MAN 2009—The 18th IEEE International Symposium on Robot and Human Interactive Communication, pp. 183–188 (2009)

20.

Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., Sheikh, Y.A.: Open-pose: realtime multi-person 2D pose estimation using part affinity fields. In: IEEE TPAMI, pp. 172–186 (2019)

21.

Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: Iccv, pp. 2659–2668 (2017)

22.

Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE TPAMI 36(7), 1325–1339 (2013)CrossRef

23.

Ekman, P., Rosenberg, E.L.: What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford University Press, Oxford (1997)

24.

Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.-P.: Openface 2.0: facial behavior analysis toolkit. Fg 2018, 59–66 (2018)

25.

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)

26.

Ma, X., Xu, P., Wang, Z., Nallapati, R., Xiang, B.: Universal text representation from BERT: an empirical study. Preprint arXiv:1910.07973 (2019)

27.

Zhang, Z., Wu, Y., Zhao, H., Li, Z., Zhang, S., Zhou, X., Zhou, X.: Semantics-aware BERT for language understanding. In: AAAI, pp. 9628–9635 (2020)

28.

Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., Takemura, H.: Bert representations for video question answering. In: WACV, pp. 1556–1565 (2020)

29.

Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint. arXiv: 1412.6980 (2014)

Title: Multi-modal humor segment prediction in video
Authors: Zekun Yang
Yuta Nakashima
Haruo Takemura
Publication date: 03-06-2023
Publisher: Springer Berlin Heidelberg
Published in: Multimedia Systems / Issue 4/2023
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI: https://doi.org/10.1007/s00530-023-01105-x

Parameters	Setting
Learning rate	\(2 \times {10}^{-5}\)
Number of epoch	3
Training batch size	32
Inferring batch size	16
Max number of tokens	128
Optimizer	Adam [29]
Weight decay	\(1 \times {10}^{-5}\)

Springer Professional

Abstract

Publisher's Note

1 Introduction

2 Related work

3 Dataset and task

4 Finding humor segments in video

4.1 Pose flow

4.2 Face flow

4.3 Language flow

4.4 Prediction and training

4.5 Converting frame predictions to temporal segments

5 Experimental results

5.1 Quantitative results

5.2 Training with different lengths of sliding window

5.3 Training time

5.4 Qualitative results

6 Conclusion and outlook

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Other articles of this Issue 4/2023

E-Cap Net: an efficient-capsule network for shallow and deepfakes forgery detection

Shallow multi-branch attention convolutional neural network for micro-expression recognition

Style matching CAPTCHA: match neural transferred styles to thwart intelligent attacks

View-aware attribute-guided network for vehicle re-identification

Cascaded deep residual learning network for single image dehazing

Multimodal image enhancement using convolutional sparse coding