1 Introduction
2 Related work
3 Dataset and task
Number of seasons | 10 | ||
Number of episodes | 228 | ||
Total duration | 76:33:50 | ||
Number of segments | 63,814 | ||
Number of subtitles | 74,217 | ||
Humor segments | Count | 31,851 | |
Duration | Min (s) | 0.042 | |
Avg (s) | 2.254 | ||
Max (s) | 18.458 | ||
Total | 19:56:32 | ||
Non-humor segments | Count | 31,963 | |
Duration | Min (s) | 0.042 | |
Avg (s) | 6.377 | ||
Max (s) | 76.792 | ||
Total | 56:37:18 | ||
Humor subtitles | Count | 33,408 | |
Average words | 7.38 | ||
Non-humor subtitles | Count | 40,809 | |
Average words | 7.72 |
4 Finding humor segments in video
4.1 Pose flow
4.2 Face flow
4.3 Language flow
[CLS]
and [SEP]
, and feed it to BERT. Only these special tokens are passed to BERT if there are no subtitles in the sliding window. The output corresponding to [CLS]
is then fed into an FC layer to get score vector \({e}^{{\textrm{S}}}_{i}\).4.4 Prediction and training
4.5 Converting frame predictions to temporal segments
Parameters | Setting |
---|---|
Learning rate | \(2 \times {10}^{-5}\) |
Number of epoch | 3 |
Training batch size | 32 |
Inferring batch size | 16 |
Max number of tokens | 128 |
Optimizer | Adam [29] |
Weight decay | \(1 \times {10}^{-5}\) |
5 Experimental results
bert-base-uncased
model with 12 layers, 768 hidden sizes, 12 self-attentions, and 110 million parameters for feature representation. The model makes no distinction between upper-case and lower-case tokens in the input sequence. Cross-entropy loss is applied for training.Input | Acc | Pre | Rec | F1 | ||
---|---|---|---|---|---|---|
Pose | Face | Subtitle | ||||
All positive | 32.00 | 32.00 | 100.00 | 48.49 | ||
All negative | 68.00 | 0.00 | 0.00 | 0.00 | ||
2D | – | – | 68.84 | 66.27 | 5.36 | 9.92 |
3D | – | – | 68.57 | 70.95 | 3.04 | 5.83 |
– | Landmark | – | 68.93 | 66.31 | 5.93 | 10.89 |
– | Action unit | – | 67.98 | 33.33 | 0.02 | 0.05 |
– | – | BERT | 70.23 | 55.93 | 32.97 | 41.48 |
2D | Landmark | – | 68.97 | 63.71 | 7.06 | 12.71 |
2D | Action unit | – | 68.62 | 59.62 | 6.00 | 10.91 |
3D | Landmark | – | 68.86 | 65.15 | 5.81 | 10.67 |
3D | Action unit | – | 68.56 | 70.33 | 3.06 | 5.87 |
2D | – | BERT | 70.75 | 57.23 | 34.09 | 42.73 |
3D | – | BERT | 70.94 | 57.53 | 35.10 | 43.60 |
– | Landmark | BERT | 71.01 | 57.38 | 36.56 | 44.66 |
– | Action unit | BERT | 70.16 | 59.76 | 20.65 | 34.69 |
2D | Landmark | BERT | 70.94 | 57.48 | 33.41 | 43.82 |
2D | Action unit | BERT | 70.51 | 59.56 | 24.45 | 34.67 |
3D | Landmark | BERT | 71.12 | 57.87 | 35.89 | 44.30 |
3D | Action unit | BERT | 70.33 | 56.08 | 33.66 | 42.07 |
Input | IoU = 0.25 | IoU = 0.50 | IoU = 0.75 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Pose | Face | Subtitle | Pre | Rec | F1 | Pre | Rec | F1 | Pre | Rec | F1 |
2D | – | – | 63.00 | 6.88 | 12.40 | 42.49 | 4.64 | 8.36 | 15.75 | 1.72 | 3.10 |
3D | – | – | 64.25 | 4.60 | 8.58 | 39.66 | 2.84 | 5.30 | 8.94 | 0.64 | 1.19 |
– | Landmark | – | 64.00 | 7.04 | 12.68 | 41.45 | 4.56 | 8.21 | 15.27 | 1.68 | 3.02 |
– | Action unit | – | 33.33 | 0.04 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
– | – | BERT | 75.49 | 35.23 | 48.04 | 52.61 | 24.58 | 33.51 | 20.87 | 9.75 | 13.29 |
2D | Landmark | – | 60.28 | 8.56 | 14.99 | 43.10 | 6.12 | 10.71 | 19.44 | 2.76 | 4.83 |
2D | Action unit | – | 56.06 | 8.88 | 15.33 | 36.36 | 5.76 | 9.94 | 16.41 | 2.60 | 4.49 |
3D | Landmark | – | 63.33 | 6.84 | 12.34 | 40.37 | 4.36 | 7.86 | 15.93 | 1.72 | 3.10 |
3D | Action unit | – | 62.98 | 4.56 | 8.50 | 39.23 | 2.84 | 5.30 | 9.39 | 0.68 | 1.27 |
2D | – | BERT | 74.65 | 36.27 | 48.82 | 51.97 | 25.30 | 34.03 | 21.18 | 10.31 | 13.84 |
3D | – | BERT | 76.07 | 37.00 | 49.78 | 54.44 | 26.48 | 35.63 | 22.29 | 10.84 | 14.59 |
– | Landmark | BERT | 76.00 | 37.98 | 50.65 | 52.95 | 26.54 | 35.36 | 20.41 | 10.23 | 13.63 |
– | Action unit | BERT | 76.77 | 24.71 | 37.39 | 55.21 | 17.79 | 26.90 | 22.08 | 7.11 | 10.76 |
2D | Landmark | BERT | 76.48 | 38.34 | 51.08 | 52.07 | 26.14 | 34.81 | 21.02 | 10.55 | 14.05 |
2D | Action unit | BERT | 76.46 | 28.83 | 41.87 | 55.34 | 20.90 | 30.35 | 23.39 | 8.83 | 12.82 |
3D | Landmark | BERT | 75.35 | 38.38 | 50.86 | 52.85 | 27.02 | 35.76 | 21.74 | 11.11 | 14.71 |
3D | Action unit | BERT | 75.47 | 35.19 | 48.00 | 51.46 | 24.02 | 32.75 | 20.03 | 9.35 | 12.75 |
5.1 Quantitative results
Method | Input | Acc | Pre | Rec | F1 | ||
---|---|---|---|---|---|---|---|
Kayatani et al. [16] | – | – | BERT | 62.94 | 44.10 | 59.04 | 50.49 |
– | Action unit | BERT | 65.20 | 46.14 | 52.13 | 48.95 | |
Char | Action unit | BERT | 66.71 | 47.40 | 39.88 | 43.40 | |
Ours | – | – | BERT | 70.23 | 55.93 | 32.97 | 41.48 |
– | Landmark | BERT | 71.01 | 57.38 | 36.56 | 44.66 | |
3D | Landmark | BERT | 71.12 | 57.87 | 35.89 | 44.30 |
Method | Input | IoU = 0.25 | IoU = 0.50 | IoU = 0.75 | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Pre | Rec | F1 | Pre | Rec | F1 | Pre | Rec | F1 | ||
Kayatani et al. [16] | BERT | 66.18 | 42.02 | 51.41 | 36.34 | 23.07 | 28.22 | 10.14 | 6.44 | 7.87 |
Action unit + BERT | 68.51 | 39.06 | 49.76 | 40.39 | 23.03 | 29.34 | 11.01 | 6.28 | 8.00 | |
Char + Action unit + BERT | 69.87 | 33.11 | 44.93 | 45.65 | 21.63 | 29.35 | 14.68 | 6.96 | 9.44 | |
Ours | BERT | 75.49 | 35.23 | 48.04 | 52.61 | 24.58 | 33.51 | 20.87 | 9.75 | 13.29 |
Landmark + BERT | 76.00 | 37.98 | 50.65 | 52.95 | 26.54 | 35.36 | 20.41 | 10.23 | 13.63 | |
3D + Landmark + BERT | 75.35 | 38.38 | 50.86 | 52.85 | 27.02 | 35.76 | 21.74 | 11.11 | 14.71 |
Length (s) | Acc | Pre | Rec | F1 |
---|---|---|---|---|
4 | 68.82 | 52.65 | 24.16 | 33.73 |
8 | 71.12 | 57.87 | 35.89 | 44.30 |
12 | 69.71 | 56.52 | 23.78 | 33.47 |
16 | 69.02 | 56.20 | 15.36 | 24.13 |
5.2 Training with different lengths of sliding window
Length (s) | IoU = 0.25 | IoU = 0.50 | IoU = 0.75 | ||||||
---|---|---|---|---|---|---|---|---|---|
Pre | Rec | F1 | Pre | Rec | F1 | Pre | Recall | F1 | |
4 | 62.50 | 30.14 | 40.67 | 45.70 | 22.06 | 29.76 | 15.95 | 7.70 | 10.38 |
8 | 75.35 | 38.38 | 50.86 | 52.85 | 27.02 | 35.76 | 21.74 | 11.11 | 14.71 |
12 | 71.53 | 23.57 | 35.45 | 46.48 | 15.36 | 23.09 | 16.14 | 5.33 | 8.02 |
16 | 68.11 | 14.60 | 24.05 | 45.69 | 9.82 | 16.16 | 16.67 | 3.58 | 5.90 |
5.3 Training time
Input | Training time |
---|---|
Pose | 1:06 |
Pose + face | 2:01 |
Pose + subtitle | 20:08 |
Face | 1:07 |
Face + subtitle | 20:46 |
Subtitle | 19:45 |
Pose + face + subtitle | 23:07 |