1 Introduction
1.1 Drowsiness detection
1.1.1 Vehicle-based measurements
1.1.2 Physiological measurements
1.1.3 Computer vision techniques
Study | Methodology |
---|---|
Park et al. [30] | Three pre-trained deep neural networks (AlexNet, VGG-FaceNet and FlowImageNet) along with two ensemble strategies (independently averaged architecture and feature-fused architecture) classify each video frame as drowsy or not |
Jiménez et al. [20] | Haar classifiers detect head, eye and mouth segments in video frames; a neural network then quantifies the level of driver distraction in each frame |
Ying et al. [59] | Colour detection methods locate the face, mouth and eyes of the driver, followed by a three-layered neural network to assess states of the eyes (closed, open, narrow) and mouth (closed, open normally, open widely). The monitoring system can provide several types of warnings |
Huynh et al. [16] | A face tracking algorithm clips each video frame and feeds it to a 3D convolutional neural network. A boosting technique combined with semi-supervised learning using the validation set further enhances accuracy |
Ribarić et al. [34] | Evaluates head rotations and eye and mouth openness, while a knowledge-based decision model decides whether to issue a warning or alarm based on the detected drowsiness level |
Ji et al. [17] | Various algorithms track the driver’s head and eyelid movement, gaze and facial expression; a Bayesian network assesses fatigue using these features |
Jiangwei et al. [19] | Detects mouth movements as a single feature for a three-layered neural network, classifying each frame as dozing (mouth wide open), talking (moderately open) or silent (mouth closed) |
Rong-ben et al. [35] | Detects drowsiness by focussing purely on the driver’s eyes |
Flores et al. [12] | Machine learning techniques detect a driver’s face and eyes in a single frame; these features are tracked over time by a neural network. The frequency of eye blinks is used as a drowsiness indicator, while head position is used as an indicator for distraction |
Lenskiy and Lee [27] | A neural network and facial segmentation algorithm obtain the driver’s facial features, which are then used to track the iris and detect blinking. Eye closures longer than 220 ms are classified as drowsy |
Harada et al. [13] | Calculates pupil diameter from eye-tracking data, which is used in a recurrent neural network to predict a driver’s distraction level |
1.2 Action recognition on video benchmarks
2 Methods
2.1 Initial considerations for mobile deployment
2.2 Data
2.3 Pre-processing
2.4 Neural network architecture
Input dimension | Operator | t | c | n | s | Input dimension | Operator | t | c | n | s |
---|---|---|---|---|---|---|---|---|---|---|---|
MobileNetV2_1.4 | Ours | ||||||||||
\(\mathbf {224^2 \times 10 \times 1}\) | conv3d | – | 48 | 1 | 2 | ||||||
\(224^2 \times 3\) | conv2d | – | 48 | 1 | 2 | \(\mathbf {112^2 \times 5 \times 48}\) | bottleneck3d | 1 | 24 | 1 | [1, 1, 5] |
\(112^2 \times 48\) | bottleneck | 1 | 24 | 1 | 1 | \(\mathbf {112^2 \times 1 \times 24}\) | squeeze | – | – | 1 | – |
\(112^2 \times 24\) | bottleneck | 6 | 32 | 2 | 2 | \(112^2 \times 24\) | bottleneck | 6 | 32 | 2 | 2 |
\(56^2 \times 32\) | bottleneck | 6 | 48 | 3 | 2 | \(56^2 \times 32\) | bottleneck | 6 | 48 | 3 | 2 |
\(28^2 \times 48\) | bottleneck | 6 | 88 | 4 | 2 | \(28^2 \times 48\) | bottleneck | 6 | 88 | 4 | 2 |
\(14^2 \times 88\) | bottleneck | 6 | 136 | 3 | 1 | \(14^2 \times 88\) | bottleneck | 6 | 136 | 3 | 1 |
\(14^2 \times 136\) | bottleneck | 6 | 224 | 3 | 2 | \(14^2 \times 136\) | bottleneck | 6 | 224 | 3 | 2 |
\(7^2 \times 224\) | bottleneck | 6 | 448 | 1 | 1 | \(7^2 \times 224\) | bottleneck | 6 | 448 | 1 | 1 |
\(7^2 \times 448\) | conv2d 1\(\times\)1 | – | 1792 | 1 | 1 | \(7^2 \times 448\) | conv2d 1 × 1 | – | 1792 | 1 | 1 |
\(7^2 \times 1792\) | avgpool 7\(\times\)7 | – | – | 1 | – | \(7^2 \times 1792\) | avgpool 7 × 7 | – | – | 1 | – |
\(1 \times 1 \times 1280\) | conv2d 1\(\times\)1 | – | 2 | – | \(1 \times 1 \times 1280\) | conv2d 1 × 1 | – | 2 | – |
Input dimension | Operator | Output dimension |
---|---|---|
\(h \times w \times f \times k\) | \(1 \times 1 \times 1\) conv3d, ReLU6 | \(h \times w \times f \times tk\) |
\(h \times w \times f \times tk\) | Depthwise with stride [\(s_1\), \(s_2\), \(s_3\)], ReLU6 | \(\frac{h}{s_1} \times \frac{w}{s_2} \times \frac{f}{s_3} \times tk\) |
\(\frac{h}{s_1} \times \frac{w}{s_2} \times \frac{f}{s_3} \times tk\) | Linear \(1 \times 1 \times 1\) conv3d | \(\frac{h}{s_1} \times \frac{w}{s_2} \times \frac{f}{s_3} \times c\) |
2.5 Model calibration
2.6 Phone application
3 Results
Scenario | High compute | Mobile deployment | Human accuracy | ||
---|---|---|---|---|---|
InceptionV1 | I3D | MobileNetV2_1.4 | Ours | ||
No glasses | 76.0 | 78.9 | 74.3 | 75.4 | 82.0 |
Glasses | 70.2 | 65.7 | 77.7 | 77.4 | 78.8 |
Sunglasses | 55.9 | 74.7 | 59.3 | 76.8 | 80.9 |
Night—no glasses | 73.0 | 79.7 | 73.8 | 76.1 | 82.5 |
Night—glasses | 68.8 | 76.9 | 71.9 | 63.6 | 79.9 |
All | 69.6 | 75.4 | 71.8 | 73.9 | 80.8 |
3.1 Sensitivity analysis
Assumption | Impact (reported as “accuracy (parameter value)”) | ||||
---|---|---|---|---|---|
Length of input sample (frames) | 65.9 (5) | 73.9 (10) | 77.6 (30) | 75.2 (60) | |
Pre-processing steps: random flip, brightness (light), translate/zoom/stretch (tzs) | 64.0 (none) | 66.0 (flip) | 66.2 (light) | 68.7 (tzs) | 73.9 (all) |
Fusion of spatial and temporal information | 73.9 (early) | 74.4 (late) | 71.0 (slow)\(^\text {a}\) | ||
Depth multiplier in MobileNetV2 | 68.9 (0.35) | 70.9 (0.75) | 73.9 (1.4) | ||
Pre-training on ImageNet (IN), Kinetics (K), IN&K, or no pre-training (none) | 68.8 (none) | 69.6 (IN) | 73.6 (K) | 73.9 (IN&K) | |
Fine-tuning: final layer only (final), all except early layers (most), or all layers | 50.5 (final) | 68.8 (most) | 73.9 (all) | ||
Initial learning rate for fine-tuning | 65.5 (0.001) | 73.9 (0.005) | 71.6 (0.01) | ||
Weight decay for fine-tuning | 70.9 (0) | 73.9 (1E−07) | 72.0 (4E−05) |
Assumption | Impact (reported as “inference time (parameter value)”) | |||
---|---|---|---|---|
Length of input sample (frames) | 1.0 (5) | 1.1 (10) | 1.9 (30) | — (60) |
Fusion of spatial and temporal information | 1.1 (early) | 5.6 (late) | 31.7 (slow) | |
Depth multiplier in MobileNetV2 | 0.5 (0.35) | 0.8 (0.75) | 1.1 (1.4) |
3.1.1 Sample length
3.1.2 Pre-processing steps
3.1.3 Information fusion
3.1.4 Depth multiplier
3.1.5 Pre-training
3.1.6 Fine-tuning
4 Discussion
- Our method implicitly decides which features are important for drowsiness detection, rather than the developer having to pre-specify a limited set of features such as eyelid closure and mouth position, potentially missing important distinctions such as outer brow raises, frowning, chin raises and nose wrinkles [49]. The presented methodology is capable of capturing these features, given a sufficient quality of the data labels.
- A second advantage is that spatial and temporal information is merged directly. Three-dimensional convolution filters incorporate the time dimension, enabling separation of blinking versus micro-sleep, talking versus yawning and the identification of important face movements.
- A third advantage (and future direction, see below) is that the model can easily be trained to detect other tasks given data availability, such as different levels of distraction. Similar associations with crash risk apply to distraction (e.g. [24, 48]), as fatigue and distraction both influence crash risk by withdrawing a driver’s attention from the driving task [3, 54]. Further, distraction-related crashes resulted in more than 3400 fatalities and 391,000 injuries in 2015 [29]. Distraction encompasses a broad range of behaviours, including passenger interaction, eyes tracking away from the road, adjusting or monitoring in-vehicle systems and consuming food or beverages. Some existing drowsiness detection models use face tracking and clipping as a first step, removing important features such as hand movements, while other models may need to extend the feature detection step to add features important for distraction detection. Our method uses the complete video frame and is suitable to directly incorporate distraction-related actions by training on labelled distraction-related videos.