Introduction
Motivation
Related works
Video surveillance
Background subtraction
Object detection
Pedestrian detection
Methods
Video surveillance data
Step 1: Extraction of moving objects with background subtraction
Annotation of ROIs extracted in Step 1
Image size (smaller than) | 32 | 64 | 96 | 128 | 160 | 192 | 224 | 256 | 320 | 800 |
Proportion (%) | 2 | 41 | 65 | 77 | 86 | 92 | 96 | 98 | 99 | 100 |
(a) Extracted and selected data from 5 cameras—dataset A | |||
---|---|---|---|
Person class | # of images | Non-person class | # of images |
Person-full | 484 | Car-part | 333 |
Person-bend | 60 | Car | 41 |
Person-upper | 211 | Animals | 29 |
Person-head | 60 | etc. | 772 |
Person-cluster | 356 | ||
Person-etc. | 84 | ||
Total | 1255 | Total | 1125 |
(b) Extracted and selected data from different camera | |||||
---|---|---|---|---|---|
Roadset1 | Roadset2 | Nightset | |||
Class | # of images | Class | # of images | Class | # of images |
Person | 100 | Person | 103 | Person | 153 |
Non-person | 141 | Non-person | 107 | Non-person | 305 |
Total | 241 | Total | 210 | Total | 458 |
Step2: CNN object classification
Image size | 32 × 32 | 64 × 64 | 128 × 128 |
---|---|---|---|
# of first feature maps | 32 | 64 | 128 |
# of second feature maps | 32 | 64 | 128 |
# of first fully connected layers | 192 | 384 | 768 |
# of second fully connected layers | 96 | 192 | 384 |
Results
Empirical setting
Evaluation with images obtained from the same group of cameras
Input size | Precision | Recall | Accuracy | ||
---|---|---|---|---|---|
Person | Non-person | Person | Non-person | ||
(a) Full-set | |||||
32 × 32 | 0.80 | 0.83 | 0.82 | 0.81 | 0.81 |
64 × 64 |
0.83
| 0.83 | 0.81 |
0.84
| 0.83 |
128 × 128 |
0.83
|
0.85
|
0.85
|
0.84
|
0.84
|
(b) Part-set | |||||
32 × 32 | 0.82 | 0.82 | 0.82 | 0.81 | 0.82 |
64 × 64 |
0.87
| 0.82 | 0.83 |
0.86
| 0.84 |
128 × 128 | 0.86 |
0.84
|
0.84
|
0.86
|
0.85
|
Our CNN model | YOLO | |
---|---|---|
Input size |
\(64\times 64\)
|
\(224 \times 224\)
|
# of conv. layers | 2 | 24 |
# of parameters | 209 K | 66,000 K |
Memory usage per image | 1690 KB | 53,606 KB |
Evaluation with images from different cameras
Test set | Input | Precision | Recall | Accuracy | ||
---|---|---|---|---|---|---|
Person | Non-pers. | Person | Non-pers. | |||
(a) Full-set model | ||||||
Roadset1 | 32 × 32 | 0.78 |
0.65
|
0.73
| 0.71 |
0.72
|
64 × 64 |
0.79
| 0.64 | 0.70 | 0.73 | 0.71 | |
128 × 128 |
0.79
| 0.63 | 0.69 |
0.75
| 0.71 | |
Roadset2 | 32 × 32 |
0.84
| 0.80 | 0.80 |
0.84
|
0.82
|
64 × 64 | 0.82 |
0.81
|
0.81
| 0.81 | 0.81 | |
128 × 128 | 0.83 | 0.79 | 0.78 |
0.84
| 0.80 | |
Nightset | 32 × 32 | 0.76 |
0.50
|
0.72
| 0.55 |
0.66
|
64 × 64 |
0.77
|
0.50
|
0.72
| 0.56 |
0.66
| |
128 × 128 | 0.76 | 0.48 | 0.68 |
0.58
| 0.65 | |
(b) Part-set model | ||||||
Roadset1 | 32 × 32 | 0.79 | 0.65 |
0.73
| 0.72 | 0.72 |
64 × 64 |
0.81
|
0.66
|
0.73
|
0.75
|
0.74
| |
128 × 128 | 0.80 | 0.64 | 0.71 | 0.73 | 0.72 | |
Roadset2 | 32 × 32 | 0.82 | 0.81 | 0.81 |
0.82
| 0.81 |
64 × 64 |
0.83
|
0.81
|
0.82
|
0.82
|
0.82
| |
128 × 128 | 0.82 | 0.80 | 0.80 |
0.82
| 0.81 | |
Nightset | 32 × 32 | 0.76 | 0.51 |
0.74
| 0.54 | 0.67 |
64 × 64 |
0.78
|
0.52
| 0.73 |
0.58
|
0.68
| |
128 × 128 | 0.76 | 0.49 | 0.71 | 0.55 | 0.65 |