1 Introduction
2 Data collection and preparation
2.1 Traffic data
2.2 Crash data
2.3 Weather data
2.4 Matched case–control method and data filtering
Symbol | Variable description | Formulation* |
---|---|---|
Q
| The total flow during 5 min |
\(\sum\nolimits_{{{\text{t}} = 1}}^{15} {\sum\nolimits_{n = 1}^{n} {Q_{nt} } }\)
|
C
m
| The mean value of occupancy of all lanes during 5 min |
\(E\left( {\sum\nolimits_{{{\text{t}} = 1}}^{15} {\sum\nolimits_{n = 1}^{n} {C_{nt} } } } \right)\)
|
V
m
| The mean value of speed of all lanes during 5 min |
\(\frac{{\sum\nolimits_{{{\text{t}} = 1}}^{15} {\sum\nolimits_{n = 1}^{n} {(Q_{nt} \times V_{nt} )} } }}{{\sum\nolimits_{{{\text{t}} = 1}}^{15} {\sum\nolimits_{n = 1}^{n} {Q_{nt} } } }}\)
|
Q
D
| The accumulated standard deviation of flow within the lanes during 5 min |
\(\sum\nolimits_{{{\text{t}} = 1}}^{15} {D_{t} (Q)}\)
|
C
D
| The accumulated standard deviation of occupancy within the lanes during 5 min |
\(\sum\nolimits_{{{\text{t}} = 1}}^{15} {D_{t} (C)}\)
|
V
D
| The accumulated standard deviation of speed within the lanes during 5 min |
\(\sum\nolimits_{{{\text{t}} = 1}}^{15} {D_{t} (V)}\)
|
Q
DL
| Sum of the accumulated standard deviation of flow within 5 min for each lane | ∑
n=1
n
D
n
(Q) |
C
DL
| Sum of the accumulated standard deviation of occupancy within 5 min for each lane | ∑
n=1
n
D
n
(C) |
V
DL
| Sum of the accumulated standard deviation of speed within 5 min for each lane | ∑
n=1
n
D
n
(V) |
Q
MDL
| The maximum value of the accumulated standard deviation of flow within 5 min for each lane | Max(D
n
(Q)) |
O
MDL
| The maximum value of the accumulated standard deviation of occupancy within 5 min for each lane | Max(D
n
(C)) |
V
MDL
| The maximum value of the accumulated standard deviation of speed within 5 min for each lane | Max(D
n
(V)) |
L
cd
| The distance from the crash to the detector |
Lc − Ld |
W
ea
| Weather condition code | Weather code |
Variables | Average | SD | First quartile | Third quartile |
---|---|---|---|---|
Q
| 498.76 | 220.74 | 301.50 | 662.88 |
C
m
| 8.89 | 7.64 | 5.35 | 10.41 |
V
m
| 82.60 | 13.96 | 76.42 | 92.87 |
Q
D
| 65.44 | 18.52 | 54.28 | 74.24 |
C
D
| 36.84 | 23.86 | 28.78 | 37.18 |
V
D
| 409.80 | 303.57 | 119.65 | 737.29 |
Q
DL
| 51.07 | 21.38 | 34.87 | 64.95 |
C
DL
| 71.47 | 159.79 | 30.61 | 61.19 |
V
DL
| 11,40.21 | 11,39.31 | 271.71 | 17,55.27 |
Q
MDL
| 4.58 | 0.98 | 3.94 | 5.15 |
O
MDL
| 4.65 | 2.77 | 3.53 | 4.88 |
V
MDL
| 22.54 | 12.88 | 10.19 | 33.62 |
L
cd
*
| −3.00 | 2.42 | −5.03 | −1.00 |
W
ea
| 2.87 | 1.73 | 2.00 | 4.00 |
3 Methodology and modeling technique
3.1 Over-sampling technique
3.2 Support vector machines
3.3 Random forest
4 Results and conclusions
4.1 Contributing factors by random forest
4.2 SVMs classifier performance
TPR (%) | FPR (%) | Accuracy (%) | AUC | |
---|---|---|---|---|
Training dataset | 91.95 (354/385) | 23.47 (88/375) | 84.34 (641/760) | 0.8037 |
Test dataset | 76.32 (116/152) | 33.91 (59/174) | 70.86 (231/326) | |
Overall dataset | 87.52 (470/537) | 26.78 (147/549) | 80.29 (872/1086) |
TPR (%) | FPR (%) | Accuracy (%) | AUC | |
---|---|---|---|---|
Training dataset | 88.05 (339/385) | 23.47 (88/375) | 82.37 (626/760) | 0.7852 |
Test dataset | 75.00 (114/152) | 35.63 (62/174) | 69.33 (226/326) | |
Overall dataset | 84.36 (453/537) | 27.32 (150/549) | 78.45 (852/1086) |
4.3 Importance analysis for variable effects
Variables |
Q
|
C
M
|
Q
D
|
V
D
|
Q
DL
|
W
ea
|
---|---|---|---|---|---|---|
MIV | −0.050 | −0.015 | 0.045 | 0.035 | −0.011 | 0.006 |