Zum Inhalt

Hybrid Approach for Estimation of Traffic Hazards: Fusion of ML and Pure Statistical Model

  • Open Access
  • 2026
  • OriginalPaper
  • Buchkapitel
Erschienen in:

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Dieses Kapitel untersucht einen hybriden Ansatz zur Abschätzung von Verkehrsgefahren durch die Kombination von maschinellem Lernen (ML) und rein statistischen Modellen. Die Autoren von der ASFINAG diskutieren die Grenzen, sich ausschließlich auf statistische Modelle zu verlassen, und führen einen kontextbewussten Ansatz ein, der zusätzliche Parameter wie Verkehrsereignisse, Straßenverhältnisse und Wetterdaten einbezieht. Das Kapitel geht auf zwei Hauptmethoden ein: den Non-ML-Ansatz, bei dem mithilfe von Fuzzy Logic die Gewichte für verschiedene Datenquellen bestimmt werden, und den ML-Ansatz, bei dem ein XGBoost-Regressionsmodell zur Vorhersage zukünftiger Reisezeiten verwendet wird. Der hybride Ansatz wählt dynamisch die zuverlässigsten Datenquellen aus und integriert bei Bedarf ML-Vorhersagen, wodurch sich die Genauigkeit von Stau- und Reiseverzögerungsschätzungen deutlich verbessert. Der Bewertungsabschnitt vergleicht die Leistung der Nicht-ML- und ML-Modelle und hebt die Fähigkeit des ML-Modells hervor, Verkehrstrends früher vorherzusagen. Das Kapitel schließt mit einer Vision für die Umsetzung dieses Ansatzes im gesamten ASFINAG-Straßennetz und der Erforschung anderer ML- und KI-Modelle zur Verbesserung des Verkehrsmanagements.

1 Overview

ASFINAG plans, funds, builds, maintains, operates, and collects toll along almost 2,249 km of Austrian motorways and expressways. In the last years, we have been extending our sensor infrastructure where most of the sensors support current wireless technologies such as the CEN Dedicated Short-Range Communication (Toll stations) [2], Bluetooth (BT), WLAN, ITS-G5 (C-ITS) [3], and TLS (Technische Lieferbedingungen für Streckenstationen) also called as speed detectors in this paper.
Based on our analysis of the traffic data, it’s clear that relying on a single statistical model to estimate traffic jam and travel delays across the entire road network is not the most effective approach. Apart from the traffic volume in urban and rural areas, traffic jams and travel delays are impacted by traffic events, and road and weather conditions. Keeping these additional parameters in mind, we came up with idea of developing a hybrid approach for traffic jam and travel time delay estimation. Here we combine the statistical model with the ML model to make our estimation more context aware. We primarily focus on the cases or regions where the pure statistical methods are not giving us satisfactory results. In this paper, we present this approach as well as the comparison and the evaluation results of non-ML (pure statistical model) model and the ML model. We accomplish this as a part of project and system “ASFINAG Traveltimes & Trafficstate Management System” (ARMS) [4].

2 Non-ML Approach

In the non-ML (pure statistical) model, we employ fuzzy logic to determine weights for individual data sources: Bluetooth and WLAN detectors, Toll transactions from the trucks, and speed detectors. The raw data from these data sources is post processed to calculate the single vehicle travel times for predefined section. The only exception is the data from the speed detectors. The data from the speed detectors are resampled in one minute intervals. The result is an average speed of the vehicles in a 60 s time window. To harmonize the traffic data from different sensor types, we have divided our road network into 200-m segments, which are referred to cells. The measurements obtained for a particular cell are interpolated to the nearest cell. Each nearest cell is assigned a weight, which is determined by factors such as measurement age, variance, and the proximity to the measurement location. Linear interpolation is the method we utilize for this purpose.
$${c}_{k}= {\beta }_{k}\cdot d\left(mcell,icell\right)+ {\beta }_{max}$$
(1)
$${\beta }_{k}=\left({\beta }_{max}- {\beta }_{min}\right)\cdot {\left|\alpha \right|}_{T}+ {\beta }_{min}$$
(2)
$${\left|\alpha \right|}_{T}= \frac{va{r}_{T}}{va{r}_{max}}$$
(3)
$${var}_{max}= {\sum }_{t=0}^{N}\lambda \left(t\right)$$
(4)
$${var}_{T}= \sum_{t=0}^{N}\lambda \left(t\right)\cdot |{v}_{t}-{v}_{T-t}|$$
(5)
Equation 1 calculates the confidence level (\({c}_{k}\)) of the received measurement based on the distance between the measured cell (\(mcell\)) and the interpolated cell (\(icell\) – cell to which we interpolate the measurement) and slope coefficient \({\beta }_{k}\). Equation 2 calculates the slope coefficient based on the standardized slope (weighted variance / max weighted variance). Weights are calculated by using a time function \(\lambda (t)\) (the older the measurement lesser the weight). Equations 4 and 5 calculate the weighted and maximal variance values.
Once the data from each data source is assigned and interpolated to every cell in the network, our ARMS system performs a weighted mean calculation to determine a traffic state for each individual cell. The traffic state is quantified on a numeric scale ranging from 0 to 100. We refer this value as “availability”. A value of 100 signifies that the current traffic state is indicative of free-flow conditions, whereas a value 0 indicates stationary traffic. Furthermore, we have employed an aggregation technique to assess the extent of a traffic jam. In this approach, we spatially and temporally aggregate the neighboring cells based on the similarity of the calculated traffic conditions.

3 ML Approach

In this section we explain our ML model, which is implemented to predict future travel times. Like the non-ML, the ML-model also uses the single vehicle travel times from the Bluetooth, WLAN, and Toll detectors, and the average speeds sampled over one minute from the speed sensors. Likewise, the data is harmonized to bring to a homogeneous data structure. In addition to these data, the model incorporates contextual information such as weather data, planned and unplanned events, and calendar data. To train our model, we have considered the last 6 months of data from above mentioned data sources. Although the 6 months of data appears to be relatively small, it has proven to be sufficient for our needs. It is detailed and consistent, which enabled the model to reach our expected accuracy in travel time prediction.
The data from Bluetooth, WLAN, and Toll detectors are preprocessed with Hampel outlier detection method [5]. The Hampel’s approach is designed to be robust when dealing with extreme values in the data set. This is especially relevant in the travel times data set as some vehicles could stop in a parking area for a significant amount of time, resulting in extreme travel times. Concerning to the data from the speed detectors, it is important to note that the data used for our model has already resampled at 1-min intervals. This ensures the data is free from outliers. After the outlier removal, the travel times data are resampled at N-minutes time window (where N = 15, 20, 30), followed by feature engineering. In our model this involves assigning time features, holidays, unusual travel time patterns, etc. A detail explanation of feature engineering is skipped, as the focus is on the hybrid model.
The modeling is done using XGBoost [6] regression model, which is further finetuned by hyperparameter random grid search and 5-fold split cross-validation. XGBoost is an advanced machine learning technique, which operates by constructing an ensemble of decision trees to predict continuous numeric values. Each decision tree learns from the errors of its predecessors iteratively, allowing XGBoost to progressively refine its predictions. During training, the algorithm assigns initial predictions and calculates the associated residuals. Subsequent trees are then built to minimize these residuals. To prevent overfitting, regularization terms are introduced, and trees are pruned based on their depth and leaf weight. The trees predictions are combined, with each tree contributing to the final result, proportionally to its performance. XGBoost also uses a technique called gradient boosting, which updates the model's parameters by minimizing the gradient of the loss function. This iterative approach optimizes the model's fit to the data, resulting in a robust regression model capable of handling complex relationships and noisy datasets.
The assessment of model performance typically relies on metrics such as root mean square error (RMSE) and mean absolute error (MAE). These metrics were computed on a test set, resulting in an RMSE of 11.1 s and an MAE of 2.2 s. These values, while relatively small, align with expectations for scenarios without traffic congestion. Estimating travel time under normal, free-flowing conditions is generally straightforward.
However, during traffic jams, accuracy in travel time estimation is challenged by unpredictable events, such as vehicle accidents. Although the XGBoost model successfully identifies all traffic jams, it struggles to provide precise travel time predictions in such scenarios. Consequently, the error metrics are notably higher, with an RMSE of 126.3 s and an MAE of 107.6 s. That also explains why RMSE is higher than MAE because it is more sensitive to outliers.

4 Advanced Data Fusion – Hybrid Approach

As explained in Sect. 3, our non-ML approach relies on a pure statistical model to estimate the traffic state and the extent of traffic jam. The travel time and travel delay calculations are averages of travel time samples collected over specific time window. One notable drawback of these methods is their lack of predictive intelligence for anticipating traffic jams and delays. At times, they report travel time delays, which significantly deviates from the real-time conditions. As a result, traffic jams are identified late, and queue lengths are not measured with expected accuracy. Figure 1 illustrates a situation where our statistical method did not yield precise travel-time delay that match the actual conditions. This resulted in discrepancies of up to 30 min when compared to the real time delays experienced. Therefore, to narrow down these discrepancies, we have developed an ensemble strategy that combines our pure statistical model with the outcomes of our ML model. We refer this strategy as the “hybrid approach”.
Fig. 1.
Delay in the estimation of travel time delays
Bild vergrößern
It is important to understand that this approach does not replace our conventional approach (non-ML), instead, it serves as a complementary strategy. Our goals are as follows: 1) Accurately estimate travel time delays, approaching real-time accuracy, 2) Identify the traffic jams at an earlier stage and 3) Enhance the precision of estimating the starting and ending points of traffic congestions.
Reference [7] describes various statistical and ensemble methods to integrate the results from multiple models. In our hybrid approach, we have chosen Bayesian statistics [8] to calculate individual weights for each data source, and thereby include or exclude a specific data source for traffic jam prediction based on their weightage. The data sources here also include the outcomes of the machine learning.
From the historic traffic data, we have pre-selected the traffic hotspots (sections and time periods). For each of these hotspots we have estimated prior probabilities for the data sources based on the fact whether they have identified the traffic jams within a threshold time or not. Although the data sources are independent, they have a strong correlation due to the similarity of the data that they produce. Therefore, we not only can calculate individual probabilities but also a joint probability of the likelihood of predicting a jam.

4.1 The Bayesian’s Model for Data Source Selection and Aggregation

In this model, we consider our main data sources Bluetooth and WLAN data (B), toll data (T), and speed sensors (S). To find out the likelihood of each of these data sources detecting a jam, we compute the conditional probabilities of traffic jam (J) given average speeds from speed sensors P(J|S), toll detectors P(J|T) and Bluetooth detectors P(J|B) using Bayes rule. Subsequently, we also calculate the conditional probability of observing measurements from the data sources S, T, and B given the presence of traffic jam J by using the below formula:
$$P\left(J|S,T,B\right)=\frac{P(S,T,B|J)P(J)}{P(S,T,B)}$$
(6)
where P(S,T,B) represents the joint probability of observing the measurements from the data sources S,T,B, regardless of the traffic jam conditions.
Fig. 2.
Data Flow Visualization with focus on Bayesian’s model for data source selection and aggregation
Bild vergrößern
We first apply the joint probability, to check if it is greater than 0.7, indicating that there is strong likelihood of predicting the jam. In these cases, we assume that the data sources are capable of estimating the jam with less discrepancies to the reality without the help from the ML prediction. However, if the joint probability is less than 0.7 then:
1.
We evaluate the conditional probabilities of each data sources to remove those with a conditional probability less than 0.7
 
2.
We calculate the conditional probability of the prediction from ML model P(J|ML)
 
3.
Finally, we perform a weighted average of the remaining data sources and the ML predictions.
 
This way we can dynamically assess and select the most reliable data sources for predicting the traffic jams by considering the outcomes from the ML model when needed.

5 Evaluation of Non-ML and ML Model

In this section an in-depth comparison between the outcomes of non-ML and ML models is presented. For this we also used RMSE and MAE metrics. Travel times obtained from non-ML model is rescaled by a constant to compensate a different length between start and end sensors position of ML model. Data from Google maps are utilized for more detailed, accurate comparison to non-ML and ML model.
Table 1 presents the overall metric results, and Fig. 2 illustrates the comparison between the ML model and non-ML approaches. Both approaches predicted the traffic jams and travel times in free flow cases correctly. However, the ML model predicted them earlier. In Fig. 2, we can see that the ML model is the first source to predict the change in the traffic trend. The difference between ML model and non-ML model data could be explained by the fact that for ML model features from BT and speed detectors data are more significant than from Toll transactions and thus is not able to predict travel time specially for trucks but instead predicts travel time for all vehicle classes. On the other hand, non-ML is calculating travel time separately for cars and lorries. The ML model predicted the change in the travel time pattern (free flow to queuing traffic) at least 5–10 min faster than our non-ML model and Google data. Likewise, it also detected the second pattern (queuing to free flow traffic) 15 – 20 min faster than the others (Fig. 3).
Table 1.
Comparison between non-ML and ML model travel time RMSE and MAE metrics results together with comparison to non-ML and Google data.
 
RMSE [s]
MAE [s]
ML to Google data
27.2
9.6
non-ML car to Google data
52.9
17.5
non-ML lorry to Google data
52.9
18.7
ML to non-ML car
26.2
7.0
ML to non-ML lorry
24.7
8.7
Fig. 3.
Comparison of the results from ML and non-ML models together with Google data for road section on A23 during selected traffic jam.
Bild vergrößern

6 Conclusion

Due to the emergence of various detection technologies, we have an increased diversity of data sources which help us in improving the accuracy of the traffic information. However, to fully leverage the strengths of these diverse data, we need an advanced approach that can effectively ensemble data from all the available data sources. Our paper focused on this important topic, where it presented a hybrid approach to ensemble data from multiple data sources, and furthermore ensembles the outcomes of the ML model when needed. At present we have employed this approach on selected road section. As part of the future work, we shall implement this proposed approach on the entire ASFINAG road network. Furthermore, we shall also research and experiment with other ML and AI models to conclude if they fit better for our use cases. We shall specifically focus on Graph Neural Network to model the whole high-level road network and compare the results with our current approach.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://​creativecommons.​org/​licenses/​by/​4.​0/​), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
download
DOWNLOAD
print
DRUCKEN
Titel
Hybrid Approach for Estimation of Traffic Hazards: Fusion of ML and Pure Statistical Model
Verfasst von
Natasa Mojic
Vijay Mudunuri
Radim Slovák
Thomas Mariacher
Peter Hrassnig
Copyright-Jahr
2026
DOI
https://doi.org/10.1007/978-3-032-06763-0_36
4.
Zurück zum Zitat Mariacher T., Bretis K., Rainer B., Hrassnig P., Pletzer F.: ARMS – Asfinag traveltime management system. In: Proceedings of 7th Transport Research Arena TRA 2018
5.
Zurück zum Zitat Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A.: Robust Statistics: The Approach Based on Influence Functions. Wiley, New Jersey (1986)MATH
6.
Zurück zum Zitat Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM, New York (2016)
7.
Zurück zum Zitat Mienye D., Sun Y.: A survey of ensemble learning: concepts, algorithms, applications, and prospects. IEEE Access 10, 99129–99149 (2022)
8.
Zurück zum Zitat Chipman, H.A., George, E.I., McCulloch, R.E.: Bayesian ensemble learning. In: 20th Annual Conference on Neural Information Processing Systems, NIPS, San Diego (2006)
    Bildnachweise
    AVL List GmbH/© AVL List GmbH, dSpace, BorgWarner, Smalley, FEV, Xometry Europe GmbH/© Xometry Europe GmbH, The MathWorks Deutschland GmbH/© The MathWorks Deutschland GmbH, IPG Automotive GmbH/© IPG Automotive GmbH, HORIBA/© HORIBA, Outokumpu/© Outokumpu, Hioko/© Hioko, Head acoustics GmbH/© Head acoustics GmbH, Gentex GmbH/© Gentex GmbH, Ansys, Yokogawa GmbH/© Yokogawa GmbH, Softing Automotive Electronics GmbH/© Softing Automotive Electronics GmbH, measX GmbH & Co. KG