Sie können Operatoren mit Ihrer Suchanfrage kombinieren, um diese noch präziser einzugrenzen. Klicken Sie auf den Suchoperator, um eine Erklärung seiner Funktionsweise anzuzeigen.
Findet Dokumente, in denen beide Begriffe in beliebiger Reihenfolge innerhalb von maximal n Worten zueinander stehen. Empfehlung: Wählen Sie zwischen 15 und 30 als maximale Wortanzahl (z.B. NEAR(hybrid, antrieb, 20)).
Findet Dokumente, in denen der Begriff in Wortvarianten vorkommt, wobei diese VOR, HINTER oder VOR und HINTER dem Suchbegriff anschließen können (z.B., leichtbau*, *leichtbau, *leichtbau*).
Diese Studie präsentiert ein horizontagnostisches Vorhersagemodell, das in erster Linie auf traditionellen maschinellen Lernalgorithmen basiert, um die Anzahl der Haltestellen und die Betriebsfaktoren in einem Busverkehrssystem vorherzusagen. Das Rahmenwerk ist so konzipiert, dass die Passagierzahlen auf der Grundlage von Änderungen der Eingabefunktionen innerhalb des Bereichs der Trainingsdaten vorhergesagt werden können, was die Integration in digitale Zwillinge und Simulationsumgebungen für Szenarioanalysen und Planungszwecke ermöglicht. Die Studie untersucht die Vorhersagekraft von Inputmerkmalen innerhalb eines lokalen Modellierungskontextes und vergleicht die Leistung verschiedener maschineller Lern- und Deep-Learning-Algorithmen bei der Erfassung der räumlich-zeitlichen Dynamik des Transitsystems. Indem sie untersucht, wie komplexe nichtlineare Interaktionen zwischen Inputmerkmalen Modellvorhersagen in Gegenwart externer Effekte beeinflussen, zielt die Studie darauf ab, diese Erkenntnisse in praktische Anwendungen zu übersetzen, die Stadtbewohnern und Planern helfen, fundierte Mobilitätsentscheidungen und -maßnahmen zu treffen. Die Studie befasst sich auch mit den Herausforderungen bei der Beschaffung und Nutzung hochwertiger Transit-Datensätze, insbesondere hinsichtlich der für die Stop-Level-Modellierung erforderlichen Granularität, und liefert eine umfassende Untersuchung der Auswirkungen externer Faktoren auf die Modellierung der Stop-Level-Dynamik von Busverkehrssystemen.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
There is a growing emphasis in urban centres on promoting sustainable mobility modes, particularly public transit systems. This highlights the critical need for predictive modelling frameworks that capture the local spatiotemporal dynamics of public transit to inform policy and planning decisions. This study develops a horizon-agnostic modelling framework using automated passenger count (APC) data from a public bus transit system, integrating machine learning (ML) and deep learning (DL) algorithms to forecast stop-level passenger counts and operational factors. We assess APC data quality, implement a feature-space optimisation pipeline to enhance algorithm-data fit, and use SHAP values to analyse feature attributions for model interpretability. Our analyses reveal a weak but asymmetric relationship between boarding and alighting passenger counts. Tree-based ML algorithms outperform DL algorithms due to the high proportion of categorical features, with Extreme Gradient Boosting (XGBoost) achieving the best performance. Furthermore, incorporating non-mobility data (weather, terrain, demographics, land use) improved modelling of passenger dynamics. However, stop-level modelling lacks inductive biases on the spatial structure of transit networks. The proposed framework provides policymakers and planners with data-driven tools to understand the local spatiotemporal dynamics of public transit under external influences, supporting resource allocation for stop placement, line routing, and bus scheduling. By predicting outcomes based on input feature combinations rather than specific temporal horizons, the framework enables scenario analysis for planning applications and can be embedded in digital twins and mobility dashboards to support informed commuting decisions by urban residents.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Introduction
With urban sizes and population expected to increase at accelerated paces (Gulc and Budna 2024), the narrative surrounding urban mobility has gradually shifted from car-centric policies towards sustainable, people-focused alternatives (Millard-Ball and Schipper 2011; Jones 2016). This shift has established public transit as a cornerstone of sustainable urban mobility, offering significant advantages over private vehicles in terms of travel satisfaction (Mouratidis et al. 2023) and reduced societal and environmental externalities (Ritchie and Roser 2023; SLOCAT 2023). Consequently, enhancing public transit systems is crucial to ensuring they effectively meet the daily mobility needs of urban populations.
The advent of the Big Data paradigm provides an avenue for gaining in-depth understanding into the local (lines and stops) and global (entire network) spatiotemporal dynamics of public transit systems at finer temporal granularities. In addition, the influence of urban characteristics and other meteorological and geographical factors on human mobility flows and passenger counts can be studied. To this end, various machine learning (ML) and deep learning (DL) algorithms can be applied for local and global modelling with the aim of forecasting passenger counts (such as boarding and alighting counts) and operational factors (such as scheduling deviations and delays). Furthermore, explainable ML methods can be applied to investigate the isolated and interaction effects of input features on the model outputs.
Anzeige
There have been several studies in the literature that have combined Big Data with ML/DL algorithms to model public transit dynamics. Li et al. (2022) proposed a Probabilistic Graph Convolution Model (PGCM) for origin–destination (OD) demand forecasting, incorporating confidence intervals to account for demand uncertainty. Pei et al. (2023) utilised a hybrid deep learning model that integrates wavelet packet decomposition, the attention mechanism, and bidirectional long short-term memory (LSTM) to predict passenger flows. Zhang et al. (2021) modelled transit networks as attributed graphs to detect areas with similar mobility patterns, embedding both mobility patterns and static urban features via a graph auto-encoder. Shrivastava et al. (2024) employed a Cluster-Based LSTM model to predict occupancy at transit stops while Verma et al. (2021) used a Gaussian mixture model to decompose public transit ridership data into temporal demand profiles. On the other hand, Egu and Bonnel (2021) integrated tree-based ML algorithms with trend forecasting to bridge the gap between short-term operational predictions and long-term strategic planning.
Other studies have taken advantage of the proven ability of transformers (Vaswani et al. 2017) to model long-range dependencies for applications in this domain. Xu et al. (2023) combined graph embeddings with a transformer network to effectively capture spatial and temporal dependencies in the forecasting of demand for scooter sharing. The spatial component incorporates four graph types—spatial adjacency, functional similarity, demographic similarity, and transportation supply similarity—to model complex relationships between urban zones. Hu et al. (2024) integrated complex network indicators (e.g., clustering coefficients and spatial correlations) within a transformer to dynamically capture spatial dependencies between subway stations, improving the prediction of short-term passenger flow in urban rail transit systems. Meanwhile, Kong et al. (2024) used a graph-based deep clustering method to extract bus stop mobility patterns based on their spatiotemporal attributes, using these representations in a transformer-based model for complex traffic prediction scenarios.
Several studies have also explored the influence of external factors such as weather, terrain, demographics, and land use characteristics on public transit, often outside the context of predictive modelling. These influences are typically regional-specific and can vary within urban areas. Tian et al. (2024) analysed the impact of extreme weather disruptions on public transit reliability in terms of predicted delays, while Alam et al. (2021) applied a similar framework to predict irregularities in bus arrival times based on hourly weather conditions. Wei (2022) and Ngo and Bashar (2024) investigated how (extreme) weather conditions affect travel behaviour in different categories of passengers and sociodemographic groups. Regarding urban characteristics, Kim and Li (2021) examined the complementary relationship between residential densification and public transit accessibility. Verma et al. (2021) decomposed ridership data into temporal demand profiles to identify clusters of subway stations, while Yang et al. (2023), Liu et al. (2023), and Zhang et al. (2025) studied the non-linear effects of the built environment on travel modes, trip duration, and ridership patterns. Other external factors that have been investigated include socio-demographics (Ma et al. 2024) and COVID-19 interventions (Zeb et al. 2024).
However, most studies on public transit forecasting aggregate passenger flows spatially across OD pairs (Li et al. 2022; Hu et al. 2024), time intervals (Shrivastava et al. 2024; Kong et al. 2024), or regions (Xu et al. 2023; Li et al. 2022). Such aggregation overlooks stop-level dynamics, inter-stop dependencies, and operational factors, while also limiting the model’s predictive window and generalisation over time. Forecasts of public transit operations can provide direct benefits to passengers by providing insight into wait times, bus delays, and system congestion Qu and Xu (2020); Shrivastava et al. (2024), thus enabling more informed mobility decisions. In addition, although the effects of external factors on human mobility dynamics have been established, few studies (Hu et al. 2024; Xu et al. 2023; Zhang et al. 2021) take this information into account within their ML/DL predictive frameworks and when done, only individual ones are considered. This highlights the need for a comprehensive investigation into the effects of said externalities on modelling stop-level dynamics of bus transit systems.
Anzeige
The contemporary Big Data paradigm has significantly enhanced ML/DL algorithms, enabling better modelling performance with larger training datasets. However, not all data are of equal quality, and an excessively large input feature space can lead to diminishing returns—such as reduced model performance, increased computational complexity, lower interpretability, and higher resource demands during training. Thus, it is crucial to assess the integrity of big mobility data sources to ensure that they accurately reflect real-world mobility dynamics. In addition, there is a need to optimise the input feature space to support downstream modelling tasks with potentially more complex models. Constraining the initially large input feature space to a relevant subset is essential to improve both predictive performance and generalisation (Jovanović et al. 2023; Martín-Baos et al. 2023).
Understanding the complex, non-linear interactions among input features, target features, and their interdependencies is crucial when deploying predictive frameworks in real-world applications. However, most studies prioritise predictive performance over interpretability, often relying on increasingly complex black-box DL models. This raises the need to explore whether traditional ML models can effectively capture local transit dynamics while maintaining lower computational complexity and greater interpretability. Without addressing these aspects, it remains difficult to fully assess the effectiveness of big data and ML/DL algorithms in modelling transit dynamics while accounting for external factors that shape mobility patterns.
Despite the increasing availability of mobility data, substantial challenges remain in acquiring and using high-quality transit datasets, particularly at the granularity required for stop-level modelling. Automated Passenger Counting (APC) data, although collected routinely by operators, is typically considered operationally sensitive and is not made publicly available. Ground truth (GT) data—manual or video-based passenger counts used to calibrate or validate APC systems—are even more scarce due to practical constraints including high passenger volumes, restricted access to onboard footage, and limited operational incentives for detailed validation. Even when such data are obtained at sufficient spatiotemporal scale and resolution, its analysis and public dissemination are often hindered by data protection regulations (e.g., General Data Protection Regulation, GDPR) and proprietary constraints embedded in data-sharing agreements. These data constraints significantly limit the development and validation of robust predictive models for public transit systems.
To address the aforementioned issues, this study aims to develop a horizon-agnostic predictive modelling framework primarily based on traditional ML algorithms to forecast stop-level passenger counts and operational factors in a bus transit system.
The term “horizon-agnostic” refers to the framework’s ability to predict passenger counts based on changes in input features within the domain of the training data, rather than predicting specific forecast horizons. This approach enables the models to be integrated into digital twins and simulation environments for scenario analysis and planning purposes.
Given the limited spatial and temporal coverage of available ground truth data, which makes systematic correction of APC data unsuitable across entire networks, the focus shifts towards feature optimisation and predictive validation supported by engineered features from external, non-mobility domains, specifically weather, terrain, demographics, and land use. This approach enables indirect but robust assessment of APC reliability and predictive generalisability for stop-level modelling, while enhancing representational fidelity through exogenous feature integration.
The study will examine the predictive power of input features within a local modelling context and compare the performance of various ML/DL algorithms in capturing the spatiotemporal dynamics of the transit system. By exploring how complex non-linear interactions between input features influence model predictions in the presence of externalities, the study also aims to translate these insights into practical applications that assist urban residents and planners in making informed mobility decisions and policies.
In conclusion, previous studies overlook stop-level dynamics by aggregating mobility flows and neglect the impact of external factors, data quality, and feature optimisation on model training. Furthermore, most prioritise predictive accuracy by using complex black-box DL models which limit the interpretability and generalisability of public transit forecasting for decision-support tools. This methodology reflects real-world data constraints in public transit systems, where comprehensive GT validation is often impractical. This study leverages unique APC and limited GT data from Trondheim’s public bus system, where Norway’s sustainability focus, rugged terrain, and harsh climate create a distinctive setting for exploring urban mobility within national and European contexts. To this end, the primary contributions of this study are:
1.
Proposal of a horizon-agnostic ML framework for modelling stop-level spatiotemporal dynamics in bus transit systems, predicting boarding and alighting counts, as well as deviations from scheduled arrival and dwell times. The framework is designed to predict outcomes based on input feature combinations rather than specific temporal horizon.
2.
Use of GT data to evaluate the integrity of the big mobility dataset used for analysing and modelling stop-level dynamics, establishing correlations between boarding and alighting counts across the transit system.
3.
Integration of a task-agnostic, data-driven feature optimisation pipeline that removes irrelevant and redundant mobility features while incorporating engineered features from external, non-mobility domains to enhance predictive performance.
4.
Application of explainable ML techniques, comparative evaluation of ML/DL models, analysis of input feature attributions across various algorithms, tasks, and data subsets using unique real-world data from Trondheim, Norway’s public bus transit system.
The rest of this paper is organised as follows: Section 2 provides the theoretical foundations of the study, discussing the ML/DL algorithms, training and evaluation metrics, hyperparameter optimisation, statistical significance testing, and ensemble learning methods used in the modelling process. Section 3 outlines the methodological framework, detailing the primary sources of big mobility data (APC and GT), feature space optimisation, training and validation procedures, model training pipeline, and feature engineering techniques. Section 4 presents the results of the predictive modelling for the APC and GT data, including performance evaluations, target feature correlations, statistical significance testing, ensemble optimisation, residual analysis and interpretability analysis. Section 5 discusses the feature engineering results, highlighting the effects of different external factors on baseline model performance and transit dynamics. Section 6 summarises the key findings, discusses the implications of the results, tackles the limitations of the study, and outlines potential directions for future research.
Theoretical background
This section outlines the theoretical foundations of the study’s modelling approach, covering the machine learning (ML) and deep learning (DL) algorithms employed, performance evaluation metrics, and statistical methods for model comparison. The section establishes the mathematical framework for predictive modelling, hyperparameter optimisation, and ensemble learning techniques used throughout this work.
Machine learning algorithms
This section presents the machine learning algorithms employed to model the Automated Passenger Counting (APC) and ground truth (GT) target features. The modelling framework comprises one deep neural network architecture and four tree-based ensemble methods: Tabular Deep Neural Network (DNN), CatBoost, Random Forest, XGBoost, and LightGBM.
Tabular deep neural network
The Tabular DNN model is an extension of the basic DNN that is designed to handle structured data, efficiently combining categorical and continuous features (Gorishniy et al. 2021; Borisov et al. 2022). The tabular model uses embedding layers to transform categorical features to learned representations that are combined with the continuous features. The combined features are then passed through basic DNN for classification or regression tasks. Figure 1 presents a schematic of the Tabular DNN used in this work which can be formalised as follows:
Fig. 1
Architecture of the Tabular DNN showing embedding layers for categorical features, concatenation with continuous features, and multiple fully connected blocks leading to regression outputs
Let \(\textbf{X}_{\text {cat}} = \{X_{1}, X_{2}, \ldots , X_{m}\}\) denote the m categorical features, where \(X_{i}\) is the i-th categorical feature, and \(\textbf{X}_{\text {cont}} \in \mathbb {R}^{n}\) denote the n continuous features. Let \(\textbf{y} \in \mathbb {R}^k\) represent the k regression targets. Each categorical feature \(X_{i}\) is passed through a trainable embedding layer \(E_{i}: \textbf{V}_{i} \rightarrow \mathbb {R}^{d_i}\), where \(\textbf{V}_{i}\) is the vocabulary of \(X_i\) and \(d_i\) is the embedding dimension. The output is as follows.
The embeddings \(\textbf{e}_{1}, \textbf{e}_{2}, \ldots , \textbf{e}_{m}\) are concatenated with the continuous features \(\textbf{X}_{\text {cont}}\):
The concatenated vector \(\textbf{z}\) is passed through a deep neural network with l blocks, \(\mathcal {F}(\cdot ; \Theta )\), parameterised by \(\Theta \) to predict the targets: \(\mathbf {\hat{y}} = \mathcal {F}(\textbf{z}; \Theta )\) using the overall mean squared error as the loss function. Each DNN block consists of linear, non-linear, normalisation, and dropout layers with the outputs of the final block passed to a final output block. The output block consists of a linear layer to match the k targets and a non-linear layer to constrain the DNN outputs.
Tree-based algorithms
Tree-based algorithms are machine learning methods built on the foundation of decision trees. A decision tree recursively splits data into subsets based on feature values, using a tree-like structure of nodes (representing decision rules) and leaves (representing outcomes or predictions). These splits are determined to maximise the separation between classes or minimise error for regression tasks by minimising a loss function. Decision trees are intuitive, interpretable, and capable of capturing non-linear relationships, making them powerful standalone models. However, their predictive performance can suffer from overfitting or instability. To address these limitations, ensemble methods such as Random Forest, XGBoost, CatBoost, and LightGBM extend decision tree principles by combining multiple trees, resulting in more robust and accurate predictions.
Random Forest (RF): Random Forest (Breiman 2001) is a tree-based ensemble method that constructs multiple decision trees in parallel, each trained on different bootstrap samples of the data with random feature selection at each split. This approach reduces overfitting and increases robustness by averaging predictions across the ensemble, effectively minimising variance. Although individual decision trees in the forest are prone to high variance and overfitting, combining their outputs leads to more accurate and stable predictions. For classification, RF aggregates class probabilities from the trees to produce probability-like outputs. For regression, it averages the predicted values. This model is versatile, handling both classification and regression tasks efficiently, while also providing measures of feature importance.
Extreme Gradient Boosting (XGBoost): Extreme Gradient Boosting (Chen and Guestrin 2016) is an advanced implementation of gradient-boosted decision trees (GBDT), optimised for speed and accuracy. Unlike Random Forests, XGBoost builds trees sequentially, each tree learning to correct the residual errors of the ensemble using gradient-based optimisation. The algorithm incorporates regularisation (L1 and L2 penalties) to reduce overfitting, efficient parallelisation for fast training, and sparse data handling for memory efficiency. For classification, XGBoost uses regression trees to predict residuals and applies softmax or logistic transformations to generate class probabilities, offering well-calibrated outputs.
Categorical Boosting (CatBoost): Categorical Boosting (Prokhorenkova et al. 2019) is a gradient boosting method designed to handle categorical data natively, which makes it particularly useful for data sets with high-cardinality features. Unlike traditional gradient-boosting methods, CatBoost uses ordered boosting, a permutation-based approach that avoids overfitting during training. It also applies efficient techniques for categorical feature encoding, such as target-based encoding, without data leakage. CatBoost delivers robust performance across classification and regression tasks, with minimal need for extensive preprocessing or parameter tuning.
Light Gradient Boosting Machine (LightGBM): Light Gradient Boosting Machine (Ke et al. 2017) is a highly efficient gradient boosting framework designed for scalability and speed. It uses histogram-based algorithms to discretise continuous features into bins, reducing computational overhead. Additionally, exclusive feature bundling combines low-correlation features into a single feature to further enhance efficiency. LightGBM grows trees leaf-wise rather than level-wise, focussing on the leaves with the highest potential loss reduction, which often improves performance. It is particularly suited for large datasets, high-dimensional data, and imbalanced classification tasks, offering strong predictive accuracy with reduced training time.
Performance metrics and model evaluation
Model performance assessment requires appropriate metrics for training, validation, and interpretation. This section presents the mathematical formulations of the performance metrics employed throughout this work, covering loss functions, accuracy measures, and interpretability methods.
Mean squared error
The mean squared error (MSE) is a metric used to evaluate the quality of an estimator by calculating the average of the squared differences between the estimated values and the true values. It serves as a risk function, representing the expected value of the squared error loss. Since it is based on the square of the Euclidean distance, the MSE is always a positive value and decreases as the errors approach zero. The MSE is adopted as the loss function for all algorithms used in this work. If \(\hat{y}_i\) represents the predicted value for the i-th sample and \(y_i\) denotes the corresponding true value, the MSE for a single-output regression task estimated over N samples is expressed as shown in Equation 1. For a multi-output regression task with k targets, where \(\hat{y}_{i,j}\) represents the predicted value for the j-th target of the i-th sample and \(y_{i,j}\) denotes the corresponding true value, the MSE is expressed as shown in Equation 2. Here, the MSE is averaged across both the N samples and the k output targets.
The root mean square error (RMSE) is calculated by taking the square root of the MSE, which results in a measure with the same units as the quantity being estimated. For an unbiased estimator, the RMSE corresponds to the square root of the variance. As it provides a measure in the same units as the target features, RMSE is used as the evaluation metric to assess model performance in this work.
Coefficient of determination
The coefficient of determination (\(R^2\)) is the proportion of the variation in the dependent (target) feature(s) that is predictable from the independent (input) feature(s). The \(R^2\) score provides a measure of how well unseen samples are likely to be predicted by the model, through the proportion of total variance explained by the model. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected (average) value of y, disregarding the input features, would get an \(R^2\) score of 0.0. If \(\hat{y}_i\) is the predicted value for the i-th sample, \(y_i\) the corresponding true value and \(\bar{y}\) the mean of y, the estimated \(R^2\) score over N samples is expressed as shown in Equation 3 where \(\bar{y} = \frac{1}{N} \sum _{i=1}^{N} y_i\).
There is a direct, inverse mathematical relationship between \(R^2\) and RMSE as expressed in Equation 4. As RMSE increases, indicating higher prediction errors, \(R^2\) decreases. However, this relationship is not strictly linear, as it depends on the total variance of y. RMSE is an absolute error metric expressed in the original units of measurement, while \(R^2\) is a relative measure of fit and may not be directly comparable across different datasets. Thus, models with different ranges of y can have similar \(R^2\) values but different RMSE values.
SHapley Additive exPlanations (SHAP) values (Lundberg and Lee 2017), derived from cooperative game theory, are a widely used metric to interpret machine learning models. They allocate credit for a model’s prediction among its input features by measuring each feature’s marginal contribution across all possible feature combinations.
For a given observation \(\textbf{X}\) with features \(\{X_1, X_2, \ldots , X_p\}\), the SHAP value \(\phi _j(\textbf{X})\) for feature j is defined as:
where S represents a subset of features excluding feature j, f(S) is the model’s expected prediction when only features in subset S are known, and p is the total number of features. SHAP values satisfy the efficiency property: \(\sum _{j=1}^{p} \phi _j(\textbf{X}) = f(\textbf{X}) - f(\emptyset )\), ensuring that the sum of all feature contributions equals the difference between the prediction and the baseline value.
SHAP values offer three key advantages (Martín-Baos et al. 2023): (1) They improve global interpretability by not only identifying important features but also indicating whether each feature positively or negatively impacts predictions; (2) They provide local interpretability, as SHAP values are computed for individual predictions, revealing how each feature contributes to specific outcomes; and (3) SHAP values are highly versatile and can be used to explain a wide range of models, including linear models, tree-based models, and neural networks. Although they can be computationally demanding in practice, this flexibility makes SHAP values a robust and powerful tool for model interpretability, offering insights into feature importance and model behaviour while providing a clear understanding of how individual features influence predictions.
Hyperparameter optimisation
Effective model performance depends not only on algorithm choice but also on proper hyperparameter configuration. Although ML algorithms and implementations come with default hyperparameter settings, these are rarely optimal for specific tasks and can significantly impact both predictive accuracy and model interpretability. The optimal hyperparameters are algorithm-specific and the Hyperparameter Optimisation (HPO) problem is defined to adjust the hyperparameters. HPO allows the algorithm’s hyperparameters to be tailored to the modelling task for improved performance and provides reproducible studies for fair comparisons and analysis of different algorithms, training pipelines and data transformations.
According to Feurer and Hutter (2019), the HPO problem can be formalised as follows: let D be a dataset of the problem at hand. Consider an ML algorithm with hyperparameters denoted by \(\mathcal {A}_{\lambda }\). Note that the vector of hyperparameters \(\lambda \) belongs to the feasible region of hyperparameters, that is, \(\lambda \in \Lambda \). The optimal set of hyperparameters \(\lambda ^*\) is computed as in Equation 6, where \(\textbf{V} \left( \mathcal {A}_{\lambda }, D_{\text {train}}, D_{\text {valid}} \right) \) is a performance metric of \(\mathcal {A}_{\lambda }\) on training data \(D_{\text {train}}\) and assessed on validation data \(D_{\text {valid}}\).
In practice, the expectation \(\mathbb {E}\) must be approximated in practice, and a holdout validation approach with random or temporal splits is used for this purpose, where the dataset D is partitioned into two training and validation subsets. Finally, the validation performance is taken as an estimator of \(\mathbb {E}_{(D_{\text {train}}, D_{\text {valid}}) \sim \mathcal {D}} \textbf{V} \left( \mathcal {A}_{\lambda }, D_{\text {train}}, D_{\text {valid}} \right) \). In the experiments carried out in this work, the performance metric for \(\textbf{V}\) is the MSE and a random search strategy was adopted where a random, uninformed set of hyperparameter values is chosen in each iteration based on a distribution for each hyperparameter. This strategy effectively balances between the problem of expensive function evaluations for large models and the high-dimensional search space when there are many hyperparameters. The details of the implementation of the HPO problem and the optimal values of the hyperparameters can be found in Section Appendix B.
Statistical significance testing
Statistical tests are essential for comparing the performance of different models, especially when evaluating their predictive accuracy across multiple datasets or cross-validation folds. Following the recommendations of Demšar (2006), this section describes the statistical tests used in this work to assess the significance of differences in model performance.
Friedman test
The Friedman test (Friedman 1940) is a non-parametric statistical test used to detect differences between multiple algorithms across multiple datasets or folds. It serves as the non-parametric equivalent of repeated-measures ANOVA (Analysis of Variance) and evaluates the null hypothesis that all algorithms perform equally by considering the rankings of models rather than their raw scores. Let k be the number of algorithms and N the number of datasets (or folds). For each dataset, the algorithms are ranked according to their performance, and the test statistic is computed as:
where \(R_j\) is the sum of ranks for algorithm j. A significant p-value (e.g., \(p < 0.05\)) indicates that at least one algorithm performs significantly differently from the others.
Post-hoc Nemenyi test
If the Friedman test detects significant differences between algorithms, post-hoc tests are applied to determine which specific pairs differ significantly. The Nemenyi test (Nemenyi 1963) is a common post-hoc procedure that compares the average ranks of all pairs of algorithms while controlling for multiple comparisons using a critical difference (CD) threshold based on the Studentised range distribution. For two algorithms i and j, the difference in average ranks \(|\bar{R}_i - \bar{R}_j|\) is considered statistically significant if it exceeds the critical difference:
where \(q_{\alpha }\) is the critical value from the Studentised range distribution for significance level \(\alpha \), k is the number of algorithms, and N is the number of datasets. The Nemenyi test provides pairwise comparisons while maintaining control over the family-wise error rate.
Ensemble learning methods
Ensemble learning combines predictions from multiple models to achieve superior performance compared to individual models. This section presents advanced ensemble techniques that go beyond simple averaging by learning optimal weights for model combination, thereby adaptively leveraging the strengths of different algorithms on specific data subsets.
Least squares optimisation
Least squares optimisation is a linear regression-based technique that determines optimal weights to combine multiple model predictions by minimising the squared error between the ensemble output and the true values. Given predictions from k different models for each of the N samples, the objective is to find a weight vector \(\textbf{w} = [w_1, \ldots , w_k]^T\) such that the weighted average of the model outputs best approximates the ground truth in the least squares sense. This approach is typically constrained so that the weights sum to unity and remain non-negative, ensuring that the ensemble remains a convex combination of model predictions. The optimisation problem is formulated as:
where \(\hat{y}_{i,j}\) denotes the prediction of the j-th model for the i-th sample, and \(y_i\) is the corresponding ground truth. This method can be extended to multi-output regression by optimising weights separately for each target.
Ridge regression
Ridge regression (Hoerl and Kennard 1970) is a regularised version of linear regression that adds a \(L_2\) penalty to the weight coefficients, preventing overfitting by discouraging large weights. When used for ensembling, ridge regression learns an optimal linear combination of model predictions that best fits the target values while applying a penalty to the magnitude of the weights. Given the prediction matrix \(\textbf{P} \in \mathbb {R}^{N \times k}\) (where each column corresponds to predictions from a different model), and the ground truth vector \(\textbf{y} \in \mathbb {R}^N\), the ridge regression objective is:
where \(\textbf{w}\) is the vector of ensemble weights, and \(\alpha \) is the regularisation parameter controlling the trade-off between fit and weight shrinkage. Unlike constrained least squares, ridge regression allows weights to be negative or greater than unity, enabling more flexible and potentially more accurate ensembles.
Methodology
This section outlines the methodological framework of the study, detailing the Automated Passenger Counting (APC) system as the main data source. It describes the analysis and modelling of ground truth (GT) for APC data validation, explains the predictive modelling approach, and discusses the feature engineering process designed to enhance model performance. Figure 2 provides an overview of the complete methodological pipeline, illustrating the flow from input data through preprocessing, exploratory analysis, predictive modelling, performance analysis, and ablation studies.
Fig. 2
Overview of the methodological pipeline showing the complete workflow from input data collection through preprocessing, exploratory analysis, predictive modelling, performance evaluation, and ablation studies with external factors
This study uses data collected from AtB, the public transport administrator for Trondheim, Norway, through APC systems installed on buses, which use DILAX optical sensors equipped with 3D stereoscopic vision technology to generate automated passenger counts (APC). Although occasional technical issues can cause count discrepancies as discussed in Section 3.2, preprocessed APC data reliably represent passenger trends and patterns across the bus transit network.
Fig. 3
Satellite map showing the six bus lines included in the public transit dataset covering Trondheim Municipality
The study’s dataset covers the period from 1st May 2020 to 30th November 2023, comprising 1,179,770 unique trips over 1,295 days, across 6 lines and 204 stops in Trondheim Municipality. Each bus line has a set of routes (which vary based on seasons, construction works, etc.), and each route has a set of stops; the trip is defined as “the sequence of boarding and alighting counts at these stops for a given bus line and route.” As shown in Figure 3, the bus lines in the data cover a wide section of the city, most of them running through the city centre. The raw dataset features a comprehensive set of 63 attributes, providing insight into the temporal and spatial aspects of bus trips, passenger flow metrics, and various operational parameters.
Temporal Information: Includes date and time features, such as TripScheduledDeparture, TripDepartureHour, StopScheduledArrival and StopActualArrival. These capture the scheduled and actual timings for departures and arrivals for specific stops and entire trips.
Operational Flags: A series of conditional flags such as StopTime, FLAG_Trip_15minDeviation and FLAG_Stop_IsDelayed offer insight into operational issues.
Spatial Information: Encoded via StopName, Longitude, and Latitude, this provides a spatial context for each trip.
Passenger Flow Metrics: The Boarding, Alighting, and Onboard represent data from the APC system, which records passenger counts over the course of the trip.
Vehicle and Service Information: Details such as BusType and CarriageNumber identify vehicles in service, allowing analysis of usage and performance across the fleet.
Special Dates: The dataset is enriched with flags such as FLAG_Holiday_Restday and FLAG_SchoolVacation and descriptive fields such as DayComment and HolidayName to denote public holidays and school breaks, allowing nuanced analyses of service variations on special days.
Data preprocessing and quality control
The APC dataset was extensively preprocessed to ensure data quality and consistency for subsequent analysis. As previously mentioned, the dataset’s foundation is the APC data, enriched with pertinent operational and spatiotemporal details. Figure 4 and Figure 5 show the total daily passenger counts across the entire dataset and for each bus line, respectively, highlighting the temporal trends and seasonal variations in the volume of passengers.
Fig. 4
Total daily passenger counts across the entire dataset from May 2020 to November 2023, showing yearly trends and seasonal variations
The raw dataset was preprocessed to address missing values, correct erroneous negative values, and filter out trips with incomplete stop records, as outlined below.
1.
Unique Trip Identifiers: The Trip feature, a daily unique identifier, was combined with the Date feature to create a unique identifier for the entire dataset dateTrip. This is useful for grouping the APC data from each stop with the correct trip.
2.
APC Features: The dataset primarily includes the Boarding and Alighting features of the APC system, representing passengers getting on and off the bus at each stop. These figures are aggregated from the counts at each door. Using these, new metrics are calculated, such as the passenger count on the bus between stops and the change in APC throughout the trip. Data cleaning involved: (1) Setting missing values in Boarding and Alighting to zero. (2) Discarding erroneous derivative APC values and engineering three new features busVolume, tripSumVolume, and stopVolume which denote the number of passengers onboard, cumulative boardings since the start of the trip, and APC at each stop, respectively.
3.
Temporal Information: For bus schedule features TripScheduledArrival, StopActualArrival, and StopActualDeparture, empty values were replaced with the corresponding scheduled values. Instances where StopActualDeparture was earlier than StopActualArrival were corrected and StopTime subsequently recalculated to reflect this correction.
4.
Missing Values: Missing values for conditional operational flags were determined using expert knowledge and set accordingly. The missing BusType, DayComment, and HolidayName data was filled with “No Information", while the modal values were used for missing Municipality and StopType data.
5.
Sanity and Completeness Filters: Taking a holistic view of the APC data at the trip level, we filtered out trips with bad APC data where: (1) There are negative Boarding or Alighting values. (2) FLAG_Trip_APCActive is False or FLAG_Trip_APCExtremeValue is True. (3) The APCs for some stops are missing over the course of the trip.
6.
Transfer Stops: As detailed in Section 3.4.1, we engineered two new features—StopType and TransferStop—representing the potential of a bus stop to act as a hub for connections across different bus lines and parts of the city.
In the final preprocessing step, we removed features deemed irrelevant and outdated (as confirmed by expert consultation). We also removed some features that have become constant (zero-variance) as a result of the preprocessing pipeline. The final preprocessed dataset contains 47 features—detailed in Tables 10 and 9—across 951,681 unique trips and 29,069,459 data points (boarding and alighting counts at stops), representing 80.67% of the raw dataset. This reduction primarily addresses inaccuracies in the APC data from the early stages of system development, when initial bugs were being resolved. The preprocessed dataset can be further divided into subsets based on line, trip, or temporal aggregation levels for analysis and modelling.
Exploratory data analysis
This section presents statistical analyses of the preprocessed data to characterise passenger volumes and trip patterns across the public transit system.
Figure 6 shows that the “metro” Lines 1 and 3 show higher trip counts and passenger volumes than other lines in the data. However, there is a contrast between the trip counts and passenger volumes for Lines 10, 11, 12, and 14. Lines 10 and 11, despite similar trip counts, see Line 11 handling more volume. The same can be observed for lines 12 and 14. These disparities are both a reflection of real-world trends and the data preprocessing pipeline, which resulted in some trips being discarded by the sanity and completeness filters.
Figures 7 and 8 show the temporal variations in tripSumVolume passenger volumes in the public transit system. The monthly passenger volumes show notable declines during holiday periods such as the Easter, Summer, Christmas, and New Year holidays. The yearly volume of passengers reflects the impact of the COVID-19 pandemic, with significantly reduced numbers in 2020, followed by a steady increase from 2021 to 2023. Note that the COVID-19 period in Norway, characterised by lockdowns and travel restrictions, spanned from 13th March 2020 to 28th February 2022.
The yearly volumes show the effect of the COVID-19 pandemic 2020 with much lower passenger volumes that steadily increase from 2021 to 2023. The extent to which the increasing yearly volumes are a result of the recovery from COVID-19 effects or actual increases in public transit usage is difficult to determine because of a lack of pre-COVID-19 APC data for comparison. The hourly passenger volumes show a morning peak for trips between 07:00 and 08:00, with the afternoon peak dispersed between 14:00 and 16:00. However, daily passenger volumes are relatively stable on weekdays and drop on weekends.
Fig. 6
Total trip counts and total passenger volumes across bus lines for the entire dataset period
The descriptive statistics for boarding and alighting counts, shown in Table 1, reveal highly skewed and peaked distributions, characteristic of sparse events with occasional extreme values. The target features exhibit low mean counts and comparable variability, indicating a generally narrow spread. The 25th, 50th, and 75th percentiles reveal that most observations involve very low counts (0-2 passengers), while the maximum values (117 and 152 passengers) indicate the presence of rare but significant outliers. The high skewness values (5.48 and 6.11) and exceptionally high kurtosis values (51.74 and 70.09) confirm the existence of long tails towards higher counts and sharply peaked distributions with infrequent extreme values.
Table 1
Descriptive statistics for the APC boarding and alighting counts
Target
Mean
SD
Minimum
25%
50%
75%
Maximum
Skew
Kurtosis
Boarding
1.48
3.29
0
0
0
2
117
5.48
51.74
Alighting
1.40
3.18
0
0
0
2
152
6.11
70.09
Most observations involve very low counts, with rare but significant outliers. Passenger counts are integers while skewness and kurtosis are dimensionless
Although Figure 6 shows that there are significant variations in passenger volumes on the different bus lines, the temporal variations shown in Figures 7 and 8—which present aggregated trends across all lines—also exhibit line-specific patterns when examined individually. Thus, a predictive model of the boarding and alighting counts across the public transit system has to disentangle the spatiotemporal dynamics of each bus line and stop. In addition, the characteristics of the boarding and alighting counts point to challenges with imbalanced data and outliers which the model has to be robust enough to handle effectively.
Ground truth validation data
To validate the boarding and alighting passenger counts generated by the Automated Passenger Counting (APC) system, ground truth (GT) data from video recordings were obtained and analysed for some trips covering the span of the APC dataset. The full GT data—after preprocessing to ensure accurate comparison with the APC data—consists of 41 unique trips over 24 days, across all 6 lines but limited to 169 stops, leading to a total of 1180 data points for comparing the boarding and alighting counts.
Figure 9 plots the residuals between the APC and GT data for the boarding (left) and alighting (right) counts. The black line shows the mean residual at each GT value, with shaded error bands indicating 95th-percentile confidence intervals.
At low passenger counts under \(\approx 15\), the APC system follows GT fairly closely, with residuals centred near zero and narrow spread. However, as counts increase, the residuals exhibit a marked downward trend, indicating systematic undercounting. This bias becomes particularly pronounced for boarding counts above \(\approx 20\) and alighting counts above \(\approx 30\), where the mean residual deviates dramatically and the variability increases. These results suggest that the APC error is not constant across count levels, but instead worsens with higher volumes of passengers, leading to significant undercounting in high-volume scenarios. Such high-volume scenarios are relatively rare, as data become increasingly sparse with higher passenger counts.
Fig. 9
Residuals between APC and GT counts for boarding (left) and alighting (right) plotted against GT values. The black line shows the mean residual at each GT value, with shaded error bands indicating 95th-percentile confidence intervals
Analysis of GT data and video recordings reveals several technical challenges inherent in automated passenger counting systems. The optical counting system faces challenges when passengers move through doors simultaneously or at varying speeds, particularly during busy periods with large groups. Detection difficulties arise with certain types of passengers, such as small children who may be partially obscured by accompanying adults. GPS-based stop registration presents technical limitations, with occasional positional drift that causes stops to be registered out of sequence or missed entirely. Other technical factors include video processing lag and varying lighting conditions.
Figure 10 shows the error distributions between the boarding and alighting counts, calculated as: diff = APC - GT. Overall, there is strong agreement between the two sources, indicating that the APC system is generally accurate in its counting. However, there are notable outliers in which the APC counts deviate significantly from the GT counts. The alighting counts exhibit more pronounced errors than the boarding counts, with a slight bias towards underestimating actual passenger counts.
Fig. 10
Distribution of counting errors between the APC and GT sources for the boarding and alighting counts. The residuals are computed as the differences between the APC and GT counts, with positive values indicating overestimation by the APC system and vice-versa
Spatial variation of counting error across various lines. The residuals for each Line and StopSequence are computed as the mean of the sum of the absolute differences between the APC and GT counts for the boarding and alighting counts
Figure 11 presents the spatial distribution of the error counts across the bus lines and stops, computed as the mean absolute differences between APC and GT counts aggregated by Line and StopSequence. The heatmap reveals a generally uniform distribution of errors, with few outliers concentrated at high-volume stops along the metro lines 1 and 3. These patterns indicate that errors are systemic to APC technology rather than specific operational issues, becoming more pronounced during rare high-volume events.
The GT data provides validation of APC system performance, revealing a generally good performance but also highlighting the challenges of automated passenger counting in real-world conditions. The key question is whether there are sufficient patterns in the error distributions and whether a model trained on the available GT data (41 trips, 1180 data points) can be used to correct the entire APC dataset (951,681 trips, 29,069,459 data points) for more accurate modelling and analysis of public transit dynamics.
Machine learning model development
This section outlines the development of machine learning (ML) and deep learning (DL) models to capture spatiotemporal dynamics in passenger flows of public transit. The development of ML and DL models in this work focusses on capturing the local and global spatiotemporal dynamics of selected bus lines in Trondheim’s public transit system. Local modelling and interpretation, treating individual bus lines and stops as independent, offer valuable insights to inform the more complex task of modelling dependencies across the entire public transit system as a unified network.
This section addresses the next stage: optimising the input feature space and predictive modelling performance for local line and stop dynamics using ML/DL algorithms such as Tabular Deep Neural Network (DNN), CatBoost, Random Forest, XGBoost, and LightGBM. SHapley Additive exPlanations (SHAP) values will be used to evaluate input feature importance and correlations, providing interpretability and explaining the models’ predictions. Furthermore, the APC data will be augmented with features such as elevation data for bus stops, historical weather data, and urban demographics and land use data. The importance of these engineered features for passenger count predictions and model interpretability will also be assessed.
Training and validation splits
The temporal splitting strategy used in this study was designed to ensure robust generalisation across temporal boundaries while addressing potential seasonality and yearly variations in the data. For the APC predictive modelling task, the data was temporally split 75:25 into training and validation subsets, with the training set containing data from 1 May 2020 to 22 December 2022 (2 years, 7 months, 21 days) and the validation set containing data from 22 December 2022 to 30 November 2023 (11 months, 8 days). This temporal split preserves the chronological structure of the data and ensures that models are evaluated on genuinely unseen future data.
The temporal split methodology serves multiple purposes: (1) it ensures that models learn within-year patterns and seasonal variations present in the training data, enabling robust generalisation to similar patterns in the validation period; (2) it addresses potential distribution shifts over time by training on a sufficiently long period that captures multiple seasonal cycles; and (3) it provides a realistic evaluation scenario where models must generalise from historical data to predict future passenger behaviour.
For model selection and statistical significance testing, two cross-validation approaches were implemented to comprehensively evaluate temporal generalisation:
Seasonal Cross-Validation: Employs fixed-seasonal windows to evaluate model stability across years within the same seasonal periods, testing whether models can generalise across similar seasonal contexts. The seasonal folds used: (1) training on May–December 2020, validation on May–December 2021; (2) training on 2021, validation on 2022; and (3) training on 2022, validation on January–November 2023.
Hybrid Cross-Validation: Combines both seasonal generalisation (fold 1) and forward extrapolation (folds 2–3), providing a comprehensive evaluation of both seasonal stability and temporal extrapolation capabilities. The hybrid folds maintained the first seasonal fold unchanged, but subsequently expanded training windows: (2) training on May 2020–December 2021, validation on 2022; and (3) training on May 2020–December 2022, validation on January–November 2023.
The hybrid cross-validation approach was used in addition to seasonal cross-validation, providing a comprehensive assessment of temporal generalisation that combines both seasonal stability and forward temporal extrapolation. Fold 1 tests seasonal generalisation by training on May-December 2020 and validating on May-December 2021, while folds 2–3 test forward extrapolation with expanding training windows and chronologically subsequent validation periods.
For the GT data modelling task, a different approach was necessary due to limited data availability (41 trips, 1,180 data points). The GT data was randomly split 80:20 training and validation samples, respectively. Random splitting was used for GT data because: (1) the limited sample size precluded meaningful temporal splits, and (2) the primary objective was to develop a quality assessment tool to evaluate APC system performance and understand error patterns rather than to systematically correct the full APC dataset.
Input feature space optimisation
This subsection describes the systematic approach to identify the most informative features for predictive modelling through dimensionality reduction and relevance analysis.
Feature space optimisation is a form of dimensionality reduction involving the use of data and model analysis techniques to identify extremely irrelevant or redundant features prior to training on the complete data. The preprocessed APC data contained 47 features of which 39 are input features and the remaining 8 are a combination of passenger counts and operational target features. The following steps were carried out to optimise the input feature space according to the properties of the said features:
1.
High Cardinality: High-cardinality features provide less signal for distinguishing between samples and hinder a model’s ability to generalise to unseen data. In addition, the introduction of new categories during deployment can further degrade performance. Therefore, two high-cardinality features—Date and dateTrip—were eliminated, leaving 37 input features.
2.
Out-of-Domain Robustness: To predict future passenger counts, the model must generalise from older data to newer data. This requires that the input features—and their categories or distributions—remain stable over time. To identify features likely to cause generalisation issues, we trained an auxiliary diagnostic model to detect temporal distribution shift. Specifically, the input data was temporally temporally split 75:25 into training and validation subsets as in Section 3.3.1. A dummy target variable was created to label samples as either training (0) or validation (1). The two subsets were then recombined and shuffled, and a Random Forest regression model (256 estimators, default settings otherwise) was trained to predict this label. The Gini (impurity-based) importance was calculated for the input features, representing the normalised reduction in mean squared error (MSE) due to each feature in this auxiliary prediction task. Under temporal stationarity, all features should have low and roughly equal importance, as the model should struggle to distinguish between older and newer samples. Features with unusually high importance in this context suggest underlying temporal shifts in their categories or distributions, making them less reliable for the primary prediction task. The Year and Vehicle features consistently showed high importance across iterative runs and were subsequently removed, yielding a final set of 35 input features.
3.
Predictive Importance: To remove features irrelevant to the prediction task, 0.1% of the preprocessed APC data (2,906,946 data points) was randomly sampled without replacement and temporally split 75:25 into training and validation subsets as in Section 3.3.1. This data was used to train a Random Forest regression model with 35 input features and 2 target features (Boarding and Alighting). After training, impurity-based feature importances were computed for the input features. To address the known bias of impurity-based measures towards high-cardinality features, an additional importance calculation was performed using Tree SHAP, which leverages the tree structure to provide more accurate attributions. Tree SHAP assigns feature importance based on the weighted sum of changes in the model’s output when features are added or removed, averaged across all possible feature combinations. The tree-based importances were derived by calculating the mean absolute SHAP values for each feature, then normalising across all features. For both impurity-based and tree-based methods, features with importance values exceeding 1.0% for both target variables were selected, and their union was taken as the final set of important features. The feature importance plots in Figure 12 demonstrate significant agreement between both methods in identifying features that are clearly irrelevant to the prediction task. This process eliminated 15 features leaving 20 input features for the next and final step.
4.
Multicollinearity: The multicollinearity check identifies input features that are partially or fully redundant, meaning a model could substitute one feature for another without performance loss. Such redundancy increases model complexity and reduces interpretability. To detect redundancy, rank correlation was calculated by replacing feature values with their ranks within each column and then computing correlations. These correlations were converted into a distance matrix and used to perform hierarchical clustering of the features. The clustering revealed that TripScheduledDeparture, TripScheduledArrival, StopScheduledArrival, StopScheduledDeparture, and TripDepartureHour were highly correlated. Among these, only StopScheduledArrival was retained, while the other 4 were eliminated.
Fig. 12
Comparison of impurity-based and Tree SHAP feature importance measures for identifying relevant predictive features
The feature space optimisation process resulted in a final set of 16 features for the predictive task, which can be broadly grouped into the following categories:
Trip-level features: Line, BusType, FLAG_TripDirection, and LastStopSequence.
Stop-level features: StopSequence, StopScheduledArrival, StopName, StopIdentifier, Longitude, and Latitude.
Transfer features: StopType and TransferStop.
Temporal features: WeekNumber, Month, DayType, and FLAG_Workday.
Model training pipeline
The model training pipeline was implemented to address two distinct but related tasks: (1) developing models using ground truth data to assess APC system performance and understand error patterns, and (2) creating predictive models for passenger counts using the full APC dataset. Both tasks employed identical machine learning algorithms but with different data preparation and validation strategies appropriate to their respective objectives.
Ground Truth Model Training
The input features for ground truth modelling were constrained to those available across both GT and APC datasets: Line, StopSequence, boardingAPC, and alightingAPC, with the targets being the corresponding GT Boarding and Alighting counts. As discussed in the previous section Section 3.3.1, a random 80:20 split was employed, yielding 944 training and 236 validation samples.
The GT data was modelled using five algorithms: Tabular DNN, CatBoost, Random Forest, XGBoost, and LightGBM. For the DNN model, categorical features (Line and StopSequence) were ordinal-encoded while continuous features (boardingAPC and alightingAPC) were standardised to zero mean and unit variance.
Hyperparameter optimisation was conducted for each algorithm to identify optimal configurations, with detailed search spaces and optimal values provided in Table 11 of the appendix. The primary objective of these models is to evaluate APC system reliability and identify error patterns that inform system understanding rather than enable systematic data correction.
APC Predictive Model Training
The APC predictive modelling task uses the complete dataset of 29,069,459 data points with 16 input features (14 categorical and 2 continuous) identified through the feature space optimisation process described in Section 3.3.2. Of the 8 original target features (detailed in Table 9), two complementary pairs were modelled: Boarding and Alighting from the APC group, and StopActualArrival and StopTime from the operational group. These pairs were selected because the remaining target features can be derived from them, providing comprehensive coverage while avoiding redundancy.
The temporal splitting strategy described in Section 3.3.1. yielded 21,802,095 training samples and 7,267,364 validation samples. The same five algorithms were employed as with the GT modelling: Tabular DNN, CatBoost, Random Forest, XGBoost, and LightGBM. Note that while all algorithms were evaluated on the APC targets, only the top-performing algorithm was evaluated on the operational targets. In addition, subsequent analyses, interpretations and ablation studies focus on the APC targets, as they are the primary focus of this work. The operational targets are included for completeness and to provide additional context for the APC data.
Due to limited multi-output support in tree-based algorithms, each target pair required separate modelling with custom meta-models that stacked two base models for each target. This approach ensured consistent training and interpretability pipelines across all algorithms while avoiding implementation-specific limitations. For the DNN model, categorical features were ordinal-encoded before embedding while continuous features were standardised. All models were trained using MSE as the objective function, with root mean square error (RMSE) and \(R^2\) serving as evaluation metrics (detailed in Section 2.2).
Prior to full-scale training, hyperparameter optimisation was conducted using a representative 0.1% random sample of the complete dataset, also split 75:25 for training and validation. This approach ensured computational efficiency while maintaining representativeness. Identical hyperparameter optimisation data was used across all algorithms to ensure fair comparison. Complete hyperparameter search spaces and optimal configurations are documented in Table 12 of the appendix.
Computational Infrastructure
All computational tasks were conducted on a Linux High Performance Computing (HPC) cluster running Rocky Linux 9.4 with 2.60GHz Intel Xeon Gold 6348 CPUs. Within a Python 3.10.12 environment, open-source ML packages were employed: FastAI (Howard and Gugger 2020) for Tabular DNN, Scikit-learn (Pedregosa et al. 2011) for Random Forest, XGBoost (Chen and Guestrin 2016), CatBoost (Dorogush et al. 2018), and LightGBM (Ke et al. 2017) for their respective algorithms. The Weights & Biases (Biewald 2020) platform facilitated hyperparameter optimisation, while Scikit-learn and SHAP (Lundberg and Lee 2017) packages supported model interpretation tasks.
External factor integration
Beyond the core features of the APC dataset, this section describes the integration of external environmental and contextual factors to improve the performance and interpretability of predictive modelling.
In addition to the default features present in the raw dataset, it can be useful for downstream analysis and modelling tasks to engineer additional, non-mobility, explanatory features into the preprocessed data to improve the performance of said downstream tasks. The efficacy and significance of the engineered features, particularly regarding modelling performance, can subsequently be assessed with a variety of model interpretation techniques employed in the ML domain. The rest of this section goes into some detail on the feature engineering carried out as follows:
Transfers
The number of bus lines that intersect at a particular stop can be useful in determining the importance of the stop and the volume of passengers it handles, as these locations act as hubs for transfers between lines that serve different parts of the city. Two new features were engineered as follows.
1.
TransferStop, which represents the count of bus lines that overlap at each stop minus 1, indicating the number of alternative lines a passenger can access by alighting at that stop. In the dataset, the values range from 0 to a maximum of 4.
2.
StopType, which classifies stops as “Ordinary” or “Transfer” depending on whether they provide access to alternative lines.
Weather
Weather conditions can significantly influence individual travel mode choices, particularly the decision to use public transit versus private vehicles or active mobility. This is especially pertinent in the Norwegian context, with its extreme weather variations throughout the year. Weather conditions can also cause operational disruptions, leading to delays and cancellations (Tian et al. 2024; Alam et al. 2021; Ngo and Bashar 2024) for safety reasons, which, in turn, shape mobility decisions and mode choices. In extreme situations, people may opt for private vehicles or cancel trips unless necessary. Wei (2022) found that the impact of the weather on individual users of public transport varies by time period and passenger type. In general, weather has a stronger influence on passenger stickiness during midday off-peak hours compared to morning or evening peak periods.
Historical weather data obtained from Visual Crossing Corporation (2024) for each date in the dataset was incorporated to assess the impact of weather conditions on public transit usage. The data was acquired using the Visual Crossing web-based query builder with the following parameters: “Location” (Trondheim, Norway), “Date” range (2019-01-17 to 2024-02-18), “Unit group” (metric), “Output format” (Excel), and “Output section” (Daily). Due to the daily limit of 1000 records for free accounts, the 1859-day period was split into three separate queries and subsequently merged. With some minimal preprocessing and multicollinearity checks to eliminate irrelevant and redundant features, the preprocessed data contained 24 features, detailed in Table 8.
To determine the effect of weather on modelling the public transit dynamics, two approaches were combined:
1.
Individual Inclusion, to directly isolate the impact of each weather feature. Each feature was iteratively added to the optimised input features and used to retrain an XGBoost model with the optimal hyperparameters. The effect of the feature on the model’s RMSE and \(R^2\) metrics was then evaluated.
2.
Combined Training, to evaluate how the weather features work in the broader context of all the features. In this case, all weather features were combined with the optimised input features. In addition to performance metrics, SHAP values were calculated and contributions of individual weather features were isolated.
Terrain
In addition to weather, the local terrain can also impact an individual’s choice of mobility modes. Using the coordinates of the bus stop locations, a set of terrain-related features pertaining to the distance and elevation changes between consecutive stops on a route were calculated. Given two consecutive bus stops \(\text {stop}_i\) and \(\text {stop}_{i+1}\) with with longitude, latitude, and altitude coordinates \((\lambda _i, \phi _i, h_i)\) and \((\lambda _{i+1}, \phi _{i+1}, h_{i+1})\), four terrain-based features were engineered as follows:
1.
Inter-Stop Distance (diffDistance), which measures the straight-line (Euclidean) horizontal distance between consecutive stops in metres, where \((X_i, Y_i)\) and \((X_{i+1}, Y_{i+1})\) are the UTM-projected coordinates of the stops. This is a simplified method that does not take into account the actual road network.
Elevation Change (diffElevation), which measures the vertical change in altitude between consecutive stops in meters, with a positive value indicating an increase in altitude, and vice-versa.
$$ \text {diffElevation}_i = h_{i+1} - h_i $$
3.
Terrain Type (typeElevation), which classifies the terrain between stops into uphill, downhill, or flat based on a threshold of 3 meters.
Inter-Stop Slope (slopeElevation), which represents the absolute slope between consecutive stops, in degrees. Higher values indicate steeper inclines or declines.
Distribution of terrain-based features across the APC data. Note: Different x-axis scales are used due to varying feature ranges (Distance: 0–3697.86, Elevation: -83.51–89.11, Slope: 0–10.59). Cross-subplot comparisons should focus on distributional shapes rather than absolute values
Figure 13 presents an overview of the distribution of these features between trips and routes in the data, illustrating how the local terrain may interact with the design of the public transit system to influence travel patterns and mode choices.
Urban demographics and land use
Kim and Li (2021) found that the availability of high-quality transit—buses running every 15 minutes or less during peak hours—is strongly linked to urban densification, while the urban structure in turn shapes the transit dynamics. Similarly, Yang et al. (2023) found that a dense mixed-use environment with strong multimodal mobility coverage significantly encourages public transit use. These effects also vary between population groups according to age, health, income level, and household size (Zhang et al. 2025; Ma et al. 2024). In their analysis of subway stations in Greater London, Verma et al. (2021) identified different types of station that reflect different patterns of ridership for work, services, leisure, and mixed uses. They found a stronger correlation between station ridership and the local population for outer-residential, inner-residential, and polycentric clusters.
Using official statistics from Statistics Norway (Statbank Norway 2024a, b, c, 2025), five features capturing key aspects of the built environment and population distribution around bus stops in Trondheim were engineered. These include population density (Population), number of buildings (Building) and dwellings (Dwellings), as well as the number of establishments (Establishments) and their employees (Employees). These features provide insights into the spatial distribution of urban activity in relation to the public transit system.
The yearly demographic and land use statistics were mapped to the APC data by associating bus stop coordinates with the statistical grids covering Norway. Each grid cell was populated with its yearly statistical values for these five features, reflecting urban activity levels, and reduced to its centroid for spatial queries. Given a bus stop \(\text {stop}_i\) at \((X_i, Y_i)\), the goal was to identify all grid centroids \((X_j, Y_j)\) within a 250-meter radius:
where (X, Y) are UTM-projected coordinates in meters. The 250-metre threshold was selected as half the average distance between stops in the dataset, \(\approx 224\) metres (see Figure 13), with some additional allowance for overlap.
Fig. 14
Distribution of demographics and land use characteristics within 250 metres of bus stops across the dataset. Note: Different x-axis scales are used due to varying feature ranges (Buildings: 0–146, Dwellings: 0–690, Establishments: 0–396, Employees: 0–2666, Population: 0–950). Cross-subplot comparisons should focus on distributional shapes rather than absolute values
For each unique stop in the APC dataset, the relevant statistical values for the corresponding year were extracted from all nearby grids within this radius and averaged across the selected grids. If no grid centroids were located within 250 metres, the features were set to zero. The newly computed land use and population features were then merged with the APC dataset, ensuring temporal alignment by matching each stop entry to the appropriate statistics for its corresponding year. Figure 14 provides an overview of the distribution of these features across bus stops in the data, showing how the local demographic and land use characteristics potentially influence the passenger counts at nearby stops.
Results and discussion
This section describes the results of the predictive modelling and interpretation experiments carried out following the methodology described in Section 3. Note that in this section, the terms “algorithm” and “model” are used interchangeably to describe a trained model and its underlying machine learning (ML) algorithm.
Ground truth data validation
Performance evaluation
Table 2 shows the performance of the five algorithms in correcting errors in the Automated Passenger Counting (APC) data when provided with the ground truth (GT) data targets. In addition, an ensemble of all trained models was created, taking the average of the predictions from each. The Tabular DNN algorithm gave the best performance while Random Forest performed the worst. However, the ensemble of all models performed better than any individual model.
Interestingly, when an exhaustive combination of all trained models was performed, an ensemble of the Tabular DNN and CatBoost models performed best overall. The \(R^2\) and RMSE for each model help explain this. The reason why ensembles of models trained using different algorithms work well is that the kinds of errors that each one makes would be quite different. The Tabular DNN was better at modelling alighting counts while CatBoost was better at boarding counts. Since their errors are not correlated, an ensemble of both models performs better. However, adding Random Forest, XGBoost and LightGBM—tree-based algorithms the same as CatBoost but with worse performance—dilutes the information from the Tabular DNN/CatBoost ensemble and leads to worse performance.
Table 2
Machine learning algorithm performance on ground truth data for APC target correction
Algorithm
\(R^2\)
RMSE
RMSE (Overall)
APC Data
[0.8309, 0.6548]
[2.4164, 3.5451]
3.0337
Random Forest
[0.8868, 0.7081]
[1.9772, 3.2599]
2.6959
LightGBM
[0.8788, 0.7233]
[2.0459, 3.1737]
2.6700
XGBoost
[0.8876, 0.7264]
[1.9699, 3.1559]
2.6306
CatBoost
[0.9034, 0.7253]
[1.8266, 3.1623]
2.5823
Tabular DNN
[0.8963, 0.7404]
[1.8921, 3.0739]
2.5524
Ensemble [All Algorithms]
[0.9042, 0.7471]
[1.8186, 3.0341]
2.5013
Ensemble [Tabular DNN & CatBoost]
[0.9068, 0.7571]
[1.7941, 2.9739]
2.4559
Metrics are reported as [boarding, alighting] pairs, with the best-performing models highlighted in bold
From \(R^2\) and RMSE, all models struggle with Alighting compared to Boarding. This performance difference can be attributed to the relatively higher errors present in the APC data for alighting counts, shown in Figures 9 and 10. The higher errors can be linked to issues with bus door operations discussed earlier, which the video analysis showed to affect alighting counts more. Thus, the relatively low \(R^2\) scores as the models struggle to explain the variance in the GT data from the input features only.
Model interpretation
Figure 15 illustrates the influence of input features on target predictions across all five algorithms using the model-agnostic PermutationExplainer method, which assesses the effect of randomly shuffling each feature on the model’s performance, revealing the extent to which the model depends on that feature for its predictions.
The results show that boardingAPC dominates the boarding predictions in all algorithms, with an average SHAP value of 2.69, confirming that the APC boarding counts correlate well with the GT boarding counts. For alighting predictions, alightingAPC shows even stronger dominance with SHAP values of 3.21 on average, suggesting a tighter (although noisier) mapping between APC and GT alighting data. Line shows minimal impact on all models and targets, while StopSequence shows moderate importance, particularly for alighting predictions, likely capturing stop-specific passenger behaviour patterns that influence counting accuracy.
Fig. 15
Feature importance for ground truth modelling derived from SHAP values across five machine learning algorithms. Mean absolute values show the impact magnitude of each feature on boarding (left) and alighting (right) predictions
Cross-target analysis in Figure 15 reveals that the original APC boarding and alighting counts are used by the algorithms to predict the correct GT counts, although to varying degrees. This shows that there is a weak but asymmetric correlation between the Boarding and Alighting targets such that the presence of the boardingAPC feature has more impact on predicting the Alighting targets than alightingAPC does on Boarding. Specifically, boardingAPC shows SHAP values around 0.12–0.49 for alighting predictions, while alightingAPC has minimal influence on boarding predictions with a maximum SHAP value of 0.24, compared to the dominant direct relationships.
These observations are confirmed by the partial dependence and individual conditional expectation plots in Figure 16 obtained from the Random Forest model. When predicting the Alighting counts for some samples, the conditional expectations for StopSequence and boardingAPC vary widely compared to the relatively stable patterns observed for Boarding predictions. In contrast, when predicting Boarding, the individual conditional expectations for all input features are relatively flat and follow the same trend as the average.
The flatness and variations of the individual conditional expectations for the Boarding and Alighting targets also explains the differences in their \(R^2\) and RMSE evaluation metrics, with the higher variability in conditional expectations for alighting predictions reflecting the greater complexity and uncertainty in predicting alighting patterns compared to boarding.
Fig. 16
Partial dependence and individual conditional expectation analysis for ground truth predictions. Random Forest model results show boarding (top) and alighting (bottom) patterns with varying prediction complexity
Data quality assessment and correction limitations
In conclusion, most of the errors in the APC data fall within the ±3 and ±6 range, up to the 95th percentile, for boarding and alighting counts, respectively. Outliers occur in situations with extremely high passenger counts, which are rare to begin with and therefore have minimal impact on subsequent analysis of the public transit dynamics.
The modelling results demonstrate that the (ensemble of) models were effective in reducing prediction errors and improving alignment between APC and GT data, especially for alighting counts. However, these models were not applied for systematic correction of the full APC dataset. The limited availability of GT data—comprising only 41 trips with 1180 data points relative to the 29M+ entries in the full APC dataset—makes network-wide correction impractical and statistically unreliable.
This limitation reflects the broader challenges in the validation of public transit data highlighted in the Introduction. Ground truth data collection is resource-intensive, requiring manual observation or video analysis under operational constraints, and operators typically lack incentives for comprehensive validation beyond sporadic quality checks. In addition, data protection regulations and proprietary constraints embedded in data-sharing agreements further restrict the acquisition and utilisation of validation datasets at the scale required for systematic correction.
Given these practical constraints, GT modelling serves primarily as a quality assessment tool rather than a correction mechanism. The analysis confirms that APC data errors are generally within acceptable bounds for aggregate analysis and predictive modelling, while identified error patterns provide insight into system reliability that inform feature engineering and model interpretation rather than direct data correction.
Predictive modelling
This subsection presents the predictive modelling results for APC target variables and operational parameters, followed by detailed analysis of model performance, statistical comparisons, ensemble strategies, and behavioural characteristics.
Performance evaluation
Table 3 shows the predictive performance of the ML algorithms on the Boarding and Alighting APC targets in ascending order. Although the performance metrics for all models are close, the tree-based models all outperform the Tabular Deep Neural Network (DNN) on this task, with XGBoost performing the best. In addition, all models have slightly better results with Boarding counts compared to Alighting. The target correlations revealed in the GT modelling suggest that multi-task learning could help address this. Hu et al. (2024) effectively used a multi-gate Mixture-of-Experts approach to capture the relationship between entry and exit subway passenger flows.
Unlike the tree-based models, the Tabular DNN model appears to struggle with the extreme bias in the input space towards categorical features (and less continuous ones) in its feature embeddings. As noted in Zhang et al. (2024), vanilla embedding layers cannot accurately capture the true spatiotemporal dependencies of mobility flows, and explicit, specialised embedding approaches are required to address this issue within deep learning (DL) architectures.
The inverse relationship between \(R^2\) and RMSE metrics is also observable here. The \(R^2\) scores of all models remain low despite the preliminary feature space and hyperparameter optimisations. With the input features explaining only about half of the variance in the target features, there is significant room for improvement. Further feature engineering and augmentation, as explored in Section 3.4, could enhance the explainability and predictive performance of APC target features.
Table 3
Machine learning algorithm performance for APC target prediction
Algorithm
\(R^2\)
RMSE
RMSE (Overall)
Tabular DNN
[0.4748, 0.4195]
[2.9158, 3.0247]
2.9888
LightGBM
[0.4639, 0.4524]
[2.9459, 2.9377]
2.9418
Random Forest
[0.4907, 0.4473]
[2.8713, 2.9514]
2.9116
CatBoost
[0.4994, 0.4619]
[2.8466, 2.9121]
2.8795
XGBoost
[0.5187, 0.4952]
[2.7913, 2.8207]
2.8247
Metrics are reported as [boarding, alighting] pairs, with the best-performing models highlighted in bold
Using the best-performing XGBoost algorithm with its optimal hyperparameters, another model was trained on the StopActualArrival and StopTime operational targets to evaluate the effectiveness of the feature space and hyperparameter optimisations for other predictive tasks. Note that StopActualArrival is measured in minutes since midnight, while StopTime is measured in seconds. Table 4 presents the predictive performance of the XGBoost model for this task, demonstrating that it captures nearly 100% of the variance in the target features despite the feature space and hyperparameter optimisations not being tailored to these specific operational targets. These results confirm the effectiveness of the optimisations and highlight the need for an expanded feature space to better account for the variance in APC target features.
Table 4
XGBoost algorithm performance for operational target prediction
Target
\(R^2\)
RMSE
StopActualArrival (minutes)
0.9995
11.3748
StopTime (seconds)
0.9963
26.3877
Results demonstrate effective capture of temporal patterns for arrival and dwell time prediction
Model selection and statistical significance
To assess the comparative performance of the ML algorithms, we used the two temporally structured cross-validation strategies described in Section 3.3.1: seasonal and hybrid cross-validation approaches. These regimes were selected to assess both year-over-year generalisation and seasonal pattern stability, while also allowing us to evaluate the robustness of the algorithms to distributional shifts in the data, such as those caused by the COVID-19 pandemic. In each setting, we trained all five algorithms and recorded RMSE and \(R^2\) in three temporally disjoint validation folds.
To statistically compare algorithm performance across folds, we used the non-parametric Friedman test, which evaluates differences in central tendency without assuming normality. For both RMSE and \(R^2\), the XGBoost algorithm consistently achieved the best average performance across folds, as detailed in Table 5. However, the Friedman test did not detect statistically significant differences between all algorithms at the \(p < 0.05\) level in either configuration (e.g. \(p = 0.0823\) for RMSE in the hybrid setting, \(p = 0.0716\) for \(R^2\) in the seasonal setting), indicating that while XGBoost performed best on average, the evidence for superiority was not conclusive given the number of folds. Since the Friedman test did not reach statistical significance, post hoc pairwise comparisons using the Nemenyi test were not performed.
Further inspection of the third fold in the hybrid setup revealed a marked degradation in algorithm performance. This fold spanned both the pandemic and post-pandemic periods, suggesting a structural regime shift in the underlying data distribution. To account for this, we introduced a binary isCOVID feature to distinguish between pre- and post-COVID samples. When re-evaluated with this regime-aware feature, XGBoost’s performance in the affected fold improved substantially—RMSE dropped from 2.85 to 2.59 and \(R^2\) increased from 0.50 to 0.58—while performance in the other folds remained stable. This indicates that the algorithm benefited from explicit disambiguation of temporal regimes, likely due to the presence of shifting behavioural patterns over time.
Together, these results show that while no algorithm was statistically superior across all validation folds, XGBoost performed consistently well under both evaluation regimes and was robust to distributional shifts when provided with regime-aware features. The effectiveness of the isCOVID feature highlights a limitation in the feature space optimisation: dropping the Year feature for temporal generalisation removed the ability to distinguish temporal regimes, suggesting future optimisations should balance generalisation with regime-shift disambiguation.
Table 5
Cross-validation performance comparison across machine learning algorithms for APC targets
Algorithm
Hybrid RMSE
Hybrid \(R^2\)
Seasonal RMSE
Seasonal \(R^2\)
XGBoost
2.4214
0.4935
2.3294
0.5245
CatBoost
2.4592
0.4780
2.3703
0.5083
LightGBM
2.4777
0.4717
2.3546
0.5148
Random Forest
2.4710
0.4734
2.3741
0.5070
Tabular DNN
2.4925
0.4659
2.4233
0.4905
Friedmanp
0.0823
0.0823
0.0823
0.0716
Average RMSE and \(R^2\) scores from three validation folds under hybrid and seasonal regimes, with Friedman test p-values indicating statistical significance of performance differences
Ensemble strategies and optimisation
Despite the theoretical advantages of ensemble learning, our experiments showed that naive averaging methods did not outperform the best single model, XGBoost (Table 6). An exhaustive combinatorial averaging of all models resulted in the best pairwise combination of CatBoost and XGBoost, which achieved only marginal improvements over XGBoost alone.
To explore whether more sophisticated ensemble techniques could produce improvements, we implemented constrained least squares optimisation and ridge regression, as detailed in Section 2. The least squares optimisation aimed to learn optimal weights under non-negativity and sum-to-one constraints. However, this approach performed worse than simple averaging, with the optimisation heavily favouring XGBoost and CatBoost while excluding other models. The rigid constraints appeared to limit the generalisability.
In contrast, ridge regression as a stacking method achieved substantial improvements, reducing overall RMSE from 2.8247 (XGBoost alone) to 2.6379 and increasing \(R^2\) scores to 0.5753 and 0.5531 for boarding and alighting, respectively. Ridge regression’s flexibility—allowing negative weights and no sum-to-one constraints—enabled a more effective model combination. The optimised ensemble predictions can be expressed as linear combinations of the individual model outputs:
where TAB = Tabular DNN, CBM = CatBoost, RF = Random Forest, XGB = XGBoost, and LGB = LightGBM.
The limited performance gains from ensembling reflect limited model diversity: tree-based algorithms share similar inductive biases, resulting in correlated errors that reduce the benefits of the ensemble. Although Tabular DNN provided architectural diversity, its weak standalone performance diminished its contribution. This contrasts with the GT modelling task (Table 2), where a better individual model performance enabled effective ensembling. Ridge regression’s success underscores the importance of flexible weighting schemes in ensemble design, particularly when individual model errors exhibit partial redundancy.
Table 6
Ensemble strategy performance comparison for APC target prediction. Ridge regression demonstrates superior performance through flexible model weighting without sum-to-one constraints
Ensemble Strategy
\(R^2\)
RMSE
RMSE (Overall)
Average [All Algorithms]
[0.4989, 0.4652]
[2.8482, 2.9030]
2.8757
Average [CatBoost & XGBoost]
[0.5115, 0.4814]
[2.8121, 2.8588]
2.8356
Weighted [Least Squares]
[0.4618, 0.4367]
[2.9516, 2.9795]
2.9656
Weighted [Ridge Regression]
[0.5753, 0.5531]
[2.6219, 2.6538]
2.6379
Note: Metrics are reported as [boarding, alighting] pairs, with the best-performing strategy highlighted in bold
Residual analysis and model behaviour
To better understand model behaviour and robustness, we conduct a detailed residual analysis comparing XGBoost (XGB), the best-performing model tree-based algorithm, with the Tabular DNN (TAB), which represents the only deep learning algorithm in our evaluation. Figure 17 shows residuals (Observed − Predicted) plotted against predicted counts for both Boarding and Alighting APC targets.
While XGBoost achieves superior performance on standard metrics (RMSE, \(R^2\)), the residual plots reveal distinct behavioural characteristics between the two architectural approaches. The XGBoost model demonstrates increased variance and systematic underprediction at higher count levels, exhibiting more pronounced error fluctuations in extreme ranges. In contrast, Tabular DNN displays smoother and more stable residual patterns across the prediction range, particularly in the distribution tails.
These patterns reflect a classic bias–variance tradeoff between the two approaches. XGBoost optimises for pointwise accuracy but exhibits less predictable behaviour under high-load conditions, while Tabular DNN maintains more consistent error characteristics across the prediction range despite underperforming on aggregate metrics. Such behavioural differences have practical implications for operational deployment, where prediction reliability may be as important as accuracy. The stability of the Tabular DNN might make it a better choice, since the statistical significance tests only show marginal superiority of XGBoost over other algorithms.
To examine potential systematic biases across different operational contexts, Figure 18 visualises the spatial distribution of XGBoost residuals across stops for several spatiotemporal features. The residuals are calculated as the mean absolute residual magnitudes, summed for the Boarding and Alighting targets, and limited to extreme residual magnitudes (greater than 5) to highlight significant deviations.
The analysis reveals no strong or persistent spatial or temporal bias in the model predictions. Although some variation is expected, particularly at terminal stops or high-traffic transfer stops, the model maintains relatively uniform error distributions across most stops throughout the service calendar. The absence of systematic bias across line routes, stop types, and temporal dimensions suggests that the model generalises effectively across diverse operational contexts without exhibiting structural preferences for specific network segments or time periods.
Fig. 17
Model residual analysis comparing XGBoost and Tabular DNN behaviour for APC targets. Residuals (observed − predicted) versus predicted counts for boarding (top) and alighting (bottom) targets, with mean residual trends (black lines) and 95th-percentile confidence intervals (shaded areas)
Spatial distribution of XGBoost prediction residuals across spatiotemporal features. Heatmap shows mean absolute residual magnitudes for boarding and alighting combined, filtered to extreme deviations (\(>5\)) to highlight any systematic prediction biases
In addition to the predictive performance of the ML algorithms, it is also useful to look at their internals and how their predictions are generated using explainable machine learning methods, such as the SHAP values introduced im Section 2.2. This section analyses the SHAP values obtained for the predictive modelling task on the APC targets with the goal of examining the correspondence in the importance placed on the input features across the five ML algorithms and two APC targets. In this section, the model-agnostic PermutationExplainer method is used, which assesses the effect of randomly shuffling each feature on the model’s performance, revealing the extent to which the model depends on that feature for its predictions.
A total of 4096 data points were used to compute SHAP values for each algorithm/model, randomly sampled 75:25 from the training and validation subsets, respectively. Training samples were used as a masker to define a baseline distribution, ensuring that feature contributions are assessed relative to realistic input variations. SHAP values were then computed on the validation samples to explain the model’s behaviour under unseen data conditions. Feature importance was determined as the mean absolute SHAP value across the validation samples, providing a robust measure of each feature’s overall contribution to model predictions.
Figure 19 shows the mean absolute SHAP values (hereafter “SHAP values”) for each input feature in the algorithms and target features. There is a large variation in the importance of input features between models and a relatively minor variation within models for target features. Although the models agree on which input features are relatively unimportant (such as StopType, Month, and StopScheduledArrival), there is less agreement on which input features are important and to what degree.
Fig. 19
Feature importance comparison across machine learning algorithms and APC targets using SHAP values. Mean absolute values represent the average magnitude of each feature’s impact on model predictions, computed from 4096 validation samples per algorithm
It is also interesting to observe the differences between the Tabular DNN and tree-based models. Tree-based models generally show strong consensus on feature importance, with occasional deviations by both the Random Forest and LightGBM models. However, the Tabular DNN model not only prioritises a different set of input features, but it also appears to be more selective: making use of roughly six input features for its predictions and almost discarding the rest. This mix of consensus and divergence reinforces the earlier discussion that ensembles could outperform XGBoost if the Tabular DNN achieved relatively strong performance. In such a case, its errors would be less correlated with those of the tree-based models, enhancing overall predictive performance.
The top plot in Figure 20 succinctly summarises these differences in the importance rankings of the input features across different ML algorithms using bands depicting the standard deviation from the mean SHAP value for target features. The difference between the tree-based and neural network-based models is clearly highlighted here. It is also evident that the Tabular DNN and CatBoost models use the same feature to predict both targets while other models show more variation. In addition, certain input features show more variation in importance between targets, with Line and Latitude showing the most pronounced differences. In general, the top seven input features across all models are WeekNumber, Line, DayType, Latitude, StopIdentifier, StopSequence, and FLAG_TripDirection.
Fig. 20
Feature importance rankings across algorithms and data subsets using SHAP values. Standard deviation bands highlight consensus and variation in feature prioritisation across algorithms (top) and temporal/spatial subsets (bottom)
The bottom plot in Figure 20 illustrates variations in feature importance rankings for the same XGBoost model across different data subsets, split by: COVID-19 period (during, after and all), month (January to December), line (1, 3, 10, 11, 12, and 14), and year (2020 to 2023). Using the same optimal hyperparameters, 25 XGBoost models were trained for each subset category, and the SHAP values were calculated.
Since the same model is used, there is strong consensus on the average feature importance, following a similar trend to the parent XGBoost model in the top plot. However, the standard deviation bands provide more insight—the wider the band for a subset, the more the model shifts its reliance on different input features for accurate predictions, highlighting the differences in the spatiotemporal dynamics across subsets. Despite these variations, the top seven input features in all subsets are the same as those of the XGBoost model: WeekNumber, Line, Latitude, DayType, StopIdentifier, BusType, and FLAG_TripDirection.
These differences in spatiotemporal dynamics between data subsets are also reflected in the RMSE (and \(R^2\)) metrics shown in Figure 21. In general, training separate models for different subsets and categories improves performance. The Line and Month subsets that showed the highest deviation in feature importances (Figure 20), also struggle to consistently outperform the baseline RMSE in all categories. Overall, the significant differences in feature importance rankings and performance metrics across the data subsets and their categories explains the low \(R^2\) scores seen in Table 3 as the models struggle to capture the complex spatiotemporal dynamics throughout the public transit system simultaneously.
Fig. 21
XGBoost performance variation across APC data subsets. Distribution of RMSE values averaged over boarding and alighting targets shows model consistency across temporal and spatial partitions
This section presents the results of ablation studies conducted to assess the impact of external, non-mobility factors on the predictive performance of the models. The goal is to understand how these factors influence passenger counts and operational factors in public transit systems.
To evaluate each engineered feature category, two complementary approaches are employed as discussed in Section 3.4.2: (1) Individual Inclusion, which isolates the impact of each feature by adding it separately to the baseline model and measuring root mean square error (RMSE) changes, and (2) Combined Training, which includes all features of a category together and uses (mean absolute) SHapley Additive exPlanations (SHAP) values to assess their relative importance within the full feature set. This dual approach reveals both the standalone predictive value and the contextual importance of engineered features.
Transfer stop features
Due to their similarity to the original (preprocessed) Automated Passenger Counting (APC) input features, these engineered features were incorporated into the predictive modelling pipeline. They successfully passed the feature space optimisation step in Section 3.3.2 and were used to train the machine learning (ML) algorithms, this confirming their utility. In addition, the model interpretation results in Figures 19 and 20 indicate that TransferStop can be quite useful for certain models and data subsets in making correct predictions.
Weather features
Weather conditions directly influence passenger behaviour and transit demand patterns. Figure 22 shows the results of both evaluation approaches for weather features. The RMSE results show that the inclusion of weather features (weather.all) reduces the XGBoost RMSE from 2.8247 to 2.7376 and increases the [Boarding, Alighting] \(R^2\) scores from [0.5187, 0.4952] to [0.5506, 0.5238]. The \(R^2\) improvement indicates the moderate effect of weather on the dynamics of passenger counts. Furthermore, when individually included, 16 of the 23 weather features led to better performance with sunrise and daylight having close to the same effect as including all weather features. Alam et al. (2021) suggests that the weather features have an even greater impact on modelling performance when training data is limited.
The positive effects of weather features on modelling performance might be more pronounced at a higher level of granularity, such as using hourly or half-hourly weather data (Wei 2022), which would better capture the impact of rapid fluctuations in local weather conditions. Trondheim, due to its geographical position along the Trondheim Fjord, experiences both maritime and continental influences, which lead to frequent and abrupt weather changes, with residents often encountering multiple weather types in a single day. As Rühmann et al. (2024) highlights, mobility decisions and mode choices are influenced not just by current weather but also by short-term forecasts. Thus, aligning hourly weather data with this anticipatory behaviour by offsetting the timings could improve predictive performance.
Fig. 22
Weather feature impacts on XGBoost performance using individual inclusion and combined training approaches. Red bars show RMSE changes from individual inclusion (left axis), blue bars show mean absolute SHAP values from combined training (right axis). Dashed line indicates baseline RMSE
However, the SHAP values of the “combined training” approach clearly rank tempmax, tempmin, and snow as the most important weather features. At first glance, these rankings appear to contradict those of the “individual inclusion” approach. The weather features that showed the highest RMSE improvements when trained alone had lower SHAP values, while the features with high SHAP rankings had minimal standalone RMSE improvements. This discrepancy arises from the underlying differences in the SHAP and RMSE metrics. RMSE measures independent predictive power, while SHAP measures conditional importance within the full set of features. Thus, a feature with strong RMSE improvement is useful on its own, while another feature may receive a high SHAP ranking because it works well in combination with other features.
SHAP analysis reveals that higher-order feature interactions in the entire feature set consolidate weather effects into tempmax, tempmin, and snow. These same features have been identified as key drivers of activity in bike sharing systems (Rühmann et al. 2024) and irregularities in transit bus arrival times (Alam et al. 2021). However, since the goal is to enhance baseline features with a subset of weather features, prioritising those that significantly improve RMSE in isolation ensures that they retain predictive value even in a reduced feature set. Thus, sunrise, daylight, precipcover, and windgust are good choices for augmentation.
Terrain features
Terrain characteristics between stops may influence travel patterns and service accessibility. Following the same evaluation approaches, Table 7 presents the effects of the terrain-based features on the \(R^2\) and RMSE performance of the baseline XGBoost model. These effects are generally modest, with small improvements or degradations depending on the feature, with the most beneficial impact coming from typeElevation.
Table 7
Terrain feature impacts on XGBoost performance using individual inclusion and combined training approaches
Metric
Baseline
Terrain [ALL]
diffDistance
diffElevation
typeElevation
slopeElevation
\(R^2\)
0.5066
0.5049
0.5061
0.5097
0.5101
0.5088
RMSE
2.8256
2.8306
2.8273
2.8167
2.8155
2.8198
Performance metrics show \(R^2\) and RMSE values for baseline, all terrain features combined, and individual terrain features. Best performing features are highlighted in bold
These modest effects can be attributed to a misalignment between the predictive modelling task and the formulation of terrain-based features. The modelling task focusses on local stop and line dynamics, whereas these features are derived from relationships between pairs of stops. Thus, the model struggles to integrate this information when making predictions. However, such features could have a greater impact in a global modelling context, where each stop is represented as a node in a line and different lines form a unified network, connected through transfer stops.
Demographic and land use features
Population density and land use patterns around stops influence passenger volumes and overall dynamics in the transit network. Figure 23 shows the results of both evaluation approaches for demographics and land use features. Although the inclusion of all features (landuse.all) has a negative effect on model performance, individual features such as Population and Buildings had positive effects. Population reduced the baseline RMSE from 2.8300 to 2.7935 and increased [Boarding, Alighting] \(R^2\) scores from [0.5191, 0.4909] to [0.5325, 0.5034].
Fig. 23
Demographic and land use feature impacts on XGBoost performance using individual inclusion and combined training approaches. Red bars show RMSE changes from individual inclusion (left axis), blue bars show mean absolute SHAP values from combined training (right axis). Dashed line indicates baseline RMSE
Although the improvements from these features are modest—particularly in comparison to the weather-related features—their limited impact can be partly attributed to the same local modelling issue observed with terrain-based features in the previous Section 3.4.3. Local demographics and land use characteristics around a bus stop influence not only boarding and alighting counts at that stop, but also downstream alighting and upstream boarding along the route, respectively.
However, in the current local modelling context, these inter-stop dependencies are not explicitly accounted for. Thus, the partial loss of information on urban activity along the route weakens the expected impact of these features on model performance. Furthermore, the finding by Wei (2022) that the influence of weather on the dynamics of public transit varies by passenger type—adult, senior, child (5 to 14 years of age), secondary student and tertiary student—suggests that increasing the granularity of demographic data during training could enhance model performance.
Notably, the SHAP values in Figure 23, obtained from the “combined training” approach, exhibit the same contradictory importance rankings relative to the RMSE metrics from the “individual inclusion” approach as observed with the weather-related features in Figure 22. SHAP analysis indicates that in the full feature set, higher-order interactions among demographics and land use characteristics are mainly aggregated into Employees, with the remaining influence distributed between Dwellings and Establishments. This suggests that the model implicitly disaggregates Buildings into its two functional categories—either residential (Dwellings) or commercial (Establishments)—and prioritises the movement of Employees between these locations (similar to Verma et al. (2021)) rather than relying on Population around stops for its predictions.
Conclusion and future work
Conclusion
This study examined the combined effectiveness of big data and machine learning/deep learning (ML/DL) algorithms in modelling the local stop-level spatiotemporal dynamics of a public bus transit system while accounting for external, non-mobility influences on human mobility. The analysis validated data integrity, identified correlations between boarding and alighting counts, and optimised the feature space to improve predictive performance. A horizon-agnostic modelling approach was employed, where models predict passenger counts based on input feature combinations rather than specific temporal horizons, enabling scenario analysis for planning and digital twin applications. Various ML/DL models were compared, and explainable ML methods were used to analyse feature attributions in different modelling tasks and data subsets.
Using unique real-world data from Trondheim, Norway’s public bus transit system, the main conclusions of the study can be summarised as follows:
ML vs. DL Performance: Tree-based ML models (LightGBM, Random Forest, CatBoost, and XGBoost) outperformed the Tabular DNN model in all horizon-agnostic predictive tasks. XGBoost achieved the best overall performance (\(\text {RMSE}=2.8247\)), but the structural similarities between the tree-based models limited their suitability for the ensemble methods. The poor performance of Tabular DNN (\(\text {RMSE}=2.9888\)) was attributed to ineffective feature embeddings that did not capture complex spatiotemporal dependencies.
Model Performance vs. Stability: While XGBoost achieved superior aggregate performance metrics, statistical significance tests revealed no conclusive algorithmic superiority (Friedman test \(p > 0.05\)). Residual analysis showed that XGBoost exhibited greater variance at higher passenger volumes, while Tabular DNN maintained more consistent error patterns despite a lower overall accuracy. Spatial analysis confirmed robust model generalisation across diverse operational contexts.
Boarding–Alighting Correlations: Ground truth validation confirmed that automated passenger counts accurately captured stop-level dynamics, with corrective models further improving the accuracy of the automated counts. The data also revealed system-wide correlations between passenger counts, showing a weak but asymmetric relationship where boarding counts exert greater influence on predicting alighting counts than vice versa.
Impact of External Factors: Models trained using only mobility-related features explained \(\approx 50\%\) of the variance in passenger counts (boarding and alighting) but \(\approx 90\%\) for operational factors (scheduling delays and dwell times). Incorporating features from external non-mobility domains—weather, terrain, demographics, and land use—improved the explained variance for passenger counts to varying degrees, with weather (\(+3.19\%\)) having the most impact and terrain the least (\(+0.35\%\)).
Spatial Structure and Inductive Biases: The local, stop-level modelling context, which treats each stop as an independent data point, reduced the impact of features encoding inter-stop attributes for consecutive stops along a route. Without inductive biases about the spatial structure of transit networks, these features—such as transfers, terrain, demographics, and land use—had limited predictive utility.
Using an iterative feature-space optimisation pipeline with hyperparameter tuning, an optimal subset of input features (\(\approx 41\%\) of the full set) was identified and validated across five ML/DL algorithms and two modelling tasks. Although there was a strong consensus on low-importance features, importance rankings varied for the remaining ones. However, these features alone were insufficient for robust prediction, requiring additional engineered features from external, non-mobility domains. To enhance interpretability, explainable ML techniques, particularly SHapley Additive exPlanations (SHAP) values, were used to analyse feature attributions, providing insight into how different input features influenced model predictions across algorithms, tasks, and data subsets.
In general, this study demonstrated the effectiveness of traditional ML models in capturing local, stop-level spatiotemporal dynamics of passenger counts and operational factors in public bus transit systems within a horizon-agnostic framework. Their lower computational complexity and greater interpretability make them suitable for real-time transit information systems, mobile applications, and digital twins. Furthermore, the feature-space optimisation, hyperparameter tuning, and algorithm comparison pipelines developed in this study can be integrated into automated machine learning (AutoML) frameworks for mobility-related modelling tasks.
This modelling framework helps policymakers, transit agencies, and urban planners make data-driven decisions by identifying mobility patterns, optimising predictive models, and incorporating external factors to improve transit reliability and efficiency. The horizon-agnostic approach enables scenario-based analysis, where models predict how changes in input features (e.g., demographic shifts, infrastructure developments, or policy interventions) affect passenger dynamics within the training data domain, making it particularly suitable for simulation and planning applications rather than temporal forecasting. These insights can inform policies that improve sustainable mobility, such as better scheduling, demand-responsive transit, and multimodal integration. The approach can also be applied to broader urban mobility challenges, such as assessing how parking policies affect public transit ridership, laying the groundwork for future research on land use and congestion management.
Future work
The modelling results in this study indicate that analysing line-stop dynamics in isolation—treating each stop as an independent data point—fails to capture the variance in passenger count dynamics or fully utilise available features. While this local modelling approach serves as an effective foundation for demonstrating traditional ML capabilities and interpretability, it represents only the first step towards comprehensive transit system modelling. Future work should pursue a global modelling approach in which stops are interconnected within lines and lines form a network, better reflecting real transit dynamics by accounting for passenger movement, transfer patterns, and operational dependencies throughout the system.
Traditional tree-based ML algorithms, while effective for tabular data, struggle with capturing the sequential and relational dependencies inherent in transit networks. In contrast, sequence-based DL algorithms (e.g., recurrent neural networks, transformers) are better at capturing temporal dependencies, while graph-based algorithms (e.g., graph neural networks) are better at representing spatial relationships. Combining these approaches can improve predictive performance by leveraging both sequential and network-based representations of passenger flow dynamics throughout the public transit system.
However, deep learning-based algorithms consistently struggle with categorical features, requiring more robust embedding methods such as autoencoders or contrastive learning. These techniques generate information-rich latent embeddings for line- and stop-specific features, enhancing the integration of categorical and numerical inputs in a global transit modelling context.
In addition, explanatory features from non-mobility sources can be further refined to enhance the granularity of transit demand analysis and modelling. This includes replacing daily weather data with hourly variations, disaggregating demographic data into age groups and occupational categories, and incorporating detailed land use classifications (e.g., shopping centres, recreational areas).
Acknowledgements
We are grateful to AtB AS Trondheim for providing the essential mobility data for our research and extend special thanks to Tsaqif Wismadi, whose ideas contributed to the development of the terrain-based features evaluated in this work.
Declarations
Conflict of interest
The authors declare that they have no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Oluwaleke Yusuf
received the B.Sc. degree in Mechanical Engineering from the University of Ibadan, Nigeria, and the M.Sc. degree in Robotics, Control, and Smart Systems from The American University in Cairo, Egypt. He is currently pursuing the Ph.D. degree in Engineering Cybernetics with the Norwegian University of Science and Technology (NTNU), Norway. Before joining NTNU, he worked in the telecommunications industry, where he gained experience in large-scale infrastructure monitoring and systems operations. His research interests include Artificial Intelligence, Big Data Analytics, Computer Vision, and Data-Driven Modelling, with applications in sustainable mobility, intelligent systems, and human–machine interaction.
Adil Rasheed
is a Professor at the Department of Engineering Cybernetics, Norwegian University of Science and Technology (NTNU), and a part-time Senior Scientist at SINTEF Digital. His research focuses on developing enabling technologies for digital twins referred to as Hybrid Analysis and Modeling (HAM), which integrates knowledgeand data-driven modeling approaches for real-time automation and control. Currently, at NTNU, he leads the digital twin and asset management research within the FME NorthWind Centre and is affiliated with the NTNU Digital Twin Team, the NTNU Wind Team, and the OpenAI Lab. He previously led the Computational Sciences and Engineering Group at SINTEF Digital (2012–2019), where he worked on advancing digitalization initiatives within the renewable and aviation sectors. His research has applications across robotics, renewable energy, aviation, and process industry, aiming to develop data-enhanced modelling frameworks that enable smarter and more sustainable engineering solutions.
Frank Lindseth
(b. 1969) is a Norwegian professor at the Department of Computer and Information Science (IDI) at the Norwegian University of Science and Technology (NTNU), where he leads and co-leads several research initiatives in digital mobility, visual computing, and AI. With a PhD in Computer and Information Science from NTNU (2003), his work bridges computer vision, digital twins, and autonomous systems with applications in mobility and transport infrastructure. Lindseth plays a central role in national and international projects shaping the future of smart mobility. He leads the “Digital Technologies and Mobility” component of the MoST – Mobilitets Lab Stor-Trondheim project and the “Mobility Pillar” of Green2050 – Centre for Green Shift in the Built Environment. He also contributes to projects like Smarter Maintenance of Roads, Machine Sensible Infrastructure under Nordic Conditions (MCSINC), and the Gemini Centre for Digitization and Automation of Future Road Transport. As a core member of the Norwegian Open AI Lab and leader of both the NTNU Autonomous Perception Lab (NAP-lab) and the Visual Computing Lab (VC-lab), Lindseth advances the use of AI, computer vision, and digital twin technologies for sustainable, intelligent mobility systems. He has co-authored over 150 scientific papers and supervised numerous master’s and PhD students working on mobility-related topics.
This section provides comprehensive specifications for all features employed in the predictive modelling framework. The input features (Table 10) encompass the temporal, operational and spatial characteristics of the bus transit system, while the target features (Table 9) represent passenger demand and operational outcomes. The weather features (Table 8) capture the environmental conditions that influence travel patterns.
Table 8
Weather variables used for feature engineering in predictive models
Weather Feature
Description
Categorical
Date
The date on which the weather data was recorded.
preciptype
Type(s) of precipitation: {rain, snow, freezingrain, or ice}.
conditions
Short textual summary of the weather conditions.
Continuous
tempmax
Maximum daily temperature (\(^\circ C\)).
tempmin
Minimum daily temperature (\(^\circ C\)).
temp
Mean daily temperature (\(^\circ C\)).
dew
Dew point temperature (\(^\circ C\)).
humidity
Relative humidity (%).
precip
Total precipitation rain, snow, or ice over the period (mm).
precipprob
Probability of measurable precipitation (0–100%).
precipcover
Proportion of hours with non-zero precipitation (0–100%).
snow
Total snowfall over the period (mm).
snowdepth
Snow depth on the ground (mm).
windgust
Maximum instantaneous wind speed (km/h).
windspeed
Maximum sustained wind speed (km/h).
winddir
Wind direction in degrees (0–360\(^\circ \)), where 0\(^\circ \) = North.
sealevelpressure
Sea-level atmospheric pressure (hPa).
cloudcover
Percentage of sky covered by clouds (0–100%).
visibility
Distance at which objects remain visible (km).
solarradiation
Instantaneous solar radiation power (\(W/m^2\)).
uvindex
Maximum daily ultraviolet (UV) exposure index (0–10).
sunrise
Local time of sunrise, encoded as minutes since midnight (minutes).
daylight
difference between the sunset and sunrise (minutes).
moonphase
Fractional progress through the moon cycle. {0 = new moon, 0.5 = full moon, 1 = new moon}.
Note: The feature names from the data source (see Section 3.3.3) have been retained for easy reference
Table 9
Target variables for passenger demand and operational predictions
Target Feature
Description (Continuous)
Automated Passenger Counts
Boarding
The number of passengers boarding at the stop.
Alighting
The number of passengers alighting at the stop.
busVolume\(^{1}\)
The total number of passengers on the bus when it leaves a stop.
tripSumVolume\(^{1}\)
An incremental count of passengers that boarded the bus over the course of the trip. The final sum is at the final stop.
stopVolume\(^{1}\)
The sum of boarding and alighting passengers at each stop. A proxy for how much passenger volume the stop handles.
Operational Parameters
StopActualArrival
The time the bus actually arrived at the stop.
StopTime
The actual time (seconds) spent at the stop.
StopActualDeparture\(^{1}\)
The time the bus actually departed from the stop.
\(^{1}\) Can be derived from other target features in the same group
Table 10
Input variables for temporal, operational and spatial characteristics of the transit system
Input Feature
Description
Categorical
Date
The date on which the transportation data was recorded.
Line
The line or route number of the bus.
StopSequence
The sequence number of the stop within the trip.
LastStopSequence
The sequence number of the last stop for the trip. Also indicates the length of the line.
TripDepartureHour
The hour at which the departure is scheduled to occur. This value is the same for all stops on a trip.
StopArrivalStatus
A flag (1/2/3) indicating if the (scheduled-actual) arrival deviation is less/more than 5 minutes (300 seconds).
StopDepartureStatus
A flag (1/2/3) indicating if the (scheduled-actual) departure deviation is less/more than 5 minutes (300 seconds).
StopName
The name of the stop.
StopIdentifier
The unique identifier for the stop in NeTEx format.
StopType
The type or category of the stop: Transfer or Ordinary.
TransferStop
A flag indicating if the stop is a transfer stop and its degree of importance i.e. how many alternate lines can be accessed from the stop.
Vehicle
The identifier of the vehicle used for transportation.
BusType
The type of bus used for transportation.
Year
The year when the transportation data was recorded.
Month
The name of the month when the transportation data was recorded.
WeekNumber
The week number in the year when the transportation data was recorded.
Week_Weekend
The classification of the day into weekday or weekend.
DayComment
A comment related to the day that might have an effect on passenger counts.
HolidayName
The name of the holiday which falls on this date.
DayType
The name of the day type.
dateTrip
A unique identifier concatenating the data and trip such that each trip is unique across the entire dataset.
Boolean Flags
TripDirection
The flag indicating the travel direction of the bus, from (1) or towards (0) its assigned ’home’ stop.
Trip_15minDeviation
A flag indicating if the entire trip has a 15-minute deviation. This description is not quite precise and the data is sensitive to outliers.
Stop_10minDeviation
A flag indicating if the stop has a 10-minute deviation relative to the scheduled time.
Stop_IsDelayed
A flag indicating if the stop is delayed relative to the scheduled time.
Stop_IsEarly
A flag indicating if the stop is early relative to the scheduled time.
Holiday_Restday
A flag indicating the holiday or rest day status.
SqueezedWorkingDay
A flag indicating a squeezed workday, a workday that falls between non-working days.
Vacation
A flag indicating the holiday status.
Workday
A flag indicating if the day is a workday.
StudentVacation
A flag indicating student holidays.
SchoolVacation
A flag indicating school holidays.
OutboundDay
A flag indicating outbound days. These are the first days of long holidays when people typically travel.
Continuous
TripScheduledDeparture
The timestamp for the scheduled trip start. This value is the same for all stops on a trip.
TripScheduledArrival
The timestamp for the scheduled trip end. This value is the same for all stops on a trip.
StopScheduledArrival
The time the bus is scheduled to arrive at the stop.
StopScheduledDeparture
The time the bus is scheduled to depart from the stop.
Longitude
The longitude coordinate of the stop location.
Latitude
The latitude coordinate of the stop location.
Appendix B Model hyperparameter configurations
This section outlines the hyperparameter optimisation process for the machine learning (ML) models used in the ground truth (GT) (Table 11) and automated passenger counting (APC) (Table 12) modelling tasks. The tables detail the search space for each model, specifying the hyperparameters explored and the distribution from which the optimal values were obtained. Note that the original names of the source ML packages (see Section 3.3.3) have been retained for easy reference.
Table 11
Hyperparameter search spaces and optimal configurations for GT prediction models
Alam, O., Kush, A., Emami, A., et al.: Predicting irregularities in arrival times for transit buses with recurrent neural networks using GPS coordinates and weather data. J. Ambient. Intell. Humaniz. Comput. 12(7), 7813–7826 (2021). https://doi.org/10.1007/s12652-020-02507-9CrossRef
Biewald, L.: Experiment tracking with weights and biases. (2020). URL https://www.wandb.com/
Borisov, V., Leemann, T., Seßler, K., et al.: Deep Neural Networks and Tabular Data: A Survey. IEEE Transactions on Neural Networks and Learning Systems pp 1–21. (2022), https://doi.org/10.1109/TNNLS.2022.3229161, arXiv:2110.01889 [cs]
Chen, T., Guestrin, C.: XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, Kdd ’16, pp 785–794, (2016). https://doi.org/10.1145/2939672.2939785
Hu, S., Chen, J., Zhang, W., et al.: Graph transformer embedded deep learning for short-term passenger flow prediction in urban rail transit systems: A multi-gate mixture-of-experts model. Inf. Sci. 679, 121095 (2024). https://doi.org/10.1016/j.ins.2024.121095CrossRef
Jones, P.: The evolution of urban transport policy from car-based to people-based cities: Is this development path universally applicable? In: Proceedings of the 14th World Conference on Transport Research, Shanghai, China, p 20, (2016). URL https://discovery.ucl.ac.uk/id/eprint/1502400/
Jovanović, B., Shabanaj, K., Ševrović, M.: Conceptual Model for Determining the Statistical Significance of Predictive Indicators for Bus Transit Demand Forecasting. Sustainability 15(1), 749 (2023). https://doi.org/10.3390/su15010749CrossRef
Ke, G., Meng, Q., Finely, T., et al.: LightGBM: A highly efficient gradient boosting decision tree. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, NIPS’17, pp 3149–3157, (2017). https://doi.org/10.5555/3294996.3295074
Kong, X., Shen, Z., Wang, K., et al.: Exploring Bus Stop Mobility Pattern: A Multi-Pattern Deep Learning Prediction Framework. IEEE Trans. Intell. Transp. Syst. 25(7), 6604–6616 (2024). https://doi.org/10.1109/TITS.2023.3345872CrossRef
Li, C., Bai, L., Liu, W., et al.: Graph Neural Network for Robust Public Transit Demand Prediction. IEEE Trans. Intell. Transp. Syst. 23(5), 4086–4098 (2022). https://doi.org/10.1109/TITS.2020.3041234CrossRef
Liu, X., Chen, X., Potoglou, D., et al.: Travel impedance, the built environment, and customized-bus ridership: A stop-to-stop level analysis. Transp. Res. Part D: Transp. Environ. 122, 103889 (2023). https://doi.org/10.1016/j.trd.2023.103889CrossRef
Ma, X., Tian, X., Cui, H., et al.: What influences intermodal Choices: Metro-Centric, Bus-Centric, Hybrid? insights from Machine learning Approaches. Transp. Res. Part D: Transp. Environ. 136, 104407 (2024). https://doi.org/10.1016/j.trd.2024.104407CrossRef
Martín-Baos, J.Á., López-Gómez, J.A., Rodriguez-Benitez, L., et al.: A prediction and behavioural analysis of machine learning methods for modelling travel mode choice. Trans. Res. Part C: Emerging Tech. 156, 104318 (2023). https://doi.org/10.1016/j.trc.2023.104318CrossRef
Millard-Ball, A., Schipper, L.: Are We Reaching Peak Travel? Trends in Passenger Transport in Eight Industrialized Countries. Transp. Rev. 31(3), 357–378 (2011). https://doi.org/10.1080/01441647.2010.518291CrossRef
Mouratidis, K., De Vos, J., Yiannakou, A., et al.: Sustainable transport modes, travel satisfaction, and emotions: Evidence from car-dependent compact cities. Travel Behaviour Soc. 33, 100613 (2023). https://doi.org/10.1016/j.tbs.2023.100613CrossRef
Ngo, N.S., Bashar, S.: The impacts of extreme weather events on U.S. Public transit ridership. Transportation Research Part D: Transport and Environment 137, 104504 (2024). https://doi.org/10.1016/j.trd.2024.104504CrossRef
Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011)
Pei, Y., Ran, S., Wang, W., et al.: Bus-Passenger-Flow Prediction Model Based on WPD, Attention Mechanism, and Bi-LSTM. Sustainability 15(20), 14889 (2023). https://doi.org/10.3390/su152014889CrossRef
Qu, H., Xu, X.: Chien S (2020) Estimating Wait Time and Passenger Load in a Saturated Metro Network: A Data-Driven Approach. J. Adv. Transp. 1, 4271871 (2020). https://doi.org/10.1155/2020/4271871CrossRef
Rühmann, S., Leible, S., Lewandowski, T.: Interpretable Bike-Sharing Activity Prediction with a Temporal Fusion Transformer to Unveil Influential Factors: A Case Study in Hamburg. Germany. Sustainability 16(8), 3230 (2024). https://doi.org/10.3390/su16083230CrossRef
Shrivastava, A., Rawat, N., Agarwal, A.: Deep-learning-based model for prediction of crowding in a public transit system. Public Transport 16(2), 449–484 (2024). https://doi.org/10.1007/s12469-024-00360-zCrossRef
SLOCAT: Global Status Report on Transport, Climate and Sustainability - 3rd Edition. Tech. rep., SLOCAT Partnership, (2023) URL https://tcc-gsr.com/
Tian, X., Lu, C., Song, Z., et al.: Quantifying weather-induced unreliable public transportation service in cold regions under future climate model scenarios. Sustain. Cities Soc. 113, 105660 (2024). https://doi.org/10.1016/j.scs.2024.105660CrossRef
Wei, M.: Investigating the influence of weather on public transit passenger’s travel behaviour: Empirical findings from Brisbane, Australia. Transportation Res. Part A: Policy and Practice 156, 36–51 (2022). https://doi.org/10.1016/j.tra.2021.12.001CrossRef
Xu, Y., Zhao, X., Zhang, X., et al.: Real-Time Forecasting of Dockless Scooter-Sharing Demand: A Spatio-Temporal Multi-Graph Transformer Approach. IEEE Trans. Intell. Transp. Syst. 24(8), 8507–8518 (2023). https://doi.org/10.1109/TITS.2023.3239309CrossRef
Yang, Y., Samaranayake, S., Dogan, T.: Assessing impacts of the built environment on mobility: A joint choice model of travel mode and duration. Environ. Planning B: Urban Anal. City Sci. 50(9), 2359–2375 (2023). https://doi.org/10.1177/23998083231154263CrossRef
Zeb, M.S., Khan, M.A., Khattak, M.M.H., et al.: Forecasting public transit ridership amidst COVID-19: A machine learning approach. Public Transport (2024). https://doi.org/10.1007/s12469-024-00368-5CrossRef
Zhang, K., Ren, H., Kang, J., et al.: TST-Trans: A Transformer Network for Urban Traffic Flow Prediction. IEEE Internet of Things Journal pp 1–1. (2024). https://doi.org/10.1109/JIOT.2024.3501294
Zhang, S., Zhang, J., Yang, Y., et al.: The non-linear effects of built environment on bus ridership of vulnerable people. Transp. Res. Part D: Transp. Environ. 139, 104540 (2025). https://doi.org/10.1016/j.trd.2024.104540CrossRef