1 Introduction
-
Traffic status prediction: It is popular to use the navigation system of the electronic map to avoid congested roads when we plan to leave one place for another. The key ability to achieve the target is to predict which roads will be congested in the future time. In other words, we need to predict the traffic status for each road. However, it is typical to measure traffic status with average traffic speed or travel time. The slower the traffic speed or the more the travel time, the worse the traffic status. Therefore, the traffic status prediction can be regarded as the traffic speed or travel time prediction, which are regression problems. Moreover, we can measure the traffic status with different types (e.g., smooth, light congestion and heavy congestion) by splitting the traffic speed into different continuous intervals, where predicting the traffic status becomes a classification problem.
-
Traffic flow prediction: Recently, there exist some stomp events caused by excessive traffic. The main reason is that the government cannot monitor and guide the flow of people in time. Hence, it is significant to predict traffic flows in future time. Moreover, traffic flow can be divided into two types: network-based and region-based. The first type infers the number of vehicles collected by loop detector sensors, which are installed on both endpoints of the roads. As for the second type, we split the whole city into different regions and regard the number of crowds leaving one region for another as the region-based traffic flow. Therefore, the region-based traffic flow can be further divided into in-flow and out-flow. For example, if there are 100 people leaving the region A for the region B, both A’s out-flow and B’s in-flow would increase 100.
-
Travel demand prediction: Transportation companies provide online taxi service for users. They need to predict people’s travel demands in order to better dispatch vehicles for different regions. For example, they should dispatch more vehicles to residential areas during the morning rush hour. In contrast, they should dispatch more vehicles to office zones during the evening rush hour. Generally, predicting travel demands is based on regions, so we also call it region-based travel demand prediction.
Spatial | Temporal | Components | |
---|---|---|---|
\({\textit{S}O}\) | Static | None | POI information, road network |
\({\textit{T}O}\) | None | Static | Holiday, date, timestamp |
\({\textit{S}TS}\) | Static | Static | Event data |
\({\textit{S}STD}\) | Static | Dynamic | Flow, demand, travel time, |
Velocity, meteorological data | |||
\({\textit{S}DTD}\) | Dynamic | Dynamic | Trajectory |
-
Map matching: Map matching is an operator to convert spatial data with latitude/longitude coordinates into road networks. For example, we can use map matching techniques to convert a taxi’s trajectory (a.k.a., GPS sequence) into a road sequence, by which we can further compute traffic flows on the corresponding roads. Hence, it is significant to apply effective map matching methods for collecting traffic data.
-
Data cleaning: It is inevitable to generate errors when collecting spatio-temporal data. For example, GPS points may be shifted from their real positions. Hence, through the data cleaning technology, we can correct historical GPS points for predicting the future traffic.
-
Data storage: With the increase in collected spatio-temporal data, it is non-tractable to efficiently manage them. For example, some travel time prediction methods leverage the average travel time of similar historical trajectories, so efficiently finding similar trajectories is significant for these methods. Here, we aim to survey different methods focusing on how to store and retrieve big spatio-temporal data.
-
Data compression: Big spatio-temporal data would cause heavy overhead for communication, computing and storage. However, some traffic prediction problems do not really need all data. For example, when computing the region-based traffic flows, we only need to record the number of trajectories coming from one region to another, so it is insignificant to record the whole trajectory information. To address this issue, one method is to compress spatio-temporal data. Here, we aim to survey different methods focusing on how to effectively and efficiently compress spatio-temporal data.
-
Traffic classification: The traffic classification problem focuses on how to design effective methods to classify given traffic data. For example, given a taxi’s ongoing trajectory, we can use some classification methods to judge whether the trajectory is normal or not and thus can remind the driver to correct the route in time. This is a typical binary classification task. Also, there exist some multiple classification problems. For instance, different modes of transportation (e.g., walking, bus, subway and taxi) should generate different kinds of trajectories. Therefore, given different kinds of trajectories, it is also significant to divide them into different kinds of modes. To solve the classification problem, existing studies mainly focus on machine learning methods. More specifically, these machine learning methods can be split into two kinds: The first is called traditional learning methods, such as HMM (hidden Markov model [3]), CRF (conditional random field [4]) and DT (decision tree [5]), while the second is called deep learning methods, such as CNN (convolutional neural network [6]) and RNN (recurrent neural network [7]).
-
Traffic Generation: Obviously, the traffic generation problem means generating some traffic data. The reason of studying this problem is threefold. Firstly, with the development of deep learning techniques, more and more deep learning models are designed to solve traffic prediction problems, and these models require large scale of training data to improve their accuracy. However, it is not easy to collect real-world traffic data for ordinary people, so generating data is an effective way to address this issue. Secondly, some applications (e.g., ride-hailing and taxi dispatching) need to evaluate some approaches on a transportation environment. However, it is unrealistic to use real-world environment due to the lack of all kinds of real-world traffic data. Hence, it is useful to simulate the environment by generating some kinds of traffic data. Thirdly, we need to consider privacy protection when using collected real-world data to train traffic prediction models. Therefore, how to avoid disclosing users’ privacy without reducing the effectiveness of trained models is one of the research hot spots. In summary, these reasons make the generation problem split into two parts. One is called simulation, while the other is called completing. For the target of simulation, we try to use collected data to simulate the transportation environment, where we would infer the distribution of traffic data and generate unseen data from other sparse data. Hence, some machine learning methods, such as Bayes [8], are used to generate data or data distributions. As for the target of representation and modeling, we try to model and represent traffic data with hidden codes, from which we can complete unavailable or sensitive data with fake data. More specifically, there are mainly deep learning methods, such as KNN (K-nearest neighbors) [9], GAN (generative-adversarial networks) [10] and RNN.
-
Traffic Forecasting: The last significant prediction task is to forecast the value of some traffic data, such as traffic speed, traffic flows, travel demands and travel time. Actually, all of these problems belong to two categories, region-based and network-based, according to traffic data’s formats. Firstly, in region-based problems, we regard a city as different disjoint regions and compute or estimate related traffic data (e.g., regional flows and travel demands) for each region. For example, the government needs to monitor the crowd flows from one region to another for avoiding the public security problem caused by the over gathering of crowds. Secondly, in network-based problems, we would consider the constraint of road networks. Specifically, these traffic data (e.g., intersection flows, road speed and travel time) are related to road networks. For example, when we plan to go from one position to another, we would prefer to select the route whose travel time is the least. Here, the travel time should be estimated by designing some effective models.
-
Order dispatching: It is more and more popular to enjoy online taxi services, which are provided by transportation companies, such as Uber, Didi and Lyft. One core problem is to effectively and efficiently assign large scale of taxi orders to drivers. Given large scale of orders, we should design methods to solve the dispatching problem for getting a global optimal solution.
-
Ride sharing: Ride sharing is becoming a popular mode of transportation with profound effects on the industry. Recent. Given a sharing request, we could estimate the travel time from each candidate car’s location to the pick-up and then assign the request to the one with the least travel time. However, it is time-consuming to traverse all available candidates. Therefore, when considering larger requests, we need to design more complex methods to make the trade-off between effectiveness and efficiency.
-
Business location: With the development of smart city, it is more and more popular to leverage find right location to set up a shop or restaurant. Here, one possible solution is based on the crowd flow prediction of regions. Intuitively, the larger the crowd flows are, the better the regions are. In addition, this also can benefit the selection of billboard locations.
-
Spatio-temporal anomaly detection: Actually, we can convert the anomaly detection problem into a two classification problem and then apply some traffic classification methods to solve the problem.
-
Route Planning: It is useful to recommend an optimal route for a given departure-destination pair. Similar to taxi dispatching, we can select the route, whose travel time is the least, as the recommendation. Here, we should predict the travel time.
2 Spatio-Temporal Data
2.1 Data Example
2.2 Reviewing Related Work
3 Preprocessing
3.1 Map-matching
Technique |
\({\textit{G}eometric}\)
|
\({\textit{T}opological}\)
|
\({\textit{G}lobal}\)
| Examples |
---|---|---|---|---|
\(\mathtt{Point\text {-}Distance}\)
| Yes/no | Yes/no | No | |
\(\mathtt{Path\text {-}Distance}\)
| Yes/no | Yes | Yes | |
\(\mathtt{Probability\text {-}Based}\)
| No | Yes | Yes | |
\(\mathtt{Model\text {-}Based}\)
| Yes | Yes | Yes/no | |
\(\mathtt{Learning\text {-}Based}\)
| Yes | Yes | Yes/no |
3.2 Data Cleaning
3.3 Data Storage
3.4 Data Compression
4 Traffic Prediction
4.1 Traffic Classification
4.2 Traffic Generation
Problem types | Techniques | Examples | Consider road network | Consider environmental data | Consider spatial property | Consider temporal property | Handle nonlinearity |
---|---|---|---|---|---|---|---|
OD- | kNN | [122] | \(\times\) | \(\times\) | \(\times\) | \(\times\) | \(\times\) |
Travel- | MLP | [123] | \(\times\) | \(\times\) | \(\times\) | \(\times\) | \(\checkmark\) |
Time | ResNet,LSTM,CNN | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |
Path- | kNN,TD,Regression | \(\checkmark\) | \(\times\) | \(\times\) | \(\times\) | \(\times\) | |
Travel- | DT,HMM | \(\checkmark\) | \(\times\) | \(\times\) | \(\times\) | \(\checkmark\) | |
Time | CNN,LSTM,W-D | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |
Generative | \(\checkmark\) | \(\times\) | \(\times\) | \(\times\) | \(\checkmark\) | ||
HA,ARIMA,ensemble | \(\times\) | \(\times\) | \(\times\) | \(\times\) | \(\times\) | ||
Travel- | MLP | [140] | \(\times\) | \(\checkmark\) | \(\times\) | \(\checkmark\) | \(\times\) |
Demand | CNN+RNN | \(\times\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |
GCN,GAT | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | ||
Regional- | HA,ARIMA,ensemble | \(\times\) | \(\times\) | \(\times\) | \(\times\) | \(\times\) | |
Flow | CNN | [151] | \(\times\) | \(\checkmark\) | \(\times\) | \(\times\) | \(\checkmark\) |
CNN+LSTM | \(\times\) | \(\checkmark\) | \(\times\) | \(\times\) | \(\checkmark\) | ||
HA,ARIMA,ensemble | \(\times\) | \(\times\) | \(\times\) | \(\times\) | \(\times\) | ||
Network- | Autoencoder | [158] | \(\times\) | \(\times\) | \(\times\) | \(\times\) | \(\checkmark\) |
Flow | GCN,Attention | \(\checkmark\) | \(\times\) | \(\times\) | \(\times\) | \(\checkmark\) | |
GCN+RNN | \(\checkmark\) | \(\times\) | \(\times\) | \(\times\) | \(\checkmark\) | ||
Meta-learning | [164] | \(\checkmark\) | \(\times\) | \(\checkmark\) | \(\times\) | \(\checkmark\) | |
Traffic- | HA,ARIMA | [138] | \(\times\) | \(\times\) | \(\times\) | \(\times\) | \(\times\) |
Speed | CNN,LSTM,FNN | \(\checkmark\) | \(\times\) | \(\times\) | \(\times\) | \(\checkmark\) | |
LSTM+GCN | [171] | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) |
4.3 Traffic Forecasting
5 Traffic Application
5.1 Ride Sharing
5.2 Order Dispatching
5.3 Business Location
5.4 Spatio-Temporal Anomaly Detection
5.5 Route Planning
6 Emerging Challenges and Opportunities
6.1 Complex Characteristics of Spatio-Temporal Data
6.2 AI-enhanced Spatio-Temporal Data Preprocessing
6.3 Joint Traffic Prediction
6.4 Interpretable and Automatic Deep Traffic Prediction Models
6.5 Unified Intelligent Transportation System
6.6 Performance Benchmarks and Pre-train Models
7 Public Spatio-Temporal Datasets
-
GAIA Open Dataset3: Didi provides academic community with real-life, high-quality anonymized data. In the website, they provide not only raw order-related datasets (e.g., orders, trajectories and voice data), but also self-processing transportation index datasets (e.g., travel time index and transportation energy index). In addition, they build benchmark datasets for some popular transportation data mining competitions, such as KDD CUP 20204 and CCF BDCI 20205.
-
Open Street Map (OSM)6: Road networks are broadly applied in many traffic prediction problems. OSM provides the way to access the road network all over the world. Also, we can extract the road network for each special city.
-
Taxi Trajectories: There are plenty of taxi trajectories released from some research projects. For example, Yuan et al. [250] provide a dataset, which is a sample of trajectories from Microsoft Research T-Drive project, generated by over 10,000 taxicabs in a week of 2008 in Beijing. In addition, the taxi service trajectory prediction challenge 20157 also provides an accurate dataset describing complete year (from 01/07/2013 to 30/06/2014) of the (busy) trajectories performed by all the 442 taxis running in the city of Porto.