nach oben

Complex & Intelligent Systems

Erschienen in:

Open Access 04.08.2022 | Original Article

Multi-granularity scenarios understanding network for trajectory prediction

verfasst von: Biao Yang, Jicheng Yang, Rongrong Ni, Changchun Yang, Xiaofeng Liu

Erschienen in: Complex & Intelligent Systems | Ausgabe 1/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

Understanding agents’ motion behaviors under complex scenes is crucial for intelligent autonomous moving systems (like delivery robots and self-driving cars). It is challenging duo to the inherent uncertain of future trajectories and the large variation in the scene layout. However, most recent approaches ignored or underutilized the scenario information. In this work, a Multi-Granularity Scenarios Understanding framework, MGSU, is proposed to explore the scene layout from different granularity. MGSU can be divided into three modules: (1) A coarse-grained fusion module uses the cross-attention to fuse the observed trajectory with the semantic information of the scene. (2) The inverse reinforcement learning module generates optimal path strategy through grid-based policy sampling and outputs multiple scene paths. (3) The fine-grained fusion module integrates the observed trajectory with the scene paths to generate multiple future trajectories. To fully explore the scene information and improve the efficiency, we present a novel scene-fusion Transformer, whose encoder is used to extract scene features and the decoder is used to fuse scene and trajectory features to generate future trajectories. Compared with the current state-of-the-art methods, our method decreases the ADE errors by 4.3% and 3.3% by gradually integrating different granularity of scene information on SDD and NuScenes, respectively. The visualized trajectories demonstrate that our method can accurately predict future trajectories after fusing scene information.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

With the rapid development of artificial intelligence, the intelligent autonomous moving system has become a hot topic of the current research, and the accompanying driving safety issues have also attracted public attention. However, the uncertainty of future trajectory and the large variation in scene layout bring great challenges to forecasting pedestrians’ trajectories. Therefore, it is of great significance to study pedestrians’ motion behaviors to reduce the occurrence of collision accidents and protect their safeties.

The core of trajectory prediction is to learn pedestrians’ motion behaviors [1, 2] based on given observed trajectories, and predict all possible future trajectories. To accurately predict future trajectories, researchers mainly adopted model-driven or data-driven methods. Commonly used model-driven methods include the Markov model [3, 4] and Kalman filter [5, 6]. For example, Schneider and Gavrila [7] combined Kalman Filter and constant velocity to predict pedestrians’ future trajectory. Mathew et al. [8] proposed a hybrid prediction method based on the hidden Markov model, which clusters trajectories based on observed trajectories. However, due to the complexity and non-linearity of future trajectories, model-based methods are difficult to accurately capture the dynamic changes and long-term dependence of trajectories, so the prediction is not accurate enough.

Data-driven methods [9, 10] are effective ways to deal with dynamic changes and long-term dependence on trajectories. The previous methods are mainly based on Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) to model the dynamic features of the trajectory. In the RNN-based methods, Lee et al. [11] proposed a model based on RNN, which captures the dynamic changes in motion through adaptive learning of network parameters. Bartoli et al. [12] adopted the Long Short-Term Memory (LSTM) network to alleviate the long-term dependence issue of trajectory by using the gate mechanism and time-step parameter sharing. The RNN-based method can model the features of the trajectory. However, the model needs to process the data sequentially due to its recurrent structure, resulting in inefficient data processing and gradient vanishing problems [13]. In the CNN-based methods, Chen et al. [14] proposed a convolution embedding model that models the relative order of positions through the one-dimensional convolution and predicted the next position with trajectory data. Zamboni et al. [15] proposed a new convolutional model for pedestrian trajectory prediction that uses 2D convolution. The CNN-based method can effectively model the trajectory sequence, but due to the limited receptive field of the CNN, the extraction of the long-term dependence does not work out well .

To solve the above problems, we adopt the Transformer framework, which was first proposed in 2017 [16], and soon became popular in the field of Natural Language Processing (NLP) such as machine translation [17], speech recognition [18], and question answering system [19]. Transformer has a strong semantic feature extraction ability and task comprehensive feature extraction ability, and can perform parallel computing, which overcomes the shortcomings of RNN and its variants sequential structure. In terms of long-term feature capture capability, Transformer also excels due to its multi-attention module. On the task of trajectory prediction, Giuliari et al. [20] used vanilla Transformer to model pedestrians’ trajectories and achieved satisfactory prediction performance. Yu et al. [21] proposed a framework for spatio-temporal crowd trajectory prediction with only attention mechanisms to model the interactions in space and time. The above methods use the powerful feature extraction ability of Transformer and achieve good results in the field of trajectory prediction. However, they only use Transformer to extract one kind of features, ignoring the multi-feature fusion ability of cross-attention module in the network. Also, we noticed that researchers tend to focus more on modeling interactions with other agents, ignoring static contextual information (road infrastructure) of the scene. However, as shown in Fig. 1, contextual information of the scene has the same (or even greater) importance as the dynamic information of other agents [22].

To address such limitations, we propose a Multi-Granularity Scenarios Understanding (MGSU) framework, which can extract trajectory features and fuse with scene information at different granularity. The network can capture long-term dependencies because it is based on the attention mechanism. To make the fusion more efficient, we introduce a novel scene-fusion transformer, whose encoder is used to extract scene features and fuse them with trajectory features in the decoder. The scene-fusion transformer adopts sparse attention mechanism and the decoder is set as generative output like Informer [23], which can effectively avoid the accumulation of errors. A lightweight semantic segmentation network, ESPNet [24], is introduced to extract the semantic features of the scene. To better utilize the scene information, an inverse reinforcement learning (IRL) approach [25] is introduced to generate the optimal path strategy based on semantic features. Concretely, the main contributions of this paper can be summarized as follows:

(1)

We propose a Multi-Granularity Scenarios Understanding framework (MGSU), which can effectively model the interaction between pedestrians’ trajectories and the scene, and generate multiple feasible predictions for the future trajectories. MGSU gradually integrates scene information and trajectory information according to different granularity stages. We also introduce ESPNet and inverse reinforcement learning methods to achieve a more comprehensive exploration of the impact of the scene layout on future trajectories.

(2)

To better fuse pedestrians’ trajectories and the scene, a novel and efficient scene-fusion transformer is presented. It adopts the sparse attention mechanism and sets the decoder as generative output of future predicted trajectories, which can effectively avoid error accumulation and improve the efficiency.

(3)

We evaluate MGSU on the SDD and NuScenesdatabases, and the results show that our approach can understand the scene layout with high accuracy.

The rest of this paper is organized as follows. “Related work” summarizes the methods for trajectory prediction and the methodology related to our work. “Method” describes the proposed MGSU in detail. “Experiments” elaborates on experiments for MGSU and discusses results with trajectory visualization and quantitative evaluation. Our conclusions are presented in “Conclusion”.

Trajectory prediction Trajectory prediction methods can be mainly divided into model-driven [8, 26‐28] and data-driven [9, 29‐31] methods. For the former, there is an explicit model related to the target motion over time. Keller and Gavrila [8] proposed a method based on a linear dynamic model to predict future trajectories in a short time. To overcome the limitations of the linear dynamic model, Karasev et al. [26] proposed another model-based approach to predict future trajectories by modeling the behavior of the target as a Markov process. Malviya and Kala [28] presented a trajectory prediction method based on particle filter to track humans using a limited-field-of-view monocular camera. On the other hand, there is no explicit modeling of target behavior in data-driven methods that rely mainly on trajectory datasets under multiple scenarios and attempt to learn the behavior of targets from the datasets. Alahi et al. [29] proposed Social LSTM that predicts pedestrians’ trajectories by exploiting interactions between pedestrians on roadways. However, the computation of the model is expensive because the social pooling operation needs to consider the interactions between all pedestrians in the scene. Gupta et al. [30] proposed Social-GAN to overcome the limitations of Social LSTM by introducing generative adversarial networks and the global pooling mechanism. In the above methods, RNN and its variants have become an important part of many recent trajectory prediction models [32, 33] due to its powerful processing capability for time series data. However, RNN and its variants cannot be computed in parallel due to its own order structure, and has poor ability to extract the long-term dependence. Therefore, our method mainly uses the attention mechanism to overcome the shortcomings of the above. In previous work, researchers tend to focus more on modeling the interaction with other agents and ignore the context information (road infrastructure) of static scenarios. However, context information for a scenario is just as important (or even more important) as dynamic information for other agents. Therefore, we use ESPNet to extract semantic features of the scene, and introduce IRL to generate the optimal path strategy from the semantic features and historical trajectories. Then, the scene information and trajectory information are closely combined to accurately predict the future trajectory.

Scene understanding The semantic segmentation network provides great help for trajectory prediction by providing feasible regions. In terms of scene understanding, Visin et al. [34] and Bell et al. [35] used RNN to pass information along each row or column of the scene, but resulted in a single RNN layer where each pixel position could only obtain information from the same row or column. Liang et al. [36] proposed a variant of LSTM to exploit the context in the scene, but suffers from the expensive computational cost. Currently, researchers referred to CNN-base methods to understand the scenes [37‐39]. Ronneberger et al. [37] proposed a U-Net semantic segmentation network, which relies on data augmentation to efficiently utilize available annotation samples. DUC [40], DeepLabv3 [41] and PSPNet [42] used extended convolution to preserve the spatial size of feature graphs. Orhan and Bastanlar [43] proposed a semantic segmentation CNN-based model that utilizes equirectangular convolutions to handle distortions in panoramic images. These methods can precisely describe the scene semantic information, but their heavy computational overheads result in a slow inference speed, which is not suitable for real-time trajectory prediction. Therefore, we introduce a lightweight semantic segmentation network, ESPNet, which has an extremely high inference speed and can make fast and efficient segmentation of the scene images. Further, the IRL method is introduced to generate the optimal path strategy by using the semantic information of the scene and pedestrians’ observed trajectories, so as to help the network to deeply understand the scene.

Transformers Transformer is a deep learning architecture proposed by Google in 2017. It has achieved great success in the field of NLP [44‐46]. Due to the unique attention mechanism and the excellent performance in the field of NLP, researchers have great interest in its application in trajectory prediction. Giuliari et al. [20] used the vanilla Transformer without considering any complex interaction information, and achieved satisfactory results. Yu et al. [21] proposed a STAR architecture to model the interaction information in space and time. Achaji et al. [47] introduced PReTR, which utilized a decomposed spatio-temporal attention module to extract features from multi-agent scenarios. Yao et al. [48] proposed an end-to-end transformer network that has the self-correcting scheme to enhance the model robustness. The above methods make use of the powerful feature extraction ability of Transformer and perform good in trajectory prediction. However, they only used Transformer to extract one class of features and ignore the multi-feature fusion ability of the cross-attention module. In this work, we improve the vanilla Transformer and propose an efficient scene-fusion Transformer, which can simultaneously fuse trajectory features with scene information.

Methods

We focus on the fusion of scene information and trajectory information to improve the trajectory prediction performance. The framework of the proposed MGSU is illustrated in Fig. 2. MGSU fully utilizes the scene information by gradually integrating different granularity of scene information to model the interaction between trajectory and scene. Firstly, the semantic information of the scene image is extracted by the coarse-grained fusion module and fused with the trajectory information through the cross-attention module to output the motion representation at a coarse-grained level. Then, the motion representation of coarse-grained fusion is fed into the IRL module to generate the optimal path strategy through the grid-based policy sampling. Afterward, the IRL module outputs multiple scene paths which are accurate and can provide the scope of future paths at a fine-grained level. Finally, the scene paths and observed trajectories are fused in the fine-grained feature fusion module to generate multiple future trajectories. The difference between the coarse-grained fusion and the fine-grained fusion mainly lies in the precision of information presentation and the way of fusion. Coarse-grained fusion is the fusion of trajectory and semantic features which provide information about the category of the identified object and the scope of future trajectories at a coarse-grained level through cross-attention. Fine-grained fusion is the fusion of trajectory and path features that provide the scope of future paths at a fine-grained level through a scene-fusion Transformer. Table 1 presents all control parameters used in this paper. Details of different modules are described as follows.

Table 1

Description of control parameters

Control parameters	Description
$\text {bs}$	The batch size of the model
$s_{i}$	The size of the scene image
$s_{{\mathrm{fc}}}$	The size of the fully connected layer
$d_{{\mathrm{grid}}}$	The dimension of the 2D grid
$s_{{\mathrm{sf}}}$	The size of the scene feature
$W_{\mathrm{e}}$	Learnable parameters of ESPNet
$W_{\mathrm{f}}$	Learnable parameters of the fully connected layer
$W_{\mathrm{l}}$	Learnable parameters of the linear layer
$e_{{\mathrm{model}}}$	The embedding size of the model
$N_{\mathrm{l}}$	The number of the layer of the scene-fusion Transformer
$\hbox {heads}$	The number of heads of the multi-head attention
$\text {lr}$	The learning rate of the model
N	The number of approximation iteration
$W^p_{i}$	The weight matrix of a linear transformation for paths $(i=Q,K,V)$
$W^t_{i}$	The weight matrix of a linear transformation for trajectory $(i=Q,K,V)$

Coarse-grained fusion module

The scene semantic information is related to the object categories in the scene. Therefore, the model can learn the information about passable areas (like roads and crosswalks) for pedestrians from semantic information to reduce the uncertainty of objects in the scene. The semantic information can provide a basis for predicting the future trajectory according to the category of the identified object. However, this basis can only provide the scope of future trajectories at a coarse-grained level. As illustrated in Fig. 2, the coarse-grained fusion module uses the cross-attention mechanism to fuse observed trajectories with corresponding scene images, and outputs a coarse-grained motion mixture representation. This module includes a semantic segmentation network, a fully connected layer, and two cross-attention modules. ESPNet is used to extract semantic features from the input scene S. Meanwhile, a fully connected layer is used to map the observed trajectory $T_{{\mathrm{obs}}}$ to a high-dimensional feature space to facilitate feature extraction. Afterward, the two cross-attention modules are used to calculate the attention of the scene to the trajectory $A_{\mathrm{s}}$ and the attention of the trajectory to the scene $A_{\mathrm{t}}$. The two attentions are concatenated to generated the coarse-grained motion representation $C_{\mathrm{h}}$, as follows:

$$\begin{aligned}&S_{\mathrm{e}}=\hbox {ESPNet}\left( S,W_{\mathrm{e}}\right) , \end{aligned}$$

(1)

$$\begin{aligned}&T_{\mathrm{fo}}= FC\left( T_{{\mathrm{obs}}},W_{\mathrm{f}}\right) , \end{aligned}$$

(2)

$$\begin{aligned}&A_{\mathrm{t}}=\frac{\hbox {Softmax}\left( T_{\mathrm{fo}}\cdot S_{\mathrm{e}}^{\mathrm{T}}\right) }{\sqrt{d_{\mathrm{e}}}}S_{\mathrm{e}}, \end{aligned}$$

(3)

$$\begin{aligned}&A_{\mathrm{s}}=\frac{\hbox {Softmax}\left( S_{\mathrm{e}} \cdot T_{\mathrm{fo}}^{\mathrm{T}}\right) }{\sqrt{d_{\mathrm{t}}}}T_{\mathrm{fo}}, \end{aligned}$$

(4)

$$\begin{aligned}&C_{\mathrm{h}}=\hbox {Cat}\left( A_{\mathrm{t}},A_{\mathrm{s}}\right) , \end{aligned}$$

(5)

where $W_{\mathrm{e}}$ and $W_{\mathrm{f}}$ denote parameters of ESPNet and the fully connected layer, respectively. $S_{\mathrm{e}}$ denotes the scene semantic feature output by ESPNet. $T_{\mathrm{fo}}$ denotes the trajectory features output by the fully connected layer. $\hbox {T}$ denotes the transposed matrix. $d_{\mathrm{e}}$ and $d_{\mathrm{t}}$ denote the dimension of $S_{\mathrm{e}}$ and $T_{\mathrm{fo}}$, respectively. $\hbox {Softmax}()$ denotes the activation function, and $\hbox {Cat}()$ is used for concatenating.

Inverse reinforcement learning module

The inverse reinforcement learning module is introduced to generate concrete scene representation based on the coarse-grained motion representation. The core of IRL is to reverse the reward function according to the expert example and generate the optimal strategy according to the reward function. This module takes the coarse-grained motion representation as input, and outputs multiple scene paths.

Firstly, the path reward map $r_{\mathrm{path}}$ and goal reward map $r_{\mathrm{goal}}$ are generated according to coarse-grained motion representation as follows:

$$\begin{aligned}&r_{\mathrm{path}}=\text {MLP}_{\mathrm{path}}(C_{\mathrm{h}}), \end{aligned}$$

(6)

$$\begin{aligned}&r_{\mathrm{goal}}=\text {MLP}_{\mathrm{goal}}(C_{\mathrm{h}}), \end{aligned}$$

(7)

where $\text {MLP}_{\mathrm{path}}$ and $\text {MLP}_{\mathrm{goal}}$ denote two multi-layer perceptron with the same structure, which provide path reward value for reinforcement learning. $r_{\mathrm{path}}$ is used to provide rewards for action choices; $r_{\mathrm{goal}}$ is used to provide a reward for terminating a path.

Afterward, to obtain the maximum entropy strategy $\pi _{\theta }\left( a\mid s \right) $, which represents the probability of taking action a under the condition of state s, we use the approximation iteration to get the maximum entropy strategy as shown in Algorithm 1, where V(s) denotes the state logarithm function, Q(s, a) denotes the state-action logarithm function, N is the total number of iterations, $T\left( s,a\right) $ denotes the cross product of s and a, $S_{p}$ and $S_{g}$ denote the state of $r_{\mathrm{path}}$ and $r_{\mathrm{goal}}$ respectively, which have the same dimensions as the 2D grid. There are different probability values around the state of the reward map, which provide the probability of different actions for the action selection of the target. Then, the target will choose the action with the highest moving probability.

Finally, the Gumbel-Softmax Trick is introduced to sample the scene paths, and the argument of the maximum is used to get the selected actions, and the ith scene path $P_{(i)}$ is obtained, as follows:

$$\begin{aligned}&\hbox {noise}=\hbox {Gumbel}(\log (\Sigma _{a}\pi _{\theta }(a,s))), \end{aligned}$$

(8)

$$\begin{aligned}&a = \hbox {argmax}(\log (\Sigma _{a}\pi _{\mathrm{theta}}(a,s))+\hbox {noise}), \end{aligned}$$

(9)

where $\hbox {noise}$ denotes Gumbel noise, and a denotes the final action choice.

Fine-grained fusion module

Path information obtained by IRL is generated through grid-based policy sampling to generate optimal path policies that can explore various future passable paths of the scene and provide the scope of future paths at a fine-grained level. This module aims to use a scene-fusion Transformer to make the network enhance the understanding of the scene on based on the scene paths, and output multiple feasible future trajectories. We present the scene-fusion Transformer to integrate multiple scene paths generated by the inverse reinforcement module and observed trajectories in the fine granularity. Below, we will give a detailed description of the fine-grained fusion module from the aspects of the scene-fusion Transformer, feature extraction, feature fusion, and output.

Scene-fusion Transformer

Figure 3 shows the architecture of the vanilla Transformer. We improve the vanilla Transformer model based on scene fusion to make it more suitable for trajectory prediction tasks. Considering the importance of real-time performance to the trajectory prediction, we use a sparse self-attention mechanism to extract features, reducing the computational complexity from $O\left( L^{2}\right) $ to $O\left( L\hbox {log}L\right) $ where L denotes the length of the input sequence. The computational efficiency is improved, and the performance remains the same as the traditional method. Besides, we adopt a parallel decoding strategy to directly predict future trajectories instead of auto-regressive methods, which can improve the prediction and reasoning speed while reducing the error accumulation. Since we adopt a non-autoregressive training strategy, we directly remove the mask in the decoder. The overall frame diagram of the scene-fusion Transformer is shown in Fig. 4.

Feature extraction

The scene paths information $P_{(i)}$ output by the IRL is used for feature extraction in the scene-fusion Transformer encoder. Features of the observed trajectory $T_{{\mathrm{obs}}}$ are extracted by the fully connected layer.

After the scene path information is input into the encoder, it is multiplied by the linear transformation with weights of $W_{Q}^p$, $W_{K}^p$ and $W_{V}^p$ respectively, to output three matrices $Q_{p}$, $K_{p}$ and $V_{p}$. The query sparsity measurement is adopted in [23], defining the i-th query’s attention on all keys as a probability $p\left( k\mid q_{i}\right) $. If this probability is closer to the uniform distribution $q\left( k\mid q_{i}\right) $, it means that the self-attention is redundant to the residential input. Therefore, the similarity between distribution p and q can be used to distinguish which queries are “important.” The Kullback–Leibler divergence can be used to measure this similarity, and the i-th query’s sparsity measurement is defined as $M\left( q_{i}\mid K\right) $.The sparse matrix of $Q_{p}$ is $\bar{Q}$, which has the same size of $Q_{p}$ and only contains the Top-u queries under the sparsity measurement $M\left( Q_{p},K_{p}\right) $. The attention $A_{p}$ of the scene path is calculated by multi-heads sparse attention module. The specific formula is as follows:

$$\begin{aligned}&Q_{p}=W_{Q}^{p}P_{(i)},K_{p}=W_{K}^{p}P_{(i)},V_{p}=W_{V}^{p}P_{(i)}, \end{aligned}$$

(10)

$$\begin{aligned}&A_{p}=\frac{\hbox {Softmax}(\bar{Q}\cdot K_{p}^{\mathrm{T}})}{\sqrt{d_{k}}}V_{p}, \end{aligned}$$

(11)

where $d_{k}$ denotes the dimension of the $K_{p}$, T denotes the transposed matrix.

The output of the encoder $E_{p}$ is obtained through a fully connected layer with residual connections, as follows:

$$\begin{aligned} E_{p}=\hbox {ResBlock}(\text {MLP}(A_{p})+P_{(i)}), \end{aligned}$$

(12)

where $\hbox {ResBlock}()$ denotes the residual connection, $\text {MLP}()$ denotes the fully connected layer.

The trajectory information $T_{\mathrm{in}}=\{T_{{\mathrm{obs}}},T_{0}\}$ (where $T_{0}$ denotes the future trajectory, all filled with 0) is extracted by a linear layer, as follows:

$$\begin{aligned} T_{\mathrm{li}}=\hbox {Linear}(T_{\mathrm{in}},W_{\mathrm{l}}), \end{aligned}$$

(13)

where $W_{\mathrm{l}}$ denotes the parameters of the linear layer.

Feature fusion and output

In the decoder stage, the trajectory features $T_{\mathrm{li}}$ is fed into the decoder. The calculation of attention parameters from observed trajectories is as follows:

$$\begin{aligned}&Q_{\mathrm{t}}=W_{Q}^{t}T_{\mathrm{li}},K_{\mathrm{t}}=W_{K}^{t}T_{\mathrm{li}}, V_{\mathrm{t}}=W_{V}^{t}T_{\mathrm{li}}, \end{aligned}$$

(14)

$$\begin{aligned}&A_{\mathrm{t}}=\frac{\hbox {Softmax}(\bar{Q}\cdot K_{\mathrm{t}}^{\mathrm{T}})}{\sqrt{d_{k}}}V_{\mathrm{t}}, \end{aligned}$$

(15)

where $W_{Q}^{t}$, $W_{K}^{t}$ and $W_{V}^{t}$ denote different weight matrix respectively, $\bar{Q}$ denotes the sparse matrix of $Q_{\mathrm{t}}$, T denotes the transposed matrix, $d_{k}$ denotes the dimension of the $K_{\mathrm{t}}$.

Through the above calculations, the feature output of the scene path $E_{p}$ and the feature attention of the observed trajectory $A_{\mathrm{t}}$ are obtained. Afterward, the cross-attention mechanism is used to integrate them and the predicted trajectory $T_{p}$ is obtained through a fully connected layer, as follows:

$$\begin{aligned}&A_{\mathrm{cross}}=\frac{\hbox {Softmax}(A_{\mathrm{t}} \cdot E_{p}^{\mathrm{T}})}{\sqrt{d_{\mathrm{e}}}}E_{p}, \end{aligned}$$

(16)

$$\begin{aligned}&T_{p}=\text {MLP}(A_{\mathrm{cross}},W_{m}), \end{aligned}$$

(17)

where $A_{\mathrm{cross}}$ denotes the cross-attention of $A_{\mathrm{t}}$ and $E_{p}$, T denotes the transposed matrix, $d_{\mathrm{e}}$ denotes the dimension of the $E_{p}$, $\text {MLP}()$ is the Multi-layer Perceptron, $W_{m}$ denotes the weight matrix of $\text {MLP}()$.

Table 2

Ablation study of the MGSU on the SDD dataset

Scene-fusion Transformer	ESPNet	Cross-attention	IRL	ADE	FDE
$\checkmark $				20.19	32.93
$\checkmark $	$\checkmark $			15.89	29.18
$\checkmark $	$\checkmark $	$\checkmark $		13.41	24.32
$\checkmark $	$\checkmark $	$\checkmark $	$\checkmark $	9.24	15.87

Experiments

Datasets

Stanford drone dataset The Stanford drone dataset (SDD) consists of the tracks of pedestrians, bicycles, skateboarders and vehicles captured by drones in 60 different scenes at Stanford University. It provides a bird’s eye view of the scene and the locations of the tracked agents in the pixel coordinates of the scene. SDD contains multiple scene elements, such as roads, sidewalks, buildings, parking lots, terrain, and leaves. Roads and sidewalks come in different configurations, including roundabouts and intersections. We use the evaluation setting as defined in the TrajNet benchmark, which segments the dataset according to the scenario. Therefore, the training, validation, and test sets have different scenarios in a total of 60 scenarios. This allows us to evaluate our model in unknown scenarios where we cannot see the previous trajectory data.

NuScenes It is a large-scale autonomous driving dataset set up by the autonomous driving company NuTonomy. NuScenes covers a total of 1000 different scenarios, with each scene having a recording length of 20 s, containing different road layouts. All data was captured using the on-board cameras and Lidar sensors. The official segmentation method is used to generate the training and test datasets for evaluations.

Evaluation indicators

Two error metrics are used to evaluate the trajectory prediction performance as follows:

(1)

Average Displacement Error (ADE): it represents the mean square error (MSE) between the predicted and ground-truth trajectory at each time step t, as follows:

$$\begin{aligned} \hbox {ADE}=\frac{1}{N}\Sigma _{t=1}^{N}\left\| Y_{\mathrm{t}}^{\mathrm{GT}}-Y_{\mathrm{t}}^{\mathrm{pred}} \right\| _{2}, \end{aligned}$$

(18)

where $Y_{\mathrm{t}}^{\mathrm{GT}}$ and $Y_{\mathrm{t}}^{\mathrm{pred}}$ represent the ground-truth and predicted trajectories, respectively. N represents the total number of time steps.

(2)

Final Displacement Error (FDE): it represents the MSE of the ground-truth and predicted trajectories at the last time step n, as follows:

$$\begin{aligned} \hbox {FDE}=\left\| Y_{n}^{\mathrm{GT}}-Y_{n}^{\mathrm{pred}} \right\| _{2}. \end{aligned}$$

(19)

Experimental details

Samples of SDD and NuScenes are generated following [49]. For the SDD, 3.2-second observed trajectories and 4.8-second ground-truth trajectories are used. For the NuScenes, 2-second observed trajectories and 6-second ground-truth trajectories are used. The input scene image is centered at the position of the last observation, and the size of the image $(s_{i})$ is set to $200\times 200$ pixels. In the coarse-grained fusion module, we only use the encoder of ESPNet, which is pre-trained on ADE20k. The size of $\hbox {FC}$ $(s_{{\mathrm{fc}}})$ is set to 128. In the IRL module, the dimension of the 2D grid $(d_{{\mathrm{grid}}})$ is [25, 25], the initial state is set to [12, 12], and the size of the scene feature $(s_{{\mathrm{sf}}})$ is set to 64. In the scene-fusion Transformer of the fine-grained fusion module, the embedding size of the model $(e_{{\mathrm{model}}})$ is set to 512, the number of the layer $(N_{\mathrm{l}})$ is set to 6, the number of heads of the multi-head attention $(\hbox {heads})$ is set to 8, and the dropout of the network is set to 0.01. ADEmin20 and FDEmin20 were used as evaluation indexes on the SDD, and ADEmin10 and FDEmin10 were used as evaluation indexes in NuScenes. All experiments are implemented on the Ubuntu system based on Pytorch framework, and the processor used is the Nvidia 2080 graphics card. The number of training epochs is 300, using the Adam optimizer, the batch size $(\text {bs})$ is set 32, and the learning rate $(\text {lr})$ is set to 0.0001.

Ablation experiments

The ablation study is performed on the SDD to analyze the impact of each module. The scene-fusion Transformer that deals only with trajectory data is denoted as MGSU-A. Based on MGSU-A, MGSU-B introduces the ESPNet and cross-attention modules. Finally, MGSU-C adds the IRL and perform fine-grained fusion based on MGSU-B. Table 2 reports the results of the ablation study. Detailed discussions are presented as follows:

MGSU-A: To analyze the influence of the scene information on trajectory prediction, the scene-fusion Transformer is used to process the observed trajectory without considering any scene information. Results show that the prediction performance is not satisfactory when the model only considers the observed trajectory and ignores the scene information (ADE /FDE up to 20.19/32.93). Such high ADE/FDE values indicate that the model is not comprehensive in predicting agents’ future trajectories only according to their observed trajectories, but ignores other factors that affect the future movement trends, such as agents’ interactions and the scene influence.

MGSU-B: This variant considers the context information of the scene. The ESPNet and cross-attention modules are added to perform coarse-grained fusion with scene context based on observed trajectories. The ESPNet is utilized to explore the semantic meaning of the objects such as roads, trees, and buildings contained in the scene, so that the predicted trajectories fall in the feasible regions of the scene. After concatenating outputs of the scene-fusion Transformer and ESPNet, the ADE/FDE values are decreased by 4.3 and 3.75, respectively. Afterward, the cross-attention module is used to fuse the trajectory and scene information. Specifically, the attention mechanism can make the trajectory to pay more attention to the scene of some important areas, such as sidewalks, roads, so that the trajectory information can better integration with scene information. Therefore, the ADE/FDE values are further decreased by 2.48 and 4.86, respectively.

MGSU-C: This variant introduces the IRL to generate scene paths and uses the fine-grained fusion module to fuse the trajectory and path information. Concretely, the output of the coarse-grained fusion module is fed into the IRL to concretize the fusion information. After outputting several feasible scene paths, the fine-grained fusion module is performed with the observed trajectories to generate the final prediction results. Compared with MGSU-B, the ADE/FDE values of MGSU-C are decreased by 4.17 and 8.45, respectively. Such an improvement verifies the effect of the fine-grained fusion module.

Evaluation of the scene-fusion Transformer

In this work, a novel efficient scene-fusion Transformer is proposed to improve the fusion efficiency. Specifically, the encoder is used to process the scene features, and the decoder is used to extract the trajectory features and fuse them with the scene features to generate the prediction results. Since the traditional attention module has a high computational complexity, the sparse attention mechanism is introduced to reduce the computational complexity from $O(L^2)$ to $O(L\cdot \hbox {log}L)$, where L represents the length of trajectory. Meanwhile, the decoder is set as generative decoding, so that the prediction results can be obtained in one step when predicting the future trajectory, instead of generating the predicted trajectory in a step-by-step way, thus reducing the time complexity of prediction from O(N) to O(1). Figures 5 and 6 illustrates the comparison with the vanilla Transformer from the aspects of training speed and prediction accuracy using the same model parameters.

Training speed Figure 5 compares the training speed of the two models when the number of training epoch is set to 10, 30, and 80, respectively. Obviously, the training speed of the scene-fusion Transformer has been greatly improved compared to the vanilla Transformer, increased by 73.3$\%$, 75.2$\%$, and 75.4$\%$, respectively. Such an improvement reflects the advantages of the proposed architecture in terms of efficiency.

Prediction accuracy Figure 6 compares the ADE of the two models when setting the number of training epoch to 90. The two methods have roughly the same accuracy at the beginning. Then, around 30–70 epochs, the prediction accuracy of the vanilla Transformer starts to decline slowly and tends to be constant, with ADE remaining around 14. However, the accuracy of the scene-fusion Transformer continues to decline, and the gap with the accuracy of vanilla Transformer gradually widened. Finally, during 70–90 epochs, the scene-fusion Transformer tends to be stable, and the final prediction results is 10.03, with the prediction accuracy improved by 27.5$\%$.

Evaluation of the fusion methods in the coarse-grained feature fusion

This section discusses different fusion methods used in the coarse-grained fusion stage. The observed trajectory $T_{{\mathrm{obs}}}$ and corresponding scene image S are fed into the network through a fully connected layer and ESPNet, respectively. Afterward, we compare the performance of different fusion methods on the SDD dataset, including the concatenation, addition and cross-attention. The fusion results are directly fed into the fine-grained fusion module, ignoring the IRL to better reflect the performance of fusion methods. As presented in Table 3, compared with the simple addition, the performance of the fusion method through concatenation is slightly improved, while the prediction accuracy of the cross-attention fusion method is the best.

Table 3

Fusion evaluation experiment on different fusion methods

Metric	Addition	Concatenation	Cross-attention
ADE	17.86	15.89	12.13
FDE	32.30	29.23	21.45

The bold in the table represents the optimal value of the fusion experiment

Table 4

Comparison with the baselines models on the SDD dataset

Metric	SGAN	SoPhie	P2T-IRL	Sim-Aug	PECNet	IRLSOT	MGSU-A	MGSU-B	MGSU-C
ADE	27.23	16.27	12.58	10.27	9.96	9.66	20.19	13.41	9.24
FDE	41.44	29.38	22.07	19.71	15.88	13.05	32.93	24.32	15.87

The bold in the table represents the optimal value in all comparison methods and the optimal value of our model

Quantitative analysis

In this section, the quantitative analysis is performed by comparing MGSU with state-of-the-art methods on the SDD and NuScenes. Methods used for comparisons are briefly introduced as follows:

SGAN SGAN [30] uses an encoder–decoder network to learn pedestrian movement patterns in an adversarial way. The social pooling operation is used to capture pedestrians’ social interactions.

SoPhie Sophie [50] combines social attention mechanisms with physical attention to help the model learn its position in a large scene and extract the most significant parts of the path-related image. It also GAN to generate more realistic samples and capture the uncertainty of future paths by modeling their distribution.

P2TIRL P2TIRL [49] proposes an attention-based trajectory generator that generates future trajectories based on a sequence of states sampled from the MaxEnt strategy. It reformulates the MaxEnt IRL to allow policies to collectively infer reasonable proxy goals and paths to those goals on a rough 2-D grid defined on the scenario.

SimAug SimAug [51] learns robust representations by augmenting simulated training data, allowing representations to better generalize to unseen real-world test data. Its key idea is to combine features from the most difficult camera views with adversarial features from the original views.

PECNet PECNet [52] presents a pedestrian endpoint conditioned trajectory prediction network that can predict rich and diverse multi-modal socially compliant trajectories across a variety of scenes.

IRLSOT IRLSOT [25] proposes inverse reinforcement learning for scene-oriented trajectory prediction to better forecast pedestrians’ future trajectories under rare or complex environments.

Physics oracle Physics oracle [53] is an extended simple and explainable model of classical physics. The current velocity, acceleration and yaw rate of the trajectory are used for prediction.

CoverNet CoverNet [53] is a method for multimodal probabilistic trajectory prediction for urban driving. It frames the trajectory prediction problem as the classification of a set of distinct trajectories.

SGDNet-ED SGDNet-ED [54] proposes a recursive trajectory prediction network SGNet that evaluates and uses targets at multiple time scales.

MTP MTP [55] proposes a multi-modal modeling method for vehicle motion prediction. It uses raster grid images to encode the context of each vehicle participant, and uses CNN model to generate several possible trajectories and the corresponding probabilities.

Trajectron++ Trajectron++ [2] presents a generative multi-agent trajectory forecasting approach that addresses the desiderata for an open, generally applicable and extensible framework.

Multipath Multipath [56] utilizes a fixed set of future-state sequence anchors that correspond to patterns of trajectory distribution. It predicts a discrete distribution over anchors, and for each anchor, the offset of the regression anchor, as well as the uncertainty, produces a Gaussian mixture at each time step.

Table 5

Comparison with the baselines models on the NuScenes dataset

Metric	Physics oracle	CoverNet	SGDNet-ED	MTP	Trajectron++	Multipath	MGSU
ADE	3.70	1.92	1.67	1.57	1.51	1.50	1.45
FDE	5.84	4.12	3.53	3.04	2.99	2.97	2.87

The bold in the table represents the optimal value in all comparison methods

Table 4 reports the comparison results between our method and S-GAN, SoPhie, P2TIRL, SimAug, PECNet, and IRLSOT on the SDD. The comparisons verify the effectiveness of our method in pedestrian trajectory prediction by fusing scene information. SGAN only uses trajectory information among these methods, so the prediction performance is the lowest. Sophie and CF-VAE take the scene information into account and used CNN to extract the features of the scene information, so their performance is superior to SGAN. In contrast to the above methods, P2TIRL employs the VGG network and reinforcement learning to further process scene information. Therefore, the prediction performance is improved. SimAug uses the multi-view simulation data to enhance the representation of the prediction model, and by adding simulation training data to learn robust representation, the representation can be better generalized to invisible test data. PECNet assists long-range multi-modal trajectory prediction by inferring distant trajectory endpoints. A novel social pooling layer is proposed to enable PECNet to consider social interactions, which improves PECNet’s trajectory prediction performance. IRLSOT exploits an IRL framework to explore the complex scenes and utilizes novel Scene Based Attention block to fuse scene and trajectory information. As a result, it achieves the sub-optimal performance. Our method combines the semantic segmentation network, cross-attention fusion, IRL, and a scene-fusion Transformer to fully fuse the scene and trajectory from different granularity to realize the understanding of the scene context to achieve the best performance in terms of mean displacement error.

Table 5 compares our method and Physics-oracle, CoverNet, MTP, SGDNet-ED, Multipath and Trajectron++ on the NuScenes. Since Physics-oracle is a simple model based on classical physics, it is challenging to accurately capture the dynamic changes of the agent trajectory. CoverNet, MTP and SGDNet-ED take the scene information into account, resulting in a certain improvement in prediction accuracy. Trajectron++ can efficiently incorporate high-dimensional data through the lens of encoding semantic maps and proposes a general approach to incorporate dynamic constraints into learning-based multi-agent trajectory prediction methods. Hence the prediction accuracy has been further improved. Multipath can predict the parameter distribution of the agent trajectory in the real world by considering the scene information, which further improves the prediction performance. Our method uses the fusion of different granularity to fully understand the scenario and also achieves optimal performance on this dataset.

Qualitative analysis

Qualitative analysis is conducted on SDD and NuScenes to evaluate the trajectory prediction performance after fusing the scene information. Figure 7 demonstrates the predicted trajectories on the SDD. The first row shows the observed trajectories (denoted by white lines) and corresponding scenes. The second row shows the scene paths generated by IRL. The third row illustrates the predicted trajectories (the red and black lines denote predicted and ground-truth trajectories, respectively).As we know, pedestrians’ movements are often affected by the scene environment. MGSU can accurately infer the pedestrians’ moving directions and potential destinations after integrating scene information, thus generating feasible future trajectories consistent with path constraints. As shown in Fig. 7a, MGSU can precisely predict the future trajectory of the target in the case of a straight path. In the case of a curved road as shown in Fig. 7b, MGSU can generate scene paths with similar degree of curvature to the road, thus standardizing the predicted future trajectory. In Fig. 7c, MGSU successfully recognize the stationary pedestrian. Figure 7d, e show the case of fork roads. MGSU can generate multiple feasible paths according to the observed trajectories and corresponding scene images. Specifically, as shown in the middle sub-graph of Fig. 7e, our model generates two feasible paths (leading to the upper left and upper right) at the intersection, resulting in a multi-modal distribution of the predicted future trajectory.

Figure 8 demonstrates the trajectory prediction of vehicles on the NuScenes. The first row shows the input, including the vehicle trajectories and scene layouts. The second row shows the scene paths generated by IRL. The third row illustrates the predicted trajectories (the white, red, and black lines denote observed, predicted, and ground-truth trajectories, respectively). After fusing the observed trajectories and corresponding scene layouts, MGSU can generate reasonable paths and precisely infer the moving directions of the lanes. Therefore, the model can forecast future trajectories that are consistent with the scene layout constraints after integrating the scene information. Figure 8a, b shows the trajectory prediction performance on the straight roads which are commonly observed on the highway. MGSU achieves accurate prediction performance on these cases. In Fig. 8c, MGSU precisely identify the stationary vehicle. In the case of curves on the highway as shown in Fig. 8d, e, MGSU predicts scene paths that share similar motion trends with the ground-truth trajectories. Then, our model accurately predicts vehicles’ future trajectories in such challenging scenarios, benefiting from the understanding of the road environments. Moreover, it can be noted from Fig. 8e that multiple feasible moving trends can be inferred from the scene layout, which makes the generated results more realistic.

Conclusion

A multi-granularity fusion architecture named MGSU is proposed to perform trajectory prediction based on the understanding of the scene. It consists of three modules: the fine-granularity feature fusion module, the inverse reinforcement learning module, and the fine-granularity feature fusion module. With these modules, the scene and trajectory information are gradually fused from coarse granularity to fine granularity. A novel scene-fusion Transformer is presented to better integrate scene information and improve the efficiency. Its encoder explores the scene context and the trajectory information is encoded by a linear layer. A sparse cross-attention mechanism is used to fuse the scene and trajectory information with high efficiency. The decoder predicts future trajectories in a generative manner to avoid the error accumulation. Quantitative and qualitative evaluations of MGSU are conducted on the public SDD and NuScenes datasets. Results show that the trajectory prediction performance of MGSU is improved after fusing the scene information, and it can better adapt to various complex environments.

We believe that MUSU can be used for real-world applications such as the service robotic or self-driving cars. For example, it can be used to forecast pedestrians’ crossing intentions [57] and provide decision information for the intelligent car. Our future work focuses on integrating pedestrians’ trajectory with their actions to perform long-term action prediction.

Acknowledgements

This work is supported by Postdoctoral Foundation of Jiangsu Province no. 2021K187B; National Postdoctoral General Fund no. 2021M701042; Changzhou Science and Technology Program with Grant no. CJ20210052; General Project of Jiangsu Provincial Department of Science and Technology no. BK20221380.

Declarations

Conflict of interest

The authors declare no competing interests.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel Improved NSGA-II for energy-efficient distributed no-wait flow-shop with sequence-dependent setup time

Nächster Artikel Multi-view 3D human pose reconstruction based on spatial confidence point group for jump analysis in figure skating

Kothari P, Kreiss S, Alahi A (2021) Human trajectory forecasting in crowds: a deep learning perspective. IEEE Trans Intell Transp Syst 13:137–146. https://doi.org/10.48550/arXiv.1907.03395CrossRef

Salzmann T, Ivanovic B, Chakravarty P, Pavone M (2020) Trajectron++: dynamically-feasible trajectory forecasting with heterogeneous data. In: Vedaldi A, Bischof H, Brox T, Frahm JM (eds) Computer vision—ECCV 2020. ECCV 2020. Lecture notes in computer science, vol 12363. Springer, Cham. https://doi.org/10.1007/978-3-030-58523-5_40

Liu S, Wang L (2018) A self-adaptive point-of-interest recommendation algorithm based on a multi-order Markov model. Future Gener Comput Syst 89:506–514. https://doi.org/10.1016/j.future.2018.07.008CrossRef

Yan M, Li SJ, Chan CA (2021) Mobility prediction using a weighted Markov model based on mobile user classification. Sensors 21(5):1740. https://doi.org/10.3390/s21051740CrossRef

Barth A, Franke U (2008) Where will the oncoming vehicle be the next second? In: IEEE intelligent vehicles symposium, pp 1068–1073. https://doi.org/10.1109/IVS.2008.4621210

Qiao S-J, Han N, Zhu X-W, Shu H-P, Zheng J-L, Yuan C-A (2018) A dynamic trajectory prediction algorithm based on Kalman filter. Acta Electon Sin 46(2):418. https://doi.org/10.3969/j.issn.0372-2112.2018.02.022CrossRef

Schneider N, Gavrila DM (2013) Pedestrian path prediction with recursive Bayesian filters: a comparative study. In: Weickert J, Hein M, Schiele B (eds) Pattern recognition. GCPR 2013. Lecture Notes in Computer Science, vol 8142. Springer, Berlin, Heidelberg, pp 174-183. https://doi.org/10.1007/978-3-642-40602-7_18

Mathew W, Raposo R, Martins B (2012) Predicting future locations with hidden Markov models. In: Proceedings of the 2012 ACM conference on ubiquitous computing, pp 911–918. https://doi.org/10.1145/2370216.2370421

Cai YF, Dai L, Wang H, Chen L, Li YC, Sotel MA, Li ZX (2021) Pedestrian motion trajectory prediction in intelligent driving from far shot first-person perspective video. IEEE Trans Intell Transp Syst. https://doi.org/10.1109/TITS.2021.3052908CrossRef

10.

Yang B, Yan GC, Wang P, Chan C-Y, Song X, Chen Y (2021) A novel graph-based trajectory predictor with pseudo-oracle. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3084143CrossRef

11.

Lee N, Choi W, Vernaza P, Choy CB, Torr PHS, Chandraker M (2017) DESIRE: distant future prediction in dynamic scenes with interacting agents. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 2165–2174. https://doi.org/10.1109/CVPR.2017.233

12.

Bartoli F, Lisanti G, Ballan L, Bimbo AD (2018) Context-aware trajectory prediction. In: 2018 24th international conference on pattern recognition (ICPR), pp 1941–1946. https://doi.org/10.1109/ICPR.2018.8545447

13.

Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl Based Syst 6(2):107–116. https://doi.org/10.1142/S0218488598000094MathSciNetCrossRefMATH

14.

Chen M, Zuo Y, Jia XY, Liu Y, Yu XH, Zheng K (2020) CEM: a convolutional embedding model for predicting next locations. IEEE Trans Intell Transp Syst 22(6):3349–3358. https://doi.org/10.1109/TITS.2020.2983647CrossRef

15.

Zamboni S, Kefato ZT, Girdzijauskas S, Noren C, Col LD (2022) Pedestrian trajectory prediction with convolutional neural networks. Pattern Recognit 121:108252. https://doi.org/10.1016/j.patcog.2021.108252CrossRef

16.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Plolsukhin L (2017) Attention is all you need. Adv Neural Inf Process Syst. https://doi.org/10.48550/arXiv.1706.03762

17.

Yao SW, Wan XJ (2020) Multimodal transformer for multimodal machine translation. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 4346–4350. https://doi.org/10.18653/v1/2020.acl-main.4002

18.

Dong LH, Xu S, Xu B (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5884–5888. https://doi.org/10.1109/ICASSP.2018.8462506

19.

Zhao XY, Xiao F, Zhong HM, Yao J, Chen HH (2020) Condition aware and revise transformer for question answering. In: Proceedings of the web conference 2020, pp 2377–2387. https://doi.org/10.1145/3366423.3380301

20.

Giuliari F, Hasan I, Cristani M, Galasso F (2021) Transformer networks for trajectory forecasting. In: 2020 25th international conference on pattern recognition (ICPR), pp 10335–10342. https://doi.org/10.1109/ICPR48806.2021.9412190

21.

Yu CJ, Ma X, Ren JW, Zhao HY, Yi S (2020) Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In: European conference on computer vision, pp 507–523. https://doi.org/10.1007/978-3-030-58610-2_30

22.

Cai YF, Wang ZH, Wang H, Chen L, Li YC, Sotel MA, Li ZX (2021) Environment-attention network for vehicle trajectory prediction. IEEE Trans Veh Technol 70(11):11216–11227. https://doi.org/10.1109/TVT.2021.3111227CrossRef

23.

Zhou HY, Zhang SH, Peng JQ, Zhang S, Li JX, Xiong H, Zhang WC (2021) Informer: beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of the AAAI conference on artificial intelligence, pp 11106–11115. https://doi.org/10.48550/arXiv.2012.07436

24.

Mehta S, Rastegari M, Caspi A, Shapiro L, Hajishirzi H (2018) ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 552–568. https://doi.org/10.1007/978-3-030-01249-6_34

25.

He CZ, Chen LP, Xu LM, Yang CC, Liu XF, Yang B (2022) IRLSOT: inverse reinforcement learning for scene-oriented trajectory prediction. IET Intell Transp Syst. https://doi.org/10.1049/itr2.12172CrossRef

26.

Karasev V, Ayvaci A, Heisele B, Soatto S (2016) Intent-aware long-term prediction of pedestrian motion. In: 2016 IEEE international conference on robotics and automation (ICRA), pp 2543–2549. https://doi.org/10.1109/ICRA.2016.7487409

27.

Wang P, Yang J, Zhang J (2022) A spatial-contextual indoor trajectory prediction approach via hidden Markov models. Wirel Commun Mob Comput. https://doi.org/10.1155/2022/6719514CrossRef

28.

Malviya V, Kala R (2022) Trajectory prediction and tracking using a multi-behaviour social particle filter. Appl Intell 52(7):7158–7200. https://doi.org/10.1007/s10489-021-02286-6CrossRef

29.

Alahi A, Goel K, Ramanathan V, Robicquet A, Fei-Fei L, Savarese S (2016) Social LSTM: human trajectory prediction in crowded spaces. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 961–971. https://doi.org/10.1109/CVPR.2016.110

30.

Gupta A, Johnson J, Fei-Fei L, Savarese S, Alahi A (2018) Social GAN: socially acceptable trajectories with generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2255–2264. https://doi.org/10.1109/CVPR.2018.00240

31.

Xu CX, Mao WB, Zhang WJ, Chen SH (2022) Remember intentions: retrospective-memory-based trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6488–6497. https://doi.org/10.48550/arXiv.2203.11474

32.

Zhang W, Yao G, Yang B, Zheng WF, Liu C (2022) Motion prediction of beating heart using spatio-temporal LSTM. IEEE Signal Process Lett 29:787–791. https://doi.org/10.1109/LSP.2022.3154317CrossRef

33.

Liu RW, Liang M, Nie J, Lim WYB, Zhang Y, Guizani M (2022) Deep learning-powered vessel trajectory prediction for improving smart traffic services in maritime internet of things. IEEE Trans Netw Sci Eng. https://doi.org/10.1109/TNSE.2022.3140529MathSciNetCrossRef

34.

Visin F, Kastner K, Cho K, Matteucci M, Bengio Y (2015) ReNet: a recurrent neural network based alternative to convolutional networks. Comput Sci 25(7):2983–2996. https://doi.org/10.1109/TIP.2016.2548241CrossRef

35.

Bell S, Zitnick CL, Bala K, Girshick R (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2874–2883. https://doi.org/10.1109/CVPR.2016.314

36.

Liang XD, Shen XH, Feng JS, Lin L, Yan SC (2016) Semantic object parsing with graph LSTM. In: European conference on computer vision, pp 125–143. https://doi.org/10.1007/978-3-319-46448-0_8

37.

Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, pp 234–241. https://doi.org/10.1007/978-3-319-24574-4_28

38.

Bai S, Gu WC, Kong LX (2022) Interweave features of deep convolutional neural networks for semantic segmentation. Eng Appl Artif Intell 109:104587. https://doi.org/10.1016/j.engappai.2021.104587CrossRef

39.

Gao P, Ma T, Li HS, Lin ZY, Dai JF, Qiao Y (2022) ConvMAE:masked convolution meets masked autoencoders. arXiv preprint, arXiv:2205.03892. https://doi.org/10.48550/arXiv.2

40.

Wang PQ, Chen PF, Yuan Y, Ding L, Huang ZH, Hou XD, Cottrell G (2018) Understanding convolution for semantic segmentation. In: 2018 IEEE winter conference on applications of computer vision (WACV), pp 1451–1460. https://doi.org/10.1109/WACV.2018.00163

41.

Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint, arXiv:1706.05587. https://doi.org/10.48550/arXiv.1706.05587

42.

Zhao HH, Shi JP, Qi XJ, Wang XG, Jia JY (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Honolulu, Hawaii, pp 2881–2890. https://doi.org/10.48550/arXiv.1612.01105

43.

Orhan S, Bastanlar Y (2022) Semantic segmentation of outdoor panoramic images. Signal Image Video Process 16(3):643–650. https://doi.org/10.1007/s11760-021-02003-3CrossRef

44.

Irwin R, Dimitriadis S, He JZ, Bjerrum EJ (2022) Chemformer: a pre-trained transformer for computational chemistry. Mach Learn Sci Technol 3(1):015022. https://doi.org/10.1088/2632-2153/ac3ffbCrossRef

45.

Tian TL, Song C, Ting J, Huang HY (2022) A French-to-English machine translation model using transformer network. Procedia Comput Sci 199:1438–1443. https://doi.org/10.1016/j.procs.2022.01.182CrossRef

46.

Yadav S, Gupta D, Abacha AB, Demner-Fushman D (2022) Question-aware transformer models for consumer health question summarization. J Biomed Inform 128:104040. https://doi.org/10.1016/j.jbi.2022.104040CrossRef

47.

Achaji L, Barry T, Fouqueray T, Moreau J, Aioun F, Charpillet F (2022) PreTR: spatio-temporal non-autoregressive trajectory prediction transformer. arXiv preprint, arXiv:2203.09293. https://doi.org/10.48550/arXiv.2203.09293

48.

Yao HY, Wan WG, Li X (2022) End-to-end pedestrian trajectory forecasting with transformer network. ISPRS Int J Geo-Inf 11(1):44. https://doi.org/10.3390/ijgi11010044CrossRef

49.

Deo N, Trivedi MM (2020) Trajectory forecasts in unknown environments conditioned on grid-based plans. arXiv preprint, arXiv:2001.00735. https://doi.org/10.48550/arXiv.2001.00735

50.

Sadeghian A, Kosaraju V, Sadeghian A, Hirose N, Rezatofighi H, Savarese S (2019) SoPhie: an attentive GAN for predicting paths compliant to social and physical constraints. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1349–1358. https://doi.org/10.48550/arXiv.1806.01482

51.

Liang JW, Jiang L, Hauptmann A (2020) SimAug: learning robust representations from simulation for trajectory prediction. In: European conference on computer vision, pp 275–292. https://doi.org/10.1007/978-3-030-58601-0_17

52.

Mangalam K, Girase H, Agarwal S, Lee KH, Adeli E, Malik J, Gaidon A (2020) It is not the journey but the destination: endpoint conditioned trajectory prediction. In: European conference on computer vision. Springer, Cham, pp 759–776. https://doi.org/10.1007/978-3-030-58536-5_45

53.

Phan-Minh T, Grigore EC, Boulton FA, Beijbom O, Wolff EM (2020) CoverNet: multimodal behavior prediction using trajectory sets. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14063–14071. https://doi.org/10.1109/CVPR42600.2020.01408

54.

Wang C, Wang Y, Xu M, Crandall DJ (2022) Stepwise goal-driven networks for trajectory prediction. IEEE Robot Autom Lett. https://doi.org/10.1109/LRA.2022.3145090CrossRef

55.

Cui HG, Radosavljevic V, Chou F-C, Lin T-H, Nguyen T, Huang T-K, Schneider J, Djuric N (2019) Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In: 2019 International conference on robotics and automation (ICRA), pp 2090–2096. https://doi.org/10.1109/ICRA.2019.8793868

56.

Chai YN, Sapp B, Bansal M, Anguelov D (2019) Multipath:multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint, arXiv:1910.05449. https://doi.org/10.48550/arXiv.1910.05449

57.

Yang B, Zhan WQ, Wang P, Chan CY, Cai YF, Wang N (2022) Crossing or not? Context-based recognition of pedestrian crossing intention in the urban environment. IEEE Trans Intell Transp Syst 23(6):5338–5349. https://doi.org/10.1109/TITS.2021.3053031CrossRef

Titel: Multi-granularity scenarios understanding network for trajectory prediction
verfasst von: Biao Yang
Jicheng Yang
Rongrong Ni
Changchun Yang
Xiaofeng Liu
Publikationsdatum: 04.08.2022
Verlag: Springer International Publishing
Erschienen in: Complex & Intelligent Systems / Ausgabe 1/2023
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-022-00834-2

Control parameters	Description
\(\text {bs}\)	The batch size of the model
\(s_{i}\)	The size of the scene image
\(s_{{\mathrm{fc}}}\)	The size of the fully connected layer
\(d_{{\mathrm{grid}}}\)	The dimension of the 2D grid
\(s_{{\mathrm{sf}}}\)	The size of the scene feature
\(W_{\mathrm{e}}\)	Learnable parameters of ESPNet
\(W_{\mathrm{f}}\)	Learnable parameters of the fully connected layer
\(W_{\mathrm{l}}\)	Learnable parameters of the linear layer
\(e_{{\mathrm{model}}}\)	The embedding size of the model
\(N_{\mathrm{l}}\)	The number of the layer of the scene-fusion Transformer
\(\hbox {heads}\)	The number of heads of the multi-head attention
\(\text {lr}\)	The learning rate of the model
N	The number of approximation iteration
\(W^p_{i}\)	The weight matrix of a linear transformation for paths \((i=Q,K,V)\)
\(W^t_{i}\)	The weight matrix of a linear transformation for trajectory \((i=Q,K,V)\)

Springer Professional

Multi-granularity scenarios understanding network for trajectory prediction

Abstract

Publisher's Note

Introduction

Methods

Coarse-grained fusion module

Inverse reinforcement learning module

Fine-grained fusion module

Scene-fusion Transformer

Feature extraction

Feature fusion and output

Experiments

Datasets

Evaluation indicators

Experimental details

Ablation experiments

Evaluation of the scene-fusion Transformer

Evaluation of the fusion methods in the coarse-grained feature fusion

Quantitative analysis

Qualitative analysis

Conclusion

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Premium Partner

Scene-fusion Transformer	ESPNet	Cross-attention	IRL	ADE	FDE
\(\checkmark \)				20.19	32.93
\(\checkmark \)	\(\checkmark \)			15.89	29.18
\(\checkmark \)	\(\checkmark \)	\(\checkmark \)		13.41	24.32
\(\checkmark \)	\(\checkmark \)	\(\checkmark \)	\(\checkmark \)	9.24	15.87

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

Methods

Coarse-grained fusion module

Inverse reinforcement learning module

Fine-grained fusion module

Scene-fusion Transformer

Feature extraction

Feature fusion and output

Experiments

Datasets

Evaluation indicators

Experimental details

Ablation experiments

Evaluation of the scene-fusion Transformer

Evaluation of the fusion methods in the coarse-grained feature fusion

Quantitative analysis

Qualitative analysis

Conclusion

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Weitere Artikel der Ausgabe 1/2023

Towards a resource efficient and privacy-preserving framework for campus-wide video analytics-based applications

An MPA-based optimized grey Bernoulli model for China’s petroleum consumption forecasting

Imperceptible black-box waveform-level adversarial attack towards automatic speaker recognition

Fusing depth local dual-view features and dual-input transformer framework for improving the recognition ability of motion artifact-contaminated electrocardiogram

Aspect term extraction via information-augmented neural network

A classification tree and decomposition based multi-objective evolutionary algorithm with adaptive operator selection

Premium Partner