Zum Inhalt

EKF-based self-attitude estimation with DNN learning landscape information

  • Open Access
  • 01.12.2021
  • Research Article
Erschienen in:

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Der Artikel diskutiert das klassische Problem der Einstellungseinschätzung für mobile Roboter und Drohnen und betont die Notwendigkeit von Echtzeitlösungen. Herkömmliche Methoden, die Trägheitssensoren wie Beschleunigungssensoren und Gyroskope verwenden, stehen aufgrund von Selbstbeschleunigung, Bodenimpulsen und Rotorschwingungen vor Herausforderungen. Um diesen zu begegnen, schlagen die Autoren eine EKF-basierte Methode zur Selbsteinschätzung vor, die ein DNN nutzt, um Gravitationsvektoren aus Einzelbildern vorherzusagen. Die DNN-Ergebnisse geben Mittelwert und Varianz aus, was die Ablehnung unsicherer Vorhersagen und die Anpassung des Prozessrauschens in der EKF ermöglicht. Dieser innovative Ansatz kombiniert die Stärken von Deep Learning und klassischen Filtertechniken und bietet eine robuste Lösung für die Kontrolle der Haltung in Echtzeit. Die Methode wurde sowohl mit synthetischen als auch mit realen Daten validiert und zeigte eine verbesserte Genauigkeit und geringere Drift im Vergleich zu herkömmlichen Methoden. Der Artikel beleuchtet auch zukünftige Arbeiten, darunter die Notwendigkeit vielfältigerer Datensätze und die Integration zusätzlicher Sensoren und Methoden.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Estimating attitude of a robot or a UAV is one of the classic problems of mobile robotics. Especially, real-time estimation is required for real-time attitude control. The attitude is generally estimated with inertial sensors such as accelerometers and gyroscopes. However, mobile robots have their own acceleration. Moreover, on-road robots also receive pulses from the ground, and UAVs suffer from vibration of their rotors. These need to be filtered out from the accelerometer. On the other hand, integration of gyroscopic angular rate has problems of drift and bias. These disturbances worsen the accuracy of the estimation. To complement each other, these inertial data are fused, generally [1]. Nevertheless, dealing the disturbances with only inertial sensors is quite difficult.
To reduce the influence of these disturbances, many kinds of SLAM (Simultaneous Localization And Mapping) [2] have been proposed. SLAM with LiDAR registers point clouds by ICP [3], NDT [4], and so on. Many visual SLAM with cameras have also been proposed [5, 6]. SLAM often contains accumulative error since relative pose changes with error are summed up. Some methods use prior information such as 3D maps for estimating self-pose [7]. These methods correct the error by matching the prior information against data from the sensor. However, they work only in environments where maps are available. Moreover, creating a map is time-consuming and also requires update. Some methods [8, 9] estimate attitude under Manhattan world assumption. They assume that planes and edges in the environment are orthogonal to each other. It helps achieving drift-free estimation. However, it is difficult for this kind of methods to avoid being affected by objects which do not satisfy the assumption.
Deep learning has been used for attitude estimation in recent years. In [10], IMU-based odometry by end-to-end learning has been proposed. In [11], a deep neural network identifies the measurement noise characteristics of IMU. In [12], a neural network estimates angular rates from sequential images. It was trained with synthetic and real images. The large synthetic dataset was collected in AirSim [13] which offers visually realistic graphics. In [14], a gravity vector is directly estimated from a single shot image. This is based on expectation that the network can learn edge, context, and landscape information; for example, most of all artificial buildings should be built vertically, the sky should be seen when the camera orients upper, and so on. The method does not depend on time sequence since only a single shot image is used to estimate the attitude. It helps suppressing drift, noise, and accumulative error. This method is the most similar to our proposed method. However, this method contains some problems. It cannot express uncertainty of the prediction; for instance, the network outputs estimation even when the camera is all covered by obstacles, when less features are captured, and so on. These outputs with large error worsen estimation when it is used in filter functions such as Kalman filter [15]. Therefore, they should be detected and be rejected before the integration.
To address these issues above, this paper presents self-attitude estimation with DNN which predicts a gravity vector from a single shot image, where the outputs are mean and variance. Only outputs with small variance are absorbed in the EKF (extended Kalman filter) to suppress accumulative error. Moreover, the output covariance matrices are used to adjust the process uncertainty in the EKF. The differences from the related work [14] are noted below:
  • A larger dataset is collected in our work by using a simulator.
  • Our network is trained with reliable ground truth while the related work collects a dataset with a hand carried phone.
  • L2 normalization is applied to the output gravity vector while ReLU is applied in the related work.
  • Our network outputs mean and a covariance matrix while the related work directly outputs the gravity vector.
  • Our method filters out DNN inferences with large variance before processing the integration in the EKF.
  • Our method uses the covariance matrix output from the DNN as update process noise in the EKF.
The dataset and the source code used in this paper for deep learning, and for the EKF have been released in open repositories.

DNN estimating gravity vector

The proposed method makes the DNN learn landscape information for estimating a gravity vector in a camera frame. The gravity is expressed as mean and covariance to consider uncertainty of the prediction.

Coordinate definition

A camera frame is defined as a right-handed coordinate system which is fixed on the camera pose. It is shown in Fig. 1.
Fig. 1
Screenshot of AirSim with coordinate description. An IMU and a camera are equipped to the drone in the simulator. The purpose of the proposed neural network is estimating a gravity vector in the camera frame from a front camera image
Bild vergrößern

Dataset collection

A dataset is collected in AirSim [13]. AirSim is a simulator for drones, cars and more, built on Unreal Engine, which provides visually realistic graphics. The dataset consists of images and corresponded gravity vectors \({\varvec{g}}\) in the camera frame. The camera pose and weather parameters are randomized, and a image and a gravity vector are recorded at each pose. The range of random \(Z\) is limited as [2 m, 3 m] in this work. The ranges of random roll \(\phi\) and pitch \(\theta\) are limited as [-30 deg, 30 deg], respectively. Fig. 2 shows examples of the dataset.
Real data is also collected with the sensors in Fig. 3 for test. The stick is hand-carried, and images and linear acceleration vectors measured with the IMU (Xsens MTi-30) are recorded. They are saved only when the stick is shaking less than 0.001 m, 0.1 deg, and when it is at least 5 deg away from the last saved pose. The IMU is regarded as ground truth because it has enough accuracy (within 0.2 deg) in static according to the specification.
Fig. 2
Examples of datasets. The dataset consists of images and correspond gravity vectors \(\varvec{g}\;[{\text{m/s}}^2]\) in the camera frame. These examples were collected in “Neighborhood” of AirSim. The camera pose and weather parameters are randomized for creating a dataset. An example of the weather parameter is that the road in the lower right image is covered in snow
Bild vergrößern
Fig. 3
Sensor stick. Images and acceleration are recorded with this stick when it is still. The judge whether it is still is processed by programing. Note that depth information is not used in this study although RealSense D435 is a RGB-D camera
Bild vergrößern

Data transformation and augmentation

Input data and label data are transformed, and are augmented in each epoch of training.

Image (input)

The image is randomly rotated for augmenting the roll data. The rotation angle \(\Delta \phi\) is limited as [– 10 deg, 10 deg]. The image is resized to \(224 \times 224\). RGB values are normalized. In this work, this normalization is done following \({\varvec{mean}} = (0.5, 0.5, 0.5)\) and \({\varvec{std}} = (0.5, 0.5, 0.5)\). Fig. 4 shows an example of the data augmentation.
Fig. 4
Example of a transformed image. An image is randomly rotated according to \(\Delta \phi\). This example shows an image when \(\Delta \phi = 10{\mkern 1mu}\; {\text{deg}}\). It is also resized, and is normalized
Bild vergrößern

Gravity vector (label)

A gravity vector in the camera frame is rotated according to \(\Delta \phi\). Since the network does not need to learn norm of the gravity, L2 normalization is also applied to the vector in order to make training efficient.
$$\begin{gathered} \varvec{g}_{{{\mathbf{trans}}}} \; = \varvec{Rot}_{{(\Delta \phi )}}^{{{\text{xyz}}}} \frac{\varvec{g}}{{|\varvec{g}|}} \hfill \\ \varvec{Rot}_{{(\Delta \phi )}}^{{{\text{xyz}}}} \; = \left( {\begin{array}{*{20}c} 1 & 0 & 0 \\ 0 & {\cos ( - \Delta \phi )} & { - \sin ( - \Delta \phi )} \\ 0 & {\sin ( - \Delta \phi )} & {\cos ( - \Delta \phi )} \\ \end{array} } \right) \hfill \\ \end{gathered}$$
(1)

Network

The proposed DNN is shown in Fig. 5. It consists of CNN layers and FC layers. Its input is a resized image, and its outputs are a mean vector of a gravity vector and a covariance matrix. Technically, the output of the FC layers is \((\mu _{g_x}, \mu _{g_y}, \mu _{g_z}, L_0, \cdots , L_5)\), and the mean vector \({\varvec{\mu }}\) and the covariance matrix \({\varvec{\Sigma }}\) are computed as Eqs. 2, 3, respectively. Since the lower-triangular matrix \({\varvec{L}}\) is required to have positive-valued diagonal entries, an exponential function is applied to the diagonal elements.
$$\varvec{\mu} =\,\frac{{(\mu _{{g_{x} }} ,\mu _{{g_{y} }} ,\mu _{{g_{z} }} )^{{\text{T}}} }}{{|(\mu _{{g_{x} }} ,\mu _{{g_{y} }} ,\mu _{{g_{z} }} )^{{\text{T}}} |}}$$
(2)
$$\varvec{\Sigma} = \varvec{LL}^{{\text{T}}} ,\quad \varvec{L} = \left( {\begin{array}{*{20}c} {\exp (L_{0} )} & 0 & 0 \\ {L_{1} } & {\exp (L_{2} )} & 0 \\ {L_{3} } & {L_{4} } & {\exp (L_{5} )} \\ \end{array} } \right)$$
(3)
It is expected that the CNN layers lean extracting features such as edges, and the FC layers learn landscape information. Feature module of VGG16 [16] pre-trained on ImageNet [17] is adopted as the CNN layers of the proposed method. All layers, except the final output layer, use the ReLU function [18] as activation function. All FC layers, except the final output layer, use the 10% Dropout [19].
Fig. 5
Proposed network architecture. It consists of CNN layers and FC layers. The input data is a resized image, and the output data are a mean vector and a covariance matrix. They are computed with an output from the final layer as Eqs. 2, 3, respectively. Log-probability of multivariate normal distribution is used as a loss function of this model
Bild vergrößern

Loss function

By learning the distribution of the dataset and updating the weights to maximize the probability density for the outputs, the DNN can output mean and variance. Assuming that the estimation follows a multivariate normal distribution, the probability density is computed as Eq. 4.
$$P(\varvec{x}|\varvec{\mu} ,\varvec{\Sigma} ) = \frac{{\exp ( - \frac{1}{2}(\varvec{x} - \varvec{\mu} )^{{\text{T}}} \varvec{\Sigma} ^{{ - 1}} (\varvec{x} - \varvec{\mu} ))}}{{\sqrt {(2\pi )^{k} |\varvec{\Sigma} |} }},\quad k = {\text{rank}}(\varvec{\Sigma} )$$
(4)
where \(k\) denotes the dimensions of the variables, i.e. \(k = 3\) in the proposed method. To make the network learn the probabilistic model, it is trained maximizing \(P({\varvec{x}}_\mathbf {label} | {\varvec{\mu }}, {\varvec{\Sigma }})\), where \({\varvec{x}}_\mathbf {label}\) (= \({\varvec{g}}_\mathbf {trans}\)) is a label of the dataset. The total probability density \(P_{{{\text{total}}}}\) of a dataset \(D_{{{\text{label}}}}\) is computed by multiplying as Eq. 5.
$$\begin{gathered} P_{{{\text{total}}}} = \prod\limits_{{i = 0}}^{n} P (\varvec{x}_{{{\mathbf{label}}_{i} }} |\varvec{\mu} _{i} ,\varvec{\Sigma} _{i} ) \hfill \\ D_{{{\text{label}}}} = [\varvec{x}_{{{\mathbf{label}}_{0} }} , \cdots ,\varvec{x}_{{{\mathbf{label}}_{i} }} , \cdots ,\varvec{x}_{{{\mathbf{label}}_{n} }} ] \hfill \\ \end{gathered}$$
(5)
Since the natural logarithm is a monotonically increasing function, maximizing \(P_{{{\text{total}}}}\) can be simplified by taking the natural logarithm. This avoids the value becoming too small by multiplying, and decreases computational cost. Here, \(P_{{{\text{log}}_{{{\text{total}}}} }}\) denotes a total value of the log-probability.
$$P_{{{\text{log}}_{{{\text{total}}}} }} = \sum\limits_{{i = 0}}^{n} {\ln } P(\varvec{x}_{{{\mathbf{label}}_{i} }} |\varvec{\mu} _{i} ,\varvec{\Sigma} _{i} ){\text{ }}$$
(6)
Moreover, maximizing \(P_{{{\text{log}}_{{{\text{total}}}} }}\) is equivalent to minimizing \(P_{{{\text{log}}_{{{\text{total}}}} }}\). Finally, the loss function of the proposed method is Eq. 7.
$$f_{{(\varvec{w})}} = - P_{{{\text{log}}_{{{\text{total}}}} }}$$
(7)
where \({\varvec{w}}\) denotes parameters of the network. The network minimizes the loss by updating \({\varvec{w}}\).

Optimization

Adaptative Moment Estimation (Adam) [20] is used to optimize the parameters. The learning rates are set as \(lr_{{{\text{CNN}}}} = 0.00001\), \(lr_{{{\text{FC}}}} = 0.0001\), where \(lr_\text {CNN}\) is a value for the CNN layers, \(lr_\text {FC}\) is a value for the FC layers.

Validation of DNN

The proposed DNN was validated by comparing to other methods. The datasets used in this validation are listed in Table 1.
Table 1
Dataset list
id#
Environment
# samples
Usage
1
AirSim’s “Neighborhood”
10000
Training
2
---"---
1000
Test
3
AirSim’s “SoccerField”
1000
Test
4
Real world
1108
Test

Compared methods

Definitions of methods which were used in this validation are summarized here.

MLE (ours, all)

“MLE (ours, all)” denotes the proposed method described in "DNN estimating gravity vector" section . MLE is short for Maximum Likelihood Estimation.

MLE (ours, selected)

“MLE (ours, selected)” denotes the method uses the exactly same network and the same parameters as “MLE (ours, all)” does, but only samples which output small variance are used for validation of attitude estimation. It means samples with large variance are filtered out as outliers. Assuming \(\beta\) in Eq. 8 expresses uncertainty of the prediction, samples with small variance are selected with a threshold \({\text{TH}}_{\beta }\). In this validation, the threshold is set as \({\text{TH}}_{\beta } = \frac{1}{n}\sum\limits_{{i = 0}}^{n} {\beta _{i} }\), where \(n\) is the number of samples in the testing dataset.
$$\begin{aligned} \beta = \sqrt{\Sigma _{0, 0}} \times \sqrt{\Sigma _{1, 1}} \times \sqrt{\Sigma _{2, 2}} ,\quad {\varvec{\Sigma }} = \begin{pmatrix} \Sigma _{0, 0} &{} \Sigma _{0, 1} &{} \Sigma _{0, 2} \\ \Sigma _{1, 0} &{} \Sigma _{1, 1} &{} \Sigma _{1, 2} \\ \Sigma _{2, 0} &{} \Sigma _{2, 1} &{} \Sigma _{2, 2} \end{pmatrix} \end{aligned}$$
(8)

Regression w/ L2 normalization

“Regression w/ L2 normalization” denotes the network which the final FC layer is difference from “MLE (ours)”. It outputs a 3d vector without covariance. It is a similar architecture as [14]. L2 normalization is applied to the final layer while ReLU is applied in [14]. Mean square error (MSE) between labels and outputs is used as a loss function. Fig. 6 shows the network architecture.
Fig. 6
Comparative network architecture. This network is almost same as [14]. Whereas the proposed method outputs the mean and variance, this one outputs only the mean
Bild vergrößern

Regression w/o L2 normalization

“Regression w/o L2 normalization” denotes the regression network without L2 normalization. It is require to learn not only the direction, but also norm of the gravity vector, i.e. approximately 9.8 \({\text{m/s}}^{2}\), in order to minimize the loss. This information is not required to estimate attitude \(\phi\), \(\theta\).

Training and validation

The network was trained with 10000 samples (#1) with a batch size of 200 samples for 200 epochs. Another 1000 samples (#2) were used for test. They were collected in “Neighborhood” of AirSim. The training dataset and the test dataset were not mixed.
Loss values during training are plotted in Fig. 7. The regression models converged much faster than the MLE model did. The regression model with L2 normalization converged a bit faster than the regression model without L2 normalization did. Tables 2, 3 and 4 respectively show the loss values after 200 epoch training. It is noted that gradient was not computed with the test datasets. It is also noted a loss function of the MLE model and one of the regression models are difference.
Another 1000 samples (#3) were also collected in “SoccerField” of AirSim as data in an unknown environment. 1108 real data samples (#4) were also collected with the sensors in Fig. 3. Note that these samples are just for test, thus the DNN was not trained with them. The loss values of these samples are larger than ones of the known environment in which the network was trained.
Fig. 7
Loss plotting. It is noted a loss functions of the MLE model and one of the regression models are difference. Therefore, their values can not be simply compared
Bild vergrößern
Table 2
Loss of MLE (ours) after 200 epochs
 
Dataset#
#1
#2
#3
#4
Loss [\({\text{m/s}}^{2}\)]
– 6.4991
– 5.7436
– 2.9386
– 2.7129
Table 3
Loss of Regression w/ L2 normalization after 200 epochs
 
Dataset#
#1
#2
#3
#4
Loss [\({\text{m}}^2 {\text{/s}}^4\)]
0.0011
0.0031
0.0084
0.0055
Table 4
Loss of Regression w/o L2 normalization after 200 epochs
 
Dataset#
#1
#2
#3
#4
Loss [\({\text{m}}^2 {\text{/s}}^4\)]
0.6588
0.4916
1.0951
0.8284

Attitude estimation

Roll \(\phi\) and pitch \(\theta\) of the camera pose in the gravitational coordinate are estimated by using \({\varvec{\mu }}\).
$$\begin{aligned} \phi = \tan ^{-1} \frac{ \mu _{g_y} }{ \mu _{g_z} } ,\quad \theta = \tan ^{-1} \frac{ -\mu _{g_x} }{ \sqrt{\mu _{g_y}^2 + \mu _{g_z}^2} } \end{aligned}$$
(9)
Mean absolute error (MAE) of the estimation of the test dataset is shown in Table 5. Variance of the estimated attitude error is shown in Table 6. Both of the error and the variance of “MLE (ours, selected)” are smaller than the others. 715 samples which has \(\beta < {\text{ TH}}_{\beta } = 0.00008814{\mkern 1mu} \; {\text{m}}^3 {\text{/s}}^6\) were selected from 1000 test samples (#2) with “MLE (ours, selected)”.
Comparing “MLE (ours, all)” and “MLE (ours, selected)”, filtering by \({\text{TH}}_{\beta }\) is found valid, which means the network expresses the uncertainty by outputting the covariance matrix. In order to see this, the samples are sorted in Fig. 8. In Fig. 8a, top 50 samples with largest estimation error with “MLE (ours)” are shown in order. In Fig. 8b, top 50 samples with the largest \(\beta\) with “MLE (ours)” are shown in order. 21 samples of the top 50 samples are mutual of both groups. In Fig. 8a, most of the sample images with large error are covered by obstacles, and they are dark with much less landscape information. There is no way to detect them by the regression model. Correlation between error and \(\beta\) are not complete, but many samples with large error were detected by sorting samples with \(\beta\). A good example with large \(\beta\) and one with small \(\beta\) are shown in Fig. 9, respectively. Obviously, the sample in Fig. 9a does not have enough landscape information to estimate the gravity direction, and the proposed network expresses the uncertainty with large \(\beta\).
Comparing “Regression w/ L2 normalization” and “Regression w/o L2 normalization”, L2 normalization does not contribute to the accuracy, although the one with L2 normalization converged a bit faster than the one without L2 normalization did.
A dataset of the unknown environment “SoccerField” (#3) and one of real world (#4) were also used for validation. Note that the neural networks were not trained in these environments. MAE and variance of the error are shown in Tables 7, 8, 9 and 10, respectively. Filtering by the threshold \({\text{TH}}_{\beta }\) made error smaller even in the unknown environments. However, the errors and the variances of errors are larger than the known environment (Tables 5, 6). To reduce difference of results between in known environments and in unknown ones, wider variety of datasets are needed for training.
Table 5
MAE of estimated attitude in known environment (#2)
 
Roll [deg]
Pitch [deg]
MLE (ours, all)
2.620
2.277
MLE (ours, selected)
1.836
1.467
Regression w/ L2 normalization
2.727
2.525
Regression w/o L2 normalization
2.766
2.366
Table 6
Variance of estimated attitude error in known environment (#2)
 
Roll [\({\text{deg}}^2\)]
Pitch [\({\text{deg}}^2\)]
MLE (ours, all)
23.139
18.348
MLE (ours, selected)
9.657
4.553
Regression w/ L2 normalization
25.161
21.996
Regression w/o L2 normalization
21.668
18.213
Fig. 8
Sorted samples. A number above each image is a index of a sample. In (a), top 50 samples are sorted in descending order of the error in “MLE (ours)”. Error of sample#608 in the regression model is \({ \phi _{{\rm error}}} = - 35.34\;{{\rm deg}}, \; {\theta _{{\rm error}}} = - 25.74\;{{\rm deg}}\). Error of sample#352 is \(\phi _{{{\text{error}}}} \; = \;2.80\;{\text{deg}}, \; \theta _{{{\text{error}}}} \; = \; - \;12.93\;{\text{deg}}\). In (b), top 50 samples are sorted in descending order of \(\beta\) in “MLE (ours)”. Mutual samples of the both groups are marked with red rectangles
Bild vergrößern
Fig. 9
Examples with large \(\beta\) and small \(\beta\). Obviously, it is hard to estimate the gravity direction from the sample (a). The proposed network expresses uncertainty of the prediction of the sample (a) by outputting large covariance
Bild vergrößern
Table 7
MAE of estimated attitude in unknown environment (#3)
 
Roll [deg]
Pitch [deg]
MLE (ours, all)
3.120
4.062
MLE (ours, selected)
2.069
2.549
Regression w/ L2 normalization
3.796
4.386
Regression w/o L2 normalization
3.744
4.695
Table 8
Variance of estimated attitude error in unknown environment (#3)
 
Roll [\({\text{deg}}^2\)]
Pitch [\({\text{deg}}^2\)]
MLE (ours, all)
31.170
36.967
MLE (ours, selected)
9.134
12.932
Regression w/ L2 normalization
35.358
45.428
Regression w/o L2 normalization
36.460
55.347
Table 9
MAE of estimated attitude in real world (#4)
 
Roll [deg]
Pitch [deg]
MLE (ours, all)
2.648
4.361
MLE (ours, selected)
2.199
3.755
Regression w/ L2 normalization
2.812
4.665
Regression w/o L2 normalization
3.092
5.371
Table 10
Variance of estimated attitude error in real world (#4)
 
Roll [\({\text{deg}}^2\)]
Pitch [\({\text{deg}}^2\)]
MLE (ours, all)
20.548
20.585
MLE (ours, selected)
9.356
13.971
Regression w/ L2 normalization
20.976
20.982
Regression w/o L2 normalization
25.996
24.160

EKF-based self-attitude estimstion with DNN

The proposed method integrates angular velocities from a gyroscope and the DNN outputs in the EKF. The proposed EKF architecture is shown in Fig. 10. The state vector \({\varvec{x}}\) of the proposed Kalman filter consists of roll and pitch of the robot. Both of the vector \({\varvec{x}}\) and the covariance matrix \({\varvec{P}}\) are computed in a prediction process and in an update process. The prediction process is computed by integrating angular velocity from a gyroscope. The update process is computed by observing the output of the DNN.
$$\varvec{x} = \left( {\begin{array}{*{20}c} \phi & \theta \\ \end{array} } \right)^{{\text{T}}}$$
(10)
Here, \(k\) denotes time step, \(S_{\phi }\), \(C_{\phi }\), \(T_{\phi }\) are short for \(\sin {\phi }\), \(\cos {\phi }\), \(\tan {\phi }\), respectively in the following sections.
Fig. 10
Proposed EKF architecture. Gyroscopic angular rates are integrated in the prediction process in the EKF. The DNN outputs are integrated in the update process in the EKF
Bild vergrößern

Prediction process

The state vector \({\varvec{x}}\) and the covariance matrix \({\varvec{P}}\) are respectively computed as following.
$$\begin{gathered} \overline{{\varvec{x}_{k} }} = f_{{(\varvec{x}_{{k - 1}} ,\varvec{u}_{{k - 1}} )}} = \varvec{x}_{{k - 1}} + \varvec{Rot}_{{(\varvec{x}_{{k - 1}} )}}^{{{\text{rpy}}}} \varvec{u}_{{k - 1}} \hfill \\ \varvec{u}_{{k - 1}} = \varvec{\omega} _{{k - 1}} \Delta t = \left( {\begin{array}{*{20}c} {\omega _{{{\text{x}}_{{k - 1}} }} \Delta t} \\ {\omega _{{{\text{y}}_{{k - 1}} }} \Delta t} \\ {\omega _{{{\text{z}}_{{k - 1}} }} \Delta t} \\ \end{array} } \right) \hfill \\ \varvec{Rot}_{{(\varvec{x}_{{k - 1}} )}}^{{{\text{rpy}}}} \; = \;\left( {\begin{array}{*{20}c} 1 \\ 0 \\ \end{array} \begin{array}{*{20}c} {S_{{\phi _{{k - 1}} }} T_{{\theta _{{k - 1}} }} } \\ {C_{{\phi _{{k - 1}} }} } \\ \end{array} \begin{array}{*{20}c} {C_{{\phi _{{k - 1}} }} T_{{\theta _{{k - 1}} }} } \\ { - S_{{\phi _{{k - 1}} }} } \\ \end{array} } \right) \hfill \\ \end{gathered}$$
(11)
where \(f\) is a state transition model, \({\varvec{u}}\) denotes a control vector, and \(\varvec{Rot}^{{{\text{rpy}}}}\) denotes a rotation matrix for angular velocities.
$$\overline{{\varvec{P}_{k} }} \; = \;\varvec{J_f}_{{k - 1}} \varvec{P}_{{k - 1}} \varvec{J_f}^{T} _{{k - 1}} \; + \;\varvec{Q}_{{k - 1}} ,\varvec{J_f}_{{k - 1}} \;\left. {\frac{{\partial f}}{{\partial \varvec{x}}}} \right|_{{\varvec{x}_{{k - 1}} ,\varvec{u}_{{k - 1}} }}$$
(12)
where \({\varvec{J_f}}\) denotes \(f\) Jacobean, and \({\varvec{Q}}\) denotes a covariance matrix of the process noise.

Update process

Outputs of the DNN with a large variance value \(\beta\) are rejected. A threshold \({\text{TH}}_{\beta }\) is set for judging \(\beta\), and only outputs with \(\beta \;{\text{ < }}\;{\text{TH}}_{\beta }\) are observed in the EKF. The observation vector is \({\varvec{z}}\).
$$\begin{aligned} {\varvec{z}} = {\varvec{\mu }} \end{aligned}$$
(13)
where \({\varvec{\mu }}\) denotes a mean vector of gravity which is output from the DNN. The observation model is \(h\).
$$h_{{(\varvec{x}_{k} )}} = \varvec{Rot}_{{(\varvec{x}_{k} )}}^{{{\text{xyz}}}} \varvec{g}_{{{\mathbf{world}}}} ,\quad \varvec{g}_{{{\mathbf{world}}}} = \left( {\begin{array}{*{20}c} 0 \\ 0 \\ {g_{{{\text{world}}}} } \\ \end{array} } \right){\text{ }}$$
(14)
$$\varvec{Rot}_{{(\varvec{x}_{k} )}}^{{{\text{xyz}}}} \; = \;\left( {\begin{array}{*{20}c} {C_{{\theta _{k} }} C_{{\psi _{k} }} C_{{\theta _{k} }} S_{{\psi _{k} }} - S_{{\theta _{k} }} } \\ {S_{{\phi _{k} }} S_{{\theta _{k} }} C_{{\psi _{k} }} - C_{{\phi _{k} }} S_{{\psi _{k} }} S_{{\phi _{k} }} S_{{\theta _{k} }} S_{{\psi _{k} }} + C_{{\phi _{k} }} C_{{\psi _{k} }} S_{{\phi _{k} }} C_{{\theta _{k} }} } \\ {C_{{\phi _{k} }} S_{{\theta _{k} }} C_{{\psi _{k} }} + S_{{\phi _{k} }} S_{{\psi _{k} }} C_{{\phi _{k} }} S_{{\theta _{k} }} S_{{\psi _{k} }} - S_{{\phi _{k} }} C_{{\psi _{k} }} C_{{\phi _{k} }} C_{{\theta _{k} }} } \\ \end{array} } \right){\text{ }}$$
(15)
where \({\varvec{g}}_\mathbf {world}\) denotes a gravity vector in the world frame i.e. \(g_{{{\text{world}}}} \; \fallingdotseq\;9.8{\mkern 1mu}\; {\text{m/s}}^{{\text{2}}}\), \(\varvec{Rot}^{{{\text{xyz}}}}\) denotes a rotation matrix for a vector. The covariance matrix of the process noise is \({\varvec{R}}\).
$$\begin{aligned} {\varvec{R}} = \begin{pmatrix} \gamma \Sigma _{0, 0} &{} \Sigma _{0, 1} &{} \Sigma _{0, 2} \\ \Sigma _{1, 0} &{} \gamma \Sigma _{1, 1} &{} \Sigma _{1, 2} \\ \Sigma _{2, 0} &{} \Sigma _{2, 1} &{} \gamma \Sigma _{2, 2} \end{pmatrix} \end{aligned}$$
(16)
where \(\gamma\) denotes a hyperparameter for adjusting variance. The state vector \({\varvec{x}}\) and the covariance matrix \({\varvec{P}}\) are respectively computed as following.
$$\begin{gathered} \hat{\varvec{x}}_{k} \; = \;\varvec{x}_{k} + \varvec{K}_{k} (\varvec{z}_{k} - h_{{(\varvec{x}_{k} )}} ),\;\hat{\varvec{P}}_{k} \; = \;(\varvec{I} - \varvec{K}_{k} \varvec{J_h}_{{{k} }} )\varvec{P}_{k} \hfill \\ \varvec{J_h}_{{{k} }} \; = \;\left. {\frac{{\partial h}}{{\partial \varvec{x}}}} \right|_{{x_{k} }} ,\;\varvec{K}_{k} \; = \;\varvec{P}_{k} \varvec{J_h}^{T} _{{{k} }} (\varvec{J_h}_{{{k} }} \varvec{P}_{k} \varvec{J_h}^{T} _{{{k} }} + \varvec{R}_{k} )^{{ - 1}} \hfill \\ \end{gathered}$$
(17)
where \({\varvec{J_h}}\) denotes \(h\) Jacobean, \({\varvec{K}}\) denotes a gain matrix, and \({\varvec{I}}\) denotes an identity matrix.

Validation of EKF

The proposed system was validated in real-time. Since ground truth is available in the simulator, we used synthetic flight data of a drone.

Compared methods

Definitions of methods which were used in this validation are summarized here.

Gyro

“Gyro” denotes an estimation method integrating angular velocity from a gyroscope.

Gyro+Acc

“Gyro+Acc” denotes an EKF-based estimation method integrating angular velocity and linear acceleration from IMU.

Gyro+NDT

“Gyro+NDT” denotes NDT SLAM [4] using 32 layers of LiDAR. Angular velocity from a gyroscope, linear velocity of ground truth and NDT output are integrated in the EKF. Note that linear velocity of ground truth is available because the environment is a simulator.

Gyro+Regression

“Gyro+Regression” denotes an EKF-based estimation method integrating angular velocity from a gyroscope and gravity vectors from the regression network (Fig. 6). All outputs from the network are integrated in the EKF.

Gyro+MLE (ours)

“Gyro+MLE (ours)” denotes the proposed method described in "EKF-based self-attitude estimstion with DNN" section. The hyperparameters were set as \({\text{TH}}_{\beta } = 8 \times 10^{{ - 5}}\) and \(\gamma = 1 \times 10^4\). They were empirically determined based on the result in "Validation of DNN" section.

Experimental conditions

Flight data of a drone ware recorded in “Neighborhood” and “SoccerField” of AirSim. “Neighborhood” is the same environment where the DNN was trained. An IMU and a camera are installed on a drone in the simulator. A LiDAR with 32 layers is also used for “Gyro+NDT”. The sampling frequency of the IMU, the camera and LiDAR are approximately 100 Hz, 12 Hz and 20 Hz, respectively. The camera’s horizontal FOV is 70 deg. Virtual noise was add to the IMU’s 6-axis data. It was randomly added following a normal distribution with a mean of 0 \({\text{rad/s}}\), 0 \({\text{m/s}}^2\) and a standard deviation of 0.1 \({\text{rad/s}}\), 0.1 \({\text{m/s}}^2\), respectively. The flight courses of “Neighborhood” and “SoccerField” are shown in Fig. 11. A computer which has i7-6700 CPU and GTX1080 GPU with 16 GB memory was used for the estimation. The DNN inference computation takes around 0.01–0.02 seconds with the computer. The proposed method is not only for UAVs, but it needs to be noted that this computer may be over-performance for real UAVs. Therefore, testing with a general UAV’s computer specification should be necessary when the method is used on real UAVs, and it is not the focus of this study.
Fig. 11
Flight courses. In (a), a course in “Neighborhood” which is a known environment for the DNN is shown. The drone flew for 5 rounds. It took about 22 minutes. In (b), a course in “SoccerField” which is an unknown environment for the DNN is shown. The drone flew for 3 rounds. It took about 6 minutes
Bild vergrößern

Experimental results

The estimated attitudes are plotted in Fig. 12a. Table 11 shows MAE of the estimated attitudes in “Neighborhood”. MAE of “Gyro+MLE (ours)” is smaller than ones of the other methods. “Gyro” had large accumulative error. That is natural because noise was added and the method does not have any other observation. “Gyro+Acc” does not have accumulative error. However it has error constantly, since the acceleration values of the sensor contain own acceleration of the robot. “Gyro+NDT” accumulated error slower than “Gyro” did by using LiDAR, but it could not remove the accumulation. “Gyro+Regression” and “Gyro+MLE (ours)” corrected the accumulative error by observing the estimated gravity. Comparing “Gyro+Regression” and “Gyro+MLE (ours)”, filtering out the DNN outputs with large \(\beta\) is found valid. With “Gyro+MLE (ours)”, 31 % of the outputs is rejected by the threshold in the flight.
MAE of the estimation in “SoccerField” is shown in Table 12. “Gyro+Regression” and “Gyro+MLE (ours)” suppressed accumulative error even in the environment where the DNN was not trained. MAE of “Gyro+MLE (ours)” is smaller than ones of the other methods. With “Gyro+MLE (ours)”, more outputs (58 %) from the DNN are filtered out by the threshold \({\text{TH}}_{\beta }\) in the unknown environment (“SoccerField”) than in the known environment (“Neighborhood”). This can avoid observing outputs with high uncertainty. However, it leads chances of correcting error less. In other words, the proposed method can not correct error when the large variance is continuing. To compensate this decrease of the chances, the proposed method actually can integrates other observations such as IMU’s acceleration, SLAM, and so on in a future work. Whereas, in this experiment, the proposed method integrates only angular rate and the DNN outputs in order to make the validation simple.
Fig. 12
Real-time attitude plotting. In (a), the result of the flight in “Neighborhood” is shown. In (b), the result of the flight in “SoccerField” is shown.
Bild vergrößern
Table 11
MAE of estimated attitude during flight in known environment
 
Roll [deg]
Pitch [deg]
Gyro
14.524
12.562
Gyro+Acc
2.920
3.380
Gyro+NDT
7.456
6.516
Gyro+Regression
3.187
1.876
Gyro+MLE (ours)
2.869
1.821
Table 12
MAE of estimated attitude during flight in unknown environment
 
Roll [deg]
Pitch [deg]
Gyro
6.424
4.679
Gyro+Acc
3.155
3.091
Gyro+NDT
4.067
2.646
Gyro+Regression
3.248
1.984
Gyro+MLE (ours)
2.869
1.785

Conclusions and future work

EKF-based self-attitude estimation with DNN learning landscape information was proposed. The DNN estimates direction of gravity from a single shot image. The network outputs not only the gravity vector, but also a covariance matrix. Training was done with synthetic data of AirSim, and validation was done with both of the synthetic data and real sensors’ data. In the validation, many samples with large error are filtered out by judging variance values. It means the proposed network expresses uncertainty of the prediction by outputting covariance matrices. The outputs from the DNN and angular velocity from a gyroscope are integrated in the EKF. The covariance matrix is used adjusting process noise. Moreover, outputs with too large variance are filtered out by a threshold. This EKF-based proposed method was validated with flight data of a drone in AirSim environments. In the experiments, the proposed method suppressed accumulative error by using the DNN.
In a future work, wider variety of datasets including real data are needed for the DNN to close the gap between the results in known environments and in unknown environments. Simulator data can be used for pre-training of the DNN before fine tuning with the real data. Only the roll augmentation is applied in this paper because it is easy and simple, but pitch augmentation by the homography transformation should also be valid for fine-tuning with the less real samples. Using other sensors or combining other methods is another future work. The experiments in this paper were done only with the daytime condition, but real robots should work in day and night. Therefor the future work needs to compensate the camera’s weak point for night operations.

Acknowledgements

The authors are thankful for the generous support from the New Energy and Industrial Technology Development Organization (NEDO) for this study. This study was conducted under “Autonomous Robot Research Cluster” at Meiji University.

Competing interests

The authors declare that they have no competing interests.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Download
Titel
EKF-based self-attitude estimation with DNN learning landscape information
Verfasst von
Ryota Ozaki
Yoji Kuroda
Publikationsdatum
01.12.2021
Verlag
Springer International Publishing
Erschienen in
ROBOMECH Journal / Ausgabe 1/2021
Elektronische ISSN: 2197-4225
DOI
https://doi.org/10.1186/s40648-021-00196-3
1.
Zurück zum Zitat Vaganay J, Aldon MJ, Fournier A (1993) Mobile robot attitude estimation by fusion of inertial data. In: Proceedings of 1993 IEEE International Conference on Robotics and Automation (ICRA), pp. 277–282
2.
Zurück zum Zitat Thrun S, Burgard W, Fox D (2005) In: Probabilistic Robotics, pp. 309–336. The MIT Press
3.
Zurück zum Zitat Rusinkiewicz S, Levoy M (2001) Efficient variants of the icp algorithm. In: Proceedings of Third International Conference on 3-D Digital Imaging and Modeling, pp. 145–152
4.
Zurück zum Zitat Biber P, er, WS (2003) The normal distributions transform: a new approach to laser scan matching. In: Proceedings of 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
5.
Zurück zum Zitat Engel J, Stueckler J, Cremers D (2014) Lsd-slam: Large-scale direct monocular slam. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 834–849
6.
Zurück zum Zitat Mur-Artal R, Montiel JMM, Tardós JD (2015) Orb-slam: a versatile and accurate monocular slam system. IEEE Transact Robotic 31(5):1147–1163CrossRef
7.
Zurück zum Zitat Quddus MA, Ochieng WY, Noland RB (2007) Current map-matching algorithms for transport applications: State-of-the art and future research directions. Transportat Res Part C Emerg Technol 15(5):312–328CrossRef
8.
Zurück zum Zitat Kim P, Coltin B, Kim HJ (2018) Linear rgb-d slam for planar environments. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 333–348
9.
Zurück zum Zitat Hwangbo M, Kanade T (2011) Visual-inertial uav attitude estimation using urban scene regularities. In: Proceedings of 2011 IEEE International Conference on Robotics and Automation (ICRA), pp. 2451–2458
10.
Zurück zum Zitat do Monte Lima JPS, Uchiyama H, Taniguchi RI (2019) End-to-end learning framework for imu-based 6-dof odometry. Sensors 19(17):3777CrossRef
11.
Zurück zum Zitat Al-Sharman MK, Zweiri Y, Jaradat MAK, Al-Husari R, Gan D, Seneviratne LD (2020) Deep-learning-based neural network training for state estimation enhancement: application to attitude estimation. IEEE Transact Instrument Measure 69(1):24–34CrossRef
12.
Zurück zum Zitat Mérida-Floriano M, Caballero F, Acedo D, García-Morales D, Casares F, Merino L (2019) Bioinspired direct visual estimation of attitude rates with very low resolution images using deep networks. In: Proceedings of 2019 IEEE International Conference on Robotics and Automation (ICRA), pp. 5672–5678
13.
Zurück zum Zitat Shah S, Dey D, Lovett C, Kapoor A (2017) Airsim: High-fidelity visual and physical simulation for autonomous vehicles. Field Serv Robotics 5:621–635CrossRef
14.
Zurück zum Zitat Ellingson G, Wingate D, McLain T (2017) Deep visual gravity vector detection for unmanned aircraft attitude estimation. In: Proceedings of 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
15.
Zurück zum Zitat Kalman RE (1960) A new approach to linear fi ltering and prediction problems. J Basic Eng 82:35–45CrossRef
16.
Zurück zum Zitat Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: arXiv Preprint, pp. 1409–1556
17.
Zurück zum Zitat Deng J, Dong W, Socher R, L. Li KL, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255
18.
Zurück zum Zitat Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. Proceed ICML 2010:807–814
19.
Zurück zum Zitat Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetMATH
20.
Zurück zum Zitat Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference for Learning Representations (ICLR)

    Marktübersichten

    Die im Laufe eines Jahres in der „adhäsion“ veröffentlichten Marktübersichten helfen Anwendern verschiedenster Branchen, sich einen gezielten Überblick über Lieferantenangebote zu verschaffen. 

    Bildnachweise
    MKVS GbR/© MKVS GbR, Nordson/© Nordson, ViscoTec/© ViscoTec, BCD Chemie GmbH, Merz+Benteli/© Merz+Benteli, Robatech/© Robatech, Ruderer Klebetechnik GmbH, Xometry Europe GmbH/© Xometry Europe GmbH, Atlas Copco/© Atlas Copco, Sika/© Sika, Medmix/© Medmix, Kisling AG/© Kisling AG, Dosmatix GmbH/© Dosmatix GmbH, Innotech GmbH/© Innotech GmbH, Hilger u. Kern GmbH, VDI Logo/© VDI Wissensforum GmbH, Dr. Fritz Faulhaber GmbH & Co. KG/© Dr. Fritz Faulhaber GmbH & Co. KG, ECHTERHAGE HOLDING GMBH&CO.KG - VSE, mta robotics AG/© mta robotics AG, Bühnen, The MathWorks Deutschland GmbH/© The MathWorks Deutschland GmbH, Spie Rodia/© Spie Rodia, Schenker Hydraulik AG/© Schenker Hydraulik AG