Top

Complex & Intelligent Systems

Published in:

Open Access 04-08-2023 | Original Article

LVIF: a lightweight tightly coupled stereo-inertial SLAM with fisheye camera

Authors: Hongwei Zhu, Guobao Zhang, Zhiqi Ye, Hongyi Zhou

Published in: Complex & Intelligent Systems | Issue 1/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

To enhance the real-time performance and reduce CPU usage in feature-based visual SLAM, this paper introduces a lightweight tightly coupled stereo-inertial SLAM with fisheye cameras, incorporating several key innovations. First, the stereo-fisheye camera is treated as two independent monocular cameras, and the SE(3) transformation is computed between them to minimize the CPU burden during stereo feature matching and eliminate the need for camera rectification. Another important innovation is the application of maximum-a-posteriori (MAP) estimation for the inertial measurement unit (IMU), which effectively reduces inertial bias and noise in a short time frame. By optimizing the system’s parameters, the constant-velocity model is replaced from the beginning, resulting in improved tracking efficiency. Furthermore, the system incorporates the inertial data in the loop closure thread. The IMU data are employed to determine the translation direction relative to world coordinates and utilized as a necessary condition for loop detection. Experimental results demonstrate that the proposed system achieves superior real-time performance and lower CPU usage compared to the majority of other SLAM systems.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Simultaneous localization and mapping (SLAM) is a sophisticated system that combines computer vision algorithms, Kalman filtering, or nonlinear optimization techniques to achieve location tracking and mapping in an unknown environment through multiple sensors. Due to its multidisciplinary nature, SLAM is a fundamental method for a range of applications, such as autonomous driving, robot navigation, augmented reality, and virtual reality. It also represents one of the most complex areas in the field of artificial intelligence.

The three most critical indicators of SLAM are robustness, real-time performance, and localization accuracy. However, vision-based SLAM systems may be affected by environmental factors, such as poor texture or blurry images, which can result in feature point matching failures and tracking errors.

To enhance the robustness of vision-based SLAM, various geometric entities, such as lines [1] or planes [2], have been integrated with feature points. Reference [3] uses the fisheye camera to solve the field of view (FOV) limitation of ordinary pinhole cameras and improve the perception range of the environment, but it needs to calibrate and correct the parameters of the fisheye camera in advance. In structured environments, pose-graph optimization can be utilized to leverage structural constraints, such as wall parallelism or orthogonality, based on the Manhattan World assumption [4]. Nevertheless, these methods mostly rely on RGB-D cameras and accurate estimation of the ground plane and Manhattan axes. Another way to increase the robustness of vision-based SLAM is to fuse inertial sensors. However, long-term operation of the IMU may lead to accumulated drift, necessitating the initialization of all IMU parameters and real-time optimization in later stages. Vins-Mono [5] employs covariance propagation and continuous-time linearization for error dynamics. BASALT [6] adopts a two-level SLAM system that optimizes the noise and bias of the IMU in both stages, while GVINS [7] employs a coarse-to-fine approach to initialize GNSS visual-inertial states and fuse their raw measurements in a probabilistic framework.

The real-time performance of SLAM, particularly on low-performance processors, is another significant challenge. This is particularly important for battery-powered robots, where computational efficiency is crucial for extending the robot’s endurance. Reference [8] uses a direct method to initialize the system and tracks non-keyframes for state estimation at the front end. At the back end, sliding window and marginalization are adopted to limit the number of keyframes and perform nonlinear optimization. FastORB-SLAM [9] tracks keypoints between incoming frames without computing descriptors, taking advantage of motion smoothness and constraints on epipolar geometry to refine correspondence. ORB-SLAM2S [10] employs a lightweight front-end that uses sparse optical flow for non-key frames and descriptors, resulting in faster performance compared to ORB-SLAM. Reference [11] has made significant advancements in DSO [12], including the incorporation of the fisheye camera model and joint optimization of camera’s internal and external parameters during back-end processing. Additionally, a real-time, direct monocular SLAM method for fisheye cameras has been proposed in [13], which derives an efficient and accurate approach to perform stereo directly on omnidirectional images. While these algorithms exhibit impressive real-time performance, it is important to note that the system’s localization accuracy may be compromised due to the absence of ORB feature extraction.

Our laboratory is currently developing a battery-powered inspection robot that utilizes the robot operating system and object detection algorithms to enable autonomous exploration, mapping, and monitoring. Given the constraints of battery-powered robots in terms of endurance and cost, it is common for many robots to utilize low-end NUCs. However, the limited computational power of these low-end NUCs severely impacts the real-time performance of algorithms. Recognizing the importance of algorithm robustness and real-time performance, we propose a lightweight visual-inertial SLAM approach called LVIF, which is based on nonlinear optimization and features extracted from a fisheye camera. Our method incorporates IMU sensor and leverages IMU throughout the entire pipeline of LVIF. By leveraging IMU data for consecutive frame pose estimation, feature matching, and loop detection, our system achieves real-time performance, enhanced localization accuracy, and lower CPU usage compared to other systems. The system framework is illustrated in Fig. 1. The main contributions are as follows:

Calculation the SE(3) transformation of the stereo-fisheye camera to speed up feature point matching.
A faster IMU optimization approach. MAP estimation is used for IMU initialization. The optimized parameters can replace the constant-velocity very early to improve the tracking efficiency.
The fusion of inertial data and loop closure to reduce CPU usage.

The paper is organized as follows. In the section “Related work”, we provide an introduction to state-of-the-art relevant systems. The section “System overview” presents the main contribution and framework of our system. In the section “Experiment”, we detail the experimental setup and compare our system with others. Finally, we conclude our work in the section “Conclusion”.

Pure visual SLAM

In the early days of SLAM, most systems used filtering as the back-end optimization. MonoSLAM [14] is the first real-time monocular SLAM system that utilizes Shi-Tomasi points to track very sparse feature points between frames. It also considered the current state of the camera and landmark as state variables. The extended Kalman filter (EKF) is used as the backend to optimize their mean and covariance. Although this approach had several drawbacks, it was a significant milestone at the time as no visual SLAM system could run in real time before MonoSLAM. In addition to EKF as a nonlinear optimization method, when there are outliers in the measurement results, [15] proposes a novel Masreliez–Martin filter to address robust parameter estimation in nonlinear systems with non-Gaussian noise.

Numerous studies [16, 17] have demonstrated that nonlinear optimization methods achieve higher accuracy than filtering methods, albeit at a slightly higher computational cost. Therefore, modern systems rely on MAP estimation, either through geometric MAP that minimizes feature reprojection error in feature-based methods, or through photometric MAP that minimizes the photometric error of a selected set of pixels in direct methods.

In feature-based methods, PTAM [18] is one of the most notable representatives. It selects some representative frames (keyframes) as input for bundle adjustment (BA) to reduce the computational complexity during nonlinear optimization. Additionally, PTAM divides the entire system into two parallel threads, one for camera pose tracking and the other for mapping. This approach enables PTAM to achieve short-term and medium-term data association. ORB-SLAM [19, 20] is one of the most prominent successors of PTAM. It supports monocular, stereo, and RGB-D sensor setups, and employs ORB features throughout the entire pipeline. The system is divided into three parallel threads for tracking, local mapping, and loop closure, representing short-term, medium-term, and long-term data association, respectively. In the short-term, ORB features are used to calculate the pose between frames. In the medium-term data association stage, keyframes and map points are utilized to minimize reprojection error with BA, refining the map. Finally, the loop closure thread employs the bag-of-words library DBoW2 [21] to achieve long-term data association. These methods result in ORB-SLAM achieving excellent accuracy.

In some large-scale scenarios, where there are a significant number of keyframes, the computational complexity of BA optimization can become substantial. Therefore, a common approach is to limit the number of keyframes used in the back-end optimization. Reference [22] proposes a new method called sliding window BA, which fixes the number of keyframes optimized for each back-end optimization. Reference [23] presents a double-window optimization with a co-visibility graph. Another approach presented in [24] is an RGB-D SLAM system designed for large-scale scenes. This system utilizes GPU acceleration to extract image features and track them at a very fast frame rate. Loop detection is implemented using pose-graph optimization with similarity constraints. These methods collectively contribute to more efficient and real-time performance of the SLAM system in large-scale scenes.

While feature-based methods can achieve good accuracy, the sparse distribution of feature points in each frame often leads to a sparse map representation. To address this limitation, direct method SLAM has been proposed as an alternative, aiming to construct a dense or semi-dense map. This approach leverages the intensity or photometric information from a set of selected pixels to build a more detailed and rich map representation. LSD-SLAM [8] is the most representative direct method, which constructs a large-scale semi-dense map using the optical flow method. This method initializes the depth of pixels with random numbers and normalizes the depth after triangulation. Because no feature points are extracted, only the pose graph can be optimized, which may result in lower accuracy compared to feature-based methods. SVO [25] is short for semi-direct visual odometry. It is a hybrid system that uses the FAST algorithm to extract feature points, tracks them by direct method, and optimizes camera pose and 3D map points by minimizing the reprojection error. The advantage of SVO is its high speed, capable of reaching 100 frames per second even on low-end computing platforms. Another fast semi-direct SLAM algorithm is [26], which uses a feature enhancement module based on subgraphs to extract image feature points more stably. Then, an apparent shape-weighted fusion method is proposed for camera pose estimation. Finally, an incremental dynamic covariance scaling algorithm is used to optimize the error of camera pose estimation. Compared with [8], the system has better accuracy.

Visual-inertial SLAM

Visual-inertial SLAM is an extension of pure visual SLAM that integrates inertial sensors to enhance robustness. By incorporating inertial information, it addresses challenges like fast rotations or poor texture that cannot be reliably tracked by pure visual SLAM. Moreover, this integration is achieved at a lower computational cost. However, inertial sensors also have their limitations. For instance, the measured angular velocity and acceleration may experience drift, which can result in unreliable calculations after multiple integration operations. Therefore, it is crucial to optimize the parameters of the IMU to ensure its reliable use. Some studies, such as [27], assume that the camera and IMU measurements are noise-free, and the external parameters of the IMU and camera are known. However, this assumption ignores the impact of bias on the solution process, which can have an unknown effect on the system’s performance. Reference [28] establishes a linear equation with the external parameter rotation as a variable through the rotation of the IMU and camera coordinate system in the same time period. However, if the IMU is zero and the initialization scale is inaccurate, the system may rely too heavily on the accuracy of the initial pose estimation of monocular SLAM.

The fusion of visual and inertial information can be classified into two main approaches: loosely coupled methods [29, 30]and tightly coupled methods. In loosely coupled methods, IMU is treated as an independent module that aids in the visual-only pose estimates obtained from visual structure from motion. However, this approach does not fully maximize the potential of the IMU. As a result, most current visual–inertial fusion SLAM systems employ tightly coupled methods, such as EKF or graph optimization. These methods jointly optimize camera and IMU measurements, leading to significantly improved accuracy and robustness compared to the loosely coupled approach.

The earliest SLAM system based on the EKF is MSCKF [31]. This system relies on the feature method and uses EKF filtering for optimization. It adds camera poses at different times to the state vector, and the feature points can be observed by multiple cameras, forming geometric constraints between multiple camera states to solve the problem of dimension explosion in optimization. This approach enables MSCKF to achieve accurate and robust performance in real-world scenarios.

Graph optimization or BA techniques can maintain and optimize all measurements to obtain the optimal state estimates. The most representative system is OKVIS [32], which supports both monocular and stereo vision. It is the first system to optimize the inertial measurements based on keyframes, nonlinear optimization, and error propagation models. However, OKVIS does not support re-localization. VINS-Mono [5] is an excellent open-source SLAM system that reduces the complexity of back-end computation using a sliding window, and solves place recognition using the DBoW2 library. The system improves stability using a 4 DoF pose graph with BA in loop closure. Its robust initialization can start the system from an unknown initial state. VINS-Fusion [33] extends VINS-Mono to stereo and stereo-inertial. ORB-SLAM3 [34] is a tightly coupled visual-inertial system that builds upon the success of [19, 20] by incorporating IMU information. It utilizes MAP estimation for IMU initialization after 2 and 15 s of system operation. For medium-term data association, ORB-SLAM3 combines inertial and visual residual terms to perform joint visual–inertial optimization with BA. It also utilizes the DBoW2 library for place recognition, which helps reduce the accumulated drift during long-term data association. Experimental results demonstrate that ORB-SLAM3 achieves higher accuracy.

The direct method has also found good application in visual–inertial SLAM. ROVIO [35] is an SLAM system that uses the direct method and EKF. It has a low computational cost, but its accuracy heavily relies on its parameters, which need to be adjusted for different scenarios. Additionally, it lacks loop closure, which results in significant accumulative drift. VI-DSO [36] extends the DSO [12] system by incorporating inertial information to extend the traditional photometric error optimization. Since scales are not immediately observable, VI-DSO is initialized at any scale, rather than lazily initialized until all variables are optimized. Moreover, it proposes a dynamic marginalization method to improve computational efficiency. These improvements enable VI-DSO to achieve high accuracy and robustness in real-world scenarios.

Kannala–Brandt model

There is a point in the world reference $P(X_w,Y_w,Z_w)$. Through SE(3) transformation as shown in Fig. 2, $P_c(X_c,Y_c,Z_c)$ is the coordinate value of P in the camera reference

$$\begin{aligned} \left[ \begin{array}{cccc} X_c \\ Y_c \\ Z_c \\ 1 \end{array} \right] = \left[ \begin{array}{cc} R&t \end{array} \right] \left[ \begin{array}{cccc} X_w \\ Y_w \\ Z_w \\ 1 \end{array} \right] . \end{aligned}$$

(1)

Project the $P_c$ onto the normalized image plane

$$\begin{aligned} \left\{ \begin{array}{c} a = x / z \\ b = y / z \\ r^2\ = \ a^2\ +\ b^2 \\ \theta \ = \ {tan}^{-1}{\left( r\right) }. \end{array} \right. \end{aligned}$$

(2)

Similar to the pinhole model, the fisheye lens is also a distortion function of $\theta $. The Kannala–Brandt model uses a polynomial containing only odd-order terms to describe the distortion process, and the distance r from the image point to the center of the image is compressed to $r_d$

$$\begin{aligned} r_d=\theta +k_1\theta ^3+k_2\theta ^5+k_3\theta ^7+k_4\theta ^9, \end{aligned}$$

(3)

where $r_d$ is undistorted projection length.

Through the similarity of the triangle, we can obtain the undistorted coordinate value

$$\begin{aligned} \frac{r_d}{r} = \frac{x^,}{a} = \frac{y^,}{b}, \end{aligned}$$

(4)

where $x^,$, $y^,$ are undistorted coordinate value in normalized image plane, respectively.

Finally, the pixel coordinate value of P is obtained through the camera internal parameters

$$\begin{aligned} \left\{ \begin{array}{l} u =\ f_x\frac{r_d}{r}\ a\ +\ c_x \\ v =\ f_y\frac{r_d}{r}\ b\ +\ c_y, \end{array} \right. \end{aligned}$$

(5)

where $f_x$, $f_y$, $c_x$, $c_y $ are, respectively, camera internal parameters, and u, v are the pixel coordinate values, respectively Figure 3 illustrates the process of ORB feature extraction and matching for fisheye images.

System overview

Our system proposed in this paper is built on [34]. We further improve it and apply it to robots with low-end NUC that demand real-time performance and low energy consumption. Since the direct method cannot satisfy the accuracy of location, we propose a tightly coupled stereo-inertial SLAM system that is based on nonlinear optimization and features with fisheye model. We have chosen stereo as it can directly measure scale without requiring additional calculation and optimization by inertial information. The system is divided into three threads, namely: tracking, local mapping, and loop closure. Note that for the completeness of the article, some equations presented are not contribution of this work, mainly from [38].

Non-rectified stereo-fisheye SLAM

Most SLAM systems use pinhole models for their camera components. In the case of stereo pinhole cameras, the two input images are rectified and their ORB features are extracted. Matching between the rectified left and right images can be done efficiently by searching for the same row as epipolar lines are horizontal. However, this approach is not suitable for fisheye lenses that have an FOV equal to or greater than 180$^\circ $, as shown in Fig. 4. Rectifying images requires an FOV less than 180 degrees, and many computer vision algorithms assume that the reprojection error is uniform across the image. If the fisheye image is rectified, the resulting images need to be cropped shown in Fig. 4b. This will cause the system to lose the advantages of a large FOV, such as the ability to extract more feature points and improved occlusion robustness.

To overcome these difficulties, we adopt a novel method that treats the stereo-fisheye camera as two monocular cameras, extracting features separately. The camera only needs to be calibrated before the algorithm starts. Since it is not suitable to rectify the fisheye image to ensure that epipolar lines are horizontal, the time complexity of feature point matching may be high. Therefore, the SE(3) transformation between the two monocular cameras significantly reduces the time consumption caused by the matching of feature points. Furthermore, since the fisheye camera contains a large number of feature points, we divide them into close and far points, as described in [20]. The close points can be reliably estimated using disparity in the stereo camera if the depth is less than 40 times the stereo baseline. In this way, LVIF filters out those outliers that might negatively affect the localization accuracy. If the two cameras have a common area, image keypoints can be triangulated directly when they are first seen. The remaining landmarks that are not in the overlapping area can be used as monocular information, which can be triangulated from multiple views.

The FAST feature points are extracted from an 8-scale pyramid model with a scale factor of 1.2. Furthermore, LVIF uses the quadtree algorithm to recursively search for groups of points, and selects the point with the largest corresponding value of the FAST corner point in the neighborhood of local feature points to achieve non-maximum suppression and fast screening. The descriptor is calculated using the BRIEF algorithm to complete the matching of feature points.

Re-localization and MLPnP

A robust SLAM system should be able to re-localize itself when the camera fails to track. The approach in [20] uses the ePnP method to solve re-localization, which matches the 3D map points and image keypoints using a calibrated pinhole camera. However, the ePnP algorithm may not be fully adapted to fisheye cameras. Therefore, we adopt the maximum-likelihood perspective-n-point algorithm [39] for re-localization, which exploits a variance propagation method from image observations to bearing vectors and has the ability to handle arbitrary center camera models, including fisheye cameras.

IMU pre-integration

From the parameters of inertial measurement between frame $b_k$ and $b_{k+1}$, the position, velocity, and rotation of the object are expressed by

$$\begin{aligned} p_{b_{k+1}}^\omega= & {} p_{b_k}^\omega \ +\ v_{b_k}^\omega t + \iint _{k}^{k+1}{\left( q_{wb_t}a^{b_t}\ -\ g^wd\right) \delta t^2} \end{aligned}$$

(6)

$$\begin{aligned} v_{b_{k+1}}^\omega= & {} v_{b_k}^\omega \ +\ \int _{k}^{k+1}\left( q_{wb_t}a^{b_t}\ -\ g^w\right) \ \delta \ t \end{aligned}$$

(7)

$$\begin{aligned} q_{wb_{k+1}}= & {} \int _{k}^{k+1}q_{wb_t}\bigotimes \left[ {\begin{array}{c} {0}\\ {\frac{1}{2}\omega ^{b_t}} \end{array}}\ \right] \delta t\ \end{aligned}$$

(8)

$$\begin{aligned} q_{wb_{k\ }}= & {} q_{wb_i}\bigotimes \ q_{b_ib_k}, \end{aligned}$$

(9)

where $p_{b_k}^\omega $, $v_{b_k}^\omega $ are, respectively, the position and velocity of $b_k$ frame in the world reference. $q_{wb_{k+1}}$ is a quaternion which represents the transformation from $b_{k+1}$ frame to the world reference, $\bigotimes $ denotes the transformation operation, $\omega ^{b_k}$, $a^{b_k}$ represent angular velocity and acceleration in $b_k$ frame reference respectively, g is the gravity vector, and t is time variation between frame $b_k$ and $b_{k+1}$.

Due to the high sampling frequency of the IMU, the IMU pre-integration method can convert the integration of multiple measurement values into a single one, which improves the calculation efficiency. Transform Eqs. (6–8) according to Eq. (9)

$$\begin{aligned} p_{b_{k+1}}^\omega= & {} p_{b_k}^\omega \ +\ v_{b_k}^\omega t\ -\ {\frac{1}{2}g}^wt^2\nonumber \\ {}{} & {} +\ \ q_{wb_k}\iint _{k}^{k+1}{\left( q_{b_kb_t}a^{b_t}\ \right) \delta t^2}\ \end{aligned}$$

(10)

$$\begin{aligned} v_{b_{k+1}}^\omega= & {} v_{b_k}^\omega \ -{\ g}^wt\ \ +\ q_{wb_k}\int _{k}^{k+1}{(q_{b_kb_t}a^{b_t}})\ \delta t \end{aligned}$$

(11)

$$\begin{aligned} q_{wb_{k+1}}= & {} q_{wb_k}\int _{k}^{k+1}q_{b_kb_t}\bigotimes \ \left[ {\begin{array}{c} {0}\\ {\frac{1}{2}\omega ^{b_t}} \end{array}}\ \right] \ \delta t. \end{aligned}$$

(12)

From Eqs. (10–12), the IMU pre-integration between frame $b_k$ and $b_{k+1}$ can be expressed as

$$\begin{aligned} \Delta p_{b_kb_{k+1}}= & {} \iint _{k}^{k+1}{\left( q_{b_kb_t}a^{b_t}\ \right) \delta t^2} \end{aligned}$$

(13)

$$\begin{aligned} \Delta v_{b_kb_{k+1}}= & {} \int _{k}^{k+1}{(q_{b_kb_t}a^{b_t}})\ \delta \ t \end{aligned}$$

(14)

$$\begin{aligned} \Delta q_{b_kb_{k+1}}= & {} \int _{k}^{k+1}q_{b_kb_t}\bigotimes \ \left[ {\begin{array}{c} {0}\\ {\frac{1}{2}\omega ^{b_t}} \end{array}}\ \right] \ \delta t, \end{aligned}$$

(15)

where $\Delta p_{b_kb_{k+1}},\Delta v_{b_kb_{k+1}},\Delta q_{b_kb_{k+1}}$ are the position, velocity, and rotation variation, respectively.

Graph optimization model

IMU residual model

Since both the accelerometer and gyroscope of IMU have noise and bias, it will have a bad influence on the measurement results. We set a covariance matrix $\tau _{k,k+1}$ to contain $\Delta p_{b_k,b_{k+1}}$,$\Delta v_{b_k,b_{k+1}}$ and $\Delta q_{b_k,b_{k+1}}$. Residual models can be used for IMU initialization, visual–inertial BA optimization

$$\begin{aligned} r_{\Delta q_{b_kb_{k+1}}}= & {} \Delta q_{b_kb_{k+1}}^T\bigotimes \ \Big (q_{b_kw}\ \bigotimes \ q_{wb_{k+1}}\Big )\ \end{aligned}$$

(16)

$$\begin{aligned} r_{\Delta p_{b_kb_{k+1}}}= & {} {q_{b_kw}\Big (p}_{b_{k+1}}^\omega -\ \ p_{b_k}^\omega \ +\ v_{b_k}^\omega t\ +\ {\frac{1}{2}g}^wt^2\ \Big )\nonumber \\{} & {} -\ \Delta p_{b_kb_{k+1}}\ \end{aligned}$$

(17)

$$\begin{aligned} r_{\Delta v_{b_kb_{k+1}}}= & {} q_{b_kw}\Big (v_{b_{k+1}}^\omega \ -v_{b_k}^\omega \ +{\ g}^wt\Big )\ -\ \Delta v_{b_kb_{k+1}} \end{aligned}$$

(18)

$$\begin{aligned} r_{\tau _{b_kb_{k+1}}}= & {} \Big [\ r_{\Delta q_{b_kb_{k+1}}},r_{\Delta v_{b_kb_{k+1}}},r_{\Delta p_{b_kb_{k+1}}}\Big ]. \end{aligned}$$

(19)

Inertial-only MAP estimation

In the local mapping thread, we adopt MAP estimation to quickly estimate the IMU variables as a coarse optimization. Assuming that all variables are independent, the noise of the variables follows a Gaussian distribution with a mean of zero. The scale of the stereo camera is known and does not need to be optimized. The estimated parameters are as follows:

$$\begin{aligned} y_k= \{R_{wg},\ b^a,\ b^g,\ v_{0:k}\}, \end{aligned}$$

(20)

where $R_{wg}$ represents the rotation matrix from ENU reference to the world reference. The direction of gravity in ENU is (0, 0, g). $b^a$, $b^g$ are the biases of the accelerometer and gyroscope, respectively. $v_{0:k}$ denotes the up-to-scale velocity vector accumulation from the first to current frame. The parameters in $y_k$ are optimized by inertial-only MAP. The MAP estimation can further transform the problem of graph optimization, using the parameters in $y_k$ as vertexes. Gyroscope bias, acceleration bias, velocity, etc. are as edges. The optimization makes the inertial residual model to force IMU biases to be close to zero shown in Fig. 5a. This method of IMU initialization is a coarse optimization method. Compared with [27, 28], it has a faster speed. The optimized parameters can be quickly used in the tracking thread and local mapping thread to speed up the tracking and improve the accuracy of system.

When the system starts, the pure vision module is used to initialize and generate map point clouds, but the world reference may not be aligned with the navigation coordinate system ENU. After IMU initialization is completed, the optimized $R_{wg}$ is used to align the Z axis of the world reference with the direction of gravity G to obtain a more accurate value. Finally, the system obtains all keyframes in the map and adjusts their poses according to Eq. 21 to complete the transformation to the new world reference

$$\begin{aligned} R_{wb_k} = R_{wg}^T R_{wb_k}. \end{aligned}$$

(21)

Vision-only MAP estimation

When IMU initialization has not started, pure visual MAP estimation is adopted and converts to graph optimization as Fig. 5b. The system runs for 2 s in a pure vision module, generating many keyframes at about 8Hz and 3D map points. We define a local window to contain the co-visible keyframes and other keyframes that can observe the 3D map points in the current keyframe. Reprojection errors is used in local window to optimize the pose and 3D map points

$$\begin{aligned} r_{b_i,j}\ =\ \ u_{b_i,j\ }\ -T_{b_iw}\ \bigotimes \ x_j, \end{aligned}$$

(22)

where $u_{b_i,j}$ is the observation of pixel j in the $b_i$ frame, $T_{b_iw}\in $ SE(3) represents the transformation between the world reference and $b_i$ frame, $\bigotimes $ is the transformation operation of SE(3) group over ${R}^3$ elements, and $x_j$ stands for the 3D map point j.

Visual–inertial MAP estimation

A joint visual–inertial MAP estimation as Eq. 23 is used to combine the inertial residual and visual residual to further refine the solution (Fig. 5c). This is a finer optimization approach to inertial variables and camera poses

$$\begin{aligned} \min \left( \ \sum _{i=1\ }^{}\Vert r_{\tau _{k,k+1}}\Vert _{\sum _{r_{\tau _{k,k+1}}}^{-1}}^2+\sum _{j\in b_i}\sum _{i=1\ }^{k}{\rho _\textrm{Hub}\Vert r_{b_i,j}\Vert }_{\sum _{r_{\tau _{k,k+1}}}^{-1}}^2\right) ,\ \nonumber \\ \end{aligned}$$

(23)

where $b_i$ is the current keyframe and its local keyframe, $r_{\tau _{k,k+1}}$ stands for the inertial residual, $r_{b_i,j}$ represents the visual residual, and $\rho _\textrm{Hub}$ is the robust kernel which is used to reduce the influence of wrong matching.

Inertial and loop closing

In the loop closing thread, the DBoW2 word-bag library is utilized to solve the position recognition problem, which converts the features in keyframes into bag-of-words vectors and queries the DBoW2 database to retrieve the most similar keyframes.

When a loop closure occurs, a portion of the sensor’s trajectory must be in close proximity to a ring structure. This means that if the sensor revisits a previous location, it must be moving in the opposite direction of some of its previous trajectory, as shown in Fig. 6. Taking advantage of this feature, we propose a method for coarse loop detection that fuses IMU information. Figure 6a is a 2D environment, which can be understood as the car running on the ground, and Fig. 6b is a three-dimensional space, representing the flight of the drone. We maintain an unit bearing vector [x, y] or [x,y,z] for 2D or 3D, respectively, which needs to be configured in code file. By analyzing the acceleration and angular velocity of the IMU for each frame, we can determine the direction of the robot’s motion. Specifically, we assign a value of 1 when the robot’s orientation aligns with the positive orientation of the IMU reference frame; otherwise, we assign a value of 0 if the orientation is opposite. By comparing the unit bearing vector maintained in the system, all keyframes that move in the opposite direction to the current keyframe can be identified. To expedite this process, a search threshold radius is set. In [40], a search radius of 10 m was used for Lidar SLAM, but this method is sensitive to point cloud density and depth accuracy. We set the threshold radius r to 35 times the stereo baseline, which prevents the system from missing loops and reduces unnecessary keyframe matching. This is the rough detection method using IMU data. If it is a 3D situation shown in Fig. 6b, the scene of the drone flying, the principle is similar to the 2D case. All the things to do are to expand the circle of radius r into a sphere and look for possible matching keyframes in it. If there are other frames within the threshold, loop detection will start, convert the features into bag-of-words vectors, and retrieve the DBoW2 database. This method can significantly reduce unnecessary calculations and CPU usage as the number of keyframes increases. Furthermore, the longer the system runs, the more evident this advantage becomes.

Table 1

Comparison of various state-of-the-art systems on the TUM dataset

Sequence	Length	Loop closing	Stereo-inertial
Sequence	Length	Loop closing	OKVIS [32]	ROVIO [35]	BASALT [6]	ORB-SLAM3 [34]	LVIF
corridor1	305	YES	0.33	0.47	0.34	0.03	0.04
corridor2	322	YES	0.47	0.75	0.42	0.02	0.02
corridor3	300	YES	0.57	0.85	0.35	0.02	0.01
corridor4	114	NO	0.26	0.13	0.21	0.21	0.15
corridor5	270	YES	0.39	2.09	0.37	0.01	0.02
magistrale1	918	YES	3.49	4.52	1.2	0.24	0.21
magistrale2	561	YES	2.73	13.43	1.11	0.52	0.73
magistrale3	566	NO	1.22	14.8	0.74	1.86	0.68
magistrale4	688	YES	0.77	39.73	1.58	0.16	0.29
magistrale5	458	YES	1.62	3.47	0.6	1.13	0.78
magistrale6	771	NO	3.91	Non	3.23	0.97	0.82
outdoors1	2656	NO	Non	101.95	255.04	32.23	41.36
outdoors2	1601	NO	73.86	21.67	64.61	10.42	6.55
outdoors3	1531	NO	32.38	26.1	38.26	54.77	45.12
outdoors4	928	NO	19.51	Non	17.53	11.61	10.37
outdoors5	1168	YES	13.12	54.32	7.89	8.95	10.26
outdoors6	2045	NO	96.51	149.14	65.5	10.7	12.49
outdoors7	1748	YES	13.61	49.01	4.07	4.58	5.56
outdoors8	986	NO	16.31	36.03	13.53	11.02	10.11
room1	146	YES	0.06	0.16	0.09	0.01	0.01
room2	142	YES	0.11	0.33	0.07	0.01	0.01
room3	135	YES	0.07	0.15	0.13	0.01	0.01
room4	68	YES	0.03	0.09	0.05	0.01	0.01
room5	131	YES	0.07	0.12	0.13	0.01	0.01
room6	67	YES	0.04	0.05	0.02	0.01	0.01
slides1	289	NO	0.86	13.73	0.32	0.41	0.29
slides2	299	NO	2.15	0.81	0.32	0.49	0.36
slides3	383	NO	2.58	4.68	0.89	0.47	0.38

Bold numbers represent the optimal solutions, highlighting their remarkable significance

The data for the other systems in this table are sourced from their respective literature

Experiment

The TUM VI Benchmark [37] consists of 28 sequences captured in 6 different environments. The benchmark provides high dynamic range photometric calibration images with a resolution of 1024 $\times $ 1024 and a frame rate of 20 Hz. The IMU measures the acceleration and angular velocity of the three axes at 200 Hz, and the camera and IMU are hardware-synchronized. However, ground-truth data are not available for all sequences, and some sequences only have ground-truth at the beginning and end. To evaluate the degree of accumulated drift, we align the estimated trajectory with the available ground-truth using an SE (3) transformation in stereo-inertial sensor configurations. All experiments are conducted on a machine with Ubuntu 18, Intel Core i5-7700 CPU at 3.6 GHz, and 16 GB memory. Only the CPU is utilized for the experiments.

To ensure real-time performance of the system, we limit the number of feature points extracted per image to 600 in the stereo-inertial sensor configurations due to the larger FOV of the fisheye camera. We perform histogram equalization on each input image, which is especially useful for solving under or over exposure in outdoor environments. In addition, in outdoor environments, feature points that are too far away, such as the sky or clouds, can cause unpredictable drift when calculating pose transformation. Therefore, we classify ORB points into close and far categories. The close points can be reliably estimated using disparity in the stereo camera, provided that their depth is less than 40 times the stereo baseline

$$\begin{aligned} \textrm{ATE}=\ \sqrt{\frac{1}{N}\sum _{i\ =1}^{N}{\Vert \textrm{trans}(T_{gt,i}^{-1}T_{\textrm{esti},i}) \Vert }^2}, \end{aligned}$$

(24)

where N represents the number of keyframes, trans represents the translation of the transformation, $T_{\textrm{esti},i}$ represents the estimated trajectory, and $T_{gt,i}$ is the real trajectory.

TUM VI sequences

We select the most representative SLAM systems for each method. ROVIO utilizes the direct method and EKF, while BASALT relies on optical flow and BA optimization. OKVIS is considered a pioneering work in the field of visual–inertial SLAM, based on nonlinear optimization. ORB-SLAM3 is considered the most advanced system currently available. The absolute trajectory error (ATE) is a metric employed to assess the accuracy of trajectory estimation algorithms. It measures the absolute positional discrepancy between corresponding positions in the estimated trajectory and the ground-truth trajectory, typically expressed in meters or another suitable distance unit. In our evaluation, we benchmark our system against state-of-the-art SLAM systems using the ATE metric, and the results are showcased in Table 1. Overall, ROVIO and BASALT rely on direct methods and optical flow without feature point extraction, resulting in limited advantages in localization accuracy. Our aim is to propose an SLAM system that is real time and computationally efficient. The ATE of LVIF is significantly lower than that of the other methods, and comparable to that of ORB-SLAM3. This demonstrates the advantages of LVIF’s medium- and long-term data association method, which enables our system to maintain good localization accuracy under lightweight conditions. In the following sections, we will analyze the performance of LVIF in different sequences and various environments.

The room environment is the simplest of all environments and provides ground-truth data for the entire trajectory. In this sequence, the slow motion of the sensor leads to no tracking failures or re-localization. Moreover, it contains many identical scenes, allowing for multiple loop closings. All systems successfully complete all sequences in this environment. However, compared to other systems, our system and ORB-SLAM3 achieve extremely low RMS ATE (m) errors, demonstrating the system’s robust place recognition and loop closure capabilities. Figure 7 shows that our system almost perfectly coincides with the ground-truth trajectories and achieves very accurate tracking in all three axes of x, y, and z.

In medium indoor environments, the corridor sequences contain corridor and several offices where ground truth are available for the start and end segments. The fisheye camera used in LVIF can extract more feature points, resulting in a semi-dense map, as shown in Fig. 8. The map clearly outlines the entire corridor and office. LVIF’s accuracy is comparable to that of ORB-SLAM3, but it still outperforms several other systems. Multiple loops can be detected in these sequences, and each time a loop is detected, a global BA is performed to correct the poses of all keyframes in the map and reduce the accumulated drift. However, in corridor4 sequence, since the sensors do not revisit the mapped area and cannot take advantage of long-term data association, LVIF is slightly worse than ROVIO.

The Magistrale is sequence in the large hall, with the longest segment being over 900 m. All sequences are indoors and contain a lot of close points, most of which can be used for pose estimation. However, the scenes in these sequences are very long and will cause more accumulated drift if left uncorrected. The pre-integration of the IMU reduces the amount of calculation required and ensures real-time performance. Additionally, the joint optimization of vision and inertia also improves the accuracy of the system. Thanks to these contributions, LVIF achieves very good results in most scenarios, with an error of around 1 m. These sequences only have ground truth available at the beginning and end, resulting in a part of the graph with no fluctuations shown in Fig. 9b. Compared with ORB-SLAM3, the error fluctuation of our system is significantly smaller, and the mean, median, and minimum errors are relatively low (Fig. 9a).

In contrast, in the long outdoor sequences, there are many sky and cloud features which can cause severe accumulated drift. We use the strategy of separating the feature points into close and far categories based on their distance, as shown in Fig. 11. Although all systems presented in Table 1 suffer from severe accumulated drift, our system still achieves better results compared to other systems. This is due to our adoption of the MAP estimates of the inertia to rapidly optimize the parameters of the IMU, aligning the world reference frame with gravity. Furthermore, in the tracking thread, the rapidly optimized parameters replace the constant-velocity model, which increases the robustness when encountering poor textures.

The slides sequences in the TUM dataset are the most challenging environment, as the data collector traversed through a black tubular slide that was almost completely devoid of visual features, as shown in Fig. 10. This proves fatal for purely visual SLAMs, but visual–inertial SLAM systems can overcome this challenge. All systems listed in Table 1 complete the entire sequences, although some systems had larger errors. The inertial module can replace the camera for tracking, provided that the inertial variables are well optimized. Otherwise, serious drift will occur, and it is difficult to correct the drift due to the lack of feature points. When the sensor slips into the dark slide, where no feature points can be extracted, the optimized inertial sensor replaces the camera in a short time to calculate the pose transformation between frames. These sequences only have ground truth available for the start and end segments, and thus, one part of the graph does not show any fluctuations, as depicted in Fig. 12. Compared with ORB-SLAM3, the error fluctuation of our system is significantly smaller, and the mean, median, and minimum errors are relatively low, as shown in Fig. 12b.

IMU and loop closure

As shown in the partial map in Fig. 13, a nearly closed trajectory shape is formed each time a loop closure occurs. We use IMU data to determine the direction of sensor displacement and employ the current keyframe’s camera center (represented in green) as the tangent point to draw a circle. This circle has a threshold radius of 35 times the stereo baseline and is used to segregate potentially matching keyframes.

The radius r is an effective filter that reduces unnecessary computational costs by eliminating many keyframes. To ensure that this approach does not miss any loops, we also conduct related experiments. Figure 14 shows the impact of various radii on loop closure. We used the mode as the evaluation criterion, since it can count the events with the highest probability. Our experiments indicate that when the radius is set to 40 times the stereo baseline, the results are identical to those achieved using a radius of 35 times, demonstrating that 40 times is too large for the TUM VI Benchmark dataset. A smaller radius could be chosen to further reduce CPU utilization. However, using a radius of 30 times resulted in the system missing several loop sequences, indicating that this radius will negatively affect localization accuracy. Therefore, a radius of 35 times the stereo baseline is appropriate, since it ensures accuracy and reduces CPU usage. From a macro- perspective, a larger radius of r is closer to the original DBoW2 calculation method, while a smaller radius can have a significant negative impact on loop detection.

LVIF has a significant advantage over ORB-SLAM3 in terms of CPU usage. In ORB-SLAM3, loop detection is initiated every time a keyframe enters the loop closure thread, and the DBoW2 database is searched to determine whether the robot revisits a previous location. As shown in Fig. 15, the CPU usage of ORB-SLAM3 fluctuates regularly. When the matching between the current frame and the database is successful, the CPU must process the transformations between them and use graph optimization to approximate the optimal solution, which leads to the highest CPU usage. Although LVIF has a similar trend to ORB-SLAM3, the coarse judgment of IMU data means that it is unnecessary to search the DBoW2 database every time, resulting in lower CPU usage. In Fig. 15d, the simple environment of the room2 sequence leads to low CPU usage. As the environment becomes more complex, the CPU usage increases. For instance, Magistrale6 contains indoor corridors and large halls, which are more complex than corridor3 and corridor4.

Computing time

Table 2 shows the running time of the main operations in the tracking and local mapping thread between the stereo-inertial ORB-SLAM3 and LVIF in different environments.

Compared to ORB-SLAM3, LVIF demonstrates obvious advantages in running in real time with around 50 frames and approximately 8 keyframes per second. The time differences of operations, such as feature extraction and stereo matching, are within the allowable fluctuation range. However, the inertial-only MAP estimation renders the optimization more efficient. In the tracking thread, during feature point tracking, the optimized IMU variables produce an initial guess for camera pose from a very early stage, reducing the time cost for BA optimization, as demonstrated in the Track column of Table 2. Additionally, this approach reduces the burden on the CPU, as shown in Fig. 16. In all sequences, LVIF’s CPU usage is lower, which is very friendly to NUC minicomputers.

Conclusion

This paper introduces a novel lightweight stereo-inertial SLAM system that leverages fisheye cameras to enhance real-time performance and reduce CPU load. By fully utilizing the large FOV of the fisheye camera, the system treats it as two monocular cameras without requiring rectification, resulting in accelerated feature point matching. Additionally, MAP estimation is employed to rapidly reduce bias and noise in the inertial data. The optimized parameters serve as the initial camera pose estimate, replacing the constant-velocity model and significantly improving tracking efficiency. In scenarios involving fast rotation, occlusion, or poor texture, the system leverages optimized inertial data as a temporary replacement for image feature point tracking, thereby enhancing system robustness. Another notable innovation is the integration of inertial data in the loop closure thread. The IMU data serve as a vital condition for loop detection, reducing unnecessary feature retrieval. The experimental results clearly demonstrate that the integration of these three techniques within the proposed LVIF system yields notable improvements in real-time performance and robustness, particularly in terms of reducing CPU consumption. This advantage is of great significance, especially for mobile robots that are equipped with low-end NUCs.

Table 2

Time comparison with ORB-SLAM3 on the TUM dataset

Sequence	System	Tracking		Local mapping
Sequence	System	ORB extract	Track	IMU initialization	Visual–inertial BA
Room2	ORB-SLAM3	12.66	12.11	0.9	6.49
Room2	LVIF	14.24	5.94	1.1	5.36
corridor3	ORB-SLAM3	16.37	13.53	1.2	27.43
corridor3	LVIF	15.98	7.65	0.9	28.67
magistrale3	ORB-SLAM3	17.23	17.65	1.3	73.34
magistrale3	LVIF	16.54	10.19	1.7	78.09
outdoors3	ORB-SLAM3	23.41	12.43	1.5	28.21
outdoors3	LVIF	22.87	9.11	1.42	25.44
slides3	ORB-SLAM3	11.65	13.54	1.13	37.41
slides3	LVIF	13.34	9.39	1.25	35.28

Bold numbers represent the optimal solutions, highlighting their remarkable significance

In the section “Graph optimization model”, inertial data are utilized as a coarse check for loop closure detection. Nevertheless, the optimal solution for the filter radius r is subject to variation due to different camera parameters. Therefore, our next objective is to develop a method for adaptively adjusting the filter radius r based on the characteristics of different cameras.

Declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

previous article A new dynamic DNA-coding model for gray-scale image encryption

next article A cooperative adaptive genetic algorithm for reentrant hybrid flow shop scheduling with sequence-dependent setup time and limited buffers

Company-Corcoles JP, Garcia-Fidalgo E, Ortiz A (2020) Lipo-lcd: combining lines and points for appearance-based loop closure detection. In: BMVC

Zhang X, Wang W, Qi X, Liao Z, Wei R (2019) Point-plane slam using supposed planes for indoor environments. Sensors 19(17):3795ADSCrossRefPubMedPubMedCentral

Ji S, Qin Z, Shan J, Lu M (2020) Panoramic slam from a multiple fisheye camera rig. ISPRS J Photogramm Remote Sens 159:169–183ADSCrossRef

Coughlan JM, Yuille AL (1999) Manhattan world: compass direction from a single image by Bayesian inference. In: Proceedings of the seventh IEEE international conference on computer vision, vol 2. IEEE, pp 941–947

Qin T, Li P, Shen S (2018) Vins-mono: a robust and versatile monocular visual-inertial state estimator. IEEE Trans Robot 34(4):1004–1020CrossRef

Usenko V, Demmel N, Schubert D, Stückler J, Cremers D (2019) Visual-inertial mapping with non-linear factor recovery. IEEE Robot Autom Lett 5(2):422–429CrossRef

Cao S, Lu X, Shen S (2022) Gvins: tightly coupled gnss-visual-inertial fusion for smooth and consistent state estimation. IEEE Trans Robot 38(4):2004–21CrossRef

Engel J, Schöps T, Cremers D (2014) Lsd-slam: large-scale direct monocular slam. In: European conference on computer vision. Springer, pp 834–849

Fu Q, Yu H, Wang X, Yang Z, He Y, Zhang H, Mian A (2021) Fast orb-slam without keypoint descriptors. IEEE Trans Image Process 31:1433–1446ADSCrossRef

10.

Diao Y, Cen R, Xue F, Su X (2021) Orb-slam2s: a fast orb-slam2 system with sparse optical flow tracking. In: 2021 13th International Conference on Advanced Computational Intelligence (ICACI). IEEE, pp 160–165

11.

Matsuki H, Von Stumberg L, Usenko V, Stückler J, Cremers D (2018) Omnidirectional dso: direct sparse odometry with fisheye cameras. IEEE Robot Autom Lett 3(4):3693–3700CrossRef

12.

Engel J, Koltun V, Cremers D (2017) Direct sparse odometry. IEEE Trans Pattern Anal Mach Intell 40(3):611–625CrossRefPubMed

13.

Caruso D, Engel J, Cremers D (2015) Large-scale direct slam for omnidirectional cameras. In: 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 141–148

14.

Davison AJ (2003) Real-time simultaneous localisation and mapping with a single camera. In: IEEE international conference on computer vision, vol 3. IEEE Computer Society, p 1403

15.

Stojanovic V, He S, Zhang B (2020) State and parameter joint estimation of linear stochastic systems in presence of faults and non-gaussian noises. Int J Robust Nonlinear Control 30(16):6683–6700MathSciNetCrossRef

16.

Thrun S (2002) Probabilistic robotics. Commun ACM 45(3):52–57CrossRef

17.

Gao X, Zhang T, Liu Y, Yan Q (2017) 14 lectures on visual slam: from theory to practice. Publishing House of Electronics Industry, Beijing

18.

Klein G, Murray D (2007) Parallel tracking and mapping for small ar workspaces. In: 2007 6th IEEE and ACM international symposium on mixed and augmented reality. IEEE, pp 225–234

19.

Mur-Artal R, Montiel JMM, Tardos JD (2015) Orb-slam: a versatile and accurate monocular slam system. IEEE Trans Robot 31(5):1147–1163CrossRef

20.

Mur-Artal R, Tardós JD (2017) Orb-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans Robot 33(5):1255–1262CrossRef

21.

Gálvez-López D, Tardos JD (2012) Bags of binary words for fast place recognition in image sequences. IEEE Trans Robot 28(5):1188–1197CrossRef

22.

Strasdat H, Montiel J, Davison AJ (2010) Scale drift-aware large scale monocular slam. Robot Sci Syst VI 2(3):7

23.

Strasdat H, Davison AJ, Montiel JM, Konolige K (2011) Double window optimisation for constant time visual slam. In: 2011 international conference on computer vision. IEEE, pp 2352–2359

24.

Lee D, Kim H, Myung H (2013) Image feature-based real-time rgb-d 3d slam with gpu acceleration. J Inst Control Robot Syst 19(5):457–461CrossRef

25.

Forster C, Pizzoli M, Scaramuzza D (2014) Svo: fast semi-direct monocular visual odometry. In: 2014 IEEE international conference on robotics and automation (ICRA). IEEE, pp 15–22

26.

Dong X, Cheng L, Peng H, Li T (2022) Fsd-slam: a fast semi-direct slam algorithm. Complex Intell Syst 8(3):1823–1834CrossRef

27.

Kaiser J, Martinelli A, Fontana F, Scaramuzza D (2016) Simultaneous state initialization and gyroscope bias calibration in visual inertial aided navigation. IEEE Robot Autom Lett 2(1):18–25CrossRef

28.

Huang W, Liu H (2018) Online initialization and automatic camera-imu extrinsic calibration for monocular visual-inertial slam. In: 2018 IEEE international conference on robotics and automation (ICRA). IEEE, pp 5182–5189

29.

Weiss S, Achtelik MW, Lynen S, Chli M, Siegwart R (2012) Real-time onboard visual-inertial state estimation and self-calibration of mavs in unknown environments. In: 2012 IEEE international conference on robotics and automation. IEEE, pp 957–964

30.

Lynen S, Achtelik MW, Weiss S, Chli M, Siegwart R (2013) A robust and modular multi-sensor fusion approach applied to mav navigation. In: 2013 IEEE/RSJ international conference on intelligent robots and systems. IEEE, pp 3923–3929

31.

Mourikis AI, Roumeliotis SI et al (2007) A multi-state constraint kalman filter for vision-aided inertial navigation. In: ICRA, vol 2, p 6

32.

Leutenegger S, Furgale P, Rabaud V, Chli M, Konolige K, Siegwart R (2013) Keyframe-based visual-inertial slam using nonlinear optimization. In: Proceedings of robotics science and systems (RSS) 2013

33.

Qin T, Pan J, Cao S, Shen S (2019) A general optimization-based framework for local odometry estimation with multiple sensors. arXiv:1901.03638

34.

Campos C, Elvira R, Rodríguez JJG, Montiel JM, Tardós JD (2021) Orb-slam3: an accurate open-source library for visual, visual-inertial, and multimap slam. IEEE Trans Robot 37(6):1874–1890CrossRef

35.

Bloesch M, Omari S, Hutter M, Siegwart R (2015) Robust visual inertial odometry using a direct ekf-based approach. In: 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 298–304

36.

Wang R, Schworer M, Cremers D (2017) Stereo dso: large-scale direct sparse visual odometry with stereo cameras. In: Proceedings of the IEEE international conference on computer vision, pp 3903–3911

37.

Schubert D, Goll T, Demmel N, Usenko V, Stückler J, Cremers D (2018) The tum vi benchmark for evaluating visual-inertial odometry. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 1680–1687

38.

Forster C, Carlone L, Dellaert F, Scaramuzza D (2016) On-manifold preintegration for real-time visual-inertial odometry. IEEE Trans Robot 33(1):1–21CrossRef

39.

Urban S, Leitloff J, Hinz S (2016) Mlpnp-a real-time maximum likelihood solution to the perspective-n-point problem. arXiv:1607.08112

40.

Ebadi K, Chang Y, Palieri M, Stephens A, Hatteland A, Heiden E, Thakur A, Funabiki N, Morrell B, Wood S et al (2020) Lamp: large-scale autonomous mapping and positioning for exploration of perceptually-degraded subterranean environments. In: 2020 IEEE international conference on robotics and automation (ICRA). IEEE, pp 80–86

Title: LVIF: a lightweight tightly coupled stereo-inertial SLAM with fisheye camera
Authors: Hongwei Zhu
Guobao Zhang
Zhiqi Ye
Hongyi Zhou
Publication date: 04-08-2023
Publisher: Springer International Publishing
Published in: Complex & Intelligent Systems / Issue 1/2024
Print ISSN: 2199-4536
Electronic ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-023-01190-5

Springer Professional

LVIF: a lightweight tightly coupled stereo-inertial SLAM with fisheye camera

Abstract

Publisher's Note

Introduction

Pure visual SLAM

Visual-inertial SLAM

Kannala–Brandt model

System overview

Non-rectified stereo-fisheye SLAM

Re-localization and MLPnP

IMU pre-integration

Graph optimization model

IMU residual model

Inertial-only MAP estimation

Vision-only MAP estimation

Visual–inertial MAP estimation

Inertial and loop closing

Experiment

TUM VI sequences

IMU and loop closure

Computing time

Conclusion

Declarations

Conflict of interest

Publisher's Note

Premium Partner

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

Pure visual SLAM

Visual-inertial SLAM

Kannala–Brandt model

System overview

Non-rectified stereo-fisheye SLAM

Re-localization and MLPnP

IMU pre-integration

Graph optimization model

IMU residual model

Inertial-only MAP estimation

Vision-only MAP estimation

Visual–inertial MAP estimation

Inertial and loop closing

Experiment

TUM VI sequences

IMU and loop closure

Computing time

Conclusion

Declarations

Conflict of interest

Publisher's Note

Other articles of this Issue 1/2024

RCFT: re-parameterization convolution and feature filter for object tracking

MTAD: multi-layer temporal transaction anomaly detection in ethereum networks with GNN

Gravity assist space pruning and global optimization of spacecraft trajectories for solar system boundary exploration

Dual-drive collaboration surrogate-assisted evolutionary algorithm by coupling feature reduction and reconstruction

Big data analytics and knowledge discovery for urban computing and intelligence

Security enhancement of the access control scheme in IoMT applications based on fuzzy logic processing and lightweight encryption

Premium Partner