nach oben

Complex & Intelligent Systems

Open Access 24.04.2024 | Original Article

Bi-directional information guidance network for UAV vehicle detection

verfasst von: Jianxiu Yang, Xuemei Xie, Zhenyuan Wang, Peng Zhang, Wei Zhong

Erschienen in: Complex & Intelligent Systems

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

UAV vehicle detection based on convolutional neural network exits a key problem of information imbalance of different feature layers. Shallow features have spatial information that is beneficial to localization, but lack semantic information. On the contrary, deep features have semantic information that is beneficial to classification, but lack spatial information. However, accurate classification and localization for UAV vehicle detection require both shallow spatial information and high semantic information. In our work, a bi-directional information guidance network (BDIG-Net) for UAV vehicle detection is proposed, which can ensure that each feature prediction layer has abundant mid-/low-level spatial information and high-level semantic information. There are two main parts in the BDIG-Net: shallow-level spatial information guidance part and deep-level semantic information guidance part. In the shallow-level guidance part, we design a feature transform module (FTM) to supply the mid-/low-level feature information, which can guide the BDIG-Net to enhance detailed and spatial features for deep features. Furthermore, we adopt a light-weight attention module (LAM) to reduce unnecessary shallow background information, making the network more focused on small-sized vehicles. In the deep-level guidance part, we use classical feature pyramid network to supply high-level semantic information, which can guide the BDIG-Net to enhance contextual information for shallow features. Meanwhile, we design a feature enhancement module (FEM) to suppress redundant features and improve the discriminability of vehicles. The proposed BDIG-Net can reduce the information imbalance. The experimental results show that the BDIG-Net can achieve accurate classification and localization for UAV vehicles and realize the real-time application requirements.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Vehicle detection from unmanned aerial vehicle (UAV) images is a key technology in many fields, such as search and rescue [1], surveillance [2], military [3], and transportation [4‐6], which has practical research significance and wide application value. However, accurate and quick vehicle detection remains a challenging problem due to many issues, such as, but not to limited to, small-sized vehicles, low-resolution vehicles, partial occluded vehicles, vehicle scale diversity, limited datasets, and information imbalance of different feature scales.

Because of the powerful representation ability of convolutional neural networks (CNN), object detection [7‐11] has been made significant breakthroughs in ground-level images, and vehicle detection in UAV images has also been continuously improved. Same as object detection in ground-level images, vehicle detectors in UAV images can be divided into two categories: two-stage vehicle detectors and single-stage vehicle detectors. Based on the two-stage detection network, such as Fast R-CNN [12] and Faster R-CNN [8], the two-stage vehicle detectors [13‐15] introduce high-level contextual semantic information to enhance the feature representation ability of vehicles. The detectors can ensure high accuracy, but are not suitable for real-time applications. Based on the single-stage detection network, such as SSD [9]] and YOLO series [16‐21], the single-stage vehicle detectors use the top-down architecture [22, 23] to introduce contextual information, which enhances the feature representation for vehicles. The detectors can guarantee high accuracy and real-time performance.

Above these detectors only consider introducing high-level semantic information into shallow features. But there is still information loss in deeper features, resulting in information imbalance, which is particularly unfavorable for small vehicle detection. Wang et al. [24, 25] proposed that shallow-level detailed and spatial information are crucial for accurate target localization, especially for small target detection.

To reduce the imbalance problem of lack of spatial information in deep features, we propose a shallow-level feature information guidance part. This guidance part mainly realizes the mid-/low-level information passing to supply detailed and spatial information for deeper features. In this part, the image pyramid is introduced to supplement spatial information into each feature prediction layer of the backbone network. Therefore, we design the feature transform module (FTM) to transform image pyramid into the mid-/low-level feature information, which can reserve more detailed and spatial features for deep features. The FTM can be understood as a shallow light-weighted network trained from scratch, which can reduce the gap between classification and localization. At the same time, it only contains simple convolution layers and batch normalization layers, without consuming a lot of training time.

Meanwhile, in the shallow-level feature information guidance part, we use residual product fusion method to implement feature fusion, which guides more mid-/low-level spatial information to be embedded into the backbone network to enhance features for UAV vehicles. Furthermore, to effectively reduce unnecessary shallow background information for fused features, we design a light-weight attention module (LAM) to make the network more focused on small-sized vehicles. The LAM can be understood as a spatial attention mechanism, which can enhance the discriminability and robustness of features by filtering important information on feature maps.

In the shallow-level feature information guidance part, we use the FTM to obtain more detailed and spatial features, which are then added to the deep prediction features through the residual product fusion module and the LAM. This reduces the information imbalance problem of lack of spatial information in deep features, and enables better detection features to be learned.

Apart from this, we use top-down architecture of the standard RefineDet [26] to introduce contextual semantic information for shallow features, which is called the deep-level semantic information guidance part. The part can guide the backbone network to enhance contextual information for small-sized vehicles and reduce the imbalance problem of lacking high semantic information in shallow layers. Meanwhile, a feature enhancement module (FEM) is proposed to suppress the redundant features and improve the discriminability of small-sized vehicles.

The whole structure combining the shallow-level feature information guidance part and the deep-level semantic information guidance part is called a bi-directional information guidance network (BDIG-Net). The BDIG-Net not only integrates high-level semantic information that is conducive to classification for shallow features, but also, more importantly, integrates mid-/low-level information that is conducive to localizate for deep features. Therefore, the proposed BDIG-Net can ensure both the mid-/low-level spatial information and high-level semantic information are abundant for each feature prediction layer, and reduce the problem of information imbalance.

In summary, we make the following main contributions:

(1)

A bi-directional information guidance network (BDIG-Net) for UAV vehicle detection is proposed, which can ensure that each prediction layer has rich mid-/low-level spatial information and high-level semantic information, and reduce the problem of information imbalance.

(2)

In the shallow-level guidance part, a feature transform module (FTM) is proposed to obtain abundant mid-/low-level feature information, which can guide the BDIG-Net to enhance detailed and spatial features for small-sized vehicles. Beside, a light-weight attention module (LAM) is used to reduce unnecessary shallow background information, making the network more focused on small-sized vehicles. The part can reduce the imbalance problem of lacking spatial information in deep features.

(3)

In the deep-level guidance part, a feature enhancement module (FEM) is designed to suppress redundant features and improve the discriminability of small-sized vehicles. The part can reduce the imbalance problem of lacking high semantic information in shallow layers.

(4)

Our method achieves the state-of-the-art performance on both datasets. 92.9% mean average precision (mAP) and 91.1% mAP are achieved on the XDUAV dataset and Stanford Drone dataset, respectively. The proposed method can process 50 frames per second on a single NVIDIA 1080Ti GPU. Code is available athttps://github.com/03100076/BDIG.

Some classical classification algorithms have been successfully applied in various fields [44, 45]. In recent years, convolutional neural networks (CNNs) have made breakthroughs in classification tasks, especially in image processing. Among them, UAV vehicle detectors [66, 67] based on CNNs have attracted many researchers’ attention in recent years. The detectors can be grouped into the two-stage vehicle detector and the single-stage vehicle detector.

Two-stage UAV vehicle detector

The two-stage vehicle detectors [14, 15, 27‐31] based on Fast R-CNN [12] and Faster R-CNN [8] enhance feature representation by introducing contextual information [32‐36], which improves detection performance. Xu et al. [13] use Faster R-CNN to improve vehicle accuracy from low-altitude UAV imagery, but due to the difficulty of feature extraction, higher altitude and multi-category of vehicles cannot be extended. Sommer et al. [27] extend multiple categories for the UAV vehicle detection task. Zhang et al. [37] realize dense and small vehicles detection with Cascade R-CNN [38] in UAV vision. Huang et al. [39] utilize the improved Cascade R-CNN to add superclass detection on top of the original one, and then fuse the regression confidence and modify the loss function to enhance the detection capability of targets. However, the two-stage UAV vehicle detection methods suffer from the enormous modeling complexity and the speed limitations.

Single-stage UAV vehicle detector

In order to satisfy real-time detection requirement, the single-stage vehicle detectors are proposed and achieve as good performance as the two-stage detectors. Tang et al. [40], Radovic et al. [41], and Ringwald et al. [43] use the improved SSD [9], YOLOv1 [16], and YOLOv2 [17] to achieve real-time vehicle detection and tracking in traffic monitoring images, respectively. To further enhance the features for weak and small-sized vehicles, Zhang et al. [46] construct an improved YOLOv3 network with 16 layers to achieve an efficient and accurate vehicle detection. With the continuous updates of the YOLO series, Tan et al. [47] propose the accurate and lightweight UAV detectors based on YOLOv4 [19], using dilated convolution and ultra-lightweight subspace attention mechanism to enhance multi-scale feature representation and improve target detection performance. Based on YOLOv5, Deng et al. [48], and Zhan et al. [49] employ different feature enhancement methods to achieve accurate and fast UAV vehicle detection. ShuffleDet [50] uses the deformable and inception modules to complete real-time UAV vehicle detection. These real-time single-stage vehicle detectors only consider the passing of high-level semantic information in convolution neural network to provide contextual information for shallow features, and do not consider the passing of the shallow-level information for deeper features. However, shallow-level detailed and spatial information are also crucial for accurate object localization, especially for small object detection.

Information imbalance for UAV vehicle detector

There is an information imbalance problem of different feature scales in CNN. Shallow features with weak semantics have spatial information that is conducive to precise localization. On the contrary, deep features with strong semantics are beneficial for classification, but lack detailed information. Some studies have been proposed to address the information imbalance problem at the feature level. The classical feature pyramid network (FPN) transmits high-level semantic information to shallow features to reduce the imbalance impact to some degree. PA-Net [51] shortens the information propagate path between low-level features and the high-level features by adding a bottom-up path. Libra R-CNN [52] employs a balanced feature pyramid to reduce feature-level imbalances. IPG-Net [53] introduces a image pyramid to solve the problem. For the information imbalance in UAV vehicle detectors, both single-stage and two-stage vehicle detectors only consider the passing of high-level semantic information for shallow features, and do not consider the passing of the shallow-level information for deeper features. Small-sized vehicle detection needs not only high-level semantic information that can distinguish other categories, but also mid-/low-level information that can accurately describe vehicles. Inspired from the above discussion, we propose a bi-directional information guidance network to reduce the imbalance problem, which ensures that each prediction layer has abundant mid-/low-level feature information and high-level semantic information to improve the detection performance.

Proposed method

Baseline and motivation

In this paper, we use the RefineDet framework [26] as our baseline, because the network not only has the real-time advantage of single-stage detection algorithms (e.g., SSD), but also has the high accuracy advantage of two-stage detection algorithms (e.g., Faster R-CNN). The standard RefineDet adopts a VGG-16 architecture [54] as the backbone network, and converts fc6 and fc7 of VGG-16 to convolution layers conv_fc6 and conv_fc7 through subsampling parameters. It then adds two extra convolution layers conv6_1 and conv6_2 to the end of the truncated VGG-16. Meanwhile, the standard RefineDet adopts the top-down architecture of the classical feature pyramid network (FPN) to achieve feature fusion, providing context information for shallow features. The standard RefineDet utilizes different prediction feature layers conv4_3, conv5_3, conv_fc7 and conv6_2 to complete multi-scale object classification and localization.

For the UAV vehicle detection task, most targets are small and weak. On the basis of the standard RefineDet network, we add a prediction layer conv3_3 to predict the relatively smaller vehicles. Meanwhile, because the deeper feature layers introduce large receptive field, it will hurt performance for small vehicles detection, we remove the deeper layers conv6_1 and conv6_2. Therefore, we ultimately use conv3_3, conv4_3, conv5_3 and conv_fc7 as multi-scale feature prediction layers. The subsequent experimental results prove the effectiveness of above prediction layer selection. Furthermore, since the size and aspect ratio distribution of targets are different in various UAV vehicle datasets, it is necessary to set suitable anchors. We reset anchors according to the distribution and the effective receptive field of vehicles in different convolutional layers to improve recall rate of vehicles.

Overall architecture

The proposed overall architecture for UAV vehicle detection is shown in Fig. 1. There are two main parts in the BDIG-Net: the shallow-level feature information guidance part and the deep-level semantic information guidance part.

The shallow-level feature information guidance part realizes the mid-/low-level information passing to supply detailed information and spatial information for vehicles. The part is mainly composed of a feature transform module (FTM), a feature fusion module (FFM), and a light-weight attention module (LAM). In this part, we firstly use down-sampling to obtain a series of images of different resolutions from image pyramid, and then design the feature transform module (FTM) to extract features of these images. The extracted features reserve more mid-/low-level feature information, which can guide the backbone network to enhance detailed and spatial features for deep features. Meanwhile, we use feature fusion module (FFM) to better integrate the mid-/low-level feature information provided by the FTM into the backbone network. Finally, the function of the light-weight attention module (LAM) is to reduce unnecessary shallow background information of the fused features, making the network more focused on small-sized vehicles.

The deep-level semantic information guidance part realizes the high-level semantic information passing to supply contextual information for vehicles. The part uses the top-down architecture of the standard RefineDet to fuse deeper semantic information, which can guide the backbone network to enhance contextual information for shallow features. In this part, we design a feature enhancement module (FEM) to suppress redundant features and improve the discriminability of small-sized vehicles.

The shallow-level and deep-level information guidance part together form the bi-directional information guidance network, providing both mid-/low-level detail information and high-level semantic information, enhancing the discriminative features for small-sized vehicles.

Feature transform module

The feature transform module (FTM), as shown in Fig. 2, mainly performs the feature transform on the input images from image pyramid, in order to obtain the mid-/low-level spatial and detailed information for UAV vehicles. The FTM is a shallow light-weighted network trained from scratch, which does not consume a lot of training time. The module contains four convolutional layers: a 3$\,\times \,$3 convolutional layer and BN layer [55], a 1$\,\times \,$1 convolutional layer and BN layer, a multi-channel dilated convolutional layer [56], and a concatenation integrated layer. Compared to general convolution, the dilated convolution adds a parameter named dilated rate r to introduce large receptive field while maintaining the image resolution. Various dilated rate r can introduce different receptive fields and different features. In this work, the multi-channel parallel dilated convolutional layer uses different receptive fields to provide more abundant mid-/low-level feature information for UAV vehicles.

The input image of image pyramid firstly uses a 3$\,\times \,$3 convolutional layer and a 1$\,\times \,$1 convolutional layer to obtain the feature $s_n$. Then the feature $s_n$ is divided into multiple channels, which completes dilated convolutions of different dilated rate r. Here, the three-channel dilated convolution is shown as Fig. 2. Three 3$\,\times \,$3 convolutional features with dilated rates of 1, 2, and 3 are concatenated, and then the channel dimension is changed through a 1$\,\times \,$1 convolutional layer. Therefore, the convolutions are integrated to achieve feature concatenation of different receptive field. Finally, the purposed of different feature fusion is achieved, and the mid-/low-level feature information $F_n$ for UAV vehicles is obtained. This process can be shown as:

$$\begin{aligned} \begin{aligned} F = Cat(D_{3,1}(s), D_{3,2}(s), D_{3,3}(s)) \end{aligned} \end{aligned}$$

(1)

where Cat is the concatenation operation, $D_{k,r}(s)$ is the dilated convolution. k is the size of the convolution kernel, we set k to 3 in this paper. r is dilated rate, s is the input features of multi-channel dilated convolution. F is the output features of multi-channel dilated convolution, that is, the extracted mid-/low-level feature information.

Feature fusion module

The idea of the FFM is to make a transformation of the two types of features firstly and then fuse them together to achieve an augment effect for small-sized vehicles detection. We formulate the FFM as follow:

$$\begin{aligned} \begin{aligned} Y_n = \alpha ({G_n(k_0)}, {F_n(h_n)}), n\in [1, N] \end{aligned} \end{aligned}$$

(2)

where $Y_n$ is the output feature of the FFM in level n. $G_n(\cdot )$ and $F_n(\cdot )$ correspond to the output of the backbone network and the FTM respectively. $\alpha (\cdot )$ is the fusion function of the FFM, which is different variants in shallow-level and deep-level information guidance network.

In the shallow-level feature information guidance part, the $F_n(\cdot )$ and $h_n$ are the output and the input of the FTM in level n separately. The $G_n(\cdot )$ and $k_0$ are the output in level n and the input of the backbone network separately. The $\alpha (\cdot )$ is a residual product fusion method as shown in Fig. 3a, which is to perform the element-wise product operation on the $F_n(\cdot )$ and the $G_n(\cdot )$. Then, the result of the operation is added to the $G_n(\cdot )$ in a residual form. The corresponding formula is as follow:

$$\begin{aligned} \begin{aligned} Y_n = W \cdot (((W_k \cdot {CT(F_n)}) \otimes (W_p \cdot G_n)) + G_n) \end{aligned} \end{aligned}$$

(3)

where $W_p$ and $W_k$ are the 1$\,\times \,$1 convolutional layer. W is the 3$\,\times \,$3 convolutional layer and BN layer. $CT(\cdot )$ is channel-dimension transform, in order to align the channel dimension of the fused features. $((W_k \cdot {CT(F_n)}) \otimes (W_p \cdot G_n))$ can be considered as the lost information in the backbone network feature $G_n(\cdot )$, and then the lost information can be added to the backbone network to enhance features for UAV vehicles.

In the deep-level semantic information guidance part, the fusion method of the $F_n(\cdot )$ and the $G_n(\cdot )$ is the same as that of the standard feature pyramid network (FPN). Therefore, the $\alpha (\cdot )$ is an element-wise sum fusion method as shown in Fig. 3b, which is to perform the element-wise sum operation on the $F_n(\cdot )$ and the $G_n(\cdot )$. In this part, $F_n(\cdot )$ is high-level semantic feature, which is different from the output of the FTM. The transmission of the $F_n(\cdot )$ provides the contextual information to enhance the discriminative features for vehicles. This fusion method requires the same dimension for the two features, and the corresponding formula is as follows:

$$\begin{aligned} \begin{aligned} Y_n = W \cdot (W_k \cdot {CT(F_n)}) + G_n) \end{aligned} \end{aligned}$$

(4)

where the explanation of all parameters is the same as the above formulas.

Light-weight attention module

During the process of mid-/low-level feature information passing, the fused features are interfered by irrelevant background information. According to the residual attention module [58] for image classification, attention mechanisms can suppress irrelevant background during the forward propagation of mid-/low-level information. However, the large number of attention mechanism parameters can increase the complexity of the model. Therefore, we design a light-weight attention module (LAM), which is actually a spatial attention mechanism, as shown in Fig. 4.

The proposed LAM is a spatial attention mechanism that mainly includes a mask branch (top branch) and a trunk branch (bottom branch). In the mask branch, we use a light-weight hourglass structure to perform down-sampling and up-sampling to obtain attention feature maps. A max pooling layer and three convolutional layers are used to perform down-sampling phase. A convolutional layer, a bilinear interpolation operation and a sigmoid function are used to perform up-sampling phase to get attention feature maps. In the trunk branch, we use a convolutional layer to get the output. Finally, the trunk branch output and the attention features are fused by the element-wise product manner to obtain enhanced features. The LAM is an attention mechanism module with few parameters, which can reduce unnecessary background information from shallow-level feature guidance network, further focus attention on small-sized vehicles, and enhance the discriminative feature of vehicles in shallow layers.

Feature enhancement module

Almost all UAV vehicle detection networks are proposed on the basis of general object detection networks. Therefore, when these general networks are used for small datasets, they will generate a large number of redundant features for UAV vehicles, which will reduce the discriminability of vehicle features. For the UAV vehicle detection task in this work, we design a feature enhancement module (FEM), which suppresses the redundant features and improves the discriminability of small-sized vehicles. The FEM is a channel attention mechanism that can quantify the importance of each convolutional kernel in the feature layers, thereby obtaining a one-dimensional vector [57] for the importance of all convolutional kernels in the corresponding feature layers. Then, the one-dimensional vector is used to adjust the feature maps of each channel. The FEM can increase the difference between vehicle features and redundant features, making the vehicle features more discriminative. The feature enhancement module (FEM) is actually a channel attention mechanism, as shown in Fig. 5.

The input of the FEM is the feature maps of the network structure layers. In this work, we use the global average pooling to obtain the response values of each channel feature map. The formula for the FEM is as follow:

$$\begin{aligned} \begin{aligned} z_i = F_{global}(X) = {\frac{1}{H \times W}} {\sum \limits _{m = 1}^H}{\sum \limits _{n = 1}^W}{x_i(m,n)} \end{aligned} \end{aligned}$$

(5)

where $F_{global}$ represents the global average pooling. H and W are the height and the width of feature map X respectively. i is the index of channels. $x_i(m,n)$ is the value of each point on the channel i feature map. In this formula, all pixel values are summed in the channel i feature map, and then taking the average value of the sum to obtain the response of the channel i. The output is a vector with dimension C (C = 256). In order to avoid the vector amplitude being too large, we use the $L_2$ function to normalize it, the formula is as shown:

$$\begin{aligned} \begin{aligned} s_i = F_{L_2}(Z) = \frac{z_i}{\Vert z \Vert _2} = \frac{z_i}{\sqrt{{\sum \limits _{i = 1}^C}{z_i^2}}} \end{aligned} \end{aligned}$$

(6)

Finally, the normalized vector is used to scale the overall amplitude of the feature maps channel by channel. Therefore, enhanced feature map ${\tilde{x}}_i$ is obtained by multiplying the original feature map $x_i$ with the weight vector $s_i$, which makes the vehicle features and redundant features more distinguishable. The corresponding formula is as follow:

$$\begin{aligned} \begin{aligned} {\tilde{x}}_i = F_{scale}(s_i, x_i) = s_i \cdot x_i \end{aligned} \end{aligned}$$

(7)

where $\cdot $ refers to multiplying weight vectors with feature maps to scale the overall amplitude of the feature maps channel by channel.

Experimental results and analysis

Datasets and implementation details

XDUAV Dataset. The dataset [42] contains a large amount of truncated, occluded, and multi-angle small vehicles. The vehicle category consists of car, bus, truck, tanker, motor and bicycle. The whole dataset contains 4344 images with 3475 images for training and 869 images for testing.

The Stanford Drone Dataset. The dataset [60] contains annotated videos of pedestrians, bikers, skateboarders, cars, buses, and golf carts. In this work, we choose 3 categories of vehicles (i.e., car, bus and golf cart) as experimental data. The whole dataset contains 4331 images with 3500 images for training and 831 images for testing.

Loss Function. During the training process, a multi-task (classification and regression) loss function is used to minimize, where the classification loss function is the cross entropy function and the regression loss function is the SmoothL1 function. The total loss function of the network is defined as:

$$\begin{aligned} \begin{aligned} L(p_i,t_i)= {\frac{1}{N_{pos}}}{\sum _i L_{cls}(p_i,c_i^*)}+{\frac{1}{N_{pos}}}{\sum _i{c_i^*}{L_{reg}(t_i,t_i^*)}}. \end{aligned} \end{aligned}$$

(8)

where i is the index of anchor. ${{N}_{pos}}$ is the number of positive samples. The label $c_{i}^{*}$ is 1 if the anchor i is positive and 0 otherwise. $p_i$ and $t_i$ are the predicted category and location of the anchor i respectively. $t_i^*$ is the ground truth location and size of i. The $L_{cls}$ is classification loss, and $L_{reg}$ is the regression loss.

Performance Metric. To evaluate the results of the proposed detector on two datasets, we use the typical PASCAL VOC metrics: Average Precision (AP) for a single category, mean Average Precision (mAP) for all categories, and detection speed (Frame Per Second, FPS). The corresponding mathematical formula is shown in the following:

$$\begin{aligned} \begin{aligned} AP=&\frac{TP+TN}{TP+TN+FP}\\ mAP=&{\frac{1}{n}}{\sum _{i=1}^{n} AP_i} \end{aligned} \end{aligned}$$

(9)

where TP (True Positives) denotes the number of correctly identified positive samples, TN (True Negatives) denotes the number of correctly identified negative samples, FP (False Positives) denotes the number of incorrectly identified negative samples as positive samples, FN (False Negatives) denotes the number of failed to recognize positive samples. n is the number of positive samples, and $AP_i$ is the average precision of the category i. The mAP refers to a measure of the overall performance of the proposed detector in correctly classification and location all UAV vehicles. The higher the value of mAP, the better the network performance.

Training Implementation Details. All experiments are implemented based on the deep learning open-source framework Caffe [61]. We use the VGG ILSVRC [62] as the parameter initialization. The network is optimized by stochastic gradient descent (SGD) with back-propagation. Weight decay is set as 0.0005 and momentum is set as 0.9. The "step" strategy is adopted to adjust learning rate. The maximum iteration is set as 120k with an initial learning rate 0.001, which is reduced by a factor of 10 at iteration 80k and 100k respectively. We train the model with batch size 16 and test the model with batch size 1 using NVIDIA GTX-1080Ti GPU, CUDA 8.0, Cudnn7.0.

Ablation study

In this work, we perform ablative analysis with the XDUAV dataset to verify the effectiveness of the proposed bi-directional information guidance network.

Baseline analysis

Considering that most of vehicles from UAV are small and weak, we make some adjustments to the baseline of the standard RefineDet. Because the deeper feature layers introduce large receptive field for small vehicles, which hurt performance for the vehicles detection, we remove the deeper layers conv6_1 and conv6_2. Meanwhile, we add a prediction layer conv3_3 to predict the relatively smaller vehicles. The modified baseline is named as Modifying-R.

The standard RefineDet mainly realizes training and testing for PASCAL VOC2007 dataset, but the distribution of vehicle dataset and PASCAL VOC2007 dataset is completely different. Therefore, the setting of anchors needs to be changed accordingly. Based on the overall distribution of XDUAV dataset and the Stanford Drone dataset, the setting of anchors is shown in Table 1. In the Conv3_3 and Conv_fc7 layers, we only set a anchor for them, which can avoid the convergence problem caused by parameter redundancy during the training process and improve the detection performance of vehicles. The setting model is referred to as Setting-R.

Table 1

The setting of aspect ratio and scale for anchors in different prediction layers

Setting Layers	Aspect ratio	Scale
Conv3_3	1	16$\,\times \,$16
Conv4_3	1, 1/2, 2	32$\,\times \,$32, 23$\,\times \,$45, 45$\,\times \,$23
Conv5_3	1, 1/2, 2, 4	64$\,\times \,$64, 24$\,\times \,$91, 91$\,\times \,$45, 32$\,\times \,$128
Conv_fc7	2	91$\,\times \,$181

We make comparative experiments in different settings, which demonstrate the effectiveness of the baseline settings as shown in Table 2. Furthermore, subsequent experiments will be analyzed in detail based on the Modifying-R and Setting-R model.

Table 2

Performance comparison of different baseline settings

Standard Refinedet	Modifying -R	Setting -R	AP (%)						mAP (%)
Standard Refinedet	Modifying -R	Setting -R	car	bus	truck	motor	bicycle	tanker	mAP (%)
$\checkmark $			90.7	90.4	90.6	79.4	69.3	89.6	85.0
	$\checkmark $		90.8	90.1	90.7	88.1	76.7	89.9	87.7
	$\checkmark $	$\checkmark $	90.9	90.5	90.8	90.2	87.6	90.9	90.1

Bold is to emphasize that the proposed method has the highest detection accuracy

The shallow-level feature information guidance part analysis

The importance of mid-/low-level feature information guidance

Small-sized vehicles detection needs not only high-level semantic information that can distinguish other categories, but also mid-/low-level information that can accurately describe vehicles. In this section, we prove the importance of mid-/low-level information guidance part for vehicles detection. We embed mid-/low-level feature information into the prediction layers conv3_3, conv4_3, conv5_3 and conv_fc7 of the backbone network to perform comparative experiments, as shown in Table 3.

Table 3

The ablation study on the embedding mid-/low-level information into different prediction layers

Conv 3_3	Conv 4_3	Conv 5_3	Conv _fc7	AP (%)						mAP (%)
Conv 3_3	Conv 4_3	Conv 5_3	Conv _fc7	Car	Bus	Truck	Motor	Bicycle	Tanker	mAP (%)
$\checkmark $	$\checkmark $	$\checkmark $	$\checkmark $	90.8	90.5	90.7	90.3	86.7	97.3	91.1
	$\checkmark $	$\checkmark $	$\checkmark $	90.9	90.6	90.9	90.2	87.9	97.3	91.3
		$\checkmark $	$\checkmark $	90.8	90.5	90.8	90.1	87.2	99.5	91.5
			$\checkmark $	90.8	90.3	90.8	90.1	87.3	90.9	90.1

The experimental results indicate that embedding mid-/low-level information for conv3_3 layer and conv4_3 layer has little effect. Because the two layers themselves are shallow features for the backbone. If additional parameters are introduced, it will actually bring feature redundancy, causing a burden for the network and resulting in a decrease for detection performance. Adding appropriate mid-/low-level information for conv5_3 layer and conv_fc7 layer can just make up for the missing information of vehicles, which is conducive to the precise location for small-sized vehicles.

Therefore, the shallow-level feature information guidance part is very important, which ensures that each prediction layer has abundant mid-/low-level feature information to improve the detection performance.

It is noted that the detection results shown is Table 3 are completed using the basic convolution operation in feature transform module (FTM). The basic operation is to sequentially pass inputs through a 3$\,\times \,$3 convolutional layer, a 1$\,\times \,$1 convolutional layer, and a 3$\,\times \,$3 convolutional layer to obtain mid-/low-level features, without the involvement of multi-channel dilated convolutions.

The effectiveness of feature transform module

The shallow-level feature information guides the backbone network to enhance detailed and spatial features for small-sized vehicles. We use the FTM to transform the image pyramid into the mid-/low-level feature information. In order to demonstrate the effectiveness of the FTM, we use a single convolution layer and multi-channel (including two and three channels) dilated convolution layers for verification, as shown in Table 4.

Table 4

The ablation study on the multi-rate dilated convolution in the effective transform module

$r = 1$	$r = 2$	$r = 3$	AP (%)						mAP (%)	Speed (ms)
$r = 1$	$r = 2$	$r = 3$	car	bus	truck	motor	bicycle	tanker	mAP (%)	Speed (ms)
$\checkmark $			90.8	90.5	90.8	90.1	87.2	99.5	91.5	17
$\checkmark $	$\checkmark $		90.8	90.6	90.8	90.5	88.7	99.7	91.9	18
$\checkmark $		$\checkmark $	90.8	90.6	90.8	90.4	88.2	99.2	91.7	20
	$\checkmark $	$\checkmark $	90.8	90.5	90.8	90.3	87.7	99.2	91.6	21
$\checkmark $	$\checkmark $	$\checkmark $	90.8	90.6	90.9	90.4	88.5	99.7	91.8	22

The mAP of two-channel dilated convolutional layers with $r = 1, 3$ and $r = 2, 3$ are higher 0.2% and 0.1% respectively than a single convolution layer. The reason is that the receptive field of the convolution with large dilated rate becomes larger, and the contour information of vehicles can be extracted. Therefore, the concatenated convolution features with different dilated ratios are more abundant, and the feature responses are stronger. Compared with two-channel dilated convolutional layers with $r = 1, 3$ and $r = 2, 3$, the three-channel dilated convolutional layers with $r = 1, 2, 3$ can bring more diverse receptive fields. In this way, more information from different ranges can be obtained, leading to an increase in mAP values. However, considering that most vehicles from UAV are weak and small, large receptive field may bring more background interference. Therefore, the detection performance of three-channel dilated convolution is not as good as that of two-channel dilated convolution with $r = 1, 2$. This is because the extracted features by the two-channel convolution are more delicate and can retain richer details, which is more conducive to small vehicles detection.

Therefore, for the FTM, we adopt the two-channel dilated convolution with $r =1, 2$ to obtain different receptive field information, and enrich the mid-/low-level features of vehicles. Furthermore, dilated convolutional layers need padding operation for input images, which can increase computational complexity of the network. The testing time for each image of different dilated convolutions is shown in the last column of Table 4, indicating that the FTM does not affect the real-time performance.

The significance of light-weight attention module

In mid-/low-level information guidance part, unnecessary background information will be brought to the backbone network during the process of mid-/low-level information passing, which affects the detection performance of vehicles. Inspired by the residual attention module (RAM) [59], we design a light-weight attention module (LAM) to suppress the irrelevant background information. The experimental results in line 1 from Table 5 only use the residual product fusion method to fuse the output of FTM and the corresponding backbone network features, and directly embed the fused information into the backbone network without using any attention module. “Without” in line 1 from Table 5 is that the spatial attention module was not used in the experiment. Due to the limited features for small vehicles and the influence of irrelevant background information, the network’s attention for small vehicles is insufficient. As shown in line 2 and line 3 from Table 5, the introduction of attention modules greatly improves vehicle detection performance. Although the RAM can improve detection performance, it also introduces a large number of parameters, increasing the complexity and detection speed of the network. The comparison of experimental results indicates the effectiveness of the LAM. Especially for smaller vehicles such as bicycles, the AP is increased by 1.7%. Therefore, the LAM is able to capture smaller vehicles areas of focus, and then invests more attention in the areas to obtain more detailed information, while ignoring other irrelevant information. The LAM can quickly filter out high-value information from limited attention resources without affecting the real-time detection performance.

Table 5

The ablation study on the light-weight attention module (LAM)

Spatial Attention	AP (%)						mAP (%)	Speed (ms)
Spatial Attention	Car	Bus	Truck	Motor	Bicycle	Tanker	mAP (%)	Speed (ms)
Without	90.8	90.7	90.9	90.6	89.2	99.5	92.0	18
RAM	91.1	90.8	91.6	91.2	90.7	99.6	92.5	29
LAM	91.3	90.9	92.3	91.0	90.9	99.7	92.7	20

Table 6

The effect of the feature enhancement module (FEM) in the deep-level semantic information guidance part

Channel Attention	AP (%)						mAP (%)	Speed (ms)
Channel Attention	Car	Bus	Truck	Motor	Bicycle	Tanker	mAP (%)	Speed (ms)
Without	91.3	90.9	92.3	91.0	90.9	99.7	92.7	20
FEM	91.3	91.0	93.0	91.3	91.2	99.7	92.9	20

The deep-level semantic information guidance part analysis

In the deep-level semantic information guidance part, the element-wise sum fusion method is used to implement the feature fusion module (FFM), which guides high-level contextual information to enhance features for small-sized vehicles. Meanwhile, a feature enhancement module (FEM) is proposed to suppress the redundant features and improve the discriminability of small-sized vehicles.

The necessity of feature enhancement module

As shown in Table 6, the comparison of experimental results indicates the necessity of the FEM. “Without” in line 1 from Table 6 is that the channel attention module was not used in the experiment. The light-weight attention module (LAM) is essentially a spatial attention mechanism, while the FEM is a channel attention mechanism. We first employ the LAM to filter out irrelevant background information, and then use the FEM to improve the discriminability of small-sized vehicles. These two modules enhance the features extraction ability of the BDIG-Net from both spatial and channel perspectives, which play a complementary role. The complementarity of the two modules is not only beneficial for improving the classification performance, but also greatly improves the localization prediction of vehicles.

Overall performance

XDUAV dataset analysis

Due to the designs of the feature transform module (FTM) and the light-weight attention module (LAM) in the shallow-level feature information guidance part, and the feature enhanced module (FEM) in the deep-level semantic information guidance part, the proposed BDIG-Net can achieve 92.9% accuracy and 50 FPS in speed on XDUAT dataset. We compare some single-stage real-time methods and two-stage high-accuracy methods by mAP and FPS in this section as shown in Table 7. We can see that, the proposed method achieves the best performance while keeping real-time detection. It is noted that all methods are trained under the same conditions, so the experimental results are credible. Figure 6 shows comparison results on XDUAV dataset. Compared with the single-directional (deep-level semantic) information guidance network (Fig. 6b), the bi-directional information guidance network (BDIG-Net) (Fig. 6a) obviously reduces missed rate, and redundant bounding boxes, etc. Especially for vehicles with scale diversity and occlusion, the proposed detector has good robustness for precise vehicles location. These above results demonstrate the effectiveness of the BDIG-Net on XDUAV dataset.

Table 7

Detection results (%) of different methods for the XDUAV dataset

Methods	Backbone	Input Size	mAP (%)	AP (%)						FPS
Methods	Backbone	Input Size	mAP (%)	Car	Bus	Truck	Motor	Bicycle	Tanker	FPS
Two-stage:
Faster R-CNN [8]	VGG-16	$\sim $1000$\,\times \,$600	74.3	88.1	90.4	89.5	48.8	35.3	94.2	7
Faster R-CNN [8]	ResNet-101	$\sim $1000$\,\times \,$600	75.8	89.1	90.3	90.5	50.3	44.5	90.4	2.4
R-FCN [10]	ResNet-101	$\sim $1000$\,\times \,$600	85.7	90.8	90.4	90.8	71.3	76.9	94.0	9
FPN [23]	ResNet-101	$\sim $1000$\,\times \,$600	87.5	90.8	90.4	90.5	81.1	81.4	90.9	6
Single-stage:
SSD [9]	VGG-16	300$\,\times \,$300	83.3	90.7	90.0	90.3	78.3	60.7	89.6	59
SSD [9]	ResNet-101	321$\,\times \,$321	79.5	90.8	90.4	90.2	62.2	56.9	86.5	11.2
RON [34]	VGG-16	320$\,\times \,$320	75.0	79.2	90.1	88.0	50.7	53.1	89.1	15
DSSD [63]	ResNet-101	321$\,\times \,$321	81.6	90.8	90.2	90.8	62.6	66.7	88.6	9.5
YOLOv2 [17]	DarkNet-19	416$\,\times \,$416	53.7	66.4	73.1	77.0	29.3	12.2	64.1	64
YOLOv3 [18]	DarkNet-53	416$\,\times \,$416	83.8	90.0	85.8	95.5	77.2	57.7	96.3	76
YOLOv5	Focus + CSP + SPP	640$\,\times \,$640	81.8	88.2	89.6	89.4	69.9	61.3	92.2	172
YOLOv7 [20]	CBS	640$\,\times \,$640	70.2	80.6	82.9	82.0	54.3	37.9	83.6	125
RefineDet [26]	ShufflenetV1 [64]	320$\,\times \,$320	78.4	90.7	90.2	89.3	70.3	41.4	88.3	79
	ShufflenetV2 [65]	320$\,\times \,$320	75.7	90.6	90.2	85.8	67.6	37.1	83.1	58
	VGG-16	320$\,\times \,$320	87.1	90.8	90.3	90.6	80.2	80.5	90.6	40
Our Method	VGG-16	320$\,\times \,$320	92.9	91.3	91.0	93.0	91.3	91.2	99.7	50

Bold is to emphasize that the proposed method has the highest detection accuracy

The stanford drone dataset analysis

To further demonstrate the practicality and robustness of the BDIG-Net detector for UAV vehicles, we evaluate the detection performance on Stanford Drone dataset. we use the same ablation experiments as the XDUAV dataset to analyze the shallow-level feature information guidance part and deep-level semantic information guidance part. The experimental results of Stanford Drone dataset also demonstrate the effectiveness of the proposed BDIG-Net. The overall results of different detection methods are shown in Table 8. The proposed method achieves 91.1% AP, which is superior to other real-time vehicle detection methods. Figure 7 shows detection results of the proposed BDIG-Net detector on Stanford Drone dataset in different scenarios.

Table 8

Detection results (%) of different methods for Stanford Drone dataset

Methods	Backbone	Input Size	mAP(%)	AP (%)			FPS
Methods	Backbone	Input Size	mAP(%)	Car	Bus	Golf Cart	FPS
Two-stage:
Faster R-CNN [8]	VGG-16	$\sim $1000$\,\times \,$600	75.3	61.6	87.1	77.1	7
Faster R-CNN [8]	ResNet-101	$\sim $1000$\,\times \,$600	75.9	60.4	89.5	75.9	2.4
R-FCN [10]	ResNet-101	$\sim $1000$\,\times \,$600	86.6	83.6	90.5	85.8	9
FPN [23]	ResNet-101	$\sim $1000$\,\times \,$600	88.8	88.3	90.3	87.7	6
Single-stage:
SSD [9]	VGG-16	300$\,\times \,$300	78.2	89.3	82.9	62.4	59
SSD [9]	ResNet-101	321$\,\times \,$321	71.9	45.9	88.4	81.4	11.2
RON [34]	VGG-16	320$\,\times \,$320	69.2	46.8	88.9	71.9	15
DSSD [63]	ResNet-101	321$\,\times \,$321	76.6	67.5	87.8	74.5	9.5
YOLOv2 [17]	DarkNet-19	416$\,\times \,$416	55.8	54.9	86.1	26.3	64
YOLOv3 [18]	DarkNet-53	416$\,\times \,$416	80.0	72.5	89.7	77.5	76
YOLOv5	Focus + CSP + SPP	640$\,\times \,$640	80.1	63.3	90.0	87.1	172
YOLOv7 [20]	CBS	640$\,\times \,$640	57.1	17.3	93.5	60.5	125
RefineDet [26]	ShufflenetV1 [64]	320$\,\times \,$320	71.7	90.6	85.2	39.3	79
	ShufflenetV2 [65]	320$\,\times \,$320	59.3	90.0	66.0	22.0	58
	VGG-16	320$\,\times \,$320	86.1	90.7	86.2	81.5	40
Our Method	VGG-16	20$\,\times \,$320	91.1	91.3	91.2	90.9	50

Bold is to emphasize that the proposed method has the highest detection accuracy

Conclusion

This paper introduces a bi-directional information guidance network (BDIG-Net) for multi-category vehicle detection, which achieves accurate and real-time detection in UAV images. Firstly, the BDIG-Net is divided into two parts: shallow-level feature information guidance part and deep-level semantic information guidance part. Secondly, in the shallow-level guidance part, we use the FTM to transform the image pyramid into the mid-/low-level feature information. Meanwhile, residual product fusion method is adopted to implement feature fusion, which guides mid-/low-level information to be embedded into the BDIG-Net. In order to reduce unnecessary shallow background information of the fused features, the LAM is designed to make the network more focused on small-sized vehicles. Thirdly, In the deep-level guidance part, we use top-down architecture of the standard RefineDet to fuse deeper semantic information. The element-wise sum fusion method is used to implement feature fusion, which guides high-level contextual information to enhance features for small-sized vehicles. Meanwhile, a feature enhancement module (FEM) is proposed to suppress the redundant features and improve the discriminability of small-sized vehicles. The BDIG-Net not only integrates high-level semantic information that is conducive to classification for shallow features, but also, more importantly, integrates mid-/low-level information that is conducive to localizate for deep features. To some extent, the BDIG-Net reduces the problem of information imbalance of different feature layers. The experimental results on two datasets demonstrate that the proposed method can detect small-sized vehicle more accurately and achieve real-time detection. In our future work, we plan to integrate inter-frame correlation and scene understanding into UAV vehicle detection network, so as to infer and detect smaller and weaker vehicle targets.

Declarations

Conflict of interest

The authors declare no Conflict of interest.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Mishra B, Garg D, Narang P, Mishra V (2020) Drone-surveillance for search and rescue in natural disaster. Comput Commun 156:1–10CrossRef

Srivastava S, Narayan S, Mittal S (2021) A survey of deep learning techniques for vehicle detection from UAV images. J Syst Architect 117:102152CrossRef

Priyanka G, Bhavya P, Gaurav S, Vijay RD (2022) Edge device based military vehicle detection and classification from UAV. Multimed Tools Appl 81:19813–19834CrossRef

Ke R, Li Z, Tang J, Pan Z, Wang Y (2019) Real-time traffic flow parameter estimation from UAV video based on ensemble classifier and optical flow. IEEE Trans Intell Transp Syst 20:54–64CrossRef

Zhou H, Kong H, Wei L, Creighton D, Nahavandi S (2017) On detecting road regions in a single UAV image. IEEE Trans Intell Transp Syst 18:1713–1722CrossRef

Li X, Li X, Li Z, Xiong X, Khyam MO, Sun C (2021) Robust Vehicle Detection in High-Resolution Aerial Images With Imbalanced Data. IEEE Transactions on Artificial Intelligence 2:238–250CrossRef

He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37:1904–1916CrossRef

Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, pp. 91–99

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. European conference on computer vision. Springer, pp. 21–37

10.

Dai J, Li Y, He K, Sun J (2016) R-fcn: Object detection via region-based fully convolutional networks. Advances in neural information processing systems, pp. 379–387

11.

He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. Proceedings of the IEEE international conference on computer vision, pp. 2961–2969

12.

Girshick R (2015) Fast r-cnn. Proceedings of the IEEE international conference on computer vision, pp. 1440–1448

13.

Xu Y, Yu G, Wang Y, Wu X, Ma Y (2017) Car detection from low-altitude UAV imagery with the faster R-CNN. Journal of Advanced Transportation 2017

14.

Sommer LW, Schuchert T, Beyerer J (2017) Fast deep vehicle detection in aerial images. Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, pp. 311–319

15.

Tang T, Zhou S, Deng Z, Zou H, Lei L (2017) Vehicle detection in aerial images based on region convolutional neural networks and hard negative example mining. Sensors 17:336CrossRef

16.

Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788

17.

Redmon J, Farhadi A (2017) YOLO9000: Better, Faster, Stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 6517–6525

18.

Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767

19.

Bochkovskiy A, Wang CY, Liao HYM (2020) YOLOv4: Optimal Speed and Accuracy of Object Detection, [arXiv:cs.CV/2004.10934]

20.

Wang CY, Bochkovskiy A, Liao HYM (2022) YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, arXiv:2207.02696 [cs.CV]

21.

Ge Z, Liu S, Wang F, Li Z, Sun J (2021) YOLOX: Exceeding YOLO Series in 2021, arXiv:2107.08430 [cs.CV]

22.

Shrivastava A, Sukthankar R, Malik J, Gupta A (2016) Beyond skip connections: Top-down modulation for object detection. arXiv preprint arXiv:1612.06851

23.

Lin T, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature Pyramid Networks for Object Detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017:936–944

24.

Zhu R, Zhang S, Wang X, Wen L, Shi H, Bo L, Mei T (2019) ScratchDet: Training Single-Shot Object Detectors From Scratch. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

25.

Wang T, Anwer RM, Cholakkal H, Khan FS, Pang Y, Shao L (2019) Learning Rich Features at High-Speed for Single-Shot Object Detection. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

26.

Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-Shot Refinement Neural Network for Object Detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018:4203–4212

27.

Sommer LW, Schuchert T, Beyerer J (2017) Deep learning based multi-category object detection in aerial images. Automatic Target Recognition XXVII; Sadjadi, F.A.; Mahalanobis, A., Eds. International Society for Optics and Photonics, SPIE, Vol. 10202, p. 1020209

28.

Sommer L, Schmidt N, Schumann A, Beyerer J (2018) Search Area Reduction Fast-RCNN for Fast Vehicle Detection in Large Aerial Imagery. 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 3054–3058

29.

Deng Z, Sun H, Zhou S, Zhao J, Zou H (2017) Toward fast and accurate vehicle detection in aerial images using coupled region-based convolutional neural networks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10:3652–3664CrossRef

30.

Mittal P, Singh R, Sharma A (2020) Deep learning-based object detection in low-altitude UAV datasets: A survey. Image Vis Comput 104:104046CrossRef

31.

Bayhan E, Ozkan Z, Namdar M, Deep Basgumus A, Detection Learning Based Object, Recognition of Unmanned Aerial Vehicles. (2021) 3rd International Congress on Human-Computer Interaction. Optimization and Robotic Applications (HORA) 2021:1–5

32.

Hu P, Ramanan D (2017) Finding tiny faces. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 1522–1530

33.

Woo S, Hwang S, Kweon IS (2018) StairNet: Top-Down Semantic Aggregation for Accurate One Shot Detection. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, pp. 1093–1102

34.

Kong T, Sun F, Yao A, Liu H, Lu M, Chen Y (2017) Ron: Reverse connection with objectness prior networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5936–5944

35.

Kong T, Sun F, Tan C, Liu H, Huang W (2018) Deep feature pyramid reconfiguration for object detection. Proceedings of the European Conference on Computer Vision (ECCV), pp. 169–185

36.

Kong T, Yao A, Chen Y, Sun F (2016) Hypernet: Towards accurate region proposal generation and joint object detection. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 845–853

37.

Zhang X, Izquierdo E, Chandramouli K (2019) Dense and Small Object Detection in UAV Vision Based on Cascade Network. The IEEE International Conference on Computer Vision (ICCV) Workshops

38.

Cai Z, Vasconcelos N (2018) Cascade R-CNN: Delving Into High Quality Object Detection. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6154–6162

39.

Huang H, Li L, Ma H (2022) An Improved Cascade R-CNN-Based Target Detection Algorithm for UAV Aerial Images. 2022 7th International Conference on Image, Vision and Computing (ICIVC), pp. 232–237

40.

Tang T, Deng Z, Zhou S, Lei L, Zou H, Fast vehicle detection in UAV images. Remote Sensing with Intelligent Processing (RSIP), (2017) International Workshop on. IEEE 2017:1–5

41.

Radovic M, Adarkwa O, Wang Q (2017) Object Recognition in Aerial Images Using Convolutional Neural Networks. Journal of Imaging 3:21CrossRef

42.

Xie X, Yang W, Cao G, Yang J, Shi G. The XDUAV dataset. Available online:. https://share.weiyun.com/lQllOGWo

43.

Ringwald T, Sommer L, Schumann A, Beyerer J, Stiefelhagen R (2019) UAV-Net: A Fast Aerial Vehicle Detector for Mobile Platforms. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

44.

Borlea ID, Precup RE, Borlea AB (2022) Improvement of K-means cluster quality by post processing resulted clusters. Procedia Computer Science 199:63–70CrossRef

45.

Protic D, Stankovic M (2023) XOR-Based Detector of Different Decisions on Anomalies in the Computer Network Traffic. SCIENCE AND TECHNOLOGY 26:323–338

46.

Zhang X, Zhu X (2019) Vehicle Detection in the Aerial Infrared Images via an Improved Yolov3 Network. 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), pp. 372–376

47.

Tan L, Lv X, Lian X, Wang G (2021) YOLOv4_Drone: UAV image target detection based on an improved YOLOv4 algorithm. Computers & Electrical Engineering 93:107261CrossRef

48.

Deng L, Liu Z, Wang J, Yang B (2023) ATT-YOLOv5-Ghost: water surface object detection in complex scenes. J Real-Time Image Proc 20:97CrossRef

49.

Zhan W, Sun C, Wang M, She J, Zhang Y, Zhang Z, Sun Y (2022) An improved Yolov5 real-time detection method for small objects captured by UAV. Soft Comput 26:362–373CrossRef

50.

Majid Azimi S (2018) ShuffleDet: Real-Time Vehicle Detection Network in On-board Embedded UAV Imagery. The European Conference on Computer Vision (ECCV) Workshops

51.

Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path Aggregation Network for Instance Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8759–8768

52.

Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D (2019) Libra R-CNN: Towards Balanced Learning for Object Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 821–830

53.

Liu Z, Gao G, Sun L, Fang L (2020) IPG-Net: Image Pyramid Guidance Network for Small Object Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 1026–1027

54.

Simonyan K, Zisserman A (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556 [cs.CV]

55.

Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167

56.

Liu S, Huang D, Wang a (2018) Receptive Field Block Net for Accurate and Fast Object Detection. Proceedings of the European Conference on Computer Vision (ECCV), pp. 385–400

57.

Hu J, Shen L, Sun G (2018) Squeeze-and-Excitation Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141

58.

Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual Attention Network for Image Classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164

59.

Lim JS, Astrid M, Yoon HJ, Lee SI (2021) Small Object Detection using Context and Attention. International Conference on Artificial Intelligence in Information and Communication (ICAIIC) 2021:181–186

60.

Robicquet A, Sadeghian A, Alahi A, Savarese S (2016) Learning Social Etiquette: Human Trajectory Understanding In Crowded Scenes. Computer Vision – ECCV 2016; Leibe B, Matas J, Sebe N, Welling M, Eds.; Springer International Publishing: Cham, pp. 549–565

61.

Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional Architecture for Fast Feature Embedding. Proceedings of the 22nd ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, p. 675-678

62.

Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet Large Scale Visual Recognition. Challenge 115:211–252MathSciNet

63.

Fu CY, Liu W, Ranga A, Tyagi A, Berg AC (2017) DSSD : Deconvolutional Single Shot Detector, arXiv:1701.06659 [cs.CV]

64.

Zhang X, Zhou X, Lin M, Sun J (2018) ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6848–6856

65.

Ma N, Zhang X, Zheng HT, Sun J (2018) ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131

66.

Bouguettaya A, Zarzour H, Kechida A, Taberkit AM (2022) Vehicle Detection From UAV Imagery With Deep Learning: A Review. IEEE Transactions on Neural Networks and Learning Systems 33:6047–6067CrossRef

67.

Ye T, Qin W, Li Y, Wang S, Zhang J, Zhao Z (2022) Dense and Small Object Detection in UAV-Vision Based on a Global-Local Feature Enhanced Network. IEEE Trans Instrum Meas 71:1–13

Titel: Bi-directional information guidance network for UAV vehicle detection
verfasst von: Jianxiu Yang
Xuemei Xie
Zhenyuan Wang
Peng Zhang
Wei Zhong
Publikationsdatum: 24.04.2024
Verlag: Springer International Publishing
Erschienen in: Complex & Intelligent Systems
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-024-01429-9

Springer Professional

Bi-directional information guidance network for UAV vehicle detection

Abstract

Publisher's Note

Introduction

Two-stage UAV vehicle detector

Single-stage UAV vehicle detector

Information imbalance for UAV vehicle detector

Proposed method

Baseline and motivation

Overall architecture

Feature transform module

Feature fusion module

Light-weight attention module

Feature enhancement module

Experimental results and analysis

Datasets and implementation details

Ablation study

Baseline analysis

The shallow-level feature information guidance part analysis

The deep-level semantic information guidance part analysis

Overall performance

XDUAV dataset analysis

The stanford drone dataset analysis

Conclusion

Declarations

Conflict of interest

Publisher's Note

Premium Partner

Setting Layers	Aspect ratio	Scale
Conv3_3	1	16\(\,\times \,\)16
Conv4_3	1, 1/2, 2	32\(\,\times \,\)32, 23\(\,\times \,\)45, 45\(\,\times \,\)23
Conv5_3	1, 1/2, 2, 4	64\(\,\times \,\)64, 24\(\,\times \,\)91, 91\(\,\times \,\)45, 32\(\,\times \,\)128
Conv_fc7	2	91\(\,\times \,\)181

\(r = 1\)	\(r = 2\)	\(r = 3\)	AP (%)						mAP (%)	Speed (ms)
\(r = 1\)	\(r = 2\)	\(r = 3\)	car	bus	truck	motor	bicycle	tanker	mAP (%)	Speed (ms)
\(\checkmark \)			90.8	90.5	90.8	90.1	87.2	99.5	91.5	17
\(\checkmark \)	\(\checkmark \)		90.8	90.6	90.8	90.5	88.7	99.7	91.9	18
\(\checkmark \)		\(\checkmark \)	90.8	90.6	90.8	90.4	88.2	99.2	91.7	20
	\(\checkmark \)	\(\checkmark \)	90.8	90.5	90.8	90.3	87.7	99.2	91.6	21
\(\checkmark \)	\(\checkmark \)	\(\checkmark \)	90.8	90.6	90.9	90.4	88.5	99.7	91.8	22

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

Two-stage UAV vehicle detector

Single-stage UAV vehicle detector

Information imbalance for UAV vehicle detector

Proposed method

Baseline and motivation

Overall architecture

Feature transform module

Feature fusion module

Light-weight attention module

Feature enhancement module

Experimental results and analysis

Datasets and implementation details

Ablation study

Baseline analysis

The shallow-level feature information guidance part analysis

The deep-level semantic information guidance part analysis

Overall performance

XDUAV dataset analysis

The stanford drone dataset analysis

Conclusion

Declarations

Conflict of interest

Publisher's Note

Premium Partner