Skip to main content

Open Access 24.04.2024 | Original Article

Bi-directional information guidance network for UAV vehicle detection

verfasst von: Jianxiu Yang, Xuemei Xie, Zhenyuan Wang, Peng Zhang, Wei Zhong

Erschienen in: Complex & Intelligent Systems

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

UAV vehicle detection based on convolutional neural network exits a key problem of information imbalance of different feature layers. Shallow features have spatial information that is beneficial to localization, but lack semantic information. On the contrary, deep features have semantic information that is beneficial to classification, but lack spatial information. However, accurate classification and localization for UAV vehicle detection require both shallow spatial information and high semantic information. In our work, a bi-directional information guidance network (BDIG-Net) for UAV vehicle detection is proposed, which can ensure that each feature prediction layer has abundant mid-/low-level spatial information and high-level semantic information. There are two main parts in the BDIG-Net: shallow-level spatial information guidance part and deep-level semantic information guidance part. In the shallow-level guidance part, we design a feature transform module (FTM) to supply the mid-/low-level feature information, which can guide the BDIG-Net to enhance detailed and spatial features for deep features. Furthermore, we adopt a light-weight attention module (LAM) to reduce unnecessary shallow background information, making the network more focused on small-sized vehicles. In the deep-level guidance part, we use classical feature pyramid network to supply high-level semantic information, which can guide the BDIG-Net to enhance contextual information for shallow features. Meanwhile, we design a feature enhancement module (FEM) to suppress redundant features and improve the discriminability of vehicles. The proposed BDIG-Net can reduce the information imbalance. The experimental results show that the BDIG-Net can achieve accurate classification and localization for UAV vehicles and realize the real-time application requirements.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Vehicle detection from unmanned aerial vehicle (UAV) images is a key technology in many fields, such as search and rescue [1], surveillance [2], military [3], and transportation [46], which has practical research significance and wide application value. However, accurate and quick vehicle detection remains a challenging problem due to many issues, such as, but not to limited to, small-sized vehicles, low-resolution vehicles, partial occluded vehicles, vehicle scale diversity, limited datasets, and information imbalance of different feature scales.
Because of the powerful representation ability of convolutional neural networks (CNN), object detection [711] has been made significant breakthroughs in ground-level images, and vehicle detection in UAV images has also been continuously improved. Same as object detection in ground-level images, vehicle detectors in UAV images can be divided into two categories: two-stage vehicle detectors and single-stage vehicle detectors. Based on the two-stage detection network, such as Fast R-CNN [12] and Faster R-CNN [8], the two-stage vehicle detectors [1315] introduce high-level contextual semantic information to enhance the feature representation ability of vehicles. The detectors can ensure high accuracy, but are not suitable for real-time applications. Based on the single-stage detection network, such as SSD [9]] and YOLO series [1621], the single-stage vehicle detectors use the top-down architecture [22, 23] to introduce contextual information, which enhances the feature representation for vehicles. The detectors can guarantee high accuracy and real-time performance.
Above these detectors only consider introducing high-level semantic information into shallow features. But there is still information loss in deeper features, resulting in information imbalance, which is particularly unfavorable for small vehicle detection. Wang et al. [24, 25] proposed that shallow-level detailed and spatial information are crucial for accurate target localization, especially for small target detection.
To reduce the imbalance problem of lack of spatial information in deep features, we propose a shallow-level feature information guidance part. This guidance part mainly realizes the mid-/low-level information passing to supply detailed and spatial information for deeper features. In this part, the image pyramid is introduced to supplement spatial information into each feature prediction layer of the backbone network. Therefore, we design the feature transform module (FTM) to transform image pyramid into the mid-/low-level feature information, which can reserve more detailed and spatial features for deep features. The FTM can be understood as a shallow light-weighted network trained from scratch, which can reduce the gap between classification and localization. At the same time, it only contains simple convolution layers and batch normalization layers, without consuming a lot of training time.
Meanwhile, in the shallow-level feature information guidance part, we use residual product fusion method to implement feature fusion, which guides more mid-/low-level spatial information to be embedded into the backbone network to enhance features for UAV vehicles. Furthermore, to effectively reduce unnecessary shallow background information for fused features, we design a light-weight attention module (LAM) to make the network more focused on small-sized vehicles. The LAM can be understood as a spatial attention mechanism, which can enhance the discriminability and robustness of features by filtering important information on feature maps.
In the shallow-level feature information guidance part, we use the FTM to obtain more detailed and spatial features, which are then added to the deep prediction features through the residual product fusion module and the LAM. This reduces the information imbalance problem of lack of spatial information in deep features, and enables better detection features to be learned.
Apart from this, we use top-down architecture of the standard RefineDet [26] to introduce contextual semantic information for shallow features, which is called the deep-level semantic information guidance part. The part can guide the backbone network to enhance contextual information for small-sized vehicles and reduce the imbalance problem of lacking high semantic information in shallow layers. Meanwhile, a feature enhancement module (FEM) is proposed to suppress the redundant features and improve the discriminability of small-sized vehicles.
The whole structure combining the shallow-level feature information guidance part and the deep-level semantic information guidance part is called a bi-directional information guidance network (BDIG-Net). The BDIG-Net not only integrates high-level semantic information that is conducive to classification for shallow features, but also, more importantly, integrates mid-/low-level information that is conducive to localizate for deep features. Therefore, the proposed BDIG-Net can ensure both the mid-/low-level spatial information and high-level semantic information are abundant for each feature prediction layer, and reduce the problem of information imbalance.
In summary, we make the following main contributions:
(1)
A bi-directional information guidance network (BDIG-Net) for UAV vehicle detection is proposed, which can ensure that each prediction layer has rich mid-/low-level spatial information and high-level semantic information, and reduce the problem of information imbalance.
 
(2)
In the shallow-level guidance part, a feature transform module (FTM) is proposed to obtain abundant mid-/low-level feature information, which can guide the BDIG-Net to enhance detailed and spatial features for small-sized vehicles. Beside, a light-weight attention module (LAM) is used to reduce unnecessary shallow background information, making the network more focused on small-sized vehicles. The part can reduce the imbalance problem of lacking spatial information in deep features.
 
(3)
In the deep-level guidance part, a feature enhancement module (FEM) is designed to suppress redundant features and improve the discriminability of small-sized vehicles. The part can reduce the imbalance problem of lacking high semantic information in shallow layers.
 
(4)
Our method achieves the state-of-the-art performance on both datasets. 92.9% mean average precision (mAP) and 91.1% mAP are achieved on the XDUAV dataset and Stanford Drone dataset, respectively. The proposed method can process 50 frames per second on a single NVIDIA 1080Ti GPU. Code is available athttps://​github.​com/​03100076/​BDIG.
 
Some classical classification algorithms have been successfully applied in various fields [44, 45]. In recent years, convolutional neural networks (CNNs) have made breakthroughs in classification tasks, especially in image processing. Among them, UAV vehicle detectors [66, 67] based on CNNs have attracted many researchers’ attention in recent years. The detectors can be grouped into the two-stage vehicle detector and the single-stage vehicle detector.

Two-stage UAV vehicle detector

The two-stage vehicle detectors [14, 15, 2731] based on Fast R-CNN [12] and Faster R-CNN [8] enhance feature representation by introducing contextual information [3236], which improves detection performance. Xu et al. [13] use Faster R-CNN to improve vehicle accuracy from low-altitude UAV imagery, but due to the difficulty of feature extraction, higher altitude and multi-category of vehicles cannot be extended. Sommer et al. [27] extend multiple categories for the UAV vehicle detection task. Zhang et al. [37] realize dense and small vehicles detection with Cascade R-CNN [38] in UAV vision. Huang et al. [39] utilize the improved Cascade R-CNN to add superclass detection on top of the original one, and then fuse the regression confidence and modify the loss function to enhance the detection capability of targets. However, the two-stage UAV vehicle detection methods suffer from the enormous modeling complexity and the speed limitations.

Single-stage UAV vehicle detector

In order to satisfy real-time detection requirement, the single-stage vehicle detectors are proposed and achieve as good performance as the two-stage detectors. Tang et al. [40], Radovic et al. [41], and Ringwald et al. [43] use the improved SSD [9], YOLOv1 [16], and YOLOv2 [17] to achieve real-time vehicle detection and tracking in traffic monitoring images, respectively. To further enhance the features for weak and small-sized vehicles, Zhang et al. [46] construct an improved YOLOv3 network with 16 layers to achieve an efficient and accurate vehicle detection. With the continuous updates of the YOLO series, Tan et al. [47] propose the accurate and lightweight UAV detectors based on YOLOv4 [19], using dilated convolution and ultra-lightweight subspace attention mechanism to enhance multi-scale feature representation and improve target detection performance. Based on YOLOv5, Deng et al. [48], and Zhan et al. [49] employ different feature enhancement methods to achieve accurate and fast UAV vehicle detection. ShuffleDet [50] uses the deformable and inception modules to complete real-time UAV vehicle detection. These real-time single-stage vehicle detectors only consider the passing of high-level semantic information in convolution neural network to provide contextual information for shallow features, and do not consider the passing of the shallow-level information for deeper features. However, shallow-level detailed and spatial information are also crucial for accurate object localization, especially for small object detection.

Information imbalance for UAV vehicle detector

There is an information imbalance problem of different feature scales in CNN. Shallow features with weak semantics have spatial information that is conducive to precise localization. On the contrary, deep features with strong semantics are beneficial for classification, but lack detailed information. Some studies have been proposed to address the information imbalance problem at the feature level. The classical feature pyramid network (FPN) transmits high-level semantic information to shallow features to reduce the imbalance impact to some degree. PA-Net [51] shortens the information propagate path between low-level features and the high-level features by adding a bottom-up path. Libra R-CNN [52] employs a balanced feature pyramid to reduce feature-level imbalances. IPG-Net [53] introduces a image pyramid to solve the problem. For the information imbalance in UAV vehicle detectors, both single-stage and two-stage vehicle detectors only consider the passing of high-level semantic information for shallow features, and do not consider the passing of the shallow-level information for deeper features. Small-sized vehicle detection needs not only high-level semantic information that can distinguish other categories, but also mid-/low-level information that can accurately describe vehicles. Inspired from the above discussion, we propose a bi-directional information guidance network to reduce the imbalance problem, which ensures that each prediction layer has abundant mid-/low-level feature information and high-level semantic information to improve the detection performance.

Proposed method

Baseline and motivation

In this paper, we use the RefineDet framework [26] as our baseline, because the network not only has the real-time advantage of single-stage detection algorithms (e.g., SSD), but also has the high accuracy advantage of two-stage detection algorithms (e.g., Faster R-CNN). The standard RefineDet adopts a VGG-16 architecture [54] as the backbone network, and converts fc6 and fc7 of VGG-16 to convolution layers conv_fc6 and conv_fc7 through subsampling parameters. It then adds two extra convolution layers conv6_1 and conv6_2 to the end of the truncated VGG-16. Meanwhile, the standard RefineDet adopts the top-down architecture of the classical feature pyramid network (FPN) to achieve feature fusion, providing context information for shallow features. The standard RefineDet utilizes different prediction feature layers conv4_3, conv5_3, conv_fc7 and conv6_2 to complete multi-scale object classification and localization.
For the UAV vehicle detection task, most targets are small and weak. On the basis of the standard RefineDet network, we add a prediction layer conv3_3 to predict the relatively smaller vehicles. Meanwhile, because the deeper feature layers introduce large receptive field, it will hurt performance for small vehicles detection, we remove the deeper layers conv6_1 and conv6_2. Therefore, we ultimately use conv3_3, conv4_3, conv5_3 and conv_fc7 as multi-scale feature prediction layers. The subsequent experimental results prove the effectiveness of above prediction layer selection. Furthermore, since the size and aspect ratio distribution of targets are different in various UAV vehicle datasets, it is necessary to set suitable anchors. We reset anchors according to the distribution and the effective receptive field of vehicles in different convolutional layers to improve recall rate of vehicles.

Overall architecture

The proposed overall architecture for UAV vehicle detection is shown in Fig. 1. There are two main parts in the BDIG-Net: the shallow-level feature information guidance part and the deep-level semantic information guidance part.
The shallow-level feature information guidance part realizes the mid-/low-level information passing to supply detailed information and spatial information for vehicles. The part is mainly composed of a feature transform module (FTM), a feature fusion module (FFM), and a light-weight attention module (LAM). In this part, we firstly use down-sampling to obtain a series of images of different resolutions from image pyramid, and then design the feature transform module (FTM) to extract features of these images. The extracted features reserve more mid-/low-level feature information, which can guide the backbone network to enhance detailed and spatial features for deep features. Meanwhile, we use feature fusion module (FFM) to better integrate the mid-/low-level feature information provided by the FTM into the backbone network. Finally, the function of the light-weight attention module (LAM) is to reduce unnecessary shallow background information of the fused features, making the network more focused on small-sized vehicles.
The deep-level semantic information guidance part realizes the high-level semantic information passing to supply contextual information for vehicles. The part uses the top-down architecture of the standard RefineDet to fuse deeper semantic information, which can guide the backbone network to enhance contextual information for shallow features. In this part, we design a feature enhancement module (FEM) to suppress redundant features and improve the discriminability of small-sized vehicles.
The shallow-level and deep-level information guidance part together form the bi-directional information guidance network, providing both mid-/low-level detail information and high-level semantic information, enhancing the discriminative features for small-sized vehicles.

Feature transform module

The feature transform module (FTM), as shown in Fig. 2, mainly performs the feature transform on the input images from image pyramid, in order to obtain the mid-/low-level spatial and detailed information for UAV vehicles. The FTM is a shallow light-weighted network trained from scratch, which does not consume a lot of training time. The module contains four convolutional layers: a 3\(\,\times \,\)3 convolutional layer and BN layer [55], a 1\(\,\times \,\)1 convolutional layer and BN layer, a multi-channel dilated convolutional layer [56], and a concatenation integrated layer. Compared to general convolution, the dilated convolution adds a parameter named dilated rate r to introduce large receptive field while maintaining the image resolution. Various dilated rate r can introduce different receptive fields and different features. In this work, the multi-channel parallel dilated convolutional layer uses different receptive fields to provide more abundant mid-/low-level feature information for UAV vehicles.
The input image of image pyramid firstly uses a 3\(\,\times \,\)3 convolutional layer and a 1\(\,\times \,\)1 convolutional layer to obtain the feature \(s_n\). Then the feature \(s_n\) is divided into multiple channels, which completes dilated convolutions of different dilated rate r. Here, the three-channel dilated convolution is shown as Fig. 2. Three 3\(\,\times \,\)3 convolutional features with dilated rates of 1, 2, and 3 are concatenated, and then the channel dimension is changed through a 1\(\,\times \,\)1 convolutional layer. Therefore, the convolutions are integrated to achieve feature concatenation of different receptive field. Finally, the purposed of different feature fusion is achieved, and the mid-/low-level feature information \(F_n\) for UAV vehicles is obtained. This process can be shown as:
$$\begin{aligned} \begin{aligned} F = Cat(D_{3,1}(s), D_{3,2}(s), D_{3,3}(s)) \end{aligned} \end{aligned}$$
(1)
where Cat is the concatenation operation, \(D_{k,r}(s)\) is the dilated convolution. k is the size of the convolution kernel, we set k to 3 in this paper. r is dilated rate, s is the input features of multi-channel dilated convolution. F is the output features of multi-channel dilated convolution, that is, the extracted mid-/low-level feature information.

Feature fusion module

The idea of the FFM is to make a transformation of the two types of features firstly and then fuse them together to achieve an augment effect for small-sized vehicles detection. We formulate the FFM as follow:
$$\begin{aligned} \begin{aligned} Y_n = \alpha ({G_n(k_0)}, {F_n(h_n)}), n\in [1, N] \end{aligned} \end{aligned}$$
(2)
where \(Y_n\) is the output feature of the FFM in level n. \(G_n(\cdot )\) and \(F_n(\cdot )\) correspond to the output of the backbone network and the FTM respectively. \(\alpha (\cdot )\) is the fusion function of the FFM, which is different variants in shallow-level and deep-level information guidance network.
In the shallow-level feature information guidance part, the \(F_n(\cdot )\) and \(h_n\) are the output and the input of the FTM in level n separately. The \(G_n(\cdot )\) and \(k_0\) are the output in level n and the input of the backbone network separately. The \(\alpha (\cdot )\) is a residual product fusion method as shown in Fig. 3a, which is to perform the element-wise product operation on the \(F_n(\cdot )\) and the \(G_n(\cdot )\). Then, the result of the operation is added to the \(G_n(\cdot )\) in a residual form. The corresponding formula is as follow:
$$\begin{aligned} \begin{aligned} Y_n = W \cdot (((W_k \cdot {CT(F_n)}) \otimes (W_p \cdot G_n)) + G_n) \end{aligned} \end{aligned}$$
(3)
where \(W_p\) and \(W_k\) are the 1\(\,\times \,\)1 convolutional layer. W is the 3\(\,\times \,\)3 convolutional layer and BN layer. \(CT(\cdot )\) is channel-dimension transform, in order to align the channel dimension of the fused features. \(((W_k \cdot {CT(F_n)}) \otimes (W_p \cdot G_n))\) can be considered as the lost information in the backbone network feature \(G_n(\cdot )\), and then the lost information can be added to the backbone network to enhance features for UAV vehicles.
In the deep-level semantic information guidance part, the fusion method of the \(F_n(\cdot )\) and the \(G_n(\cdot )\) is the same as that of the standard feature pyramid network (FPN). Therefore, the \(\alpha (\cdot )\) is an element-wise sum fusion method as shown in Fig. 3b, which is to perform the element-wise sum operation on the \(F_n(\cdot )\) and the \(G_n(\cdot )\). In this part, \(F_n(\cdot )\) is high-level semantic feature, which is different from the output of the FTM. The transmission of the \(F_n(\cdot )\) provides the contextual information to enhance the discriminative features for vehicles. This fusion method requires the same dimension for the two features, and the corresponding formula is as follows:
$$\begin{aligned} \begin{aligned} Y_n = W \cdot (W_k \cdot {CT(F_n)}) + G_n) \end{aligned} \end{aligned}$$
(4)
where the explanation of all parameters is the same as the above formulas.

Light-weight attention module

During the process of mid-/low-level feature information passing, the fused features are interfered by irrelevant background information. According to the residual attention module [58] for image classification, attention mechanisms can suppress irrelevant background during the forward propagation of mid-/low-level information. However, the large number of attention mechanism parameters can increase the complexity of the model. Therefore, we design a light-weight attention module (LAM), which is actually a spatial attention mechanism, as shown in Fig. 4.
The proposed LAM is a spatial attention mechanism that mainly includes a mask branch (top branch) and a trunk branch (bottom branch). In the mask branch, we use a light-weight hourglass structure to perform down-sampling and up-sampling to obtain attention feature maps. A max pooling layer and three convolutional layers are used to perform down-sampling phase. A convolutional layer, a bilinear interpolation operation and a sigmoid function are used to perform up-sampling phase to get attention feature maps. In the trunk branch, we use a convolutional layer to get the output. Finally, the trunk branch output and the attention features are fused by the element-wise product manner to obtain enhanced features. The LAM is an attention mechanism module with few parameters, which can reduce unnecessary background information from shallow-level feature guidance network, further focus attention on small-sized vehicles, and enhance the discriminative feature of vehicles in shallow layers.

Feature enhancement module

Almost all UAV vehicle detection networks are proposed on the basis of general object detection networks. Therefore, when these general networks are used for small datasets, they will generate a large number of redundant features for UAV vehicles, which will reduce the discriminability of vehicle features. For the UAV vehicle detection task in this work, we design a feature enhancement module (FEM), which suppresses the redundant features and improves the discriminability of small-sized vehicles. The FEM is a channel attention mechanism that can quantify the importance of each convolutional kernel in the feature layers, thereby obtaining a one-dimensional vector [57] for the importance of all convolutional kernels in the corresponding feature layers. Then, the one-dimensional vector is used to adjust the feature maps of each channel. The FEM can increase the difference between vehicle features and redundant features, making the vehicle features more discriminative. The feature enhancement module (FEM) is actually a channel attention mechanism, as shown in Fig. 5.
The input of the FEM is the feature maps of the network structure layers. In this work, we use the global average pooling to obtain the response values of each channel feature map. The formula for the FEM is as follow:
$$\begin{aligned} \begin{aligned} z_i = F_{global}(X) = {\frac{1}{H \times W}} {\sum \limits _{m = 1}^H}{\sum \limits _{n = 1}^W}{x_i(m,n)} \end{aligned} \end{aligned}$$
(5)
where \(F_{global}\) represents the global average pooling. H and W are the height and the width of feature map X respectively. i is the index of channels. \(x_i(m,n)\) is the value of each point on the channel i feature map. In this formula, all pixel values are summed in the channel i feature map, and then taking the average value of the sum to obtain the response of the channel i. The output is a vector with dimension C (C = 256). In order to avoid the vector amplitude being too large, we use the \(L_2\) function to normalize it, the formula is as shown:
$$\begin{aligned} \begin{aligned} s_i = F_{L_2}(Z) = \frac{z_i}{\Vert z \Vert _2} = \frac{z_i}{\sqrt{{\sum \limits _{i = 1}^C}{z_i^2}}} \end{aligned} \end{aligned}$$
(6)
Finally, the normalized vector is used to scale the overall amplitude of the feature maps channel by channel. Therefore, enhanced feature map \({\tilde{x}}_i\) is obtained by multiplying the original feature map \(x_i\) with the weight vector \(s_i\), which makes the vehicle features and redundant features more distinguishable. The corresponding formula is as follow:
$$\begin{aligned} \begin{aligned} {\tilde{x}}_i = F_{scale}(s_i, x_i) = s_i \cdot x_i \end{aligned} \end{aligned}$$
(7)
where \(\cdot \) refers to multiplying weight vectors with feature maps to scale the overall amplitude of the feature maps channel by channel.

Experimental results and analysis

Datasets and implementation details

XDUAV Dataset. The dataset [42] contains a large amount of truncated, occluded, and multi-angle small vehicles. The vehicle category consists of car, bus, truck, tanker, motor and bicycle. The whole dataset contains 4344 images with 3475 images for training and 869 images for testing.
The Stanford Drone Dataset. The dataset [60] contains annotated videos of pedestrians, bikers, skateboarders, cars, buses, and golf carts. In this work, we choose 3 categories of vehicles (i.e., car, bus and golf cart) as experimental data. The whole dataset contains 4331 images with 3500 images for training and 831 images for testing.
Loss Function. During the training process, a multi-task (classification and regression) loss function is used to minimize, where the classification loss function is the cross entropy function and the regression loss function is the SmoothL1 function. The total loss function of the network is defined as:
$$\begin{aligned} \begin{aligned} L(p_i,t_i)= {\frac{1}{N_{pos}}}{\sum _i L_{cls}(p_i,c_i^*)}+{\frac{1}{N_{pos}}}{\sum _i{c_i^*}{L_{reg}(t_i,t_i^*)}}. \end{aligned} \end{aligned}$$
(8)
where i is the index of anchor. \({{N}_{pos}}\) is the number of positive samples. The label \(c_{i}^{*}\) is 1 if the anchor i is positive and 0 otherwise. \(p_i\) and \(t_i\) are the predicted category and location of the anchor i respectively. \(t_i^*\) is the ground truth location and size of i. The \(L_{cls}\) is classification loss, and \(L_{reg}\) is the regression loss.
Performance Metric. To evaluate the results of the proposed detector on two datasets, we use the typical PASCAL VOC metrics: Average Precision (AP) for a single category, mean Average Precision (mAP) for all categories, and detection speed (Frame Per Second, FPS). The corresponding mathematical formula is shown in the following:
$$\begin{aligned} \begin{aligned} AP=&\frac{TP+TN}{TP+TN+FP}\\ mAP=&{\frac{1}{n}}{\sum _{i=1}^{n} AP_i} \end{aligned} \end{aligned}$$
(9)
where TP (True Positives) denotes the number of correctly identified positive samples, TN (True Negatives) denotes the number of correctly identified negative samples, FP (False Positives) denotes the number of incorrectly identified negative samples as positive samples, FN (False Negatives) denotes the number of failed to recognize positive samples. n is the number of positive samples, and \(AP_i\) is the average precision of the category i. The mAP refers to a measure of the overall performance of the proposed detector in correctly classification and location all UAV vehicles. The higher the value of mAP, the better the network performance.
Training Implementation Details. All experiments are implemented based on the deep learning open-source framework Caffe [61]. We use the VGG ILSVRC [62] as the parameter initialization. The network is optimized by stochastic gradient descent (SGD) with back-propagation. Weight decay is set as 0.0005 and momentum is set as 0.9. The "step" strategy is adopted to adjust learning rate. The maximum iteration is set as 120k with an initial learning rate 0.001, which is reduced by a factor of 10 at iteration 80k and 100k respectively. We train the model with batch size 16 and test the model with batch size 1 using NVIDIA GTX-1080Ti GPU, CUDA 8.0, Cudnn7.0.

Ablation study

In this work, we perform ablative analysis with the XDUAV dataset to verify the effectiveness of the proposed bi-directional information guidance network.

Baseline analysis

Considering that most of vehicles from UAV are small and weak, we make some adjustments to the baseline of the standard RefineDet. Because the deeper feature layers introduce large receptive field for small vehicles, which hurt performance for the vehicles detection, we remove the deeper layers conv6_1 and conv6_2. Meanwhile, we add a prediction layer conv3_3 to predict the relatively smaller vehicles. The modified baseline is named as Modifying-R.
The standard RefineDet mainly realizes training and testing for PASCAL VOC2007 dataset, but the distribution of vehicle dataset and PASCAL VOC2007 dataset is completely different. Therefore, the setting of anchors needs to be changed accordingly. Based on the overall distribution of XDUAV dataset and the Stanford Drone dataset, the setting of anchors is shown in Table 1. In the Conv3_3 and Conv_fc7 layers, we only set a anchor for them, which can avoid the convergence problem caused by parameter redundancy during the training process and improve the detection performance of vehicles. The setting model is referred to as Setting-R.
Table 1
The setting of aspect ratio and scale for anchors in different prediction layers
Setting Layers
Aspect ratio
Scale
Conv3_3
1
16\(\,\times \,\)16
Conv4_3
1, 1/2, 2
32\(\,\times \,\)32, 23\(\,\times \,\)45, 45\(\,\times \,\)23
Conv5_3
1, 1/2, 2, 4
64\(\,\times \,\)64, 24\(\,\times \,\)91, 91\(\,\times \,\)45, 32\(\,\times \,\)128
Conv_fc7
2
91\(\,\times \,\)181
We make comparative experiments in different settings, which demonstrate the effectiveness of the baseline settings as shown in Table 2. Furthermore, subsequent experiments will be analyzed in detail based on the Modifying-R and Setting-R model.
Table 2
Performance comparison of different baseline settings
Standard Refinedet
Modifying -R
Setting -R
AP (%)
mAP (%)
car
bus
truck
motor
bicycle
tanker
\(\checkmark \)
  
90.7
90.4
90.6
79.4
69.3
89.6
85.0
 
\(\checkmark \)
 
90.8
90.1
90.7
88.1
76.7
89.9
87.7
 
\(\checkmark \)
\(\checkmark \)
90.9
90.5
90.8
90.2
87.6
90.9
90.1
Bold is to emphasize that the proposed method has the highest detection accuracy

The shallow-level feature information guidance part analysis

  • The importance of mid-/low-level feature information guidance
Small-sized vehicles detection needs not only high-level semantic information that can distinguish other categories, but also mid-/low-level information that can accurately describe vehicles. In this section, we prove the importance of mid-/low-level information guidance part for vehicles detection. We embed mid-/low-level feature information into the prediction layers conv3_3, conv4_3, conv5_3 and conv_fc7 of the backbone network to perform comparative experiments, as shown in Table 3.
Table 3
The ablation study on the embedding mid-/low-level information into different prediction layers
Conv 3_3
Conv 4_3
Conv 5_3
Conv _fc7
AP (%)
mAP (%)
Car
Bus
Truck
Motor
Bicycle
Tanker
\(\checkmark \)
\(\checkmark \)
\(\checkmark \)
\(\checkmark \)
90.8
90.5
90.7
90.3
86.7
97.3
91.1
 
\(\checkmark \)
\(\checkmark \)
\(\checkmark \)
90.9
90.6
90.9
90.2
87.9
97.3
91.3
  
\(\checkmark \)
\(\checkmark \)
90.8
90.5
90.8
90.1
87.2
99.5
91.5
   
\(\checkmark \)
90.8
90.3
90.8
90.1
87.3
90.9
90.1
The experimental results indicate that embedding mid-/low-level information for conv3_3 layer and conv4_3 layer has little effect. Because the two layers themselves are shallow features for the backbone. If additional parameters are introduced, it will actually bring feature redundancy, causing a burden for the network and resulting in a decrease for detection performance. Adding appropriate mid-/low-level information for conv5_3 layer and conv_fc7 layer can just make up for the missing information of vehicles, which is conducive to the precise location for small-sized vehicles.
Therefore, the shallow-level feature information guidance part is very important, which ensures that each prediction layer has abundant mid-/low-level feature information to improve the detection performance.
It is noted that the detection results shown is Table 3 are completed using the basic convolution operation in feature transform module (FTM). The basic operation is to sequentially pass inputs through a 3\(\,\times \,\)3 convolutional layer, a 1\(\,\times \,\)1 convolutional layer, and a 3\(\,\times \,\)3 convolutional layer to obtain mid-/low-level features, without the involvement of multi-channel dilated convolutions.
  • The effectiveness of feature transform module
The shallow-level feature information guides the backbone network to enhance detailed and spatial features for small-sized vehicles. We use the FTM to transform the image pyramid into the mid-/low-level feature information. In order to demonstrate the effectiveness of the FTM, we use a single convolution layer and multi-channel (including two and three channels) dilated convolution layers for verification, as shown in Table 4.
Table 4
The ablation study on the multi-rate dilated convolution in the effective transform module
\(r = 1\)
\(r = 2\)
\(r = 3\)
AP (%)
mAP (%)
Speed (ms)
car
bus
truck
motor
bicycle
tanker
\(\checkmark \)
  
90.8
90.5
90.8
90.1
87.2
99.5
91.5
17
\(\checkmark \)
\(\checkmark \)
 
90.8
90.6
90.8
90.5
88.7
99.7
91.9
18
\(\checkmark \)
 
\(\checkmark \)
90.8
90.6
90.8
90.4
88.2
99.2
91.7
20
 
\(\checkmark \)
\(\checkmark \)
90.8
90.5
90.8
90.3
87.7
99.2
91.6
21
\(\checkmark \)
\(\checkmark \)
\(\checkmark \)
90.8
90.6
90.9
90.4
88.5
99.7
91.8
22
The mAP of two-channel dilated convolutional layers with \(r = 1, 3\) and \(r = 2, 3\) are higher 0.2% and 0.1% respectively than a single convolution layer. The reason is that the receptive field of the convolution with large dilated rate becomes larger, and the contour information of vehicles can be extracted. Therefore, the concatenated convolution features with different dilated ratios are more abundant, and the feature responses are stronger. Compared with two-channel dilated convolutional layers with \(r = 1, 3\) and \(r = 2, 3\), the three-channel dilated convolutional layers with \(r = 1, 2, 3\) can bring more diverse receptive fields. In this way, more information from different ranges can be obtained, leading to an increase in mAP values. However, considering that most vehicles from UAV are weak and small, large receptive field may bring more background interference. Therefore, the detection performance of three-channel dilated convolution is not as good as that of two-channel dilated convolution with \(r = 1, 2\). This is because the extracted features by the two-channel convolution are more delicate and can retain richer details, which is more conducive to small vehicles detection.
Therefore, for the FTM, we adopt the two-channel dilated convolution with \(r =1, 2\) to obtain different receptive field information, and enrich the mid-/low-level features of vehicles. Furthermore, dilated convolutional layers need padding operation for input images, which can increase computational complexity of the network. The testing time for each image of different dilated convolutions is shown in the last column of Table 4, indicating that the FTM does not affect the real-time performance.
  • The significance of light-weight attention module
In mid-/low-level information guidance part, unnecessary background information will be brought to the backbone network during the process of mid-/low-level information passing, which affects the detection performance of vehicles. Inspired by the residual attention module (RAM) [59], we design a light-weight attention module (LAM) to suppress the irrelevant background information. The experimental results in line 1 from Table 5 only use the residual product fusion method to fuse the output of FTM and the corresponding backbone network features, and directly embed the fused information into the backbone network without using any attention module. “Without” in line 1 from Table 5 is that the spatial attention module was not used in the experiment. Due to the limited features for small vehicles and the influence of irrelevant background information, the network’s attention for small vehicles is insufficient. As shown in line 2 and line 3 from Table 5, the introduction of attention modules greatly improves vehicle detection performance. Although the RAM can improve detection performance, it also introduces a large number of parameters, increasing the complexity and detection speed of the network. The comparison of experimental results indicates the effectiveness of the LAM. Especially for smaller vehicles such as bicycles, the AP is increased by 1.7%. Therefore, the LAM is able to capture smaller vehicles areas of focus, and then invests more attention in the areas to obtain more detailed information, while ignoring other irrelevant information. The LAM can quickly filter out high-value information from limited attention resources without affecting the real-time detection performance.
Table 5
The ablation study on the light-weight attention module (LAM)
Spatial Attention
AP (%)
mAP (%)
Speed (ms)
Car
Bus
Truck
Motor
Bicycle
Tanker
Without
90.8
90.7
90.9
90.6
89.2
99.5
92.0
18
RAM
91.1
90.8
91.6
91.2
90.7
99.6
92.5
29
LAM
91.3
90.9
92.3
91.0
90.9
99.7
92.7
20
Table 6
The effect of the feature enhancement module (FEM) in the deep-level semantic information guidance part
Channel Attention
AP (%)
mAP (%)
Speed (ms)
Car
Bus
Truck
Motor
Bicycle
Tanker
Without
91.3
90.9
92.3
91.0
90.9
99.7
92.7
20
FEM
91.3
91.0
93.0
91.3
91.2
99.7
92.9
20

The deep-level semantic information guidance part analysis

In the deep-level semantic information guidance part, the element-wise sum fusion method is used to implement the feature fusion module (FFM), which guides high-level contextual information to enhance features for small-sized vehicles. Meanwhile, a feature enhancement module (FEM) is proposed to suppress the redundant features and improve the discriminability of small-sized vehicles.
  • The necessity of feature enhancement module
As shown in Table 6, the comparison of experimental results indicates the necessity of the FEM. “Without” in line 1 from Table 6 is that the channel attention module was not used in the experiment. The light-weight attention module (LAM) is essentially a spatial attention mechanism, while the FEM is a channel attention mechanism. We first employ the LAM to filter out irrelevant background information, and then use the FEM to improve the discriminability of small-sized vehicles. These two modules enhance the features extraction ability of the BDIG-Net from both spatial and channel perspectives, which play a complementary role. The complementarity of the two modules is not only beneficial for improving the classification performance, but also greatly improves the localization prediction of vehicles.

Overall performance

XDUAV dataset analysis

Due to the designs of the feature transform module (FTM) and the light-weight attention module (LAM) in the shallow-level feature information guidance part, and the feature enhanced module (FEM) in the deep-level semantic information guidance part, the proposed BDIG-Net can achieve 92.9% accuracy and 50 FPS in speed on XDUAT dataset. We compare some single-stage real-time methods and two-stage high-accuracy methods by mAP and FPS in this section as shown in Table 7. We can see that, the proposed method achieves the best performance while keeping real-time detection. It is noted that all methods are trained under the same conditions, so the experimental results are credible. Figure 6 shows comparison results on XDUAV dataset. Compared with the single-directional (deep-level semantic) information guidance network (Fig. 6b), the bi-directional information guidance network (BDIG-Net) (Fig. 6a) obviously reduces missed rate, and redundant bounding boxes, etc. Especially for vehicles with scale diversity and occlusion, the proposed detector has good robustness for precise vehicles location. These above results demonstrate the effectiveness of the BDIG-Net on XDUAV dataset.
Table 7
Detection results (%) of different methods for the XDUAV dataset
Methods
Backbone
Input Size
mAP (%)
AP (%)
FPS
Car
Bus
Truck
Motor
Bicycle
Tanker
Two-stage:
Faster R-CNN [8]
VGG-16
\(\sim \)1000\(\,\times \,\)600
74.3
88.1
90.4
89.5
48.8
35.3
94.2
7
ResNet-101
\(\sim \)1000\(\,\times \,\)600
75.8
89.1
90.3
90.5
50.3
44.5
90.4
2.4
R-FCN [10]
ResNet-101
\(\sim \)1000\(\,\times \,\)600
85.7
90.8
90.4
90.8
71.3
76.9
94.0
9
FPN [23]
ResNet-101
\(\sim \)1000\(\,\times \,\)600
87.5
90.8
90.4
90.5
81.1
81.4
90.9
6
Single-stage:
SSD [9]
VGG-16
300\(\,\times \,\)300
83.3
90.7
90.0
90.3
78.3
60.7
89.6
59
ResNet-101
321\(\,\times \,\)321
79.5
90.8
90.4
90.2
62.2
56.9
86.5
11.2
RON [34]
VGG-16
320\(\,\times \,\)320
75.0
79.2
90.1
88.0
50.7
53.1
89.1
15
DSSD [63]
ResNet-101
321\(\,\times \,\)321
81.6
90.8
90.2
90.8
62.6
66.7
88.6
9.5
YOLOv2 [17]
DarkNet-19
416\(\,\times \,\)416
53.7
66.4
73.1
77.0
29.3
12.2
64.1
64
YOLOv3 [18]
DarkNet-53
416\(\,\times \,\)416
83.8
90.0
85.8
95.5
77.2
57.7
96.3
76
YOLOv5
Focus + CSP + SPP
640\(\,\times \,\)640
81.8
88.2
89.6
89.4
69.9
61.3
92.2
172
YOLOv7 [20]
CBS
640\(\,\times \,\)640
70.2
80.6
82.9
82.0
54.3
37.9
83.6
125
RefineDet [26]
ShufflenetV1 [64]
320\(\,\times \,\)320
78.4
90.7
90.2
89.3
70.3
41.4
88.3
79
ShufflenetV2 [65]
320\(\,\times \,\)320
75.7
90.6
90.2
85.8
67.6
37.1
83.1
58
VGG-16
320\(\,\times \,\)320
87.1
90.8
90.3
90.6
80.2
80.5
90.6
40
Our Method
VGG-16
320\(\,\times \,\)320
92.9
91.3
91.0
93.0
91.3
91.2
99.7
50
Bold is to emphasize that the proposed method has the highest detection accuracy

The stanford drone dataset analysis

To further demonstrate the practicality and robustness of the BDIG-Net detector for UAV vehicles, we evaluate the detection performance on Stanford Drone dataset. we use the same ablation experiments as the XDUAV dataset to analyze the shallow-level feature information guidance part and deep-level semantic information guidance part. The experimental results of Stanford Drone dataset also demonstrate the effectiveness of the proposed BDIG-Net. The overall results of different detection methods are shown in Table 8. The proposed method achieves 91.1% AP, which is superior to other real-time vehicle detection methods. Figure 7 shows detection results of the proposed BDIG-Net detector on Stanford Drone dataset in different scenarios.
Table 8
Detection results (%) of different methods for Stanford Drone dataset
Methods
Backbone
Input Size
mAP(%)
AP (%)
FPS
Car
Bus
Golf Cart
Two-stage:
Faster R-CNN [8]
VGG-16
\(\sim \)1000\(\,\times \,\)600
75.3
61.6
87.1
77.1
7
ResNet-101
\(\sim \)1000\(\,\times \,\)600
75.9
60.4
89.5
75.9
2.4
R-FCN [10]
ResNet-101
\(\sim \)1000\(\,\times \,\)600
86.6
83.6
90.5
85.8
9
FPN [23]
ResNet-101
\(\sim \)1000\(\,\times \,\)600
88.8
88.3
90.3
87.7
6
Single-stage:
SSD [9]
VGG-16
300\(\,\times \,\)300
78.2
89.3
82.9
62.4
59
ResNet-101
321\(\,\times \,\)321
71.9
45.9
88.4
81.4
11.2
RON [34]
VGG-16
320\(\,\times \,\)320
69.2
46.8
88.9
71.9
15
DSSD [63]
ResNet-101
321\(\,\times \,\)321
76.6
67.5
87.8
74.5
9.5
YOLOv2 [17]
DarkNet-19
416\(\,\times \,\)416
55.8
54.9
86.1
26.3
64
YOLOv3 [18]
DarkNet-53
416\(\,\times \,\)416
80.0
72.5
89.7
77.5
76
YOLOv5
Focus + CSP + SPP
640\(\,\times \,\)640
80.1
63.3
90.0
87.1
172
YOLOv7 [20]
CBS
640\(\,\times \,\)640
57.1
17.3
93.5
60.5
125
RefineDet [26]
ShufflenetV1 [64]
320\(\,\times \,\)320
71.7
90.6
85.2
39.3
79
ShufflenetV2 [65]
320\(\,\times \,\)320
59.3
90.0
66.0
22.0
58
VGG-16
320\(\,\times \,\)320
86.1
90.7
86.2
81.5
40
Our Method
VGG-16
20\(\,\times \,\)320
91.1
91.3
91.2
90.9
50
Bold is to emphasize that the proposed method has the highest detection accuracy

Conclusion

This paper introduces a bi-directional information guidance network (BDIG-Net) for multi-category vehicle detection, which achieves accurate and real-time detection in UAV images. Firstly, the BDIG-Net is divided into two parts: shallow-level feature information guidance part and deep-level semantic information guidance part. Secondly, in the shallow-level guidance part, we use the FTM to transform the image pyramid into the mid-/low-level feature information. Meanwhile, residual product fusion method is adopted to implement feature fusion, which guides mid-/low-level information to be embedded into the BDIG-Net. In order to reduce unnecessary shallow background information of the fused features, the LAM is designed to make the network more focused on small-sized vehicles. Thirdly, In the deep-level guidance part, we use top-down architecture of the standard RefineDet to fuse deeper semantic information. The element-wise sum fusion method is used to implement feature fusion, which guides high-level contextual information to enhance features for small-sized vehicles. Meanwhile, a feature enhancement module (FEM) is proposed to suppress the redundant features and improve the discriminability of small-sized vehicles. The BDIG-Net not only integrates high-level semantic information that is conducive to classification for shallow features, but also, more importantly, integrates mid-/low-level information that is conducive to localizate for deep features. To some extent, the BDIG-Net reduces the problem of information imbalance of different feature layers. The experimental results on two datasets demonstrate that the proposed method can detect small-sized vehicle more accurately and achieve real-time detection. In our future work, we plan to integrate inter-frame correlation and scene understanding into UAV vehicle detection network, so as to infer and detect smaller and weaker vehicle targets.

Declarations

Conflict of interest

The authors declare no Conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
1.
Zurück zum Zitat Mishra B, Garg D, Narang P, Mishra V (2020) Drone-surveillance for search and rescue in natural disaster. Comput Commun 156:1–10CrossRef Mishra B, Garg D, Narang P, Mishra V (2020) Drone-surveillance for search and rescue in natural disaster. Comput Commun 156:1–10CrossRef
2.
Zurück zum Zitat Srivastava S, Narayan S, Mittal S (2021) A survey of deep learning techniques for vehicle detection from UAV images. J Syst Architect 117:102152CrossRef Srivastava S, Narayan S, Mittal S (2021) A survey of deep learning techniques for vehicle detection from UAV images. J Syst Architect 117:102152CrossRef
3.
Zurück zum Zitat Priyanka G, Bhavya P, Gaurav S, Vijay RD (2022) Edge device based military vehicle detection and classification from UAV. Multimed Tools Appl 81:19813–19834CrossRef Priyanka G, Bhavya P, Gaurav S, Vijay RD (2022) Edge device based military vehicle detection and classification from UAV. Multimed Tools Appl 81:19813–19834CrossRef
4.
Zurück zum Zitat Ke R, Li Z, Tang J, Pan Z, Wang Y (2019) Real-time traffic flow parameter estimation from UAV video based on ensemble classifier and optical flow. IEEE Trans Intell Transp Syst 20:54–64CrossRef Ke R, Li Z, Tang J, Pan Z, Wang Y (2019) Real-time traffic flow parameter estimation from UAV video based on ensemble classifier and optical flow. IEEE Trans Intell Transp Syst 20:54–64CrossRef
5.
Zurück zum Zitat Zhou H, Kong H, Wei L, Creighton D, Nahavandi S (2017) On detecting road regions in a single UAV image. IEEE Trans Intell Transp Syst 18:1713–1722CrossRef Zhou H, Kong H, Wei L, Creighton D, Nahavandi S (2017) On detecting road regions in a single UAV image. IEEE Trans Intell Transp Syst 18:1713–1722CrossRef
6.
Zurück zum Zitat Li X, Li X, Li Z, Xiong X, Khyam MO, Sun C (2021) Robust Vehicle Detection in High-Resolution Aerial Images With Imbalanced Data. IEEE Transactions on Artificial Intelligence 2:238–250CrossRef Li X, Li X, Li Z, Xiong X, Khyam MO, Sun C (2021) Robust Vehicle Detection in High-Resolution Aerial Images With Imbalanced Data. IEEE Transactions on Artificial Intelligence 2:238–250CrossRef
7.
Zurück zum Zitat He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37:1904–1916CrossRef He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37:1904–1916CrossRef
8.
Zurück zum Zitat Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, pp. 91–99 Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, pp. 91–99
9.
Zurück zum Zitat Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. European conference on computer vision. Springer, pp. 21–37 Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. European conference on computer vision. Springer, pp. 21–37
10.
Zurück zum Zitat Dai J, Li Y, He K, Sun J (2016) R-fcn: Object detection via region-based fully convolutional networks. Advances in neural information processing systems, pp. 379–387 Dai J, Li Y, He K, Sun J (2016) R-fcn: Object detection via region-based fully convolutional networks. Advances in neural information processing systems, pp. 379–387
11.
Zurück zum Zitat He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. Proceedings of the IEEE international conference on computer vision, pp. 2961–2969 He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. Proceedings of the IEEE international conference on computer vision, pp. 2961–2969
12.
Zurück zum Zitat Girshick R (2015) Fast r-cnn. Proceedings of the IEEE international conference on computer vision, pp. 1440–1448 Girshick R (2015) Fast r-cnn. Proceedings of the IEEE international conference on computer vision, pp. 1440–1448
13.
Zurück zum Zitat Xu Y, Yu G, Wang Y, Wu X, Ma Y (2017) Car detection from low-altitude UAV imagery with the faster R-CNN. Journal of Advanced Transportation 2017 Xu Y, Yu G, Wang Y, Wu X, Ma Y (2017) Car detection from low-altitude UAV imagery with the faster R-CNN. Journal of Advanced Transportation 2017
14.
Zurück zum Zitat Sommer LW, Schuchert T, Beyerer J (2017) Fast deep vehicle detection in aerial images. Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, pp. 311–319 Sommer LW, Schuchert T, Beyerer J (2017) Fast deep vehicle detection in aerial images. Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, pp. 311–319
15.
Zurück zum Zitat Tang T, Zhou S, Deng Z, Zou H, Lei L (2017) Vehicle detection in aerial images based on region convolutional neural networks and hard negative example mining. Sensors 17:336CrossRef Tang T, Zhou S, Deng Z, Zou H, Lei L (2017) Vehicle detection in aerial images based on region convolutional neural networks and hard negative example mining. Sensors 17:336CrossRef
16.
Zurück zum Zitat Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788 Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788
17.
Zurück zum Zitat Redmon J, Farhadi A (2017) YOLO9000: Better, Faster, Stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 6517–6525 Redmon J, Farhadi A (2017) YOLO9000: Better, Faster, Stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 6517–6525
20.
Zurück zum Zitat Wang CY, Bochkovskiy A, Liao HYM (2022) YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, arXiv:2207.02696 [cs.CV] Wang CY, Bochkovskiy A, Liao HYM (2022) YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, arXiv:​2207.​02696 [cs.CV]
22.
Zurück zum Zitat Shrivastava A, Sukthankar R, Malik J, Gupta A (2016) Beyond skip connections: Top-down modulation for object detection. arXiv preprint arXiv:1612.06851 Shrivastava A, Sukthankar R, Malik J, Gupta A (2016) Beyond skip connections: Top-down modulation for object detection. arXiv preprint arXiv:​1612.​06851
23.
Zurück zum Zitat Lin T, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature Pyramid Networks for Object Detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017:936–944 Lin T, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature Pyramid Networks for Object Detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017:936–944
24.
Zurück zum Zitat Zhu R, Zhang S, Wang X, Wen L, Shi H, Bo L, Mei T (2019) ScratchDet: Training Single-Shot Object Detectors From Scratch. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Zhu R, Zhang S, Wang X, Wen L, Shi H, Bo L, Mei T (2019) ScratchDet: Training Single-Shot Object Detectors From Scratch. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
25.
Zurück zum Zitat Wang T, Anwer RM, Cholakkal H, Khan FS, Pang Y, Shao L (2019) Learning Rich Features at High-Speed for Single-Shot Object Detection. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Wang T, Anwer RM, Cholakkal H, Khan FS, Pang Y, Shao L (2019) Learning Rich Features at High-Speed for Single-Shot Object Detection. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
26.
Zurück zum Zitat Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-Shot Refinement Neural Network for Object Detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018:4203–4212 Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-Shot Refinement Neural Network for Object Detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018:4203–4212
27.
Zurück zum Zitat Sommer LW, Schuchert T, Beyerer J (2017) Deep learning based multi-category object detection in aerial images. Automatic Target Recognition XXVII; Sadjadi, F.A.; Mahalanobis, A., Eds. International Society for Optics and Photonics, SPIE, Vol. 10202, p. 1020209 Sommer LW, Schuchert T, Beyerer J (2017) Deep learning based multi-category object detection in aerial images. Automatic Target Recognition XXVII; Sadjadi, F.A.; Mahalanobis, A., Eds. International Society for Optics and Photonics, SPIE, Vol. 10202, p. 1020209
28.
Zurück zum Zitat Sommer L, Schmidt N, Schumann A, Beyerer J (2018) Search Area Reduction Fast-RCNN for Fast Vehicle Detection in Large Aerial Imagery. 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 3054–3058 Sommer L, Schmidt N, Schumann A, Beyerer J (2018) Search Area Reduction Fast-RCNN for Fast Vehicle Detection in Large Aerial Imagery. 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 3054–3058
29.
Zurück zum Zitat Deng Z, Sun H, Zhou S, Zhao J, Zou H (2017) Toward fast and accurate vehicle detection in aerial images using coupled region-based convolutional neural networks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10:3652–3664CrossRef Deng Z, Sun H, Zhou S, Zhao J, Zou H (2017) Toward fast and accurate vehicle detection in aerial images using coupled region-based convolutional neural networks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10:3652–3664CrossRef
30.
Zurück zum Zitat Mittal P, Singh R, Sharma A (2020) Deep learning-based object detection in low-altitude UAV datasets: A survey. Image Vis Comput 104:104046CrossRef Mittal P, Singh R, Sharma A (2020) Deep learning-based object detection in low-altitude UAV datasets: A survey. Image Vis Comput 104:104046CrossRef
31.
Zurück zum Zitat Bayhan E, Ozkan Z, Namdar M, Deep Basgumus A, Detection Learning Based Object, Recognition of Unmanned Aerial Vehicles. (2021) 3rd International Congress on Human-Computer Interaction. Optimization and Robotic Applications (HORA) 2021:1–5 Bayhan E, Ozkan Z, Namdar M, Deep Basgumus A, Detection Learning Based Object, Recognition of Unmanned Aerial Vehicles. (2021) 3rd International Congress on Human-Computer Interaction. Optimization and Robotic Applications (HORA) 2021:1–5
32.
Zurück zum Zitat Hu P, Ramanan D (2017) Finding tiny faces. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 1522–1530 Hu P, Ramanan D (2017) Finding tiny faces. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 1522–1530
33.
Zurück zum Zitat Woo S, Hwang S, Kweon IS (2018) StairNet: Top-Down Semantic Aggregation for Accurate One Shot Detection. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, pp. 1093–1102 Woo S, Hwang S, Kweon IS (2018) StairNet: Top-Down Semantic Aggregation for Accurate One Shot Detection. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, pp. 1093–1102
34.
Zurück zum Zitat Kong T, Sun F, Yao A, Liu H, Lu M, Chen Y (2017) Ron: Reverse connection with objectness prior networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5936–5944 Kong T, Sun F, Yao A, Liu H, Lu M, Chen Y (2017) Ron: Reverse connection with objectness prior networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5936–5944
35.
Zurück zum Zitat Kong T, Sun F, Tan C, Liu H, Huang W (2018) Deep feature pyramid reconfiguration for object detection. Proceedings of the European Conference on Computer Vision (ECCV), pp. 169–185 Kong T, Sun F, Tan C, Liu H, Huang W (2018) Deep feature pyramid reconfiguration for object detection. Proceedings of the European Conference on Computer Vision (ECCV), pp. 169–185
36.
Zurück zum Zitat Kong T, Yao A, Chen Y, Sun F (2016) Hypernet: Towards accurate region proposal generation and joint object detection. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 845–853 Kong T, Yao A, Chen Y, Sun F (2016) Hypernet: Towards accurate region proposal generation and joint object detection. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 845–853
37.
Zurück zum Zitat Zhang X, Izquierdo E, Chandramouli K (2019) Dense and Small Object Detection in UAV Vision Based on Cascade Network. The IEEE International Conference on Computer Vision (ICCV) Workshops Zhang X, Izquierdo E, Chandramouli K (2019) Dense and Small Object Detection in UAV Vision Based on Cascade Network. The IEEE International Conference on Computer Vision (ICCV) Workshops
38.
Zurück zum Zitat Cai Z, Vasconcelos N (2018) Cascade R-CNN: Delving Into High Quality Object Detection. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6154–6162 Cai Z, Vasconcelos N (2018) Cascade R-CNN: Delving Into High Quality Object Detection. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6154–6162
39.
Zurück zum Zitat Huang H, Li L, Ma H (2022) An Improved Cascade R-CNN-Based Target Detection Algorithm for UAV Aerial Images. 2022 7th International Conference on Image, Vision and Computing (ICIVC), pp. 232–237 Huang H, Li L, Ma H (2022) An Improved Cascade R-CNN-Based Target Detection Algorithm for UAV Aerial Images. 2022 7th International Conference on Image, Vision and Computing (ICIVC), pp. 232–237
40.
Zurück zum Zitat Tang T, Deng Z, Zhou S, Lei L, Zou H, Fast vehicle detection in UAV images. Remote Sensing with Intelligent Processing (RSIP), (2017) International Workshop on. IEEE 2017:1–5 Tang T, Deng Z, Zhou S, Lei L, Zou H, Fast vehicle detection in UAV images. Remote Sensing with Intelligent Processing (RSIP), (2017) International Workshop on. IEEE 2017:1–5
41.
Zurück zum Zitat Radovic M, Adarkwa O, Wang Q (2017) Object Recognition in Aerial Images Using Convolutional Neural Networks. Journal of Imaging 3:21CrossRef Radovic M, Adarkwa O, Wang Q (2017) Object Recognition in Aerial Images Using Convolutional Neural Networks. Journal of Imaging 3:21CrossRef
43.
Zurück zum Zitat Ringwald T, Sommer L, Schumann A, Beyerer J, Stiefelhagen R (2019) UAV-Net: A Fast Aerial Vehicle Detector for Mobile Platforms. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops Ringwald T, Sommer L, Schumann A, Beyerer J, Stiefelhagen R (2019) UAV-Net: A Fast Aerial Vehicle Detector for Mobile Platforms. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
44.
Zurück zum Zitat Borlea ID, Precup RE, Borlea AB (2022) Improvement of K-means cluster quality by post processing resulted clusters. Procedia Computer Science 199:63–70CrossRef Borlea ID, Precup RE, Borlea AB (2022) Improvement of K-means cluster quality by post processing resulted clusters. Procedia Computer Science 199:63–70CrossRef
45.
Zurück zum Zitat Protic D, Stankovic M (2023) XOR-Based Detector of Different Decisions on Anomalies in the Computer Network Traffic. SCIENCE AND TECHNOLOGY 26:323–338 Protic D, Stankovic M (2023) XOR-Based Detector of Different Decisions on Anomalies in the Computer Network Traffic. SCIENCE AND TECHNOLOGY 26:323–338
46.
Zurück zum Zitat Zhang X, Zhu X (2019) Vehicle Detection in the Aerial Infrared Images via an Improved Yolov3 Network. 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), pp. 372–376 Zhang X, Zhu X (2019) Vehicle Detection in the Aerial Infrared Images via an Improved Yolov3 Network. 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), pp. 372–376
47.
Zurück zum Zitat Tan L, Lv X, Lian X, Wang G (2021) YOLOv4_Drone: UAV image target detection based on an improved YOLOv4 algorithm. Computers & Electrical Engineering 93:107261CrossRef Tan L, Lv X, Lian X, Wang G (2021) YOLOv4_Drone: UAV image target detection based on an improved YOLOv4 algorithm. Computers & Electrical Engineering 93:107261CrossRef
48.
Zurück zum Zitat Deng L, Liu Z, Wang J, Yang B (2023) ATT-YOLOv5-Ghost: water surface object detection in complex scenes. J Real-Time Image Proc 20:97CrossRef Deng L, Liu Z, Wang J, Yang B (2023) ATT-YOLOv5-Ghost: water surface object detection in complex scenes. J Real-Time Image Proc 20:97CrossRef
49.
Zurück zum Zitat Zhan W, Sun C, Wang M, She J, Zhang Y, Zhang Z, Sun Y (2022) An improved Yolov5 real-time detection method for small objects captured by UAV. Soft Comput 26:362–373CrossRef Zhan W, Sun C, Wang M, She J, Zhang Y, Zhang Z, Sun Y (2022) An improved Yolov5 real-time detection method for small objects captured by UAV. Soft Comput 26:362–373CrossRef
50.
Zurück zum Zitat Majid Azimi S (2018) ShuffleDet: Real-Time Vehicle Detection Network in On-board Embedded UAV Imagery. The European Conference on Computer Vision (ECCV) Workshops Majid Azimi S (2018) ShuffleDet: Real-Time Vehicle Detection Network in On-board Embedded UAV Imagery. The European Conference on Computer Vision (ECCV) Workshops
51.
Zurück zum Zitat Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path Aggregation Network for Instance Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8759–8768 Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path Aggregation Network for Instance Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8759–8768
52.
Zurück zum Zitat Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D (2019) Libra R-CNN: Towards Balanced Learning for Object Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 821–830 Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D (2019) Libra R-CNN: Towards Balanced Learning for Object Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 821–830
53.
Zurück zum Zitat Liu Z, Gao G, Sun L, Fang L (2020) IPG-Net: Image Pyramid Guidance Network for Small Object Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 1026–1027 Liu Z, Gao G, Sun L, Fang L (2020) IPG-Net: Image Pyramid Guidance Network for Small Object Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 1026–1027
54.
55.
Zurück zum Zitat Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:​1502.​03167
56.
Zurück zum Zitat Liu S, Huang D, Wang a (2018) Receptive Field Block Net for Accurate and Fast Object Detection. Proceedings of the European Conference on Computer Vision (ECCV), pp. 385–400 Liu S, Huang D, Wang a (2018) Receptive Field Block Net for Accurate and Fast Object Detection. Proceedings of the European Conference on Computer Vision (ECCV), pp. 385–400
57.
Zurück zum Zitat Hu J, Shen L, Sun G (2018) Squeeze-and-Excitation Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141 Hu J, Shen L, Sun G (2018) Squeeze-and-Excitation Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141
58.
Zurück zum Zitat Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual Attention Network for Image Classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164 Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual Attention Network for Image Classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164
59.
Zurück zum Zitat Lim JS, Astrid M, Yoon HJ, Lee SI (2021) Small Object Detection using Context and Attention. International Conference on Artificial Intelligence in Information and Communication (ICAIIC) 2021:181–186 Lim JS, Astrid M, Yoon HJ, Lee SI (2021) Small Object Detection using Context and Attention. International Conference on Artificial Intelligence in Information and Communication (ICAIIC) 2021:181–186
60.
Zurück zum Zitat Robicquet A, Sadeghian A, Alahi A, Savarese S (2016) Learning Social Etiquette: Human Trajectory Understanding In Crowded Scenes. Computer Vision – ECCV 2016; Leibe B, Matas J, Sebe N, Welling M, Eds.; Springer International Publishing: Cham, pp. 549–565 Robicquet A, Sadeghian A, Alahi A, Savarese S (2016) Learning Social Etiquette: Human Trajectory Understanding In Crowded Scenes. Computer Vision – ECCV 2016; Leibe B, Matas J, Sebe N, Welling M, Eds.; Springer International Publishing: Cham, pp. 549–565
61.
Zurück zum Zitat Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional Architecture for Fast Feature Embedding. Proceedings of the 22nd ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, p. 675-678 Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional Architecture for Fast Feature Embedding. Proceedings of the 22nd ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, p. 675-678
62.
Zurück zum Zitat Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet Large Scale Visual Recognition. Challenge 115:211–252MathSciNet Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet Large Scale Visual Recognition. Challenge 115:211–252MathSciNet
64.
Zurück zum Zitat Zhang X, Zhou X, Lin M, Sun J (2018) ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6848–6856 Zhang X, Zhou X, Lin M, Sun J (2018) ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6848–6856
65.
Zurück zum Zitat Ma N, Zhang X, Zheng HT, Sun J (2018) ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131 Ma N, Zhang X, Zheng HT, Sun J (2018) ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131
66.
Zurück zum Zitat Bouguettaya A, Zarzour H, Kechida A, Taberkit AM (2022) Vehicle Detection From UAV Imagery With Deep Learning: A Review. IEEE Transactions on Neural Networks and Learning Systems 33:6047–6067CrossRef Bouguettaya A, Zarzour H, Kechida A, Taberkit AM (2022) Vehicle Detection From UAV Imagery With Deep Learning: A Review. IEEE Transactions on Neural Networks and Learning Systems 33:6047–6067CrossRef
67.
Zurück zum Zitat Ye T, Qin W, Li Y, Wang S, Zhang J, Zhao Z (2022) Dense and Small Object Detection in UAV-Vision Based on a Global-Local Feature Enhanced Network. IEEE Trans Instrum Meas 71:1–13 Ye T, Qin W, Li Y, Wang S, Zhang J, Zhao Z (2022) Dense and Small Object Detection in UAV-Vision Based on a Global-Local Feature Enhanced Network. IEEE Trans Instrum Meas 71:1–13
Metadaten
Titel
Bi-directional information guidance network for UAV vehicle detection
verfasst von
Jianxiu Yang
Xuemei Xie
Zhenyuan Wang
Peng Zhang
Wei Zhong
Publikationsdatum
24.04.2024
Verlag
Springer International Publishing
Erschienen in
Complex & Intelligent Systems
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI
https://doi.org/10.1007/s40747-024-01429-9

Premium Partner