nach oben

Complex & Intelligent Systems

Erschienen in:

Open Access 06.09.2022 | Original Article

Multiple spatial residual network for object detection

verfasst von: Yongsheng Dong, Zhiqiang Jiang, Fazhan Tao, Zhumu Fu

Erschienen in: Complex & Intelligent Systems | Ausgabe 2/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

Many residual network-based methods have been proposed to perform object detection. However, most of them may lead to overfitting or cannot perform well in small object detection and alleviate the problem of overfitting. We propose a multiple spatial residual network (MSRNet) for object detection. Particularly, our method is based on central point detection algorithm. Our proposed MSRNet employs a residual network as the backbone. The resulting features are processed by our proposed residual channel pooling module. We then construct a multi-scale feature transposed residual fusion structure consists of three overlapping stacked residual convolution modules and a transpose convolution function. Finally, we use the Center structure to process the high-resolution feature image for obtaining the final prediction detection result. Experimental results on PASCAL VOC dataset and COCO dataset confirm that the MSRNet has competitive accuracy compared with several other classical object detection algorithms, while providing a unified framework for training and reasoning. The MSRNet runs on GeForce RTX 2080Ti.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Object detection is an important computer vision task [1‐3] that deals with detecting instances [4, 5] of visual objects of a certain class (such as humans [6], animals or cars [7]) in digital images [8, 9]. In the past 2 decades, a variety of deep learning-based object detection algorithms have been proposed [10‐12]. Anchor-based object detection methods use anchors to perform regression reference and classify candidate operations to obtain predicted proposals or final bounding boxes [13]. Anchor-based object detection methods have made great progress [14, 15]. But anchor-based object detection algorithms also face some challenges. It is difficult to design just the right anchor to detect different objects. The shapes and sizes of objects in different images are different [16, 17]. At the same time, they have some other challenges such as large number of hyperparameters, and the imbalance problem between positive and negative samples.

Key points based methods represented each object by key points [11, 18, 19]. Key points based methods were proposed to alleviate the above problems [18]. Cornernet [18] used the detection and matching of key points to obtain good detection results, instead of the idea of anchor and area suggestion box. Although CornerNet [18] can be used for object detection, CornerNet [18] requires two corner spots to be matched, and error detection samples are prone to occur in the matching process. In order to make better use of key points for object detection, Keypoint Triplets for Object Detection [19] turned the problem of object detection into a key point estimation problem. It can achieve better detection performance by predicting the location of the center point of the object, and the corresponding height and width.

Although key points based methods have made great progress, they do not perform well in detecting small objects [20, 21]. The main reason is that there is object occlusion [22], and low quality problem of feature extraction. Simultaneous overfitting is also a common phenomenon. To alleviate the above issues, in this paper, we propose a novel multiple spatial residual network for object detection, Named MSRNet. MSRNet is based on key points for object detection. The main contributions of this paper are as follows:

We propose a novel multiple spatial residual network (MSRNet) for object detection. Our proposed MSRNet makes use of the idea of global residual scheme to effectively improve the object detection performance by pooling and convolving the features of different scales and different spaces.

We propose a multi-spatial residual channel pooling (MSRCP) structure. Our proposed MSRCP structure contains four branches, which pool features in different spaces and stack features using residual idea. Our proposed MSRCP structure makes MSRNet pay more attention to global features rather than local features through pooling and residual operation. This structure plays a positive role in alleviating overfitting.

In order to improve the detection performance of small objects, we propose a multi-scale feature transposed residual fusion (MFTRF) structure. Our proposed MFTRF structure contains four branches. Our proposed MFTRF structure performs transpose convolution operation on the obtained high-quality features and stacking feature operation using residual idea. The rest of this paper is organized as follows. The Related work section presents the related work. The Our proposed method section introduces our proposed method and the training details of our proposed network. Experimental results are presented in the Experiments section. In addition, a brief conclusion is given in the Conclusion section.

Object detection algorithms need to locate and classify objects. Deep learning-based object detection algorithms can be divided into two-stage methods and one-stage methods.

The two-stage algorithms divide the object detection [23] process into two steps: the first step is to extract ROI (the full name of ROI is regions on interesting), and the second step is to make classification and regression prediction of ROI. R-CNN [14] algorithm was published in CVPR in 2014. This algorithm applies convolutional neural network to feature extraction. Under the effect of better feature extraction function of convolutional neural network [24], it has better detection performance on PASCAL VOC dataset [25]. R-CNN [14] algorithm still has some traditional object detection ideas. R-CNN [14] treats object detection as a classification problem, that is, the candidate regions are first extracted and then the candidate regions are classified. The concrete implementation process of R-CNN [14] algorithm is mainly divided into four steps: The first step is to generate candidate regions. The second step is to use the convolutional neural network for feature extraction. The third step is to use the classifier to classify the features. The fourth step is to carry on the boundary regression to the feature to get the more accurate object region.

Fast R-CNN [26] is faster and more powerful than R-CNN [14]. Fast R-CNN [26] algorithm based on VGG-16 network, algorithm speed is faster than R-CNN [14]. Faster R-CNN [27] algorithm was proposed in 2015. Compared with previous target detection algorithms, the biggest highlight of this algorithm is the proposed RPN (Region Proposal Network). The Anchor mechanism is used to connect the convolutional network with region generation, which greatly improves the detection speed. Anchor can be understood as some boxes of fixed size and width and height on the image, and these boxes of fixed size can be matched with real labels, and then regression and classification prediction can be carried out. Based on the existing methods, some methods that play a greater role in the field of object detection have been proposed. For example, SPP-net [28], FPN [29], Mask R-CNN [30] and DetectoRS [31]. SPP-net does not require a fixed size input image [28]. Spatial pyramid pooling network is proposed [28]. In the FPN article [29], a method for detecting objects with different scales is proposed [29]. In Mask R-CNN [30], an object instance segmentation framework which can be easily extended to other computer vision tasks is proposed [30]. In DetectoRS [31], recursive feature pyramid and switchable atrous convolution, which can significantly improve the performance of object detection, are proposed [31].

The one-stage object detection algorithms can realize object detection directly, which is relatively fast. However, the accuracy of the one-stage object detection algorithms is obviously reduced. Two-stage object detection algorithms improve the accuracy. But the two-stage object detection algorithms limit the speed of object detection. The SSD [32] algorithm draws lessons from the ideas of Faster R-CNN [27] and YOLO [33]. Based on the one-stage object detection, SSD [32] uses a certain proportion of the size of the border for region generation, and makes use of the feature information of different depths.

In the implementation of SSD [32] object detection algorithm, the input image first passes through the basic network of VGGnet [34]. SSD [32] adds several convolution layers to this basic network, and then uses the convolution kernel to make predictions on 6 feature layers of different depths and sizes to get the predicted values of classification and regression. Finally, the SSD [32] can directly predict the result, or solve the loss of the network. YOLO [33, 35, 36] used first-order network to directly complete the two tasks of classification and object positioning, which is relatively fast. Different versions of YOLO have been proposed one after another. For example, YOLOv1 [33], YOLO9000 [35], YOLOv3 [36], YOLOX [37] and so on. YOLO [33, 35, 36] algorithm speeds up the application of object detection in industry.

With the further development of object detection methods, some valuable methods have been proposed. EfficientDet [38] is proposed to balance the relationship between detection speed and detection accuracy. In order to balance the foreground and background, RetinaNet [39] is proposed. The method of modeling an object as the center point of the bounding box is proposed in CenterNet [40]. A new transformer-based general backbone network is proposed in Swin [10]. A direct set prediction approach to bypass the surrogate tasks is proposed in DeTR [11]. The authors have designed a novel light architecture cooperating with the proposed sliding window procedure, and the designed model works with maximum simplicity to support mobile devices [1]. A correlation learning mechanism (CLM) of deep neural network architectures by combining convolutional neural network is proposed [3]. Authors propose DSwarm-Net [2], a framework by employing deep learning and swarm intelligence-based metaheuristic for HAR that uses 3D skeleton data.

In order to ensure the recall rate of object detection, in the execution of SSD [32] or Faster R-CNN [27] object detection algorithm, a real object corresponds to more than one candidate box. In this case, you need to suppress candidate boxes whose scores are not maximum, which is non-maximum suppression (NMS) [41]. The non-maximum suppression (NMS) [41] is the removal of the duplicated and redundant prediction. Although the method NMS [41] is simple and effective, it also has some problems under the high object detection requirements. If the object is dense, the border belonging to the two objects may be suppressed, thus reducing the recall rate of the object detection algorithms. The NMS [41] method simply takes the score as an evaluation criterion, but the prediction box with high score may not be accurate in some specific situations. Therefore, the non-maximal suppression [41] method is not used in this paper. Our proposed MSRNet in this paper is based on the key point [18, 19, 40].

The key points based object detection algorithms do not use anchor as a prior box, but predicts the location of the central point of the object [19, 40]. Figure 1 shows the keypoint-based detection. It can be seen from Fig. 1 that the object detection algorithms deduces the type of object from the center point. Therefore, the key points based object detection algorithms do not have the matching processing operation between label and prior box and the process of screening positive and negative samples. For each object label, only one central point is selected as a positive sample, that is, the local peak point of the heat map is extracted. Therefore, there is no process for the NMS [41] method.

Although the key points based methods have been proposed, the object detection task still faces some problems. In different computer vision tasks, there are different visual task processing scenarios. In different scenes, objects of different scales need to be detected. In most detection scenarios, the detection of small objects accounts for the majority. There are many reasons why small objects are difficult to detect [42]. First, the small objects occupy a small area of the digital image, and the feature information is insufficient [43]. Second, small objects and large and medium-sized objects exist in the same detection scene at the same time, and the size span between targets is large. Finally, there is a large gap between the scale number of small objects available for training and the small object classification [42]. Many methods have been proposed to solve the problems faced by small objects [42]. The common solutions are as follows: changing the training strategy and carrying out multi-scale training to solve the problem of object scale span. Make innovations in object feature information to solve the problem of low resolution and so on. FPN [29], FS-SSD [44], IPAN [45] and other methods have been proposed to enhance the feature information of small objects. However, there is still some room for improvement in enhancing the feature information of small objects.

Our proposed method

In this section, we present our proposed multiple spatial residual network (MSRNet) for object detection. In the following subsections, we first describe our proposed object detection system architecture, and then the training method, followed by its prediction.

Multiple spatial residual network (MSRNet)

In our proposed MSRNet, we first extract the backbone features from the input images, and then build a multi-spatial residual channel pooling (MSRCP) structure to alleviate overfitting and a multi-scale feature transposed residual fusion (MFTRF) structure to improve the detection accuracy of small objects. Finally, we use a Center structure [18, 19, 40] to detect objects.

Multi-spatial residual channel pooling (MSRCP)

In order to alleviate overfitting, a multi-spatial residual channel pooling (MSRCP) structure is proposed. In the deep learning object detection algorithms, when the number of parameters is large and the training samples are relatively small, the model is prone to overfitting. Overfitting is a problem in deep learning and even machine learning. The specific form of overfitting is that the model has a high prediction accuracy on the training set, but a large decrease in the test set. Pooling is used in the MSRCP structure. Pooling is a strong priori. Pooling is used to make the model focus on global features rather than local features. The dimensionality reduction process of pooling can retain some important characteristic information and improve fault tolerance. The MSRCP structure is proposed to effectively alleviate the occurrence of overfitting problem.

The MSRCP structure refers to the idea of residual network. The MSRCP structure has four branches. Each branch contains an residual channel pooling (RCP) structure. Figure 4 shows residual channel pooling (RCP) module. It can be seen from Fig. 4 that RCP module is used to process input features through depthwise separable convolution operation, pooling operation and stacking operation. The four branches in the MSRCP structure process the four feature layers obtained by the main feature extraction network, respectively. After pooling in each RCP structure, the shape of the feature layer does not change.

Multiscale feature transposed residual fusion (MFTRF)

As the number of layers of convolutional neural network deepens, the function of feature graph changes. The resolution of the feature map in the shallow layer is high, which is beneficial to object positioning. We randomly select an image from the PASCAL VOC [25] dataset. We will pass this picture into the MSRNet for forward propagation. We get the shallow layer and the deep layer. The characteristic diagram of the shallow layer is shown in Fig. 2. The characteristic diagram of the deep layer is shown in Fig. 3. From Fig. 2, we can clearly see the outline of a horse, a person and a brand. Shallow feature map can clearly show the outline of the object, convenient for object positioning. In Fig. 3, we can see the dense dots in some areas. Combine with the original input image, we can know that the area of these dots is exactly where the object is. Deep layer feature map reflects strong semantic features, which is helpful for classification and recognition.

Hypernet [46] method published on CVPR in 2016. The authors of Hypernet believed that a single feature map layer could not well represent all the features of the object. Hypernet method was proposed to integrate three different depth characteristics: shallow, medium and deep [28, 47]. In order to better represent the features of the object, our proposed MSRNet integrates the features of four different depths: shallow, medium, deep and deeper [48]. The MFTRF structure is proposed by combining four different depth features [49‐51].

In order to improve the detection performance of small objects, we propose a multi-scale feature transposed residual fusion (MFTRF) structure. The MFTRF structure consists of three overlapping stacked residual convolution (SRC) modules and a transpose convolution function. The SRC module contains the transpose convolution function and the stack feature operation. The original input image is processed by the backbone feature extraction network of MSRNet, and a feature of 16*16*2048 is obtained. After the 16*16*2048 feature passes through the MSRCP structure, four effective feature layers with different shapes and different depths will be obtained. These four effective feature layers will be processed by the MFTRF structure to obtain 128*128*128 feature with high semantics.

The overall architecture of MSRNet

This subsection presents the MSRNet architecture as a whole. Figure 4 shows the overall framework of the MSRNet. It can be seen from Fig. 4 that the input image is processed by backbone, MSRCP, MFTRF and Center to predict the type and location of the object. In MSRNet, the input image is first processed by the backbone. The backbone used in this article is ResNet-50. The output results of the backbone are processed by the MSRCP structure, and four feature maps are obtained. The detailed operation of RCP module in the MSRCP structure is shown in Fig. 4. The four output results of the MSRCP structure are processed by the MFTRF structure. SRC module processes the output of RCP module. RCP module includes ConvTranspose2d module, depthwise separable convolution and ordinary convolution. The RCP module uses depthwise separable convolution to reduce the amount of parameters. Finally, the output of the MFTRF structure and the output of backbone are stacked together.

Note that the used Center structure [18, 19, 40] (as shown in Fig. 4) is to process the high semantic feature layer. The Center structure is really just a series of convolution [52] operations. The size of the high semantic feature graph enter into the Center structure is 128*128*64. A feature map of this size is equivalent to dividing the input image into 128*128 regions. There is a feature point in each region. If the object falls into a certain region, the object is predicted by the feature points within that region. The whole Center structure is composed of three steps. The first step is heat map prediction. Figure 5 shows the heat map obtained through the Center structure. It can be seen from Fig. 5 that some brighter spots. These spots are graphical displays of the object category. The function of the heat map is to determine whether there is a corresponding object at each feature point. The second step of the Center structure is the center point deviation prediction. The result of center point deviation prediction will adjust the X-axis and Y-axis coordinates of the current feature points, and finally obtain the X-axis and Y-axis coordinates of the object center. The third step of the Center structure is the width and height prediction. The result of width and height prediction will directly go through regression to obtain the width and height of the object corresponding to the feature points.

Training

We use the PyTorch [53] framework to implement our approach. In network training, we use residual blocks to accelerate the convergence speed and improve the performance of the algorithm.

The overall loss function of MSRNet, which was used in [18, 19, 40], is shown as follows:

$$\begin{aligned} L_\mathrm{{overall}}&= L_k + \lambda _\mathrm{{size}}L_\mathrm{{size}} + \lambda _\mathrm{{off}}L_\mathrm{{off}} \end{aligned}$$

(1)

In the formula $L_k$ is the loss of the key point, here the loss is in the form of Focal Loss [39]. $L_\mathrm{{size}}$ is the predicted loss of width and height. $L_\mathrm{{off}}$ is the loss of offset forecast. $\lambda _\mathrm{{size}}$ and $\lambda _\mathrm{{off}}$ are weights introduced to balance the losses of each part.

Our object detection algorithm is implemented in GeForce RTX 2080Ti. Training on PASCAL VOC dataset is as follows: We train with an input resolution of 512*512*3. We use Adam [54] to optimize the entire object detection algorithm. Our object detection algorithm is trained by 150 epochs. The backbone is ResNet-50. We use batchsize 4. The details of the 1–50 epochs training are: Learning rate is 1e$-$3, weight decay is 5e$-$4. The details of the training in the 51–100 epochs are: Learning rate is 1e$-$4, weight decay is 5e$-$4. The details of training in 101–150 epochs are: Learning rate is 1e$-$5, weight decay is 5e$-$4.

Training on COCO2014 dataset is as follows: We train with 512*512*3 input resolution. Our object detection algorithm is trained 900,000 iterations. The learning rate is 1.25e$-$4. Weight decay is 1e$-$4. The optimizer is SGD. The backbone is ResNet-50. The batchsize and the gamma are selected as 4 and 0.9 respectively.

Prediction

When making prediction, the input image is first propagated forward. After forward propagation, the prediction results of heat map, center point deviation and width and height will be obtained. Our object detection algorithm is a key point-based approach. Each object extracts local peak points on the critical heat map, so the NMS [41] operation is not required. The key point heat map here is similar to the CornerNet [18], CenterNet [19, 40]. This only predicts the location of one central point.

Our method makes an offset prediction for each center point, and all object categories share an offset prediction value [18, 19, 40]. In general, for a point on the feature map, the MSRNet predicts C+4 values. C+4 values include the center point score of C categories, the predicted deviation value of the center point (x, y), and the width and height of the object (w, h). And after all of this, we get a prediction for the whole network. Finally, the visualization of the prediction results.

Experiments

In this section, we test our proposed MSRNet on COCO dataset and PASCAL VOC dataset for demonstrating its effectiveness by comparing it with eight representative object detection methods. In the following subsections, we first introduce the dataset required for the experiment. Then the evaluation indicators are introduced. Finally, the results of ablation experiments and common object detection methods are presented.

Datasets

PASCAL VOC and COCO datasets are a set of standard datasets for image classification and object detection. The commonly used PASCAL VOC datasets are PASCAL VOC 2007 and PASCAL VOC 2012. The PASCAL VOC 2007 dataset contains 9963 labeled images and 24,640 object tags. The PASCAL VOC 2012 dataset consists of 11,530 images, consisting of 20 object categories such as people, cows, sofas and televisions. COCO2014 contains 83 K images in the training set and 41 K images in the validation set. COCO2017 contains 118 K images in the training set and 5K images in the validation set. The PASCAL VOC dataset and COCO dataset have good overall image quality and relatively complete annotation, which are suitable for the performance test of object detection algorithms. Most object detection algorithms provide data interfaces with PASCAL VOC dataset and COCO dataset.

Evaluation indicators

We need to have certain rules to evaluate a detector. For the object classification task, since the output of the classification task is relatively simple object category, we can measure it by evaluating the number of correct object classification. In the object detection task, there are two tags: object and background. The prediction box can also be different from the actual situation. Four kinds of samples are generated during the evaluation of object detection tasks. The four types of samples are True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN). The correct check box means that the prediction box matches the label box correctly. A mis-check box is one that predicts the background as an object. A missed box is an object that should have been predicted by the object detection algorithms, but the algorithms did not detect it. The correct background is the background itself, and the object detection algorithms do not predict it.

For object detection algorithms, we usually use the mAP (mean Average Precision) index to evaluate the algorithms. Here, AP (Average Precision) refers to the detection accuracy of a category in the object category. mAP is the average accuracy of multiple categories. When evaluating the object detection algorithms, it is necessary to know the predicted value and label value of each image. The predicted values include the category of the object, four predicted values for the position of the object’s border, and the score of the object. The label value includes the object category and the four truth values of the object border.

After traversing the prediction boxes of the object detection algorithms, we will get the attribute of each prediction box, that is, TP or FP. In the process of traversal, we can calculate the recall rate and precision rate by TP and FP, respectively. We calculate the recall rate and precision rate using the following formula:

$$\begin{aligned} R_\mathrm{(recall)}&= \frac{\mathrm{{TP}}}{\mathrm{{TP}}+\mathrm{{FN}}} \end{aligned}$$

(2)

$$\begin{aligned} P_\mathrm{(precision)}&= \frac{\mathrm{{TP}}}{\mathrm{{TP}}+\mathrm{{FP}}} \end{aligned}$$

(3)

A corresponding P and R can be generated as each prediction box is traversed. These two values can form a point (R, P). Plot all the points as a curve, and you get a P–R curve.

The evaluation model of the P–R curve is not very intuitive. Because when the recall rate is high, the precision rate is low, when the precision rate is high, the recall rate is low. We used the following formula to calculate AP:

$$\begin{aligned} \mathrm{{AP}}&= \int _0^1{P\mathrm{{d}}R} \end{aligned}$$

(4)

Therefore, we can see from the AP formula that AP is the area of the curve. Each object category is independent of each other. The average of the AP of each object category is the mAP. Our object detection algorithm uses mAP as the evaluation index to judge the quality of the algorithm.

Ablation analysis

We perform ablation experiments on the MSRCP structure and the MFTRF structure to demonstrate the effectiveness of our method. To this end, we delete the MSRCP structure and MFTRF structure in our proposed MSRNet, and the resulting network is used as a basic model (Named BaseNet) for comparisons.

Our proposed MSRCP structure can alleviate the overfitting problem. In order to verify that the MSRCP structure can alleviate the overfitting problem, we conduct an ablation experiment. The specific performance of overfitting is that the loss function is lower on the training set and higher on the verification set. Another expression of overfitting is that the verification loss increases gradually with the gradual decrease of training loss. Experiments are carried out on MSRCP+BaseNet algorithm, BaseNet algorithm and MSRNet algorithm respectively to verify that the MSRCP structure can alleviate the overfitting problem. Figure 6 shows the time delay of the overfitting occurred. It can be seen from Fig. 6 that the MSRCP structure can indeed delay the occurrence of overfitting time, which has certain reference value.

Table 1

The result of the ablation experiments on COCO2014 validation datasets when they are trained on COCO2014 train datasets

Method	Backbone	AP	AP$_{50}$	AP$_{75}$	AP$_{S}$	AP$_{M}$	AP$_{L}$	AR$_{1}$	AR$_{10}$	AR$_{100}$	AR$_{S}$	AR$_{M}$	AR$_{L}$
BaseNet	ResNet-50	36.1	56.3	38.7	14.4	41.9	52.3	30.9	48.8	51.0	23.6	57.1	74.8
MSRCP+BaseNet	ResNet-50	17.0	43.3	10.6	6.5	18.1	26.9	18.3	29.6	31.3	12.1	30.1	53.3
MFTRF+BaseNet	ResNet-50	38.4	59.4	41.3	15.1	44.5	56.4	32.2	50.2	52.3	24.9	58.6	76.6
MSRNet	ResNet-50	38.4	59.4	41.2	15.4	44.2	56.8	32.1	50.2	52.3	25.0	58.4	76.8

Table 2

The result of the ablation experiments on PASCAL VOC 2007 test datasets when they are trained on PASCAL VOC 2007 train datasets and PASCAL VOC 2012 train datasets

Method	mAP	Areo	Bike	Bird	Boat	Bottle	Bus	Car	Cat	Chair	Cow	Table	Dog	Horse	Mbike	Person	Plant	Sheep	Sofa	Train	TV
BaseNet	69.59	72.92	79.73	70.95	55.19	46.18	75.68	82.38	82.24	50.42	71.35	59.77	74.69	83.57	78.05	78.05	40.12	69.03	69.60	79.04	72.93
MSRCP+BaseNet	43.76	57.38	35.84	47.89	25.34	5.97	64.97	50.94	81.10	18.18	34.20	37.78	53.99	30.01	51.23	39.60	14.74	48.03	62.46	65.68	49.77
MFTRF+BaseNet	65.93	73.13	68.18	69.25	55.45	47.40	77.42	79.10	82.08	46.84	69.21	53.61	73.83	56.25	70.53	75.99	35.36	70.31	67.54	79.04	68.09
MSRNet	70.53	73.97	80.71	68.89	56.69	49.03	78.12	83.47	83.47	52.28	71.31	58.15	76.87	84.20	81.08	77.68	38.17	71.41	70.63	80.65	73.85

We further propose a MFTRF structure to improve the accuracy of object detection algorithm for small objects. In order to verify that the MFTRF structure can improve the accuracy of small object detection, we conduct ablation experiment. The ablation experimental results are shown in Tables 1 and 2. In Table 2, we can see that the detection accuracy of the object detection algorithm with the MFTRF structure is 65.93%. The detection accuracy of object detection algorithm without the MFTRF structure (namely, MSRCP+BaseNet algorithm) is 43.76%. The detection accuracy of MSRNet algorithm is 70.53%. It can be seen from Table 1 that 38.4% AP can be obtained by MFTRF+BaseNet algorithm (with the MFTRF structure) and 38.4% AP can be obtained by MSRNet algorithm. In Table 1, MSRCP+BaseNet algorithm without the MFTRF structure can get 6.5% AP$_{S}$, BaseNet algorithm can get 14.4% AP$_{S}$, MFTRF+BaseNet algorithm with the MFTRF structure can get 15.1% AP$_{S}$, and MSRNet algorithm can get 15.4% AP$_{S}$. In Table 1, MSRCP+BaseNet algorithm without the MFTRF structure can get 12.1% AR$_{S}$, BaseNet algorithm can get 23.6% AR$_{S}$, MFTRF+BaseNet algorithm with the MFTRF structure can get 24.9% AR$_{S}$, and MSRNet algorithm can get 25.0% AR$_{S}$. AP$_{S}$ and AR$_{S}$ represent the evaluation indicators for small objects. When AP$_{S}$ and AR$_{S}$ get higher values, it shows that the method has better detection performance in small object detection task. Comparing the detection results of the four methods in Table 1 on AP$_{S}$ and AR$_{S}$, we can see that MSRNet achieves the highest detection accuracy. By comprehensive comparison of Tables 1 and 2, we can find that the MFTRF structure does play a role.

Figure 8 shows the qualitative results and compares the MSRNet approach with other algorithms with different structures. It can be seen from Fig. 8 that different algorithms have different advantages in detecting objects. We can obviously see that MSRNet has higher detection accuracy in different objects than BaseNet. MSRNet also has higher detection on small objects than BaseNet. The detection result of MSRNet is satisfactory. Therefore, the MFTRF structure can improve the detection accuracy of small objects.

Quantitative comparisons

Compared with other representative object detection algorithms, we demonstrate the effectiveness of our proposed MSRNet.

Table 3

The mAP of the six methods on PASCAL VOC 2007 test datasets when they are trained on PASCAL VOC 2007 train datasets and PASCAL VOC 2012 train datasets

Method	mAP	Areo	Bike	Bird	Boat	Bottle	Bus	Car	Cat	Chair	Cow	Table	Dog	Horse	Mbike	Person	Plant	Sheep	Sofa	Train	TV
SSD [32]	67.81	77.48	71.01	70.94	52.87	42.75	71.42	84.12	84.75	58.25	73.19	42.93	79.15	77.85	77.44	76.99	40.29	72.71	60.58	69.90	71.49
Faster R-CNN [27]	66.97	68.98	77.82	71.82	51.20	46.25	72.51	80.36	84.17	47.28	62.19	56.41	77.15	79.28	72.74	77.52	37.93	69.18	65.30	76.71	64.56
DSSD [55]	68.14	74.52	79.52	67.05	60.28	34.27	79.00	79.84	82.94	47.64	66.56	60.38	78.78	81.49	76.78	71.98	32.99	67.81	73.84	80.43	66.73
FPN [29]	68.10	75.97	79.53	53.13	52.02	48.78	79.77	85.01	82.67	46.22	64.99	61.48	77.98	81.49	77.36	76.42	36.23	65.41	70.74	79.05	67.83
CenterNet [40]	69.59	72.92	79.73	70.95	55.19	46.18	75.68	82.38	82.24	50.42	71.35	59.77	74.69	83.57	78.05	78.05	40.12	69.03	69.60	79.04	72.93
Ours	70.53	73.97	80.71	68.89	56.69	49.03	78.12	83.47	83.47	52.28	71.31	58.15	76.87	84.20	81.08	77.68	38.17	71.41	70.63	80.65	73.85