Top

International Journal of Computer Vision

Published in:

03-12-2019

The Unmanned Aerial Vehicle Benchmark: Object Detection, Tracking and Baseline

Authors: Hongyang Yu, Guorong Li, Weigang Zhang, Qingming Huang, Dawei Du, Qi Tian, Nicu Sebe

Published in: International Journal of Computer Vision | Issue 5/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

With the increasing popularity of Unmanned Aerial Vehicles (UAVs) in computer vision-related applications, intelligent UAV video analysis has recently attracted the attention of an increasing number of researchers. To facilitate research in the UAV field, this paper presents a UAV dataset with 100 videos featuring approximately 2700 vehicles recorded under unconstrained conditions and 840k manually annotated bounding boxes. These UAV videos were recorded in complex real-world scenarios and pose significant new challenges, such as complex scenes, high density, small objects, and large camera motion, to the existing object detection and tracking methods. These challenges have encouraged us to define a benchmark for three fundamental computer vision tasks, namely, object detection, single object tracking (SOT) and multiple object tracking (MOT), on our UAV dataset. Specifically, our UAV benchmark facilitates evaluation and detailed analysis of state-of-the-art detection and tracking methods on the proposed UAV dataset. Furthermore, we propose a novel approach based on the so-called Context-aware Multi-task Siamese Network (CMSN) model that explores new cues in UAV videos by judging the consistency degree between objects and contexts and that can be used for SOT and MOT. The experimental results demonstrate that our model could make tracking results more robust in both SOT and MOT, showing that the current tracking and detection methods have limitations in dealing with the proposed UAV benchmark and that further research is indeed needed.

previous article Loss-Sensitive Generative Adversarial Networks on Lipschitz Densities

next article Special Issue on Deep Learning for Robotic Vision

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

We use DJI Inspire 2 to collect videos. More information about the UAV platform can be found at http://www.dji.com/inspire-2.

Our dataset is available for download at https://sites.google.com/site/daviddo0323/.

http://carlvondrick.com/vatic/

The detection result has been taken from http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=2d.

Bae, S. H., & Yoon, K. (2014). Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In CVPR (pp. 1218–1225).

Bernardin, K., & Stiefelhagen, R. (2008). Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP Journal on Image and Video Processing, 2008, 246309.CrossRef

Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. S. (2016). Fully-convolutional siamese networks for object tracking. In ECCV (pp. 850–865).

Bewley, A., Ge, Z., Ott, L., Ramos, F. T., & Upcroft, B. (2016). Simple online and realtime tracking. In ICIP (pp. 3464–3468).

Bochinski, E., Eiselein, V., & Sikora, T. (2017). High-speed tracking-by-detection without using image information. In AVSS (pp. 1–6).

Choi, W. (2015). Near-online multi-target tracking with aggregated local flow descriptor. In ICCV (pp. 3029–3037).

Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In CVPR (pp. 539–546).

Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In NIPS (pp. 379–387).

Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2016). ECO: Efficient convolution operators for tracking. arXiv:1611.09224.

Danelljan, M., Häger, G., Khan, F. S., & Felsberg, M. (2015). Learning spatially regularized correlation filters for visual tracking. In ICCV (pp. 4310–4318).

Danelljan, M., Häger, G., Khan, F. S., & Felsberg, M. (2016). Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking. In CVPR (pp. 1430–1438).

Danelljan, M., Robinson, A., Khan, F. S., & Felsberg, M. (2016). Beyond correlation filters: Learning continuous convolution operators for visual tracking. In ECCV (pp. 472–488).

Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). Imagenet: A large-scale hierarchical image database. In CVPR (pp. 248–255).

Dicle, C., Camps, O. I., & Sznaier, M. (2013). The way they move: Tracking multiple targets with similar appearance. In ICCV (pp. 2304–2311).

Dollár, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 743–761.CrossRef

Du, D., Qi, Y., Yu, H., Yang, Y., Duan, K., Li, G., Zhang, W., Huang, Q., & Tian, Q. (2018). The unmanned aerial vehicle benchmark: Object detection and tracking. In ECCV (pp. 375–391).

Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.CrossRef

Fan, H., & Ling, H. (2017). Parallel tracking and verifying: A framework for real-time and high accuracy visual tracking. In ICCV.

Ferryman, J., & Shahrokni, A. (2009). Pets2009: Dataset and challenge. In AVSS (pp. 1–6).

Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The KITTI vision benchmark suite. In CVPR (pp. 3354–3361).

Girshick, R. B. (2015). Fast R-CNN. In ICCV (pp. 1440–1448).

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).

Held, D., Thrun, S., & Savarese, S. (2016). Learning to track at 100 FPS with deep regression networks. In ECCV (pp. 749–765).

Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. (2015). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 583–596.CrossRef

Hsieh, M., Lin, Y., & Hsu, W. H. (2017). Drone-based object counting by spatially regularized regional proposal network. In ICCV.

Hwang, S., Park, J., Kim, N., Choi, Y., & Kweon, I. S. (2015). Multispectral pedestrian detection: Benchmark dataset and baseline. In CVPR (pp. 1037–1045).

Izadinia, H., Saleemi, I., Li, W., & Shah, M. (2012). (MP)2T: Multiple people multiple parts tracker. In ECCV (pp. 100–114).

Kalra, I., Singh, M., Nagpal, S., Singh, R., Vatsa, M., & Sujit, P. (2019). Dronesurf: Benchmark dataset for drone-based face recognition. In IEEE FG 2019 (pp. 1–7).

Kiani Galoogahi, H., Fagg, A., Huang, C., Ramanan, D., & Lucey, S. (2017). Need for speed: A benchmark for higher frame rate object tracking. In ICCV (pp. 1125–1134).

Kim, C., Li, F., Ciptadi, A., & Rehg, J. M. (2015). Multiple hypothesis tracking revisited. In ICCV (pp. 4696–4704).

Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., & Chen, Y. (2017). RON: Reverse connection with objectness prior networks for object detection. In CVPR.

Kristan, M., Leonardis, A., Matas, J., et al. (2016). The visual object tracking VOT2016 challenge results. In ECCV workshop (pp. 777–823).

Kristan, M., Leonardis, A., Matas, J., Felsberg, M., & He, Z. (2017). The visual object tracking VOT2017 challenge results. In ICCV workshop.

Leal-Taixé, L., Milan, A., Reid, I. D., Roth, S., & Schindler, K. (2015). Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv:1504.01942.

Li, F., Tian, C., Zuo, W., Zhang, L., & Yang, M. (2018). Learning spatial-temporal regularized correlation filters for visual tracking. In CVPR.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S. E., Fu, C., & Berg, A. C. (2016). SSD: Single shot multibox detector. In ECCV (pp. 21–37).

Ma, C., Huang, J., Yang, X., & Yang, M. (2015). Hierarchical convolutional features for visual tracking. In ICCV (pp. 3074–3082).

Milan, A., Leal-Taixé, L., Reid, I. D., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv:1603.00831.

Milan, A., Rezatofighi, S. H., Dick, A. R., Reid, I. D., & Schindler, K. (2017). Online multi-target tracking using recurrent neural networks. In AAAI (pp. 4225–4232).

Milan, A., Roth, S., & Schindler, K. (2014). Continuous energy minimization for multitarget tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(1), 58–72.CrossRef

Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., & Ghanem, B. (2018). Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In ECCV.

Mueller, M., Smith, N., & Ghanem, B. (2016). A benchmark and simulator for UAV tracking. In ECCV (pp. 445–461).

Mueller, M., Smith, N., & Ghanem, B. (2017). Context-aware correlation filter tracking. In CVPR.

Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics, 5(1), 32–38.MathSciNetCrossRef

Nam, H., & Han, B. (2016). Learning multi-domain convolutional neural networks for visual tracking. In CVPR (pp. 4293–4302).

Ning, W., Wengang, Z., Qi, T., Richang, H., Meng, W., & Houqiang, L. (2018). Multi-cue correlation filters for robust visual tracking. In CVPR (pp. 4844–4853).

Papageorgiou, C., & Poggio, T. (2000). A trainable system for object detection. International Journal of Computer Vision, 38(1), 15–33.CrossRef

Pirsiavash, H., Ramanan, D., & Fowlkes, C. C. (2011). Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR (pp. 1201–1208).

Qi, Y., Zhang, S., Qin, L., Yao, H., Huang, Q., Lim, J., & Yang, M. (2016). Hedged deep tracking. In CVPR (pp. 4303–4311).

Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS (pp. 91–99).

Ristani, E., Solera, F., Zou, R. S., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In ECCVW (pp. 17–35).

Robicquet, A., Sadeghian, A., Alahi, A., & Savarese, S. (2016). Learning social etiquette: Human trajectory understanding in crowded scenes. In ECCV (pp. 549–565).

Shu, G., Dehghan, A., Oreifej, O., Hand, E., & Shah, M. (2012). Part-based multiple-person tracking with partial occlusion handling. In CVPR (pp. 1815–1821).

Smeulders, A. W. M., Chu, D. M., Cucchiara, R., Calderara, S., Dehghan, A., & Shah, M. (2014). Visual tracking: An experimental survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1442–1468.CrossRef

Solera, F., Calderara, S., & Cucchiara, R. (2015). Towards the evaluation of reproducible robustness in tracking-by-detection. In AVSS (pp. 1–6).

Son, J., Baek, M., Cho, M., & Han, B. (2017). Multi-object tracking with quadruplet convolutional neural networks. In CVPR.

Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R. W. H., & Yang, M. (2017). CREST: Convolutional residual learning for visual tracking. arXiv:1708.00225.

Song, Y., Ma, C., Wu, X., Gong, L., Bao, L., Zuo, W., Shen, C., Lau, R. W. H., & Yang, M. (2018). VITAL: Visual tracking via adversarial learning. arXiv:1804.04273.

Tang, S., Andres, B., Andriluka, M., & Schiele, B. (2016). Multi-person tracking by multicut and deep matching. In ECCV workshops (pp. 100–111).

Tang, S., Andriluka, M., Andres, B., & Schiele, B. (2017). Multiple people tracking by lifted multicut and person re-identification. In CVPR.

Tao, R., Gavves, E., & Smeulders, A. W. M. (2016). Siamese instance search for tracking. In CVPR (pp. 1420–1429).

Valmadre, J., Bertinetto, L., Henriques, J. F., Vedaldi, A., & Torr, P. H. S. (2017). End-to-end representation learning for correlation filter based tracking. In CVPR.

Wang, L., Ouyang, W., Wang, X., & Lu, H. (2015). Visual tracking with fully convolutional networks. In ICCV (pp. 3119–3127).

Wang, L., Ouyang, W., Wang, X., Lu, H. (2016). STCT: Sequentially training convolutional networks for visual tracking. In CVPR (pp. 1373–1381).

Wen, L., Du, D., Cai, Z., Lei, Z., Chang, M., Qi, H., Lim, J., Yang, M., & Lyu, S. (2015). DETRAC: A new benchmark and protocol for multi-object tracking. arXiv:1511.04136.

Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. arXiv:1703.07402.

Wu, Y., Lim, J., & Yang, M. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848.CrossRef

Xia, G. S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., & Zhang, L. (2018). DOTA: A large-scale dataset for object detection in aerial images. In CVPR (pp. 3974–3983).

Xiang, Y., Alahi, A., & Savarese, S. (2015). Learning to track: Online multi-object tracking by decision making. In ICCV (pp. 4705–4713).

Yoon, J. H., Lee, C., Yang, M., & Yoon, K. (2016). Online multi-object tracking via structural constraint event aggregation. In CVPR (pp. 1392–1400).

Yoon, J. H., Yang, M., Lim, J., & Yoon, K. (2015). Bayesian multi-object tracking using motion context from multiple objects. In WACV (pp. 33–40).

Yu, H., Qin, L., Huang, Q., & Yao, H. (2018). Online multiple object tracking via exchanging object context. Neurocomputing, 292, 28–37.CrossRef

Yun, S., Choi, J., Yoo, Y., Yun, K., & Choi, J. Y. (2017). Action-decision networks for visual tracking with deep reinforcement learning. In CVPR.

Zhang, K., Zhang, L., Liu, Q., Zhang, D., & Yang, M. (2014). Fast visual tracking via dense spatio-temporal context learning. In ECCV (pp. 127–141).

Zhang, T., Xu, C., & Yang, M. H. (2017). Multi-task correlation particle filter for robust visual tracking. In CVPR.

Zhong, B., Bai, B., Li, J., Zhang, Y., & Fu, Y. (2018). Hierarchical tracking by reinforcement learning-based searching and coarse-to-fine verifying. IEEE Transactions on Image Processing, 28(5), 2331–2341.MathSciNetCrossRef

Zhou, Q., Zhong, B., Zhang, Y., Li, J., & Fu, Y. (2018). Deep alignment network based multi-person tracking with occlusion and motion reasoning. IEEE Transactions on Multimedia, 21(5), 1183–1194.CrossRef

Zhu, P., Wen, L., Bian, X., Haibin, L., & Hu, Q. (2018). Vision meets drones: A challenge. arXiv preprint arXiv:1804.07437.

Title: The Unmanned Aerial Vehicle Benchmark: Object Detection, Tracking and Baseline
Authors: Hongyang Yu
Guorong Li
Weigang Zhang
Qingming Huang
Dawei Du
Qi Tian
Nicu Sebe
Publication date: 03-12-2019
Publisher: Springer US
Published in: International Journal of Computer Vision / Issue 5/2020
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-019-01266-1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 5/2020

A Weakly Supervised Multi-task Ranking Framework for Actor–Action Semantic Segmentation

DGPose: Deep Generative Models for Human Body Analysis

Towards High Fidelity Face Frontalization in the Wild

Cognitive Mapping and Planning for Visual Navigation

Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models

Realistic Speech-Driven Facial Animation with GANs

Premium Partner