nach oben

International Journal of Computer Vision

Erschienen in:

07.11.2023

Universal Object Detection with Large Vision Model

verfasst von: Feng Lin, Wenze Hu, Yaowei Wang, Yonghong Tian, Guangming Lu, Fanglin Chen, Yong Xu, Xiaoyu Wang

Erschienen in: International Journal of Computer Vision | Ausgabe 4/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Over the past few years, there has been growing interest in developing a broad, universal, and general-purpose computer vision system. Such systems have the potential to address a wide range of vision tasks simultaneously, without being limited to specific problems or data domains. This universality is crucial for practical, real-world computer vision applications. In this study, our focus is on a specific challenge: the large-scale, multi-domain universal object detection problem, which contributes to the broader goal of achieving a universal vision system. This problem presents several intricate challenges, including cross-dataset category label duplication, label conflicts, and the necessity to handle hierarchical taxonomies. To address these challenges, we introduce our approach to label handling, hierarchy-aware loss design, and resource-efficient model training utilizing a pre-trained large vision model. Our method has demonstrated remarkable performance, securing a prestigious second-place ranking in the object detection track of the Robust Vision Challenge 2022 (RVC 2022) on a million-scale cross-dataset object detection benchmark. We believe that our comprehensive study will serve as a valuable reference and offer an alternative approach for addressing similar challenges within the computer vision community. The source code for our work is openly available at https://github.com/linfeng93/Large-UniDet.

Vorheriger Artikel Cascaded Iterative Transformer for Jointly Predicting Facial Landmark, Occlusion Probability and Head Pose

Nächster Artikel Harmonizing Base and Novel Classes: A Class-Contrastive Approach for Generalized Few-Shot Segmentation

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

www.robustvision.net/leaderboard.php?benchmark=object, IFFF_RVC entry on Leaderboard.

https://github.com/ozendelait/rvc_devkit/blob/master/objdet/obj_det_mapping.csv

The used object detectors are built upon Cascade R-CNN enhanced with NAS-FPN (\(\times \)7) and Cascade RPN, utilizing SEER-RegNet32gf as the backbone. As detailed in Sect. 4.2, the universal object detector undergoes training for 1.15M iterations with a batch size of 16. Considering the re-sampled training set size of approximately 2.3M, the calculated training epochs amount to 8. Consequently, for Individual OD on OID, the training epochs are set to 8. On a comparable scale, for Individual OD on COCO, the object detector is trained for 12 epochs in line with the default configuration in the mmdetection codebase. Lastly, for Individual OD on MVD, the object detector is also trained for 12 epochs, initializing its network parameters with the object detector mentioned earlier for COCO.

Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., & Deaton, J., et al. (2021). Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3478–3488).

Bello, I., Fedus, W., Du, X., Cubuk, E. D., Srinivas, A., Lin, T.-Y., & Zoph, B. (2021). Revisiting resnets: Improved training and scaling strategies. Advances in Neural Information Processing Systems, 34, 22614–22627.

Bevandić, P., & Šegvić, S. (2022). Automatic universal taxonomies for multidomain semantic segmentation. arXiv preprint arXiv:2207.08445 .

Bodla, N., Singh, B., Chellappa, R. & Davis, L.S. (2017). Soft-nms-improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (pp. 5561–5569).

Bu, X., Peng, J., Yan, J., Tan, T. & Zhang, Z. (2021). Gaia: A transfer learning system of object detection that fits your needs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 274–283).

Cai, L., Zhang, Z., Zhu, Y., Zhang, L., Li, M. & Xue, X. (2022). Bigdetection: A largescale benchmark for improved object detector pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4777–4787).

Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6154–6162).

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European Conference on Computer Vision (pp. 213–229).

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912–9924.

Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., et al. (2019). Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 .

Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021). Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34, 3965–3977.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. (2009). Imagenet: A largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255).

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .

Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2011). Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 743–761.CrossRef

Gao, M., Yu, R., Li, A., Morariu, V.I. & Davis, L.S. (2018). Dynamic zoom-in network for fast object detection in large images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6926–6935).

Ghiasi, G., Lin, T.-Y. & Le, Q.V. (2019). Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7036–7045).

Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1440–1448).

Gong, R., Dai, D., Chen, Y., Li, W. & Van Gool, L. (2021). mdalu: Multi-source domain adaptation and label unification with partial datasets. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8876–8885).

Goyal, P., Duval, Q., Seessel, I., Caron, M., Singh, M., Misra, I., & Bojanowski, P. (2022). Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:2202.08360 .

Gupta, A., Dollar, P. & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5356–5364).

Gupta, T., Kamath, A., Kembhavi, A. & Hoiem, D. (2022). Towards general purpose vision systems: An end-to-end task-agnostic visionlanguage architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16399–16409).

Hasan, I., Liao, S., Li, J., Akram, S.U. & Shao, L. (2021). Generalizable pedestrian detection: The elephant in the room. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11328–11337).

He, K., Chen, X., Xie, S., Li, Y., Dollár, P. & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009).

He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.9729–9738).

He, Y., Huang, G., Chen, S., Teng, J., Wang, K., Yin, Z. & Shao, J. (2022). Xlearner: Learning cross sources and tasks for universal visual representation. In European Conference on Computer Vision (pp.509–528).

Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 .

Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., & Fathi, A., et al. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7310–7311).

Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H. & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning (pp. 4904–4916).

Joulin, A., Maaten, L.v.d., Jabri, A. & Vasilache, N. (2016). Learning visual features from large weakly supervised data. In European Conference on Computer Vision (pp. 67–84).

Kamath, A., Clark, C., Gupta, T., Kolve, E., Hoiem, D. & Kembhavi, A. (2022). Webly supervised concept expansion for general purpose vision models. In European Conference on Computer Vision (pp. 662–681).

Kolesnikov, A., Zhai, X. & Beyer, L. (2019). Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1920–1929).

Kornblith, S., Shlens, J. & Le, Q.V. (2019). Do better imagenet models transfer better?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2661–2671).

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73.MathSciNetCrossRef

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., et al. (2020). The open images dataset v4. International Journal of Computer Vision, 128(7), 1956–1981.CrossRef

Lambert, J., Liu, Z., Sener, O., Hays, J. & Koltun, V. (2020). Mseg: A composite dataset for multi-domain semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2879–2888).

Lin, F., Xu, H., Li, H., Xiong, H. & Qi, G.-J. (2021). Auto-encoding transformations in reparameterized lie groups for unsupervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, pp. 8610–8617).

Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B. & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2117–2125).

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D. & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision (pp. 740–755).

Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. (2018). Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8759–8768).

Liu, Y., Wang, Y., Wang, S., Liang, T., Zhao, Q., Tang, Z. & Ling, H. (2020). Cbnet: A novel composite backbone network architecture for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, pp. 11653–11660).

Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., et al. (2022). Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12009–12019).

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z. & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022).

Loshchilov, I., & Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 .

Lu, J., Clark, C., Zellers, R., Mottaghi, R. & Kembhavi, A. (2022). Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 .

Meng, L., Dai, X., Chen, Y., Zhang, P., Chen, D., Liu, M., & Jiang, Y.-G. (2022). Detection hub: Unifying object detection datasets via query adaptation on language embedding. arXiv preprint arXiv:2206.03484 .

Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., et al. (2018). Mixed precision training. In International Conference on Learning Representations.

Neuhold, G., Ollmann, T., Rota Bulo, S. & Kontschieder, P. (2017). The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4990–4999).

Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W. & Lin, D. (2019). Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.821–830).

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (pp. 8748–8763).

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. (2018). Improving language understanding by generative pretraining.

Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K. & Dollár, P. (2020). Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10428–10436).

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485–5551.MathSciNet

Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 1623–37.CrossRef

Ren, S., He, K., Girshick, R. & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28.

Shao, J., Chen, S., Li, Y., Wang, K., Yin, Z., He, Y., et al. (2021). Intern: A new learning paradigm towards general vision. arXiv preprint arXiv:2111.08687 .

Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X. & Sun, J. (2019). Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8430–8439).

Singh, B., & Davis, L.S. (2018). An analysis of scale invariance in object detection snip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3578–3587).

Sun, C., Shrivastava, A., Singh, S. & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision (pp. 843–852).

Tan, M., Pang, R. & Le, Q.V. (2020). Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10781–10790).

Vasconcelos, C., Birodkar, V. & Dumoulin, V. (2022). Proper reuse of image classification features improves object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13628–13637).

Vu, T., Jang, H., Pham, T.X. & Yoo, C. (2019). Cascade rpn: Delving into high-quality region proposal network with adaptive convolution. Advances in Neural Information Processing Systems, 32.

Wang, X., Cai, Z., Gao, D. & Vasconcelos, N. (2019). Towards universal object detection by domain attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7289–7298).

Xu, H., Fang, L., Liang, X., Kang, W. & Li, Z. (2020). Universal-rcnn: Universal object detector via transferable graph r-cnn. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, pp. 12492–12499).

Xu, H., Zhang, X., Li, H., Xie, L., Dai, W., Xiong, H., & Tian, Q. (2022). Seed the views: Hierarchical semantic alignment for contrastive representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3753–67.

Yu, J., Jiang, Y., Wang, Z., Cao, Z. & Huang, T. (2016). Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia (pp. 516–520).

Yuan, L., Chen, D., Chen, Y.-L., Codella, N., Dai, X., Gao, J., et al. (2021). Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 .

Zhang, S., Benenson, R. & Schiele, B. (2017). Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3213–3221).

Zhao, X., Schulter, S., Sharma, G., Tsai, Y.-H., Chandraker, M. & Wu, Y. (2020). Object detection with a unified label space from multiple datasets. In European Conference on Computer Vision (pp. 178–193).

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A. & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 633–641).

Zhou, X., Koltun, V. & Krähenbühl, P. (2022). Simple multi-dataset detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7571–7580).

Titel: Universal Object Detection with Large Vision Model
verfasst von: Feng Lin
Wenze Hu
Yaowei Wang
Yonghong Tian
Guangming Lu
Fanglin Chen
Yong Xu
Xiaoyu Wang
Publikationsdatum: 07.11.2023
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 4/2024
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-023-01929-0

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 4/2024

CCR: Facial Image Editing with Continuity, Consistency and Reversibility

Learning Robust Multi-scale Representation for Neural Radiance Fields from Unposed Images

Learning Portrait Drawing with Unsupervised Parts

A Deeper Analysis of Volumetric Relightiable Faces

Towards a Unified Network for Robust Monocular Depth Estimation: Network Architecture, Training Strategy and Dataset

DIVOTrack: A Novel Dataset and Baseline Method for Cross-View Multi-Object Tracking in DIVerse Open Scenes

Premium Partner