Skip to main content
Erschienen in: International Journal of Computer Vision 4/2024

07.11.2023

Universal Object Detection with Large Vision Model

verfasst von: Feng Lin, Wenze Hu, Yaowei Wang, Yonghong Tian, Guangming Lu, Fanglin Chen, Yong Xu, Xiaoyu Wang

Erschienen in: International Journal of Computer Vision | Ausgabe 4/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Over the past few years, there has been growing interest in developing a broad, universal, and general-purpose computer vision system. Such systems have the potential to address a wide range of vision tasks simultaneously, without being limited to specific problems or data domains. This universality is crucial for practical, real-world computer vision applications. In this study, our focus is on a specific challenge: the large-scale, multi-domain universal object detection problem, which contributes to the broader goal of achieving a universal vision system. This problem presents several intricate challenges, including cross-dataset category label duplication, label conflicts, and the necessity to handle hierarchical taxonomies. To address these challenges, we introduce our approach to label handling, hierarchy-aware loss design, and resource-efficient model training utilizing a pre-trained large vision model. Our method has demonstrated remarkable performance, securing a prestigious second-place ranking in the object detection track of the Robust Vision Challenge 2022 (RVC 2022) on a million-scale cross-dataset object detection benchmark. We believe that our comprehensive study will serve as a valuable reference and offer an alternative approach for addressing similar challenges within the computer vision community. The source code for our work is openly available at https://​github.​com/​linfeng93/​Large-UniDet.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
3
The used object detectors are built upon Cascade R-CNN enhanced with NAS-FPN (\(\times \)7) and Cascade RPN, utilizing SEER-RegNet32gf as the backbone. As detailed in Sect. 4.2, the universal object detector undergoes training for 1.15M iterations with a batch size of 16. Considering the re-sampled training set size of approximately 2.3M, the calculated training epochs amount to 8. Consequently, for Individual OD on OID, the training epochs are set to 8. On a comparable scale, for Individual OD on COCO, the object detector is trained for 12 epochs in line with the default configuration in the mmdetection codebase. Lastly, for Individual OD on MVD, the object detector is also trained for 12 epochs, initializing its network parameters with the object detector mentioned earlier for COCO.
 
Literatur
Zurück zum Zitat Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., & Deaton, J., et al. (2021). Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3478–3488). Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., & Deaton, J., et al. (2021). Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3478–3488).
Zurück zum Zitat Bello, I., Fedus, W., Du, X., Cubuk, E. D., Srinivas, A., Lin, T.-Y., & Zoph, B. (2021). Revisiting resnets: Improved training and scaling strategies. Advances in Neural Information Processing Systems, 34, 22614–22627. Bello, I., Fedus, W., Du, X., Cubuk, E. D., Srinivas, A., Lin, T.-Y., & Zoph, B. (2021). Revisiting resnets: Improved training and scaling strategies. Advances in Neural Information Processing Systems, 34, 22614–22627.
Zurück zum Zitat Bevandić, P., & Šegvić, S. (2022). Automatic universal taxonomies for multidomain semantic segmentation. arXiv preprint arXiv:2207.08445 . Bevandić, P., & Šegvić, S. (2022). Automatic universal taxonomies for multidomain semantic segmentation. arXiv preprint arXiv:​2207.​08445 .
Zurück zum Zitat Bodla, N., Singh, B., Chellappa, R. & Davis, L.S. (2017). Soft-nms-improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (pp. 5561–5569). Bodla, N., Singh, B., Chellappa, R. & Davis, L.S. (2017). Soft-nms-improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (pp. 5561–5569).
Zurück zum Zitat Bu, X., Peng, J., Yan, J., Tan, T. & Zhang, Z. (2021). Gaia: A transfer learning system of object detection that fits your needs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 274–283). Bu, X., Peng, J., Yan, J., Tan, T. & Zhang, Z. (2021). Gaia: A transfer learning system of object detection that fits your needs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 274–283).
Zurück zum Zitat Cai, L., Zhang, Z., Zhu, Y., Zhang, L., Li, M. & Xue, X. (2022). Bigdetection: A largescale benchmark for improved object detector pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4777–4787). Cai, L., Zhang, Z., Zhu, Y., Zhang, L., Li, M. & Xue, X. (2022). Bigdetection: A largescale benchmark for improved object detector pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4777–4787).
Zurück zum Zitat Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6154–6162). Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6154–6162).
Zurück zum Zitat Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European Conference on Computer Vision (pp. 213–229). Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European Conference on Computer Vision (pp. 213–229).
Zurück zum Zitat Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912–9924. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912–9924.
Zurück zum Zitat Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., et al. (2019). Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 . Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., et al. (2019). Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:​1906.​07155 .
Zurück zum Zitat Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021). Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34, 3965–3977. Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021). Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34, 3965–3977.
Zurück zum Zitat Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. (2009). Imagenet: A largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255). Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. (2009). Imagenet: A largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255).
Zurück zum Zitat Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 . Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805 .
Zurück zum Zitat Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2011). Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 743–761.CrossRef Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2011). Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 743–761.CrossRef
Zurück zum Zitat Gao, M., Yu, R., Li, A., Morariu, V.I. & Davis, L.S. (2018). Dynamic zoom-in network for fast object detection in large images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6926–6935). Gao, M., Yu, R., Li, A., Morariu, V.I. & Davis, L.S. (2018). Dynamic zoom-in network for fast object detection in large images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6926–6935).
Zurück zum Zitat Ghiasi, G., Lin, T.-Y. & Le, Q.V. (2019). Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7036–7045). Ghiasi, G., Lin, T.-Y. & Le, Q.V. (2019). Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7036–7045).
Zurück zum Zitat Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1440–1448). Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1440–1448).
Zurück zum Zitat Gong, R., Dai, D., Chen, Y., Li, W. & Van Gool, L. (2021). mdalu: Multi-source domain adaptation and label unification with partial datasets. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8876–8885). Gong, R., Dai, D., Chen, Y., Li, W. & Van Gool, L. (2021). mdalu: Multi-source domain adaptation and label unification with partial datasets. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8876–8885).
Zurück zum Zitat Goyal, P., Duval, Q., Seessel, I., Caron, M., Singh, M., Misra, I., & Bojanowski, P. (2022). Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:2202.08360 . Goyal, P., Duval, Q., Seessel, I., Caron, M., Singh, M., Misra, I., & Bojanowski, P. (2022). Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:​2202.​08360 .
Zurück zum Zitat Gupta, A., Dollar, P. & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5356–5364). Gupta, A., Dollar, P. & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5356–5364).
Zurück zum Zitat Gupta, T., Kamath, A., Kembhavi, A. & Hoiem, D. (2022). Towards general purpose vision systems: An end-to-end task-agnostic visionlanguage architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16399–16409). Gupta, T., Kamath, A., Kembhavi, A. & Hoiem, D. (2022). Towards general purpose vision systems: An end-to-end task-agnostic visionlanguage architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16399–16409).
Zurück zum Zitat Hasan, I., Liao, S., Li, J., Akram, S.U. & Shao, L. (2021). Generalizable pedestrian detection: The elephant in the room. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11328–11337). Hasan, I., Liao, S., Li, J., Akram, S.U. & Shao, L. (2021). Generalizable pedestrian detection: The elephant in the room. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11328–11337).
Zurück zum Zitat He, K., Chen, X., Xie, S., Li, Y., Dollár, P. & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009). He, K., Chen, X., Xie, S., Li, Y., Dollár, P. & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009).
Zurück zum Zitat He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.9729–9738). He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.9729–9738).
Zurück zum Zitat He, Y., Huang, G., Chen, S., Teng, J., Wang, K., Yin, Z. & Shao, J. (2022). Xlearner: Learning cross sources and tasks for universal visual representation. In European Conference on Computer Vision (pp.509–528). He, Y., Huang, G., Chen, S., Teng, J., Wang, K., Yin, Z. & Shao, J. (2022). Xlearner: Learning cross sources and tasks for universal visual representation. In European Conference on Computer Vision (pp.509–528).
Zurück zum Zitat Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., & Fathi, A., et al. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7310–7311). Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., & Fathi, A., et al. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7310–7311).
Zurück zum Zitat Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H. & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning (pp. 4904–4916). Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H. & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning (pp. 4904–4916).
Zurück zum Zitat Joulin, A., Maaten, L.v.d., Jabri, A. & Vasilache, N. (2016). Learning visual features from large weakly supervised data. In European Conference on Computer Vision (pp. 67–84). Joulin, A., Maaten, L.v.d., Jabri, A. & Vasilache, N. (2016). Learning visual features from large weakly supervised data. In European Conference on Computer Vision (pp. 67–84).
Zurück zum Zitat Kamath, A., Clark, C., Gupta, T., Kolve, E., Hoiem, D. & Kembhavi, A. (2022). Webly supervised concept expansion for general purpose vision models. In European Conference on Computer Vision (pp. 662–681). Kamath, A., Clark, C., Gupta, T., Kolve, E., Hoiem, D. & Kembhavi, A. (2022). Webly supervised concept expansion for general purpose vision models. In European Conference on Computer Vision (pp. 662–681).
Zurück zum Zitat Kolesnikov, A., Zhai, X. & Beyer, L. (2019). Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1920–1929). Kolesnikov, A., Zhai, X. & Beyer, L. (2019). Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1920–1929).
Zurück zum Zitat Kornblith, S., Shlens, J. & Le, Q.V. (2019). Do better imagenet models transfer better?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2661–2671). Kornblith, S., Shlens, J. & Le, Q.V. (2019). Do better imagenet models transfer better?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2661–2671).
Zurück zum Zitat Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73.MathSciNetCrossRef Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73.MathSciNetCrossRef
Zurück zum Zitat Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., et al. (2020). The open images dataset v4. International Journal of Computer Vision, 128(7), 1956–1981.CrossRef Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., et al. (2020). The open images dataset v4. International Journal of Computer Vision, 128(7), 1956–1981.CrossRef
Zurück zum Zitat Lambert, J., Liu, Z., Sener, O., Hays, J. & Koltun, V. (2020). Mseg: A composite dataset for multi-domain semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2879–2888). Lambert, J., Liu, Z., Sener, O., Hays, J. & Koltun, V. (2020). Mseg: A composite dataset for multi-domain semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2879–2888).
Zurück zum Zitat Lin, F., Xu, H., Li, H., Xiong, H. & Qi, G.-J. (2021). Auto-encoding transformations in reparameterized lie groups for unsupervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, pp. 8610–8617). Lin, F., Xu, H., Li, H., Xiong, H. & Qi, G.-J. (2021). Auto-encoding transformations in reparameterized lie groups for unsupervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, pp. 8610–8617).
Zurück zum Zitat Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B. & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2117–2125). Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B. & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2117–2125).
Zurück zum Zitat Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D. & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision (pp. 740–755). Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D. & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision (pp. 740–755).
Zurück zum Zitat Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. (2018). Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8759–8768). Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. (2018). Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8759–8768).
Zurück zum Zitat Liu, Y., Wang, Y., Wang, S., Liang, T., Zhao, Q., Tang, Z. & Ling, H. (2020). Cbnet: A novel composite backbone network architecture for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, pp. 11653–11660). Liu, Y., Wang, Y., Wang, S., Liang, T., Zhao, Q., Tang, Z. & Ling, H. (2020). Cbnet: A novel composite backbone network architecture for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, pp. 11653–11660).
Zurück zum Zitat Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., et al. (2022). Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12009–12019). Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., et al. (2022). Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12009–12019).
Zurück zum Zitat Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z. & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022). Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z. & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022).
Zurück zum Zitat Lu, J., Clark, C., Zellers, R., Mottaghi, R. & Kembhavi, A. (2022). Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 . Lu, J., Clark, C., Zellers, R., Mottaghi, R. & Kembhavi, A. (2022). Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:​2206.​08916 .
Zurück zum Zitat Meng, L., Dai, X., Chen, Y., Zhang, P., Chen, D., Liu, M., & Jiang, Y.-G. (2022). Detection hub: Unifying object detection datasets via query adaptation on language embedding. arXiv preprint arXiv:2206.03484 . Meng, L., Dai, X., Chen, Y., Zhang, P., Chen, D., Liu, M., & Jiang, Y.-G. (2022). Detection hub: Unifying object detection datasets via query adaptation on language embedding. arXiv preprint arXiv:​2206.​03484 .
Zurück zum Zitat Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., et al. (2018). Mixed precision training. In International Conference on Learning Representations. Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., et al. (2018). Mixed precision training. In International Conference on Learning Representations.
Zurück zum Zitat Neuhold, G., Ollmann, T., Rota Bulo, S. & Kontschieder, P. (2017). The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4990–4999). Neuhold, G., Ollmann, T., Rota Bulo, S. & Kontschieder, P. (2017). The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4990–4999).
Zurück zum Zitat Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W. & Lin, D. (2019). Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.821–830). Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W. & Lin, D. (2019). Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.821–830).
Zurück zum Zitat Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (pp. 8748–8763). Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (pp. 8748–8763).
Zurück zum Zitat Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. (2018). Improving language understanding by generative pretraining. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. (2018). Improving language understanding by generative pretraining.
Zurück zum Zitat Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K. & Dollár, P. (2020). Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10428–10436). Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K. & Dollár, P. (2020). Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10428–10436).
Zurück zum Zitat Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485–5551.MathSciNet Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485–5551.MathSciNet
Zurück zum Zitat Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 1623–37.CrossRef Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 1623–37.CrossRef
Zurück zum Zitat Ren, S., He, K., Girshick, R. & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28. Ren, S., He, K., Girshick, R. & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28.
Zurück zum Zitat Shao, J., Chen, S., Li, Y., Wang, K., Yin, Z., He, Y., et al. (2021). Intern: A new learning paradigm towards general vision. arXiv preprint arXiv:2111.08687 . Shao, J., Chen, S., Li, Y., Wang, K., Yin, Z., He, Y., et al. (2021). Intern: A new learning paradigm towards general vision. arXiv preprint arXiv:​2111.​08687 .
Zurück zum Zitat Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X. & Sun, J. (2019). Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8430–8439). Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X. & Sun, J. (2019). Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8430–8439).
Zurück zum Zitat Singh, B., & Davis, L.S. (2018). An analysis of scale invariance in object detection snip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3578–3587). Singh, B., & Davis, L.S. (2018). An analysis of scale invariance in object detection snip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3578–3587).
Zurück zum Zitat Sun, C., Shrivastava, A., Singh, S. & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision (pp. 843–852). Sun, C., Shrivastava, A., Singh, S. & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision (pp. 843–852).
Zurück zum Zitat Tan, M., Pang, R. & Le, Q.V. (2020). Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10781–10790). Tan, M., Pang, R. & Le, Q.V. (2020). Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10781–10790).
Zurück zum Zitat Vasconcelos, C., Birodkar, V. & Dumoulin, V. (2022). Proper reuse of image classification features improves object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13628–13637). Vasconcelos, C., Birodkar, V. & Dumoulin, V. (2022). Proper reuse of image classification features improves object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13628–13637).
Zurück zum Zitat Vu, T., Jang, H., Pham, T.X. & Yoo, C. (2019). Cascade rpn: Delving into high-quality region proposal network with adaptive convolution. Advances in Neural Information Processing Systems, 32. Vu, T., Jang, H., Pham, T.X. & Yoo, C. (2019). Cascade rpn: Delving into high-quality region proposal network with adaptive convolution. Advances in Neural Information Processing Systems, 32.
Zurück zum Zitat Wang, X., Cai, Z., Gao, D. & Vasconcelos, N. (2019). Towards universal object detection by domain attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7289–7298). Wang, X., Cai, Z., Gao, D. & Vasconcelos, N. (2019). Towards universal object detection by domain attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7289–7298).
Zurück zum Zitat Xu, H., Fang, L., Liang, X., Kang, W. & Li, Z. (2020). Universal-rcnn: Universal object detector via transferable graph r-cnn. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, pp. 12492–12499). Xu, H., Fang, L., Liang, X., Kang, W. & Li, Z. (2020). Universal-rcnn: Universal object detector via transferable graph r-cnn. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, pp. 12492–12499).
Zurück zum Zitat Xu, H., Zhang, X., Li, H., Xie, L., Dai, W., Xiong, H., & Tian, Q. (2022). Seed the views: Hierarchical semantic alignment for contrastive representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3753–67. Xu, H., Zhang, X., Li, H., Xie, L., Dai, W., Xiong, H., & Tian, Q. (2022). Seed the views: Hierarchical semantic alignment for contrastive representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3753–67.
Zurück zum Zitat Yu, J., Jiang, Y., Wang, Z., Cao, Z. & Huang, T. (2016). Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia (pp. 516–520). Yu, J., Jiang, Y., Wang, Z., Cao, Z. & Huang, T. (2016). Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia (pp. 516–520).
Zurück zum Zitat Yuan, L., Chen, D., Chen, Y.-L., Codella, N., Dai, X., Gao, J., et al. (2021). Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 . Yuan, L., Chen, D., Chen, Y.-L., Codella, N., Dai, X., Gao, J., et al. (2021). Florence: A new foundation model for computer vision. arXiv preprint arXiv:​2111.​11432 .
Zurück zum Zitat Zhang, S., Benenson, R. & Schiele, B. (2017). Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3213–3221). Zhang, S., Benenson, R. & Schiele, B. (2017). Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3213–3221).
Zurück zum Zitat Zhao, X., Schulter, S., Sharma, G., Tsai, Y.-H., Chandraker, M. & Wu, Y. (2020). Object detection with a unified label space from multiple datasets. In European Conference on Computer Vision (pp. 178–193). Zhao, X., Schulter, S., Sharma, G., Tsai, Y.-H., Chandraker, M. & Wu, Y. (2020). Object detection with a unified label space from multiple datasets. In European Conference on Computer Vision (pp. 178–193).
Zurück zum Zitat Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A. & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 633–641). Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A. & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 633–641).
Zurück zum Zitat Zhou, X., Koltun, V. & Krähenbühl, P. (2022). Simple multi-dataset detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7571–7580). Zhou, X., Koltun, V. & Krähenbühl, P. (2022). Simple multi-dataset detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7571–7580).
Metadaten
Titel
Universal Object Detection with Large Vision Model
verfasst von
Feng Lin
Wenze Hu
Yaowei Wang
Yonghong Tian
Guangming Lu
Fanglin Chen
Yong Xu
Xiaoyu Wang
Publikationsdatum
07.11.2023
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 4/2024
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-023-01929-0

Weitere Artikel der Ausgabe 4/2024

International Journal of Computer Vision 4/2024 Zur Ausgabe

Premium Partner