Skip to main content
Erschienen in: International Journal of Computer Vision 7/2020

13.03.2020

The Open Images Dataset V4

Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale

verfasst von: Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, Vittorio Ferrari

Erschienen in: International Journal of Computer Vision | Ausgabe 7/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide \(15\times \) more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
Image hosting service (www.​flickr.​com)
 
2
In Flickr terms, images are served at different sizes (Thumbnail, Large, Medium, etc.). The Original size is a pristine copy of the image that was uploaded by the author.
 
4
Image ids are generated based on hashes of the data so effectively the sampling within a stratum is pseudo-random and deterministic.
 
5
Note that while in theory logit scores are unbounded, we rarely observe values outside of \([-8,8]\) so the number of strata is bounded in practice.
 
6
These are really unique objects: Each object is annotated only with its leafmost label, e.g. a man has a single box, it is not annotated as person also.
 
7
We thank Ross Girshick for suggesting this type of visualization.
 
8
To find the triplets in common between two datasets we matched the class names based on Lexicographical comparison and aggregated annotations in VG based on relationship; since VG contains somewhat inconsistent relationship names, we use loose string matching to match relationships
 
Literatur
Zurück zum Zitat Alexe, B., Deselaers, T., & Ferrari, V. (2010). What is an object? In CVPR. Alexe, B., Deselaers, T., & Ferrari, V. (2010). What is an object? In CVPR.
Zurück zum Zitat Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on PAMI, 34, 2189–2202.CrossRef Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on PAMI, 34, 2189–2202.CrossRef
Zurück zum Zitat Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In CVPR. Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In CVPR.
Zurück zum Zitat Dai, B., Zhang, Y., & Lin, D. (2017). Detecting visual relationships with deep relational networks. In CVPR. Dai, B., Zhang, Y., & Lin, D. (2017). Detecting visual relationships with deep relational networks. In CVPR.
Zurück zum Zitat Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
Zurück zum Zitat Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88, 303–338.CrossRef Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88, 303–338.CrossRef
Zurück zum Zitat Everingham, M., Eslami, S., van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2015). The PASCAL visual object classes challenge: A retrospective. International Journal of Computer Vision, 111, 98–136.CrossRef Everingham, M., Eslami, S., van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2015). The PASCAL visual object classes challenge: A retrospective. International Journal of Computer Vision, 111, 98–136.CrossRef
Zurück zum Zitat Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.CrossRef Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.CrossRef
Zurück zum Zitat Felzenszwalb, P., Girshick, R., & McAllester, D. (2010a). Cascade object detection with deformable part models. In CVPR. Felzenszwalb, P., Girshick, R., & McAllester, D. (2010a). Cascade object detection with deformable part models. In CVPR.
Zurück zum Zitat Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010b). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.CrossRef Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010b). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.CrossRef
Zurück zum Zitat Gao, C., Zou, Y., & Huang, J.B. (2018). iCAN: Instance-centric attention network for human-object interaction detection. In BMVC. Gao, C., Zou, Y., & Huang, J.B. (2018). iCAN: Instance-centric attention network for human-object interaction detection. In BMVC.
Zurück zum Zitat Girshick, R. (2015). Fast R-CNN. In ICCV. Girshick, R. (2015). Fast R-CNN. In ICCV.
Zurück zum Zitat Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
Zurück zum Zitat Gkioxari, G., Girshick, R., Dollár, P., & He, K. (2018). Detecting and recognizing human-object interactions. CVPR. Gkioxari, G., Girshick, R., Dollár, P., & He, K. (2018). Detecting and recognizing human-object interactions. CVPR.
Zurück zum Zitat Griffin, G., Holub, A., & Perona, P. (2007). The Caltech-256. Technical report, Caltech. Griffin, G., Holub, A., & Perona, P. (2007). The Caltech-256. Technical report, Caltech.
Zurück zum Zitat Gupta, A., Kembhavi, A., & Davis, L. (2009). Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 1775–1789.CrossRef Gupta, A., Kembhavi, A., & Davis, L. (2009). Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 1775–1789.CrossRef
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Zurück zum Zitat Hinton, G. E., Vinyals, O., & Dean, J. (2014). Distilling the knowledge in a neural network. In NeurIPS. Hinton, G. E., Vinyals, O., & Dean, J. (2014). Distilling the knowledge in a neural network. In NeurIPS.
Zurück zum Zitat Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., & Murphy, K. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR. Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., & Murphy, K. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR.
Zurück zum Zitat Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.
Zurück zum Zitat Kolesnikov, A., Kuznetsova, A., Lampert, C., & Ferrari, V. (2018). Detecting visual relationships using box attention. arXiv:1807.02136. Kolesnikov, A., Kuznetsova, A., Lampert, C., & Ferrari, V. (2018). Detecting visual relationships using box attention. arXiv:​1807.​02136.
Zurück zum Zitat Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1), 32–73.MathSciNetCrossRef Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1), 32–73.MathSciNetCrossRef
Zurück zum Zitat Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto. Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.
Zurück zum Zitat Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NeurIPS. Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NeurIPS.
Zurück zum Zitat Li, Y., Ouyang, W., Wang, X., & Tang, X. (2017). ViP-CNN: Visual phrase guided convolutional neural network. In CVPR. Li, Y., Ouyang, W., Wang, X., & Tang, X. (2017). ViP-CNN: Visual phrase guided convolutional neural network. In CVPR.
Zurück zum Zitat Liang, K., Guo, Y., Chang, H., & Chen, X. (2018). Visual relationship detection with deep structural ranking. In AAAI. Liang, K., Guo, Y., Chang, H., & Chen, X. (2018). Visual relationship detection with deep structural ranking. In AAAI.
Zurück zum Zitat Liang, X., Lee, L., & Xing, E. P. (2017). Deep variation-structured reinforcement learning for visual relationship and attribute detection. In CVPR. Liang, X., Lee, L., & Xing, E. P. (2017). Deep variation-structured reinforcement learning for visual relationship and attribute detection. In CVPR.
Zurück zum Zitat Lin, T., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense object detection. In ICCV. Lin, T., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense object detection. In ICCV.
Zurück zum Zitat Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona P, Ramanan, D., Zitnick, C.L., & Dollár, P. (2014). Microsoft COCO: Common objects in context. In ECCV. Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona P, Ramanan, D., Zitnick, C.L., & Dollár, P. (2014). Microsoft COCO: Common objects in context. In ECCV.
Zurück zum Zitat Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., & Berg, A.C. (2016). SSD: Single shot multibox detector. In ECCV. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., & Berg, A.C. (2016). SSD: Single shot multibox detector. In ECCV.
Zurück zum Zitat Lu, C., Krishna, R., Bernstein, M., & Fei-Fei, L. (2016). Visual relationship detection with language priors. In European Conference on Computer Vision. Lu, C., Krishna, R., Bernstein, M., & Fei-Fei, L. (2016). Visual relationship detection with language priors. In European Conference on Computer Vision.
Zurück zum Zitat Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NeurIPS. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NeurIPS.
Zurück zum Zitat Papadopoulos, D.P., Uijlings, J.R.R., Keller, F., & Ferrari, V. (2016). We don’t need no bounding-boxes: Training object class detectors using only human verification. In CVPR. Papadopoulos, D.P., Uijlings, J.R.R., Keller, F., & Ferrari, V. (2016). We don’t need no bounding-boxes: Training object class detectors using only human verification. In CVPR.
Zurück zum Zitat Papadopoulos, D.P., Uijlings, J.R., Keller, F., & Ferrari, V. (2017). Extreme clicking for efficient object annotation. In ICCV. Papadopoulos, D.P., Uijlings, J.R., Keller, F., & Ferrari, V. (2017). Extreme clicking for efficient object annotation. In ICCV.
Zurück zum Zitat Peyre, J., Laptev, I., Schmid, C., & Sivic, J. (2017). Weakly-supervised learning of visual relations. In CVPR. Peyre, J., Laptev, I., Schmid, C., & Sivic, J. (2017). Weakly-supervised learning of visual relations. In CVPR.
Zurück zum Zitat Prest, A., Schmid, C., & Ferrari, V. (2012). Weakly supervised learning of interactions between humans and objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34, 601–614.CrossRef Prest, A., Schmid, C., & Ferrari, V. (2012). Weakly supervised learning of interactions between humans and objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34, 601–614.CrossRef
Zurück zum Zitat Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1), 145–151.MathSciNetCrossRef Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1), 145–151.MathSciNetCrossRef
Zurück zum Zitat Redmon, J., & Farhadi, A. (2017). YOLO9000: better, faster, stronger. In CVPR. Redmon, J., & Farhadi, A. (2017). YOLO9000: better, faster, stronger. In CVPR.
Zurück zum Zitat Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In CVPR. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In CVPR.
Zurück zum Zitat Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS.
Zurück zum Zitat Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. IJCV. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. IJCV.
Zurück zum Zitat Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., & Chen, L. (2018). Mobilenetv2: Inverted residuals and linear bottleneck. In CVPR. Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., & Chen, L. (2018). Mobilenetv2: Inverted residuals and linear bottleneck. In CVPR.
Zurück zum Zitat Su, H., Deng, J., & Fei-Fei, L. (2012). Crowdsourcing annotations for visual object detection. In AAAI Human Computation Workshop. Su, H., Deng, J., & Fei-Fei, L. (2012). Crowdsourcing annotations for visual object detection. In AAAI Human Computation Workshop.
Zurück zum Zitat Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In ICCV. Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In ICCV.
Zurück zum Zitat Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.
Zurück zum Zitat Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In CVPR. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In CVPR.
Zurück zum Zitat Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI. Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI.
Zurück zum Zitat Uijlings, J., Popov, S., & Ferrari, V. (2018). Revisiting knowledge transfer for training object class detectors. In CVPR. Uijlings, J., Popov, S., & Ferrari, V. (2018). Revisiting knowledge transfer for training object class detectors. In CVPR.
Zurück zum Zitat Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., & Smeulders, A. W. M. (2013). Selective search for object recognition. International Journal of Computer Vision, 104, 154–171.CrossRef Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., & Smeulders, A. W. M. (2013). Selective search for object recognition. International Journal of Computer Vision, 104, 154–171.CrossRef
Zurück zum Zitat Viola, P., & Jones, M. (2001a). Rapid object detection using a boosted cascade of simple features. In CVPR. Viola, P., & Jones, M. (2001a). Rapid object detection using a boosted cascade of simple features. In CVPR.
Zurück zum Zitat Viola, P., & Jones, M. (2001b). Robust real-time object detection. International Journal of Computer Vision, 4, 4. Viola, P., & Jones, M. (2001b). Robust real-time object detection. International Journal of Computer Vision, 4, 4.
Zurück zum Zitat Xu, D., Zhu, Y., Choy, C., & Fei-Fei, L. (2017). Scene graph generation by iterative message passing. In Computer Vision and Pattern Recognition (CVPR). Xu, D., Zhu, Y., Choy, C., & Fei-Fei, L. (2017). Scene graph generation by iterative message passing. In Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Yao, B., & Fei-Fei, L. (2010). Modeling mutual context of object and human pose in human-object interaction activities. In CVPR. Yao, B., & Fei-Fei, L. (2010). Modeling mutual context of object and human pose in human-object interaction activities. In CVPR.
Zurück zum Zitat Zellers, R., Yatskar, M., Thomson, S., & Choi, Y. (2018). Neural motifs: Scene graph parsing with global context. In CVPR. Zellers, R., Yatskar, M., Thomson, S., & Choi, Y. (2018). Neural motifs: Scene graph parsing with global context. In CVPR.
Zurück zum Zitat Zhang, H., Kyaw, Z., Chang, S.F., & Chua, T.S. (2017a). Visual translation embedding network for visual relation detection. In CVPR. Zhang, H., Kyaw, Z., Chang, S.F., & Chua, T.S. (2017a). Visual translation embedding network for visual relation detection. In CVPR.
Zurück zum Zitat Zhang, H., Kyaw, Z., Yu, J., & Chang, S.F. (2017b). PPR-FCN: weakly supervised visual relation detection via parallel pairwise R-FCN. In ICCV Zhang, H., Kyaw, Z., Yu, J., & Chang, S.F. (2017b). PPR-FCN: weakly supervised visual relation detection via parallel pairwise R-FCN. In ICCV
Metadaten
Titel
The Open Images Dataset V4
Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale
verfasst von
Alina Kuznetsova
Hassan Rom
Neil Alldrin
Jasper Uijlings
Ivan Krasin
Jordi Pont-Tuset
Shahab Kamali
Stefan Popov
Matteo Malloci
Alexander Kolesnikov
Tom Duerig
Vittorio Ferrari
Publikationsdatum
13.03.2020
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 7/2020
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-020-01316-z

Weitere Artikel der Ausgabe 7/2020

International Journal of Computer Vision 7/2020 Zur Ausgabe

OriginalPaper

Deep Image Prior