nach oben

International Journal of Computer Vision

Erschienen in:

30.09.2020

Fine-Grained Instance-Level Sketch-Based Image Retrieval

verfasst von: Qian Yu, Jifei Song, Yi-Zhe Song, Tao Xiang, Timothy M. Hospedales

Erschienen in: International Journal of Computer Vision | Ausgabe 2/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The problem of fine-grained sketch-based image retrieval (FG-SBIR) is defined and investigated in this paper. In FG-SBIR, free-hand human sketch images are used as queries to retrieve photo images containing the same object instances. It is thus a cross-domain (sketch to photo) instance-level retrieval task. It is an extremely challenging problem because (i) visual comparisons and matching need to be executed under large domain gap, i.e., from black and white line drawing sketches to colour photos; (ii) it requires to capture the fine-grained (dis)similarities of sketches and photo images while free-hand sketches drawn by different people present different levels of deformation and expressive interpretation; and (iii) annotated cross-domain fine-grained SBIR datasets are scarce, challenging many state-of-the-art machine learning techniques, particularly those based on deep learning. In this paper, for the first time, we address all these challenges, providing a step towards the capabilities that would underpin a commercial sketch-based object instance retrieval application. Specifically, a new large-scale FG-SBIR database is introduced which is carefully designed to reflect the real-world application scenarios. A deep cross-domain matching model is then formulated to solve the intrinsic drawing style variability, large domain gap issues, and capture instance-level discriminative features. It distinguishes itself by a carefully designed attention module. Extensive experiments on the new dataset demonstrate the effectiveness of the proposed model and validate the need for a rigorous definition of the FG-SBIR problem and collecting suitable datasets.

Vorheriger Artikel Benchmarking the Robustness of Semantic Segmentation Models with Respect to Common Corruptions

Nächster Artikel Binarized Neural Architecture Search for Efficient Object Recognition

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Free-hand sketch in this work refers to sketches drawn by amateurs based on their mental recollection. Specifically, we assume that before a human draws a sketch, (s)he has seen a reference object instance, but does not have the object or a photo at hand while drawing.

Here ‘CFF’ refers to the operation of combining the feature map extracted from an earlier layer with the final layer output. This is different with the meaning in the preliminary version (Song et al. 2017) where it indicates both feature fusion and residual attention module.

Bui, T., Ribeiro, L., Ponti, M., & Collomosse, J. (2016). Generalisation and sharing in triplet convnets for sketch based visual search. arXiv preprint arXiv:1611.05301.

Bui, T., Ribeiro, L., Ponti, M., & Collomosse, J. (2018). Sketching out the details: sketch-based image retrieval using convolutional neural networks with multi-stage regression. Computers & Graphics, 71, 77–87.CrossRef

Cao, Y., Wang, H., Wang, C., Li, Z., Zhang, L., & Zhang, L. (2010). Mindfinder: interactive sketch-based image search on millions of images. In International conference on multimedia.

Cao, Y., Wang, C., Zhang, L., & Zhang, L. (2011) Edgel index for large-scale sketch-based image search. In CVPR.

Chen, T., Cheng, M. M., Tan, P., Shamir, A., & Hu, S. M. (2009). Sketch2photo: internet image montage. ACM Transactions on Graphics (TOG), 28, 1–10.

Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In IEEE computer society conference on computer vision and pattern recognitio.

Collomosse, J., Bui, T., Wilber, M. J., Fang, C., & Jin, H. (2017). Sketching with style: visual search with sketches and aesthetic context. In Proceedings of the IEEE international conference on computer vision.

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009) ImageNet: a large-scale hierarchical image database. In CVPR.

Eitz, M., Hildebrand, K., Boubekeur, T., & Alexa, M. (2010). An evaluation of descriptors for large-scale image retrieval from sketched feature lines. Computers & Graphics, 34(5), 482–498.CrossRef

Eitz, M., Hildebrand, K., Boubekeur, T., & Alexa, M. (2011). Sketch-based image retrieval: benchmark and bag-of-features descriptors. IEEE Transactions on Visualization and Computer Graphics, 17(11), 1624–1636.CrossRef

Eitz, M., Hays, J., & Alexa, M. (2012). How do humans sketch objects? ACM Transactions on Graphics (TOG), 31, 1–10.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.CrossRef

Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP.

Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). Texture synthesis and the controlled generation of natural stimuli using convolutional neural networks. CoRR, arXiv:1505.07376.

Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In European conference on computer vision.

Gordo, A., Almazan, J., Revaud, J., & Larlus, D. (2017). End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision, 124(2), 237–254.MathSciNetCrossRef

Gygli, M., Grabner, H., Riemenschneider, H., Nater, F., & Van Gool, L. (2013). The interestingness of images. In IEEE international conference on computer vision.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

Hu, R., & Collomosse, J. (2013). A performance evaluation of gradient field HOG descriptor for sketch based image retrieval. Computer Vision and Image Understanding, 117(7), 790–806.CrossRef

Hu, R., Barnard, M., & Collomosse, J. (2010). Gradient field descriptor for sketch based retrieval and localization. In IEEE international conference on image processing.

Hu, R., Wang, T., & Collomosse, J. (2011). A bag-of-regions approach to sketch based image retrieval. In IEEE international conference on image processing.

Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In Advances in neural information processing systems.

James, S., Fonseca, M., & Collomosse, J. (2014). Reenact: Sketch based choreographic design from archival dance footage. In Proceedings of international conference on multimedia retrieval.

Jiang, Y. G., Wang, Y., Feng, R., Xue, X., Zheng, Y., & Yang, H. (2013). Understanding and predicting interestingness of videos. In AAAI.

Johnson, J., Krishna, R., Stark, M., Li, L. J., Shamma, D., Bernstein, M., & Fei-Fei, L. (2015). Image retrieval using scene graphs. In CVPR.

Krizhevsky, A., & Hinton, G. E. (2011). Using very deep autoencoders for content-based image retrieval. In European symposium on artificial neural networks.

Krizhevsky, A., Sutskever, I., Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.

Landay, J. A., & Myers, B. A. (2001). Sketching interfaces: toward more human interface design. IEEE Computer, 34(3), 56–64.CrossRef

Li, Y., Hospedales, T., Song, Y. Z., & Gong, S. (2014). Fine-grained sketch-based image retrieval by matching deformable part models. In BMVC.

Li, Y., Hospedales, T. M., Song, Y. Z., & Gong, S. (2015). Free-hand sketch recognition by multi-kernel feature learning. Computer Vision and Image Understanding, 137, 1–11.CrossRef

Li, K., Pang, K., Song, Y. Z., Hospedales, T. M., Xiang, T., & Zhang, H. (2017). Synergistic instance-level subspace alignment for fine-grained sketch-based image retrieval. IEEE Transactions on Image Processing, 26(12), 5908–5921.MathSciNetCrossRef

Lin, Y., Huang, C., Wan, C., & Hsu, W. (2013) 3D sub-query expansion for improving sketch-based multi-view image retrieval. In Proceedings of the IEEE international conference on computer vision.

Lin, T. Y., RoyChowdhury, A., & Maji, S. (2015). Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE international conference on computer vision (pp. 1449–1457).

Liu, L., Shen, F., Shen, Y., Liu, X., & Shao, L. (2017a). Deep sketch hashing: fast free-hand sketch-based image retrieval. arXiv preprint arXiv:1703.05605.

Liu, Y., Guo, Y., Lew, M. S. (2017b). On the exploration of convolutional fusion networks for visual recognition. In International conference on multimedia modeling.

Lu, J., Xiong, C., Parikh, D., & Socher, R. (2016). Knowing when to look: adaptive attention via a visual sentinel for image captioning. arXiv preprint arXiv:1612.01887.

Mahendran, A., & Vedaldi, A. (2015) Understanding deep image representations by inverting them. In IEEE conference on computer vision and pattern recognition.

Marr, D. (1982). Vision. New York: W. H. Freeman and Company.

Mnih, V., Heess, N., Graves, A., et al. (2014). Recurrent models of visual attention. In Advances in neural information processing systems.

Moulin, C., Largeron, C., Ducottet, C., Géry, M., & Barat, C. (2014). Fisher linear discriminant analysis for text-image combination in multimedia information retrieval. Pattern Recognition, 47(1), 260–269.CrossRef

Nam, H., Ha, J. W., & Kim, J. (2016). Dual attention networks for multimodal reasoning and matching. arXiv preprint arXiv:1611.00471.

Newell, A., Yang, K., Deng, J. (2016). Stacked hourglass networks for human pose estimation. In European conference on computer vision.

Noh, H., Araujo, A., Sim, J., Weyand, T., & Han, B. (2017). Large-scale image retrieval with attentive deep local features. In IEEE international conference on computer vision.

Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. In IEEE conference on computer vision and pattern recognition.

Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2008). Lost in quantization: improving particular object retrieval in large scale image databases. In IEEE conference on computer vision and pattern recognition.

Prosser, B. J., Zheng, W. S., Gong, S., Xiang, T., & Mary, Q. (2010). Person re-identification by support vector ranking. In British machine vision conference.

Radenovic, F., Tolias, G., & Chum, O. (2018). Deep shape matching. In Proceedings of the European conference on computer vision.

Radenović, F., Tolias, G., & Chum, O. (2018). Fine-tuning cnn image retrieval with no human annotation. TPAMI, 41(7), 1655–1668.CrossRef

Ren, X. (2008). Multi-scale improves boundary detection in natural images. In Proceedings of the European conference on computer vision.

Sangkloy, P., Burnell, N., Ham, C., & Hays, J. (2016). The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG), 35, 1–12.CrossRef

Sermanet, P., Frome, A., & Real, E. (2014). Attention for fine-grained categorization. arXiv preprint arXiv:1412.7054.

Song, J., Yu, Q., Song, Y. Z., Xiang, T., & Hospedales, T. M. (2017). Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In Proceedings of the IEEE international conference on computer vision.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. IEEE conference on computer vision and pattern recognition. arXiv:1409.4842.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016a). Rethinking the inception architecture for computer vision. In IEEE conference on computer vision and pattern recognition.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016b). Rethinking the inception architecture for computer vision. In IEEE conference on computer vision and pattern recognition.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NeuIPS.

Wang, X., & Tang, X. (2009). Face photo-sketch synthesis and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(11), 1955–1967.MathSciNetCrossRef

Wang, C., Li, Z., & Zhang, L. (2010). Mindfinder: image search by interactive sketching and tagging. In Proceedings of the 19th international conference on world wide web.

Wang, F., Kang, L., & Li, Y. (2015). Sketch-based 3D shape retrieval using convolutional neural networks. In IEEE conference on computer vision and pattern recognition.

Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., & Zhang, Z. (2015). The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In IEEE conference on computer vision and pattern recognition.

Xie, S., & Tu, Z. (2015). Holistically-nested edge detection. In IEEE international conference on computer vision.

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., & Bengio, Y. (2015). Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning.

Yang, S., & Ramanan, D. (2015). Multi-scale recognition with DAG-CNNS. In IEEE international conference on computer vision.

Yu, A., & Grauman, K. (2014). Fine-grained visual comparisons with local learning. In IEEE conference on computer vision and pattern recognition.

Yu, Q., Yang, Y., Song, Y., Xiang, T., & Hospedales, T. (2015). Sketch-a-net that beats humans. In BMVC.

Yu, Q., Liu, F., Song, Y. Z., Xiang, T., Hospedales, T. M., & Loy, C. C. (2016). Sketch me that shoe. In IEEE conference on computer vision and pattern recognition.

Yu, Q., Yang, Y., Liu, F., Song, Y. Z., Xiang, T., & Hospedales, T. M. (2017). Sketch-a-net: a deep neural network that beats humans. International Journal of Computer Vision, 122(3), 411–425.MathSciNetCrossRef

Zhang, J., Shen, F., Liu, L., Zhu, F., Yu, M., Shao, L., Tao Shen, H., & Van Gool, L. (2018). Generative domain-migration hashing for sketch-to-image retrieval. In Proceedings of the European conference on computer vision (ECCV).

Zhu, J. Y., Lee, Y. J., & Efros, A. A. (2014). Averageexplorer: interactive exploration and alignment of visual data collections. ACM Transactions on Graphics (TOG), 33, 1–11.

Zitnick, C. L., & Dollár, P. (2014). Edge boxes: locating object proposals from edges. In Proceedings of the European conference on computer vision.

Titel: Fine-Grained Instance-Level Sketch-Based Image Retrieval
verfasst von: Qian Yu
Jifei Song
Yi-Zhe Song
Tao Xiang
Timothy M. Hospedales
Publikationsdatum: 30.09.2020
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 2/2021
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-020-01382-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 2/2021

Beyond Covariance: SICE and Kernel Based Visual Feature Representation

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

A Comprehensive Analysis of Weakly-Supervised Semantic Segmentation in Different Image Domains

Binarized Neural Architecture Search for Efficient Object Recognition

Face Image Reflection Removal

Unsupervised Domain Adaptation in the Wild via Disentangling Representation Learning