ABSTRACT
With the proliferation of e-commerce websites and the ubiquitousness of smart phones, cross-domain image retrieval using images taken by smart phones as queries to search products on e-commerce websites is emerging as a popular application. One challenge of this task is to locate the attention of both the query and database images. In particular, database images, e.g. of fashion products, on e-commerce websites are typically displayed with other accessories, and the images taken by users contain noisy background and large variations in orientation and lighting. Consequently, their attention is difficult to locate. In this paper, we exploit the rich tag information available on the e-commerce websites to locate the attention of database images. For query images, we use each candidate image in the database as the context to locate the query attention. Novel deep convolutional neural network architectures, namely TagYNet and CtxYNet, are proposed to learn the attention weights and then extract effective representations of the images. Experimental results on public datasets confirm that our approaches have significant improvement over the existing methods in terms of the retrieval accuracy and efficiency.
- Artem Babenko and Victor Lempitsky. 2015. Aggregating local deep features for image retrieval 2015 IEEE International Conference on Computer Vision (ICCV). 1269--1277. Google ScholarDigital Library
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- Aurélien Bellet, Amaury Habrard, and Marc Sebban. 2013. A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709 (2013).Google Scholar
- Jiewei Cao, Lingqiao Liu, Peng Wang, Zi Huang, Chunhua Shen, and Heng Tao Shen. 2016. Where to Focus: Query Adaptive Matching for Instance Retrieval Using Convolutional Feature Maps. arXiv preprint arXiv:1606.06811 (2016).Google Scholar
- Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014).Google Scholar
- Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. Vol. 1. 539--546. Google ScholarDigital Library
- Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. 2008. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (Csur) Vol. 40, 2 (2008), 5. Google ScholarDigital Library
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 248--255.Google Scholar
- Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016).Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.Google ScholarCross Ref
- Junshi Huang, Rogerio S Feris, Qiang Chen, and Shuicheng Yan. 2015. Cross-domain image retrieval with a dual attribute-aware ranking network 2015 IEEE International Conference on Computer Vision (ICCV). 1062--1070. Google ScholarDigital Library
- Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). 3304--3311.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks Advances in neural information processing systems. 1097--1105. Google ScholarDigital Library
- Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).Google Scholar
- Hongye Liu, Yonghong Tian, Yaowei Yang, Lu Pang, and Tiejun Huang. 2016 b. Deep relative distance learning: Tell the difference between similar vehicles 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2167--2175.Google Scholar
- Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016 a. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1096--1104.Google Scholar
- David G Lowe. 1999. Object recognition from local scale-invariant features Proceedings of the Seventh IEEE International Conference on Computer Vision (ICCV), Vol. Vol. 2. 1150--1157. Google ScholarDigital Library
- Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep metric learning via lifted structured feature embedding 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4004--4012.Google Scholar
- Beng Chin Ooi, Kian-Lee Tan, Sheng Wang, Wei Wang, Qingchao Cai, Gang Chen, Jinyang Gao, Zhaojing Luo, Anthony KH Tung, Yuan Wang, et almbox.. 2015. SINGA: A distributed deep learning platform. In Proceedings of the 23rd ACM international conference on Multimedia. ACM, 685--688. Google ScholarDigital Library
- Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 815--823.Google Scholar
- Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013).Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et almbox.. 2015. End-to-end memory networks. In Advances in neural information processing systems. 2440--2448. Google ScholarDigital Library
- Jinhui Tang, Xiangbo Shu, Zechao Li, Guo-Jun Qi, and Jingdong Wang. 2016. Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 12, 4s (2016), 68. Google ScholarDigital Library
- Daksh Varshneya and G Srinivasaraghavan. 2017. Human Trajectory Prediction using Spatially aware Deep Attention Models. arXiv preprint arXiv:1705.09436 (2017).Google Scholar
- Ji Wan, Dayong Wang, Steven Chu Hong Hoi, Pengcheng Wu, Jianke Zhu, Yongdong Zhang, and Jintao Li. 2014. Deep learning for content-based image retrieval: A comprehensive study Proceedings of the 22nd ACM international conference on Multimedia. ACM, 157--166. Google ScholarDigital Library
- Wei Wang, Gang Chen, Haibo Chen, Tien Tuan Anh Dinh, Jinyang Gao, Beng Chin Ooi, Kian-Lee Tan, Sheng Wang, and Meihui Zhang. 2016 a. Deep learning at scale and at ease. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 12, 4s (2016), 69. Google ScholarDigital Library
- Wei Wang, Beng Chin Ooi, Xiaoyan Yang, Dongxiang Zhang, and Yueting Zhuang. 2014. Effective multi-modal retrieval based on stacked auto-encoders. Proceedings of the VLDB Endowment Vol. 7, 8 (2014), 649--660. Google ScholarDigital Library
- Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, and Yueting Zhuang. 2016 c. Effective deep learning-based multi-modal retrieval. The VLDB Journal, Vol. 25, 1 (2016), 79--101. Google ScholarDigital Library
- Wei Wang, Meihui Zhang, Gang Chen, HV Jagadish, Beng Chin Ooi, and Kian-Lee Tan. 2016 d. Database Meets Deep Learning: Challenges and Opportunities. ACM SIGMOD Record, Vol. 45, 2 (2016), 17--22. Google ScholarDigital Library
- Xi Wang, Zhenfeng Sun, Wenqiang Zhang, Yu Zhou, and Yu-Gang Jiang. 2016 b. Matching user photos to online products with robust deep features Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 7--14. Google ScholarDigital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention International Conference on Machine Learning. 2048--2057. Google ScholarDigital Library
- Yang Yang, Yadan Luo, Weilun Chen, Fumin Shen, Jie Shao, and Heng Tao Shen. 2016. Zero-shot hashing via transferring supervised knowledge Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1286--1295. Google ScholarDigital Library
- Yang Yang, Zheng-Jun Zha, Yue Gao, Xiaofeng Zhu, and Tat-Seng Chua. 2014. Exploiting web images for semantic video indexing via robust sample-specific loss. IEEE Transactions on Multimedia Vol. 16, 6 (2014), 1677--1689.Google ScholarCross Ref
- Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? Advances in neural information processing systems. 3320--3328. Google ScholarDigital Library
- Yuhui Yuan, Kuiyuan Yang, and Chao Zhang. 2016. Hard-Aware Deeply Cascaded Embedding. arXiv preprint arXiv:1611.05720 (2016).Google Scholar
Index Terms
- Cross-Domain Image Retrieval with Attention Modeling
Recommendations
Regional Maximum Activations of Convolutions with Attention for Cross-domain Beauty and Personal Care Product Retrieval
MM '18: Proceedings of the 26th ACM international conference on MultimediaCross-domain beauty and personal care product image retrieval is a challenging problem due to data variations (e.g., brightness, viewpoint, and scale), and the rich types of items. In this paper, we present a regional maximum activations of convolutions ...
Dual-domain strip attention for image restoration
AbstractImage restoration aims to reconstruct a latent high-quality image from a degraded observation. Recently, the usage of Transformer has significantly advanced the state-of-the-art performance of various image restoration tasks due to its powerful ...
Comments