Cross-Domain Image Retrieval with Attention Modeling

Authors:
Xin Ji

National University of Singapore, Singapore, Singapore

National University of Singapore, Singapore, Singapore
View Profile

,
Wei Wang

National University of Singapore, Singapore, Singapore

National University of Singapore, Singapore, Singapore
View Profile

,
Meihui Zhang

Singapore University of Technology and Design, Singapore, Singapore

Singapore University of Technology and Design, Singapore, Singapore
View Profile

,
Yang Yang

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

MM '17: Proceedings of the 25th ACM international conference on MultimediaOctober 2017Pages 1654–1662https://doi.org/10.1145/3123266.3123429

Published:23 October 2017Publication History

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 1654–1662

ABSTRACT

With the proliferation of e-commerce websites and the ubiquitousness of smart phones, cross-domain image retrieval using images taken by smart phones as queries to search products on e-commerce websites is emerging as a popular application. One challenge of this task is to locate the attention of both the query and database images. In particular, database images, e.g. of fashion products, on e-commerce websites are typically displayed with other accessories, and the images taken by users contain noisy background and large variations in orientation and lighting. Consequently, their attention is difficult to locate. In this paper, we exploit the rich tag information available on the e-commerce websites to locate the attention of database images. For query images, we use each candidate image in the database as the context to locate the query attention. Novel deep convolutional neural network architectures, namely TagYNet and CtxYNet, are proposed to learn the attention weights and then extract effective representations of the images. Experimental results on public datasets confirm that our approaches have significant improvement over the existing methods in terms of the retrieval accuracy and efficiency.

References

Artem Babenko and Victor Lempitsky. 2015. Aggregating local deep features for image retrieval 2015 IEEE International Conference on Computer Vision (ICCV). 1269--1277. Google ScholarDigital Library
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
Aurélien Bellet, Amaury Habrard, and Marc Sebban. 2013. A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709 (2013).Google Scholar
Jiewei Cao, Lingqiao Liu, Peng Wang, Zi Huang, Chunhua Shen, and Heng Tao Shen. 2016. Where to Focus: Query Adaptive Matching for Instance Retrieval Using Convolutional Feature Maps. arXiv preprint arXiv:1606.06811 (2016).Google Scholar
Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014).Google Scholar
Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. Vol. 1. 539--546. Google ScholarDigital Library
Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. 2008. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (Csur) Vol. 40, 2 (2008), 5. Google ScholarDigital Library
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 248--255.Google Scholar
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016).Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.Google ScholarCross Ref
Junshi Huang, Rogerio S Feris, Qiang Chen, and Shuicheng Yan. 2015. Cross-domain image retrieval with a dual attribute-aware ranking network 2015 IEEE International Conference on Computer Vision (ICCV). 1062--1070. Google ScholarDigital Library
Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). 3304--3311.Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks Advances in neural information processing systems. 1097--1105. Google ScholarDigital Library
Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).Google Scholar
Hongye Liu, Yonghong Tian, Yaowei Yang, Lu Pang, and Tiejun Huang. 2016 b. Deep relative distance learning: Tell the difference between similar vehicles 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2167--2175.Google Scholar
Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016 a. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1096--1104.Google Scholar
David G Lowe. 1999. Object recognition from local scale-invariant features Proceedings of the Seventh IEEE International Conference on Computer Vision (ICCV), Vol. Vol. 2. 1150--1157. Google ScholarDigital Library
Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep metric learning via lifted structured feature embedding 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4004--4012.Google Scholar
Beng Chin Ooi, Kian-Lee Tan, Sheng Wang, Wei Wang, Qingchao Cai, Gang Chen, Jinyang Gao, Zhaojing Luo, Anthony KH Tung, Yuan Wang, et almbox.. 2015. SINGA: A distributed deep learning platform. In Proceedings of the 23rd ACM international conference on Multimedia. ACM, 685--688. Google ScholarDigital Library
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 815--823.Google Scholar
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013).Google Scholar
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et almbox.. 2015. End-to-end memory networks. In Advances in neural information processing systems. 2440--2448. Google ScholarDigital Library
Jinhui Tang, Xiangbo Shu, Zechao Li, Guo-Jun Qi, and Jingdong Wang. 2016. Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 12, 4s (2016), 68. Google ScholarDigital Library
Daksh Varshneya and G Srinivasaraghavan. 2017. Human Trajectory Prediction using Spatially aware Deep Attention Models. arXiv preprint arXiv:1705.09436 (2017).Google Scholar
Ji Wan, Dayong Wang, Steven Chu Hong Hoi, Pengcheng Wu, Jianke Zhu, Yongdong Zhang, and Jintao Li. 2014. Deep learning for content-based image retrieval: A comprehensive study Proceedings of the 22nd ACM international conference on Multimedia. ACM, 157--166. Google ScholarDigital Library
Wei Wang, Gang Chen, Haibo Chen, Tien Tuan Anh Dinh, Jinyang Gao, Beng Chin Ooi, Kian-Lee Tan, Sheng Wang, and Meihui Zhang. 2016 a. Deep learning at scale and at ease. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 12, 4s (2016), 69. Google ScholarDigital Library
Wei Wang, Beng Chin Ooi, Xiaoyan Yang, Dongxiang Zhang, and Yueting Zhuang. 2014. Effective multi-modal retrieval based on stacked auto-encoders. Proceedings of the VLDB Endowment Vol. 7, 8 (2014), 649--660. Google ScholarDigital Library
Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, and Yueting Zhuang. 2016 c. Effective deep learning-based multi-modal retrieval. The VLDB Journal, Vol. 25, 1 (2016), 79--101. Google ScholarDigital Library
Wei Wang, Meihui Zhang, Gang Chen, HV Jagadish, Beng Chin Ooi, and Kian-Lee Tan. 2016 d. Database Meets Deep Learning: Challenges and Opportunities. ACM SIGMOD Record, Vol. 45, 2 (2016), 17--22. Google ScholarDigital Library
Xi Wang, Zhenfeng Sun, Wenqiang Zhang, Yu Zhou, and Yu-Gang Jiang. 2016 b. Matching user photos to online products with robust deep features Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 7--14. Google ScholarDigital Library
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention International Conference on Machine Learning. 2048--2057. Google ScholarDigital Library
Yang Yang, Yadan Luo, Weilun Chen, Fumin Shen, Jie Shao, and Heng Tao Shen. 2016. Zero-shot hashing via transferring supervised knowledge Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1286--1295. Google ScholarDigital Library
Yang Yang, Zheng-Jun Zha, Yue Gao, Xiaofeng Zhu, and Tat-Seng Chua. 2014. Exploiting web images for semantic video indexing via robust sample-specific loss. IEEE Transactions on Multimedia Vol. 16, 6 (2014), 1677--1689.Google ScholarCross Ref
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? Advances in neural information processing systems. 3320--3328. Google ScholarDigital Library
Yuhui Yuan, Kuiyuan Yang, and Chao Zhang. 2016. Hard-Aware Deeply Cascaded Embedding. arXiv preprint arXiv:1611.05720 (2016).Google Scholar

Index Terms

Cross-Domain Image Retrieval with Attention Modeling
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search

Recommendations

Regional Maximum Activations of Convolutions with Attention for Cross-domain Beauty and Personal Care Product Retrieval
MM '18: Proceedings of the 26th ACM international conference on Multimedia

Cross-domain beauty and personal care product image retrieval is a challenging problem due to data variations (e.g., brightness, viewpoint, and scale), and the rich types of items. In this paper, we present a regional maximum activations of convolutions ...
Read More
Image retrieval using visual attention
Read More
Dual-domain strip attention for image restoration
Abstract
Image restoration aims to reconstruct a latent high-quality image from a degraded observation. Recently, the usage of Transformer has significantly advanced the state-of-the-art performance of various image restoration tasks due to its powerful ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
General Chairs:
Qiong Liu
FXPAL, USA
,
Rainer Lienhart
Universität Augsburg, Germany
,
Haohong Wang
TCL America, USA
,
Program Chairs:
Sheng-Wei "Kuan-Ta" Chen
Academia Sinica, Taiwan
,
Susanne Boll
University of Oldenburg, Germany
,
Phoebe Chen
La Trobe University, Australia
,
Gerald Friedland
Lawrence Livermore National Lab, USA
,
Jia Li
Google, USA
,
Shuicheng Yan
Qihoo 360, China
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
attention modeling
cross-domain image retrieval
deep learning
fashion product
Qualifiers
- research-article
Conference

Acceptance Rates
MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 55
  Total Citations
  View Citations
- 1,018
  Total Downloads
- Downloads (Last 12 months)126
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cross-Domain Image Retrieval with Attention Modeling

MM '17: Proceedings of the 25th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Regional Maximum Activations of Convolutions with Attention for Cross-domain Beauty and Personal Care Product Retrieval

Image retrieval using visual attention

Dual-domain strip attention for image restoration