ABSTRACT
In this paper, we propose two techniques for near-duplicate image detection at high confidence and large scale. First, we show that entropy-based filtering eliminates ambiguous SIFT features that cause most of the false positives, and enables claiming near-duplicity with a single match of the retained high-quality features. Second, we show that graph cut can be used for query expansion with a duplicity graph computed offline to substantially improve search quality. Evaluation with web images show that when combined with sketch embedding [6], our methods achieve false positive rate orders of magnitude lower than the standard visual word approach. We demonstrate the proposed techniques with a large-scale image search engine which, using indexing data structure offline computed with a Hadoop cluster, is capable of serving more than 50 million web images with a single commodity server.
- R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. FOCS, 2006. Google ScholarDigital Library
- O. Boiman, E. Shechtman, and M. Irani. In defense of nearest-neighbor based image classification. In CVPR, 2008.Google ScholarCross Ref
- O. Chum, J. Philbin, M. Isard, and A. Zisserman. Scalable near identical image and shot detection. In CIVR, 2007. Google ScholarDigital Library
- O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Automatic query expansion with a generative feature model for object retrieval. In ICCV, 2007.Google ScholarCross Ref
- W. Dong, M. Charikar, and K. Li. Efficiently matching sets of features with random histograms. In MM'08: Proceedings of the 16th ACM International Conference on Multimedia, Vancouver, Canada, 2008. Google ScholarDigital Library
- W. Dong, M. Charikar, and K. Li. High dimensional similarity search with sketches. In SIGIR, 2008.Google Scholar
- M. Douze, H. Jégou, H. Sandhawalia, L. Amsaleg, and C. Schmid. Evaluation of GIST descriptors for web-scale image search. In Proceeding of the ACM International Conference on Image and Video Retrieval, CIVR '09, pages 19:1--19:8. ACM, 2009. Google ScholarDigital Library
- P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC, 1998. Google ScholarDigital Library
- H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In IEEE Conference on Computer Vision & Pattern Recognition, pages 3304--3311, jun 2010.Google ScholarCross Ref
- Y. Ke, R. Sukthankar, and L. Huston. An efficient parts-based near-duplicate and sub-image retrieval system. In ACM MM, 2004. Google ScholarDigital Library
- D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91--110, 2004. Google ScholarDigital Library
- Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: efficient indexing for high-dimensional similarity search. In VLDB, 2007. Google ScholarDigital Library
- G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In WWW, 2007. Google ScholarDigital Library
- D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR, 2006. Google ScholarDigital Library
- J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In CVPR, 2007.Google ScholarCross Ref
- J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR, 2008.Google ScholarCross Ref
- D. A. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In STOC, 2004. Google ScholarDigital Library
- P. Turcot and D. Lowe. Better matching with fewer features: The selection of useful features in large database recognition problems. In ICCV Workshop on Emergent Issues in Large Amounts of Visual Data, 2009.Google ScholarCross Ref
- A. Vedaldi and B. Fulkerson. Vlfeat -- an open and portable library of computer vision algorithms. In Proceedings of the 18th annual ACM international conference on Multimedia, 2010. Google ScholarDigital Library
- Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS. 2009.Google Scholar
- Z. Wu, Q. Ke, M. Isard, and J. Sun. Bundling features for large scale partial-duplicate web image search. In CVPR, 2009.Google Scholar
- D. Xu, T.-J. Cham, S. Yan, and S.-F. Chang. Near duplicate image identification with patially aligned pyramid matching. In CVPR, 2008.Google Scholar
- S. Zhang, Q. Tian, G. Hua, Q. Huang, and S. Li. Descriptive visual words and visual phrases for image applications. In ACM MM, 2009. Google ScholarDigital Library
Index Terms
- High-confidence near-duplicate image detection
Recommendations
Detection of near-duplicate images for web search
CIVR '07: Proceedings of the 6th ACM international conference on Image and video retrievalAmong the vast numbers of images on the web are many duplicates and near-duplicates, that is, variants derived from the same original image. Such near-duplicates appear in many web image searches and may represent infringements of copyright or indicate ...
Speed up duplicate/near-duplicate image detection
ICIMCS '10: Proceedings of the Second International Conference on Internet Multimedia Computing and ServiceFinding duplicate and near-duplicate images plays an important role on redundancy reduction for image storage, summarization and recommendation. This paper introduces how to speed up Duplicate/Near-Duplicate(D/ND) image detection. Image clustering was ...
Document expansion for image retrieval
RIAO '10: Adaptivity, Personalization and Fusion of Heterogeneous InformationSuccessful information retrieval requires effective matching between the user's search request and the contents of relevant documents. Often the request entered by a user may not use the same topic relevant terms as the authors' of these documents. One ...
Comments