Elsevier

Neurocomputing

Volume 171, 1 January 2016, Pages 283-292
Neurocomputing

Linear unsupervised hashing for ANN search in Euclidean space

https://doi.org/10.1016/j.neucom.2015.06.036Get rights and content

Abstract

Approximate nearest neighbors (ANN) search for large scale data has attracted considerable attention due to the fact that large amounts of data are easily available. Recently, hashing has been widely adopted for similarity search because of its good potential for low storage cost and fast query speed. Among of them, when semantic similarity information is available, supervised hashing methods show better performance than unsupervised ones. However, supervised hashing methods need explicit similarity information which is not available in some scenarios. In addition, they have the problems of difficult optimization and time consuming for training, which make them unpracticable to large scale data. In this paper, we propose an unsupervised hashing method – Unsupervised Euclidean Hashing (USEH), which learns and generates hashing codes to preserve the Euclidean distance relationship between data. Specifically, USEH first utilizes Locality-Sensitive Hashing (LSH) to generate pseudo labels; then, it adopts a sequential learning strategy to learn the hash functions, one bit at a time, which can generate very discriminative codes. Moreover, USEH avoids explicitly computing the similarity matrix by decomposing it into the product of a label matrix and its transposition, which makes the training complexity of USEH linear to the size of training samples when the number of training samples is much greater than the dimension of feature. Thus, it can efficiently work on large scale data. We test USEH on two large scale datasets – SIFT1M and GIST1M. Experimental results show that USEH is comparable to state-of-the-art unsupervised hashing methods.

Introduction

In recent years, there has been a massive explosion of data on the web, e.g., images and videos. For such data, we usually need to find the nearest neighbors of a query sample from a given large scale database. However, such databases contain even billions of samples; this requires retrieval methods that should be highly efficient. In addition, besides the infeasibility of exhaustive search problem, storage of the original data is also a problem for such data. To tackle such problems, approximate nearest neighbor (ANN) search techniques have been proposed; especially, hashing based ANN methods have attracted more and more attention. The main idea of hashing for ANN is to learn hash functions for converting high-dimension features into compact binary codes while preserving similarity in original space. By this way, hashing enables large gains in data storage and computation speed for similarity search. Thus, it has become a popular method applied to many communities having large scale data. Especially, for documents/multimedia/cross-modal/multimodal retrieval, hashing has shown its superiority for fast semantic similarity approximate nearest neighbors search [1], [2], [3], [4], [5], [6].

Generally, exiting hashing methods can be divided into three main categories: unsupervised hashing, semi-supervised hashing and supervised hashing. Unsupervised hashing only uses unlabeled data to generate binary codes. In this paradigm, Locality-Sensitive Hashing (LSH) [7] is a well-known method, especially in computer vision. The basic idea of LSH is to compute randomized hash functions that guarantee a high probability of collision for similar samples. It offers probabilistic guarantees of retrieving items within (1+ϵ) times the optimal similarity, with sub-linear query time with respect to the number of samples in database [8], [9].

Another famous unsupervised hashing method is Spectral Hashing (SH) which was proposed by Weiss et al. [10]. SH uses a separable Laplacian eigenfunction formulation that ends up assigning more bits to directions along which the data has a wider range. Some methods such as IsoHash [11], Bagging PCA [12] and ITQ [13] are based on PCA. Iterative quantization (ITQ) [13] tires to find a rotation of zeros-centered data to minimize the quantization error of mapping data to the vertices of a zero-centered binary hypercube. It has shown good performance; however, the code length of ITQ is limited by the dimension of original data. Similar to ITQ, ok-means and ck-means [14] generalize the ITQ algorithm and Product Quantization (PQ) [15], respectively, to minimize the quantization error. He et al. [16] proposed a non-orthogonal quantization method, K-means hashing, which simultaneously performs k-means clustering and learning the binary indices of the clusters, is a non-orthogonal quantization method. In [17], Heo et al. propose spherical hashing method which is a hypersphere-based binary coding scheme and achieves balanced partitioning of data points for each hash function and independence between hashing functions. There are also some methods utilizing anchor points to approximate the data neighborhood structure, e.g., Anchor Graph Hashing [18] and Compressed Hashing [19].

Unsupervised hashing is promising to retrieve metric distance neighbors when there is no labeled information. However, in some circumstances, we have some pairwise labeled data, e.g., semantic similarity. In order to make use of such labeled data, many semi-supervised or supervised hashing methods have been proposed to generate codes that respect such similarity. For example, Semi-Supervised Hashing (SSH) [20] can leverage semantic similarity using labeled data while remaining robust to overfitting. In addition, it relaxes the orthogonality constraints to allow successive projections to capture more of the data variance. Considering that the sequential learning shows superiority to projection learning, Wang et al. [21] proposed a sequential projection learning version of SSH. Semi-Supervised Discriminant Hashing (SSDH) is also a semi-supervised method developed by Kim et al. [22], which is based on linear discriminant analysis.

There are also some sophisticated supervised hashing methods such as RBM [23], BRE [24] MLSH [25], MFH [26], IMH [27] and MLH [9] which have shown higher search accuracy than unsupervised hashing approaches. However, they all require difficult optimization and are time consuming in training phase. Another supervised hashing method, KSH [28], utilizes the equivalence between optimizing the code inner products and the hamming distances, which also achieves good performance.

From the above, we can see that unsupervised hashing cannot make use of labeled information and supervised hashing has the problems of difficult optimization and time consuming for training. In addition, we can usually get some useful information from unlabeled data. Thus, if a model can not only make use of such information, but also is easy for optimization and efficient in training, it is expected to have good performance and potential for large scale problems. Motivated by this, in this paper, we propose an unsupervised hashing technique in Euclidean space, namely Unsupervised Euclidean Hashing (USEH), which works on unlabeled data; however, it utilizes useful information got from unlabeled data. USEH consists of two phases. In the first phase, USEH uses LSH to generate pseudo labels of training data; then in the second phase, it adopts a sequential learning strategy to learn hash functions, one bit at a time. Moreover, we show that the similarity matrix in USEH is not needed explicitly and can be substituted by other formula. Due to this, when the number of training samples is much greater than the dimension of feature, the time complexity of USEH is dramatically reduced from O(n2) to O(n) (n is the number of samples in training data), which makes it practicable and efficient on large-scale data.

The remainder of this paper is organized as follows: Section 2 gives a brief introduction of Locality Sensitive Hashing. Section 3 presents the detailed formulations and solutions to the optimization problem in USEH. Experimental results and analysis on two large-scale datasets are presented in Section 4. Finally, Section 5 concludes this paper.

Section snippets

Locality sensitive hashing

Locality-Sensitive Hashing (LSH) [7], [29] is a famous hashing algorithm. It hashes the data to ensure that the probability of collision is much higher for objects that are close to each other than for those that are far apart. For our purpose, we adopt the following LSH scheme.

Given a feature space M, H is a LSH family, for arbitrary x,yM, we havePrhH[h(x)=h(y)]=sH(x,y),where sH is a similarity measure of M. Given an locality sensitive hash function hH, we can define the corresponding

Approach

In this section, we describe the proposed USEH approach in detail, including the formulations, how to learn the project matrix, e.g., learning the orthogonal projection matrix and the non-orthogonal project matrix. In addition, we analyze its computational complexity.

Datasets and evaluation metric

We evaluate USEH on two public datasets, i.e., SIFT1M1 [15] and GIST1M. SIFT1M contains one million 128-dimensional SIFT [37] features as search data base and 10,000 SIFT features as queries. GIST1M contains one million 384-dimensional GIST [38] features and 1000 GIST features as queries which are randomly sampled from the 80 M dataset [39]. For each query, the first K Euclidean nearest neighbors are used as ground truth, and

Conclusions

In this paper, we propose an unsupervised hashing method – USEH for fast approximate nearest neighbors search which embeds original high dimensional data into hamming space while preserving Euclidean neighborhood. We use LSH to generate similarity matrix. Especially, USEH avoids explicitly computing the similarity matrix by decomposing it into the product of pseudo label matrix and its transposition, which reduces the training time to O(n). Unlike most supervised hashing methods, USEH learns

Acknowledgments

This work is partially supported by National Natural Science Foundation of China (61173068 and 61103151), Program for New Century Excellent Talents in University of the Ministry of Education, the Key Science Technology Project of Shandong Province (2014GGD01063), the Independent Innovation Foundation of Shandong Province (2014CGZH1106) and the Shandong Provincial Natural Science Foundation (ZR2014FM020 and ZR2014FM031).

Jian Wang received his B.E. degree in computer science from Shandong University, China, in 2013. He is currently pursuing the M.S. degree at the School of Computer Science and Technology, Shandong University. His research interests include machine learning, hashing and cross-modal multimedia retrieval.

References (42)

  • R. Salakhutdinov et al.

    Semantic hashing

    Int. J. Approx. Reason.

    (2009)
  • J. Song et al.

    Effective multiple feature hashing for large-scale near-duplicate video retrieval

    IEEE Trans. Multimed.

    (2013)
  • M.M. Bronstein, A.M. Bronstein, F. Michel, N. Paragios, Data fusion through cross-modality metric learning using...
  • S. Kumar, R. Udupa, Learning hash functions for cross-view similarity search, in: Proceedings of the International...
  • D. Zhang, F. Wang, L. Si, Composite hashing with multiple information sources, in: Proceedings of the ACM International...
  • Y. Zhen, D.-Y. Yeung, A probabilistic model for multimodal hash function learning, in: Proceedings of the ACM...
  • X. Zhu, Z. Huang, H. T. Shen, X. Zhao, Linear cross-modal hashing for efficient multimedia search, in: Proceedings of...
  • J. Wang et al.

    Optimized cartesian k-means

    IEEE Trans. Knowl. Data Eng.

    (2015)
  • A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in: Proceedings of the International...
  • B. Kulis, K. Grauman, Kernelized locality-sensitive hashing for scalable image search, in: Proceedings of the...
  • M. Norouzi, D. Fleet, Minimal loss hashing for compact binary codes, in: Proceedings of the International Conference on...
  • Y. Weiss et al.

    Spectral hashing

    Adv. Neural Inf. Process. Syst.

    (2008)
  • W. Kong, W.-J. Li, Isotropic hashing, in: Advances in Neural Information Processing Systems, 2012, pp....
  • C. Leng, J. Cheng, T. Yuan, X. Bai, H. Lu, Learning binary codes with bagging pca, in: Proceedings of European...
  • Y. Gong, S. Lazebnik, Iterative quantization: a procrustean approach to learning binary codes, in: Proceedings of the...
  • M. Norouzi, D. Fleet, Cartesian k-means, in: Proceedings of the IEEE Conference on Computer Vision and Pattern...
  • H. Jegou et al.

    Product quantization for nearest neighbor search

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • K. He, F. Wen, J. Sun, K-means hashing: an affinity-preserving quantization method for learning binary compact codes,...
  • J.-P. Heo, J. He, S.-F. Chang, S.-E. Yoon, Spherical hashing, in: Proceedings of the IEEE Conference on Computer Vision...
  • W. Liu, J. Wang, S. Kumar, S.-F. Chang, Hashing with graphs, in: Proceedings of the International Conference on Machine...
  • Y. Lin, R. Jin, D. Cai, S. Yan, X. Li, Compressed hashing, in: Proceedings of the IEEE Conference on Computer Vision...
  • Cited by (24)

    • Multi-level magnification correlation hashing for scalable histopathological image retrieval

      2019, Neurocomputing
      Citation Excerpt :

      Hashing techniques are very popular in recent research on data retrieval due to the efficient query time and low storage cost. According to whether the labels of data are used in the training process, the hashing approaches can be categorized into unsupervised methods [13–18] and supervised methods [19–25]. Unsupervised methods focus on learning the hashing functions by preserving the local and global structures of data distribution.

    • Unsupervised adaptive hashing based on feature clustering

      2019, Neurocomputing
      Citation Excerpt :

      Data-independent methods [8,9] randomly generate projections without using data properties while their performances are inferior to data-dependent methods. In this paper, we focus on data-dependent hashing methods [10–12]. Jin et al. [13] explores the geometric structure of the data to avoid a purely random projection selection.

    • Local multi-feature hashing based fast matching for aerial images

      2018, Information Sciences
      Citation Excerpt :

      For aerial image matching tasks, the accuracy of binary feature descriptors is usually not high enough for application. Recently, with the research boom of machine learning methods, emerged an interesting and promising idea namely learning to hash [14,18–20,26,31,35,38]. By training abundant samples and learning the hash function, the high dimensional data points are embedded into a similarity-preserved Hamming space with low-dimensional compact binary string, which brings a large improvement in both memory usage and time complexity [17].

    • Learning binary codes with local and inner data structure

      2018, Neurocomputing
      Citation Excerpt :

      In this way, searching similar images converts to finding neighbor hashing codes in Hamming space which is simple and practical. Consequently, the technique leads to significant efficiency in multimedia, computer vision, information retrieval, machine learning and pattern matching [10–19]. For many applications, approximate nearest neighbor (ANN) search is sufficient enough [20–24].

    • Diversity Regularized Latent Semantic Match for Hashing

      2017, Neurocomputing
      Citation Excerpt :

      For example, it can be roughly divided into data-independent methods and data-dependent methods by using the data or not, where LSH [6], KLSH [7] and other LSH-like methods [8] are data-independent and ITQ [9], SpH [10], SSH [11] and MLH [12] are data-dependent ones. From another perspective of using supervised information or not, there could be three kinds: unsupervised [6,13], supervised [5,12] and semi-supervised [11,14,15] methods. Here, we would like to divide the hashing methods into traditional image retrieval methods and current multi-view cross-modal retrieval methods.

    View all citing articles on Scopus

    Jian Wang received his B.E. degree in computer science from Shandong University, China, in 2013. He is currently pursuing the M.S. degree at the School of Computer Science and Technology, Shandong University. His research interests include machine learning, hashing and cross-modal multimedia retrieval.

    Xin-Shun Xu received his M.S. and Ph.D. degrees in computer science from Shandong University, China, in 2002, and Toyama University, Japan, in 2005, respectively. He joined the School of Computer Science and Technology at Shandong University as an associate professor in 2005, and joined the LAMDA group of the National Key Laboratory for Novel Software Technology, Nanjing University, China, as a postdoctoral fellow in 2009. Since 2010, he has been a professor at School of Computer Science and Technology at Shandong University, and the leader of MIMA (Machine Intelligence and Media Analysis) group of Shandong University. His research interests include machine learning, information retrieval, data mining and image/video analysis.

    Shanqing Guo is an associate professor of School of Computer Science and Technology at Shandong University. He obtained his Ph.D. degree in computer science from Department of Computer Science, Nanjing University, China. His research interests are data mining, data analysis, software and network security, with an emphasis on the development of program analysis and reverse engineering techniques at network level, or their applications to binary level vulnerability discovery, malicious code analysis and computer forensics.

    Lizhen Cui received his Ph.D. degree in computer science from Shandong University in 2005. He is a professor at the School of Computer Science and Technology at the Shandong University. His research interests include data mining and data management for cloud computing.

    Xiao-Lin Wang received her B.S., M.S. and Ph.D. degrees in computer science from Shandong University. Now, she is an associate professor at the School of Computer Science and Technology, Shandong University, and a member of MIMA (Machine Intelligence and Media Analysis) group of Shandong University. Her research interests include machine learning, data mining, information retrieval, intelligent computing and mobile agent.

    View full text