Linear unsupervised hashing for ANN search in Euclidean space

doi:10.1016/j.neucom.2015.06.036

Neurocomputing

Volume 171, 1 January 2016, Pages 283-292

https://doi.org/10.1016/j.neucom.2015.06.036 Get rights and content

Abstract

Approximate nearest neighbors (ANN) search for large scale data has attracted considerable attention due to the fact that large amounts of data are easily available. Recently, hashing has been widely adopted for similarity search because of its good potential for low storage cost and fast query speed. Among of them, when semantic similarity information is available, supervised hashing methods show better performance than unsupervised ones. However, supervised hashing methods need explicit similarity information which is not available in some scenarios. In addition, they have the problems of difficult optimization and time consuming for training, which make them unpracticable to large scale data. In this paper, we propose an unsupervised hashing method – Unsupervised Euclidean Hashing (USEH), which learns and generates hashing codes to preserve the Euclidean distance relationship between data. Specifically, USEH first utilizes Locality-Sensitive Hashing (LSH) to generate pseudo labels; then, it adopts a sequential learning strategy to learn the hash functions, one bit at a time, which can generate very discriminative codes. Moreover, USEH avoids explicitly computing the similarity matrix by decomposing it into the product of a label matrix and its transposition, which makes the training complexity of USEH linear to the size of training samples when the number of training samples is much greater than the dimension of feature. Thus, it can efficiently work on large scale data. We test USEH on two large scale datasets – SIFT1M and GIST1M. Experimental results show that USEH is comparable to state-of-the-art unsupervised hashing methods.

Introduction

In recent years, there has been a massive explosion of data on the web, e.g., images and videos. For such data, we usually need to find the nearest neighbors of a query sample from a given large scale database. However, such databases contain even billions of samples; this requires retrieval methods that should be highly efficient. In addition, besides the infeasibility of exhaustive search problem, storage of the original data is also a problem for such data. To tackle such problems, approximate nearest neighbor (ANN) search techniques have been proposed; especially, hashing based ANN methods have attracted more and more attention. The main idea of hashing for ANN is to learn hash functions for converting high-dimension features into compact binary codes while preserving similarity in original space. By this way, hashing enables large gains in data storage and computation speed for similarity search. Thus, it has become a popular method applied to many communities having large scale data. Especially, for documents/multimedia/cross-modal/multimodal retrieval, hashing has shown its superiority for fast semantic similarity approximate nearest neighbors search [1], [2], [3], [4], [5], [6].

Generally, exiting hashing methods can be divided into three main categories: unsupervised hashing, semi-supervised hashing and supervised hashing. Unsupervised hashing only uses unlabeled data to generate binary codes. In this paradigm, Locality-Sensitive Hashing (LSH) [7] is a well-known method, especially in computer vision. The basic idea of LSH is to compute randomized hash functions that guarantee a high probability of collision for similar samples. It offers probabilistic guarantees of retrieving items within $(1 + ϵ)$ times the optimal similarity, with sub-linear query time with respect to the number of samples in database [8], [9].

Another famous unsupervised hashing method is Spectral Hashing (SH) which was proposed by Weiss et al. [10]. SH uses a separable Laplacian eigenfunction formulation that ends up assigning more bits to directions along which the data has a wider range. Some methods such as IsoHash [11], Bagging PCA [12] and ITQ [13] are based on PCA. Iterative quantization (ITQ) [13] tires to find a rotation of zeros-centered data to minimize the quantization error of mapping data to the vertices of a zero-centered binary hypercube. It has shown good performance; however, the code length of ITQ is limited by the dimension of original data. Similar to ITQ, ok-means and ck-means [14] generalize the ITQ algorithm and Product Quantization (PQ) [15], respectively, to minimize the quantization error. He et al. [16] proposed a non-orthogonal quantization method, K-means hashing, which simultaneously performs k-means clustering and learning the binary indices of the clusters, is a non-orthogonal quantization method. In [17], Heo et al. propose spherical hashing method which is a hypersphere-based binary coding scheme and achieves balanced partitioning of data points for each hash function and independence between hashing functions. There are also some methods utilizing anchor points to approximate the data neighborhood structure, e.g., Anchor Graph Hashing [18] and Compressed Hashing [19].

Unsupervised hashing is promising to retrieve metric distance neighbors when there is no labeled information. However, in some circumstances, we have some pairwise labeled data, e.g., semantic similarity. In order to make use of such labeled data, many semi-supervised or supervised hashing methods have been proposed to generate codes that respect such similarity. For example, Semi-Supervised Hashing (SSH) [20] can leverage semantic similarity using labeled data while remaining robust to overfitting. In addition, it relaxes the orthogonality constraints to allow successive projections to capture more of the data variance. Considering that the sequential learning shows superiority to projection learning, Wang et al. [21] proposed a sequential projection learning version of SSH. Semi-Supervised Discriminant Hashing (SSDH) is also a semi-supervised method developed by Kim et al. [22], which is based on linear discriminant analysis.

There are also some sophisticated supervised hashing methods such as RBM [23], BRE [24] MLSH [25], MFH [26], IMH [27] and MLH [9] which have shown higher search accuracy than unsupervised hashing approaches. However, they all require difficult optimization and are time consuming in training phase. Another supervised hashing method, KSH [28], utilizes the equivalence between optimizing the code inner products and the hamming distances, which also achieves good performance.

From the above, we can see that unsupervised hashing cannot make use of labeled information and supervised hashing has the problems of difficult optimization and time consuming for training. In addition, we can usually get some useful information from unlabeled data. Thus, if a model can not only make use of such information, but also is easy for optimization and efficient in training, it is expected to have good performance and potential for large scale problems. Motivated by this, in this paper, we propose an unsupervised hashing technique in Euclidean space, namely Unsupervised Euclidean Hashing (USEH), which works on unlabeled data; however, it utilizes useful information got from unlabeled data. USEH consists of two phases. In the first phase, USEH uses LSH to generate pseudo labels of training data; then in the second phase, it adopts a sequential learning strategy to learn hash functions, one bit at a time. Moreover, we show that the similarity matrix in USEH is not needed explicitly and can be substituted by other formula. Due to this, when the number of training samples is much greater than the dimension of feature, the time complexity of USEH is dramatically reduced from $O (n^{2})$ to O(n) (n is the number of samples in training data), which makes it practicable and efficient on large-scale data.

The remainder of this paper is organized as follows: Section 2 gives a brief introduction of Locality Sensitive Hashing. Section 3 presents the detailed formulations and solutions to the optimization problem in USEH. Experimental results and analysis on two large-scale datasets are presented in Section 4. Finally, Section 5 concludes this paper.

Section snippets

Locality sensitive hashing

Locality-Sensitive Hashing (LSH) [7], [29] is a famous hashing algorithm. It hashes the data to ensure that the probability of collision is much higher for objects that are close to each other than for those that are far apart. For our purpose, we adopt the following LSH scheme.

Given a feature space $M$ , $H$ is a LSH family, for arbitrary $x, y \in M$ , we have $\Pr_{h \in H} [h (x) = h (y)] = s_{H} (x, y),$ where $s_{H}$ is a similarity measure of $M$ . Given an locality sensitive hash function $h \in H$ , we can define the corresponding

Approach

In this section, we describe the proposed USEH approach in detail, including the formulations, how to learn the project matrix, e.g., learning the orthogonal projection matrix and the non-orthogonal project matrix. In addition, we analyze its computational complexity.

Datasets and evaluation metric

We evaluate USEH on two public datasets, i.e., SIFT1M¹ [15] and GIST1M. SIFT1M contains one million 128-dimensional SIFT [37] features as search data base and 10,000 SIFT features as queries. GIST1M contains one million 384-dimensional GIST [38] features and 1000 GIST features as queries which are randomly sampled from the 80 M dataset [39]. For each query, the first K Euclidean nearest neighbors are used as ground truth, and

Conclusions

In this paper, we propose an unsupervised hashing method – USEH for fast approximate nearest neighbors search which embeds original high dimensional data into hamming space while preserving Euclidean neighborhood. We use LSH to generate similarity matrix. Especially, USEH avoids explicitly computing the similarity matrix by decomposing it into the product of pseudo label matrix and its transposition, which reduces the training time to O(n). Unlike most supervised hashing methods, USEH learns

Acknowledgments

This work is partially supported by National Natural Science Foundation of China (61173068 and 61103151), Program for New Century Excellent Talents in University of the Ministry of Education, the Key Science Technology Project of Shandong Province (2014GGD01063), the Independent Innovation Foundation of Shandong Province (2014CGZH1106) and the Shandong Provincial Natural Science Foundation (ZR2014FM020 and ZR2014FM031).

Jian Wang received his B.E. degree in computer science from Shandong University, China, in 2013. He is currently pursuing the M.S. degree at the School of Computer Science and Technology, Shandong University. His research interests include machine learning, hashing and cross-modal multimedia retrieval.

References (42)

R. Salakhutdinov et al.
Semantic hashing
Int. J. Approx. Reason.
(2009)
J. Song et al.
Effective multiple feature hashing for large-scale near-duplicate video retrieval
IEEE Trans. Multimed.
(2013)
M.M. Bronstein, A.M. Bronstein, F. Michel, N. Paragios, Data fusion through cross-modality metric learning using...
S. Kumar, R. Udupa, Learning hash functions for cross-view similarity search, in: Proceedings of the International...
D. Zhang, F. Wang, L. Si, Composite hashing with multiple information sources, in: Proceedings of the ACM International...
Y. Zhen, D.-Y. Yeung, A probabilistic model for multimodal hash function learning, in: Proceedings of the ACM...
X. Zhu, Z. Huang, H. T. Shen, X. Zhao, Linear cross-modal hashing for efficient multimedia search, in: Proceedings of...
J. Wang et al.
Optimized cartesian k-means
IEEE Trans. Knowl. Data Eng.
(2015)
A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in: Proceedings of the International...
B. Kulis, K. Grauman, Kernelized locality-sensitive hashing for scalable image search, in: Proceedings of the...

M. Norouzi, D. Fleet, Minimal loss hashing for compact binary codes, in: Proceedings of the International Conference on...

Y. Weiss et al.

Spectral hashing

Adv. Neural Inf. Process. Syst.

(2008)

W. Kong, W.-J. Li, Isotropic hashing, in: Advances in Neural Information Processing Systems, 2012, pp....

C. Leng, J. Cheng, T. Yuan, X. Bai, H. Lu, Learning binary codes with bagging pca, in: Proceedings of European...

Y. Gong, S. Lazebnik, Iterative quantization: a procrustean approach to learning binary codes, in: Proceedings of the...

M. Norouzi, D. Fleet, Cartesian k-means, in: Proceedings of the IEEE Conference on Computer Vision and Pattern...

H. Jegou et al.

Product quantization for nearest neighbor search

IEEE Trans. Pattern Anal. Mach. Intell.

(2011)

K. He, F. Wen, J. Sun, K-means hashing: an affinity-preserving quantization method for learning binary compact codes,...

J.-P. Heo, J. He, S.-F. Chang, S.-E. Yoon, Spherical hashing, in: Proceedings of the IEEE Conference on Computer Vision...

W. Liu, J. Wang, S. Kumar, S.-F. Chang, Hashing with graphs, in: Proceedings of the International Conference on Machine...

Y. Lin, R. Jin, D. Cai, S. Yan, X. Li, Compressed hashing, in: Proceedings of the IEEE Conference on Computer Vision...

Cited by (24)

Multi-level magnification correlation hashing for scalable histopathological image retrieval
2019, Neurocomputing
Citation Excerpt :
Hashing techniques are very popular in recent research on data retrieval due to the efficient query time and low storage cost. According to whether the labels of data are used in the training process, the hashing approaches can be categorized into unsupervised methods [13–18] and supervised methods [19–25]. Unsupervised methods focus on learning the hashing functions by preserving the local and global structures of data distribution.
Content-based Medical Image Retrieval (CBMIR) presents promising results in computer-aided diagnosis which provides the clinicians with interpretative guidance based on the visual similarity. In this paper, we focus on the tasks of histopathological image retrieval for breast cancer diagnosis. Instead of the holistic-based and cell-based methods proposed in previous works, we propose a multi-magnification correlation hashing (MMCH) framework that learns the discriminative binary codes for histopathological images. The patch-link graph is constructed with both low and high magnification images where the semantic links are propagated from the pre-labeled patches to unlabeled data. We then learn the binary hashing codes based on the global labels from the low-magnification images with structure-regularization. The experiments on BreakHis dataset and PLOSONE dataset demonstrate that MMCH outperforms the previous hashing methods on histopathological image retrieval.
Unsupervised adaptive hashing based on feature clustering
2019, Neurocomputing
Citation Excerpt :
Data-independent methods [8,9] randomly generate projections without using data properties while their performances are inferior to data-dependent methods. In this paper, we focus on data-dependent hashing methods [10–12]. Jin et al. [13] explores the geometric structure of the data to avoid a purely random projection selection.
An attractive method for image retrieval is binary hashing, which aims to reduce the dimensionality and generate similarity-preserving binary codes. To map the high-dimensional data into a low-dimensional subspace, majority of current unsupervised hashing approaches reduce the dimensionality by principal component analysis (PCA). However, PCA will yield unbalanced variances of projection directions and cause inconvenience in the quantization step. Besides, preserving the original similarity in existing unsupervised hashing methods remains as an NP-hard problem. For addressing these problems, we explore a novel hashing method based on feature clustering to simultaneously generate low-dimensional data with balanced variance and preserve the similarity in Euclidean space. Furthermore, we also propose an adaptive quantization approach to displace the fixed threshold quantization. Our novel method, dubbed as Feature Clustering Hashing (FCH), has shown its superiority to state-of-the-art methods on three benchmark datasets.
Local multi-feature hashing based fast matching for aerial images
2018, Information Sciences
Citation Excerpt :
For aerial image matching tasks, the accuracy of binary feature descriptors is usually not high enough for application. Recently, with the research boom of machine learning methods, emerged an interesting and promising idea namely learning to hash [14,18–20,26,31,35,38]. By training abundant samples and learning the hash function, the high dimensional data points are embedded into a similarity-preserved Hamming space with low-dimensional compact binary string, which brings a large improvement in both memory usage and time complexity [17].
Despite their superior accuracy, existing floating-point feature descriptor based image matching algorithms are too computationally intensive for real world aerial imaging applications, where target images are often of extremely high resolutions. To improve the efficiency of these algorithms, this paper presents a novel low-complexity image matching approach, namely local multi-feature hashing (LMFH). Similar to conventional techniques, LMFH employs a floating-point local multi-feature descriptor for accurate matching, however, the proposed method does not compare descriptors directly. Instead, LMFH projects each high-dimensional descriptor to a compact binary code in Hamming space using a trained hash function and then performs the feature comparison in the mapped domain. Experimental results show that, in terms of the computational cost, LMFH is on pair with the most efficient modern binary descriptors. Furthermore, LMFH has a performance gain which is significantly higher than the state-of-the-art binary descriptors and is comparable with conventional floating-point descriptors in terms of accuracy.
Learning binary codes with local and inner data structure
2018, Neurocomputing
Citation Excerpt :
In this way, searching similar images converts to finding neighbor hashing codes in Hamming space which is simple and practical. Consequently, the technique leads to significant efficiency in multimedia, computer vision, information retrieval, machine learning and pattern matching [10–19]. For many applications, approximate nearest neighbor (ANN) search is sufficient enough [20–24].
Recent years have witnessed the promising capacity of hashing techniques in tackling nearest neighbor search because of the high efficiency in storage and retrieval. Data-independent approaches (e.g., Locality Sensitive Hashing) normally construct hash functions using random projections, which neglect intrinsic data properties. To compensate this drawback, learning-based approaches propose to explore local data structure and/or supervised information for boosting hashing performance. However, due to the construction of Laplacian matrix, existing methods usually suffer from the unaffordable training cost. In this paper, we propose a novel supervised hashing scheme, which has the merits of (1) exploring the inherent neighborhoods of samples; (2) significantly saving training cost confronted with massive training data by employing approximate anchor graph; as well as (3) preserving semantic similarity by leveraging pair-wise supervised knowledge. Besides, we integrate discrete constraint to significantly eliminate accumulated errors in learning reliable hash codes and hash functions. We devise an alternative algorithm to efficiently solve the optimization problem. Extensive experiments on various image datasets demonstrate that our proposed method is superior to the state-of-the-arts.
SOMH: A self-organizing map based topology preserving hashing method
2017, Neurocomputing
Hashing based approximate nearest neighbor (ANN) search techniques have attracted considerable attention in media search community because of its good potential for low storage cost and fast query speed. Many hashing based ANN search methods have been proposed; but, most of them just consider to keep the similarity relationship of data points during mapping instead of topology of data. It is well known that Self-Organizing Map can keep topology structure while conducting mapping task. Motivated by this, in this paper, we propose a Self-Organizing Map based hashing framework–SOMH, which cannot only keep similarity relationship, but also preserve topology of data. Specifically, in SOMH, Self-Organizing Map is introduced to map data points into hamming space. In addition, in order to make it work well on short and long binary codes, we propose a relaxed version of SOMH and a product space SOMH, respectively. For the optimization problem of the relaxed SOMH, we also present an iterative solution. Moreover, we further propose an extended version of SOMH, which can work well on multimodal data search task, i.e., cross-modal search. To test the performance of these proposed algorithms, we conduct experiments on three data sets–SIFT1M, GIST1M and Wiki (a multimodal dataset). Experimental results show that SOMH can outperform or is comparable to several state-of-the-arts.
Diversity Regularized Latent Semantic Match for Hashing
2017, Neurocomputing
Citation Excerpt :
For example, it can be roughly divided into data-independent methods and data-dependent methods by using the data or not, where LSH [6], KLSH [7] and other LSH-like methods [8] are data-independent and ITQ [9], SpH [10], SSH [11] and MLH [12] are data-dependent ones. From another perspective of using supervised information or not, there could be three kinds: unsupervised [6,13], supervised [5,12] and semi-supervised [11,14,15] methods. Here, we would like to divide the hashing methods into traditional image retrieval methods and current multi-view cross-modal retrieval methods.
Hashing based approximate nearest neighbors (ANN) search has drawn considerable attraction owing to its low-memory storage and hardware-level logical computing which is doomed to be greatly applicable to quantities of large-scale and practical scenarios, such as information retrieval, computer vision and natural language processing. However, most existing hashing methods concentrate either on images only or on pairwise image-texts (labels, short documents) and rarely utilize more common sentences. In this paper, we propose D iversity R egularized L atent S emantic M atch for H ashing (DRLSMH), a new multimodal hashing method that projects images and sentences into a shared latent semantic space with label-supervised semantic constraints to proceed on multimodal retrieval. Notably, soft orthogonality is induced as a novel regularizer to preserve diverse hashing functions for compact and accurate representations; what's more, this kind of regularization also benefits the derivations of closed-form solutions with some proper relaxations under iterative optimization framework. Extensive experiments on two public datasets demonstrate the advantages of our method over some state-of-the-art baselines under cross-modal retrieval both on image-query-image, image-query-text and text-query-image tasks.

View all citing articles on Scopus

Xin-Shun Xu received his M.S. and Ph.D. degrees in computer science from Shandong University, China, in 2002, and Toyama University, Japan, in 2005, respectively. He joined the School of Computer Science and Technology at Shandong University as an associate professor in 2005, and joined the LAMDA group of the National Key Laboratory for Novel Software Technology, Nanjing University, China, as a postdoctoral fellow in 2009. Since 2010, he has been a professor at School of Computer Science and Technology at Shandong University, and the leader of MIMA (Machine Intelligence and Media Analysis) group of Shandong University. His research interests include machine learning, information retrieval, data mining and image/video analysis.

Shanqing Guo is an associate professor of School of Computer Science and Technology at Shandong University. He obtained his Ph.D. degree in computer science from Department of Computer Science, Nanjing University, China. His research interests are data mining, data analysis, software and network security, with an emphasis on the development of program analysis and reverse engineering techniques at network level, or their applications to binary level vulnerability discovery, malicious code analysis and computer forensics.

Lizhen Cui received his Ph.D. degree in computer science from Shandong University in 2005. He is a professor at the School of Computer Science and Technology at the Shandong University. His research interests include data mining and data management for cloud computing.

Xiao-Lin Wang received her B.S., M.S. and Ph.D. degrees in computer science from Shandong University. Now, she is an associate professor at the School of Computer Science and Technology, Shandong University, and a member of MIMA (Machine Intelligence and Media Analysis) group of Shandong University. Her research interests include machine learning, data mining, information retrieval, intelligent computing and mobile agent.

View full text