In order to improve the efficiency of image retrieval and reduce the cost of searching key features, it is necessary to encode the image descriptor into codes with smaller data volume, that is, the quantization process of indexing the image descriptor [
20]. In this paper, the approximate nearest neighbor search algorithm is used to encode and index the key features of the extracted image so as to achieve quick search. In the search stage, its quick search can be divided into two steps, one is rough quantization, the other is distance estimation.
2.2.1 Coarse quantization of the image to be queried
Before performing coarse quantization processing on the image to be queried, the image must be preprocessed, and now the key information of the image is quantized and encoded [
21].The key feature set space of the input original image is divided into smaller codebook spaces by quantizer. For example, suppose the key feature set of the image is
\(x\subset {\mathbb{R}}^{D}\) and the input is vector
\(x\in X\), the vector quantizer can be understood as the mapping function
\(q\), and the vector
\(x\) is quantized to
\(q\left(x\right)\in C=\left\{{c}_{0},{c}_{1},\cdots ,{c}_{k-1}\right\}\), where
\(C\) is the key feature set
\(X\), and the set of eigenvalues quantized by the quantizer is also called the codebook, where the codebook size is
\(k\), and
\({c}_{i=\mathrm{0,1},\cdots ,k-1}\) is the centroid in the codebook. The set of vectors quantized to the same centroid is called a cluster, and the expression of cluster
\({V}_{i}\) corresponding to centroid
\({c}_{i}\) is as follows:
$${V}_{i}\underline{def}\left\{q\left(x\right)={c}_{i}\right\}$$
(5)
The
\(k\) clusters generated by vector quantization are a spatial division of the key feature set of the image, and the vectors that are divided into the same cluster will be represented by the same centroid
\({c}_{i}\).Generally, for a
\(D\) -dimensional image key feature vector, in order to obtain 64 bit coding (each dimension accounts for 0.5 bits), the quantizer used needs to contain
\(k={2}^{64}\) cluster centers or visual words [
22]. However, due to the large number of image key features and the complexity of the training quantizer is several times that of
\(k\), it is difficult for the general training quantizer to train such a large number of key features. For this problem, multiple dimensions of the feature vector can be selected for joint quantization.
The specific process is as follows: set the image key feature set input vector as
\(\in {R}^{D}\), and divide it into
\(m\) low-dimensional vectors
\({u}_{j}\left(x\right)\),
\(j=\mathrm{0,1},\cdots ,m-1\),
\({u}_{j}\left(x\right)\in {R}^{{D}^{^{\prime}}}\), where
\({D}^{^{\prime}}=D/m\). To ensure that
\(m\) is divisible into
\(D\), the vector
\(x\) is divided into
\(m\) sub vectors:
$$\underset{u_0(x)}{\underbrace{x_1,\cdots, {x}_{D^{\prime }}}},\cdots, \underset{u_{m-1}(x)}{\underbrace{x_{D-{D}^{\prime }+1},\cdots, {x}_D}}$$
(6)
Similarly, the original image set
\(X\) will also be split into
\(m\) subspaces denoted by
\({U}_{j=0,\cdots ,m-1}\subset {\mathbb{R}}^{{D}^{^{\prime}}}\). next, the feature vectors in
\(m\) subspace are divided into clusters according to the vector quantization processing method. Let the sub vector of the input vector
\({u}_{j}\left(x\right)\in {U}_{j}\),
\({q}_{j}\) is the vector quantizer on the sub space
\({U}_{j}\), and
\({C}_{j}\) is the vector quantization codebook of the space, then the quantized value of vector
\(x\) quantized by
\(m\) groups of independent vectors is as follows:
$$q\left(x\right)={q}_{0}\left({u}_{1}\left(x\right)\right),\cdots ,{q}_{m}\left({u}_{m}\left(x\right)\right)$$
(7)
where
\({q}_{j}\) is the low complexity sub quantizer corresponding to the
\(j\) -th sub vector. The product quantizer maps the input image key feature vectors into tuples of exponents by quantizing the sub-vectors separately. The mapping result is the product quantization value of vector
\(x\), in which the subspace centroid
\({q}_{j}\left({u}_{j}\left(x\right)\right)\in {C}_{j}\), the codebook
\({C}_{p}\) obtained by product quantization of the original image set
\(X\) is the Cartesian product of the
\(m\) groups of independent vector quantization codebooks, and the expression is as follows:
$${C}_{p}={C}_{0}\times \cdots \times {C}_{m-1}$$
(8)
The cluster center of this set is the concatenation of
\(m\) -sub quantizers. Assuming that all sub quantizers have
\({k}_{s}\) codewords,
\({k}_{s}\) is usually set to a very small value in order to limit the complexity of key feature allocation of image. In this case, the codebook generated by the product quantizer through the concatenated sub quantizer code-word is still very large:
Assuming that in the very extreme case \(m=D\), All dimension elements of the input image key feature vector \(x\) are quantized separately. \(m\times {k}_{s}\) codewords of all sub quantizers are selected and stored, that is, the value of \(m\times {D}^{*}\times {k}_{s}={k}_{s}\times D\) floating-point numbers.
Let the size of the independent image key feature vector quantization codebook
\({C}_{j}\) be
\({k}^{^{\prime}}\) (in order to ensure the balanced quantization effect, the codebook size of subspace is the same), it can be seen that the codebook size of
\({k}^{^{\prime}}\) is
\({\left({k}^{^{\prime}}\right)}^{m}\) [
23]. Where
\(m\) is recorded as the number of division groups of product quantization. When
\(m=1\), product quantization degenerates into vector quantization. The product quantization error
\(MSE\left({q}_{p}\right)\) is calculated as follows:
$$MSE{q}_{p}={\sum }_{j}MSE{q}_{j}$$
(10)
In Eq. (
10),
\(MSE{q}_{j}\) is the quantization error of subspace vector quantization, and
\(MSE{q}_{p}\) varies with the number of product quantization groups
\(m\) and subspace vector dimension
\({D}^{^{\prime}}\). After quantifying the key information of the image, the
\(N\) database items
\(X=\left\{{x}_{1},{x}_{2},\cdots ,{x}_{N}\right\}\) are divided into
\(J\) mutually exclusive groups, each group has a representative vector
\({\mu }_{j}\in {R}^{D}\), For each group, the key feature vectors representing the image and the residuals
\(x\in {X}_{j}\) of each item of data in the group are calculated. The residual
\(x-{\mu }_{j}\) is encoded into a product quantization code and stored as an inverted linked list. Given a key feature vector
\(y\in {R}^{D}\) of the image to be queried, select the group
\({X}_{j}\) with the closest distance. Calculate the residual
\(y-{\mu }_{j}\) between the key feature vector of the image to be queried and the representative vector of the group. When the search space of the mobile network image set is limited by coarse quantization, only the selected data items in the group are calculated, thereby completing the coarse quantization processing of the image to be queried.
2.2.2 Distance estimation of key feature of images to be retrieved
After quantifying the key features of the image, the estimation of the feature distance is completed, and the quick search of the key image is realized [
24,
25]. Given a key feature vector of an image to be queried and a segment of product quantization code, the approximate distance of the original vector represented by the feature vector and the product quantization code can be effectively calculated by the asymmetric distance calculation method. First, a vector
\(y\in {R}^{D}\) to be queried and a product quantization code
\({i}_{x}={\left[{i}^{1},{i}^{2},\cdots ,{i}^{m}\right]}^{T}\in {\left\{\mathrm{1,2},\cdots ,K\right\}}^{M}\) can be defined. In order to calculate the distance
\(d{\left(y,x\right)}^{2}\) between the vector to be queried and the key feature vector of the original image represented by the product quantization code, an asymmetric distance
\(\widetilde{d}{\left(\cdot ,\cdot \right)}^{2}\) is defined to represent the distance between the vector to be queried and the decoded restoration vector, namely:
$$d{\left(y,x\right)}^{2}\approx \widetilde{d}{\left(y,x\right)}^{2}={\left(dy,d\widetilde{x}\right)}^{2}$$
(11)
The above equation can be solved by calculating the decoding vector obtained by Eq. (
11), but the direct calculation is almost the same as the traversal calculation from the original vector library, and the calculation is too complex.
After completing the approximate expression of data set vectors, we can discuss the distance estimation between query vectors and candidate vectors based on area quantization. Let the query vector be
\(x\) and the candidate vector set be
\({L}_{c}\), the distance between vector
\(y\in {R}^{D}\) is obtained from Eq. (
12):
$$d{\left(x,y\right)}^{2}={\sum }_{p=1}^{P}{\left[d{u}_{p}\left(x\right),d{u}_{p}\left(y\right)\right]}^{2}$$
(12)
The distance between the image key feature vectors
\(x\) and
\(y\) in the mobile network can be decomposed into the sum of the distances between the sub-vectors in the second-layer product quantization space. In the
\(p\) -th subspace of the second layer product quantization, the distance between the sub vector
\({u}_{p}\left(x\right)\) of the query vector
\(x\) and the sub vector
\({u}_{p}\left(y\right)\) of the candidate set vector
\(y\) can be defined as:
$${d}_{p}\left(x,y\right)\underline{def}d{u}_{p}\left(x\right),d{u}_{p}\left(y\right)$$
(13)
Point x
\({u}_{p}\left(y\right)\) is approximated by point
\(S\), the distance
\({d}_{p}\left(x,y\right)\) can be approximated by the distance
\({h}_{p}\left(x,y\right)\underline{def}d\left({u}_{p}\left(x\right),s\right)\) between
\({u}_{p}\left(x\right)\) and the projection point
\(S\). After trigonometric operation:
$${h}_{p}\left(x,y\right)={{b}^{^{\prime}}}_{p}^{2}+{{\lambda }^{^{\prime}}}_{p}^{2}\cdot {{c}^{^{\prime}}}_{p}^{2}+{\lambda }_{p}^{^{\prime}}{{a}^{^{\prime}}}_{p}^{2}-{\lambda }_{p}^{^{\prime}}{{b}^{^{\prime}}}_{p}^{2}-{\lambda }^{^{\prime}}{{c}^{^{\prime}}}_{p}^{2}$$
(14)
where
$${{b}^{^{\prime}}}_{p}^{2}={a}_{p}^{2}+{\lambda }_{p}^{2}\cdot {c}_{p}^{2}-{\lambda }_{p}{a}_{p}^{2}-{\lambda }_{p}{b}_{p}^{2}-{\lambda }_{p}{c}_{p}^{2}$$
(15)
$${{c}^{^{\prime}}}_{p}^{2}={a}_{1p}^{2}+{\lambda }_{p}^{2}\cdot {c}_{p}^{2}-{\lambda }_{p}{a}_{1p}^{2}-{\lambda }_{p}{b}_{1p}^{2}-{\lambda }_{p}{c}_{p}^{2}$$
(16)
$${\lambda }_{p}={\widetilde{\lambda }}_{p}/\left(1-{\overline{\lambda }}_{p}\right)+{\widetilde{\lambda }}_{p}/{\overline{\lambda }}_{p}{\widetilde{\lambda }}_{p}$$
(17)
The final obtained image key feature distance calculation expression is as follows:
$$d{\left(x,y\right)}^{2}={\sum }_{p=1}^{P}\left({{b}^{^{\prime}}}_{p}^{2}+{{\lambda }^{^{\prime}}}_{p}^{2}\cdot {{c}^{^{\prime}}}_{p}^{2}+{\lambda }_{p}^{^{\prime}}{{a}^{^{\prime}}}_{p}^{2}-{\lambda }_{p}^{^{\prime}}{{b}^{^{\prime}}}_{p}^{2}-{\lambda }_{p}^{^{\prime}}{{c}^{^{\prime}}}_{p}^{2}\right)$$
(18)
After obtaining the key feature distance of the image, the product quantization code of the selected group can be obtained by traversing the reverse linked list, and the image quick search can be completed by calculating the asymmetric distance between the residual vector \(y-{\mu }_{j}\) and these PQ codes.
2.2.3 A fast image retrieval method based on non-exhaustive search
Linear search based on asymmetric distance calculation is much faster than direct linear search, but when \(N\) is large, the speed is still not fast enough. In order to handle the search of millions or even billions of data, a search method combined with inverted index is proposed.
The indexing process of the key feature vector
\(y\) in the mobile network image library is as follows:
(1)
The coarse quantizer is used to quantize the key feature vector x to Y
(2)
Residual vector \(r\left(y\right)=y-{q}_{c}\left(y\right)\) of image key feature vector \(y\) in mobile network is calculated.
(3)
The product quantizer \({q}_{p}\) quantizes residual vector \(r\left(y\right)\) to \({q}_{p}r\left(y\right)\), that is, \({u}_{j}\left(y\right)\) to \({q}_{j}\left({u}_{j}\left(y\right)\right)\).
(4)
In the inverted table related to \({q}_{c}\left(y\right)\), add an item including the ID and binary encoding of the image key feature vectors in the mobile network,, that is, the index of the product quantizer.
Querying key feature vectors
\(x\) of images and searching the nearest neighbor vector includes four steps:
(1)
The key feature vector \(x\) of the query image was quantified to the nearest \(w\) clustering centers through a crude quantizer.
(2)
The square distance \(d\left( {u_{j} \left( {r\left( x \right)} \right),c_{j,i} } \right)^{2}\) between each sub-vector and each cluster center \(c_{j,i}\) in its corresponding sub-quantizer is calculated and stored in the table.
(3)
Calculate the square distance between the residual vector \(r\left(x\right)\) and the index vector on the inverted list. By calculating the distance between the obtained sub vector and the cluster center, the distance between the residual vector and the index vector on the inverted list is added to obtain the square distance of the two vectors.
(4)
Select the \(K\) key image feature vectors with the closest distance to the query feature vector \(x\).
In the process of querying the key feature vector \(x\) and searching for the nearest neighbor vector, Only step 3 is related to the size of the mobile network image library.Compared with the general asymmetric distance method, this method has more steps to quantify the query features \(x\) to \({q}_{c}\left(x\right)\), including \({k}^{^{\prime}}\) times of caculation of the distance of D-dimension feature vector. Assuming its inverted tables are balanced, each table has approximately \(n\times w/{k}^{^{\prime}}\) entries. Compared with the general asymmetric distance method, the efficiency of quick image retrieval using the non-exhaustive search method is significantly improved.