Let us now denote the video sequences collected in the first (diagnosis) and second (surveillance) examinations as
\(\mathcal {O}_1\) and
\(\mathcal {O}_2\), respectively. During the surveillance examination, retargeting of a query image (in
\(\mathcal {O}_2\)) is required to be real time such that a regular clinical procedure would not be interfered with. To enable the real-time retargeting capability, we adopt hashing which has proved to be efficient for large-scale image retrieval [
21‐
24]). We follow the two-step hashing approach in [
24] to compress the image descriptors into compact binary codes and then learn the mapping function via a novel random forests hash. This allows for fast matching between descriptors based on Hamming distance computation. Furthermore, a quadratic loss function is used for learning the hashing function that maps the original descriptors to a new space, where images from the same scene have a smaller distance.
In this work, we adopt supervised hashing, requiring a scene label for each image in the training image set. We define a scene as a cluster of adjacent images which represent the same topological location. To obtain the scene labels for images, we perform image clustering on the diagnosis video collected in the first examination similar to [
8]. Specifically, we use an semiautomatic approach that performs K-means (intensity-based) clustering, followed by manually merging similar clusters. This results in an affinity matrix
\(\mathcal {A}\) where
\(a_{ij} = 1\) if
\(x_i\) and
\(x_j\) have the same scene label, and
\(a_{ij} = 0\) if not.
Given a set of image descriptors extracted from the diagnosis video, which are denoted as
\(\left\{ \mathbf {x}_{i} \right\} _{i=1}^{n}\), our aim is to infer their corresponding
m-bit binary codes
\(\left\{ \mathbf {b}_{i} \right\} _{i=1}^{n}\). This inference is performed by encouraging the Hamming distance between images of the same scene to be small, whilst large for images of different scenes. We sequentially obtain each bit in the binary code by optimising for
r-th bit with the objective function:
$$\begin{aligned} \begin{aligned} \min _{\mathbf {b}_{\left( r\right) }}&\sum _{i=1}^{n}\sum _{j=1}^{n}l_r\left( b_{r,i}, b_{r,j};a_{ij} \right) , \\&\text {s.t.}\, \mathbf {b}_{\left( r\right) }\in \left\{ -1,1 \right\} ^{n} \end{aligned} \end{aligned}$$
(1)
where
\(b_{r,i}\) and
\(b_{r,j}\) are the
r-th bits for images
i and
j, respectively. Here,
\(\mathbf {b}_{\left( r\right) }\) represents a vector that concatenates the
r-th bits of
n images. Therefore, this optimisation sequentially seeks the values of
\(\mathbf {b}_{\left( r\right) }\) for each bit.
Following [
24], we consider a hash loss function
\(l\left( b_1, b_2 \right) \) that takes binary variables
\(b_1, b_2 \in \lbrace -1,1 \rbrace \) as input and satisfies
\(l\left( 1,1\right) =l\left( -1,-1\right) \) and
\(l\left( -1,1\right) =l\left( 1,-1\right) \). This loss can be replaced with an equivalent quadratic function defined as:
$$\begin{aligned} \begin{aligned} h\left( b_1, b_2 \right)&= \dfrac{1}{2}\left[ b_1 b_2 \left( l^{11}-l^{-11}\right) + l^{11} + l^{-11} \right] \\&= l\left( b_1, b_2 \right) , \end{aligned} \end{aligned}$$
(2)
Here,
\(l^{11}\) and
\(l^{-11}\) are the constants that represent
\(l\left( 1,1\right) \) and
\(l\left( -1,1\right) \), respectively. Note that, Eq.
2 can be proved by checking all the possible binary inputs. For example, when
\(b_1=b_2=1\), we have
$$\begin{aligned} h\left( 1,1 \right) = \left[ l^{11}-l^{-11} + l^{11}+l^{-11}\right] = l\left( 1,1\right) , \end{aligned}$$
(3)
and when
\(b_1=-1\) and
\(b_2=1\), we can obtain
$$\begin{aligned} h\left( -1,1 \right)= & {} \dfrac{1}{2}\left[ -1\cdot 1\cdot \left( l^{11}-l^{-11} \right) + l^{11}+l^{-11}\right] \nonumber \\= & {} l\left( -1,1\right) . \end{aligned}$$
(4)
Similar equations can also be derived for
\(h\left( -1,-1 \right) \) and
\(h\left( 1,-1 \right) \). Given that
\(l^{11}+l^{-11}\) results in a constant, we now use Eq.
2 to reformulate Eq.
1 as
$$\begin{aligned} \begin{aligned} \min _{\mathbf {b}_{\left( r\right) }}&\sum _{i=1}^{n}\sum _{j=1}^{n} b_{r,i}b_{b,j} \left( l^{11}_{r,i,j}-l^{-11}_{r,i,j}\right) , \\&\text {s.t.}\, \mathbf {b}_{\left( r\right) }\in \left\{ -1,1 \right\} ^{n}. \end{aligned} \end{aligned}$$
(5)
When considering the affinity label between images
i and
j, we have
\(l^{11}_{r,i,j}=l_r\left( 1,1;a_{ij}\right) \) and
\(l^{-11}_{r,i,j}=l_r\left( -1,1;a_{ij}\right) \). Let us denote
\(c_{r,i,j} = l^{11}_{r,i,j}-l^{-11}_{r,i,j}\), and define matrix
\(\mathcal {C}\) that contains all the
\(c_{r,i,j}\) elements. The objective is finally turned into a matrix representation:
$$\begin{aligned} \begin{aligned}&\min _{\mathbf {b}_{\left( r\right) }} \mathbf {b}_{\left( r\right) }^{T} \mathcal {C} \mathbf {b}_{\left( r\right) }, \\&\text {s.t.}\, \mathbf {b}_{\left( r\right) }\in \left\{ -1,1 \right\} ^{n}. \end{aligned} \end{aligned}$$
(6)
Note that, for solving this unconstrained binary quadratic problem, we perform a series of local optimisations via graph-cut [
24]. Furthermore, in this work, we employ a hinge loss function, defined as
$$\begin{aligned}&l_r\left( b_{r,i}, b_{r,j};a_{ij}\right) \nonumber \\&\quad = {\left\{ \begin{array}{ll} \left[ 0-\mathcal {D}\left( \mathbf {b}_{i}^r,\mathbf {b}_{j}^r \right) \right] ^{2},&{} \quad \text {if } a_{ij} = 1\\ \left[ \max \left( 0.5m -\mathcal {D}\left( \mathbf {b}_{i}^r, \mathbf {b}_{j}^r \right) ,0\right) \right] ^{2},&{} \quad \text {if } a_{ij} = 0 \end{array}\right. } \end{aligned}$$
(7)
where
\(\mathbf {b}^r_i\) and
\(\mathbf {b}^r_j\) denote the first
r bits for
\(\mathbf {b}_i\) and
\(\mathbf {b}_j\), respectively.
\(\mathcal {D}\left( \cdot ,\cdot \right) \) indicates the Hamming distance. Equation
7 encourages the images of same scene to be close and pushes the images of different scenes to have distances larger than half the maximum distance (0.5 m). It is worth noting that during this sequential optimisation, each current bit (
r-th bit) derivation uses the results of previous bits (
\(0-(r-1)\)-th bits).