Private identification schemes
The principle of a private identification scheme is to manage nearest neighbor search in the encrypted domain. The two main sub-problems are the Approximate Nearest Neighbor (ANN) problem and Searchable Encryption
The Approximate Nearest Neighbor (ANN) problem is defined as follows: Let
be a set of points in a metric space (E,d
E). For an input
x ∈
E and ϵ ≥ 0, find a point p
x ∈
such that
This is an approximation of the Nearest Neighbor problem as the exact case is hard to solve in large dimension spaces. Several algorithms for the ANN problem have been proposed [
19] and the basic principle is to rely on sketching methods which output shorter vectors with increased stability and which enable to simplify the search:
is preprocessed with such sketching to end-up with a look-up table of short vectors on which the search can be realized quickly through counting the number of the exact or almost exact matches. Sketching needs there to guarantee that two close inputs would give with a good probability the same short vector. Examples of sketching methods are numerous for vector space (with Hamming distance or Euclidean distance) [
20‐
23]; for instance random projections on small subspace. In the private identification schemes [
7‐
9], the authors suggest to use a construction exploited in [
24] for iris biometry. This is adapted to binary vectors with Hamming distance comparison. The sketching functions are restriction of
n bits vectors over r ≪ n of their coordinates to obtain
r bits vectors:
Definition 2. Let be a family of function from {0,1}n to {0,1}r such that for x ∈ {0,1}n, we have for all i ∈ {1,...,μ}, . We say that is a sketching family for the Hamming distance from dimension n to dimension r.
With a sketching family where all functions are independent and if we assume that the inputs are uniformly distributed, the probability to obtain the same output with two distinct inputs can be estimated as follows.
In our construction, we rely on this idea for Hamming distance approximation combined with the embedding method from [
10,
11] of edit distance into the Hamming space.
As far privacy and security are concerned, private identification schemes are based on searchable encryption principle. The main goal of searchable encryption [
2,
25] is to store messages into an encrypted database while still enabling to search the messages related to some keywords. For instance this could correspond to a remote mailing service where the user wants to retrieve his messages which contain a given keyword, without letting the server learn information on the content of his mails. [
3] also uses such technique but only in a symmetric context. Following [
7]'s idea, we adapt an asymmetric searchable encryption scheme for our construction (cf. Section
Our construction).
A general solution to design a searchable encryption scheme is to associate a message to a set of keywords and to consider each keyword as a virtual address where the receiver can recover a link toward the associated messages. To manage all these relations in an efficient way, we follow [
1,
26,
27] by using Bloom filters. Bloom filter [
28] is a notion used in membership checking applications to reduce the memory cost of the data storage. We use an extension of this notion called
Bloom filters with storage. It enables to store identifiers of elements in each array.
Definition 3.
Bloom Filter with Storage, [
1] Let
be a finite subset of a space
E and a set of identifiers associated to
. For a family of v (independent and random) hash functions
, with each h
i:E→{1,...,k}, a (v,k)
-Bloom Filter with Storage for indexation of
is
, together with the array (t
1,...,t
k), defined recursively as:
2.
where Id(x) is the identifier of x.
In other words, the array is empty at the beginning and for each element, we add the identifier Id(x) of x at the cells indexed by h1(x),...,hv(x). To recover the identifiers associated to an element y, we compute . The following lemma describes the accuracy of this storage method.
Lemma 1. [
28] Let
be a (v,k)-Bloom filter with storage indexing
. For
, the following properties hold:
-
, i.e. the identifier of is always retrieved,
-
the probability Pr[t∈T(y) and t≠Id(y)] to obtain a false positive is
Edit distance approximation
Our construction is based on the embedding of edit distance into Hamming distance designed in [
10]. To solve problems such as those described in Section
Private identification schemes, data are embedded into Hamming space and then we can apply techniques dedicated to Hamming distance.
Definition 4. Let
and
be two metric spaces. An embedding
has a distortion
c if for all (x,y) ∈ E
1,
[
10] proves that {0,1}
N with edit distance can be embedded into ℓ
1 with small distortion
and then shows from a previous work [
20] how to end upefficiently into the Hamming space. More precisely:
Lemma 2. [
10] There exists a probabilistic polynomial time algorithm π and constants c
1,c
2 > 0 that, for every N ∈ ℕ, for every 4
-N ≫ δ > 0, and for all × ∈ {0,1}
N, computes
and such that for all (x,y) ∈ {0,1}
N, with probability at least 1 - δ,
where L1 denotes the distance L1.
The principle of the algorithm is to partition a string x into about
substrings. From each substring x
i, sets of all substrings (shingles) when taking a window of a fixed size
t are considered (i.e. all possible substrings of x
i formed by
t subsequent coordinates). By considering the metric defined by the minimum cost perfect matching algorithm between sets, [
10] then explains how such sets are embedded into ℓ
1. Note that this technique introduces a lot of redundancy in the substrings which are embedded and this increases the dimension by a factor at least N
2, but this is interesting for our construction as the distortion is very low and the algorithm remains polynomial in
N.
Based on [
20], the authors then show that there exist 0 < α < β < c
2 and an embedding Ψ from {0,1}
N with edit distance
ed to
with Hamming distance
HD that computes Ψ(x) = (x;t) for every t ∈ ℕ and such that with probability at least 1 - \delta:
-
If ed(x,y) ≤ t, then HD(ψ(x), ψ(y)) ≤ α log2(1/δ).
-
If then HD(ψ(x), ψ(y)) ≥ β log2(1/δ).