In many applications, for example, in data integration scenarios, strings must be matched if they are similar. String similarity joins, which match all pairs of similar strings from two datasets, are of particular interest and have recently received much attention in the database research community. Most approaches, however, assume a global similarity threshold; all string pairs that exceed the threshold form a match in the join result. The global threshold approach has two major problems: (a) the threshold depends on the (mostly unknown) data distribution, (b) often there is no single threshold that is good for all string pairs.
In this paper we propose the
algorithm, a novel string similarity join that requires no configuration and uses an
. PG-Join computes a so-called proximity graph to derive an individual threshold for each string. Computing the proximity graph efficiently is essential for the scalability of PG-Join. To this end we develop a new and fast algorithm,
, that computes the proximity graph in two steps: First an efficient approximation is computed, then the approximation error is fixed incrementally until the adaptive threshold is stable. Our extensive experiments on real-world and synthetic data show that PG-I is up to five times faster than the state-of-the-art algorithm and suggest that PG-Join is a useful and effective join paradigm.