2011 | OriginalPaper | Chapter
PG-Join: Proximity Graph Based String Similarity Joins
Authors : Michail Kazimianec, Nikolaus Augsten
Published in: Scientific and Statistical Database Management
Publisher: Springer Berlin Heidelberg
Activate our intelligent search to find suitable subject content or patents.
Select sections of text to find matching patents with Artificial Intelligence. powered by
Select sections of text to find additional relevant content using AI-assisted search. powered by
In many applications, for example, in data integration scenarios, strings must be matched if they are similar. String similarity joins, which match all pairs of similar strings from two datasets, are of particular interest and have recently received much attention in the database research community. Most approaches, however, assume a global similarity threshold; all string pairs that exceed the threshold form a match in the join result. The global threshold approach has two major problems: (a) the threshold depends on the (mostly unknown) data distribution, (b) often there is no single threshold that is good for all string pairs.
In this paper we propose the
PG-Join
algorithm, a novel string similarity join that requires no configuration and uses an
adaptive threshold
. PG-Join computes a so-called proximity graph to derive an individual threshold for each string. Computing the proximity graph efficiently is essential for the scalability of PG-Join. To this end we develop a new and fast algorithm,
PG-I
, that computes the proximity graph in two steps: First an efficient approximation is computed, then the approximation error is fixed incrementally until the adaptive threshold is stable. Our extensive experiments on real-world and synthetic data show that PG-I is up to five times faster than the state-of-the-art algorithm and suggest that PG-Join is a useful and effective join paradigm.