A string similarity join finds similar string pairs from two sets of strings, which is frequently found in many applications, such as duplicate detection, data integration and cleaning. Various algorithms have been proposed to address its efficiency issues. Partition-based filtering methods, such as Pass-JOIN, are promising, which quickly screens out possible similar string pairs by searching partitioned parts of a string in another string, in order of increasing length, and then performs similarity verification base on edit-distance. We notice that, filtering with different direction produces different candidate sets, which motivate us using a bi-directional filtering mechanism. This paper proposes a novel bi-directional filtering mechanism to enhance the filtering capability, which pipelines filtered results in forward direction to the process of backward filtering. The substring selection method of Pass-JOIN is adapted for the backward filtering. Experimental results show that the proposed bi-directional filtering algorithm outperforms the origin algorithm on real-world datasets.
Weitere Kapitel dieses Buchs durch Wischen aufrufen
Bitte loggen Sie sich ein, um Zugang zu diesem Inhalt zu erhalten
Sie möchten Zugang zu diesem Inhalt erhalten? Dann informieren Sie sich jetzt über unsere Produkte:
- A Partition-Based Bi-directional Filtering Method for String Similarity JOINs