With the ever-increasing growth of the internet, more and more data sets are being made available. Most of this data has its origin in the real world, often describing the same objects or events from different viewpoints. One can thus consider data sets obtained from different sources as different (and possibly inconsistent) views of our world, and it makes sense to try to integrate them in some form, e.g. to answer questions which involve data from multiple sources. While data integration is an old and well-investigated subject, the nature of the data sets to be integrated is changing. They increase in volume as well as complexity, are often undocumented, relationships between data sets are more fuzzy, and representations of the same real-word object differ. To address these challenges, new methods for rapid, semi-automatic, loose and virtual integration, exploration and querying of large families of data sets must be developed.
In an ongoing project we are investigating a framework for sampling and matching data sets in an efficient manner. In particular, we consider the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes . Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. We deal with the issue of different representation of objects, i.e., ‘dirty’ data, by employing new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between string instances. To make the measures effective, especially in light of data sets being large and distributed, we developed efficient algorithms for distributed sample creation and similarity computation. Central to this is that sampling is
. For clean data this means that the same values are sampled for each set, if present [2,3]. For dirty data one must ensure that similar values are sampled for each set, if present, and we manage to do so in a probabilistic manner.
The next step of our research is to extend such a sampling and matching approach to multiple attributes and semi-structured data, and to construct search and query systems which make direct use of the matches discovered.