Skip to main content

Über dieses Buch

This, the 25th issue of Transactions on Large-Scale Data- and Knowledge-Centered Systems, contains five fully revised selected papers focusing on data and knowledge management systems. Topics covered include a framework consisting of two heuristics with slightly different characteristics to compute the action rating of data stores, a theoretical and experimental study of filter-based equijoins in a MapReduce environment, a constraint programming approach based on constraint reasoning to study the view selection and data placement problem given a limited amount of resources, a formalization and an approximate algorithm to tackle the problem of source selection and query decomposition in federations of SPARQL endpoints, and a matcher factory enabling the generation of a dedicated schema matcher for a given schema matching scenario.



On Expedited Rating of Data Stores

To rate a data store is to compute a value that describes the performance of the data store with a database and a workload. A common performance metric of interest is the highest throughput provided by the data store given a pre-specified service level agreement such as 95 % of requests observing a response time faster than 100 ms. This is termed the action rating of the data store. This paper presents a framework consisting of two search techniques with slightly different characteristics to compute the action rating. With both, to expedite the rating process, the framework employs agile data loading techniques and strategies that reduce the duration of conducted experiments. We show these techniques enhance the rating of a data store by one to two orders of magnitude. The rating framework and its optimization techniques are implemented using a social networking benchmark named BG.
Sumita Barahmand, Shahram Ghandeharizadeh

A Theoretical and Experimental Comparison of Filter-Based Equijoins in MapReduce

MapReduce has become an increasingly popular framework for large-scale data processing. However, complex operations such as joins are quite expensive and require sophisticated techniques. In this paper, we review state-of-the-art strategies for joining several relations in a MapReduce environment and study their extension with filter-based approaches. The general objective of filters is to eliminate non-matching data as early as possible in order to reduce the I/O, communication and CPU costs. We examine the impact of systematically adding filters as early as possible in MapReduce join algorithms, both analytically with cost models and practically with evaluations. The study covers binary joins, multi-way joins and recursive joins, and addresses the case of large inputs that gives rise to the most intricate challenges.
Thuong-Cang Phan, Laurent d’Orazio, Philippe Rigaux

A Constraint Optimization Method for Large-Scale Distributed View Selection

View materialization is a commonly used technique in many data-intensive systems to improve the query performance. Increasing need for large-scale data processing has led to investigating the view selection problem in distributed complex scenarios where a set of cooperating computer nodes may share data and issue numerous queries. In our work, the view selection and data placement problem is studied given a limited amount of resources e.g. storage space capacity per computer node and maximum view maintenance cost. We also consider the IO and CPU costs for each computer node as well as the network bandwidth. To address this problem, we have proposed a constraint programming approach which is based on constraint reasoning to tackle problems that aim to satisfy a set of constraints. Then, we have designed a set of efficient heuristics that result in a drastic reduction in the solution space so that the problem becomes solvable for complex scenarios consisting of realistically large numbers of sites, queries and views. Our experimental study shows that our approach performs consistently better compared to a practical approach designed for large-scale distributed environments which uses a genetic algorithm to compute which view has to be materialized at what computer node.
Imene Mami, Zohra Bellahsene, Remi Coletta

On the Selection of SPARQL Endpoints to Efficiently Execute Federated SPARQL Queries

We consider the problem of source selection and query decomposition in federations of SPARQL endpoints, where query decompositions of a SPARQL query should reduce execution time and maximize answer completeness. This problem is in general intractable, and performance and answer completeness of SPARQL queries can be considerably affected when the number of SPARQL endpoints in a federation increases. We devise a formalization of this problem as the Vertex Coloring Problem and propose an approximate algorithm named Fed-DSATUR. We rely on existing results from graph theory to characterize the family of SPARQL queries for which Fed-DSATUR can produce optimal decompositions in polynomial time on the size of the query, i.e., on the number of SPARQL triple patterns in the query. Fed-DSATUR scales up much better to SPARQL queries with a large number of triple patterns, and may exhibit significant improvements in performance while answer completeness remains close to 100 %. More importantly, we put our results in perspective, and provide evidence of SPARQL queries that are hard to decompose and constitute new challenges for data management.
Maria-Esther Vidal, Simón Castillo, Maribel Acosta, Gabriela Montoya, Guillermo Palma

YAM: A Step Forward for Generating a Dedicated Schema Matcher

Discovering correspondences between schema elements is a crucial task for data integration. Most schema matching tools are semi-automatic, e.g., an expert must tune certain parameters (thresholds, weights, etc.). They mainly use aggregation methods to combine similarity measures. The tuning of a matcher, especially for its aggregation function, has a strong impact on the matching quality of the resulting correspondences, and makes it difficult to integrate a new similarity measure or to match specific domain schemas. In this paper, we present YAM (Yet Another Matcher), a matcher factory which enables the generation of a dedicated schema matcher for a given schema matching scenario. For this purpose we have formulated the schema matching task as a classification problem. Based on this machine learning framework, YAM automatically selects and tunes the best method to combine similarity measures (e.g., a decision tree, an aggregation function). In addition, we describe how user inputs, such as a preference between recall or precision, can be closely integrated during the generation of the dedicated matcher. Many experiments run against matchers generated by YAM and traditional matching tools confirm the benefits of a matcher factory and the significant impact of user preferences.
Fabien Duchateau, Zohra Bellahsene


Weitere Informationen