nach oben

2011 | Buch

Data Management and Query Processing in Semantic Web Databases

verfasst von: Sven Groppe

Verlag: Springer Berlin Heidelberg

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

The Semantic Web, which is intended to establish a machine-understandable Web, is currently changing from being an emerging trend to a technology used in complex real-world applications. A number of standards and techniques have been developed by the World Wide Web Consortium (W3C), e.g., the Resource Description Framework (RDF), which provides a general method for conceptual descriptions for Web resources, and SPARQL, an RDF querying language. Recent examples of large RDF data with billions of facts include the UniProt comprehensive catalog of protein sequence, function and annotation data, the RDF data extracted from Wikipedia, and Princeton University’s WordNet. Clearly, querying performance has become a key issue for Semantic Web applications.

In his book, Groppe details various aspects of high-performance Semantic Web data management and query processing. His presentation fills the gap between Semantic Web and database books, which either fail to take into account the performance issues of large-scale data management or fail to exploit the special properties of Semantic Web data models and queries. After a general introduction to the relevant Semantic Web standards, he presents specialized indexing and sorting algorithms, adapted approaches for logical and physical query optimization, optimization possibilities when using the parallel database technologies of today’s multicore processors, and visual and embedded query languages.

Groppe primarily targets researchers, students, and developers of large-scale Semantic Web applications. On the complementary book webpage readers will find additional material, such as an online demonstration of a query engine, and exercises, and their solutions, that challenge their comprehension of the topics presented.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction

Abstract

The current World Wide Web (short Web) enables an easy, instant access to a vast amount of online information. However, the content in the Web is typically for human consumption and is not tailored to be machine-processed.

Sven Groppe

Chapter 2. Semantic Web

Abstract

The Semantic Web provides languages to define data, queries, ontologies, and rules. This chapter introduces them after a short motivation and overview of the Semantic Web.

Sven Groppe

Chapter 3. External Sorting and B+-Trees

Abstract

Today’s Semantic Web datasets become increasingly larger containing over one billion triples. The performance of index construction is a crucial factor for the success of large Semantic Web databases. (Large-scale) Indices are typically constructed from externally sorted data. In this chapter, as well as reviewing the data structure B⁺-tree and traditional external sort algorithms, we propose two new external sort approaches: External chunks-merge sort and Distribution Sort for RDF. The former stores and retrieves chunks from a special chunks heap in order to speed up replacement selection. The latter leverages the RDF-specific properties to construct RDF indices and significantly improves the performance of index construction. Our experimental results show that our approaches significantly speed up RDF index construction and are important techniques for large Semantic Web databases.

Sven Groppe

Chapter 4. Query Processing Overview

Abstract

We first present our LUPOSDATE system, including its indexing methods for data management and query engines for query evaluation. Afterward, we describe the different phases of query processing performed by these query engines on a high-level basis. In this chapter, we describe the phase of eliminating redundant language constructs of SPARQL queries in detail. The other (more complex) phases will be described in detail in their own chapters.

Sven Groppe

Chapter 5. Logical Optimization

Abstract

In this chapter, we first introduce an algebra for SPARQL queries and define the semantics of query evaluation. Afterward, we present equivalency rules for optimizing the query processing and present a heuristic approach to query optimization based on these equivalency rules. Afterward, we deal with query optimizers, which enumerate all possible query plans and choose the one with the best estimated costs. Finally, we describe how to employ histograms for estimating the cardinality of operator results as basis for cost estimations.

Sven Groppe

Chapter 6. Physical Optimization

Abstract

Different algorithms exist to compute the result of a logical operator like AND, OPT, or SORT. A physical operator implements one of the algorithms to compute the result of a logical operator. The different physical operators sometimes have different constraints on the input data like that the input data must be sorted, or are faster than others for special types of input data, for example, when the input data fit into main memory. The context of an operator can be described by the estimations of properties of its input data. For each (logical) operator in the operatorgraph, physical optimization aims to choose the physical operator with the best estimated execution times in the operator’s context.

As well as describing the physical operators, we in this chapter present our new approaches to efficient RDF data management and join optimization for small datasets and for large-scale datasets with over one billion triples.

For small datasets, where the data can be indexed in main memory, in-memory indices can significantly speed up query processing because (after loading the data) no disk accesses need to be done for query processing. B⁺-trees are optimized for disk indices of large-scale datasets, as they are optimized for blockwise sequential accesses of disks. For main-memory indices, hash indices are preferable as an index access can be done in constant time, as only a hash function must be applied to the key to retrieve the (main memory) address of the indexed element. Therefore, we use hash indices to manage small RDF datasets. Based on the triple nature of RDF data, we create seven hash indices in order to retrieve in-memory RDF data quickly. On the basis of the SPARQL-specific properties and the seven indices, we develop a new, efficient approach to computing join by dynamically restricting triple patterns. A performance evaluation demonstrates that the new approach outperforms other state-of-the-art in-memory databases.

Since the Semantic Web datasets are becoming increasingly large, developing efficient techniques to speeding up querying large-scale Semantic Web data is a key issue for Semantic Web applications. When data are already sorted, from relational database research, merge joins are known to be the fastest join algorithms on large-scale data. Therefore, recent approaches focus on the presorting of Semantic Web data during index construction, and thus the fast merge join can be used without a sorting phase at runtime for some joins. When data for succeeding joins become unsorted, the hash join is typically used. In this chapter, we propose a sorting numbering scheme for large RDF datasets, based on which we can fast sort any intermediate and final querying results. Applying our sorting numbering scheme, all joins can be computed using the merge join with a fast sorting phase. Besides being a significant benefit to merge joins, our fast sorting technique can also remarkably speed up the elimination of duplicates. Our experiments show that a merge join using our fast sorting technique outperforms greatly the hash join and that our sorting numbering scheme integrated into any index approaches significantly speeds up querying large-scale Semantic Web data.

Sven Groppe

Chapter 7. Streams

Abstract

Data streams are becoming an important concept and used in more and more applications. Processing of data streams needs a streaming engine. The streaming engine can start query processing once initial data is available. This capability is especially important for real-time computation and for long-relay transmission of data streams. In this chapter, we introduce stream processing by a demonstration of a monitoring system of eBay auctions, which is based on our RDF stream engine and can analyze eBay auctions in a flexible way. Using our monitoring system, users can easily monitor the eBay auctions information of interest, analyze the behavior of buyers and sellers, predict the tendency of auctions, and make more favorable decisions.

Sven Groppe

Chapter 8. Parallel Databases

Abstract

While a number of optimizing techniques have been developed to efficiently process increasing large Semantic Web databases, these optimization approaches have not fully leveraged the powerful computation capability of modern computers. Today’s multicore computers promise an enormous performance boost by providing a parallel computing platform. Although the parallel relational database systems have been well built, parallel query computing in Semantic Web databases have not extensively been studied. In this work, we develop the parallel algorithms for join computations of SPARQL queries. Our performance study shows that the parallel computation of SPARQL queries significantly speeds up querying large Semantic Web databases.

Sven Groppe

Chapter 9. Inference

Abstract

Data contain given facts, which are explicitly expressed. If we have the facts that Nils is a child of Sven and Sven is a child of Josef, then we as humans know that Josef is the grandparent of Nils, which is also called implicit knowledge. However, machines cannot process implicit knowledge as humans can do. Machines must get to know how to transform implicit knowledge to explicit knowledge, that is, to facts, such that machine can process it. The transformation from implicit knowledge to explicit knowledge is often expressed by rules. The application of rules to determine new facts is called inference. Inference is a costly operation, often leading to higher costs than query processing. We propose different materialization strategies for inferred facts to optimize query processing on inferred facts in this chapter and examine their performance gains.

Sven Groppe

Chapter 10. Visual Query Languages

Abstract

The social web is becoming increasingly popular and important, because it creates the collective intelligence, which can produce more value than the sum of individuals. The social web uses the Semantic Web technology RDF to describe the social data in a machine-readable way. RDF query languages play certainly an important role in the social data analysis for extracting the collective intelligence. However, constructing such queries is not trivial because the social data are often quite large and assembled from a large number of different sources and because of the lack of structure information like ontologies. In order to solve these challenges, we develop a Visual Query System (VQS) for helping the analysts of social data and other semantic data to formulate such queries easily and exactly. In this VQS, we suggest a condensed data view, a browser-like query creation system for absolute beginners and a Visual Query Language (VQL) for beginners and experienced users. Using the browser-like query creation or the VQL, the analysts of social data and other semantic data can construct queries with no or little syntax knowledge; using the condensed view, they can determine easily what queries should be used. Furthermore, our system also supports a number of other functionalities, for example, precise suggestions to extend and refine existing queries. An online demonstration of our VQS is publicly available at http://www.ifis.uni-luebeck.de/index.php?id=luposdate-demo.

Sven Groppe

Chapter 11. Embedded Languages

Abstract

The state of the art in programming Semantic Web applications is using complex application programming interfaces of Semantic Web frameworks. Extensive tests are necessary for the detection of errors, although many types of errors could be detected already at compile time. In this chapter, we propose an embedding of Semantic Web languages into the java programming language, such that Semantic Web data and queries are easily integrated into the program code, type safety is guaranteed, and already at compile time, syntax errors of Semantic Web data and queries are reported and unsatisfiable queries are detected.

Sven Groppe

Chapter 12. Comparison of the XML and Semantic Web Worlds

Abstract

XML and the Semantic Web cover many specifications of languages for the web, which can be used for similar applications. We compare both worlds, the Semantic Web one and the XML one, and show how to transform queries and data from one to the other. We also provide a comprehensive performance analysis for translated queries.

Sven Groppe

Chapter 13. Summary, Conclusions, and Future Work

Abstract

This book covers a wide range of topics in the area of the Semantic Web related to query processing.

Sven Groppe

Backmatter

Titel: Data Management and Query Processing in Semantic Web Databases
verfasst von: Sven Groppe
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-19357-6
Print ISBN: 978-3-642-19356-9
DOI: https://doi.org/10.1007/978-3-642-19357-6