Skip to main content

2013 | Buch

String Processing and Information Retrieval

20th International Symposium, SPIRE 2013, Jerusalem, Israel, October 7-9, 2013, Proceedings

herausgegeben von: Oren Kurland, Moshe Lewenstein, Ely Porat

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 20th International Symposium on String Processing and Information Retrieval, SPIRE 2013, held in Jerusalem, Israel, in October 2013. The 18 full papers, 10 short papers were carefully reviewed and selected from 60 submissions. The program also featured 4 keynote speeches. The following topics are covered: fundamentals algorithms in string processing and information retrieval; SP and IR techniques as applied to areas such as computational biology, DNA sequencing, and Web mining.

Inhaltsverzeichnis

Frontmatter
Consolidating and Exploring Information via Textual Inference

Effectively consuming information from large amounts of texts, which are often largely redundant in their content, is an old but increasingly pressing challenge. It is well illustrated by the perpetual attempts to move away from the flat result lists of search engines towards more structured fact-based presentations. Some recent attempts at this challenge are based on presenting structured information that was formulated according to pre-defined knowledge schemes, such as Freebase and Google’s knowledge graph. We propose an alternative, as well as complementary, approach that attempts to consolidate and structure all textual statements in a document collection based on the inference relations between them. Generic textual inference techniques, formulated under the Textual Entailment paradigm, are used to consolidate redundant information into unique “core” statements, and then present them in an intuitive general-to-specific hierarchy. The talk will review some of the underlying concepts and algorithms behind our approach and present an initial demo.

Ido Dagan
Pattern Discovery and Listing in Graphs

Graphs are gaining increasing popularity in many application domains as they have the potential of modeling binary relations among entities. Along with textual and multimedia data, they are the main sources for producing large data sets. It is natural to ask how it is easy to extend the notion of patterns typically found in string matching and sequence analysis, to graphs and real-life networks. Unfortunately, even the basic problem of finding a simple path in a graph is NP-hard since this can establish if the graph is Hamiltonian. Also, the number of patterns can be exponentially large in the size of the graph, thus listing them is a challenge. We will discuss some output-sensitive and parameterized algorithms for listings patterns that are paths, cycles and trees, and provide a notion of “certificate” to attain this goal. This is joint work with Rui Ferreira.

Roberto Grossi
Efficient Approximation of Edit Distance

The similarity between two strings is often measured by some variant of edit distance, depending on the intended application. But even the basic version of this distance, which is simply the minimum number of character insertions, deletions and substitutions needed to transform one string to the other, presents remarkable algorithmic challenges.

This talk will examine the task of approximating the basic edit distance between two strings, starting with the classical RAM model and moving on to computational models which impose further constraints, such as the query complexity model and the sketching model. Beyond their concrete applications, these investigations provide a wealth of information about the problem, teaching us state of the art techniques and uncovering the limitations of certain methodologies. We will then come full circle with improved algorithms for the classical RAM model. During this journey, we may encounter special cases like permutation strings, whose patterns are all distinct, and smoothed instances, which are a mixture of worst-case and average-case inputs.

Finally, we shall discuss known gaps and open problems in the area, including variants of the basic edit distance, such as allowing block moves, or edit distance between trees, and hopefully touch upon related computational problems like nearest-neighbor search.

Robert Krauthgamer
Nowcasting with Google Trends

Since launching Google Trends we have seen extensive interest in what can be learned from search trends. A plethora of studies have shown how to use search trends data for effective nowcasting in diverse areas such as health, finance, economics, politics and more.

We give an overview of Google Trends and Nowcasting, highlighting some exciting Big Data challenges, including large scale engineering, effective data analysis, and domain specific considerations.

An extended summary will be available at

http://goo.gl/FbYh9

Yossi Matias
Space-Efficient Construction of the Burrows-Wheeler Transform

The Burrows-Wheeler transform (

BWT

), originally invented for data compression, is nowadays also the core of many self-indexes, which can be used to solve many problems in bioinformatics. However, the memory requirement during the construction of the

BWT

is often the bottleneck in applications in the bioinformatics domain.

In this paper, we present a linear-time semi-external algorithm whose memory requirement is only about one byte per input symbol. Our experiments show that this algorithm provides a new time-memory trade-off between external and in-memory construction algorithms.

Timo Beller, Maike Zwerger, Simon Gog, Enno Ohlebusch
Using Mutual Influence to Improve Recommendations

In this work we show how items in recommender systems mutually influence each other’s utility and how it can be explored to improve recommendations. The way we model mutual influence is cheap and can be computed without requiring any source of content information about either items or users. We propose an algorithm that considers mutual influence to generate recommendations and analyse it over different recommendation datasets. We compare our algorithm with the

Top

 − 

N

selection algorithm and obtain gains up to 17% in the utility of recommendations without affecting their diversity. We also analyse the scalability of our algorithm and show that it is as applicable for real-world recommender systems as

Top

 − 

N

.

Aline Bessa, Adriano Veloso, Nivio Ziviani
Position-Restricted Substring Searching over Small Alphabets

We consider the problem of indexing a given text

T

[0...

n

 − 1] of

n

characters over an alphabet set Σ of size

σ

, in order to answer the position-restricted substring searching queries. The query input consists of a pattern

P

(of length

p

) and two indices ℓ and

r

and the output is the set of all

occ

ℓ,

r

occurrences of

P

in

T

[ℓ...

r

]. In this paper, we propose an

O

(

n

log

σ

)-word space index with

O

(

p

 + 

occ

ℓ,

r

loglog

n

) query time. Our solution is interesting when the alphabet size is small. For example, when the alphabet set is of constant size, we achieve exponential time improvement over the previously best-known linear space index by Navarro and Nekrich [SWAT 2012] with

O

(

p

 + 

occ

ℓ,

r

log

ε

n

) query time, where

ε

 > 0 is any positive constant. We also study the property matching problem and provide an improved index for handling semi-dynamic (only insertions) properties, where we use position-restricted substring queries as the main technique.

Sudip Biswas, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan
Simulation Study of Multi-threading in Web Search Engine Processors

Modern cluster processors have been steadily increasing the number of cores able to execute concurrent threads. Web search engines critically rely on multithreading to efficiently process user queries and document insertions to support real-time search. This requires synchronization of readers and writers which, for large number of threads, poses the question of what concurrency control strategies are capable of scaling to hundreds of cores and more. This paper presents a comparative study of a number of such strategies. To this end, we focus on the development of suitable simulation models for performance evaluation of search algorithms on dedicated single-purpose multi-threaded processors. We validate our model against actual implementations of the multi-threading strategies to then go further on studying performance on very large processors. We conclude that intra-query parallelism scales up more efficiently than inter-query parallelism.

Carolina Bonacic, Mauricio Marin
Query Processing in Highly-Loaded Search Engines

While Web search engines are built to cope with a large number of queries, query traffic can exceed the maximum query rate supported by the underlying computing infrastructure. We study how response times and results vary when, in presence of high loads, some queries are either interrupted after a fixed time threshold elapses or dropped completely. Moreover, we introduce a novel dropping strategy, based on machine learned performance predictors to select the queries to drop in order to sustain the largest possible query rate with a relative degradation in effectiveness.

Daniele Broccolo, Craig Macdonald, Salvatore Orlando, Iadh Ounis, Raffaele Perego, Fabrizio Silvestri, Nicola Tonellotto
Indexes for Jumbled Pattern Matching in Strings, Trees and Graphs

We consider how to index strings, trees and graphs for jumbled pattern matching when we are asked to return a match if one exists. For example, we show how, given a tree containing two colours, we can build a quadratic-space index with which we can find a match in time proportional to the size of the match. We also show how we need only linear space if we are content with approximate matches.

Ferdinando Cicalese, Travis Gagie, Emanuele Giaquinta, Eduardo Sany Laber, Zsuzsanna Lipták, Romeo Rizzi, Alexandru I. Tomescu
Adaptive Data Structures for Permutations and Binary Relations

We present new data structures for representing binary relations in an adaptive way, that is, for certain classes of inputs we achieve space below the general information theoretic lower bound, while achieving reasonable space complexities in the worst case. Our approach is derived from a geometric data structure [Arroyuelo et al., TCS 2011]. When used for representing permutations, it converges to a previously known adaptive representation [Barbay and Navarro, STACS 2009]. However, this new way of approaching the problem shows that we can support range searching in the adaptive representation. We extend this approach to representing binary relations, where no other adaptive representations using this chain decomposition have been proposed.

Francisco Claude, J. Ian Munro
Document Listing on Versioned Documents

Representing versioned documents, such as Wikipedia history, web archives, genome databases, backups, is challenging when we want to support searching for an exact substring and retrieve the documents that contain the substring. This problem is called

document listing

.

We present an index for the document listing problem on versioned documents. Our index is the first one based on grammar-compression. This allows for good results on repetitive collections, whereas standard techniques cannot achieve competitive space for solving the same problem.

Our index can also be addapted to work in a more standard way, allowing users to search for word-based phrase queries and conjunctive queries at the same time.

Finally, we discuss extensions that may be possible in the future, for example, supporting ranking capabilities within the index itself.

Francisco Claude, J. Ian Munro
Order-Preserving Incomplete Suffix Trees and Order-Preserving Indexes

Recently Kubica et al. (

Inf. Process. Let.

, 2013) and Kim et al. (

submitted to Theor. Comp. Sci.

) introduced order-preserving pattern matching: for a given text the goal is to find its factors having the same ‘shape’ as a given pattern. Known results include a linear-time algorithm for this problem (in case of polynomially-bounded alphabet) and a generalization to multiple patterns. We give an

O

(

n

loglog

n

) time construction of an index that enables order-preserving pattern matching queries in time proportional to pattern length. The main component is a data structure being an incomplete suffix tree in the order-preserving setting. The tree can miss single letters related to branching at internal nodes. Such incompleteness results from the weakness of our so called

weak character oracle

. However, due to its weakness, such oracle can answer queries on-line in

O

(loglog

n

) time using a sliding-window approach. For most of the applications such incomplete suffix-trees provide the same functional power as the complete ones. We also give an

$O(\frac{n\log{n}}{\log\log{n}})$

time algorithm constructing complete order-preserving suffix trees.

Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Marcin Kubica, Alessio Langiu, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń
Compact Querieable Representations of Raster Data

In Geographic Information Systems (GIS) the attributes of the space (altitude, temperature, etc.) are usually represented using a raster model. There are no compact representations of raster data that provide efficient query capabilities. In this paper we propose compact representations to efficiently store and query raster datasets in main memory. We experimentally compare our proposals with traditional storage mechanisms for raster data, showing that our structures obtain competitive space performance while efficiently answering range queries involving the values stored in the raster.

Guillermo de Bernardo, Sandra Álvarez-García, Nieves R. Brisaboa, Gonzalo Navarro, Oscar Pedreira
Top-k Color Queries on Tree Paths

We present a data structure for the following problem: Given a tree

$\mathcal{T}$

, with each of its nodes assigned a color in a totally ordered set, preprocess

$\mathcal{T}$

to efficiently answer queries for the top

k

distinct colors on the path between two nodes, reporting the colors sorted in descending order. Our data structure requires linear space of

O

(

n

) words and answers queries in

O

(

k

) time.

Stephane Durocher, Rahul Shah, Matthew Skala, Sharma V. Thankachan
A Lempel-Ziv Compressed Structure for Document Listing

Document listing is the problem of preprocessing a set of sequences, called documents, so that later, given a short string called the pattern, we retrieve the documents where the pattern appears. While optimal-time and linear-space solutions exist, the current emphasis is in reducing the space requirements. Current document listing solutions build on compressed suffix arrays. This paper is the first attempt to solve the problem using a Lempel-Ziv compressed index of the text collections. We show that the resulting solution is very fast to output most of the resulting documents, taking more time for the final ones. This makes this index particularly useful for interactive scenarios or when listing some documents is sufficient. Yet, it also offers a competitive space/time tradeoff when returning the full answers.

Héctor Ferrada, Gonzalo Navarro
Minimal Discriminating Words Problem Revisited

We revisit two variants of the problem of computing

minimal discriminating

words studied in [5]. Given a pattern

P

and a threshold

d

, we want to report (i) all shortest extensions of

P

which occur in less than

d

documents, and (ii) all shortest extensions of

P

which occur only in

d

selected

documents. For the first problem, we give an optimal solution with constant time per output word. For the second problem, we propose an algorithm with running time

O

(|

P

| + 

d

·(1 + 

output

)) improving the solution of [5].

Paweł Gawrychowski, Gregory Kucherov, Yakov Nekrich, Tatiana Starikovskaya
Adding Compression and Blended Search to a Compact Two-Level Suffix Array

The suffix array is an efficient in-memory data structure for pattern search; and two-level variants also exist that are suited to external searching and can handle strings larger than the available memory. Assuming the latter situation, we introduce a factor-based mechanism for compressing the text string that integrates seamlessly with the in-memory index search structure, rather than requiring a separate dictionary. We also introduce a mixture of indexed and sequential pattern search in a trade-off that allows further space savings. Experiments on a 4 GB computer with 62.5 GB of English text show that a two-level arrangement is possible in which around 2.5% of the text size is required as an index and for which the disk-resident components, including the text itself, occupy less than twice the space of the original text; and with which

count

queries can be carried out using two disk accesses and less than two milliseconds of CPU time.

Simon Gog, Alistair Moffat
You Are What You Eat: Learning User Tastes for Rating Prediction

Poor nutrition is one of the major causes of ill-health and death in the western world and is caused by a variety of factors including lack of nutritional understanding and preponderance towards eating convenience foods. We wish to build systems which can recommend nutritious meal plans to users, however a crucial pre-requisite is to be able to recommend recipes that people will like. In this work we investigate key factors contributing to how recipes are rated by analysing the results of a longitudinal study (n=124) in order to understand how best to approach the recommendation problem. We identify a number of important contextual factors which can influence the choice of rating. Based on this analysis, we construct several recipe recommendation models that are able to leverage understanding of user’s likes and dislikes in terms of ingredients and combinations of ingredients and in terms of nutritional content. Via experiment over our dataset we are able to show that these models can significantly outperform a number of competitive baselines.

Morgan Harvey, Bernd Ludwig, David Elsweiler
Discovering Dense Subgraphs in Parallel for Compressing Web and Social Networks

Mining and analyzing graphs are challenging tasks, especially with today’s fast-growing graphs such as Web and social networks. In the case of Web and social networks an effective approach have been using compressed representations that enable basic navigation over the compressed structure. In this paper, we first present a parallel algorithm for reducing the number of edges of Web graphs adding virtual nodes over a cluster using BSP (Bulk Synchronous Processing) model. Applying another compression technique on edge-reduced Web graphs we achieve the best state-of-the-art space/time tradeoff for accessing out/in-neighbors. Second, we present a scalable parallel algorithm over BSP for extracting dense subgraphs and represent them with compact data structures. Our algorithm uses summarized information for implementing dynamic load balance avoiding idle time on processors. We show that our algorithms are scalable and keep compression efficiency.

Cecilia Hernández, Mauricio Marín
Faster Lyndon Factorization Algorithms for SLP and LZ78 Compressed Text

We present two efficient algorithms which, given a compressed representation of a string

w

of length

N

, compute the Lyndon factorization of

w

. Given a straight line program (SLP)

$\mathcal{S}$

of size

n

and height

h

that describes

w

, the first algorithm runs in

O

(

nh

(

n

 + log

N

log

n

)) time and

O

(

n

2

) space. Given the Lempel-Ziv 78 encoding of size

s

for

w

, the second algorithm runs in

O

(

s

log

s

) time and space.

Tomohiro I, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda
Lossless Compression of Rotated Maskless Lithography Images

A new lossless image compression algorithm is presented, aimed at maskless lithography systems with mostly right-angled regular structures. Since these images appear often in slightly rotated form, an algorithm dealing with this special case is suggested, which improves performance relative to the state of the art alternatives.

Shmuel Tomi Klein, Dana Shapira, Gal Shelef
Learning URL Normalization Rules Using Multiple Alignment of Sequences

In this work, we present DUSTER, a new approach to detect and eliminate redundant content when crawling the web. DUSTER takes advantage of a multi-sequence alignment strategy to learn rewriting rules able to transform URLs to other likely to have similar content, when it is the case. We show the alignment strategy that can lead to a reduction in the number of duplicate URLs 54% larger than the one achieved by our best baseline.

Kaio Wagner Lima Rodrigues, Marco Cristo, Edleno Silva de Moura, Altigran Soares da Silva
On Two-Dimensional Lyndon Words

A Lyndon word is a primitive string which is lexicographically smallest among cyclic permutations of its characters. Lyndon words are used for constructing bases in free algebras, constructing de Bruijn sequences, finding the lexicographically smallest or largest substring in a string, and succinct suffix-prefix matching of highly periodic strings. In this paper, we extend the concept of the Lyndon word to two dimensions. We introduce the 2D Lyndon word and use it to capture 2D horizontal periodicity of a matrix in which each row is highly periodic, and to efficiently solve 2D horizontal suffix-prefix matching among a set of patterns. This yields a succinct and efficient algorithm for 2D dictionary matching. We present several algorithms that compute the 2D Lyndon word that represents a matrix. The final algorithm achieves linear time complexity even when the least common multiple of the periods of the rows is exponential in the matrix width.

Shoshana Marcus, Dina Sokol
Fully-Online Grammar Compression

We present a fully-online algorithm for constructing straight-line programs (SLPs). A naive array representation of an SLP with

n

variables on an alphabet of size

σ

requires

$2n\lg(n+\sigma)$

bits. As already shown in [Tabei et al., CPM’13], in offline setting, this size can be reduced to

$ n\lg(n+\sigma) + 2n + o(n)$

, which is asymptotically equal to the information-theoretic lower bound. Our algorithm achieves the same size in online setting, i.e., characters of an input string are given one by one to update the current SLP. With an auxiliary position array of size

$n\lg(N/n) + 3n + o(n)$

bits, our representation supports substring extractions in

O

((

m

 + 

h

)

t

) time where

N

is the length of the input string,

m

is the length of a substring extracted,

$h = O(\lg N)$

is the height of the SLP,

t

 = 

O

(1) in offline case, and

$t=O(\lg n/\lg\lg n)$

in online case. The working space is bounded by

$(1+\alpha)n\lg(n+\sigma)+n(3+\lg(\alpha n))$

bits depending on a constant

α

 ∈ (0,1], which is a load factor of hash tables. We compared our algorithm to LZend in experiments using real world repetitive texts.

Shirou Maruyama, Yasuo Tabei, Hiroshi Sakamoto, Kunihiko Sadakane
Solving Graph Isomorphism Using Parameterized Matching

We propose a new approach to solve graph isomorphism using parameterized matching. To find isomorphism between two graphs, one graph is linearized,

i.e.

, represented as a graph walk that covers all nodes and edges such that each element is represented by a parameter. Next, we match the graph linearization on the second graph, searching for a bijective function that maps each element of the first graph to an element of the second graph. We develop an efficient linearization algorithm that generates short linearization with an approximation guarantee, and develop a graph matching algorithm. We evaluate our approach experimentally on graphs of different types and sizes, and compare to the performance of VF2, which is a prominent algorithm for graph isomorphism. Our empirical measurements show that graph linearization finds a matching graph faster than VF2 in many cases because of better pruning of the search space.

Juan Mendivelso, Sunghwan Kim, Sameh Elnikety, Yuxiong He, Seung-won Hwang, Yoan Pinzón
Suffix Array of Alignment: A Practical Index for Similar Data

The

suffix tree of alignment

is an index data structure for similar strings. Given an alignment of similar strings, it stores all suffixes of the alignment, called

alignment-suffixes

. An alignment-suffix represents one suffix of a string or suffixes of multiple strings starting at the same position in the alignment. The suffix tree of alignment makes good use of similarity in strings theoretically. However, suffix trees are not widely used in biological applications because of their huge space requirements, and instead suffix arrays are used in practice.

In this paper we propose a space-economical version of the suffix tree of alignment, named the

suffix array of alignment (SAA)

. Given an alignment

ρ

of similar strings, the SAA for

ρ

is a lexicographically sorted list of all the alignment-suffixes of

ρ

. The SAA supports pattern search as efficiently as the

generalized suffix array

. Our experiments show that our index uses only 14% of the space used by the generalized suffix array to index 11 human genome sequences. The space efficiency of our index increases as the number of the genome sequences increases. We also present an efficient algorithm for constructing the SAA.

Joong Chae Na, Heejin Park, Sunho Lee, Minsung Hong, Thierry Lecroq, Laurent Mouchard, Kunsoo Park
Faster Top-k Document Retrieval in Optimal Space

We consider the problem of retrieving the

k

documents from a collection of strings where a given pattern

P

appears most often. We show that, by representing the collection using a Compressed Suffix Array

CSA

, a data structure using the asymptotically optimal |

CSA

|+

o

(

n

) bits can answer queries in the time needed by

CSA

to find the suffix array interval of the pattern plus

$O(k\lg^2 k \lg^\epsilon n)$

accesses to suffix array cells, for any constant

ε

 > 0. This is

$\lg n / \lg k$

times faster than the only previous solution using optimal space,

$\lg k$

times slower than the fastest structure that uses twice the space, and

$\lg^2 k \lg^\epsilon n$

times the lower-bound cost of obtaining

k

document identifiers from the CSA. To obtain the result we introduce a tool called the

sampled document array

, which can be of independent interest.

Gonzalo Navarro, Sharma V. Thankachan
Faster Range LCP Queries

Range LCP (longest common prefix) is an extension of the classical LCP problem and is defined as follows: Preprocess a string

S

[1...

n

] so that

max

a

,

b

 ∈ {

i

...

j

}

LCP(

S

a

,

S

b

) can be computed efficiently for the input

i

,

j

 ∈ [1,

n

], where LCP(

S

a

,

S

b

) is the length of the longest common prefix of the suffixes of

S

starting at locations

a

and

b

. In this paper, we describe a linear space data structure with

O

((

j

 − 

i

)

1/2

log

ε

(

j

 − 

i

)) query time, where

ε

 > 0 is any constant. This improves the linear space and

O

((

j

 − 

i

)loglog

n

) query time solution by Amir et. al. [ISAAC, 2011].

Manish Patil, Rahul Shah, Sharma V. Thankachan
Learning to Schedule Webpage Updates Using Genetic Programming

A key challenge endured when designing a scheduling policy regarding freshness is to estimate the likelihood of a previously crawled webpage being modified on the web. This estimate is used to define the order in which those pages should be visited, and can be explored to reduce the cost of monitoring crawled webpages for keeping updated versions. We here present a novel approach to generate score functions that produce accurate rankings of pages regarding their probability of being modified when compared to their previously crawled versions. We propose a flexible framework that uses genetic programming to evolve score functions to estimate the likelihood that a webpage has been modified. We present a thorough experimental evaluation of the benefits of our framework over five state-of-the-art baselines.

Aécio S. R. Santos, Nivio Ziviani, Jussara Almeida, Cristiano R. Carvalho, Edleno Silva de Moura, Altigran Soares da Silva
Accurate Profiling of Microbial Communities from Massively Parallel Sequencing Using Convex Optimization

We describe the Microbial Community Reconstruction (MCR) Problem, which is fundamental for microbiome analysis. In this problem, the goal is to reconstruct the identity and frequency of species comprising a microbial community, using short sequence reads from Massively Parallel Sequencing (MPS) data obtained for specified genomic regions. We formulate the problem mathematically as a convex optimization problem and provide sufficient conditions for identifiability, namely the ability to reconstruct species identity and frequency correctly when the data size (number of reads) grows to infinity. We discuss different metrics for assessing the quality of the reconstructed solution, including a novel phylogenetically-aware metric based on the Mahalanobis distance, and give upper-bounds on the reconstruction error for a finite number of reads under different metrics. We propose a scalable divide-and-conquer algorithm for the problem using convex optimization, which enables us to handle large problems (with

$\sim\!10^6$

species). We show using numerical simulations that for realistic scenarios, where the microbial communities are sparse, our algorithm gives solutions with high accuracy, both in terms of obtaining accurate frequency, and in terms of species phylogenetic resolution.

Or Zuk, Amnon Amir, Amit Zeisel, Ohad Shamir, Noam Shental
Distributed Query Processing on Compressed Graphs Using K2-Trees

Compact representation of Web and social graphs can be made efficiently with the

K

2

-tree as it achieves compression ratios about 5 bits per link for web graphs and about 20 bits per link for social graphs. The

K

2

-tree also enables fast processing of relevant queries such as direct and reverse neighbours in the compressed graph. These two properties make the

K

2

-tree suitable for inclusion in Web search engines where it is necessary to maintain very large graphs and to process on-line queries on them. Typically these search engines are deployed on dedicated clusters of distributed memory processors wherein the data set is partitioned and replicated to enable low query response time and high query throughput. In this context a practical strategy is simply to distribute the data on the processors and build local data structures for efficient retrieval in each processor. However, the way the data set is distributed on the processors can have a significant impact in performance. In this paper, we evaluate a number of data distribution strategies which are suitable for the

K

2

-tree and identify the alternative with the best general performance. In our study we consider different data sets and focus on metrics such as overall compression ratio and parallel response time for retrieving direct and reverse neighbours.

Sandra Álvarez-García, Nieves R. Brisaboa, Carlos Gómez-Pantoja, Mauricio Marin
Erratum to: String Processing and Information Retrieval

Erratum to: O. Kurland, M. Lewenstein, and E. Porat (Eds.) String Processing and Information Retrieval DOI: 10.1007/978-3-319-02432-5The book was inadvertently published with an incorrect name of the copyright holder. The name of the copyright holder for this book is: © Springer-Verlag Berlin Heidelberg. The book has been updated with the changes.

Oren Kurland, Moshe Lewenstein, Ely Porat
Backmatter
Metadaten
Titel
String Processing and Information Retrieval
herausgegeben von
Oren Kurland
Moshe Lewenstein
Ely Porat
Copyright-Jahr
2013
Verlag
Springer International Publishing
Electronic ISBN
978-3-319-02432-5
Print ISBN
978-3-319-02431-8
DOI
https://doi.org/10.1007/978-3-319-02432-5

Neuer Inhalt