Skip to main content
main-content

Über dieses Buch

This book constitutes the refereed proceedings of the 22nd International Symposium on String Processing and Information Retrieval, SPIRE 2015, held in London, UK, in September 2015. The 28 full and 6 short papers included in this volume were carefully reviewed and selected from 90 submissions. The papers cover research in all aspects of string processing, information retrieval, computational biology, pattern matching, semi-structured data, and related applications.

Inhaltsverzeichnis

Frontmatter

Faster Exact Search Using Document Clustering

We show how full-text search based on inverted indices can be accelerated by clustering the documents without losing results (SeCluD –

Se

arch with

Clu

stered

D

ocuments). We develop a fast multilevel clustering algorithm that uses query cost of conjunctive queries as an objective function. Depending on the inputs we get up to four times faster than non-clustered search. The resulting clusters are also useful for data compression and for distributing the work over many machines.

Jonathan Dimond, Peter Sanders

Fast Online Lempel-Ziv Factorization in Compressed Space

Let

T

be a text of length

n

on an alphabet

$$\Sigma $$

of size

$$\sigma $$

, and let

$$H_0$$

be the zero-order empirical entropy of

T

. We show that the LZ77 factorization of

T

can be computed in

$$nH_0+o(n\log \sigma ) + \mathcal {O}(\sigma \log n)$$

bits of working space with an online algorithm running in

$$\mathcal {O}(n\log n)$$

time. Previous space-efficient online solutions either work in compact space and

$$\mathcal {O}(n\log n)$$

time, or in succinct space and

$$\mathcal {O}(n\log ^3 n)$$

time.

Alberto Policriti, Nicola Prezza

Adaptive Computation of the Swap-Insert Correction Distance

The Swap-Insert Correction distance from a string

S

of length

n

to another string

L

of length

$$m\ge n$$

on the alphabet [1..

d

] is the minimum number of insertions, and swaps of pairs of adjacent symbols, converting

S

into

L

. Contrarily to other correction distances, computing it is NP-Hard in the size

d

of the alphabet. We describe an algorithm computing this distance in time within

$$O(d^2 nm g^{d-1})$$

, where there are

$$n_\alpha $$

occurrences of

$$\alpha $$

in

S

,

$$m_\alpha $$

occurrences of

$$\alpha $$

in

L

, and where

$$g=\max _{\alpha \in [1..d]} \min \{n_\alpha ,m_\alpha -n_\alpha \}$$

measures the difficulty of the instance. The difficulty

g

is bounded by above by various terms, such as the length of the shortest string

S

, and by the maximum number of occurrences of a single character in

S

. The latter bound yields a running time within

$$O(d(n+m)+(d/(d-1)^{d-2})\cdot n^{d}(m-n))$$

in the worst case over instances of fixed lengths

n

and

m

for

S

and

L

, which further simplifies to within

$$O(n^d(m-n)+m)$$

when

d

is fixed, the state of the art for this problem. This illustrates how, in many cases, the correction distance between two strings can be easier to compute than in the worst case scenario.

Jérémy Barbay, Pablo Pérez-Lantero

Transforming XML Streams with References

Many useful

xml

transformations can be formulated through deterministic top-down tree transducers. If transducers process parts of the input repeatedly or in non-document order, then they cannot be realized over the

xml

stream with constant or even depth-bounded memory. We show that by enriching streams by

forward references

both in the input and in the output, every such transformation can be compiled into a stream processor with a space consumption depending only on the transducer and the depth of the

xml

document. References allow to produce DAG-compressed output that is guaranteed to be linear in the size of the input (up to the space required for labels). Our model is designed so that without decompression, the output may again serve as the input of a subsequent transducer.

Sebastian Maneth, Alberto Ordóñez, Helmut Seidl

Efficient Term Set Prediction Using the Bell-Wigner Inequality

The task of measuring the dependence between terms is computationally expensive for IR systems which have to deal with large and sparse datasets. The current approaches to mining frequent term sets are based on the enumeration of the term sets found in a set of documents and on monotonicity, the latter being the property that a term set is frequent only if all its subsets are frequent as implemented by Apriori. However, the computational time can be very large. An alternative approach is to store the dataset in a FPT and to visit and prune the tree in a recursive way as implemented by FPGrowth. However, the storage space can still be very large. We introduce the BWI as a conceptual enhancement of monotonicity to predict with certainty when an itemset is frequent and when it is infrequent. We describe the empirical validation that the BWI can significantly reduce both the computational time of Apriori and the storage space of pattern tree-based algorithms such as FPGrowth. The empirical validation has been performed using some runs produced by IR systems from the TIPSTER test collection.

Massimo Melucci

On Prefix/Suffix-Square Free Words

We present a series of algorithms identifying efficiently the factors of a word that neither start nor end with squares (called, accordingly, prefix-suffix-square free factors). A series of closely related algorithmic problems are discussed.

Marius Dumitran, Florin Manea, Dirk Nowotka

Temporal Analysis of CHAVE Collection

The importance of temporal information (TI) is increasing in several Information Retrieval (IR) tasks. CHAVE, available from Linguateca’s site, is the only ad hoc IR test collection with Portuguese texts. So, the research question of this work is whether this collection is sufficiently rich to be used in Temporal IR evaluation. The obtained answer was yes. By the analysis of the CHAVE collection, we verified that 22% of the topics and 86% of the documents have at least one

chronon

. 49% of topics are time-sensitive. Analyzing the relation of topics with documents, relevant documents of time-sensitive topics converge to a specific date(s), while the non-relevant ones are dispersed along the timeline. Finally, we used a peak dates strategy as a time-aware query expansion (QE) process. Experiments showed effectiveness improvements for time-sensitive queries.

Olga Craveiro, Joaquim Macedo, Henrique Madeira

DeShaTo: Describing the Shape of Cumulative Topic Distributions to Rank Retrieval Systems Without Relevance Judgments

This paper investigates an approach for estimating the effectiveness of any IR system. The approach is based on the idea that a set of documents retrieved for a specific query is highly relevant if there are only a small number of predominant topics in the retrieved documents. The proposed approach is to determine the topic probability distribution of each document offline, using Latent Dirichlet Allocation. Then, for a retrieved set of documents, a set of probability distribution shape descriptors, namely the skewness and the kurtosis, are used to compute a score based on the shape of the cumulative topic distribution of the respective set of documents. The proposed model is termed

DeShaTo

, which is short for

De

scribing the

Sha

pe of cumulative

To

pic distributions. In this work, DeShaTo is used to rank retrieval systems without relevance judgments. In most cases, the empirical results are better than the state of the art approach. Compared to other approaches, DeShaTo works independently for each system. Therefore, it remains reliable even when there are less systems to be ranked by relevance.

Radu Tudor Ionescu, Adrian-Gabriel Chifu, Josiane Mothe

Induced Sorting Suffixes in External Memory with Better Design and Less Space

Recently, several attempts have been made to extend the internal memory suffix array (SA) construction algorithm SA-IS to the external memory model, e.g., eSAIS, EM-SA-DS and DSA-IS. While the developed programs for these algorithms achieve remarkable performance in terms of I/O complexity and speed, their designs are quite complex and their disk requirements remain rather heavy. Currently, the core algorithmic part of each of these programs consists of thousands of lines in C++, and the average peak disk requirement is over 20

n

bytes for an input string of size

$$n<2^{40}$$

. We re-investigate the problem of induced sorting suffixes in external memory and propose a new algorithm SAIS-PQ (SAIS with Priority Queue) and its enhanced alternative SAIS-PQ+. Using the library STXXL, the core algorithmic parts of SAIS-PQ and SAIS-PQ+ are coded in around 800 and 1600 lines in C++, respectively. The time and space performance of these two programs are evaluated in comparison with eSAIS that is also implemented using STXXL. In our experiment, eSAIS runs the fastest for the input strings not larger than 16 GiB, but it is slower than SAIS-PQ+ for the only two input strings of 32 and 48.44 GiB. For the average peak disk requirements, eSAIS and SAIS-PQ+ are around 23

n

and 15

n

bytes, respectively.

Wei Jun Liu, Ge Nong, Wai Hong Chan, Yi Wu

Efficient Algorithms for Longest Closed Factor Array

We consider a family of strings called closed strings and a related array of Longest Closed Factors (LCF). We show that the reconstruction of a string from its LCF array is easier than the construction and verification of this array. Moreover, the reconstructed string is unique. We improve also the time of construction/verification, reducing it from

$$\mathcal {O}(n \log n/\log \log n)$$

(the best previously known) to

$$\mathcal {O}(n\sqrt{\log n})$$

. We use connections between the LCF array and the longest previous/next factor arrays.

Hideo Bannai, Shunsuke Inenaga, Tomasz Kociumaka, Arnaud Lefebvre, Jakub Radoszewski, Wojciech Rytter, Shiho Sugimoto, Tomasz Waleń

A Compact RDF Store Using Suffix Arrays

RDF has become a standard format to describe resources in the Semantic Web and other scenarios. RDF data is composed of triples (

subject

,

predicate

,

object

), referring respectively to a resource, a property of that resource, and the value of such property. Compact storage schemes allow fitting larger datasets in main memory for faster processing. On the other hand, supporting efficient SPARQL queries on RDF datasets requires index data structures to accompany the data, which hampers compactness. As done for text collections, we introduce a

self-index

for RDF data, which combines the data and its index in a single representation that takes less space than the raw triples and efficiently supports basic SPARQL queries. Our storage format,

RDFCSA

, builds on compressed suffix arrays. Although there exist more compact representations of RDF data,

RDFCSA

uses about half of the space of the raw data (and replaces it) and displays much more robust and predictable query times around 1–2 microseconds per retrieved triple.

RDFCSA

is 3 orders of magnitude faster than representations like MonetDB or RDF-3X, while using the same space as the former and 6 times less space than the latter. It is also faster than the more compact representations on most queries, in some cases by 2 orders of magnitude.

Nieves R. Brisaboa, Ana Cerdeira-Pena, Antonio Fariña, Gonzalo Navarro

Chaining Fragments in Sequences: to Sweep or Not (Extended Abstract)

Computing an optimal chain of fragments is a classical problem in string algorithms, with important applications in computational biology. There exist two efficient dynamic programming algorithms solving this problem, based on different principles. In the present note, we show how it is possible to combine the principles of two of these algorithms in order to design a hybrid dynamic programming algorithm that combines the advantages of both algorithms.

Julien Allali, Cedric Chauve, Laetitia Bourgeade

A Faster Algorithm for Computing Maximal $$\alpha $$ -gapped Repeats in a String

A string

$$x = uvu$$

with both

u

,

v

being non-empty is called a

gapped repeat with period

$$p = |uv|$$

, and is denoted by pair (

x

,

p

). If

$$p \le \alpha (|x|-p)$$

with

$$\alpha > 1$$

, then (

x

,

p

) is called an

$$\alpha $$

-gapped

repeat. An occurrence

$$[i, i+|x|-1]$$

of an

$$\alpha $$

-gapped repeat (

x

,

p

) in a string

w

is called a

maximal

$$\alpha $$

-gapped repeat of

w

, if it cannot be extended either to the left or to the right in

w

with the same period

p

. Kolpakov et al. (CPM 2014) showed that, given a string of length

n

over a constant alphabet, all the occurrences of maximal

$$\alpha $$

-gapped repeats in the string can be computed in

$$O(\alpha ^2 n + occ )$$

time, where

$$ occ $$

is the number of occurrences. In this paper, we propose a faster

$$O(\alpha n + occ )$$

-time algorithm to solve this problem, improving the result of Kolpakov et al. by a factor of

$$\alpha $$

.

Yuka Tanimura, Yuta Fujishige, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Selective Labeling and Incomplete Label Mitigation for Low-Cost Evaluation

Information retrieval evaluation heavily relies on human effort to assess the relevance of result documents. Recent years have seen efforts and good progress to reduce the human effort and thus lower the cost of evaluation. Selective labeling strategies carefully choose a subset of result documents to label, for instance, based on their aggregate rank in results; strategies to mitigate incomplete labels seek to make up for missing labels, for instance, predicting them using machine learning methods. How different strategies interact, though, is unknown.

In this work, we study the interaction of several state-of-the-art strategies for selective labeling and incomplete label mitigation on four years of TREC Web Track data (2011–2014). Moreover, we propose and evaluate

MaxRep

as a novel selective labeling strategy, which has been designed so as to select effective training data for missing label prediction.

Kai Hui, Klaus Berberich

Relative Select

Motivated by the problem of storing coloured de Bruijn graphs, we show how, if we can already support fast select queries on one string, then we can store a little extra information and support fairly fast select queries on a similar string.

Christina Boucher, Alexander Bowe, Travis Gagie, Giovanni Manzini, Jouni Sirén

Temporal Query Classification at Different Granularities

In this work, we consider the problem of classifying time-sensitive queries at different temporal granularities (day, month, and year). Our approach involves performing Bayesian analysis on time intervals of interest obtained from pseudo-relevant documents. Based on the Bayesian analysis we derive several effective features which are used to train a supervised machine learning algorithm for classification. We evaluate our method on a large temporal query workload to show that we can determine the temporal class of a query with high precision.

Dhruv Gupta, Klaus Berberich

Prefix and Suffix Reversals on Strings

The

Sorting by Prefix Reversals

problem consists in sorting the elements of a given permutation

$$\pi $$

with a minimum number of prefix reversals, i.e. reversals that always imply the leftmost element of

$$\pi $$

. A natural extension of this problem is to consider strings (in which any letter may appear several times) rather than permutations. In strings, three different types of problems arise:

grouping

(starting from a string

S

, transform it so that all identical letters are consecutive),

sorting

(a constrained version of grouping, in which the target string must be lexicographically ordered) and

rearranging

(given two strings

S

and

T

, transform

S

into

T

). In this paper, we study these three problems, under an algorithmic viewpoint, in the setting where two operations (rather than one) are allowed: namely,

prefix and suffix

reversals - where a suffix reversal must always imply the rightmost element of the string. We first give elements of comparison between the “prefix reversals only” case and our case. The algorithmic results we obtain on these three problems depend on the size

k

of the alphabet on which the strings are built. In particular, we show that the grouping problem is in P for

$$k\in [2;4]$$

and when

$$n-k=O(1)$$

, where

n

is the length of the string. We also show that the grouping problem admits a PTAS for any constant

k

, and is 2-approximable for any

k

. Concerning sorting, it is in P for

$$k\in [2;3]$$

, admits a PTAS for constant

k

, and is NP-hard for

$$k=n$$

. Finally, concerning the rearranging problem, we show that it is NP-hard, both for

$$k=O(1)$$

and

$$k=n$$

. We also show that the three problems are FPT when the parameter is the maximum number of blocks over the source and target strings.

Guillaume Fertin, Loïc Jankowiak, Géraldine Jean

Filtration Algorithms for Approximate Order-Preserving Matching

The exact order-preserving matching problem is to find all the substrings of a text

T

which have the same length and relative order as a pattern

P

. Like string maching, order-preserving matching can be generalized by allowing the match to be approximate. In approximate order-preserving matching two strings match if they have the same relative order after removing up to

k

elements in the same positions in both strings. In this paper we present practical solutions for this problem. The methods are based on filtration, and one of them is the first sublinear solution on average. We show by practical experiments that the new solutions are fast and efficient.

Tamanna Chhabra, Emanuele Giaquinta, Jorma Tarhio

Fishing in Read Collections: Memory Efficient Indexing for Sequence Assembly

In this paper, we present a memory efficient index for storing a large set of DNA sequencing reads. The index allows us to quickly retrieve the set of reads containing a certain query

k

-mer. Instead of the usual approach of treating each read as a separate string, we take an advantage of significant overlap between reads and compress the data by aligning the reads to an approximate superstring constructed specifically for this purpose in combination with several succint data structures.

Vladimír Boža, Jakub Jursa, Broňa Brejová, Tomáš Vinař

How Big is that Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra

Many practical algorithms for sequence alignment, genome assembly and other tasks represent a sequence as a set of

k

-mers. Here, we address the problems of estimating genome size and sequencing coverage from sequencing reads, without the need for sequence assembly. Our estimates are based on a histogram of

k

-mer abundance in the input set of sequencing reads and on probabilistic modeling of distribution of

k

-mer abundance based on parameters related to the coverage, error rate and repeat structure of the genome. Our method provides reliable estimates even at coverage as low as 0.5 or at error rates as high as 10%.

Michal Hozza, Tomáš Vinař, Broňa Brejová

Assessing the Efficiency of Suffix Stripping Approaches for Portuguese Stemming

Stemming is the process of reducing inflected words to their root form, the stem. Search engines use stemming algorithms to conflate words in the same stem, reducing index size and improving recall. Suffix stripping is a strategy used by stemming algorithms to reduce words to stems by processing suffix rules suitable to address the constraints of each language. For Portuguese stemming, the RSLP was the first suffix stripping algorithm proposed in literature, and it is still widely used in commercial and open source search engines. Typically, the RSLP algorithm uses a list-based approach to process rules for suffix stripping. In this article, we introduce two suffix stripping approaches for Portuguese stemming. Particularly, we propose the hash-based and the automata-based approach, and we assess their efficiency by contrasting them with the state-of-the-art list-based approach. Complexity analysis shows that the automata-based approach is more efficient in time. In addition, experiments on two datasets attest the efficiency of our approaches. In particular, the hash-based and the automata-based approaches outperform the list-based approach, with reduction of up to 65.28% and 86.48% in stemming time, respectively.

Wadson Gomes Ferreira, Willian Antônio dos Santos, Breno Macena Pereira de Souza, Tiago Matta Machado Zaidan, Wladmir Cardoso Brandão

Space-Efficient Detection of Unusual Words

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of

$$O(\sigma ^2\log ^2 n)$$

bits, where

n

is the length of the string and

$$\sigma $$

is the size of the alphabet. The size of the stack is

o

(

n

) except for very large values of

$$\sigma $$

. We further improve the algorithm by removing its time dependency on

$$\sigma $$

, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that

do not occur

in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.

Djamal Belazzougui, Fabio Cunial

Parallel Construction of Succinct Representations of Suffix Tree Topologies

A compressed suffix tree usually consists of three components: a compressed suffix array, a compressed

$$\mathsf {LCP}$$

-array, and a succinct representation of the suffix tree topology. There are parallel algorithms that construct the suffix array and the

$$\mathsf {LCP}$$

-array, but none for the third component. In this paper, we present parallel algorithms on shared memory architectures that construct the enhanced balanced parentheses representation (

$$\mathsf {BPR}$$

). The enhanced

$$\mathsf {BPR}$$

is an implicit succinct representation of the suffix tree topology, which supports all navigational operations on the suffix tree. It can also be used to efficiently construct the

$$\mathsf {BPS}$$

, an explicit succinct representation of the suffix tree topology.

Uwe Baier, Timo Beller, Enno Ohlebusch

Computing the Longest Unbordered Substring

A substring of a string is

unbordered

if its only border is the empty string. The study of unbordered substrings goes back to the paper of Ehrenfeucht and Silberger [Discr. Math 26 (1979)]. The main focus of their and subsequent papers was to elucidate the relationship between the longest unbordered substring and the minimal period of strings. In this paper, we consider the algorithmic problem of computing the longest unbordered substring of a string. The problem was introduced recently by G. Kucherov et al. [CPM (2015)], where the authors showed that the average-case running time of the simple, border-array based algorithm can be bounded by

$$\mathcal {O}(\max \{n, n^2/\sigma ^4\})$$

for

$$\sigma $$

being the size of the alphabet. (The worst-case running time remained

$$\mathcal {O}(n^2)$$

.) Here we propose two algorithms, both presenting substantial theoretical improvements to the result of [

11

]. The first algorithm has

$$\mathcal {O}(n \log n)$$

average-case running time and

$$\mathcal {O}(n^2)$$

worst-case running time, and the second algorithm has

$$\mathcal {O}(n^{1.5})$$

worst-case running time.

Paweł Gawrychowski, Gregory Kucherov, Benjamin Sach, Tatiana Starikovskaya

Online Self-Indexed Grammar Compression

Although several grammar-based self-indexes have been proposed thus far, their applicability is limited to offline settings where whole input texts are prepared, thus requiring to rebuild index structures for given additional inputs, which is often the case in the big data era. In this paper, we present the first online self-indexed grammar compression named OESP-index that can gradually build the index structure by reading input characters one-by-one. Such a property is another advantage which enables saving a working space for construction, because we do not need to store input texts in memory. We experimentally test OESP-index on the ability to build index structures and search query texts, and we show OESP-index’s efficiency, especially space-efficiency for building index structures.

Yoshimasa Takabatake, Yasuo Tabei, Hiroshi Sakamoto

Tight Bound for the Number of Distinct Palindromes in a Tree

For an undirected tree with

n

edges labelled by single letters, we consider its substrings, which are labels of the simple paths between pairs of nodes. We prove that there are

$$\mathcal {O}(n^{1.5})$$

different palindromic substrings. This solves an open problem of Brlek, Lafrenière and Provençal (DLT 2015), who gave a matching lower-bound construction. Hence, we settle the tight bound of

$$\Theta (n^{1.5})$$

for the maximum palindromic complexity of trees. For standard strings, i.e., for paths, the palindromic complexity is

$$n+1$$

.

Paweł Gawrychowski, Tomasz Kociumaka, Wojciech Rytter, Tomasz Waleń

Beyond the Runs Theorem

In [

3

], a short and elegant proof was presented showing that a word of length

n

contains at most

$$n-3$$

runs. Here we show, using the same technique and a computer search, that the number of runs in a binary word of length

n

is at most

$$\frac{22}{23}n<0.957n$$

.

Johannes Fischer, Štěpán Holub, Tomohiro I, Moshe Lewenstein

Sampling the Suffix Array with Minimizers

Sampling (evenly) the suffixes from the suffix array is an old idea trading the pattern search time for reduced index space. A few years ago Claude et al. showed an alphabet sampling scheme allowing for more efficient pattern searches compared to the sparse suffix array, for long enough patterns. A drawback of their approach is the requirement that sought patterns need to contain at least one character from the chosen subalphabet. In this work we propose an alternative suffix sampling approach with only a minimum pattern length as a requirement, which seems more convenient in practice. Experiments show that our algorithm achieves competitive time-space tradeoffs on most standard benchmark data. As a side result, we show that

$$n'$$

arbitrarily selected suffixes from a text of length

n

, where

$$n' < n$$

, over an integer alphabet, can be sorted in

O

(

n

) time using

$$O(n')$$

words of space.

Szymon Grabowski, Marcin Raniszewski

Longest Common Prefix with Mismatches

The Longest Common Prefix (LCP) array is a data structure commonly used in combination with the Suffix Array. However, in some settings we are interested in the LCP values

per se

since they provide useful information on the repetitiveness of the underlying sequence.

Since sequences can contain alterations, which can be either malicious (plagiarism attempts) or pseudo-random (as in sequencing experiments), when using LCP values to measure repetitiveness it makes sense to allow for a small number of errors. In this paper we formalize this notion by considering the longest common prefix in the presence of mismatches. In particular, we propose an algorithm that computes, for each text suffix, the length of its longest prefix that occurs elsewhere in the text with at most one mismatch. For a sequence of length

n

our algorithm uses

$$\Theta (n\log n)$$

bits and runs in

$$\mathcal {O}(n \text{L}_{ave}\log n/\log \log n)$$

time where

$$\text{L}_{ave}$$

is the average LCP of the input sequence. Although

$$\text{L}_{ave}$$

is

$$\Theta (n)$$

in the worst case, recent analyses of real world data show that it usually grows logarithmically with the input size. We then describe and analyse a second algorithm that uses a greedy strategy to reduce the amount of computation and that can be turned into an even faster algorithm if allow an additive one-sided error.

Finally, we consider the related problem of computing the 1-mappability of a sequence. In this problem we are asked to compute, for each length-

m

substring of the input sequence, the number of other substrings which are at Hamming distance one. For this problem we propose an algorithm that takes

$$\mathcal {O}(m n \log n/\log \log n)$$

time using

$$\Theta (n \log n)$$

bits of space.

Giovanni Manzini

Evaluating Geographical Knowledge Re-Ranking, Linguistic Processing and Query Expansion Techniques for Geographical Information Retrieval

This paper describes and evaluates the use of Geographical Knowledge Re-Ranking, Linguistic Processing, and Query Expansion techniques to improve Geographical Information Retrieval effectiveness. Geographical Knowledge Re-Ranking is performed with Geographical Gazetteers and conservative Toponym Disambiguation techniques that boost the ranking of the geographically relevant documents retrieved by standard state-of-the-art Information Retrieval algorithms. Linguistic Processing is performed in two ways: 1) Part-of-Speech tagging and Named Entity Recognition and Classification are applied to analyze the text collections and topics to detect toponyms, 2) Stemming (Porter’s algorithm) and Lemmatization are also applied in combination with default stopwords filtering. The Query Expansion methods tested are the Bose-Einstein (Bo1) and Kullback-Leibler term weighting models. The experiments have been performed with the English Monolingual test collections of the GeoCLEF evaluations (from years 2005, 2006, 2007, and 2008) using the TF-IDF, BM25, and InL2 Information Retrieval algorithms over unprocessed texts as baselines. The experiments have been performed with each GeoCLEF test collection (25 topics per evaluation) separately and with the fusion of all these collections (100 topics). The results of evaluating separately Geographical Knowledge Re-Ranking, Linguistic Processing (lemmatization, stemming, and the combination of both), and Query Expansion with the fusion of all the topics show that all these processes improve the Mean Average Precision (MAP) and RPrecision effectiveness measures in all the experiments and show statistical significance over the baselines in most of them. The best results in MAP and RPrecision are obtained with the InL2 algorithm using the following techniques: Geographical Knowledge Re-Ranking, Lemmatization with Stemming, and Kullback-Leibler Query Expansion. Some configurations with Geographical Knowledge Re-Ranking, Linguistic Processing and Query Expansion have improved the MAP of the best official results at GeoCLEF evaluations of 2005, 2006, and 2007.

Daniel Ferrés, Horacio Rodríguez

Improved Practical Compact Dynamic Tries

We consider the problem of implementing a

dynamic trie

with an emphasis on good practical performance. For a trie with

n

nodes with an alphabet of size

$$\sigma $$

, the information-theoretic lower bound is

$$n \log \sigma + O(n)$$

bits. The Bonsai data structure [

1

] supports trie operations in

O

(1) expected time (based on assumptions about the behaviour of hash functions). While its practical speed performance is excellent, its space usage of

$$(1+\epsilon ) n (\log \sigma + O(\log \log n))$$

bits, where

$$\epsilon $$

is any constant

$$> 0$$

, is not asymptotically optimal. We propose an alternative,

m-Bonsai

, that uses

$$(1 + \epsilon ) n (\log \sigma + O(1))$$

bits in expectation, and supports operations in

O

(1) expected time (again based on assumptions about the behaviour of hash functions). We give a heuristic implementation of m-Bonsai which uses considerably less memory and is slightly faster than the original Bonsai.

Andreas Poyias, Rajeev Raman

ShRkC: Shard Rank Cutoff Prediction for Selective Search

In search environments where large document collections are partitioned into smaller subsets (

shards

), processing the query against only the relevant shards improves search efficiency. The problem of ranking the shards based on their estimated relevance to the query has been studied extensively. However, a related important task of identifying

how many

of the top ranked relevant shards should be searched for the query, so as to balance the competing objectives of effectiveness and efficiency, has not received much attention. This task of

shard rank cutoff estimation

is the focus of the presented work. The central premise for the proposed solution is that the number of top shards searched should be dependent on – 1. the query, 2. the given ranking of shards, and 3. on the type of

search need

being served (precision-oriented versus recall-oriented task). An array of features that capture these three factors are defined, and a regression model is induced based on these features to learn a query-specific shard rank cutoff estimator. An empirical evaluation using two large datasets demonstrates that the learned shard rank cutoff estimator provides substantial improvement in search efficiency as compared to strong baselines without degrading search effectiveness.

Anagha Kulkarni

Range LCP Queries Revisited

The Range LCP problem is to preprocess a string

$$S[1\dots n]$$

, to enable efficient solutions of the following query: given a range [

l

,

r

] as the input, report

$$\max _{i, j \in \{l,\ldots ,r\}} |\mathsf {LCP}(S_{i}, S_j)|$$

. Here

$$\mathsf {LCP}(S_i, S_j)$$

is the longest common prefix of the suffixes of

S

starting at locations

i

and

j

and

$$|\mathsf {LCP}(S_i,S_j)|$$

is its length. We study a natural extension of this problem, where the query consists of two ranges. Additionally, we allow a bounded number (say

$$k\ge 0$$

) of mismatches in the

$$\mathsf {LCP}$$

computation. Specifically, our task is to report the following when two ranges

$$[\ell _1, r_1]$$

and

$$[\ell _2,r_2]$$

comes as input:

$$\max _{\{\ell _1\le i\le r_1, \ell _2\le j\le r_2\}}|\mathsf {LCP}_k(S_i,S_j)|$$

Here

$$\mathsf {LCP}_k(S_i,S_j)$$

is the longest prefix of

$$S_i$$

and

$$S_j$$

with at most

k

mismatches allowed. We show that the queries can be answered in

O

(

k

) time using an

$$O(n^2/w)$$

space data structure, where

w

is the word size. We also present space efficient data structures for

$$k=0$$

and

$$k=1$$

. For

$$k=0$$

, we obtain a linear space data structure with query time

$$O(\sqrt{n/w}\log ^{\epsilon } n)$$

, where

w

is the word size and

$$\epsilon >0$$

is an arbitrarily small constant.

For the case

$$k=1$$

we obtain an

$$O(n\log n)$$

space data structure with query time

$$O(\sqrt{n}\log n)$$

.

Finally, we give a reduction from Set Intersection to Range LCP queries, suggesting that it will be very difficult to improve our upper bound by more than a factor of

$$O(\log ^{\epsilon }n)$$

.

Amihood Amir, Moshe Lewenstein, Sharma V. Thankachan

Feasibility of Word Difficulty Prediction

We present a machine learning algorithm to predict how difficult is a word for a person with dyslexia. To train the algorithm we used a data set of words labeled as easy or difficult. The algorithm predicts correctly slightly above 72% of our instances, showing the feasibility of building such a predictive solution for this problem. The main purpose of our work is to be able to weight words in order to perform lexical simplification in texts read by people with dyslexia. Since the main feature used by the classifier, and the only that is not computed in constant time, is the number of

similar

words in a dictionary, we did a study on the different methods that exist to compute efficiently this feature. This algorithmic comparison is interesting on its own sake and shows that two algorithms can solve the problem in less than a second.

Ricardo Baeza-Yates, Martí Mayo-Casademont, Luz Rello

Backmatter

Weitere Informationen

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

Whitepaper

- ANZEIGE -

INDUSTRIE 4.0

Der Hype um Industrie 4.0 hat sich gelegt – nun geht es an die Umsetzung. Das Whitepaper von Protolabs zeigt Unternehmen und Führungskräften, wie sie die 4. Industrielle Revolution erfolgreich meistern. Es liegt an den Herstellern, die besten Möglichkeiten und effizientesten Prozesse bereitzustellen, die Unternehmen für die Herstellung von Produkten nutzen können. Lesen Sie mehr zu: Verbesserten Strukturen von Herstellern und Fabriken | Konvergenz zwischen Soft- und Hardwareautomatisierung | Auswirkungen auf die Neuaufstellung von Unternehmen | verkürzten Produkteinführungszeiten
Jetzt gratis downloaden!

Bildnachweise