Skip to main content

2015 | Buch

Similarity Search and Applications

8th International Conference, SISAP 2015, Glasgow, UK, October 12-14, 2015, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 8th International Conference on Similarity Search and Applications, SISAP 2015, held in Glasgow, UK, in October 2015.

The 19 full papers, 12 short and 9 demo and poster papers presented in this volume were carefully reviewed and selected from 68 submissions. They are organized in topical sections named: improving similarity search methods and techniques; metrics and evaluation; applications and specific domains; implementation and engineering solutions; posters; demo papers.

Inhaltsverzeichnis

Frontmatter

Improving Similarity Search Methods and Techniques

Frontmatter
Approximate Furthest Neighbor in High Dimensions

Much recent work has been devoted to approximate nearest neighbor queries. Motivated by applications in recommender systems, we consider

approximate furthest neighbor

(AFN) queries. We present a simple, fast, and highly practical data structure for answering AFN queries in high-dimensional Euclidean space. We build on the technique of Indyk (SODA 2003), storing random projections to provide sublinear query time for AFN. However, we introduce a different query algorithm, improving on Indyk’s approximation factor and reducing the running time by a logarithmic factor. We also present a variation based on a query-independent ordering of the database points; while this does not have the provable approximation factor of the query-dependent data structure, it offers significant improvement in time and space complexity. We give a theoretical analysis, and experimental results.

Rasmus Pagh, Francesco Silvestri, Johan Sivertsen, Matthew Skala
Flexible Aggregate Similarity Search in High-Dimensional Data Sets

Numerous applications in different fields, such as spatial databases, multimedia databases, data mining and recommender systems, may benefit from efficient and effective aggregate similarity search, also known as aggregate nearest neighbor (AggNN) search. Given a group of query objects

Q

, the goal of AggNN is to retrieve the

k

most similar objects from the database, where the underlying similarity measure is defined as an aggregation (usually

sum

,

avg

or

max

) of the distances between the retrieved objects and every query object in

Q

. Recently, the problem was generalized so as to retrieve the

k

objects which are most similar to a fixed proportion of the elements of

Q

. This variant of aggregate similarity search is referred to as ‘flexible AggNN’, or FANN. In this work, we propose two approximation algorithms, one for the

sum

and

avg

variants of FANN, and the other for the

max

variant. Extensive experiments are provided showing that, relative to state-of-the-art approaches (both exact and approximate), our algorithms produce query results with good accuracy, while at the same time being very efficient — even for real datasets of very high dimension.

Michael E. Houle, Xiguo Ma, Vincent Oria
Similarity Joins and Beyond: An Extended Set of Binary Operators with Order

Similarity joins are troublesome database operators that often produce results much larger than the user really needs or expects. In order to return the similar elements, similarity joins also require sorting during the retrieval process, although order is a concept not supported in the relational model. This paper proposes a solution to solve those two issues extending the similarity join concept to a broader set of binary operators, which aims at retrieving the most similar pairs and embedding the sorting operation only as an internal processing step, so as to comply with the relational theory. Additionally, our extension allows to explore another useful condition not previously considered in the similarity retrieval: the negation of predicates. Experiments performed on real and synthetic data show that our operators are fast enough to be used in real applications and scale well both for multidimensional and non-dimensional metric data.

Luiz Olmes Carvalho, Lucio F. D. Santos, Willian D. Oliveira, Agma Juci Machado Traina, Caetano Traina Jr.
Diversity in Similarity Joins

With the increasing ability of current applications to produce and consume more complex data, such as images and geographic information, the similarity join has attracted considerable attention. However, this operator does not consider the relationship among the elements in the answer, generating results with many pairs similar among themselves, which does not add value to the final answer. Result diversification methods are intended to retrieve elements similar enough to satisfy the similarity conditions, but also considering the diversity among the elements in the answer, producing a more heterogeneous result with smaller cardinality, which improves the meaning of the answer. Still, diversity have been studied only when applied to unary operations. In this paper, we introduce the concept of

diverse similarity joins

: a similarity join operator that ensures a smaller, more diversified and useful answers. The experiments performed on real and synthetic datasets show that our proposal allows exploiting diversity in similarity joins without diminish their performance whereas providing elements that cover the same data space distribution of the non-diverse answers.

Lucio F. D. Santos, Luiz Olmes Carvalho, Willian D. Oliveira, Agma J. M. Traina, Caetano Traina Jr.
CDA: Succinct Spaghetti

A pivot table is a popular mechanism for building indexes for similarity queries. Precomputed distances to a set of references are used to filter non-relevant candidates. Every pivot serves as a reference for all, or a proper subset of, the objects in the database.Each pivot filters its share of the database and the candidate list for a query is the intersection of all the partial lists.The

spaghetti

data structure is a mechanism to compute the above intersection without performing a sequential scan over the database, and consist of a collection of circular linked lists.

In this paper, we present a succinct version of the spaghetti. The proposed data structure uses less memory and, unlike the original spaghetti, it can compute the intersection using an arbitrary order of the component sets. This later property enables more sophisticated evaluation heuristics leading to faster intersection computation.

We present the analysis of the performance, as well as a comprehensive set of experiments where the new approach is proven to be faster in practice.

Edgar Chavez, Ubaldo Ruiz, Eric Tellez
Improving Metric Access Methods with Bucket Files

Modern applications deal with complex data, where retrieval by similarity plays an important role in most of them. Complex data whose primary comparison mechanisms are similarity predicates are usually immersed in metric spaces. Metric Access Methods (MAMs) exploit the metric space properties to divide the metric space into regions and conquer efficiency on the processing of similarity queries, like range and

k

-nearest neighbor queries.

Existing MAM use homogeneous data structures to improve query execution, pursuing the same techniques employed by traditional methods developed to retrieve scalar and multidimensional data. In this paper, we combine hashing and hierarchical ball partitioning approaches to achieve a hybrid index that is tuned to improve similarity queries targeting complex data sets, with search algorithms that reduce total execution time by aggressively reducing the number of distance calculations. We applied our technique in the Slim-tree and performed experiments over real data sets showing that the proposed technique is able to reduce the execution time of both range and

k

-nearest queries to at least half of the Slim-tree. Moreover, this technique is general to be applied over many existing MAM.

Ives R. V. Pola, Agma J. M. Traina, Caetano Traina Jr., Daniel S. Kaster
Faster Dual-Tree Traversal for Nearest Neighbor Search

Nearest neighbor search is a nearly ubiquitous problem in computer science. When nearest neighbors are desired for a query set instead of a single query point, dual-tree algorithms often provide the fastest solution, especially in low-to-medium dimensions (i.e. up to a hundred or so), and can give exact results or absolute approximation guarantees, unlike hashing techniques. Using a recent decomposition of dual-tree algorithms into modular pieces, we propose a new piece: an improved traversal strategy; it is applicable to any dual-tree algorithm. Applied to nearest neighbor search using both kd-trees and ball trees, the new strategy demonstrably outperforms the previous fastest approaches. Other problems the traversal may easily be applied to include kernel density estimation and max-kernel search.

Ryan R. Curtin
Optimizing the Distance Computation Order of Multi-Feature Similarity Search Indexing

Multi-feature search is an effective approach to similarity search. Unfortunately, the search efficiency decreases with the number of features. Several indexing approaches aim to achieve efficiency by incrementally reducing the approximation error of aggregated distance bounds. They apply heuristics to determine the distance computations order and update the object’s aggregated bounds after each computation. However, the existing indexing approaches suffer from several drawbacks. They use the same computation order for all objects, do not support important types of aggregation functions and do not take the varying CPU and I/O costs of different distance computations into account. To resolve these problems, we introduce a new heuristic to determine an efficient distance computation order for each individual object. Our heuristic supports various important aggregation functions and calculates cost-benefit-ratios to incorporate the varying computation costs of different distance functions. The experimental evaluation reveals that our heuristic outperforms state-of-the-art approaches in terms of the number of distance computations as well as search time.

Marcel Zierenberg, Ingo Schmitt
Dynamic Permutation Based Index for Proximity Searching

Proximity searching consists in retrieving objects from a dataset that are similar to a given query. This kind of tool is an elementary task in different areas, for instance pattern recognition or artificial intelligence. To solve this problem, it is usual to use a metric index. The permutation based index (PBI) is an unbeatable metric technique which needs just few bits for each object in the index. In this paper, we present a dynamic version of the PBI, which supports insertions, deletions and updates, and keeps the effectiveness of the original technique.

Karina Figueroa, Rodrigo Paredes
Finding Near Neighbors Through Local Search

Proximity searching can be formulated as an optimization problem, being the goal function to find the object minimizing the distance to a given query by traversing a graph with a greedy algorithm. This formulation can be traced back to early formulations defined for vector spaces, and other recent approaches defined for the more general setup of metric spaces.

In this paper we introduce three searching algorithms generalizing to local search other than greedy, and experimentally prove that our approach improves significantly the state of the art. In particular, our contributions have excellent trade-offs among speed, recall and memory usage; making our algorithms suitable for real world applications. As a byproduct, we present an open source implementation of most of the near neighbor search algorithms in the literature.

Guillermo Ruiz, Edgar Chávez, Mario Graff, Eric S. Téllez

Metrics and Evaluation

Frontmatter
When Similarity Measures Lie

Do similarity or distance measures ever go wrong? The inherent subjectivity in similarity discernment has long supported the view that all judgements of similarity are equally valid, and that any selected similarity measure may only be considered more effective in some chosen domain. This paper presents evidence that such a view is incorrect for structural similarity comparisons. Similarity and distance measures occasionally do go wrong, and produce judgements that can be considered as errors in judgement. This claim is supported by a novel method for assessing the quality of similarity and distance functions, which is based on relative scale of similarity with respect to chosen reference objects. The method may be applied in any domain, and is demonstrated for common measures of structural similarity in graphs. Finally, the paper identifies three distinct kinds of relative similarity judgement errors, and shows how the distribution of these errors is related to graph properties under common similarity measures.

Kevin A. Naudé, Jean H. Greyling, Dieter Vogts
An Empirical Evaluation of Intrinsic Dimension Estimators

We study the practical behavior of different algorithms that aim to estimate the intrinsic dimension (ID) in metric spaces. Some of these algorithms were specifically developed to evaluate the complexity of searching in metric spaces, based on different theories related to the distribution of distances between objects on such spaces. Others were originally designed for vector spaces only, and have been extended to general metric spaces. To empirically evaluate the fitness of various ID estimations with the actual difficulty of searching in metric spaces, we compare one representative of each of the broadest families of metric indices: those based on pivots and those based on compact partitions. Our preliminary conclusions are that Fastmap and the measure called Intrinsic Dimensionality fit best their purpose.

Cristian Bustos, Gonzalo Navarro, Nora Reyes, Rodrigo Paredes
A Belief Framework for Similarity Evaluation of Textual or Structured Data

This paper discovers the major shortcomings of the Levenshtein Distance method, the longest common subsequence (LCS) method, and other general approaches to finding common parts, including the unjustified fragmentation of selected parts, the lack of sensitivity to transposition of large blocks, and no mechanisms to prevent accidental matches. The belief function theory leads to a flexible framework for similarity evaluation.The framework is aimed on new similarity models which are free of described shortcomings and can be effectively calculated. A sketch of better sequence alignment algorithm illustrates the framework’s utility.

Sergej Znamenskij
Similarity of Attributed Generalized Tree Structures: A Comparative Study

In our earlier attributed generalized tree (AGT) structures, vertex labels (as types) and edge labels (as attributes) embody semantic information, while edge weights express assessments regarding the (percentage-)relative importance of the attributes, a kind of pragmatic information. Our AGT similarity algorithm has been applied to e-Health, e-Business, and insurance underwriting. In this paper, we compare similarity computed by our AGT algorithm with the similarities obtained using: (a) a weighted tree similarity algorithm (WT), (b) graph edit distance (GED) based similarity measure, (c) maximum common subgraph (MCS) algorithm, and (d) a generalized tree similarity algorithm (GT). It is shown that small changes in tree structures may lead to undesirably large similarity changes using WT. Further, GT is found to be not applicable to AGT structures containing semantic as well as pragmatic information. GED and MCS cannot differentiate AGT structures with edges having different directions, lengths, or weights, all taken into account by our AGT algorithm.

Mahsa Kiani, Virendrakumar C. Bhavsar, Harold Boley
Evaluating Multilayer Multimedia Exploration

Multimedia exploration is an entertaining approach for multimedia retrieval enabling users to interactively browse and navigate through multimedia collections in a content-based way. The multimedia exploration approach extends the traditional query-by-example retrieval scenario to be a more intuitive approach for obtaining a global overview over an explored collection. However, novel exploration scenarios require many user studies demonstrating their benefits. In this paper, we present results of an extensive user study focusing on the comparison of 3-layer Multilayer Exploration Structure (MLES) structure with standard flat

k

-NN browsing. The results of the user study show that principles of the MLES lead to better effectiveness of the exploration process, especially when searching for a first object of the searched concept in an unknown collection.

Juraj Moško, Jakub Lokoč, Tomáš Grošup, Přemysl Čech, Tomáš Skopal, Jan Lánský
Semantic Similarity Between Images: A Novel Approach Based on a Complex Network of Free Word Associations

Several measures exist to describe similarities between digital contents, especially for what concerns images. Nevertheless, distances based on low-level visual features embedded in a multidimensional linear space are hardly suitable for capturing semantic similarities and recently novel techniques have been introduced making use of hierarchical knowledge bases. While being successfully exploited in specific contexts, the human perception of similarity cannot be easily encoded in such rigid structures. In this paper we propose to represent a knowledge base of semantic concepts as a

complex network

whose topology arises from free conceptual associations and is markedly different from a hierarchical structure. Images are anchored to relevant semantic concepts through an annotation process and similarity is computed following the related paths in the complex network. We finally show how this definition of semantic similarity is not necessarily restricted to images, but can be extended to compute distances between different types of sensorial information such as pictures and sounds, modeling the human ability to realize synaesthesias.

Enrico Palumbo, Walter Allasia

Applications and Specific Domains

Frontmatter
Vector-Based Similarity Measurements for Historical Figures

Historical interpretation benefits from identifying analogies among famous people: Who are the Lincolns, Einsteins, Hitlers, and Mozarts? We investigate several approaches to convert approximately 600,000 historical figures into vector representations to quantify similarity according to their Wikipedia pages. We adopt an effective reference standard based on the number of human-annotated Wikipedia categories being shared and use this to demonstrate the performance of our similarity detection algorithms. In particular, we investigate four different unsupervised approaches to representing the semantic associations of individuals: (1) TF-IDF, (2) Weighted average of distributed word embedding, (3) LDA Topic analysis and (4) Deepwalk embedding from page links. All proved effective, but Deepwalk embedding yielded an overall accuracy of 91.33% in our evaluation to uncover historical analogies. Combining LDA and Deepwalk yielded even higher performance.

Yanqing Chen, Bryan Perozzi, Steven Skiena
Efficient Approximate 3-Dimensional Point Set Matching Using Root-Mean-Square Deviation Score

In this paper, we study approximate point subset match (APSM) problem with minimum RMSD score under translation, rotation, and one-to-one correspondence in

d

-dimension. Since this problem seems computationally much harder than the previously studied APSM problems with translation only or distance evaluation only, we focus on speed-up of exhaustive search algorithms that can find all approximate matches. First, we present an efficient branch-and-bound algorithm using a novel lower bound function of the minimum RMSD score. Next, we present another algorithm that runs fast with high probability when a set of parameters are fixed. Experimental results on real 3-D molecular data sets showed that our branch-and-bound algorithm achieved significant speed-up over the naive algorithm still keeping the advantage of generating all answers.

Yoichi Sasaki, Tetsuo Shibuya, Kimihito Ito, Hiroki Arimura
Face Image Retrieval Revisited

The objective of face retrieval is to efficiently search an image database with detected faces and identify such faces that belong to the same person as a query face. Unlike most related papers, we concentrate on both retrieval effectiveness and efficiency. High retrieval effectiveness is achieved by proposing a new fusion approach which integrates existing state-of-the-art detection as well as matching methods. We further significantly improve a retrieval quality by employing the concept of multi-face queries along with optional relevance feedback. To be able to efficiently process queries on databases with millions of faces, we apply a specialized indexing algorithm. The proposed solutions are compared against four existing open-source and commercial technologies and experimentally evaluated on the standardized FERET dataset and on a real-life dataset of more than one million face images. The retrieval results demonstrate a significant gain in effectiveness and two-orders of magnitude more efficient query processing, with respect to a single technology executed sequentially.

Jan Sedmidubsky, Vladimir Mic, Pavel Zezula
Semiautomatic Learning of 3D Objects from Video Streams

Object detection and recognition are classical problems in computer vision, but are still challenging without a priori knowledge of objects and with a limited user interaction. In this work, a semiautomatic system for visual object learning from video stream is presented. The system detects movable foreground objects relying on FAST interest points. Once a view of an object has been segmented, the system relies on ORB features to create its descriptor, store it and compare it with descriptors of previously seen views. To this end, a visual similarity function based on geometry consistency of the local features is used. The system groups together similar views of the same object into clusters relying on the transitivity of similarity among them. Each cluster identifies a 3D object and the system learn to autonomously recognize a particular view assessing its cluster membership. When ambiguities arise, the user is asked to validate the membership assignments. Experiments have demonstrated the ability of the system to group together unlabeled views, reducing the labeling work of the user.

Fabio Carrara, Fabrizio Falchi, Claudio Gennaro
Banknote Recognition as a CBIR Problem

Automatic banknote recognition is an important aid for visually impaired users, which may provide a complementary evidence to tactile perception. In this paper we propose a framework for banknote recognition based on a traditional Content-Based Image Retrieval pipeline: given a test image, we first extract SURF features, then adopt a Bag of Features representation, finally we associate the image with the banknote amount which ranked best according to a similarity measure of choice. Compared with previous works in the literature, our method is simple, computationally efficient, and does not require a banknote detection stage. In order to validate effectiveness and robustness of the proposed approach, we have collected several datasets of Euro banknotes on a variety of conditions including partial occlusion, cluttered background, and also rotation, viewpoint, and illumination changes. We report a comparative analysis on different image descriptors and similarity measures and show that the proposed scheme achieves high recognition rates also on rather challenging circumstances. In particular, Bag of Features associated with L2 distance appears to be the best combination for the problem at hand, and performances do not degrade if a dimensionality reduction step is applied.

Joan Sosa-García, Francesca Odone
Efficient Image Search with Neural Net Features

We present an efficiency evaluation of similarity search techniques applied on visual features from deep neural networks. Our test collection consists of 20 million 4096-dimensional descriptors (320 GB of data). We test approximate

$$k$$

-NN search using several techniques, specifically FLANN library (a popular in-memory implementation of k-d tree forest), M-Index (that uses recursive Voronoi partitioning of a metric space), and PPP-Codes, which work with memory codes of metric objects and use disk storage for candidate refinement. Our evaluation shows that as long as the data fit in main memory, the FLANN and the M-Index have practically the same ratio between precision and response time. The PPP-Codes identify candidate sets ten times smaller then the other techniques and the response times are around 500 ms for the whole 20M dataset stored on the disk. The visual search with this index is available as an online demo application. The collection of 20M descriptors is provided as a public dataset to academic community.

David Novak, Jan Cech, Pavel Zezula
Textual Similarity for Word Sequences

In this work, we introduce new kinds of sentence similarity, called

Euclid similarity

and

Levenshtein similarity

, to capture both word sequences and semantic aspects. This is especially useful for Semantic Textual Similarity (STS) so that we could retrieve SNS texts, short sentences or something including collocations. We show the usefulness of our approach by some experimental results.

Fumito Konaka, Takao Miura
Motion Images: An Effective Representation of Motion Capture Data for Similarity Search

The rapid development of motion capturing technologies has caused a massive usage of human motion data in a variety of fields, such as computer animation, gaming industry, medicine, sports and security. These technologies produce large volumes of complex spatio-temporal data which need to be effectively compared on the basis of similarity. In contrast to a traditional way of extracting numerical features, we propose a new idea to transform complex motion data into RGB images and compare them by content-based image retrieval methods. We see these images not only as human-understandable visualization of motion characteristics (e.g., speed, duration and movement repetitions), but also as descriptive features for their ability to preserve key aspects of performed motions. To demonstrate the usability of this idea, we evaluate a preliminary experiment that classifies 1, 034 motions into 14 categories with the

$$87.4\,\%$$

precision.

Petr Elias, Jan Sedmidubsky, Pavel Zezula

Implementation and Engineering Solutions

Frontmatter
Brute-Force k-Nearest Neighbors Search on the GPU

We present a brute-force approach for finding

k

-nearest neighbors on the GPU for many queries in parallel. Our program takes advantage of recent advances in fundamental GPU computing primitives. We modify a matrix multiplication subroutine in MAGMA library [

6

] to calculate the squared Euclidean distances between queries and references. The nearest neighbors selection is accomplished by a truncated merge sort built on top of sorting and merging functions in the Modern GPU library [

3

]. Compared to state-of-the-art approaches, our program is faster and it handles larger inputs. For instance, we can find 1000 nearest neighbors among 1 million 64-dimensional reference points at a rate of about 435 queries per second.

Shengren Li, Nina Amenta
Regrouping Metric-Space Search Index for Search Engine Size Adaptation

This work contributes to the development of search engines that self-adapt their size in response to fluctuations in workload. Deploying a search engine in an Infrastructure as a Service (IaaS) cloud facilitates allocating or deallocating computational resources to or from the engine. In this paper, we focus on the problem of regrouping the metric-space search index when the number of virtual machines used to run the search engine is modified to reflect changes in workload. We propose an algorithm for incrementally adjusting the index to fit the varying number of virtual machines. We tested its performance using a custom-build prototype search engine deployed in the Amazon EC2 cloud, while calibrating the results to compensate for the performance fluctuations of the platform. Our experiments show that, when compared with computing the index from scratch, the incremental algorithm speeds up the index computation 2–10 times while maintaining a similar search performance.

Khalil Al Ruqeishi, Michal Konečný
Improving Parallel Processing of Matrix-Based Similarity Measures on Modern GPUs

Dynamic programming techniques are well-established and employed by various practical algorithms which are used as similarity measures, for instance the edit-distance algorithm or the dynamic time warping algorithm. These algorithms usually operate in iteration-based fashion where new values are computed from values of the previous iteration, thus they cannot be processed by simple data-parallel approaches. In this paper, we propose a way how to utilize computational power of massively parallel GPUs to compute dynamic programming algorithms effectively and efficiently. We address both the problem of computing one distance on large inputs concurrently and the problem of computing large number of distances simultaneously (e.g., when a similarity query is being resolved).

Martin Kruliš, David Bednárek, Michal Brabec
Time Series Subsequence Similarity Search Under Dynamic Time Warping Distance on the Intel Many-core Accelerators

Subsequence similarity search is one of the most important problems of time series data mining. Nowadays there is empirical evidence that Dynamic Time Warping (DTW) is the best distance metric for many applications. However in spite of sophisticated software speedup techniques DTW still computationally expensive. There are studies devoted to acceleration of the DTW computation by means of parallel hardware (e.g. computer-cluster, multi-core, FPGA and GPU). In this paper we present an approach to acceleration of the subsequence similarity search based on DTW distance using the Intel Many Integrated Core architecture. The experimental evaluation on synthetic and real data sets confirms the efficiency of the approach.

Aleksandr Movchan, Mikhail Zymbler
Subspace Nearest Neighbor Search - Problem Statement, Approaches, and Discussion
Position Paper

Computing the similarity between objects is a central task for many applications in the field of information retrieval and data mining. For finding k-nearest neighbors, typically a ranking is computed based on a predetermined set of data dimensions and a distance function, constant over all possible queries. However, many high-dimensional feature spaces contain a large number of dimensions, many of which may contain noise, irrelevant, redundant, or contradicting information. More specifically, the relevance of dimensions may depend on the query object itself, and in general, different dimension sets (subspaces) may be appropriate for a query. Approaches for feature selection or -weighting typically provide a global subspace selection, which may not be suitable for all possibly queries. In this position paper, we frame a new research problem, called

subspace nearest neighbor search

, aiming at multiple query-dependent subspaces for nearest neighbor search. We describe relevant problem characteristics, relate to existing approaches, and outline potential research directions.

Michael Hund, Michael Behrisch, Ines Färber, Michael Sedlmair, Tobias Schreck, Thomas Seidl, Daniel Keim
Query-Based Improvement Procedure and Self-Adaptive Graph Construction Algorithm for Approximate Nearest Neighbor Search

The nearest neighbor search problem is well known since 60s. Many approaches have been proposed. One is to build a graph over the set of objects from a given database and use a greedy walk as a basis for a search algorithm. If the greedy walk has an ability to find the nearest neighbor in the graph starting from any vertex with a small number of steps, such a graph is called a navigable small world. In this paper we propose a new algorithm for building graphs with navigable small world properties. The main advantage of the proposed algorithm is that it is free from input parameters and has an ability to adapt on the fly to any changes in the distribution of data. The algorithm is based on the idea of removing local minimums by adding new edges. We realize this idea to improve search properties of the structure by using the set of queries in the execution stage. An empirical study of the proposed algorithm and comparison with previous works are reported in the paper.

Alexander Ponomarenko

Posters

Frontmatter
Is There a Free Lunch for Image Feature Extraction in Web Applications

Feature extraction is one of the essential parts of multimedia indexing in similarity search and content-based retrieval methods. Most applications that employ these methods also implement their client side interface using web technologies. The world wide web has become a well-established platform for distributed software and virtually all personal computers, tablets, and smartphones are equipped with a web browser. In the past, most applications employed a strict client-server approach, where the client part (running in the browser) handles only the user interface and the server side handles data storage and business logic. However, the client-side technologies leaped forward with the new HTML5 standard and the web browser has become capable of handling much more complex tasks. In this paper, we propose a model where the multimedia indexing is handled at the client side, which reduces necessary computational power of the server to run a web application that manages large multimedia database. We have implemented an in-browser image feature extractor and compared its performance with a server implementation.

Martin Kruliš
On the Use of Similarity Search to Detect Fake Scientific Papers

Fake scientific papers have recently become of interest within the academic community as a result of the identification of fake papers in the digital libraries of major academic publishers [

8

]. Detecting and removing these papers is important for many reasons. We describe an investigation into the use of similarity search for detecting fake scientific papers by comparing several methods for signature construction and similarity scoring and describe a pseudo-relevance feedback technique that can be used to improve the effectiveness of these methods. Experiments on a dataset of 40,000 computer science papers show that precision, recall and MAP scores of 0.96, 0.99 and 0.99, respectively, can be achieved, thereby demonstrating the usefulness of similarity search in detecting fake scientific papers and ranking them highly.

Kyle Williams, C. Lee Giles
Reducing Hubness for Kernel Regression

In this paper, we point out that

hubness

—some samples in a high-dimensional dataset emerge as

hubs

that are similar to many other samples—influences the performance of kernel regression. Because the dimension of feature spaces induced by kernels is usually very high, hubness occurs, giving rise to the problem of

multicollinearity

, which is known as a cause of instability of regression results. We propose hubness-reduced kernels for kernel regression as an extension of a previous approach for

k

NN classification that reduces

spatial centrality

to eliminate hubness.

Kazuo Hara, Ikumi Suzuki, Kei Kobayashi, Kenji Fukumizu, Miloš Radovanović

Demo Papers

Frontmatter
FELICITY: A Flexible Video Similarity Search Framework Using the Earth Mover’s Distance

In this paper, we demonstrate our novel system, called FELICITY, which is capable of processing user-adaptive content-based k-nearest-neighbor queries efficiently by utilizing video signatures and the Earth Mover’s Distance (EMD). To this end, we implement an optimal multi-step query processing algorithm by approximating the EMD between any two video signatures by utilizing lower-bounding filter approximation techniques. The system enables the user to adapt query parameters and interact with the system, helping him to find out the best parameter combination for the desired similarity tasks. Moreover, our system incorporates a user-friendly visualization interface for the EMD flow between video signatures, providing an intuitive understanding of the EMD and video similarity.

Merih Seran Uysal, Christian Beecks, Daniel Sabinasz, Thomas Seidl
Searching the EAGLE Epigraphic Material Through Image Recognition via a Mobile Device

This demonstration paper describes the mobile application developed by the EAGLE project to increase the use and visibility of its epigraphic material. The EAGLE project (European network of Ancient Greek and Latin Epigraphy) is gathering a comprehensive collection of inscriptions (about 80 % of the surviving material) and making it accessible through a user-friendly portal, which supports searching and browsing of the epigraphic material. In order to increase the usefulness and visibility of its content, EAGLE has developed also a mobile application to enable tourists and scholars to obtain detailed information about the inscriptions they are looking at by taking pictures with their smartphones and sending them to the EAGLE portal for recognition. In this demonstration paper we describe the EAGLE mobile application and give an outline of its features and its architecture.

Paolo Bolettieri, Vittore Casarosa, Fabrizio Falchi, Lucia Vadicamo, Philippe Martineau, Silvia Orlandi, Raffaella Santucci
Backmatter
Metadaten
Titel
Similarity Search and Applications
herausgegeben von
Giuseppe Amato
Richard Connor
Fabrizio Falchi
Claudio Gennaro
Copyright-Jahr
2015
Electronic ISBN
978-3-319-25087-8
Print ISBN
978-3-319-25086-1
DOI
https://doi.org/10.1007/978-3-319-25087-8

Neuer Inhalt