Skip to main content

2014 | Buch

Similarity Search and Applications

7th International Conference, SISAP 2014, Los Cabos, Mexico, October 29-31, 2014. Proceedings

herausgegeben von: Agma Juci Machado Traina, Caetano Traina Jr., Robson Leonardo Ferreira Cordeiro

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 7th International Conference on Similarity Search and Applications, SISAP 2014, held in A Coruña, Spain, in October 2014. The 21 full papers and 6 short papers presented were carefully reviewed and selected from 45 submissions. The papers are organized in topical sections on Improving Similarity Search Methods and Techniques; Indexing and Applications; Metrics and Evaluation; New Scenarios and Approaches; Applications and Specific Domains.

Inhaltsverzeichnis

Frontmatter

Improving Similarity Search Methods and Techniques

Efficient Algorithms for Similarity Search in Axis-Aligned Subspaces
Abstract
Many applications — such as content-based image retrieval, subspace clustering, and feature selection — may benefit from efficient subspace similarity search. Given a query object, the goal of subspace similarity search is to retrieve the most similar objects from the database, where the similarity distance is defined over an arbitrary subset of dimensions (or features) — that is, an arbitrary axis-aligned projective subspace. Though much effort has been spent on similarity search in fixed subspaces, relatively little attention has been given to the problem of similarity search when the dimensions are specified at query time. In this paper, we propose several new methods for the subspace similarity search problem. Extensive experiments are provided showing very competitive performance relative to state-of-the-art solutions.
Michael E. Houle, Xiguo Ma, Vincent Oria, Jichao Sun
Partial Refinement for Similarity Search with Multiple Features
Abstract
Filter refinement is an efficient and flexible indexing approach to similarity search with multiple features. However, the conventional refinement phase has one major drawback: when an object is refined, the partial distances to the query object are computed for all features. This frequently leads to more distance computations being executed than necessary to exclude an object. To address this problem, we introduce partial refinement, a simple, yet efficient improvement of the filter refinement approach. It incrementally replaces partial distance bounds with exact partial distances and updates the aggregated bounds accordingly each time. This enables us to exclude many objects before all of their partial distances have been computed exactly. Our experimental evaluation illustrates that partial refinement significantly reduces the number of required distance computations and the overall search time in comparison to conventional refinement and other state-of-the-art techniques.
Marcel Zierenberg
Video Retrieval with Feature Signature Sketches
Abstract
In this paper, we present an effective yet efficient approach for known-item search in video data. The approach employs feature signatures based on color distribution to represent video key-frames. At the same time, the feature signatures enable users to intuitively draw simple colored sketches of the desired scene. We describe in detail the video retrieval model and also discuss and carefully optimize its parameters. Furthermore, several indexing techniques suitable for the model are presented and their performance is empirically evaluated in the experiments. Apart from that, we also investigate a bounding-sphere pruning technique suitable for similarity search in vector spaces.
Adam Blažek, Jakub Lokoč, Tomáš Skopal
Some Theoretical and Experimental Observations on Permutation Spaces and Similarity Search
Abstract
Permutation based approaches represent data objects as ordered lists of predefined reference objects. Similarity queries are executed by searching for data objects whose permutation representation is similar to the query one. Various permutation-based indexes have been recently proposed. They typically allow high efficiency with acceptable effectiveness. Moreover, various parameters can be set in order to find an optimal trade-off between quality of results and costs.
In this paper we studied the permutation space without referring to any particular index structure focusing on both theoretical and experimental aspects. We used both synthetic and real-word datasets for our experiments. The results of this work are relevant in both developing and setting parameters of permutation-based similarity searching approaches.
Giuseppe Amato, Fabrizio Falchi, Fausto Rabitti, Lucia Vadicamo
Metric Space Searching Based on Random Bisectors and Binary Fingerprints
Abstract
We present a novel index for approximate searching in metric spaces based on random bisectors and binary fingerprints. The aim is to deal with scenarios where the main memory available is small. The method was tested on synthetic and real-world metric spaces. Our results show that our scheme outperforms the standard permutant-based index in scenarios where memory is scarce.
José María Andrade, César A. Astudillo, Rodrigo Paredes

Indexing and Applications

Faster Proximity Searching with the Distal SAT
Abstract
In this paper we present the Distal Spatial Approximation Tree (DiSAT), an algorithmic improvement of SAT. Our improvement increases the discarding power of the SAT by selecting distal nodes instead of the proximal nodes proposed in the original paper. Our approach is parameter free and it was the most competitive in an extensive benchmarking, from two to forty times faster than the SAT, and faster than the List of Clusters (LC) which is considered the state of the art for main memory, linear sized indexes in the model of distance computations.
In summary, we obtained an index more resistant to the curse of dimensionality, establishing a new benchmark in performance, faster to build than the LC and with a small memory footprint. Our strategies can be used in any version of the SAT, either in main or secondary memory.
Edgar Chávez, Verónica Ludueña, Nora Reyes, Patricia Roggero
A Dynamic Pivoting Algorithm Based on Spatial Approximation Indexes
Abstract
Metric indexes aim at reducing the amount of distance evaluations carried out when searching a metric space. Spatial approximation trees (sa-trees for short), in particular, are efficient data structures, which have shown to be competitive in metric spaces of medium to high difficulty, or queries with low selectivity. Sa-trees can be also made dynamic, and can use the available space to improve the query performance adding pivot information. In this paper we extend previous work on dynamic sa-trees with pivots, and show how the pivot information can be used to a full extent to improve the search performance. The result is a technique that allows one to traverse a dynamic sa-tree without necessarily comparing all traversed nodes against the query object. As a result, the novel algorithm makes a much better use of the available space, yielding a saving of distance computations of about 10% to 70%, compared with previous sa-tree schemes that use pivot information.
Diego Arroyuelo
Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data
Abstract
Locality-Sensitive Hashing (LSH) is extremely competitive for similarity search, but works under the assumption of uniform access cost to the data, and for just a handful of dissimilarities for which locality-sensitive families are available. In this work we propose Parallel Voronoi LSH, an approach that addresses those two limitations of LSH: it makes LSH efficient for distributed-memory architectures, and it works for very general dissimilarities (in particular, it works for all metric dissimilarities). Each hash table of Voronoi LSH works by selecting a sample of the dataset to be used as seeds of a Voronoi diagram. The Voronoi cells are then used to hash the data. Because Voronoi diagrams depend only on the distance, the technique is very general. Implementing LSH in distributed-memory systems is very challenging because it lacks referential locality in its access to the data: if care is not taken, excessive message-passing ruins the index performance. Therefore, another important contribution of this work is the parallel design needed to allow the scalability of the index, which we evaluate in a dataset of a thousand million multimedia features.
Eliezer Silva, Thiago Teixeira, George Teodoro, Eduardo Valle
Dynamic List of Clusters in Secondary Memory
Abstract
We introduce a dynamic and secondary-memory-based variant of the List of Clusters, which is shown to be competitive with the literature, especially on higher-dimensional spaces, where it outperforms the M-tree in searches and I/Os used for insertions. The basic principles of our design are applicable to other secondary-memory structures.
Gonzalo Navarro, Nora Reyes
Index-Based R-S Similarity Joins
Abstract
Similarity Joins are some of the most useful and powerful data processing operations. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations, such as record linkage, data cleaning, and many other applications. An important method to implement efficient Similarity Joins is the use of indexing structures. The previous work, however, only supports self joins or requires the joint indexing of every pair of relations that participate in a Similarity Join. We present an algorithm that extends a previously proposed index-based algorithm (eD-Index) to support Similarity Joins over two relations. Our approach operates over individual indices. We evaluate the performance of this algorithm, contrast it with an alternative approach, and investigate the configuration of parameters that maximize performance. Our results show that our algorithm significantly outperforms the alternative one in terms of distance computations, and reveal interesting properties when comparing execution time.
Spencer S. Pearson, Yasin N. Silva
A Compressed Index for Hamming Distances
Abstract
Some instances of multimedia data can be represented as high dimensional binary vectors under the hamming distance. The standard index used to handle queries is Locality Sensitive Hashing (LSH), reducing approximate queries to a set of exact searches. When the queries are not selective and multiple families of hashing functions are employed, or when the collection is large, LSH indexes should be stored in secondary memory, slowing down the query time.
In this paper we present a compressed LSH index, queryable without decompression and with negligible impact in query speed. This compressed representation enables larger collections to be handled in main memory with the corresponding speedup with respect to fetching data from secondary memory.
We tested the index with a real world example, indexing songs to detect near duplicates. Songs are represented using an entropy based audio-fingerprint (AFP), of independent interest.
The combination of compressed LSH and the AFP enables the retrieval of lossy compressed audio with near perfect recall at bit-rates as low as 32 kbps, packing the representation of 30+ million music tracks of standard length (which is about the total number of unique tracks of music available worldwide) in half a gigabyte of space. A sequential search for matches would take about 15 minutes; while using our compressed index, of size roughly one gigabyte, searching for a song would take a fraction of a second.
Francisco Santoyo, Edgar Chávez, Eric S. Téllez
Perils of Combining Parallel Distance Computations with Metric and Ptolemaic Indexing in kNN Queries
Abstract
Similarity search methods face serious performance issues since similarity functions are rather expensive to compute. Many optimization techniques were designed to reduce the number of similarity computations, when a query is being resolved. Indexing methods, like pivot table prefiltering, based on the metric properties of feature space, are one of the most popular methods. They can increase the speed of query evaluation even by orders of magnitude. Another approach is to employ highly parallel architectures like GPUs to accelerate evaluation by unleashing their raw computational power. Unfortunately, resolving the k nearest neighbors (kNN) queries optimized with metric indexing is a problem that is serial in nature. In this paper, we explore the perils of kNN parallelization and we propose a new algorithm that basically converts kNN queries into range queries, which are perfectly parallelizable. We have experimentally evaluated all approaches using a highly parallel environment comprised of multiple GPUs. The new algorithm demonstrates more than 2× speedup to the naïve parallel implementation of kNN queries.
Martin Kruliš, Steffen Kirchhoff, Jakub Yaghob

Metrics and Evaluation

Transition-Sensitive Distances
Abstract
In information retrieval and classification, the relevance of the obtained result and the efficiency of the computational process are strongly influenced by the distance measure used for data comparison. Conventional distance measures, including Hamming distance (HD) and Levenshtein distance (LD), count merely the number of mismatches (or modifications). Given a query, samples mapped at the same distance have the same number of mismatches, but the distribution of the mismatches might be different, either disperse or blocked, so that other measures must be cascaded for further differentiation of the samples. Here we present a new type of distances, called transition-sensitive distances, which count, in addition to the number of mismatches, the cost of transitions between positionally adjacent match-mismatch pairs, as part of the distance. The cost of transitions that reflects the dispersion of mismatches can be integrated into conventional distance measures. We introduce transition-sensitive variants of LD and HD, referred to as TLD and THD. It is shown that while TLD and THD hold properties of the metric similarly as LD and HD, they function as more strict distance measures in similarity search applications than LD and HD, respectively.
Kaoru Yoshida
Retrieval of Binary Features in Image Databases: A Study
Abstract
Many state-of-the art object recognition systems rely on local image features, sometimes hundreds per image, that describe the surroundings of detected interest points by a high-dimensional feature vector. To recognize objects, these systems have to match features detected in a query image against the features stored in a database containing millions or even billions of feature vectors. Hence, efficient matching is crucial for real applications. In the past, feature vectors were often real-valued, and therefore research focused on such feature representations. Present techniques, however, involve binary features to reduce memory consumption and to speed up the feature extraction stage. Matching such binary features received relatively little attention in the computer vision community. Often, either Locality Sensitive Hashing (LSH) or quantization-based techniques, that are known from real-valued features, are used. However, an in-depth evaluation of the involved parameters in binary space has, to the best of our knowledge, not yet been performed. In this paper, we aim at closing this research gap, providing valuable insights for application-oriented research.
Johannes Niedermayer, Peer Kröger

New Scenarios and Approaches

The Similarity-Aware Relational Intersect Database Operator
Abstract
Identifying similarities in large datasets is an essential operation in many applications such as bioinformatics, pattern recognition, and data integration. To make the underlying database system similarity-aware, the core relational operators have to be extended. Several similarity-aware relational operators have been proposed that introduce similarity processing at the database engine level, e.g., similarity joins and similarity group-by. This paper extends the semantics of the set intersection operator to operate over similar values. The paper describes the semantics of the similarity-based set intersection operator, and develops an efficient query processing algorithm for evaluating it. The proposed operator is implemented inside an open-source database system, namely PostgreSQL. Several queries from the TPC-H benchmark are extended to include similarity-based set intersetion predicates. Performance results demonstrate up to three orders of magnitude speedup in performance over equivalent queries that only employ regular operators.
Wadha J. Al Marri, Qutaibah Malluhi, Mourad Ouzzani, Mingjie Tang, Walid G. Aref
High Dimensional Search Using Polyhedral Query
Abstract
It is well known that, as the dimensionality of a metric space increases, metric search techniques become less effective and the cost of indexing mechanisms becomes greater than the saving they give. This is due to the so-called curse of dimensionality.
One effect of increasing dimensionality is that the ratio of unit hypersphere to unit hypercube volume decreases rapidly, making the solution to a similarity query (the query ball, or hypersphere) ever more difficult to identify by using metric invariants such as triangle inequality.
In this paper we take a different approach, by identifying points within a query polyhedron rather than a ball. We show how this can be achieved by constructing a surrogate metric space, such that a query ball in the surrogate space corresponds to a polyhedron in the original space. If the polyhedron contains the ball, the overall cost of the query is likely to be increased in high dimensions; however, we show that shrinking the polyhedron can capture a surprisingly high proportion of the points within the ball, whilst at the same time giving a more efficient, and more scalable, search.
We show results which confirm our underlying hypothesis. In some cases we can retrieve significant volumes of query results from spaces which are otherwise intractable.
Richard Connor, Stewart MacKenzie-Leigh, Robert Moss
Generating Synthetic Data to Allow Learning from a Single Exemplar per Class
Abstract
Recent years have seen an explosion in the volume of historical documents placed online. The individuality of fonts combined with the degradation suffered by century old manuscripts means that Optical Character Recognition Systems do not work well here. As human transcription is prohibitively expensive, recent efforts focused on human/computer cooperative transcription: a human annotates a small fraction of a text to provide labeled data for recognition algorithms. Such a system naturally begs the question of how much data must the human label? In this work we show that we can do well even if the human labels only a single instance from each class. We achieve this good result using two novel observations: we can leverage off a recently introduced parameter-free distance measure, improving it by taking into account the “complexity” of the glyphs being compared; we can estimate this complexity using synthetic but plausible instances made from the single training instance. We demonstrate the utility of our observations on diverse historical manuscripts.
Liudmila Ulanova, Yuan Hao, Eamonn Keogh
Similarity for Natural Semantic Networks
Abstract
A natural semantic network (NSN) represents the knowledge of a group of persons with respect to a particular topic. NSN comparison would allow to discover how close one group is to the other in terms of expertise in the topic— for example, how close apprentices are to experts or students to teachers. We propose to model natural semantic networks as weighted bipartite graphs and to extract feature vectors from these graphs for calculating similarity between pairs of networks. By comparing a set of networks from different topics, we show the approach is feasible.
Francisco Torres, Sara E. Garza

Applications and Specific Domains

Anomaly Detection in Streaming Time Series Based on Bounding Boxes
Abstract
Anomaly detection in time series has been studied extensively by the scientific community utilizing a wide range of applications. One specific technique that obtains very good results is “HOT SAX”, because it only requires a parameter the length of the subsequence, and it does not need a training model for detecting anomalies. However, its disadvantage is that it requires the use of a normalized Euclidean distance, which in turn requires setting a parameter ε to avoid detecting meaningless patterns (noise in the signal). Setting an appropriate ε requires an analysis of the domain of the values from the time series, which implies normalizing all subsequences before performing the detection. We propose an approach for anomaly detection based on bounding boxes, which does not require normalizing the subsequences, thus it does not need to set ε. Thereby, the proposed technique can be used directly for online detection, without any a priori knowledge and using the non-normalized Euclidean distance. Moreover, we show that our algorithm computes less CPU runtime in finding the anomaly than HOT SAX in normalized scenarios.
Heider Sanchez, Benjamin Bustos
SVG-to-RDF Image Semantization
Abstract
The goal of this work is to provide an original (semi-automatic) annotation framework titled SVG-to-RDF whichconverts a collection of raw Scalable vector graphic (SVG) images into a searchable semantic-based RDF graph structure that encodes relevant features and contents. Using a dedicated knowledge base, SVG-to-RDF offers the user possible semantic annotations for each geometric object in the image, based on a combination of shape, color, and position similarity measures. Our method presents several advantages, namely i) achieving complete semantization of image content, ii) allowing semantic-based data search and processing using standard RDF technologies, iii) while being compliant with Web standards (i.e., SVG and RDF) in displaying images and annotation results in any standard Web browser, as well as iv) coping with different application domains. Our solution is of linear complexity in the size of the image and knowledge base structures used. Using our prototype SVG2RDF, several experiments have been conducted on a set of panoramic dental x-ray images to underline our approach’s effectiveness, and its applicability to different application domains.
Khouloud Salameh, Joe Tekli, Richard Chbeir
Employing Similarity Methods for Stellar Spectra Classification in Astroinformatics
Abstract
In the past few years, we have observed a trend of increasing cooperation between computer science and other empirical sciences such as physics, biology, or medical fields. This e-science synergy opens new challenges for the computer science and triggers important advances in other areas of research. In our particular case, we are facing an astroinformatics challenge of analysing stellar spectra in order to establish automated classification methods for recognizing different types of Be stars. We have chosen similarity search methods, which are effectively utilized in other domains like multimedia content-based retrieval for instance. This paper presents our analysis of the problematics and proposed a solution based on Signature Quadratic Form Distance and feature signatures. We have also conducted intensive empirical evaluation which allowed us to determine appropriate configuration for our similarity model.
Martin Kruliš, David Bednárek, Jakub Yaghob, Filip Zavoral
A Similarity-Based Method for Visual Search in Time Series Using Coulomb’s Law
Abstract
We present a method for visual search in multidimensional time series based on Coulomb’s law. The proposed method integrates: a descriptor based on Coulomb’s law for dimensionality reduction in time series; a system to perform similarity searching in time series; and, a module for the visualization of results. Experiments were performed using real data, indicating that the proposed method broadens the quality of through similarity queries in time series.
Claudinei Garcia de Andrade, Marcela Xavier Ribeiro
Classification of Epileptoid Oscillations in EEG Using Shannon’s Entropy Amplitude Probability Distribution
Abstract
This paper presents an additional tool the authors have developed to continue merging the fields of computational neuroscience with medical based neurodiagnostic clinical research, particularly those associated with machine learning in Big Electroencephalogram (EEG) Data. The authors introduce a means to identify various types of epileptic pathologic oscillations using a parameter based on the Shannon entropy of the probability distribution of the amplitudes within EEG signals. Multiple entropy and entropy-like measures have been explored to aid in epileptic seizure classification including Kolmogorov-Sinai entropy, spectral entropy, Renyi entropy, approximate entropy, and equal frequency discretization. Here we propose a more computational efficient measure which calculates a discrete probability distribution directly from the recorded amplitudes of an EEG recording over a specified window and uses an entropy-like calculation to reduce dimensionality.
Ronald Broberg, Rory Lewis
Entity Recognition for Duplicate Filtering
Abstract
We propose a system for automatic detection of duplicate entries in a repository of semi-structured text documents. The proposed system employs text-entity recognition to extract information regarding time, location, names of persons and organizations, as well as events described within the document content. With structured representations of the content, called “metamodels”, we group the entries into clusters based on the similarity of the contents. Then we apply machine-learning algorithms to the clusters to carry out duplicate detection. We present results regarding precision, recall, and F-value of the proposed system.
J. A. Cordero Cruz, Sara E. Garza, S. E. Schaeffer
A Bayesian Ensemble Classifier for Source Code Authorship Attribution
Abstract
Authorship attribution of source code is the task of deciding who wrote software, given its source code, when the author of the software is not explicitly known. There are numerous scenarios in which it is necessary to identify the author of a piece of software whose author is unknown, including software forensics investigations, plagiarism detection, and questions of software ownership. A number of methods for authorship attribution of source code have been presented in the past, including two state-of-the-art methods: SCAP and Burrows. Each of these two state-of-the-art methods was individually improved, and – as presented in this paper – an ensemble method was developed from them based on the Bayes optimal classifier. An empirical study was performed using a data set consisting of 7,231 open-source and textbook programs written in C++ and Java by thirty unique authors. The ensemble method successfully attributed 98.2% of all documents in the data set, compared to 88.9% by the Burrows baseline method and 91.0% by the SCAP baseline method.
Matthew F. Tennyson, Francisco J. Mitropoulos
Multi-Core (CPU and GPU) for Permutation-Based Indexing
Abstract
Permutation-based indexing is a technique to approximate k-nearest neighbor computation in high-dimensional spaces. The technique aims to predict the proximity between elements encoding their location with respect to their surrounding. The strategy is fast and effective to answer user queries. The main constraint of this technique is the indexing time. Opening the GPUs to general purpose computation allows to perform parallel computation on a powerful platform. In this paper, we propose efficient indexing algorithms for the permutation-based indexing using multi-core architecture GPU and CPU. We study the performance and efficiency of our algorithms on large-scale datasets of millions of documents. Experimental results show a decrease of the indexing time.
Hisham Mohamed, Hasmik Osipyan, Stéphane Marchand-Maillet
An Efficient DTW-Based Approach for Melodic Similarity in Flamenco Singing
Abstract
We study melodic similarity in flamenco singing by using the Dynamic Time Warping (DTW) distance. Given two melodic contours, the score of the alignment of the two melodies is taken as a similarity measure. Concretely, we consider a particularly representative flamenco repertoire, the tonás, a cappella flamenco singings with free rhythm and high degree of complex ornamentation. We show that the DTW-distance discriminates correctly variations between the styles. In order to speedup the quadratic time and space complexity of the standard DTW, our strategy is to perform an efficient segmentation on the pitch contour before applying dynamic programming. We show that our method achieves better results (both in efficiency and accuracy) than other existing DTW-based similarity measures.
J. M. Díaz-Báñez, J. C. Rizo
Backmatter
Metadaten
Titel
Similarity Search and Applications
herausgegeben von
Agma Juci Machado Traina
Caetano Traina Jr.
Robson Leonardo Ferreira Cordeiro
Copyright-Jahr
2014
Verlag
Springer International Publishing
Electronic ISBN
978-3-319-11988-5
Print ISBN
978-3-319-11987-8
DOI
https://doi.org/10.1007/978-3-319-11988-5

Neuer Inhalt