Skip to main content

2016 | Buch

Similarity Search and Applications

9th International Conference, SISAP 2016, Tokyo, Japan, October 24-26, 2016, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 9th International Conference on Similarity Search and Applications, SISAP 2016, held in Tokyo, Japan, in October 2016.

The 18 full papers and 7 short papers presented in this volume were carefully reviewed and selected from 47 submissions. The program of the conference was grouped in 8 categories as follows: graphs and networks; metric and permutation-based indexing; multimedia; text and document similarity; comparisons and benchmarks; hashing techniques; time-evolving data; and scalable similarity search.

Inhaltsverzeichnis

Frontmatter
Erratum to: Pruned Bi-directed K-nearest Neighbor Graph for Proximity Search
Masajiro Iwasaki

Graphs and Networks

Frontmatter
BFST_ED: A Novel Upper Bound Computation Framework for the Graph Edit Distance
Abstract
Graph similarity is an important operation with many applications. In this paper we are interested in graph edit similarity computation. Due to the hardness of the problem, it is too hard to exactly compare large graphs, and fast approximation approaches with high quality become very interesting. In this paper we introduce a novel upper bound computation framework for the graph edit distance. The basic idea of this approach is to picture the comparing graphs into hierarchical structures. This view facilitates easy comparison and graph mapping construction. Specifically, a hierarchical view based on a breadth first search tree with its backward edges is used. A novel tree traversing and matching method is developed to build a graph mapping. The idea of spare trees is introduced to minimize the number of insertions and/or deletions incurred by the method and a lookahead strategy is used to enhance the vertex matching process. An interesting feature of the method is that it combines vertex map construction with edit counting in an easy and straightforward manner. This framework also allows to compare graphs from different hierarchical views to improve the upper bound. Experiments show that tighter upper bounds are always delivered by this new framework at a very good response time.
Karam Gouda, Mona Arafa, Toon Calders
Pruned Bi-directed K-nearest Neighbor Graph for Proximity Search
Abstract
In this paper, we address the problems with fast proximity searches for high-dimensional data by using a graph as an index. Graph-based methods that use the k-nearest neighbor graph (KNNG) as an index perform better than tree-based and hash-based methods in terms of search precision and query time. To further improve the performance of the KNNG, the number of edges should be increased. However, increasing the number takes up more memory, while the rate of performance improvement gradually falls off. Here, we propose a pruned bi-directed KNNG (PBKNNG) in order to improve performance without increasing the number of edges. Different directed edges for existing edges between a pair of nodes are added to the KNNG, and excess edges are selectively pruned from each node. We show that the PBKNNG outperforms the KNNG for SIFT and GIST image descriptors. However, the drawback of the KNNG is that its construction cost is fatally expensive. As an alternative, we show that a graph can be derived from an approximate neighborhood graph, which costs much less to construct than a KNNG, in the same way as the PBKNNG and that it also outperforms a KNNG.
Masajiro Iwasaki
A Free Energy Foundation of Semantic Similarity in Automata and Languages
Abstract
This paper develops a free energy theory from physics including the variational principles for automata and languages and also provides algorithms to compute the energy as well as efficient algorithms for estimating the nondeterminism in a nondeterministic finite automaton. This theory is then used as a foundation to define a semantic similarity metric for automata and languages. Since automata are a fundamental model for all modern programs while languages are a fundamental model for the programs’ behaviors, we believe that the theory and the metric developed in this paper can be further used for real-word programs as well.
Cewei Cui, Zhe Dang

Metric and Permutation-Based Indexing

Frontmatter
Supermetric Search with the Four-Point Property
Abstract
Metric indexing research is concerned with the efficient evaluation of queries in metric spaces. In general, a large space of objects is arranged in such a way that, when a further object is presented as a query, those objects most similar to the query can be efficiently found. Most such mechanisms rely upon the triangle inequality property of the metric governing the space. The triangle inequality property is equivalent to a finite embedding property, which states that any three points of the space can be isometrically embedded in two-dimensional Euclidean space. In this paper, we examine a class of semimetric space which is finitely 4-embeddable in three-dimensional Euclidean space. In mathematics this property has been extensively studied and is generally known as the four-point property. All spaces with the four-point property are metric spaces, but they also have some stronger geometric guarantees. We coin the term supermetric space as, in terms of metric search, they are significantly more tractable. We show some stronger geometric guarantees deriving from the four-point property which can be used in indexing to great effect, and show results for two of the SISAP benchmark searches that are substantially better than any previously published.
Richard Connor, Lucia Vadicamo, Franco Alberto Cardillo, Fausto Rabitti
Reference Point Hyperplane Trees
Abstract
Our context of interest is tree-structured exact search in metric spaces. We make the simple observation that, the deeper a data item is within the tree, the higher the probability of that item being excluded from a search. Assuming a fixed and independent probability p of any subtree being excluded at query time, the probability of an individual data item being accessed is \( (1-p)^d\) for a node at depth d. In a balanced binary tree half of the data will be at the maximum depth of the tree so this effect should be significant and observable. We test this hypothesis with two experiments on partition trees. First, we force a balance by adjusting the partition/exclusion criteria, and compare this with unbalanced trees where the mean data depth is greater. Second, we compare a generic hyperplane tree with a monotone hyperplane tree, where also the mean depth is greater. In both cases the tree with the greater mean data depth performs better in high-dimensional spaces. We then experiment with increasing the mean depth of nodes by using a small, fixed set of reference points to make exclusion decisions over the whole tree, so that almost all of the data resides at the maximum depth. Again this can be seen to reduce the overall cost of indexing. Furthermore, we observe that having already calculated reference point distances for all data, a final filtering can be applied if the distance table is retained. This reduces further the number of distance calculations required, whilst retaining scalability. The final structure can in fact be viewed as a hybrid between a generic hyperplane tree and a LAESA search structure.
Richard Connor
Quantifying the Invariance and Robustness of Permutation-Based Indexing Schemes
Abstract
Providing a fast and accurate (exact or approximate) access to large-scale multidimensional data is a ubiquitous problem and dates back to the early days of large-scale Information Systems. Similarity search, requiring to resolve nearest neighbor (NN) searches, is a fundamental tool for structuring information space. Permutation-based Indexing (PBI) is a reference-based indexing scheme that accelerates NN search by combining the use of landmark points and ranking in place of distance calculation.
In this paper, we are interested in understanding the approximation made by the PBI scheme. The aim is to understand the robustness of the scheme created by modeling and studying by quantifying its invariance properties. After discussing the geometry of PBI, in relation to the study of ranking, from empirical evidence, we make proposals to cater for the inconsistencies of this structure.
Stéphane Marchand-Maillet, Edgar Roman-Rangel, Hisham Mohamed, Frank Nielsen
Deep Permutations: Deep Convolutional Neural Networks and Permutation-Based Indexing
Abstract
The activation of the Deep Convolutional Neural Networks hidden layers can be successfully used as features, often referred as Deep Features, in generic visual similarity search tasks.
Recently scientists have shown that permutation-based methods offer very good performance in indexing and supporting approximate similarity search on large database of objects. Permutation-based approaches represent metric objects as sequences (permutations) of reference objects, chosen from a predefined set of data. However, associating objects with permutations might have a high cost due to the distance calculation between the data objects and the reference objects.
In this work, we propose a new approach to generate permutations at a very low computational cost, when objects to be indexed are Deep Features. We show that the permutations generated using the proposed method are more effective than those obtained using pivot selection criteria specifically developed for permutation-based methods.
Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, Lucia Vadicamo

Multimedia

Frontmatter
Patch Matching with Polynomial Exponential Families and Projective Divergences
Abstract
Given a query patch image, patch matching consists in finding similar patches in a target image. In pattern recognition, patch matching is a fundamental task that is time consuming, specially when zoom factors and symmetries are handled. The matching results heavily depend on the underlying notion of distances, or similarities, between patches. We present a method that consists in modeling patches by flexible statistical parametric distributions called polynomial exponential families (PEFs). PEFs model universally arbitrary smooth distributions, and yield a compact patch representation of complexity independent of the patch sizes. Since PEFs have computationally intractable normalization terms, we estimate PEFs with score matching, and consider a projective distance: the symmetrized \(\gamma \)-divergence. We demonstrate experimentally the performance of our patch matching system.
Frank Nielsen, Richard Nock
Known-Item Search in Video Databases with Textual Queries
Abstract
In this paper, we present two approaches for known-item search in video databases with textual queries. In the first approach, we require the database objects to be labeled with an arbitrary ImageNet classification model. During the search, the set of query words is expanded with synonyms and hypernyms until we encounter words present in the database which are consequently searched for. In the second approach, we delegate the query to an independent database such as Google Images and let the user pick a suitable result for query-by-example search. Furthermore, the effectiveness of the proposed approaches is evaluated in a user study.
Adam Blažek, David Kuboň, Jakub Lokoč
Combustion Quality Estimation in Carbonization Furnace Using Flame Similarity Measure
Abstract
Similarity distance measures are used to study the similarity between patterns. We propose the use of similarity measures between images to estimate the quality of combustion in a furnace designed for carbonization processes in the production of activated carbon. Broadly speaking, the production of activated carbon requires two thermal processes: carbonization and activation. One of the most sensitive variables in both processes is the level of oxygen. For carbonization, the process involves thermal decomposition of vegetal material in the absence of air. For activation, the gasification of the material at high temperature is required, and one of the oxidizing agents used is oxygen. Given the complexity of measuring the oxygen level because of the functional characteristics of the furnaces, we propose a strategy for estimating the quality of combustion, which is directly related to the oxygen level, based on similarity measures between reference photographs and the flame states. This strategy corresponds to the instrumentalization of methods used by operators in manual control of the furnaces. Our algorithm is tested with reference photos taken at the production plant, and the experimental results prove the efficiency of the proposed technique.
Fredy Martínez, Angelica Rendón, Pedro Guevara

Text and Document Similarity

Frontmatter
Bit-Vector Search Filtering with Application to a Kanji Dictionary
Abstract
Database query problems can be categorized by the expressiveness of their query languages, and data structure bounds are better for less expressive languages. Highly expressive languages, such as those permitting Boolean operations, lead to difficult query problems with poor bounds, and high dimensionality in geometric problems also causes their query languages to become expressive and inefficient. The IDSgrep kanji dictionary software approaches a highly expressive tree-matching query problem with a filtering technique set in 128-bit Hamming space. It can be a model for other highly expressive query languages. We suggest improvements to bit vector filtering of general applicability, and evaluate them in the context of IDSgrep.
Matthew Skala
Domain Graph for Sentence Similarity
Abstract
In this work we propose a new method for word similarity. Assuming that each word corresponds to a unit of semantics, called synset, with categorical features, called domain, we construct a domain graph of a synset which is all the hypernyms which belong to the domain of the synset. Here we take an advantage of domain graphs to reflect semantic aspect of words. In experiments we show how well the domain graph approach goes well with word similarity. Then we extend the domain graph in sentence similarity independent of BOW. In addition we assess the execution time in terms of the task and show the significant improvements.
Fumito Konaka, Takao Miura
Context Semantic Analysis: A Knowledge-Based Technique for Computing Inter-document Similarity
Abstract
We propose a novel knowledge-based technique for inter-document similarity, called Context Semantic Analysis (CSA). Several specialized approaches built on top of specific knowledge base (e.g. Wikipedia) exist in literature but CSA differs from them because it is designed to be portable to any RDF knowledge base. Our technique relies on a generic RDF knowledge base (e.g. DBpedia and Wikidata) to extract from it a vector able to represent the context of a document. We show how such a Semantic Context Vector can be effectively exploited to compute inter-document similarity. Experimental results show that our general technique outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones of specialized methods.
Fabio Benedetti, Domenico Beneventano, Sonia Bergamaschi

Comparisons and Benchmarks

Frontmatter
An Experimental Survey of MapReduce-Based Similarity Joins
Abstract
In recent years, Big Data systems and their main data processing framework - MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.
Yasin N. Silva, Jason Reed, Kyle Brown, Adelbert Wadsworth, Chuitian Rong
YFCC100M-HNfc6: A Large-Scale Deep Features Benchmark for Similarity Search
Abstract
In this paper, we present YFCC100M-HNfc6, a benchmark consisting of 97M deep features extracted from the Yahoo Creative Commons 100M (YFCC100M) dataset. Three type of features were extracted using a state-of-the-art Convolutional Neural Network trained on the ImageNet and Places datasets. Together with the features, we made publicly available a set of 1,000 queries and k-NN results obtained by sequential scan. We first report detailed statistical information on both the features and search results. Then, we show an example of performance evaluation, performed using this benchmark, on the MI-File approximate similarity access method.
Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, Fausto Rabitti
A Tale of Four Metrics
Abstract
There are many contexts where the definition of similarity in multivariate space requires to be based on the correlation, rather than absolute value, of the variables. Examples include classic IR measurements such as TDF/IF and BM25, client similarity measures based on collaborative filtering, feature analysis of chemical molecules, and biodiversity contexts.
In such cases, it is almost standard for Cosine similarity to be used. More recently, Jensen-Shannon divergence has appeared in a proper metric form, and a related metric Structural Entropic Distance (SED) has been investigated. A fourth metric, based on a little-known divergence function named as Triangular Divergence, is also assessed here.
For these metrics, we study their properties in the context of similarity and metric search. We compare and contrast their semantics and performance. Our conclusion is that, despite Cosine Distance being an almost automatic choice in this context, Triangular Distance is most likely to be the best choice in terms of a compromise between semantics and performance.
Richard Connor

Hashing Techniques

Frontmatter
Fast Approximate Furthest Neighbors with Data-Dependent Candidate Selection
Abstract
We present a novel strategy for approximate furthest neighbor search that selects a candidate set using the data distribution. This strategy leads to an algorithm, which we call DrusillaSelect, that is able to outperform existing approximate furthest neighbor strategies. Our strategy is motivated by an empirical study of the behavior of the furthest neighbor search problem, which lends intuition for where our algorithm is most useful. We also present a variant of the algorithm that gives an absolute approximation guarantee; under some assumptions, the guaranteed approximation can be achieved in provably less time than brute-force search. Performance studies indicate that DrusillaSelect can achieve comparable levels of approximation to other algorithms while giving up to an order of magnitude speedup. An implementation is available in the mlpack machine learning library (found at http://​www.​mlpack.​org).
Ryan R. Curtin, Andrew B. Gardner
NearBucket-LSH: Efficient Similarity Search in P2P Networks
Abstract
We present NearBucket-LSH, an effective algorithm for similarity search in large-scale distributed online social networks organized as peer-to-peer overlays. As communication is a dominant consideration in distributed systems, we focus on minimizing the network cost while guaranteeing good search quality. Our algorithm is based on Locality Sensitive Hashing (LSH), which limits the search to collections of objects, called buckets, that have a high probability to be similar to the query. More specifically, NearBucket-LSH employs an LSH extension that searches in near buckets, and improves search quality but also significantly increases the network cost. We decrease the network cost by considering the internals of both LSH and the P2P overlay, and harnessing their properties to our needs. We show that our NearBucket-LSH increases search quality for a given network cost compared to previous art. In many cases, the search quality increases by more than \(50\,\%\).
Naama Kraus, David Carmel, Idit Keidar, Meni Orenbach
Speeding up Similarity Search by Sketches
Abstract
Efficient object retrieval based on a generic similarity is one of the fundamental tasks in the area of information retrieval. We propose an enhancement for techniques that use the distance-based model of similarity. This enhancement is based on sketches–compact bit strings compared by the Hamming distance which represent data objects from the original space. The sketches form an additional filter that reduce the number of accessed data objects while practically preserving the search quality. For a certain class of state-of-the-art techniques, we can create the sketches using already known information, thus the time overhead is negligible and the memory overhead is subtle. According to the presented experiments, the sketch filtering can reduce the number of accessed data objects by 60–80 % in case of M-Index, and 30 % in case of PPP-Codes index while hurting the recall by less than 0.4 % on 10-NN search.
Vladimir Mic, David Novak, Pavel Zezula
Fast Hilbert Sort Algorithm Without Using Hilbert Indices
Abstract
Hilbert sort arranges given points of a high-dimensional space with integer coordinates along a Hilbert curve. A naïve method first draws a Hilbert curve of a sufficient resolution to separate all the points, associates integers called Hilbert indices representing the orders along the Hilbert curve to points, and then, sorts the pairs of points and indices. Such a method requires an exponentially large cost with respect to both the dimensionality n of the space and the order m of the Hilbert curve even if obtaining Hilbert indices. A known improved method computes the Hilbert index for each point in O(mn) time. In this paper, we propose an algorithm which directly sorts N points along a Hilbert curve in O(mnN) time without using Hilbert indices. This algorithm has the following three advantages; (1) it requires no extra space for Hilbert indices, (2) it handles simultaneously multiple points, and (3) it simulates the Hilbert curve in heterogeneous resolution, that is, in lower order for sparse space and higher order for dense space. It, therefore, runs much faster on random data in O(NlogN) time. Furthermore, it can be expected to run very fast on practical data, such as high-dimensional features of multimedia data.
Yasunobu Imamura, Takeshi Shinohara, Kouichi Hirata, Tetsuji Kuboyama

Time-Evolving Data

Frontmatter
Similarity Searching in Long Sequences of Motion Capture Data
Abstract
Motion capture data digitally represent human movements by sequences of body configurations in time. Searching in such spatio-temporal data is difficult as query-relevant motions can vary in lengths and occur arbitrarily in the very long data sequence. There is also a strong requirement on effective similarity comparison as the specific motion can be performed by various actors in different ways, speeds or starting positions. To deal with these problems, we propose a new subsequence matching algorithm which uses a synergy of elastic similarity measure and multi-level segmentation. The idea is to generate a minimum number of overlapping data segments so that there is at least one segment matching an arbitrary subsequence. A non-partitioned query is then efficiently evaluated by searching for the most similar segments in a single level only, while guaranteeing a precise answer with respect to the similarity measure. The retrieval process is efficient and scalable which is confirmed by experiments executed on a real-life dataset.
Jan Sedmidubsky, Petr Elias, Pavel Zezula
Music Outlier Detection Using Multiple Sequence Alignment and Independent Ensembles
Abstract
The automated retrieval of related music documents, such as cover songs or folk melodies belonging to the same tune, has been an important task in the field of Music Information Retrieval (MIR). Yet outlier detection, the process of identifying those documents that deviate significantly from the norm, has remained a rather unexplored topic. Pairwise comparison of music sequences (e.g. chord transcriptions, melodies), from which outlier detection can potentially emerge, has been always in the center of MIR research but the connection has remained uninvestigated. In this paper we firstly argue that for the analysis of musical collections of sequential data, outlier detection can benefit immensely from the advantages of Multiple Sequence Alignment (MSA). We show that certain MSA-based similarity methods can better separate inliers and outliers than the typical similarity based on pairwise comparisons. Secondly, aiming towards an unsupervised outlier detection method that is data-driven and robust enough to be generalizable across different music datasets, we show that ensemble approaches using an entropy-based diversity measure can outperform supervised alternatives.
Dimitrios Bountouridis, Hendrik Vincent Koops, Frans Wiering, Remco C. Veltkamp
Scalable Similarity Search in Seismology: A New Approach to Large-Scale Earthquake Detection
Abstract
Extracting earthquake signals from continuous waveform data recorded by networks of seismic sensors is a critical and challenging task in seismology. Earthquakes occur infrequently in long-duration data and may produce weak signals, which are challenging to detect while limiting the number of false discoveries. Earthquake detection based on waveform similarity has demonstrated success in detecting weak signals from small events, but existing techniques either require prior knowledge of the event waveform or have poor scaling properties that limit use to small data sets. In this paper, we describe ongoing research into the use of similarity search for large-scale earthquake detection. We describe Fingerprint and Similarity Thresholding (FAST), a new earthquake detection method that leverages locality-sensitive hashing to enable waveform-similarity-based earthquake detection in long-duration continuous seismic data. We demonstrate the detection capability of FAST and compare different fingerprinting schemes by performing numerical experiments on test data, with an emphasis on false alarm reduction.
Karianne Bergen, Clara Yoon, Gregory C. Beroza

Scalable Similarity Search

Frontmatter
Feature Extraction and Malware Detection on Large HTTPS Data Using MapReduce
Abstract
Secure HTTP network traffic represents a challenging immense data source for machine learning tasks. The tasks usually try to learn and identify infected network nodes, given only limited traffic features available for secure HTTP data. In this paper, we investigate the performance of grid histograms that can be used to aggregate traffic features of network nodes considering just 5-min batches for snapshots. We compare the representation using linear and k-NN classifiers. We also demonstrate that all presented feature extraction and classification tasks can be implemented in a scalable way using the MapReduce approach.
Přemysl Čech, Jan Kohout, Jakub Lokoč, Tomáš Komárek, Jakub Maroušek, Tomáš Pevný
Similarity Search of Sparse Histograms on GPU Architecture
Abstract
Searching for similar objects within large-scale database is a hard problem due to the exponential increase of multimedia data. The time required to find the nearest objects to the specific query in a high-dimensional space has become a serious constraint of the searching algorithms. One of the possible solution for this problem is utilization of massively parallel platforms such as GPU architectures. This solution becomes very sensitive for the applications working with sparse dataset. The performance of the algorithm can be totally changed depending on the different sparsity settings of the input data. In this paper, we study four different approaches on the GPU architecture for finding the similar histograms to the given queries. The performance and efficiency of observed methods were studied on sparse dataset of half a million histograms. We summarize our empirical results and point out the optimal GPU strategy for sparse histograms with different sparsity settings.
Hasmik Osipyan, Jakub Lokoč, Stéphane Marchand-Maillet
Backmatter
Metadaten
Titel
Similarity Search and Applications
herausgegeben von
Laurent Amsaleg
Michael E. Houle
Erich Schubert
Copyright-Jahr
2016
Electronic ISBN
978-3-319-46759-7
Print ISBN
978-3-319-46758-0
DOI
https://doi.org/10.1007/978-3-319-46759-7

Neuer Inhalt