Skip to main content

2020 | Buch

Similarity Search and Applications

13th International Conference, SISAP 2020, Copenhagen, Denmark, September 30 – October 2, 2020, Proceedings

herausgegeben von: Prof. Shin'ichi Satoh, Lucia Vadicamo, Dr. Arthur Zimek, Fabio Carrara, Prof. Dr. Ilaria Bartolini, Martin Aumüller, Prof. Björn Þór Jónsson, Rasmus Pagh

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 13th International Conference on Similarity Search and Applications, SISAP 2020, held in Copenhagen, Denmark, in September/October 2020. The conference was held virtually due to the COVID-19 pandemic.

The 19 full papers presented together with 12 short and 2 doctoral symposium papers were carefully reviewed and selected from 50 submissions. The papers are organized in topical sections named: scalable similarity search; similarity measures, search, and indexing; high-dimensional data and intrinsic dimensionality; clustering; artificial intelligence and similarity; demo and position papers; and doctoral symposium.

Inhaltsverzeichnis

Frontmatter

Scalable Similarity Search

Frontmatter
Accelerating Metric Filtering by Improving Bounds on Estimated Distances

Filtering is a fundamental strategy of metric similarity indexes to minimise the number of computed distances. Given a triple of objects for which distances of two pairs are known, the lower and upper bounds on the third distance can be set as the difference and the sum of these two already known distances, due to the triangle inequality rule of the metric space. For efficiency reasons, the tightness of bounds is crucial, but as angles within triangles of distances can be arbitrary, the worst case with zero and straight angles must also be considered for correctness. However, in data of real-life applications, the distribution of possible angles is skewed and extremes are very unlikely to occur. In this paper, we enhance the existing definition of bounds on the unknown distance with information about possible angles within triangles. We show that two lower bounds and one upper bound on each distance exist in case of limited angles. We analyse their filtering power and confirm high improvements of efficiency by experiments on several real-life datasets.

Vladimir Mic, Pavel Zezula
Differentially Private Sketches for Jaccard Similarity Estimation

This paper describes two locally-differential private algorithms for releasing user vectors such that the Jaccard similarity between these vectors can be efficiently estimated. The basic building block is the well known MinHash method. To achieve a privacy-utility trade-off, MinHash is extended in two ways using variants of Generalized Randomized Response and the Laplace Mechanism. A theoretical analysis provides bounds on the absolute error and experiments show the utility-privacy trade-off on synthetic and real-world data. A full version of this paper is available at http://arxiv.org/abs/2008.08134 .

Martin Aumüller, Anders Bourgeat, Jana Schmurr
Pivot Selection for Narrow Sketches by Optimization Algorithms

Sketches are compact bit strings that are considered as products of an LSH for high-dimensional data. We use them in filtering for narrowing down solution candidates in similarity search. We propose a pivot selection method for narrow sketches with a length such as 16-bits by optimization algorithms with the accuracy of filtering itself as the objective function.

Naoya Higuchi, Yasunobu Imamura, Vladimir Mic, Takeshi Shinohara, Kouichi Hirata, Tetsuji Kuboyama
mmLSH: A Practical and Efficient Technique for Processing Approximate Nearest Neighbor Queries on Multimedia Data

Many large multimedia applications require efficient processing of nearest neighbor queries. Often, multimedia data are represented as a collection of important high-dimensional feature vectors. Existing Locality Sensitive Hashing (LSH) techniques require users to find top-k similar feature vectors for each of the feature vectors that represent the query object. This leads to wasted and redundant work due to two main reasons: 1) not all feature vectors may contribute equally in finding the top-k similar multimedia objects, and 2) feature vectors are treated independently during query processing. Additionally, there is no theoretical guarantee on the returned multimedia results. In this work, we propose a practical and efficient indexing approach for finding top-k approximate nearest neighbors for multimedia data using LSH called mmLSH, which can provide theoretical guarantees on the returned multimedia results. Additionally, we present a buffer-conscious strategy to speed up the query processing. Experimental evaluation shows significant gains in performance time and accuracy for different real multimedia datasets when compared against state-of-the-art LSH techniques.

Omid Jafari, Parth Nagarkar, Jonathan Montaño
Parallelizing Filter-Verification Based Exact Set Similarity Joins on Multicores

Set similarity join (SSJ) is a well studied problem with many algorithms proposed to speed up its performance. However, its scalability and performance are rarely discussed in modern multicore environments. Existing algorithms assume a single-threaded execution that wastes the abundant parallelism provided by modern machines, or use distributed setups that may not yield efficient runtimes and speedups that are proportional to the amount of hardware resources (e.g., CPU cores). In this paper, we focus on a widely-used family of SSJ algorithms that are based on the filter-verification paradigm, and study the potential of speeding them up in the context of multicore machines. We adapt state-of-the-art SSJ algorithms including PPJoin and AllPairs. Our experiments using 12 real-world data sets highlight important findings: (1) Using the exact number of hardware-provided hyperthreads leads to optimal runtimes for most experiments, (2) hand-crafted data structures do not always lead to better performance, and (3) PPJoin’s position filter is more effective in the multithreaded case compared to the single-threaded execution.

Fabian Fier, Tianzheng Wang, Erkang Zhu, Johann-Christoph Freytag
Similarity Search with Tensor Core Units

Tensor Core Units (TCUs) are hardware accelerators developed for deep neural networks, which efficiently support the multiplication of two dense $$\sqrt{m}\times \sqrt{m}$$ matrices, where m is a given hardware parameter. In this paper, we show that TCUs can speed up similarity search problems as well. We propose algorithms for the Johnson-Lindenstrauss dimensionality reduction and for similarity join that, by leveraging TCUs, achieve a $$\varOmega (\sqrt{m})$$ speedup up with respect to traditional approaches.

Thomas D. Ahle, Francesco Silvestri
On the Problem of in Locality-Sensitive Hashing

A Locality-Sensitive Hash (LSH) function is called (r, cr, $$p_1,p_2)$$ -sensitive, if two data-points with a distance less than r collide with probability at least $$p_1$$ while data points with a distance greater than cr collide with probability at most $$p_2$$ . These functions form the basis of the successful Indyk-Motwani algorithm (STOC 1998) for nearest neighbour problems. In particular one may build a c-approximate nearest neighbour data structure with query time $$\tilde{O}(n^\rho /p_1)$$ where $$\rho =\frac{\log 1/p_1}{\log 1/p_2}\in (0,1)$$ . This is sub-linear as long as $$p_1$$ is not too small. Such an algorithm is significant, since most high dimensional nearest neighbour problems suffer from the curse of dimensionality, and can’t be solved exact, faster than a brute force linear-time scan of the database.Unfortunately many of the best LSH functions tend to have very low collision probabilities, including the best functions for Cosine and Jaccard Similarity. This means that the $$n^\rho /p_1$$ query time of LSH is often not sub-linear after all, even for approximate nearest neighbours!In this paper, we improve the general Indyk-Motwani algorithm to reduce the query time of LSH to $$\tilde{O}(n^\rho /p_1^{1-\rho })$$ (and the space usage correspondingly.) Since $$n^\rho /p_1^{1-\rho } < n \Leftrightarrow p_1 > n^{-1}$$ , our algorithm always obtains sublinear query time, for all collision probabilities at least 1/n. For $$p_1$$ and $$p_2$$ small enough, our improvement over all previous methods can be up to a factor n in both query time and space.The improvement comes from a simple change to the Indyk-Motwani algorithm, which we call “LSH with High-Low Tables”. This technique can easily be implemented in existing software packages.

Thomas Dybdahl Ahle

Similarity Measures, Search, and Indexing

Frontmatter
Confirmation Sampling for Exact Nearest Neighbor Search

Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ’98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest neighbor problem, in practice LSH data structures with suitably chosen parameters are used to solve the exact nearest neighbor problem (with some error probability). Sublinear query time is often possible in practice even for exact nearest neighbor search, intuitively because the nearest neighbor tends to be significantly closer than other data points. However, theory offers little advice on how to choose LSH parameters outside of pre-specified worst-case settings.We introduce the technique of confirmation sampling for solving the exact nearest neighbor problem using LSH. First, we give a general reduction that transforms a sequence of data structures that each find the nearest neighbor with a small, unknown probability, into a data structure that returns the nearest neighbor with probability $$1-\delta $$ , using as few queries as possible. Second, we present a new query algorithm for the LSH Forest data structure with L trees that is able to return the exact nearest neighbor of a query point within the same time bound as an LSH Forest of $$\varOmega (L)$$ trees with internal parameters specifically tuned to the query and data.

Tobias Christiani, Rasmus Pagh, Mikkel Thorup
Optimal Metric Search Is Equivalent to the Minimum Dominating Set Problem

In metric search, worst-case analysis is of little value, as the search invariably degenerates to a linear scan for ill-behaved data. Consequently, much effort has been expended on more nuanced descriptions of what performance might in fact be attainable, including heuristic baselines like the AESA family, as well as statistical proxies such as intrinsic dimensionality. This paper gets to the heart of the matter with an exact characterization of the best performance actually achievable for any given data set and query. Specifically, linear-time objective-preserving reductions are established in both directions between optimal metric search and the minimum dominating set problem, whose greedy approximation becomes the equivalent of an oracle-based AESA, repeatedly selecting the pivot that eliminates the most of the remaining points. As an illustration, the AESA heuristic is adapted to downplay the role of previously eliminated points, yielding some modest performance improvements over the original, as well as its younger relative iAESA2.

Magnus Lie Hetland
Metrics and Ambits and Sprawls, Oh My
Another Tutorial on Metric Indexing

A follow-up to my previous tutorial on metric indexing, this paper walks through the classic structures, placing them all in the context of the recently proposed sprawl of ambits framework. The indexes are presented as configurations of a single, more general structure, all queried using the same search procedure.

Magnus Lie Hetland
Some Branches May Bear Rotten Fruits: Diversity Browsing VP-Trees

Diversified similarity searching embeds result diversification straight into the query procedure, which boosts the computational performance by orders of magnitude. While metric indexes have a hidden potential for perfecting such procedures, the construction of a suitable, fast, and incremental solution for diversified similarity searching is still an open issue. This study presents a novel index-and-search algorithm, coined diversity browsing, that combines an optimized implementation of the vantage-point tree (VP-Tree) index with the distance browsing search strategy and coverage-based query criteria. Our proposal maps data elements into VP-Tree nodes, which are incrementally evaluated for solving diversified neighborhood searches. Such an evaluation is based not only on the distance between the query and candidate objects but also on distances from the candidate to data elements (called influencers) in the partial search result. Accordingly, we take advantage of those distance-based relationships for pruning VP-Tree branches that are themselves influenced by elements in the result set. As a result, diversity browsing benefits from data indexing for (i) eliminating nodes without valid candidate elements, and (ii) examining the minimum number of partitions regarding the query element. Experiments with real-world datasets show our approach outperformed competitors GMC and GNE by at least 4.91 orders of magnitude, as well as baseline algorithm BRID $$_k$$ in at least $$87.51\%$$ regarding elapsed query time.

Daniel Jasbick, Lucio Santos, Daniel de Oliveira, Marcos Bedo
Continuous Similarity Search for Evolving Database

Similarity search for data streams has attracted much attention recently in the area of information recommendation. This paper studies a continuous set similarity search which regards the latest W items in a data stream as an evolving set. So far, a top-k similarity search problem called CEQ (Continuous similarity search for Evolving Query) has been researched in the literature, where the query evolves dynamically and the database consists of multiple static sets. By contrast, this paper examines a new top-k similarity search problem, where the query is a static set and the database consists of multiple dynamic sets extracted from multiple data streams. This new problem is named as CED (Continuous similarity search for Evolving Database). Our main contribution is to develop a pruning-based exact algorithm for CED. Though our algorithm is created by extending the previous pruning-based exact algorithm for CEQ, it runs substantially faster than the one which simply adapts the exact algorithm for CEQ to CED. Our algorithm achieves this speed by devising two novel techniques to refine the similarity upper bounds for pruning.

Hisashi Koga, Daiki Noguchi
Taking Advantage of Highly-Correlated Attributes in Similarity Queries with Missing Values

Incompleteness harms the quality of content-based retrieval and analysis in similarity queries. Missing data are usually evaluated using exclusion and imputation methods to infer possible values to complete gaps. However, such approaches can introduce bias into data and lose useful information. Similarity queries cannot perform over incomplete complex tuples, since distance functions are undefined over missing values. We propose the SOLID approach to allow similarity queries in complex databases without the need neither of data imputation nor deletion. First, SOLID finds highly-correlated metric spaces. Then, SOLID uses a weighted distance function to search by similarity over tuples of complex objects using compatibility factors among metric spaces. Experimental results show that SOLID outperforms imputation methods with different missing rates. SOLID was up to $$7.3\%$$ better than the competitors in quality when querying over incomplete tuples, reducing $$16.42\%$$ the error of similarity searches over incomplete data, and being up to 30.8 times faster than the closest competitor.

Lucas Santiago Rodrigues, Mirela Teixeira Cazzolato, Agma Juci Machado Traina, Caetano Traina Jr.
Similarity Between Points in Metric Measure Spaces

This paper is about similarity between objects that can be represented as points in metric measure spaces. A metric measure space is a metric space that is also equipped with a measure. For example, a network with distances between its nodes and weights assigned to its nodes is a metric measure space. Given points x and y in different metric measure spaces or in the same space, how similar are they? A well known approach is to consider x and y similar if their neighborhoods are similar. For metric measure spaces, similarity between neighborhoods is well captured by the Gromov-Hausdorff-Prokhorov distance, but it is $$\mathsf {NP}$$ -hard to compute this distance even in quite simple cases. We propose a tractable alternative: the radial distribution distance between the neighborhoods of x and y. The similarity measure based on the radial distribution distance is coarser than the similarity based on the Gromov-Hausdorff-Prokhorov distance but much easier to compute.

Evgeny Dantsin, Alexander Wolpert

High-Dimensional Data and Intrinsic Dimensionality

Frontmatter
GTT: Guiding the Tensor Train Decomposition

The demand for searching, querying multimedia data such as image, video and audio is omnipresent, how to effectively access data for various applications is a critical task. Nevertheless, these data usually are encoded as multi-dimensional arrays, or Tensor, and traditional data mining techniques might be limited due to the curse of dimensionality. Tensor decomposition is proposed to alleviate this issue, commonly used tensor decomposition algorithms include CP-decomposition (which seeks a diagonal core) and Tucker-decomposition (which seeks a dense core). Naturally, Tucker maintains more information, but due to the denseness of the core, it also is subject to exponential memory growth with the number of tensor modes. Tensor train (TT) decomposition addresses this problem by seeking a sequence of three-mode cores: but unfortunately, currently, there are no guidelines to select the decomposition sequence. In this paper, we propose a GTT method for guiding the tensor train in selecting the decomposition sequence. GTT leverages the data characteristics (including number of modes, length of the individual modes, density, distribution of mutual information, and distribution of entropy) as well as the target decomposition rank to pick a decomposition order that will preserve information. Experiments with various data sets demonstrate that GTT effectively guides the TT-decomposition process towards decomposition sequences that better preserve accuracy.

Mao-Lin Li, K. Selçuk Candan, Maria Luisa Sapino
Noise Adaptive Tensor Train Decomposition for Low-Rank Embedding of Noisy Data

Tensor decomposition is a multi-modal dimensionality reduction technique to support similarity search and retrieval. Yet, the decomposition process itself is expensive and subject to dimensionality curse. Tensor train decomposition is designed to avoid the explosion of intermediary data, which plagues other tensor decomposition techniques. However, many tensor decomposition schemes, including tensor train decomposition is sensitive to noise in the input data streams. While recent research has shown that it is possible to improve the resilience of the tensor decomposition process to noise and other forms of imperfections in the data by relying on probabilistic techniques, these techniques have a major deficiency: they treat the entire tensor uniformly, ignoring potential non-uniformities in the noise distribution. In this paper, we note that noise is rarely uniformly distributed in the data and propose a Noise-Profile Adaptive Tensor Train Decomposition method, which aims to tackle this challenge. $$\mathtt{NTTD}$$ leverages a model-based noise adaptive tensor train decomposition strategy: any rough priori knowledge about the noise profiles of the tensor enable us to develop a sample assignment strategy that best suits the noise distribution of the given tensor.

Xinsheng Li, K. Selçuk Candan, Maria Luisa Sapino
ABID: Angle Based Intrinsic Dimensionality

The intrinsic dimensionality refers to the “true” dimensionality of the data, as opposed to the dimensionality of the data representation. For example, when attributes are highly correlated, the intrinsic dimensionality can be much lower than the number of variables. Local intrinsic dimensionality refers to the observation that this property can vary for different parts of the data set; and intrinsic dimensionality can serve as a proxy for the local difficulty of the data set.Most popular methods for estimating the local intrinsic dimensionality are based on distances, and the rate at which the distances to the nearest neighbors increase, a concept known as “expansion dimension”. In this paper we introduce an orthogonal concept, which does not use any distances: we use the distribution of angles between neighbor points. We derive the theoretical distribution of angles and use this to construct an estimator for intrinsic dimensionality.Experimentally, we verify that this measure behaves similarly, but complementarily, to existing measures of intrinsic dimensionality. By introducing a new idea of intrinsic dimensionality to the research community, we hope to contribute to a better understanding of intrinsic dimensionality and to spur new research in this direction.

Erik Thordsen, Erich Schubert
Sampled Angles in High-Dimensional Spaces

Similarity search using metric indexing techniques is largely a solved problem in low-dimensional spaces. However techniques based only on the triangle inequality property start to fail as dimensionality increases.Since proper metric spaces allow a finite projection of any three objects into a 2D Euclidean space, the notion of angle can be validly applied among any three (but no more) objects. High dimensionality is known to have interesting effects on angles in vector spaces, but to our knowledge this has not been studied in more general metric spaces. Here, we consider the use of angles among objects in combination with distances.As dimensionality becomes higher, we show that the variance in sampled angles reduces. Furthermore, sampled angles also become correlated with inter-object distances, giving different distributions between query solutions and non-solutions. We show the theoretical underpinnings of this observation in unbounded high-dimensional Euclidean spaces, and then examine how the pure property is reflected in some real-world high dimensional spaces. Our experiments on both generated and real world datasets demonstrate that these observations can have an important impact on the tractability of search as dimensionality increases.

Richard Connor, Alan Dearle
Local Intrinsic Dimensionality III: Density and Similarity

In artificial intelligence, machine learning, and other areas in which statistical estimation and modeling is common, distributions are typically assumed to admit a representation in terms of a probability density function (pdf). However, in many situations, such as mixture modeling and subspace methods, the distributions in question are not always describable in terms of a single pdf. In this paper, we present a theoretical foundation for the modeling of density ratios in terms of the local intrinsic dimensionality (LID) model, in a way that avoids the use of traditional probability density functions. These formulations provide greater flexibility when modeling data under the assumption of local variation in intrinsic dimensionality, in that no explicit dependence on a fixed-dimensional data representation is required.

Michael E. Houle
Analysing Indexability of Intrinsically High-Dimensional Data Using TriGen

The TriGen algorithm is a general approach to transform distance spaces in order to provide both exact and approximate similarity search in metric and non-metric spaces. This paper focuses on the reduction of intrinsic dimensionality using TriGen. Besides the well-known intrinsic dimensionality based on distance distribution, we inspect properties of triangles used in metric indexing (the triangularity) as well as properties of quadrilaterals used in ptolemaic indexing (the ptolemaicity). We also show how LAESA with triangle and ptolemaic filtering behaves on several datasets with respect to the proposed indicators.

David Bernhauer, Tomáš Skopal
Reverse k-Nearest Neighbors Centrality Measures and Local Intrinsic Dimension

The estimation of local intrinsic dimensionality has applications ranging from adversarial attack disclosure to clustering and outlier detection, indexing, and data fingerprinting. In this paper, we analyze measures of network centrality in the kNN graph and their relation to LID measures. Our method ranks the dataset by its centrality, measured as the number of reverse or mutual kNN of each object. The computation of these measures involves only kNN queries, allowing a speedup in its computation using probabilistic indexing. A property of independent interest is the rank being independent of k for a wide range of k values, leading to parameter-free density estimation and applications.

Oscar Pedreira, Stephane Marchand-Maillet, Edgar Chávez

Clustering

Frontmatter
BETULA: Numerically Stable CF-Trees for BIRCH Clustering

BIRCH clustering is a widely known approach for clustering, that has influenced much subsequent research and commercial products. The key contribution of BIRCH is the Clustering Feature tree (CF-Tree), which is a compressed representation of the input data. As new data arrives, the tree is eventually rebuilt to increase the compression. Afterward, the leaves of the tree are used for clustering. Because of the data compression, this method is very scalable. The idea has been adopted for example for k-means, data stream, and density-based clustering.Clustering features used by BIRCH are simple summary statistics that can easily be updated with new data: the number of points, the linear sums, and the sum of squared values. Unfortunately, how the sum of squares is then used in BIRCH is prone to catastrophic cancellation.We introduce a replacement cluster feature that does not have this numeric problem, that is not much more expensive to maintain, and which makes many computations simpler and hence more efficient. These cluster features can also easily be used in other work derived from BIRCH, such as algorithms for streaming data. In the experiments, we demonstrate the numerical problem and compare the performance of the original algorithm compared to the improved cluster features.

Andreas Lang, Erich Schubert
Using a Set of Triangle Inequalities to Accelerate K-means Clustering

The k-means clustering is a well-known problem in data mining and machine learning. However, the de facto standard, i.e., Lloyd’s k-mean algorithm, suffers from a large amount of time on the distance calculations. Elkan’s k-means algorithm as one prominent approach exploits triangle inequality to greatly reduce such distance calculations between points and centers, while achieving the exactly same clustering results with significant speed improvement, especially on high-dimensional datasets. In this paper, we propose a set of triangle inequalities to enhance the filtering step of Elkan’s k-means algorithm. With our new filtering bounds, a filtering-based Elkan (FB-Elkan) is proposed, which preserves the same results as Lloyd’s k-means algorithm and additionally prunes unnecessary distance calculations. In addition, a memory-optimized Elkan (MO-Elkan) is provided, where the space complexity is greatly reduced by trading-off the maintenance of lower bounds and the run-time efficiency. Throughout evaluations with real-world datasets, FB-Elkan in general accelerates the original Elkan’s k-means algorithm for high-dimensional datasets (up to 1.69x), whereas MO-Elkan outperforms the others for low-dimensional datasets (up to 2.48x). Specifically, when the datasets have a large number of points, i.e., $$n\ge 5$$ M, MO-Elkan still can derive the exact clustering results, while the original Elkan’s k-means algorithm is not applicable due to memory limitation.

Qiao Yu, Kuan-Hsun Chen, Jian-Jia Chen
Angle-Based Clustering

The amount of data increases steadily, and yet most clustering algorithms perform complex computations for every single data point. Furthermore, Euclidean distance which is used for most of the clustering algorithms is often not the best choice for datasets with arbitrarily shaped clusters or such with high dimensionality. Based on ABOD, we introduce ABC, the first angle-based clustering method. The algorithm first identifies a small part of the data as border points of clusters based on the angle between their neighbors. Those few border points can, with some adjustments, be clustered with well-known clustering algorithms like hierarchical clustering with single linkage or DBSCAN. Residual points can quickly and easily be assigned to the cluster of their nearest border point, so the overall runtime is heavily reduced while the results improve or remain similar.

Anna Beer, Dominik Seeholzer, Nadine-Sarah Schüler, Thomas Seidl

Artificial Intelligence and Similarity

Frontmatter
Improving Locality Sensitive Hashing by Efficiently Finding Projected Nearest Neighbors

Similarity search in high-dimensional spaces is an important task for many multimedia applications. Due to the notorious curse of dimensionality, approximate nearest neighbor techniques are preferred over exact searching techniques since they can return good enough results at a much better speed. Locality Sensitive Hashing (LSH) is a very popular random hashing technique for finding approximate nearest neighbors. Existing state-of-the-art Locality Sensitive Hashing techniques that focus on improving performance of the overall process, mainly focus on minimizing the total number of IOs while sacrificing the overall processing time. The main time-consuming process in LSH techniques is the process of finding neighboring points in projected spaces. We present a novel index structure called radius-optimized Locality Sensitive Hashing (roLSH). With the help of sampling techniques and Neural Networks, we present two techniques to find neighboring points in projected spaces efficiently, without sacrificing the accuracy of the results. Our extensive experimental analysis on real datasets shows the performance benefit of roLSH over existing state-of-the-art LSH techniques.

Omid Jafari, Parth Nagarkar, Jonathan Montaño
SIR: Similar Image Retrieval for Product Search in E-Commerce

We present a similar image retrieval (SIR) platform that is used to quickly discover visually similar products in a catalog of millions. Given the size, diversity, and dynamism of our catalog, product search poses many challenges. It can be addressed by building supervised models to tagging product images with labels representing themes and later retrieving them by labels. This approach suffices for common and perennial themes like “white shirt” or “lifestyle image of TV”. It does not work for new themes such as “e-cigarettes”, hard-to-define ones such as “image with a promotional badge”, or the ones with short relevance span such as “Halloween costumes”. SIR is ideal for such cases because it allows us to search by an example, not a pre-defined theme. We describe the steps - embedding computation, encoding, and indexing - that power the approximate nearest neighbor search back-end. We also highlight two applications of SIR. The first one is related to the detection of products with various types of potentially objectionable themes. This application is run with a sense of urgency, hence the typical time frame to train and bootstrap a model is not permitted. Also, these themes are often short-lived based on current trends, hence spending resources to build a lasting model is not justified. The second application is a variant item detection system where SIR helps discover visual variants that are hard to find through text search. We analyze the performance of SIR in the context of these applications.

Theban Stanley, Nihar Vanjara, Yanxin Pan, Ekaterina Pirogova, Swagata Chakraborty, Abon Chaudhuri
Cross-Resolution Deep Features Based Image Search

Deep Learning models proved to be able to generate highly discriminative image descriptors, named deep features, suitable for similarity search tasks such as Person Re-Identification and Image Retrieval. Typically, these models are trained by employing high-resolution datasets, therefore reducing the reliability of the produced representations when low-resolution images are involved. The similarity search task becomes even more challenging in the cross-resolution scenarios, i.e., when a low-resolution query image has to be matched against a database containing descriptors generated from images at different, and usually high, resolutions. To solve this issue, we proposed a deep learning-based approach by which we empowered a ResNet-like architecture to generate resolution-robust deep features. Once trained, our models were able to generate image descriptors less brittle to resolution variations, thus being useful to fulfill a similarity search task in cross-resolution scenarios. To asses their performance, we used synthetic as well as natural low-resolution images. An immediate advantage of our approach is that there is no need for Super-Resolution techniques, thus avoiding the need to synthesize queries at higher resolutions.

Fabio Valerio Massoli, Fabrizio Falchi, Claudio Gennaro, Giuseppe Amato
Learning Distance Estimators from Pivoted Embeddings of Metric Objects

Efficient indexing and retrieval in generic metric spaces often translate into the search for approximate methods that can retrieve relevant samples to a query performing the least amount of distance computations. To this end, when indexing and fulfilling queries, distances are computed and stored only against a small set of reference points (also referred to as pivots) and then adopted in geometrical rules to estimate real distances and include or exclude elements from the result set. In this paper, we propose to learn a regression model that estimates the distance between a pair of metric objects starting from their distances to a set of reference objects. We explore architectural hyper-parameters and compare with the state-of-the-art geometrical method based on the n-simplex projection. Preliminary results show that our model provides a comparable or slightly degraded performance while being more efficient and applicable to generic metric spaces.

Fabio Carrara, Claudio Gennaro, Fabrizio Falchi, Giuseppe Amato

Demo and Position Papers

Frontmatter
Visualizer of Dataset Similarity Using Knowledge Graph

Many institutions choose to make their datasets available as Open Data. Open Data datasets are described by publisher-provided metadata and are registered in catalogs such as the European Data Portal. In spite of that, findability still remain a major issue. One of the main reasons is that metadata is captured in different contexts and with different background knowledge, so that keyword-based search provided by the catalogs is insufficient. A solution is to use an enriched querying that employs a dataset similarity model built on a shared context represented by a knowledge graph. However, the “black-box” dataset similarity may not fit well the user needs. If an explainable similarity model is used, then the issue can be tackled by providing users with a visualisation of the dataset similarity. This paper introduces a web-based tool for dataset similarity visualisation called ODIN (Open Dataset INspector). ODIN visualises knowledge graph-based dataset similarity, offering thus an explanation to the user. To understand the similarity, users can discover additional datasets that match their needs or reformulate the query to better reflect the knowledge graph. Last but not least, the user can analyze and/or design the similarity model itself.

Petr Škoda, Jakub Matějík, Tomáš Skopal
Vitrivr-Explore: Guided Multimedia Collection Exploration for Ad-hoc Video Search

vitrivr is an open-source system for indexing and retrieving multimedia data based on its content and it has been a fixture at the Video Browser Showdown (VBS) in the past years. While vitrivr has proven to be competitive in content-based retrieval due to the many different query modes it supports, its functionality is rather limited when it comes to exploring a collection or searching result sets based on content. In this paper, we present vitrivr-explore, an extension to the vitrivr stack that allows to explore multimedia collections using relevance feedback. For this, our implementation integrates into the existing features of vitrivr and exploits self-organizing maps. Users initialize the exploration by either starting with a query or just picking examples from a collection while browsing. Exploration can be based on a mixture of semantic and visual features. We describe our architecture and implementation and present first results of the effectiveness of vitrivr-explore in a VBS-like evaluation. These results show that vitrivr-explore is competitive for Ad-hoc Video Search (AVS) tasks, even without user initialization.

Silvan Heller, Mahnaz Parian, Maurizio Pasquinelli, Heiko Schuldt
Running Experiments with Confidence and Sanity

Analyzing data from large experimental suites is a daily task for anyone doing experimental algorithmics. In this paper we report on several approaches we tried for this seemingly mundane task in a similarity search setting, reflecting on the challenges it poses.We conclude by proposing a workflow, which can be implemented using several tools, that allows to analyze experimental data with confidence.The extended version of this paper and the support code are provided at https://github.com/Cecca/running-experiments .

Martin Aumüller, Matteo Ceccarello

Doctoral Symposium

Frontmatter
Temporal Similarity of Trajectories in Graphs

The analysis of similar trajectories in a network provides useful information for different applications. In this study, we are interested in algorithms to efficiently retrieve similar trajectories. Many studies have focused on retrieving similar trajectories by extracting the geometrical and geographical information of trajectories. We provide a similarity function by making use of both the temporal aspect of trajectories and the structure of the underlying network. We propose an approximation technique offering the top-k similar trajectories with respect to a query in a specified time interval in an efficient way. We also investigate how our idea can be applied to similar behavior of the tourists, so as to offer a high-quality prediction of their next movements.

Shima Moghtasedi
Relational Visual-Textual Information Retrieval

With the advent of deep learning, multimedia information processing gained a huge boost, and astonishing results have been observed on a multitude of interesting visual-textual tasks. Relation networks paved the way towards an attentive processing methodology that considers images and texts as sets of basic interconnected elements (regions and words). These winning ideas recently helped to reach the state-of-the-art on the image-text matching task. Cross-media information retrieval has been proposed as a benchmark to test the capabilities of the proposed networks to match complex multi-modal concepts in the same common space. Modern deep-learning powered networks are complex and almost all of them cannot provide concise multi-modal descriptions that can be used in fast multi-modal search engines. In fact, the latest image-sentence matching networks use cross-attention and early-fusion approaches, which force all the elements of the database to be considered at query time. In this work, I will try to lay down some ideas to bridge the gap between the effectiveness of modern deep-learning multi-modal matching architectures and their efficiency, as far as fast and scalable visual-textual information retrieval is concerned.

Nicola Messina
Backmatter
Metadaten
Titel
Similarity Search and Applications
herausgegeben von
Prof. Shin'ichi Satoh
Lucia Vadicamo
Dr. Arthur Zimek
Fabio Carrara
Prof. Dr. Ilaria Bartolini
Martin Aumüller
Prof. Björn Þór Jónsson
Rasmus Pagh
Copyright-Jahr
2020
Electronic ISBN
978-3-030-60936-8
Print ISBN
978-3-030-60935-1
DOI
https://doi.org/10.1007/978-3-030-60936-8

Neuer Inhalt