Unsupervised Learning in Space and Time

A Modern Approach for Computer Vision using Graph-based Techniques and Deep Neural Networks

verfasst von: Assoc. Prof. Marius Leordeanu

Verlag: Springer International Publishing

Buchreihe : Advances in Computer Vision and Pattern Recognition

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book addresses one of the most important unsolved problems in artificial intelligence: the task of learning, in an unsupervised manner, from massive quantities of spatiotemporal visual data that are available at low cost. The book covers important scientific discoveries and findings, with a focus on the latest advances in the field.

Presenting a coherent structure, the book logically connects novel mathematical formulations and efficient computational solutions for a range of unsupervised learning tasks, including visual feature matching, learning and classification, object discovery, and semantic segmentation in video. The final part of the book proposes a general strategy for visual learning over several generations of student-teacher neural networks, along with a unique view on the future of unsupervised learning in real-world contexts.

Offering a fresh approach to this difficult problem, several efficient, state-of-the-art unsupervised learning algorithms are reviewed in detail, complete with an analysis of their performance on various tasks, datasets, and experimental setups. By highlighting the interconnections between these methods, many seemingly diverse problems are elegantly brought together in a unified way.

Serving as an invaluable guide to the computational tools and algorithms required to tackle the exciting challenges in the field, this book is a must-read for graduate students seeking a greater understanding of unsupervised learning, as well as researchers in computer vision, machine learning, robotics, and related disciplines.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Unsupervised Visual Learning: From Pixels to Seeing

Abstract

This book is about unsupervised learning. That is one of the most challenging puzzles that we must solve and put together, piece by piece, in order to decode the secrets of intelligence. Here, we move closer to that goal by connecting classical computational models to newer deep learning ones, and then build based on some fundamental and intuitive unsupervised learning principles. We want to reduce the unsupervised learning problem to a set of essential ideas and then develop the computational tools needed to implement them in the real world. Eventually, we aim to imagine a universal unsupervised learning machine, the Visual Story Network. The book is written for young students as well as experienced researchers, engineers, and professors. It presents computational models and optimization algorithms in sufficient technical detail, while also creating and maintaining a big intuitive picture about the main subject. Different tasks, such as graph matching and clustering, feature selection, classifier learning, unsupervised object discovery and segmentation in video, teacher-student learning over multiple generations as well as recursive graph neural networks are brought together, chapter by chapter, under the same umbrella of unsupervised learning. In the current chapter, we introduce the reader to the overall story of the book, which presents a unified image of the different topics that will be presented in detail in the chapters to follow. Besides sharing that main goal of learning without human supervision, the problems and tasks presented in the book also share common computational graph models and optimization methods, such as spectral graph matching, spectral clustering, and the integer projected fixed point method. By bringing together similar mathematical formulations across different tasks, all guided by common intuitive principles towards a universal unsupervised learning system, the book invites the reader to absorb and then participate in the creation of the next generation of artificial intelligence.

Marius Leordeanu

2. Unsupervised Learning of Graph and Hypergraph Matching

Abstract

Graph and hypergraph matching are fundamental problems in computer vision, which have been receiving a steadily growing attention in the last two decades due to the recent modern algorithms and models that are both fast and accurate. Graph and hypergraph matching are successful in tackling many real-world tasks that require 2D and 3D feature matching and alignment. Matching is a general problem in vision and graphs have always been quintessential for modeling information that has both local and longer range connections. Therefore, graph matching will continue to influence computer vision and robotics research, crossing over many different mathematical and computational models, including Bayesian networks, conditional and Markov random fields, spectral graph models, and, more recently, deep neural networks. While graph matching is limited to using pairwise relationships, hypergraph matching permits the use of relationships between sets of features of any order. Consequently, it carries the promise to make matching more robust to changes in scale, deformations, and outliers. In this chapter, we present several key ideas, computational models, and optimization methods that introduced in the literature the possibility to learn graph and hypergraph matching in a semi-supervised or even completely unsupervised fashion. We start by describing two methods for solving graph and hypergraph matching based on spectral matching and integer projected fixed point methods, introduced in Chap. 1. Then, based on the same intuition (also discussed in Chap. 1) that correct assignments form a strong cluster of agreements, we follow up by introducing the main approach to learning as it applies to graph and hypergraph matching, as well as the more general Maximum A Posteriori (MAP) inference problem in probabilistic graphical models (MRFs and CRFs). In the last part of the chapter, we demonstrate experimentally the effectiveness of the algorithms presented and, whenever possible, provide extensive comparisons to other top published methods. Moreover, we show that our learning approach is general and can significantly boost the performance of different state-of-the-art matching algorithms in the literature.

Marius Leordeanu

Chapter 3. Unsupervised Learning of Graph and Hypergraph Clustering

Abstract

Data clustering is an essential unsupervised learning problem in data mining, machine learning, and computer vision. In this chapter, we present in more depth our work on clustering, introduced in the first chapter, for which second- or higher order affinities between sets of data points are considered. We introduce our novel, efficient algorithm for graph-based clustering based on a variant of the Integer Projected Fixed Point (IPFP) method, adapted for the case of hypergraph clustering. This method has important theoretical properties, such as convergence and satisfaction of first-order necessary optimality conditions. As also discussed in Chap. 1, the main difference between IPFP applied to graph matching and the case of clustering it is coming from the different mapping constraints on the solution vector. Once the constraints are changed, the clustering algorithm follows directly from IPFP: at each iteration, the objective score is approximated with its first-order Taylor polynomial. Then, a discrete solution, for the resulting linear optimization problem, is found as the optimum. As in the matching case that optimum of the linear approximation, in the real domain of the clustering problem, is discrete. That is a nice property, since the original non-relaxed problem requires a discrete solution, not a real-valued one. Then, the optimum of the original score along the line between the present solution and the discrete one is found in closed form, for second-order (graph) or third-order (hypergraph) clustering tasks. The procedure is repeated several times, until convergence. Usually, convergence happens very soon, after 5–10 iterations, significantly fewer than competing methods. We also present an algorithm for learning the parameters of the higher order affinities in the hypergraph clustering model, which is directly adapted from the approach used for graph and hypergraph matching. Thus, we learn the affinity functions using gradient ascent, by maximizing the dot product between the ground truth clustering (or the discrete clustering solution in the unsupervised case) and soft clustering solution found by power iteration applied to the hypergraph tensor. In experiments on different model fitting tasks, formulated as third-order clustering, we outperform other top hypergraph clustering methods in the literature, especially in terms of computational speed, but also in terms of accuracy.

Marius Leordeanu

4. Feature Selection Meets Unsupervised Learning

Abstract

Feature selection is a fundamental problem in learning. We are immersed in a huge quantity of spatial and temporal data, and one of the crucial questions if we want to learn efficiently is to find the key cues that are correlated with our specific learning task. Often the task itself is not supervised, that is, we do not know exactly what we are looking for. In that case, we turn again our attention towards the natural clustering and correlations that take place in the spatiotemporal world. In this chapter, we present an efficient method that performs joint classifier learning and feature selection. The approach is able to discover sparse, compact representations of input features from a vast sea of candidates, with an almost unsupervised formulation. For the main algorithm to work, we require only the following knowledge, that is to know, for each cue, whether or not a particular feature has on average stronger values over positive samples than over negatives. We call this bit of knowledge the feature sign. What is interesting is that the mathematical formulation of the problem follows directly from the clustering approach from Chap. 3, which is in turn related to the initial graph matching formulation from Chap. 2. The main feature selection idea boils down to discovering the cluster of features that fire together and are strongly correlated. The spikes in strongly intercorrelated group firings are often robust indicators of the presence of the positive class. In experiments, we show that discovering the correct set of relevant features can be done using as few as a single-labeled training sample per class—used to estimate the feature signs. Then, using these feature signs, we extend an initial supervised learning problem into an unsupervised learning formulation that can incorporate new data without requiring ground truth labels. Thus our method works, simultaneously, both as a feature selection mechanism and as a fully competitive classifier. Our original algorithm has certain theoretical guarantees, low computational cost, and excellent accuracy, especially in difficult cases of very limited training data. Its practical value is demonstrated on large-scale recognition experiments in video: it outperforms in speed and accuracy established feature selection approaches such as AdaBoost, Lasso, greedy forward-backward selection, and powerful classifiers such as SVM, especially in the case of limited supervised training data.

Marius Leordeanu

Chapter 5. Unsupervised Learning of Object Segmentation in Video with Highly Probable Positive Features

Abstract

Many times when learning without human supervision, it is possible to tell whether a certain cue or data sample is likely to belong to the positive class of interest. In this chapter, we study this case and show that such highly probably positive features could be reliably used for learning in the real natural world, without human supervision. We chose as use case the problem of foreground object segmentation, since it is one of the fundamental ones in vision. The main task, in this case, is to separate automatically the main object of interest present in a video sequence from its surrounding background. An efficient solution to this task would have an immense practical value. It would enable large-scale video interpretation at a high semantic level in the absence of the costly manual labeling. In this chapter, we present several unsupervised algorithms for generating foreground object soft masks based on automatic selection and learning from highly probable positive features. We start with a very simple and fast, yet surprisingly effective method that is able to produce robust object segmentations by using only simple colors as features. While being very simple to implement and understand, the algorithm constitutes the basis for a more general principle for learning from highly probable positive features, which we study theoretically and develop further within a more complex method for unsupervised video object segmentation. One important module in this algorithm connects to the feature selection by clustering method presented in Chap. 4—that approach is used in this case for learning an effective and robust patch-based descriptor based on color co-occurrences. We also introduce a novel and fast algorithm for background subtraction, called VideoPCA, based on modeling the background scene with a linear subspace and regarding the main foreground objects as regions that do not belong to that subspace. All algorithms and ideas presented are, at the core, connected by a single fundamental idea—that of learning from highly probable positive features, which are easy to detect in an unsupervised way with high precision and are effective, together, in learning powerful classifiers. The idea naturally starts and evolves from the insights and conclusions of the previous chapters presented in the book. In this chapter, we show that such HPP features can be selected efficiently by taking into consideration the spatiotemporal appearance and motion consistency of the object in the video sequence. We also emphasize the role of the contrasting properties between the foreground object and its background. Our final foreground segmentation model is created over several stages: we start from pixel-level analysis and move to descriptors that consider information over groups of pixels combined with efficient motion analysis. We also prove theoretical properties of our unsupervised learning method, which under some mild constraints is guaranteed to learn the correct classifier even in the unsupervised case. We achieve competitive and even state-of-the-art results on the challenging YouTube-Objects and SegTrack datasets, while being at least one order of magnitude faster than the competition. The strong performance of our method, along with its theoretical properties, constitutes another step towards solving unsupervised discovery in video.

Marius Leordeanu

Chapter 6. Coupling Appearance and Motion: Unsupervised Clustering for Object Segmentation Through Space and Time

Abstract

In this chapter, we approach the unsupervised learning problem in both space and time head-on, by considering both spatial and temporal dimensions from the very beginning. We couple, from the start, the appearance of objects. It also defines their spatial properties with their motion, which defines their existence in time and provide a unique graph clustering formulation in both space and time for the problem of unsupervised object discovery in video. Again we resort to a graph formulation: we have one-to-one correspondences between graph nodes and video pixels. Graph nodes that belong to the main object of interest should form a strong cluster: they are linked through trajectories defined by pixels that belong to the same physical object point and they should also have similar appearance features along those trajectories. The clustering problem aims to maximize both their agreements in motion and their similarities in appearance. Therefore, we regard this object discovery problem as a graph clustering problem for which we provide an efficient spectral approach: the object segmentation in the space-time volume is captured by the principal eigenvector of a Feature-Motion matrix, which we introduce for the first time. Even though the matrix is huge, we never built it explicitly. However, our segmentation algorithm is guaranteed to converge to its principal eigenvector. The eigenvector is known to provide an optimal solution to an eigenvalue problem, which does not depend on initialization and, for this reason, is very robust in practice and able to find the foreground object in the video as the strongest cluster in the space-time graph. The approach in this chapter naturally relates to the principles of unsupervised learning presented in the introduction and also provides a first, lower level approach to coupling space and time processing. That is further explored at a higher semantic level in the final chapter, where we present our recurrent space-time graph neural network model (Nicolicioiu et al. Neural information processing systems, [1]). In practice, our method, termed GO-VOS, achieves state-of-the-art results on three difficult datasets in the literature: DAVIS, SegTrack, and YouTube-Objects.

Marius Leordeanu

7. Unsupervised Learning in Space and Time over Several Generations of Teacher and Student Networks

Abstract

Unsupervised learning represents one of the most interesting challenges in computer vision today, as the material presented so far in the book has so often shown. The problem is essential for visual learning, coming in many different forms and tasks. Consequently, it has an immense practical value with many applications in artificial intelligence and emerging technologies, as large quantities of unlabeled images and videos can be collected at low cost. In this chapter, we address the unsupervised learning problem in the context of segmenting the main foreground objects in single images. We propose an unsupervised learning system, which has two pathways, the teacher and the student, respectively. The system is designed to learn over several generations of teachers and students. At every generation, the teacher does unsupervised object discovery in videos or collections of images and an automatic selection module picks up good frame segmentations and passes them to the student pathway for training. At every generation, multiple students are trained, with different deep network architectures to ensure a better diversity. The students at one iteration help in training a better selection module, forming together a more powerful teacher pathway at the next iteration. The approach presented in this chapter gathers together many ideas studied throughout the book. On one hand, the pathway used as teacher at the first iteration is an unsupervised video object discoverer that immediately relates to the VideoPCA method introduced in Chap. 5 and the space-time clustering GO-VOS method introduced in Chap. 6. Then, the idea of using agreements between multiple student nets at the current generation as teacher at the next generation relates directly to the principles for unsupervised learning proposed in Chap. 1. Such agreements constitute reliable HPP signal (Chap. 5), which becomes a robust teacher for training the next-generation students. Their agreements are also used to train in an unsupervised fashion a deep network that selects only high-quality HPP object masks as supervisory signal. The principles for unsupervised learning are exploited fully in the teacher-student system presented here. The material in this chapter makes some original contributions, discussed next. They lay the foundation of our general concept of a universal unsupervised learning machine, the Visual Story Network, which we introduce in the final chapter of the book. In the extensive experimental validation of our proposed system, we show that the improvement in the selection power, the training of multiple students, and the increase in unlabeled data significantly improve segmentation accuracy from one generation to the next. Our method achieves top results on three challenging datasets for object discovery in video, unsupervised image segmentation, and saliency detection. At test time, our system is fast, being one to two orders of magnitude faster than published unsupervised methods. We also test the strength of our unsupervised features within a well-known transfer learning setup and achieve competitive performance, proving that our unsupervised approach can be reliably used in a variety of computer vision tasks.

Marius Leordeanu

Chapter 8. Unsupervised Learning Towards the Future

Abstract

We have created so far a basis for unsupervised learning in a more general sense. We proposed a set of principles, which we could consider when designing a system that learns in the space-time domain with minimal supervision. We have also demonstrated in practice the usefulness and validity of our key ideas and showed how to create models that learn by themselves for a wide variety of vision tasks. We started with the case of unsupervised learning for feature matching in the first two chapters. Then we extended the ideas to the case of feature selection and classifier learning. The theoretical methods were developed and the initial principles were then applied to the case of object discovery and segmentation in video. In the last chapter, we presented a more general approach and showed how we can learn in an unsupervised way several student-teacher generations, through an evolution process that takes advantage of all the key building blocks developed so far. In this final chapter, we will take one step further and show how a more general self-supervised vision system, the Visual Story Graph Neural Network, could be created using the tools presented. We start by first showing how the two main computational models discussed in the book, namely, classical graphs and modern neural networks, could be put together into a single unified recurrent graph neural network that processes data in space and time. We also demonstrate through extensive experiments that our recurrent graph neural net achieves state-of-the-art performance on several higher level problems such as activity recognition, with minimal or no low-level human supervision. Finally, we introduce the more general concept of a visual story graph neural network (VSN). The VSN concept is based on the ideas and principles presented in this book and could develop into a universal unsupervised learning machine, which could attain a general and coherent knowledge of its surrounding world in space and time.

Marius Leordeanu

Backmatter

Titel: Unsupervised Learning in Space and Time
verfasst von: Assoc. Prof. Marius Leordeanu
Verlag: Springer International Publishing
Electronic ISBN: 978-3-030-42128-1
Print ISBN: 978-3-030-42127-4
DOI: https://doi.org/10.1007/978-3-030-42128-1