scroll identifier for mobile
main-content

## Über dieses Buch

This book presents cutting-edge research on various ways to bridge the semantic gap in image and video analysis. The respective chapters address different stages of image processing, revealing that the first step is a future extraction, the second is a segmentation process, the third is object recognition, and the fourth and last involve the semantic interpretation of the image.

The semantic gap is a challenging area of research, and describes the difference between low-level features extracted from the image and the high-level semantic meanings that people can derive from the image. The result greatly depends on lower level vision techniques, such as feature selection, segmentation, object recognition, and so on. The use of deep models has freed humans from manually selecting and extracting the set of features. Deep learning does this automatically, developing more abstract features at the successive levels.

The book offers a valuable resource for researchers, practitioners, students and professors in Computer Engineering, Computer Science and related fields whose work involves images, video analysis, image interpretation and so on.

## Inhaltsverzeichnis

### Chapter 1. Semantic Gap in Image and Video Analysis: An Introduction

Abstract
The chapter presents a brief introduction to the problem with the semantic gap in content-based image retrieval systems. It presents the complex process of image processing, leading from raw images, through subsequent stages to the semantic interpretation of the image. Next, the content of all chapters included in this book is shortly presented.
Halina Kwaśnicka, Lakhmi C. Jain

### Chapter 2. Low-Level Feature Detectors and Descriptors for Smart Image and Video Analysis: A Comparative Study

Abstract
Local feature detectors and descriptors (hereinafter extractors) play a key role in the modern computer vision. Their scope is to extract, from any image, a set of discriminative patterns (hereinafter keypoints) present on some parts of background and/or foreground elements of the image itself. A prerequisite of a wide range of practical applications (e.g., vehicle tracking, person re-identification) is the design and development of algorithms able to detect, recognize and track the same keypoints within a video sequence. Smart cameras can acquire images and videos of an interesting scenario according to different intrinsic (e.g., focus, iris) and extrinsic (e.g., pan, tilt, zoom) parameters. These parameters can make the recognition of a same keypoint between consecutive images a hard task when some critical factors such as scale, rotation and translation are present. The aim of this chapter is to provide a comparative study of the most used and popular low-level local feature extractors: SIFT, SURF, ORB, PHOG, WGCH, Haralick and A-KAZE. At first, the chapter starts by providing an overview of the different extractors referenced in a concrete case study to show their potentiality and usage. Afterwards, a comparison of the extractors is performed by considering the Freiburg-Berkeley Motion Segmentation (FBMS-59) dataset, a well-known video data collection widely used by the computer vision community. Starting from a default setting of the local feature extractors, the aim of the comparison is to discuss their behavior and robustness in terms of invariance with respect to the most important critical factors. The chapter also reports comparative considerations about one of the basic steps based on the feature extractors: the matching process. Finally, the chapter points out key considerations about the use of the discussed extractors in real application domains.
D. Avola, L. Cinque, G. L. Foresti, N. Martinel, D. Pannone, C. Piciarelli

### Chapter 3. Scale-Insensitive MSER Features: A Promising Tool for Meaningful Segmentation of Images

Abstract
Automatic annotation of image contents can be performed more efficiently if it is supported by reliable segmentation algorithms which can extract, as accurately as possible, areas with a certain level of semantic uniformity on top of the default pictorial uniformity of regions extracted by the segmentation methods. Obviously, the results should be insensitive to noise, textures, and other effects typically distorting such uniformities. This chapter discusses a segmentation technique based on SIMSER (scale-insensitive maximally stable extremal regions) features, which are a generalization of popular MSER features. Promising conformity (at least in selected applications) of such segmentation results with semantic image interpretation is shown. Additionally, the approach has a relatively low computational complexity $$(O(log n\times n)$$ or $$O(log n\times n\times log(log(n)))$$, where n is the image resolution) which makes it prospectively instrumental in real-time applications and/or in low-cost mobile devices. First, the chapter presents fundamentals of SIMSER detector (and the original MSER detector) in gray-level images. Then, relations between semantics-based image annotation and SIMSER features are investigated and illustrated by extensive experiments (including color images, which are the main area of interest).
Andrzej Śluzek

### Chapter 4. Active Partitions in Localization of Semantically Important Image Structures

Abstract
In this chapter active partitions, a generalization of active contours concept to other than pixel-based image representations, is presented. Active contours are methods where optimal, with respect to a given objective function, contours are sought in the images. Their main advantage is fact that they are able to use any additional expert knowledge while analyzing the images. It is of special importance if in the image itself there is no sufficient visual information allowing for proper interpretation of its content. That knowledge can be incorporated into the search process by proper selection of contour model, soft constraints in energy function or hard constraints in an optimization procedure. All those advantages are preserved in active partitions where image content is described not with pixels but with other set of semantically more informative elements. Consequently, in active partitions not an optimal contour is sought but optimal partition of given element set is looked for. The change of image content description is advantageous as well. It reduces the size of search space and allows humans to express their knowledge in more intuitive way.

### Chapter 5. Model-Based 3D Object Recognition in RGB-D Images

Abstract
A computational framework for 3D object recognition in RGB-D images is presented. The focus is on computer vision applications in indoor autonomous robotics, where objects need to be recognized either for the purpose of being grasped and manipulated by the robot, or where the entire scene must be recognized to allow high-level cognitive tasks to be performed. The framework integrates solutions for generic (i.e. type-based) object representation (e.g. semantic networks), trainable transformations between abstraction levels (e.g. by neural networks), reasoning under uncertain and partial data (e.g. Dynamic Bayesian Networks, Fuzzy Logic), optimized model-to-data matching (e.g. constraint optimization problems) and efficient search strategies (switching between data- and model-driven inference steps). The computational implementation of the object model and the object recognition strategy is presented in more details. Testing scenarios deal with the recognition of cups and bottles or household furniture. Conducted experiments and the chosen applications confirmed, that this approach is valid and may easily be adapted to multiple scenarios.
Maciej Stefańczyk, Włodzimierz Kasprzak

### Chapter 6. Ontology-Based Structured Video Annotation for Content-Based Video Retrieval via Spatiotemporal Reasoning

Abstract
The constantly increasing popularity and ubiquity of videos urges efficient automated mechanisms for processing video contents, which is a big challenge due to the huge gap between what software agents can obtain from signal processing and what humans can comprehend based on cognition, knowledge, and experience. Automatically extracted low-level video features typically do not correspond to concepts, persons, and events depicted in videos. To narrow the Semantic Gap, the depicted concepts and their spatial relations can be described in a machine-interpretable form using formal definitions from structured data resources. Rule-based mechanisms are efficient in describing the temporal information of actions and video events.
Leslie F. Sikos

### Chapter 7. Deep Learning—A New Era in Bridging the Semantic Gap

Abstract
The chapter deals with the semantic gap, the well-known phenomenon in the area of vision systems. Despite the significant efforts of researchers, the problem of how to overcome the semantic gap remains a challenge. One of the most popular research areas, where this problem is present and causes difficulty in obtaining good results, is the task of image retrieval. This chapter focuses on this problem. As deep learning models gain more and more popularity among researchers and more and more spectacular results are obtained, the deep learning models in solving the semantic gap in the Content Based Image Retrieval (CBIR) is the central issue of this chapter. The chapter briefly presents the traditional approaches to CBIR, next introduces shortly into methods and models of deep learning, and shows the application of deep learning at the particular levels of CBIR—features level, common sense knowledge level, and inference about a scene level.
Urszula Markowska-Kaczmar, Halina Kwaśnicka

### Backmatter

Weitere Informationen