Review ArticleHybrid soft computing approaches to content based video retrieval: A brief review
Graphical abstract
Introduction
With the immense popularity of video recorders and reduction in cost of digital storage devices, voluminous amount of videos are uploaded at an incredible rate for online browsing and retrieval purposes. As video encompasses the other three media types, i.e. text, image and audio, combining them into a single data stream for transmission and retrieval has drawn the attention of researchers and application developers to a great extent. Image/video retrieval has been an active research domain since the last four decades.
In comparison to images, videos possess characteristic features. Although the organization of a video is not well defined, the content is more affluent as compared to individual images. Archiving and indexing of images and videos based on semantic content is a challenging task. Subsequent retrieval and browsing of video data manually is a laborious and time intensive task from the users’ perspective. During the retrieval process, the primary objective is to present the videos of interest to the user. As such, this broad domain of research is referred to as content-based video retrieval (CBVR). Since a video is a conglomeration of time sequenced images, research in the field of CBVR has been supplemented to a great extent by advances in the field of content-based image retrieval (CBIR). In some approaches, CBVR applications have been augmented with audio cues. As such, approaches in audio indexing, retrieval and classification [1] are important for advances in CBVR.
The term “content-based” refers to features like color, texture, intensity, trajectory of objects or statistical characteristics at the low-level. Mid-level features refer to feature points taken on an image or a series of images as in a video. These feature points are detected by algorithms for feature point detection and description such as SIFT [2], SURF [3], BRISK [4], DAISY [5], GIST [6], ORB [7], etc. The detected feature points in images are matched in order to compute the amount of similarity. Mid-level features are incapable of evaluating the semantic content in a video. This drawback is alleviated using high-level features such as edges, shapes, motion vector, optical flow, event modeling (in videos), timbre, rhythm (in audio), etc. involving different levels of semantics. High-level features are capable of handling semantic queries like “retrieve videos where blue sky and snowy mountains are present”. These queries require matching the semantic content in the video database. The major hurdle behind processing high-level queries is the semantic gap that exists between the high-level features and low-level ones.
The process of “retrieval” involves matching features extracted from the frames of the video with the query given by the user. The features are used to create the feature vectors for each video. All videos in the database are represented as a point in an n-dimensional space, where n represents the number of features under consideration. It is pertinent to mention here that taking too many features enhances the computational cost since the dimensionality of the feature vector increases. Over time, researchers have adopted various means for reducing dimensionality. Principal component analysis (PCA) [8] is a very widely used technique for reducing the number of features by mapping the set of original features (P) to another set (Q). The feature set Q contains derived features from the set P. Importantly, the cardinality of the set Q is lesser than P. Researchers have also relied on rough sets for identifying features with high discrimination power and thereby reducing the dimensionality by eliminating less important features. The features taken into consideration form the feature vector for each object (video) in the database. Distance between the feature vectors acts as a measure for similarity between the videos. The commonly used distance measures are Euclidean, Mahalanobis, Hausdorff, Chi-square, etc. The end product of the retrieval process is a set of videos ranked according to relevance. In some systems user relevance feedback is used to tune the system in order to produce more meaningful results. This also helps to model human perception in a better way.
For better understanding, an analogy is drawn between a document and a video sequence (refer Fig. 1). A document consists of paragraphs much like a video being composed of scenes. A paragraph in turn represents a group of inter-related sentences similar to inter-related shots, forming a scene in a video. Further down the hierarchy, the sentences are composed of words, like the shots consist of frames. A frame in a video denotes a single image, whereas a shot is a consecutive sequence of frames recorded by a single camera. A scene represents a collection of semantically related and temporally adjacent shots, portraying and imparting a high-level story. A group of scenes comprises a sequence/story. Frames and shots are considered to be low-level temporal features of a video, whereas, scenes and sequence/story form the high-level features which are close to human perception. In other words, shots act as the basic indexing elements for a video since there is no perceivable change in the scene contents. A given video may be decomposed into its constituent units either in the temporal (scenes, shots or frames) or spatial domain (objects, color segments or pixels). Decomposing a video in the temporal domain involves a hierarchical disintegration of the video into scenes, shots or frames according to need. In order to analyze the semantic content, a spatial analysis of the frame contents is required. The initial step for any video analysis application is to decompose the video into its constituent shots. In the literature, this has been referred to as shot boundary analysis or temporal video segmentation. As the visual content of the frames in a shot are similar, spatial segmentation is carried out in the next step to derive the semantic content in the form of objects composing the shot and the inter-relationship among these. Video segmentation methods thus provide a means to harness the structural primitives of videos for various applications related to content-based video retrieval.
The framework of a content-based video indexing, retrieval and summarization system has been presented in Fig. 2. The input video may be segmented either at the semantic level (leading to scene detection) or with respect to the timeline (leading to shot boundary detection). High-level feature extraction leads to video mining, video classification or other related tasks. High-level features are also necessary for semantic indexing in databases which are required for video retrieval, based on content. The query video given by the user goes through pre-processing steps like segmentation and feature extraction before similarity [9] is measured with the videos contained in the database. The indexing, browsing and retrieval jobs were quite manageable previously due to lesser number of videos. Earlier, keywords were mainly used as meta-data for the retrieval task. However, due to the exponential rise in videos available online, tools and applications for automatic indexing and retrieval of the videos based on semantic content became a necessity rather than a bare need.
The task of video summarization may be categorized into static and dynamic summarization. In a static summary [10], key-frames are chosen from each shot. Redundancy reduction is performed on the representative set to eliminate visually duplicate frames having similar content. Low-level or mid-level features are used for this purpose. Use of key-frames has not only proven helpful for video indexing and summarization applications, but it is also a useful means for maintaining video logs. In video summarization, extraction of key-frames [11], [12], [13] from the original video sequence not only provide a visual abstract to the user, but also acts as a means for faster content browsing. For producing a dynamic summary [14], features extracted from the shots are used for clustering [15]. Representative shots from each cluster, when coalesced together, form the dynamic summary. However, the shots may be given weights according to the content embedded within them. Depending on the weights, the most significant shots may be chosen for generating the summary. The user may or may not specify the duration of summary required. It is needless to mention that generating a dynamic summary is much more challenging due to the fact that there must be coherence among the shots composing the video skim.
In the light of this foregoing background, a proper framework needs to be defined which will cater to the latest developments in the field of CBVR. Also, a focused approach is essential which would encompass and enumerate the research work carried out in CBVR, especially with respect to the advances in techniques using soft computing or hybrid soft computing. This forms the major motivation behind this work. The contributions of this article are many fold. Firstly, it enumerates the various techniques developed in the field of video segmentation (both temporal and spatial) and categorizes them into classical, soft computing and hybrid soft computing based approaches. As video segmentation is a prerequisite for any form of video analysis, this sub-domain of CBVR is focused upon. Secondly, the advances in techniques of CBVR (including CBIR) are classified into the conventional, soft computing and hybrid soft computing based approaches in order to facilitate researchers in understanding the approaches developed in each category. Finally, a snapshot of the hybrid approaches is presented in the form of a table to highlight the merits and demerits of each method.
Remainder of the paper is organized as follows. Section 2 deals with the various soft computing paradigms used for content-based video retrieval. The major approaches to video segmentation have been detailed in Section 3. This section reviews the prominent methods related to temporal and spatial video segmentation. For easy understanding, methods have been further categorized according to the approach used, i.e. classical, soft computing or hybrid soft computing. The same categorization has been adopted in Section 4 where approaches related to content-based video retrieval has been enumerated and the major methods have been reviewed. Finally, the future directions and concluding remarks are presented in Section 5.
Section snippets
Components of soft computing
Soft computing is best known to solve real life problems where the computation time required for ascertaining the exact solution is very high. Also, the approaches using soft computing can tolerate imprecise values, uncertainty, partial truth and approximation resulting in robustness, tractability and low-cost solutions. Thus, to obtain a higher level of accuracy and minimize the number of computations involved, researchers have shifted from standard classical algorithms to the soft computing
Video segmentation
The task of video segmentation involves the decomposition of a video in either the temporal, spatial or spatio-temporal domain. The tasks of relevant shot retrieval, object recognition, event detection, video indexing, etc. are based upon the efficacy with which such decomposition is performed. As such video segmentation forms the very basic step for efficient video analysis. Further subsections elaborate on methods of video decomposition which have been developed using conventional, soft
Content based video retrieval
Advances in data storage and image acquisition technologies have enabled the creation of large image datasets. In this scenario, it is necessary to develop appropriate information systems to efficiently manage these collections. The commonest approaches use the so called content-based image retrieval (CBIR) systems. These systems try to retrieve images similar to a user-defined specification or pattern. Their goal is to support image retrieval based on content properties (e.g. shape, color,
Future directions and conclusion
The hybrid approaches used for content-based video retrieval has grasped a lot of attention among academicians, researchers and real life users. Hybrid algorithms for CBVR have helped to reduce user interaction and manual annotation to a great extent. At the same time the semantic gap between high-level and low-level features has been bridged to a great extent. The hybrid algorithms have also attempted to grasp user perception to a great extent. The systems have increased in robustness,
References (201)
- et al.
Building the gist of a scene: the role of global image features in recognition
Prog. Brain Res.
(2006) - et al.
Adaptive key frame extraction for video summarization using an aggregation mechanism
J. Visual Commun. Image Represent.
(2012) Fuzzy sets
Inform. Control
(1965)- et al.
Multilevel image segmentation with adaptive image context based thresholding
Appl. Soft Comput.
(2011) Toward a perception-based theory of probabilistic reasoning with imprecise probabilities
J. Stat. Plann. Infer.
(2002)- et al.
Video shot boundary detection: seven years of TRECVid activity
Comput. Vision Image Understand.
(2010) - et al.
An adaptive video shot segmentation scheme based on dual-detection model
Neurocomputing
(2013) - et al.
A fuzzy logic approach for detection of video shot boundaries
Pattern Recogn.
(2006) - et al.
A fuzzy theoretic approach for video segmentation using syntactic features
Pattern Recogn. Lett.
(2001) - et al.
AVCD-FRA: a novel solution to automatic video cut detection using fuzzy-rule-based approach
Comput. Vision Image Understand.
(2013)