Elsevier

Applied Soft Computing

Volume 46, September 2016, Pages 1008-1029
Applied Soft Computing

Review Article
Hybrid soft computing approaches to content based video retrieval: A brief review

https://doi.org/10.1016/j.asoc.2016.03.022Get rights and content

Abstract

There has been an unrestrained growth of videos on the Internet due to proliferation of multimedia devices. These videos are mostly stored in unstructured repositories which pose enormous challenges for the task of both image and video retrieval. Users aim to retrieve videos of interest having content which is relevant to their need. Traditionally, low-level visual features have been used for content based video retrieval (CBVR). Consequently, a gap existed between these low-level features and the high level semantic content. The semantic differential was partially bridged by proliferation of research on interest point detectors and descriptors, which represented mid-level features of the content. The computational time and human interaction involved in the classical approaches for CBVR are quite cumbersome. In order to increase the accuracy, efficiency and effectiveness of the retrieval process, researchers resorted to soft computing paradigms. The entire retrieval task was automated to a great extent using individual soft computing components. Due to voluminous growth in the size of multimedia databases, augmented by an exponential rise in the number of users, integration of two or more soft computing techniques was desirable for enhanced efficiency and accuracy of the retrieval process. The hybrid approaches serve to enhance the overall performance and robustness of the system with reduced human interference. This article is targeted to focus on the relevant hybrid soft computing techniques which are in practice for content-based image and video retrieval.

Introduction

With the immense popularity of video recorders and reduction in cost of digital storage devices, voluminous amount of videos are uploaded at an incredible rate for online browsing and retrieval purposes. As video encompasses the other three media types, i.e. text, image and audio, combining them into a single data stream for transmission and retrieval has drawn the attention of researchers and application developers to a great extent. Image/video retrieval has been an active research domain since the last four decades.

In comparison to images, videos possess characteristic features. Although the organization of a video is not well defined, the content is more affluent as compared to individual images. Archiving and indexing of images and videos based on semantic content is a challenging task. Subsequent retrieval and browsing of video data manually is a laborious and time intensive task from the users’ perspective. During the retrieval process, the primary objective is to present the videos of interest to the user. As such, this broad domain of research is referred to as content-based video retrieval (CBVR). Since a video is a conglomeration of time sequenced images, research in the field of CBVR has been supplemented to a great extent by advances in the field of content-based image retrieval (CBIR). In some approaches, CBVR applications have been augmented with audio cues. As such, approaches in audio indexing, retrieval and classification [1] are important for advances in CBVR.

The term “content-based” refers to features like color, texture, intensity, trajectory of objects or statistical characteristics at the low-level. Mid-level features refer to feature points taken on an image or a series of images as in a video. These feature points are detected by algorithms for feature point detection and description such as SIFT [2], SURF [3], BRISK [4], DAISY [5], GIST [6], ORB [7], etc. The detected feature points in images are matched in order to compute the amount of similarity. Mid-level features are incapable of evaluating the semantic content in a video. This drawback is alleviated using high-level features such as edges, shapes, motion vector, optical flow, event modeling (in videos), timbre, rhythm (in audio), etc. involving different levels of semantics. High-level features are capable of handling semantic queries like “retrieve videos where blue sky and snowy mountains are present”. These queries require matching the semantic content in the video database. The major hurdle behind processing high-level queries is the semantic gap that exists between the high-level features and low-level ones.

The process of “retrieval” involves matching features extracted from the frames of the video with the query given by the user. The features are used to create the feature vectors for each video. All videos in the database are represented as a point in an n-dimensional space, where n represents the number of features under consideration. It is pertinent to mention here that taking too many features enhances the computational cost since the dimensionality of the feature vector increases. Over time, researchers have adopted various means for reducing dimensionality. Principal component analysis (PCA) [8] is a very widely used technique for reducing the number of features by mapping the set of original features (P) to another set (Q). The feature set Q contains derived features from the set P. Importantly, the cardinality of the set Q is lesser than P. Researchers have also relied on rough sets for identifying features with high discrimination power and thereby reducing the dimensionality by eliminating less important features. The features taken into consideration form the feature vector for each object (video) in the database. Distance between the feature vectors acts as a measure for similarity between the videos. The commonly used distance measures are Euclidean, Mahalanobis, Hausdorff, Chi-square, etc. The end product of the retrieval process is a set of videos ranked according to relevance. In some systems user relevance feedback is used to tune the system in order to produce more meaningful results. This also helps to model human perception in a better way.

For better understanding, an analogy is drawn between a document and a video sequence (refer Fig. 1). A document consists of paragraphs much like a video being composed of scenes. A paragraph in turn represents a group of inter-related sentences similar to inter-related shots, forming a scene in a video. Further down the hierarchy, the sentences are composed of words, like the shots consist of frames. A frame in a video denotes a single image, whereas a shot is a consecutive sequence of frames recorded by a single camera. A scene represents a collection of semantically related and temporally adjacent shots, portraying and imparting a high-level story. A group of scenes comprises a sequence/story. Frames and shots are considered to be low-level temporal features of a video, whereas, scenes and sequence/story form the high-level features which are close to human perception. In other words, shots act as the basic indexing elements for a video since there is no perceivable change in the scene contents. A given video may be decomposed into its constituent units either in the temporal (scenes, shots or frames) or spatial domain (objects, color segments or pixels). Decomposing a video in the temporal domain involves a hierarchical disintegration of the video into scenes, shots or frames according to need. In order to analyze the semantic content, a spatial analysis of the frame contents is required. The initial step for any video analysis application is to decompose the video into its constituent shots. In the literature, this has been referred to as shot boundary analysis or temporal video segmentation. As the visual content of the frames in a shot are similar, spatial segmentation is carried out in the next step to derive the semantic content in the form of objects composing the shot and the inter-relationship among these. Video segmentation methods thus provide a means to harness the structural primitives of videos for various applications related to content-based video retrieval.

The framework of a content-based video indexing, retrieval and summarization system has been presented in Fig. 2. The input video may be segmented either at the semantic level (leading to scene detection) or with respect to the timeline (leading to shot boundary detection). High-level feature extraction leads to video mining, video classification or other related tasks. High-level features are also necessary for semantic indexing in databases which are required for video retrieval, based on content. The query video given by the user goes through pre-processing steps like segmentation and feature extraction before similarity [9] is measured with the videos contained in the database. The indexing, browsing and retrieval jobs were quite manageable previously due to lesser number of videos. Earlier, keywords were mainly used as meta-data for the retrieval task. However, due to the exponential rise in videos available online, tools and applications for automatic indexing and retrieval of the videos based on semantic content became a necessity rather than a bare need.

The task of video summarization may be categorized into static and dynamic summarization. In a static summary [10], key-frames are chosen from each shot. Redundancy reduction is performed on the representative set to eliminate visually duplicate frames having similar content. Low-level or mid-level features are used for this purpose. Use of key-frames has not only proven helpful for video indexing and summarization applications, but it is also a useful means for maintaining video logs. In video summarization, extraction of key-frames [11], [12], [13] from the original video sequence not only provide a visual abstract to the user, but also acts as a means for faster content browsing. For producing a dynamic summary [14], features extracted from the shots are used for clustering [15]. Representative shots from each cluster, when coalesced together, form the dynamic summary. However, the shots may be given weights according to the content embedded within them. Depending on the weights, the most significant shots may be chosen for generating the summary. The user may or may not specify the duration of summary required. It is needless to mention that generating a dynamic summary is much more challenging due to the fact that there must be coherence among the shots composing the video skim.

In the light of this foregoing background, a proper framework needs to be defined which will cater to the latest developments in the field of CBVR. Also, a focused approach is essential which would encompass and enumerate the research work carried out in CBVR, especially with respect to the advances in techniques using soft computing or hybrid soft computing. This forms the major motivation behind this work. The contributions of this article are many fold. Firstly, it enumerates the various techniques developed in the field of video segmentation (both temporal and spatial) and categorizes them into classical, soft computing and hybrid soft computing based approaches. As video segmentation is a prerequisite for any form of video analysis, this sub-domain of CBVR is focused upon. Secondly, the advances in techniques of CBVR (including CBIR) are classified into the conventional, soft computing and hybrid soft computing based approaches in order to facilitate researchers in understanding the approaches developed in each category. Finally, a snapshot of the hybrid approaches is presented in the form of a table to highlight the merits and demerits of each method.

Remainder of the paper is organized as follows. Section 2 deals with the various soft computing paradigms used for content-based video retrieval. The major approaches to video segmentation have been detailed in Section 3. This section reviews the prominent methods related to temporal and spatial video segmentation. For easy understanding, methods have been further categorized according to the approach used, i.e. classical, soft computing or hybrid soft computing. The same categorization has been adopted in Section 4 where approaches related to content-based video retrieval has been enumerated and the major methods have been reviewed. Finally, the future directions and concluding remarks are presented in Section 5.

Section snippets

Components of soft computing

Soft computing is best known to solve real life problems where the computation time required for ascertaining the exact solution is very high. Also, the approaches using soft computing can tolerate imprecise values, uncertainty, partial truth and approximation resulting in robustness, tractability and low-cost solutions. Thus, to obtain a higher level of accuracy and minimize the number of computations involved, researchers have shifted from standard classical algorithms to the soft computing

Video segmentation

The task of video segmentation involves the decomposition of a video in either the temporal, spatial or spatio-temporal domain. The tasks of relevant shot retrieval, object recognition, event detection, video indexing, etc. are based upon the efficacy with which such decomposition is performed. As such video segmentation forms the very basic step for efficient video analysis. Further subsections elaborate on methods of video decomposition which have been developed using conventional, soft

Content based video retrieval

Advances in data storage and image acquisition technologies have enabled the creation of large image datasets. In this scenario, it is necessary to develop appropriate information systems to efficiently manage these collections. The commonest approaches use the so called content-based image retrieval (CBIR) systems. These systems try to retrieve images similar to a user-defined specification or pattern. Their goal is to support image retrieval based on content properties (e.g. shape, color,

Future directions and conclusion

The hybrid approaches used for content-based video retrieval has grasped a lot of attention among academicians, researchers and real life users. Hybrid algorithms for CBVR have helped to reduce user interaction and manual annotation to a great extent. At the same time the semantic gap between high-level and low-level features has been bridged to a great extent. The hybrid algorithms have also attempted to grasp user perception to a great extent. The systems have increased in robustness,

References (201)

  • S. Bhattacharyya et al.

    High-speed target tracking by fuzzy hostility-induced segmentation of optical flow field

    Appl. Soft Comput.

    (2009)
  • V. Chasanis et al.

    Simultaneous detection of abrupt cuts and dissolves in videos using support vector machines

    Pattern Recogn. Lett.

    (2009)
  • J. Cao et al.

    A robust shot transition detection method based on support vector machine in compressed domain

    Pattern Recogn. Lett.

    (2007)
  • M.-H. Lee et al.

    Video scene change detection using neural network: improved ART2

    Expert Syst. Appl.

    (2006)
  • Y. Cheng

    The incremental method for fast computing the rough fuzzy approximations

    Data Knowl. Eng.

    (2011)
  • M.S. Allili et al.

    Object tracking in videos using adaptive mixture models and active contours

    Neurocomputing

    (2008)
  • Q. Zhou et al.

    Object tracking in an outdoor environment using fusion of features and cameras

    Image Vision Comput.

    (2006)
  • E. Wold et al.

    Content-based classification, search, and retrieval of audio

    IEEE Trans. Multimedia

    (1996)
  • D.G. Lowe

    Object recognition from local scale-invariant features

  • H. Bay et al.

    Surf: speeded-up robust features

  • S. Leutenegger et al.

    Brisk: binary robust invariant scalable keypoints

  • E. Tola et al.

    Daisy: an efficient dense descriptor applied to wide-baseline stereo

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • E. Rublee et al.

    Orb: an efficient alternative to sift or surf

  • I. Jolliffe

    Principal Component Analysis

    (2002)
  • D. Van der Weken et al.

    An overview of similarity measures for images

  • B.T. Truong et al.

    Video abstraction: a systematic review and classification

    ACM Trans. Multimedia Comput. Commun. Appl.

    (2007)
  • H. Bhaumik et al.

    Enhancement of perceptual quality in static video summarization using minimal spanning tree approach

  • P. Kelm et al.

    Feature-based video key frame extraction for low quality video sequences

  • J. Nam et al.

    Dynamic video summarization and visualization

  • W. Tavanapong et al.

    Shot clustering techniques for story browsing

    IEEE Trans. Multimedia

    (2004)
  • S.K. Das et al.

    On soft computing techniques in various areas

    Comput. Sci. Inform. Technol.

    (2013)
  • T.J. Ross et al.

    Fuzzy Logic with Engineering Applications

    (1995)
  • L.A. Zadeh

    Fuzzy logic, neural networks, and soft computing

    Commun. ACM

    (1994)
  • D.E. Goldberg
    (1989)
  • L. Davis

    Handbook of Genetic Algorithms

    (1991)
  • S. De et al.

    Efficient gray level image segmentation using an optimized MUSIG (OptiMUSIG) activation function

    Int. J. Parallel Emerg. Distrib. Syst.

    (2010)
  • C. Cortes et al.

    Support-vector networks

    Mach. Learn.

    (1995)
  • M. Dorigo

    Optimization, learning and natural algorithms (Ph.D. Thesis)

    (1992)
  • N.M. AL-Salami

    System evolving using ant colony optimization algorithm

    J. Comput. Sci.

    (2009)
  • M. Dorigo et al.

    Ant colony system: a cooperative learning approach to the traveling salesman problem

    IEEE Trans. Evol. Comput.

    (1997)
  • M. Younes et al.

    Economic power dispatch using an ant colony optimization method

  • J. Kennedy et al.

    Particle swarm optimization

  • S. Bhattacharyya et al.

    Soft Computing for Image and Multimedia Data Processing

    (2013)
  • Z. Pawlak

    Rough sets

    Int. J. Comput. Inform. Sci.

    (1982)
  • C. Yeo et al.

    A framework for sub-window shot detection

  • G. Camara-Chavez et al.

    Shot boundary detection by a hierarchical supervised approach

  • H. Lu et al.

    Shot boundary detection using unsupervised clustering and hypothesis testing

  • M. Cooper et al.

    Video segmentation via temporal pattern classification

    IEEE Trans. Multimedia

    (2007)
  • A.M. Ferman et al.

    Robust color histogram descriptors for video segment retrieval and identification

    IEEE Trans. Image Process.

    (2002)
  • S.C. Hoi et al.

    Chinese University of Hongkong at TRECVid 2006: shot boundary detection and video search

  • Cited by (0)

    View full text