main-content

## Über dieses Buch

This book constitutes the refereed proceedings of the 6th National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics, NCVPRIPG 2017, held in Mandi, India, in December 2017.

The 48 revised full papers presented in this volume were carefully reviewed and selected from 147 submissions. The papers are organized in topical sections on video processing; image and signal processing; segmentation, retrieval, captioning; pattern recognition applications.

## Inhaltsverzeichnis

### Visual Odometry Based Omni-directional Hyperlapse

The prohibitive amounts of time required to review the large amounts of data captured by surveillance and other cameras has brought into question the very utility of large scale video logging. Yet, one recognizes that such logging and analysis are indispensable to security applications. The only way out of this paradox is to devise expedited browsing, by the creation of hyperlapse. We address the hyperlapse problem for the very challenging category of intensive egomotion which makes the hyperlapse highly jerky. We propose an economical approach for trajectory estimation based on Visual Odometry and implement cost functions to penalize pose and path deviations. Also, this is implemented on data taken by omni-directional camera, so that the viewer can opt to observe any direction while browsing. This requires many innovations, including handling the massive radial distortions and implementing scene stabilization that need to be operated upon the least distorted region of the omni view.

Prachi Rani, Arpit Jangid, Vinay P. Namboodiri, K. S. Venkatesh

### Classification of Human Actions Using 3-D Convolutional Neural Networks: A Hierarchical Approach

In this paper, we present a hierarchical approach for human action classification using 3-D Convolutional neural networks (3-D CNN). In general, human actions refer to positioning and movement of hands and legs and hence can be classified based on those performed by hands or by legs or, in some cases, both. This acts as the intuition for our work on hierarchical classification. In this work, we consider the actions as tasks performed by hand or leg movements. Therefore, instead of using a single 3-D CNN for classification of given actions, we use multiple networks to perform the classification hierarchically, that is, we first perform binary classification to separate the hand and leg actions and then use two separate networks for hand and leg actions to perform classification among target action categories. For example, in case of KTH dataset, we train three networks to classify six different actions, comprising of three actions each for hands and legs. The novelty of our approach lies in performing the separation of hand and leg actions first, thus making the subsequent classifiers to accept the features corresponding to either hands or legs only. This leads to better classification accuracy. Also, the use of 3-D CNN enables automatic extraction of features in spatial as well as temporal domain, avoiding the need for hand crafted features. This makes it one of the better approaches when it comes to video classification. We use the KTH, Weizmann and UCF-sports datasets to evaluate our method and comparison with the state of the art methods shows that our approach outperforms most of them.

Shaival Thakkar, M. V. Joshi

### SmartTennisTV: Automatic Indexing of Tennis Videos

In this paper, we demonstrate a score based indexing approach for tennis videos. Given a broadcast tennis video (btv), we index all the video segments with their scores to create a navigable and searchable match. Our approach temporally segments the rallies in the video and then recognizes the scores from each of the segments, before refining the scores using the knowledge of the tennis scoring system. We finally build an interface to effortlessly retrieve and view the relevant video segments by also automatically tagging the segmented rallies with human accessible tags such as ‘fault’ and ‘deuce’. The efficiency of our approach is demonstrated on btv’s from two major tennis tournaments.

Anurag Ghosh, C. V. Jawahar

### Flow-Free Video Object Segmentation

Segmenting foreground object from a video is a challenging task because of large deformations of objects, occlusions, and background clutter. In this paper, we propose a frame-by-frame but computationally efficient approach for video object segmentation by clustering visually similar generic object segments throughout the video. Our algorithm segments object instances appearing in the video and then performs clustering in order to group visually similar segments into one cluster. Since the object that needs to be segmented appears in most part of the video, we can retrieve the foreground segments from the cluster having maximum number of segments. We then apply a track and fill approach in order to localize the object in the frames where the object segmentation framework fails to segment any object. Our algorithm performs comparably to the recent automatic methods for video object segmentation when benchmarked on DAVIS dataset while being computationally much faster.

### SSIM-Based Joint Bit-Allocation Using Frame Model Parameters for 3D Video Coding

Optimum bit-allocation between texture video and depth map in 3D video results in better virtual view quality. To incorporate this, rate distortion optimization (RDO) property is used. The RDO in 3D video implies minimization of synthesis distortion at available rate. Several bit-allocation methods proposed in literature have not considered perceptual quality improvement. In this paper, we propose bit-allocation criteria that results in better visual quality of synthesized view. To achieve this, visual quality metrics are to be incorporated and structural similarity (SSIM) index is one of the metric that measures perceived quality. As SSIM gives similarity measure, we used dSSIM as distortion metric in mode decision and motion estimation instead of traditional metrics like mean square error (MSE) or sum of squared error (SSE). Synthesis distortion is modeled using dSSIM and joint bit-allocation is formulated as optimization problem that is solved using Lagrange multiplier method. Model parameters are determined at frame level for more accurate calculation of quantization parameters. BD-Rate evaluation shows a reduction in bit rate with improved SSIM.

Y. Harshalatha, Prabir Kumar Biswas

### Trajectory Based Integrated Features for Action Classification from Depth Data

We present an approach for Human Action Recognition based on amalgamation of features from depth maps and body-joint data. This Integrated feature set consists of depth features based on gradient orientation and motion energy, in addition to features from 3D- skeleton data capturing its statistical details. Feature selection is carried out to extract a relevant set of features for action recognition. The resultant set of features are evaluated using SVM classifier. We validate our proposed method on various benchmark datasets for Action Recognition such as MSR-Daily Activity and UT-Kinect dataset.

Parul Shukla, Noopur Arora, Kanad K. Biswas

### Anomaly from Motion: Unsupervised Extraction of Visual Irregularity via Motion Prediction

The problem of automatically extracting anomalous events from any given video is a problem that has been researched from the early days of computer vision. It has still not been fully solved, showing that it is indeed not a trivial problem. The various challenges involved are lack of proper definition, varying scene structure and objects of interest in the scene, just a few to name.In this paper we propose a novel method to extract outliers from motion alone. We employ a stacked LSTM encoder-decoder structure to model the regular motion patterns of the given video sequence. The discrepancy between the motion predicted using the model and the actual observed motion in the scene is analyzed to detect anomalous activities. We perform extensive experimentation on the benchmark datasets of crowd anomaly analysis. We report State of the Art results across all the datasets.

Avishek Majumder, R. Venkatesh Babu, Anirban Chakraborty

### Recognizing Human Activities in Videos Using Improved Dense Trajectories over LSTM

We propose a deep learning based technique to classify actions based on Long Short Term Memory (LSTM) networks. The proposed scheme first learns spatial temporal features from the video, using an extension of the Convolutional Neural Networks (CNN) to 3D. A Recurrent Neural Network (RNN) is then trained to classify each sequence considering the temporal evolution of the learned features for each time step. Experimental results on the CMU MoCap, UCF 101, Hollywood 2 dataset show the efficacy of the proposed approach. We extend the proposed framework with an efficient motion feature, to enable handling significant camera motion. The proposed approach outperforms the existing deep models for each dataset.

Krit Karan Singh, Snehasis Mukherjee

### Saliency Driven Video Motion Magnification

The main goal of the proposed work is to detect certain spatial and temporal changes in videos that are not visible to the human eye and magnify them in order to make them perceptible while making sure that the background noise is not amplified. We apply Eulerian motion magnification on only the salient area of each frame of the video. The salient object is processed independent of the rest of the image using alpha matting aided by scribbles. We demonstrate the need to isolate the salient object from background motions and propose a simple and efficient way to do so. The proposed algorithm is tested on videos with imperceptible motion along with background motion to illustrate the significance of the proposed method. We compare the proposed method with linear and phase based Eulerian motion magnification techniques.

Manisha Verma, Ramyani Ghosh, Shanmuganathan Raman

### Detecting Missed and Anomalous Action Segments Using Approximate String Matching Algorithm

We forget action steps and perform some unwanted action movements as amateur performers during our daily exercise routine, dance performances, etc. To improve our proficiency, it is important that we get a feedback on our performances in terms of where we went wrong. In this paper, we propose a framework for analyzing and issuing reports of action segments that were missed or anomalously performed. This involves comparing the performed sequence with the standard action sequence and notifying when misalignments occur. We propose an exemplar based Approximate String Matching (ASM) technique for detecting such anomalous and missing segments in action sequences. We compare the results with those obtained from the conventional Dynamic Time Warping (DTW) algorithm for sequence alignment. It is seen that the alignment of the action sequences under conventional DTW fails in the presence of missed action segments and anomalous segments due to its boundary condition constraints. The performance of the two techniques has been tested on a complex aperiodic human action dataset with Warm up exercise sequences that we developed from correct and incorrect executions by multiple people. The proposed ASM technique shows promising alignment and missed/anomalous notification results over this dataset.

Hiteshi Jain, Gaurav Harit

### Parametric Reshaping of Humans in Videos Incorporating Motion Retargeting

We propose a system capable of changing the shape of humans in monocular video sequences. Initially, a 3D model is fit over each frame of the video sequence in a spatio-temporally coherent manner, using the feature points provided by the user in a semi-automatic interface and the silhouette correspondences obtained from background subtraction. The 3D morphable model learned from laser scans of different human subjects is used to generate a model having the shape parameters like height, weight, leg length, etc. specified by the user. The deformed model is then retargeted to transfer the semantics of the motion, like step size of the person. This retargeted model is used to perform a body-aware warping of the foreground of each frame. Finally, the warped foreground is composited over the inpainted background. Spatio-temporal consistency is achieved through the combination of automatic pose fitting and body-aware frame warping. Motion retargeting makes the system produce visually pleasing and natural results like the motion of a taller human is higher than that of the human before warping. We have demonstrated the results of shape changes on different subjects with a variety of actions.

Suresh Prakash, Prem Kalra

### Enhanced Aggregated Channel Features Detector for Pedestrian Detection Using Parameter Optimisation and Deep Features

Aggregated Channel Features (ACF) proposed by Dollar et al. provide strong framework for pedestrian detection. Many variants of ACF detector achieved state of the art result using deep features along with aggregated channel features. In this paper we propose a hybrid method for pedestrian detection using a parameter optimized variant of ACF detector with decorrelated channels as region proposer followed by a deep CNN for feature extraction. Our proposed method effectively handles the issues of false positives and detection of small instances of pedestrians. The proposed detector gives the best result among the different variants of the ACF detectors in Caltech dataset with the best localization and is second to the best performing detector available till date.

Blossom Treesa Bastian, C. V. Jiji

### Unsupervised Segmentation of Speech Signals Using Kernel-Gram Matrices

The objective of this paper is to develop an unsupervised method for segmentation of speech signals into phoneme-like units. The proposed algorithm is based on the observation that the feature vectors from the same segment exhibit higher degree of similarity than the feature vectors across the segments. The kernel-Gram matrix of an utterance is formed by computing the similarity between every pair of feature vectors in the Gaussian kernel space. The kernel-Gram matrix consists of square patches, along with the principle diagonal, corresponding to different phoneme-like segments in the speech signal. It detects the number of segments, as well as their boundaries automatically. The proposed approach does not assume any information about input utterances like exact distribution of segment length or correct number of segments in an utterance. The proposed method out-performs the state-of-the-art blind segmentation algorithms on Zero Resource 2015 databases and TIMIT database.

Saurabhchand Bhati, Shekhar Nayak, K. Sri Rama Murty

### Design of Biorthogonal Wavelet Filters of DTCWT Using Factorization of Halfband Polynomials

In this paper, we propose a new approach for designing the biorthogonal wavelet filters (BWFs) of Dual-Tree Complex Wavelet Transform (DTCWT). Proposed approach provides an effective way to handle the frequency response characteristics of these filters. This is done by optimizing the free variables obtained using factorization of generalized halfband polynomial (GHBP). The designed filters using proposed approach have better frequency response characteristics than those obtained by using binomial spectral factorization approach. Also, their associated wavelets show improved analyticity in terms of qualitative and quantitative measures. Transform-based image denoising using the proposed filters shows better visual as well as quantitative performance.

Shrishail S. Gajbhar, Manjunath V. Joshi

### Single Noisy Image Super Resolution by Minimizing Nuclear Norm in Virtual Sparse Domain

Super-resolving a noisy image is a challenging problem, and needs special care as compared to the conventional super resolution approaches, when the power of noise is unknown. In this scenario, we propose an approach to super-resolve single noisy image by minimizing nuclear norm in a virtual sparse domain that tunes with the power of noise via parameter learning. The approach minimizes nuclear norm to explore the inherent low-rank structure of visual data, and is further augmented with coarse-to-fine information by adaptively re-aligning the data along the principal components of a dictionary in virtual sparse domain. The experimental results demonstrate the robustness of our approach across different powers of noise.

Srimanta Mandal, A. N. Rajagopalan

### Near Real-Time Correction of Specular Reflections in Flash Images Using No-Flash Image Prior

In insufficient indoor light conditions, when images of paintings, documents and objects with glossy surfaces are captured using flash light, bright annoying specularities appear in the image which not only degrade the aesthetic quality, but also lead to loss of useful information. In this paper, we address the problem of specular reflections in images of aforementioned scenes, captured using flash light. We propose a novel specular reflection detection algorithm which utilizes flash/no-flash image pair to accurately detect specular reflections in flash image, while ignoring the inherent bright regions. The detected specular reflections are seamlessly recovered using Poisson image editing technique. Quantitative as well as qualitative comparison of the proposed detection method on our flash/no-flash image dataset shows that it significantly outperforms other eminent methods in literature. We also implement our solution in an Android smartphone to demonstrate its effectiveness in real-life scenarios.

Saikat Kumar Das, Kunal Swami, Gaurav Khandelwal, Prashanth Rao Thakkalapally

### A Method for Detecting JPEG Anti-forensics

In this paper, a new approach is proposed for the detection of JPEG anti-forensic operations. It is based on the fact that when a JPEG anti-forensic operation is applied, the values of DCT coefficients are changed. This change decreases, especially in high frequency subbands, if we apply anti-forensic operation again. Hence, we propose to calculate a normalized difference between absolute values of DCT coefficients in 28 high frequency AC-subbands of the test image and its anti-forensically modified version. Based on this normalized feature, it is possible to differentiate between uncompressed and anti-forensically modified images. Experimental results show the effectiveness of the proposed method.

Dinesh Bhardwaj, Chothmal Kumawat, Vinod Pankajakshan

### An End-to-End Deep Learning Framework for Super-Resolution Based Inpainting

Image inpainting is an extremely challenging and open problem for the computer vision community. Motivated by the recent advancement in deep learning algorithms for computer vision applications, we propose a new end-to-end deep learning based framework for image inpainting. Firstly, the images are down-sampled as it reduces the targeted area of inpainting therefore enabling better filling of the target region. A down-sampled image is inpainted using a trained deep convolutional auto-encoder (CAE). A coupled deep convolutional auto-encoder (CDCA) is also trained for natural image super resolution. The pre-trained weights from both of these networks serve as initial weights to an end-to-end framework during the fine tuning phase. Hence, the network is jointly optimized for both the aforementioned tasks while maintaining the local structure/information. We tested this proposed framework with various existing image inpainting datasets and it outperforms existing natural image blind inpainting algorithms. Our proposed framework also works well to get noise resilient super-resolution after fine-tuning on noise-free super-resolution dataset. It provides more visually plausible and better resultant image in comparison of other conventional and state-of-the-art noise-resilient super-resolution algorithms.

Manoj Sharma, Rudrabha Mukhopadhyay, Santanu Chaudhury, Brejesh Lall

### Saliency Map Improvement Using Edge-Aware Filtering

Content-aware applications in computational photography define the relative importance of objects or actions present in an image using a saliency map. Most saliency detection algorithms learn from the human visual system and try to find relatively important content as a salient region(s). This paper attempts to improve the saliency map defined by these algorithms using an iterative process. The saliency map of an image generated by an existing saliency detection algorithm is modified by filtering the image after segmenting into foreground and background. In order to enhance the saliency map values present in the salient region, the background is filtered using an edge-aware guided filter and the foreground is enhanced using a local Laplacian filter. The number of iterations required varies according to the image content. We show that the proposed framework enhances the saliency maps generated using the state-of-the-art saliency detection algorithms both qualitatively and quantitatively.

Diptiben Patel, Shanmuganathan Raman

### A Generative Adversarial Network for Tone Mapping HDR Images

A tone mapping operator converts High Dynamic Range (HDR) images to Low Dynamic Range (LDR) images, which can be seen on LDR displays. There has been a lot of research done in the direction of an optimal Tone Mapping Operator which maximizes Tone Mapping Quality Index (TMQI). However, since all the methods approximate Human Vision System in one or different way, none of them works for every type of images. We are proposing a novel generative adversarial network to learn a combination of these tone mapping operators. In order to get pixel level accuracy, we are using residual connections between same sized network layers. We compare this method with some of the existing tone mapping operators and observe that our method generates images with comparably high TMQI and indeed works on many different types of images. Because of the residual connections, the network can be scaled to very high dimensional images.

Vaibhav Amit Patel, Purvik Shah, Shanmuganathan Raman

### Efficient Clustering-Based Noise Covariance Estimation for Maximum Noise Fraction

Most hyperspectral images (HSI) have important spectral features in specific combination of wave numbers or channels. Noise in these specific channels or bands can easily overwhelm these relevant spectral features. Maximum Noise Fraction (MNF) by Green et al. [1] has been extensively studied for noise removal in HSI data. The MNF transform maximizes the Signal to Noise Ratio (SNR) in feature space, thereby explicitly requiring an estimation of the HSI noise. We present two simple and efficient Noise Covariance Matrix (NCM) estimation methods as required for the MNF transform. Our NCM estimations improve the performance of HSI classification, even when ground objects are mixed. Both techniques rely on a superpixel based clustering of HSI data in the spatial domain. The novelty of our NCM’s comes from their reduced sensitivity to HSI noise distributions and interference patterns. Experiments with both simulated and real HSI data show that our methods significantly outperforms the NCM estimation in the classical MNF transform, as well as against more recent state of the art NCM estimation methods. We quantify this improvement in terms of HSI classification accuracy and superior recovery of spectral features.

Soumyajit Gupta, Chandrajit Bajaj

### GMM Based Single Depth Image Super-Resolution

Super-resolution (SR) is a technique to improve the resolution of an image from a sequence of input images or from a single image. As SR is an ill-posed inverse problem, it leads to many suboptimal solutions. Since modern depth cameras suffer from low-spatial resolution and are noisy, we present a Gaussian mixture model (GMM) based method for depth image super-resolution (SR). We train GMM from a set of high-resolution and low-resolution (HR-LR) synthetic training depth images to learn the relation between the HR and the LR patches in the form of covariance matrices. We use expectation-maximization (EM) algorithm to converge to an optimal solution. We show the promising results qualitatively and quantitatively in comparison to other depth image SR methods.

Chandra Shaker Balure, M. Ramesh Kini, Arnav Bhavsar

### Patch Similarity in Transform Domain for Intensity/Range Image Denoising with Edge Preservation

For the image denoising task, the prior information obtained from grouping similar non-local patches has been shown to serve as an effective regularizer. Nevertheless, noise may create ambiguity in grouping similar patches, hence it may degrade the results. However, most of the non-local similarity based approaches do not take care of the issue of noisy grouping. Hence, we propose to denoise an image by mitigating the issue of grouping non-local similar patches in presence of noise in transform domain using sparsity and edge preserving constraints. The effectiveness of the transform domain grouping of patches is utilized for learning dictionaries, and is further extended for achieving an initial approximation of sparse coefficient vector for the clean image patches. We have demonstrated the results of effective grouping of similar patches in denoising intensity as well as range images.

Seema Kumari, Srimanta Mandal, Arnav Bhavsar

### Multi-modal Image Analysis for Plant Stress Phenotyping

Drought stress detection involves multi-modal image analysis with high spatio-temporal resolution. Identification of digital traits that characterizes drought stress response (DSR) is challenging due to high volume of image based features. Also, the labelled data that categorizes DSR are either unavailable or subjectively developed, which is a low-throughput and error-prone task. Therefore, we propose a novel framework that provides an automated scoring of DSR based on multi-trait fusion. k-means clustering was used to extract latent drought clusters and the relevant traits were identified using Support Vector Machine-Recursive Feature Extraction (SVM-RFE). Using these traits, SVM based DSR classification model was constructed. The framework has been validated on visible and thermal shoot images of rice plants, yielding 95% accuracy. Various imaging modalities can be integrated with the proposed framework, thus making it scalable as no prior information about the DSR was assumed.

Swati Bhugra, Anupama Anupama, Santanu Chaudhury, Brejesh Lall, Archana Chugh

### Source Classification Using Document Images from Smartphones and Flatbed Scanners

With technological advancements, digital scans of printed documents are increasingly used in many systems in place of the original hard copy documents. This convenience to use digital scans comes at increased risk of potentially fraudulent and criminal activities due to their easy manipulation. To curb such activities, identification of source corresponding to a scanned document can provide important clues to investigating agencies and also help build a secure communication system. This work utilizes local tetra patterns to capture unique device-specific signatures from images of printed documents. In this first of its kind work for scanner identification, the method uses all characters to train a single classifier thereby, reducing the amount of training data required. The proposed method depicts font size independence when tested on an existing scanner dataset and a novel step towards font shape independence when tested on a smart phone dataset of comparable size (Supplementary material and code is available at https://sites.google.com/view/manaslab ).

Sharad Joshi, Gaurav Gupta, Nitin Khanna

### Homomorphic Incremental Directional Averaging for Noise Suppression in SAR Images

In recent days, it is found that Synthetic Aperture Radar (SAR) images can be a very useful mode for observing and understanding the surface of Earth. The images formed under SAR modality usually suffer from multiplicative noise, particularly in single-look-complex (SLC) mode. There are extensive works in the literature for denoising SAR data, which are usually applied on amplitude data, or on coherency/covariance data. In this paper, we propose a two-channel filtering technique for noise suppression in complex SAR data. The rectangular format of complex SAR data is represented in phasor form to execute noise filtering over amplitude and phase independently, and then converted back to the rectangular format for subsequent applications. In this approach, it is observed that, the surface texture information is visibly retained while suppressing the noise considerably well, in comparison to reference multi-look image. As an application, we show the advantage of proposed noise suppression technique in classification of SAR images.

Shashaank M. Aswatha, Jayanta Mukhopadhyay, Prabir K. Biswas, Subhas Aikat

### An EEG-Based Image Annotation System

The success of deep learning in computer vision has greatly increased the need for annotated image datasets. We propose an EEG (Electroencephalogram)-based image annotation system. While humans can recognize objects in 20–200 ms, the need to manually label images results in a low annotation throughput. Our system employs brain signals captured via a consumer EEG device to achieve an annotation rate of up to 10 images per second. We exploit the P300 event-related potential (ERP) signature to identify target images during a rapid serial visual presentation (RSVP) task. We further perform unsupervised outlier removal to achieve an F1-score of 0.88 on the test set. The proposed system does not depend on category-specific EEG signatures enabling the annotation of any new image category without any model pre-training.

Viral Parekh, Ramanathan Subramanian, Dipanjan Roy, C. V. Jawahar

### Multimodal Registration of Retinal Images

Registration of multimodal retinal images such as fundus and Optical Coherence Tomography (OCT) images is important as the two structural imaging modalities provide complementary views of the retina. This enables a more accurate assessment of the health of the retina. However, registration is a challenging task because fundus image (2D) is obtained via optical projection whereas the OCT image (3D) is derived via optical coherence and is very noisy. Furthermore, the field of view of imaging possible in the two modalities is very different resulting in low overlap (5–20%) between the obtained images. Existing methods for this task rely on either key-point (junction/corner) detection or accurate segmentation of vessels which is difficult due to noise. We propose a registration algorithm for finding efficient landmarks under noisy conditions. The method requires neither accurate structure segmentation nor key-point detection. The Modality Independent Neighborhood Descriptor (MIND) features are used to represent landmarks to achieve insensitivity to noise, contrast. Similarity transformation is used to register images. Evaluation of the proposed method on 142 fundus-OCT pairs results in an RMSE of 2.61 pixels. The proposed method outperforms the existing algorithm in terms of robustness, accuracy, and computational efficiency.

Gamalapati S. Jahnavi, Jayanthi Sivaswamy

### Dynamic Class Learning Approach for Smart CBIR

Smart Content Based Image Retrieval (CBIR) helps to simultaneously localize and recognize all object(s) present in a scene, for image retrieval task. The major drawbacks in such kind of system are: (a) overhead for addition of new class is high - addition of new class requires manual annotation of large number of samples and retraining of an entire object model; and (b) use of handcrafted features for recognition and localization task, which limits its performance. In this era of data proliferation where it is easy to discover new object categories and hard to label all of them i.e. less amount of labeled samples for training which raises the above mentioned drawbacks. In this work, we propose an approach which cuts down the overhead of labelling the data and re-training on an entire module to learn new classes. The major components in proposed framework are: (a) selection of an appropriate pre-trained deep model for learning a new class; and (b) learning new class by utilizing selected deep model with less supervision (i.e. with the least amount of labeled data) using a concept of triplet learning. To show the effectiveness of the proposed technique of new class learning, we have performed an evaluation on CIFAR-10, PASCAL VOC2007 and Imagenet datasets.

Girraj Pahariya, Balaraman Ravindran, Sukhendu Das

### Exploring Memory and Time Efficient Neural Networks for Image Captioning

Automatically describing the contents of an image is one of the fundamental problems in artificial intelligence. Recent research has primarily focussed on improving the quality of the generated descriptions. It is possible to construct multiple architectures that achieve equivalent performance for the same task. Among these, the smaller architecture is desirable as they require less communication across servers during distributed training and less bandwidth to export a new model from one place to another through a network. Generally, a deep learning architecture for image captioning consists of a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) clubbed together within an encoder-decoder framework. We propose to combine a significantly smaller CNN architecture termed SqueezeNet and a memory and computation efficient LightRNN within a visual attention framework. Experimental evaluation of the proposed architecture on Flickr8k, Flickr30k and MS-COCO datasets reveal superior result when compared to the state of the art.

Sandeep Narayan Parameswaran

### Dataset Augmentation with Synthetic Images Improves Semantic Segmentation

Although Deep Convolutional Neural Networks trained with strong pixel-level annotations have significantly pushed the performance in semantic segmentation, annotation efforts required for the creation of training data remains a roadblock for further improvements. We show that augmentation of the weakly annotated training dataset with synthetic images minimizes both the annotation efforts and also the cost of capturing images with sufficient variety. Evaluation on the PASCAL 2012 validation dataset shows an increase in mean IOU from 52.80% to 55.47% by adding just 100 synthetic images per object class. Our approach is thus a promising solution to the problems of annotation and dataset collection.

Manik Goyal, Param Rajpura, Hristo Bojinov, Ravi Hegde

### Deep Neural Network for Foreground Object Segmentation: An Unsupervised Approach

Saliency plays a key role in various computer vision tasks. Extracting salient regions from images and videos have been a well established problem of computer vision. While segmenting salient objects from images depend only on static information, temporal information in a video can make non salient objects be salient due to movement. Besides the temporal information, there are other challenges involved with video segmentation, such as 3D parallax, camera shake, motion blur, etc. In this work, we propose a novel unsupervised end to end trainable, fully convolutional deep neural network for object segmentation. Our model is robust and scalable across scenes, as it is tested unsupervisedly and can easily infer which objects constitute the foreground of the image. We run various tests on two well established benchmarks of video object segmentation, DAVIS and FBMS-59 datasets. We report our results and compare them against the state of the art methods.

Avishek Majumder, R. Venkatesh Babu

### Document Image Segmentation Using Deep Features

This paper explores the effectiveness of deep features for document image segmentation. The document image segmentation problem is modelled as a pixel labeling task where each pixel in the document image is classified into one of the predefined labels such as text, comments, decorations and background. Our method first extracts deep features from superpixels of the document image. Then we learn an svm classifier using these features, and segment the document image. Fisher vector encoded convolutional layer features (fv-cnn) and fully connected layer features (fc-cnn) are used in our study. Experiments validate that our method is effective and yields better results for segmenting document images in comparison to the popular approaches on benchmark handwritten datasets.

K. V. Jobin, C. V. Jawahar

### MKL Based Local Label Diffusion for Automatic Image Annotation

The task of automatic image annotation attempts to predict a set of semantic labels for an image. Majority of the existing methods discover a common latent space that combines content and semantic image similarity using the metric learning kind of global learning framework. This limits their applicability to large datasets. On the other hand, there are few methods which entirely focus on learning a local latent space for every test image. However, they completely ignore the global structure of the data. In this work, we propose a novel image annotation method which attempts to combine best of both local and global learning methods. We introduce the notion of neighborhood-types based on the hypothesis that similar images in content/feature space should also have overlapping neighborhoods. We also use graph diffusion as a mechanism for label transfer. Experiments on publicly available datasets show promising performance.

Abhijeet Kumar, Anjali Anil Shenoy, Avinash Sharma

### Semantic Multinomial Representation for Scene Images Using CNN-Based Pseudo-concepts and Concept Neural Network

For challenging visual recognition tasks such as scene classification and object detection there is a need to bridge the semantic gap between low-level features and the semantic concept descriptors. This requires mapping a scene image onto a semantic representation. Semantic multinomial (SMN) representation is a semantic representation of an image that corresponds to a vector of posterior probabilities of concepts. In this work we propose to build a concept neural network (CoNN) to obtain the SMN representation for a scene image. An important issue in building a CoNN is that it requires the availability of ground truth concept labels. In this work we propose to use pseudo-concepts obtained from feature maps of higher level layers of convolutional neural network. The effectiveness of the proposed approaches are studied using standard datasets.

Deepak Kumar Pradhan, Shikha Gupta, Veena Thenkanidiyoor, Dileep Aroor Dinesh

### Automatic Synthesis of Boolean Expression and Error Detection from Logic Circuit Sketches

Automatic techniques to recognize and evaluate digital logic circuits are more efficient and require less human intervention, as compared to, traditional pen and paper methods. In this paper, we propose LEONARDO (Logic Expression fOrmatioN And eRror Detection framewOrk), a hierarchical approach to recognize boolean expression from hand drawn digital logic gate diagram. The key contributions in the proposed approach are: (i) a novel hierarchical framework to synthesize boolean expression from a hand drawn logic circuit diagram; and (ii) identification of anomalies in drawing. Extensive experimentation was performed through qualitative and quantitative analysis. Results were also compared with existing techniques proposed on the similar problem. Upon experimentation and analysis, our system proved to be more robust to user variability in design and yielded an accuracy of $$95.2\%$$ 95.2 % , which is a $$4\%$$ 4 % gain over others.

Sahil Dhiman, Pushpinder Garg, Divya Sharma, Chiranjoy Chattopadhyay

### Comparison of Edge Detection Algorithms in the Framework of Despeckling Carotid Ultrasound Images Based on Bayesian Estimation Approach

Common carotid artery (CCA) ultrasound with estimation of Intima Media Thickness (IMT) is the safe and non-invasive technique for predicting the cardiovascular risks. The precise quantification of IMT is useful for evaluating the risk of cardiovascular disease. The presence of speckle noise in carotid ultrasound image reduces the quality of the image and automatic human interpretation. Carotid ultrasound images have multiplicative speckle noise and it is difficult remove compared to the additive noises. The speckle removal filters have a greater restriction in edges and characteristics preservation. In this paper, we propose an extension of our earlier work with a fully automated Region of Interest (ROI) extraction and speckle denoising using optimized bayesian least square estimation (BLSE) approach followed by edge detection. The objective of the paper is to reduce the speckle noise in the extracted ROI of carotid ultrasound images using state-of-art denoising techniques and then followed by edge detection techniques and compared them with the edges extracted by these edge operators of ground truth image. The proposed algorithm experiments with 50 B-mode carotid ultrasound images. Experimental analysis shows that proposed method achieves better results as compared to other edge detection methods in terms of structural similarity Index Map (SSIM), correlation of coefficient (CoC), peak signal to noise ratio (PSNR) and mean square error (MSE) measures. Based on results, proposed work more effective in terms of visual inspection and detail preservation in carotid ultrasound images.

### A Two Stage Contour Evolution Approach for the Measurement of Choroid Thickness in EDI-OCT Images

High resolution images of the choroid can be obtained using Enhanced Depth Imaging Optical Coherence Tomography (EDI-OCT). The thickness of the choroid can be measured from these images and is used widely in clinical application for diagnosing various eye related diseases. But analysis of the choroidal thickness is presently done manually which varies with the observer and is a time consuming task. In this paper we propose a two stage contour evolution approach using chan vese method for the segmentation of choroidal layers in EDI OCT images. First the EDI OCT image is prefiltered using Rotating Kernel Transformation (RKT) to reduce the effect of speckle noise. This is followed by first stage of contour evolution which effectively identifies the upper boundary, the Bruchs Membrane (BM). The second level of segmentation delineates the lower boundary of the choroid, the Choroid Sclera Interface (CSI). The choroid thickness measured as the distance between BM and CSI are compared with the manually segmented results by an ophthalmologist. Results show good consistency with the proposed method.

George Neetha, C. V. Jiji

### Improved Low Resolution Heterogeneous Face Recognition Using Re-ranking

Recently, near-infrared to visible light facial image matching is gaining popularity, especially for low-light and night-time surveillance scenarios. Unlike most of the work in literature, we assume that the near-infrared probe images have low-resolution in addition to uncontrolled pose and expression, which is due to the large distance of the person from the camera. To address this very challenging problem, we propose a re-ranking strategy which takes into account the relation of both the probe and gallery with a set of reference images. This can be used as an add-on to any existing algorithm. We apply it with one recent dictionary learning algorithm which uses alignment of orthogonal dictionaries. We also create a benchmark for this task by evaluating some of the recent algorithms for this experimental protocol. Extensive experiments are conducted on a modified version of the CASIA NIR VIS 2.0 database to show the effectiveness of the proposed re-ranking approach.

Sivaram Prasad Mudunuri, Shashanka Venkataramanan, Soma Biswas

### Description Based Person Identification: Use of Clothes Color and Type

Surveillance videos can be searched for person identification using soft-biometrics. The proposed paper use clothes color and their type for person identification in a video. A height model and the ISCC-NBS color descriptors are used for human localization and color classification. Experimental results are demonstrated on the custom video database and compared with Gaussian mixture model based search model. It is shown that the proposed approach identifies a person correctly with high accuracy and outperforms Gaussian mixture based search model. The paper also develops a new vocabulary to describe the clothing type for a human.

Priyansh Shah, Mehul S. Raval, Shvetal Pandya, Sanjay Chaudhary, Anand Laddha, Hiren Galiyawala

### Towards Accurate Handwritten Word Recognition for Hindi and Bangla

Building accurate lexicon free handwritten text recognizers for Indic languages is a challenging task, mostly due to the inherent complexities in Indic scripts in addition to the cursive nature of handwriting. In this work, we demonstrate an end-to-end trainable CNN-RNN hybrid architecture which takes inspirations from recent advances of using residual blocks for training convolutional layers, along with the inclusion of spatial transformer layer to learn a model invariant to geometric distortions present in handwriting. In this work we focus building state of the art handwritten word recognizers for two popular Indic scripts – Devanagari and Bangla. To address the need of large scale training data for such low resources languages, we utilize synthetically rendered data for pre-training the network and later fine tune it on the real data. We outperform the previous lexicon based, state of the art methods on the test set of Devanagari and Bangla tracks of RoyDB by a significant margin.

Kartik Dutta, Praveen Krishnan, Minesh Mathew, C. V. Jawahar

### NrityaGuru: A Dance Tutoring System for Bharatanatyam Using Kinect

Indian Classical Dance (ICD) is a living heritage of India. Traditionally Gurus (teachers) are the custodians of this heritage. They practice and pass on the legacy through their Shishyas (disciples), often in undocumented forms. The preservation of the heritage, thus, remains limited in time and scope. Emergence of digital multimedia technology has created the opportunity to preserve heritage by ensuring that it can be accessible over a long period of time. However, there have been only limited attempts to use effective technologies either in the pedagogy of learning dance or in the preservation of heritage of ICD. In this context, the paper presents NrityaGuru – a tutoring system for Bharatanatyam – a form of ICD. Using Kinect Xbox to capture dance videos in multi-modal form, we design a system that can help a learner dancer identify deviations in her dance postures and movements against the prerecorded benchmark performances of the tutor (Guru).

Achyuta Aich, Tanwi Mallick, Himadri B. G. S. Bhuyan, Partha Pratim Das, Arun Kumar Majumdar

### Automated Translation of Human Postures from Kinect Data to Labanotation

We present a non-intrusive automated system to translate human postures into Labanotation, a graphical notation for human postures and movements. The system uses Kinect to capture the human postures, identifies the positions and formations of the four major limbs: two hands and two legs, converts to the vocabulary of Labanotation and finally translates to a parseable LabanXML representation. We use the skeleton stream to classify the formations of the limbs using multi-class support vector machines. Encoding to XML is performed based on Labanotation specification. A data set of postures is created and annotated for training the classifier and to test its performance. We achieve 80% to 90% accuracy for the 4 limbs. The system can be used as an effective front-end for posture analysis applications in various areas like dance and sports where predefined postures form the basis for analysis and interpretation. The parseability of XML makes it easy for integration in a platform independent manner.

Anindhya Sankhla, Vinanti Kalangutkar, Himadri B. G. S. Bhuyan, Tanwi Mallick, Vivek Nautiyal, Partha Pratim Das, Arun Kumar Majumdar

### Emotion Based Categorization of Music Using Low Level Features and Agglomerative Clustering

Music emotion recognition (MER) has become an eminent field of interest in music information retrieval (MIR) group with the objective to provide more flexibility in content based music retrieval. It is quite important to categorize the music according to the emotional characteristics as it enables the users to retrieve the music according to their cognitive state. In this work, we have considered low level time-domain and spectral features extracted from the music signal. Instead of considering a wide range of features, they are judiciously considered based on our perception about the particular emotion. For classification, unsupervised approach based on K-means and Agglomerative clustering are considered. Experiment is carried out on a benchmark dataset. Performance comparison with existing work reflects the superiority of our proposed work.

Rajib Sarkar, Saikat Dutta, Aneek Roy, Sanjoy Kumar Saha

### Transfer Learning by Finetuning Pretrained CNNs Entirely with Synthetic Images

We show that finetuning pretrained CNNs entirely on synthetic images is an effective strategy to achieve transfer learning. We apply this strategy for detecting packaged food products clustered in refrigerator scenes. A CNN pretrained on the COCO dataset and fine-tuned with our 4000 synthetic images achieves mean average precision (mAP @ 0.5-IOU) of 52.59 on a test set of real images (150 distinct products as objects of interest and 25 distractor objects) in comparison to a value of 24.15 achieved without such finetuning. The synthetic images were rendered with freely available 3D models with variations in parameters like color, texture and viewpoint without a high emphasis on photorealism. We analyze factors like training data set size, cue variances, 3D model dictionary size and network architecture for their influence on the transfer learning performance. Additionally, training strategies like fine-tuning with selected layers and early stopping which affect transfer learning from synthetic scenes to real scenes were explored. This approach is promising in scenarios where limited training data is available.

Param Rajpura, Alakh Aggarwal, Manik Goyal, Sanchit Gupta, Jonti Talukdar, Hristo Bojinov, Ravi Hegde

### Detection of Coal Seam Fires in Summer Seasons from Landsat 8 OLI/TIRS in Dhanbad

Surface and sub-surface coal seam fires are detected by estimating Land Surface Temperature (LST). The LST of an area depends on several factors such as, seasonal variation, nature of soil, urban settlements, etc. Temperatures of several areas of Dhanbad region of Eastern India are affected by the presence of surface and sub-surface coal seam fires. Coal seam fire detection has several challenges. Specially in summer season, thermal anomalies provide false classifications of such fires. It has been observed that during summer season, water bodies have high temperatures, and thus affecting the performance of detection of fires. This paper proposes a novel method to detect surface and subsurface fires in summer from satellite data by removing the high temperature water bodies.

Jit Mukherjee, Jayanta Mukherjee, Debashish Chakravarty

### Classification of Indian Monuments into Architectural Styles

We propose two novel approaches to classify Indian monuments according to their distinct architectural styles. While the historical significance of most Indian monuments is well documented, the details of their architectural styles are not as well recorded. Different Indian architectural styles often show certain similar features which makes classification a difficult task. Previous work has focused on European architecture and standard datasets are available for the same, but no standard dataset exists for Indian architecture. Therefore, we have curated a dataset of Indian monuments. In this paper, we propose two approaches to classify monuments according to their styles: Radon Barcodes and Convolutional Neural Networks. The first approach is fast and consumes less memory, but the second approach gives an accuracy of 82%, which is better than the 76% accuracy of the first method.

Saurabh Sharma, Priyal Aggarwal, Akanksha N. Bhattacharyya, S. Indu

### Predicting Word from Brain Activity Using Joint Sparse Embedding with Domain Adaptation

In the proposed work machine learning algorithm is applied on Functional Magnetic Resonance Imaging (fMRI) data to analyze the human brain activity and then predicting the word that the subject was thinking. The algorithm that can learn to identify and track the cognitive processes and gives rise to predict the word from observed fMRI data is developed. The major problem here is that we have limited data in very high dimensional feature space. Thereby, making the model susceptible to overfit the data. Also, the data is highly noisy through most of the dimensions, leaving only a few features that are discriminative. Due to high noise domain shift problem is very likely to occur. Most of the previous approach focused only on feature selection and learning the embedding space. Here our main objective is to learn the robust embedding space and handling the domain shift problem [11] in an efficient way. Unlike the previous approach instead of learning the dictionary that projects the visual space to the word embedding space, we are using the joint dictionary learning approach based on the matrix factorization. Our experiment shows that the proposed approach based on the joint dictionary learning and domain adaptation method has the significant advantage over the previous approaches.

Akansha Mishra

### Backmatter

Weitere Informationen