Skip to main content

2015 | Buch

Mining Intelligence and Knowledge Exploration

Third International Conference, MIKE 2015, Hyderabad, India, December 9-11, 2015, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the Third International Conference on Mining Intelligence and Knowledge Exploration, MIKE 2015, held in Hyderabad, India, in December 2015.

The 48 full papers and 8 short papers presented together with 4 doctoral consortium papers were carefully reviewed and selected from 185 submissions. The papers cover a wide range of topics including information retrieval, machine learning, pattern recognition, knowledge discovery, classification, clustering, image processing, network security, speech processing, natural language processing, language, cognition and computation, fuzzy sets, and business intelligence.

Inhaltsverzeichnis

Frontmatter
Spreading Activation Way of Knowledge Integration

Search and recommender systems benefit from effective integration of two different kinds of knowledge. The first is introspective knowledge, typically available in feature-theoretic representations of objects. The second is external knowledge, which could be obtained from how users rate (or annotate) items, or collaborate over a social network. This paper presents a spreading activation model that is aimed at a principled integration of these two sources of knowledge. In order to empirically evaluate our approach, we restrict the scope to text classification tasks, where we use the category knowledge of the labeled set of examples as an external knowledge source. Our experiments show a significantly improved classification effectiveness on hard datasets, where feature value representations, on their own, are inadequate in discriminating between classes.

Shubhranshu Shekhar, Sutanu Chakraborti, Deepak Khemani
Class Specific Feature Selection Using Simulated Annealing

This paper proposes a method of identifying features which are important for each class. This entails selecting the features specifically for each class. This is carried out by using the simulated annealing technique. The algorithm is run separately for each class resulting in the feature subset for that class. A test pattern is classified by running a classifier for each class and combining the result. The 1NN classifier is the classification algorithm used. Results have been reported on eight benchmark datasets from the UCI repository. The selected features, besides giving good classification accuracy, gives an idea of the important features for each class.

V. Susheela Devi
A Redundancy Study for Feature Selection in Biological Data

The curse of dimensionality is one of the well known issues in Biological data bases. A possible solution to avoid this issue is to use feature selection approach. Filter feature selection are well know feature selection methods that selects the most significant features and discards the rest according to their significance level. In general The set of eliminated features may hide some useful information that may be valuable in further studies. Hence, this paper present a new approach for filter feature selection that uses redundant features to create new instances and avoid the curse of dimensionality.

Emna Mouelhi, Waad Bouaguel, Ghazi Bel Mufti
New Feature Detection Mechanism for Extended Kalman Filter Based Monocular SLAM with 1-Point RANSAC

We present a different approach of feature point detection for improving the accuracy of SLAM using single, monocular camera. Traditionally, Harris Corner detection, SURF or FAST corner detectors are used for finding feature points of interest in the image. We replace this with another approach, which involves building non-linear scale space representation of images using Perona and Malik Diffusion equation and computing the scale normalized Hessian at multiple scale levels (KAZE feature). The feature points so detected are used to estimate the state and pose of a mono camera using extended Kalman filter. By using accelerated KAZE features and a more rigorous feature rejection routine combined with 1-point RANSAC for outlier rejection, short baseline matching of features are significantly improved, even with lesser number of feature points, especially in the presence of motion blur. We present a comparative study of our proposal with FAST and show improved localization accuracy in terms of absolute trajectory error.

Agniva Sengupta, Shafeeq Elanattil
Sequential Instance Based Feature Subset Selection for High Dimensional Data

Feature subset selection is a key problem in the data-mining classification task that helps to obtain more compact and understandable models without degrading their performance. This paper deals with the problem of supervised wrapper based feature subset selection in data sets with a very large number of attributes and a low sample size. In this case, standard wrapper algorithms cannot be applied because of their complexity. In this work we propose a new hybrid -filter wrapper- approach based on instance learning with the main goal of accelerating the feature subset selection process by reducing the number of wrapper evaluations. In our hybrid feature selection method, named Hybrid Instance Based Sequential Backward Search (HIB-SBS), instance learning is used to weight features and generate candidate feature subsets, then SBS and K-nearest neighbours (KNN) compose an evaluation system of wrappers. Our method is experimentally tested and compared with state-of-the-art algorithms over four high-dimensional low sample size datasets. The results show an impressive reduction in the execution time compared to the wrapper approach and that our proposal outperforms other methods in terms of accuracy and cardinality of the selected subset.

Afef Ben Brahim, Mohamed Limam
Facial Expression Recognition Using Entire Gabor Filter Matching Score Level Fusion Approach Based on Subspace Methods

In this study appearance based facial expression recognition is presented by extracting the Gabor magnitude feature vectors (GMFV) and Gabor Phase Congruency vectors (GPCV). Feature vector space of these two vectors dimensions are reduced and redundant information is removed using subspace methods. Both GMFV and GPCV spaces are projected with Eigen score and projected matching scores are normalized and fused. Final matching score of each subspace method are normalized using Z-score normalization and fused together using maximum rule. Dimension of entire Gabor feature vector space consumes larger area of memory and high processing time with more redundant data. To overcome this problem in this paper entire Gabor matching score level fusion (EGMSLF) approach based on subspace methods is introduced. The JAFFE database is used for experiment. Support vector machine classifier technique is used as classifier. Performance evaluation is carried out by comparing proposed approach with state of art approaches. Proposed EGMSLF approach enhances the performance of earlier methods.

Ganapatikrishna Hegde, M. Seetha, Nagaratna Hegde
Cluster Dependent Classifiers for Online Signature Verification

In this paper, the applicability of notion of cluster dependent classifier for online signature verification is investigated. For every writer, by the use of a number of training samples, a representative is selected based on minimum average distance criteria (centroid) across all the samples of that writer. Later k-means clustering algorithm is employed to cluster the writers based on the chosen representatives. To select a suitable classifier for a writer, the equal error rate (EER) is estimated using each of the classifier for every writer in a cluster. The classifier which gives the lowest EER for a writer is selected to be the suitable classifier for that writer. Once the classifier for each writer in a cluster is decided, the classifier which has been selected for a maximum number of writers in that cluster is decided to be the classifier for all writers of that cluster. During verification, the authenticity of the query signature is decided using the same classifier which has been selected for the cluster to which the claimed writer belongs. In comparison with the existing works on online signature verification, which use a common classifier for all writers during verification, our work is based on the usage of a classifier which is cluster dependent. On the other hand our intuition is to recommend to use a same classifier for all and only those writers who have some common characteristics and to use different classifiers for writers of different characteristics. To demonstrate the efficacy of our model, extensive experiments are carried out on the MCYT online signature dataset (DB1) consisting signatures of 100 individuals. The outcome of the experiments being indicative of increased performance with the adaption of cluster dependent classifier seems to open up a new avenue for further investigation on a reasonably large dataset.

S. Manjunath, K.S. Manjunatha, D.S. Guru, M.T. Somashekara
Classification Using Rough Random Forest

The Rough random forest is a classification model based on rough set theory. The Rough random forest uses the concept of random forest and rough set theory in a single model. It combines a collection of decision trees for classification instead of depending on a single decision tree. It uses the concept of bagging and random subspace method to improve the performance of the classification model. In the rough random forest the reducts of each decision tree are chosen on the basis of boundary region condition. Each decision tree uses a different subset of patterns and features. The class label of patterns is obtained by combining the decisions of all the decision trees by majority voting. Results are reported on a number of benchmark datasets and compared with other techniques. Rough random forest is found to give better performance.

Rajhans Gondane, V. Susheela Devi
Extending and Tuning Heuristics for a Partial Order Causal Link Planner

Recent literature reveals that different heuristic functions perform well in different domains due to the varying nature of planning problems. This nature is characterized by the degree of interaction between subgoals and actions. We take the approach of learning the characteristics of different domains in a supervised manner. In this paper, we employ a machine learning approach to combine different, possibly inadmissible, heuristic functions in a domain dependent manner. With the renewed interest in Partial Order Causal Link (POCL) planning we also extend the heuristic functions derived from state space approaches to POCL planning. We use Artificial Neural Network (ANN) for combining these heuristics. The goal is to allow a planner to learn the parameters to combine heuristic functions in a given domain over time in a supervised manner. Our experiments demonstrate that one can discover combinations that yield better heuristic functions in different planning domains.

Shashank Shekhar, Deepak Khemani
Symbolic Representation of Text Documents Using Multiple Kernel FCM

In this paper, we proposed a novel method of representing text documents based on clustering of term frequency vector. In order to cluster the term frequency vectors, we make use of Multiple Kernel Fuzzy C-Means (MKFCM). After clustering, term frequency vector of each cluster are used to form a interval valued representation (symbolic representation) by the use of mean and standard deviation. Further, interval value features are stored in knowledge base as a representative of the cluster. To corroborate the efficacy of the proposed model, we conducted extensive experimentation on standard datset like Reuters-21578 and 20 Newsgroup. We have compared our classification accuracy achieved by the Symbolic classifier with the other existing Naive Bayes classifier, KNN classifier and SVM classifier. The experimental result reveals that the classification accuracy achieved by using symbolic classifier is better than other three classifiers.

B. S. Harish, M. B. Revanasiddappa, S. V. Aruna Kumar
GIST Descriptors for Sign Language Recognition: An Approach Based on Symbolic Representation

This paper presents an approach for recognizing signs made by hearing impaired people at sentence level. The signs are captured in the form of video and each frame is processed to efficiently extract sign information to model the sign and recognize instances of new test signs. Low-dimensional global “gist” descriptors are used to capture sign information from every frame of a sign video. K-means clustering is used to choose fixed number of frames, which are discriminative enough to distinguish between signs. Also, selection of fixed number of frames helps us to deal with unequal number of frames among the instances of same sign due to different signers and reduce the complexity of subsequent processing. Further, we exploit the concept of symbolic data analysis to effectively represent a sign. A fuzzy trapezoidal membership function is used to establish the similarity between test and a reference sign and a nearest neighbour classification technique is used to recognize the given test sign. A considerably large database of signs (UoM-ISL) is created and an extensive experimentation is conducted on this database to study the efficacy of the proposed methodology. The experimental results are found to be encouraging.

H.S. Nagendraswamy, B.M. Chethana Kumara, R. Lekha Chinmayi
A Graph Processing Based Approach for Automatic Detection of Semantic Inconsistency Between BPMN Process Model and SBVR Rules

Business Process Modeling Notation (BPMN) is a technique for graphically drawing and illustrating business processes in diagramtic form. Semantic of Business Vocabulary and Business Rules (SBVR) is a declarative language used to define business vocabulary, rules and policy. Several times inconsistencies occur between BPMN and SBVR as they are independently maintained. Our aim is to investigate techniques for automatically detecting inconsistencies between business process and rules. We present a method for inconsistency detection (between BPMN and SBVR) based on converting SBVR rules to graphical representation and apply sub graph-isomorphism to detect instances of inconsistencies between BPMN and SBVR models. We propose a multi-step process framework for identification of instances of inconsistencies between the two models. We first generate an XML of BPMN diagram and apply parsing and tag extraction. We then apply Stanford NLP Parser to generate parse tree of rules. The detailed information about the parse tree is stored in the form of Typed Dependency which represent grammatical relation between words of a sentence. We utilize the grammatical relation extract triplet (actor-action-object) of a sentence. We find node-induced sub-graph of all possible length of nodes of a graph and apply VF2 Algorithm to detect instances of inconsistency between sub graphs. Finally, we evaluate the proposed research framework by conducting experiments on synthetic dataset to validate the accuracy and effectiveness of our approach.

Akanksha Mishra, Ashish Sureka
An Improved Intrusion Detection System Based on a Two Stage Alarm Correlation to Identify Outliers and False Alerts

To ensure the protection of computer networks from attacks, an intrusion detection system (IDS) should be included in the security architecture. Despite the detection of intrusions is the ultimate goal, IDSs generate a huge amount of false alerts which cannot be properly managed by the administrator, along with some noisy alerts or outliers. Many research works were conducted to improve IDS accuracy by reducing the rate of false alerts and eliminating outliers. In this paper, we propose a two-stage process to detect false alerts and outliers. In the first stage, we remove outliers from the set of meta-alerts using the best outliers detection method after evaluating the most cited ones in the literature. In the last stage, we propose a binary classification algorithm to classify meta-alerts whether as false alerts or real attacks. Experimental results show that our proposed process outperforms concurrent methods by considerably reducing the rate of false alerts and outliers.

Fatma Hachmi, Mohamed Limam
A Geometric Viewpoint of the Selection of the Regularization Parameter in Some Support Vector Machines

The regularization parameter of support vector machines is intended to improve their generalization performance. Since the feasible region of binary class support vector machines with finite dimensional feature space is a polytope, we note that classifiers at vertices of this unbounded polytope correspond to certain ranges of the regularization parameter. This reduces the search for a suitable regularization parameter to a search of (finite number of) vertices of this polytope. We propose an algorithm that identifies neighbouring vertices of a given vertex and thereby identifies the classifiers corresponding to the set of vertices of this polytope. A classifier can then be chosen from them based on a suitable test error criterion. We illustrate our results with an example which demonstrates that this path can be complicated. A portion of the path is sandwiched between two finite intervals of path, each generated by separate sets of vertices and edges.

Nandyala Hemachandra, Puja Sahu
Discovering Communities in Heterogeneous Social Networks Based on Non-negative Tensor Factorization and Cluster Ensemble Approach

Identification of the appropriate community structure in social networks is an arduous task. The intricacy of the problem increases with the heterogeneity of multiple types of objects and relationships involved in the analysis of the network. Traditional approaches for community detection focus on the networks comprising of content features and linkage information of the set of single type of entities. However, rich social media networks are usually heterogeneous in nature with multiple types of relationships existing between different types of entities. Cognizant to these requirements, we develop a model for community detection in Heterogeneous Social Networks (HSNs) employing non-negative tensor factorization method and cluster ensemble approach. Extensive experiments are performed on 20Newsgroup dataset which establish the effectiveness and efficiency of our scheme.

Ankita Verma, K. K. Bharadwaj
On the Impact of Post-clustering Phase in Multi-way Spectral Partitioning

Spectral clustering is one of the most popular modern graph clustering techniques in machine learning. By using the eigenvalue analysis, spectral methods partition the given set of points into number of disjoint groups. Spectral methods are very useful in determining non-convex shaped clusters, identifying such clusters is not trivial for many traditional clustering methods including hierarchical and partitional methods. Spectral clustering may be carried out either as recursive bi-partitioning using fiedler vector (second eigenvector) or as muti-way partitioning using first k eigenvectors, where k is the number of clusters. Although spectral methods are widely discussed, there has been a little attention on which post-clustering algorithm (for eg. K-means) should be used in multi-way spectral partitioning. This motivated us to carry out an experimental study on the influence of post-clustering phase in spectral methods. We consider three clustering algorithms namely K-means, average linkage and FCM. Our study shows that the results of multi-way spectral partitioning strongly depends on the post-clustering algorithm.

R. Jothi, Sraban Kumar Mohanty, Aparajita Ojha
BSO-CLARA: Bees Swarm Optimization for Clustering LARge Applications

Clustering is an essential data mining tool for analyzing big data. In this article, an overview of literature methods is undertaken. Following this study, a new algorithm called BSO-CLARA is proposed for clustering large data sets. It is based on bee behavior and k-medoids partitioning. Criteria like effectiveness, eficiency, scalability and control of noise and outliers are discussed for the new method and compared to those of the previous techniques. Experimental results show that BSO-CLARA is more effective and more efficient than PAM, CLARA and CLARANS, the well-known partitioning algorithms but also CLAM, a recent algorithm found in the literature.

Yasmin Aboubi, Habiba Drias, Nadjet Kamel
ECHSA: An Energy-Efficient Cluster-Head Selection Algorithm in Wireless Sensor Networks

In Wireless Sensor Networks (WSNs) a key issue is the limited battery power of sensor nodes. To increase the network lifetime is a great challenge where different nodes have different energy labels. To work with this challenge we propose an Energy-Efficient Cluster-Head Selection Algorithm (ECHSA) based on Nash Equilibrium (NE) decision of game theory where, each cluster in the network acts as a player and each player chooses his best strategy followed by other players. We have also compared ECHSA with existing protocols. The simulation results show increase in performance of our proposed approach as compared to the existing approaches.

Bibudhendu Pati, Joy Lal Sarkar, Chhabi Rani Panigrahi, Mayank Tiwary
Optimal Core Point Detection Using Multi-scale Principal Component Analysis

Core point plays a vital role in fingerprint matching and classification. The fingerprint images may be of poor quality because of sensor type and user’s body condition. To detect the core point in noisy and poor quality fingerprint images, we have estimated the dominant orientation field based on principal component analysis and multi-scale pyramid decomposition to produce correct orientation field. The proposed work detects the optimal upper and lower core points using shape analysis of orientation field and binary candidate region images in fingerprints. Experiments are carried out on FVC databases and it is found that the proposed algorithm has high accuracy in locating exact core points.

T. Kathirvalavakumar, K. S. Jeyalakshmi
Recognition of Semigraph Representation of Alphabets Using Edge Based Hybrid Neural Network

Graph structured data are classified by connectionist models such as Graph neural network (GNN), recursive neural network. These models are based on the label of the nodes of the graph. An attempt has been made to consider the network based on edges. If a graph structured data is represented as semigraph, the number of edges will be reduced leading to a reduction in the number of networks in GNN. In this paper uppercase English alphabets represented as graphs are recognized using edge based hybrid neural network by viewing the graphs as semigraph. Experimental results show that the edge based hybrid neural network is able to identify all the graphs of alphabets correctly and outperforms edge based GNN.

R. B. Gnana Jothi, S. M. Meena Rani
Small Eigenvalue Based Skew Estimation of Handwritten Devanagari Words

In this work, a novel technique for estimating skew angle in handwritten Devanagari words is proposed. Orientation of the Shirorekha, a horizontal line present at the top and touching all the characters of a word, is used as a clue to identify its skew. The method exploits the eigenvalue analysis of the covariance matrix formed by the edge pixels of the word image over a small connected region of support for the purpose extracting straight line segments. The line segments thus obtained are grouped according to their orientations. The orientation of a line segment is computed as a function of angles of its associated edge pixels. The angle of an edge pixel is identified using the eigenvector corresponding to the small eigenvalue associated with it. The line segments of each group are processed to locate a longest connected line which is decided to be the Shirorekha. The method is very fast when compared to Hough transform based line detection approach in addition to being robust to noise. Performance of the method is studied on a dataset consisting of 400 word images extracted from handwritten Devanagari documents especially of Hindi language with arbitrary orientations ranging from −45° to 45° under different scaling.

D.S. Guru, Mahamad Suhil, M. Ravikumar, S. Manjunath
Recognizing Handwritten Arabic Numerals Using Partitioning Approach and KNN Algorithm

A method has been proposed to classify handwritten Arabic numerals in its compressed form using partitioning approach and K-Nearest Neighbour (KNN) algorithm. Handwritten numerals are represented in a matrix form. Compressing the matrix representation by merging adjacent pair of rows, by OR-ing the bits in corresponding positions, reduces its size in half. Considering each row as a partitioned portion, clusters are formed for each partition of a digit separately. Leaders of clusters of partitions are used to recognize the patterns by Divide and Conquer approach and KNN algorithm. Experimental results show that the proposed method recognize the patterns accurately.

T. Kathirvalavakumar, R. Palaniappan
Fuzzy Based Support System for Melanoma Diagnosis

Early detection of Melanoma (skin cancer) and its classes (Malignant, Atypical, Common Nevus) is always beneficial for patients. Till now researchers have designed many Computer Aided Detection (CAD) systems which have focused on providing binary results (i.e. either presence or absence of any class of melanoma). As these systems do not provide relative extent of lesions belonging to each class, they usually lack decision support for dermatologists (in case of suspiciousness of a lesion) and complete reliability for routine clinical use. To overcome these problems, a two stage framework is proposed incorporating a new fuzzy membership function based on Lagrange Interpolation Curve Fitting method. This framework returns analogue values for a lesion which represents its relative extent in a particular class (helpful in recreating suspiciousness), hence having a greater degree of acceptability among dermatologists. A two stage CAD framework proposed here uses $$PH{}^{2}$$ dermoscopic image dataset as input. In the first stage pre-processing, segmentation and feature extraction is performed while in the next stage fuzzy membership values for the three classes are calculated using Gaussian, Bell and the proposed membership function. A comparative study is done on the basis of sensitivity and specificity for the three membership functions.

Anand Gupta, Devendra Tiwari, Siddharth Agarwal, Monal Jain
KD-Tree Approach in Sketch Based Image Retrieval

In this work, we developed a model for representation and indexing of objects for given input query sketch. In some applications, where the database is supposed to be very large, the retrieval process typically has an unacceptably long response time. A solution to speed up the retrieval process is to design an indexing model prior to retrieval. In this work, we study the suitability of Kd-tree indexing mechanism for sketch based retrieval system based on shape descriptors like Scale invariant feature transform(SIFT), Histogram of Gradients (HOG), Edge orientation histograms (EOH) and Shape context (SC). To corroborate the efficacy of the proposed method, an experiment was conducted on Caltech-101 dataset. And we collected about 200 sketches from 20 users. Experimental results reveal that indexing prior to identification is faster than conventional identification method.

Y. H. Sharath Kumar, N. Pavithra
Benchmarking Gradient Magnitude Techniques for Image Segmentation Using CBIR

As image segmentation has become a definite prerequisite in many of the image processing and computer vision applications, an effort towards evaluating such segmentation techniques is indeed found very less in literature. In this paper, we carried out a comprehensive evaluation of five different gradient magnitude (GM) based image segmentation techniques using CBIR (Content Based Image Retrieval). Firstly, boundary probabilities are detected using the gradient magnitude based techniques such as - Canny edge detection (pbCanny), Second moment matrix (pb2MM), Multi-scale second moment matrix (pb2MM2), Gradient magnitude (pbGM) and Multi-scale gradient magnitude (pbGM2). Further, Ridgelets are applied to these boundaries to extract radial energy information exhibiting linear properties and PCA to reduce the dimensionality of these features. Finally, probabilistic neural network (PNN) classifiers are used to classify and observe the performance of gradient magnitude techniques in classification process. We observed the performance of these algorithms on the most challenging and popular image datasets namely Corel-1K, Caltech-101, and Caltech-256.

K. Mahantesh, V. N. Manjunath Aradhya, B. V. Sandesh Kumar
Automated Nuclear Pleomorphism Scoring in Breast Cancer Histopathology Images Using Deep Neural Networks

Scoring the size/shape variations of cancer nuclei (nuclear pleomorphism) in breast cancer histopathology images is a critical prog-nostic marker in breast cancer grading and has been subject to a con-siderable amount of observer variability and subjectivity issues. In spite of a decade long histopathology image analysis research, automated as-sessment of nuclear pleomorphism remains challenging due to the com-plex visual appearance and huge variability of cancer nuclei.This study proposes a practical application of the deep belief based deep neural net-work (DBN-DNN) model to determine the nuclear pleomorphism score of breast cancer tissue. The DBN-DNN network is trained to classify a breast cancer histology image into one of the three groups: score 1, score 2 and score 3 nuclear pleomorphism by learning the mean and standard deviation of morphological and texture features of the entire nuclei population contained in a breast histology image. The model was trained for features from automatically-segmented nuclei from 80 breast cancer histopathology images selected from publicly available MITOS-ATYPIA dataset. The classification accuracy of the model on the training and testing datasets was found to be 96 % and 90 % respectively.

P. Maqlin, Robinson Thamburaj, Joy John Mammen, Marie Theresa Manipadam
Hybrid Source Modeling Method Utilizing Optimal Residual Frames for HMM-based Speech Synthesis

This paper proposes a new hybrid source modeling method for improving the quality of HMM-based speech synthesis. The proposed method is an extension of recently proposed source model based on optimal residual frame [1]. The source or excitation signal is first decomposed into a number of pitch-synchronous residual frames. Unique variations are observed in the pitch-synchronous residual frames present at the beginning, middle and end regions of excitation signal of a phone. Based on the observation, one optimal residual frame is extracted from each of the beginning, middle and end regions of excitation signal of a phone. The optimal residual frames extracted from every region of excitation signal are separately grouped in the form of decision tree. During synthesis, for every phone, three optimal residual frames are selected from three decision trees based on target and concatenation costs. Using three optimal residual frames, the excitation signal of a phone is constructed. The proposed hybrid source model is used for synthesizing speech under HTS framework. Subjective evaluation results indicate that the proposed source model is better the two existing source modeling methods.

N. P. Narendra, K. Sreenivasa Rao
Significance of Emotionally Significant Regions of Speech for Emotive to Neutral Conversion

Most of the speech processing applications suffer from a degradation in performance when operated in emotional environments. The degradation in performance is mostly due to a mismatch between developing and operating environments. Model adaptation and feature adaptation schemes have been employed to adapt speech systems developed in neutral environments to emotional environments. In this study, we have considered only anger emotion in emotional environments. In this work, we have studied the signal level conversion from anger emotion to neutral emotion. Emotion in human speech is concentrated over a small region in the entire utterance. The regions of speech that are highly influenced by the emotive state of the speaker is are considered as emotionally significant regions of an utterance. Physiological constraints of human speech production mechanism are explored to detect the emotionally significant regions of an utterance. Variation of various prosody parameters (Pitch, duration and energy) based on their position in the sentences is analyzed to obtain the modification factors. Speech signal in the emotionally significant regions is modified using the corresponding modification factor to generate the neutral version of the anger speech. Speech samples from Indian Institute of Technology Kharagpur Simulated Emotion Speech Corpus (IITKGP-SESC) are used in this study. A subjective listening test is performed for evaluating the effectiveness of the proposed conversion.

Hari Krishna Vydana, V. V. Vidyadhara Raju, Suryakanth V. Gangashetty, Anil Kumar Vuppala
Spoken Document Retrieval: Sub-sequence DTW Framework and Variants

We address the problem of spoken document retrieval (alternately termed content-based audio-search and retrieval), which involves searching a large spoken document or database for a specific spoken query. We formulate the search within the sub-sequence DTW (SS-DTW) framework proposed earlier in literature, adapted here to work on acoustic feature representation of the database and spoken query term. Further, we propose several variants within this framework, such as (i) path-length based score normalization, (ii) clustered quantization of acoustic feature vectors for fast search and retrieval with invariant performances and, (iii) phonetic representation of the database and spoken query term, derived from ground-truth annotation as well as HMM based continuous phoneme recognition. We characterize the performance of the proposed framework, algorithms and variants in terms of ROC curves, EER and time-complexity and present results using the TIMIT database with annotated spoken sentences from 400 speakers.

Akshay Khatwani, Komala Pawar, Sushma Hegde, Sudha Rao, Adithya Seshasayee, V. Ramasubramanian
Improved Language Identification in Presence of Speech Coding

Automatically identifying the language being spoken from speech plays a vital role in operating multilingual speech processing applications. A rapid growth in the use of mobile communication devices has inflicted the necessity of operating all speech processing applications in mobile environments. Degradation in the performance of any speech processing applications is majorly due to varying background environments, speech coding and transmission errors. In this work, we focus on developing a language identification system robust to degradations in coding environments in Indian scenario. Spectral features (MFCC) extracted from high sonority regions of speech are used for language identification. Sonorant regions of speech are the regions of speech that are perceptually loud, carry a clear pitch. The quality of coded speech in high sonority region is high compared to less sonorant regions. Spectral features (MFCC) extracted from high sonority regions of speech are used for language identification. In this work, GMM-UBM based modelling technique is employed to develop an language identification (LID) system. Present study is carried out on IITKGP-MLILSC speech database.

Ravi Kumar Vuddagiri, Hari Krishna Vydana, Jiteesh Varma Bhupathiraju, Suryakanth V. Gangashetty, Anil Kumar Vuppala
SHIM: A Novel Influence Maximization Algorithm for Targeted Marketing

Influence maximization is the problem of finding a set of k users in a social network, such that by targeting these k users one can maximize the spread of influence in the network. Recently a new type of social network has come into existence on platforms like Zomato and Yelp, where people can publish reviews of local businesses like restaurants, hotels, salons etc. Such social network can help owners of local businesses in making intelligent business decisions through the use of Targeted Marketing.In this paper we present Spread Heuristic based Influence Maximization (SHIM) algorithm, our novel algorithm, which uses a heuristic approach that maximizes the influence spread every time a node is added to the set of influential nodes. In our work, we also introduce a new method to find information-propagation probability based on attributes of the user. We test the proposed algorithm on academic dataset of Yelp, and a comprehensive performance study shows that SHIM algorithm achieves greater Influence Spread than several other algorithms.

Abhishek Gupta, Tushar Gupta
An Optimal Path Planning for Multiple Mobile Robots Using AIS and GA: A Hybrid Approach

Design of proficient control algorithms for mobile robot navigation in an unknown and changing environment, with obstacles and walls is a complicated task. The objective for building the intelligent planner is to plan actions for multiple mobile robots to coordinate with others and to achieve the global goal by avoiding static and dynamic obstacles. This paper demonstrates a hybrid method of two optimization techniques that are Artificial Immune System (AIS) and Genetic Algorithm (GA). The capability of overcoming the shortcomings of individual algorithms without losing their advantage makes the hybrid techniques superior to the stand-alone ones. The main objective behind this is to improvise the result of a path planning approach than done on AIS and GA separately. The hybridization includes two phases; in first enhancing the local searching ability by AIS and secondly to add stochasticity, instead of choosing random population, the last generation of AIS will be accepted as input to the next process of GA in the hybrid AIS-GA. From the result and observations, it can be inferred that the proposed algorithm is able to efficiently explore the unknown environment by learning from past behavior towards reaching the target. The result obtained from the hybrid algorithm is compared over AIS and GA and found to be more efficient in terms of convergence speed and the time taken to reach at the target, making it a promising approach for solving the mobile robot path planning problem.

Mohit Ranjan Panda, Rojalina Priyadarshini, Saroj Pradhan
Metaheuristic Optimization Using Sentence Level Semantics for Extractive Document Summarization

Multi document summarization is the process of automatic creation of a summary of one or more text documents. We developed a multi-document summarization system which generate an extractive generic summary with maximum relevance and minimum redundancy. To achieve this, four features associated with sentences, that can influence the summarization process are extracted. It is difficult to find the appropriate weights corresponding to the features, which leads to good results. We propose a metaheuristic optimization based on solution population with multiple objective functions. The objective functions used takes care of both the statistical and semantic aspects of the documents. Our population based optimization converges rapidly to produce candidate sentences for summary. Evaluation of the proposed system is performed on DUC 2002 dataset using ROGUE tool kit. Experimental results shows that our system outperforms the state of the art works in terms of Recall and Precision.

P. S. Premjith, Ansamma John, M. Wilscy
Circulant Singular Value Decomposition Combined with a Conventional Neural Network to Improve the Hake Catches Prediction

This paper presents the one-step ahead forecasting of time series based on Singular Value Decomposition of a circulant trajectory matrix combined with the conventional non linear prediction method. The catches of a fishery resource was used to evaluate the proposal, this is due to the great importance of this resource in the economy of a country, and its high variability presents difficulties in the forecasting; the catches of hakes from January 1963 to December 2008 along the Chilean coast ($$30^{\circ }\mathrm{S}$$–$$40^{\circ }\mathrm{S}$$) are the application data. The forecasting strategy is presented in two stages: preprocessing and prediction. In the first stage the Singular Value Decomposition of a circulant matriz (CSVD) resultant of the mapping time series is applied to extract the components, after the decomposition and grouping, the components interannual and annual were obtained. In the second stage a conventional Artificial Neural Network (ANN) is implemented to predict the extracted components. The results evaluation shows a high prediction accuracy through the strategy based on the combination CSVD-ANN. Besides, the results were compared with the conventional nonlinear prediction based on an Autoregressive Neural Network. The improvement in the prediction accuracy by using the proposed decomposition strategy was demonstrated.

Lida Barba, Nibaldo Rodríguez, Diego Barba
To Optimize Graph Based Power Iteration for Big Data Based on MapReduce Paradigm

The next big thing in the IT world is Big Data. The values generated from storing and processing of Big Data cannot be analyzed using traditional computing techniques. The main aim of this paper is to design a scalable machine learning algorithm to scaleup and speedup clustering algorithm without losing its accuracy. Clustering using power iteration is fast and scalable. However, it requires matrix computation which makes the algorithm infeasible for Big Data. Moreover, power method converges slowly based on eigen vector. Hence, in this paper an investigation is done on convergence factor by applying a modified constraint that minimizes the computational cost by making the algorithm converge quickly. MapReduce parallel environment for Big Data is verified for the proposed algorithm using different sizes of datasets with different nodes in the cluster selecting speedup, scalability, and efficiency as the indicators. The performance of the proposed algorithm has been shown with respect to the execution time and the number of nodes. The results show that the proposed method is feasible and valid. It improves the overall performance and efficiency of the algorithm that can meet the needs of large scale processing.

Dhanapal Jayalatchumy, Perumal Thambidurai
Complex Transforms

In this research paper, motivated by the concept of complex hypercube, a novel class of complex Hadamard matrices are proposed. Based on such class of matrices, a novel transform, called complex Hadamard transform is discussed. In the same spirit of this transform, other complex transforms such as complex Haar transform are proposed. It is expected that these novel complex transforms will find many applications. Also, the associated complex valued orthogonal functions are of theoretical interest.

Garimella Rama Murthy, Tapio Saramaki
A New Multivariate Time Series Transformation Technique Using Closed Interesting Subspaces

Subspace clustering detects the clusters that are existing in the subspaces of the feature space. Density based subspace clustering defines clusters as regions of high density existing in subspaces of multidimensional datasets. This paper discusses the concept of closed interesting subspaces under density divergence context for multivariate datasets and proposes an algorithm to transform the multivariate time series to a symbol sequence using the closed interesting subspaces. The proposed transformation allows the applicability of any of the symbolic sequential mining algorithms to efficiently extract sequential patterns which capture the interdependencies and co-variations among groups of time series variables. The multivariate time series transformation technique is explained using a sample dataset. It is evaluated using a real world weather dataset obtained from Cambridge University. The representation power of the closed interesting subspaces and maximal interesting subspaces in transforming multivariate time series is compared.

Sirisha G.N.V.G., Shashi M.
S2S: A Novel Approach for Source to Sink Node Communication in Wireless Sensor Networks

In Wireless Sensor Networks (WSNs) sensor nodes are deployed in various geographical areas. But, the main problem is the data collection from source nodes to the sink node in an energy efficient way and in data collection scenario it is also a challenging task to reduce inter-cluster communication cost which can balance the network traffic. To work with these challenges, in this work, we propose a source to sink node (S2S) communication algorithm where to reduce the communication overhead and to minimize the delay we used Cluster-Head(CH)-CH communication method where one CH forms coalition to another CH based on the distance of the sink node from each CH. The simulation results indicate the better performance of our approach as compared to the existing approaches and the results are validated through MATLAB.

Chhabi Rani Panigrahi, Joy Lal Sarkar, Bibudhendu Pati, Himansu Das
Establishing Equivalence of Expressions: An Automated Evaluator Designer’s Perspective

Automated assessment of students’ programs has become essential in the institutions where the intake of students is large to ensure fast and consistent evaluation. An automated evaluator compares a program written by a student with a model program supplied by the instructor and tries to evaluate the student’s performance. In course of checking similarity between the two programs, the evaluator may sometimes have to determine whether some expression written in the student program assumes the same value as that of an equivalent expression in the model. Thus, determining equivalence between pairs of expressions is at the core of designing automated evaluators. This paper discusses different methods for determining equivalence between expressions involving various datatypes. Specifically, it proposes a novel technique to determine equivalence between expressions involving floating point and transcendental numbers, which have not been addressed in earlier literature to the best of our knowledge.

K. K. Sharma, Kunal Banerjee, Chittaranjan Mandal
Data Driven Modelling for the Estimation of Probability of Loss of Control of Typical Fighter Aircraft

Loss of control of aircraft is one of the catastrophic safety critical events in the aerospace domain, which results usually into risks of loss of human lives and/or environmental hazards. Triggering of this undesired event could be at any level of hardware and/or software in the digital fly-by-wire fighter aircraft. The contributing factors for this undesirable event and the interrelationships among the basic events are to be carefully accounted, while estimating the loss of control of the fighter aircraft, probabilistically. Components which have the potential to cause failures are required to be treated carefully, by properly considering the failure modes of those components. This paper brings out, the data driven methodology to estimate the probability of control of the aircraft considering all the interdependent components along with the associated failure modes, which have the potential to trigger the occurrence of the undesired event-‘Loss of Control of Aircraft’. The approach presented here would serve as a guideline for estimating the PLOC of any types of aircraft.

Antony Gratas Varuvel
Ranking Business Scorecard Factor Using Intuitionistic Fuzzy Analytical Hierarchy Process with Fuzzy Delphi Method in Automobile Sector

Business scorecard is an integral part of human resource management in an industry or an organization and used to strengthen the functionality of the organization. It plays a vital role in promoting the business. Exploring the uncertainty creeping into various factors in business scorecard is an interesting challenge. In this work, we applied Intuitionistic Fuzzy Analytical Hierarchy Process (IFAHP) with Fuzzy Delphi method to analyse the uncertainty factors in business scorecard. Also we explore the importance of various factors by means of ranking using IFAHPwith Fuzzy Delphi method. The ranking scores are further used to strengthen the business scorecard.

S. Rajaprakash, R. Ponnusamy
Text and Citations Based Cluster Analysis of Legal Judgments

Developing efficient approaches to extract relevant information from a collection of legal judgments is a research issue. Legal judgments contain citations in addition to text. It can be noted that the link information has been exploited to build efficient search systems in web domain. Similarly, the citation information in legal judgments could be utilized for efficient search. In this paper, we have proposed an approach to find similar judgments by exploiting citations in legal judgments through cluster analysis. As several judgments have few citations, a notion of paragraph link is employed to increase the number of citations in the judgment. User evaluation study on the judgment dataset of Supreme Court of India shows that the proposed clustering approach is able to find similar judgments by exploiting citations and paragraph links. Overall, the results show that citation information in judgments can be exploited to establish similarity between judgments.

K. Raghav, Pailla Balakrishna Reddy, V. Balakista Reddy, Polepalli Krishna Reddy
Vision-Based Human Action Recognition in Surveillance Videos Using Motion Projection Profile Features

Human Action Recognition (HAR) is a dynamic research area in pattern recognition and artificial Intelligence. The area of human action recognition consistently focuses on changes in the scene of a subject with reference to time, since motion information can prudently depict the action. This paper depicts a novel framework for action recognition based on Motion Projection Profile (MPP) features of the difference image, representing various levels of a person’s posture. The motion projection profile features consist of the measure of moving pixel of each row, column and diagonal (left and right) of the difference image and gives adequate motion information to recognize the instantaneous posture of the person. The experiments are carried out using WEIZMANN and AUCSE datasets and the extracted features are modeled by the GMM classifier for recognizing human actions. In the experimental results, GMM exhibit effectiveness of the proposed method with an overall accuracy rate of 94.30 % for WEIZMANN dataset and 92.49 % for AUCSE dataset.

J. Arunnehru, M. Kalaiselvi Geetha
A Web-Based Intelligent Spybot

Robots have been making inroads to human life in almost all spheres. Spybots can be immensely useful for unmanned surveillance and covert spying operations. If online streaming of the spied data can be made feasible, that would be an added advantage. In this paper, we propose an unmanned Spybot that can be controlled remotely from web-page based commands, using a WiFi network. It can also stream back the spied data, that could be video or images, over the WiFi network. A prototype Spybot is developed. Users can give the control instructions from a web-page. The WiFi module on board the Spybot receives these commands from the web-page and passes those to the microcontroller. The microcontroller interprets the control commands, and generates control signals to operate the Spybot as per the user’s commands. The captured images and video data is sent back to a smartphone over WiFi network. Performance evaluation is carried out, to measure various limits of operation of the Spybot. The results are encouraging. The proposed Spybot can have potential applications for military, security forces and surveillance purposes.

Pruthvi Raj, N. Rajasree, T. Jayasri, Yash Mittal, V. K. Mittal
Evidential Link Prediction Based on Group Information

Link prediction has become a common way to infer new associations among actors in social networks. Most existing methods focus on the local and global information neglecting the implication of the actors in social groups. Further, the prediction process is characterized by a high complexity and uncertainty. In order to address these problems, we firstly introduce a new evidential weighted version of the social networks graph-based model that encapsulates the uncertainty at the edges level using the belief function framework. Secondly, we use this graph-based model to provide a novel approach for link prediction that takes into consideration both groups information and uncertainty in social networks. The performance of the method is experimented on a real world social network with group information and shows interesting results.

Sabrine Mallek, Imen Boukhris, Zied Elouedi, Eric Lefevre
Survey of Social Commerce Research

Social commerce is a field that is growing rapidly with the rise of Web 2.0 technologies. This paper presents a review of existing research on this topic to ensure a comprehensive understanding of social commerce. First, we explore the evolution of social commerce from its marketing origins. Next, we examine various definitions of social commerce and the motivations behind it. We also investigate its advantages and disadvantages for both businesses and customers. Then, we explore two major tools for important for social commerce: Sentiment Analysis, and Social Network Analysis. By delving into well-known research papers in Information Retrieval and Complex Networks, we seek to present a survey of current research in multifarious aspects of social commerce to the scientific research community.

Anuhya Vajapeyajula, Priya Radhakrishnan, Vasudeva Varma
Refine Social Relations and Differentiate the Same Friends’ Influence in Recommender System

Social relations has been widely used in recommender system to improve accuracy of recommendations. Most works consider influence from overall friends simultaneously when recommending, and to each item the same friend always has equal influence. However, existing models fail to be consistent with real life recommendations, because in real life only a part of friends can affect our decisions, and we couldn’t be influenced by the same friends on everything. So in this paper, we use machine learning way to infer truly influential friends in a mixed friends circle. And to different items we use relevance to differentiate the same friend’s influence. A model, Topic-based Friends Refining Probabilistic Matrix Factorization (TFR-PMF), is proposed to check the performance of our theory. Through experiments on public data set, we domenstrate that our method can increase the accuracy of recommendation by 6.5 %, comparing with models that do not filter unrelated friends’ influence.

Haitao Zhai, Jing Li
User Similarity Adjustment for Improved Recommendations

Recommender systems are becoming more and more attractive in both research and commercial communities due to Information overload problem and the popularity of the Internet applications. Collaborative Filtering, a popular branch of recommendation approaches, makes predictions based on historical data available in the system. In particular, user based Collaborative Filtering largely depends on how users rate various items of the database and the success of such a system largely relies on pair wise similarity between users. However popular items may give a negative effect on choosing similar users of the target user. The proposed work namely User Similarity Adjustment based on Item Diversity (USA_ID) is designed to achieve personalized recommendations by modifying user similarity scores, for the purpose of reducing the negative effects of popular items in user based Collaborative Filtering framework. A Recommender system is focusing exclusively on achieving accurate recommendations i.e., providing the most relevant items for the needs of a user. From user’s perspective, they would not be interested when they are facing monotonous recommendations even if they are accurate. Whilst much research effort is spent on improving accuracy of recommendations, less effort is taken on analyzing usefulness of recommendations. Novelty and Diversity have been identified as key dimensions of recommendation utility. It has been made clear that greater accuracy leads to lower diversity which results in accuracy-diversity trade off in personalized recommender systems. The proposed work provides an approach to increase the utility of a Recommender system by improving accuracy as well as diversity. Experiments are conducted on the bench mark data set MovieLens and the results show efficiency of the proposed approach in improving quality of predictions.

R. Latha, R. Nadarajan
Enhancing Recommendation Quality of a Multi Criterion Recommender System Using Genetic Algorithm

Recommender system (RS) the most successful application of Web personalization helps in alleviating the information overload available on large information spaces. It attempts to identify the most relevant items for users based on their preferences. Generally, users are allowed to provide overall ratings on experienced items but many online systems allow users to provide their ratings on different criteria. Several attempts have been made in the past to design a RS focusing on the ratings of a single criterion. However, investigation of the utility of multi criterion recommender systems in online environment is still in its infancy. We propose a multi criterion RS based on leveraging information derived from multi-criterion ratings through genetic algorithm. Experimental results are presented to demonstrate the effectiveness of the proposed recommendation strategy using a well-known Yahoo! Movies dataset.

Rubina Parveen, Vibhor Kant, Pragya Dwivedi, Anant K. Jaiswal
Adapting PageRank to Position Events in Time

In this paper, we order events in time by using evidence present in their partial orders. We propose an algorithm named TimeRank, a variant of PageRank, for this task. PageRank operates on the hyperlink graph and orders the web pages according to their importance. We identify limitations of PageRank in the context of temporally ordering the nodes. We draw an analogy between the notion of importance in PageRank to the notion of recency in TimeRank. We evaluate TimeRank using the Citation Graph of scientific publications of physics and propose a baseline method to compare TimeRank and PageRank. The baseline method ranks the nodes according to their number of immediate predecessors without considering the higher order transitive relations among the events. Evaluation results suggest that TimeRank outperforms both the baseline method and PageRank in this task.

Abhijit Sahoo, Swapnil Hingmire, Sutanu Chakraborti
After You, Who? Data Mining for Predicting Replacements

This paper proposes a new class of data mining problems in which agents replace their current object (predecessor) by another object (replacement or successor); the problem is to discover the knowledge used by the agents in identifying suitable successors. While such replacement data is available in many practical applications, in this paper we explore a problem in HR analytics, viz., replacing person in a key position in a project by another most suitable person from other employees. We propose unsupervised (distance-based) algorithms for finding suitable replacements. We also apply several standard classification techniques. This paper is the first in applying metric learning algorithms to a problem in HR analytics. We compare the approaches using a real-life replacement dataset from a multinational IT company. Results show that metric learning is a promising approach that captures the implicit knowledge for replacement identification.

Girish Keshav Palshikar, Kuleshwar Sahu, Rajiv Srivastava
Tri-Axial Vibration Analysis Using Data Mining for Multi Class Fault Diagnosis in Induction Motor

Induction motor frame vibration is believed to contain certain crucial information which not only helps detecting faults but also capable of diagnosing different types of faults that occur. The vibration data can be in radial, axial and tangential directions. The frequency content of the three different directions are compared and analyzed using data mining techniques to find the most informative vibration data and to extract the vital information that can be effectively used to diagnose multiple induction motor faults. The vibration data is decomposed using powerful signal processing tools like Continuous Wavelet Transform (CWT) and Hilbert Transform (HT). Statistical features are extracted from the decomposition coefficients obtained. Finally, data mining is applied to extract knowledge. Three types of data mining tools are deployed: sequential greedy search (GS), heuristic genetic algorithm (GA) and deterministic rough set theory (RST). The classification accuracy is judged by five types of classifiers: k-Nearest Neighbors (k-NN), Multilayer Perceptron (MLP), Radial Basis Function (RBF) and Support Vector Machine (SVM), and Simple logistic. The benefits of using all the tri-axial data combined for vibration monitoring and diagnostics is also explored. The results indicate that tri-axial vibration combined provides the most informative knowledge for multi-class fault diagnosis in induction motor. However, it was also found that multi-class fault diagnosis can also be done quite effectively using only the tangential vibration signal with the help of data mining knowledge discovery.

Pratyay Konar, Parth Sarathi Panigrahy, Paramita Chattopadhyay
An Efficient Text Compression Algorithm - Data Mining Perspective

The paper explores a novel compression perspective of Data Mining. Frequent Pattern Mining, an important phase of Association Rule Mining is employed in the process of Huffman Encoding for Lossless Text Compression. Conventional Apriori algorithm has been refined to employ efficient pruning strategies to optimize the number of pattern(s) employed in encoding. Detailed simulations of the proposed algorithms in relation to Conventional Huffman Encoding has been done over benchmark datasets and results indicate significant gains in compression ratio.

C. Oswald, Anirban I. Ghosh, B. Sivaselvan
Identifying Semantic Events in Unstructured Text

Semantics has always been considered the hidden treasure of texts, accessible only to humans. Artificial intelligence struggles to enrich machines with human features, therefore accessing this treasure and sharing it with computers is one of the main challenges that the natural language domain faces nowadays. This paper represents a further step in this direction, by proposing an automatic approach to extract information about events from unstructured texts by using semantic role labeling.

Diana Trandabăț
Predicting Treatment Relations with Semantic Patterns over Biomedical Knowledge Graphs

Identifying new potential treatment options (say, medications and procedures) for known medical conditions that cause human disease burden is a central task of biomedical research. Since all candidate drugs cannot be tested with animal and clinical trials, in vitro approaches are first attempted to identify promising candidates. Even before this step, due to recent advances, in silico or computational approaches are also being employed to identify viable treatment options. Generally, natural language processing (NLP) and machine learning are used to predict specific relations between any given pair of entities using the distant supervision approach. In this paper, we report preliminary results on predicting treatment relations between biomedical entities purely based on semantic patterns over biomedical knowledge graphs. As such, we refrain from explicitly using NLP, although the knowledge graphs themselves may be built from NLP extractions. Our intuition is fairly straightforward – entities that participate in a treatment relation may be connected using similar path patterns in biomedical knowledge graphs extracted from scientific literature. Using a dataset of treatment relation instances derived from the well known Unified Medical Language System (UMLS), we verify our intuition by employing graph path patterns from a well known knowledge graph as features in machine learned models. We achieve a high recall (92 %) but precision, however, decreases from 95 % to an acceptable 71 % as we go from uniform class distribution to a ten fold increase in negative instances. We also demonstrate models trained with patterns of length $$\le 3$$ result in statistically significant gains in F-score over those trained with patterns of length $$\le 2$$. Our results show the potential of exploiting knowledge graphs for relation extraction and we believe this is the first effort to employ graph patterns as features for identifying biomedical relations.

Gokhan Bakal, Ramakanth Kavuluru
A Supervised Framework for Classifying Dependency Relations from Bengali Shallow Parsed Sentences

Natural Language Processing, one of the contemporary research area has adopted parsing technologies for various languages across the world for different objectives. In the present task, a new approach has been introduced for classifying the dependency parsed relations for a morphologically rich and free-phrase-ordered Indian language like Bengali. The pair of dependency parsed relations (also referred as kaarakas ‘cases’) are classified based on different features like vibhaktis (inflections), Part-of-Speech (POS), punctuation, gender, number and post-position. It is observed that the consecutive and non-consecutive occurrences of such relations play a vital role in the classification. We employed three different machine-learning classifiers, namely NaiveBayes, Sequential Minimal Optimization (SMO) and Conditional Random Field (CRF) which obtained the average F-Scores of 0.895, 0.869 and 0.697, respectively for classifying relation pairs of three primary kaarakas and one primary vibhakti relation. We have also conducted the error analysis for such primary relations using confusion matrices.

Anupam Mondal, Dipankar Das
Learning Clusters of Bilingual Suffixes Using Bilingual Translation Lexicon

By learning bilingual suffixation operations from translations using an existing bilingual lexicon with near translation forms we can improve its coverage and hence deal with the OOV entries. From this perspective, we identify bilingual stems, their bilingual morphological extensions (bilingual suffixes) and subsequently clusters of bilingual suffixes using known translation forms seen in an existing bilingual translation lexicon. We rely on clustering to enable safer translation generalisations. The degree of co-occurrence between two bilingual morphological extensions with reference to common bilingual stems determines if each of them should fall in the same cluster. Results are discussed for language pairs English-Portuguese (EN-PT) and English-Hindi (EN-HI).

K. M. Kavitha, Luís Gomes, José Gabriel P. Lopes
Automatic Construction of Tamil UNL Dictionary

In this paper, we propose an automatic tool for creating dictionary entries of Tamil words for the Universal Networking Language (UNL). Dictionary plays a crucial role in many NLP applications especially in machine translation (MT) systems. However, creating dictionary entries manually is a time consuming process. Moreover the UNL dictionary consists of additional features such as semantic constraints and attributes. To address this complex task, we propose a domain specific approach where the dictionary entries are created automatically using other word-based resources such as WordNet, bilingual dictionaries, and the UNL ontology. For the source of domain specific words, we use domain specific documents from the web. The resources used for extracting meaningful words from the documents are: Morphological analyzer, to extract the grammatical information of a given word, WordNet, to identify the semantics of the given word and UNL KB (Knowledge Base) to obtain the semantic constraints of a given word. Semantic constraints help to know the tense mood and aspect of the given word. Sometimes these semantic constraints may not be determined correctly by the automatic process. In such cases, a semantic similarity based filtering method based on UNL ontology is used to remove the incorrect dictionary entries. Thus, this automatic dictionary tool handles words semantically and also improves the correctness of the dictionary.

Ganesh J, Ranjani Parthasarathi, Geetha T. V
A New Approach to Syllabification of Words in Gujarati

This paper presents a statistical approach for automatic syllabification of words in Gujarati. Gujarati is a resource poor language and hardly any work for its syllabification has been reported, to the best our knowledge. Specifically, lack of enough training data makes this task difficult to perform. A training corpus of 14 thousand Gujarati words is built and a new approach to syllabification in Gujarati is tested on it. The maximum word and syllable level accuracies achieved are 91.89 % and 98.02 % respectively.

Harsh Trivedi, Aanal Patel, Prasenjit Majumder
A Support Vector Machine Based System for Technical Question Classification

This paper presents our attempt on developing a question classification system for technical domain. Question classification system classifies a question into the type of answer it requires and therefore plays an important role in question answering. Although the task is quite popular in general domain, we were unable to find any question classification system that classifies the questions of a technical subject. We defined a technical domain question taxonomy containing six classes. We manually created a dataset containing 1086 questions. Then we identified a set of features suitable for the technical domain. We observed that the parse structure similarity plays an important role in this classification. To capture the parse tree similarity we employed the tree kernel and we proposed a level-wise matching approach. We have used these features and dataset in a support vector machine classifier to achieve 93.22 % accuracy.

Shlok Kumar Mishra, Pranav Kumar, Sujan Kumar Saha
Shared Task on Sentiment Analysis in Indian Languages (SAIL) Tweets - An Overview

Sentiment Analysis in Twitter has been considered as a vital task for a decade from various academic and commercial perspectives. Several works have been performed on Twitter sentiment analysis or opinion mining for English in contrast to the Indian languages. Here, we summarize the objectives and evaluation of the sentiment analysis task in tweets for three Indian languages namely Bengali, Hindi and Tamil. This is the first attempt to sentiment analysis task in the context of Indian language tweets. The main objective of this task was to classify the tweets into positive, negative, and neutral polarity. For training and testing purpose, the tweets from each language were provided. Each of the participating teams was asked to submit two systems, constrained and unconstrained systems for each of the languages. We ranked the systems based on the accuracy of the systems. Total of six teams submitted the results and the maximum accuracy achieved for Bengali, Hindi, and Tamil are 43.2 %, 55.67 %, and 39.28 % respectively.

Braja Gopal Patra, Dipankar Das, Amitava Das, Rajendra Prasath
Sentiment Classification: An Approach for Indian Language Tweets Using Decision Tree

This paper describes the system we used for Shared Task on Sentiment Analysis in Indian Languages (SAIL) Tweets, at MIKE-2015. Twitter is one of the most popular platform which allows users to share their opinion in the form of tweets. Since it restricts the users with 140 characters, the tweets are actually very short to carry opinions and sentiments to analyze. We take the help of a twitter training dataset in Indian Language (Hindi) and apply data mining approaches for analyzing the sentiments. We used a state-of-the-art Data Mining tool Weka to automatically classify the sentiment of Hindi tweets into positive, negative or neutral.

Sudha Shanker Prasad, Jitendra Kumar, Dinesh Kumar Prabhakar, Sukomal Pal
Sentiment Classification for Hindi Tweets in a Constrained Environment Augmented Using Tweet Specific Features

India being a diverse country rich in spoken languages with around 23 official languages has always left open a wide arena for NLP researchers. The increase in the availability of voluminous data in Indian languages in the recent years has prompted researchers to explore the challenges in the Indian language domain. The proposed work explores Sentiment Analysis on Hindi tweets in a constrained environment and hence proposes a model for dealing with the challenges in extracting sentiment from Hindi tweets. The model has exhibited an average performance with cross validation accuracy for training data around 56 % and a test accuracy of 43 %.

Manju Venugopalan, Deepa Gupta
AMRITA_CEN-NLP@SAIL2015: Sentiment Analysis in Indian Language Using Regularized Least Square Approach with Randomized Feature Learning

The present work is done as part of shared task in Sentiment Analysis in Indian Languages (SAIL 2015), under constrained category. The task is to classify the twitter data into three polarity categories such as positive, negative and neutral. For training, twitter dataset under three languages were provided Hindi, Bengali and Tamil. In this shared task, ours is the only team who participated in all the three languages. Each dataset contained three separate categories of twitter data namely positive, negative and neutral. The proposed method used binary features, statistical features generated from SentiWordNet, and word presence (binary feature). Due to the sparse nature of the generated features, the input features were mapped to a random Fourier feature space to get a separation and performed a linear classification using regularized least square method. The proposed method identified more negative tweets in the test data provided Hindi and Bengali language. In test tweet for Tamil language, positive tweets were identified more than other two polarity categories. Due to the lack of language specific features and sentiment oriented features, the tweets under neutral were less identified and also caused misclassifications in all the three polarity categories. This motivates to take forward our research in this area with the proposed method.

S. Sachin Kumar, B. Premjith, M. Anand Kumar, K. P. Soman
IIT-TUDA: System for Sentiment Analysis in Indian Languages Using Lexical Acquisition

Social networking platforms such as Facebook and Twitter have become a very popular communication tools among online users to share and express opinions and sentiment about the surrounding world. The availability of such opinionated text content has drawn much attention in the field of Natural Language Processing. Compared to other languages, such as English, little work has been done for Indian languages in this domain. In this paper, we present our contribution in classifying sentiment polarity for Indian tweets as a part of the shared task on Sentiment Analysis in Indian Languages (SAIL 2015). With the support of a distributional thesaurus (DTs) and sentence level co-occurrences, we expand existing Indian sentiment lexicons to reach a higher coverage on sentiment words. Our system achieves an accuracy of 43.20 % and 49.68 % for the constrained submission, and an accuracy of 42.0 % and 46.25 % for the unconstrained setup for Bengali and Hindi, respectively. This puts our system in the first position for Bengali and in the third position for Hindi, amongst six participating teams.

Ayush Kumar, Sarah Kohail, Asif Ekbal, Chris Biemann
A Sentiment Analysis System for Indian Language Tweets

This paper reports about our work in the MIKE 2015, Shared Task on Sentiment Analysis in Indian Languages (SAIL) Tweets. We submitted runs for Hindi and Bengali. A multinomial Naïve Bayes based model has been used to implement our system. The system has been trained and tested on the dataset released for SAIL TWEET CONTEST 2015. Our system obtains accuracy of 50.75 %, 48.82 %, 41.20 %, and 40.20 % for Hindi constrained, Hindi unconstrained, Bengali constrained and Bengali unconstrained run respectively.

Kamal Sarkar, Saikat Chakraborty
AMRITA-CEN@SAIL2015: Sentiment Analysis in Indian Languages

The contemporary work is done as slice of the shared task in Sentiment Analysis in Indian Languages (SAIL 2015), constrained variety. Social media allows people to create and share or exchange opinions based on many perspectives such as product reviews, movie reviews and also share their thoughts through personal blogs and many more platforms. The data available in the internet is huge and is also increasing exponentially. Due to social media, the momentousness of categorizing these data has also increased and it is very difficult to categorize such huge data manually. Hence, an improvised machine learning algorithm is necessary for wrenching out the information. This paper deals with finding the sentiment of the tweets for Indian languages. These sentiments are classified using various features which are extracted using words and binary features, etc. In this paper, a supervised algorithm is used for classifying the tweets into positive, negative and neutral labels using Naive Bayes classifier.

Shriya Se, R. Vinayakumar, M. Anand Kumar, K. P. Soman
Backmatter
Metadaten
Titel
Mining Intelligence and Knowledge Exploration
herausgegeben von
Rajendra Prasath
Anil Kumar Vuppala
T. Kathirvalavakumar
Copyright-Jahr
2015
Electronic ISBN
978-3-319-26832-3
Print ISBN
978-3-319-26831-6
DOI
https://doi.org/10.1007/978-3-319-26832-3