Skip to main content

Über dieses Buch

This book constitutes the refereed proceedings of the 6th International Conference on Big Data analytics, BDA 2018, held in Warangal, India, in December 2018. The 29 papers presented in this volume were carefully reviewed and selected from 93 submissions. The papers are organized in topical sections named: big data analytics: vision and perspectives; financial data analytics and data streams; web and social media data; big data systems and frameworks; predictive analytics in healthcare and agricultural domains; and machine learning and pattern mining.



Big Data Analytics: Vision and Perspectives


Fault Tolerant Data Stream Processing in Cooperation with OLTP Engine

In recent years, with the increase of big data and the spread of IoT technology and the continual evolution of hardware technology, the demand for data stream processing is further increased. Meanwhile, in the field of database systems, a new demand for HTAP (hybrid transactional and analytical processing) that integrates the functions of on-line transaction processing (OLTP) and on-line analytical processing (OLAP) is emerging. Based on this background, our group started a new project to develop data stream processing technologies in the HTAP environment in cooperation with other research groups in Japan. Our main focus is to develop new data stream processing methodologies such as fault tolerance in cooperation with the OLAP engine. In this paper, we describe the background, the objectives and the issues of the research.
Yoshiharu Ishikawa, Kento Sugiura, Daiki Takao

Blockchain-Powered Big Data Analytics Platform

As cryptocurrencies and other business blockchain applications are becoming mainstream, the amount of transactional data as well as business contracts and documents captured within various ledgers are getting bigger and bigger. Blockchains provide enterprises and consumers with greater confidence in the integrity of the captured data. This gives rise to the new level of analytics that marries the advantages of both blockchain and big data technologies to provide trusted analysis on validated and quality big data. Blockchain-based big data is a perfect source for subsequent analytics because the big data maintained on the blockchain is both secure (i.e., tamper-proof and cannot be forged) and valuable (i.e., validated and abundant). Further, data integration and advanced analysis across on-chain and off-chain data present enterprises with even more complete business insights. In this paper, we first discuss a blockchain-based business application for micro-insurance and AI marketplaces, which render blockchain-generated big data scenarios and the opportunity to develop trusted and federated AI insights across the insurers. We then also describe the design of a blockchain-powered big data analytics platform as well as our initial steps being taken along the development of this platform.
Hoang Tam Vo, Mukesh Mohania, Dinesh Verma, Lenin Mehedy

Humble Data Management to Big Data Analytics/Science: A Retrospective Stroll

We are on the cusp of analyzing a variety of data being collected in every walk of life in diverse ways and holistically as well as developing a science (Big Data Science) to benefit humanity at large in the best possible way. This warrants developing and using new approaches – technological, scientific, and systems – in addition to building upon and integrating with the ones that have been developed so far. With this ambitious goal, there is also the accompanying risk of these advancements being misused or abused as we have seen so many times with respect to new technologies.
In this paper, we plan on providing a retrospective bird’s-eye-view on the approaches that have come about for managing and analyzing data over the last 40+ years. Since the advent of Database Management Systems (or DBMSs) and especially the Relational DBMSs (or RDBMSs), data management and analysis have seen several significant strides. Today, data has become an important tool (or even a weapon) in society and its role and importance is unprecedented.
The goal of this paper is to provide the reader an understanding of data management and analysis approaches with respect to where we have come from, motivations for developing them, and what this journey has been about in a short span of 40+ years. We sincerely hope this presentation provides a historical as well as a pedagogical perspective for those who are new to the field and provides a useful perspective that they can relate to and appreciate for those who have been working and contributing to the field.
Sharma Chakravarthy, Abhishek Santra, Kanthi Sannappa Komar

Fusion of Game Theory and Big Data for AI Applications

With the increasing reach of the Internet, more and more people and their devices are coming online which has resulted in the fact that, a significant amount of our time and a significant number of tasks are getting performed online. As the world moves faster towards more automation and as concepts such as IoT catch up, a lot more (data generation) devices are getting added online without needing the involvement of human agents. The result of all this is that there will be lots (and lots) of information generated in a variety of contexts, in a variety of formats at a variety of rates. Big data analytics therefore becomes (and is already) a vital topic to gain insights or understand the trends encoded in the large datasets. For example, the worldwide Big Data market revenues for software and services are projected to increase from 42 Billion USD in 2018 to 103 Billion in 2027. However, in the real-world it may not be enough to just perform analysis, but many times there may be a need to operationalize the insights to obtain strategic advantages. Game theory being a mathematical tool to analyze strategic interactions between rational decision-makers, in this paper, we study the usage of Game Theory to obtain strategic advantages in different settings involving usage of large amounts of data. The goal is to provide an overview of the use of game theory in different applications that rely extensively on big data. In particular, we present case studies of four different Artificial Intelligence (AI) applications namely Information Markets, Security systems, Trading agents and Internet Advertising and present details for how game theory helps to tackle them. Each of these applications has been studied in detail in the game theory literature, and different algorithms and techniques have been developed to address the different challenges posed by them.
Praveen Paruchuri, Sujit Gujar

Financial Data Analytics and Data Streams


Distributed Financial Calculation Framework on Cloud Computing Environment

Even though the recent technological innovations in cloud computing, distributed data base architecture grew to a point where they can now address Big Data process, still most of the companies are struggling to implement their solutions on cloud computing environment. This is mainly due to lack of proper case studies and application frameworks available on a public research domain. Most of the implementations are vendor provided, high cost, consulting type and long drawn projects which consume valuable Business and IT resources for an organization. In this paper we propose a simple to implement (parallel data load and any aggregation calculation framework), easy to maintain and flexible architecture framework which can be adapted as a tool for small to mid-size investment organization in implementing a Distributed Cloud Computing Architecture. Our framework is an extension of traditional Distributed Database Design with the horizontal partition of the relations to parallelize the computation on Azure SQL instance and materialize the aggregated results with SQL Views for users, Business Intelligence (BI) Reporting, Data Mining and Knowledge Discovery applications. The solution is implemented on Azure SQL Cloud Computing platform to build the financial calculation framework.
Rao Casturi, Rajshekhar Sunderraman

Testing Concept Drift Detection Technique on Data Stream

Data mutates dynamically, and these transmutations are so diverse that it affects the quality and reliability of the model. Concept Drift is the quandary of such dynamic cognitions and modifications in the data stream which leads to change in the behaviour of the model. The problem of concept drift affects the prognostication quality of the software and thus reduces its precision. In most of the drift detection methods, it is followed that there are given labels for the incipient data sample which however is not practically possible. In this paper, the performance and accuracy of the proposed concept drift detection technique for the classification of streaming data with undefined labels will be tested. Testing is followed with the creation of the centroid classification model by utilizing some training examples with defined labels and test its precision with the test set and then compare the accuracy of the prediction model with and without the proposed concept drift detection technique.
Narinder Singh Punn, Sonali Agarwal

Homogenous Ensemble of Time-Series Models for Indian Stock Market

In the present era, Stock Market has become the storyteller of all the financial activity of any country. Therefore, stock market has become the place of high risks, but even then it is attracting the mass because of its high return value. Stock market tells about the economy of any country and has become one of the biggest investment place for the general public. In this manuscript, we present the various forecasting approaches and linear regression algorithm to successfully predict the Bombay Stock Exchange (BSE) SENSEX value with high accuracy. Depending upon the analysis performed, it can be said successfully that Linear Regression in combination with different mathematical functions prepares the best model. This model gives the best output with BSE SENSEX values and Gross Domestic Product (GDP) values as it shows the least p-value as 5.382e−10 when compared with other model’s p-values.
Sourabh Yadav, Nonita Sharma

Improving Time Series Forecasting Using Mathematical and Deep Learning Models

With the increase in the number of internet users, there is a deluge of traffic over the web, handling Internet traffic with much more optimized and efficient approach is the need of the hour. In this work, we have tried to forecast Internet traffic on TCP/IP network using web traffic data of Wikipedia articles, provided by Kaggle (https://​www.​kaggle.​com/​). We work on the stationarity of time series and use mathematical concepts of log transformation, differencing and decomposition in order to make the time series stationary. Our research presents an approach for forecasting web traffic for these articles using different statistical time series models such as Auto-Regressive (AR) model, Moving Average (MA) model, Auto-Regressive Integrated Moving Average (ARIMA) Model and a deep learning model - Long Short-Term Memory (LSTM). This research work opens the possibility of efficient traffic handling thus, leading to improved performance for an organization as well as better experience for the users on the internet
Mohit Gupta, Ayushi Asthana, Nishant Joshi, Pulkit Mehndiratta

Emerging Technologies and Opportunities for Innovation in Financial Data Analytics: A Perspective

Several key transformations in the macro-environment coupled with recent advances in technology have opened up tremendous opportunities for innovation in the financial services industry. We discuss the implications and ramifications of these macro-environmental trends for data science research. Moreover, we describe novel and innovative IT-enabled applications, use-cases and techniques in retail financial services as well as in financial investment services. Furthermore, this paper identifies the research challenges that need to be addressed for realizing the full potential of innovation in financial services. Examples of such research challenges include context-aware analytics over uncertain and imprecise data, data reasoning and semantics, cognitive and behavioural analytics, design of user-friendly interfaces for improved expressiveness in querying financial service providers, personalization based on fine-grained user preferences and financial Big Data processing on Cloud-based infrastructure. Additionally, we discuss new and exciting opportunities for innovation in financial services by leveraging the new and emerging financial technologies as well as Big Data technologies.
Anirban Mondal, Atul Singh

Web and Social Media Data


Design of the Cogno Web Observatory for Characterizing Online Social Cognition

It is important to occasionally remember that the World Wide Web (WWW) is the largest information network the world has ever seen. Just about every sphere of human activity has been altered in some way, due to the web. Our understanding of the web has been evolving over the past few decades ever since it was born. In its early days, the web was seen just as an unstructured hypertext document collection. However, over time, we have come to model the web as a global, participatory, socio-cognitive space. One of the consequences of modeling the web as a space rather than as a tool, is the emergence of the concept of Web observatories. These are application programs that are meant to observe and curate data about online phenomena. This paper details the design of a Web observatory called Cogno, that is meant to observe online social cognition. Social cognition refers to the way social discourses lead to the formation of collective worldviews. As part of the design of Cogno, we also propose a computational model for characterizing social cognition. Social media is modeled as a “marketplace of opinions” where different opinions come together to form “narratives” that not only drive the discourse, but may also bring some form of returns to the opinion holders. The problem of characterizing social cognition is defined as breaking down a social discourse into its constituent narratives, and for each narrative, its key opinions, and the key people driving the narrative.
Srinath Srinivasa, Raksha Pavagada Subbanarasimha

Automated Credibility Assessment of Web Page Based on Genre

With more than a billion web sites, volume and variety of content available for consumption is huge. However, credibility, an important quality characteristic of web pages is questionable in many cases and tends to be non-uniform. Credibility can increase or reduce the importance of web page leading to potential gain or loss of user base. Credibility without factoring genre of content (for example, Help, Article, Discussion, etc.) can lead to incorrect assessment. Depending on the genre, the importance of features such as web page date time modified, grammar, image to text ratio, in and out links, and other web page features differ. We propose a genre credibility assessment based on web page surface features and their importance in a genre. Further, we built a WEBCred framework to assess GCS (Genre based Credibility Score) with flexibility to add/modify genres, its features and their importance. We validated our approach on 10,429 ‘Information Security’ related web pages; the assessed score correlated 35% with crowd sourced Web Of Trust (WOT) score and 39% with Alexa ranking.
Shriyansh Agrawal, S. Lalit Mohan, Y. Raghu Reddy

CbI: Improving Credibility of User-Generated Content on Facebook

Online Social Networks (OSNs) have become a popular platform to share information with each other. Fake news often spread rapidly in OSNs especially during news-making events, e.g. Earthquake in Chile (2010) and Hurricane Sandy in the USA (2012). A potential solution is to use machine learning techniques to assess the credibility of a post automatically, i.e. whether a person would consider the post believable or trustworthy. In this paper, we provide a fine-grained definition of credibility. We call a post to be credible if it is accurate, clear, and timely. Hence, we propose a system which calculates the Accuracy, Clarity, and Timeliness (A-C-T) of a Facebook post which in turn are used to rank the post for its credibility. We experiment with 1,056 posts created by 107 pages that claim to belong to news-category. We use a set of 152 features to train classification models each for A-C-T using supervised algorithms. We use the best performing features and models to develop a RESTful API and a Chrome browser extension to rank posts for its credibility in real-time. The random forest algorithm performed the best and achieved ROC AUC of 0.916, 0.875, and 0.851 for A-C-T respectively.
Sonu Gupta, Shelly Sachdeva, Prateek Dewan, Ponnurangam Kumaraguru

A Parallel Approach to Detect Communities in Evolving Networks

To understand the dynamics, functional and topological aspect of the real-world networks it is necessary to segregate the network into sub-networks, where each member of a sub-network possess analogous characteristics. Numerous number of community finding approaches are proposed in the last few decades to overcome the issues associated with community detection. Although, most of the conventional approaches rely on the premises that networks are static in nature and there won’t be any alternation over time. Moreover, all these approaches are single machine approach and hence exhibits poor scalability.
In this work, we propose a new incremental parallel community detection method, PcDEN (Parallel Community Detection approach in Evolving Networks). Our proposed method can detect communities in dynamic distributed networks. We define a new Affinity score based on intra-community strength between nodes and their neighbors. We also derive a new model to perform community merging, based on common high degree nodes present in both the communities. We tested our algorithm on various real-world networks for our experimentation. Results show that, PcDEN produce satisfactory output with respect to various assessment indices.
Keshab Nath, Swarup Roy

Modeling Sparse and Evolving Data

Existing relational database management system (RDBMS) excels in providing transactional support. However, RDBMS performance declines when sparse and evolving data needs to be stored. Modeling of highly evolving and sparse data is a major issue that needs attention to provide faster and competent technology solutions. This research work is focused on providing a solution to handle sparseness and frequent evolution of data with adherence for the transactional support. Recently, authors propose an extension of binary table approach to overcome the lacking aspects. The proposed approach is termed as Multi Table Entity Attribute Value (MTEAV) model. To make users completely unaware about the underlying modeling approach, MTEAV is augmented with a translation layer. It translates conventional SQL query (as per the relational model) to a new SQL query (as per MTEAV structure) to provide the user friendly environment. In this research, authors extend the functionality of the translation layer to provide support for data definition (creating, reading, updating and deleting schema). Authors have experimented MTEAV for analyzing the effect of sparseness on the performance of MTEAV. Results achieved clearly indicate that the MTEAV performance increases with increase in sparseness.
Shivani Batra, Shelly Sachdeva, Aayushi Bansal, Suyash Bansal

Big Data Systems and Frameworks


Polystore Data Management Systems for Managing Scientific Data-sets in Big Data Archives

Large scale scientific data sets are often analyzed for the purpose of supporting workflow and querying. User need to query over different data sources. These systems manage intermediate results. Most prototypes are complex and have an ad hoc design. These require extensive modifications in case of growth of data and change of scale, in terms of data or number of users. New data sources may arise to further complicate the ad hoc design. The polystore data management approach provides ‘data independence’ for changes in data profile, including addition of cloud data resources. The users are often provided a quasi-relational query language. In many cases, the polystore systems support distinct tasks that are user defined workflow activity, in addition to providing a common view of data resources.
Rashmi Girirajkumar Patidar, Shashank Shrestha, Subhash Bhalla

MPP SQL Query Optimization with RTCG

Analytics database dbX is a cloud agnostic, MPP SQL product with both DSM and NSM stores. One of the techniques for better micro optimization of SQL query processing is runtime code generation and JIT compilation. We propose a RTCG model that is both query aware and hardware conscious extending analytics SQL query processing to a high degree of intra-query parallelism. Our approach to RTCG, at system level targets to maximize benefits from modern hardware, and at use level focuses on typical, industry type SQL, somewhat different from standard benchmarks. We describe the model, highlighting its novel aspects, techniques implemented and product engineering decisions in dbX. To evaluate the efficacy of the RTCG model, we perform experiments on desktop and cloud clusters, with standard and synthetic benchmarks, on data that is more commensurate in size with industry applications.
K. T. Sridhar, M. A. Sakkeer, Shiju Andrews, Jimson Johnson

Big Data Analytics Framework for Spatial Data

In the world of mobile and Internet, large volume of data is generated with spatial components. Modern users demand fast, scalable and cost-effective solutions to perform relevant analytics on massively distributed data including spatial data. Traditional spatial data management systems are becoming less efficient to meet the current users demand due to poor scalability, limited computational power and storage. The potential approach is to develop data intensive spatial applications on parallel distributed architectures deployed on commodity clusters. The paper presents an open-source big data analytics framework to load, store, process and perform ad-hoc query processing on spatial and non-spatial data at scale. The system is built on top of Spark framework with a new input data source NoSQL database i.e. Cassandra. It is implemented by performing analytics operations like filtration, aggregation, exact match, proximity and K nearest neighbor search. It also provides an application architecture to accelerate ad-hoc query processing by diverting user queries to the suitable framework either Cassandra or Spark via a common web based REST interface. The framework is evaluated by analyzing the performance of the system in terms of latency against variable size of data.
Purnima Shah, Sanjay Chaudhary

An Ingestion Based Analytics Framework for Complex Event Processing Engine in Internet of Things

Internet of Things (IoT) is the new paradigm that connects the physical world with the virtual world. The interconnection is generated by the optimal deployment of sensors which continuously generate data and streams it to a data store. The concept drift and data drift are integral characteristics of IoT data. Due to this nature, there is a need to process data from various sources and decipher patterns in them. This process of detecting complex patterns in data is called Complex Event Processing which provides near real-time analytics for various IoT applications. Current CEP deployments have a inherent capability to react to events instantaneously. This leaves room to develop CEPs which are proactive in nature which can take the help of various machine learning (ML) models to work together with CEP. In this paper, the usage of Complex Event Processing (CEP) engine is exhibited that allows the inference of new scenarios out of incoming traffic data. This conversion of historical data into actionable knowledge is undertaken by a Long Short Term Memory (LSTM) model so as to detect the occurrence of an event well before time. The experimental results suggest the rich abilities of Deep Learning to predict events proactively with minimal error. This allows to deal with uncertainties and steps for significant improvement can be made in advance.
Sanket Mishra, Mohit Jain, B. Siva Naga Sasank, Chittaranjan Hota

An Energy-Efficient Greedy MapReduce Scheduler for Heterogeneous Hadoop YARN Cluster

Energy efficiency of a MapReduce system has become an essential part of infrastructure management in the field of big data analytics. Here, Hadoop scheduler plays a vital role in order to ensure the energy efficiency of the system. A handful of MapReduce scheduling algorithms have been proposed in the literature for slot-based Hadoop system (i.e., Hadoop 0.x and Hadoop 1.x) to minimize the overall energy consumption. However, YARN-based Hadoop schedulers have not been discussed much in the literature. In this paper, we design a scheduling model for Hadoop YARN architecture and formulate the energy efficient scheduling problem as an Integer Program. To solve the problem, we propose a Greedy scheduler which selects the best job with minimum energy consumption in each iteration. We evaluate the performance of the proposed algorithm against the FAIR and Capacity schedulers and find out that our greedy scheduler shows better results for both CPU- and I/O intensive workloads.
Vaibhav Pandey, Poonam Saini

Predictive Analytics in Healthcare and Agricultural Domains


Analysis of Narcolepsy Based on Single-Channel EEG Signals

A normal person spends about third of his life in sleep. Healthy sleep is vital to people’s normal lives. Sleep analysis can be used to diagnose certain physiological and neurological diseases such as insomnia and narcolepsy. This paper will introduce the sleep stage and the corresponding electroencephalogram (EEG) characteristics at each stage. We used the deep convolutional neural network (CNN) to classify original EEG data with narcolepsy. We use perturbations based on frequency to generate adversarial examples to analyze the characteristics of narcolepsy in different sleep stages. We find that perturbations at specific frequencies affect the classification results of deep learning.
Jialin Wang, Yanchun Zhang, Qinying Ma

Formal Methods, Artificial Intelligence, Big-Data Analytics, and Knowledge Engineering in Medical Care to Reduce Disease Burden and Health Disparities

Medical errors and overtreatment combined with growing non-communicable disease population are responsible for increase in the burden of disease and health disparity. To control this burden and disparity, automation with zero defects must be introduced in evidence based medicine. In safety critical systems, zero defects are achieved through formal methods. A formal model is tested (proved) and the target system is generated through automation with the removal of error prone programming or construction phase. Inspired by similar ideas, we created DocDx, a novel formal method driven medical care framework without any programming phase involved. We convert clinical pathways into a multipartite directed weighted graph (MDWG) that embeds the medical intelligence. The autonomous interpreters in the server presents natural language generator (NLG) pathophysiology questions a doctor would normally ask a patient to understand the signs and symptoms of a disease. The biological terms and human understandable unstructured text entered in DocDx client is made machine understandable through AI NLP engine and translated into biomedical ontology concepts. A new medical condition or presentation of disease in DocDx will need a new clinical pathway translated into a MDWG without the need for any programming or application development process either at the client or at the server end.
Sakthi Ganesh, Asoke K. Talukder

Adaboost.RT Based Soil N-P-K Prediction Model for Soil and Crop Specific Data: A Predictive Modelling Approach

In relation to the evaluation of the soil breeding status of a region or realm, the soil characteristics are an important aspect in terms of agricultural production. Nitrogen, phosphorus, potassium, and sulfur are important elements of soil that regulate its fertility and yield of crops. Due to the low efficiency of other inputs or due to the use of unbalanced and inadequate fertilizer, the reaction of chemical fertilizer nutrients (production) efficiency in recent years has reduced considerably under intensive agriculture. Stability in crop productivity cannot be extended without the judicial use of macro and micro nutrients to overcome existing deficiencies. The information on the availability of macro nutrients in the study area is low. Therefore, the current study has been done to know the condition of soil nutrients. Use of advanced agricultural technology can help in predicting soil nutrient content and can help farmers to decide the amount of fertilizers to use on a particular land. The proposed study focuses on the accurate prediction of N-P-K content in the given land by utilizing the predication method using Adaboost.RT method. A comparison is also made in between the nutrient utilized using traditional methods and the proposed method. Experimental results show that the proposed stream outperforms with other existing methodologies.
Rashmi Priya, Dharavath Ramesh

Machine Learning and Pattern Mining


Deep Neural Network Based Image Captioning

Generating a concise natural language description of an image enables a number of applications including fast keyword based search of large image collections. Primarily inspired by deep learning, recent times have witnessed a substantially increased focus on machine based image caption generation. In this paper, we provide a brief review of deep learning based image caption generation along with a brief overview of the datasets and metrics used to evaluate the captioning algorithms. We conclude the paper with some discussion on promising directions for future research.
Anurag Tripathi, Siddharth Srivastava, Ravi Kothari

Oversample Based Large Scale Support Vector Machine for Online Class Imbalance Problem

Dealing with online class imbalance from evolving stream is a critical issue than the conventional class imbalance problem. Usually, the class imbalance problem occurs when one class of data severely outnumbers the other classes of data, thus leads to skewed class boundaries. In the case of online class imbalance problem, the degree of class imbalance changes over time and the present state of imbalance is not known a prior to the learner. To address such problem, in this paper, we present an Oversampling based Online Large Scale Support Vector Machine (OOLASVM) algorithm which is a hybrid of active sample selection and over sampling of Support Vectors and thereby both oversampling and under sampling coexists while learning the new boundary. Further, OOLASVM maintains the balanced boundary throughout the learning process. Results on simulated and real world datasets demonstrate that proposed OOLASVM yields better performance than existing approaches such as Generalized Oversampling based Online Imbalanced Learners and Over Online Bagging.
D. Himaja, T. Maruthi Padmaja, P. Radha Krishna

Using Crowd Sourced Data for Music Mood Classification

Music has been part of human lives since ancient times. We have hundreds of millions of songs representing different cultures, mood and genres. These songs are readily accessible using Internet and streaming services. However, the discovery of the right music piece to listen is hard and an automated assistance to find the right song among the millions is always desired. There have been several attempts to classify music on the basis of their genres but their efforts have not been much fruitful because of lack of good and large datasets. Moreover, identifying the set of features to represent the music in a summarized way is also a challenging task. In this work, we present an automated music mood classification approach that uses crowd-sourced platforms to label the songs. It eliminates the subjectivity of one’s perception of mood on a song. We have confined our work to two classes of mood: happy and sad. The proposed approach is tested with three machine learning models: artificial neural networks (ANN), Decision Tree (DT) and Support Vector Machine (SVM). The experimental results show that ANN performs better than DT and SVM.
Ashish Kumar Patel, Satyendra Singh Chouhan, Rajdeep Niyogi

Applying Big Data Intelligence for Real Time Machine Fault Prediction

Continuous use of mechanical systems requires precise maintenance. Automatic monitoring of such systems generates a large amount of data which require intelligent mining methods for processing and information extraction. The problem is to predict the faults generated with ball bearing which severely degrade operating conditions of machinery. We develop a distributed fault prediction model based on big data intelligence that extracts nine essential features from ball bearing dataset through distributed random forest. We also perform a rigorous simulation analysis of the proposed approach and the results ensure the accuracy/correctness of the method. Different types of fault classes are considered for prediction purpose and classification is done in a supervised distributed environment.
Amrit Pal, Manish Kumar

PRISMO: Priority Based Spam Detection Using Multi Optimization

The rapid growth of social networking sites such as Twitter, Facebook, Google+, MySpace, Snapchat, Instagram, etc., along with its local invariants such as Weibo, Hyves, etc., has made them infiltrated with a large amount of spamming activities. Based on the features, an account or content can be classified as spam or benign. The presence of some irrelevant features decreases the performance of the classifier, understandability of dataset, and the time requirement for training and classification increases. Therefore, Feature subset selection is an essential phase in the process of machine learning mechanism. The objective of feature subset selection is to choose a subset of size ‘s’ (s < n) from the total set of ‘n’ features that results in the least classification error. The feature subset selection problem can be represented as a problem of optimization in which the objective is to choose the near-optimal subset of features. Based on the literature survey, it is found that the classifier will offer its best performance if the data with high dimension is reduced such that it includes only appropriate features having lesser redundancy. The contribution of this paper comprises feature subset and its cost optimization simultaneously. The fundamental aspect PRISMO is to generate a primary feature subset through various optimization algorithms for the initialization stage. Further, the subset has been generated using the initial feature set based on their priority using basic rules of conjunction and disjunction. To evaluate the overall efficiency of PRISMO, various experiments were carried out using different dataset. The obtained result shows that the proposed model effectively reduces the cardinality of features without any bias to a specific dataset and any degradation to the classifier accurateness.
Mohit Agrawal, R. Leela Velusamy

Malware Detection Using Machine Learning and Deep Learning

Research shows that over the last decade, malware have been growing exponentially, causing substantial financial losses to various organizations. Different anti-malware companies have been proposing solutions to defend attacks from these malware. The velocity, volume, and the complexity of malware are posing new challenges to the anti-malware community. Current state-of-the-art research shows that recently, researchers and anti-virus organizations started applying machine learning and deep learning methods for malware analysis and detection. We have used opcode frequency as a feature vector and applied unsupervised learning in addition to supervised learning for malware classification. The focus of this tutorial is to present our work on detecting malware with (1) various machine learning algorithms and (2) deep learning models. Our results show that the Random Forest outperforms Deep Neural Network with opcode frequency as a feature. Also in feature reduction, Deep Auto-Encoders are overkill for the dataset, and elementary function like Variance Threshold perform better than others. In addition to the proposed methodologies, we will also discuss the additional issues and the unique challenges in the domain, open research problems, limitations, and future directions.
Hemant Rathore, Swati Agarwal, Sanjay K. Sahay, Mohit Sewak

Spatial Co-location Pattern Mining

Given a spatial dataset containing instances of a set of spatial Boolean feature-types, the problem of spatial co-location pattern mining aims to determine a subset of feature-types which are frequently co-located in space. Spatial Co-location patterns have a wide range of applications in the domains such as ecology, public health and public safety. For instance, in an ecological dataset containing event instances corresponding to different bird species and vegetation types, spatial co-location patterns may revel that a particular species of birds prefer a particular kind of trees for their nests. Similarly, in a crime dataset, spatial co-location may revel a pattern that drunk-driving cases are co-located with bar locations. This article presents a gentle introduction to spatial co-location pattern mining. It introduces a well studied interest measure called participation index for co-location mining and, then discusses an algorithm to determine patterns having high participation index in a spatial dataset.
Venkata M. V. Gunturi


Weitere Informationen

Premium Partner