Skip to main content
Top

2018 | Book

Data Science and Analytics

4th International Conference on Recent Developments in Science, Engineering and Technology, REDSET 2017, Gurgaon, India, October 13-14, 2017, Revised Selected Papers

Editors: Dr. Brajendra Panda, Sudeep Sharma, Nihar Ranjan Roy

Publisher: Springer Singapore

Book Series : Communications in Computer and Information Science

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 4th International Conference on Recent Developments in Science, Engineering and Technology, REDSET 2017, held in Gurgaon, India, in October 2017.

The 66 revised full papers presented were carefully reviewed and selected from 329 submissions. The papers are organized in topical sections on big data analysis, data centric programming, next generation computing, social and web analytics, security in data science analytics.

Table of Contents

Frontmatter

Big Data Analytics

Frontmatter
System Behavior Analysis in the Urea Fertilizer Industry

This paper models and analyses a urea plant having subsystems of different operational nature for system parameters using Regenerative Point Graphical Technique (RPGT). A common cause failure is also considered in modeling. Problem is formulated and solved for constant failure/repair rates for each subsystem. A state diagram of the system depicting the transition rates is drawn and expressions for path probabilities mean sojourn times are derived. Analytical discussion is carried out by tables and graphs. Behavioral inferences have been drawn which may useful to concerned industrial personals.

Arun Kumar, Pardeep Goel, Deepika Garg, Atma Sahu
Performance Analysis of Machine Learning Techniques on Big Data Using Apache Spark

Applying Intelligence to the machines is a need in today’s world and this need leads to the evolution of machine learning. The analysis of data using machine learning algorithms is a trending research area and this analysis lead to some problems when the data comes out to be big data. This paper compares various classification based machine learning algorithms namely, Decision Tree Learning, Naïve Bayes, Random Forest and Support Vector Machines on big data using Apache Spark. The accuracy is evaluated to find out which classification based algorithm gives fast and better result.

Garima Mogha, Khyati Ahlawat, Amit Prakash Singh
A Comparative Study of Consumption Behavior of Pharmaceutical Drugs

The research is done to identify the significance of gender, profession, income and age on the consumption behavior of various classes of drugs. In this research paper a comparative study is carried out to understand about the consumption behavior of pharmaceutical drugs using primary data collected from NCR area, India through questionnaire and analyzed using R-programming. On analysis we found that antibiotics and analgesic classes of drug are most frequently used by consumers whereas consumption of significantly depends on age group. It was also found that there is no significant difference in consumption behaviour due to gender and profession. However income of the respondent is affecting consumption of newly launched drugs of same salt. These results will be helpful for pharmacy market to understand consumption behaviour of common people.

Keerti Jain, Priyanka Sharma, Medathati Jayalakshmi
Trend Analysis of Machine Learning Research Using Topic Network Analysis

In this paper, a topic network analysis approach is proposed which integrates topic modeling and social network analysis. We collected 16,855 scientific papers from six top journals in the field of machine learning published from 1997 to 2016 and analyzed them with the topic network. The dataset is break down into 4 intervals to identify topic trends and performed the time-series analysis of topic network. Our experimental results show centralization of the topic network has the highest score from 2002 to 2006, and decreases for next 5 years and increases again. For last 5 years, centralization of the degree centrality and closeness centrality increases, while centralization of the betweenness centrality decreases again. Also, data analytic and computer vision are identified as the most interrelated topic among other topics. Topics with the highest degree centrality evolve component analysis, text mining, biometric and computer vision according to time. Our approach extracts the interrelationships of topics, which cannot be detected with conventional topic modeling approaches, and provides topical trends of machine learning research.

Deepak Sharma, Bijendra Kumar, Satish Chand
Impact of Ontology on Databases

There is a boom in research works in the arena of devising new methodologies for indexing and accessing hidden web data available in databases. To exploit this hidden web data, users need to fill various search forms available on World Wide Web with appropriate values. For a common user with non technical background, this is quite difficult to find suitable values. Ontology provides a way to find these values. Ontology is useful in constructing, a semantic database that provides values for various fields of search form interfaces. Ontology based data extraction is efficient in dealing with large amount of information available on the World Wide Web. This information may be available in different formats such as Hypertext Mark UP Language, HTML, Web Ontology Language, OWL, Recognition Form Designer, RDF and Extensible Markup Language, XML files. This paper creates an ontology database which provides us relevant information about book domain. It presents relevant data before the user rather than giving vague and irrelevant results as traditional databases generate.

Mohini Goyal, Geeta Rani
Design and Implementation of Virtual Hadoop Cluster on Private Cloud

Virtualization has made it feasible to deploy hadoop in cloud environment. Virtualized hadoop offers unique benefits like setting up a cluster in short time, flexibility to use variety of hardware (SAN, NAS, DAS), high availability and many more. With companies like Google, Microsoft, Rackspace and IBM providing their own infrastructure for cloud service more and more business is assumed to move on cloud in the near future. Apart from public cloud business can make use of private, community or hybrid cloud deployment model. In this paper, the focus is on private cloud deployment which offers its own benefits like security, reduced cost, more control over resources etc. The design and implementation of private cloud using Xen 6.5 bare metal hypervisor is discussed in this paper. Further it also discusses deploying hadoop as service on the cloud with the help of shell script. For experimental purpose 8 physical hosts are connected to 60 Tb SAN with QLogic 20-Port 8 Gb SAN switch module which provides fiber connectivity to the storage. Finally, the performance of hadoop on the cloud is evaluated.

Garima Singh, Anil Kumar Singh
Healthcare Waste Management and Application Through Big Data Analytics

Healthcare is one most rapidly growing and largest sector in India. Healthcare comprises hospitals, medical devices, clinical trials, telemedicine, health insurance and medical equipment. Such a large industry generates huge amount of waste such as syringes, needles, disposable scalpels and blades, used cotton and bandages, plastic and glass bottles, chemicals, pharmaceuticals and other infectious waste. Waste is not managed properly due to lack of awareness about the health hazards related to health-care waste, inadequate training in proper waste management, absence of waste management and disposal systems, insufficient financial and human resources, and low priority to the task connected with the health-care wastes. This paper highlights the use of Big Data Analytics to provide application of healthcare waste in agriculture and disaster management sector. The predictions show that if planning is adequate then waste of one sector can be fully utilized in another sector with low cost.

Poorti Sahni, Ginni Arora, Ashwani Kumar Dubey
Detecting Internet Addiction Disorder Using Bayesian Networks

21st century being the digital age has produced social, economic and communication revolution. Proliferation of Internet has made a globally connected world. Internet per se is a harmless technology with significant benefits, on contrary its excessive usage and dependence leads to high risk of addiction. It is preordained society need and requirement of continual research to develop efficient tools to identify and predict the potential Internet Addiction Disorder among the internet users. Our aim is to automate task of predicting prevalence of Internet Addiction disorder using Bayesian Network and propose a machine learning graphical framework, namely, Internet Addiction Disorder Bayesian Network. In this work, we exploit the unique features of Bayesian Network to explore the influence of causal symptoms on the probability of occurrence of Internet Addiction Disorder (IAD). The model is constructed with Internet Addiction Test as platform and Internet Addiction Disorder absence or presence is measured through six parameters of Internet Addiction Test (Salience, Excessive use, Neglect work, Anticipation, Lack of control, Neglect Social Life). The six attributes are classified into four groups: normal, mild, moderate and severe on the basis of the item scores obtained by the individuals in the parameters, which are summed to obtain the total score. The total score is utilized to classify the samples into two groups: IAD Present and IAD Absent. To achieve graphical user interface for the model, a high performance Netica software is used for the study. The results obtained are promising and reveals that the model can predict the IAD presence and absence with 100% accuracy. The model also shows that out of six parameters, excessive use of internet plays significant role in increasing the risk of IAD preceded by Salience.

Anju Singh, Sakshi Babbar
A Semantic Web-Based Framework for Information Retrieval in E-Learning Systems

The advent of the internet, the evolution of the World Wide Web (WWW), coupled with the e-learning paradigm has resulted in the availability of a plethora of learning resources on the Web. However, these resources are not being fully utilized to their greatest potential. Learners, educators and researchers seeking educational content usually spend a great deal of time sorting through resources on the web without satisfactory results. Most times, this is not because the information is not available, but because the techniques being applied by major search engines do not handle the semantics and personalization required in this context. In a bid to proffer a solution to the problem of discovering relevant resources online by different categories of users, this work presents an integrated framework for personalized information retrieval of educational content. The framework exploits semantic web technologies. Further work will include the implementation and testing of the framework.

Olaperi Yeside Sowunmi, Sanjay Misra, Nicholas Omoregbe, Robertas Damasevicius, Rytis Maskeliūnas
A Database for Handwritten Yoruba Characters

This paper describes a novel publicly available dataset for research on offline Yoruba handwritten character recognition. It contains a total of 6954 characters being made up of several categories from a total number of 183 writers thus making it the largest available dataset for Yoruba handwriting research. It can be used for designing and evaluating handwritten character recognition systems for the Yoruba language as well as provide valuable insights through writer identification. The dataset has been partitioned into training and test sets being shared into 70% and 30% respectively.

Samuel Ojumah, Sanjay Misra, Adewole Adewumi
Facial Expression Recognition for Motor Impaired Users

In today’s world touch screen devices are trending as people are dependent on their smartphones and tablets for much of their work, making it simple and convenient to store and access data anytime and anywhere. In such a bustling framework, some people are not able to access touch screen devices. These users are diseased by motor impairment because of which they find it difficult or nearly impossible to access touch screen devices resulting in a digital divide. This research work revolves around a technology that can be used to aid problems faced by motor impaired users. It provides an alternative solution by using an algorithm that detects emotions and performs action on touch screen devices. Facial expression recognition can support access to touch screen devices with minimal physical interaction. In this proposed work facial expressions of a user are detected.

Krishna Sehgal, Sanchit Goel, Rachna Jain
Document Oriented NoSQL Databases: An Empirical Study

In today’s era, organizations are developing applications which continuously and rapidly generates large amount of data. In this world of big data, NoSQL databases are rapidly becoming popular for storing information among organizations. Therefore, it is essential for an organization to choose a database which is compatible and efficient for their applications. To choose a correct database, it is essential to examine the performance of various databases under diverse workload conditions. In this paper, we are examining different document oriented databases on various workloads, so that we can categorize them according to application need. The evaluation has been performed on four NoSQL document oriented databases: MongoDB, ArangoDB, Elastic search and OrientDB with the help of Yahoo Cloud Service Benchmark (YCSB), which is a popular benchmark tool. Comparison is done in two parts: In the first part, all the databases are compared for single thread on the basis of throughput and runtime. Here, Mongo DB shows better results with highest throughput and lowest runtime among all the databases. In the second part, a thorough analysis is done in which MongoDB and ArangoDB are compared for some threads on different workloads. Here also, Mongo DB outperforms ArangoDB with a high percentage.

Omji Mishra, Pooja Lodhi, Shikha Mehta

Data Centric Programming

Frontmatter
A Survey of Techniques Used in Processing and Mining of Medical Images

Medical image processing is the method to enhance and derive meaningful information from digital medical images. Large collection of medical images has led to the rise in some medical information retrieval system whose aim is storing images, retrieval of images, pattern reorganization etc. All of these are done so that some useful knowledge and information might be derived from them. If proper information can be retrieved from the images, it will help in diagnosis, research and education. This paper studies the various image processing and image mining techniques applied on medical images and their utility. This paper helps to understand the different techniques used in different phases of medical image processing and mining like pre-processing, feature extraction, segmentation, classification, indexing, storing and retrieval. This paper concludes by providing possible directions in future work.

Sudhriti Sengupta, Neetu Mittal, Megha Modi
Weighted Fuzzy KNN Optimized by Simulated Annealing for Classification of Large Data: A New Approach to Skin Detection

Machine learning is being used in every field. In almost all technical and financial domains, machine learning is being used enormously from predicting new outcomes to classifying a given data into multiple sets. In this research work it has been tried to build and expand upon previously built binary classifiers to develop a unique classifier for skin detection that separates the given input data into two sets – Skin segment and Non-Skin segment. Skin Detection essentially means detecting in an image or video pixels or regions which are of the skin color. The input data given to the classifier has three attributes - value of the red, green and blue channel. The combination of these three values is the color of the object seen. The classifier classifies the input data into the above two classes on the basis of these attributes. In general, this classifier can be extended to any binary class data.

Swati Aggarwal, Lehar Bhandari, Karan Kapoor, Jaswin Kaur
Comparative Analysis of Edge Detection Techniques for Medical Images of Different Body Parts

Medical images are arduous to process since they possess distinct modalities. Therefore, the medical practitioners cannot competently detect and diagnosis the diseases in conventional ways. There should be a system which helps physicians to understand medical images very easily. Image segmentation using edge detection is commonly used for image analysis and better visualization of medical images. Various methods have been used for image segmentation such as Threshold detection, Region detection, Edge detection and Clustering technique. Edge detection is one of the prominently used methods for segmentation. This technique focuses on identifying and analyzing the entire image based upon the detected edges. In this paper, MRI images of human body parts such as abdomen, ankle, elbow, hand, knee, leg, liver and brain are considered for edge detection. Further, filtering has been performed on the segmented images to remove the unwanted noise. This makes the image more clearly for further reference. The effectiveness of the proposed technique has been evaluated quantitatively by using the performance measures like Entropy and Standard Deviation. The proposed technique may be highly beneficial for medical practitioners to carry out the diagnosis for effective treatment.

Bhawna Dhruv, Neetu Mittal, Megha Modi
Classifier Dependent Dimensionality Reduction for Resource Restricted Environments

High dimensionality problems have become prevalent in present day machine learning applications. The voluminous datasets acquired from sources like cameras, spectroscopes, and other sensors need to be analysed and modelled in a way that uses the available computational resources most efficiently. The paper proposes a genetic algorithm optimised neural network model that takes care of the issue mentioned above. A comparison is also drawn between the results produced by the proposed model and those produced by other contemporary dimensionality reduction algorithms.

Divyanshu Kalra, Chaitanya Dwivedi, Swati Aggarwal
Improving Road Safety in India Using Data Mining Techniques

Road accidents are very common in India. World Health Organization (WHO) has revealed that India the worst road traffic accident rate worldwide. According to the report, poor driving pattern, drunk driving, badly maintained roads and vehicles are the main triggering factors to road casualties. Statistics shows that one serious road accident in the country occurs every minute. The national capital, Delhi is among the deadliest. To achieve aim of reducing road accidents, novel and robust prevention strategies for improved road severity have to be developed. In this work, we propose use of data mining frame work to analyze traffic on National Highways of India. Using real data set of National Highway of India, we will mine important patterns for accidental data on National Highways of India and, identify key causes to road casualties. The discovered knowledge can be used by Ministry of road transport & highways of India to take effective decisions to reduce road severity.

Gaurav, Zunaid Alam
An Analytical Study to Find the Major Factors Behind the Great Smog of Delhi, 2016: Using Fundamental Data Sciences

Concerns over the alarming situation of Smog Pollution have come under the broad and current interests of masses since past few decades. Exposure to augmented levels of pollutants forming Photochemical Smog poses threat to human life, plants, animals and property as well. The effects of Smog on health can be felt instantaneously, ranging from minor pains to deadly pulmonary diseases such as lung cancer. The Great Smog of Delhi, 2016 was a manifestation of such situations. The air quality dipped to hazardous levels posing a health emergency situation. With an aim to know the intricacies of the problem, an Analytical study of the factors that contributed to Smog Pollution in Delhi was carried out using fundamental Data Sciences in R programming language. The study covers the major Pollutant analysis and Meteorological factors from the data of two pollution monitoring stations viz. R.K. Puram and Mandir Marg. Statistical Analysis and simple Linear Regression models were used for the correlation study between pollutants’ concentration levels and meteorological factors. Also, the comparative analysis was executed over the conditions of monitoring stations under study. It was found that the Particulate Matter (PM10) turned up as the major pollutant. Concentrations of pollutants are also affected by the meteorological factors. Less greenery and exposure to more vehicular pollution resulted in R.K. Puram being more polluted than Mandir Marg.

Deepak Kumar Sharma, Arushi Bhatt, Aditi Kumar
Effects of Partial Dependency of Features and Feature Selection Procedure Over the Plant Leaf Image Classification

The process of taxonomic classification of plant species has been carried out by the botanist since centuries, by observing their roots, shoots and flowers. It is the age of modernization; roads, buildings and bridges are fast replacing the vegetation, even before the botanist might personally get a chance to look at them. Therefore, the role of computer vision is justified for fast classification of plant species before they become extinct. The sole purpose of this research work is to increase the predictive classification accuracy of plant species by using their shape and texture features obtained from the digital leaf images of dorsal sides. Since the geometrical shape features of the leaves alone are not able to provide better predictive classification accuracy results, therefore, the texture features have been clubbed together to achieve higher order of accuracy results. This leads to increase in the data size. Therefore, in order to reduce the feature dataset, random feature selection procedure has been adopted, which selects features on the basis of weights of attributes. The justifiability of the features selected has been carried out by using the feature importance plots and strengthened further by the partial dependency plots having been drawn to see their final inclusion into the final feature selection dataset. The results exhibited by the shape, shape subset, texture, texture subset of feature dataset as well as the combined dataset are quite exemplary and worth showcasing. In spite of the fact that the geometrical shapes of many of the leaves may be the same or almost same, the combined shape and texture features can be a suitable alternative for improvement in predictive accuracy results.

Arun Kumar, Poonam Saini
Performance Measure Based Segmentation Techniques for Skin Cancer Detection

Skin Cancer is a very common form of cancer which initially starts with investigation and analysis going through biopsy and examination. Doing analysis is the most challenging task as it depends on appearance of skin lesion. Computer Aided Diagnostic (CAD) system have been developed for skin cancer detection which goes through various phases starting from pre-processing, segmentation, feature extraction & selection and classification of cancer type. Segmentation is an important as well as difficult phase which extracts the lesion from non-lesion area depending on variation in terms of color, texture, size and shape. In this paper, different segmentation techniques have been discussed, Otsu thresholding as Pixel Based Segmentation, Canny edge detection as Edge based Segmentation, Watershed as Region Based Segmentation and K-Means as Clustering based Segmentation. The performance of techniques have been measured by Peak Signal Noise Ratio (PSNR), Mean Square Error (MSE) and Structure Similarity Index Measure (SSIM) using MATLAB.

Ginni Arora, Ashwani Kumar Dubey, Zainul Abdin Jaffery
Effectiveness of Region Growing Based Segmentation Technique for Various Medical Images - A Study

Due to rapid and continuous progress along with higher fidelity rate, medical imaging is becoming one of the most crucial fields in scientific imaging. Both microscopic and macroscopic modalities are probed and their resulting images are analyzed and interpreted in medical imaging for the early detection, diagnosis, and treatment of various ailments like a tumor, cancer, gallstones, etc. Although the field of medical image processing is growing significantly and persistently, there still exist a number of challenges in this field. Among these challenges, the frequently occurring and critically significant one is image segmentation. The theme work presented in this paper includes challenges involved and comparative analysis of segmentation using region growing techniques frequently utilized in various biomedical images like retinal vessel image, mammograms, magnetic resonance images, PET-CT image, coronary artery image, microscopy image, ultrasound image, etc. It discusses the effectiveness of the region growing technique applied on various medical images.

Manju Dabass, Sharda Vashisth, Rekha Vig
Biomedical Image Enhancement Using Different Techniques - A Comparative Study

In medical applications, processing of various medical images like chest X-rays, projection images of trans-axial tomography, cineangiograms and other medical images that occur in radiology, ultrasonic scanning and nuclear magnetic resonance (NMR) is required. These images may be used for patients’ screening and monitoring for detection of diseases in patients. Image enhancement algorithms are employed to emphasize, smoothen or sharpen image features for display and analysis. In the biomedical field, image enhancement faces the greatest difficulty in quantifying the criterion for enhancement. Enhancement methods are application specific and often developed empirically. The theme work presented in this paper is a detailed analysis of enhancement of medical images using contrast manipulation, noise reduction, edge sharpening, gray level slicing, edge crispening, magnification, interpolation, and pseudo-coloring. Comparison of these techniques is necessary for deciding an apt algorithm applicable for enhancement of all medical images and further processing. This paper reviews the background of enhancement techniques in three domains i.e. spatial, frequency and fuzzy domain. The comparative analysis of different techniques is shown using results that are obtained by applying these techniques to medical images.

Jyoti Dabass, Rekha Vig
Using Variant Directional Dis(similarity) Measures for the Task of Textual Entailment

Textual entailment (TE) is a task used to determine degree of semantic inference between a pair of text fragments in many natural language processing applications. In literature, a single document summarization framework has exploited TE to establish degree of connectedness between pair of sentences in a text summarization method. Despite noteworthy performance of the method, the extensive resource requirements and slow speed of the TE tool render it impractical to generate summaries in real time scenarios. This has stimulated the authors to propose the use of available directional dis(similarity) (distance and similarity) measures in place of TE system. The present paper aims to find a suitable directional measure which can successfully replace the TE system and decrease the overall runtime of the summarization method. Therefore, state-of-the-art directional dis(similarity) measures are implemented in the same summarization framework to present a comparative analysis of performance of all the measures. The experiments are conducted on DUC 2002 dataset and the results are evaluated using ROUGE tool to find the most suitable directional measure of textual entailment.

Anand Gupta, Manpreet Kaur, Disha Garg, Karuna Saini
Image Enhancement of Lemon Grasses Using Image Processing Techniques (Histogram Equalization)

Lemon grass a type of medicinal plants has been part of human existence and has been applied in so many ways, like for healing, for drugs and for protection. In this paper, the conventional Histogram Equalization has been used to improve the images of the lemon grasses. MATLAB software was used to display the Histogram as well as the Histogram Equalization of the image of the lemon grasses. Image processing techniques to be used here is Histogram Equalization. The Histogram Equalization is considered, since it is one of the techniques in the enhancement of images and as such is being applied to the medicinal herb which in particular is the lemon grasses. The Histogram Equalization technique used may be seen as a conventional technique but the results obtained demonstrates its capability to improve the appearance of images by bringing out hidden details. The performance of the technique also shows that it is a better method in comparison to other types of Histogram Equalization methods.

Ofeoritse S. Temiatse, Sanjay Misra, Chitra Dhawale, Ravin Ahuja, Victor Matthews
Unit Testing in Global Software Development Environment

Global software development has many challenges. Amongst them maintaining quality in the code developed in distributed sites is also a challenge. One of effective way to control the quality of the code is the unit testing. It removes defects at the early stage of development and further reduces the testing and maintenance efforts at the later phase of the development lifecycle. In this paper, a class complexity metric for testing the class, which is normally treated as a unit in object-oriented programming is proposed. The applicability of class complexity metrics for unit testing is demonstrated through a project in JAVA.

Sanjay Misra, Adewole Adewumi, Rytis Maskeliūnas, Robertas Damaševičius, Ferid Cafer

Next Generation Computing

Frontmatter
Comparative Analysis of Different Load Balancing Algorithm Using Cloud Analyst

Cloud computing is the emerging trend for effective allocation of hardware, platform and software over the internet, but there are many advantages and disadvantages of using cloud computing in today’s world. The main features of cloud computing are reduced cost, uninterrupted pervasive accessibility, backup retrieval, throughput and efficiency and improved storage competence. Data security, privacy, cloud infrastructure, monitoring and load balancing are also the major concern. This paper focuses on load balancing for improving the performance of the resources available and distributes the load uniformly across the servers.

Meeta Singh, Poonam Nandal, Deepa Bura
Performance Analysis of QoS Parameters During Vertical Handover Process Between Wi-Fi and WiMAX Networks

To make a seamless handover among heterogeneous networks, the optimal selection of a network is a challenging task. These networks vary in terms of QoS provisioning. The challenge is to select a network which has better QoS. The paper showcases the effects of Hard Handover and Predictive Handover on the QoS parameters with different data rates of application running on the mobile node at different mobile node speeds between Wi-Fi and WiMAX networks.

Bhawna Narwal, Amar K. Mohapatra
Improving the Performance of Monostatic Pulse RADAR by Changing Waveform Design Using MATLAB

This paper describes the designing of a reliable monostatic pulse radar system with improved signal to noise ratio. The waveform transmitted by the radar is varied and the effect of the waveform variation is noted on important radar parameters like peak transmit power and radar range. Also, the validity of the radar range equation for various waveforms has been verified. All the simulations have been done in MATLAB and useful conclusions are drawn about the effect of pulse integration on the reliability of a monostatic pulse radar system and the effect of waveform variation on peak transmit power and radar range.

Diksha Prakash, Shreya Prakash
Analysis and Detection of Ransomware Through Its Delivery Methods

Cyber criminals are utilizing diverse approaches to draw money from internet users and organizations. Recently, a malware called ransomware has become effectively accessible for this job due to its ease of availability and distribution methods. Security experts are working to counter ransomware attacks by fixing the vulnerabilities of operating system. In this research work, we have proposed a method to prevent the ransomware attack at its early stages through its delivery channels like Exploit Kits. We have analyzed the crawling patterns (listing of file path, dropped file, network activity, ransom note etc.) of victim’s computer. These patterns have been used to extract the features for classification of malicious samples. We have used supervised machine learning algorithms for classification of malwares. Experimental results shows that accuracy of 94% is achieved in tightly bound mode by using random forest algorithm. While, accuracy of 91% is achieved in moderate bound mode by using random forest classification algorithm.

Keertika Gangwar, Subhranshu Mohanty, A. K. Mohapatra
A Comparative Analysis of Various Regularization Techniques to Solve Overfitting Problem in Artificial Neural Network

Neural networks having a large number of parameters are considered as very effective machine learning tool. But as the number of parameters becomes large, the network becomes slow to use and the problem of overfitting arises. Various ways to prevent overfitting of model are further discussed here and a comparative study has been done for the same. The effects of various regularization methods on the performance of neural net models are observed.

Shrikant Gupta, Rajat Gupta, Muneendra Ojha, K. P. Singh
Analysis of COCOMO and UCP

After an early requirement design, project managers mostly use the requirement specifications to get an estimate of functional size of software which helps in estimating effort required and tentative cost of the software. An accurate estimate is necessary to be able to negotiate price of a software project and to plan and schedule project activities. Function Point sizing method for estimation is used frequently to estimate functional size of software. Another popular method of functional sizing is Use Case Points (UCP). UCP method of estimation although less used than FP based estimation, but is simpler than FP based method. One reason for this is - FP sizing metric often uses COCOMO-II or other effort estimation method to convert size estimate in FPs into effort estimate where as direct conversion formula can be used for converting size in UCPs to effort estimate. This paper compares results of both approaches for two mid-size business applications and tries to understand the correlation between the results of two approaches.

Bhawna Sharma, Rajendra Purohit
Review of Current Software Estimation Techniques

Software Effort Estimation is an onerous but still inevitable task project managers have to perform. Project managers often face the dilemma of selection of estimation approach whenever any new project opportunity comes across. Estimation is required for not only setting a price and bidding rounds but also for planning, budgeting, staffing and scheduling of project related tasks. This paper reviews major cost estimation techniques that are relevant in current scenario. The primary conclusion is - all estimation approaches have few advantages and disadvantages and are often complimentary in their characteristics. Observation and Evaluation of several approaches can be insightful and can help in selecting an estimation technique or combination of techniques best suited for a particular project.

Bhawna Sharma, Rajendra Purohit
Improved Bee Swarm Optimization Algorithm for Load Scheduling in Cloud Computing Environment

The cloud acts as a model that contains an aggregation of resources and data that needs to be shared among users. The scheduling of the load acts as a major challenge to fulfill the requests of the several users. Till now several algorithms have been proposed for fulfilling the purpose of load scheduling in cloud. The latest works are based on swarm-intelligence techniques. However, one such swarm-intelligence technique Bee Swarm Optimization (BSO) has not been exploited for serving this purpose. In this paper, an improvised version of BSO, the Improved Bee Swarm Optimization in Cloud (IBSO-C) has been proposed with the objective of efficient and cost-effective scheduling in cloud. It uses the swarm of particles as bees for scheduling and updated total cost evaluation function. The proposed algorithm is validated and tested by analysis on large set of iterations. The comparison of results with existing techniques has proven, the proposed IBSO-C to be a more cost-effective algorithm.

Divya Chaudhary, Bijendra Kumar, Sakshi Sakshi, Rahul Khanna
Simulation and Application Performance Evaluation Using GPU Through CUDA C & Deep Learning in TensorFlow

GPUs have as of late pulled in the consideration of numerous application designers as product information parallel coprocessors. The most current eras of GPU design give less demanding programmability and expanded all-inclusive statement while keeping up the gigantic memory data transfer capacity and computational force of conventional GPUs. This open door ought to divert endeavors in GPU examination to setting up standards and systems that permit proficient mapping of calculation to design equipment. The project, shows the GeForce GTX 560 Ti processors association, highlights, and summed up improvement systems. Method to execution on the platform is by utilizing gigantic multithreading and use vast quantity of centers, cover up global storage inactivity. In order to achieve it, designers confront the test of striking the right harmony between every string’s asset utilization and the quantity of all the while dynamic strings. The assets to oversee incorporate the quantity of resistors also the degree of on-chip storage utilized per string, given strings per multiprocessor, also worldwide memory transmission capacity. The researcher likewise get expanded execution on rearranging, gets to off-chip storage and join solicitations for similar else adjoining storage areas therefore, implement established enhancements by diminishing quantity of implemented function. Such methodologies are used over an assortment of utilizations and areas and accomplish between a 10.5X to 14X application speedup. The similar result was achieved with the single core GPU using deep learning technique in TensorFlow framework.

Ajeet Kumar, Abhishek Khanna
Significance of DITMC Technique for Capacity Enhancement in GSM and CDMA Networks

By the end of 2020 heavy data traffic flood will dominate the speech traffic and the probability of bandwidth and capacity saturation (in GSM and CDMA networks) seems to be high. During the last decade various traffic models and techniques have been developed in order to optimize bandwidth and capacity of mobile networks. It has been analysed and found that most economic and efficient method to optimize the limited channel capacity and bandwidth in GSM and CDMA networks is to use the silence period during any speech conversation which can be utilized to transmit data as well as other channel allocation. Moreover the simulation studies show that this concept seems to be the most optimized, efficient and economic way to incorporate more users in GSM and CDMA networks. This paper proposes the mathematical model to incorporate DITMC technique for capacity enhancement in GSM and CDMA networks.

Hemant Purohit, Parneet Kaur, Kanika Joshi
Dual Band Micro Strip Patch Antenna for UWB Application

In modern age, life of a human being is closely associated with different kind of technologies to cator the needs of its daily life like, education, health, food production, transportation, communication and many more Technology should be helpful to ease the daily life of a common man. The problems must be addressed in real time system so that relevant remedies can be obtained in time. In today’s digital world wireless technology with high data rate services are the necessity of time to provide the solution in real time systems. Ever increasing to demand for higher data rates require appropriate radiation systems with large bandwidth and stable gain. Microstrip antennas with unidirectional radiation patterns and stable gain are the most useful for this purpose. In the present work, an UWB (Ultra Wide Band) antenna is implemented by using a micro strip patch antenna for application in next generation wireless communication and Internet of Things. A micro strip patch antenna ground plane defect is used to breed multiband applications. The performance of gain, directivity, and bandwidth is enhanced and also reduces the geometry, shape and size of an UWB antenna. Output result reveals efficient performance with respect to wideband operation.

Arun Kumar, Manish Kumar Singh
Smartphone with Solar Charging Mechanism to Issue Alert During Rainfall Disaster

Rainfall-induced disaster is long duration disasters that have a probability of power failure that makes disaster worst due to lack of communication. Flash Flood, long duration flood, and landslides are few examples of them. In this case, Communication between victims is a challenging job. This paper presents an idea to overcome these kinds of situations. Different tools have been developed Alert about disaster including in flood disaster but during the flood, power failure makes these impossible to work for a long duration while flood is long duration disaster. This paper introducing an Alert Network designed by different technologies including WSN, MANET, and IoT. This architecture will be able to communicate in absence of any fixed infrastructure as well as it will be able to communicate through the social media if available. For supporting the architecture, the use of the screen with a solar cell as well as OLED with a screen to enhance the battery life of a Smartphone is discussed.

Neeraj Kumar, Alka Agrawal, Raees Ahmad Khan
Performance Analysis of OSPF Routing Protocol Under Single and Multiple Link Failure

This study aims at highlighting OSPF (Open Shortest Path First) routing technique and its performance under different network failures. A single failure in high-speed networks even for fraction of second can disrupt millions of users. Therefore it is imperative to analyze performance of routing protocol under network failure. In this paper, OSPF routing technique is examined under single and multiple link failures. Investigations are performed on basis of average E2E (end to end) delay, throughput and jitter. In this investigation, convergence time for link failures is calculated. It is observed that network has least average E2E delay when working path does not experience any failure. Under single link failure, average E2E delay is least when failed link is nearest to the source node. Under double link failure, average E2E delay is maximum when failed links are farthest apart. Convergence time for multiple link failure is more than single link failure. It is observed that convergence is least for failed link immediately connected to the source node. As observed from plots, network achieves high throughput when working path does not experience any failure and throughput reduces for multiple link failures. Average jitter is maximum for the network without failure and its value reduces for multiple link failures. The investigations in this paper provide insight into the performance of OSPF routing technique. Analysis performed and values of performance metrics obtained for various link failures provide knowledge of critical links and backup routes with respect to application specific QoS (Quality of Service) requirements.

Himanshi Saini, Amit Kumar Garg
Application of ICT by Small and Medium Enterprises in Ogun State Nigeria

Small and medium enterprises (SMEs) have emerged as promising opportunities to eliminate and reduce unemployment globally. Increasing levels of technological advancement has revolutionized the dynamics of the business terrain. However, SMEs in developing countries are yet to fully explore the benefits of Information and Communications Technology (ICT). Survey data was collected from 75 SME ICT users in Abeokuta and Otta through a structured questionnaire using stratified random sampling technique. Results of regression analysis revealed that demographic variable (Staff Strength) significantly influences ICT application among SMEs while SME service delivery had no influence. Also, analysis of variance on the categories of SMEs was not a determinant factor on the use of ICT. The outcome of this study has implications for owners of SMEs, stakeholders, government and academic researchers in developing countries as it can provide patterns to help bridge the existing digital divide especially among Nigerian SMEs.

Oyediran Oyebiyi, Sanjay Misra, Rytis Maskeliūnas, Robertas Damaševičius

Social and Web Analytics

Frontmatter
Leveraging Movie Recommendation Using Fuzzy Emotion Features

User generated data like reviews encapsulate bursts of emotions that are generated after reading a book or watching a movie. These emotions garnered from reviews can be used in recommending items from entertainment domain with similar emotions. Till now much work has been done using emotions as discrete features in recommender system. In this paper, we delve deeper to use fuzziness in the emotion categories. The use of emotional features such as love, joy, surprise, anger, sadness and fear has been shown to be effective in identifying items with similar features for recommendations. However, there is a certain degree of vagueness and blurring boundaries between the lexicons of these categorical emotion features that has hitherto been largely ignored. In this paper, we tackle the problem of inherent vagueness in emotional features by proposing a framework for movie recommendation using fuzzy emotion features by taking each emotion category as a linguistic variable. We develop a Mamdani Model to extract fuzzy classification rules for recommending movies from emotions extracted from their corresponding reviews. Results show that Guassian fuzzy model with 5 linguistic variables yield 68.43% F-measure which is 10.5% improvement over the SVM based crisp model for recommending movies.

Mala Saraswat, Shampa Chakraverty
Analyzation and Detection of Cyberbullying: A Twitter Based Indian Case Study

Social networking sites like Facebook and Twitter specially involves large population connected worldwide. Though these social networks aim to bring people from around the world together yet it has its own cons associated with it. With the increase in these Social Networks there is an exponential increase in cybercrimes on these sites. Cyberbullying or Trolling is one such crime where victim is bullied with abuses, personal remarks, false claims and sarcasm on social networking sites and sometimes is traumatized to great extent. There have been many cyberbullying detection methods and systems already developed to cater to the problem but major concern lies on the fact that nearly 80%–90% users on such sites are Indians owing to one of most populous countries in the world, they use Hinglish (Hindi written in English) to communicate mostly on social networking sites majorly Facebook and Twitter. Our research aims at analyzing Cyberbullying content based on Hinglish tweets on one such social network that is Twitter. We analyzed tweets based on textual analysis and performed classification also. Through this we concluded our findings and future scope of work for detection of Cyberbullying on more complex data.

Aastha Sahni, Naveen Raja
Characterizing and Detecting Social Outrage on Twitter: Patel Reservation in Gujarat

Social Media is a platform to share ideas, opinions and discussions. This provides scope to study social behavior and perform analysis around events discussed over it. The idea behind this study is to analyze the social characteristics during unrest in society. The analysis further can be used to identify the trend of social behavior and utilize for decision making and anticipatory governance. For this paper recent social outrage in Indian context related to caste based reservation has been studied using social media platform Twitter. A number of analytical methodologies have been used to understand the variations in opinions over social media during unrest. This paper researches the potential of tension during social outrage and the factors affecting it. Sentiment analysis and different machine learning methods used to detect level of tension and compared the results against manual annotation. To improve the performance of classification results, a rule based algorithm has been developed to detect tension during social outrage.

Sulbha Singh, Rajeev Pal
Optimizing Accuracy of Sentiment Analysis Using Deep Learning Based Classification Technique

The sentiment or opinion of a person expressed with words and phrases reflect the thoughts of polarities ranging from positive to neutral as well as negative. The emotions hidden in expressions indicate positivity or negativity in the opinion such as words DISTRESS ejaculates more negativity than the word SAD does. Sentiment analysis is the study of concepts of polarities hidden in natural languages. The use of natural language processing is to elucidate information hidden inside the text referred to as sentiment analysis. Sentiment analysis is widely applied to reviews of social media for a variety of applications covering the domains of business, health and government performance evaluations. This type of evaluation extracts the attitude of an author with respect to the context of the topic that to what extent whether the hidden information is related to joy/sadness, amaze/anger, positive and negative emotions. This paper will introduce the proposed technique with Convolution Neural Network used for text classification. The performance of proposed classifier is validated against the performance of Naïve Bayes, J48, BFTree, OneR, LDA and SVM. Examination of efficacies is done on three manually annotated datasets, one dataset is taken from IMDB movie portal and two datasets are infatuated from Amazon product reviews. The accuracies of these seven machine learning techniques are compared and the proposed technique is noticed more precise in generating the accuracy of 85.2% in precision, 82.9% in F-measure and 85.46% of correctly classified sentiment.

Jaspreet Singh, Gurvinder Singh, Rajinder Singh, Prithvipal Singh
Intuitionistic Fuzzy Shortest Path in a Multigraph

Mutigraphs are a generalized model of graphs. Multigraphs may have multiple edges between a pair of its vertices. The existing algorithms to find the fuzzy shortest path or intuitionistic fuzzy shortest paths in graphs is not applicable to multigraphs. Our work here is on the theory of multigraphs. In this paper we develop a method to search for an intuitionistic fuzzy shortest path in a directed multigraph and then develop, as a special case, a fuzzy shortest path in a multigraph. We re-construct classical Dijkstra’s rule that is applicable to graphs with crisp weights which can then be extendable to IFN multigraphs. It’s claimed that the tactic might play a significant role in several application areas of technology, specifically in those networks that may not be shaped into graphs but however into multigraphs.

Siddhartha Sankar Biswas, Bashir Alam, M. N. Doja
A Rough Set Based Approach for Web User Profiling

E-governance plays a pivotal role in the domain of online services by ensuring round the clock accessibility of a wide spectrum of services. However, the huge amount of uploaded information and a vacillating user base makes it rather difficult to access the desired information from the portal. This requires a system which intelligently presents a personalized user interface. A challenging requirement in designing such a system is classifying the diversified users on the basis of their web experience. Traditional web usage mining techniques have been used to cluster similar users primarily on the basis of their page access patterns. In this paper, we veer our attention towards the level of user experience by introducing three parameters namely, page switching behavior, page probing behavior and session count which predominantly decide the level of experience acquired by e-governance users. We make an innovative use of Rough Set Theory to derive a rule-based classification system using three reduct optimization algorithms namely, Johnson Algorithm, Genetic Algorithm and Basic Minimal classification method. In order to test our system, we classified the user base that is publically available in the CTI dataset into two categories. The Basic Minimal method reports the highest accuracy of 74.90% with five fold cross validation.

Geeta Rani, Shampa Chakraverty
Analysis and Detection of Fruit Defect Using Neural Network

Fruit quality detection is important to maintain the quality of fruits. Generally, fruit inspection is done manually which is ineffectual for farmers in the agriculture industry. This paper proposes an approach to identify fruit defects in the agricultural industry for reducing the production cost and time. Since different fruit images may have similar or identical color and shape values. Therefore, we used a method to increase the accuracy level of fruit quality detection by using color, shape and size-based method with Artificial Neural Network (ANN). ANN is a special kind of tool that is used to estimate the cost and various artificial intelligence field like - voice recognition, image recognition, robotics and much more. The fruit quality inspection acquires images as external input and then the captured image is used for detecting defects by applying segmentation algorithm. The output of the processed image is then used as input to ANN. The network uses backpropagation algorithm, was tested with training dataset and hence predicted defects with good efficiency and in much shorter time than the human inspectors. On comparing predicted and training dataset together, the feasibility of an approach reveals the efficient defect detection and classification in the agricultural industry.

Tanya Makkar, Seema Verma, Yogesh, Ashwani Kumar Dubey
Sentiment Analysis of Indians on GST

Twitter has become one of the most popular communication medium among internet users. Millions of users share their opinions on it, therefore making it a rich source of data for opinion mining and sentiment analysis. Taking our context as Goods and Services Tax, which has been one of the most debated topic in media not only in India but also outside India. In this paper, we collected the data from the twitter and tried to infer about the way Indians have understood Goods and Services Tax. We have discussed the methodology to prepare a corpus from Twitter for sentiment analysis and opinion mining. The analysis of the collected corpus is done using a ‘lexicon-based approach’. Using this approach, we can determine the opinion in three major categories; positive, negative and neutral sentiments of the document.

Amogh Madan, Ridhima Arora, Nihar Ranjan Roy

Security in Data Science Analytics

Frontmatter
Virtual Consciousness from 3D to 4D Password: A Next Generation Security System Inspiration

We have many authentication schemes presently, but they all have a few disadvantages. So recently, the 3D password paradigm was introduced. The 3D password is multi-factor authentication system as it uses various authentication techniques such as graphical password, textual password etc. Most important part of 3D password is inclusion of 3D virtual environment. However the 3D Password is still in its initial stages. Designing various kinds of 3D virtual environments, interpreting user feedback and deciding on password spaces, and experiences from such environments will bring about upgrading and enhancing the client experience of the 3D Password. Moreover, gathering attackers from various foundations to break the system is one of the future works that will lead to system improvement and demonstrate the complexity of breaking a 3D password. This paper introduces a study of the 3D password and reinforce it by including a Fourth dimension, that deals with time recording and gesture recognition, and that would help strengthen the authentication paradigm. Henceforth we endeavor to propose a 4D Password as a one-up technique to the 3D Password.

Saurabh Allawadhi, Nitin Kumar, Sanjib Kumar Sahu
Efficient and Secure Nearest Neighbor Search Over Encrypted Data in Cloud Environment

Nearest neighbor search is the most general search queries on large data sets, location base services, spatial databases, graph applications, etc. With the growth of cloud computing, the trend of outsourcing the sensitive data demands the fast and secure nearest neighbor solutions over the existing solutions. In our paper, propose a new secure and efficient nearest neighbor search for encrypted data with mOPE (mutable order preserving encryption). In our proposed model use the probabilistic data structure skip graph for the efficient indexing. Then encrypt the indexes using mOPE for efficient nearest neighbor search. With the thorough analysis of our scheme achieve a perfect balance between security and nearest neighbor search query compared to another scheme.

Kaur Upinder, R. Suri Pushpa
RREQ Flood Attack and Its Mitigation in Ad Hoc Network

RREQ Flood Attack is one of the prominent attack in Wireless Adhoc Network. In the flood attack, a malicious node fills up all the routing tables with its own packets and hence communication between source and destination becomes paralyzed. To secure against flooding attack, a mitigation scheme is proposed which uses Time to Live value to mark and remove the malicious node that floods the network. In the proposed scheme, RREQ limit is set which is dynamically checked after certain period so that the flood attack does not occur. The Simulation of the proposed scheme is done by Qualnet and the result shows that the scheme prevents the flooding attack, reduces the end-to-end delay and increases the throughput.

Shweta Rani, Bhawna Narwal, Amar K. Mohapatra
A Study on Integrating Crypto-Stego Techniques to Minimize the Distortion

Now a days on-line services has end up as a big part of our lives that performs communication electronically. This e-communication requires confidentiality and information integrity to protect from unauthorized users. Security can be provided by using two widespread techniques i.e., cryptography and steganography. However, no single technique in itself is well acquainted to cater to our needs in implementing a secure, and robust system. Steganography helps in hiding the data behind the cover image. But once known secret message can be captured very easily. Moreover, by embedding the data into the image medium, we increase the risk of distortion. If the distorted image is sent over the communication channel, the intruder can easily guess that some secret data has been sent and that secret data can be easily recovered. As we know, cryptography ciphers the text so that the secret data sent over the communication link cannot be understood. But if both the security mechanisms can be integrated we can come up with a system which is more secure and robust. This paper presents the study of hybrid crypto-stego techniques so as to reduce distortion by maintaining the imperceptibility, robustness and providing high security for the e-communication between two authorized parties.

Neha Sharma, Usha Batra
Performance Analysis of Cloud Data Verification Using MD5 and ECDSA Method

Cloud computing enable the users to outsource and access data economically using storage as a service. In this storage model, the data owner doesnot have any control of the data once its stores on cloud server. Therefore, privacy and security of the data is a challenging issue in cloud computing. To provide the integrity of the outsourced data, we have proposed a lightweight data auditing technique such as MD5 and ECDSA signature method using third party auditor. The result analysis of the proposed method shows that, ECDSA has better security performance than the computation as compared to MD5 method for larger data size. The selection of the signature method depends on the priority of the data size and frequency of accessing.

G. L. Prakash, Manish Prateek, Inder Singh
Research Trends in Malware Detection on Android Devices

Mobile phones have become the necessity of modern human life to store our valuable information such as passwords, reminders, messages, photos, videos and social contacts. The advent in mobile technology has made human life easier and more efficient. However, at the same time, our excessive dependency on mobile devices has drawn attention of malware authors and cyber criminals leading to large number of cyber-attacks. Amongst all, the major concern of security threat is on Android smartphones. The key reason for it is that it does not restrict users to download applications from unsafe sites. So, it is important to develop robust and efficient Android Malware detection system in order to protect our sensitive data from cyber-attacks on Android platform. In this work, we discuss different types of Android Malwares and provide critical review on their detection approaches that exist in literature. We also highlight promising new directions of research in the domain of Malware detection on Android devices.

Leesha Aneja, Sakshi Babbar
An Insightful View on Security and Performance of NoSQL Databases

The recent advancement in cloud computing and distributed web applications has created the need to store large amount of data in distributed databases that provide high availability and scalability. Due to this problem, in recent years, a non-relational database is needed to scale the growing need of industry and at the same time, must be highly efficient. This gave rise to NoSQL databases which are highly scalable and can store large amount of data. NoSQL databases are- ‘Not Only SQL’ databases and supports almost all the SQL features in addition to several other features. NoSQL is completely schema less and can store any kind of data. A large amount of data is stored in these databases every day, including sensitive data. Security of this sensitive data is an area of concern. Research looked into the security aspect of the top three open source NoSQL databases which are Mongo DB, Cassandra, and Redis. In this research, surprisingly, it has been found that data file encryption is missing in all three databases. Cassandra and Mongo DB are safe from Injection attacks, but there are many other ways to get into these databases and access the data at backend. Researchers provided useful methods to enhance the security for these databases and also evaluated the performance, after improvement in security.

Upaang Saxena, Shelly Sachdeva
Backmatter
Metadata
Title
Data Science and Analytics
Editors
Dr. Brajendra Panda
Sudeep Sharma
Nihar Ranjan Roy
Copyright Year
2018
Publisher
Springer Singapore
Electronic ISBN
978-981-10-8527-7
Print ISBN
978-981-10-8526-0
DOI
https://doi.org/10.1007/978-981-10-8527-7

Premium Partner