Skip to main content
Top

2017 | Book

Data Science and Big Data: An Environment of Computational Intelligence

insite
SEARCH

About this book

This book presents a comprehensive and up-to-date treatise of a range of methodological and algorithmic issues. It also discusses implementations and case studies, identifies the best design practices, and assesses data analytics business models and practices in industry, health care, administration and business.Data science and big data go hand in hand and constitute a rapidly growing area of research and have attracted the attention of industry and business alike. The area itself has opened up promising new directions of fundamental and applied research and has led to interesting applications, especially those addressing the immediate need to deal with large repositories of data and building tangible, user-centric models of relationships in data. Data is the lifeblood of today’s knowledge-driven economy.Numerous data science models are oriented towards end users and along with the regular requirements for accuracy (which are present in any modeling), come the requirements for ability to process huge and varying data sets as well as robustness, interpretability, and simplicity (transparency). Computational intelligence with its underlying methodologies and tools helps address data analytics needs.The book is of interest to those researchers and practitioners involved in data science, Internet engineering, computational intelligence, management, operations research, and knowledge-based systems.

Table of Contents

Frontmatter

Fundamentals

Frontmatter
Large-Scale Clustering Algorithms
Abstract
Computational tools in modern data analysis must be scalable to satisfy business and research time constraints. In this regard, two alternatives are possible: (i) adapt available algorithms or design new approaches such that they can run on a distributed computing environment (ii) develop model-based learning techniques that can be trained efficiently on a small subset of the data and make reliable predictions. In this chapter two recent algorithms following these different directions are reviewed. In particular, in the first part a scalable in-memory spectral clustering algorithm is described. This technique relies on a kernel -based formulation of the spectral clustering problem also known as kernel spectral clustering. More precisely, a finite dimensional approximation of the feature map via the Nyström method is used to solve the primal optimization problem, which decreases the computational time from cubic to linear. In the second part, a distributed clustering approach with fixed computational budget is illustrated. This method extends the k-means algorithm by applying regularization at the level of prototype vectors. An optimal stochastic gradient descent scheme for learning with \(l_1\) and \(l_2\) norms is utilized, which makes the approach less sensitive to the influence of outliers while computing the prototype vectors.
Rocco Langone, Vilen Jumutc, Johan A. K. Suykens
On High Dimensional Searching Spaces and Learning Methods
Abstract
In data science, there are important parameters that affect the accuracy of the algorithms used. Some of these parameters are: the type of data objects, the membership assignments, and distance or similarity functions. In this chapter we describe different data types, membership functions, and similarity functions and discuss the pros and cons of using each of them. Conventional similarity functions evaluate objects in the vector space. Contrarily, Weighted Feature Distance (WFD) functions compare data objects in both feature and vector spaces, preventing the system from being affected by some dominant features. Traditional membership functions assign membership values to data objects but impose some restrictions. Bounded Fuzzy Possibilistic Method (BFPM) makes possible for data objects to participate fully or partially in several clusters or even in all clusters. BFPM introduces intervals for the upper and lower boundaries for data objects with respect to each cluster. BFPM facilitates algorithms to converge and also inherits the abilities of conventional fuzzy and possibilistic methods. In Big Data applications knowing the exact type of data objects and selecting the most accurate similarity [1] and membership assignments is crucial in decreasing computing costs and obtaining the best performance. This chapter provides data types taxonomies to assist data miners in selecting the right learning method on each selected data set. Examples illustrate how to evaluate the accuracy and performance of the proposed algorithms. Experimental results show why these parameters are important.
Hossein Yazdani, Daniel Ortiz-Arroyo, Kazimierz Choroś, Halina Kwasnicka
Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification
Abstract
Facing hundreds of gigabytes of data has triggered a need to reconsider data management options. There is a tremendous requirement to study data sets beyond the capability of commonly used software tools to capture, curate and manage within a tolerable elapsed time and also beyond the processing feasibility of the single machine architecture. In addition to the traditional structured data, the new avenue of NoSQL Big Data has urged a call to experimental techniques and technologies that require ventures to re-integrate. It helps to discover large hidden values from huge datasets that are complex, diverse and of a massive scale. In many of the real world applications, classification of imbalanced datasets is the point of priority concern. The standard classifier learning algorithms assume balanced class distribution and equal misclassification costs; as a result, the classification of datasets having imbalanced class distribution has produced a notable drawback in performance obtained by the most standard classifier learning algorithms. Most of the classification methods focus on two-class imbalance problem inspite of multi-class imbalance problem, which exist in real-world domains. A methodology is introduced for single-class/multi-class imbalanced data sets (Lowest vs. Highest—LVH) with enhanced over_sampling (O.S.) techniques (MEre Mean Minority Over_Sampling Technique—MEMMOT, Majority Minority Mix mean—MMMm, Nearest Farthest Neighbor_Mid—NFN-M, Clustering Minority Examples—CME, Majority Minority Cluster Based Under_Over Sampling Technique—MMCBUOS, Updated Class Purity Maximization—UCPM) to improve classification. The study is based on broadly two views: either to compare the enhanced non-cluster techniques to prior work or to have a clustering based approach for advance O.S. techniques. Finally, this balanced data is to be applied to form Random Forest (R.F.) tree for classification. O.S. techniques are projected to apply on imbalanced Big Data using mapreduce environment. Experiments are suggested to perform on Apache Hadoop and Apache Spark, using different datasets from UCI/KEEL repository. Geometric mean, F-measures, Area under curve (AUC), Average accuracy, Brier scores are used to measure the performance of this classification.
Sachin Subhash Patil, Shefali Pratap Sonavane
Online Anomaly Detection in Big Data: The First Line of Defense Against Intruders
Abstract
We live in a world of abundance of information, but lack the ability to fully benefit from it, as succinctly described by John Naisbitt in his 1982 book, “we are drowning in information, but starved for knowledge”. The information, collected by various sensors and humans, is corrupted by noise, ambiguity and distortions and suffers from the data deluge problem. Combining the noisy, ambiguous and distorted information that comes from a variety of sources scattered around the globe in order to synthesize accurate and actionable knowledge is a challenging problem. To make things even more complex, there are intentionally developed intrusive mechanisms that aim to disturb accurate information fusion and knowledge extraction; these mechanisms include cyber attacks, cyber espionage and cyber crime, to name a few. Intrusion detection has become a major research focus over the past two decades and several intrusion detection approaches, such as rule-based, signature-based and computer intelligence based approaches were developed. Out of these, computational intelligence based anomaly detection mechanisms show the ability to handle hitherto unknown intrusions and attacks. However, these approaches suffer from two different issues: (i) they are not designed to detect similar attacks on a large number of devices, and (ii) they are not designed for quickest detection. In this chapter, we describe an approach that helps to scale-up existing computational intelligence approaches to implement quickest anomaly detection in millions of devices at the same time.
Balakumar Balasingam, Pujitha Mannaru, David Sidoti, Krishna Pattipati, Peter Willett
Developing Modified Classifier for Big Data Paradigm: An Approach Through Bio-Inspired Soft Computing
Abstract
The emerging applications of big data usher different blends of applications, where classification, accuracy and precision could be identified as major concern. The contemporary issues are also being emphasized as detecting multiple autonomous sources and unstructured trends of data. Therefore, it becomes mandatory to follow suitable classification and in addition to appropriate labelling of data is required to use relevant computational intelligent techniques. This is significant, where the movement of data is random and follows linked concept of data e.g. social network and blog data, transportation data and even supporting low-carbon road transport policies. It has been agreed by the research community whether only supervised classification techniques could be useful for such diversified imbalanced classification. Subsequently, the genesis of majority and minority class detection based on supervised features following conventional data mining principle. However, the classification of majority or positive class is over-sampled by taking each minority class sample. Definitely, significant computationally intelligent methodologies have been introduced. Following the philosophy of data science and big data, the heterogeneous classification, over-sampling, mis-labelled data features cannot be standardized with hard classification. Hence, conventional algorithm can be modified to support ensemble data set for precise classification under big and random data and that can be achieved through proposed monkey algorithm dynamic classification under imbalance. The proposed algorithm is not completely supervised rather it is blended with certain number of pre-defined examples and iterations. The approach could be more specific, when more numbers of soft computing methods, if they can be hybridized with bio-inspired algorithms.
Youakim Badr, Soumya Banerjee
Unified Framework for Control of Machine Learning Tasks Towards Effective and Efficient Processing of Big Data
Abstract
Big data can be generally characterised by 5 Vs—Volume, Velocity, Variety, Veracity and Variability. Many studies have been focused on using machine learning as a powerful tool of big data processing. In machine learning context, learning algorithms are typically evaluated in terms of accuracy, efficiency, interpretability and stability. These four dimensions can be strongly related to veracity, volume, variety and variability and are impacted by both the nature of learning algorithms and characteristics of data. This chapter analyses in depth how the quality of computational models can be impacted by data characteristics as well as strategies involved in learning algorithms. This chapter also introduces a unified framework for control of machine learning tasks towards appropriate employment of algorithms and efficient processing of big data. In particular, this framework is designed to achieve effective selection of data pre-processing techniques towards effective selection of relevant attributes, sampling of representative training and test data, and appropriate dealing with missing values and noise. More importantly, this framework allows the employment of suitable machine learning algorithms on the basis of the training data provided from the data pre-processing stage towards building of accurate, efficient and interpretable computational models.
Han Liu, Alexander Gegov, Mihaela Cocea
An Efficient Approach for Mining High Utility Itemsets Over Data Streams
Abstract
Mining frequent itemsets only considers the number of the occurrences of the itemsets in the transaction database. Mining high utility itemsets considers the purchased quantities and the profits of the itemsets in the transactions, which the profitable products can be found. In addition, the transactions will continuously increase over time, such that the size of the database becomes larger and larger. Furthermore, the older transactions which cannot represent the current user behaviors also need to be removed. The environment to continuously add and remove transactions over time is called a data stream. When the transactions are added or deleted, the original high utility itemsets will be changed. The previous proposed algorithms for mining high utility itemsets over data streams need to rescan the original database and generate a large number of candidate high utility itemsets without using the previously discovered high utility itemsets. Therefore, this chapter proposes an approach for efficiently mining high utility itemsets over data streams. When the transactions are added into or removed from the transaction database, our algorithm does not need to scan the original transaction database and search from a large number of candidate itemsets. Experimental results also show that our algorithm outperforms the previous approaches.
Show-Jane Yen, Yue-Shi Lee
Event Detection in Location-Based Social Networks
Abstract
With the advent of social networks and the rise of mobile technologies, users have become ubiquitous sensors capable of monitoring various real-world events in a crowd-sourced manner. Location-based social networks have proven to be faster than traditional media channels in reporting and geo-locating breaking news, i.e. Osama Bin Laden’s death was first confirmed on Twitter even before the announcement from the communication department at the White House. However, the deluge of user-generated data on these networks requires intelligent systems capable of identifying and characterizing such events in a comprehensive manner. The data mining community coined the term, event detection , to refer to the task of uncovering emerging patterns in data streams. Nonetheless, most data mining techniques do not reproduce the underlying data generation process, hampering to self-adapt in fast-changing scenarios. Because of this, we propose a probabilistic machine learning approach to event detection which explicitly models the data generation process and enables reasoning about the discovered events. With the aim to set forth the differences between both approaches, we present two techniques for the problem of event detection in Twitter: a data mining technique called Tweet-SCAN and a machine learning technique called Warble. We assess and compare both techniques in a dataset of tweets geo-located in the city of Barcelona during its annual festivities. Last but not least, we present the algorithmic changes and data processing frameworks to scale up the proposed techniques to big data workloads.
Joan Capdevila, Jesús Cerquides, Jordi Torres

Applications

Frontmatter
Using Computational Intelligence for the Safety Assessment of Oil and Gas Pipelines: A Survey
Abstract
The applicability of intelligent techniques for the safety assessment of oil and gas pipelines is investigated in this study. Crude oil and natural gas are usually transmitted through metallic pipelines. Working under unforgiving environments, these pipelines may extend to hundreds of kilometers, which make them very susceptible to physical damage such as dents, cracks, corrosion, etc. These defects, if not managed properly, can lead to catastrophic consequences in terms of both financial losses and human life. Thus, effective and efficient systems for pipeline safety assessment that are capable of detecting defects, estimating defects sizes, and classifying defects are urgently needed. Such systems often require collecting diagnostic data that are gathered using different monitoring tools such as ultrasound, magnetic flux leakage, and Closed Circuit Television (CCTV) surveys. The volume of the data collected by these tools is staggering. Relying on traditional pipeline safety assessment techniques to analyze such huge data is neither efficient nor effective. Intelligent techniques such as data mining techniques, neural networks, and hybrid neuro-fuzzy systems are promising alternatives. In this paper, different intelligent techniques proposed in the literature are examined; and their merits and shortcomings are highlighted.
Abduljalil Mohamed, Mohamed Salah Hamdi, Sofiène Tahar
Big Data for Effective Management of Smart Grids
Abstract
The Energy industry is facing a set of changes. The old grids need to be replaced, alternative energy market is increasing and consumers want more control of their consumption. On the other hand, the ever-increasing pervasiveness of technology together with the smart paradigm, are becoming the reference point of anyone involved in innovation, and energy management issues. In this context, the information that can potentially be made available by technological innovation is obvious. Nevertheless, in order to turn it into better and more efficient decisions, it is necessary to keep in mind three sets of issues: those related to the management of generated data streams, those related to the quality of the data and finally those related to their usability for human decision-maker. In smart grid, large amounts of and various types of data, such as device status data, electricity consumption data, and user interaction data are collected. Then, as described in several scientific papers, many data analysis techniques, including optimization, forecasting, classification and other, can be applied on the large amounts of smart grid big data. There are several techniques, based on Big Data analysis using computational intelligence techniques, to optimize power generation and operation in real time, to predict electricity demand and electricity consumption and to develop dynamic pricing mechanisms. The aim of the chapter is to critically analyze the way Big Data is utilized in the field of Energy Management in Smart Grid addressing problems and discussing the important trends.
Alba Amato, Salvatore Venticinque
Distributed Machine Learning on Smart-Gateway Network Towards Real-Time Indoor Data Analytics
Abstract
Computational intelligence techniques are intelligent computational methodologies such as neural network to solve real-world complex problems. One example is to design a smart agent to make decisions within environment in response to the presence of human beings. Smart building/home is a typical computational intelligence based system enriched with sensors to gather information and processors to analyze it. Indoor computational intelligence based agents can perform behavior or feature extraction from environmental data such as power, temperature, and lighting data, and hence further help improve comfort level for human occupants in building. The current indoor system cannot address dynamic ambient change with a real-time response under emergency because processing backend in cloud takes latency. Therefore, in this chapter we have introduced distributed machine learning algorithms (SVM and neural network) mapped on smart-gateway networks. Scalability and robustness are considered to perform real-time data analytics. Furthermore, as the success of system depends on the trust of users, network intrusion detection for smart gateway has also been developed to provide system security. Experimental results have shown that with a distributed machine learning mapped on smart-gateway networks real-time data analytics can be performed to support sensitive, responsive and adaptive intelligent systems.
Hantao Huang, Rai Suleman Khalid, Hao Yu
Predicting Spatiotemporal Impacts of Weather on Power Systems Using Big Data Science
Abstract
Due to the increase in extreme weather conditions and aging infrastructure deterioration, the number and frequency of electricity network outages is dramatically escalating, mainly due to the high level of exposure of the network components to weather elements. Combined, 75% of power outages are either directly caused by weather-inflicted faults (e.g., lightning, wind impact), or indirectly by equipment failures due to wear and tear combined with weather exposure (e.g. prolonged overheating). In addition, penetration of renewables in electric power systems is on the rise. The country’s solar capacity is estimated to double by the end of 2016. Renewables significant dependence on the weather conditions has resulted in their highly variable and intermittent nature. In order to develop automated approaches for evaluating weather impacts on electric power system, a comprehensive analysis of large amount of data needs to be performed. The problem addressed in this chapter is how such Big Data can be integrated, spatio-temporally correlated, and analyzed in real-time, in order to improve capabilities of modern electricity network in dealing with weather caused emergencies.
Mladen Kezunovic, Zoran Obradovic, Tatjana Dokic, Bei Zhang, Jelena Stojanovic, Payman Dehghanian, Po-Chen Chen
Backmatter
Metadata
Title
Data Science and Big Data: An Environment of Computational Intelligence
Editors
Witold Pedrycz
Shyi-Ming Chen
Copyright Year
2017
Electronic ISBN
978-3-319-53474-9
Print ISBN
978-3-319-53473-2
DOI
https://doi.org/10.1007/978-3-319-53474-9

Premium Partner