Top

2015 | Book

Read chapter Read first chapter

Big Data in Complex Systems

Challenges and Opportunities

Editors: Aboul Ella Hassanien, Ahmad Taher Azar, Vaclav Snasael, Janusz Kacprzyk, Jemal H. Abawajy

Publisher: Springer International Publishing

Book Series : Studies in Big Data

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This volume provides challenges and Opportunities with updated, in-depth material on the application of Big data to complex systems in order to find solutions for the challenges and problems facing big data sets applications. Much data today is not natively in structured format; for example, tweets and blogs are weakly structured pieces of text, while images and video are structured for storage and display, but not for semantic content and search. Therefore transforming such content into a structured format for later analysis is a major challenge. Data analysis, organization, retrieval, and modeling are other foundational challenges treated in this book. The material of this book will be useful for researchers and practitioners in the field of big data as well as advanced undergraduate and graduate students. Each of the 17 chapters in the book opens with a chapter abstract and key terms list. The chapters are organized along the lines of problem description, related works, and analysis of the results and comparisons are provided whenever feasible.

Frontmatter

Cloud Computing Infrastructure for Massive Data: A Gigantic Task Ahead

Abstract

Today, in the era of computer we collect and store data from innumerable sources and some of these are Internet transactions, social media, mobile devices and automated sensors. From all of these sources massive or big data is generated and gathered for finding the useful patterns. The amount of data is growing at the enormous rate, the analyst forecast that the expected global big data storage to grow at the rate of 31.87% over the period 2012-2016, thus the storage must be highly scalable as well as flexible so the entire system doesn’t need to be brought down to increase storage. In order to store and access the massive data the storage hardware and network infrastructure is required.

Cloud computing can be viewed as one of the most viable technology for handling the big data and providing the infrastructure as services and these services should be uninterrupted. This computing is one of the cost effective technique for storage and analysis of big data.

Cloud computing and Massive data are the two rapidly evolving technologies in the modern day business applications. Lot of hope and optimism are surrounding around these technologies because analysis of massive or big data provides better insight into the data that may create competitive advantage and generates data related innovations having tremendous potential to revive the business bottom lines. Tradition ICT (information and communication) technology is inadequate and illequipped to handle terabytes or petabytes of data whereas cloud computing promises to hold unlimited, on-demand, elastic computing and data storage resources without huge upfront investments that is otherwise required when setting up traditional data centers. These two technologies are on converging paths and the combinations of the two technologies are proving powerful when it comes to perform analytics. At the same time, cloud computing platforms provide massive scalability, 99.999% reliability, high performance, and specifiable configurability. These capabilities are provided at relatively low cost compared to dedicated infrastructures.

There is an element of over enthusiasm and unrealistic expectations with regard to the use and future of these technologies. This chapter draws attention towards the challenges and risks involved in the use and implementation of these naive technologies. Downtime, data privacy and security, scarcity of big data analysts, validity and accuracy of the emerged data pattern and many more such issues need to be carefully examined before switching from legacy data storage infrastructure to the cloud storage. The chapter elucidates the possible tradeoffs between storing the data using legacy infrastructure and the cloud. It is emphasizes that cautious and selective use of big data and cloud technologies is advisable till these technologies matures.

Renu Vashist

Big Data Movement: A Challenge in Data Processing

Abstract

This chapter discusses modern methods of data processing, especially data parallelization and data processing by bio-inspired methods. The synthesis of novel methods is performed by selected evolutionary algorithms and demonstrated on the astrophysical data sets. Such approach is now characteristic for so called Big Data and Big Analytics. First, we describe some new database architectures that support Big Data storage and processing. We also discuss selected Big Data issues, specifically the data sources, characteristics, processing, and analysis. Particular interest is devoted to parallelism in the service of data processing and we discuss this topic in detail. We show how new technologies encourage programmers to consider parallel processing not only in a distributive way (horizontal scaling), but also within each server (vertical scaling). The chapter also intensively discusses interdisciplinary intersection between astrophysics and computer science, which has been denoted astroinformatics, including a variety of data sources and examples. The last part of the chapter is devoted to selected bio-inspired methods and their application on simple model synthesis from astrophysical Big Data collections. We suggest a method how new algorithms can be synthesized by bio-inspired approach and demonstrate its application on an astronomy Big Data collection. The usability of these algorithms along with general remarks on the limits of computing are discussed at the conclusion of this chapter.

Jaroslav Pokorný, Petr Škoda, Ivan Zelinka, David Bednárek, Filip Zavoral, Martin Kruliš, Petr Šaloun

Towards Robust Performance Guarantees for Models Learned from High-Dimensional Data

Abstract

Models learned from high-dimensional spaces, where the high number of features can exceed the number of observations, are susceptible to overfit since the selection of subspaces of interest for the learning task is prone to occur by chance. In these spaces, the performance of models is commonly highly variable and dependent on the target error estimators, data regularities and model properties. High-variable performance is a common problem in the analysis of omics data, healthcare data, collaborative filtering data, and datasets composed by features extracted from unstructured data or mapped from multi-dimensional databases. In these contexts, assessing the statistical significance of the performance guarantees of models learned from these high-dimensional spaces is critical to validate and weight the increasingly available scientific statements derived from the behavior of these models. Therefore, this chapter surveys the challenges and opportunities of evaluating models learned from big data settings from the less-studied angle of big dimensionality. In particular, we propose a methodology to bound and compare the performance of multiple models. First, a set of prominent challenges is synthesized. Second, a set of principles is proposed to answer the identified challenges. These principles provide a roadmap with decisions to: i) select adequate statistical tests, loss functions and sampling schema, ii) infer performance guarantees from multiple settings, including varying data regularities and learning parameterizations, and iii) guarantee its applicability for different types of models, including classification and descriptive models. To our knowledge, this work is the first attempt to provide a robust and flexible assessment of distinct types of models sensitive to both the dimensionality and size of data. Empirical evidence supports the relevance of these principles as they offer a coherent setting to bound and compare the performance of models learned in high-dimensional spaces, and to study and refine the behavior of these models.

Rui Henriques, Sara C. Madeira

Stream Clustering Algorithms: A Primer

Abstract

Stream data has become ubiquitous due to advances in acquisition technology and pervades numerous applications. These massive data gathered as continuous flow, are often accompanied by dire need for real-time processing. One aspect of data streams deals with storage management and processing of continuous queries for aggregation. Another significant aspect pertains to discovery and understanding of hidden patterns to derive actionable knowledge using mining approaches. This chapter focuses on stream clustering and presents a primer of clustering algorithms in data stream environment.

Clustering of data streams has gained importance because of its ability to capture natural structures from unlabeled, non-stationary data. Single scan of data, bounded memory usage, and capturing data evolution are the key challenges during clustering of streaming data. We elaborate and compare the algorithms on the basis of these constraints. We also propose a taxonomy of algorithms based on the fundamental approaches used for clustering. For each approach, a systematic description of contemporary, well-known algorithms is presented. We place special emphasis on synopsis data structure used for consolidating characteristics of streaming data and feature it as an important issue in design of a stream clustering algorithms. We argue that a number of functional and operational characteristics (e.g. quality of clustering, handling of outliers, number of parameters etc.) of a clustering algorithm are influenced by the choice of synopsis. A summary of clustering features that are supported by different algorithms is given. Finally, research directions for improvement in the usability of stream clustering algorithms are suggested.

Sharanjit Kaur, Vasudha Bhatnagar, Sharma Chakravarthy

Cross Language Duplicate Record Detection in Big Data

Abstract

The importance of data accuracy and quality has increased with the explosion of data size. This factor is crucial to ensure the success of any cross-enterprise integration applications, business intelligence or data mining solutions. Detecting duplicate data that represent the same real world object more than once in a certain dataset is the first step to ensure the data accuracy. This operation becomes more complicated when the same object name (person, city) is represented in multiple natural languages due to several factors including spelling, typographical and pronunciation variation, dialects and special vowel and consonant distinction and other linguistic characteristics. Therefore, it is difficult to decide whether or not two syntactic values (names) are alternative designation of the same semantic entity. Up to authors’ knowledge, the previously proposed duplicate record detection (DRD) algorithms and frameworks support only single language duplicate record detection, or at most bilingual. In this paper, two available tools of duplicate record detection are compared. Then, a generic cross language based duplicate record detection solution architecture is proposed, designed and implemented to support the wide range variations of several languages. The proposed system design uses a dictionary based on phonetic algorithms and support different indexing/blocking techniques to allow fast processing. The framework proposes the use of several proximity matching algorithms, performance evaluation metrics and classifiers to suit the diversity in several languages names matching. The framework is implemented and verified empirically in several case studies. Several Experiments are executed to compare the advantages and disadvantages of the proposed system compared to other tool. Results showed that the proposed system has substantial improvements compared to the well-known tools.

Ahmed H. Yousef

A Novel Hybridized Rough Set and Improved Harmony Search Based Feature Selection for Protein Sequence Classification

Abstract

The progress in bio-informatics and biotechnology area has generated a big amount of sequence data that requires a detailed analysis. Recent advances in future generation sequencing technologies have resulted in a tremendous raise in the rate of that protein sequence data are being obtained. Big Data analysis is a clear bottleneck in many applications, especially in the field of bio-informatics, because of the complexity of the data that needs to be analyzed. Protein sequence analysis is a significant problem in functional genomics. Proteins play an essential role in organisms as they perform many important tasks in their cells. In general, protein sequences are exhibited by feature vectors. A major problem of protein dataset is the complexity of its analysis due to their enormous number of features. Feature selection techniques are capable of dealing with this high dimensional space of features. In this chapter, the new feature selection algorithm that combines the Improved Harmony Search algorithm with Rough Set theory for Protein sequences is proposed to successfully tackle the big data problems. An Improved harmony search (IHS) algorithm is a comparatively new population based meta-heuristic optimization algorithm. This approach imitates the music improvisation process, where each musician improvises their instrument’s pitch by seeking for a perfect state of harmony and it overcomes the limitations of traditional harmony search (HS) algorithm. An Improved Harmony Search hybridized with Rough Set Quick Reduct for faster and better search capabilities. The feature vectors are extracted from protein sequence database, based on amino acid composition and K-mer patterns or K-tuples and then feature selection is carried out from the extracted feature vectors. The proposed algorithm is compared with the two prominent algorithms, Rough Set Quick Reduct and Rough Set based PSO Quick Reduct. The experiments are carried out on protein primary single sequence data sets that are derived from PDB on SCOP classification, based on the structural class predictions such as all α, all β, all α + β and all α/ β. The feature subset of the protein sequences predicted by both existing and proposed algorithms are analyzed with the decision tree classification algorithms.

M. Bagyamathi, H. Hannah Inbarani

Autonomic Discovery of News Evolvement in Twitter

Abstract

Recently the continuous increase in data sizes has resulted in many data processing challenges. This increase has compelled data users to find automatic means of looking into databases to bring out vital information. Retrieving information from ‘Big data’, (as it is often referred to) can be likened to finding ‘a needle in the haystack’. It is worthy of note that while big data has several computational challenges, it also serves as gateway to technological preparedness in making the world a global village. Social media sites (of which Twitter is one) are known to be big data collectors as well as an open source for information retrieval. Easy access to social media sites and the advancement of technology tools such as the computer and smart devices have made it convenient for different entities to store enormous data in real time. Twitter is known to be the most powerful and most popular microbloging tool in social media. It offers its users the opportunity of posting and receiving instantaneous information from the network. Traditional news media follow the activities on Twitter network in order to retrieve interesting tweets that can be used to enhance their news reports and news updates. Twitter users include hashtags symbols (#) as prefix to keywords used in tweets to describe its content and to enhance the readability of their tweets. This chapter uses the Apriori method for Association Rule Mining (ARM) and a novel methodology termed Rule Type Identification-Mapping (RTI-Mapping) which is inherited from Transaction-based Rule Change Mining TRCM (Adedoyin-Olowe et al., 2013) and Transaction-based Rule Change Mining-Rule Type Identification (TRCM-RTI) (Gomes et al., 2013) to map Association Rules (ARs) detected in tweets’ hashtags to evolving news reports and news updates of traditional news agents in real life. TRCM uses Association Rule Mining (ARM) to analyse tweets on the same topic over consecutive periods t and t + 1 using Rule Matching (RM) to detected changes in ARs such as emerging, unexpected, new and dead rules. This is obtained by setting user-defined Rule Matching Threshold (RMT) to match rules in tweets at time t with those in tweets at t + 1 in order to ascertain rules that fall into the different patterns. TRCM-RTI is a methodology built from TRCM, it identifies rule types of evolving ARs present in tweets’ hashtags at different time periods. This chapter adopts RTI-Mapping from methodologies in (Adedoyin-Olowe et al., 2013) and (Gomes et al., 2013) to map ARs with online evolving news of top traditional news agents in order to detect and track news and news updates of evolving events. This is an initial experiment of ARs mapping to evolving news. The mapping is done manually at this stage and the methodology is validated using four events and news topics as case studies. The experiments show substantial result on the selected news topics.

Mariam Adedoyin-Olowe, Mohamed Medhat Gaber, Frederic Stahl, João Bártolo Gomes

Hybrid Tolerance Rough Set Based Intelligent Approaches for Social Tagging Systems

Abstract

The major challenge with Big Data analysis is the generation of huge amounts of data over a short period like Social tagging system. Social Tagging systems such as BibSonomy and del.icio.us have become progressively popular with the widespread use of the internet. The social tagging system is a popular way to annotate web 2.0 resources. Social tagging systems allow users to annotate web resources with free-form tags. Tags are widely used to interpret and classify the web 2.0 resources. Tag clustering is the process of grouping the similar tags into clusters. The tag clustering is very useful for searching and organizing the web2.0 resources and also important for the success of social tagging systems. Clustering the tag data is very tedious since the tag space is very large in several social bookmarking websites. So, instead of clustering the entire tag space of Web 2.0 data, some tags frequent enough in the tag space can be selected for clustering by applying feature selection techniques. The goal of feature selection is to determine a marginal bookmarked URL subset from Web 2.0 data while retaining a suitably high accuracy in representing the original bookmarks. In this chapter, Unsupervised Quick Reduct feature selection algorithm is applied to find a set of most commonly tagged bookmarks and this paper proposes TRS approach hybridized with Meta heuristic clustering algorithms. The proposed approaches are Hybrid TRS and K-Means Clustering (TRS-K-Means), Hybrid TRS and Particle swarm optimization (PSO) K-Means clustering algorithm (TRS-PSO-KMeans), and Hybrid TRS-PSO-K-Means-Genetic Algorithm (TRS-PSO-GA). These intelligent approaches automatically determine the number of clusters. These are in turn compared with K-Means benchmark algorithm for Social Tagging System.

H. Hannah Inbarani, S. Selva Kumar

Exploitation of Healthcare Databases in Anesthesiology and Surgical Care for Comparing Comorbidity Indexes in Cholecystectomized Patients

Abstract

Objective: Charlson comorbidity index (CCI) and Elixhauser comorbidity index (ECI) have been used as prognostic tools in surgical and medical research. We compared their ability to predict in-hospital mortality among cholecystectomized patients using a Spanish large database of 87 hospitals during the period 2008-2010.

Methods: The electronic healthcare database Minimal Basic Data Set (MBDS) contain information of diseases, conditions, procedures and demographic data of patients attended in hospital setting. We used available information to calculate CCI and ECI, and analyzed their relation to in-hospital mortality.

Results: The models including age, gender, tobacco use disorders, hospital size and CCI or ECI were predictive of in-hospital mortality among cholecystectomized patients, when measured in terms of adjusted Odds Ratios and 95% confidence limits. There was a dose-effect relationship between score of prognostic indexes and risk of death. Area under the curve for ROC predictive models for in-hospital mortality were 0.8717 for CCI and 0.8771 for ECI, but differences were not statistically significant (p> 10^− 6).

Conclusion: Both CCI and ECI were predictive of in-hospital mortality among cholecystectomized patients in a large sample of Spanish patients after controlling for age, gender, group of hospital and tobacco use disorders. The availability of more hospitals databases through Big Data can strengthen the external validity of these results if we control several threats of internal validity such as biases and missing values.

Luís Béjar-Prado, Enrique Gili-Ortiz, Julio López-Méndez

Sickness Absence and Record Linkage Using Primary Healthcare, Hospital and Occupational Databases

Abstract

Objective: Charlson comorbidity index (CCI) has been adapted to primary care (PC) patients to determine chronic illness costs. We retrospectively evaluated its ability to predict sickness absence, hospital admissions and in-hospital mortality among 1,826,190 workers followed during the period 2007-2009.

Methods: The electronic administrative databases DIRAYA^© and MBDS contain information of diseases and conditions of patients attended in primary care and hospital settings, respectively. We retrospectively used available information in the DIRAYA medical record database to calculate CCI adapted to PC (CCIPC), and analyzed its relation to sickness absence, hospital admissions and in-hospital mortality.

Results: The models including age, gender, province of residence, hospital size and CCIPC calculated in PC setting were predictive of every outcome: sick leave (their number and duration), hospital admissions (number and length of hospital stays) and in-hospital mortality, when measured in terms of adjusted Odds Ratios and 95% confidence limits. Area under the curve for ROC predictive models was maximal for in-hospital mortality (0.9254).

Conclusion: The adapted CCIPC was predictive of all outcomes related to sick leave, hospital admissions, and in-hospital mortality among a large sample of Spanish workers. If the goal is to compare outcomes across centers and regions for specific diseases and causes of sickness absence, CCIPC is a promising option worthy of prospective testing. The future availability of information through Big Data can increase the external validity of these results if at the same time biases that threaten the internal validity of the results are avoided.

Miguel Gili-Miner, Juan Luís Cabanillas-Moruno, Gloria Ramírez-Ramírez

Classification of ECG Cardiac Arrhythmias Using Bijective Soft Set

Abstract

This paper presents the new automated classification method for electrocardiogram (ECG) arrhythmia. Electrocardiogram datasets are generally called as big data. Big Data are the group of huge volumes of unstructured data. Big Data means enormous amounts of data, such large that it is difficult to collect, store, manage, analyze, predict, visualize, and model the data. Electrocardiography deals with the electrical movement of the heart. The order of cardiac health is given by ECG and heart rate. A study of the nonlinear dynamics of electrocardiogram (ECG) signals for arrhythmia characterization is considered in this work. Cardiac problems are considered to be the most deadly disease in medical world. Cardiac arrhythmia is abnormality of heart rhythm, in fact refers to disorder in electrical conduction system of the heart. In this paper, computerized ECG interpretations are used to identify arrhythmias. It is a process of ECG signal acquisition, eliminating noise (De-noising) from ECG signal, detecting wave parameters (P, Q, R, S and T) and rhythm classification. Substantial progress has been made over the years in improvising techniques for signal conditioning, extraction of relevant wave parameters and rhythm classification. However, many problems and issues, especially those related to detection of multiple arrhythmic events using soft computing techniques is still need to be addressed in a broader manner to improve the prospect of commercial automated arrhythmia analysis in mass health care centres. The main objective of this paper is to present a classifier system based on Bijective soft set in order to classify ECG signal data into five classes (Normal, Left bundle branch blocks, Right bundle branch blocks, premature ventricular contractions and Paced rhythm class). To complete this objective, an algorithm for detection of P, QRS and T waves are applied followed by IBISOCLASS Classifier. The experimental results are acquired by examining the proposed method on ECG data from the MIT-BIH arrhythmia database. The proposed algorithm is also compared with the well-known standard classification algorithms namely Back propagation network (BPN), Decision table, J48 and Naïve Bayes.

S. Udhaya Kumar, H. Hannah Inbarani

Semantic Geographic Space: From Big Data to Ecosystems of Data

Abstract

Enhancing the physical view of a geographic space through the integration of semantic models enables a novel extended logic context for geographic data infrastructures that are modelled as an ecosystem of data in which semantic properties and relations are defined with the concepts composing the model. The signigicant capabilities of current semantic technology allow the implementation of rich data models according to an ontological approach that assures competitive interoperable solutions in the context of environments for general purpose (e.g. the Semantic Web) as well as inside more specific systems (e.g. Geographic Information Systems). Extended capabilities in terms of expressivity have strong implications also for data/information processing, especially on a large scale (Big Data). Semantic spaces can play a critical role in those processes contrasting the mostly passive role of models simply reflecting a geographic perspective. This chapter proposes a short overview of a simple model for semantic geographic space and a number of its applications, mostly focusing on the added value provided by the use of semantic spaces in different use cases.

Salvatore F. Pileggi, Robert Amor

Big DNA Methylation Data Analysis and Visualizing in a Common Form of Breast Cancer

Abstract

DNA methylation is one of epigenetics mechanisms that plays a vital role in cancer research area by controlling gene expression, especially in the research of abnormally hypermethylated tumor suppressor genes or hypomethylaed oncogenes. The role of DNA methylation analysis leads to determine the significant hypermethlated or hypomethylated genes that are candidate to be cancer biomarkers also the visualization of DNA methylation status leads to discover very important relationships between hypermethylated and hypomethylated genes by using mathematical theory modeling called formal concept analysis.

Islam Ibrahim Amin, Aboul Ella Hassanien, Samar K. Kassim, Hesham A. Hefny

Data Quality, Analytics, and Privacy in Big Data

Abstract

In today’s world, companies not only compete on products or services but also on how they can analyze and mine data in order to gain insights for competitive advantages and long term growth. With the exponential growth of data, companies now face unprecedented challenges, however are also presented with numerous opportunities for competitive growth. Advancement in data capturing devices and the existence of multi-generation systems in organizations have increased the number of data sources. Typically, data generated from different devices may not be compatible with each other, which calls for data integration. Although, ETL market offers a wide variety of tools for data integration, it is still common for companies to use SQL to manually produce in-house ETL tools. There are technological and managerial challenges to deal with data integration. During data integration, data quality must be embedded in it.

Big data analytics delivers insights which can be used for effective business decisions. However, some of these insights may invade consumer privacy. With more and more data related to consumer behavior being collected and the advancement in big data analytics, privacy has become an increasing concern. Therefore, it is necessary to address issues related to privacy laws, consumer protections and best practices to safeguard privacy. In this chapter, we will discuss topics related to big data in the area of big data integration, big data quality, big data privacy, and big data analytics.

Xiaoni Zhang, Shang Xiang

Search, Analysis and Visual Comparison of Massive and Heterogeneous Data

Application in the Medical Field

Abstract

It is thanks to the continuous evolution of the hardware technology that enables information systems to store very large amounts of data, the latter that explode even more rapidly than the growth rate of computing power. This spectacular growth of data is at the origin of what is called the Big Data. As several fields are affected by digitization, the medical field has experienced in the past years, an important technological and digital revolution, which contributed to a large informational explosion of digital medical data. In addition to their massive quantity, these data are also characterized by the complexity, diversity and heterogeneity, and they are often contained in the so-called the Electronic Health Record (EHR). However, not having the right tools to explore the large amounts of data that have been collected because of their potential usefulness, the data becomes unnecessary and databases and their management systems become without advantage. In this context, we propose in this paper the Medical Multi-project ICOP system (M²ICOP) which was an interactive system dedicated particularly to clinicians and researchers in the medical field to help them explore, visualize and analyze a set of medical data really large and heterogeneous. Practically, our system allows these users to visualize and interact with a large number of electronic health records, and the search of similar EHRs and the comparison between them to take advantage of best practices and shared experiences to improve the quality of treatment.

Ahmed Dridi, Salma Sassi, Anis Tissaoui

Modified Soft Rough Set Based ECG Signal Classification for Cardiac Arrhythmias

Abstract

The objective of the present study is ECG signal classification for cardiac arrhythmias. Most of the pattern reorganization techniques involve significantly large amounts of computation and processing time for extracting the features and classification. Electrocardiogram (ECG) is the P, QRS, T wave demonstrating the electrical activity of the heart. Electrocardiogram is the most straightforwardly accessible bioelectric signal that provides the doctors with reasonably accurate data regarding the patient’s heart disorder. Many of the cardiac problems are visible as distortions in the electrocardiogram (ECG). Different heart diseases are with different ECG wave shapes; in addition, there is large numbers of heart illnesses, so it is hard to accurately extract cardiology features from diverse ECG wave forms. Big Data is now rapidly expanding in all science and engineering domains, including physical, bio-medical and social sciences. It is used to build computational models directly from large ECG data sets. Rough set rule generation is specifically designed to extract human understandable decision rules from nominal data. Soft rough set theory is a new mathematical tool to deal uncertainty. Five types of rhythm including Normal Sinus Rhythm (NSR), Premature Ventricular Contraction (PVC), Left Bundle Branch Block (LBBB), Right Bundle Branch Block (RBBB) and Paced Rhythm (PR) are obtained from the MIT-BIH arrhythmia database. Five morphological features are extracted from each beat after the preprocessing of the selected records. In this chapter, the ECG signals were classified using Modified soft rough set technique. The empirical analysis shows that the proposed method shows better performance compared to the other six established techniques like Back Propagation Neural Network, Decision table, J48, JRip, Multilayer Perceptron and Naive Bayes. This chapter is focused on finding an easy but reliable features and best MSR structure to correctly classify five different cardiac conditions.

S. Senthil Kumar, H. Hannah Inbarani

Towards a New Architecture for the Description and Manipulation of Large Distributed Data

Abstract

The exponential growth of generated information volume, the loss of structure meaning due to data and sources variety along with a highly exhausting applications and end-users led to centralized databases distribution. One of the common approaches to satisfy performance need and preserve relational integrity, is a correctly designed and implemented decentralized database. IT systems migration from centralized to distributed database may imply heavy costs including review of existing systems core and interfaces. Also, an incongruous design may be fatal in Big Data processing systems such as data loss due to completeness rule break. A simple field replication may be acceptable in “normal size” databases, but will result to significant storage space waste. Indeed, marketed DDBMS are currently very far from automated support for the large distributed data. This heavy task is always done without any GUI or a friendly assistance that insures distribution rules (completeness, disjointness and reconstruction). Moreover, database transparency still not automatically ensured even with reported distribution script. Data treatment stored procedures and functions must take in consideration the distributed context. This context switch may result to re-writing the complete algorithm of data treatment. The aim of this paper is to propose a new architecture for the description and manipulation of Large Distributed Data. The result of this approach is a distribution context aware tool that respects database distribution rules and helps designers to easily create reliable DDB scripts. To avoid the core application and interfaces review, an automated translator from centralized format queries to distribution context aware queries. Even after the migration, end-users and application will see the distributed database as it was before splitting. This level of transparency is guaranteed by the queries translator.

Fadoua Hassen, Amel Grissa Touzi

Backmatter

Title: Big Data in Complex Systems
Editors: Aboul Ella Hassanien
Ahmad Taher Azar
Vaclav Snasael
Janusz Kacprzyk
Jemal H. Abawajy
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-11056-1
Print ISBN: 978-3-319-11055-4
DOI: https://doi.org/10.1007/978-3-319-11056-1

Springer Professional

Big Data in Complex Systems

Challenges and Opportunities

About this book

Table of Contents

Frontmatter

Cloud Computing Infrastructure for Massive Data: A Gigantic Task Ahead

Big Data Movement: A Challenge in Data Processing

Towards Robust Performance Guarantees for Models Learned from High-Dimensional Data

Stream Clustering Algorithms: A Primer

Cross Language Duplicate Record Detection in Big Data

A Novel Hybridized Rough Set and Improved Harmony Search Based Feature Selection for Protein Sequence Classification

Autonomic Discovery of News Evolvement in Twitter

Hybrid Tolerance Rough Set Based Intelligent Approaches for Social Tagging Systems

Exploitation of Healthcare Databases in Anesthesiology and Surgical Care for Comparing Comorbidity Indexes in Cholecystectomized Patients

Sickness Absence and Record Linkage Using Primary Healthcare, Hospital and Occupational Databases

Classification of ECG Cardiac Arrhythmias Using Bijective Soft Set

Semantic Geographic Space: From Big Data to Ecosystems of Data

Big DNA Methylation Data Analysis and Visualizing in a Common Form of Breast Cancer

Data Quality, Analytics, and Privacy in Big Data

Search, Analysis and Visual Comparison of Massive and Heterogeneous Data

Modified Soft Rough Set Based ECG Signal Classification for Cardiac Arrhythmias

Towards a New Architecture for the Description and Manipulation of Large Distributed Data

Backmatter

Premium Partner