scroll identifier for mobile
main-content

## Über dieses Buch

This book presents the current trends, technologies, and challenges in Big Data in the diversified field of engineering and sciences. It covers the applications of Big Data ranging from conventional fields of mechanical engineering, civil engineering to electronics, electrical, and computer science to areas in pharmaceutical and biological sciences. This book consists of contributions from various authors from all sectors of academia and industries, demonstrating the imperative application of Big Data for the decision-making process in sectors where the volume, variety, and velocity of information keep increasing. The book is a useful reference for graduate students, researchers and scientists interested in exploring the potential of Big Data in the application of engineering areas.

## Inhaltsverzeichnis

### Applying Big Data Concepts to Improve Flat Steel Production Processes

In this chapter we present some results of the first European research project dealing with the utilisation of Big Data ideas and concepts in the Steel Industry. In the first part, it motivates the definition of a multi-scale data representation over multiple production stages. This data model is capable to synchronize high-resolution (HR) measuring data gathered along the whole flat steel production chain. In the second part, a realization of this concept as a three-tier software architecture including a web-service for a standardized data access is described and some implementation details are given. Finally, two industrial demonstration applications are presented in detail to explain the full potential of this concept and to prove that it is operationally applicable. In the first application, we realized an instant interactive data visualisation enabling the in-coil aggregation of millions of quality and process measures within seconds. In the second application, we used the simple and fast HR data access to realize a refined cause-and-effect analysis.
Jens Brandenburger, Valentina Colla, Silvia Cateni, Antonella Vignali, Floriano Ferro, Christoph Schirm, Josef Melcher

### Parallel Generation of Very High Resolution Digital Elevation Models: High-Performance Computing for Big Spatial Data Analysis

Very high resolution digital elevation models (DEM) provide the opportunity to represent the micro-level detail of topographic surfaces, thus increasing the accuracy of the applications that are depending on the topographic data. The analyses of micro-level topographic surfaces are particularly important for a series of geospatially related engineering applications. However, the generation of very high resolution DEM using, for example, LiDAR data is often extremely computationally demanding because of the large volume of data involved. Thus, we use a high-performance and parallel computing approach to resolve this big data-related computational challenge facing the generation of very high resolution DEMs from LiDAR data. This parallel computing approach allows us to generate a fine-resolution DEM from LiDAR data efficiently. We applied this parallel computing approach to derive the DEM in our study area, a bottomland hardwood wetland located in the USDA Forest Service Santee Experimental Forest. Our study demonstrated the feasibility and acceleration performance of the parallel interpolation approach for tackling the big data challenge associated with the generation of very high resolution DEM.
Minrui Zheng, Wenwu Tang, Yu Lan, Xiang Zhao, Meijuan Jia, Craig Allan, Carl Trettin

### Big-Data Analysis of Process Performance: A Case Study of Smart Cities

This chapter presents a data-centric software architecture that provides timely data access to key performance indicators (KPIs) about process performance. This architecture comes in the form of an analytical framework that lies on big-data and cloud-computing technologies aimed to cope with the demands of the crowd-sourced data analysis in terms of latency and data volume. This framework is proposed to be applied to the Smart Cities and the Internet of Things (IoT) arenas to monitor, analyse and improve the business processes and smart services of the city. Once the framework is presented from the technical standpoint, a case study is rolled out to leverage this process-centric framework and apply its fundamentals to the smart cities realm with the aim of analysing live smart data and improve the efficiency of smart cities. More specifically, this case study is focussed on the improvement of the service delivery process of the Open311 smart services deployed in the city of Chicago. The outcomes of the test show the ability of the systems to generate metrics in nearly real-time for high volumes of data.
Alejandro Vera-Baquero, Ricardo Colomo-Palacios

### Implementing Scalable Machine Learning Algorithms for Mining Big Data: A State-of-the-Art Survey

The growing trend of Big Data drives additional demand for novel solutions and specifically-designed algorithms that will perform efficient Big Data filtering and processing, recently even in a real-time fashion. Thus, the necessity to scale up Machine Learning algorithms to larger datasets and more complex methods should be addressed by distributed parallelism. This book chapter conducts a thorough literature review on distributed parallel data-intensive Machine Learning algorithms applied on Big Data so far. The selected algorithms fall into various Machine Learning categories, including (i) unsupervised learning, (ii) supervised learning, (iii) semi-supervised learning and (iv) deep learning. The most popular programming frameworks like MapReduce, PLANET, DryadLINQ, IBM Parallel Machine Learning Toolbox (PML), Compute Unified Device Architecture (CUDA) etc., well suited for parallelizing Machine Learning algorithms, will be cited throughout the review. However, this review is mainly focused on the performance and implementation traits of scalable Machine Learning algorithms, rather than on framework wide-ranging choices and their trade-offs.
Marjana Prifti Skënduli, Marenglen Biba, Michelangelo Ceci

### Concepts of HBase Archetypes in Big Data Engineering

All the technology that has been used for the big data handling is inspired by technology that was explain in the Google paper back in 2003. HBase is of the top most used and preferred open source distributed system developed by the Apache including apache zookeeper, apache Hadoop HBase provide random access for the storing and retrieving the data. In HBase we can store any type of data in any format, data can be structured and semi structured. It is very malleable and dynamic in case of data model. It is a No-SQL database i.e. it doesn’t let any inter row transactions to occur. Unlike traditional systems HBase run on multiple or a cluster of computers instead of single one, number of computer in a cluster can be increased or decreased as per the requirement. This type of design provide a more powerful and scalable approach for the data handling. This chapter explains about the how efficient HBase architecture and its command, operations are different from traditional systems.
Ankur Saxena, Shivani Singh, Chetna Shakya

### Scalable Framework for Cyber Threat Situational Awareness Based on Domain Name Systems Data Analysis

There are myriad of security solutions that have been developed to tackle the Cyber Security attacks and malicious activities in digital world. They are firewalls, intrusion detection and prevention systems, anti-virus systems, honeypots etc. Despite employing these detection measures and protection mechanisms, the number of successful attacks and the level of sophistication of these attacks keep increasing day-by-day. Also, with the advent of Internet-of-Things, the number of devices connected to Internet has risen dramatically. The inability to detect attacks on these devices are due to (1) the lack of computational power for detecting attacks, (2) the lack of interfaces that could potentially indicate a compromise on this devices and (3) the lack of the ability to interact with the system to execute diagnostic tools. This warrants newer approaches such as Tier-1 Internet Service Provider level view of attack patterns to provide situational awareness of Cyber Security threats. We investigate and explore the event data generated by the Internet protocol Domain Name Systems (DNS) for the purpose of Cyber threat situational awareness. Traditional methods such as Static and Binary analysis of Malware are sometimes inadequate to address the proliferation of Malware due to the time taken to obtain and process the individual binaries in order to generate signatures. By the time the Anti-Malware signature is available, there is a chance that a significant amount of damage might have happened. The traditional Anti-Malware systems may not identify malicious activities. However, it may be detected faster through DNS protocol by analyzing the generated event data in a timely manner. As DNS was not designed with security in mind (or suffers from vulnerabilities), we explore how the vast amount of event data generated by these systems can be leveraged to create Cyber threat situational awareness. The main contributions of the book chapter are two-fold: (1). A scalable framework that can perform web scale analysis in near real-time that provide situational awareness. (2). Detect early warning signals before large scale attacks or malware propagation occurs. We employ deep learning approach to classify and correlate malicious events that are perceived from the protocol usage. To our knowledge this is the first time, a framework that can analyze and correlate the DNS usage information at continent scale or multiple Tier-1 Internet Service Provider scale has been studied and analyzed in real-time to provide situational awareness. Merely using a commodity hardware server, the developed framework is capable of analyzing more than 2 Million events per second and it could detect the malicious activities within them in near real-time. The developed framework can be scaled out to analyze even larger volumes of network event data by adding additional computing resources. The scalability and real-time detection of malicious activities from early warning signals makes the developed framework stand out from any system of similar kind.
R. Vinayakumar, Prabaharan Poornachandran, K. P. Soman

### Big Data in HealthCare

This chapter presents an analysis of the infrastructure of big data, the elements that make it up, the types of data that define it, and the characteristics that distinguish it as a child: Volume, speed, variety, veracity and volatility. In a concrete way, different applications based on this architecture are analyzed, from which it is possible to find health, internet of things, among other applications. A description of the data used in health is performed, which is possible to manage effectively with a model based on big data. Finally, the proposal of a health model for Mexico is presented, based on an infrastructure that allows the integration and sharing of information, the administration of medical histories, public health and research data in the health area, all of them as a basis to carry out data analysis, to support decision-making and to serve as a basis for the creation of Institutional health programs. It concludes with evidence of the significant contribution that a big data model can give to the health sector in Mexico.
Margarita Ramírez Ramírez, Hilda Beatriz Ramírez Moreno, Esperanza Manrique Rojas

### Facing Up to Nomophobia: A Systematic Review of Mobile Phone Apps that Reduce Smartphone Usage

Excessive smartphone use has been linked to adverse health outcomes including distracted driving, sleep disorders, and depression. Responding to this growing trend, apps have been developed to support users in overcoming their dependency on smartphones. In that vein, our investigation explored the “big data” available on these types of apps to gain insights about them. We narrowed our search of apps, then reviewed content and functionality of 125 Android and iOS apps that purport to reduce device usage in the United States and elsewhere. This sample was curated based on popularity through the market research tool, App Annie (which indicates revenue and downloads per category of app and by country). The apps fell into 13 broad categories, each of which contained several different features related to filters, usage controls, and monitoring programs. Findings suggest that social media technologies, including smartphone apps, are being attempted for use for health behavior change. We discuss methods of sorting through “big data” generated by apps that purport to curb smartphone addiction. Finally, we propose data-driven features, such as social facilitation and gamification, that developers might use to enhance the effectiveness of these apps.
David Bychkov, Sean D. Young

### A Fast DBSCAN Algorithm with Spark Implementation

DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. Parallelization of DBSCAN is a challenging work because there is an inherent sequential data access order and based on MPI or OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists to handle communication between nodes which is a big challenge. We present a new parallel DBSCAN algorithm using Spark. kd-tree technique is applied in our algorithm to reduce search time. More specifically, a novel merge approach is used so that no communication between executors is required while partial clusters are generated. Appropriate and efficient data structures are carefully used in our study: Using Queue to contain neighbors of the data point, and using Hashtable when checking the status of and processing the data points. Also other advanced data structures from Spark are applied to make our implementation more effective. We implement the algorithm in Java and evaluate its scalability by using different number of processing cores. Our experiments demonstrate that the algorithm we propose scales up very well. Using data sets containing up to 1 million high-dimensional points, we show that our proposed algorithm achieves speedups up to 6 using 8 cores (10 k), 10 using 32 cores (100 k), and 137 using 512 cores (1 m). Another experiment using 10 k data points is conducted and the result shows that the algorithm with MapReduce achieves speedups to 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.
Dianwei Han, Ankit Agrawal, Wei-keng Liao, Alok Choudhary

### Understanding How Big Data Leads to Social Networking Vulnerability

Although the term “Big Data” is often used to refer to large datasets generated by science and engineering or business analytics efforts, increasingly it is used to refer to social networking websites and the enormous quantities of personal information, posts, and networking activities contained therein. The quantity and sensitive nature of this information constitutes both a fascinating means of inferring sociological parameters and a grave risk for security of privacy. The present study aimed to find evidence in the literature that malware has already adapted, to a significant degree, to this specific form of Big Data. Evidence of the potential for abuse of personal information was found: predictive models for personal traits of Facebook users are alarmingly effective with only a minimal depth of information, “Likes”. It is likely that more complex forms of information (e.g. posts, photos, connections, statuses) could lead to an unprecedented level of intrusiveness and familiarity with sensitive personal information. Support for the view that this potential for abuse of private information is being exploited was found in research describing the rapid adaptation of malware to social networking sites, for the purposes of social engineering and involuntary surrendering of personal information.
Romany F. Mansour

### Big Data Applications in Health Care and Education

Technology plays a major role in all spheres of life and higher education and health care are no exceptions. The use of big data in higher education and health care are relatively new. The dynamics of higher education is passing through a phase of rapid changes. Also, the amount of data available in this field and proper analytics can reap the benefits and highlight on future techniques to be followed in handling the complex situations arisen from pressure exerted by accrediting agencies, governments and other stake holders. Higher education is becoming more and more complex with several institutes entering into the market with more and more diversified approaches. This makes the functionalities of all institutes of higher education to revise their approaches frequently to cope up with this pressure. The educational institutes have to ensure that the quality of learning programmes is at par with that of their counterparts at the national and global level. Analysis of vast data sources generated in this connection being more often not available for analysis is a major concern. The analysis of these volumes of data plays a major role in understanding and ensuring that institutions are aware of the changes occurring everywhere and they are taking care of their social responsibilities. Due to digitization of medical records in an attempt to make them available for research and development over the past ten to fifteen years, there is a huge amount of data, which besides being voluminous are complex, diverse and temporal which is collected by healthcare stockholders. An analysis of these data could collectively help the healthcare industry to find out problems related to variability in healthcare quality and escalating healthcare expenditure. In this chapter we shall make a critical analysis of these aspects of higher education and healthcare with respect to big data analysis and make some recommendations in this direction.
B. K. Tripathy

### BWT: An Index Structure to Speed-Up Both Exact and Inexact String Matching

The BWT transformation of a string is originally proposed for string compression, but can also be used to speed up string matchings. In this chapter, we address two issues around this mechanism: (1) how to use BWT to improve the running time of a multiple pattern string matching process; and (2) how to integrate mismatching information into a search of BWT arrays to expedite string matching with k mismatches. For the first problem, we will first construct the BWT array of a target string s, denoted as BWT(s); and then establish a trie structure over a set of pattern strings $$\varvec{R}\,\varvec{ = }\,\left\{ {r_{1} , \ldots ,r_{l} } \right\}$$, denoted as T(R). By scanning BWT(s) against T(R), the time spent for finding occurrences of r i ’s can be significantly reduced. For the second problem, for a given pattern string r, we will precompute its mismatching information (over some different substrings of it, denoted as M(r)) and construct a tree structure, called a mismatching tree, to record the mismatches between r and s during a search of BWT(s) against r. In this process, the mismatching tree can be effectively utilized to do some kind of useful mismatching information derivation based on M(r) to avoid any possible redundancy. Extensive experiments have been done to compare our methods with the existing ones, which show that for both the problems described above our methods are promising.
Yangjun Chen, Yujia Wu

### Traffic Condition Monitoring Using Social Media Analytics

Scientist and practitioner seek innovations that analyze traffic big data for reducing congestion. In this chapter, we propose a framework for traffic condition monitoring using social media data analytics. This involves sentiment analysis and cluster classification utilizing the big data volume readily available through Twitter microblogging service. Firstly, we examine some key aspects of big data technology for traffic, transportation and information engineering systems. Secondly, we consider Parts of Speech tagging utilizing the simplified Phrase-Search and Forward-Position-Intersect algorithms. Then, we use the k-nearest neighbor classifier to obtain the unigram and bigram; followed by application of Naїve Bayes Algorithm to perform the sentiment analysis. Finally, we use the Jaccard Similarity and the Term Frequency-Inverse Document Frequency for cluster classification of traffic tweets data. The preliminary results show that the proposed methodology, comparatively tested for accuracy and precision with another approach employing Latent Dirichlet Allocation is sufficient for predicting traffic flow in order to effectively improve the road traffic condition.
Taiwo Adetiloye, Anjali Awasthi

### Modelling of Pile Drivability Using Soft Computing Methods

Driven piles are commonly used to transfer the loads from the superstructure through weak strata onto stiffer soils or rocks. For driven piles, the impact of the piling hammer induces compression and tension stresses in the piles. Hence, an important design consideration is to check that the strength of the pile is sufficient to resist the stresses caused by the impact of the pile hammer. Due to its complexity, pile drivability lacks a precise analytical theory or understanding of the phenomena involved. In situations where measured or numerical data are available, various soft computing methods have shown to offer great promise for mapping the nonlinear interactions between the system’s inputs and outputs. In this study, two soft computing methods, the Back propagation neural network (BPNN) and Multivariate adaptive regression splines (MARS) algorithms were used to assess pile drivability in terms of the Maximum compressive stresses, Maximum tensile stresses, and Blow per foot. A database of more than four thousand piles is utilized for model development and comparative performance of the predictions between BPNN and MARS.
Wengang Zhang, Anthony T. C. Goh

### Three Different Adaptive Neuro Fuzzy Computing Techniques for Forecasting Long-Period Daily Streamflows

A modeling study was presented here using three different adaptive neuro-fuzzy (ANFIS) approach algorithms comprising grid partitioning (ANFIS-GP), subtractive clustering (ANFIS-SC) and fuzzy C-Means clustering (ANFIS-FCM) for forecasting long period daily streamflow magnitudes. Long-period data (between 1936 and 2016) from two hydrometric stations in USA were used for training, evaluating and testing the approaches. Five different input combinations were applied based on the autoregressive analysis of the recorded streamflow data. A sensitivity analysis was also carried out to investigate the effect of different model architectures on the obtained outcomes. When using ANFIS-GP, the double-input model gives the best results for different model architectures, while the triple-input models produce the most accurate results using both ANFIS-SC and ANFIS-FCM models, which is due to increasing the model complexity for ANFIS-GP by using more input parameters. Comparing the all three algorithms it was observed that the ANFIS-FCM generally gave the most accurate results among others.
Ozgur Kisi, Jalal Shiri, Sepideh Karimi, Rana Muhammad Adnan

### Prediction of Compressive Strength of Geopolymers Using Multi-objective Feature Selection

To reduce the carbon dioxide emission to the environment, production of geopolymer is one of the effective binding materials to act as a substitute of cement. The strength of the geopolymer depends upon different factors such as chemical constituents, curing temperature, curing time, super plasticizer etc. In this paper, prediction models for compressive strength of geopolymer is presented using recently developed artificial intelligence techniques; multi-objective feature selection (MOFS), functional network (FN), multivariate adaptive regression spline (MARS) and multi gene genetic programming (MGGP). The MOFS is also used to find the subset of influential parameters responsible for the compressive strength of geopolymers. MOFS has been applied with artificial neural network (ANN) and non-dominated sorting genetic algorithm (NSGA II). The parameters considered for development of prediction models are curing time, NaOH concentration, Ca(OH)2 content, superplasticizer content, types of mold, types of geopolymer and H2O/Na2O molar ratio. The developed AI models were compared in terms of different statistical parameters such as average absolute error, root mean square error correlation coefficient, Nash-Sutcliff coefficient of efficiency.
Lasyamayee Garanayak, Sarat Kumar Das, Ranajeet Mohanty

### Application of Big Data Analysis to Operation of Smart Power Systems

The volume of data production is increased in smart power system by growing smart meters. Such data is applied for control, operation and protection objectives of power networks. Power companies can attain high indexes of efficiency, reliability and sustainability of the smart grid by appropriate management of such data. Therefore, the smart grids can be assumed as a big data challenge, which needs advanced information techniques to meet massive amounts of data and their analytics. This chapter investigates the utilization of huge data sets in power system operation, control, and protection, which are difficult to process with traditional database tools and often are known as big data. In addition, this paper covers two aspects of applying smart grid data sets, which include feature extraction, and system integration for power system applications. The application of big data methodology, which is analyzed in this study, can be classified to corrective, predictive, distributed, and adaptive approaches.

### A Structural Graph-Coupled Advanced Machine Learning Ensemble Model for Disease Risk Prediction in a Telehealthcare Environment

The use of intelligent and sophistic technologies in evidence-based clinical decision making support have been playing an important role in improving the quality of patients’ life and helping to reduce cost and workload involved in their daily healthcare. In this paper, an effective medical recommendation system that uses a structural graph approach with advanced machine learning ensemble model is proposed for short-term disease risk prediction to provide chronic heart disease patients with appropriate recommendations about the need to take a medical test or not on the coming day based on analysing their medical data. A time series telehealth data recorded from patients is used for experimentations, evaluation and validation. The Tunstall dataset were collected from May to October 2012, from industry collaborator Tunstall. A time series data is segmented into slide windows and then mapped into undirect graph. The size of slide window was empirically determined. The structural properties of graph enter as the features set to the machine learning ensemble classifier to predict the patient’s condition one day in advance. A combination of three classifiers—Least Squares-Support Vector Machine, Artificial Neural Network, and Naive Bayes—are used to construct an ensemble framework to classify the graph features. To investigate the predictive ability of the graph with the ensemble classifier, the extracted statistical features were also forwarded to the individual classifiers for comparison. The findings of this study shows that the recommendation system yields a satisfactory recommendation accuracy, offers a effective way for reducing the risk of incorrect recommendations as well as reducing the workload for heart disease patients in conducting body tests every day. A 94% average prediction accuracy is achieved by using the proposed recommendation system. The results conclusively ascertain that the proposed system is a promising tool for analyzing time series medical data and providing accurate and reliable recommendations to patients suffering from chronic heart diseases.
Raid Lafta, Ji Zhang, Xiaohui Tao, Yan Li, Mohammed Diykh, Jerry Chun-Wei Lin
Weitere Informationen

## BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

## Whitepaper

- ANZEIGE -

### Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.