Skip to main content

2016 | Buch

Big Data Analysis: New Algorithms for a New Society

insite
SUCHEN

Über dieses Buch

This edited volume is devoted to Big Data Analysis from a Machine Learning standpoint as presented by some of the most eminent researchers in this area.

It demonstrates that Big Data Analysis opens up new research problems which were either never considered before, or were only considered within a limited range. In addition to providing methodological discussions on the principles of mining Big Data and the difference between traditional statistical data analysis and newer computing frameworks, this book presents recently developed algorithms affecting such areas as business, financial forecasting, human mobility, the Internet of Things, information networks, bioinformatics, medical systems and life science. It explores, through a number of specific examples, how the study of Big Data Analysis has evolved and how it has started and will most likely continue to affect society. While the benefits brought upon by Big Data Analysis are underlined, the book also discusses some of the warnings that have been issued concerning the potential dangers of Big Data Analysis along with its pitfalls and challenges.

Inhaltsverzeichnis

Frontmatter
A Machine Learning Perspective on Big Data Analysis
Abstract
This chapter surveys the field of Big Data analysis from a machine learning perspective. In particular, it contrasts Big Data analysis with data mining, which is based on machine learning, reviews its achievements and discusses its impact on science and society. The chapter concludes with a summary of the book’s contributing chapters divided into problem-centric and domain-centric essays.
Nathalie Japkowicz, Jerzy Stefanowski
An Insight on Big Data Analytics
Abstract
This paper discusses the opportunities big data offers decision makers from a statistical perspective. It calls for a multidisciplinary approach by computer scientists, statisticians and domain experts to providing useful big data solutions. Big data calls for us to think in new ways and communicate effectively within such teams. We make a plea for linking data-driven and model-driven analytics, and stress the role of cause-effect models for knowledge enhancement in big data analytics. We remember Kant’s statement that theory without data is blind, but facts without theories are meaningless. A case is made for each discipline to define the contribution they offer to big data solutions so that effective teams can be formed to improve inductions. Although new approaches are needed much of the past learning related to small data are valuable in providing big data solutions. Here we have in mind the long-term academic training and field experience of statisticians concerning reduction of dataset volumes, sampling in a more general setting, data depreciation and quality, model design and validation, visualisation, etc. We expect that combining the present approaches will give incentives for increasing the chances for “real big solutions”.
Ross Sparks, Adrien Ickowicz, Hans J. Lenz
Toward Problem Solving Support Based on Big Data and Domain Knowledge: Interactive Granular Computing and Adaptive Judgement
Abstract
Nowadays efficient methods for dealing with Big Data are urgently needed for many real-life applications. Big Data is often distributed over networks of agents involved in complex interactions. Decision support for users, to solve problems using Big Data, requires to develop relevant computation models for the agents as well as methods for incorporating changes in the reasoning of the computation models themselves; these requirements would enable agents to control computations for achieving the target goals. It is to be noted that users are also agents. Agents are performing computations on complex objects of very different natures (e.g., (behavioral) patterns, classifiers, clusters, structural objects, sets of rules, aggregation operations, reasoning schemes etc.). One of the challenges for systems based on Big Data is to provide the systems with high-level primitives of users for composing and building complex analytical pipelines over Big Data. Such primitives are very often expressed in natural language, and they should be approximated using low-level primitives, accessible from raw data. In Granular Computing (GrC), all such constructed and/or induced objects are called granules. To model interactive computations, performed by the agent in complex systems based on Big Data, we extend the existing approach to GrC by introducing complex granules (c-granules or granules, for short). Many advanced tasks, concerning complex systems based on Big Data may be classified as control tasks performed by agents aiming at achieving the high quality trajectories (defined by computations) relative to the considered target tasks and quality measures. Here, new challenges are to develop strategies to control, predict, and bound the behavior of the system based on Big Data at scale. We propose to investigate these challenges using the GrC framework. The reasoning, which aims at controlling the computational schemes from time-to-time, in order to achieve the required target, is called an adaptive judgement. This reasoning deals with granules and computations over them. Adaptive judgement is more than a mixture of reasoning based on deduction, induction and abduction. Due to the uncertainty the agents generally cannot predict exactly the results of actions (or plans). Moreover, the approximations of the complex vague concepts initiating actions (or plans) are drifting with time. Hence, adaptive strategies for evolving approximation of concepts with respect to time are needed. In particular, the adaptive judgement is very much needed in the efficiency management of granular computations, carried out by agents, for risk assessment, risk treatment, cost/benefit analysis. The approach, discussed in this paper, is a step towards realization of the Wisdom Technology (WisTech) program [2, 3], and is developed over years of experiences, based on the work on different real-life projects.
Andrzej Skowron, Andrzej Jankowski, Soma Dutta
An Overview of Concept Drift Applications
Abstract
In most challenging data analysis applications, data evolve over time and must be analyzed in near real time. Patterns and relations in such data often evolve over time, thus, models built for analyzing such data quickly become obsolete over time. In machine learning and data mining this phenomenon is referred to as concept drift. The objective is to deploy models that would diagnose themselves and adapt to changing data over time. This chapter provides an application oriented view towards concept drift research, with a focus on supervised learning tasks. First we overview and categorize application tasks for which the problem of concept drift is particularly relevant. Then we construct a reference framework for positioning application tasks within a spectrum of problems related to concept drift. Finally, we discuss some promising research directions from the application perspective, and present recommendations for application driven concept drift research and development.
Indrė Žliobaitė, Mykola Pechenizkiy, João Gama
Analysis of Text-Enriched Heterogeneous Information Networks
Abstract
This chapter addresses the analysis of information networks, focusing on heterogeneous information networks with more than one type of nodes and arcs. After an overview of tasks and approaches to mining heterogeneous information networks, the presentation focuses on text-enriched heterogeneous information networks whose distinguishing property is that certain nodes are enriched with text information. A particular approach to mining text-enriched heterogeneous information networks is presented that combines text mining and network mining approaches. The approach decomposes a heterogeneous network into separate homogeneous networks, followed by concatenating the structural context vectors calculated from separate homogeneous networks with the bag-of-words vectors obtained from textual information contained in certain network nodes. The approach is show-cased on the analysis of two real-life text-enriched heterogeneous citation networks.
Jan Kralj, Anita Valmarska, Miha Grčar, Marko Robnik-Šikonja, Nada Lavrač
Implementing Big Data Analytics Projects in Business
Abstract
Big Data analytics present both opportunities and challenges for companies. It is important that, before embarking on a Big Data project, companies understand the value offered by Big Data and the processes needed to extract it. This chapter discusses why companies should progressively increase their data volumes and the process to follow for implementing a Big Data project. We present a variety of architectures, from in-memory servers to Hadoop, to handle Big Data. We introduce the concept of Data Lake and discuss its benefits for companies and the research still required to fully deploy it. We illustrate some of the points discussed in the chapter through the presentation of various architectures available for running Big Data initiatives, and discuss the expected evolution of hardware and software tools in the near future.
Françoise Fogelman-Soulié, Wenhuan Lu
Data Mining in Finance: Current Advances and Future Challenges
Abstract
Data mining has been successfully applied in many businesses, thus aiding managers to make informed decisions that are based on facts, rather than having to rely on guesswork and incorrect extrapolations. Data mining algorithms equip institutions to predict the movements of financial indicators, enable companies to move towards more energy-efficient buildings, as well as allow businesses to conduct targeted marketing campaigns and forecast sales. Specific data mining success stories include customer loyalty prediction, economic forecasting, and fraud detection. The strength of data mining lies in the fact that it allows for not only predicting trends and behaviors, but also for the discovery of previously unknown patterns. However, a number of challenges remain, especially in this era of big data. These challenges are brought forward due to the sheer Volume of today’s databases, as well as the Velocity (in terms of speed of arrival) and the Variety, in terms of the various types of data collected. This chapter focuses on techniques that address these issues. Specifically, we turn our attention to the financial sector, which has become paramount to business. Our discussion centers on issues such as considering data distributions with high fluctuations, incorporating late arriving data, and handling the unknown. We review the current state-of-the-art, mainly focusing on model-based approaches. We conclude the chapter by providing our perspective as to what the future holds, in terms of building accurate models against today’s business, and specifically financial, data.
Eric Paquet, Herna Viktor, Hongyu Guo
Industrial-Scale Ad Hoc Risk Analytics Using MapReduce
Abstract
Modern reinsurance companies hold portfolios consisting of thousands of reinsurance contracts covering millions of individually insured locations. To ensure capital adequacy and for fine-grained financial planning, these companies carry out large-scale Monte Carlo simulations to estimate the probabilities that the losses incurred due to catastrophic events such as hurricanes, earthquakes, etc. exceed certain critical values. This is a computationally intensive process that requires the use of parallelism to answer risk queries over a portfolio in a timely manner. We present a system that uses the MapReduce framework to evaluate risk analysis queries on industrial-scale portfolios efficiently. In contrast to existing production systems, this system is designed to support arbitrary ad hoc queries an analyst may pose while achieving a performance that is very close to that of highly optimized production systems, which often only support evaluating a limited set of risk metrics. For example, a full portfolio risk analysis run consisting of a 1,000,000-trial simulation, with 1,000 events per trial, and 3,200 risk transfer contracts can be completed on a 16-node Hadoop cluster in just over 20 min. MapReduce is an easy-to-use parallel programming framework that offers the flexibility required to develop the type of system we describe. The key to nearly matching the performance of highly optimized production systems was to judiciously choose which parts of our system should depart from the classical MapReduce model and use a combination of advanced features offered by Apache Hadoop with carefully engineered data structure implementations to eliminate performance bottlenecks while not sacrificing the flexibility of our system.
Andrew Rau-Chaplin, Zhimin Yao, Norbert Zeh
Big Data and the Internet of Things
Abstract
Advances in sensing and computing capabilities are making it possible to embed increasing computing power in small devices. This has enabled the sensing devices not just to passively capture data at very high resolution but also to take sophisticated actions in response. Combined with advances in communication, this results in an ecosystem of highly interconnected devices referred to as the Internet of Things—IoT. In conjunction, the advances in machine learning have allowed building models on this ever increasing amount of data. Consequently, devices all the way from heavy assets such as aircraft engines to wearables such as health monitors can all now not only generate massive amounts of data but can draw back on aggregate analytics to “improve” their performance over time. Big data analytics has been identified as a key enabler for the IoT. In this chapter, we discuss various avenues of the IoT where big data analytics either is already making a significant impact or is on the cusp of doing so. We also discuss social implications and areas of concern.
Mohak Shah
Social Network Analysis in Streaming Call Graphs
Abstract
Mobile phones are powerful tools to connect people. The streams of Call Detail Records (CDR’s) generating from these devices provide a powerful abstraction of social interactions between individuals, representing social structures. Call graphs can be deduced from these CDRs, where nodes represent subscribers and edges represent the phone calls made. These graphs may easily reach millions of nodes and billions of edges. Besides being large-scale and generated in real-time, the underlying social networks are inherently complex and, thus, difficult to analyze. Conventional data analysis performed by telecom operators is slow, done by request and implies heavy costs in data warehouses. In face of these challenges, real-time streaming analysis becomes an ever increasing need to mobile operators, since it enables them to quickly detect important network events and optimize business operations. Sampling, together with visualization techniques, are required for online exploratory data analysis and event detection in such networks. In this chapter, we report the burgeoning body of research in network sampling, visualization of streaming social networks, stream analysis and the solutions proposed so far.
Rui Sarmento, Márcia Oliveira, Mário Cordeiro, Shazia Tabassum, João Gama
Scalable Cloud-Based Data Analysis Software Systems for Big Data from Next Generation Sequencing
Abstract
Next generation sequencing (NGS) technology has become a serious computational challenge since its commercial introduction in 2008. Currently, thousands of machines worldwide produce daily billions of sequenced nucleotide base pairs of data. Due to continuous development of faster and economical sequencing technologies, processing the large amounts of data produced by high throughput sequencing technologies became the main challenge in bioinformatics. It can be solved by the new generation of software tools based on the paradigms and principles developed within the Hadoop ecosystem. This chapter presents the overall perspective for data analysis software for genomics and prospects for the emerging applications. To show genomic big data analysis in practice, a case study of the SparkSeq system that delivers tool for biological sequence analysis is presented.
Monika Szczerba, Marek S. Wiewiórka, Michał J. Okoniewski, Henryk Rybiński
Discovering Networks of Interdependent Features in High-Dimensional Problems
Abstract
The availability of very large data sets in Life Sciences provided earlier by the technological breakthroughs such as microarrays and more recently by various forms of sequencing has created both challenges in analyzing these data as well as new opportunities. A promising, yet underdeveloped approach to Big Data, not limited to Life Sciences, is the use of feature selection and classification to discover interdependent features. Traditionally, classifiers have been developed for the best quality of supervised classification. In our experience, more often than not, rather than obtaining the best possible supervised classifier, the Life Scientist needs to know which features contribute best to classifying observations (objects, samples) into distinct classes and what the interdependencies between the features that describe the observation. Our underlying hypothesis is that the interdependent features and rule networks do not only reflect some syntactical properties of the data and classifiers but also may convey meaningful clues about true interactions in the modeled biological system. In this chapter we develop further our method of Monte Carlo Feature Selection and Interdependency Discovery (MCFS and MCFS-ID, respectively), which are particularly well suited for high-dimensional problems, i.e., those where each observation is described by very many features, often many more features than the number of observations. Such problems are abundant in Life Science applications. Specifically, we define Inter-Dependency Graphs (termed, somewhat confusingly, ID Graphs) that are directed graphs of interactions between features extracted by aggregation of information from the classification trees constructed by the MCFS algorithm. We then proceed with modeling interactions on a finer level with rule networks. We discuss some of the properties of the ID graphs and make a first attempt at validating our hypothesis on a large gene expression data set for CD4\(^{+}\) T-cells. The MCFS-ID and ROSETTA including the Ciruvis approach offer a new methodology for analyzing Big Data from feature selection, through identification of feature interdependencies, to classification with rules according to decision classes, to construction of rule networks. Our preliminary results confirm that MCFS-ID is applicable to the identification of interacting features that are functionally relevant while rule networks offer a complementary picture with finer resolution of the interdependencies on the level of feature-value pairs.
Michał Dramiński, Michał J. Da̧browski, Klev Diamanti, Jacek Koronacki, Jan Komorowski
Final Remarks on Big Data Analysis and Its Impact on Society and Science
Abstract
In this chapter, we summarize the lessons learned from the contributions to this book, add some of the important points regarding the current state of the art in Big Data Analysis that have not been discussed at length in the contributions per se, but are worth being aware of, and conclude with a discussion of the influence that Stan Matwin has had throughout the years on the successive related fields of Machine Learning, Data Mining and Big Data Analysis.
Jerzy Stefanowski, Nathalie Japkowicz
Metadaten
Titel
Big Data Analysis: New Algorithms for a New Society
herausgegeben von
Nathalie Japkowicz
Jerzy Stefanowski
Copyright-Jahr
2016
Electronic ISBN
978-3-319-26989-4
Print ISBN
978-3-319-26987-0
DOI
https://doi.org/10.1007/978-3-319-26989-4