Skip to main content
Top

2016 | Book

Big Data Optimization: Recent Developments and Challenges

insite
SEARCH

About this book

The main objective of this book is to provide the necessary background to work with big data by introducing some novel optimization algorithms and codes capable of working in the big data setting as well as introducing some applications in big data optimization for both academics and practitioners interested, and to benefit society, industry, academia, and government. Presenting applications in a variety of industries, this book will be useful for the researchers aiming to analyses large scale data. Several optimization algorithms for big data including convergent parallel algorithms, limited memory bundle algorithm, diagonal bundle method, convergent parallel algorithms, network analytics, and many more have been explored in this book.

Table of Contents

Frontmatter
Big Data: Who, What and Where? Social, Cognitive and Journals Map of Big Data Publications with Focus on Optimization
Abstract
Contemporary research in various disciplines from social science to computer science, mathematics and physics, is characterized by the availability of large amounts of data. These large amounts of data present various challenges, one of the most intriguing of which deals with knowledge discovery and large-scale data-mining. This chapter investigates the research areas that are the most influenced by big data availability, and on which aspects of large data handling different scientific communities are working. We employ scientometric mapping techniques to identify who works on what in the area of big data and large scale optimization problems.
Ali Emrouznejad, Marianna Marra
Setting Up a Big Data Project: Challenges, Opportunities, Technologies and Optimization
Abstract
In the first part of this chapter we illustrate how a big data project can be set up and optimized. We explain the general value of big data analytics for the enterprise and how value can be derived by analyzing big data. We go on to introduce the characteristics of big data projects and how such projects can be set up, optimized and managed. Two exemplary real word use cases of big data projects are described at the end of the first part. To be able to choose the optimal big data tools for given requirements, the relevant technologies for handling big data are outlined in the second part of this chapter. This part includes technologies such as NoSQL and NewSQL systems, in-memory databases, analytical platforms and Hadoop based solutions. Finally, the chapter is concluded with an overview over big data benchmarks that allow for performance optimization and evaluation of big data technologies. Especially with the new big data applications, there are requirements that make the platforms more complex and more heterogeneous. The relevant benchmarks designed for big data technologies are categorized in the last part.
Roberto V. Zicari, Marten Rosselli, Todor Ivanov, Nikolaos Korfiatis, Karsten Tolle, Raik Niemann, Christoph Reichenbach
Optimizing Intelligent Reduction Techniques for Big Data
Abstract
Working with big volume of data collected through many applications in multiple storage locations is both challenging and rewarding. Extracting valuable information from data means to combine qualitative and quantitative analysis techniques. One of the main promises of analytics is data reduction with the primary function to support decision-making. The motivation of this chapter comes from the new age of applications (social media, smart cities, cyber-infrastructures, environment monitoring and control, healthcare, etc.), which produce big data and many new mechanisms for data creation rather than a new mechanism for data storage. The goal of this chapter is to analyze existing techniques for data reduction, at scale to facilitate Big Data processing optimization and understanding. The chapter will cover the following subjects: data manipulation, analytics and Big Data reduction techniques considering descriptive analytics, predictive analytics and prescriptive analytics. The CyberWater case study will be presented by referring to: optimization process, monitoring, analysis and control of natural resources, especially water resources to preserve the water quality.
Florin Pop, Catalin Negru, Sorin N. Ciolofan, Mariana Mocanu, Valentin Cristea
Performance Tools for Big Data Optimization
Abstract
Many big data optimizations have critical performance requirements (e.g., real-time big data analytics), as indicated by the Velocity dimension of 4Vs of big data. To accelerate the big data optimization, users typically rely on detailed performance analysis to identify potential performance bottlenecks. However, due to the large scale and high abstraction of existing big data optimization frameworks (e.g., Apache Hadoop MapReduce), it remains a major challenge to tune the massively distributed systems in a fine granularity. To alleviate the challenges of performance analysis, various performance tools have been proposed to understand the runtime behaviors of big data optimization for performance tuning. In this chapter, we introduce several performance tools for big data optimization from various aspects, including the requirements of ideal performance tools, the challenges of performance tools, and state-of-the-art performance tool examples.
Yan Li, Qi Guo, Guancheng Chen
Optimising Big Images
Abstract
We take a look at big data challenges in image processing. Real-life photographs and other images, such ones from medical imaging modalities, consist of tens of million data points. Mathematically based models for their improvement—due to noise, camera shake, physical and technical limitations, etc.—are moreover often highly non-smooth and increasingly often non-convex. This creates significant optimisation challenges for the application of the models in quasi-real-time software packages, as opposed to more ad hoc approaches whose reliability is not as easily proven as that of mathematically based variational models. After introducing a general framework for mathematical image processing, we take a look at the current state-of-the-art in optimisation methods for solving such problems, and discuss future possibilities and challenges.
Tuomo Valkonen
Interlinking Big Data to Web of Data
Abstract
The big data problem can be seen as a massive number of data islands, ranging from personal, shared, social to business data. The data in these islands is getting large scale, never ending, and ever changing, arriving in batches at irregular time intervals. Examples of these are social and business data. Linking and analyzing of this potentially connected data is of high and valuable interest. In this context, it will be important to investigate how the Linked Data approach can enable the Big Data optimization. In particular, the Linked Data approach has recently facilitated the accessibility, sharing, and enrichment of data on the Web. Scientists believe that Linked Data reduces Big Data variability by some of the scientifically less interesting dimensions. In particular, by applying the Linked Data techniques for exposing structured data and eventually interlinking them to useful knowledge on the Web, many syntactic issues vanish. Generally speaking, this approach improves data optimization by providing some solutions for intelligent and automatic linking among datasets. In this chapter, we aim to discuss the advantages of applying the Linked Data approach, towards the optimization of Big Data in the Linked Open Data (LOD) cloud by: (i) describing the impact of linking Big Data to LOD cloud; (ii) representing various interlinking tools for linking Big Data; and (iii) providing a practical case study: linking a very large dataset to DBpedia.
Enayat Rajabi, Seyed-Mehdi-Reza Beheshti
Topology, Big Data and Optimization
Abstract
The idea of using geometry in learning and inference has a long history going back to canonical ideas such as Fisher information, Discriminant analysis, and Principal component analysis. The related area of Topological Data Analysis (TDA) has been developing in the last decade. The idea is to extract robust topological features from data and use these summaries for modeling the data. A topological summary generates a coordinate-free, deformation invariant and highly compressed description of the geometry of an arbitrary data set. Topological techniques are well-suited to extend our understanding of Big Data. These tools do not supplant existing techniques, but rather provide a complementary viewpoint to existing techniques. The qualitative nature of topological features do not give particular importance to individual samples, and the coordinate-free nature of topology generates algorithms and viewpoints well suited to highly complex datasets. With the introduction of persistence and other geometric-topological ideas we can find and quantify local-to-global properties as well as quantifying qualitative changes in data.
Mikael Vejdemo-Johansson, Primoz Skraba
Applications of Big Data Analytics Tools for Data Management
Abstract
Data, at a very large scale, has been accumulating in all aspects of our lives for a long time. Advances in sensor technology, the Internet, social networks, wireless communication, and inexpensive memory have all contributed to an explosion of “Big Data”. Our interconnected world of today and the advent of cyber-physical or system of systems (SoS) are also a key source of data accumulation- be it numerical, image, text or texture, etc. SoS is basically defined as an integration of independently operating, non-homogeneous systems for certain duration to achieve a higher goal than the sum of the parts. Recent efforts have developed a promising approach, called “Data Analytics”, which uses statistical and computational intelligence (CI) tools such as principal component analysis (PCA), clustering, fuzzy logic, neuro-computing, evolutionary computation, Bayesian networks, data mining, pattern recognition, deep learning, etc. to reduce the size of “Big Data” to a manageable size and apply these tools to (a) extract information, (b) build a knowledge base using the derived data, (c) optimize validation of clustered knowledge through evolutionary computing and eventually develop a non-parametric model for the “Big Data”, and (d) Test and verify the model. This chapter attempts to construct a bridge between SoS and Data Analytics to develop reliable models for such systems. Four applications of big data analytics will be presented, i.e. solar, wind, financial and biological data.
Mo Jamshidi, Barney Tannahill, Maryam Ezell, Yunus Yetis, Halid Kaplan
Optimizing Access Policies for Big Data Repositories: Latency Variables and the Genome Commons
Abstract
The design of access policies for large aggregations of scientific data has become increasingly important in today’s data-rich research environment. Planners routinely consider and weigh different policy variables when deciding how and when to release data to the public. This chapter proposes a methodology in which the timing of data release can be used to balance policy variables and thereby optimize data release policies. The global aggregation of publicly-available genomic data, or the “genome commons” is used as an illustration of this methodology.
Jorge L. Contreras
Big Data Optimization via Next Generation Data Center Architecture
Abstract
The use of Big Data underpins critical activities in all sectors of our society. Achieving the full transformative potential of Big Data in this increasingly digital and interconnected world requires both new data analysis algorithms and a new class of systems to handle the dramatic data growth, the demand to integrate structured and unstructured data analytics, and the increasing computing needs of massive-scale analytics. As a result, massive-scale data analytics of all forms have started to operate in data centers (DC) across the world. On the other hand, data center technology has evolved from DC 1.0 (tightly-coupled silos) to DC 2.0 (computer virtualization) in order to enhance data processing capability. In the era of big data, highly diversified analytics applications continue to stress data center capacity. The mounting requirements on throughput, resource utilization, manageability, and energy efficiency demand seamless integration of heterogeneous system resources to adapt to varied big data applications. Unfortunately, DC 2.0 does not suffice in this context. By rethinking of the challenges of big data applications, researchers and engineers at Huawei propose the High Throughput Computing Data Center architecture (HTC-DC) toward the design of DC 3.0. HTC-DC features resource disaggregation via unified interconnection. It offers Peta Byte (PB) level data processing capability, intelligent manageability, high scalability and high energy efficiency, hence a promising candidate for DC 3.0. This chapter discusses the hardware and software features HTC-DC for Big Data optimization.
Jian Li
Big Data Optimization Within Real World Monitoring Constraints
Abstract
Large scale monitoring systems can provide information to decision makers. As the available measurement data grows, the need for available and reliable interpretation also grows. To this, as decision makers require the timely arrival of information, the need for high performance interpretation of measurement data also grows. Big Data optimization techniques can enable designers and engineers to realize large scale monitoring systems in real life, by allowing these systems to comply to real world constrains in the area of performance, reliability and reliability. Using several examples of real world monitoring systems this chapter discusses different approaches in optimization: data, analysis, system architecture and goal oriented optimization.
Kristian Helmholt, Bram van der Waaij
Smart Sampling and Optimal Dimensionality Reduction of Big Data Using Compressed Sensing
Abstract
Handling big data poses as a huge challenge in the computer science community. Some of the most appealing research domains such as machine learning, computational biology and social networks are now overwhelmed with large-scale databases that need computationally demanding manipulation. Several techniques have been proposed for dealing with big data processing challenges including computational efficient implementations, like parallel and distributed architectures, but most approaches benefit from a dimensionality reduction and smart sampling step of the data. In this context, through a series of groundbreaking works, Compressed Sensing (CS) has emerged as a powerful mathematical framework providing a suite of conditions and methods that allow for an almost lossless and efficient data compression. The most surprising outcome of CS is the proof that random projections qualify as a close to optimal selection for transforming high-dimensional data into a low-dimensional space in a way that allows for their almost perfect reconstruction. The compression power along with the usage simplicity render CS an appealing method for optimal dimensionality reduction of big data. Although CS is renowned for its capability of providing succinct representations of the data, in this chapter we investigate its potential as a dimensionality reduction technique in the domain of image annotation. More specifically, our aim is to initially present the challenges stemming from the nature of big data problems, explain the basic principles, advantages and disadvantages of CS and identify potential ways of exploiting this theory in the domain of large-scale image annotation. Towards this end, a novel Hierarchical Compressed Sensing (HCS) method is proposed. The new method dramatically decreases the computational complexity, while displays robustness equal to the typical CS method. Besides, the connection between the sparsity level of the original dataset and the effectiveness of HCS is established through a series of artificial experiments. Finally, the proposed method is compared with the state-of-the-art dimensionality reduction technique of Principal Component Analysis. The performance results are encouraging, indicating a promising potential of the new method in large-scale image annotation.
Anastasios Maronidis, Elisavet Chatzilari, Spiros Nikolopoulos, Ioannis Kompatsiaris
Optimized Management of BIG Data Produced in Brain Disorder Rehabilitation
Abstract
Brain disorders resulting from injury, disease, or health conditions can influence function of most parts of human body. Necessary medical care and rehabilitation is often impossible without close cooperation of several diverse medical specialists who must work jointly to choose methods that improve and support healing processes as well as to discover underlying principles. The key to their decisions are data resulting from careful observation or examination of the patient. We introduce the concept of scientific dataspace that involves and stores numerous and often complex types of data, e.g., the primary data captured from the application, data derived by curation and analytic processes, background data including ontology and workflow specifications, semantic relationships between dataspace items based on ontologies, and available published data. Our contribution applies big data and cloud technologies to ensure efficient exploitation of this dataspace, namely, novel software architectures, algorithms and methodology for its optimized management and utilization. We present its service-oriented architecture using a running case study and results of its data processing that involves mining and visualization of selected patterns optimized towards big and complex data we are dealing with.
Peter Brezany, Olga Štěpánková, Markéta Janatová, Miroslav Uller, Marek Lenart
Big Data Optimization in Maritime Logistics
Abstract
Seaborne trade constitutes nearly 80 % of the world trade by volume and is linked into almost every international supply chain. Efficient and competitive logistic solutions obtained through advanced planning will not only benefit the shipping companies, but will trickle down the supply chain to producers and consumers alike. Large scale maritime problems are found particularly within liner shipping due to the vast size of the network that global carriers operate. This chapter will introduce a selection of large scale planning problems within the liner shipping industry. We will focus on the solution techniques applied and show how strategic, tactical and operational problems can be addressed. We will discuss how large scale optimization methods can utilize special problem structures such as separable/independent subproblems and give examples of advanced heuristics using divide-and-conquer paradigms, decomposition and mathematical programming within a large scale search framework. We conclude the chapter by discussing future challenges of large scale optimization within maritime shipping and the integration of predictive big data analysis combined with prescriptive optimization techniques.
Berit Dangaard Brouer, Christian Vad Karsten, David Pisinger
Big Network Analytics Based on Nonconvex Optimization
Abstract
The scientific problems that Big Data faces may be network scientific problems. Network analytics contributes a great deal to networked Big Data processing. Many network issues can be modeled as nonconvex optimization problems and consequently they can be addressed by optimization techniques. In the pipeline of nonconvex optimization techniques, evolutionary computation gives an outlet to handle these problems efficiently. Because, network community discovery is a critical research agenda of network analytics, in this chapter we focus on the evolutionary computation based nonconvex optimization for network community discovery. The single and multiple objective optimization models for the community discovery problem are thoroughly investigated. Several experimental studies are shown to demonstrate the effectiveness of optimization based approach for big network community analytics.
Maoguo Gong, Qing Cai, Lijia Ma, Licheng Jiao
Large-Scale and Big Optimization Based on Hadoop
Abstract
Integer Linear Programming (ILP) is among the most popular optimization techniques found in practical applications, however, it often faces computational issues in modeling real-world problems. Computation can easily outgrow the computing power of standalone computers as the size of problem increases. The modern distributed computing releases the computing power constraints by providing scalable computing resources to match application needs, which boosts large-scale optimization. This chapter presents a paradigm that leverages Hadoop, an open-source distributed computing framework, to solve a large-scale ILP problem that is abstracted from real-world air traffic flow management. The ILP involves millions of decision variables, which is intractable even with existing state-of-the-art optimization software package. Dual decomposition method is used to separate variables into a set of dual subproblems that are smaller ILPs with lower dimensions, the computation complexity is downsized. As a result, the subproblems are solvable with optimization tools. It is shown that the iterative update on Lagrangian multipliers in dual decomposition method can fit into the Hadoop’s MapReduce programming model, which is designed to allocate computations to cluster for parallel processing and collect results from each node to report aggregate results. Thanks to the scalability of the distributed computing, parallelism can be improved by assigning more working nodes to the Hadoop cluster. As a result, the computational efficiency for solving the whole ILP problem is not impacted by the input size.
Yi Cao, Dengfeng Sun
Computational Approaches in Large-Scale Unconstrained Optimization
Abstract
As a topic of great significance in nonlinear analysis and mathematical programming, unconstrained optimization is widely and increasingly used in engineering, economics, management, industry and other areas. Unconstrained optimization also arises in reformulation of the constrained optimization problems in which the constraints are replaced by some penalty terms in the objective function. In many big data applications, solving an unconstrained optimization problem with thousands or millions of variables is indispensable. In such situations, methods with the important feature of low memory requirement are helpful tools. Here, we study two families of methods for solving large-scale unconstrained optimization problems: conjugate gradient methods and limited-memory quasi-Newton methods, both of them are structured based on the line search. Convergence properties and numerical behaviors of the methods are discussed. Also, recent advances of the methods are reviewed. Thus, new helpful computational tools are supplied for engineers and mathematicians engaged in solving large-scale unconstrained optimization problems.
Saman Babaie-Kafaki
Numerical Methods for Large-Scale Nonsmooth Optimization
Abstract
Nonsmooth optimization refers to the general problem of minimizing (or maximizing) functions that are typically not differentiable at their minimizers (maximizers). NSO problems are encountered in many application areas: for instance, in economics, mechanics, engineering, control theory, optimal shape design, machine learning, and data mining including cluster analysis and classification. Most of these problems are large-scale. In addition, constantly increasing database sizes, for example in clustering and classification problems, add even more challenge in solving these problems. NSO problems are in general difficult to solve even when the size of the problem is small and problem is convex. In this chapter we recall two numerical methods for solving large-scale nonconvex NSO problems. Namely, the limited memory bundle algorithm (LMBM) and the diagonal bundle method (D-Bundle). We also recall the convergence properties of these algorithms. The numerical experiments have been made using problems with up to million variables, which indicates the usability of the methods also in real world applications with big data-sets.
Napsu Karmitsa
Metaheuristics for Continuous Optimization of High-Dimensional Problems: State of the Art and Perspectives
Abstract
The age of big data brings new opportunities in many relevant fields, as well as new research challenges. Among the latter, there is the need for more effective and efficient optimization techniques, able to address problems with hundreds, thousands, and even millions of continuous variables. Over the last decade, researchers have developed various improvements of existing metaheuristics for tacking high-dimensional optimization problems, such as hybridizations, local search and parameter adaptation. Another effective strategy is the cooperative coevolutionary approach, which performs a decomposition of the search space in order to obtain sub-problems of smaller size. Moreover, in some cases such powerful search algorithms have been used with high performance computing to address, within reasonable run times, very high-dimensional optimization problems. Nevertheless, despite the significant amount of research already carried out, there are still many open research issues and room for significant improvements. In order to provide a picture of the state of the art in the field of high-dimensional continuous optimization, this chapter describes the most successful algorithms presented in the recent literature, also outlining relevant trends and identifying possible future research directions.
Giuseppe A. Trunfio
Convergent Parallel Algorithms for Big Data Optimization Problems
Abstract
When dealing with big data problems it is crucial to design methods able to decompose the original problem into smaller and more manageable pieces. Parallel methods lead to a solution by concurrently working on different pieces that are distributed among available agents, so that exploiting the computational power of multi-core processors and therefore efficiently solving the problem. Beyond gradient-type methods, that can of course be easily parallelized but suffer from practical drawbacks, recently a convergent decomposition framework for the parallel optimization of (possibly non-convex) big data problems was proposed. Such framework is very flexible and includes both fully parallel and fully sequential schemes, as well as virtually all possibilities in between. We illustrate the versatility of this parallel decomposition framework by specializing it to different well-studied big data problems like LASSO, logistic regression and support vector machines training. We give implementation guidelines and numerical results showing that proposed parallel algorithms work very well in practice.
Simone Sagratella
Backmatter
Metadata
Title
Big Data Optimization: Recent Developments and Challenges
Editor
Ali Emrouznejad
Copyright Year
2016
Electronic ISBN
978-3-319-30265-2
Print ISBN
978-3-319-30263-8
DOI
https://doi.org/10.1007/978-3-319-30265-2

Premium Partner