Skip to main content

Über dieses Buch

This handbook brings together a variety of approaches to the uses of big data in multiple fields, primarily science, medicine, and business. This single resource features contributions from researchers around the world from a variety of fields, where they share their findings and experience. This book is intended to help spur further innovation in big data. The research is presented in a way that allows readers, regardless of their field of study, to learn from how applications have proven successful and how similar applications could be used in their own field. Contributions stem from researchers in fields such as physics, biology, energy, healthcare, and business. The contributors also discuss important topics such as fraud detection, privacy implications, legal perspectives, and ethical handling of big data.





Chapter 1. Strategic Applications of Big Data

Although big data is often associated with retail marketing analytics, it is broadly relevant to today’s corporate strategies. Big data can be exploited in support of four “digital disciplines:” information excellence, i.e., better processes and asset utilization through data-based information technologies; solution leadership, i.e., better products and services through data enrichment of formerly standalone elements to cloud-connected, smart, digital things; collective intimacy, i.e., sophisticated algorithms processing billions or trillions of data points on attitudes, behaviors, contexts, and external data in addition to demographics to develop personalized, contextualized recommendations and services; and accelerated innovation, enabling ad hoc solvers across the world—and eventually machines—to process data sets to develop new insights. These digital disciplines can create unparalleled customer value, galvanize corporate strategy, and link information technologies to the ultimate success of any business in any vertical.
Joe Weinman

Chapter 2. Start with Privacy by Design in All Big Data Applications

The term “Big Data” is used to describe a universe of very large datasets that hold a variety of data types. This has spawned a new generation of information architectures and applications to facilitate the fast processing speeds and the visualization needed to analyze and extract value from these extremely large sets of data, using distributed platforms. While not all data in Big Data applications will be personally identifiable, when this is the case, privacy interests arise. To be clear, privacy requirements are not obstacles to innovation or to realizing societal benefits from Big Data analytics—in fact, they can actually foster innovation and doubly-enabling, win–win outcomes. This is achieved by taking a Privacy by Design approach to Big Data applications. This chapter begins by defining information privacy, then it will provide an overview of the privacy risks associated with Big Data applications. Finally, the authors will discuss Privacy by Design as an international framework for privacy, then provide guidance on using the Privacy by Design Framework and the 7 Foundational Principles, to achieve both innovation and privacy—not one at the expense of the other.
Ann Cavoukian, Michelle Chibba

Chapter 3. Privacy Preserving Federated Big Data Analysis

Biomedical data are often collected and hosted by different agents (e.g., hospitals, insurance company, sequencing centers). For example, regional hospitals have the data from local patients. On the other hand, data from the same patient can also spread across multiple hospitals and institutions when she makes multiple visits. There are many benefits to use distributed data together in research studies but it is challenging to pool the raw data due to efficiency and privacy concerns.
We review solutions for privacy-preserving federated data analysis. To better explain these solutions, we begin with the architectures and optimization methods. We present the server/client and decentralized architectures to perform privacy-preserving federated data analysis. Under these architectures, the Newton-Raphson method and alternating direction method of multipliers (ADMM) framework are introduced. Specifically, consensus and sharing problems are formulated under the ADMM framework for horizontally and vertically partitioned data, respectively. We further introduce secure multiparty computation (SMC) protocols to protect the intermediary results in communication. We also introduce asynchronous parallel optimization that covers some applications. Finally, we review state-of-the-arts in applications like regression, classification and evaluation for privacy-preserving federated data analysis.
Privacy-preserving federated data analysis can enhance security in real world biomedical applications. Recent progresses shed light on the development of secure and efficient federated data analysis techniques using advanced cryptographic techniques and parallelization methods to reduce computational complexity and communication costs while respecting privacy. Federated models bring computation to the data rather than bring the data to the computation. The appropriate adaptation of these models will address the privacy concerns of patients and institutions while preserving the utility of data in analyses.
Wenrui Dai, Shuang Wang, Hongkai Xiong, Xiaoqian Jiang

Chapter 4. Word Embedding for Understanding Natural Language: A Survey

Word embedding, where semantic and syntactic features are captured from unlabeled text data, is a basic procedure in Natural Language Processing (NLP). The extracted features thus could be organized in low dimensional space. Some representative word embedding approaches include Probability Language Model, Neural Networks Language Model, Sparse Coding, etc. The state-of-the-art methods like skip-gram negative samplings, noise-contrastive estimation, matrix factorization and hierarchical structure regularizer are applied correspondingly to resolve those models. Most of these literatures are working on the observed count and co-occurrence statistic to learn the word embedding. The increasing scale of data, the sparsity of data representation, word position, and training speed are the main challenges for designing word embedding algorithms. In this survey, we first introduce the motivation and background of word embedding. Next we will introduce the methods of text representation as preliminaries, as well as some existing word embedding approaches such as Neural Network Language Model and Sparse Coding Approach, along with their evaluation metrics. In the end, we summarize the applications of word embedding and discuss its future directions.
Yang Li, Tao Yang

Applications in Science


Chapter 5. Big Data Solutions to Interpreting Complex Systems in the Environment

The amount of relevant published data sets available in the environmental sciences is rapidly increasing in recent years. For example, the National Oceanic and Atmospheric Administration (NOAA) has published vast data resources and tremendous volumes of high quality environmental data. Analyzing those data sets poses unprecedented challenges and opportunities to environmental scientists. The goal of this chapter is to present a practical investigation of big data tools that can be used to analyze environmental data sets and provide environmental information to decision makers in the political and non-profit spheres. Throughout this chapter, we will provide examples of the uses of big data analysis in assessing environmental impact and change in real-time in hopes of initiating discussion towards benchmarking key features and considerations of big data techniques.
Hongmei Chi, Sharmini Pitter, Nan Li, Haiyan Tian

Chapter 6. High Performance Computing and Big Data

High Performance Computing (HPC) has traditionally been characterized by low-latency, high throughput, massive parallelism and massively distributed systems. Big Data or analytics platforms share some of the same characteristics but as of today are limited somewhat in their guarantees on latency and throughput. The application of Big Data platforms has been in solving problems where data that is being operated upon is in motion while HPC has traditionally been applied to performing scientific computations where data is at rest. The programing paradigms that are in use in Big Data platforms for example Map-Reduce (Google Research Publication: MapReduce. Retrieved November 29, 2016, from http://​research.​google.​com/​archive/​mapreduce.​html) and Spark streaming (Spark Streaming/Apache Spark. Retrieved November 29, 2016, from https://​spark.​apache.​org/​streaming/​) have their genesis in HPC but they need to address some of the distinct characteristics of Big Data platforms. So bringing High Performance to Big Data platforms means addressing the following:
Ingesting Data at high volume with low latency
Processing streaming data at high volume with low latency
Storing Data in a distributed data store
Indexing and searching the stored data for Real–Time processing
In order to achieve 1, 2, 3, 4 mentioned above, the right hardware and software components need to be chosen. With the plethora of software stacks and different kinds of hardware infrastructure–including public/private cloud, on­ premise and co–located hardware there are many criteria, characteristics and metrics to be evaluated in order to make the right choices. We show that it is of the utmost importance to have the right tools to make this kind of evaluation as accurate as possible and then have the appropriate software to maintain performance of such systems as they scale. We then identify the different types of hardware infrastructure in the cloud including Amazon Web Services (AWS) (Amazon Web Services. What is AWS?. Retrieved November 29, 2016, from https://​aws.​amazon.​com/​what-is-aws), and different types of on-premise hardware infrastructure including converged hyperscale infrastructure from vendors such as Nutanix (Nutanix-The Enterprise Cloud Company. Retrieved November 29, 2016, from http://​www.​nutanix.​com/​) and traditional vendors such as Dell and HP. We also explore high-performance offerings from emerging open network switch device makers such as Cumulus (Better, Faster, Easier Networks. Retrieved November 29, 2016, from (https://​cumulusnetworks.​com/​) and from traditional vendors such as Cisco (Cisco. Retrieved November 29, 2016, from (http://​www.​cisco.​com/​) as well as explore various storage architectures and their relative merits in the context of Big Data.
Rishi Divate, Sankalp Sah, Manish Singh

Chapter 7. Managing Uncertainty in Large-Scale Inversions for the Oil and Gas Industry with Big Data

Inverse problems arise in almost all fields of science when parameters of a postulated model have to be determined from a set of observed data. Due to the increasing volume of data collected by the oil and gas industry, there is an urgent need for addressing large-scale inverse problems. In this article, after examining both deterministic and statistical methods that are scalable for managing large volume of data, we present the MapReduce paradigm as a potential speed up technique for future implementations.
Jiefu Chen, Yueqin Huang, Tommy L. Binford, Xuqing Wu

Chapter 8. Big Data in Oil & Gas and Petrophysics

Petrophysics is the science (and art) of rock exploration. Even though its development is primarily driven by oil exploration, with the motto of “if it ain’t broke, don’t fix it,” and despite or maybe due to the current low oil prices, the oil and gas industry is at the brink of a digital revolution. Organizations and measurements scattered across the globe are in dire needs for cloud technology to bring enhanced calculation capabilities, communication and collaboration within and between companies.
In this chapter, we provide guidance to Big Data petrophysical implementations. We approach this from two angles. First, why the oilfield needs this technology and what is required to benefit the most from the new technologies; second, we show the lessons that petrophysics scientists and software developers can learn from the Big Data best practices in terms of implementation from other industries
These recommendations are based on an actual implementation and our Big Data teaching and implementation experience.
Mark Kerzner, Pierre Jean Daniel

Chapter 9. Friendship Paradoxes on Quora

The “friendship paradox” is the statistical pattern that, in many social networks, most individuals’ friends have more friends on average than they do. This phenomenon has been observed in various real-world social networks, dating back to a seminal 1991 paper by S.L. Feld. In recent years, the availability of large volumes of data on online social networks has allowed researchers to also study generalizations of the core friendship-paradox idea to quantities other than connectivity. This has led to the finding that, in social networks such as Twitter, a typical person whom a randomly selected individual follows is usually a more active user and contributor than that individual. Here, we study friendship paradoxes on Quora, an online knowledge-sharing platform that is structured in a question-and-answer format. In addition to the “traditional” friendship paradoxes in the network of people following one another on Quora, we also study variants of the phenomenon that arise through the platform’s core interactions. We specifically focus on “upvoting” and “downvoting,” actions that people take to give positive and negative feedback on Quora answers. We observe that, for most answer authors who have few followers, for most of their answers, most of the upvoters of those answers have more followers than they do. This has potentially advantageous consequences for the distribution of content produced by these writers. Meanwhile, we also observe a paradox in downvoting: for most sufficiently active answer writers who got downvoted by other sufficiently active answer writers during a four-week period, most of their sufficiently active downvoters got downvoted more than they did. We explain how these paradoxes arise and place them in the context of recent research on friendship-paradox phenomena.
Shankar Iyer

Chapter 10. Deduplication Practices for Multimedia Data in the Cloud

Data deduplication has been a promising and effective technology that provides storage saving in data centers and clouds. It is the process of identifying the redundancy in data, removing all but one copy (or n copies) of duplicate data, and making all references point to that copy. Existing techniques often require access to the content of data in order to establish redundancy. As more sensitive information is stored on clouds, encryption is commonly used to protect this information, which obscures the content of the data to Cloud Service Providers (CSPs). This chapter deals with how to provide deduplication services, while protecting data privacy. Most of the data stored today is multimedia content, with one estimate putting this type of content at about 80% of all corporate and public unstructured big data (Venter and Stein, Images & videos: really big data. Accessed 24 August 2016, 2012). In this chapter, we will give an overview of some of the best practices to achieve efficient deduplication for multimedia content.
Fatema Rashid, Ali Miri

Chapter 11. Privacy-Aware Search and Computation Over Encrypted Data Stores

Systems today often handle massive amount of data with little regard to privacy or security issues that may arise. As corporations and governments increasingly monitor many aspects of our lives, the security and privacy concerns that surround big data has also become apparent. While anonymization is suggested for protecting user privacy, it has shown to be unreliable. In contrast, cryptographic techniques are well studied, and have provable and quantifiable security. There have been many works on enabling search over encrypted data. In this chapter, we look at some of the most important results in the area of searchable encryption and encrypted data processing, including encrypted indexes, Bloom filters and Boneh’s IBE based searchable encryption scheme. We’ll also discuss some of the most promising developments in recent years: performing range query through the use of order-preserving encryption and computing over ciphertext using homomorphic encryption. To better illustrate the techniques, the schemes are described in various sample applications involving text and media search.
Hoi Ting Poon, Ali Miri

Chapter 12. Civil Infrastructure Serviceability Evaluation Based on Big Data

Failure of civil infrastructure, such as, bridges and pipelines, can cause large public safety and economic consequences. Structural health monitoring (SHM) plays a significant role in preventing and mitigating the course of structural damage. In this work, a multi-scale SHM framework based on Hadoop Ecosystem (MS-SHM-Hadoop) to monitor and evaluate the serviceability of civil infrastructure is proposed. Through utilizing fault-tolerant distributed file system called Hadoop Distributed File System (HDFS) and high-performance parallel data processing engine called MapReduce programming paradigm, MS-SHM-Hadoop has high scalability and robustness in data ingestion, fusion, processing, retrieval, and analytics. MS-SHM-Hadoop is a multi-scale reliability analysis framework including nationwide civil infrastructure survey, global structural integrity analysis, and structural components’ reliability analysis. The nationwide civil infrastructure survey uses deep-learning techniques to evaluate their serviceability according to real-time sensory data or archived civil infrastructure related data such as traffic status, weather conditions and civil infrastructure’s structural configuration. The global structural integrity analysis of a targeted civil infrastructure is made by processing and analyzing the measured vibration signals incurred by external loads such as wind and traffic flow. Component-wise reliability analysis is also enabled by deep learning technique, where the input data is derived from the measured structural load effect, hyper-spectral and 3D point cloud images, and moisture measurement about structural component. As one of its major contributions, this work employs Bayesian network to formulate the integral serviceability of a civil infrastructure according to components’ serviceability and inter-component correlations. Here the inter-component correlations are jointly specified using statistics-oriented machine learning method (e.g., association rule learning) or structural mechanics modeling and simulation.
Yu Liang, Dalei Wu, Dryver Huston, Guirong Liu, Yaohang Li, Cuilan Gao, Zhongguo John Ma

Applications in Medicine


Chapter 13. Nonlinear Dynamical Systems with Chaos and Big Data: A Case Study of Epileptic Seizure Prediction and Control

The modeling of dynamic behavior of systems is a ubiquitous problem in all facets of human endeavors. Importantly so, dynamical systems have been studied and modeled since the nineteenth century and currently applied in almost all branches of sciences and engineering including social sciences. The development of computers and scientific/numerical methods has accelerated the pace of new developments in modeling both linear and nonlinear dynamical systems. However, modeling complex physical system behaviors as nonlinear dynamical systems is still difficult and challenging. General approaches to solving such systems typically fail and require personalized problem dependent techniques to satisfy the constraints imposed based on the initial conditions to predict state space trajectories. In addition, they require enormous computational power available on supercomputers. Numerical tools such as HPCmatlab enable rapid prototyping of algorithms for large scale computations and data analysis. BigData applications are computationally intensive and I/O bound. An example, state of the art case study involving big data of epileptic seizure prediction and control is presented. The nonlinear dynamical model is based on the biology of the brain and its neurons, chaotic systems, nonlinear signal processing, and feedback and adaptive systems. The goal is to develop new feedback controllers for the suppression of epileptic seizures based on electroencephalographic (EEG) data by altering the brain dynamics through the use of electrical stimulation. The research is expected to contribute to new modes of treatment for epilepsy and other dynamical brain disorders.
Ashfaque Shafique, Mohamed Sayeed, Konstantinos Tsakalis

Chapter 14. Big Data to Big Knowledge for Next Generation Medicine: A Data Science Roadmap

Big-data data is proving to be the launching pad of Next-generation Medicine. While there is an unprecedented optimism for its potential to improve healthcare, it has also sparked the need for bridging medical expertise with Big-data. Such scientific temperament and skills for end-to-end translation of Big-data to Big-knowledge have been encapsulated into the new specialty of Data-science. This chapter discusses the key Data-science technologies that are enabling generation of new-theory and new-practice of medicine in clinics and community. In addition to computational and analytical frameworks, this chapter lays special emphasis on Veracity handling and Value generation from biomedical Big-data. Capturing Reliably, Approaching Systemically, Phenotyping Deeply and Enabling Decisions are presented as the key underlying themes towards generating value for medicine. Techniques such as open source reliable pipelines, networks and graphs for approaching systemically, mining millions of patients for phenotyping deeply and multivariate models that enable policy and decisions are elaborated. These are discussed in the context of case studies across critical care to community health. Multi-omic initiatives ranging from Genomes to Exposomes are geared towards such Data-science driven integration and are briefly discussed. This leads to the biggest scientific opportunity for using Big-data for Precision Medicine which aims to precisely define and treat illnesses using this integrated framework. Finally, the chapter discusses the open technological and social challenges in pervasive adoption of the Big-data-to-Bedside cycle and proposes a combination of traditional and Big-data methods as a way forward in transforming medicine.
Tavpritesh Sethi

Chapter 15. Time-Based Comorbidity in Patients Diagnosed with Tobacco Use Disorder

Healthcare is one of the promising fields where Big Data tools and techniques can have the highest impact. One of the key problems in the healthcare sector is to analyze impact of comorbidities. Comorbidity is a medical condition when a patient develops multiple diseases simultaneously. The research on finding comorbidities over time is rare. In this paper, our focus is to find time-based comorbidities in the patients diagnosed with Tobacco Use Disorder (TUD). First, we explain a generalized process to find chronological comorbidities. Then, we analyze electronic medical records of patients diagnosed with Tobacco Use Disorder from the hospitals in the West South-Central region of United States (1999–2013). Specifically, we discover comorbidities in the TUD patients across three hospital visits. We also compare the results with the patients who never developed TUD.
We found interesting results indicating that some comorbidities are different in TUD and non-TUD patients over time, but not others. The knowledge about the time-based comorbidities can help physicians take preemptive actions to prevent future diseases.
Pankush Kalgotra, Ramesh Sharda, Bhargav Molaka, Samsheel Kathuri

Chapter 16. The Impact of Big Data on the Physician

Over the past few years technology has evolved to integrate itself into a myriad of aspects of every day life, including healthcare access and delivery. One way in which technology has begun to truly transform healthcare is with big data and big data analytics. Using sophisticated tools to capture, aggregate, and translate data across multiple sources, ranging from traditional electronic health records to non-traditional consumer devices, big data has the potential to transform the practice of medicine at both the level of the patient-physician relationship as well as in clinical decision-making and treatment. Already apparent are intriguing examples of increased patient engagement, improved evidence-based physician decision-making, and multi-faceted tools for more individualized patient care. As big data becomes more and more incorporated into the practice of medicine, issues regarding systems interoperability, privacy and security, and legal and regulatory boundaries will need to be resolved. This chapter is intended to be an overview of big data applications and potential challenges as they relate to the patient and physician.
Elizabeth Le, Sowmya Iyer, Teja Patil, Ron Li, Jonathan H. Chen, Michael Wang, Erica Sobel

Applications in Business


Chapter 17. The Potential of Big Data in Banking

The emergence of the notion of Big Data has created substantial value expectations for organizations with huge collections of data. Banks, as all the other organizations, recognize the value potential contained in big data. For financial services companies with mostly information circulating in their value chain, data—the source of information—is one of arguably their most important assets. The current issue is to which extent these data assets may be leveraged to produce value and gain competitive advantage. The goal of this chapter is to discuss issues, challenges and important dimensions in the activities of banks and other financial institutions that need to be understood to produce a framework for efficient utilization of the potential of Big Data approaches. The authors argue that, although the banks have much in common with other businesses regarding customer and operations management, a priority feature of analytics in banking is risk management. No less important than technology issues are managerial implications, concentrating on issues like acceptance of Big Data approaches and intelligence culture.
Rimvydas Skyrius, Gintarė Giriūnienė, Igor Katin, Michail Kazimianec, Raimundas Žilinskas

Chapter 18. Marketing Applications Using Big Data

Big Data applications abound in all disciplines. In this chapter we consider the practical applications of Big Data analytics in Marketing. Businesses spend significant human and financial resources in marketing their products and services to potential customers. In this chapter we look at how businesses use the data gathered from multiple sources and use that to promote their products and services to customers who are more likely to benefit from them. Typically, marketing involves communicating with the potential customer through multiple advertising and promotion mediums. Much of the information sent to the potential customer is either in print or electronic form. Businesses who use appropriate target marketing feel that a person within that market would eventually turn out to be a customer. With this in mind, the approaches businesses take are geared towards sending the right information to the right person. To gain this type of knowledge, businesses use extensive data from multiple sources. With the advancements in computing power, affordable resources and social media, businesses are in a better position to target their materials at the potential customer. Even though the cost of information dissemination is very small, if the information is sent to the wrong person then that person is not only going to discard the information but may resent being bombarded with unwanted information. In this chapter we show the various techniques real businesses use to target the right customer and send the information that will be used. In this effort Big Data techniques are helpful. We point out how the data was used in marketing and its success.
S. Srinivasan

Chapter 19. Does Yelp Matter? Analyzing (And Guide to Using) Ratings for a Quick Serve Restaurant Chain

In this paper, we perform an analysis of reviews for a national quick serve (fast food) restaurant chain. Results show that the company-owned restaurants consistently perform better than franchised restaurants in numeric rating (1–5 stars) in states which contain both types of operations. Using sales data, correlations are used to evaluate the relationship between the number of guests or sales of a restaurant and the rating of the restaurant. We found positive correlations were present between the number of customers at a location and the numeric rating. No correlation was found between the average ticket size and numeric rating. The study also found that 5-star rated restaurants have frequent comments related to the cleanliness and friendliness of the staff whereas 1-star rated restaurants have comments more closely related to speed of service and temperature of food. Overall, the study found that in contrast to previous research, rating sites like Yelp are relevant in the quick serve restaurant sector and reviews can be used to inform operational decisions, leading to improved performance. Detailed explanations related to the process of extracting this data and relevant code are provided for future researchers interested in analyzing Yelp reviews.
Bogdan Gadidov, Jennifer Lewis Priestley


Weitere Informationen