Skip to main content
main-content

Über dieses Buch

Over the past decade, Big Data have become ubiquitous in all economic sectors, scientific disciplines, and human activities. They have led to striking technological advances, affecting all human experiences. Our ability to manage, understand, interrogate, and interpret such extremely large, multisource, heterogeneous, incomplete, multiscale, and incongruent data has not kept pace with the rapid increase of the volume, complexity and proliferation of the deluge of digital information. There are three reasons for this shortfall. First, the volume of data is increasing much faster than the corresponding rise of our computational processing power (Kryder’s law > Moore’s law). Second, traditional discipline-bounds inhibit expeditious progress. Third, our education and training activities have fallen behind the accelerated trend of scientific, information, and communication advances. There are very few rigorous instructional resources, interactive learning materials, and dynamic training environments that support active data science learning. The textbook balances the mathematical foundations with dexterous demonstrations and examples of data, tools, modules and workflows that serve as pillars for the urgently needed bridge to close that supply and demand predictive analytic skills gap.

Exposing the enormous opportunities presented by the tsunami of Big data, this textbook aims to identify specific knowledge gaps, educational barriers, and workforce readiness deficiencies. Specifically, it focuses on the development of a transdisciplinary curriculum integrating modern computational methods, advanced data science techniques, innovative biomedical applications, and impactful health analytics.

The content of this graduate-level textbook fills a substantial gap in integrating modern engineering concepts, computational algorithms, mathematical optimization, statistical computing and biomedical inference. Big data analytic techniques and predictive scientific methods demand broad transdisciplinary knowledge, appeal to an extremely wide spectrum of readers/learners, and provide incredible opportunities for engagement throughout the academy, industry, regulatory and funding agencies.

The two examples below demonstrate the powerful need for scientific knowledge, computational abilities, interdisciplinary expertise, and modern technologies necessary to achieve desired outcomes (improving human health and optimizing future return on investment). This can only be achieved by appropriately trained teams of researchers who can develop robust decision support systems using modern techniques and effective end-to-end protocols, like the ones described in this textbook.

• A geriatric neurologist is examining a patient complaining of gait imbalance and posture instability. To determine if the patient may suffer from Parkinson’s disease, the physician acquires clinical, cognitive, phenotypic, imaging, and genetics data (Big Data). Most clinics and healthcare centers are not equipped with skilled data analytic teams that can wrangle, harmonize and interpret such complex datasets. A learner that completes a course of study using this textbook will have the competency and ability to manage the data, generate a protocol for deriving biomarkers, and provide an actionable decision support system. The results of this protocol will help the physician understand the entire patient dataset and assist in making a holistic evidence-based, data-driven, clinical diagnosis.

• To improve the return on investment for their shareholders, a healthcare manufacturer needs to forecast the demand for their product subject to environmental, demographic, economic, and bio-social sentiment data (Big Data). The organization’s data-analytics team is tasked with developing a protocol that identifies, aggregates, harmonizes, models and analyzes these heterogeneous data elements to generate a trend forecast. This system needs to provide an automated, adaptive, scalable, and reliable prediction of the optimal investment, e.g., R&D allocation, that maximizes the company’s bottom line. A reader that complete a course of study using this textbook will be able to ingest the observed structured and unstructured data, mathematically represent the data as a computable object, apply appropriate model-based and model-free prediction techniques. The results of these techniques may be used to forecast the expected relation between the company’s investment, product supply, general demand of healthcare (providers and patients), and estimate the return on initial investments.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Motivation

Abstract
This textbook is based on the Data Science and Predictive Analytics (DSPA) course taught by the author at the University of Michigan. These materials collectively aim to provide learners with a solid foundation of the challenges, opportunities, and strategies for designing, collecting, managing, processing, interrogating, analyzing, and interpreting complex health and biomedical datasets. Readers that finish this textbook and successfully complete the examples and assignments will gain unique skills and acquire a tool-chest of methods, software tools, and protocols that can be applied to a broad spectrum of Big Data problems.
Ivo D. Dinov

Chapter 2. Foundations of R

Abstract
This Chapter introduces the foundations of R programming for visualization, statistical computing and scientific inference. Specifically, in this Chapter we will (1) discuss the rationale for selecting R as a computational platform for all DSPA demonstrations; (2) present the basics of installing shell-based R and RStudio user-interface; (3) show some simple R commands and scripts (e.g., translate long-to-wide data format, data simulation, data stratification and subsetting); (4) introduce variable types and their manipulation; (5) demonstrate simple mathematical functions, statistics, and matrix operators; (6) explore simple data visualization; and (7) introduce optimization and model fitting. The chapter appendix includes references to R introductory and advanced resources, as well as a primer on debugging.
Ivo D. Dinov

Chapter 3. Managing Data in R

Abstract
In this Chapter, we will discuss strategies to import data and export results. Also, we are going to learn the basic tricks we need to know about processing different types of data. Specifically, we will illustrate common R data structures and strategies for loading (ingesting) and saving (regurgitating) data. In addition, we will (1) present some basic statistics, e.g., for measuring central tendency (mean, median, mode) or dispersion (variance, quartiles, range); (2) explore simple plots; (3) demonstrate the uniform and normal distributions; (4) contrast numerical and categorical types of variables; (5) present strategies for handling incomplete (missing) data; and (6) show the need for cohort-rebalancing when comparing imbalanced groups of subjects, cases or units.
Ivo D. Dinov

Chapter 4. Data Visualization

Abstract
In this chapter, we use a broad range of simulations and hands-on activities to highlight some of the basic data visualization techniques using R. A brief discussion of alternative visualization methods is followed by demonstrations of histograms, density, pie, jitter, bar, line and scatter plots, as well as strategies for displaying trees, more general graphs, and 3D surface plots. Many of these are also used throughout the textbook in the context of addressing the graphical needs of specific case-studies.
Ivo D. Dinov

Chapter 5. Linear Algebra & Matrix Computing

Abstract
Linear algebra is a branch of mathematics that studies linear associations using vectors, vector-spaces, linear equations, linear transformations, and matrices. It is generally challenging to visualize complex data, e.g., large vectors, tensors, and tables in n-dimensional Euclidian spaces (n ≥ 3). Linear algebra allows us to mathematically represent, computationally model, statistically analyze, synthetically simulate, and visually summarize such complex data.
Ivo D. Dinov

Chapter 6. Dimensionality Reduction

Abstract
Now that we have most of the fundamentals covered in the previous chapters, we can delve into the first data analytic method, dimension reduction, which reduces the number of features when modeling a very large number of variables. Dimension reduction can help us extract a set of “uncorrelated” principal variables and reduce the complexity of the data. We are not simply picking some of the original variables. Rather, we are constructing new “uncorrelated” variables as functions of the old features.
Ivo D. Dinov

Chapter 7. Lazy Learning: Classification Using Nearest Neighbors

Abstract
In the next several Chapters, we will concentrate on various progressively advanced machine learning, classification and clustering techniques. There are two categories of learning techniques we will explore: supervised (human-guided) classi腶cation and unsupervised (fully-automated) clustering. In general, supervised classification aims to identify or predict predefined classes and label new objects as members of specific classes. Whereas, unsupervised clustering attempts to group objects into sets, without knowing a priori labels, and determine relationships between objects.
Ivo D. Dinov

Chapter 8. Probabilistic Learning: Classification Using Naive Bayes

Abstract
The introduction to Chap. 7 presented the types of machine learning methods and described lazy classification for numerical data. What about nominal features or textual data? In this Chapter, we will begin to explore some classification techniques for categorical data. Specifically, we will (1) present the Naive Bayes algorithm; (2) review its assumptions; (3) discuss Laplace estimation; and (4) illustrate the Naive Bayesian classifier on a Head and Neck Cancer Medication case-study.
Ivo D. Dinov

Chapter 9. Decision Tree Divide and Conquer Classification

Abstract
When classification needs to be apparent, kNN or naive Bayes we presented earlier may not be useful as they do not generate explicit classification rules. In some cases, we need to specify well stated rules for our decisions, just like a scoring criterion for driving ability or credit scoring for loan underwriting. The decisions in many situations actually require having a clear and easily understandable decision tree to follow the classification process start to finish to judge.
Ivo D. Dinov

Chapter 10. Forecasting Numeric Data Using Regression Models

Abstract
In the previous Chaps. 7, 8, and 9, we covered classification methods that use mathematical formalism to address everyday life prediction problems. In this Chapter, we will focus on specific model-based statistical methods providing forecasting and classification functionality. Specifically, we will (1) demonstrate the predictive power of multiple linear regression; (2) show the foundation of regression trees and model trees; and (3) examine two complementary case-studies (Baseball Players and Heart Attack).
Ivo D. Dinov

Chapter 11. Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines

Abstract
In this Chapter, we are going to cover two very powerful machine-learning algorithms. These techniques have complex mathematical formulations; however, efficient algorithms and reliable software packages have been developed to utilize them for various practical applications. We will (1) describe Neural Networks as analogues of biological neurons; (2) develop hands-on a neural net that can be trained to compute the square-root function; (3) describe support vector machine (SVM) classification; and (4) complete several case-studies, including optical character recognition (OCR), the Iris flowers, Google Trends and the Stock Market, and Quality of Life in chronic disease.
Ivo D. Dinov

Chapter 12. Apriori Association Rules Learning

Abstract
HTTP cookies are used to monitor web-traffic and track users surfing the Internet. We often notice that promotions (ads) on websites tend to match our needs, reveal our prior browsing history, or reflect our interests. That is not an accident. Nowadays, recommendation systems are highly based on machine learning methods that can learn the behavior, e.g., purchasing patterns, of individual consumers. In this chapter, we will uncover some of the mystery behind recommendation systems based on transactional records. Specifically, we will (1) discuss association rules and their support and confidence; (2) the Apriori algorithm for association rule learning; and (3) cover step-by-step a set of case-studies, including a toy example, Head and Neck Cancer Medications, and Grocery purchases.
Ivo D. Dinov

Chapter 13. k-Means Clustering

Abstract
As we learned in Chaps. 7, 8, and 9, classification could help us make predictions on new observations. However, classification requires (human supervised) predefined label classes. What if we are in the early phases of a study and/or don’t have the required resources to manually define, derive, or generate these class labels? Clustering can help us explore the dataset and separate cases into groups representing similar traits or characteristics. Each group could be a potential candidate for a class. Clustering is used for exploratory data analytics, i.e., as unsupervised learning, rather than for confirmatory analytics, or for predicting specific outcomes.
Ivo D. Dinov

Chapter 14. Model Performance Assessment

Abstract
In previous chapters, we used prediction accuracy to evaluate classification models. However, having accurate predictions in one dataset does not necessarily imply that the model is perfect or that it will reproduce when tested on external data. We need additional metrics to evaluate the model performance and to make sure it is robust, reproducible, reliable, and unbiased.
Ivo D. Dinov

Chapter 15. Improving Model Performance

Abstract
We already explored several alternative machine learning (ML) methods for prediction, classification, clustering, and outcome forecasting. In many situations, we derive models by estimating model coefficients or parameters. The main question now is How can we adopt the advantages of crowdsourcing and biosocial networking to aggregate different predictive analytics strategies? Are there reasons to believe that such ensembles of forecasting methods may actually improve the performance (e.g., increase prediction accuracy) of the resulting consensus meta-algorithm? In this chapter, we are going to introduce ways that we can search for optimal parameters for a single ML method as well as aggregate different methods into ensembles to enhance their collective performance relative to any of the individual methods part of the meta-aggregate.
Ivo D. Dinov

Chapter 16. Specialized Machine Learning Topics

Abstract
This chapter presents some technical details about data formats, streaming, optimization of computation, and distributed deployment of optimized learning algorithms. Chapter 22 provides additional optimization details. We show format conversion and working with XML, SQL, JSON, 15 CSV, SAS and other data objects. In addition, we illustrate SQL server queries, describe protocols for managing, classifying and predicting outcomes from data streams, and demonstrate strategies for optimization, improvement of computational performance, parallel (MPI) and graphics (GPU) computing.
Ivo D. Dinov

Chapter 17. Variable/Feature Selection

Abstract
As we mentioned in Chap. 16, variable selection is very important when dealing with bioinformatics, healthcare, and biomedical data, where we may have more features than observations. Variable selection, or feature selection, can help us focus only on the core important information contained in the observations, instead of every piece of information. Due to presence of intrinsic and extrinsic noise, the volume and complexity of big health data, and different methodological and technological challenges, this process of identifying the salient features may resemble finding a needle in a haystack. Here, we will illustrate alternative strategies for feature selection using filtering (e.g., correlation-based feature selection), wrapping (e.g., recursive feature elimination), and embedding (e.g., variable importance via random forest classification) techniques.
Ivo D. Dinov

Chapter 18. Regularized Linear Modeling and Controlled Variable Selection

Abstract
Many biomedical and biosocial studies involve large amounts of complex data, including cases where the number of features (k) is large and may exceed the number of cases (n). In such situations, parameter estimates are difficult to compute or may be unreliable as the system is underdetermined. Regularization provides one approach to improve model reliability, prediction accuracy, and result interpretability. It is based on augmenting the primary fidelity term of the objective function used in the model-fitting process with a dual regularization term that provides restrictions on the parameter space.
Ivo D. Dinov

Chapter 19. Big Longitudinal Data Analysis

Abstract
The time-varying (longitudinal) characteristics of large information flows represent a special case of the complexity and the dynamic multi-scale nature of big biomedical data that we discussed in the DSPA Motivation section. Previously, in Chap. 4, we saw space-time (4D) functional magnetic resonance imaging (fMRI) data, and in Chap. 16 we discussed streaming data, which also has a natural temporal dimension. Now we will go deeper into managing, modeling and analyzing big longitudinal data.
Ivo D. Dinov

Chapter 20. Natural Language Processing/Text Mining

Abstract
As we have seen in the previous chapters, traditional statistical analyses and classical data modeling are applied to relational data where the observed information is represented by tables, vectors, arrays, tensors, or data-frames containing binary, categorical, original, or numerical values. Such representations provide incredible advantages (e.g., quick reference and de-reference of elements, search, discovery, and navigation), but also limit the scope of applications. Relational data objects are quite effective for managing information that is based only on existing attributes. However, when data science inference needs to utilize attributes that are not included in the relational model, alternative non-relational representations are necessary. For instance, imagine that our data object includes a free text feature (e.g., physician/nurse clinical notes, biospecimen samples) that contains information about medical condition, treatment or outcome. It’s very difficult, or sometimes even impossible, to include the raw text into the automated data analytics, using classical procedures and statistical models available for relational datasets.
Ivo D. Dinov

Chapter 21. Prediction and Internal Statistical Cross Validation

Abstract
Cross-validation is a statistical approach for validating predictive methods, classification models, and clustering techniques. It assesses the reliability and stability of the results of the corresponding statistical analyses (e.g., predictions, classifications, forecasts) based on independent datasets. For prediction of trend, association, clustering, and classification, a model is usually trained on one dataset (training data) and subsequently tested on new data (testing or validation data). Statistical internal cross-validation uses iterative bootstrapping to define test datasets, evaluates the model predictive performance, and assesses its power to avoid overfitting. Overfitting is the process of computing a predictive or classification model that describes random error, i.e., fits to the noise components of the observations, instead of the actual underlying relationships and salient features in the data.
Ivo D. Dinov

Chapter 22. Function Optimization

Abstract
Most data-driven scientific inference, qualitative, quantitative, and visual analytics involve formulating, understanding the behavior of, and optimizing objective (cost) functions. Presenting the mathematical foundations of representation and interrogation of diverse spectra of objective functions provides mechanisms for obtaining effective solutions to complex big data problems. (Multivariate) function optimization (minimization or maximization) is the process of searching for variables x1, x2, x3, …, xn that either minimize or maximize the multivariate cost function f(x1, x2, x3, …, xn). In this chapter, we will specifically discuss (1) constrained and unconstrained optimization; (2) Lagrange multipliers; (3) linear, quadratic and (general) non-linear programming; and (4) data denoising.
Ivo D. Dinov

Chapter 23. Deep Learning, Neural Networks

Abstract
Deep learning is a special branch of machine learning using a collage of algorithms to model high-level data motifs. Deep learning resembles the biological communications of systems of brain neurons in the central nervous system (CNS), where synthetic graphs represent the CNS network as nodes/states and connections/edges between them. For instance, in a simple synthetic network consisting of a pair of connected nodes, an output sent by one node is received by the other as an input signal. When more nodes are present in the network, they may be arranged in multiple levels (like a multiscale object) where the ith layer output serves as the input of the next (i + 1)st layer. The signal is manipulated at each layer, sent as a layer output downstream, interpreted as an input to the next, (i + 1)st layer, and so forth. Deep learning relies on multipler layers of nodes and many edges linking the nodes forming input/output (I/O) layered grids representing a multiscale processing network. At each layer, linear and non-linear transformations are converting inputs into outputs.
Ivo D. Dinov

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise