main-content

## Über dieses Buch

Unique features of the book involve the following.

1.This book is the third volume of a three volume series of cookbooks entitled "Machine Learning in Medicine - Cookbooks One, Two, and Three". No other self-assessment works for the medical and health care community covering the field of machine learning have been published to date.

2. Each chapter of the book can be studied without the need to consult other chapters, and can, for the readership's convenience, be downloaded from the internet. Self-assessment examples are available at extras.springer.com.

3. An adequate command of machine learning methodologies is a requirement for physicians and other health workers, particularly now, because the amount of medical computer data files currently doubles every 20 months, and, because, soon, it will be impossible for them to take proper data-based health decisions without the help of machine learning.

4. Given the importance of knowledge of machine learning in the medical and health care community, and the current lack of knowledge of it, the readership will consist of any physician and health worker.

5. The book was written in a simple language in order to enhance readability not only for the advanced but also for the novices.

6. The book is multipurpose, it is an introduction for ignorant, a primer for the inexperienced, and a self-assessment handbook for the advanced.

7. The book, was, particularly, written for jaded physicians and any other health care professionals lacking time to read the entire series of three textbooks.

8. Like the other two cookbooks it contains technical descriptions and self-assessment examples of 20 important computer methodologies for medical data analysis, and it, largely, skips the theoretical and mathematical background.

9. Information of theoretical and mathematical background of the methods described are displayed in a "notes" section at the end of each chapter.

10.Unlike traditional statistical methods, the machine learning methodologies are able to analyze big data including thousands of cases and hundreds of variables.

11. The medical and health care community is little aware of the multidimensional nature of current medical data files, and experimental clinical studies are not helpful to that aim either, because these studies, usually, assume that subgroup characteristics are unimportant, as long as the study is randomized. This is, of course, untrue, because any subgroup characteristic may be vital to an individual at risk.

12. To date, except for a three volume introductary series on the subject entitled "Machine Learning in Medicine Part One, Two, and Thee, 2013, Springer Heidelberg Germany" from the same authors, and the current cookbook series, no books on machine learning in medicine have been published.

13. Another unique feature of the cookbooks is that it was jointly written by two authors from different disciplines, one being a clinician/clinical pharmacologist, one being a mathematician/biostatistician.

14. The authors have also jointly been teaching at universities and institutions throughout Europe and the USA for the past 20 years.

15. The authors have managed to cover the field of medical data analysis in a nonmathematical way for the benefit of medical and health workers.

16. The authors already successfully published many statistics textbooks and self-assessment books, e.g., the 67 chapter textbook entitled "Statistics Applied to Clinical Studies 5th Edition, 2012, Springer Heidelberg Germany" with downloads of 62,826 copies.

17. The current cookbook makes use, in addition to SPSS statistical software, of various free calculators from the internet, as well as the Konstanz Information Miner (Knime), a widely approved free machine learning package, and the free Weka Data Mining package from New Zealand.

18. The above software packages with hundreds of nodes, the basic processing units including virtually all of the statistical and data mining methods, can be used not only for data analyses, but also for appropriate data storage.

19. The current cookbook shows, particularly, for those with little affinity to value tables, that data mining in the form of a visualization process is very well feasible, and often more revealing than traditional statistics.

20.The Knime and Weka data miners uses widely available excel data files.

21. In current clinical research prospective cohort studies are increasingly replacing the costly controlled clinical trials, and modern machine learning methodologies like probit and tobit regressions as well as neural networks, Bayesian networks, and support vector machines prove to better fit their analysis than traditional statistical methods do.

22. The current cookbook not only includes concise descriptions of standard machine learning methods, but also of more recent methods like the linear machine learning models using ordinal and loglinear regression.

23. Machine learning tends to increasingly use evolutionary operation methodologies. Also this subject has been covered.

24. All of the methods described have been applied in the authors' own research prior to this publication.

## Inhaltsverzeichnis

### Chapter 1. Data Mining for Visualization of Health Processes (150 Patients with Pneumonia)

Abstract
Computer files of clinical data are often complex and multi-dimensional, and they are, frequently, hard to statistically test. Instead, visualization processes can be successfully used as an alternative approach to traditional statistical data analysis.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 2. Training Decision Trees for a More Meaningful Accuracy (150 Patients with Pneumonia)

Abstract
Traditionally, decision trees are used for finding the best predictors of health risks and improvements (Chap. 16 in Machine Learning in Medicine Cookbook One, pp. 97–104, Decision trees for decision analysis, Springer Heidelberg Germany, 2014, from the same authors). However, this method is not entirely appropriate, because a decision tree is built from a data file, and, subsequently, the same data file is applied once more for computing the health risk probabilities from the built tree. Obviously, the accuracy must be close to 100 %, because the test sample is 100 % identical to the sample used for building the tree, and, therefore, this accuracy does not mean too much. With neural networks this problem of duplicate usage of the same data is solved by randomly splitting the data into two samples, a training sample and a test sample (Chap. 12 in Machine Learning in Medicine Part One, pp. 145–156, Artificial intelligence, multilayer perceptron modeling, Springer Heidelberg Germany, 2013, from the same authors). The current chapter is to assess whether the splitting methodology, otherwise called partitioning, is also feasible for decision trees, and to assess its level of accuracy.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 3. Variance Components for Assessing the Magnitude of Random Effects (40 Patients with Paroxysmal Tachycardias)

Abstract
If we have reasons to believe that in a study certain patients due to co-morbidity, co-medication and other factors will respond differently from others, then the spread in the data is caused not only by residual effect, but also by some subgroup property, otherwise called some random effect. Variance components analysis is able to assess the magnitudes of random effects as compared to that of the residual error of a study.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 4. Ordinal Scaling for Clinical Scores with Inconsistent Intervals (900 Patients with Different Levels of Health)

Abstract
Clinical studies often have categories as outcome, like various levels of health or disease. Multinomial regression is suitable for analysis (see Chap. 4, Machine Learning in Medicine Cookbook Two, Polynomial regression for outcome categories, pp. 23–25, Springer Heidelberg Germany, 2014, from the same authors). However, if one or two outcome categories in a study are severely underpresented, polynomial regression is flawed, and ordinal regression including specific link functions may provide a better fit for the data.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 5. Loglinear Models for Assessing Incident Rates with Varying Incident Risks (12 Populations with Different Drink Consumption Patterns)

Abstract
Data files that assess the effect of various predictors on frequency counts of morbidities/mortalities can be classified into multiple cells with varying incident risks (like, e.g., the incident risk of infarction). The underneath table gives an example.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 6. Loglinear Modeling for Outcome Categories (Quality of Life Scores in 445 Patients with Different Personal Characteristics)

Abstract
Multinomial regression is adequate for identifying the main predictors of certain outcome categories, like different levels of injury or quality of life (QOL) (see Machine Learning in Medicine Cookbook Two, Chap. 4, pp. 23–25, Polynomial regression for outcome categories, 2014, Springer Heidelberg Germany, from the same authors). An alternative approach is logit loglinear modeling. The latter method does not use continuous predictors on a case by case basis, but rather the weighted means of these predictors. This approach may allow for relevant additional conclusions from your data.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 7. Heterogeneity in Clinical Research: Mechanisms Responsible

Abstract
In clinical research similar studies often have different results. This may be due to differences in patient-characteristics and trial-quality-characteristics such as the use of blinding, randomization, and placebo-controls. This chapter is to assess whether 3-dimensional scatter plots and regression analyses with the treatment results as outcome and the predictors of heterogeneity as exposure are able to identify mechanisms responsible.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 8. Performance Evaluation of Novel Diagnostic Tests (650 and 588 Patients with Peripheral Vascular Disease)

Abstract
Both logistic regression and c-statistics can be used to evaluate the performance of novel diagnostic tests. This chapter is to assess whether one method can outperform the other.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 9. Spectral Plots for High Sensitivity Assessment of Periodicity (6 Years’ Monthly C Reactive Protein Levels)

Abstract
In clinical research times series often show many peaks and irregular spaces.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 10. Runs Test for Identifying Best Regression Models (21 Estimates of Quantity and Quality of Patient Care)

Abstract
R-square values are often used to test the appropriateness of diagnostic models.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 11. Evolutionary Operations for Process Improvement (8 Operation Room Air Condition Settings)

Abstract
Evolutionary operations (evops) try and find improved processes by exploring the effect of small changes in an experimental setting. It stems from evolutionary algorithms (see Machine Learning in Medicine Part Three, Chap. 2, Evolutionary Operations, pp. 11–18, Springer Heidelberg Germany, 2013, from the same authors), which uses rules based on biological evolution mechanisms where each next generation is slightly different and generally somewhat improved as compared to its ancestors. It is widely used not only in genetic research, but also in chemical and technical processes. So much so that the internet nowadays offers free evop calculators suitable not only for the optimization of the above processes, but also for the optimization of your pet’s food, your car costs, and many other daily life standard issues. This chapter is to assess how evops can be helpful to optimize the air quality of operation rooms.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 12. Bayesian Networks for Cause Effect Modeling (600 Patients Assessed for Longevity Factors)

Abstract
Bayesian networks are probabilistic graphical models using nodes and arrows, respectively representing variables, and probabilistic dependencies between two variables. Computations in a Bayesian network are performed using weighted likelihood methodology and marginalization, meaning that irrelevant variables are integrated or summed out. Additional theoretical information is given in Machine Learning in Medicine Part Two, Chap. 16, Bayesian networks, pp. 163–170, Springer Heidelberg Germany, 2013, from the same authors. This chapter is to assess if Bayesian networks is able to determine direct and indirect predictors of binary outcomes like morbidity/mortality outcomes.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 13. Support Vector Machines for Imperfect Nonlinear Data (200 Patients with Sepsis)

Abstract
The basic aim of support vector machines is to construct the best fit separation line (or with three dimensional data separation plane), separating cases and controls as good as possible. Discriminant analysis, classification trees, and neural networks (see Machine Learning in Medicine Part One, Chap. 17, Discriminant analysis for supervised data, pp. 215–224, Chap. 13, Artificial Intelligence, Chaps. 12 and 13, pp. 145–165, 2013, and Machine Learning in Medicine Part Three, Chap. 14, Decision Trees, pp. 137–150, 2013, Springer Heidelberg Germany, by the same authors as the current chapter) are alternative methods for the purpose, but support vector machines are generally more stable and sensitive, although heuristic studies to indicate when they perform better are missing. Support vector machines are also often used in automatic modeling that computes the ensembled results of several best fit models (see Machine Learning in Medicine Cookbook Two, Chaps. 18 and 19, Automatic modeling of drug efficacy prediction, and Automatic modeling for clinical event prediction, pp. 99–111, 2014, Springer Heidelberg Germany, from the same authors). This chapter uses the Konstanz Information Miner, a free data mining software package developed at the University of Konstanz, and also used in the Chaps. 1 and 2.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 14. Multiple Response Sets for Visualizing Clinical Data Trends (811 Patient Visits to General Practitioners)

Abstract
Multiple response methodology answers multiple qualitative questions about a single group of patients, and uses for the purpose summary tables. The method visualizes trends and similarities in the data, but no statistical test is given.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 15. Protein and DNA Sequence Mining

Abstract
Sequence similarity searching is a method that can be applied by almost anybody for finding similarities between his/her query sequences of amino acids and DNA and the sequences known to be associated with different clinical effects. The latter have been included in database systems like the Basic Local Alignment Search Tool (BLAST) database system from the US National Center of Biotechnology Information (NCBI), and the MOTIF data base system, a joint website from different European and American institutions, and they are available through the internet for the benefit of individual researchers trying and finding a match for novel sequences from their own research. This chapter is to demonstrate that sequence similarity searching is a method that can be applied by almost anybody for finding similarities between his/her sequences and the sequences known to be associated with different clinical effects.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 16. Iteration Methods for Crossvalidations (150 Patients with Pneumonia)

Abstract
In Chap. 2 of this volume validation of a decision tree model is performed splitting a data file into a training and a testing sample. This method performed pretty well with a sensitivity of 90–100 % and an overall accuracy of 94 %. However, measures of error of predictive models like the above one are based on residual methods, assuming a priori defined data distributions, particularly normal distributions. Machine learning data file may not meet such assumptions, and distribution free methods of validation, like crossvalidations may be more safe.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 17. Testing Parallel-Groups with Different Sample Sizes and Variances (5 Parallel-Group Studies)

Abstract
Unpaired t-tests are traditionally used for testing the significance of difference between parallel-groups according to
$${\text{t}}\text{-}\text{value} = \, \left( {{\text{mean}}_{ 1} - {\text{ mean}}_{ 2} } \right)/\surd \left( {{\text{SD}}_{ 1} /{\text{N}}_{ 1} + {\text{ SD}}_{ 2} /\text{N}_{ 2} } \right)$$
where mean, SD, N are respectively the mean, the standard deviation and the sample size of the parallel groups.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 18. Association Rules Between Exposure and Outcome (50 and 60 Patients with Coronary Risk Factors)

Abstract
Traditional analysis of exposure outcome relationships is only sensitive with strong relationships. This chapter is to assess whether association rules, based on conditional probabilities, may be more sensitive in case of weak relationships.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 19. Confidence Intervals for Proportions and Differences in Proportions

Abstract
Proportions, fractions, percentages, risks, hazards are all synonymous terms to indicate what part of a population had events like death, illness, complications etc. Instead of p-values, confidence intervals are often calculated. If you obtained many samples from the same population, 95 % of them would have their mean results between the 95 % confidence intervals. And, likewise, samples from the same population with their proportions outside the 95 % confidence intervals means that they are significantly different from the population with a probability of 5 % (p < 0.05). This chapter is to assess how confidence intervals can be computed.
Ton J. Cleophas, Aeilko H. Zwinderman

### Chapter 20. Ratio Statistics for Efficacy Analysis of New Drugs (50 Patients Treated for Hypercholesterolemia)

Abstract
Treatment efficacies are often assessed as differences from baseline. However, better treatment efficacies may be observed in patients with high baseline-values than in those with low ones. This was, e.g., the case in the Progress study, a parallel-group study of pravastatin versus placebo (see Statistics Applied to Clinical Studies Fifth Edition, Chap. 17, Logistic and Cox regression, Markov models, and Laplace transformations, pp. 199–218, Springer Heidelberg Germany, 2012, from the same authors). This chapter assesses the performance of ratio statistics for that purpose.
Ton J. Cleophas, Aeilko H. Zwinderman

### Backmatter

Weitere Informationen