Skip to main content

2023 | Buch

Predictive Analytics with KNIME

Analytics for Citizen Data Scientists

insite
SUCHEN

Über dieses Buch

This book is about data analytics, including problem definition, data preparation, and data analysis. A variety of techniques (e.g., regression, logistic regression, cluster analysis, neural nets, decision trees, and others) are covered with conceptual background as well as demonstrations of KNIME using each tool.

The book uses KNIME, which is a comprehensive, open-source software tool for analytics that does not require coding but instead uses an intuitive drag-and-drop workflow to create a network of connected nodes on an interactive canvas. KNIME workflows provide graphic representations of each step taken in analyses, making the analyses self-documenting. The graphical documentation makes it easy to reproduce analyses, as well as to communicate methods and results to others. Integration with R is also available in KNIME, and several examples using R nodes in a KNIME workflow are demonstrated for special functions and tools not explicitly included in KNIME.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction to Analytics
Abstract
This chapter introduces analytics and its growing importance. Analytics involves applying data-based models to enhance results, reduce costs, and reduce risk in both profit-making and non-profit organizations. Various surveys have indicated a significant increase in the use of analytics in organizations across multiple industries, such as sports, lodging, e-commerce, and health insurance, and sectors, such as IT, supply chain and manufacturing, healthcare, and human resources.
Three developments have spurred the use of analytics: the explosion in data volume, variety, and velocity; advancements in hardware and software; and a growing demand for data-supported decision-making. Analytics can be classified into three types based on their end objectives: descriptive analytics (examining and interpreting data to understand “what happened”), predictive analytics (predicting outcomes and behaviors), and prescriptive analytics (guiding “what should be done”).
Developing predictive analytics models follows a deliberate sequence of steps, including problem definition, data preparation, modeling and evaluation, and deployment. The revised CRISP model is used to organize the chapters covered in the book. The chapter sets the stage for the subsequent chapters that delve deeper into the analytics process and its applications in different scenarios.
This book features the open-source software KNIME, which provides state-of-the-art tools to develop models using a no-code, drag-and-drop interface. A typical reader might be in the class of users often described as “citizen data scientists,” referring to individuals outside the field of statistics and information technology who use self-service analytics tools to perform predictive or prescriptive analytics.
Frank Acito
Chapter 2. Problem Definition
Abstract
This chapter emphasizes the critical role of problem definition in the analytics process. Before collecting data or selecting analytical techniques, it is crucial to understand the business problem at hand thoroughly. Failure to do so often leads to the downfall of analytics projects. This chapter highlights the benefits of meticulously defining the problem, including improved efficiency, focused data collection, and the generation of valuable insights. Moreover, a well-defined problem sets the foundation for adding real value to an organization through analytics.
Expert perspectives on problem definition are presented, underscoring the importance of asking the right questions and framing the problem appropriately. Various quotes from industry professionals emphasize the significance of problem definition in differentiating between successful and mediocre data science endeavors.
The chapter provides a telecom-related case study on predicting customer churn to illustrate the benefits of careful problem definition. Through detailed questioning and understanding of customer behavior, the root cause of churn is identified, leading to a more effective solution that addresses the underlying issue.
Defining the analytics problem is systematically broken down into tasks, such as determining business objectives, translating them into measurable metrics, identifying stakeholders, developing a comprehensive project plan, and carefully framing the problem. The chapter advocates for structured problem definition and introduces techniques like right-to-left thinking, reversing the problem, asking “whys,” and challenging assumptions to arrive at a clear, well-rounded problem statement.
The chapter concludes by emphasizing the significance of investing time and effort in problem definition to increase the likelihood of successful analytics projects with meaningful results. It highlights the importance of various tools and strategies that aid in problem framing and exploration, ensuring that analytics efforts align with the organization’s goals and drive impactful outcomes.
Frank Acito
Chapter 3. Introduction to KNIME
Abstract
This chapter of this book introduces the KNIME analytics and data mining tool, a comprehensive platform that offers an intuitive drag-and-drop workflow canvas for data analysis. KNIME serves professional data analysts and beginners with its user-friendly interface, making it an excellent choice for low or no-code predictive analytics and data mining tasks. The chapter covers various aspects of KNIME, starting with its features, which include a vast array of nodes for data connections, transformations, machine learning, and visualization. KNIME is extensible and can run R or Python scripts to enhance its capabilities, and it also integrates features from other analytic platforms like H2O and WEKA.
The chapter explains the KNIME Workbench, which is the main interface for creating workflows. It includes components like KNIME Explorer, Workflow Coach, Node Repository, Workflow Editor, Outline, and Console. The Workbench allows users to construct and visualize their analyses step-by-step.
The chapter provides information about various learning resources, including courses, documentation, and videos that can users learn KNIME. Users can access free self-paced courses covering different levels of expertise, enabling them to become proficient in using KNIME for various data analysis tasks.
Additionally, the chapter demonstrates how to use flow variables to pass information between nodes and how to use loops to iterate over values in a workflow. The chapter introduces the concepts of Metanodes and Components to organize and simplify complex workflows, making them more manageable and self-contained.
Overall, the chapter serves as an informative and practical introduction to KNIME, highlighting its key features, resources for learning, and essential tools for workflow organization and analysis. Readers are encouraged to install KNIME and explore its capabilities through hands-on practice to gain proficiency in this powerful data analytics tool.
Frank Acito
Chapter 4. Data Preparation
Abstract
This chapter focuses on data preparation, a crucial step in the analytics process to ensure that the data used for modeling is of the highest quality. The chapter covers various aspects of data preparation, including obtaining the needed data, data cleaning, handling missing values, detecting and dealing with outliers, and feature engineering.
The chapter starts by highlighting that a company’s data warehouse may not always provide the required data in the correct form, and data must be assembled, cleaned, and tailored to the analytics problem. It emphasizes the importance of having the right type of data, such as predictors and outcome variables for predictive modeling, and the need for external data integration from various sources.
Data cleaning is discussed in detail, acknowledging that datasets may contain inaccurate, incomplete, or inconsistent values, among other issues. The chapter lists several activities for data cleaning, such as removing duplicate records, dealing with missing values, identifying and handling outliers, and ensuring consistency in formats and units of measurement.
The presence of missing values is recognized as one of the most challenging problems in analytics. The chapter discusses the types of missing values (MCAR, MAR, MNAR). It outlines various techniques for handling missing data, such as listwise deletion, imputation methods, and using indicator variables for missingness.
Outliers are discussed in the context of both univariate and multivariate methods for detection. The chapter emphasizes that outliers should not be routinely removed without careful consideration and domain knowledge. Instead, different techniques for handling outliers, such as replacing them with missing value indicators or Winsorizing, are provided.
Feature engineering is introduced to improve predictive model performance by creating new variables or transforming existing ones. The chapter presents examples of feature engineering, including creating new variables from existing ones, adding polynomial terms, and using transformations to reduce skewness.
Throughout the chapter, the KNIME data analytics platform is demonstrated to carry out various data preparation tasks, such as missing value handling, outlier detection, and transformations.
In conclusion, this chapter emphasizes the significance of data preparation in the analytics process and provides practical guidance and examples for data cleaning, missing value handling, outlier detection, and feature engineering using KNIME.
Frank Acito
Chapter 5. Dimensionality Reduction
Abstract
In business analytics and predictive modeling, data sets often contain hundreds or even thousands of predictor variables, which can create challenges in terms of both efficiency and effectiveness. This chapter explores the problems associated with large numbers of variables and delves into various approaches for dimension reduction to address these issues.
The “curse of dimensionality” refers to the exponential increase in the number of observations needed to maintain predictive model accuracy as the number of predictors increases. Moreover, including irrelevant or redundant variables can reduce the performance of predictive models. Surprisingly, even too many relevant variables can diminish overall accuracy. Having an excessive number of variables also introduces various undesirable effects. Computer processing time increases, and predictive models become more complex and challenging to maintain. Redundant variables can cause instability in the model, and variables unrelated to the target should be removed, such as customer ID numbers or those with regulatory concerns.
To mitigate these challenges, three general approaches to dimension reduction are discussed: manually removing variables based on specific criteria, using algorithms to select the most predictive variables, and employing principal component analysis (PCA) to create linear combinations of original variables.
The chapter emphasizes the importance of carefully considering which variables to retain and which to exclude to balance predictive power and model complexity. It concludes by acknowledging the trade-offs involved in dimension reduction and the need for thoughtful analysis when dealing with large numbers of predictor variables in applied situations.
Frank Acito
Chapter 6. Ordinary Least Squares Regression
Abstract
This chapter discusses least squares regression, one of the most widely used analytics tools for building predictive models. The chapter begins by highlighting the reasons for the popularity of regression, including its logical, linear nature and its ease of programming. It emphasizes the flexibility of regression, as it can be applied to various types of problems, even those that may not initially seem suitable for linear regression.
The section on multiple regression explores how to deal with various types of variables encountered in regression applications, such as nominal, ordinal, and continuous variables. It explains how to code nominal and ordinal variables using indicator variables for meaningful inclusion in regression.
The chapter then delves into handling nonlinearity in regression, discussing how to detect and address it using polynomial models or variable transformations.
Next, the chapter focuses on evaluating the predictive accuracy of regression models using metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE).
To illustrate the concepts discussed, KNIME is used with a subset of the Ames, Iowa, housing price data. It demonstrates the use of ordinary least squares regression, stepwise regression, and LASSO (L1 regularization) for predictive modeling and compares their prediction accuracy on a test set.
Frank Acito
Chapter 7. Logistic Regression
Abstract
This chapter covers logistic regression, which is a widely used method in analytics projects for predicting binary outcomes. The chapter begins by explaining the difference between ordinary linear and logistic regression when dealing with binary outcomes. Logistic regression is preferred for binary targets as it provides predictions ranging from 0.0 to 1.0, representing the probability of the target variable taking on the value of 1.
The chapter introduces the logistic function and discusses analyses with binary outcomes. It also explores the metrics used to assess predictive models with binary or multi-level categorical targets, relevant for later chapters covering other prediction models.
The logistic model is demonstrated with examples using simulated data and real-world data related to employee turnover and heart disease prediction. The importance of interpreting coefficients in logistic regression is discussed, and various approaches to interpreting predictors and assessing model performance are explored, including confusion matrices and ROC curves.
The chapter also covers applying regularization techniques (L1 and L2 regularization) to logistic regression models to improve generalizability and mitigate overfitting. The concept of asymmetric costs and benefits in predictive models is introduced, particularly in the context of medical applications.
Finally, the chapter introduces multinomial logistic regression for cases where the target variable has more than two categorical levels. An example using the Iris data set is provided to demonstrate the multinomial logistic regression approach.
Overall, this chapter provides a comprehensive overview of logistic regression, its interpretation, performance evaluation, regularization, and its extension to multinomial cases. It offers valuable insights for data analysts and researchers working with binary and multi-level categorical outcomes in their predictive modeling tasks.
Frank Acito
Chapter 8. Classification and Regression Trees
Abstract
This chapter discusses Classification and Regression Trees, widely used in data mining for predictive analytics. The chapter starts by explaining the two principal types of decision trees: classification trees and regression trees. In a classification tree, the dependent variable is categorical, while in a regression tree, it is continuous.
The first section discusses classification trees, using an example of customer targeting in a marketing campaign. The chapter emphasizes that classification trees are “automatic” models, as they select independent variables by searching for optimal splits based on measures of purity or entropy.
The second section covers regression trees, illustrating their application in predicting continuous target variables using an example of head acceleration measurements from simulated motorcycle accidents.
The chapter explores the development of classification trees, explaining how splitting nodes are continued until they are pure or no further splits are possible. It emphasizes the importance of pruning to avoid overfitting, which can lead to poor generalization with unseen data.
The author discusses different pruning techniques, including pre-pruning and post-pruning. Pre-pruning involves setting stopping rules during tree growth, while post-pruning involves trimming the tree after it is fully grown.
The strengths and weaknesses of decision trees are highlighted. The interpretability and intuitiveness of decision trees are listed as strengths, while the risk of overfitting and sensitivity to minor data changes are cited as weaknesses.
Overall, this chapter provides a comprehensive overview of decision trees, their applications, and essential considerations for creating accurate and robust models using this popular data mining technique.
Frank Acito
Chapter 9. Naïve Bayes
Abstract
This chapter introduces the Naïve Bayes algorithm, a predictive model based on Bayesian analysis. The chapter starts with a thought problem involving a breathalyzer used by a police department. It demonstrates how Bayes’ Theorem can be used to estimate the probability that a driver is over the legal alcohol limit based on breathalyzer results.
The concept of Naïve Bayes is illustrated with a toy data set of 15 observations, demonstrating how probabilities are calculated using counting. The assumption of conditional independence is explained, and the algorithm’s applicability to categorical and continuous predictors is discussed.
The problem of zero probabilities and the need for Laplace smoothing to avoid issues with small sample sizes are explored. Naïve Bayes in KNIME is applied to a real-world example of predicting heart disease detection and identifying spam emails, achieving 85% and 99% accuracy, respectively, with test data.
Despite its strong assumption of independence among predictors, Naïve Bayes proves to be a practical and efficient algorithm for classification tasks, especially when dealing with a large number of predictors. While its probability estimates may not always be precise, the classification results are often reliable.
Frank Acito
Chapter 10. k Nearest Neighbors
Abstract
K Nearest Neighbors (kNN) is a powerful and intuitive data mining model for classification and regression tasks. As an instance-based or memory-based learning algorithm, kNN classifies new objects based on their similarity to known objects in the training data. Unlike parametric models, kNN is non-parametric and does not rely on assumptions about data distributions.
The main advantage of kNN is its simplicity and ability to handle large datasets efficiently. However, one of its drawbacks is that it requires scanning all the training data each time a new observation needs to be classified, which can be time-consuming for large datasets.
The kNN algorithm calculates the distances between the new observation and all existing data points. The k nearest neighbors are selected based on the smallest distances, and their majority class or average value is used for classification or regression.
For classification tasks, kNN is considered a “lazy” algorithm because it does not create an explicit model during training. Instead, it stores the entire dataset and makes decisions on new observations instantly. In contrast, “eager” algorithms, like logistic regression, build a model during training that is then used for predictions.
In addition to classification, kNN can also be used for regression tasks. It can capture non-linear relationships between predictors and continuous target variables without requiring a predefined model.
While kNN is flexible and robust to different target variables and distributions, it requires standardizing predictors to avoid bias from variables with large values. It also suffers from the “curse of dimensionality,” where the performance degrades in high-dimensional spaces due to increased sparsity.
Despite its limitations, kNN remains a valuable tool in data mining, especially when dealing with non-linear relationships and a lack of strict assumptions about the data. Careful data preprocessing and optimization of the value of k can help improve its performance in various applications.
Frank Acito
Chapter 11. Neural Networks
Abstract
This chapter explores neural networks, focusing on their applications and underlying principles. Neural networks have gained immense popularity due to their flexibility and accuracy in supervised data mining tasks. They can effectively handle problems with categorical and continuous target variables, making them a versatile tool in predictive modeling. The chapter introduces the concept of artificial neural networks, which mimic the structure and function of human brain neurons.
The mathematical model of a neuron, first proposed by McCulloch and Pitts, serves as the foundation for neural networks. However, early attempts to implement neural networks faced challenges, leading to a period of reduced interest. The breakthrough came in the 1980s with the development of algorithms like backpropagation, which enabled the estimation of weights in multilayer networks.
The chapter discusses the learning process for neural networks, which involves adjusting the model weights iteratively to minimize an error function. Different activation functions are explored, each influencing the output of the neurons. Notably, the ReLU activation function enabled the development of deep learning models with three or more hidden layers.
An example of a single-layer artificial neuron demonstrates the calculations with various activation functions. This is followed by an example of a multilayer perceptron, showcasing the real power of neural networks with multiple layers and nodes. Neural network applications using KNIME are illustrated in the context of credit screening and predicting used car prices.
The chapter also emphasizes the importance of proper data preparation, including normalization and dealing with oversampling in the context of classification problems. Overfitting, a common challenge in neural networks, is discussed, and techniques to mitigate it are presented.
The chapter provides a comprehensive overview of neural networks, highlighting their strengths and challenges. Neural networks offer great potential for complex and non-linear problems but require careful considerations in data preparation, model complexity, and validation to ensure reliable and accurate predictions.
Frank Acito
Chapter 12. Ensemble Models
Abstract
Ensemble models in machine learning involve combining predictions from multiple diverse models to achieve improved accuracy and stability. This chapter explores various ensemble techniques and their benefits.
The search for the best machine learning algorithm for a particular problem is an ongoing challenge. Studies have shown that no single algorithm performs best across all datasets. This has led to the concept of ensemble learning, where the predictions of multiple models are aggregated to produce a final estimate.
The effectiveness of combining diverse independent estimates was first highlighted in “The Wisdom of Crowds.” A classic example by Sir Francis Galton demonstrated the power of combining individual estimates, leading to a more accurate prediction.
Ensemble models are created using different approaches, such as employing multiple algorithms, varying model parameters, sampling different subsets of predictor variables, or sampling observations. The benefits of ensemble models lie in reduced variation and improved accuracy.
Reduced variation ensures reliability in predictions with different data samples, allowing for a better understanding of the model’s performance with unseen data. Improved accuracy is achieved by combining independent predictions, which helps cancel out errors, resulting in better overall predictions.
Bagging, Random Forests, AdaBoost, Gradient Tree Boosting, and XGBoost are discussed. These models are popular for their ability to handle different types of data and achieve state-of-the-art performance in various contexts.
The chapter includes practical examples of ensemble modeling with continuous and binary targets. One example uses a KNIME workflow to predict used car prices using ordinary least regression (OLS) and Gradient Boosted Trees. Another example involves predicting credit status using XGBoost.
Frank Acito
Chapter 13. Cluster Analysis
Abstract
This chapter covers cluster analysis, a set of methods used for identifying groups of similar observations based on proximity measures. The chapter focuses on specific methods KNIME, a data analytics platform. Cluster analysis aims to find groups of objects that are similar within each group and distinct from objects in other groups. The number and composition of clusters can be challenging to determine, making cluster analysis an unsupervised and descriptive technique.
Hierarchical clustering is a flexible method that can work with different distance measures and linkage types. It forms a tree-like structure of clusters, starting from individual observations and gradually merging them. On the other hand, K-means clustering, requires specifying the number of clusters beforehand and aims to minimize within-cluster variance by iteratively updating cluster centroids. Density-based clustering, like DBSCAN, can discover arbitrarily shaped clusters and is robust to outliers. Fuzzy clustering assigns probabilistic membership to clusters, allowing observations to belong partially to multiple clusters.
Cluster validation is the process of evaluating the quality of clustering results. Internal validation uses metrics derived from the data, such as the Silhouette coefficient or within-cluster sum of squares. External validation compares the clustering results with external criteria or expected patterns.
Overall, cluster analysis is a powerful technique for grouping similar observations, but it requires careful consideration of algorithm choice, parameter settings, and validation to produce meaningful and useful clusters.
Frank Acito
Chapter 14. Communication and Deployment
Abstract
This chapter emphasizes that creating a predictive model is not the final step, but it requires effective communication and deployment to realize its value. Three essential elements of the “endgame” are identified: a final written report, a presentation based on the report, and model deployment. Deployment of models can be challenging, and many models are not successfully deployed due to technical, political, or regulatory issues.
The level of detail in the remaining steps depends on the model’s intended use. The communication process may be straightforward if the model is intended only for internal use by the developers or a small team. However, communication and deployment become more complex and significant if the model is to be integrated into production processes or made available to external stakeholders.
The chapter highlights the importance of written reports and presentations in conveying insights, conclusions, and plans for deploying the model. Clear and effective communication can help decision-makers understand the benefits and potential impact of the model on existing operations. The report and presentation should include a statement of the business problem, the analysis process, a summary of models and findings, deployment plans, and recommendations for further work.
The chapter also covers the complexities of deploying predictive models, ranging from individual use within the organization to real-time processing for external users. The deployment scope affects factors like data privacy, robustness, usability, and maintenance.
In conclusion, the chapter stresses the importance of effective communication, data visualization, and successful deployment to ensure that predictive models deliver value to the organization. It underscores the need to tailor the communication approach to the audience’s needs and to manage model integration complexity for successful deployment.
Frank Acito
Backmatter
Metadaten
Titel
Predictive Analytics with KNIME
verfasst von
Frank Acito
Copyright-Jahr
2023
Electronic ISBN
978-3-031-45630-5
Print ISBN
978-3-031-45629-9
DOI
https://doi.org/10.1007/978-3-031-45630-5