Skip to main content

2024 | Buch

R Programming

Statistical Data Analysis in Research

verfasst von: Kingsley Okoye, Samira Hosseini

Verlag: Springer Nature Singapore

insite
SUCHEN

Über dieses Buch

Dieses Buch richtet sich an Statistiker, Datenanalytiker, Programmierer, Forscher, Fachleute und allgemeine Verbraucher, die sich mit der objektorientierten Programmiersprache R und der integrierten Entwicklungsumgebung RStudio (IDE) mit der Durchführung verschiedener Arten statistischer Datenanalysen zu Forschungszwecken befassen. R ist eine Open-Source-Software mit einer Entwicklungsumgebung (RStudio) zur Berechnung von Statistiken und grafischen Darstellungen durch Datenmanipulation, Modellierung und Berechnung. R-Pakete und unterstützte Bibliotheken bieten ein breites Spektrum an Funktionen zur Programmierung und Analyse von Daten. Im Gegensatz zu vielen anderen statistischen Programmen hat R den zusätzlichen Vorteil, dass die Benutzer effizientere Codes schreiben können, indem sie Kommandozeilenskripte und Vektoren verwenden. Es verfügt über mehrere eingebaute Funktionen und Bibliotheken, die erweiterbar sind und es dem Benutzer ermöglichen, seine eigenen (benutzerdefinierten) Funktionen zu definieren, wie er das Verhalten des Programms erwartet, während er mit den Daten umgeht, die auch im einfachen Objektsystem gespeichert werden können. Daher dient dieses Buch sowohl als Lehrbuch als auch als Handbuch für R-Statistiken, insbesondere in der akademischen Forschung, Datenanalyse und Computerprogrammierung, die darauf abzielen, die Arbeit der Nutzer zu informieren und zu leiten. Es bietet Informationen über verschiedene Arten statistischer Datenanalyse und Methoden sowie die besten Szenarien für die Anwendung jedes einzelnen Falles in R. Es gibt eine praktische Schritt-für-Schritt-Anleitung, wie man die verschiedenen parametrischen und nichtparametrischen Verfahren identifiziert und durchführt. Dazu gehört eine Beschreibung der verschiedenen Bedingungen oder Annahmen, die für die Durchführung der verschiedenen statistischen Methoden oder Tests erforderlich sind, und wie die Ergebnisse der Methoden zu verstehen sind. Das Buch behandelt auch die verschiedenen Datenformate und -quellen und wie man die Zuverlässigkeit und Validität der verfügbaren Datensätze überprüft. Verschiedene Forschungsexperimente, Fallszenarien und Beispiele werden in diesem Buch erläutert. Das Buch bietet eine umfassende Beschreibung und Schritt-für-Schritt-Anleitung zur praktischen Durchführung der verschiedenen Arten statistischer Analysen in der R, insbesondere zu Forschungszwecken, anhand von Beispielen. Angefangen beim Import und Speichern von Datensätzen in R als Objekte, über die Kodierung und den Aufruf der Methoden oder Funktionen zur Manipulation von Datensätzen oder Objekten, Faktorisierung und Vektorisierung bis hin zu besserer Argumentation, Interpretation und Speicherung der Ergebnisse für zukünftige Verwendung und grafischen Visualisierungen und Darstellungen, die somit mit Statistik und Computerprogrammierung in der Forschung übereinstimmen.

Inhaltsverzeichnis

Frontmatter

Fundamental Concepts of R Programming and Statistical Data Analysis in Research

Frontmatter
Chapter 1. Introduction to R Programming and RStudio Integrated Development Environment (IDE)
Abstract
This chapter presents an introduction to the R programming language and RStudio software used in conducting the statistical data analysis, graphical displays, modeling, and calculations particularly for research purposes. It covers the basic concept of R programming, and how the readers can install and run their first project using R. R is built on the well-developed, simple, and effective programming language called “S and S-Plus” that supports the users to define recursive functions and conditionals loops, with a wide range of coherent and integrated collection of tools (packages) and suite operators for statistical data analysis and/or calculating arrays and matrices. The capacity to write more efficient code using parallel method or vectorization is one of the main features of the R software illustrated in this chapter, because of its programmable integrated development environment (such as RStudio) that uses command-line scripting. It shows the capability to define and customize the R functions or codes (e.g., how the user or analysts want or expect the resultant models to behave) upon handling the data which are expanded in detail in Chap. 2). R has several built-in functions and libraries that are extendable (extensible) and allow the users to define their own (customized) functions or methods that can be stored in the simple object system. Thus, R is regarded as an object-oriented programming language.
Kingsley Okoye, Samira Hosseini
Chapter 2. Working with Data in R: Objects, Vectors, Factors, Packages and Libraries, and Data Visualization
Abstract
This chapter explains and illustrates the basic principles and concepts of data management and manipulation in R by discussing what are R objects, vectors, packages, and libraries, including graphs and data visualization methods using RStudio. It introduces the users to the different functions and methods for working with data in R. This includes a description of some of the different methods and ways the users can get data into R for further analysis, manipulations, or visualizations. Thus, the chapter covers how to work with data in R, including the different functions that the users can use to create or import data in R for analysis. In addition, it also covers how to install R packages and libraries for data analysis and different ways the users can plot or visualize data (graphs) in R. R has many built-in functions and packages that allows the user to perform different types of data analysis, and is also extensible due to fact that it allows the users to define their own additional functions or methods.
Kingsley Okoye, Samira Hosseini
Chapter 3. Test of Normality and Reliability of Data in R
Abstract
This chapter introduces the readers to the main tests of data normality and reliability in scientific research using R. The most commonly used types and frequently applied methods for research purposes and investigation are explained and illustrated in detail in this chapter. The data normality tests are performed to assess if a data sample is well modeled by a normal distribution, while the test for reliability of the datasets or research instruments is done to determine the extent (consistency of measures) to which the scales or variables (items) in the available data is capable of producing a reliable/coherent result. The Kolmogorov–Smirnov (K-S) and Shapiro–Wilk (S-W) tests and Cronbach’s alpha test are used to demonstrate these tests in R in this chapter, respectively.
Kingsley Okoye, Samira Hosseini
Chapter 4. Choosing Between Parametric and Non-parametric Tests in Statistical Data Analysis
Abstract
In this chapter, the authors describe what is the parametric and non-parametric tests in statistical data analysis and the best scenarios for the use of each test. It provides a guide for the readers on how to choose which of the test is most suitable for their specific research, including a description of the differences, advantages, and disadvantages of using the two types of tests. Before using the different statistical methods in R (see Chap. 6), the users need to understand the differences and conditions under which the various tests or methods are applied. The term “parametric” is used to refer to parameters of the resultant datasets (distribution) that supposedly assume that the sample (mean, standard deviations, etc.) is normally distributed. While the “non-parametric” tests (usually measured in median) are referred to as “distribution-free” tests given the fact that the supporting methods assume that the analyzed datasets follow a certain but not specified distribution. Thus, the different statistical procedures or supporting methods (parametric versus non-parametric) are followed based on the type of the available dataset (nominal, ordinal, continuous, discrete) and/or the number of the independent versus dependent groups or categories of the variables which are described in Chapter 5.
Kingsley Okoye, Samira Hosseini
Chapter 5. Understanding Dependent and Independent Variables in Research Experiments and Hypothesis Testing
Abstract
This chapter describes what are the dependent and independent variables for conducting research experiments. It introduces the readers to the different conditions to the use of the two types of variables (dependent and independent) in scientific research and hypothesis testing. The differences between the two variables and examples of each use case scenario are provided in this chapter. The relationship between the independent (IV) and dependent (DV) variables is the key foundation of most statistical data analysis or scientific tests. The authors note that an easy way to identify the independent or dependent variable in an experiment is: independent variables (IV) are what the researchers change or changes on its own, whereas dependent variables (DV) are what changes as a result of the change in the independent variable (IV). Thus, independent variables (IV) otherwise known as the “predictor variable” are the cause while dependent variables (DV) or the “response variable” are the effect.
Kingsley Okoye, Samira Hosseini
Chapter 6. Understanding the Different Types of Statistical Data Analysis and Methods
Abstract
This chapter provides the readers with information about the various types of statistical data analysis methods in research, and example of best scenarios for the use of each method. The authors provide this chapter as a guideline for researchers in the selection of the most appropriate or suitable statistical analysis/method for their research based on the type of data (e.g., independent versus dependent variable) or research design. We note that the type of research methodology or design one chooses to carry out the research investigations determines the type of data that is required for the research purpose, and vice versa. This outcome then determines the means or procedures that will be consequently applied for collecting the samples (data collection) as well as the type of analysis (statistical data analysis) that would be performed as discussed in this chapter.
Kingsley Okoye, Samira Hosseini

Application and Implementation of Advanced Methods for Statistical Data Analysis in Research Using R

Frontmatter
Chapter 7. Regression Analysis in R: Linear Regression and Logistic Regression
Abstract
This first chapter of the series of statistical data analysis using R, which the authors provides in this second part (PART II) of the book, introduces and practically illustrates to the users how to run a linear and logistic regression analysis in R. This statistical technique (regression) helps to estimate the association or dependency of relationship between two variables. Technically, there are two main points to consider when conducting regression analysis. First is to check whether a predictor variable (often called the independent variable—see Chap. 5) is good enough (measured through significant levels, e.g., p-values ≤ 0.05) in predicting the effect (outcome) or response of the targeted variable (the dependent variable). Second, the regression analysis can be used to determine what variable(s), in particular, are the significant predictors of the outcome (dependent variable) in the case of multiple independent variables. Linear regression, as the name implies, assumes that the relationships between the independent and dependent variables are linear. Thus, a constant unit of change in one of the variables implies a constant unit of change in the other. On the other hand, unlike the linear regression that uses continuous variables in its tests, the Logistic regression (an alternative or non-parametric equivalent of the linear regression) is used when the dependent variable is a categorical or dichotomous (binary) data, i.e., fits into one of two clear-cut categories.
Kingsley Okoye, Samira Hosseini
Chapter 8. T-test Statistics in R: Independent Samples, Paired Sample, and One Sample T-tests
Abstract
This chapter provides the readers with the steps and guidelines on how to run the T-test in R. This test is used for evaluating the means of one or two groups of variables in research experiments or statistical tests. By definition, the t-test is one of the inferential (parametric) statistics that are used for hypothesis testing, and for determining the differences (where applicable) between the means of two independent groups of data or single variable in a data sample. The most common types of the test namely—Independent samples, Paired sample, and One sample t-tests are explained and practically illustrated in this chapter. The Independent samples t-test (also referred to as Unpaired or Two-sample t-test) is used by the researchers to compare the means for two independently sampled groups, where the two groups under consideration are independent of each other. On the other hand, Paired sample t-test (also known as Dependent sample t-test) is applied to compare the means of a sample collected from the same group or population but at different time or interval (e.g., pre and post test, before and after). While the One sample t-test is used when the researcher wants to compare the mean of a single group of variables or data alongside a known mean, i.e., test whether the given sample mean is equal to the hypothesized data value otherwise known as the test mean.
Kingsley Okoye, Samira Hosseini
Chapter 9. Analysis of Variance (ANOVA) in R: One-Way and Two-Way ANOVA
Abstract
This chapter provides the users with information on how to conduct Analysis of variance (ANOVA) test in R. This test helps to determine the mean differences that may exist in data samples. The most common types of the test, namely—One-way and Two-way ANOVA are explained and practically illustrated in R by the authors in this chapter. Whilst the One-way ANOVA is used to compare the differences in mean between one independent (categorical or ordinal) variable and one dependent (continuous) variable, whereby the independent variable must have at least three levels or categories, i.e., a minimum of three different groups of a specified variable or groups. On the other hand, the Two-way ANOVA is used to compare the differences in mean between two independent (categorical or ordinal) variables (with three or more multilevel) and one dependent (continuous) variable. In summary, the ANOVA test is used for examining the effects that one or two or even multiple factors (independent variables) have on the population of the study (usually continuous dependent variable) simultaneously or all at the same time.
Kingsley Okoye, Samira Hosseini
Chapter 10. Chi-Squared (X2) Statistical Test in R
Abstract
This chapter explains and practically illustrates to the readers how to apply a Chi-squared (X2) analysis in R. The Chi-squared test is used to compare how expectations are linked or related with the actual observed (frequency, fact, behaviour, relationship, fitting, distribution) datasets or experimental data. Two main types of tests or analyses are usually applied by the researchers using the Chi-squared analysis. This includes the (i) Independence test which is defined as a test of “relationship” that allows the researcher or data analyst to compare two (categorical) variables to determine whether they are related or not, and (ii) Goodness of fit test which allows the user to determine whether a proportion of a data sample matches the larger population. Thus, if the analyzed data does not match or fit the assumed (expected) characteristics of the intended population, usually determined through the p-value (p ≤ 0.05), then the users or researcher may not consequentially want to use the drawn data or sample to make any conclusion about the studied (larger) population in question. The content of this chapter also covers how to graphically visualize the data relationships and interpretation of the result of the Chi-squared (X2) analysis in R.
Kingsley Okoye, Samira Hosseini
Chapter 11. Mann–Whitney U Test and Kruskal–Wallis H Test Statistics in R
Abstract
This chapter of the book on statistical analysis covers how to analyze the effect of a variable over another by using the “Mann–Whitney U” and “Kruskal–Wallis H” tests in R. It explains and practically illustrates the differences and similarities between the Mann–Whitney U and Kruskal–Wallis H tests, including the various conditions that are required or necessary to performing the two different tests. The Mann-Whitney and Kruskall-Wallis tests are non-parametric equivalent and alternatives to the Independent t-tests and analysis of variance (ANOVA), respectively, used to analyze nominal/ordinal datasets or non-normally distributed datasets that violate the different test of assumptions or conditions for performing the parametric test or data sample sizes that are too small. By definition, Mann–Whitney test, also known as the U test, is used to determine the differences between two groups of an independent variable with no specific distribution on a single ranked scale and must be an ordinal variable data type. On the other hand, Kruskal–Wallis test, also referred to as H test, is described as an extension of the two-grouped Mann–Whitney U test used when the researcher or analysts are comparing more than two groups (i.e., three or more levels or categories) of independent samples and also uses ranked (ordinal) datasets. Whereas the Mann–Whitney U test is considered as a powerful alternative (i.e., non-parametric version) to the Independent t-test, the Kruskal–Wallis test is considered as the alternative (non-parametric version) to the One-way ANOVA test. Both the Mann–Whitney U and Kruskal–Wallis H tests are measured by considering the “median”.
Kingsley Okoye, Samira Hosseini
Chapter 12. Correlation Tests in R: Pearson Cor, Kendall’s Tau, and Spearman’s Rho
Abstract
This chapter explains and practically illustrates to the readers how to conduct the three main types of correlational analysis in R, namely, The Pearson, Kendall’s tau, and Spearman’s rho correlation tests. These are the primary inferential (statistical) procedures or methods used by the researchers or the data analysts to evaluate the strength or degree (direction) of relationship between two variables (continuous or categorical). The Pearson correlation (also known as Pearson product–moment correlation coefficient) measures the strength of linear association that exists between two continuous variables by drawing a “line of best fit” through the two datasets and establishing how far away the two data points are from the drawn line (model) of best fit. On the other hand, the Kendall’s tau and Spearman’s rho correlation tests are considered alternatives (non-parametric equivalent) to the Pearson cor test mainly used by the researchers to measure the strength and degree of dependence between two categorical or ordinal variables. The differences and similarities between the Kendall’s tau and Spearman’s rho correlation tests are also discussed in this chapter. While the interpretations of the two methods (Kendall’s tau and Spearman’s rho) are very similar, and thus, appear to invariably lead to the same inferences or statistical results. The only difference between the Spearman’s rho versus Kendall’s tau method is that the Spearman’s rho (ρ) statistics or results are calculated through the “ordinary least squares”, while the Kendall’s tau (τ) statistics is calculated through the “pairwise comparison” of all the data points. Thus, Kendall’s tau (τ) statistics are based on “concordant and discordant pairs”, while the Spearman’s rho (ρ) statistics are based on “deviations”, respectively.
Kingsley Okoye, Samira Hosseini
Chapter 13. Wilcoxon Statistics in R: Signed-Rank Test and Rank-Sum Test
Abstract
This chapter provides the readers with guideline on how to run the Wilcoxon statistical test in R. The Wilcoxon test is a non-parametric “alternative test to the t-test” used for comparing the median of two data samples or variables. This inferential statistical test is particularly recommended in a situation where the dataset the researchers or data analysts want to analyze is not normally distributed or are in a ranked or ordinal scale. By definition, Wilcoxon test is one of the inferential (non-parametric) statistics that are used for hypothesis testing, and for determining the significant differences (where applicable) between the “medians” of two independent groups of data or paired variables. The common types of the test in the current literature are, namely, (i) Wilcoxon Signed-Rank and (ii) Wilcoxon Rank-Sum tests, which are explained and practically illustrated in this chapter. The Wilcoxon Signed-Rank test (also referred to as alternative to the Paired Sample t-test) is calculated based on differences in the samples’ scores but in addition to it taking into account the signs of the differences, thus, takes into consideration the magnitudes of the observed differences. Whereas the Wilcoxon Rank-Sum test (also referred to as alternative to the Independent Sample t-test) is used to compare the median of two independently sampled data where the condition for conducting the “Independent Sample t-test” is not met or the dataset in question contains outliers (i.e., distribution-free or not normally distributed).
Kingsley Okoye, Samira Hosseini
Backmatter
Metadaten
Titel
R Programming
verfasst von
Kingsley Okoye
Samira Hosseini
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9733-85-9
Print ISBN
978-981-9733-84-2
DOI
https://doi.org/10.1007/978-981-97-3385-9