nach oben

2012 | Buch

Kapitel lesen Erstes Kapitel lesen

Beginning R

An Introduction to Statistical Programming

verfasst von: Larry Pace

Verlag: Apress

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Beginning R: An Introduction to Statistical Programming is a hands-on book showing how to use the R language, write and save R scripts, build and import data files, and write your own custom statistical functions. R is a powerful open-source implementation of the statistical language S, which was developed by AT&T. R has eclipsed S and the commercially-available S-Plus language, and has become the de facto standard for doing, teaching, and learning computational statistics.

R is both an object-oriented language and a functional language that is easy to learn, easy to use, and completely free. A large community of dedicated R users and programmers provides an excellent source of R code, functions, and data sets. R is also becoming adopted into commercial tools such as Oracle Database. Your investment in learning R is sure to pay off in the long term as R continues to grow into the go to language for statistical exploration and research.

Covers the freely-available R language for statistics Shows the use of R in specific uses case such as simulations, discrete probability solutions, one-way ANOVA analysis, and more Takes a hands-on and example-based approach incorporating best practices with clear explanations of the statistics being done

Inhaltsverzeichnis

Frontmatter

Chapter 1. Getting R and Getting Started

Abstract

R is a flexible and powerful open-source implementation of the language S (for statistics) developed by John Chambers and others at Bell Labs. R has eclipsed S and the commercially available S-Plus program for many reasons. R is free, and has a variety (nearly 4,000 at last count) of contributed packages, most of which are also free. R works on Macs, PCs, and Linux systems. In this book, you will see screens of R 2.15.1 running in a Windows 7 environment, but you will be able to use everything you learn with other systems, too. Although R is initially harder to learn and use than a spreadsheet or a dedicated statistics package, you will find R is a very effective statistics tool in its own right, and is well worth the effort to learn.

Larry Pace

Chapter 2. Programming in R

Abstract

R allows you to save and reuse code, as we discussed in Chapter 1. When explicit looping is necessary, it is possible, as you will learn in this chapter. We will discuss the basics of programming in general, and the specifics of programming in R, including program flow, looping, and the use of logic. In Chapter 3, you will learn how to create your own useful R functions to keep from typing the same code repetitively. When you create functions, they are saved in your workspace image, and are available to you whenever you need them. As mentioned earlier, R programming is functional in the sense that each function call should perform a well-defined computation relying on the arguments passed to the function (or to default values for arguments). Everything in R is an object, including a function, and in this sense, R is an object-oriented programming language.

Larry Pace

Chapter 3. Writing Reusable Functions

Abstract

As you learned in the previous chapters, everything in R is an object. This makes R an object-oriented programming language. There are many built-in functions and thousands more contributed by the R community. Before you consider writing your own function for a particular problem, check the R documentation and discussions to see if someone else has already done the work for you. One of the best things about the R community is that new packages appear regularly, and it is quite likely someone else has already faced and solved the same programming problem. If they haven’t, it is easy enough to create your own functions.

Larry Pace

Chapter 4. Summary Statistics

Abstract

You have already seen many of the statistical functions in R, but in this chapter, you get a more complete list. You also will learn how R calculates and, more important, reports the statistical results. We cover the standard descriptive statistics from a business statistics class. If you have not taken statistics or are a little rusty, I recommend David Moore’s business statistics books. Although it is a toss-up whether to cover graphs or numerical summaries first, we will start with numerical summaries. We will cover measures of central tendency and spread, and the shape of a distribution. For this chapter, and for others, we will use common examples, and the same data will be used several different times.

Larry Pace

Chapter 5. Creating Tables and Graphs

Abstract

In the last chapter, you learned how to use R for the most common numerical descriptive statistics. In this chapter, you learn how to use R for summary tables and for various graphical representations of data. As with many of the other functions of R, you will learn that the methods R provides often produce quite different results depending on the input.

Larry Pace

Chapter 6. Discrete Probability Distributions

Abstract

In Chapter 6, you learn how to use R for the binomial distribution and the Poisson distribution. We quickly review the characteristics of a discrete probability distribution, and then you learn the various functions in R that make it easy to work with both discrete and continuous distributions.

Larry Pace

Chapter 7. Computing Normal Probabilities

Abstract

The normal distribution is the backbone of traditional statistics. We learn very early in our statistics training that the distribution of sample means, regardless of the shape of the parent distribution, approaches a normal distribution as the sample size increases. This fact permits us to use the normal distribution and distributions theoretically related to it, such as t, F, and ͼ², for testing hypotheses about means and functions of means (such as the differences between two or more means, or variances, which are derived from means).

Larry Pace

Chapter 8. Creating Confidence Intervals

Abstract

We have already discussed using the standard normal and t distributions for confidence intervals for means. We can construct confidence intervals for other statistics as well. Confidence intervals avoid some of the logical problems of null hypothesis testing and therefore is recommended as an alternative in many cases.

Larry Pace

Chapter 9. Performing t Tests

Abstract

We use t tests to compare means. You have already seen that the t.test function can be used to display confidence intervals for means and differences between means. The t.test function in R is used for all three kinds of t tests: one-sample, paired-samples, and two-sample t tests.

Larry Pace

Chapter 10. One-Way Analysis of Variance

Abstract

The analysis of variance (ANOVA) compares three or more means simultaneously. We determine whether the means are significantly different in the population by analyzing the variation in the dependent variable into separate sources. ANOVA takes advantage of the additivity property of variance, and we partition the variation into treatment effect (real differences) and error (differences due to sampling error or individual differences). The ratio of two variances follows the F (named after R. A. Fisher) distribution. Some readers may have difficulty understanding why the analysis of variance components can be used to test hypotheses about means, but on reflection, one should realize that the variances themselves are based on squared deviations from means.

Larry Pace

Chapter 11. Advanced Analysis of Variance

Abstract

In Chapter 11 we dig deeper into ANOVA procedures, including two-way ANOVA, repeated-measures ANOVA, and mixed-factorial ANOVA. In general, these forms of ANOVA allow us to evaluate the combined effect of two or more factors or to examine multiple measurements on the dependent variable. In Chapter 13, where we discuss multiple regression, we will illustrate how the same linear model underlies both regression and ANOVA models and how we can use regression models, if preferred, to conduct ANOVAs as well.

Larry Pace

Chapter 12. Correlation and Regression

Abstract

In Chapter 12 you learn simple (bivariate) correlation and regression. You discover how to use the R functions for correlation and regression and how to calculate and interpret correlation coefficients and regression equations. You also learn about fitting a curvilinear model and about confidence and prediction intervals for regression models. In Chapter 13, we will build on what you learn here, with multiple correlation and regression. In Chapter 13, you will also learn that ANOVAs and t tests are special cases of regression. All these techniques are based on the same underlying general linear model.

Larry Pace

Chapter 13. Multiple Regression

Abstract

In Chapter 13, you learn about multiple regression, which is the linear combination of two or more predictor variables to optimize the relationship between the observed and predicted variables. You also learn, as promised earlier, that the general linear model underlying multiple regression also underlies ANOVA and t tests, and you can use regression instead of those procedures to obtain the same (or more informative) results. You discover R’s functions for dealing with both continuous and dichotomous predictor variables, and how to dummy code your data to achieve the most useful information. As a bonus, I show you how to use matrix algebra to solve a regression model. This will enhance both your R skills and your statistical understanding, and will help you with more advanced topics such as multivariate analyses (which are beyond the scope of this beginning text).

Larry Pace

Chapter 14. Logistic Regression

Abstract

In bivariate and multiple regression (Chapters 12 and 13), we used a continuous dependent variable. There are occasions in which the outcome is an either-or (0, 1) binary outcome. Fisher developed a technique known as discriminant analysis for predicting group membership (by maximizing the combination of predictor scores to separate the two groups). This is a fine technique, and one that has been around for a long time. There is, however, one major problem with discriminant analysis, namely that it can take advantage only of continuous predictors. A more modern technique to the prediction (and classification as desired) of group membership in two groups represented by 0s and 1s is known as logistic regression. Logistic regression allows the use of both continuous and binary predictors.

Larry Pace

Chapter 15. Chi-Square Tests

Abstract

You learned about the chi-square distribution when we discussed confidence intervals for the variance and standard deviation in Chapter 8. Another contribution of Karl Pearson, the chi-square distribution has proven to be quite versatile. We use chi-square tests for determining goodness of fit and for determining the association or lack thereof for cross-tabulated categorical variables.

Larry Pace

Chapter 16. Nonparametric Tests

Abstract

As you learned in Chapter 15, when we use chi-square tests for frequency data and we are not estimating population parameters, we are conducting nonparametric tests. A whole class of nonparametric procedures is available for data that are ordinal (ranks) in nature. Some data are ordinal by their very definition, such as employee rankings, while in other cases we convert interval or ratio data to ranks because the data violate distributional assumptions such as linearity, normality, or equality of variance. As a rule of thumb, nonparametric tests are generally less powerful than parametric tests, but that is not always the case.

Larry Pace

Chapter 17. Using R for Simulation

Abstract

R shines as a tool for simulations of all kinds. Simulations have been around for a very long time. I remember doing Monte Carlo simulations as a graduate student with FORTRAN-IV programs I wrote. That was fun, but R is far more versatile, and as a result, more fun.

Larry Pace

Chapter 18. The “New” Statistics: Resampling and Bootstrapping

Abstract

In Chapter 17 you learned how you can do simulations of various statistical processes in R by generating random data. In Chapter 18, we turn the tables and work only with sample data. We will still be simulating, however, because we will use resampling and bootstrapping to generate multiple samples with replacement from our original data. In essence these are nonparametric techniques, because we are not making assumptions about populations when we analyze the sample data.

Larry Pace

Chapter 19. Making an R Package

Abstract

One of the strengths of R is the ability to share software as packages. Packages give users a reliable, convenient, and standardized way to access R functions, data, and documentation. Package authors have a means of communicating with users and a way to organize software for the purpose of sharing it with others and reusing it themselves.

Larry Pace

Chapter 20. The R Commander Package

Abstract

If you are interested in developing R software, you probably found Chapter 19 to be interesting and useful. However, if you plan to use R simply for statistical analyses or if you need to teach (or take) a statistics course, you may want to consider the R Commander package written by John Fox. R Commander is a graphical user interface you invoke from the R command line. It is an R package itself, and must be downloaded and installed like any other R package. Once you have initiated R Commander, however, you are no longer working with the traditional R GUI, but with a different one, as you will learn here.

Larry Pace

Backmatter

Titel: Beginning R
verfasst von: Larry Pace
Verlag: Apress
Electronic ISBN: 978-1-4302-4555-1
Print ISBN: 978-1-4302-4554-4
DOI: https://doi.org/10.1007/978-1-4302-4555-1