Skip to main content

2021 | Buch

CRAN Recipes

DPLYR, Stringr, Lubridate, and RegEx in R

insite
SUCHEN

Über dieses Buch

Want to use the power of R sooner rather than later? Don’t have time to plow through wordy texts and online manuals? Use this book for quick, simple code to get your projects up and running. It includes code and examples applicable to many disciplines. Written in everyday language with a minimum of complexity, each chapter provides the building blocks you need to fit R’s astounding capabilities to your analytics, reporting, and visualization needs.

CRAN Recipes recognizes how needless jargon and complexity get in your way. Busy professionals need simple examples and intuitive descriptions; side trips and meandering philosophical discussions are left for other books.

Here R scripts are condensed, to the extent possible, to copy-paste-run format. Chapters and examples are structured to purpose rather than particular functions (e.g., “dirty data cleanup” rather than the R package name “janitor”). Everyday language eliminates the need to know functions/packages in advance.

What You Will Learn

Carry out input/output; visualizations; data munging; manipulations at the group level; and quick data explorationHandle forecasting (multivariate, time series, logistic regression, Facebook’s Prophet, and others)Use text analytics; sampling; financial analysis; and advanced pattern matching (regex)Manipulate data using DPLYR: filter, sort, summarize, add new fields to datasets, and apply powerful IF functionsCreate combinations or subsets of files using joinsWrite efficient code using pipes to eliminate intermediate steps (MAGRITTR) Work with string/character manipulation of all types (STRINGR)Discover counts, patterns, and how to locate whole wordsDo wild-card matching, extraction, and invert-matchWork with dates using LUBRIDATEFix dirty data; attractive formatting; bad habits to avoid

Who This Book Is For

Programmers/data scientists with at least some prior exposure to R.

Inhaltsverzeichnis

Frontmatter
Chapter 1. DPLYR
Abstract
Dplyr is one of my favorite R packages. Its logical and consistent rules replace the older, motley collection of syntactically inconsistent packages and functions. It’s like a Swiss Army knife in the woods—don’t leave home without it.
William Yarberry
Chapter 2. Stringr
Abstract
The next two packages, Lubridate and Stringr, omit many exceptions and tricky, oddball situations that standard manuals include by necessity. From a technical perspective, there is nothing new in this book. Indeed, much of the code is copied, with slight modifications, from the excellent, free online manuals. The narrow purpose of this book is to give you enough knowledge to use the packages as quickly as possible. If you have programming experience, the explanations and examples here should have you competent within a day or two.
William Yarberry
Chapter 3. Lubridate: Date and Time Processing
Abstract
Lubridate starts out answering a simple question—is it AM or PM, based on a date and hour? From there, it gets more complex but maintains a consistent approach to working with dates, times, and the combination date-times.
William Yarberry
Chapter 4. Regular Expressions: Introduction
Abstract
First impressions of regular expressions are rarely positive. They look arcane, a throwback to the early decades of modern computing, when GUIs and object-oriented programming were a distant future. However, once you get past its look and feel, regular expressions give you serious power to search, filter, and manipulate text and numbers with speed and minimal code.
William Yarberry
Chapter 5. Typical Uses
Abstract
The simplest patterns match exact strings:

              x <- c("apple", "banana", "pear")
              str_extract(x, "an")
             
              ## [1] NA   "an" NA
            
William Yarberry
Chapter 6. Some Simple Patterns
Abstract
Regular expressions, taken to their extreme, can make you feel irregular. For many R programming tasks, only a few meta-characters need be used. Table 6-1 is from an excellent Loyola Marymount University website. In R, the regular expression will be enclosed in quotes and used in one of the seven functions listed in Chapter 9, “The Magnificent Seven.”
William Yarberry
Chapter 7. Character Classes
Abstract
A character class is how regex understands which characters should be considered for a match or “anti-match”—anything but what is shown. Note that “class” in this context has nothing to do with the R statement “class(x),” which gives you the class of object x.
William Yarberry
Chapter 8. Elements of Regular Expressions
Abstract
Literals are simply characters themselves, such as “a” or “boat” or “123.” Some characters are “reserved” with special meanings, such as “+.” In the case of the plus sign, its special meaning is “additional characters like one just to the left of the + sign.” If you want to use any of these reserved characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1\\+1=2. Otherwise, the plus sign has a special meaning. Remember that two backslashes are required.

              string1 <- "This is elementary Watson. 1+1=2"
              my.regex <- "1\\+1=2"
              my.regex.replacement.value <- "two plus two equals four "
              sub(pattern = my.regex,replacement =
                my.regex.replacement.value,x = string1)
             
              ## [1] "This is elementary Watson. two plus two equals four "
            
William Yarberry
Chapter 9. The Magnificent Seven
Abstract
R uses seven regular expression functions for pattern matching and replacement. If you know how to use these functions with appropriate regular expression patterns, then you have a worthy and efficient toolkit for most data science applications.
William Yarberry
Chapter 10. Regular Expressions in Stringr
Abstract
You can use the following prebuilt classes in the package stringr:
William Yarberry
Chapter 11. Unicode
Abstract
ASCII coding is familiar to most of us. Unfortunately, ASCII is the exception—a simple, plain, “white bread” version of the character world. Latin-1 is less commonly used. Other representations of characters depend on the standard selected, which in turn may depend on the locale. As an example, consider the trademark sign. In MS Word and other MS Office applications, Ctrl-Alt-T or Alt8482 will create the trademark sign: ™. In Unicode, it is \u2122 or outside R is U+2122. In HTML Hex, use ™. Once you determine the coding scheme, finding the values is straightforward, since they are well documented on the Internet. To find information on various Unicodes, see the official website of the Unicode Consortium, www.Unicode.org. Another helpful resource is www.utf8-chartable.de/.
William Yarberry
Chapter 12. Tools for Development and Resources
Abstract
The free and easy-to-use website www.regex101.com serves as a goto sandbox when you are trying different approaches and need immediate feedback. See Figures 12-1 and 12-2. It is not specifically tuned to R, but the great majority of R regex configurations will be properly evaluated. To the right of the screen, explanatory comments are provided. Although it has far fewer features than Regex Buddy (discussed later), for many people, it will be sufficient. Since doing is closely tied to learning, spending a few hours with this tool on the front end will accelerate your learning curve.
William Yarberry
Chapter 13. RegEx Summary
Abstract
Pain, then gain: that’s the typical result of learning regular expressions. R includes many functions which duplicate regex’s capability for specific actions. However, the scope of regex pattern matching exceeds traditional R logic. Even knowing just a few regex examples will speed your code development and possibly reduce execution time of your script.
William Yarberry
Chapter 14. Recipes for Common R Tasks
Abstract
Load the following packages to execute the code in subsequent sections:

              library(tidyverse)
              library(readr)
              library(datasets)
            
William Yarberry
Chapter 15. Data Structures
Abstract
MASS is a commonly used source of datasets for practicing R code. Figure 15-1 shows a partial list of datasets available. Although it is tempting when you have a new project to jump in using your actual data, consider working through your code first with a toy, built-in dataset so that you are varying only one thing at a time—get the code logic down first, and then work with your own data.

              library(MASS)
              data() #shows base datasets available.
            
William Yarberry
Chapter 16. Visualization
Abstract
Most of these examples use the R workhorse, ggplot2. The ggplot2 package has had many spin-offs and is a gold mine of logically structured visualizations. The following sections are a smorgasbord of visualizations from ggplot2 and other packages. ggplot2 gets loaded automatically with the library(tidyverse) command.
William Yarberry
Chapter 17. Simple Prediction Methods
Abstract
Predictive modeling has gotten sophisticated over the years. It is a major discipline of data science and includes some of the most sophisticated mathematics and statistical engineering found anywhere. That being said, some of the tools for prediction are straightforward. We can start with the package prophet, graciously given to the world by Facebook. If you compare prophet to some of the older time series prediction methods, you will be impressed with its simplicity. It has all the trending, seasonality, and other mathematical patterns of older systems but does not force you to “get involved.” You can just enter a dataframe with one column of dates and another column of some numeric value (birds per square mile, prisoners in Alabama, rain forest size, etc.) and then predict future values. The most time-consuming part is getting the dates and column headers in the format prophet requires. After prophet, I’ll present an older method, Holt-Winters (time series) and multivariate regression, where one or more variables (predictors) are used to estimate some variable of interest, termed the response.
William Yarberry
Chapter 18. Smorgasbord of Simple Statistical Tests
Abstract
The following are a sampling of one-liner tests. They provide a wealth of information on any numeric series/column with little code required. If you are doing upfront data exploration, put a chunk of these at the beginning of your program.
William Yarberry
Chapter 19. Validation of Data
Abstract
After spending five minutes doing data science, everyone knows that data preparation, including validation, is the most time-consuming step of any analysis. Several cleanup packages have been developed, including janitor and validate. Figure 19-1, from the validate package, shows a convenient graphic of three mtcars variables. It meets the data science trifecta: simple, quick, and handy.
William Yarberry
Chapter 20. Shortcuts and Miscellaneous
Abstract
You can certainly use notepad or base R to write your scripts, in the same sense as a dog could walk on his hind legs to get places. It works but not well. Rstudio is built for the purpose of efficient, effective, and beginner-friendly script coding.
William Yarberry
Chapter 21. Conclusion
Abstract
My intent in writing this book was to provide useful code that you can put to work quickly. Some of the topics, such as the prediction models, are the subject of thick books and thousands of mathematical, scientific, and business-related papers. I wanted to show that with a few lines of code some functionality is possible without investing months or years going up the learning curve. Decades ago, those who controlled the old mainframes were called “high priests”—not necessarily a complimentary term. R has the potential to help people across the planet get value from data and so should be democratized as much as possible. Obviously if you have a need to do more in-depth analysis, there are terrific books available from Apress and others to get the expertise you need. But everyone has a “day one” in any subject. We should not discourage new users of the language by making it onerous to perform basic analysis. Success motivates.
William Yarberry
Backmatter
Metadaten
Titel
CRAN Recipes
verfasst von
William Yarberry
Copyright-Jahr
2021
Verlag
Apress
Electronic ISBN
978-1-4842-6876-6
Print ISBN
978-1-4842-6875-9
DOI
https://doi.org/10.1007/978-1-4842-6876-6