Top

2021 | Book

Read chapter Read first chapter

CRAN Recipes

DPLYR, Stringr, Lubridate, and RegEx in R

Author: William Yarberry

Publisher: Apress

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

Want to use the power of R sooner rather than later? Don’t have time to plow through wordy texts and online manuals? Use this book for quick, simple code to get your projects up and running. It includes code and examples applicable to many disciplines. Written in everyday language with a minimum of complexity, each chapter provides the building blocks you need to fit R’s astounding capabilities to your analytics, reporting, and visualization needs.

CRAN Recipes recognizes how needless jargon and complexity get in your way. Busy professionals need simple examples and intuitive descriptions; side trips and meandering philosophical discussions are left for other books.

Here R scripts are condensed, to the extent possible, to copy-paste-run format. Chapters and examples are structured to purpose rather than particular functions (e.g., “dirty data cleanup” rather than the R package name “janitor”). Everyday language eliminates the need to know functions/packages in advance.

What You Will Learn

Carry out input/output; visualizations; data munging; manipulations at the group level; and quick data explorationHandle forecasting (multivariate, time series, logistic regression, Facebook’s Prophet, and others)Use text analytics; sampling; financial analysis; and advanced pattern matching (regex)Manipulate data using DPLYR: filter, sort, summarize, add new fields to datasets, and apply powerful IF functionsCreate combinations or subsets of files using joinsWrite efficient code using pipes to eliminate intermediate steps (MAGRITTR) Work with string/character manipulation of all types (STRINGR)Discover counts, patterns, and how to locate whole wordsDo wild-card matching, extraction, and invert-matchWork with dates using LUBRIDATEFix dirty data; attractive formatting; bad habits to avoid

Who This Book Is For

Programmers/data scientists with at least some prior exposure to R.

Frontmatter

Chapter 1. DPLYR

Abstract

Dplyr is one of my favorite R packages. Its logical and consistent rules replace the older, motley collection of syntactically inconsistent packages and functions. It’s like a Swiss Army knife in the woods—don’t leave home without it.

William Yarberry

Chapter 2. Stringr

Abstract

The next two packages, Lubridate and Stringr, omit many exceptions and tricky, oddball situations that standard manuals include by necessity. From a technical perspective, there is nothing new in this book. Indeed, much of the code is copied, with slight modifications, from the excellent, free online manuals. The narrow purpose of this book is to give you enough knowledge to use the packages as quickly as possible. If you have programming experience, the explanations and examples here should have you competent within a day or two.

William Yarberry

Chapter 3. Lubridate: Date and Time Processing

Abstract

Lubridate starts out answering a simple question—is it AM or PM, based on a date and hour? From there, it gets more complex but maintains a consistent approach to working with dates, times, and the combination date-times.

William Yarberry

Chapter 4. Regular Expressions: Introduction

Abstract

First impressions of regular expressions are rarely positive. They look arcane, a throwback to the early decades of modern computing, when GUIs and object-oriented programming were a distant future. However, once you get past its look and feel, regular expressions give you serious power to search, filter, and manipulate text and numbers with speed and minimal code.

William Yarberry

Chapter 5. Typical Uses

Abstract

The simplest patterns match exact strings:


              x <- c("apple", "banana", "pear")
              str_extract(x, "an")
             
              ## [1] NA   "an" NA

William Yarberry

Chapter 6. Some Simple Patterns

Abstract

Regular expressions, taken to their extreme, can make you feel irregular. For many R programming tasks, only a few meta-characters need be used. Table 6-1 is from an excellent Loyola Marymount University website. In R, the regular expression will be enclosed in quotes and used in one of the seven functions listed in Chapter 9, “The Magnificent Seven.”

William Yarberry

Chapter 7. Character Classes

Abstract

A character class is how regex understands which characters should be considered for a match or “anti-match”—anything but what is shown. Note that “class” in this context has nothing to do with the R statement “class(x),” which gives you the class of object x.

William Yarberry

Chapter 8. Elements of Regular Expressions

Abstract

Literals are simply characters themselves, such as “a” or “boat” or “123.” Some characters are “reserved” with special meanings, such as “+.” In the case of the plus sign, its special meaning is “additional characters like one just to the left of the + sign.” If you want to use any of these reserved characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1\\+1=2. Otherwise, the plus sign has a special meaning. Remember that two backslashes are required.


              string1 <- "This is elementary Watson. 1+1=2"
              my.regex <- "1\\+1=2"
              my.regex.replacement.value <- "two plus two equals four "
              sub(pattern = my.regex,replacement =
                my.regex.replacement.value,x = string1)
             
              ## [1] "This is elementary Watson. two plus two equals four "

William Yarberry

Chapter 9. The Magnificent Seven

Abstract

R uses seven regular expression functions for pattern matching and replacement. If you know how to use these functions with appropriate regular expression patterns, then you have a worthy and efficient toolkit for most data science applications.

William Yarberry

Chapter 10. Regular Expressions in Stringr

Abstract

You can use the following prebuilt classes in the package stringr:

William Yarberry

Chapter 11. Unicode

Abstract

ASCII coding is familiar to most of us. Unfortunately, ASCII is the exception—a simple, plain, “white bread” version of the character world. Latin-1 is less commonly used. Other representations of characters depend on the standard selected, which in turn may depend on the locale. As an example, consider the trademark sign. In MS Word and other MS Office applications, Ctrl-Alt-T or Alt8482 will create the trademark sign: ™. In Unicode, it is \u2122 or outside R is U+2122. In HTML Hex, use ™. Once you determine the coding scheme, finding the values is straightforward, since they are well documented on the Internet. To find information on various Unicodes, see the official website of the Unicode Consortium, www.Unicode.org. Another helpful resource is www.utf8-chartable.de/.

William Yarberry

Chapter 12. Tools for Development and Resources

Abstract

The free and easy-to-use website www.regex101.com serves as a goto sandbox when you are trying different approaches and need immediate feedback. See Figures 12-1 and 12-2. It is not specifically tuned to R, but the great majority of R regex configurations will be properly evaluated. To the right of the screen, explanatory comments are provided. Although it has far fewer features than Regex Buddy (discussed later), for many people, it will be sufficient. Since doing is closely tied to learning, spending a few hours with this tool on the front end will accelerate your learning curve.

William Yarberry

Chapter 13. RegEx Summary

Abstract

Pain, then gain: that’s the typical result of learning regular expressions. R includes many functions which duplicate regex’s capability for specific actions. However, the scope of regex pattern matching exceeds traditional R logic. Even knowing just a few regex examples will speed your code development and possibly reduce execution time of your script.

William Yarberry

Chapter 14. Recipes for Common R Tasks

Abstract

Load the following packages to execute the code in subsequent sections:


              library(tidyverse)
              library(readr)
              library(datasets)

William Yarberry

Chapter 15. Data Structures

Abstract

MASS is a commonly used source of datasets for practicing R code. Figure 15-1 shows a partial list of datasets available. Although it is tempting when you have a new project to jump in using your actual data, consider working through your code first with a toy, built-in dataset so that you are varying only one thing at a time—get the code logic down first, and then work with your own data.


              library(MASS)
              data() #shows base datasets available.

William Yarberry

Chapter 16. Visualization

Abstract

Most of these examples use the R workhorse, ggplot2. The ggplot2 package has had many spin-offs and is a gold mine of logically structured visualizations. The following sections are a smorgasbord of visualizations from ggplot2 and other packages. ggplot2 gets loaded automatically with the library(tidyverse) command.

William Yarberry

Chapter 17. Simple Prediction Methods

Abstract

Predictive modeling has gotten sophisticated over the years. It is a major discipline of data science and includes some of the most sophisticated mathematics and statistical engineering found anywhere. That being said, some of the tools for prediction are straightforward. We can start with the package prophet, graciously given to the world by Facebook. If you compare prophet to some of the older time series prediction methods, you will be impressed with its simplicity. It has all the trending, seasonality, and other mathematical patterns of older systems but does not force you to “get involved.” You can just enter a dataframe with one column of dates and another column of some numeric value (birds per square mile, prisoners in Alabama, rain forest size, etc.) and then predict future values. The most time-consuming part is getting the dates and column headers in the format prophet requires. After prophet, I’ll present an older method, Holt-Winters (time series) and multivariate regression, where one or more variables (predictors) are used to estimate some variable of interest, termed the response.

William Yarberry

Chapter 18. Smorgasbord of Simple Statistical Tests

Abstract

The following are a sampling of one-liner tests. They provide a wealth of information on any numeric series/column with little code required. If you are doing upfront data exploration, put a chunk of these at the beginning of your program.

William Yarberry

Chapter 19. Validation of Data

Abstract

After spending five minutes doing data science, everyone knows that data preparation, including validation, is the most time-consuming step of any analysis. Several cleanup packages have been developed, including janitor and validate. Figure 19-1, from the validate package, shows a convenient graphic of three mtcars variables. It meets the data science trifecta: simple, quick, and handy.

William Yarberry

Chapter 20. Shortcuts and Miscellaneous

Abstract

You can certainly use notepad or base R to write your scripts, in the same sense as a dog could walk on his hind legs to get places. It works but not well. Rstudio is built for the purpose of efficient, effective, and beginner-friendly script coding.

William Yarberry

Chapter 21. Conclusion

Abstract

My intent in writing this book was to provide useful code that you can put to work quickly. Some of the topics, such as the prediction models, are the subject of thick books and thousands of mathematical, scientific, and business-related papers. I wanted to show that with a few lines of code some functionality is possible without investing months or years going up the learning curve. Decades ago, those who controlled the old mainframes were called “high priests”—not necessarily a complimentary term. R has the potential to help people across the planet get value from data and so should be democratized as much as possible. Obviously if you have a need to do more in-depth analysis, there are terrific books available from Apress and others to get the expertise you need. But everyone has a “day one” in any subject. We should not discourage new users of the language by making it onerous to perform basic analysis. Success motivates.

William Yarberry

Backmatter

Title: CRAN Recipes
Author: William Yarberry
Publisher: Apress
Electronic ISBN: 978-1-4842-6876-6
Print ISBN: 978-1-4842-6875-9
DOI: https://doi.org/10.1007/978-1-4842-6876-6

Springer Professional

CRAN Recipes

DPLYR, Stringr, Lubridate, and RegEx in R

About this book

Table of Contents

Frontmatter

Chapter 1. DPLYR

Chapter 2. Stringr

Chapter 3. Lubridate: Date and Time Processing

Chapter 4. Regular Expressions: Introduction

Chapter 5. Typical Uses

Chapter 6. Some Simple Patterns

Chapter 7. Character Classes

Chapter 8. Elements of Regular Expressions

Chapter 9. The Magnificent Seven

Chapter 10. Regular Expressions in Stringr

Chapter 11. Unicode

Chapter 12. Tools for Development and Resources

Chapter 13. RegEx Summary

Chapter 14. Recipes for Common R Tasks

Chapter 15. Data Structures

Chapter 16. Visualization

Chapter 17. Simple Prediction Methods

Chapter 18. Smorgasbord of Simple Statistical Tests

Chapter 19. Validation of Data

Chapter 20. Shortcuts and Miscellaneous

Chapter 21. Conclusion

Backmatter

Premium Partner