Chapter 14: Strings

class: inverse, center, bottom

# Chapter 14: Strings

## RStudio Instructor Training Study Session

### Silvia Canelón, PhD

### November 14th, 2020

---
class: center, middle

# [Introduction](https://r4ds.had.co.nz/strings.html#introduction-8)

Chapter has a focus on **regular expressions** or **regexps**

"**regexps** are a concise language for describing patterns in strings"

Focus will be on the `stringr` package

![:scale 15%](https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/stringr.png)

---

# String basics

` "Quotes can be written with double quotes"`...

` 'or with single quotes'`

but the recommendation is to use double quotes unless if you want to include a `'string like this one which uses "double quotes" inside of it'`

----

To use these literally, you can use `\` to "escape" the quote use:

- `"\""` for a double quote
- `'\''` for a single quote
- `"\\"` if you want to use a backlash literally (you have to "escape" the "escape")

---

### String length

We can use `str_length()` to find the **length** of a **str**ing

```r
str_length(c("MiR", "hosts", "fun and supportive", "study sessions"))
## [1]  3  5 18 14
```

### Combining strings

We can use `str_c()` to combine strings, I like to read it as "**str**ing **c**ombine"

```r
str_c("Dorris", "Yanina", "Laurie", "Silvia")
## [1] "DorrisYaninaLaurieSilvia"
```

And we can use `sep = ` to specify how we want to **sep**arate them

```r
str_c("Dorris", "Yanina", "Laurie", "Silvia", sep = ", ")
## [1] "Dorris, Yanina, Laurie, Silvia"
```

---

### Combining strings (cont'd)

We can collapse a _vector_ of strings into a _single_ string using `collapse`

```r
str_c(c("Dorris", "Yanina", "Laurie", "Silvia"), collapse = ", ")
## [1] "Dorris, Yanina, Laurie, Silvia"
#> [1] "x, y, z"
```

This can be helpful when we're writing a formula to be used in a model:

```r
predictors = c("species", "island", "sex")
predictors_collapsed = str_c(predictors, collapse = " + ")
predictors_collapsed
## [1] "species + island + sex"

formula_penguins = as.formula(str_c('body_mass_g ~ ', predictors_collapsed))
formula_penguins
## body_mass_g ~ species + island + sex
```

```r
lm(data = palmerpenguins::penguins, formula = formula_penguins)  #linear model
```

---

### Subsetting strings

We can use `str_sub()` to extract parts of a string.

It helps me to think about it like **str**ing **sub**set.

```r
study_buddies <- c("Dorris", "Yanina", "Laurie", "Silvia")

# positive numbers count forwards from beginning
str_sub(study_buddies, 1, 3)
## [1] "Dor" "Yan" "Lau" "Sil"

# negative numbers count backwards from end
str_sub(study_buddies, -3, -1)
## [1] "ris" "ina" "rie" "via"
```

We all go by first names with 6 letters!

---

### Transforming strings

`str_to_lower()` and `str_to_upper()` are two examples of string transformations

```r
study_buddies
## [1] "Dorris" "Yanina" "Laurie" "Silvia"

str_to_lower(study_buddies)                   # transforms all to lowercase
## [1] "dorris" "yanina" "laurie" "silvia"

str_to_upper(study_buddies)                   # transforms all to uppercase
## [1] "DORRIS" "YANINA" "LAURIE" "SILVIA"
```

**Note:** Lower- and uppercase rules can vary by language, so you may want to also specify the `locale` using [ISO 639-1 codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) for the specific language

```r
str_to_upper(study_buddies, locale = "en")    # example for English
## [1] "DORRIS" "YANINA" "LAURIE" "SILVIA"

str_to_upper(study_buddies, locale = "tr")    # example for Turkish
## [1] "DORRİS" "YANİNA" "LAURİE" "SİLVİA"
```

---

# Matching patterns with regular expressions

### Basic matches

.pull-left[

```r
str_view(study_buddies, "i")
```

<div id="htmlwidget-4fda11c6266b396f6d2e" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-4fda11c6266b396f6d2e">{"x":{"html":"<ul>\n  <li>Dorr<span class='match'>i<\/span>s<\/li>\n  <li>Yan<span class='match'>i<\/span>na<\/li>\n  <li>Laur<span class='match'>i<\/span>e<\/li>\n  <li>S<span class='match'>i<\/span>lvia<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

`str_view()` returns the **first** match,<br>like the first `i` in Silvia
]

.pull-right[

```r
str_view_all(study_buddies, ".i.")
```

<div id="htmlwidget-b937ec0ac5b9e0e2974c" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-b937ec0ac5b9e0e2974c">{"x":{"html":"<ul>\n  <li>Dor<span class='match'>ris<\/span><\/li>\n  <li>Ya<span class='match'>nin<\/span>a<\/li>\n  <li>Lau<span class='match'>rie<\/span><\/li>\n  <li><span class='match'>Sil<\/span><span class='match'>via<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

`str_view_all()` returns **all** matches,<br>like both `i` segments in Silvia
]

---

### Anchors

You can **anchor** the regular expression with `^` and `$` to be more specific

.pull-left[

```r
str_view(study_buddies, "^S")
```

<div id="htmlwidget-11d5e995c060d3cba76c" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-11d5e995c060d3cba76c">{"x":{"html":"<ul>\n  <li>Dorris<\/li>\n  <li>Yanina<\/li>\n  <li>Laurie<\/li>\n  <li><span class='match'>S<\/span>ilvia<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>
]

.pull-right[

```r
str_view(study_buddies, "s$")
```

<div id="htmlwidget-a2c90b3da29e6e78cdcb" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-a2c90b3da29e6e78cdcb">{"x":{"html":"<ul>\n  <li>Dorri<span class='match'>s<\/span><\/li>\n  <li>Yanina<\/li>\n  <li>Laurie<\/li>\n  <li>Silvia<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

]

<br>
One way to remember when to use each is:

> if you begin with power (`^`), you end up with money (`$`)

> -- R4DS from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953)]

---

### Character classes and alternatives

Some special patterns match more than one character

- `.`: matches any character (apart from a new line `\n`)
- `\d`: matches any digit.
- `\s`: matches any whitespace (e.g. space, tab, new line `\n`).
- `[abc]`: matches a, b, or c.
- `[^abc]`: matches anything except a, b, or c.

.pull-left[

```r
str_view(c("grey", "gray"), "[abc]y")
```

<div id="htmlwidget-c862012b850a488e0dcf" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-c862012b850a488e0dcf">{"x":{"html":"<ul>\n  <li>grey<\/li>\n  <li>gr<span class='match'>ay<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>
]

.pull-right[

```r
str_view(c("grey", "gray"), "gr(e|a)y")
```

<div id="htmlwidget-c96669df060c62a0be6f" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-c96669df060c62a0be6f">{"x":{"html":"<ul>\n  <li><span class='match'>grey<\/span><\/li>\n  <li><span class='match'>gray<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>
]

---

### Repetition

.pull-left[
Decide how many times a pattern matches:

- `?`: 0 or 1
- `+`: 1 or more
- `*`: 0 or more

```r
str_view(study_buddies, "n+")
```

<div id="htmlwidget-162e1602e5cb58adc889" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-162e1602e5cb58adc889">{"x":{"html":"<ul>\n  <li>Dorris<\/li>\n  <li>Ya<span class='match'>n<\/span>ina<\/li>\n  <li>Laurie<\/li>\n  <li>Silvia<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>
]

.pull-right[

Decide the number of matches precisely:

- `{n}`: exactly n
- `{n,}`: n or more
- `{,m}`: at most m
- `{n,m}`: between n and m

```r
str_view(study_buddies, "r{1,}")
```

<div id="htmlwidget-b74e74ee318694d2a294" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-b74e74ee318694d2a294">{"x":{"html":"<ul>\n  <li>Do<span class='match'>rr<\/span>is<\/li>\n  <li>Yanina<\/li>\n  <li>Lau<span class='match'>r<\/span>ie<\/li>\n  <li>Silvia<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

]

---

### Grouping and backreferences

Parentheses can help us create _numbered_ capturing groups which store the part of the string matched by the part of the regular expression inside the parentheses

> You can refer to the same text as previously matched by a capturing group with backreferences, like `\1`, `\2` 🤔

This finds all study buddy names with a repeated letter using `(.)`:

```r
str_view(study_buddies, "(.)\\1", match = TRUE)
```

<div id="htmlwidget-46c19363f54a6a2fc388" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-46c19363f54a6a2fc388">{"x":{"html":"<ul>\n  <li>Do<span class='match'>rr<\/span>is<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---

# Tools

- Determine which strings match a pattern.
- Find the positions of matches.
- Extract the content of matches.
- Replace matches with new values.
- Split a string based on a match.

### Which strings match a pattern

```r
study_buddies
## [1] "Dorris" "Yanina" "Laurie" "Silvia"

str_detect(study_buddies, "o")
## [1]  TRUE FALSE FALSE FALSE

str_count(study_buddies, "a")
## [1] 0 2 1 1
```

---

### Extracting matches

Here's an example used to **extract** any study buddies from a sentence.

```r
sentence <- c("The newest member of the study group is Laurie")
sentence
## [1] "The newest member of the study group is Laurie"

name_match <- str_c(study_buddies, collapse = "|")
name_match
## [1] "Dorris|Yanina|Laurie|Silvia"

matches <- str_extract(sentence, name_match)
matches
## [1] "Laurie"
```

---

### Replacing matches

```r
str_replace(study_buddies, "[aeiou]", "-")
## [1] "D-rris" "Y-nina" "L-urie" "S-lvia"

str_replace_all(study_buddies, "[aeiou]", "-")
## [1] "D-rr-s" "Y-n-n-" "L--r--" "S-lv--"

str_replace_all(study_buddies, 
                c("Dorris" = "Scott", "Yanina" = "Bellini Saibene", 
                  "Laurie" = "Baker", "Silvia" = "Canelon"))
## [1] "Scott"           "Bellini Saibene" "Baker"           "Canelon"
```

---

.pull-left[
### Splitting

```r
study_buddies %>% str_split("i")
## [[1]]
## [1] "Dorr" "s"   
## 
## [[2]]
## [1] "Yan" "na" 
## 
## [[3]]
## [1] "Laur" "e"   
## 
## [[4]]
## [1] "S"  "lv" "a"
```

]

.pull-right[
### Finding matches

```r
str_locate_all(study_buddies, "i")
## [[1]]
##      start end
## [1,]     5   5
## 
## [[2]]
##      start end
## [1,]     4   4
## 
## [[3]]
##      start end
## [1,]     5   5
## 
## [[4]]
##      start end
## [1,]     2   2
## [2,]     5   5
```
]

---

# Further reading and more advanced string patterns and matching

### [Other types of patterns](https://r4ds.had.co.nz/strings.html#other-types-of-pattern)

### [Other uses of regular expressions](https://r4ds.had.co.nz/strings.html#other-uses-of-regular-expressions)

### [stringi](https://r4ds.had.co.nz/strings.html#stringi)

[stringi](file:///Users/scanelon/Downloads/stringi.pdf) is a package that contains _all_ string manipulation functions you could ever need, whereas `stringr` contains the most common ones

---
class: inverse, center, middle

# The End

## <i class="fas  fa-book-open "></i>