class: inverse, center, bottom <img src="https://d33wubrfki0l68.cloudfront.net/b88ef926a004b0fce72b2526b0b5c4413666a4cb/24a30/cover.png" width="150" /> # Chapter 14: Strings ## RStudio Instructor Training Study Session ### Silvia Canelón, PhD ### November 14th, 2020 --- class: center, middle # [Introduction](https://r4ds.had.co.nz/strings.html#introduction-8) Chapter has a focus on **regular expressions** or **regexps** -- "**regexps** are a concise language for describing patterns in strings" -- Focus will be on the `stringr` package ![:scale 15%](https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/stringr.png) --- # String basics ` "Quotes can be written with double quotes"`... ` 'or with single quotes'` but the recommendation is to use double quotes unless if you want to include a `'string like this one which uses "double quotes" inside of it'` ---- To use these literally, you can use `\` to "escape" the quote use: - `"\""` for a double quote - `'\''` for a single quote - `"\\"` if you want to use a backlash literally (you have to "escape" the "escape") --- ### String length We can use `str_length()` to find the **length** of a **str**ing ```r str_length(c("MiR", "hosts", "fun and supportive", "study sessions")) ## [1] 3 5 18 14 ``` -- ### Combining strings We can use `str_c()` to combine strings, I like to read it as "**str**ing **c**ombine" ```r str_c("Dorris", "Yanina", "Laurie", "Silvia") ## [1] "DorrisYaninaLaurieSilvia" ``` And we can use `sep = ` to specify how we want to **sep**arate them ```r str_c("Dorris", "Yanina", "Laurie", "Silvia", sep = ", ") ## [1] "Dorris, Yanina, Laurie, Silvia" ``` --- ### Combining strings (cont'd) We can collapse a _vector_ of strings into a _single_ string using `collapse` ```r str_c(c("Dorris", "Yanina", "Laurie", "Silvia"), collapse = ", ") ## [1] "Dorris, Yanina, Laurie, Silvia" #> [1] "x, y, z" ``` -- This can be helpful when we're writing a formula to be used in a model: ```r predictors = c("species", "island", "sex") predictors_collapsed = str_c(predictors, collapse = " + ") predictors_collapsed ## [1] "species + island + sex" formula_penguins = as.formula(str_c('body_mass_g ~ ', predictors_collapsed)) formula_penguins ## body_mass_g ~ species + island + sex ``` ```r lm(data = palmerpenguins::penguins, formula = formula_penguins) #linear model ``` --- ### Subsetting strings We can use `str_sub()` to extract parts of a string. It helps me to think about it like **str**ing **sub**set. ```r study_buddies <- c("Dorris", "Yanina", "Laurie", "Silvia") # positive numbers count forwards from beginning str_sub(study_buddies, 1, 3) ## [1] "Dor" "Yan" "Lau" "Sil" # negative numbers count backwards from end str_sub(study_buddies, -3, -1) ## [1] "ris" "ina" "rie" "via" ``` -- We all go by first names with 6 letters! --- ### Transforming strings `str_to_lower()` and `str_to_upper()` are two examples of string transformations ```r study_buddies ## [1] "Dorris" "Yanina" "Laurie" "Silvia" str_to_lower(study_buddies) # transforms all to lowercase ## [1] "dorris" "yanina" "laurie" "silvia" str_to_upper(study_buddies) # transforms all to uppercase ## [1] "DORRIS" "YANINA" "LAURIE" "SILVIA" ``` -- **Note:** Lower- and uppercase rules can vary by language, so you may want to also specify the `locale` using [ISO 639-1 codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) for the specific language ```r str_to_upper(study_buddies, locale = "en") # example for English ## [1] "DORRIS" "YANINA" "LAURIE" "SILVIA" str_to_upper(study_buddies, locale = "tr") # example for Turkish ## [1] "DORRİS" "YANİNA" "LAURİE" "SİLVİA" ``` --- # Matching patterns with regular expressions ### Basic matches .pull-left[ ```r str_view(study_buddies, "i") ```
`str_view()` returns the **first** match,<br>like the first `i` in Silvia ] -- .pull-right[ ```r str_view_all(study_buddies, ".i.") ```
`str_view_all()` returns **all** matches,<br>like both `i` segments in Silvia ] --- ### Anchors You can **anchor** the regular expression with `^` and `$` to be more specific .pull-left[ ```r str_view(study_buddies, "^S") ```
] .pull-right[ ```r str_view(study_buddies, "s$") ```
] -- <br> One way to remember when to use each is: > if you begin with power (`^`), you end up with money (`$`) > -- R4DS from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953)] --- ### Character classes and alternatives Some special patterns match more than one character - `.`: matches any character (apart from a new line `\n`) - `\d`: matches any digit. - `\s`: matches any whitespace (e.g. space, tab, new line `\n`). - `[abc]`: matches a, b, or c. - `[^abc]`: matches anything except a, b, or c. .pull-left[ ```r str_view(c("grey", "gray"), "[abc]y") ```
] .pull-right[ ```r str_view(c("grey", "gray"), "gr(e|a)y") ```
] --- ### Repetition .pull-left[ Decide how many times a pattern matches: - `?`: 0 or 1 - `+`: 1 or more - `*`: 0 or more ```r str_view(study_buddies, "n+") ```
] .pull-right[ Decide the number of matches precisely: - `{n}`: exactly n - `{n,}`: n or more - `{,m}`: at most m - `{n,m}`: between n and m ```r str_view(study_buddies, "r{1,}") ```
] --- ### Grouping and backreferences Parentheses can help us create _numbered_ capturing groups which store the part of the string matched by the part of the regular expression inside the parentheses > You can refer to the same text as previously matched by a capturing group with backreferences, like `\1`, `\2` 🤔 This finds all study buddy names with a repeated letter using `(.)`: ```r str_view(study_buddies, "(.)\\1", match = TRUE) ```
--- # Tools - Determine which strings match a pattern. - Find the positions of matches. - Extract the content of matches. - Replace matches with new values. - Split a string based on a match. ### Which strings match a pattern ```r study_buddies ## [1] "Dorris" "Yanina" "Laurie" "Silvia" str_detect(study_buddies, "o") ## [1] TRUE FALSE FALSE FALSE str_count(study_buddies, "a") ## [1] 0 2 1 1 ``` --- ### Extracting matches Here's an example used to **extract** any study buddies from a sentence. ```r sentence <- c("The newest member of the study group is Laurie") sentence ## [1] "The newest member of the study group is Laurie" name_match <- str_c(study_buddies, collapse = "|") name_match ## [1] "Dorris|Yanina|Laurie|Silvia" matches <- str_extract(sentence, name_match) matches ## [1] "Laurie" ``` --- ### Replacing matches ```r str_replace(study_buddies, "[aeiou]", "-") ## [1] "D-rris" "Y-nina" "L-urie" "S-lvia" str_replace_all(study_buddies, "[aeiou]", "-") ## [1] "D-rr-s" "Y-n-n-" "L--r--" "S-lv--" str_replace_all(study_buddies, c("Dorris" = "Scott", "Yanina" = "Bellini Saibene", "Laurie" = "Baker", "Silvia" = "Canelon")) ## [1] "Scott" "Bellini Saibene" "Baker" "Canelon" ``` --- .pull-left[ ### Splitting ```r study_buddies %>% str_split("i") ## [[1]] ## [1] "Dorr" "s" ## ## [[2]] ## [1] "Yan" "na" ## ## [[3]] ## [1] "Laur" "e" ## ## [[4]] ## [1] "S" "lv" "a" ``` ] -- .pull-right[ ### Finding matches ```r str_locate_all(study_buddies, "i") ## [[1]] ## start end ## [1,] 5 5 ## ## [[2]] ## start end ## [1,] 4 4 ## ## [[3]] ## start end ## [1,] 5 5 ## ## [[4]] ## start end ## [1,] 2 2 ## [2,] 5 5 ``` ] --- # Further reading and more advanced string patterns and matching ### [Other types of patterns](https://r4ds.had.co.nz/strings.html#other-types-of-pattern) ### [Other uses of regular expressions](https://r4ds.had.co.nz/strings.html#other-uses-of-regular-expressions) ### [stringi](https://r4ds.had.co.nz/strings.html#stringi) [stringi](file:///Users/scanelon/Downloads/stringi.pdf) is a package that contains _all_ string manipulation functions you could ever need, whereas `stringr` contains the most common ones --- class: inverse, center, middle # The End ##