quarto
) ✅R
Basics ✅R
✅R
✅R
R
MP#03 peer feedback in process
Brightspace - Wednesdays at 11:45
Next (and final!) pre-assignment is 2025-05-07 at 11:45pm ET
I am behind on reading PA comments:
I returned:
I still owe you:
I will owe you:
Semester end is coming quickly!
That’s it!
I have posted a course feedback survey at
https://baruch.az1.qualtrics.com/jfe/form/SV_9uyZ4YFsrcRRPIG
Comments very welcome (but not required)
Possible MP ideas:
Good - Purpose (“Business Logic”):
Good - Higher Level Structure (Example from googledrive
package):
# https://github.com/gaborcsardi/rencfaq#with-base-r
write_utf8 <- function(text, path = NULL) {
# sometimes we use writeLines() basically to print something for a snapshot
if (is.null(path)) {
return(base::writeLines(text))
}
# step 1: ensure our text is utf8 encoded
utf8 <- enc2utf8(text)
upath <- enc2utf8(path)
# step 2: create a connection with 'native' encoding
# this signals to R that translation before writing
# to the connection should be skipped
con <- file(upath, open = "w+", encoding = "native.enc")
withr::defer(close(con))
# step 3: write to the connection with 'useBytes = TRUE',
# telling R to skip translation to the native encoding
base::writeLines(utf8, con = con, useBytes = TRUE)
}
More Advice on StackOverflow
In R
, strings and characters are basically interchangeable
stringr
provides basic tools for string manipulation
stringi
provides advanced functionality
Easy to get 90% of the way correct - very hard to get 100% correct
Human language is messy - choices are culturally-specific
Unicode standard exists to make it easy (easier…) to do the right thing
When to use the
^
and$
anchors?
Start and end of a line.
str_split
str_detect
has a negate
option:
str_detect
vs str_match
vs str_extract
str_detect
is there a ‘fit’?str_extract
extract the whole ‘fit’str_match
extract specific groupsstr_match(group=)
is useful for complex data extraction.
x <- c("Michael Weylandt teaches STA9750", "KRR teaches STA9891")
pattern <- c("(.*) teaches (.*)")
stringr::str_extract(x, pattern, group=1)
[1] "Michael Weylandt" "KRR"
[1] "STA9750" "STA9891"
(Not sure what negatives do here…)
Also allows named groups:
x <- c("Michael Weylandt teaches STA9750 on Thursday", "KRR teaches STA9891 on Wednesday")
pattern <- c("(?<instructor>.*) teaches (?<course>.*) on (?<weekday>.*)")
stringr::str_match(x, pattern) |> as.data.frame()
V1 instructor course
1 Michael Weylandt teaches STA9750 on Thursday Michael Weylandt STA9750
2 KRR teaches STA9891 on Wednesday KRR STA9891
weekday
1 Thursday
2 Wednesday
Why?
?
Symbol (Quantifiers)Quantifiers (multiple matches):
.{a, b}
: anywhere from a
to b
copies (inclusive).{, b}
: no more than b
copies.{a,}
: at least a
copies.?
: zero-or-one, same as .{0,1}
.*
: zero-or-more, same as .{0,}
.+
: one-or-more, same as {1,}
stringr
vs grep
/ grepl
Ultimately the same functionality, but stringr
has a more consistent interface.
Conversion table online
All stringr
functions work well in dplyr
pipelines (“vectorized”):
library(dplyr); library(stringr)
df <- data.frame(lower_letters = letters)
df |> mutate(upper_letters = str_to_upper(lower_letters))
lower_letters upper_letters
1 a A
2 b B
3 c C
4 d D
5 e E
6 f F
7 g G
8 h H
9 i I
10 j J
11 k K
12 l L
13 m M
14 n N
15 o O
16 p P
17 q Q
18 r R
19 s S
20 t T
21 u U
22 v V
23 w W
24 x X
25 y Y
26 z Z
If you know the source encoding:
If you don’t know the source, ….
As of Thursday morning, on the fritz so you likely need to copy exercises into local RStudio
Room | Team | Room | Team | |
---|---|---|---|---|
1 | Team Mystic | 5 | Money Team + CWo. | |
2 | Subway Metrics | 6 | Lit Group | |
3 | Noise Busters | 7 | Cinephiles + VG | |
4 | AI Impact Col | 8 |
First, we will complete the cocktail scraping exercise from last week.
Instructions and pointers can be found here
Room | Team | Room | Team | |
---|---|---|---|---|
1 | Team Mystic | 5 | Money Team + CWo. | |
2 | Subway Metrics | 6 | Lit Group | |
3 | Noise Busters | 7 | Cinephiles + VG | |
4 | AI Impact Col | 8 |
Now, complete the second scraping exercise in your small groups
Room | Team | Room | Team | |
---|---|---|---|---|
1 | Team Mystic | 5 | Money Team + CWo. | |
2 | Subway Metrics | 6 | Lit Group | |
3 | Noise Busters | 7 | Cinephiles + VG | |
4 | AI Impact Col | 8 |
Recall the basic theory of statistical tests - “goodness of fit”
75+ Years of Theory
\(Z\)-values, \(t\)-values, \(p\)-values, etc.
Typically requires ‘big math’
Alternative:
Using a computer’s pseudo-random number generator (PRNG)
Repeat:
Sample average (LLN)
\[\frac{1}{n} \sum_{i=1}^n f(X_i) \to \E[f(X)]\]
Holds for arbitrary related quantities (quantiles, medians, variances)
Example: suppose we have \(X_i \sim\text{iid} \mathcal{N}(0, \sigma^2)\) and we want to test \(H_0: \sigma=1\)
How to test?
Per Cochran’s theorem, \(S \sim \sqrt{\frac{\chi^2_{n-1}}{n-1}} = \frac{1}{\sqrt{n-1}} \chi_{n-1}\) has a \(\chi\) (not \(\chi^2\)) distribution
So reject \(H_0\) if \(S\) above critical value (1.26)
To get a critical value
To get a \(p\)-value:
infer
The infer
package automates much of this for common tests
Seeking suggestions for next semester
This Week:
Longer Term:
If you want to vote in the upcoming NYC Mayoral Primary, it’s time to register to vote:
Primary voting begins in mid-June: need to register 10 days before
Comments
Bad - Trivial:
Bad - Opaque:
Bad - Redundant / Explaining Code