STA 9750 - Week 12

Michael Weylandt

STA 9750 Mini-Project #04

MP#04 online now

  • Due 2024-12-04 (\(\approx\) 3 weeks - 2 remaining)
  • Topic: financial modeling
    • Comparison of two retirement plans
    • Historical data + Monte Carlo (“bootstrapping”)
  • Format:
    • Decision Analytics - Play the role of financial advisor
    • GitHub post AND Brightspace submission

Week 12 Pre-Assignment

Due at midnight tonight - take a moment to do it now if you haven’t already!

Going Forward

Pre-Assignments

Brightspace - Wednesdays at 11:45

  • Reading, typically on course website
  • Brightspace auto-grades
    • I have to manually change to completion grading

Next (and final!) pre-assignment is December 4th

Thank you for FAQs and (honest) team feedback. Keep it coming!

Next Semester Topics

  • NYC Open Data
  • Sports (Baseball?)
  • Spotify / Music
  • Healthcare / Pharmaceutical (might be tricky…)
  • Video Games
  • Quant Finance / Time Series?
  • Baruch Demographics
  • Job Market
  • Real Estate

Grading

Returned:

  • MP#02 grade
  • MP#02 meta-review grade
  • Videos now on Vocat

We owe you:

  • Mid-Term Check-In Feedback

Grading - Ex Post Adjustments

FYI: At the end of the course, I curve individual peer grades.

Example: If grader \(X\) is on average, 5 points lower, I re-center all their grades, raising the gradees by an average of 1.25.

Try to be consistent over the semester so I can calibrate this correctly.

GitHub Notifications

Make sure you check GitHub notifications, via email or at https://github.com/notifications to make sure you get all peer feedback assignments.

  • I tag you in other folks’ repo when you are supposed to review

  • People tagged in your repo are evaluating you

Course Support

  • Synchronous
    • Office Hours 4x / week
      • MW Office Hours on Monday + Thursday
      • CR Tuesday + Friday
      • No OH during Thanksgiving break
  • Asynchronous
    • Piazza (\(38\) minute average response time)

Change: MW Thursday Zoom OH now 4:00pm to 5:00pm

Upcoming

Nov 27 - Thanksgiving Holiday (No Class on Nov 28)

  • Check-In Peer Feedback (Vocat)

Comments

Bad - Trivial:

# Set x to 3
x <- 3

Bad - Opaque:

# Follow instructions
x <- 3

Bad - Redundant / Explaining Code

# Fit a model
mod <- model(x, y)

# Build a query
query_build() |> query_add() |> query_formulate()

Comments

Good - Purpose (“Business Logic”):

# Regulation XX.YY requires us to apply a risk multiplier to output
# As of 2024-11-01, CFO says risk multiplier is 3
# Cf. Email to risk group, subject "New Risk Multiplier"
RISK_MULTIPLIER <- 3

Good - Higher Level Structure (Example from googledrive package):

# https://github.com/gaborcsardi/rencfaq#with-base-r
write_utf8 <- function(text, path = NULL) {
  # sometimes we use writeLines() basically to print something for a snapshot
  if (is.null(path)) {
    return(base::writeLines(text))
  }

  # step 1: ensure our text is utf8 encoded
  utf8 <- enc2utf8(text)
  upath <- enc2utf8(path)

  # step 2: create a connection with 'native' encoding
  # this signals to R that translation before writing
  # to the connection should be skipped
  con <- file(upath, open = "w+", encoding = "native.enc")
  withr::defer(close(con))

  # step 3: write to the connection with 'useBytes = TRUE',
  # telling R to skip translation to the native encoding
  base::writeLines(utf8, con = con, useBytes = TRUE)
}

Comments

More Advice on StackOverflow

Pre-Assignment #12 FAQs

FAQ: Unicode Resources

FAQ: Regular Expression Tools

FAQ: Substrings and String Splitting

fruits <- c("apples and oranges and pears and bananas", 
            "pineapples and mangos and guavas")

stringr::str_split(fruits, " and ")
[[1]]
[1] "apples"  "oranges" "pears"   "bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"    
stringr::str_split_fixed(fruits, "and", n=2)
     [,1]          [,2]                            
[1,] "apples "     " oranges and pears and bananas"
[2,] "pineapples " " mangos and guavas"            

Sub-Strings / Splitting

x <- "Baruch College, CUNY"
stringr::str_sub(x, end=6) # Includes endpoints
[1] "Baruch"
stringr::str_sub(x, start=-4) # Count from end
[1] "CUNY"
x <- c("Baruch College, CUNY", "Brooklyn College, CUNY")
stringr::str_sub(x, end=-7) # Drop last _6_
[1] "Baruch College"   "Brooklyn College"

FAQ: Start and End Anchors

When to use the ^ and $ anchors?

Start and end of a line.

  • Very useful for structured text (computer log outputs)
  • In data analysis, a bit less useful
    • Applied to output of str_split

FAQ: Exclusion + Detection

x <- c("10 blue fish", "three wet goats")
stringr::str_detect(x, "[^0123456789]")
[1] TRUE TRUE

str_detect has a negate option:

stringr::str_detect(x, "[0-9]", negate=TRUE)
[1] FALSE  TRUE

FAQ: str_detect vs str_match vs str_extract

  • str_detect is there a ‘fit’?
  • str_extract extract the whole ‘fit’
  • str_match extract specific groups
x <- "Baruch College, CUNY is a wonderful place to work!"
stringr::str_detect(x, "(.*), CUNY")
[1] TRUE
stringr::str_extract(x, "(.*), CUNY")
[1] "Baruch College, CUNY"
stringr::str_match(x, "(.*), CUNY")
     [,1]                   [,2]            
[1,] "Baruch College, CUNY" "Baruch College"

FAQ: Subset Selection + Indexing

str_match(group=) is useful for complex data extraction.

x <- c("Michael Weylandt teaches STA9750", "KRR teaches STA9891")
pattern <- c("(.*) teaches (.*)")
stringr::str_extract(x, pattern, group=1)
[1] "Michael Weylandt" "KRR"             
stringr::str_extract(x, pattern, group=2)
[1] "STA9750" "STA9891"

(Not sure what negatives do here…)

Also allows named groups:

x <- c("Michael Weylandt teaches STA9750 on Thursday", "KRR teaches STA9891 on Wednesday")
pattern <- c("(?<instructor>.*) teaches (?<course>.*) on (?<weekday>.*)")
stringr::str_match(x, pattern) |> as.data.frame()
                                            V1       instructor  course
1 Michael Weylandt teaches STA9750 on Thursday Michael Weylandt STA9750
2             KRR teaches STA9891 on Wednesday              KRR STA9891
    weekday
1  Thursday
2 Wednesday

FAQ: Homoglyphs

x <- c("Η", "H")
tolower(x)
[1] "η" "h"

Why?

uni_info <- Vectorize(function(x) Unicode::u_char_name(utf8ToInt(x)), "x")
uni_info(x)
                         Η                          H 
"GREEK CAPITAL LETTER ETA"   "LATIN CAPITAL LETTER H" 

Particularly nasty with dashes - lean on [[:punct::]] where possible.

x <- c("Em Dash —", "En Dash –", "Hyphen ‐")
stringr::str_remove(x, "[[:punct:]]") # Works
[1] "Em Dash " "En Dash " "Hyphen " 
stringr::str_remove(x, "-")  # Keyboard minus = Fail
[1] "Em Dash —" "En Dash –" "Hyphen ‐" 

FAQ: ? Symbol (Quantifiers)

Quantifiers (multiple matches):

  • .{a, b}: anywhere from a to b copies (inclusive)
  • .{, b}: no more than b copies
  • .{a,}: at least a copies
  • .?: zero-or-one, same as .{0,1}
  • .*: zero-or-more, same as .{0,}
  • .+: one-or-more, same as {1,}

FAQ: stringr vs grep / grepl

Ultimately the same functionality, but stringr has a more consistent interface.

Conversion table online

FAQ: Working Columnwise

All stringr functions work well in dplyr pipelines (“vectorized”):

library(dplyr); library(stringr)
df <- data.frame(lower_letters = letters)
df |> mutate(upper_letters = str_to_upper(lower_letters))
   lower_letters upper_letters
1              a             A
2              b             B
3              c             C
4              d             D
5              e             E
6              f             F
7              g             G
8              h             H
9              i             I
10             j             J
11             k             K
12             l             L
13             m             M
14             n             N
15             o             O
16             p             P
17             q             Q
18             r             R
19             s             S
20             t             T
21             u             U
22             v             V
23             w             W
24             x             X
25             y             Y
26             z             Z

FAQ: How to Convert to UTF-8

If you know the source encoding:

inconv(STRING, from="latin1", to="UTF-8")

If you don’t know the source, ….

Today

Agenda

  • Unicode Discussion
  • Regex Discussion
  • Regex Exercises
  • Completion of Cocktail Exercise
  • Time Permitting: More Scraping
  • Time Permitting: Statistical Inference

Breakout Rooms

Order Team Order Team
1 Rat Pack 6 Ca$h VZ
2 Subway Surfers 7 Listing Legends
3 Chart Toppers 8 TDSSG
4 Metro Mindset 9 Broker T’s
5 Apple Watch 10 EVengers