Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 3

STA 9750 Week 3

Today:

  • Tuesday Section: 2025-09-09
  • Thursday Section: 2025-09-11

Lecture #03: Coding in R: Variables, Control Flow, Packages, Function Calls

Mini-Project #00

Mini-Project #00 is due 2025-09-12 at 11:59pm ET

As of 2025-09-11:

  • 67 Users registered via Piazza
  • 65 Posts on GitHub
    • 62 pass automated tests
    • 3 need more work (see comments)

Recall that you need to post on both GitHub and Piazza for this assignment - sending GitHub ID is most important

MP #00 Goals

Mini-Project #00 is ungraded to:

  1. Sort out tech trouble
  2. Show danger of waiting to the last minute
  3. Demonstrate importance of closely following instructions

Lessons learned will be helpful in future assignments

Some sites look really great - excited to see headshots, resumes, etc. and to learn more about you

MP #00 Peer Feedback

On 2025-09-15, I will assign peer-feedback on GitHub:

Aims:

  • Learn to read and evaluate code
  • In analysis, rarely right and wrong; definitely better and worse
  • Learn tricks to improve your own site

“Good artists copy; great artists steal.” – Steve Jobs

More discussion next week

Course Support

Asynchronous Support: Piazza

  • 96 students registered in Piazza
  • 1 still needs to register(!)
  • 10 minute average time to response

Synchronous Support: Office Hours

  • Tuesdays and Thursdays at 5pm
  • Still waiting on CUNY HR before TA office hours begin

Pre-Assignments

Pre-Assignment #03

  • Ignore Brightspace’s Grading
    • Brightspace calls all short answers wrong
    • Gradebook shows complete/incomplete grading
  • I often give feedback through Brightspace, so check ‘feedback’ if you are alerted to any

Pre-Assignment #04

  • Day before class at midnight each week
  • Available on course website + Brightspace after 9pm

Today

Today

  • Brief Review of Quarto Render + Git Commit Cycle
  • Introduction to Course Project
  • R: Data Frames, Functions, Packages and Control Flow

Render + Commit Cycle

In Mini-Project #00, your goal is to get rendered HTML to GitHub

  • ‘Kick-the-tires’ on Set-Up
  • Things will go wrong - better now than later

Once you do this once, cycle becomes easier

Render + Commit

For ongoing changes / updates:

  1. Render button in RStudio - generate new / change HTML
  2. Check boxes next to all changed files - Stage updates
  3. Hit Commit and use modal to make a Commit - Commit a new version in Git
  4. Push - Send new version to GitHub. Website will update automatically

Render + Commit

Tips:

  1. Go to your GitHub.com - see if expected files are present
  2. When in doubt, better to include everything in docs
  3. Cycle rapidly - find problems ASAP

Storage is cheap - your time isn’t

Live demo

Course Project Overview

Course Project

  • Teams: 4-6 classmates
  • Stages:
    • Proposal (in class presentation)
    • Mid-semester check-in (in class presentation)
    • Final: in class presentation, individual report, summary report
  • Structure:
    • Shared “Overarching Question” (OQ)
    • Individual “Specific Question” (SQ, one per teammate)

Full description online

Course Project Conceit

  • Consultants hired by client interested in qualitative OQ
  • You break OQ into several quantitative SQs
  • Combine results from SQs to answer OQ

Process:

  • Proposal: Sales pitch for consultants - here’s the OQ we will answer
  • Mid-Semester: Scope of work - here’s the SQs and data we will use
  • End-of-Semester: Presentation of Results
    • Group Report: “Executive Summary”
    • Individual Reports: “Technical Appendices”

Finding Data

  • Start early!
  • NYC Open Data is great
  • Nothing paid or private without express instructor submission
  • Everyone loves spatial data!

Presentation Hints

  • Longest time \(\neq\) most important
  • Story, story, story! Why are you making these choices?
  • Hourglass Structure
    • Start big
    • Motivate your overarching question
    • Specific questions
    • Tie specific to overarching
    • From overarching back to big motivation
  • No less than one figure every other slide

Next Steps

First step: By 2025-09-30, email me your group members.

Proposal Presentations:

  • In-class October 07 (Tuesday) and October 09 (Thursday)
  • See details and rubric online
  • No Pre-Assignment that week

Getting Started with R

Programming in R

It’s now time for us to start writing code in R

No more copy and trust

Goals:

  1. Modify existing code to new applications
  2. Write code to use existing libraries
  3. Read and debug code

Execution Model

Three models of executing code:

  1. Line-by-line at Console
    • REPL: Read Evaluate Print Loop
    • Best for transient, one-off actions; trying new things
  2. Script writing in a separate file
    • Write in a separate (.R) file
    • Executes in same session; persistent state
    • Best for longer analyses with complex commands, developing code
  3. Code in a Quarto document
    • Write code in chunks inside a qmd file
    • Executes in a fresh session
    • Best for documenting and conveying analysis, archiving results

Arithmetic in R

Basic arithmetic in R runs as expected

1 + 2 + 3 + 4 + 5
[1] 15

PEMDAS Ordering: Parentheses, Exponentials, Multiplication/Division, Addition/Subtraction

Arithmetic in R

\[3^{2 * 5 - 1} / 24^5\]

3^(2 * 5 - 1) / 24^5
[1] 0.002471924

\[\frac{1^1 + 2^2 + 3^3}{3^1 + 2^2 + 1^3}\]

(1^1 + 2^2 + 3^3) / (3^1 + 2^2 + 1^3)
[1] 4

When in doubt, extra parentheses don’t hurt

Function Calls

To go beyond arithmetic, need to invoke functions

\[ \cos(\pi) + \tan\left(\frac{\pi}{4}\right) + \sqrt{\sin(\pi/2)} - e^1\]

cos(pi) + tan(pi / 4) + sqrt(sin(pi/2)) - exp(1)
[1] -1.718282

Function Calls

All function calls have a fundamental syntax:

name()

e.g.,

R.Version()

To get help with any function in R, type ?name

Function Calls

Most interesting functions require input:

name(argument)

Here, the argument is passed as input to the function:

cos(pi)

Multiple arguments are separated by commas

atan2(-1, 1)

Function Calls

Type a name without () to see its implementation

cos
function (x)  .Primitive("cos")

and

lm
function (formula, data, subset, weights, na.action, method = "qr", 
    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
    contrasts = NULL, offset, ...) 
{
    ret.x <- x
...

Conceptually sqrt vs sqrt(4) is “concept of square rooting” vs “the actual square root of 4, i.e., 2”

Function Calls

Most important for users are the first few lines (args)

  • Define optional and required inputs
args(log)
function (x, base = exp(1)) 

Two arguments:

  • x: the input
  • base: optional 2nd argument with default \(e\) (natural log)

Function Calls

Every argument has a name, but not always required

R is usually smart about knowing what you meant

These are equivalent:

log(10)
log(10, exp(1))
log(x = 10)
log(x = 10, exp(1))
log(x = 10, base = exp(1))
log(base = exp(1), x = 10)
log(base = exp(1), 10)

Vector Semantics

Often when dealing with data, we want to transform related data similarly:

  • E.g., change all temperatures in data set from F to C

Dangerous to only do part

R has vectorized semantics - whenever possible, do same operation to all numbers together

1:10 # A vector
 [1]  1  2  3  4  5  6  7  8  9 10
sqrt(1:10) # Keeps same vector structure
 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278

Vector Semantics

Most functions in R try to vectorize, but not always possible

sqrt(1:10)
 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278
cospi((1:8) / 4)
[1]  0.7071068  0.0000000 -0.7071068 -1.0000000 -0.7071068  0.0000000  0.7071068
[8]  1.0000000

But

sum(1:10)
[1] 55

and

rev(1:10)
 [1] 10  9  8  7  6  5  4  3  2  1

Vectorized Semantics

The [1] you sometimes see is R just letting you know where in a vector you are

sqrt(1:25)
 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278 3.316625 3.464102 3.605551 3.741657 3.872983 4.000000
[17] 4.123106 4.242641 4.358899 4.472136 4.582576 4.690416 4.795832 4.898979
[25] 5.000000

When we do 2D data (later), we get column and row indices

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25

Variables

Often, we want to save a several values as a single ‘thing’

x <- 1:5

This is an assignment operator. Vector formed by 1:5 is labeled x

Naming x later gives us this vector

x
[1] 1 2 3 4 5

Can be used in functions

sqrt(x)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068

Vectors and Variables

If you need to write a vector ‘by hand’, use the c function:

x <- c(1, 4, 9, 16, 25)
x
[1]  1  4  9 16 25

then

sqrt(x)
[1] 1 2 3 4 5

We won’t usually hand-write vectors like this:

  • Data comes in vectors (e.g., spreadsheet columns)

Vector Access

Use [] operator to get individual elements of a vector:

x <- sqrt(1:10)
x[4]
[1] 2

Can do more complex indexing, but we won’t use it much:

x[1:5]
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
x[6:10]
[1] 2.449490 2.645751 2.828427 3.000000 3.162278
x[-1]
[1] 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427 3.000000
[9] 3.162278

In-Class Activity

Weekly Lab

Lab #03

New topics to cover:

  1. classes of objects
  2. Using Packages
  3. Comments
  4. Writing functions
  5. Control Flow

Object classes

Everything in R has a type or class:

  • Integer, Double (i.e. 64 bit number allowing decimals), Character, …

All vectors elements must have the same class - this is the vector’s class

x <- 1:5
y <- sqrt(x)

class(x)
[1] "integer"
class(y)
[1] "numeric"
class(letters)
[1] "character"

Using Packages

A package is a set of code (and data) packaged up for distribution and use

R has many helpful packages - these are distributed via CRAN (currently over 200,000)

Using packages is a two-step process:

  1. Get package from CRAN to your computer (one time)
  2. Loading into R (every time)

Think of regular software: you download MS Office once but need to start it whenever you want to use it

Using Packages

The install.packages function will download and install a package:

install.packages("ggplot2")

If that package uses other packages, R will sort that out automatically

When ready to use a package, use the library() command to ‘start’ it:

library(ggplot2)

Now I have access to everything in that package

Comments

Comments are text inside the code that R ignores

  • Everything following a # gets ignored

Compare

tan(45 * pi / 180) Compute the tangent of 45 degrees
Error in parse(text = input): <text>:1:20: unexpected symbol
1: tan(45 * pi / 180) Compute
                       ^

with

tan(45 * pi / 180) # Compute the tangent of 45 degrees
[1] 1

Comments

The best comments don’t just say what you are doing. They say why you are doing something in the way it is being done

More discussion of comments later as we write more complex code

FAQs from PA #03

FAQ: Vector Index Printout

Default vector printing:

1:10
 [1]  1  2  3  4  5  6  7  8  9 10

Each line gets a new index:

sqrt(1:10)
 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278

More complex objects have alternate print styles:

matrix(1:9, nrow=3, ncol=3)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Print width is controlled by getOption("width").

FAQ: Recycling Rules

Alignment by default:

x <- 1:3
y <- 4:6
x + y
[1] 5 7 9

Recycling by default:

x <- 1
y <- 4:6
x + y
[1] 5 6 7

Recycle warning when vectors don’t fit together cleanly:

x <- 1:2
y <- 4:6
x + y
Warning in x + y: longer object length is not a multiple of shorter object
length
[1] 5 7 7

FAQ: Recycling Warning

x <- 1:2
y <- 4:6
x + y
Warning in x + y: longer object length is not a multiple of shorter object
length
[1] 5 7 7

Not a problem per se, but often a sign that something has gone wrong.

  • scalar + vector is usually safe
  • 2 vectors of same size is usually safe
  • vectors of different size is usually a programming mistake

FAQ: Warnings vs Errors

  • Warnings: heuristics pointing at typical problem
    • Code still executed without a problem
    • Try to fix these unless you’re certain it’s not a problem
  • Errors: code failed to execute
    • You have to fix these to run your code

FAQ: Changing Functions

Most built-in functions can’t / shouldn’t be changed.

Some allow alternate behavior via additional arguments:

log(10) # Default is natural (base e) logarithm
[1] 2.302585
log(10, base=10)
[1] 1

If you want different behavior, write your own function:

cosd <- function(x){
    ## Cosine in degrees
    cos(x * pi / 180)
}
cosd(90)
[1] 6.123234e-17

Always try ?name to see documentation.

FAQ: Git Workflow

Three key commands:

  • git add: add some changes to a ‘box’
  • git commit: seal the ‘box’
  • git push: send the ‘box’ to GitHub

Git pane in RStudio shows uncommited changes, not files.

If a file ‘vanishes’ after a commit, that’s good!

Wrap-Up

Review

Introduction to R:

  • Arithmetic
  • Variables and Vectors
  • Functions: Calls, Arguments, Defining
  • Packages
  • Help System

Upcoming Work

Upcoming work from course calendar

Looking Ahead

Course Project:

  • Start looking for teammates and topics

Life Tip of the Week

ZSB / Baruch / CUNY Benefits

As a student, you have many free and discounted benefits.

I have collected some of these on the course page, but there are many more if you look around.

Places love to give discounts to students - use them!

CUNY-Wide

  • Free New York Times and Wall Street Journal
  • Free and Discounted Museum Access via CUNY Arts
  • Discounted Broadway and Off-Broadway via TDF

Baruch / ZSB

  • Free Barron’s Subscription
  • Newman Library Databases

Any Student

  • Free Trial and Discounted Rate Amazon Prime
  • Discounted Spotify + Streaming Subscriptions
  • GitHub Student Developer Pack

Musical Treat