Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 3 – Thursday 2026-02-19
Last Updated: 2026-02-19

STA 9750 Week 3

Today: Lecture #03: Coding in R: Variables, Control Flow, Packages, Function Calls

These slides can be found online at:

https://michael-weylandt.com/STA9750/slides/slides03.html

In-class activities (if any) can be found at:

https://michael-weylandt.com/STA9750/labs/lab03.html

Upcoming TODO

Upcoming student responsibilities:

Date Time Details
2026-02-19 11:59pm ET Team Roster Submission
2026-02-20 11:59pm ET Mini-Project #00 Due
2026-02-26 6:00pm ET Team Contract Due
2026-02-26 6:00pm ET Project Proposal Presentation Slides Due
2026-03-01 11:59pm ET Mini-Project Peer Feedback #00 Due
2026-03-05 6:00pm ET Pre-Assignment #05 Due
2026-03-12 6:00pm ET Pre-Assignment #06 Due
2026-03-13 11:59pm ET Mini-Project #01 Due
2026-03-19 6:00pm ET Pre-Assignment #07 Due

Course Project

Course project description is now online

Detailed discussion of:

  • Project structure
  • Key deadlines
  • Grading rubrics

Please send me questions!

Team rosters due at midnight tonight. If you don’t have a team, stick around after class and meet others without a team.

Course Project

  • Teams: 4-6 classmates
  • Stages:
    • Proposal (in class presentation)
    • Mid-semester check-in (in class presentation)
    • Final: in class presentation, individual report, summary report
  • Structure:
    • Shared “Overarching Question” (OQ)
    • Individual “Specific Question” (SQ, one per teammate)

Full description online

Course Project

Conceit:

  • Consultants hired by client interested in qualitative OQ
  • You break OQ into several quantitative SQs
  • Combine results from SQs to answer OQ

Process:

  • Proposal: Sales pitch for consultants - here’s the OQ we will answer
  • Mid-Semester: Scope of work - here’s the SQs and data we will use
  • End-of-Semester: Presentation of Results
    • Group Report: “Executive Summary”
    • Individual Reports: “Technical Appendices”

Course Project

Next week: project proposal presentations

Major goal:

  • What is the overarching / motivating question for your project?

See instructions for more

Post slides and team contract to Brightspace before class - you will present from your computer

Overarching Question

Example Topic: NYC Apartment Prices

  • Very Bad: Is there a relationship between rent and location?

    • Problems: Binary, Obvious, Limited room for exploration
  • Bad: What is the correlation between rent and location?

    • Problems: Simple (one numeric answer), Linear, Limited room for exploration
  • Good: What is the relationship between rent and location?

    • Strengths: Allows more interesting analysis, subgroups / differential impact

    • Weaknesses: Very broad (lots has been said about this), not clear how to split up

  • Great: What are the key drivers of NYC rents and do they suggest multiple market segments with different relative priorities?

    • Strengths: Allows project to be split up (different factors), clear intent to integrate findings, working hypothesis (multiple segments)

Finding Data

Advice:

  • Start early!
  • NYC Open Data is great
  • Nothing paid or private without express instructor submission
  • Everyone loves spatial data!
  • Avoid overdone topics (NYC rents, NYC schools) - new topics => low-hanging fruit

Presentation Hints

  • Longest time \(\neq\) most important
  • Story, story, story! Why are you making these choices?
  • Hourglass Structure
    • Start big
    • Motivate your overarching question
    • Specific questions
    • Tie specific to overarching
    • From overarching back to big motivation
  • No less than one figure every other slide

Special Presentation

Data Science Resources at the Baruch Libraries

Presenter: Jason Amey (Baruch Business Librarian and Former STA 9750 Student)

Mini-Project #00

Mini-Project #00 is due 2026-02-20 at 11:59pm ET

  • Automatic grace period until Sunday at midnight

As of 2026-02-19:

  • 9 Users registered via Piazza
  • 9 Posts on GitHub
    • 8 pass automated tests
    • 1 need more work (see comments)

Recall that you need to submit via GitHub, Brightspace, and Piazza for this assignment - sending GitHub ID is most important

MP #00 Goals

Mini-Project #00 is ungraded to:

  1. Sort out tech trouble
  2. Show danger of waiting to the last minute

Lessons learned will be helpful in future assignments

Some sites look really great - excited to see headshots, resumes, etc. and to learn more about you

MP #00 Peer Feedback

On 2026-02-23, I will assign peer-feedback:

  • For MP #00, no real scores (give all 10s!): practice with process
  • I will provide an R function to solicit your peer feedback
    • You provide your GitHub ID and your secret phrase
    • Answer questions
    • Upload result file to Brightspace
  • Full details to follow via Brightspace announcement

MP #00 Peer Feedback

Aims:

  • Learn to read and evaluate code
  • In analysis, rarely right and wrong; definitely better and worse
  • Learn tricks to improve your own site

“Good artists copy; great artists steal.” – Steve Jobs

More discussion in advance of MP #01 feedback cycle

CISSOID

New this semester: cissoid

  • CIS Scripts for Organizing Instructional Delivery

Course robot to help guarantee submissions are complete and to provide automated help where possible

Feature requests welcome

Course Support

Asynchronous Support: Piazza

  • 21 students registered in Piazza

Synchronous Support: Office Hours

  • Wednesday at 5pm (In-Person)
  • Thursdays at 5pm (Zoom)

Pre-Assignments

Pre-assignment quizzes

  • 30 point quiz
    • My intent is for everyone to get 30/30
    • Take as many times as needed
    • Settings tweaked to only re-show incorrect questions
  • I sometimes give feedback through Brightspace, so check ‘feedback’ if you are alerted to any

No pre-assignment next week - presentations instead

Mini-Project #01

Mini-Project #01 officially released today

  • Due 2026-03-13 at 11:59pm ET (22 Days)
  • Topic: Assessing the Impact of SFFA on Campus Diversity One-Year Later
  • Analysis of IPEDS Data
  • Graded out of 80
    • 30 points for completed submissions

Mini-Project #01

7 ‘Tasks’ covering:

  • data import (#1),
  • data preparation / feature engineering (#2-3)
  • Exploratory Data Analysis (#4-6), and
  • writing up your analysis (#7)

For this MP, you are a college president describing changes (or lack of changes) to the demographics of your student body post-SFFA.

Today

Today

  • Brief Review of Quarto Render + Git Commit Cycle
  • Introduction to R Programming
  • Key Ideas in R: Data Frames, Functions, Packages and Control Flow
  • Wrap Up

Git Review

Render + Commit Cycle

In Mini-Project #00, your goal is to get rendered HTML to GitHub

  • ‘Kick-the-tires’ on Set-Up
  • Things will go wrong - better now than later

Once you do this once, cycle becomes easier

Git Workflow

Three key commands:

  • git add: add some changes to a ‘box’
  • git commit: seal the ‘box’
  • git push: send the ‘box’ to GitHub

Git pane in RStudio shows uncommited changes, not files.

If a file ‘vanishes’ after a commit, that’s good!

Render + Commit

For ongoing changes / updates:

  1. Render button in RStudio - generate new / change HTML
  2. Check boxes next to all changed files - Stage updates
  3. Hit Commit and use modal to make a Commit - Commit a new version in Git
  4. Push - Send new version to GitHub. Website will update automatically

Render + Commit

Tips:

  1. Go to your GitHub.com - see if expected files are present
  2. When in doubt, better to include everything in docs
  3. Cycle rapidly - find problems ASAP

Storage is cheap - your time isn’t

Getting Started with R

Programming in R

It’s now time for us to start writing code in R

No more copy and trust

Goals:

  1. Modify existing code to new applications
  2. Write code to use existing libraries
  3. Read and debug code

Execution Model

Three models of executing code:

  1. Line-by-line at Console
    • REPL: Read Evaluate Print Loop
    • Best for transient, one-off actions; trying new things
  2. Script writing in a separate file
    • Write in a separate (.R) file
    • Executes in same session; persistent state
    • Best for longer analyses with complex commands, developing code
  3. Code in a Quarto document
    • Write code in chunks inside a qmd file
    • Executes in a fresh session
    • Best for documenting and conveying analysis, archiving results

Arithmetic in R

Basic arithmetic in R runs as expected

1 + 2 + 3 + 4 + 5
[1] 15

PEMDAS Ordering: Parentheses, Exponents, Multiplication/Division, Addition/Subtraction

Arithmetic in R

\[3^{2 * 5 - 1} / 24^5\]

3^(2 * 5 - 1) / 24^5
[1] 0.002471924

\[\frac{1^1 + 2^2 + 3^3}{3^1 + 2^2 + 1^3}\]

(1^1 + 2^2 + 3^3) / (3^1 + 2^2 + 1^3)
[1] 4

When in doubt, extra parentheses don’t hurt

Function Calls

To go beyond arithmetic, need to invoke functions

\[ \cos(\pi) + \tan\left(\frac{\pi}{4}\right) + \sqrt{\sin(\pi/2)} - e^1\]

cos(pi) + tan(pi / 4) + sqrt(sin(pi/2)) - exp(1)
[1] -1.718282

Function Calls

All function calls have a fundamental syntax:

name()

e.g.,

R.Version()

To get help with any function in R, type ?name

Function Calls

Most interesting functions require input:

name(argument)

Here, the argument is input to the function:

cos(pi)

Multiple arguments are separated by commas

atan2(-1, 1)

Function Calls

Type a name without () to see its implementation

cos
function (x)  .Primitive("cos")

and

lm
function (formula, data, subset, weights, na.action, method = "qr", 
    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
    contrasts = NULL, offset, ...) 
{
    ret.x <- x
...

Conceptually sqrt vs sqrt(4) is “concept of square rooting” vs “the actual square root of 4, i.e., 2”

Function Calls

Most important for users are the first few lines (args)

  • Define optional and required inputs
args(log)
function (x, base = exp(1)) 

Two arguments:

  • x: the input
  • base: optional 2nd argument with default \(e\) (natural log)

Function Calls

Every argument has a name, but not always required

R is usually smart about knowing what you meant

These are equivalent:

log(10)
log(10, exp(1))
log(x = 10)
log(x = 10, exp(1))
log(x = 10, base = exp(1))
log(base = exp(1), x = 10)
log(base = exp(1), 10)

Vector Semantics

Often when dealing with data, we want to transform related data similarly:

  • E.g., change all temperatures in data set from F to C

Dangerous to only do part

R has vectorized semantics - whenever possible, do same operation to all numbers together

1:10 # A vector
 [1]  1  2  3  4  5  6  7  8  9 10
sqrt(1:10) # Keeps same vector structure
 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278

Vector Semantics

Most functions in R try to vectorize, but not always possible

sqrt(1:10)
 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278
cospi((1:8) / 4)
[1]  0.7071068  0.0000000 -0.7071068 -1.0000000 -0.7071068  0.0000000  0.7071068
[8]  1.0000000

But

sum(1:10)
[1] 55

and

rev(1:10)
 [1] 10  9  8  7  6  5  4  3  2  1

Vectorized Semantics

The [1] you sometimes see is R just letting you know where in a vector you are

sqrt(1:25)
 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278 3.316625 3.464102 3.605551 3.741657 3.872983 4.000000
[17] 4.123106 4.242641 4.358899 4.472136 4.582576 4.690416 4.795832 4.898979
[25] 5.000000

When we do 2D data (later), we get column and row indices

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25

Variables

Often, we want to save a several values as a single ‘thing’

x <- 1:5

This is an assignment operator. Vector formed by 1:5 is labeled x

Naming x later gives us this vector

x
[1] 1 2 3 4 5

Can be used in functions

sqrt(x)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068

Vectors and Variables

If you need to write a vector ‘by hand’, use the c function:

x <- c(1, 4, 9, 16, 25)
x
[1]  1  4  9 16 25

then

sqrt(x)
[1] 1 2 3 4 5

We won’t usually hand-write vectors like this:

  • Data comes in vectors (e.g., spreadsheet columns)

Vector Access

Use [] operator to get individual elements of a vector:

x <- sqrt(1:10)
x[4]
[1] 2

Can do more complex indexing, but we won’t use it much:

x[1:5]
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
x[6:10]
[1] 2.449490 2.645751 2.828427 3.000000 3.162278
x[-1]
[1] 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427 3.000000
[9] 3.162278

In-Class Activity

Weekly Lab

Lab #03

New topics to cover:

  1. classes of objects
  2. Using Packages
  3. Comments
  4. Writing functions
  5. Control Flow

Object classes

Everything in R has a type or class:

  • Integer, Double (i.e. 64 bit number allowing decimals), Character, …

All vectors elements must have the same class - this is the vector’s class

x <- 1:5
y <- sqrt(x)

class(x)
[1] "integer"
class(y)
[1] "numeric"
class(letters)
[1] "character"

Using Packages

A package is a set of code (and data) packaged up for distribution and use

R has many helpful packages - these are distributed via CRAN (currently over 20,000)

Using packages is a two-step process:

  1. Get package from CRAN to your computer (one time)
  2. Loading into R (every time)

Think of regular software: you download MS Office once but need to start it whenever you want to use it

Using Packages

The install.packages function will download and install a package:

install.packages("ggplot2")

If that package uses other packages, R will sort that out automatically

When ready to use a package, use the library() command to ‘start’ it:

library(ggplot2)

Now I have access to everything in that package

Comments

Comments are text inside the code that R ignores

  • Everything following a # gets ignored

Compare

tan(45 * pi / 180) Compute the tangent of 45 degrees
Error in parse(text = input): <text>:1:20: unexpected symbol
1: tan(45 * pi / 180) Compute
                       ^

with

tan(45 * pi / 180) # Compute the tangent of 45 degrees
[1] 1

Comments

The best comments don’t just say what you are doing. They say why you are doing something in the way it is being done

More discussion of comments later as we write more complex code

FAQs from PA #03

Vector Index Printing

Default vector printing:

1:10
 [1]  1  2  3  4  5  6  7  8  9 10

Each line gets a new index:

sqrt(1:10)
 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278

More complex objects have alternate print styles:

matrix(1:9, nrow=3, ncol=3)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Print width is controlled by getOption("width").

Recycling Rules

Alignment by default:

x <- 1:3
y <- 4:6
x + y
[1] 5 7 9

Recycling by default:

x <- 1
y <- 4:6
x + y
[1] 5 6 7

Recycle warning when vectors don’t fit together cleanly:

x <- 1:2
y <- 4:6
x + y
Warning in x + y: longer object length is not a multiple of shorter object
length
[1] 5 7 7

Recycling Warning

x <- 1:2
y <- 4:6
x + y
Warning in x + y: longer object length is not a multiple of shorter object
length
[1] 5 7 7

Not a problem per se, but often a sign that something has gone wrong.

  • scalar + vector is usually safe
  • 2 vectors of same size is usually safe
  • vectors of different size is usually a programming mistake

Warnings vs Errors

  • Warnings: heuristics pointing at typical problem
    • Code still executed without a problem
    • Try to fix these unless you’re certain it’s not a problem
  • Errors: code failed to execute
    • You have to fix these to run your code

Changing Functions

Most built-in functions can’t / shouldn’t be changed.

Some allow alternate behavior via additional arguments:

log(10) # Default is natural (base e) logarithm
[1] 2.302585
log(10, base=10)
[1] 1

If you want different behavior, write your own function:

cosd <- function(x){
    ## Cosine in degrees
    cos(x * pi / 180)
}
cosd(90)
[1] 6.123234e-17

Always try ?name to see documentation.

Wrap-Up

Orientation

  • Communicating Results (quarto) ✅
  • R Basics ⬅️
  • Data Manipulation in R
  • Data Visualization in R
  • Getting Data into R
  • Statistical Modeling in R

Review

Introduction to R:

  • Arithmetic
  • Variables and Vectors
  • Functions: Calls, Arguments, Defining
  • Packages
  • Help System

Next Time

Data Frames:

  • Organizing several ‘connected’ vectors into a table
  • table operations with dplyr

Life Tip of the Week

It’s time to start preparing your taxes. (I know, I know …)

  • Preparing is not the same as filing
    • Preparing is doing the calculations
    • Filing is submitting to IRS
  • Employers and financial institutions should be sending you documents (W2, 1099, etc.)
    • Easier to use them now so you don’t lose them
  • Benefits of starting early:
    • If you get a refund, great!
    • If you owe money, avoid nasty surprise.
  • You can still make certain 2025 tax moves and get the tax benefit(IRA, HSA, etc.)

IRS Free File

If your income is less than ~$90K, the IRS FreeFile program means you can use TaxAct, etc. for free.1

Musical Treat