Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 3 – Thursday 2026-02-19
Last Updated: 2026-02-19
Upcoming TODO
Upcoming student responsibilities:
| 2026-02-19 |
11:59pm ET |
Team Roster Submission |
| 2026-02-20 |
11:59pm ET |
Mini-Project #00 Due |
| 2026-02-26 |
6:00pm ET |
Team Contract Due |
| 2026-02-26 |
6:00pm ET |
Project Proposal Presentation Slides Due |
| 2026-03-01 |
11:59pm ET |
Mini-Project Peer Feedback #00 Due |
| 2026-03-05 |
6:00pm ET |
Pre-Assignment #05 Due |
| 2026-03-12 |
6:00pm ET |
Pre-Assignment #06 Due |
| 2026-03-13 |
11:59pm ET |
Mini-Project #01 Due |
| 2026-03-19 |
6:00pm ET |
Pre-Assignment #07 Due |
Course Project
Course project description is now online
Detailed discussion of:
- Project structure
- Key deadlines
- Grading rubrics
Please send me questions!
Team rosters due at midnight tonight. If you don’t have a team, stick around after class and meet others without a team.
Course Project
- Teams: 4-6 classmates
- Stages:
- Proposal (in class presentation)
- Mid-semester check-in (in class presentation)
- Final: in class presentation, individual report, summary report
- Structure:
- Shared “Overarching Question” (OQ)
- Individual “Specific Question” (SQ, one per teammate)
Full description online
Course Project
Conceit:
- Consultants hired by client interested in qualitative OQ
- You break OQ into several quantitative SQs
- Combine results from SQs to answer OQ
Process:
- Proposal: Sales pitch for consultants - here’s the OQ we will answer
- Mid-Semester: Scope of work - here’s the SQs and data we will use
- End-of-Semester: Presentation of Results
- Group Report: “Executive Summary”
- Individual Reports: “Technical Appendices”
Course Project
Next week: project proposal presentations
Major goal:
- What is the overarching / motivating question for your project?
See instructions for more
Post slides and team contract to Brightspace before class - you will present from your computer
Overarching Question
Example Topic: NYC Apartment Prices
Very Bad: Is there a relationship between rent and location?
- Problems: Binary, Obvious, Limited room for exploration
Bad: What is the correlation between rent and location?
- Problems: Simple (one numeric answer), Linear, Limited room for exploration
Good: What is the relationship between rent and location?
Strengths: Allows more interesting analysis, subgroups / differential impact
Weaknesses: Very broad (lots has been said about this), not clear how to split up
Great: What are the key drivers of NYC rents and do they suggest multiple market segments with different relative priorities?
- Strengths: Allows project to be split up (different factors), clear intent to integrate findings, working hypothesis (multiple segments)
Finding Data
Advice:
- Start early!
- NYC Open Data is great
- Nothing paid or private without express instructor submission
- Everyone loves spatial data!
- Avoid overdone topics (NYC rents, NYC schools) - new topics => low-hanging fruit
Presentation Hints
- Longest time \(\neq\) most important
- Story, story, story! Why are you making these choices?
- Hourglass Structure
- Start big
- Motivate your overarching question
- Specific questions
- Tie specific to overarching
- From overarching back to big motivation
- No less than one figure every other slide
Special Presentation
Data Science Resources at the Baruch Libraries
Presenter: Jason Amey (Baruch Business Librarian and Former STA 9750 Student)
Mini-Project #00
Mini-Project #00 is due 2026-02-20 at 11:59pm ET
- Automatic grace period until Sunday at midnight
As of 2026-02-19:
- 9 Users registered via Piazza
- 9 Posts on GitHub
- 8 pass automated tests
- 1 need more work (see comments)
Recall that you need to submit via GitHub, Brightspace, and Piazza for this assignment - sending GitHub ID is most important
MP #00 Goals
Mini-Project #00 is ungraded to:
- Sort out tech trouble
- Show danger of waiting to the last minute
Lessons learned will be helpful in future assignments
Some sites look really great - excited to see headshots, resumes, etc. and to learn more about you
MP #00 Peer Feedback
On 2026-02-23, I will assign peer-feedback:
- For MP #00, no real scores (give all 10s!): practice with process
- I will provide an
R function to solicit your peer feedback
- You provide your GitHub ID and your secret phrase
- Answer questions
- Upload result file to Brightspace
- Full details to follow via Brightspace announcement
MP #00 Peer Feedback
Aims:
- Learn to read and evaluate code
- In analysis, rarely right and wrong; definitely better and worse
- Learn tricks to improve your own site
“Good artists copy; great artists steal.” – Steve Jobs
More discussion in advance of MP #01 feedback cycle
CISSOID
New this semester: cissoid
CIS Scripts for Organizing Instructional Delivery
Course robot to help guarantee submissions are complete and to provide automated help where possible
Course Support
Asynchronous Support: Piazza
- 21 students registered in Piazza
Synchronous Support: Office Hours
- Wednesday at 5pm (In-Person)
- Thursdays at 5pm (Zoom)
Pre-Assignments
Pre-assignment quizzes
- 30 point quiz
- My intent is for everyone to get 30/30
- Take as many times as needed
- Settings tweaked to only re-show incorrect questions
- I sometimes give feedback through Brightspace, so check ‘feedback’ if you are alerted to any
No pre-assignment next week - presentations instead
Mini-Project #01
Mini-Project #01 officially released today
- Due 2026-03-13 at 11:59pm ET (22 Days)
- Topic: Assessing the Impact of SFFA on Campus Diversity One-Year Later
- Analysis of IPEDS Data
- Graded out of 80
- 30 points for completed submissions
Mini-Project #01
7 ‘Tasks’ covering:
- data import (#1),
- data preparation / feature engineering (#2-3)
- Exploratory Data Analysis (#4-6), and
- writing up your analysis (#7)
For this MP, you are a college president describing changes (or lack of changes) to the demographics of your student body post-SFFA.
Today
- Brief Review of Quarto Render + Git Commit Cycle
- Introduction to
R Programming
- Key Ideas in
R: Data Frames, Functions, Packages and Control Flow
- Wrap Up
Render + Commit Cycle
In Mini-Project #00, your goal is to get rendered HTML to GitHub
- ‘Kick-the-tires’ on Set-Up
- Things will go wrong - better now than later
Once you do this once, cycle becomes easier
Git Workflow
Three key commands:
git add: add some changes to a ‘box’
git commit: seal the ‘box’
git push: send the ‘box’ to GitHub
Git pane in RStudio shows uncommited changes, not files.
If a file ‘vanishes’ after a commit, that’s good!
Render + Commit
For ongoing changes / updates:
Render button in RStudio - generate new / change HTML
- Check boxes next to all changed files - Stage updates
- Hit
Commit and use modal to make a Commit - Commit a new version in Git
Push - Send new version to GitHub. Website will update automatically
Render + Commit
Tips:
- Go to your GitHub.com - see if expected files are present
- When in doubt, better to include everything in
docs
- Cycle rapidly - find problems ASAP
Storage is cheap - your time isn’t
Programming in R
It’s now time for us to start writing code in R
No more copy and trust
Goals:
- Modify existing code to new applications
- Write code to use existing libraries
- Read and debug code
Execution Model
Three models of executing code:
- Line-by-line at
Console
- REPL: Read Evaluate Print Loop
- Best for transient, one-off actions; trying new things
- Script writing in a separate file
- Write in a separate (
.R) file
- Executes in same session; persistent state
- Best for longer analyses with complex commands, developing code
- Code in a Quarto document
- Write code in chunks inside a
qmd file
- Executes in a fresh session
- Best for documenting and conveying analysis, archiving results
Arithmetic in R
Basic arithmetic in R runs as expected
PEMDAS Ordering: Parentheses, Exponents, Multiplication/Division, Addition/Subtraction
Arithmetic in R
\[3^{2 * 5 - 1} / 24^5\]
\[\frac{1^1 + 2^2 + 3^3}{3^1 + 2^2 + 1^3}\]
(1^1 + 2^2 + 3^3) / (3^1 + 2^2 + 1^3)
When in doubt, extra parentheses don’t hurt
Function Calls
To go beyond arithmetic, need to invoke functions
\[ \cos(\pi) + \tan\left(\frac{\pi}{4}\right) + \sqrt{\sin(\pi/2)} - e^1\]
cos(pi) + tan(pi / 4) + sqrt(sin(pi/2)) - exp(1)
Function Calls
All function calls have a fundamental syntax:
name()
e.g.,
To get help with any function in R, type ?name
Function Calls
Most interesting functions require input:
name(argument)
Here, the argument is input to the function:
Multiple arguments are separated by commas
Function Calls
Type a name without () to see its implementation
function (x) .Primitive("cos")
and
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
{
ret.x <- x
Conceptually sqrt vs sqrt(4) is “concept of square rooting” vs “the actual square root of 4, i.e., 2”
Function Calls
Most important for users are the first few lines (args)
- Define optional and required inputs
function (x, base = exp(1))
Two arguments:
x: the input
base: optional 2nd argument with default \(e\) (natural log)
Function Calls
Every argument has a name, but not always required
R is usually smart about knowing what you meant
These are equivalent:
log(10)
log(10, exp(1))
log(x = 10)
log(x = 10, exp(1))
log(x = 10, base = exp(1))
log(base = exp(1), x = 10)
log(base = exp(1), 10)
Vector Semantics
Often when dealing with data, we want to transform related data similarly:
- E.g., change all temperatures in data set from F to C
Dangerous to only do part
R has vectorized semantics - whenever possible, do same operation to all numbers together
sqrt(1:10) # Keeps same vector structure
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
[9] 3.000000 3.162278
Vector Semantics
Most functions in R try to vectorize, but not always possible
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
[9] 3.000000 3.162278
[1] 0.7071068 0.0000000 -0.7071068 -1.0000000 -0.7071068 0.0000000 0.7071068
[8] 1.0000000
Vectorized Semantics
The [1] you sometimes see is R just letting you know where in a vector you are
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
[9] 3.000000 3.162278 3.316625 3.464102 3.605551 3.741657 3.872983 4.000000
[17] 4.123106 4.242641 4.358899 4.472136 4.582576 4.690416 4.795832 4.898979
[25] 5.000000
When we do 2D data (later), we get column and row indices
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
Variables
Often, we want to save a several values as a single ‘thing’
This is an assignment operator. Vector formed by 1:5 is labeled x
Naming x later gives us this vector
Can be used in functions
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
Vectors and Variables
If you need to write a vector ‘by hand’, use the c function:
x <- c(1, 4, 9, 16, 25)
x
then
We won’t usually hand-write vectors like this:
- Data comes in vectors (e.g., spreadsheet columns)
Vector Access
Use [] operator to get individual elements of a vector:
Can do more complex indexing, but we won’t use it much:
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
[1] 2.449490 2.645751 2.828427 3.000000 3.162278
[1] 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427 3.000000
[9] 3.162278
Weekly Lab
Lab #03
New topics to cover:
- classes of objects
- Using Packages
- Comments
- Writing functions
- Control Flow
Object classes
Everything in R has a type or class:
- Integer, Double (i.e. 64 bit number allowing decimals), Character, …
All vectors elements must have the same class - this is the vector’s class
x <- 1:5
y <- sqrt(x)
class(x)
Using Packages
A package is a set of code (and data) packaged up for distribution and use
R has many helpful packages - these are distributed via CRAN (currently over 20,000)
Using packages is a two-step process:
- Get package from CRAN to your computer (one time)
- Loading into
R (every time)
Think of regular software: you download MS Office once but need to start it whenever you want to use it
Using Packages
The install.packages function will download and install a package:
install.packages("ggplot2")
If that package uses other packages, R will sort that out automatically
When ready to use a package, use the library() command to ‘start’ it:
Now I have access to everything in that package
Vector Index Printing
Default vector printing:
Each line gets a new index:
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
[9] 3.000000 3.162278
More complex objects have alternate print styles:
matrix(1:9, nrow=3, ncol=3)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Print width is controlled by getOption("width").
Recycling Rules
Alignment by default:
Recycling by default:
Recycle warning when vectors don’t fit together cleanly:
Warning in x + y: longer object length is not a multiple of shorter object
length
Recycling Warning
Warning in x + y: longer object length is not a multiple of shorter object
length
Not a problem per se, but often a sign that something has gone wrong.
- scalar + vector is usually safe
- 2 vectors of same size is usually safe
- vectors of different size is usually a programming mistake
Warnings vs Errors
- Warnings: heuristics pointing at typical problem
- Code still executed without a problem
- Try to fix these unless you’re certain it’s not a problem
- Errors: code failed to execute
- You have to fix these to run your code
Changing Functions
Most built-in functions can’t / shouldn’t be changed.
Some allow alternate behavior via additional arguments:
log(10) # Default is natural (base e) logarithm
If you want different behavior, write your own function:
cosd <- function(x){
## Cosine in degrees
cos(x * pi / 180)
}
cosd(90)
Always try ?name to see documentation.
Orientation
- Communicating Results (
quarto) ✅
R Basics ⬅️
- Data Manipulation in
R
- Data Visualization in
R
- Getting Data into
R
- Statistical Modeling in
R
Review
Introduction to R:
- Arithmetic
- Variables and Vectors
- Functions: Calls, Arguments, Defining
- Packages
- Help System
Next Time
Data Frames:
- Organizing several ‘connected’ vectors into a table
- table operations with
dplyr
IRS Free File
If your income is less than ~$90K, the IRS FreeFile program means you can use TaxAct, etc. for free.1
Comments
Comments are text inside the code that
RignoresCompare
with