Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 6 – Thursday 2026-03-12
Last Updated: 2026-01-14

STA 9750 Week 6

Today: Lecture #05: Multi-Table dplyr Verbs

Today

Course Administration
Mini-Project #01 Reflection
Project Proposal Presentations
Wrap-Up
- Life Tip of the Day
Time Allowing: dplyr vs SQL

MP#01 - Peer Feedback

Assigned on GitHub - due on 2026-03-22

\(\approx 4\) feedbacks each
Take this seriously: around 20% of this assignment is “meta-review”
Goal: rigorous constructive critique

Use helper functions to find submissions assigned to you. Ask on Piazza if still having trouble.

MP#01 - Peer Feedback

Submissions may not map perfectly to rubric - use your best judgement

Be generous but serious:

Goal is improvement, so “everything is great, no comments” is unhelpful
- Nothing is completely right nor completely wrong
Remember, meta-review (TA scores of your feedback) to follow
Extra Credit is limited per instructions; only award for exceptional work

Learn from this! What can you adapt for MP#02?

MP#01 - Peer Feedback

Example of poor feedback:

Scores
Written Communication: 10
Project Skeleton: 10
Formatting & Display: 10
Code Quality: 10
Data Preparation: 10
Extra Credit: 10

Comments
Website looks really good, nice and clean!

Written Communication
Excellent and straight to the point.

Project Skeleton
Solid skeleton, well-organized .

Formatting & Display
Nicely formatted and balanced across the website.

Code Quality
Code runs like Forrest Gump in a Slump.

Data Preparation
Very great. 

Extra Credit
Added graphes for even better understanding.

Improper formatting
Superficial comments / no sign of actually reading work

Red Flag: Repeated verbatim on several posts

Reminder: Poor feedback \(\neq\) poor work.

MP#01 - Peer Feedback

Example of medium feedback:

## Scores 

- Written Communication: 8
- Project Skeleton: 10
- Formatting & Display: 9
- Code Quality: 10
- Data Preparation: 10
- Extra Credit: 1

## Comments

Love the visuals for Press Releases.

### Written Communication

A short summary or a description of this project would be great to add.

### Project Skeleton

Code completes all instructor-provided tasks correctly.

### Formatting & Display

Tables have well-formatted column names; caption would be great.

### Code Quality

Code is clear and well written.

### Data Preparation

Automatic (10/10). Out of scope for this mini-project.

### Extra Credit

I found all the Press Releases very interesting. The visuals were a great touch.

Proper formatting
Gave actionable suggestions
Directionally correct, but a bit vague

MP#01 - Peer Feedback

Example of great feedback:


## Scores

- Written Communication: 9
- Project Skeleton: 10
- Formatting & Display: 9
- Code Quality: 9
- Data Preparation: 10
- Extra Credit: 2

## Comments

### Written Communication

Overall, your writing is clear and easy to follow. I noticed a few small typos, but nothing major — using the built-in spell check in RStudio should catch those quickly. Everything else looks solid, and I didn’t have any major concerns based on my review.

### Project Skeleton

All tasks were completed satisfactorily. 

### Formatting & Display

Overall, your tables and figures are well-organized and clear. There is one table in the "Data" section with column titles that could be formatted a little more cleanly for easier reading. I also noticed a few small typos in some of the table captions, but they should be easy to fix.

### Code Quality

The code quality is generally good, but there are a few minor linter issues. The comments could be more frequent and clearer in some places. For example, the comment `#checking dupolicate so that github will not block it` should be corrected to "duplicate," and the explanation could be clarified to avoid confusion–what duplicate are you referring to, and how does it block GitHub? It might confuse someone else reading the code.

Also, I would recommend moving all library imports to the top of the script, as that is typically considered standard practice.

### Data Preparation

The data preparation looks solid overall. I like how you handle missing files and JSON parsing failures–it demonstrates strong defensive programming.

### Extra Credit

Additional two points for using quarto's video support!

Positive tone
Detailed comments
Noted issues & gave suggestions on how to fix
Noted unclear sections for improvement

MP#01 - Peer Feedback

Lack of prior experience is not a hinderance here:

If something is unclear to you, that’s a problem!
Nothing required super-complex code, so anything overly complex probably could have been a simpler way (except for some above-and-beyond stuff)
You don’t have to be definitive in comments - impressions and questions are just as helpful.

MP#01 - Impressions

👏 Thank you for excellent work here! 👏

I’ve started reading some submissions - really impressive!

Plots (exciting!)
Sophisticated and Insightful Analyses
Lots of K-Pop Demon Hunters
Fun for me to get to learn your interests

Hopefully automated checks helpful. If you have ideas for other useful automated ‘supports’ let me know

MP#01 - Common Questions

Q: Why doesn’t my site look the same on GitHub as it does on my laptop?

A: Missing css and js files. Need to upload everything in docs.

New helper function:

source("https://michael-weylandt.com/STA9750/load_helpers.R")
mp_submission_ready()

Tries to make sure all ‘secondary’ files are present.

MP#01 - Common Questions

Q: Why doesn’t my submission have the right URL?

A: Most computers have case-sensitive file names, Mac doesn’t, and Windows is iffy.

New helper function:

source("https://michael-weylandt.com/STA9750/load_helpers.R")
mp_start()

Generates a very basic qmd file with the correct file name and some relevant context.

Mini-Project #02

MP#02 released - TBD

Due2026-04-03 at 11:59pm ET

GitHub post (used for peer feedback) AND Brightspace
\(\approx\) One Month: don’t wait until the very end

Pay attention to the rubric

Writing and presentation are about 50% of your grade
Evaluated on rigor and thoughtfulness, not necessarily correctness

Thanks to HI and SK for help with Windows-compatibility!

Upcoming Mini-Projects

Tentative Topics

MP#03: NYC Trees - 2026-04-24 at 11:59pm ET
MP#04: US Employment Surveys - 2026-05-15 at 11:59pm ET

Pre-Assignments

Brightspace - day before class

Reading, typically on course website
Brightspace auto-grades
- I have to manually change to completion grading

Next pre-assignment is 2026-03-12 at 6:00pm ET

I missed a few comments in the previous cycle (sorry!) - trying to catch up in the next few days

Course Support

Synchronous
- Virtual Office Hours 2x / week on Class Days
Asynchronous
- Piazza (\(<20\) minute average response time)

Upcoming Week

Due Wednesday at 11:45pm:

Pre-Assignment #07 (Brightspace)
- Introduction to plotting with ggplot2

Expect back:

MP#01 grades
Project proposal instructor feedback

Project Proposals

Official Description

6 minute presentation
Key topics:
- Animating Question
- Team Roster
Also discuss: Possible specific questions, data sources, analytical plan, anticipated challenges

Most important: team names!

Previous: Rat Pack, Subway Surfers, Going for Gold, etc.

After Proposals

100% optional discussion of dplyr vs SQL

SQL is a very common Data Scientist interview topic so if you’re not taking a database course, might be useful

On to the Show!

Presentation Order

TBD

Wrap-Up

Orientation

Communicating Results (quarto) ✅
R Basics ✅
Data Manipulation in R ✅
Data Visualization in R ⬅️
Getting Data into R
Statistical Modeling in R

Upcoming Work

Upcoming work from course calendar

Mini-Project #01 Peer Feedback due on 2026-03-22
Pre-Assignment #07 due 2026-03-12 at 6:00pm ET
Mini-Project #02 due on 2026-04-03 at 11:59pm ET

Life Tip of the Week

Register to Vote

If you want to vote in the upcoming NYC Election, it’s time to register to vote:

https://www.vote.nyc/page/register-vote

Early voting begins October 25th: need to be registered by then

Forms must be received not just mailed

Life Tip of the Week

On the Ballot

Mayor, Comptroller, Advocate
Borough President, Council Member
Judges
5 City Charter Amendments (Official NYC-CFB Voter Guide)

Musical Treat

Optional Material (Time Allowing)

SQL

Structured Query Language (ISO 9075)

Standard language for interacting with Relational Database Management Systems (RDMS)
Divided into three main sub-languages:
- Data Manipulation Language (DML)
- Data Declaration Language (DDL)
- Data Control Language (DCL)
Declarative, not Imperative Programming

SQL Languages

Data Definition Language (DDL)

“Design” language for DB
What tables are present? What types are the columns? What relationships (keys) are enforced?

Data Manipulation Language (DML)

“User” interaction with DB: queries, adding, changing, deleting data
SELECT, GROUP BY, WHERE, HAVING, JOIN, etc.
Roughly coextensive with dplyr

Data Control Language (DCL)

“Security” language for DB
What permissions do different users have?

RDMS

Relational Database Management System (RDMS)

Data management server for complex environments
Relational tabular data structures
Manipulated using SQL
- Common SQL Standard + Many RDMS-specific extensions
ACID Guarantees

DDL

RDMS has a fixed number of tables (roughly data.frames). DDL specifies

Column names and type
Can values be left blank (nullability)?
Identifiers (Primary Keys)
Constraints (non-negative, a < b, etc.)
Relationships to other tables (Foreign Keys)

DDL

Basic command is the CREATE TABLE statement.

TYPE can be INTEGER, DECIMAL, CHAR, VARCHAR, DATE, TIMESTAMP and more

DDL

Our beloved penguins:

Compare:

Rows: 344
Columns: 8
$ species     <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island      <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ bill_len    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
$ bill_dep    <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
$ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
$ body_mass   <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
$ sex         <fct> male, female, female, NA, female, male, female, male, NA, …
$ year        <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

DDL

We can add a constraint so the year is 2007-2009:

DDL

Constraints for categorical variables (R factors) can be tricky. Create a normalized set of tables:

Sex ID	Name
1	Female
2	Male

Penguin ID	Sex ID	Weight
1	1	4000
2	1	3500
3	2	5000

We want to make sure each Sex ID in table 2 matches a row of the first table

DDL

Constraints for categorical variables (R factors) can be tricky. Create a normalized set of tables:

DDL Normalization

Normalization:

Increases integrity via constraints
Reduces redundancy (need to change value in 1 place only)
Slight performance benefits

More JOIN work in SQL than in dplyr

Proper table design is a science and an art

DML

Data Manipulation Language is for DB users.

Close parallels to dplyr
Compare:

vs.

SELECT * means all columns which matches R’s default

DML

Data Manipulation Language is for DB users.

Close parallels to dplyr
Compare:

vs.

DML

Data Manipulation Language is for DB users.

Close parallels to dplyr
Compare:

vs.

DML

Note that

and

are both

Why?

They commute – same result in either order

DML

R is an imperative paradigm - you say what to do
SQL is a declarative paradigm - you say what you want

Declarative paradigms let the server figure out the best way to compute, but are more restrictive

For a well-defined paradigm like SQL, much work has gone into query optimization

DML

Compare:

vs.

summarize() is implicit in SELECT clause when we have a GROUP BY

DML

But:

Why not WHERE again?

DML

In SQL: - WHERE happens before a GROUP BY; and - HAVING happens after a GROUP BY

In R: - Ordering of summarize() and filter controls execution order

ACID Guarantees

Most RDMSs provide a set of four ACID guarantees:

Atomicity: Changes occur all-or-nothing
- E.g.: Add charge to customer’s account + schedule item for shipping
Consistency: DDL Constraints are always valid
- E.g.: Can’t create a user without having an email on file
Isolation: Two different users don’t “step on each other’s toes”
- E.g.: If I’m updating prices while you are checking out, you get your original price
Durability: When a change is complete, it is guaranteed to be saved to disk

Getting Started with RDMS

Easiest RDMS to use is SQLite

Most ubiquitous DB in the world (by a long-shot)
Already installed on your computer (many times over!)
Access directly through sqlite3 or sqlite3.exe
Access from R with RSQLite package

Excellent documentation

dbplyr

Can use dbplyr (extra b!) to put dplyr-UX on DB

E.g.

dbplyr

# Source:   table<`penguins`> [?? x 8]
# Database: sqlite 3.50.4 [:memory:]
   species island    bill_len bill_dep flipper_len body_mass sex     year
   <chr>   <chr>        <dbl>    <dbl>       <int>     <int> <chr>  <int>
 1 Adelie  Torgersen     39.1     18.7         181      3750 male    2007
 2 Adelie  Torgersen     39.5     17.4         186      3800 female  2007
 3 Adelie  Torgersen     40.3     18           195      3250 female  2007
 4 Adelie  Torgersen     NA       NA            NA        NA <NA>    2007
 5 Adelie  Torgersen     36.7     19.3         193      3450 female  2007
 6 Adelie  Torgersen     39.3     20.6         190      3650 male    2007
 7 Adelie  Torgersen     38.9     17.8         181      3625 female  2007
 8 Adelie  Torgersen     39.2     19.6         195      4675 male    2007
 9 Adelie  Torgersen     34.1     18.1         193      3475 <NA>    2007
10 Adelie  Torgersen     42       20.2         190      4250 <NA>    2007
# ℹ more rows

dbplyr

<SQL>
SELECT `species`, AVG(`body_mass`) AS `mean_mass`
FROM `penguins`
GROUP BY `species`

Extra step - show_query to see SQL or collect() to run in DB:

# A tibble: 3 × 2
  species   mean_mass
  <chr>         <dbl>
1 Adelie        3701.
2 Chinstrap     3733.
3 Gentoo        5076.

dbplyr

<SQL>
SELECT `species`, AVG(`body_mass`) AS `mean_mass`
FROM (
  SELECT `penguins`.*
  FROM `penguins`
  WHERE (`body_mass` > 4000.0)
) AS `q01`
GROUP BY `species`

Compared to:

<SQL>
SELECT `species`, AVG(`body_mass`) AS `mean_mass`
FROM `penguins`
GROUP BY `species`
HAVING (AVG(`body_mass`) > 4000.0)

Not the best SQL, but RDMS can make quick work of it

dbplyr

A more substantial example:

dbplyr

A more substantial example:

CREATE TABLE `AllstarFull` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `gameNum` INTEGER,
  `gameID` TEXT,
  `teamID` TEXT,
  `lgID` TEXT,
  `GP` INTEGER,
  `startingPos` TEXT
);
CREATE INDEX `AllstarFull_playerID` ON `AllstarFull` (`playerID`);
CREATE INDEX `AllstarFull_yearID` ON `AllstarFull` (`yearID`);
CREATE INDEX `AllstarFull_gameID` ON `AllstarFull` (`gameID`);
CREATE INDEX `AllstarFull_teamID` ON `AllstarFull` (`teamID`);
CREATE INDEX `AllstarFull_lgID` ON `AllstarFull` (`lgID`);
CREATE TABLE sqlite_stat1(tbl,idx,stat);
CREATE TABLE sqlite_stat4(tbl,idx,neq,nlt,ndlt,sample);
CREATE TABLE `Appearances` (
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `playerID` TEXT,
  `G_all` INTEGER,
  `GS` INTEGER,
  `G_batting` INTEGER,
  `G_defense` INTEGER,
  `G_p` INTEGER,
  `G_c` INTEGER,
  `G_1b` INTEGER,
  `G_2b` INTEGER,
  `G_3b` INTEGER,
  `G_ss` INTEGER,
  `G_lf` INTEGER,
  `G_cf` INTEGER,
  `G_rf` INTEGER,
  `G_of` INTEGER,
  `G_dh` INTEGER,
  `G_ph` INTEGER,
  `G_pr` INTEGER
);
CREATE INDEX `Appearances_yearID` ON `Appearances` (`yearID`);
CREATE INDEX `Appearances_teamID` ON `Appearances` (`teamID`);
CREATE INDEX `Appearances_lgID` ON `Appearances` (`lgID`);
CREATE INDEX `Appearances_playerID` ON `Appearances` (`playerID`);
CREATE TABLE `AwardsManagers` (
  `playerID` TEXT,
  `awardID` TEXT,
  `yearID` INTEGER,
  `lgID` TEXT,
  `tie` TEXT,
  `notes` TEXT
);
CREATE INDEX `AwardsManagers_playerID` ON `AwardsManagers` (`playerID`);
CREATE INDEX `AwardsManagers_awardID` ON `AwardsManagers` (`awardID`);
CREATE INDEX `AwardsManagers_yearID` ON `AwardsManagers` (`yearID`);
CREATE INDEX `AwardsManagers_lgID` ON `AwardsManagers` (`lgID`);
CREATE TABLE `AwardsPlayers` (
  `playerID` TEXT,
  `awardID` TEXT,
  `yearID` INTEGER,
  `lgID` TEXT,
  `tie` TEXT,
  `notes` TEXT
);
CREATE INDEX `AwardsPlayers_playerID` ON `AwardsPlayers` (`playerID`);
CREATE INDEX `AwardsPlayers_awardID` ON `AwardsPlayers` (`awardID`);
CREATE INDEX `AwardsPlayers_yearID` ON `AwardsPlayers` (`yearID`);
CREATE INDEX `AwardsPlayers_lgID` ON `AwardsPlayers` (`lgID`);
CREATE TABLE `AwardsShareManagers` (
  `awardID` TEXT,
  `yearID` INTEGER,
  `lgID` TEXT,
  `playerID` TEXT,
  `pointsWon` INTEGER,
  `pointsMax` INTEGER,
  `votesFirst` INTEGER
);
CREATE INDEX `AwardsShareManagers_awardID` ON `AwardsShareManagers` (`awardID`);
CREATE INDEX `AwardsShareManagers_yearID` ON `AwardsShareManagers` (`yearID`);
CREATE INDEX `AwardsShareManagers_lgID` ON `AwardsShareManagers` (`lgID`);
CREATE INDEX `AwardsShareManagers_playerID` ON `AwardsShareManagers` (`playerID`);
CREATE TABLE `AwardsSharePlayers` (
  `awardID` TEXT,
  `yearID` INTEGER,
  `lgID` TEXT,
  `playerID` TEXT,
  `pointsWon` INTEGER,
  `pointsMax` INTEGER,
  `votesFirst` INTEGER
);
CREATE INDEX `AwardsSharePlayers_awardID` ON `AwardsSharePlayers` (`awardID`);
CREATE INDEX `AwardsSharePlayers_yearID` ON `AwardsSharePlayers` (`yearID`);
CREATE INDEX `AwardsSharePlayers_lgID` ON `AwardsSharePlayers` (`lgID`);
CREATE INDEX `AwardsSharePlayers_playerID` ON `AwardsSharePlayers` (`playerID`);
CREATE TABLE `Batting` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `G` INTEGER,
  `AB` INTEGER,
  `R` INTEGER,
  `H` INTEGER,
  `X2B` INTEGER,
  `X3B` INTEGER,
  `HR` INTEGER,
  `RBI` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `IBB` INTEGER,
  `HBP` INTEGER,
  `SH` INTEGER,
  `SF` INTEGER,
  `GIDP` INTEGER
);
CREATE INDEX `Batting_playerID` ON `Batting` (`playerID`);
CREATE INDEX `Batting_yearID` ON `Batting` (`yearID`);
CREATE INDEX `Batting_teamID` ON `Batting` (`teamID`);
CREATE INDEX `Batting_lgID` ON `Batting` (`lgID`);
CREATE TABLE `BattingPost` (
  `yearID` INTEGER,
  `round` TEXT,
  `playerID` TEXT,
  `teamID` TEXT,
  `lgID` TEXT,
  `G` INTEGER,
  `AB` INTEGER,
  `R` INTEGER,
  `H` INTEGER,
  `X2B` INTEGER,
  `X3B` INTEGER,
  `HR` INTEGER,
  `RBI` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `IBB` INTEGER,
  `HBP` INTEGER,
  `SH` INTEGER,
  `SF` INTEGER,
  `GIDP` INTEGER
);
CREATE INDEX `BattingPost_yearID` ON `BattingPost` (`yearID`);
CREATE INDEX `BattingPost_playerID` ON `BattingPost` (`playerID`);
CREATE INDEX `BattingPost_teamID` ON `BattingPost` (`teamID`);
CREATE INDEX `BattingPost_lgID` ON `BattingPost` (`lgID`);
CREATE TABLE `CollegePlaying` (
  `playerID` TEXT,
  `schoolID` TEXT,
  `yearID` INTEGER
);
CREATE INDEX `CollegePlaying_playerID` ON `CollegePlaying` (`playerID`);
CREATE INDEX `CollegePlaying_schoolID` ON `CollegePlaying` (`schoolID`);
CREATE INDEX `CollegePlaying_yearID` ON `CollegePlaying` (`yearID`);
CREATE TABLE `Fielding` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `POS` TEXT,
  `G` INTEGER,
  `GS` INTEGER,
  `InnOuts` INTEGER,
  `PO` INTEGER,
  `A` INTEGER,
  `E` INTEGER,
  `DP` INTEGER,
  `PB` INTEGER,
  `WP` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `ZR` INTEGER
);
CREATE INDEX `Fielding_playerID` ON `Fielding` (`playerID`);
CREATE INDEX `Fielding_yearID` ON `Fielding` (`yearID`);
CREATE INDEX `Fielding_teamID` ON `Fielding` (`teamID`);
CREATE INDEX `Fielding_lgID` ON `Fielding` (`lgID`);
CREATE TABLE `FieldingOF` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `Glf` INTEGER,
  `Gcf` INTEGER,
  `Grf` INTEGER
);
CREATE INDEX `FieldingOF_playerID` ON `FieldingOF` (`playerID`);
CREATE INDEX `FieldingOF_yearID` ON `FieldingOF` (`yearID`);
CREATE TABLE `FieldingOFsplit` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `POS` TEXT,
  `G` INTEGER,
  `GS` INTEGER,
  `InnOuts` INTEGER,
  `PO` INTEGER,
  `A` INTEGER,
  `E` INTEGER,
  `DP` INTEGER,
  `PB` INTEGER,
  `WP` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `ZR` INTEGER
);
CREATE INDEX `FieldingOFsplit_playerID` ON `FieldingOFsplit` (`playerID`);
CREATE INDEX `FieldingOFsplit_yearID` ON `FieldingOFsplit` (`yearID`);
CREATE INDEX `FieldingOFsplit_teamID` ON `FieldingOFsplit` (`teamID`);
CREATE INDEX `FieldingOFsplit_lgID` ON `FieldingOFsplit` (`lgID`);
CREATE TABLE `FieldingPost` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `round` TEXT,
  `POS` TEXT,
  `G` INTEGER,
  `GS` INTEGER,
  `InnOuts` INTEGER,
  `PO` INTEGER,
  `A` INTEGER,
  `E` INTEGER,
  `DP` INTEGER,
  `TP` INTEGER,
  `PB` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER
);
CREATE INDEX `FieldingPost_playerID` ON `FieldingPost` (`playerID`);
CREATE INDEX `FieldingPost_yearID` ON `FieldingPost` (`yearID`);
CREATE INDEX `FieldingPost_teamID` ON `FieldingPost` (`teamID`);
CREATE INDEX `FieldingPost_lgID` ON `FieldingPost` (`lgID`);
CREATE TABLE `HallOfFame` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `votedBy` TEXT,
  `ballots` REAL,
  `needed` REAL,
  `votes` REAL,
  `inducted` TEXT,
  `category` TEXT,
  `needed_note` TEXT
);
CREATE INDEX `HallOfFame_playerID` ON `HallOfFame` (`playerID`);
CREATE INDEX `HallOfFame_yearID` ON `HallOfFame` (`yearID`);
CREATE TABLE `HomeGames` (
  `year.key` INTEGER,
  `league.key` TEXT,
  `team.key` TEXT,
  `park.key` TEXT,
  `span.first` TEXT,
  `span.last` TEXT,
  `games` INTEGER,
  `openings` INTEGER,
  `attendance` INTEGER
);
CREATE TABLE `LahmanData` (
  `file` TEXT,
  `class` TEXT,
  `nobs` REAL,
  `nvar` REAL,
  `title` TEXT
);
CREATE TABLE `Managers` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `inseason` INTEGER,
  `G` INTEGER,
  `W` INTEGER,
  `L` INTEGER,
  `rank` INTEGER,
  `plyrMgr` TEXT
);
CREATE INDEX `Managers_playerID` ON `Managers` (`playerID`);
CREATE INDEX `Managers_yearID` ON `Managers` (`yearID`);
CREATE INDEX `Managers_teamID` ON `Managers` (`teamID`);
CREATE INDEX `Managers_lgID` ON `Managers` (`lgID`);
CREATE TABLE `ManagersHalf` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `inseason` INTEGER,
  `half` INTEGER,
  `G` INTEGER,
  `W` INTEGER,
  `L` INTEGER,
  `rank` INTEGER
);
CREATE INDEX `ManagersHalf_playerID` ON `ManagersHalf` (`playerID`);
CREATE INDEX `ManagersHalf_yearID` ON `ManagersHalf` (`yearID`);
CREATE INDEX `ManagersHalf_teamID` ON `ManagersHalf` (`teamID`);
CREATE INDEX `ManagersHalf_lgID` ON `ManagersHalf` (`lgID`);
CREATE TABLE `Parks` (
  `park.alias` TEXT,
  `park.key` TEXT,
  `park.name` TEXT,
  `city` TEXT,
  `state` TEXT,
  `country` TEXT
);
CREATE TABLE `People` (
  `playerID` TEXT,
  `birthYear` INTEGER,
  `birthMonth` INTEGER,
  `birthDay` INTEGER,
  `birthCity` TEXT,
  `birthCountry` TEXT,
  `birthState` TEXT,
  `deathYear` INTEGER,
  `deathMonth` INTEGER,
  `deathDay` INTEGER,
  `deathCountry` TEXT,
  `deathState` TEXT,
  `deathCity` TEXT,
  `nameFirst` TEXT,
  `nameLast` TEXT,
  `nameGiven` TEXT,
  `weight` INTEGER,
  `height` INTEGER,
  `bats` TEXT,
  `throws` TEXT,
  `debut` TEXT,
  `bbrefID` TEXT,
  `finalGame` TEXT,
  `retroID` TEXT,
  `deathDate` REAL,
  `birthDate` REAL
);
CREATE INDEX `People_playerID` ON `People` (`playerID`);
CREATE INDEX `People_bbrefID` ON `People` (`bbrefID`);
CREATE INDEX `People_retroID` ON `People` (`retroID`);
CREATE TABLE `Pitching` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `W` INTEGER,
  `L` INTEGER,
  `G` INTEGER,
  `GS` INTEGER,
  `CG` INTEGER,
  `SHO` INTEGER,
  `SV` INTEGER,
  `IPouts` INTEGER,
  `H` INTEGER,
  `ER` INTEGER,
  `HR` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `BAOpp` REAL,
  `ERA` REAL,
  `IBB` INTEGER,
  `WP` INTEGER,
  `HBP` INTEGER,
  `BK` INTEGER,
  `BFP` INTEGER,
  `GF` INTEGER,
  `R` INTEGER,
  `SH` INTEGER,
  `SF` INTEGER,
  `GIDP` INTEGER
);
CREATE INDEX `Pitching_playerID` ON `Pitching` (`playerID`);
CREATE INDEX `Pitching_yearID` ON `Pitching` (`yearID`);
CREATE INDEX `Pitching_teamID` ON `Pitching` (`teamID`);
CREATE INDEX `Pitching_lgID` ON `Pitching` (`lgID`);
CREATE TABLE `PitchingPost` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `round` TEXT,
  `teamID` TEXT,
  `lgID` TEXT,
  `W` INTEGER,
  `L` INTEGER,
  `G` INTEGER,
  `GS` INTEGER,
  `CG` INTEGER,
  `SHO` INTEGER,
  `SV` INTEGER,
  `IPouts` INTEGER,
  `H` INTEGER,
  `ER` INTEGER,
  `HR` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `BAOpp` REAL,
  `ERA` REAL,
  `IBB` INTEGER,
  `WP` INTEGER,
  `HBP` INTEGER,
  `BK` INTEGER,
  `BFP` INTEGER,
  `GF` INTEGER,
  `R` INTEGER,
  `SH` INTEGER,
  `SF` INTEGER,
  `GIDP` INTEGER
);
CREATE INDEX `PitchingPost_playerID` ON `PitchingPost` (`playerID`);
CREATE INDEX `PitchingPost_yearID` ON `PitchingPost` (`yearID`);
CREATE INDEX `PitchingPost_teamID` ON `PitchingPost` (`teamID`);
CREATE INDEX `PitchingPost_lgID` ON `PitchingPost` (`lgID`);
CREATE TABLE `Salaries` (
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `playerID` TEXT,
  `salary` INTEGER
);
CREATE INDEX `Salaries_yearID` ON `Salaries` (`yearID`);
CREATE INDEX `Salaries_teamID` ON `Salaries` (`teamID`);
CREATE INDEX `Salaries_lgID` ON `Salaries` (`lgID`);
CREATE INDEX `Salaries_playerID` ON `Salaries` (`playerID`);
CREATE TABLE `Schools` (
  `schoolID` TEXT,
  `name_full` TEXT,
  `city` TEXT,
  `state` TEXT,
  `country` TEXT
);
CREATE INDEX `Schools_schoolID` ON `Schools` (`schoolID`);
CREATE TABLE `SeriesPost` (
  `yearID` INTEGER,
  `round` TEXT,
  `teamIDwinner` TEXT,
  `lgIDwinner` TEXT,
  `teamIDloser` TEXT,
  `lgIDloser` TEXT,
  `wins` INTEGER,
  `losses` INTEGER,
  `ties` INTEGER
);
CREATE INDEX `SeriesPost_yearID` ON `SeriesPost` (`yearID`);
CREATE TABLE `Teams` (
  `yearID` INTEGER,
  `lgID` TEXT,
  `teamID` TEXT,
  `franchID` TEXT,
  `divID` TEXT,
  `Rank` INTEGER,
  `G` INTEGER,
  `Ghome` INTEGER,
  `W` INTEGER,
  `L` INTEGER,
  `DivWin` TEXT,
  `WCWin` TEXT,
  `LgWin` TEXT,
  `WSWin` TEXT,
  `R` INTEGER,
  `AB` INTEGER,
  `H` INTEGER,
  `X2B` INTEGER,
  `X3B` INTEGER,
  `HR` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `HBP` INTEGER,
  `SF` INTEGER,
  `RA` INTEGER,
  `ER` INTEGER,
  `ERA` REAL,
  `CG` INTEGER,
  `SHO` INTEGER,
  `SV` INTEGER,
  `IPouts` INTEGER,
  `HA` INTEGER,
  `HRA` INTEGER,
  `BBA` INTEGER,
  `SOA` INTEGER,
  `E` INTEGER,
  `DP` INTEGER,
  `FP` REAL,
  `name` TEXT,
  `park` TEXT,
  `attendance` INTEGER,
  `BPF` INTEGER,
  `PPF` INTEGER,
  `teamIDBR` TEXT,
  `teamIDlahman45` TEXT,
  `teamIDretro` TEXT
);
CREATE INDEX `Teams_yearID` ON `Teams` (`yearID`);
CREATE INDEX `Teams_lgID` ON `Teams` (`lgID`);
CREATE INDEX `Teams_teamID` ON `Teams` (`teamID`);
CREATE INDEX `Teams_franchID` ON `Teams` (`franchID`);
CREATE INDEX `Teams_divID` ON `Teams` (`divID`);
CREATE TABLE `TeamsFranchises` (
  `franchID` TEXT,
  `franchName` TEXT,
  `active` TEXT,
  `NAassoc` TEXT
);
CREATE INDEX `TeamsFranchises_franchID` ON `TeamsFranchises` (`franchID`);
CREATE TABLE `TeamsHalf` (
  `yearID` INTEGER,
  `lgID` TEXT,
  `teamID` TEXT,
  `Half` INTEGER,
  `divID` TEXT,
  `DivWin` TEXT,
  `Rank` INTEGER,
  `G` INTEGER,
  `W` INTEGER,
  `L` INTEGER
);
CREATE INDEX `TeamsHalf_yearID` ON `TeamsHalf` (`yearID`);
CREATE INDEX `TeamsHalf_lgID` ON `TeamsHalf` (`lgID`);
CREATE INDEX `TeamsHalf_teamID` ON `TeamsHalf` (`teamID`);
CREATE INDEX `TeamsHalf_divID` ON `TeamsHalf` (`divID`);

dbplyr

List the 5 modern-era managers with the highest winning percentages. (Define modern-era as managed over 1000 games post 1920.)

<SQL>
SELECT `nameGiven` || ' ' || `nameLast` AS `name`, `W`, `L`, `G`, `record`
FROM (
  SELECT `q01`.*, RANK() OVER (ORDER BY `record` DESC) AS `col01`
  FROM (
    SELECT
      `LHS`.*,
      `birthYear`,
      `birthMonth`,
      `birthDay`,
      `birthCity`,
      `birthCountry`,
      `birthState`,
      `deathYear`,
      `deathMonth`,
      `deathDay`,
      `deathCountry`,
      `deathState`,
      `deathCity`,
      `nameFirst`,
      `nameLast`,
      `nameGiven`,
      `weight`,
      `height`,
      `bats`,
      `throws`,
      `debut`,
      `bbrefID`,
      `finalGame`,
      `retroID`,
      `deathDate`,
      `birthDate`
    FROM (
      SELECT `q01`.*, `W` / (`G` + 0.0) AS `record`
      FROM (
        SELECT `q01`.*, `W` + `L` AS `G`
        FROM (
          SELECT `playerID`, SUM(`W`) AS `W`, SUM(`L`) AS `L`
          FROM (
            SELECT `Managers`.*
            FROM `Managers`
            WHERE (`yearID` >= 1920.0)
          ) AS `q01`
          GROUP BY `playerID`
        ) AS `q01`
      ) AS `q01`
      WHERE (`G` > 1000.0)
    ) AS `LHS`
    INNER JOIN `People`
      ON (`LHS`.`playerID` = `People`.`playerID`)
  ) AS `q01`
) AS `q01`
WHERE (`col01` <= 5)

dbplyr

Add |> explain() to see the DB internal query plan:

<SQL>
SELECT `nameGiven` || ' ' || `nameLast` AS `name`, `W`, `L`, `G`, `record`
FROM (
  SELECT `q01`.*, RANK() OVER (ORDER BY `record` DESC) AS `col01`
  FROM (
    SELECT
      `LHS`.*,
      `birthYear`,
      `birthMonth`,
      `birthDay`,
      `birthCity`,
      `birthCountry`,
      `birthState`,
      `deathYear`,
      `deathMonth`,
      `deathDay`,
      `deathCountry`,
      `deathState`,
      `deathCity`,
      `nameFirst`,
      `nameLast`,
      `nameGiven`,
      `weight`,
      `height`,
      `bats`,
      `throws`,
      `debut`,
      `bbrefID`,
      `finalGame`,
      `retroID`,
      `deathDate`,
      `birthDate`
    FROM (
      SELECT `q01`.*, `W` / (`G` + 0.0) AS `record`
      FROM (
        SELECT `q01`.*, `W` + `L` AS `G`
        FROM (
          SELECT `playerID`, SUM(`W`) AS `W`, SUM(`L`) AS `L`
          FROM (
            SELECT `Managers`.*
            FROM `Managers`
            WHERE (`yearID` >= 1920.0)
          ) AS `q01`
          GROUP BY `playerID`
        ) AS `q01`
      ) AS `q01`
      WHERE (`G` > 1000.0)
    ) AS `LHS`
    INNER JOIN `People`
      ON (`LHS`.`playerID` = `People`.`playerID`)
  ) AS `q01`
) AS `q01`
WHERE (`col01` <= 5)

<PLAN>
   id parent notused                                                 detail
1   2      0       0                                         CO-ROUTINE q01
2   5      2       0                                CO-ROUTINE (subquery-8)
3   7      5       0                                         CO-ROUTINE q01
4  14      7     142            SCAN Managers USING INDEX Managers_playerID
5  55      5      82                                               SCAN q01
6  62      5      42 SEARCH People USING INDEX People_playerID (playerID=?)
7 120      5       0                           USE TEMP B-TREE FOR ORDER BY
8 163      2      81                                      SCAN (subquery-8)
9 285      0      81                                               SCAN q01

Beyond this class to interpret, but in general use of INDEX yields good performance. Take DB classes for much more

dbplyr

List the 5 modern-era managers with the highest winning percentages. (Define modern-era as managed over 1000 games post 1920.)

# A tibble: 5 × 5
  name                          W     L     G record
  <chr>                     <int> <int> <int>  <dbl>
1 David Ray Roberts           851   507  1358  0.627
2 Joseph Vincent McCarthy    2125  1333  3458  0.615
3 Miller James Huggins        927   597  1524  0.608
4 William Harold Southworth  1044   704  1748  0.597
5 Aaron John Boone            603   428  1031  0.585

Software Tools for Data AnalysisSTA 9750Michael WeylandtWeek 6 – Thursday 2026-03-12Last Updated: 2026-01-14

STA 9750 Week 6

Today

Today

MP#01 - Peer Feedback

MP#01 - Peer Feedback

MP#01 - Peer Feedback

MP#01 - Peer Feedback

MP#01 - Peer Feedback

MP#01 - Peer Feedback

MP#01 - Impressions

MP#01 - Common Questions

MP#01 - Common Questions

Mini-Project #02

Upcoming Mini-Projects

Pre-Assignments

Course Support

Upcoming Week

Project Proposals

After Proposals

On to the Show!

Presentation Order

Wrap-Up

Orientation

Upcoming Work

Life Tip of the Week

Register to Vote

Life Tip of the Week

On the Ballot

Musical Treat

Optional Material (Time Allowing)

SQL

SQL Languages

RDMS

DDL

DDL

DDL

DDL

DDL

DDL

DDL Normalization

DML

DML

DML

DML

DML

DML

DML

DML

ACID Guarantees

Getting Started with RDMS

dbplyr

dbplyr

dbplyr

dbplyr

dbplyr

dbplyr

dbplyr

dbplyr

dbplyr

Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 6 – Thursday 2026-03-12
Last Updated: 2026-01-14