Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 6 – Thursday 2026-03-12
Last Updated: 2026-01-14

STA 9750 Week 6

Today: Lecture #05: Multi-Table dplyr Verbs

Today

Today

  • Course Administration
  • Mini-Project #01 Reflection
  • Project Proposal Presentations
  • Wrap-Up
    • Life Tip of the Day
  • Time Allowing: dplyr vs SQL

MP#01 - Peer Feedback

Assigned on GitHub - due on 2026-03-22

  • \(\approx 4\) feedbacks each
  • Take this seriously: around 20% of this assignment is “meta-review”
  • Goal: rigorous constructive critique

Use helper functions to find submissions assigned to you. Ask on Piazza if still having trouble.

MP#01 - Peer Feedback

Submissions may not map perfectly to rubric - use your best judgement

Be generous but serious:

  • Goal is improvement, so “everything is great, no comments” is unhelpful
    • Nothing is completely right nor completely wrong
  • Remember, meta-review (TA scores of your feedback) to follow
  • Extra Credit is limited per instructions; only award for exceptional work

Learn from this! What can you adapt for MP#02?

MP#01 - Peer Feedback

Example of poor feedback:

Scores
Written Communication: 10
Project Skeleton: 10
Formatting & Display: 10
Code Quality: 10
Data Preparation: 10
Extra Credit: 10

Comments
Website looks really good, nice and clean!

Written Communication
Excellent and straight to the point.

Project Skeleton
Solid skeleton, well-organized .

Formatting & Display
Nicely formatted and balanced across the website.

Code Quality
Code runs like Forrest Gump in a Slump.

Data Preparation
Very great. 

Extra Credit
Added graphes for even better understanding.
  • Improper formatting
  • Superficial comments / no sign of actually reading work

Red Flag: Repeated verbatim on several posts

Reminder: Poor feedback \(\neq\) poor work.

MP#01 - Peer Feedback

Example of medium feedback:

## Scores 

- Written Communication: 8
- Project Skeleton: 10
- Formatting & Display: 9
- Code Quality: 10
- Data Preparation: 10
- Extra Credit: 1

## Comments

Love the visuals for Press Releases.

### Written Communication

A short summary or a description of this project would be great to add.

### Project Skeleton

Code completes all instructor-provided tasks correctly.

### Formatting & Display

Tables have well-formatted column names; caption would be great.

### Code Quality

Code is clear and well written.

### Data Preparation

Automatic (10/10). Out of scope for this mini-project.

### Extra Credit

I found all the Press Releases very interesting. The visuals were a great touch.
  • Proper formatting
  • Gave actionable suggestions
  • Directionally correct, but a bit vague

MP#01 - Peer Feedback

Example of great feedback:


## Scores

- Written Communication: 9
- Project Skeleton: 10
- Formatting & Display: 9
- Code Quality: 9
- Data Preparation: 10
- Extra Credit: 2

## Comments

### Written Communication

Overall, your writing is clear and easy to follow. I noticed a few small typos, but nothing major — using the built-in spell check in RStudio should catch those quickly. Everything else looks solid, and I didn’t have any major concerns based on my review.

### Project Skeleton

All tasks were completed satisfactorily. 

### Formatting & Display

Overall, your tables and figures are well-organized and clear. There is one table in the "Data" section with column titles that could be formatted a little more cleanly for easier reading. I also noticed a few small typos in some of the table captions, but they should be easy to fix.

### Code Quality

The code quality is generally good, but there are a few minor linter issues. The comments could be more frequent and clearer in some places. For example, the comment `#checking dupolicate so that github will not block it` should be corrected to "duplicate," and the explanation could be clarified to avoid confusion–what duplicate are you referring to, and how does it block GitHub? It might confuse someone else reading the code.

Also, I would recommend moving all library imports to the top of the script, as that is typically considered standard practice.

### Data Preparation

The data preparation looks solid overall. I like how you handle missing files and JSON parsing failures–it demonstrates strong defensive programming.

### Extra Credit

Additional two points for using quarto's video support!
  • Positive tone
  • Detailed comments
  • Noted issues & gave suggestions on how to fix
  • Noted unclear sections for improvement

MP#01 - Peer Feedback

Lack of prior experience is not a hinderance here:

  • If something is unclear to you, that’s a problem!
  • Nothing required super-complex code, so anything overly complex probably could have been a simpler way (except for some above-and-beyond stuff)
  • You don’t have to be definitive in comments - impressions and questions are just as helpful.

MP#01 - Impressions

👏 Thank you for excellent work here! 👏

I’ve started reading some submissions - really impressive!

  • Plots (exciting!)
  • Sophisticated and Insightful Analyses
  • Lots of K-Pop Demon Hunters
  • Fun for me to get to learn your interests

Hopefully automated checks helpful. If you have ideas for other useful automated ‘supports’ let me know

MP#01 - Common Questions

Q: Why doesn’t my site look the same on GitHub as it does on my laptop?

A: Missing css and js files. Need to upload everything in docs.

New helper function:

source("https://michael-weylandt.com/STA9750/load_helpers.R")
mp_submission_ready()

Tries to make sure all ‘secondary’ files are present.

MP#01 - Common Questions

Q: Why doesn’t my submission have the right URL?

A: Most computers have case-sensitive file names, Mac doesn’t, and Windows is iffy.

New helper function:

source("https://michael-weylandt.com/STA9750/load_helpers.R")
mp_start()

Generates a very basic qmd file with the correct file name and some relevant context.

Mini-Project #02

MP#02 released - TBD

Due2026-04-03 at 11:59pm ET

  • GitHub post (used for peer feedback) AND Brightspace
  • \(\approx\) One Month: don’t wait until the very end

Pay attention to the rubric

  • Writing and presentation are about 50% of your grade
  • Evaluated on rigor and thoughtfulness, not necessarily correctness

Thanks to HI and SK for help with Windows-compatibility!

Upcoming Mini-Projects

Tentative Topics

  • MP#03: NYC Trees - 2026-04-24 at 11:59pm ET
  • MP#04: US Employment Surveys - 2026-05-15 at 11:59pm ET

Pre-Assignments

Brightspace - day before class

  • Reading, typically on course website
  • Brightspace auto-grades
    • I have to manually change to completion grading

Next pre-assignment is 2026-03-12 at 6:00pm ET

I missed a few comments in the previous cycle (sorry!) - trying to catch up in the next few days

Course Support

  • Synchronous
    • Virtual Office Hours 2x / week on Class Days
  • Asynchronous
    • Piazza (\(<20\) minute average response time)

Upcoming Week

Due Wednesday at 11:45pm:

  • Pre-Assignment #07 (Brightspace)
    • Introduction to plotting with ggplot2

Expect back:

  • MP#01 grades
  • Project proposal instructor feedback

Project Proposals

Official Description

  • 6 minute presentation
  • Key topics:
    • Animating Question
    • Team Roster
  • Also discuss: Possible specific questions, data sources, analytical plan, anticipated challenges

Most important: team names!

Previous: Rat Pack, Subway Surfers, Going for Gold, etc.

After Proposals

100% optional discussion of dplyr vs SQL

  • SQL is a very common Data Scientist interview topic so if you’re not taking a database course, might be useful

On to the Show!

Presentation Order

TBD

Wrap-Up

Orientation

  • Communicating Results (quarto) ✅
  • R Basics ✅
  • Data Manipulation in R
  • Data Visualization in R ⬅️
  • Getting Data into R
  • Statistical Modeling in R

Upcoming Work

Upcoming work from course calendar

Life Tip of the Week

Register to Vote

If you want to vote in the upcoming NYC Election, it’s time to register to vote:

https://www.vote.nyc/page/register-vote

Early voting begins October 25th: need to be registered by then

  • Forms must be received not just mailed

Life Tip of the Week

On the Ballot

Musical Treat


Optional Material (Time Allowing)

SQL

Structured Query Language (ISO 9075)

  • Standard language for interacting with Relational Database Management Systems (RDMS)
  • Divided into three main sub-languages:
    • Data Manipulation Language (DML)
    • Data Declaration Language (DDL)
    • Data Control Language (DCL)
  • Declarative, not Imperative Programming

SQL Languages

Data Definition Language (DDL)

  • “Design” language for DB
  • What tables are present? What types are the columns? What relationships (keys) are enforced?

Data Manipulation Language (DML)

  • “User” interaction with DB: queries, adding, changing, deleting data
  • SELECT, GROUP BY, WHERE, HAVING, JOIN, etc.
  • Roughly coextensive with dplyr

Data Control Language (DCL)

  • “Security” language for DB
  • What permissions do different users have?

RDMS

Relational Database Management System (RDMS)

  • Data management server for complex environments
  • Relational tabular data structures
  • Manipulated using SQL
    • Common SQL Standard + Many RDMS-specific extensions
  • ACID Guarantees

DDL

RDMS has a fixed number of tables (roughly data.frames). DDL specifies

  • Column names and type
  • Can values be left blank (nullability)?
  • Identifiers (Primary Keys)
  • Constraints (non-negative, a < b, etc.)
  • Relationships to other tables (Foreign Keys)

DDL

Basic command is the CREATE TABLE statement.

TYPE can be INTEGER, DECIMAL, CHAR, VARCHAR, DATE, TIMESTAMP and more

DDL

Our beloved penguins:

Compare:

Rows: 344
Columns: 8
$ species     <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island      <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ bill_len    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
$ bill_dep    <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
$ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
$ body_mass   <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
$ sex         <fct> male, female, female, NA, female, male, female, male, NA, …
$ year        <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

DDL

We can add a constraint so the year is 2007-2009:

DDL

Constraints for categorical variables (R factors) can be tricky. Create a normalized set of tables:

Sex ID Name
1 Female
2 Male
Penguin ID Sex ID Weight
1 1 4000
2 1 3500
3 2 5000

We want to make sure each Sex ID in table 2 matches a row of the first table

DDL

Constraints for categorical variables (R factors) can be tricky. Create a normalized set of tables:

DDL Normalization

Normalization:

  1. Increases integrity via constraints
  2. Reduces redundancy (need to change value in 1 place only)
  3. Slight performance benefits

More JOIN work in SQL than in dplyr

Proper table design is a science and an art

DML

Data Manipulation Language is for DB users.

  • Close parallels to dplyr
  • Compare:

vs.

SELECT * means all columns which matches R’s default

DML

Data Manipulation Language is for DB users.

  • Close parallels to dplyr
  • Compare:

vs.

DML

Data Manipulation Language is for DB users.

  • Close parallels to dplyr
  • Compare:

vs.

DML

Note that

and

are both

Why?

They commute – same result in either order

DML

  • R is an imperative paradigm - you say what to do
  • SQL is a declarative paradigm - you say what you want

Declarative paradigms let the server figure out the best way to compute, but are more restrictive

For a well-defined paradigm like SQL, much work has gone into query optimization

DML

  • Compare:

vs.

summarize() is implicit in SELECT clause when we have a GROUP BY

DML

But:

is

Why not WHERE again?

DML

In SQL: - WHERE happens before a GROUP BY; and - HAVING happens after a GROUP BY

In R: - Ordering of summarize() and filter controls execution order

ACID Guarantees

Most RDMSs provide a set of four ACID guarantees:

  • Atomicity: Changes occur all-or-nothing
    • E.g.: Add charge to customer’s account + schedule item for shipping
  • Consistency: DDL Constraints are always valid
    • E.g.: Can’t create a user without having an email on file
  • Isolation: Two different users don’t “step on each other’s toes”
    • E.g.: If I’m updating prices while you are checking out, you get your original price
  • Durability: When a change is complete, it is guaranteed to be saved to disk

Getting Started with RDMS

Easiest RDMS to use is SQLite

  • Most ubiquitous DB in the world (by a long-shot)
  • Already installed on your computer (many times over!)
  • Access directly through sqlite3 or sqlite3.exe
  • Access from R with RSQLite package

dbplyr

Can use dbplyr (extra b!) to put dplyr-UX on DB

E.g.

dbplyr

# Source:   table<`penguins`> [?? x 8]
# Database: sqlite 3.50.4 [:memory:]
   species island    bill_len bill_dep flipper_len body_mass sex     year
   <chr>   <chr>        <dbl>    <dbl>       <int>     <int> <chr>  <int>
 1 Adelie  Torgersen     39.1     18.7         181      3750 male    2007
 2 Adelie  Torgersen     39.5     17.4         186      3800 female  2007
 3 Adelie  Torgersen     40.3     18           195      3250 female  2007
 4 Adelie  Torgersen     NA       NA            NA        NA <NA>    2007
 5 Adelie  Torgersen     36.7     19.3         193      3450 female  2007
 6 Adelie  Torgersen     39.3     20.6         190      3650 male    2007
 7 Adelie  Torgersen     38.9     17.8         181      3625 female  2007
 8 Adelie  Torgersen     39.2     19.6         195      4675 male    2007
 9 Adelie  Torgersen     34.1     18.1         193      3475 <NA>    2007
10 Adelie  Torgersen     42       20.2         190      4250 <NA>    2007
# ℹ more rows

dbplyr

<SQL>
SELECT `species`, AVG(`body_mass`) AS `mean_mass`
FROM `penguins`
GROUP BY `species`

Extra step - show_query to see SQL or collect() to run in DB:

# A tibble: 3 × 2
  species   mean_mass
  <chr>         <dbl>
1 Adelie        3701.
2 Chinstrap     3733.
3 Gentoo        5076.

dbplyr

<SQL>
SELECT `species`, AVG(`body_mass`) AS `mean_mass`
FROM (
  SELECT `penguins`.*
  FROM `penguins`
  WHERE (`body_mass` > 4000.0)
) AS `q01`
GROUP BY `species`

Compared to:

<SQL>
SELECT `species`, AVG(`body_mass`) AS `mean_mass`
FROM `penguins`
GROUP BY `species`
HAVING (AVG(`body_mass`) > 4000.0)

Not the best SQL, but RDMS can make quick work of it

dbplyr

A more substantial example:

dbplyr

A more substantial example:

CREATE TABLE `AllstarFull` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `gameNum` INTEGER,
  `gameID` TEXT,
  `teamID` TEXT,
  `lgID` TEXT,
  `GP` INTEGER,
  `startingPos` TEXT
);
CREATE INDEX `AllstarFull_playerID` ON `AllstarFull` (`playerID`);
CREATE INDEX `AllstarFull_yearID` ON `AllstarFull` (`yearID`);
CREATE INDEX `AllstarFull_gameID` ON `AllstarFull` (`gameID`);
CREATE INDEX `AllstarFull_teamID` ON `AllstarFull` (`teamID`);
CREATE INDEX `AllstarFull_lgID` ON `AllstarFull` (`lgID`);
CREATE TABLE sqlite_stat1(tbl,idx,stat);
CREATE TABLE sqlite_stat4(tbl,idx,neq,nlt,ndlt,sample);
CREATE TABLE `Appearances` (
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `playerID` TEXT,
  `G_all` INTEGER,
  `GS` INTEGER,
  `G_batting` INTEGER,
  `G_defense` INTEGER,
  `G_p` INTEGER,
  `G_c` INTEGER,
  `G_1b` INTEGER,
  `G_2b` INTEGER,
  `G_3b` INTEGER,
  `G_ss` INTEGER,
  `G_lf` INTEGER,
  `G_cf` INTEGER,
  `G_rf` INTEGER,
  `G_of` INTEGER,
  `G_dh` INTEGER,
  `G_ph` INTEGER,
  `G_pr` INTEGER
);
CREATE INDEX `Appearances_yearID` ON `Appearances` (`yearID`);
CREATE INDEX `Appearances_teamID` ON `Appearances` (`teamID`);
CREATE INDEX `Appearances_lgID` ON `Appearances` (`lgID`);
CREATE INDEX `Appearances_playerID` ON `Appearances` (`playerID`);
CREATE TABLE `AwardsManagers` (
  `playerID` TEXT,
  `awardID` TEXT,
  `yearID` INTEGER,
  `lgID` TEXT,
  `tie` TEXT,
  `notes` TEXT
);
CREATE INDEX `AwardsManagers_playerID` ON `AwardsManagers` (`playerID`);
CREATE INDEX `AwardsManagers_awardID` ON `AwardsManagers` (`awardID`);
CREATE INDEX `AwardsManagers_yearID` ON `AwardsManagers` (`yearID`);
CREATE INDEX `AwardsManagers_lgID` ON `AwardsManagers` (`lgID`);
CREATE TABLE `AwardsPlayers` (
  `playerID` TEXT,
  `awardID` TEXT,
  `yearID` INTEGER,
  `lgID` TEXT,
  `tie` TEXT,
  `notes` TEXT
);
CREATE INDEX `AwardsPlayers_playerID` ON `AwardsPlayers` (`playerID`);
CREATE INDEX `AwardsPlayers_awardID` ON `AwardsPlayers` (`awardID`);
CREATE INDEX `AwardsPlayers_yearID` ON `AwardsPlayers` (`yearID`);
CREATE INDEX `AwardsPlayers_lgID` ON `AwardsPlayers` (`lgID`);
CREATE TABLE `AwardsShareManagers` (
  `awardID` TEXT,
  `yearID` INTEGER,
  `lgID` TEXT,
  `playerID` TEXT,
  `pointsWon` INTEGER,
  `pointsMax` INTEGER,
  `votesFirst` INTEGER
);
CREATE INDEX `AwardsShareManagers_awardID` ON `AwardsShareManagers` (`awardID`);
CREATE INDEX `AwardsShareManagers_yearID` ON `AwardsShareManagers` (`yearID`);
CREATE INDEX `AwardsShareManagers_lgID` ON `AwardsShareManagers` (`lgID`);
CREATE INDEX `AwardsShareManagers_playerID` ON `AwardsShareManagers` (`playerID`);
CREATE TABLE `AwardsSharePlayers` (
  `awardID` TEXT,
  `yearID` INTEGER,
  `lgID` TEXT,
  `playerID` TEXT,
  `pointsWon` INTEGER,
  `pointsMax` INTEGER,
  `votesFirst` INTEGER
);
CREATE INDEX `AwardsSharePlayers_awardID` ON `AwardsSharePlayers` (`awardID`);
CREATE INDEX `AwardsSharePlayers_yearID` ON `AwardsSharePlayers` (`yearID`);
CREATE INDEX `AwardsSharePlayers_lgID` ON `AwardsSharePlayers` (`lgID`);
CREATE INDEX `AwardsSharePlayers_playerID` ON `AwardsSharePlayers` (`playerID`);
CREATE TABLE `Batting` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `G` INTEGER,
  `AB` INTEGER,
  `R` INTEGER,
  `H` INTEGER,
  `X2B` INTEGER,
  `X3B` INTEGER,
  `HR` INTEGER,
  `RBI` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `IBB` INTEGER,
  `HBP` INTEGER,
  `SH` INTEGER,
  `SF` INTEGER,
  `GIDP` INTEGER
);
CREATE INDEX `Batting_playerID` ON `Batting` (`playerID`);
CREATE INDEX `Batting_yearID` ON `Batting` (`yearID`);
CREATE INDEX `Batting_teamID` ON `Batting` (`teamID`);
CREATE INDEX `Batting_lgID` ON `Batting` (`lgID`);
CREATE TABLE `BattingPost` (
  `yearID` INTEGER,
  `round` TEXT,
  `playerID` TEXT,
  `teamID` TEXT,
  `lgID` TEXT,
  `G` INTEGER,
  `AB` INTEGER,
  `R` INTEGER,
  `H` INTEGER,
  `X2B` INTEGER,
  `X3B` INTEGER,
  `HR` INTEGER,
  `RBI` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `IBB` INTEGER,
  `HBP` INTEGER,
  `SH` INTEGER,
  `SF` INTEGER,
  `GIDP` INTEGER
);
CREATE INDEX `BattingPost_yearID` ON `BattingPost` (`yearID`);
CREATE INDEX `BattingPost_playerID` ON `BattingPost` (`playerID`);
CREATE INDEX `BattingPost_teamID` ON `BattingPost` (`teamID`);
CREATE INDEX `BattingPost_lgID` ON `BattingPost` (`lgID`);
CREATE TABLE `CollegePlaying` (
  `playerID` TEXT,
  `schoolID` TEXT,
  `yearID` INTEGER
);
CREATE INDEX `CollegePlaying_playerID` ON `CollegePlaying` (`playerID`);
CREATE INDEX `CollegePlaying_schoolID` ON `CollegePlaying` (`schoolID`);
CREATE INDEX `CollegePlaying_yearID` ON `CollegePlaying` (`yearID`);
CREATE TABLE `Fielding` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `POS` TEXT,
  `G` INTEGER,
  `GS` INTEGER,
  `InnOuts` INTEGER,
  `PO` INTEGER,
  `A` INTEGER,
  `E` INTEGER,
  `DP` INTEGER,
  `PB` INTEGER,
  `WP` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `ZR` INTEGER
);
CREATE INDEX `Fielding_playerID` ON `Fielding` (`playerID`);
CREATE INDEX `Fielding_yearID` ON `Fielding` (`yearID`);
CREATE INDEX `Fielding_teamID` ON `Fielding` (`teamID`);
CREATE INDEX `Fielding_lgID` ON `Fielding` (`lgID`);
CREATE TABLE `FieldingOF` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `Glf` INTEGER,
  `Gcf` INTEGER,
  `Grf` INTEGER
);
CREATE INDEX `FieldingOF_playerID` ON `FieldingOF` (`playerID`);
CREATE INDEX `FieldingOF_yearID` ON `FieldingOF` (`yearID`);
CREATE TABLE `FieldingOFsplit` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `POS` TEXT,
  `G` INTEGER,
  `GS` INTEGER,
  `InnOuts` INTEGER,
  `PO` INTEGER,
  `A` INTEGER,
  `E` INTEGER,
  `DP` INTEGER,
  `PB` INTEGER,
  `WP` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `ZR` INTEGER
);
CREATE INDEX `FieldingOFsplit_playerID` ON `FieldingOFsplit` (`playerID`);
CREATE INDEX `FieldingOFsplit_yearID` ON `FieldingOFsplit` (`yearID`);
CREATE INDEX `FieldingOFsplit_teamID` ON `FieldingOFsplit` (`teamID`);
CREATE INDEX `FieldingOFsplit_lgID` ON `FieldingOFsplit` (`lgID`);
CREATE TABLE `FieldingPost` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `round` TEXT,
  `POS` TEXT,
  `G` INTEGER,
  `GS` INTEGER,
  `InnOuts` INTEGER,
  `PO` INTEGER,
  `A` INTEGER,
  `E` INTEGER,
  `DP` INTEGER,
  `TP` INTEGER,
  `PB` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER
);
CREATE INDEX `FieldingPost_playerID` ON `FieldingPost` (`playerID`);
CREATE INDEX `FieldingPost_yearID` ON `FieldingPost` (`yearID`);
CREATE INDEX `FieldingPost_teamID` ON `FieldingPost` (`teamID`);
CREATE INDEX `FieldingPost_lgID` ON `FieldingPost` (`lgID`);
CREATE TABLE `HallOfFame` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `votedBy` TEXT,
  `ballots` REAL,
  `needed` REAL,
  `votes` REAL,
  `inducted` TEXT,
  `category` TEXT,
  `needed_note` TEXT
);
CREATE INDEX `HallOfFame_playerID` ON `HallOfFame` (`playerID`);
CREATE INDEX `HallOfFame_yearID` ON `HallOfFame` (`yearID`);
CREATE TABLE `HomeGames` (
  `year.key` INTEGER,
  `league.key` TEXT,
  `team.key` TEXT,
  `park.key` TEXT,
  `span.first` TEXT,
  `span.last` TEXT,
  `games` INTEGER,
  `openings` INTEGER,
  `attendance` INTEGER
);
CREATE TABLE `LahmanData` (
  `file` TEXT,
  `class` TEXT,
  `nobs` REAL,
  `nvar` REAL,
  `title` TEXT
);
CREATE TABLE `Managers` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `inseason` INTEGER,
  `G` INTEGER,
  `W` INTEGER,
  `L` INTEGER,
  `rank` INTEGER,
  `plyrMgr` TEXT
);
CREATE INDEX `Managers_playerID` ON `Managers` (`playerID`);
CREATE INDEX `Managers_yearID` ON `Managers` (`yearID`);
CREATE INDEX `Managers_teamID` ON `Managers` (`teamID`);
CREATE INDEX `Managers_lgID` ON `Managers` (`lgID`);
CREATE TABLE `ManagersHalf` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `inseason` INTEGER,
  `half` INTEGER,
  `G` INTEGER,
  `W` INTEGER,
  `L` INTEGER,
  `rank` INTEGER
);
CREATE INDEX `ManagersHalf_playerID` ON `ManagersHalf` (`playerID`);
CREATE INDEX `ManagersHalf_yearID` ON `ManagersHalf` (`yearID`);
CREATE INDEX `ManagersHalf_teamID` ON `ManagersHalf` (`teamID`);
CREATE INDEX `ManagersHalf_lgID` ON `ManagersHalf` (`lgID`);
CREATE TABLE `Parks` (
  `park.alias` TEXT,
  `park.key` TEXT,
  `park.name` TEXT,
  `city` TEXT,
  `state` TEXT,
  `country` TEXT
);
CREATE TABLE `People` (
  `playerID` TEXT,
  `birthYear` INTEGER,
  `birthMonth` INTEGER,
  `birthDay` INTEGER,
  `birthCity` TEXT,
  `birthCountry` TEXT,
  `birthState` TEXT,
  `deathYear` INTEGER,
  `deathMonth` INTEGER,
  `deathDay` INTEGER,
  `deathCountry` TEXT,
  `deathState` TEXT,
  `deathCity` TEXT,
  `nameFirst` TEXT,
  `nameLast` TEXT,
  `nameGiven` TEXT,
  `weight` INTEGER,
  `height` INTEGER,
  `bats` TEXT,
  `throws` TEXT,
  `debut` TEXT,
  `bbrefID` TEXT,
  `finalGame` TEXT,
  `retroID` TEXT,
  `deathDate` REAL,
  `birthDate` REAL
);
CREATE INDEX `People_playerID` ON `People` (`playerID`);
CREATE INDEX `People_bbrefID` ON `People` (`bbrefID`);
CREATE INDEX `People_retroID` ON `People` (`retroID`);
CREATE TABLE `Pitching` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `W` INTEGER,
  `L` INTEGER,
  `G` INTEGER,
  `GS` INTEGER,
  `CG` INTEGER,
  `SHO` INTEGER,
  `SV` INTEGER,
  `IPouts` INTEGER,
  `H` INTEGER,
  `ER` INTEGER,
  `HR` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `BAOpp` REAL,
  `ERA` REAL,
  `IBB` INTEGER,
  `WP` INTEGER,
  `HBP` INTEGER,
  `BK` INTEGER,
  `BFP` INTEGER,
  `GF` INTEGER,
  `R` INTEGER,
  `SH` INTEGER,
  `SF` INTEGER,
  `GIDP` INTEGER
);
CREATE INDEX `Pitching_playerID` ON `Pitching` (`playerID`);
CREATE INDEX `Pitching_yearID` ON `Pitching` (`yearID`);
CREATE INDEX `Pitching_teamID` ON `Pitching` (`teamID`);
CREATE INDEX `Pitching_lgID` ON `Pitching` (`lgID`);
CREATE TABLE `PitchingPost` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `round` TEXT,
  `teamID` TEXT,
  `lgID` TEXT,
  `W` INTEGER,
  `L` INTEGER,
  `G` INTEGER,
  `GS` INTEGER,
  `CG` INTEGER,
  `SHO` INTEGER,
  `SV` INTEGER,
  `IPouts` INTEGER,
  `H` INTEGER,
  `ER` INTEGER,
  `HR` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `BAOpp` REAL,
  `ERA` REAL,
  `IBB` INTEGER,
  `WP` INTEGER,
  `HBP` INTEGER,
  `BK` INTEGER,
  `BFP` INTEGER,
  `GF` INTEGER,
  `R` INTEGER,
  `SH` INTEGER,
  `SF` INTEGER,
  `GIDP` INTEGER
);
CREATE INDEX `PitchingPost_playerID` ON `PitchingPost` (`playerID`);
CREATE INDEX `PitchingPost_yearID` ON `PitchingPost` (`yearID`);
CREATE INDEX `PitchingPost_teamID` ON `PitchingPost` (`teamID`);
CREATE INDEX `PitchingPost_lgID` ON `PitchingPost` (`lgID`);
CREATE TABLE `Salaries` (
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `playerID` TEXT,
  `salary` INTEGER
);
CREATE INDEX `Salaries_yearID` ON `Salaries` (`yearID`);
CREATE INDEX `Salaries_teamID` ON `Salaries` (`teamID`);
CREATE INDEX `Salaries_lgID` ON `Salaries` (`lgID`);
CREATE INDEX `Salaries_playerID` ON `Salaries` (`playerID`);
CREATE TABLE `Schools` (
  `schoolID` TEXT,
  `name_full` TEXT,
  `city` TEXT,
  `state` TEXT,
  `country` TEXT
);
CREATE INDEX `Schools_schoolID` ON `Schools` (`schoolID`);
CREATE TABLE `SeriesPost` (
  `yearID` INTEGER,
  `round` TEXT,
  `teamIDwinner` TEXT,
  `lgIDwinner` TEXT,
  `teamIDloser` TEXT,
  `lgIDloser` TEXT,
  `wins` INTEGER,
  `losses` INTEGER,
  `ties` INTEGER
);
CREATE INDEX `SeriesPost_yearID` ON `SeriesPost` (`yearID`);
CREATE TABLE `Teams` (
  `yearID` INTEGER,
  `lgID` TEXT,
  `teamID` TEXT,
  `franchID` TEXT,
  `divID` TEXT,
  `Rank` INTEGER,
  `G` INTEGER,
  `Ghome` INTEGER,
  `W` INTEGER,
  `L` INTEGER,
  `DivWin` TEXT,
  `WCWin` TEXT,
  `LgWin` TEXT,
  `WSWin` TEXT,
  `R` INTEGER,
  `AB` INTEGER,
  `H` INTEGER,
  `X2B` INTEGER,
  `X3B` INTEGER,
  `HR` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `HBP` INTEGER,
  `SF` INTEGER,
  `RA` INTEGER,
  `ER` INTEGER,
  `ERA` REAL,
  `CG` INTEGER,
  `SHO` INTEGER,
  `SV` INTEGER,
  `IPouts` INTEGER,
  `HA` INTEGER,
  `HRA` INTEGER,
  `BBA` INTEGER,
  `SOA` INTEGER,
  `E` INTEGER,
  `DP` INTEGER,
  `FP` REAL,
  `name` TEXT,
  `park` TEXT,
  `attendance` INTEGER,
  `BPF` INTEGER,
  `PPF` INTEGER,
  `teamIDBR` TEXT,
  `teamIDlahman45` TEXT,
  `teamIDretro` TEXT
);
CREATE INDEX `Teams_yearID` ON `Teams` (`yearID`);
CREATE INDEX `Teams_lgID` ON `Teams` (`lgID`);
CREATE INDEX `Teams_teamID` ON `Teams` (`teamID`);
CREATE INDEX `Teams_franchID` ON `Teams` (`franchID`);
CREATE INDEX `Teams_divID` ON `Teams` (`divID`);
CREATE TABLE `TeamsFranchises` (
  `franchID` TEXT,
  `franchName` TEXT,
  `active` TEXT,
  `NAassoc` TEXT
);
CREATE INDEX `TeamsFranchises_franchID` ON `TeamsFranchises` (`franchID`);
CREATE TABLE `TeamsHalf` (
  `yearID` INTEGER,
  `lgID` TEXT,
  `teamID` TEXT,
  `Half` INTEGER,
  `divID` TEXT,
  `DivWin` TEXT,
  `Rank` INTEGER,
  `G` INTEGER,
  `W` INTEGER,
  `L` INTEGER
);
CREATE INDEX `TeamsHalf_yearID` ON `TeamsHalf` (`yearID`);
CREATE INDEX `TeamsHalf_lgID` ON `TeamsHalf` (`lgID`);
CREATE INDEX `TeamsHalf_teamID` ON `TeamsHalf` (`teamID`);
CREATE INDEX `TeamsHalf_divID` ON `TeamsHalf` (`divID`);

dbplyr

List the 5 modern-era managers with the highest winning percentages. (Define modern-era as managed over 1000 games post 1920.)

<SQL>
SELECT `nameGiven` || ' ' || `nameLast` AS `name`, `W`, `L`, `G`, `record`
FROM (
  SELECT `q01`.*, RANK() OVER (ORDER BY `record` DESC) AS `col01`
  FROM (
    SELECT
      `LHS`.*,
      `birthYear`,
      `birthMonth`,
      `birthDay`,
      `birthCity`,
      `birthCountry`,
      `birthState`,
      `deathYear`,
      `deathMonth`,
      `deathDay`,
      `deathCountry`,
      `deathState`,
      `deathCity`,
      `nameFirst`,
      `nameLast`,
      `nameGiven`,
      `weight`,
      `height`,
      `bats`,
      `throws`,
      `debut`,
      `bbrefID`,
      `finalGame`,
      `retroID`,
      `deathDate`,
      `birthDate`
    FROM (
      SELECT `q01`.*, `W` / (`G` + 0.0) AS `record`
      FROM (
        SELECT `q01`.*, `W` + `L` AS `G`
        FROM (
          SELECT `playerID`, SUM(`W`) AS `W`, SUM(`L`) AS `L`
          FROM (
            SELECT `Managers`.*
            FROM `Managers`
            WHERE (`yearID` >= 1920.0)
          ) AS `q01`
          GROUP BY `playerID`
        ) AS `q01`
      ) AS `q01`
      WHERE (`G` > 1000.0)
    ) AS `LHS`
    INNER JOIN `People`
      ON (`LHS`.`playerID` = `People`.`playerID`)
  ) AS `q01`
) AS `q01`
WHERE (`col01` <= 5)

dbplyr

Add |> explain() to see the DB internal query plan:

<SQL>
SELECT `nameGiven` || ' ' || `nameLast` AS `name`, `W`, `L`, `G`, `record`
FROM (
  SELECT `q01`.*, RANK() OVER (ORDER BY `record` DESC) AS `col01`
  FROM (
    SELECT
      `LHS`.*,
      `birthYear`,
      `birthMonth`,
      `birthDay`,
      `birthCity`,
      `birthCountry`,
      `birthState`,
      `deathYear`,
      `deathMonth`,
      `deathDay`,
      `deathCountry`,
      `deathState`,
      `deathCity`,
      `nameFirst`,
      `nameLast`,
      `nameGiven`,
      `weight`,
      `height`,
      `bats`,
      `throws`,
      `debut`,
      `bbrefID`,
      `finalGame`,
      `retroID`,
      `deathDate`,
      `birthDate`
    FROM (
      SELECT `q01`.*, `W` / (`G` + 0.0) AS `record`
      FROM (
        SELECT `q01`.*, `W` + `L` AS `G`
        FROM (
          SELECT `playerID`, SUM(`W`) AS `W`, SUM(`L`) AS `L`
          FROM (
            SELECT `Managers`.*
            FROM `Managers`
            WHERE (`yearID` >= 1920.0)
          ) AS `q01`
          GROUP BY `playerID`
        ) AS `q01`
      ) AS `q01`
      WHERE (`G` > 1000.0)
    ) AS `LHS`
    INNER JOIN `People`
      ON (`LHS`.`playerID` = `People`.`playerID`)
  ) AS `q01`
) AS `q01`
WHERE (`col01` <= 5)

<PLAN>
   id parent notused                                                 detail
1   2      0       0                                         CO-ROUTINE q01
2   5      2       0                                CO-ROUTINE (subquery-8)
3   7      5       0                                         CO-ROUTINE q01
4  14      7     142            SCAN Managers USING INDEX Managers_playerID
5  55      5      82                                               SCAN q01
6  62      5      42 SEARCH People USING INDEX People_playerID (playerID=?)
7 120      5       0                           USE TEMP B-TREE FOR ORDER BY
8 163      2      81                                      SCAN (subquery-8)
9 285      0      81                                               SCAN q01

Beyond this class to interpret, but in general use of INDEX yields good performance. Take DB classes for much more

dbplyr

List the 5 modern-era managers with the highest winning percentages. (Define modern-era as managed over 1000 games post 1920.)

# A tibble: 5 × 5
  name                          W     L     G record
  <chr>                     <int> <int> <int>  <dbl>
1 David Ray Roberts           851   507  1358  0.627
2 Joseph Vincent McCarthy    2125  1333  3458  0.615
3 Miller James Huggins        927   597  1524  0.608
4 William Harold Southworth  1044   704  1748  0.597
5 Aaron John Boone            603   428  1031  0.585