Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 8 – Thursday 2026-03-26
Last Updated: 2026-03-23

STA 9750 Week 8

Today: Mid-Semester Check-In Presentation + Enrichment: SQL

These slides can be found online at:

https://michael-weylandt.com/STA9750/slides/slides08.html

Upcoming TODO

Upcoming student responsibilities:

Date Time Details
2026-04-02 11:59pm ET Mid-Semester Teammate Peer Evaluations Due
2026-04-02 NA Classes Cancelled (Spring Break – Week 1)
2026-04-03 11:59pm ET Mini-Project #02 Due
2026-04-09 NA Classes Cancelled (Spring Break – Week 2)
2026-04-12 11:59pm ET Mini-Project Peer Feedback #02 Due
2026-04-16 6:00pm ET Pre-Assignment #09 Due
2026-04-21 6:00pm ET Pre-Assignment #10 Due
2026-04-23 6:00pm ET Pre-Assignment #11 Due

STA 9750 Week 8

Today: Mid-Semester Check-In Presentation + Enrichment: SQL

  • Team Formation ✅
  • Project Proposals ✅
  • Check-In Presentations ⬅️
    • Mid-Semester Peer Evaluations
  • Final Presentations
  • Final Reports
    • Individual Technical Report
    • Group Summary Report
    • Final Peer Evaluations

After Proposals

Introduction to SQL and Databases - 100% Optional

  • If you’ve taken DB courses already, won’t add much.
  • SQL is a common lingua franca of data science
  • Useful in interviews

Schedule Notes

Upcoming schedule is a bit confusing:

  • Today: Mid-Semester Check-In Presentations
  • Two Weeks of Spring Break (April 2 and 9)
    • MP#02 due over Spring Break
    • MP#02 Peer Feedback due over Spring Break
    • MP#03 Assigned
  • April 16: Data Import Part 1
  • Tuesday April 21: Make-Up Class (Advanced Plotting)
  • April 23: Data Import Part 2

Mini-Project #02

MP#02 - How Do You Do ‘You Do You’?

Due 2026-04-03 at 11:59pm ET

  • GitHub post (used for peer feedback) AND Brightspace
  • Start early to avoid Git issues

Use Piazza for questions over Spring Break (no Office Hours)

Mini-Project #02

Topic:

  • ATUS Analysis of “People Like You”

Skills:

  • Joining tables
  • Elementary visualization
  • Survey Weights

Mini-Project #03

MP#03 - Who Goes There? US Internal Migration and Implications for Congressional Reapportionment

Due 2026-04-24 at 11:59pm ET

Analysis of US internal migration

  • Who moves between states?
  • Predict populations as of 2030
  • Implications for Congress

Mini-Project #03

Topics Covered:

  • Using API Wrapper Packages
  • Using APIs Directly
  • Downloading and Parsing Files
  • Cleaning and standardizing inputs
  • Complex group_by structures

Warning: This is the longest MP. Plan accordingly!

Check-In Presentations

Check-In Presentations

Today: Mid-Semester Check-In Presentations (Official Description)

I’m looking for:

  • Finalized overarching question
  • Locking in on specific questions
  • Evaluation of relevant data sources
    • Anticipated data or statistical challenges
  • Engagement with existing literature
  • 6 minutes

Mainly, I want to see that you will be able to succeed

Peer Evaluations

After today, you will be able to upload (anonymous) evaluations of your project teammates:

source("https://michael-weylandt.com/STA9750/load_helpers.R")
project_peer_evals(cycle="mid")

Upload result bspf file to Brightspace by 2026-04-02.

I will collate and summarize feedback and assign scores.

On to the Show!

Presentation Order:

Presentation Order Team
1 Water Benders (JE+JABB+MTP+JA+AS)
2 3-1-Fun! (XC+ML+ER+RJSN)
3 Maniac Braniacs (HHS+KK+FC+DN)
4 Inspector Gadget (MUO+KN+CM+ID+KM)
5 Emissions Impossible (LR+MOG+APTL)

Wrap-Up

Next Steps

Now you should have your questions and data finalized

Time to begin the actual analysis

Starting next class (mid-April): getting data into R

Final Deliverables

Final Deliverables:

  • Final Presentations: 2026-05-07
  • Individual Technical Report: 2026-05-21
  • Group Summary Report: 2026-05-21

I’ve moved presentaitons forward a week this semester to give you two weeks to write final reports. Feedback on this decision welcome!

Life Tip of the Week

Tips for Technical Interviews

Some tips for technical interviews:

  1. Communicate!

  2. Test!

  3. Anticipate the Danger Zone!

Communicate!

Talk / comment throughout a technical interview

  • Live interview: vocalize thought process
  • Async interview: comment all your choices

Advantages of Communication:

  • Allow your interviewer to give “partial credit”
  • Most questions aren’t fully-precise, seek clarification
  • Share your thought process

Test!

Check your work as you go

  • Mistakes that only appear ‘at scale’ are rare
  • Test on small data, first terms of sequence, etc.

Easier to check simple output than code

Test!

Suppose you are asked to compute 100th number of a sequence

  • Don’t code and submit
  • Look at first 5 elements and check by hand

Suppose you are asked to analyze a data set

  • Try your code on a small data set first
  • Design your data to make correct answer evident

Danger Zone!

Know the ‘sharp edges’ of the tools you will be tested on:

  • NULL handling in SQL
  • Missing values in R
  • Groups of 0 or 1 observation

etc.

Technical Interviews

Your interviewer doesn’t ultimately care if you get the right answer on a puzzle

Do you have a process to get the right answer on a problem that matters?

  • Do you handle edge cases?
  • Do you resolve ambiguities in problem statement?
  • Do you test well or just “code and pray”?

Musical Treat


Enrichment Material: SQL

SQL

Structured Query Language (ISO 9075)

  • Standard language for interacting with Relational Database Management Systems (RDMS)
  • Divided into three main sub-languages:
    • Data Manipulation Language (DML)
    • Data Declaration Language (DDL)
    • Data Control Language (DCL)
  • Declarative, not Imperative Programming

SQL Languages

Data Definition Language (DDL)

  • “Design” language for DB
  • What tables are present? What types are the columns? What relationships (keys) are enforced?

Data Manipulation Language (DML)

  • “User” interaction with DB: queries, adding, changing, deleting data
  • SELECT, GROUP BY, WHERE, HAVING, JOIN, etc.
  • Roughly coextensive with dplyr

Data Control Language (DCL)

  • “Security” language for DB
  • What permissions do different users have?

RDMS

Relational Database Management System (RDMS)

  • Data management server for complex environments
  • Relational tabular data structures
  • Manipulated using SQL
    • Common SQL Standard + Many RDMS-specific extensions
  • ACID Guarantees

DDL

RDMS has a fixed number of tables (roughly data.frames). DDL specifies

  • Column names and type
  • Can values be left blank (nullability)?
  • Identifiers (Primary Keys)
  • Constraints (non-negative, a < b, etc.)
  • Relationships to other tables (Foreign Keys)

DDL

Basic command is the CREATE TABLE statement.

CREATE TABLE table_name (
  column_name1 TYPE1 PRIMARY KEY -- <- Guaranteed Unique and not blank
  column_name2 TYPE2 NOT NULL -- <- Can't be left blank
  column_name3 TYPE3 -- <- Can be blank
  column_name4 TYPE4 DEFAULT 0 -- <- Has default value
);

TYPE can be INTEGER, DECIMAL, CHAR, VARCHAR, DATE, TIMESTAMP and more

DDL

Our beloved penguins:

CREATE TABLE penguins (
  species VARCHAR(9) NOT NULL, -- Name of up to 9 characters
  island  VARCHAR(9) NOT NULL, -- Also can't be left blank
  bill_len NUMERIC,
  bill_dep NUMERIC,
  flipper_len INTEGER,
  body_mass INTEGER, 
  sex VARCHAR(6), -- Name of 6 (female) but also allow missing
  year INTEGER NOT NULL
);

Compare:

glimpse(penguins)
Rows: 344
Columns: 8
$ species     <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island      <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ bill_len    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
$ bill_dep    <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
$ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
$ body_mass   <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
$ sex         <fct> male, female, female, NA, female, male, female, male, NA, …
$ year        <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

DDL

We can add a constraint so the year is 2007-2009:

CREATE TABLE penguins (
  species VARCHAR(9) NOT NULL, -- Name of up to 9 characters
  island  VARCHAR(9) NOT NULL, -- Also can't be left blank
  bill_len NUMERIC,
  bill_dep NUMERIC,
  flipper_len INTEGER,
  body_mass INTEGER, 
  sex VARCHAR(6), -- Name of 6 (female) but also allow missing
  year INTEGER NOT NULL CHECK (year > 2006 AND year < 2010)
);

DDL

Constraints for categorical variables (R factors) can be tricky. Create a normalized set of tables:

Sex ID Name
1 Female
2 Male
Penguin ID Sex ID Weight
1 1 4000
2 1 3500
3 2 5000

We want to make sure each Sex ID in table 2 matches a row of the first table

DDL

Constraints for categorical variables (R factors) can be tricky. Create a normalized set of tables:

CREATE TABLE sex (
  sex_id INTEGER PRIMARY KEY, 
  sex_name VARCHAR(6) NOT NULL
)

CREATE TABLE penguins (
  species VARCHAR(9) NOT NULL, -- Name of up to 9 characters
  island  VARCHAR(9) NOT NULL, -- Also can't be left blank
  bill_len NUMERIC,
  bill_dep NUMERIC,
  flipper_len INTEGER,
  body_mass INTEGER, 
  sex_id INTEGER, 
  year INTEGER NOT NULL CHECK (year > 2006 AND year < 2010), 
  FOREIGN KEY (sex_id) REFERENCES sex(sex_id)
);

DDL Normalization

Normalization:

  1. Increases integrity via constraints
  2. Reduces redundancy (need to change value in 1 place only)
  3. Slight performance benefits

More JOIN work in SQL than in dplyr

Proper table design is a science and an art

DML

Data Manipulation Language is for DB users.

  • Close parallels to dplyr
  • Compare:
penguins

vs.

SELECT * FROM penguins; 

SELECT * means all columns which matches R’s default

DML

Data Manipulation Language is for DB users.

  • Close parallels to dplyr
  • Compare:
penguins |> select(species, island, bill_dep)

vs.

SELECT species, island, bill_dep FROM penguins; 

DML

Data Manipulation Language is for DB users.

  • Close parallels to dplyr
  • Compare:
penguins |> select(species, island, bill_dep) |> filter(bill_dep > 20)

vs.

SELECT species, island, bill_dep FROM penguins WHERE bill_dep > 20; 

DML

Note that

penguins |> filter(bill_dep > 20) |> select(species, island, bill_dep)

and

penguins |> select(species, island, bill_dep) |> filter(bill_dep > 20)

are both

SELECT species, island, bill_dep FROM penguins WHERE bill_dep > 20; 

Why?

They commute – same result in either order

DML

  • R is an imperative paradigm - you say what to do
  • SQL is a declarative paradigm - you say what you want

Declarative paradigms let the server figure out the best way to compute, but are more restrictive

For a well-defined paradigm like SQL, much work has gone into query optimization

DML

  • Compare:
penguins |> group_by(species) |> summarize(mean_mass = mean(body_mass, na.rm=TRUE))

vs.

SELECT species, AVERAGE(body_mass) FROM penguins GROUP BY species;

summarize() is implicit in SELECT clause when we have a GROUP BY

DML

But:

penguins |> 
  group_by(species) |> 
  summarize(mean_mass = mean(body_mass, na.rm=TRUE)) |>
  filter(mean_mass > 4000)

is

SELECT species, AVERAGE(body_mass) AS avg_mass FROM penguins 
  GROUP BY species HAVING avg_mass > 4000;

Why not WHERE again?

DML

In SQL:

  • WHERE happens before a GROUP BY; and
  • HAVING happens after a GROUP BY

In R:

  • Ordering of summarize() and filter controls execution order

Because SQL is declarative, we need different words to encode filtering at different stages (pre/post-grouping)

ACID Guarantees

Most RDMSs provide a set of four ACID guarantees:

  • Atomicity: Changes occur all-or-nothing
    • E.g.: Add charge to customer’s account + schedule item for shipping
  • Consistency: DDL Constraints are always valid
    • E.g.: Can’t create a user without having an email on file
  • Isolation: Two different users don’t “step on each other’s toes”
    • E.g.: If I’m updating prices while you are checking out, you get your original price
  • Durability: When a change is complete, it is guaranteed to be saved to disk

Getting Started with RDMS

Easiest RDMS to use is SQLite

  • Most ubiquitous DB in the world (by a long-shot)
  • Already installed on your computer (many times over!)
  • Access directly through sqlite3 or sqlite3.exe
  • Access from R with RSQLite package

dbplyr

Can use dbplyr (extra b!) to put dplyr-UX on DB

E.g.

library(dbplyr)
my_db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(my_db, penguins) # Copy penguins data into a temporary SQLite DB

dbplyr

penguins_db <- tbl(my_db, "penguins") # Connect to my_db and load penguins table
penguins_db
# Source:   table<`penguins`> [?? x 8]
# Database: sqlite 3.50.4 [:memory:]
   species island    bill_len bill_dep flipper_len body_mass sex     year
   <chr>   <chr>        <dbl>    <dbl>       <int>     <int> <chr>  <int>
 1 Adelie  Torgersen     39.1     18.7         181      3750 male    2007
 2 Adelie  Torgersen     39.5     17.4         186      3800 female  2007
 3 Adelie  Torgersen     40.3     18           195      3250 female  2007
 4 Adelie  Torgersen     NA       NA            NA        NA <NA>    2007
 5 Adelie  Torgersen     36.7     19.3         193      3450 female  2007
 6 Adelie  Torgersen     39.3     20.6         190      3650 male    2007
 7 Adelie  Torgersen     38.9     17.8         181      3625 female  2007
 8 Adelie  Torgersen     39.2     19.6         195      4675 male    2007
 9 Adelie  Torgersen     34.1     18.1         193      3475 <NA>    2007
10 Adelie  Torgersen     42       20.2         190      4250 <NA>    2007
# ℹ more rows

dbplyr

penguins_db |> 
  group_by(species) |> 
  summarize(mean_mass = mean(body_mass, na.rm=TRUE)) |> 
  show_query()
<SQL>
SELECT `species`, AVG(`body_mass`) AS `mean_mass`
FROM `penguins`
GROUP BY `species`

Extra step - show_query to see SQL or collect() to run in DB:

penguins_db |> 
  group_by(species) |> 
  summarize(mean_mass = mean(body_mass, na.rm=TRUE)) |> 
  collect()
# A tibble: 3 × 2
  species   mean_mass
  <chr>         <dbl>
1 Adelie        3701.
2 Chinstrap     3733.
3 Gentoo        5076.

dbplyr

penguins_db |> 
  filter(body_mass > 4000) |>
  group_by(species) |> 
  summarize(mean_mass = mean(body_mass, na.rm=TRUE)) |> 
  show_query()
<SQL>
SELECT `species`, AVG(`body_mass`) AS `mean_mass`
FROM (
  SELECT `penguins`.*
  FROM `penguins`
  WHERE (`body_mass` > 4000.0)
) AS `q01`
GROUP BY `species`

Compared to:

penguins_db |> 
  group_by(species) |> 
  summarize(mean_mass = mean(body_mass, na.rm=TRUE)) |> 
  filter(mean_mass > 4000) |>
  show_query()
<SQL>
SELECT `species`, AVG(`body_mass`) AS `mean_mass`
FROM `penguins`
GROUP BY `species`
HAVING (AVG(`body_mass`) > 4000.0)

Not the best SQL, but RDMS can make quick work of it

dbplyr

A more substantial example:

library(dbplyr)
baseball_db <- lahman_sqlite()

dbplyr

A more substantial example:

library(dbplyr)
baseball_db <- lahman_sqlite()
CREATE TABLE `AllstarFull` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `gameNum` INTEGER,
  `gameID` TEXT,
  `teamID` TEXT,
  `lgID` TEXT,
  `GP` INTEGER,
  `startingPos` TEXT
);
CREATE INDEX `AllstarFull_playerID` ON `AllstarFull` (`playerID`);
CREATE INDEX `AllstarFull_yearID` ON `AllstarFull` (`yearID`);
CREATE INDEX `AllstarFull_gameID` ON `AllstarFull` (`gameID`);
CREATE INDEX `AllstarFull_teamID` ON `AllstarFull` (`teamID`);
CREATE INDEX `AllstarFull_lgID` ON `AllstarFull` (`lgID`);
CREATE TABLE sqlite_stat1(tbl,idx,stat);
CREATE TABLE sqlite_stat4(tbl,idx,neq,nlt,ndlt,sample);
CREATE TABLE `Appearances` (
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `playerID` TEXT,
  `G_all` INTEGER,
  `GS` INTEGER,
  `G_batting` INTEGER,
  `G_defense` INTEGER,
  `G_p` INTEGER,
  `G_c` INTEGER,
  `G_1b` INTEGER,
  `G_2b` INTEGER,
  `G_3b` INTEGER,
  `G_ss` INTEGER,
  `G_lf` INTEGER,
  `G_cf` INTEGER,
  `G_rf` INTEGER,
  `G_of` INTEGER,
  `G_dh` INTEGER,
  `G_ph` INTEGER,
  `G_pr` INTEGER
);
CREATE INDEX `Appearances_yearID` ON `Appearances` (`yearID`);
CREATE INDEX `Appearances_teamID` ON `Appearances` (`teamID`);
CREATE INDEX `Appearances_lgID` ON `Appearances` (`lgID`);
CREATE INDEX `Appearances_playerID` ON `Appearances` (`playerID`);
CREATE TABLE `AwardsManagers` (
  `playerID` TEXT,
  `awardID` TEXT,
  `yearID` INTEGER,
  `lgID` TEXT,
  `tie` TEXT,
  `notes` TEXT
);
CREATE INDEX `AwardsManagers_playerID` ON `AwardsManagers` (`playerID`);
CREATE INDEX `AwardsManagers_awardID` ON `AwardsManagers` (`awardID`);
CREATE INDEX `AwardsManagers_yearID` ON `AwardsManagers` (`yearID`);
CREATE INDEX `AwardsManagers_lgID` ON `AwardsManagers` (`lgID`);
CREATE TABLE `AwardsPlayers` (
  `playerID` TEXT,
  `awardID` TEXT,
  `yearID` INTEGER,
  `lgID` TEXT,
  `tie` TEXT,
  `notes` TEXT
);
CREATE INDEX `AwardsPlayers_playerID` ON `AwardsPlayers` (`playerID`);
CREATE INDEX `AwardsPlayers_awardID` ON `AwardsPlayers` (`awardID`);
CREATE INDEX `AwardsPlayers_yearID` ON `AwardsPlayers` (`yearID`);
CREATE INDEX `AwardsPlayers_lgID` ON `AwardsPlayers` (`lgID`);
CREATE TABLE `AwardsShareManagers` (
  `awardID` TEXT,
  `yearID` INTEGER,
  `lgID` TEXT,
  `playerID` TEXT,
  `pointsWon` INTEGER,
  `pointsMax` INTEGER,
  `votesFirst` INTEGER
);
CREATE INDEX `AwardsShareManagers_awardID` ON `AwardsShareManagers` (`awardID`);
CREATE INDEX `AwardsShareManagers_yearID` ON `AwardsShareManagers` (`yearID`);
CREATE INDEX `AwardsShareManagers_lgID` ON `AwardsShareManagers` (`lgID`);
CREATE INDEX `AwardsShareManagers_playerID` ON `AwardsShareManagers` (`playerID`);
CREATE TABLE `AwardsSharePlayers` (
  `awardID` TEXT,
  `yearID` INTEGER,
  `lgID` TEXT,
  `playerID` TEXT,
  `pointsWon` INTEGER,
  `pointsMax` INTEGER,
  `votesFirst` INTEGER
);
CREATE INDEX `AwardsSharePlayers_awardID` ON `AwardsSharePlayers` (`awardID`);
CREATE INDEX `AwardsSharePlayers_yearID` ON `AwardsSharePlayers` (`yearID`);
CREATE INDEX `AwardsSharePlayers_lgID` ON `AwardsSharePlayers` (`lgID`);
CREATE INDEX `AwardsSharePlayers_playerID` ON `AwardsSharePlayers` (`playerID`);
CREATE TABLE `Batting` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `G` INTEGER,
  `AB` INTEGER,
  `R` INTEGER,
  `H` INTEGER,
  `X2B` INTEGER,
  `X3B` INTEGER,
  `HR` INTEGER,
  `RBI` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `IBB` INTEGER,
  `HBP` INTEGER,
  `SH` INTEGER,
  `SF` INTEGER,
  `GIDP` INTEGER
);
CREATE INDEX `Batting_playerID` ON `Batting` (`playerID`);
CREATE INDEX `Batting_yearID` ON `Batting` (`yearID`);
CREATE INDEX `Batting_teamID` ON `Batting` (`teamID`);
CREATE INDEX `Batting_lgID` ON `Batting` (`lgID`);
CREATE TABLE `BattingPost` (
  `yearID` INTEGER,
  `round` TEXT,
  `playerID` TEXT,
  `teamID` TEXT,
  `lgID` TEXT,
  `G` INTEGER,
  `AB` INTEGER,
  `R` INTEGER,
  `H` INTEGER,
  `X2B` INTEGER,
  `X3B` INTEGER,
  `HR` INTEGER,
  `RBI` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `IBB` INTEGER,
  `HBP` INTEGER,
  `SH` INTEGER,
  `SF` INTEGER,
  `GIDP` INTEGER
);
CREATE INDEX `BattingPost_yearID` ON `BattingPost` (`yearID`);
CREATE INDEX `BattingPost_playerID` ON `BattingPost` (`playerID`);
CREATE INDEX `BattingPost_teamID` ON `BattingPost` (`teamID`);
CREATE INDEX `BattingPost_lgID` ON `BattingPost` (`lgID`);
CREATE TABLE `CollegePlaying` (
  `playerID` TEXT,
  `schoolID` TEXT,
  `yearID` INTEGER
);
CREATE INDEX `CollegePlaying_playerID` ON `CollegePlaying` (`playerID`);
CREATE INDEX `CollegePlaying_schoolID` ON `CollegePlaying` (`schoolID`);
CREATE INDEX `CollegePlaying_yearID` ON `CollegePlaying` (`yearID`);
CREATE TABLE `Fielding` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `POS` TEXT,
  `G` INTEGER,
  `GS` INTEGER,
  `InnOuts` INTEGER,
  `PO` INTEGER,
  `A` INTEGER,
  `E` INTEGER,
  `DP` INTEGER,
  `PB` INTEGER,
  `WP` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `ZR` INTEGER
);
CREATE INDEX `Fielding_playerID` ON `Fielding` (`playerID`);
CREATE INDEX `Fielding_yearID` ON `Fielding` (`yearID`);
CREATE INDEX `Fielding_teamID` ON `Fielding` (`teamID`);
CREATE INDEX `Fielding_lgID` ON `Fielding` (`lgID`);
CREATE TABLE `FieldingOF` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `Glf` INTEGER,
  `Gcf` INTEGER,
  `Grf` INTEGER
);
CREATE INDEX `FieldingOF_playerID` ON `FieldingOF` (`playerID`);
CREATE INDEX `FieldingOF_yearID` ON `FieldingOF` (`yearID`);
CREATE TABLE `FieldingOFsplit` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `POS` TEXT,
  `G` INTEGER,
  `GS` INTEGER,
  `InnOuts` INTEGER,
  `PO` INTEGER,
  `A` INTEGER,
  `E` INTEGER,
  `DP` INTEGER,
  `PB` INTEGER,
  `WP` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `ZR` INTEGER
);
CREATE INDEX `FieldingOFsplit_playerID` ON `FieldingOFsplit` (`playerID`);
CREATE INDEX `FieldingOFsplit_yearID` ON `FieldingOFsplit` (`yearID`);
CREATE INDEX `FieldingOFsplit_teamID` ON `FieldingOFsplit` (`teamID`);
CREATE INDEX `FieldingOFsplit_lgID` ON `FieldingOFsplit` (`lgID`);
CREATE TABLE `FieldingPost` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `round` TEXT,
  `POS` TEXT,
  `G` INTEGER,
  `GS` INTEGER,
  `InnOuts` INTEGER,
  `PO` INTEGER,
  `A` INTEGER,
  `E` INTEGER,
  `DP` INTEGER,
  `TP` INTEGER,
  `PB` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER
);
CREATE INDEX `FieldingPost_playerID` ON `FieldingPost` (`playerID`);
CREATE INDEX `FieldingPost_yearID` ON `FieldingPost` (`yearID`);
CREATE INDEX `FieldingPost_teamID` ON `FieldingPost` (`teamID`);
CREATE INDEX `FieldingPost_lgID` ON `FieldingPost` (`lgID`);
CREATE TABLE `HallOfFame` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `votedBy` TEXT,
  `ballots` REAL,
  `needed` REAL,
  `votes` REAL,
  `inducted` TEXT,
  `category` TEXT,
  `needed_note` TEXT
);
CREATE INDEX `HallOfFame_playerID` ON `HallOfFame` (`playerID`);
CREATE INDEX `HallOfFame_yearID` ON `HallOfFame` (`yearID`);
CREATE TABLE `HomeGames` (
  `year.key` INTEGER,
  `league.key` TEXT,
  `team.key` TEXT,
  `park.key` TEXT,
  `span.first` TEXT,
  `span.last` TEXT,
  `games` INTEGER,
  `openings` INTEGER,
  `attendance` INTEGER
);
CREATE TABLE `LahmanData` (
  `file` TEXT,
  `class` TEXT,
  `nobs` REAL,
  `nvar` REAL,
  `title` TEXT
);
CREATE TABLE `Managers` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `inseason` INTEGER,
  `G` INTEGER,
  `W` INTEGER,
  `L` INTEGER,
  `rank` INTEGER,
  `plyrMgr` TEXT
);
CREATE INDEX `Managers_playerID` ON `Managers` (`playerID`);
CREATE INDEX `Managers_yearID` ON `Managers` (`yearID`);
CREATE INDEX `Managers_teamID` ON `Managers` (`teamID`);
CREATE INDEX `Managers_lgID` ON `Managers` (`lgID`);
CREATE TABLE `ManagersHalf` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `inseason` INTEGER,
  `half` INTEGER,
  `G` INTEGER,
  `W` INTEGER,
  `L` INTEGER,
  `rank` INTEGER
);
CREATE INDEX `ManagersHalf_playerID` ON `ManagersHalf` (`playerID`);
CREATE INDEX `ManagersHalf_yearID` ON `ManagersHalf` (`yearID`);
CREATE INDEX `ManagersHalf_teamID` ON `ManagersHalf` (`teamID`);
CREATE INDEX `ManagersHalf_lgID` ON `ManagersHalf` (`lgID`);
CREATE TABLE `Parks` (
  `park.alias` TEXT,
  `park.key` TEXT,
  `park.name` TEXT,
  `city` TEXT,
  `state` TEXT,
  `country` TEXT
);
CREATE TABLE `People` (
  `playerID` TEXT,
  `birthYear` INTEGER,
  `birthMonth` INTEGER,
  `birthDay` INTEGER,
  `birthCity` TEXT,
  `birthCountry` TEXT,
  `birthState` TEXT,
  `deathYear` INTEGER,
  `deathMonth` INTEGER,
  `deathDay` INTEGER,
  `deathCountry` TEXT,
  `deathState` TEXT,
  `deathCity` TEXT,
  `nameFirst` TEXT,
  `nameLast` TEXT,
  `nameGiven` TEXT,
  `weight` INTEGER,
  `height` INTEGER,
  `bats` TEXT,
  `throws` TEXT,
  `debut` TEXT,
  `bbrefID` TEXT,
  `finalGame` TEXT,
  `retroID` TEXT,
  `deathDate` REAL,
  `birthDate` REAL
);
CREATE INDEX `People_playerID` ON `People` (`playerID`);
CREATE INDEX `People_bbrefID` ON `People` (`bbrefID`);
CREATE INDEX `People_retroID` ON `People` (`retroID`);
CREATE TABLE `Pitching` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `stint` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `W` INTEGER,
  `L` INTEGER,
  `G` INTEGER,
  `GS` INTEGER,
  `CG` INTEGER,
  `SHO` INTEGER,
  `SV` INTEGER,
  `IPouts` INTEGER,
  `H` INTEGER,
  `ER` INTEGER,
  `HR` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `BAOpp` REAL,
  `ERA` REAL,
  `IBB` INTEGER,
  `WP` INTEGER,
  `HBP` INTEGER,
  `BK` INTEGER,
  `BFP` INTEGER,
  `GF` INTEGER,
  `R` INTEGER,
  `SH` INTEGER,
  `SF` INTEGER,
  `GIDP` INTEGER
);
CREATE INDEX `Pitching_playerID` ON `Pitching` (`playerID`);
CREATE INDEX `Pitching_yearID` ON `Pitching` (`yearID`);
CREATE INDEX `Pitching_teamID` ON `Pitching` (`teamID`);
CREATE INDEX `Pitching_lgID` ON `Pitching` (`lgID`);
CREATE TABLE `PitchingPost` (
  `playerID` TEXT,
  `yearID` INTEGER,
  `round` TEXT,
  `teamID` TEXT,
  `lgID` TEXT,
  `W` INTEGER,
  `L` INTEGER,
  `G` INTEGER,
  `GS` INTEGER,
  `CG` INTEGER,
  `SHO` INTEGER,
  `SV` INTEGER,
  `IPouts` INTEGER,
  `H` INTEGER,
  `ER` INTEGER,
  `HR` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `BAOpp` REAL,
  `ERA` REAL,
  `IBB` INTEGER,
  `WP` INTEGER,
  `HBP` INTEGER,
  `BK` INTEGER,
  `BFP` INTEGER,
  `GF` INTEGER,
  `R` INTEGER,
  `SH` INTEGER,
  `SF` INTEGER,
  `GIDP` INTEGER
);
CREATE INDEX `PitchingPost_playerID` ON `PitchingPost` (`playerID`);
CREATE INDEX `PitchingPost_yearID` ON `PitchingPost` (`yearID`);
CREATE INDEX `PitchingPost_teamID` ON `PitchingPost` (`teamID`);
CREATE INDEX `PitchingPost_lgID` ON `PitchingPost` (`lgID`);
CREATE TABLE `Salaries` (
  `yearID` INTEGER,
  `teamID` TEXT,
  `lgID` TEXT,
  `playerID` TEXT,
  `salary` INTEGER
);
CREATE INDEX `Salaries_yearID` ON `Salaries` (`yearID`);
CREATE INDEX `Salaries_teamID` ON `Salaries` (`teamID`);
CREATE INDEX `Salaries_lgID` ON `Salaries` (`lgID`);
CREATE INDEX `Salaries_playerID` ON `Salaries` (`playerID`);
CREATE TABLE `Schools` (
  `schoolID` TEXT,
  `name_full` TEXT,
  `city` TEXT,
  `state` TEXT,
  `country` TEXT
);
CREATE INDEX `Schools_schoolID` ON `Schools` (`schoolID`);
CREATE TABLE `SeriesPost` (
  `yearID` INTEGER,
  `round` TEXT,
  `teamIDwinner` TEXT,
  `lgIDwinner` TEXT,
  `teamIDloser` TEXT,
  `lgIDloser` TEXT,
  `wins` INTEGER,
  `losses` INTEGER,
  `ties` INTEGER
);
CREATE INDEX `SeriesPost_yearID` ON `SeriesPost` (`yearID`);
CREATE TABLE `Teams` (
  `yearID` INTEGER,
  `lgID` TEXT,
  `teamID` TEXT,
  `franchID` TEXT,
  `divID` TEXT,
  `Rank` INTEGER,
  `G` INTEGER,
  `Ghome` INTEGER,
  `W` INTEGER,
  `L` INTEGER,
  `DivWin` TEXT,
  `WCWin` TEXT,
  `LgWin` TEXT,
  `WSWin` TEXT,
  `R` INTEGER,
  `AB` INTEGER,
  `H` INTEGER,
  `X2B` INTEGER,
  `X3B` INTEGER,
  `HR` INTEGER,
  `BB` INTEGER,
  `SO` INTEGER,
  `SB` INTEGER,
  `CS` INTEGER,
  `HBP` INTEGER,
  `SF` INTEGER,
  `RA` INTEGER,
  `ER` INTEGER,
  `ERA` REAL,
  `CG` INTEGER,
  `SHO` INTEGER,
  `SV` INTEGER,
  `IPouts` INTEGER,
  `HA` INTEGER,
  `HRA` INTEGER,
  `BBA` INTEGER,
  `SOA` INTEGER,
  `E` INTEGER,
  `DP` INTEGER,
  `FP` REAL,
  `name` TEXT,
  `park` TEXT,
  `attendance` INTEGER,
  `BPF` INTEGER,
  `PPF` INTEGER,
  `teamIDBR` TEXT,
  `teamIDlahman45` TEXT,
  `teamIDretro` TEXT
);
CREATE INDEX `Teams_yearID` ON `Teams` (`yearID`);
CREATE INDEX `Teams_lgID` ON `Teams` (`lgID`);
CREATE INDEX `Teams_teamID` ON `Teams` (`teamID`);
CREATE INDEX `Teams_franchID` ON `Teams` (`franchID`);
CREATE INDEX `Teams_divID` ON `Teams` (`divID`);
CREATE TABLE `TeamsFranchises` (
  `franchID` TEXT,
  `franchName` TEXT,
  `active` TEXT,
  `NAassoc` TEXT
);
CREATE INDEX `TeamsFranchises_franchID` ON `TeamsFranchises` (`franchID`);
CREATE TABLE `TeamsHalf` (
  `yearID` INTEGER,
  `lgID` TEXT,
  `teamID` TEXT,
  `Half` INTEGER,
  `divID` TEXT,
  `DivWin` TEXT,
  `Rank` INTEGER,
  `G` INTEGER,
  `W` INTEGER,
  `L` INTEGER
);
CREATE INDEX `TeamsHalf_yearID` ON `TeamsHalf` (`yearID`);
CREATE INDEX `TeamsHalf_lgID` ON `TeamsHalf` (`lgID`);
CREATE INDEX `TeamsHalf_teamID` ON `TeamsHalf` (`teamID`);
CREATE INDEX `TeamsHalf_divID` ON `TeamsHalf` (`divID`);

dbplyr

List the 5 modern-era managers with the highest winning percentages. (Define modern-era as managed over 1000 games post 1920.)

managers <- tbl(baseball_db, "Managers")
people   <- tbl(baseball_db, "People")

managers |> 
    filter(yearID >= 1920) |> 
    group_by(playerID) |> 
    summarize(W = sum(W, na.rm=TRUE),
              L = sum(L, na.rm=TRUE)) |> 
    mutate(G = W + L, 
           record = W / (G + 0.0)) |>  # + 0.0 forces to work with decimals, not ints
    filter(G > 1000) |> 
    inner_join(people, 
               join_by(playerID == playerID)) |> 
    slice_max(record, n=5) |> 
    mutate(name = paste(nameGiven, nameLast)) |> 
    select(name, W, L, G, record) |>
    show_query()
<SQL>
SELECT `nameGiven` || ' ' || `nameLast` AS `name`, `W`, `L`, `G`, `record`
FROM (
  SELECT `q01`.*, RANK() OVER (ORDER BY `record` DESC) AS `col01`
  FROM (
    SELECT
      `LHS`.*,
      `birthYear`,
      `birthMonth`,
      `birthDay`,
      `birthCity`,
      `birthCountry`,
      `birthState`,
      `deathYear`,
      `deathMonth`,
      `deathDay`,
      `deathCountry`,
      `deathState`,
      `deathCity`,
      `nameFirst`,
      `nameLast`,
      `nameGiven`,
      `weight`,
      `height`,
      `bats`,
      `throws`,
      `debut`,
      `bbrefID`,
      `finalGame`,
      `retroID`,
      `deathDate`,
      `birthDate`
    FROM (
      SELECT `q01`.*, `W` / (`G` + 0.0) AS `record`
      FROM (
        SELECT `q01`.*, `W` + `L` AS `G`
        FROM (
          SELECT `playerID`, SUM(`W`) AS `W`, SUM(`L`) AS `L`
          FROM (
            SELECT `Managers`.*
            FROM `Managers`
            WHERE (`yearID` >= 1920.0)
          ) AS `q01`
          GROUP BY `playerID`
        ) AS `q01`
      ) AS `q01`
      WHERE (`G` > 1000.0)
    ) AS `LHS`
    INNER JOIN `People`
      ON (`LHS`.`playerID` = `People`.`playerID`)
  ) AS `q01`
) AS `q01`
WHERE (`col01` <= 5)

dbplyr

Add |> explain() to see the DB internal query plan:

managers <- tbl(baseball_db, "Managers")
people   <- tbl(baseball_db, "People")

managers |> 
    filter(yearID >= 1920) |> 
    group_by(playerID) |> 
    summarize(W = sum(W, na.rm=TRUE),
              L = sum(L, na.rm=TRUE)) |> 
    mutate(G = W + L, 
           record = W / (G + 0.0)) |>  # + 0.0 forces to work with decimals, not ints
    filter(G > 1000) |> 
    inner_join(people, 
               join_by(playerID == playerID)) |> 
    slice_max(record, n=5) |> 
    mutate(name = paste(nameGiven, nameLast)) |> 
    select(name, W, L, G, record) |>
    explain() # 
<SQL>
SELECT `nameGiven` || ' ' || `nameLast` AS `name`, `W`, `L`, `G`, `record`
FROM (
  SELECT `q01`.*, RANK() OVER (ORDER BY `record` DESC) AS `col01`
  FROM (
    SELECT
      `LHS`.*,
      `birthYear`,
      `birthMonth`,
      `birthDay`,
      `birthCity`,
      `birthCountry`,
      `birthState`,
      `deathYear`,
      `deathMonth`,
      `deathDay`,
      `deathCountry`,
      `deathState`,
      `deathCity`,
      `nameFirst`,
      `nameLast`,
      `nameGiven`,
      `weight`,
      `height`,
      `bats`,
      `throws`,
      `debut`,
      `bbrefID`,
      `finalGame`,
      `retroID`,
      `deathDate`,
      `birthDate`
    FROM (
      SELECT `q01`.*, `W` / (`G` + 0.0) AS `record`
      FROM (
        SELECT `q01`.*, `W` + `L` AS `G`
        FROM (
          SELECT `playerID`, SUM(`W`) AS `W`, SUM(`L`) AS `L`
          FROM (
            SELECT `Managers`.*
            FROM `Managers`
            WHERE (`yearID` >= 1920.0)
          ) AS `q01`
          GROUP BY `playerID`
        ) AS `q01`
      ) AS `q01`
      WHERE (`G` > 1000.0)
    ) AS `LHS`
    INNER JOIN `People`
      ON (`LHS`.`playerID` = `People`.`playerID`)
  ) AS `q01`
) AS `q01`
WHERE (`col01` <= 5)

<PLAN>
   id parent notused                                                 detail
1   2      0       0                                         CO-ROUTINE q01
2   5      2       0                                CO-ROUTINE (subquery-8)
3   7      5       0                                         CO-ROUTINE q01
4  14      7     142            SCAN Managers USING INDEX Managers_playerID
5  55      5      82                                               SCAN q01
6  62      5      42 SEARCH People USING INDEX People_playerID (playerID=?)
7 120      5       0                           USE TEMP B-TREE FOR ORDER BY
8 163      2      81                                      SCAN (subquery-8)
9 285      0      81                                               SCAN q01

Beyond this class to interpret, but in general use of INDEX yields good performance. Take DB classes for much more

dbplyr

List the 5 modern-era managers with the highest winning percentages. (Define modern-era as managed over 1000 games post 1920.)

managers <- tbl(baseball_db, "Managers")
people   <- tbl(baseball_db, "People")

managers |> 
    filter(yearID >= 1920) |> 
    group_by(playerID) |> 
    summarize(W = sum(W, na.rm=TRUE),
              L = sum(L, na.rm=TRUE)) |> 
    mutate(G = W + L, 
           record = W / (G + 0.0)) |>  # + 0.0 forces to work with decimals, not ints
    filter(G > 1000) |> 
    inner_join(people, 
               join_by(playerID == playerID)) |> 
    slice_max(record, n=5) |> 
    mutate(name = paste(nameGiven, nameLast)) |> 
    select(name, W, L, G, record) |>
    collect()
# A tibble: 5 × 5
  name                          W     L     G record
  <chr>                     <int> <int> <int>  <dbl>
1 David Ray Roberts           851   507  1358  0.627
2 Joseph Vincent McCarthy    2125  1333  3458  0.615
3 Miller James Huggins        927   597  1524  0.608
4 William Harold Southworth  1044   704  1748  0.597
5 Aaron John Boone            603   428  1031  0.585