Today: Lecture #05: Multi-Table dplyr Verbs
dplyr vs SQLAssigned on GitHub - due on 2026-03-22
Use helper functions to find submissions assigned to you. Ask on Piazza if still having trouble.
Submissions may not map perfectly to rubric - use your best judgement
Be generous but serious:
Learn from this! What can you adapt for MP#02?
Example of poor feedback:
Scores
Written Communication: 10
Project Skeleton: 10
Formatting & Display: 10
Code Quality: 10
Data Preparation: 10
Extra Credit: 10
Comments
Website looks really good, nice and clean!
Written Communication
Excellent and straight to the point.
Project Skeleton
Solid skeleton, well-organized .
Formatting & Display
Nicely formatted and balanced across the website.
Code Quality
Code runs like Forrest Gump in a Slump.
Data Preparation
Very great.
Extra Credit
Added graphes for even better understanding.
Red Flag: Repeated verbatim on several posts
Reminder: Poor feedback \(\neq\) poor work.
Example of medium feedback:
## Scores
- Written Communication: 8
- Project Skeleton: 10
- Formatting & Display: 9
- Code Quality: 10
- Data Preparation: 10
- Extra Credit: 1
## Comments
Love the visuals for Press Releases.
### Written Communication
A short summary or a description of this project would be great to add.
### Project Skeleton
Code completes all instructor-provided tasks correctly.
### Formatting & Display
Tables have well-formatted column names; caption would be great.
### Code Quality
Code is clear and well written.
### Data Preparation
Automatic (10/10). Out of scope for this mini-project.
### Extra Credit
I found all the Press Releases very interesting. The visuals were a great touch.
Example of great feedback:
## Scores
- Written Communication: 9
- Project Skeleton: 10
- Formatting & Display: 9
- Code Quality: 9
- Data Preparation: 10
- Extra Credit: 2
## Comments
### Written Communication
Overall, your writing is clear and easy to follow. I noticed a few small typos, but nothing major — using the built-in spell check in RStudio should catch those quickly. Everything else looks solid, and I didn’t have any major concerns based on my review.
### Project Skeleton
All tasks were completed satisfactorily.
### Formatting & Display
Overall, your tables and figures are well-organized and clear. There is one table in the "Data" section with column titles that could be formatted a little more cleanly for easier reading. I also noticed a few small typos in some of the table captions, but they should be easy to fix.
### Code Quality
The code quality is generally good, but there are a few minor linter issues. The comments could be more frequent and clearer in some places. For example, the comment `#checking dupolicate so that github will not block it` should be corrected to "duplicate," and the explanation could be clarified to avoid confusion–what duplicate are you referring to, and how does it block GitHub? It might confuse someone else reading the code.
Also, I would recommend moving all library imports to the top of the script, as that is typically considered standard practice.
### Data Preparation
The data preparation looks solid overall. I like how you handle missing files and JSON parsing failures–it demonstrates strong defensive programming.
### Extra Credit
Additional two points for using quarto's video support!
Lack of prior experience is not a hinderance here:
👏 Thank you for excellent work here! 👏
I’ve started reading some submissions - really impressive!
Hopefully automated checks helpful. If you have ideas for other useful automated ‘supports’ let me know
Q: Why doesn’t my site look the same on GitHub as it does on my laptop?
A: Missing css and js files. Need to upload everything in docs.
Q: Why doesn’t my submission have the right URL?
A: Most computers have case-sensitive file names, Mac doesn’t, and Windows is iffy.
MP#02 released - TBD
Due2026-04-03 at 11:59pm ET
Pay attention to the rubric
Thanks to HI and SK for help with Windows-compatibility!
Tentative Topics
Brightspace - day before class
Next pre-assignment is 2026-03-12 at 6:00pm ET
I missed a few comments in the previous cycle (sorry!) - trying to catch up in the next few days
Due Wednesday at 11:45pm:
ggplot2Expect back:
Most important: team names!
Previous: Rat Pack, Subway Surfers, Going for Gold, etc.
100% optional discussion of dplyr vs SQL
SQL is a very common Data Scientist interview topic so if you’re not taking a database course, might be usefulTBD
quarto) ✅R Basics ✅R ✅R ⬅️RRUpcoming work from course calendar
If you want to vote in the upcoming NYC Election, it’s time to register to vote:
Early voting begins October 25th: need to be registered by then
Structured Query Language (ISO 9075)
Data Definition Language (DDL)
Data Manipulation Language (DML)
SELECT, GROUP BY, WHERE, HAVING, JOIN, etc.dplyrData Control Language (DCL)
Relational Database Management System (RDMS)
RDMS has a fixed number of tables (roughly data.frames). DDL specifies
a < b, etc.)Basic command is the CREATE TABLE statement.
TYPE can be INTEGER, DECIMAL, CHAR, VARCHAR, DATE, TIMESTAMP and more
Our beloved penguins:
Compare:
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ bill_len <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
$ bill_dep <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
$ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
$ body_mass <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
$ sex <fct> male, female, female, NA, female, male, female, male, NA, …
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
We can add a constraint so the year is 2007-2009:
Constraints for categorical variables (R factors) can be tricky. Create a normalized set of tables:
| Sex ID | Name |
|---|---|
| 1 | Female |
| 2 | Male |
| Penguin ID | Sex ID | Weight |
|---|---|---|
| 1 | 1 | 4000 |
| 2 | 1 | 3500 |
| 3 | 2 | 5000 |
We want to make sure each Sex ID in table 2 matches a row of the first table
Constraints for categorical variables (R factors) can be tricky. Create a normalized set of tables:
Normalization:
More JOIN work in SQL than in dplyr
Proper table design is a science and an art
Data Manipulation Language is for DB users.
dplyrvs.
SELECT * means all columns which matches R’s default
Data Manipulation Language is for DB users.
dplyrvs.
Data Manipulation Language is for DB users.
dplyrvs.
Note that
and
are both
Why?
They commute – same result in either order
R is an imperative paradigm - you say what to doSQL is a declarative paradigm - you say what you wantDeclarative paradigms let the server figure out the best way to compute, but are more restrictive
For a well-defined paradigm like SQL, much work has gone into query optimization
vs.
summarize() is implicit in SELECT clause when we have a GROUP BY
But:
is
Why not WHERE again?
In SQL: - WHERE happens before a GROUP BY; and - HAVING happens after a GROUP BY
In R: - Ordering of summarize() and filter controls execution order
Most RDMSs provide a set of four ACID guarantees:
Easiest RDMS to use is SQLite
sqlite3 or sqlite3.exeR with RSQLite packageCan use dbplyr (extra b!) to put dplyr-UX on DB
E.g.
# Source: table<`penguins`> [?? x 8]
# Database: sqlite 3.50.4 [:memory:]
species island bill_len bill_dep flipper_len body_mass sex year
<chr> <chr> <dbl> <dbl> <int> <int> <chr> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18 195 3250 female 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 female 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# ℹ more rows
<SQL>
SELECT `species`, AVG(`body_mass`) AS `mean_mass`
FROM `penguins`
GROUP BY `species`
Extra step - show_query to see SQL or collect() to run in DB:
# A tibble: 3 × 2
species mean_mass
<chr> <dbl>
1 Adelie 3701.
2 Chinstrap 3733.
3 Gentoo 5076.
<SQL>
SELECT `species`, AVG(`body_mass`) AS `mean_mass`
FROM (
SELECT `penguins`.*
FROM `penguins`
WHERE (`body_mass` > 4000.0)
) AS `q01`
GROUP BY `species`
Compared to:
<SQL>
SELECT `species`, AVG(`body_mass`) AS `mean_mass`
FROM `penguins`
GROUP BY `species`
HAVING (AVG(`body_mass`) > 4000.0)
Not the best SQL, but RDMS can make quick work of it
A more substantial example:
A more substantial example:
CREATE TABLE `AllstarFull` (
`playerID` TEXT,
`yearID` INTEGER,
`gameNum` INTEGER,
`gameID` TEXT,
`teamID` TEXT,
`lgID` TEXT,
`GP` INTEGER,
`startingPos` TEXT
);
CREATE INDEX `AllstarFull_playerID` ON `AllstarFull` (`playerID`);
CREATE INDEX `AllstarFull_yearID` ON `AllstarFull` (`yearID`);
CREATE INDEX `AllstarFull_gameID` ON `AllstarFull` (`gameID`);
CREATE INDEX `AllstarFull_teamID` ON `AllstarFull` (`teamID`);
CREATE INDEX `AllstarFull_lgID` ON `AllstarFull` (`lgID`);
CREATE TABLE sqlite_stat1(tbl,idx,stat);
CREATE TABLE sqlite_stat4(tbl,idx,neq,nlt,ndlt,sample);
CREATE TABLE `Appearances` (
`yearID` INTEGER,
`teamID` TEXT,
`lgID` TEXT,
`playerID` TEXT,
`G_all` INTEGER,
`GS` INTEGER,
`G_batting` INTEGER,
`G_defense` INTEGER,
`G_p` INTEGER,
`G_c` INTEGER,
`G_1b` INTEGER,
`G_2b` INTEGER,
`G_3b` INTEGER,
`G_ss` INTEGER,
`G_lf` INTEGER,
`G_cf` INTEGER,
`G_rf` INTEGER,
`G_of` INTEGER,
`G_dh` INTEGER,
`G_ph` INTEGER,
`G_pr` INTEGER
);
CREATE INDEX `Appearances_yearID` ON `Appearances` (`yearID`);
CREATE INDEX `Appearances_teamID` ON `Appearances` (`teamID`);
CREATE INDEX `Appearances_lgID` ON `Appearances` (`lgID`);
CREATE INDEX `Appearances_playerID` ON `Appearances` (`playerID`);
CREATE TABLE `AwardsManagers` (
`playerID` TEXT,
`awardID` TEXT,
`yearID` INTEGER,
`lgID` TEXT,
`tie` TEXT,
`notes` TEXT
);
CREATE INDEX `AwardsManagers_playerID` ON `AwardsManagers` (`playerID`);
CREATE INDEX `AwardsManagers_awardID` ON `AwardsManagers` (`awardID`);
CREATE INDEX `AwardsManagers_yearID` ON `AwardsManagers` (`yearID`);
CREATE INDEX `AwardsManagers_lgID` ON `AwardsManagers` (`lgID`);
CREATE TABLE `AwardsPlayers` (
`playerID` TEXT,
`awardID` TEXT,
`yearID` INTEGER,
`lgID` TEXT,
`tie` TEXT,
`notes` TEXT
);
CREATE INDEX `AwardsPlayers_playerID` ON `AwardsPlayers` (`playerID`);
CREATE INDEX `AwardsPlayers_awardID` ON `AwardsPlayers` (`awardID`);
CREATE INDEX `AwardsPlayers_yearID` ON `AwardsPlayers` (`yearID`);
CREATE INDEX `AwardsPlayers_lgID` ON `AwardsPlayers` (`lgID`);
CREATE TABLE `AwardsShareManagers` (
`awardID` TEXT,
`yearID` INTEGER,
`lgID` TEXT,
`playerID` TEXT,
`pointsWon` INTEGER,
`pointsMax` INTEGER,
`votesFirst` INTEGER
);
CREATE INDEX `AwardsShareManagers_awardID` ON `AwardsShareManagers` (`awardID`);
CREATE INDEX `AwardsShareManagers_yearID` ON `AwardsShareManagers` (`yearID`);
CREATE INDEX `AwardsShareManagers_lgID` ON `AwardsShareManagers` (`lgID`);
CREATE INDEX `AwardsShareManagers_playerID` ON `AwardsShareManagers` (`playerID`);
CREATE TABLE `AwardsSharePlayers` (
`awardID` TEXT,
`yearID` INTEGER,
`lgID` TEXT,
`playerID` TEXT,
`pointsWon` INTEGER,
`pointsMax` INTEGER,
`votesFirst` INTEGER
);
CREATE INDEX `AwardsSharePlayers_awardID` ON `AwardsSharePlayers` (`awardID`);
CREATE INDEX `AwardsSharePlayers_yearID` ON `AwardsSharePlayers` (`yearID`);
CREATE INDEX `AwardsSharePlayers_lgID` ON `AwardsSharePlayers` (`lgID`);
CREATE INDEX `AwardsSharePlayers_playerID` ON `AwardsSharePlayers` (`playerID`);
CREATE TABLE `Batting` (
`playerID` TEXT,
`yearID` INTEGER,
`stint` INTEGER,
`teamID` TEXT,
`lgID` TEXT,
`G` INTEGER,
`AB` INTEGER,
`R` INTEGER,
`H` INTEGER,
`X2B` INTEGER,
`X3B` INTEGER,
`HR` INTEGER,
`RBI` INTEGER,
`SB` INTEGER,
`CS` INTEGER,
`BB` INTEGER,
`SO` INTEGER,
`IBB` INTEGER,
`HBP` INTEGER,
`SH` INTEGER,
`SF` INTEGER,
`GIDP` INTEGER
);
CREATE INDEX `Batting_playerID` ON `Batting` (`playerID`);
CREATE INDEX `Batting_yearID` ON `Batting` (`yearID`);
CREATE INDEX `Batting_teamID` ON `Batting` (`teamID`);
CREATE INDEX `Batting_lgID` ON `Batting` (`lgID`);
CREATE TABLE `BattingPost` (
`yearID` INTEGER,
`round` TEXT,
`playerID` TEXT,
`teamID` TEXT,
`lgID` TEXT,
`G` INTEGER,
`AB` INTEGER,
`R` INTEGER,
`H` INTEGER,
`X2B` INTEGER,
`X3B` INTEGER,
`HR` INTEGER,
`RBI` INTEGER,
`SB` INTEGER,
`CS` INTEGER,
`BB` INTEGER,
`SO` INTEGER,
`IBB` INTEGER,
`HBP` INTEGER,
`SH` INTEGER,
`SF` INTEGER,
`GIDP` INTEGER
);
CREATE INDEX `BattingPost_yearID` ON `BattingPost` (`yearID`);
CREATE INDEX `BattingPost_playerID` ON `BattingPost` (`playerID`);
CREATE INDEX `BattingPost_teamID` ON `BattingPost` (`teamID`);
CREATE INDEX `BattingPost_lgID` ON `BattingPost` (`lgID`);
CREATE TABLE `CollegePlaying` (
`playerID` TEXT,
`schoolID` TEXT,
`yearID` INTEGER
);
CREATE INDEX `CollegePlaying_playerID` ON `CollegePlaying` (`playerID`);
CREATE INDEX `CollegePlaying_schoolID` ON `CollegePlaying` (`schoolID`);
CREATE INDEX `CollegePlaying_yearID` ON `CollegePlaying` (`yearID`);
CREATE TABLE `Fielding` (
`playerID` TEXT,
`yearID` INTEGER,
`stint` INTEGER,
`teamID` TEXT,
`lgID` TEXT,
`POS` TEXT,
`G` INTEGER,
`GS` INTEGER,
`InnOuts` INTEGER,
`PO` INTEGER,
`A` INTEGER,
`E` INTEGER,
`DP` INTEGER,
`PB` INTEGER,
`WP` INTEGER,
`SB` INTEGER,
`CS` INTEGER,
`ZR` INTEGER
);
CREATE INDEX `Fielding_playerID` ON `Fielding` (`playerID`);
CREATE INDEX `Fielding_yearID` ON `Fielding` (`yearID`);
CREATE INDEX `Fielding_teamID` ON `Fielding` (`teamID`);
CREATE INDEX `Fielding_lgID` ON `Fielding` (`lgID`);
CREATE TABLE `FieldingOF` (
`playerID` TEXT,
`yearID` INTEGER,
`stint` INTEGER,
`Glf` INTEGER,
`Gcf` INTEGER,
`Grf` INTEGER
);
CREATE INDEX `FieldingOF_playerID` ON `FieldingOF` (`playerID`);
CREATE INDEX `FieldingOF_yearID` ON `FieldingOF` (`yearID`);
CREATE TABLE `FieldingOFsplit` (
`playerID` TEXT,
`yearID` INTEGER,
`stint` INTEGER,
`teamID` TEXT,
`lgID` TEXT,
`POS` TEXT,
`G` INTEGER,
`GS` INTEGER,
`InnOuts` INTEGER,
`PO` INTEGER,
`A` INTEGER,
`E` INTEGER,
`DP` INTEGER,
`PB` INTEGER,
`WP` INTEGER,
`SB` INTEGER,
`CS` INTEGER,
`ZR` INTEGER
);
CREATE INDEX `FieldingOFsplit_playerID` ON `FieldingOFsplit` (`playerID`);
CREATE INDEX `FieldingOFsplit_yearID` ON `FieldingOFsplit` (`yearID`);
CREATE INDEX `FieldingOFsplit_teamID` ON `FieldingOFsplit` (`teamID`);
CREATE INDEX `FieldingOFsplit_lgID` ON `FieldingOFsplit` (`lgID`);
CREATE TABLE `FieldingPost` (
`playerID` TEXT,
`yearID` INTEGER,
`teamID` TEXT,
`lgID` TEXT,
`round` TEXT,
`POS` TEXT,
`G` INTEGER,
`GS` INTEGER,
`InnOuts` INTEGER,
`PO` INTEGER,
`A` INTEGER,
`E` INTEGER,
`DP` INTEGER,
`TP` INTEGER,
`PB` INTEGER,
`SB` INTEGER,
`CS` INTEGER
);
CREATE INDEX `FieldingPost_playerID` ON `FieldingPost` (`playerID`);
CREATE INDEX `FieldingPost_yearID` ON `FieldingPost` (`yearID`);
CREATE INDEX `FieldingPost_teamID` ON `FieldingPost` (`teamID`);
CREATE INDEX `FieldingPost_lgID` ON `FieldingPost` (`lgID`);
CREATE TABLE `HallOfFame` (
`playerID` TEXT,
`yearID` INTEGER,
`votedBy` TEXT,
`ballots` REAL,
`needed` REAL,
`votes` REAL,
`inducted` TEXT,
`category` TEXT,
`needed_note` TEXT
);
CREATE INDEX `HallOfFame_playerID` ON `HallOfFame` (`playerID`);
CREATE INDEX `HallOfFame_yearID` ON `HallOfFame` (`yearID`);
CREATE TABLE `HomeGames` (
`year.key` INTEGER,
`league.key` TEXT,
`team.key` TEXT,
`park.key` TEXT,
`span.first` TEXT,
`span.last` TEXT,
`games` INTEGER,
`openings` INTEGER,
`attendance` INTEGER
);
CREATE TABLE `LahmanData` (
`file` TEXT,
`class` TEXT,
`nobs` REAL,
`nvar` REAL,
`title` TEXT
);
CREATE TABLE `Managers` (
`playerID` TEXT,
`yearID` INTEGER,
`teamID` TEXT,
`lgID` TEXT,
`inseason` INTEGER,
`G` INTEGER,
`W` INTEGER,
`L` INTEGER,
`rank` INTEGER,
`plyrMgr` TEXT
);
CREATE INDEX `Managers_playerID` ON `Managers` (`playerID`);
CREATE INDEX `Managers_yearID` ON `Managers` (`yearID`);
CREATE INDEX `Managers_teamID` ON `Managers` (`teamID`);
CREATE INDEX `Managers_lgID` ON `Managers` (`lgID`);
CREATE TABLE `ManagersHalf` (
`playerID` TEXT,
`yearID` INTEGER,
`teamID` TEXT,
`lgID` TEXT,
`inseason` INTEGER,
`half` INTEGER,
`G` INTEGER,
`W` INTEGER,
`L` INTEGER,
`rank` INTEGER
);
CREATE INDEX `ManagersHalf_playerID` ON `ManagersHalf` (`playerID`);
CREATE INDEX `ManagersHalf_yearID` ON `ManagersHalf` (`yearID`);
CREATE INDEX `ManagersHalf_teamID` ON `ManagersHalf` (`teamID`);
CREATE INDEX `ManagersHalf_lgID` ON `ManagersHalf` (`lgID`);
CREATE TABLE `Parks` (
`park.alias` TEXT,
`park.key` TEXT,
`park.name` TEXT,
`city` TEXT,
`state` TEXT,
`country` TEXT
);
CREATE TABLE `People` (
`playerID` TEXT,
`birthYear` INTEGER,
`birthMonth` INTEGER,
`birthDay` INTEGER,
`birthCity` TEXT,
`birthCountry` TEXT,
`birthState` TEXT,
`deathYear` INTEGER,
`deathMonth` INTEGER,
`deathDay` INTEGER,
`deathCountry` TEXT,
`deathState` TEXT,
`deathCity` TEXT,
`nameFirst` TEXT,
`nameLast` TEXT,
`nameGiven` TEXT,
`weight` INTEGER,
`height` INTEGER,
`bats` TEXT,
`throws` TEXT,
`debut` TEXT,
`bbrefID` TEXT,
`finalGame` TEXT,
`retroID` TEXT,
`deathDate` REAL,
`birthDate` REAL
);
CREATE INDEX `People_playerID` ON `People` (`playerID`);
CREATE INDEX `People_bbrefID` ON `People` (`bbrefID`);
CREATE INDEX `People_retroID` ON `People` (`retroID`);
CREATE TABLE `Pitching` (
`playerID` TEXT,
`yearID` INTEGER,
`stint` INTEGER,
`teamID` TEXT,
`lgID` TEXT,
`W` INTEGER,
`L` INTEGER,
`G` INTEGER,
`GS` INTEGER,
`CG` INTEGER,
`SHO` INTEGER,
`SV` INTEGER,
`IPouts` INTEGER,
`H` INTEGER,
`ER` INTEGER,
`HR` INTEGER,
`BB` INTEGER,
`SO` INTEGER,
`BAOpp` REAL,
`ERA` REAL,
`IBB` INTEGER,
`WP` INTEGER,
`HBP` INTEGER,
`BK` INTEGER,
`BFP` INTEGER,
`GF` INTEGER,
`R` INTEGER,
`SH` INTEGER,
`SF` INTEGER,
`GIDP` INTEGER
);
CREATE INDEX `Pitching_playerID` ON `Pitching` (`playerID`);
CREATE INDEX `Pitching_yearID` ON `Pitching` (`yearID`);
CREATE INDEX `Pitching_teamID` ON `Pitching` (`teamID`);
CREATE INDEX `Pitching_lgID` ON `Pitching` (`lgID`);
CREATE TABLE `PitchingPost` (
`playerID` TEXT,
`yearID` INTEGER,
`round` TEXT,
`teamID` TEXT,
`lgID` TEXT,
`W` INTEGER,
`L` INTEGER,
`G` INTEGER,
`GS` INTEGER,
`CG` INTEGER,
`SHO` INTEGER,
`SV` INTEGER,
`IPouts` INTEGER,
`H` INTEGER,
`ER` INTEGER,
`HR` INTEGER,
`BB` INTEGER,
`SO` INTEGER,
`BAOpp` REAL,
`ERA` REAL,
`IBB` INTEGER,
`WP` INTEGER,
`HBP` INTEGER,
`BK` INTEGER,
`BFP` INTEGER,
`GF` INTEGER,
`R` INTEGER,
`SH` INTEGER,
`SF` INTEGER,
`GIDP` INTEGER
);
CREATE INDEX `PitchingPost_playerID` ON `PitchingPost` (`playerID`);
CREATE INDEX `PitchingPost_yearID` ON `PitchingPost` (`yearID`);
CREATE INDEX `PitchingPost_teamID` ON `PitchingPost` (`teamID`);
CREATE INDEX `PitchingPost_lgID` ON `PitchingPost` (`lgID`);
CREATE TABLE `Salaries` (
`yearID` INTEGER,
`teamID` TEXT,
`lgID` TEXT,
`playerID` TEXT,
`salary` INTEGER
);
CREATE INDEX `Salaries_yearID` ON `Salaries` (`yearID`);
CREATE INDEX `Salaries_teamID` ON `Salaries` (`teamID`);
CREATE INDEX `Salaries_lgID` ON `Salaries` (`lgID`);
CREATE INDEX `Salaries_playerID` ON `Salaries` (`playerID`);
CREATE TABLE `Schools` (
`schoolID` TEXT,
`name_full` TEXT,
`city` TEXT,
`state` TEXT,
`country` TEXT
);
CREATE INDEX `Schools_schoolID` ON `Schools` (`schoolID`);
CREATE TABLE `SeriesPost` (
`yearID` INTEGER,
`round` TEXT,
`teamIDwinner` TEXT,
`lgIDwinner` TEXT,
`teamIDloser` TEXT,
`lgIDloser` TEXT,
`wins` INTEGER,
`losses` INTEGER,
`ties` INTEGER
);
CREATE INDEX `SeriesPost_yearID` ON `SeriesPost` (`yearID`);
CREATE TABLE `Teams` (
`yearID` INTEGER,
`lgID` TEXT,
`teamID` TEXT,
`franchID` TEXT,
`divID` TEXT,
`Rank` INTEGER,
`G` INTEGER,
`Ghome` INTEGER,
`W` INTEGER,
`L` INTEGER,
`DivWin` TEXT,
`WCWin` TEXT,
`LgWin` TEXT,
`WSWin` TEXT,
`R` INTEGER,
`AB` INTEGER,
`H` INTEGER,
`X2B` INTEGER,
`X3B` INTEGER,
`HR` INTEGER,
`BB` INTEGER,
`SO` INTEGER,
`SB` INTEGER,
`CS` INTEGER,
`HBP` INTEGER,
`SF` INTEGER,
`RA` INTEGER,
`ER` INTEGER,
`ERA` REAL,
`CG` INTEGER,
`SHO` INTEGER,
`SV` INTEGER,
`IPouts` INTEGER,
`HA` INTEGER,
`HRA` INTEGER,
`BBA` INTEGER,
`SOA` INTEGER,
`E` INTEGER,
`DP` INTEGER,
`FP` REAL,
`name` TEXT,
`park` TEXT,
`attendance` INTEGER,
`BPF` INTEGER,
`PPF` INTEGER,
`teamIDBR` TEXT,
`teamIDlahman45` TEXT,
`teamIDretro` TEXT
);
CREATE INDEX `Teams_yearID` ON `Teams` (`yearID`);
CREATE INDEX `Teams_lgID` ON `Teams` (`lgID`);
CREATE INDEX `Teams_teamID` ON `Teams` (`teamID`);
CREATE INDEX `Teams_franchID` ON `Teams` (`franchID`);
CREATE INDEX `Teams_divID` ON `Teams` (`divID`);
CREATE TABLE `TeamsFranchises` (
`franchID` TEXT,
`franchName` TEXT,
`active` TEXT,
`NAassoc` TEXT
);
CREATE INDEX `TeamsFranchises_franchID` ON `TeamsFranchises` (`franchID`);
CREATE TABLE `TeamsHalf` (
`yearID` INTEGER,
`lgID` TEXT,
`teamID` TEXT,
`Half` INTEGER,
`divID` TEXT,
`DivWin` TEXT,
`Rank` INTEGER,
`G` INTEGER,
`W` INTEGER,
`L` INTEGER
);
CREATE INDEX `TeamsHalf_yearID` ON `TeamsHalf` (`yearID`);
CREATE INDEX `TeamsHalf_lgID` ON `TeamsHalf` (`lgID`);
CREATE INDEX `TeamsHalf_teamID` ON `TeamsHalf` (`teamID`);
CREATE INDEX `TeamsHalf_divID` ON `TeamsHalf` (`divID`);
List the 5 modern-era managers with the highest winning percentages. (Define modern-era as managed over 1000 games post 1920.)
<SQL>
SELECT `nameGiven` || ' ' || `nameLast` AS `name`, `W`, `L`, `G`, `record`
FROM (
SELECT `q01`.*, RANK() OVER (ORDER BY `record` DESC) AS `col01`
FROM (
SELECT
`LHS`.*,
`birthYear`,
`birthMonth`,
`birthDay`,
`birthCity`,
`birthCountry`,
`birthState`,
`deathYear`,
`deathMonth`,
`deathDay`,
`deathCountry`,
`deathState`,
`deathCity`,
`nameFirst`,
`nameLast`,
`nameGiven`,
`weight`,
`height`,
`bats`,
`throws`,
`debut`,
`bbrefID`,
`finalGame`,
`retroID`,
`deathDate`,
`birthDate`
FROM (
SELECT `q01`.*, `W` / (`G` + 0.0) AS `record`
FROM (
SELECT `q01`.*, `W` + `L` AS `G`
FROM (
SELECT `playerID`, SUM(`W`) AS `W`, SUM(`L`) AS `L`
FROM (
SELECT `Managers`.*
FROM `Managers`
WHERE (`yearID` >= 1920.0)
) AS `q01`
GROUP BY `playerID`
) AS `q01`
) AS `q01`
WHERE (`G` > 1000.0)
) AS `LHS`
INNER JOIN `People`
ON (`LHS`.`playerID` = `People`.`playerID`)
) AS `q01`
) AS `q01`
WHERE (`col01` <= 5)
Add |> explain() to see the DB internal query plan:
<SQL>
SELECT `nameGiven` || ' ' || `nameLast` AS `name`, `W`, `L`, `G`, `record`
FROM (
SELECT `q01`.*, RANK() OVER (ORDER BY `record` DESC) AS `col01`
FROM (
SELECT
`LHS`.*,
`birthYear`,
`birthMonth`,
`birthDay`,
`birthCity`,
`birthCountry`,
`birthState`,
`deathYear`,
`deathMonth`,
`deathDay`,
`deathCountry`,
`deathState`,
`deathCity`,
`nameFirst`,
`nameLast`,
`nameGiven`,
`weight`,
`height`,
`bats`,
`throws`,
`debut`,
`bbrefID`,
`finalGame`,
`retroID`,
`deathDate`,
`birthDate`
FROM (
SELECT `q01`.*, `W` / (`G` + 0.0) AS `record`
FROM (
SELECT `q01`.*, `W` + `L` AS `G`
FROM (
SELECT `playerID`, SUM(`W`) AS `W`, SUM(`L`) AS `L`
FROM (
SELECT `Managers`.*
FROM `Managers`
WHERE (`yearID` >= 1920.0)
) AS `q01`
GROUP BY `playerID`
) AS `q01`
) AS `q01`
WHERE (`G` > 1000.0)
) AS `LHS`
INNER JOIN `People`
ON (`LHS`.`playerID` = `People`.`playerID`)
) AS `q01`
) AS `q01`
WHERE (`col01` <= 5)
<PLAN>
id parent notused detail
1 2 0 0 CO-ROUTINE q01
2 5 2 0 CO-ROUTINE (subquery-8)
3 7 5 0 CO-ROUTINE q01
4 14 7 142 SCAN Managers USING INDEX Managers_playerID
5 55 5 82 SCAN q01
6 62 5 42 SEARCH People USING INDEX People_playerID (playerID=?)
7 120 5 0 USE TEMP B-TREE FOR ORDER BY
8 163 2 81 SCAN (subquery-8)
9 285 0 81 SCAN q01
Beyond this class to interpret, but in general use of INDEX yields good performance. Take DB classes for much more
List the 5 modern-era managers with the highest winning percentages. (Define modern-era as managed over 1000 games post 1920.)
# A tibble: 5 × 5
name W L G record
<chr> <int> <int> <int> <dbl>
1 David Ray Roberts 851 507 1358 0.627
2 Joseph Vincent McCarthy 2125 1333 3458 0.615
3 Miller James Huggins 927 597 1524 0.608
4 William Harold Southworth 1044 704 1748 0.597
5 Aaron John Boone 603 428 1031 0.585