Also discuss: Possible specific questions, data sources, analytical plan, anticipated challenges
Peer feedback mechanism TBD
Pre-Assignment #05 FAQs
FAQ: Subqueries
[W]ill we be learning how to perform joins within a subquery?
You don’t need subqueries in R since it’s an imperative language. Just create a new variable to represent the result of the subquery and use that in the next command.
SELECT first_name, last_name
FROM collectors
WHERE id IN (
SELECT collector_id
FROM sales
);
# A tibble: 3 × 4
species female male gender_diff
<fct> <dbl> <dbl> <dbl>
1 Adelie 3369. 4043. 675.
2 Chinstrap 3527. 3939. 412.
3 Gentoo 4680. 5485. 805.
FAQ: dplyr joins vs SQL joins
What is the difference between dplyr and SQL joins?
Not too much - biggest difference is no INDEX or FOREIGN KEY in R so less guarantees of data integrity.
FAQ: When to use anti_join?
Rare: looking for unmatched rows. - Useful to find data integrity issues or ‘implicit’ missingness. - I use an anti_join to find students who haven’t submitted an assignment.
semi_join appears in MP #02.
FAQ: many-to-many Warning
Tricky to address, but fortunately pretty rare.
SQL explicitly forbids many-to-many
Usually a sign that a “key” isn’t really unique
Check for duplicates in x and y tables
Can occur with “fancy” joins (rolling, inequality)
Add additional join variables to break “duplication”
FAQ: How to Check Efficiency?
No automatic way. Some rules of thumb:
Don’t create large tables just to filter down
filter before join when possible
full_outer join is a bit dangerous
cross_join is rarely the right answer
FAQ: tidyr vs dplyr
Is tidyr more efficient than dplyr?
Nope - different packages from the same developers.
Designed to work together elegantly.
FAQ: Rare Joins
Please explain what cross_join, filter joins, and nest_join are?
cross_join: dangerous.
Creates “all pairs” of rows. Useful for ‘design’ problems
filter joins (anti_, semi_):
Hunting down quietly missing data.
Filtering to sub-samples (see MP#02)
nest_join: beyond this course.
left_join with extra structure to output.
FAQ: How to Pick a Join
How do I decide which type of join is most approriate for a given analysis?
Topic of today’s work.
Other Tips
Disable RStudio’s visual Quarto editor. It’s more trouble than it’s worth. To stop it from opening by default, add editor: source in the header of your qmd files.
Quarto depends on file structure for organizing content. The main directory (STA9750-2024-FALL) should hold all of your input files. You should never directly put anything in the docs/ folder. That’s where generated output should live.
When I leave <GITHUB_NAME> or similar in instructions, put in your GitHub ID. (And make sure to remove the < and > symbols)