STA 9750 - Final Project

In lieu of exams, STA 9750 has an end-of-semester project, worth 40% of your final grade. This project is intended to showcase the data analysis technologies covered in this course, including - but not limited to:

The project will be graded out of a total 400 points, divided as follows:

Projects can be completed in groups of 4-6 students.1 All group members are responsible for all portions of the work and will receive the same grade, except on the individual evaluation.

Group Membership: By 2025-09-30 at 11:59pm ET, email the instructor with a list of group members, cc-ing all group members. Once the original email is sent, other group members must reply acknowledging their intention to work with this group. After this date, group membership may only be changed for extraordinary circumstances.

Feel free to work with students in the other section in forming your teams. If your team has a mix of sections, please inform the instructor in the team registration email of the dates on which you will make your proposal, check-in, and final presentations.

As you form your team, you may optionally construct a Work Plan Agreement for your team and register it with the instructor. If you choose to do this, please include:

  1. Names and a short biography of all teammates. Be sure to include your data science background (if any), prior education, your career trajectory up to this point, your career goals for your time here at Baruch, and one additional ‘fun fact’ about yourself.
  2. Preferred Means and Timing of Communication. How does your team plan on meeting (Zoom, asynchronous chat, discussion boards…) and how often? Ideally, you will establish at least one synchronous meeting a week and a means of asynchronous communication (WeChat, Email, Discord, …).
  3. Workload Expectations. How much work will each teammate be able to contribute and when? Some of you may have more or less time at different points in the term and it will help your teammates to know when you may be less available. Ideally, you will agree on a schedule of weekly ‘internal deadlines’ to keep yourselves on target.
  4. Accountability Mechanisms. How do you intend to ensure that all teammates are staying on task and fulfilling their responsibilities to the larger group? There is room for creativity and flexibility here, but you may agree to ‘self-impose’ sanctions up to partial loss of credit to penalize students who fail to perform at group expectations. (This is a rather drastic mechanism and you should agree on some forms of accountability that can be invoked before grade penalties.)

The instructor is willing to implement agreed accountability mechanisms, but will not referee disputes among teams as to accountability. For example, a team may agree that a delinquent member will have their grade lowered by a pre-determined amount upon a unanimous vote of the rest of the team. The instructor will apply this penalty, but will not evaluate whether it is properly deserved. Teams will not be allowed to add or remove members except in truly exceptional circumstances, so having some sort of enforcement mechanism is an uncomfortable, but effective, means of ensuring all team members stay engaged throughout the semester.

Note that a Work Plan Agreement is not required, but this type of structure has helped teams remain focused and organized in the past.

Project Proposal Presentations

On Tuesday October 07, 2025 and Thursday October 09, 2025, project teams will present a 6 minute project proposal in the form of an in-class presentation. This presentation must cover:

  • The animating or overarching question (OQ) of the project: e.g., “Is MTA subway and bus service less reliable in poorer areas of NYC?”

  • Public data sources you intend to analyze (at least three): e.g., MTA on-time arrival statistics by station and household income data by ZIP code.2 The presentation should include a brief discussion of what relevant data elements these sources provide; note that this discussion should be selective and informative, not simply a list of column names. You do not have to wind up using all of these: I just want to see that you have several potential avenues to be successful.

  • Specific questions (SQs) you hope to answer in your analysis: e.g.

    1. average arrival delay by station;
    2. average household income for families nearest to each station; and
    3. which routes are busiest at which time of day.

    There should be (at least) one specific question per group member. (This forms part of your individual evaluation.) Regardless of the size of your group, there need to be at least three specific questions.

  • Rough analytical plan: e.g., we plan to divide NY up into regions based on nearest subway station and to compute average household income within those regions; we will correlate those income statistics with average MTA service delays in the direction of work-day travel; finally, we will use arrival data to identify portions of the MTA network where delays tend to occur.

  • Anticipated challenges: e.g. how to disentangle the effect of being further away from the central business district (and delays accumulating) from specific socioeconomic effects; or lack of historical data going back further than 1 year.

  • List of team members

This presentation will be graded out of 50 points, divided as:

  • Quality of presentation (10 points): Are slides clearly designed making effective use of visual elements? Does oral presentation supplement the slides or simply read text on slide?

  • Clarity of motivating question (15 points): Is the key question clearly stated and fully motivated? Is the question designed with sufficient specificity to both:

    1. be feasible within the project scope; and
    2. be of genuine interest?
  • Quality of proposed data sources (5 points): are the data sources proposed sufficient to answer the question?

  • Quality of specific questions (10 points): how well do the specific questions support the motivating question? Do they take full advantage of the proposed data sources?

  • Timing of presentation (10 points): Does the project proposal actually take 6 minutes (not going over!)? Presentations that are too short (less than 5.5 minutes) or too long (more than 6.5 minutes) will be penalized in proportion to violation of the 6 minute limit.

Points will be roughly awarded according to the following rubric:

Rubric Element Needs Improvement Poor Fair Good Great
Quality of Presentation (0-2/10) Weak presentation, evidencing little preparation. Fails to discuss all required elements. (3-4/10) Weak presentation, but covers all required elements, at least nominally. (5-6/10) Moderate presentation quality; slides have either too much or too little text. (7-8/10) Presentation clearly addresses all required elements. Slides have a good balance of text and images. (9-10/10) Excellent, compelling and dynamic presentation covering all required elements. May include preliminary results.
Clarity of Motivating Question (0-3/15) Project domain and motivating question are not well identified. (4-6/15) Project lacks sufficient motivation. Domain identified, but question needs further refinement. (Example, “We want to do something about X.”) (7-9/15) Motivating question not well-formed or not suitable to quantitative analysis. (10-12/15) Good motivating question. Well-motivated. Project will answer some, but not all, important questions in domain. (13-15/15) Excellent motivating question. Strong motivation, suitability for quantitative analysis, and high potential impact.
Quality of Proposed Data Sources (1/5) Data sources not clearly identified or inappropriate to questions asked. (2/5) Data sources clearly identified, but not well-suited to questions. (3/5) Data sources well-suited to question, but of questionable quality and reliability. (4/5) Quality relevant data sources; no concerns about usability for project. (5/5) Excellent data source identified. Well-targeted to question and not extensively previously analyzed.
Quality of Specific Questions (1-2/10) Presentation does not clearly state sufficient number of distinct specific questions (at least 1 per group member). (3-4/10) Questions are poorly structured, lacking clear connections to motivating question and/or to project data. (5-6/10) Specific questions are acceptable, but do not fully address animating question. Questions are somewhat repetitive. (7-8/10) Specific questions are well-designed and fully support motivating question. Questions are not well-separated and/or may be difficult to address with data sources. (9-10/10) Specific questions are well-designed and fully support motivating question. Each question is clearly distinct and can be addressed with data sources.
Timing of Presentation (1-2/10) Presentation lasted more than 8 or less than 4:45 minutes (2/10) (3-4/10) Presentation took between 7:30 and 8:00 or between 4:45 and 5:00 minutes (5-6/10) Presentation took between 7:00 and 7:30 minutes or between 5:00 and 5:15 (7-8/10) Presentation took between 6:30 and 7:00 minutes or between 5:15 and 5:30. (9-10/10) Presentation took between 5:30 and 6:30.

At this point, only the Overarching Question and team roster are locked. If you discover alternate data sources, better specific questions, or superior analytical strategies that help better address the OQ, you may (and should!) change your project plan.

In the interest of time, it is not required that all team members present.

You may find it helpful to think of your team as a set of consultants hired to take on a project for a non-technical customer. The customer will have a vague and likely qualitative question (the OQ) that they seek to answer. Once your team of consultants has been engaged, you divide the OQ into distinct actionable work for each team member centered around an SQ. The SQs are set by your team based on how you think the OQ can best be answered. As you form your OQ, make sure that the SQs are both doable and, if successful, sufficient to answer the SQ. Then, as you progress through the project, you can think of:

  • Project Proposal: The ‘sales pitch’ of your team hoping to get hired to answer an OQ. At this point, you’re more trying to indicate to your client that you understand their OQ, why it is important, and have initial plans on how best to answer it.
  • Mid-Term Check-In: At this point, your consultants have started the project in earnest. You are sharing your work to date with the client, updating them on your progress and any challenges encountered along the way.
  • Final Presentation: This is the final presentation to your client and their organization covering how you answered their OQ and giving the highlights of your work. This is a presentation to the entire organization and should be non-technical / accessible to everyone.
  • Group Final Report: This is the ‘Executive Summary’ prepared and shared with the client. You expect it to be sent to the highest levels of their organization and should focus on conveying the highlights and limitations of your analysis with minimal jargon.
  • Individual Final Reports: These are the technical appendices of your work, where you describe what you did for each SQ in detail. Your client may not engage with this material at first, but you share it with them so they know that they got their money’s worth and to give them a resource to apply and extend your work in the future.

Note that while a client might not read your technical report closely, I absolutely will so it requires just as much polish as everything else you submit.

Mid-Semester Check-In Presentations

On November 11, 2025 (Tuesday) and November 06, 2025 (Thursday), project teams will present a 6 minute check-in in the form of an in-class presentation. This presentation must cover:

  • The Overarching Question of the project.

  • Public data sources you are in the process of analyzing (at least two). At this point, the description of the used data sources should include critical evaluation of both data quality and data relevance to the overarching question. In particular, teams should be able to describe relevant challenges and how, if possible, the team overcame those challenges.

    I recommend structuring your evaluation of data sources around two separate concerns: the quality of a data source and and its suitability. By quality, you essentially seek to answer “how well does this data set do what it claims to do?” In assessing suitability, you ask “how well does this data set do what I need it to do?”

    Quality can be further broken into two sub-parts: recording quality and sampling quality. Recording quality examines the actual values in the data set: are there lots of missing data? Are their significant numbers of outliers? Are the measurements accurate or do we have to worry about issues with the tools used to capture values? For text / qualitative responses, are the answers easy-to-use and standardized or are they messy? Are sensitive quantities presented in some sort of privacy-respecting manner (e.g., bucketing or rounding?)

    Concepts like ‘sampling error’ or ‘margin of error’ may be helpful here, but they won’t capture everything. There’s no precise measure of recording quality, and it’s certainly impossible to list all the ways in which data might exhibit sub-optimal recording, but issues usually manifest quite quickly when you start creating exploratory data analysis (EDA) plots of the data.

    To assess sampling quality, ask whether the data has the right rows (observations). Are the data comprehensive of the entire population of interest? If they are not comprehensive, are they a representative sample of the population of interest? If the data has missing values, are they truly random or are they perhaps correlated with other quantities of interest? (E.g., you might have less accurate information about the personal finances of the ultra-wealthy because taxable income does not represent their total financial resources.) As with recording quality, the questions here don’t have specific metrics or checklists, but you want to simply ask yourself whether the data can represent what it claims to represent accurately, fairly, and completely.

    Suitability is a distinct concept from quality. While quality can be assessed by looking solely at the data collection and recording mechanisms, suitability also requires us to think about what we intend to use the data for. The most complete, most accurate data in the world can only take us so far (or perhaps nowhere at all) if it’s off topic. To assess suitability, think about what the data can do and what you need it to do. To the extent that the answers to these questions are not one-and-the-same, you have issues of suitability you need to plan to address.

    When working with free data, there are almost always issues of suitability. After all, it’s incredibly unlikely that someone purely randomly collected a perfect data set for your project and then made it freely available. Don’t fret - this is just a fact of life! Your analysis can still have great value for your clients. You simply need to consider about the mismatch between your data and your questions and to think about how that mismatch will limit the applicability of your findings. In essence, even if you do the best you can, what caveats or footnotes do you need to add to your final report?

    We often talk about “communicating uncertainty” as a key skill in statistics. While things like p-values and margins of error are important, the implications of quality and suitability are often much more relevant here. The biggest data sets in the world will drive all your standard errors to zero, but if your data has systematic sampling errors, you still need to communicate those to your clients.[1] Your findings will be a bit uncertain because you couldn’t quit measure what you needed to.

    You can (and should!) try to drive this mismatch to be as small as possible, but ultimately the question is the question and the data are the data: it is more responsible and more ethical to honestly communicate the limitations of your work rather than pretending those limits don’t exist. If your client finds value in your work and respects your discussion of limits, you always have the option of discussing next steps with them: how can they give you more data or adjust their question to the data at hand? Maybe they can give you some extra funding to collect data that is more closely aligned with their goals.

  • Specific questions you hope to answer in your analysis. At this point, each SQ should be assigned to a single team member. While the SQs work together to answer the overarching question, they should also be sufficiently distinct to allow individual evaluation.

  • Relevant prior art: what prior work has been done on this topic? How does the project complement and contrast with this work?

    For your project to be “worth doing”, it needs to be novel in some way. You are not required to have a high degree of novelty, but your work should at least be distinguishable. Novelty may include differentiation like:

    1. an analysis done for LA being now done in NYC; ii) a pre-COVID analysis being repeated post-COVID; iii) using a new or updated data source to see if you can reproduce the same phenomenon.
  • Anticipated challenges: e.g. how to disentangle the effect of being further away from the central business district (and delays accumulating) from specific socioeconomic effects; or lack of historical data going back further than 1 year.

This presentation will be graded out of 50 points, divided as:

  • Quality of presentation (10 points): Are slides clearly designed making effective use of visual elements? Does oral presentation supplement the slides or simply read text on slide?
  • Initial analysis of proposed data sources (15 points): are the data sources proposed sufficient to answer the question? Has the team begun to analyze the existing data in an exploratory fashion, determining the degree to which it is comprehensive (representing an unbiased, and ideally full, sample of a relevant population) and internally consistent (are the data well recorded or do they have tell-tale signs of inaccuracy)?
  • Quality of specific questions (10 points): how well do the SQ support the OQ? Do they take full advantage of the proposed data sources? If all SQs are answered, would they form a coherent answer to the OQ?
  • Engagement with Relevant Literature (10 points): how well does the team ground their project in relevant academic publications and/or reputable news media reports?
  • Timing of presentation (5 points): Does the project proposal actually take 6 minutes (not going over!)? Presentations that are too short (less than 5.5 minutes) or too long (more than 6.5 minutes) will be penalized in proportion to violation of the 6 minute limit.

Points will be roughly awarded according to the following rubric:

Rubric Element Needs Improvement Poor Fair Good Great
Quality of Presentation (0-2/10) Weak presentation, evidencing little preparation. Fails to discuss all required elements. (3-4/10) Weak presentation, but covers all required elements, at least nominally. (5-6/10) Moderate presentation quality; slides have either too much or too little text. (7-8/10) Presentation clearly addresses all required elements. Slides have a good balance of text and images. (9-10/10) Excellent, compelling and dynamic presentation covering all required elements.
Initial Analysis of Proposed Data Sources (0-3/15) Minimal analysis of data sources. Team only provides cursory data description (sources and dimension) with no discussion of quality. (4-6/15) Poor analysis of data sources: team describes data providers and evaluates quality OR demonstrates initial quality checks. (7-9/15) Fair analysis of data sources: team describes data providers and evaluates quality AND demonstrates initial quality checks. (10-12/15) Team assesses data quality thoroughly, evaluating both sampling and recording quality. Potential issues (if any) are identified. (13-15/15) Excellent analysis of data: full assessment of both sampling and reporting quality. identification of possible issues, and clear plan to remediate / supplement data sources in order to complete analysis.
Quality of Specific Questions (1-2/10) Presentation does not clearly state sufficient number of distinct specific questions (at least 1 per group member). (3-4/10) Questions are poorly structured, lacking clear connections to motivating question and/or to project data. (5-6/10) Specific questions are acceptable, but do not fully address animating question. Questions are somewhat repetitive. (7-8/10) Specific questions are well-designed and fully support motivating question. Questions are not well-separated and/or may be difficult to address with data sources. (9-10/10) Specific questions are well-designed and fully support motivating question. Each question is clearly distinct and can be addressed with data sources.
Engagement witih Relevant Literature (1-2/10) Presentation does not engage with relevant literature. (3-4/10) Engagement with prior literature is poor: only citations, without comparison to proposed project. (5-6/10) Acceptable engagement with prior literature: presentation compares proposed work with prior art. (7-8/10) Good engagement with prior work: team is able to contrast and compare with prior work. (9-10/10) Excellent engagement with prior literature: team uses review of existing work to tailor project to fill a “gap” in the literatu
Timing of Presentation (1/5) Presentation lasted more than 8 or less than 4:45 minutes. (2/5) Presentation took between 7:30 and 8:00 or between 4:45 and 5:00 minutes. (3/5) Presentation took between 7:00 and 7:30 minutes or between 5:00 and 5:15. (4/5) Presentation took between 6:30 and 7:00 minutes or between 5:15 and 5:30. (5/5) Presentation took between 5:30 and 6:30.

At this point, both the OQ and SQs should be essentially “locked.” While you may adjust the SQs between now and the final report, you will be asked to justify deviation.

In the interest of time, it is not required that all team members present.

Final Presentations

On Tuesday December 09, 2025, Thursday December 11, 2025, student teams will present a 10 minute final presentation describing their project. This presentation must cover:

  • The OQ of the project: this is essentially a restatement from the prior presentations, though it may be refined in light of work performed.
  • Prior art
  • Data sources used: if you changed data - or used additional data - explain what motivated the change from your original plan. Describe any difficulties you encountered in working with this data.
  • Specific analytical questions (and answers) supporting the animating question. Describe the major analytical stages of your project and summarize the results.
  • Summary of overall findings: relate your SQs to your OQ; describe any limitations of the approach used.
  • Proposals for future work: if this work could be continued beyond the end of the semester, what additional steps would you suggest to a client / boss?

All team members must present part of this presentation and each team member must present on their specific question.

This presentation will be graded out of 100 points, divided as:

  • Quality of presentation (20 points): are slides clearly designed to make use of attractive and effective visual elements? Does the oral presentation supplement the slides or simply read text on slide?
  • Relationship of OQ and SQs (10 points): Are the specific questions well-suited for the motivating question? Does the team address limitations of their analysis? Does the motivating question lead naturally to the specific questions?
  • Discussion of data sources (20 points): How well does the team describe the data used for the analysis - its size, structure, and provenance - and why it is suitable for their motivating question?
  • Communication of findings (25 points): are the visualizations in the presentation effective at communicating statistical findings? Does the team effectively communicate limitations and uncertainties of their approach?
  • Contextualization of project (15 points): is the project well situated in the existing literature? Are the findings of the specific questions well integrated to answer the overarching question?
  • Timing of presentation (10 points): Does the project proposal actually take 10 minutes (not going over!)? Presentations that are too short (less than 9.5 minutes) or too long (more than 10.5 minutes) will be penalized.

Points will be roughly awarded according to the following rubric:

Rubric Element Needs Improvement Poor Fair Good Great
Quality of Presentation (0-4/20) Weak presentation, evidencing little preparation. Fails to discuss all required elements. (5-8/20) Weak presentation, but covers all required elements, at least nominally. (9-12/20) Moderate presentation quality; slides have either too much or too little text. (13-16/20) Presentation clearly addresses all required elements. Slides have a good balance of text and images. (17-20) Excellent, compelling and dynamic presentation covering all required elements.
Relationship of Motivating and Specific Questions (0-2/10) Specific questions poorly address motivating question. (3-4/10) Specific questions give some insight into motivating question but leave major factors unaddressed. (5-6/10) Specific questions give real insight into motivating question but may leave minor factors unaddressed. (7-8/10) Specific questions fully address motivating question and deliver meaningful insights. (9-10/10) Specific questions impressively address motivating question and deliver novel and meaningful insights. Additionally, evidence is provided supporting the idea that the specific questions considered are indeed the most important and best questions that could be used to support the motivating question. (I.e., you don’t just find some factors that matter; you find the most important factors.)
Discussion of Data Sources (1-4/20) Data sources are of limited relevance or have meaningful and obvious quality issues. Data structure and provenance are not discussed. (5-8/20) Data sources used are relevant to problem, but team does not ensure quality. Presentation includes only cursory discussion of data structure OR provenance. (9-12/20) Data sources used are relevant to problem, but team performs only cursory analysis to ensure quality. Presentation includes discussion of data structure OR provenance. (13-16/20) Data sources used are relevant to problem; team performs detailed analysis of sampling and reporting quality but fails to addresses data limitations. Presentation includes discussion of data structure AND provenance. (17-20/20) Data sources used are relevant to problem; team performs detailed analysis of sampling and reporting quality and actively addresses any limitations. Presentation includes discussion of data structure AND provenance.
Communication of Findings (1-5/25) Poor communication of findings. Visualizations and tables are of rough quality and are not ‘publication ready’, instead remaining close to software defaults. Verbal discussion of methodology and data is muddled or missing. Significant elements missing. (6-10/25) Visualizations and tables evidence attempts at improvement, but still have notable flaws. Script omits discussion of one or more key analytical steps or findings. (11-15/25) Communication includes attractive visualization and tables but script does not successfully communicate key analytical steps and findings. (16-20/25) Strong communication throughout, with professional-grade data visualization and tables. Script highlights key findings and broader implications, but discussion of methodology is limited or confused in parts. (21-25/25) Excellent communication of findings. Visualizations, tables, and script convey the essence of sophisticated analyses without getting lost in details. Visualizations are compelling and attractive. Script highlights key findings and clearly connects quantitative findings with qualitative interpretation.
Contextualization of Project (1-3/15) Project is not well situated in existing literature. (4-6/15) Project addresses existing literature, but leaves significant related work unaddressed. (7-9/15) Project capably situates itself in existing literature, but does not actively demonstrate novelty or impact. (10-12/15) Project capably situates itself in existing literature and has a non-trivial degree of novelty, but is not naturally extended beyond the ‘four corners’ of the project. (13-15/15) Project capably situates itself in existing literature and answers a novel question or an important question in a novel way that can be used to drive meaningful future research.
Timing of Presentation (1-2/10) Presentation lasted more than 12:30 or less than 8:30 minutes. (3-4/10) Presentation took between 11:30 and 12:30 or between 8:30 and 9 minutes. (5-6/10) Presentation took between 11:00 and 11:30 minutes or between 9:00 and 9:15. (7-8/10) Presentation took between 10:30 and 11:00 minutes or between 9:15 and 9:30. (9-10/10) Presentation took between 9:30 and 10:30.

In the past, I have recommended this outline for the final presentation. You are not required to follow this outline, but I provide it in case it may be helpful.

  • Motivation and Context
    • Why should I care about this problem?
    • Hourglass structure: Big picture down to this question
  • Prior Art:
    • How does your work relate to other previous work?
    • What gap are you filling in the literature?
    • What is novel about your study?
  • Overarching Question
    • What is your major question?
  • Discussion of Data Sources
    • What data sources did you use?
    • SWOT each data source:
      • What is good about it (Strengths)?
      • What is bad (Weaknesses)?
      • What novel insights does it let you generate (Opportunities)?
      • What limitations or biases might it induce in your findings (Threats)?
  • Specific Questions
    • What are each of your specific questions?
    • How do they support and tie back overarching question?
    • How did you answer each of them? What did you find?
  • Integration of Findings
    • How do quantitative specific findings provide qualitative insights for overarching question?
    • What can you see be combining specific questions that you can’t see from a single specific question?
    • What limits are there to your findings?
  • Overarching Answer
  • Potential Future Work
    • What would you do with more time?
    • What additional data source would you like to access?

Final Summary Report

By the registrar-assigned ‘final exam date’ for this course, (tentatively 2025-12-18 at 11:59pm ET), the team will post a summary project report of no more than 2000 words summarizing their findings. This is a “non-technical” document, suitable for someone who cares about the motivating question, not for a data scientist. This document should focus on i) motivations and importance of analysis; ii) briefly how the specific analyses help to address the motivating question; iii) the choice of data used, including discussion of any limitations; iv) visualization of most important findings; v) relation to prior work (“the literature”); and vi) potential next steps.

Furthermore, this document should link to individual reports (more detail below) which work through the project specific questions in detail. Students are responsible for ensuring stable links between postings throughout the grading window.

This report should be written using quarto and formatted as a web page, submitted using the same process as the course mini-projects. This document is not required to be a “reproducible research” document since it is “non-technical”. As a general rule, this is a “words and pictures” document-possibly including a few tables-not a “code” document. You are encouraged to re-use material from your final presentation. Students are encouraged to re-use one or two key figures from individual reports in this document; there is no “disadvantage” to not having one of your individual figures used here. It is more important to select the right figures for the report.

For portfolio purposes, students are encouraged to each post a copy of the summary report to their own web presence, though this is not required.

This summary document will be graded out of 75 points, divided as:

  • Clarity of writing and motivation (50 points): is the report written accessibly for a non-technical audience? Is the motivating question well-posed and supported by the specific questions? Do the authors engage with prior work on this topic well?
  • Clarity of visuals (25 points): are visuals chosen to support the overall narrative? Are they “basic static” plots or have the authors gone “above and beyond” in formatting and structure? Do they clearly convey relevant uncertainty and key analytic choices?

Final Individual Report

By the Registrar-assigned ‘final exam date’ for this course, (tentatively 2025-12-18 at 11:59pm ET), each team member will post an individual project report of no more than 2000 words summarizing their work on the individual specific question(s) for which they were responsible.3

This is a “technical” document and should be structured as a “reproducible research” document, including all code needed to acquire, process, visualize, and analyze data. (Code does not count towards word counts) This report should be written using quarto and formatted as a web page, submitted using the same process as the course mini-projects.

Once both the summary and individual reports are submitted, students should open a new GitHub issue, tagging the instructor and linking to both reports using the template below:

Hello @michaelweylandt!

My team is ready to submit our STA 9750 course project reports. You can
find mine at: 

- Summary Report: [http://link.to.my/summary_report.html](http://link.to.my/summary_report.html)
- Individual Report: [http://link.to.my/individual_report.html](http://link.to.my/individual_report.html)

Thanks,
@<MY_GITHUB_NAME>

Additionally, each student should submit a PDF copy of both their group report and their individual report via CUNY Brightspace.

The final individual report will be graded out of 100, divided roughly as follows:

  • Code quality (20 points)
  • Data acquisition and processing (20 points)
  • Data analysis (30 points)
  • Communication and presentation of results (30 points)

Note that the individual reports may cross-reference each other and share code (suitably attributed) as appropriate. Students are encouraged to consider this project as a “series of posts” on a suitably technical blog.

The following rubric will be used to assess the final individual report, though the instructor may deviate to recognize exceptional work.

Rubric Element D C B A

Code Quality

The code runs on the instructor’s machine without errors.

Everything mentioned before and:

  • The code is well-organized, with good variable names and use of functions (subroutines) to avoid repeated code.
  • The code is well-formatted
  • The code uses comments effectively
  • Code is written efficiently, making use of R’s vectorized semantics

Everything mentioned before and:

  • Code is organized into sub-routines that could be easily adapted to support other analyses and are not overly specific to the particular data being analyzed

Everything mentioned before and:

  • Code is suitable to be re-used for similar analysis without effort, e.g., by organization into a readily-accessible R package

Data Acquisition and Processing4

The code loads data from a static web-based source5

Everything mentioned before and:

  • The code uses a dynamic API or basic web-scraping techniques to download data

Everything mentioned before and:

  • The code fully prepares and cleans the data, including fully investigating and properly handling any outliers, missing data, or other irregularities

    (Note that discarding or Windsorizing is not necessarily ‘properly handling’.)

Everything mentioned before and:

  • The code acquires data using techniques not presented in class, such as headless browsers, logging into password-protected resources using an httr2 session, or scraping data from non-tabular HTML

Data Analysis

The analysis consists primarily of basic descriptive statistics.

Everything mentioned before and:

  • Advanced descriptive statistics
  • Basic “pre-packaged” tests used for all inferential statistics

Everything mentioned before and:

  • A computer-based inference strategy such as bootstrapping, permutation testing, or cross-validation is used

Everything mentioned before and:

  • Sophisticated computer-based inference exceeding techniques presented in class

Communication and Presentation of Results

The report consists of:

  • A static Markdown/Quarto document or presentation with basic graphics and tables
  • “Baseline” graphics that do not adapt the default formatting or styling

Everything mentioned before and:

  • Advanced / interactive graphics
  • “Publication quality” graphics using advanced plotting functionality

Everything mentioned before and:

  • A basic interactive dashbaord6

Everything mentioned before and:

  • A fully-interactive “dashboard” type product that reacts to data in real-time and allows for customizable visualization and/or data export

Note that this rubric is intentionally quite difficult. It is designed to be able to properly recognize teams and individuals that go above and beyond in specific elements of the course project. You should not try to do everything listed in this rubric as a pure box-checking exercise: it is better to focus on a few of the challenges most central to your project and to address them well.

Final Peer Evaluations

The final 25 points will be assigned based on anonymous peer feedback collected through Brightspace. In particular, after completing a qualitative quiz designed to assess relative contributions, each teammate will be given 100 “points” to distribute among teammates (not self). Peer feedback scores will be based upon the total amount of “points” received from teammates adjusted by the instructor based on qualitative peer feedback. The Peer Feedback quiz will be made available on Brightspace only after submitting both the individual and group reports. Like the reports, the Peer Feedback quizzes are due on the Registrar-assigned final exam date for this course (tentatively, 2025-12-18 at 11:59pm ET). If the peer feedback quiz is not completed, you will receive 0 points for this component of your final grade, even if you received high scores from your teammates.

Footnotes

Footnotes

  1. If desired, students can work in pairs or even individually. That “team” is still responsible for a minimum of four specific questions, so you will have to do extra work if you have a team of fewer than four people.↩︎

  2. More properly, you would want to use Zip Code Tabulation Areas (ZCTAs) for this sort of analysis. The distinction is subtle, but while all ZCTAs have geographic extents, not all zip codes do. For example, there are dedicated zip codes for the IRS and the Department of Defense that have no associated geographic boundaries. Most open data sources will omit this distinction, but if you see it, you should be aware of it.↩︎

  3. If students choose to take on multiple specific questions (perhaps because they were in a small group or if a classmate had to drop the course), they may submit multiple individual reports (one per question). If doing so, please modify the GitHub message template to link all reports.↩︎

  4. Your report must import and analyze a real data set. Hard-coded values will be penalized heavily without specific and highly compelling and pre-approved justification.↩︎

  5. It is insufficient to simply read data from a static file hosted locally. That is, a command like read_csv should be preceeding by code necessary to re-create that file, e.g. download.file or similar. If your data is only available through resources that require manual (browser-based) access patterns, you ask the instructor for permission to use a local static file. Requests must be made via email at least one week before the final individual report is due.↩︎

  6. There are several options for embedding an interactive dashboard in your final submission. The easiest is probably to host your dashboard on https://www.shinyapps.io/ and to either use a link or HTML windowing to put it inside your qmd document. If you want to make everything truly seamless and integrated, you can also use the shinylive framework to run R and your dashboard fully in-browser. To embed the resulting dashboard in your qmd document, you may need to use the shinylive Quarto extension.↩︎