| title | Stat 184 Course Project Guidelines |
|---|---|
| date | Spring 2026 |
This repo will serve as a template for your Stat 184 course project, providing you with an initial README file, two CSL files (APA7 and MLA9) for reference citation styles, initial .gitignore and .lintr files, and the project guidelines (this file).
This Project Guidelines file contains all of the guidelines for the project, including key dates and a list of learning outcomes assessed. Further, we've attempted to provide you with some hints and suggestions. Be sure to read through all portions of the Project Guidelines carefully.
The context for this project that you’ve made it through several phases of a hiring process at Big Data, Inc. They are looking for the best statisticians, data scientists, and data analysts to join the firm.
They have hired you as a Probationary Data Scientist and want to see how you work as part of a team and what you can produce. Thus, this project acts as your demonstration of your skills and knowledge.
The purpose of this project is to provide you with an opportunity to put into practice everything that you have learned over the course of the semester and to push yourselves. To this end, you’ll work in teams (2-3 people) to conduct your own Exploratory Data Analysis, posing and answering your own research question(s).
There are multiple key dates that you and your project team are going need to be aware of as you complete this project.
| Date | Notes |
|---|---|
| Sec. 1 & 4: 4/20/26 Sec. 2 & 5: 4/16/26 |
Checkpoint #1: Form Teams and make your team contract (on paper, due by end of class) You must be physically present in class in order to join a team. |
| Sec. 1 & 4: 4/22/26 Sec. 2 & 5: 4/21/26 |
Checkpoint #2: Approval of Initial Data and Plan (in MOM, due by end of class) |
| Sec. 1 & 4: 4/24/26 Sec. 2 & 5: 4/23/26 |
Checkpoint #3: GitHub Repo Check (in MOM, due by end of class) |
| Sec. 1 & 4: 4/24/26 Sec. 2 & 5: 4/23/26 |
Checkpoint #4: Sign up for Work-in-Progress Presentation Slot (in class) |
| 4/27/26--5/1/26 | Checkpoint #5A: Work-in-Progress Presentations and Checkpoint #5B: Peer Feedback Each person MUST be present for all presentations. |
| 5/6/26 Sec. 2: 5/5/26 |
Checkpoint #6: Submission of Final Report (PDF, Canvas) and Link to Repo (submission comment) |
| 5/7/26 Sec. 2: 5/5/26 |
Checkpoint #7: Submission of Self- & Peer Evaluations |
Checkpoints #1-#5 must be completed in person.
The following checklist provides your team with an expanded listing of all elements necessary in your group project. You can keep track of what you have and have yet to complete by editing this file and placing an x inside of the square brackets at the start of each item (i.e., [x]).
- Read through the Project Guidelines.
- Form a team with 1-2 other individuals in your section of Stat 184.
- Complete Checkpoint #1 and make a team contract.
- Create (and update) a detailed plan for your project.
- Place a copy of your plan in your repo.
- Brainstorm what you're interested in exploring and look for potential data sources. (See the Data and Topics for resources.)
- Locate appropriate data sources for your project. See the Data Requirements section for important restrictions.
- Create a repo using the template repo. Name your repo
Sec#_CP_Name1_Name2_Name3whereSec#is your section of Stat184 (i.e.,Sec1,Sec2,Sec4, orSec5) andName#is replaced with the names of each team member.- The owner of the repo should be
Stat184-Spring2026. - Make sure that each team member has access to the repo.
- The owner of the repo should be
- Use GitHub appropriately.
- Have a proper README file for the repo.
- Each member must have at least five (5) commits with appropriate subjects/messages.
- The team should use the Issues feature regularly and appropriately.
- The team should use branches; there should be at least a main branch and one "dev" branch for each team member.
- The team should appropriately use Pull Requests to merge branches; one member should make the Pull Request while another should review and approve the request.
- Read in your data and perform any necessary data tidying, wrangling, and cleaning.
- Conduct Exploratory Data Analysis.
- Set the stage for your analysis (i.e., prepare an introduction on your topic).
- Describe the data provenance; see See the Data Requirements for additional information.
- Explain how the data satisfy (or not) both the FAIR and CARE principles.
- Describe what attributes you'll focus your analysis on (mention if they are part of your data sets or if you created them out of your data sets).
- Create multiple data visualizations (tables and figures) that assist both the team and readers in understanding the data.
- Data visualizations should show a variety of your skills and geometries.
- Each team member should be the main author behind at least one professional looking table (not a raw data table).
- Each team member should be the main author behind at least one professional looking plot/graph. The plots/graphs should not all be the same kind of plot/graph.
- Prepare a reproducible report.
- Use a QMD file; the output type is PDF.
- The report should be well organized with section headings.
- Code should only be found in a Code Appendix at the end for a PDF, not in the body of your report.
- Data visualizations should be appropriately sized--not too small and not too big.
- Figures and Tables should have appropriate captions and appropriately cross-referenced in the body of your report.
- All figures should have Alt Text set.
- There should be narrative text helping readers to better understand what each visualization helps them to learn about the data and context.
- Your report should contain narrative text (beyond explaining tables and figures) that explains the overall data story or context and helps the reader make sense of what is going on. That is, take the reader on a journey.
- You should properly cite any work you reference (including data) according to your choice of citation style. We've included files for APA7 and MLA9 as part of this template. If you want to use a different citation style, you will need download the CSL file from the Zotero Style Respository and include it in your team's repo.
- All code should be written according to a Style Guide of your choice. List this Style Guide as a code comment in your first code chunk.
- All code chunks should have a Code Header that contains who the primary author of the chunk is and who reviewed the chunk.
- Follow and apply the principles and elements of Open Science (and PCIP).
- Include an Author Contribution section at the end of the report, before the Code Appendix.
- Sign up for and complete a Work-in-Progress Presentation (details below)
- Finalize your work and submit your report as a PDF to the appropriate submission portal in Canvas by the deadline.
- As a comment on your submission, include a link to your team's project repo.
- Each team member must complete their Peer & Self Evaluations for working together on this project. This will be an assignment in MyOpenMath and needs to be completed by the deadline.
Your team may work with any data collection you wish, provided you satisfy certain conditions. These conditions include:
- The data may not by any of the sets/collections we've used in the class including any examples, homework, activities, quizzes, or other assignments.
- The data should be real, not fake or synthetic. You may not use genAI tools to create or find data sets. If you are unsure, talk to your instructor.
- Your data sets/collections should be rich for exploration. That is, there should be multiple attributes (i.e., more than 3) and multiple types (e.g. categorical/qualitative and quantitative) for you to explore in your team. We strongly encourage your team to look for multiple data sets/collections that you can link together so that the sets complement/supplement each other.
- You must be able to explain the data provenance as fully as possible for each data set/collection you use. This includes the who, what, when, where, why, and how the data were originally collected.
- You must be able to explain how your data satisfy (or not) both the FAIR and CARE principles.
- Your data may not come from or be found on Kaggle. It is your team's responsibility to complete this check. Projects using data from Kaggle will be treated as non-submissions.
- Your data may not come from another project you are doing or have done for a grade in any other course without prior permission. This also includes any Honors thesis or projects.
- You may only use data that you have a legal right to be using.
The key factor in determining whether a certain data set/collection is allowed is the analysis. We want your project to reflect your thinking and skill sets from this course, not someone else's (including genAI). To this end, you cannot submit data analysis work that you've done on a particular data set for another course as this is a violation of Academic Integrity (i.e., self-plagiarism). Thus, if you want to use a data set/collection you've used in another course, you are going to have to provide clear documentation of what you've done for that other course and ensure that you (or any member of your team) do not duplicate that work for this project.
We do not allow Kaggle data sets/collections to be used for this project for three reasons. First, the data that is shared on Kaggle is not necessarily real data. Second, the data (real or fake) may have been posted without permission or acknowledgement of the original data stewards. Finally, Kaggle data sets/collections are often accompanied by data analysis reports from one or more Kaggle users. This then raises questions about whether we are looking at your work or you copying someone else.
If there are data sets you are interested in using that come from R packages, talk to your instructor before using them. If you or a team member works in a research lab on campus and/or are doing research on campus, you may use that data provided you've been given written permission to use that data from the lab's Principle Investigator (typically, a faculty member in charge).
In the last week of classes (Apr. 27th--May 1st, 2026), each group will give a short/brief presentation to the class (approximately 5 minutes). In the presentation, each team will share
- what they are exploring,
- one (1) insight they have had so far,and
- one (1) challenge they have encountered as they work (if any).
Each member of the team must take part in the presentation and speak. As the name suggests, these presentations are not meant for teams to present polished/completed reports. Rather, the idea is to convey what you're currently working on.
While each team is giving their presentation, all other students in the class will be filling out a simple peer feedback form. Students who do not pay attention to their peers and provide meaningful feedback will receive a lower grade.
There are a couple of pieces of advice that we want to give your team at this time for your project.
- Planning will help your team get and stay organized.
- Communicate! The success of your team will hinge upon how well you communicate with each other.
- Draw upon each other's strengths and have everyone take a look at each other's work.
- Proofread your report before you submit.
- When you're unsure of something or running into a problem the team can't figure out, use your network of peers and your instructor.
Is your team stuck trying to figure out where to look for data or a topic to explore? Check out these resources.
- https://www.icpsr.umich.edu/sites/icpsr/home
- https://www.re3data.org/
- https://openpaymentsdata.cms.gov/
- https://aact.ctti-clinicaltrials.org/
- https://www.esportsearnings.com/
- SCORE Sports Data Repository: https://data.scorenetwork.org/
- CMU Statistics and Data Science Repository: https://cmustatistics.github.io/data-repository/
- The R package
{quantmod}allows you to read Yahoo Finance data. See more at https://www.quantmod.com/examples/intro/ - https://www.data.gov/
- https://github.com/fivethirtyeight/data
- https://github.com/nytimes
- https://github.com/rfordatascience/tidytuesday
- https://www.data-is-plural.com/ A newsletter dedicated to useful/curious data sets (over 400 editions of the newsletter are available)
You can also check with your instructor to see if they might have some data files you might be able to use.
Keep in mind that if any team submits a project that looks too similar to work done by someone else it would be an academic integrity violation.
- How have popular songs changed over the last 5-6 decades? For instance, beats per minute, genres, number of unique words, etc.
- How have food prices changed in the past two decades? For instance, the average cost of vegetables, fruit, and meat, and the year-to-year inflation rates.
- How have the key economic factors changed over the last 5-6 decades? For instance, the unemployment rate, S&P 500 index, CPI, median salary, and Housing price index.
- How has the eSports industry changed over the last decades? For instance, the number of tournaments, the number of people viewing the game, the total prize money, and the total estimated economic income (direct and indirect).
- Do parodies of songs or the originals use more unique words? Does the genre of the original song appear to have an effect?
- Investigate some part(s) of daily life that you can compare and contrast between your teenage years and the time period when your parents were teenagers (e.g., weather, economy, politics).
As a team you can discuss prompts such as the following to help you come up with ideas:
- What shared interests does your team have?
- Do you all like a particular sport?
- Do you like a particular type of music?
- Do you like to play video games?
If your team is completely struck, talk with your instructor.
- The PCIP System
- The Data Wrangling Shiny App
- The ASU Image Accessiblity AI Tool
- The Quarto Crash Course Guide
- PSU Libraries Zotero Guide
- Adopt a coding style guide for your team to follow throughout your files. Here are a few options:
- Make sure to use the
{lintr}and lint your files to ensure that stylistic error get flagged and addressed.- The defaults of the
lintfunction are for the Tidyverse style. - The provided
.lintrfile in this template is configured for BOAST/Tidyverse.
- The defaults of the
- Draw on the Contributor Role Taxonomy to report who did what.
Channel the beliefs and dispositions of EDA in your work here. That is,
- We construct our understandings of the data by investigating broad questions of "What is going on here?" and "What do we have?"
- Data visualizations (tables, plots, graphs) play a central and vital role in making sense of data and the surrounding context.
- Model building/hypothesis generation is an iterative process.
- Use methods that are robust, resistant, smooth, and have breadth.
- Maintain dispositions of skepticism, flexibility, and statistical ecumenism for methods used.
The EDA portion of work is what many people mess up: either because they skip it, don't do a good job, or don't spend the time on it that they should. However, this portion of data analysis is vital for you to truly be able to communicate later types of analysis in coherent and valid ways.
For this project, we recommend that you really focus on exploring the data and propose hypotheses/models for future analysis work.
A common way that Stat 184 students introduce unnecessary problems into their course project is attempting to move beyond EDA and use tools/methods from beyond the course. For example, using different forms of regression, ANOVA, machine learning algorithms (supervised and unsupervised), etc. We do not recommend that you attempt to use these, even if you have seen them in another course. It has been our collective experience over the years that students who attempt to incorporate these techniques into this project 1) never do so correctly/coherently, and 2) end up focusing on those method instead of of the EDA.
Teams will not receive any points/benefit for using methods from beyond Stat 184, however teams can lose points if such methods are used incorrectly, poorly, and/or are explained in less than ideal ways.
-
I want to work alone. Can I do this project by myself?
- No. Statistics and Data Science are deeply collaborative fields and you need to learn how to work effectively in a team setting. As such this project was designed with collaboration and teamwork as a core component.
-
Can we use a past project from another course for this project?
- No. Submitted work that you already received a grade in another course is a violation of the University's Academic Integrity policy (i.e., self-plagiarism).
-
Can we use a project from another (current) course for this project?
- No. There is not enough time for the faculty to meet and negotiate how one project will fully satisfy the requirements for the separate courses.
-
One of the team members works in a research lab. Can we use data from that lab for our project?
- YES! Provided you have permission from the lab's Principle Investigator (PI).
-
Can we use the data from another course's project?
- Maybe. Your team will need to disclose this as part of your data provenance and check with your instructor early on in the process. While you may use the data, you may not use the same analysis as what was used in the other course's project. Be prepared to show the other course's project upon instructor request.
-
Can I use Python (or another language)?
- Not as the majority language. STAT 184 is an R programming course, and the project is intended to evaluate learning objectives of this course so you should mostly be using R and your entire analysis must be self-contained in a single QMD. However, if you want to do something in the project that we have not learned about in class (using R) and prefer to use Python or some other language for that purpose it's fine to include some Python chunks in your QMD file. Try to get your usage of non-R programming languages under 20% (total).
This project is meant to provide one the last opportunities for each student to demonstrate their growth and development. In particular, this project provides us with data on the following learning objectives and outcomes. We will be looking at several items including the body of your rendered output file, the code appendix of your rendered output file, and your GitHub repo (linked in a comment on your submission in Canvas).
- Data Analysis: Students will develop their skills in using statistical software to engage in data analysis.
- DA.5: The student will learn to create data visualizations that support data analysis.
- DA.3: The student will learn to describe the components of data visualizations.
- Communication: Students will develop their communication skills as related to statistical programming and data analysis.
- Comm.3: The student will learn to meaningfully discuss data visualizations (e.g., plots, tables) to support others in their learning about the current context.
- Reproducibility: Students will develop their skills in creating reproducible code and data analyses.
- Repro.6: The student will learn to create reproducible analysis reports.
- Computational Thinking: Students will develop ways of thinking that make use of R’s computing power to solve problems.
- CT.1: The student will learn to use statistical software (R) to solve problems.
- Professionalism: Students will develop their professional identity through self-reflection and working with others.
- Prof.3: The student will demonstrate that they can work/collaborate effectively with others.
- Computational Thinking: Students will develop ways of thinking that make use of R’s computing power to solve problems.
- CT.1: The student will learn to use statistical software (R) to solve problems.
- CT.2: The student will learn to apply the PCIP System to their work.
- CT.4: The student will learn to check code for effectiveness and issues.
- Programming: Students will develop their skills in programming (creating code) with statistical software.
- Prog. 3: The student will learn to apply the core programming principles to their work.
- Prog. 4: The student will learn to write code that works for achieving their goals.
- Prog. 5: The student will learn to organize their code to assist with the code’s readability.
- Prog. 6: The student will learn to implement a coding style for their code.
- Data Analysis: Students will develop their skills in using statistical software to engage in data analysis.
- DA.1: The student will learn to import files into R from a variety of sources and file types.
- DA.2: The student will learn to wrangle (clean, prepare) data for further analysis.
- DA.5: The student will learn to create data visualizations that support data analysis.
- DA.6: The student will learn to generate the values of descriptive statistics that support data analysis.
- DA.7: The student will learn to identify both what makes up a case and the case attributes for different situations.
- Reproducibility: Students will develop their skills in creating reproducible code and data analyses.
- Repro.2: The student will learn to apply the principles of Open Science to their work.
- Repro.3: The student will learn to assess whether data collections meet the FAIR/CARE principles.
- Repro.4: The student will learn to utilize version control tools as a means of creating reproducible and collaborative work.
- Repro.5: The student will learn to create reproducible and reusable code.
- Repro.6: The student will learn to create reproducible analysis reports.
- Communication: Students will develop their communication skills as related to statistical programming and data analysis.
- Comm.1. The student will learn to generate documentation for their code that not only they, but others can use to help make sense of the code.
- Comm.3: The student will learn to meaningfully discuss data visualizations (e.g., plots, tables) to support others in their learning about the current context.
- Professionalism: Students will develop their professional identity through self-reflection and working with others.
- Prof.2: The student will learn to attribute coding and data analysis work to different paradigms/perspectives such as Exploratory vs. Confirmatory Data Analysis, Tidyverse vs. Base R, etc.
- Professionalism: Students will develop their professional identity through self-reflection and working with others.
- Prof.3: The student will demonstrate that they can work/collaborate effectively with others.
- Prof.5: The student will demonstrate that they can keep their (professional) commitments through actions such as attending class (on time), engaging with the class, and submitting work in a timely fashion.
If at any point in time you have questions, uncertainties, doubts, etc. or if you run into problems, errors, mysterious bugs, etc. talk with your instructor. We are here to help and will function as your Technical Supervisor/Mentor to your Probationary Data Scientist status. Make use of our knowledge, wisdom, and skills!