Final Project Introduction

This final project, which is worth 45% of your final grade, is your opportunity to test the progress you’ve made as an R programmer. Once you’ve completed it, you’ll be able to say that you have experience in the following:

Your task for the final project is construct a small analysis of a real-world dataset relevant to economics, finance, or business. The main deliverable will be one or more R scripts (you are more than welcome to have, for example, a get_data.R and a separate analyze_data.R) which accomplish all the following tasks:

  1. Read in data from the internet
    • Anyone should be able to run your code to load the data, without needing any data files you provide. If you have manually-collected data, I can show you how to host it publicly.
    • Please see this list of example data sources for some ideas to get you started
  2. Preprocess those data so they are ready for analysis
    • Put the data into a data.frame, data.table, or matrix
    • Ensure that all column types are correct
    • Check for missing values and do something to handle them
    • Any other preprocessing or transformations your project requires
  3. Conduct a statistical analysis with the data. Examples include:
    • Estimating one or more regressions to examine relationships in the data
    • Using summary statistics like grouped means to compare different subsets of data
    • Using hypothesis tests like the t-test and chi-squared test to make assertions about similarities and differences between variables
    • Training a classifier that can be used to make predictions on new, incoming data
    • Anything else that interests you
  4. Create at Least Two Visualizations to Communicate Results Examples:
    • Histograms/densities to compare probability distributions
    • Line plots to compare progression of variables over time
    • Tables of regression results (formatted with {stargazer}, perhaps) showing comparisons across models

While it is certainly possible to complete the tasks listed above with the base functionality of R, you will likely find that using external libraries will make your code more powerful and expressive. The ability to use external packages is another key learning outcome for the course. For the remainder of this assignment, I’ll refer to the package list, this package list I’ve hosted on GitHub. In this assignment, you must use at least one external package from each of the three sections in the package list:

You are, of course, welcome to use any other packages you deem appropriate in addition to this minimum requirement.

See the following sections for submission details on each of the four parts of this assignment


Part 1: Project Proposal

Description

The first deliverable for this project is a 1-2 page written report detailing your plans for the final project. This is not meant to lock you in to a particular data set, set of packages, or approach…all those details can change between this report and your final project submission in Week 5. It is just meant to get your thinking about the project and ensure that you’re on the right track.

This report should answer the following items:

Submission

This component of the project is due prior to our Week 4 session.

Please submit your report (as a Word doc, PDF, or HTML doc) to the “Final Project Proposal” dropbox on D2L.


Part 2: R Script(s)

Description

This code is the main deliverable for the class…it is worth 20% of your final grade. This is the part of the assignment where you get to show off what you’ve learned! Your goal is to create one or more R scripts that meet the requirements listed in the project description above.

Submission

This component of the project is due prior to our Week 5 session.

Please submit your code to the “Final Project (script + report)” dropbox on D2L. If you use multiple scripts (totally acceptable!), please include an additional script called “build.R” which species the order to run the scripts in.

Your script will be scored out of 100 points, using the following rubric:

Grade Item Total Possible Points
All-or-Nothing Grade Items
  1. All code runs without error (with no reliance on local data) 25
  2. Uses at least one Data Retrieval and Transformation package 4
  3. Uses at least one Math and Statistics package 4
  4. Uses at least one Visualization, Presentation, and Reporting package 4
  5. Every external package used is imported with library() 3
  6. All library() calls are at the top of submitted script(s) 1
  7. Code does not call install.packages(), install_github(), etc. 4
Code Quality Items
  1. Code commenting 5
    (4-5) Clearly commented
    (1-3) Minimal comments, code is hard to understand
    (0) No comments
  2. Code Organization 15
    (12-15) Well-organized, intuitive flow
    (8-11) Difficult to understand without comments
    (0-8) Takes significant effort to understand
  3. Use of External Functions 15
    (11-15) All/most external functions are called with ::
    (6-10) Some external function calls use ::
    (0-5) Unclear use of external functions
  4. Problem Solving 20
    (15-20) Excellent problem decomposition, R solution
    (8-14) Good solution, meets minimum requirements
    (0-7) Solution is significantly incomplete

There are definitely good and bad ways to comment code. For some tips, see this code-commenting tutorial I really like.


Part 3: Code Review

In professional data science teams, it is common for team members to present their work in internal “code reviews”, small meetings where a data scientist shares brief background on the problem he/she sought to solve and then invites criticism of his/her code.

This can be a nerve-wracking experience in the context of a new job (trust me), so I’d like to give you the opportunity to practice sharing code with others in the safe setting of this introductory course. In the code review component of this project, you will do a 5-10 minutes live presentation of your final project.

Submission (in-person class)

Code reviews will be done in-class during the Week 5 session.

You do not need to present slides or turn in anything on D2L.

When presentations begin, a Google Doc will be shared with the class.

While you present, everyone in the class will have this doc up on their machines and use it to give you comments and questions. The benefit of this practice, in professional settings, is that days after your code review you’ll have a written record of your audience’s feedback.

Description and Rubric

Be prepared to show your code + report in front of the class.

Your code review should consist of the following:

Your presentation will be scored out of 100 points, using the following rubric:

Grade Item Total Possible Points
All-or-Nothing Grade Items
  1. Shows at least one data visualization 10
  2. Describes at least one learning or piece of advice for classmates 10
Other Presentation Items
  1. Problem Introduction 10
    (5-10) introduction is clear and concise
    (1-4) introduction is confusing or rambling
    (0) no introduction of the problem
  2. Explanation of the dataset 30
    (20-30) explains source and real-world meaning of the data
    (10-19) literal description with no connection to problem
    (0-9) inaccurate or confusing explanation
  3. Explanation of the Code 40
    (30-40) clear explanation of how the code solves the problem
    (10-29) literal description of the code as-is
    (0-9) inaccurate or confusing explanation

Avoid these common code review issues:

You’ll do great!


Part 4: Written Report

Description

The final deliverable for this project is a written report with your findings. This should be a 2-4 page “executive briefing”, the type that you would write if you were doing this analysis for a consulting client.

Your report should focus on the problem, not the specifics of the code.

It should describe the problem, a brief summary of the work that was done (including the data that was used), and the result.

The problem should be stated as a falsifiable research question.

Bad

Every knows inflation is a big problem. I looked at how high inflation is.

Good

This project explored whether the relationship between observed inflation and consumer sentiment in the U.S. changed in the period 2021-2023.

Your report should NOT include any raw R code, but it should include the output of the visualization step of your script (i.e. a table, plot, or other viz).

This report should be free from statistical jargon, or such jargon should be clearly and concisely explained in the report.

Bad

I used a Breusch-Pagan test to check for heteroskedasticity, and it had a Chi-squared statistic of 46.98 with a p-value of 0.00715.

Good

I observed larger model errors for higher-value stocks, so I’m not confident that these results would hold for a portfolio with a different mix of company sizes.

For example, none of the following phrases should be used (unless they’re clearly explained in jargon-free language):

Submission

This component of the project is due prior to our Week 5 session. Please upload your report to the “Final Project (script + report)” dropbox on D2L.

Your report will be scored out of 100 points, using the following rubric:

Grade Item Total Possible Points
All-or-Nothing Grade Items
  1. Report does not contain any raw R code 5
  2. Report contains data visualizations created by the code 10
Other Report Items
  1. Problem Explanation 10
    (5-10) introduction is clear and concise
    (1-4) introduction is not clear
    (0) no introduction of the problem
  2. Explanation of where the data came from 10
    (8-10) clear explanation
    (0-7) unclear or inaccurate explanation
  3. Explanation what the dataset contains 20
    (15-20) clear, executive-level explanation
    (10-14) overly-technical description or somewhat unclear or inaccurate
    (0-9) very unclea, inaccurate, or confusing explanation or no explanation
  4. Explanation of the result 25
    (20-25) result of the project clearly expressed in business terms
    (10-19) overly-technical explanation of the result
    (0-10) inaccurate, confusing, or missing explanation of the result
  5. Grammar and Formatting 20
    (18-20) at most two minor grammar and spelling issues
    (8-17) some grammar and spelling issues
    (0-7) many grammar and spelling issues