43 Lab: Ugly charts and Simpson’s paradox

The two data visualized embedded in this lab violate many data visualization best practices. Improve these visualizations using R and the tips for effective visualizations that we’ve introduced. You should produce one visualization per dataset. Your visualization should be accompanied by a brief paragraph describing the choices you made in your improvement, specifically discussing what you didn’t like in the original plots and why, and how you addressed them in the visualization you created.

The learning goals for this lab are:

  • Telling a story with data
  • Data visualization best practices
  • Reshaping data

Getting started

Go to the course GitHub organization and locate your lab repo. Either Fork it or copy it as a template. Then clone it in RStudio. Refer to Lab 01 if you would like to see step-by-step instructions for cloning a repo into an RStudio project.

First, open the R Markdown document and Knit it. Make sure it compiles without errors. (Also, remember to check the final version after you upload!)

The output will be in the file markdown .md file with the same name.

Housekeeping

Remember: Your email address is the address tied to your GitHub account and your name should be first and last name.

Before we can get started we need to take care of some required housekeeping. Specifically, we need to do some configuration so that RStudio can communicate with GitHub. This requires two pieces of information: your email address and your name.

Run the following (but update it for your name and email!) in the Console to configure git:

library(usethis)
use_git_config(
  user.name = "Your Name",
  user.email = "your.email@address.com"
)

Packages

Run the following code in the Console to load this package.

library(tidyverse)

Take a sad plot and make it better

Fisheries

The Fisheries and Aquaculture Department of the Food and Agriculture Organization of the United Nations (FAO) collects data on the fisheries production of different countries. You can find a list of fishery production for various countries in 2016 on this Wikipedia page. The data includes the tonnage of fish captured and farmed for each country. Note that countries whose total harvest was less than 100,000 tons are excluded from the visualization.

A researcher has shared a visualization they created using these data with you.

  1. Can you help them make improve it? First, brainstorm how you would improve it. Then create the improved visualization and document your changes/decisions with bullet points. It’s ok if some of your improvements are aspirational, i.e. you don’t know how to implement it, but you think it’s a good idea. Implement what you can and leave notes identifying the aspirational improvements that could not be made. (You don’t need to recreate their plots in order to improve them)
fisheries <- read_csv("data/fisheries.csv")

✅ ⬆️ Commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Stretch Practice with Smokers in Whickham

A study conducted in Whickham, England recorded participants’ age and smoking status at baseline, and then 20 years later, their health outcome was recorded.

Packages

Now, we will work with the mosaicData package.

Because this is first time we’re using the mosaicData package, you need to make sure to install it first by running install.packages("mosaicData") in the console.

library(tidyverse)
library(mosaicData)

Note that these packages are also loaded in your R Markdown document.

The data

The data is in the mosaicData package. You can load it with

data(Whickham)

Take a peek at the codebook with

?Whickham
library(performance)
performance::compare_performance()

Exercises

  1. What type of study do you think these data come from: observational or experiment? Why?

  2. How many observations are in this dataset? What does each observation represent?

  3. How many variables are in this dataset? What type of variable is each? Display each variable using an appropriate visualization.

  4. What would you expect the relationship between smoking status and health outcome to be?

  5. Create a visualization depicting the relationship between smoking status and health outcome. Briefly describe the relationship, and evaluate whether this meets your expectations. Additionally, calculate the relevant conditional probabilities to help your narrative. Here is some code to get you started:

Whickham %>%
  count(smoker, outcome)
  1. Create a new variable called age_cat using the following scheme:
  • age <= 44 ~ "18-44"
  • age > 44 & age <= 64 ~ "45-64"
  • age > 64 ~ "65+"
  1. Re-create the visualization depicting the relationship between smoking status and health outcome, faceted by age_cat. What changed? What might explain this change? Extend the contingency table from earlier by breaking it down by age category and use it to help your narrative. We can use the contingency table to examine how the relationship between smoking status and health outcome differs between different age groups. This extension will help us better understand the patterns we see in the visualization, and explain any changes we observe.
Whickham %>%
  count(smoker, age_cat, outcome)

Wrapping up

Go back through your write up to make sure you’re following coding style guidelines we discussed in class. Make any edits as needed.

Also, make sure all of your R chunks are properly labeled, and your figures are reasonably sized.