80 Lab: So what if you smoke when pregnant?

Non-parametric-based inference

In 2004, North Carolina released a comprehensive birth record dataset that holds valuable insights for researchers examining the connection between expectant mothers’ habits and practices and their child’s birth. In this lab, we’ll be exploring a randomly selected subset of the data. You’ll learn how to use non-parametric-based inference to analyze the impact of maternal smoking during pregnancy on the weight of the baby. You will also explore the relationships between other variables, such as the mother’s age and the baby’s birth weight. Through the exercises, you will practice data manipulation, visualization, hypothesis testing, and calculating confidence intervals. You can find the lab here

Getting started

Packages

In this lab, we will work with the tidyverse, infer, and openintro packages. We can install and load them with the following code:

library(tidyverse)
library(infer)
library(openintro)

Housekeeping

Git configuration / password caching

Configure your Git user name and email. If you cannot remember the instructions, refer to an earlier lab. Also remember that you can cache your password for a limited amount of time.

Project name

Update the name of your project to match the lab’s title.

Warm up

Before diving into the dataset, let’s warm up with some simple exercises.

YAML

Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document. Doing so will allow you to personalize your Rmd file.

Commiting and pushing changes

  • Go to the Git pane in your RStudio.
  • View the Diff and confirm that you are happy with the changes.
  • Add a commit message like “Update team name” in the Commit message box and hit Commit.
  • Click on Push. This will prompt a dialogue box where you first need to enter your user name, and then your password.

Set a seed!

In this lab, we’ll be generating random samples. To make sure our results stay consistent, make sure to set a seed before you dive into the sampling process. Otherwise, the data will change each time you knit your lab. To set your seed, simply find the designated R chunk in your R Markdown file and insert the seed value there.

80.1 The data

Load the ncbirths data from the openintro package:

data(ncbirths)

We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable is provided in the following table.

variable description
fage father’s age in years.
mage mother’s age in years.
mature maturity status of mother.
weeks length of pregnancy in weeks.
premie whether the birth was classified as premature (premie) or full-term.
visits number of hospital visits during pregnancy.
marital whether mother is married or not married at birth.
gained weight gained by mother during pregnancy in pounds.
weight weight of the baby at birth in pounds.
lowbirthweight whether baby was classified as low birthweight (low) or not (not low).
gender gender of the baby, female or male.
habit status of the mother as a nonsmoker or a smoker.
whitemom whether mom is white or not white.

Before analyzing any new dataset, it’s important to get to know your data. Start by summarizing the variables and determining if their data type. Are they categorical? Are they numerical? For numerical variables, check for outliers. If you aren’t sure or want to take a closer look at the data, create a graph.

80.2 Exercises

  1. What are the cases in this data set? How many cases are there in our sample?

80.2.1 Baby weights

Wen, Shi Wu, Michael S. Kramer, and Robert H. Usher. “Comparison of birth weight distributions between Chinese and Caucasian infants.” American Journal of Epidemiology 141.12 (1995): 1177-1187.

A 1995 study suggests that average weight of White babies born in the US is 3,369 grams (7.43 pounds).In this dataset, we have pretty limited information on race, we only know whether the mother is White. We will make the simplifying assumption that babies of White mothers are also White, i.e. whitemom = "white". (Yes, I know that this assumption is a gross oversimplification).

We want to evaluate whether the average weight of White babies has changed since 1995.

Our null hypothesis should state “there is nothing going on”, i.e. no change since 1995: \(H_0: \mu = 7.43~pounds\).

Our alternative hypothesis should reflect the research question, i.e., some change since 1995. Since the research question doesn’t state a direction for the change, we use a two-sided alternative hypothesis: \(H_A: \mu \ne 7.43~pounds\).

  1. Create a filtered data frame called ncbirths_white that contains data only from White mothers. Then, calculate the mean of the weights of their babies.
  1. Are the criteria necessary for conducting simulation-based inference satisfied? Explain your reasoning.

Let’s discuss how this test would work. Our goal is to simulate a null distribution of sample means that is centered at the null value of 7.43 pounds. In order to do so, we: - take a bootstrap sample of from the original sample, - calculate this bootstrap sample’s mean, - repeat these two steps a large number of times to create a bootstrap distribution of means centered at the observed sample mean, - shift this distribution to be centered at the null value by subtracting / adding X to all bootstrap mean (X = difference between mean of bootstrap distribution and null value), and - calculate the p-value as the proportion of bootstrap samples that yielded a sample mean at least as extreme as the observed sample mean.

  1. Run the appropriate hypothesis test, visualize the null distribution, calculate the p-value, and interpret the results in the context of the data and the hypothesis test.

80.2.2 Baby weight vs. smoking

Consider the relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

  1. Make side-by-side box plots displaying the relationship between habit and weight. What does the plot highlight about the relationship between these two variables?

  2. Before continuing, create a cleaned version of the dataset by removing any rows with missing values for habit or weight. Name this version ncbirths_clean.

  3. Calculate the observed difference in means between the baby weights of smoking and non-smoking mothers.

ncbirths_clean %>%
  group_by(habit) %>%
  summarise(mean_weight = mean(weight))

We can see that there’s an observable difference, but is this difference meaningful? Is it statistically significant? We can answer this question by conducting a hypothesis test.

  1. Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

\(H_0\): _________ (\(\mu_1 = \mu_2\))

\(H_A\): _________ (\(\mu_1 \ne \mu_2\))

  1. Run the appropriate hypothesis test, calculate the p-value, and interpret the results in context of the data and the hypothesis test.

  2. Construct a 95% confidence interval for the difference between the average weights of babies born to smoking and non-smoking mothers.

80.2.3 Mother’s age vs. baby weight

In this portion of the analysis, we focus on two variables. The first one is maturemom.

  1. First, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.

The other variable of interest is lowbirthweight.

  1. Conduct a hypothesis test evaluating whether the proportion of low birth weight babies is higher for mature mothers. Use \(\alpha = 0.05\).
  • State the hypotheses
  • Verify the conditions
  • Run the test and calculate the p-value
  • State your conclusion within context of the research question
  1. Calculate a confidence interval for the difference between the proportions of low birth weight babies between mature and younger mothers. Interpret the interval in the context of the data and explain what it means.

80.3 Wrap up

In this lab, you practiced using non-parametric inference to analyze the impact of maternal smoking during pregnancy on the weight of the baby. You also explored the relationships between other variables, such as the mother’s age and the baby’s birth weight. You’ve gained experience with data manipulation, visualization, hypothesis testing, and calculating confidence intervals.

Remember to save your work, commit your changes, and push them to your Git repository!