# 80 Lab: So what if you smoke when pregnant?

## Non-parametric-based inference

In 2004, North Carolina released a comprehensive birth record dataset that holds valuable insights for researchers examining the connection between expectant mothers’ habits and practices and their child’s birth. In this lab, we’ll be exploring a randomly selected subset of the data. You’ll learn how to use non-parametric-based inference to analyze the impact of maternal smoking during pregnancy on the weight of the baby. You will also explore the relationships between other variables, such as the mother’s age and the baby’s birth weight. Through the exercises, you will practice data manipulation, visualization, hypothesis testing, and calculating confidence intervals. You can find the lab here

## Getting started

### Packages

In this lab, we will work with the **tidyverse**, **infer**, and **openintro** packages. We can install and load them with the following code:

### Housekeeping

### Warm up

Before diving into the dataset, let’s warm up with some simple exercises.

#### YAML

Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document. Doing so will allow you to personalize your Rmd file.

#### Commiting and pushing changes

- Go to the
**Git**pane in your RStudio. - View the
**Diff**and confirm that you are happy with the changes. - Add a commit message like “Update team name” in the
**Commit message**box and hit**Commit**. - Click on
**Push**. This will prompt a dialogue box where you first need to enter your user name, and then your password.

### Set a seed!

In this lab, we’ll be generating random samples. To make sure our results stay consistent, make sure to set a seed before you dive into the sampling process. Otherwise, the data will change each time you knit your lab. To set your seed, simply find the designated R chunk in your R Markdown file and insert the seed value there.

## 80.1 The data

Load the `ncbirths`

data from the `openintro`

package:

We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable is provided in the following table.

variable | description |
---|---|

`fage` |
father’s age in years. |

`mage` |
mother’s age in years. |

`mature` |
maturity status of mother. |

`weeks` |
length of pregnancy in weeks. |

`premie` |
whether the birth was classified as premature (premie) or full-term. |

`visits` |
number of hospital visits during pregnancy. |

`marital` |
whether mother is `married` or `not married` at birth. |

`gained` |
weight gained by mother during pregnancy in pounds. |

`weight` |
weight of the baby at birth in pounds. |

`lowbirthweight` |
whether baby was classified as low birthweight (`low` ) or not (`not low` ). |

`gender` |
gender of the baby, `female` or `male` . |

`habit` |
status of the mother as a `nonsmoker` or a `smoker` . |

`whitemom` |
whether mom is `white` or `not white` . |

Before analyzing any new dataset, it’s important to get to know your data. Start by summarizing the variables and determining if their data type. Are they categorical? Are they numerical? For numerical variables, check for outliers. If you aren’t sure or want to take a closer look at the data, create a graph.

## 80.2 Exercises

- What are the cases in this data set? How many cases are there in our sample?

### 80.2.1 Baby weights

Wen, Shi Wu, Michael S. Kramer, and Robert H. Usher. “Comparison of birth weight distributions between Chinese and Caucasian infants.” American Journal of Epidemiology 141.12 (1995): 1177-1187.

A 1995 study suggests that average weight of White babies born in the US is 3,369 grams (7.43 pounds).In this dataset, we have pretty limited information on race, we only know whether the mother is White. We will make the simplifying assumption that babies of White mothers are also White, i.e. `whitemom = "white"`

. (Yes, I know that this assumption is a gross oversimplification).

We want to evaluate whether the average weight of White babies has changed since 1995.

Our null hypothesis should state “there is nothing going on”, i.e. no change since 1995: \(H_0: \mu = 7.43~pounds\).

Our alternative hypothesis should reflect the research question, i.e., some change since 1995. Since the research question doesn’t state a direction for the change, we use a two-sided alternative hypothesis: \(H_A: \mu \ne 7.43~pounds\).

- Create a filtered data frame called
`ncbirths_white`

that contains data only from White mothers. Then, calculate the mean of the weights of their babies.

- Are the criteria necessary for conducting simulation-based inference satisfied? Explain your reasoning.

Let’s discuss how this test would work. Our goal is to simulate a null distribution of sample means that is centered at the null value of 7.43 pounds. In order to do so, we: - take a bootstrap sample of from the original sample, - calculate this bootstrap sample’s mean, - repeat these two steps a large number of times to create a bootstrap distribution of means centered at the observed sample mean, - shift this distribution to be centered at the null value by subtracting / adding X to all bootstrap mean (X = difference between mean of bootstrap distribution and null value), and - calculate the p-value as the proportion of bootstrap samples that yielded a sample mean at least as extreme as the observed sample mean.

- Run the appropriate hypothesis test, visualize the null distribution, calculate the p-value, and interpret the results in the context of the data and the hypothesis test.

### 80.2.2 Baby weight vs. smoking

Consider the relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

Make side-by-side box plots displaying the relationship between

`habit`

and`weight`

. What does the plot highlight about the relationship between these two variables?Before continuing, create a cleaned version of the dataset by removing any rows with missing values for

`habit`

or`weight`

. Name this version`ncbirths_clean`

.Calculate the observed difference in means between the baby weights of smoking and non-smoking mothers.

We can see that there’s an observable difference, but is this difference meaningful? Is it statistically significant? We can answer this question by conducting a hypothesis test.

- Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

\(H_0\): _________ (\(\mu_1 = \mu_2\))

\(H_A\): _________ (\(\mu_1 \ne \mu_2\))

Run the appropriate hypothesis test, calculate the p-value, and interpret the results in context of the data and the hypothesis test.

Construct a 95% confidence interval for the difference between the average weights of babies born to smoking and non-smoking mothers.

### 80.2.3 Mother’s age vs. baby weight

In this portion of the analysis, we focus on two variables. The first one is `maturemom`

.

- First, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.

The other variable of interest is `lowbirthweight`

.

- Conduct a hypothesis test evaluating whether the proportion of low birth weight babies is higher for mature mothers. Use \(\alpha = 0.05\).

- State the hypotheses
- Verify the conditions
- Run the test and calculate the p-value
- State your conclusion within context of the research question

- Calculate a confidence interval for the difference between the proportions of low birth weight babies between mature and younger mothers. Interpret the interval in the context of the data and explain what it means.

## 80.3 Wrap up

In this lab, you practiced using non-parametric inference to analyze the impact of maternal smoking during pregnancy on the weight of the baby. You also explored the relationships between other variables, such as the mother’s age and the baby’s birth weight. You’ve gained experience with data manipulation, visualization, hypothesis testing, and calculating confidence intervals.

Remember to save your work, commit your changes, and push them to your Git repository!