Simulating Correlated Data 🎲

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a>
</span>
</div>

---

# Learning Goals

---

## Learning Goals

By the end of this session, you will be able to...

- Explain why and when simulation is a useful tool in data science and psychology
- Simulate correlated variables using `mvrnorm()` from the MASS package
- Use simulation to demonstrate statistical concepts like the Central Limit Theorem
- Conduct a basic power analysis through simulation
- Evaluate how sample size affects the precision and reliability of estimates

---

# Why Simulate?

---

.pull-left-narrow[
We've already learned the **mechanics** of simulation:
- `rnorm()`, `runif()`, `rpois()`
- `rep()` for categorical variables
- `replicate()` and `for()` loops
]

Simulation lets us...
- Develop intuition about chance
- Demonstrate statistical principles
- Plan studies before collecting data
- Test whether our methods actually work
]

---

## Simulation as a Way of Thinking

.alert[Have you ever found a "significant" result and wondered... is this real, or did I just get lucky?]

- Simulation gives us a way to answer that question **directly** — create a world where we **know** the truth, generate data, and see what happens.

---

# Spurious Correlations

---

## The Danger of Small Samples

.question[
You collect data on two variables from 10 people and find r = 0.45. Is that a real relationship?
]

- Let's ask: **how often would we see r = 0.45 or more from two completely unrelated variables?**

---

## Simulating Spurious Correlations

.midi.pull-left-narrow[

``` r
set.seed(2024)
spurious_cors <- replicate(10000, {
  x <- rnorm(10)
  y <- rnorm(10)
  cor(x, y)
})
mean(abs(spurious_cors) >= 0.45)
```

```
## [1] 0.1896
```
]

- About 19% of the time!

.footnote[
See [Spurious Correlations](https://www.tylervigen.com/spurious-correlations) for fun real-world examples.
]

---

## Let's Watch It Happen

Here's one random sample of n = 10 from two unrelated variables:

---

## Another Random Sample

A **second** sample — same population (no relationship):

Different sample, different correlation — and the true value is **zero** in both cases.

---

## And another...

And a **third ** ...

---

## 10,000 Samples Later

---

## What About Larger Samples?

---

# Wrapping Up...

---

# The Central Limit Theorem via Simulation

---

## The Central Limit Theorem

- The sampling distribution of the mean approaches a normal distribution as n increases, regardless of the population distribution.

- That's a big claim! Let's test it with a decidedly non-normal distribution.

---

## Start with a Skewed Distribution

.pull-left[
<img src="d31_simulations_files/figure-html/unnamed-chunk-8-1.png" alt="" width="95%" style="display: block; margin: auto;" />
]

---

## CLT in Action

---

## The CLT Code

``` r
set.seed(42)
sample_means <- replicate(
  n = 5000,
  expr = mean(rpois(n = 30, lambda = 2))
)
data.frame(sample_means) %>%
  ggplot(aes(x = sample_means)) +
  geom_histogram(binwidth = 0.1)
```
]

.question[
What would happen if you replaced `rpois(30, lambda = 2)` with `runif(30, min = 0, max = 1)`? Would the CLT still hold?
]

---

# Wrapping Up...

---

# Simulating Correlated Data with `mvrnorm()`

---

## The Problem with One Variable at a Time

.alert[Imagine simulating data for a study on exam performance. You need midterm and final scores that are **correlated** — but building that with `y = 0.5*x + noise` is clunky.]

- You need to **know the equation** in advance
- Hard to control the **exact correlation**
- Doesn't scale to 3, 4, 5+ variables

---

## What Does `mvrnorm()` Need?

``` r
mvrnorm(n, mu, Sigma)
```

- **`n`** — number of observations (e.g., 200 students)

- **`mu`** — vector of means (one per variable)

- **`Sigma`** — **covariance matrix** (variances on diagonal, covariances off-diagonal)

---

## Step 1: Define Means and Correlation

Let's simulate midterm and final exam scores (mean = 100, SD = 15, r = 0.4):

``` r
mu <- c(100, 100)
cor_mat <- matrix(c(1.0, 0.4,
                    0.4, 1.0), ncol = 2)
```

## Step 2: Convert to Covariance

`$\text{Cov}(X_i, X_j) = r_{ij} \cdot SD_i \cdot SD_j$`

``` r
sd_vec <- c(15, 15)
cov_mat <- cor_mat * (sd_vec %*% t(sd_vec))
cov_mat
```

```
##      [,1] [,2]
## [1,]  225   90
## [2,]   90  225
```

.midi[
The diagonal values (225) are the **variances** (`$15^2 = 225$`). The off-diagonal values (90) are the **covariances** (`$0.4 \times 15 \times 15 = 90$`).
]

---

## Step 3: Generate the Data

``` r
library(MASS)
set.seed(11)
sim <- mvrnorm(
  n = 200, mu = mu,
  Sigma = cov_mat
)
colnames(sim) <- c("Midterm", "Final")
head(sim, 4)
```

```
##       Midterm     Final
## [1,] 87.17429  97.99095
## [2,] 86.76120 113.90631
## [3,] 91.28541  70.64941
## [4,] 75.66035  90.13732
```
]

.pull-right[
<img src="d31_simulations_files/figure-html/unnamed-chunk-15-1.png" alt="" width="95%" style="display: block; margin: auto;" />
]

---

## Step 4: Check Your Work

Always verify your parameters after generating data:

``` r
# Check means (target: 100, 100)
colMeans(sim)
```

```
##   Midterm     Final 
## 100.27172  99.71524
```

``` r
# Check SDs (target: 15, 15)
apply(sim, 2, sd)
```

```
##  Midterm    Final 
## 14.38604 14.51655
```
]
.pull-right[

``` r
# Check correlation (target: 0.4)
cor(sim)
```

```
##           Midterm     Final
## Midterm 1.0000000 0.3731749
## Final   0.3731749 1.0000000
```
]

---

## Scaling Up: Four Exam Scores

``` r
var_names <- c("Quiz", "Midterm", "Project", "Final")
mu <- c(75, 70, 80, 72)
sd_exams <- c(10, 12, 8, 15)
cor_exams <- matrix(
  c(1.00, 0.55, 0.35, 0.50,
    0.55, 1.00, 0.40, 0.65,
    0.35, 0.40, 1.00, 0.45,
    0.50, 0.65, 0.45, 1.00),
  nrow = 4, byrow = TRUE,
  dimnames = list(var_names, var_names))
cov_exams <- cor_exams * (sd_exams %*% t(sd_exams))
```
]

.tip[
When building correlation matrices, make sure they are **symmetric** (same value above and below the diagonal) and **positive definite** (all eigenvalues > 0). R will throw an error if the matrix isn't valid.
]
---

## Did It Work?

The recovered correlations are close to the population values — simulation is working as intended.

---

# Power Analysis via Simulation

---

## What is Statistical Power?

**Power** = the probability of detecting an effect *when one truly exists*.

- Low power means we miss real effects
- High power means we reliably detect effects

.question[You're planning a study comparing reaction times between two groups. How many participants do you need?]

We can answer this **entirely through simulation**.

---

## Building a Power Simulation, Step by Step

First, let's think about what a single simulated experiment looks like:

``` r
group1 <- rnorm(n_per_group, mean = 0, sd = 10)
group2 <- rnorm(n_per_group, mean = 5, sd = 10)
t.test(group1, group2)$p.value < 0.05
```

We generate two groups with a **known** difference (5 points), then test whether we can detect it...

---

## Repeating the Experiment

Now repeat that experiment many times:

``` r
results <- replicate(1000, {
  group1 <- rnorm(n_per_group, mean = 0, sd = 10)
  group2 <- rnorm(n_per_group, mean = 5, sd = 10)
  t.test(group1, group2)$p.value < 0.05
})
```

Each replication returns `TRUE` (significant) or `FALSE` (not significant).

---

## Wrapping It in a Function

``` r
power_sim <- function(n_per_group, effect = 5, sd = 10,
                      alpha = 0.05, nsim = 1000) {
  significant <- replicate(nsim, {
    group1 <- rnorm(n_per_group, mean = 0, sd = sd)
    group2 <- rnorm(n_per_group, mean = effect, sd = sd)
    t.test(group1, group2)$p.value < alpha
  })
  mean(significant)
}
```
]
--
.pull-right[

This function:
1. Generates two groups with a known difference
2. Runs a t-test
3. Checks if p < .05
4. Repeats 1000 times
5. Returns the proportion of significant results = **power**
]

---

## Let's Try a Few Sample Sizes

``` r
set.seed(2024)
power_sim(n_per_group = 10)
```

```
## [1] 0.185
```

``` r
power_sim(n_per_group = 30)
```

```
## [1] 0.484
```

``` r
power_sim(n_per_group = 60)
```

```
## [1] 0.774
```

.midi[
Power goes up as sample size increases. But where does it cross the conventional 80% threshold?
]

---

## The Power Curve

---

## Why Simulation-Based Power?

.pull-left[
### Formulas work for simple cases
- Two groups, equal variances
- One predictor, normal outcome
- Standard tests (t, ANOVA, chi-square)
]

.pull-right[
### Simulation works for everything
- Complex designs (repeated measures, nested)
- Non-normal data (e.g., reaction times)
- Missing data patterns
- Any model you can fit in R
]

---

# How Much Data Do We Need?

---

## Sample Size and Precision

Let's go back to our exam score simulation (r = 0.4) and see how the estimate varies.

---

## Watching Estimates Converge

---

## The Distribution Narrows

.hand[.light-blue[Larger samples give estimates closer to the population value — the law of large numbers in action.]]

---

# Monte Carlo Simulation as a Research Tool

---

## What is a Monte Carlo Study?

- You can't test it on real data — you never know the **truth**. A Monte Carlo study lets you create a world where you **do** know the truth, then see how your method performs.

---

## The Simulation Workflow

**Step 1:** Define population parameters (means, SDs, correlations, effects)

**Step 2:** Generate one dataset (`rnorm`, `runif`, `mvrnorm`)

**Step 3:** Apply your method and extract results (`lm`, `t.test`, `cor`)

**Step 4:** Repeat many times (`replicate` or `for` loop)

**Step 5:** Summarize across replications (bias, power, coverage)

---

- How often does my test reject when it shouldn't? **Type I error rate**

- What happens when assumptions are violated? **Robustness**

- How large a sample do I need? **Sample size planning**

.footnote[
For a deeper dive, see Muthén & Muthén (2002) on using Monte Carlo simulation for sample size and power determination, and Fan (2012) in the APA Handbook for a wonderful introduction aimed at psychology researchers.
]

---

---

# Summary: Learning Goals Achieved

---

## What We've Learned

Today, you should now be able to...

.pull-left[
### Concepts
- ✅ Simulation as a way of thinking
- ✅ Spurious correlations and sample size
- ✅ Central Limit Theorem
- ✅ Power analysis
- ✅ Monte Carlo methodology
]

.pull-right[
### Skills
- ✅ Simulate correlated data with `mvrnorm()`
- ✅ Convert correlation ↔ covariance matrices
- ✅ Run power analyses by simulation
- ✅ Design and evaluate simulation studies
]

---

# Wrapping Up...

---

# Sources

- Ariel Muldoon's [tutorial](https://github.com/aosmith16/simulation-helper-functions)
- Mine Cetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/))
- Fan, X. (2012). Designing simulation studies. In H. Cooper (Ed.), *APA Handbook of Research Methods in Psychology* (Vol. 2, pp. 427-444). APA.