Data Simulations in R 📈

---

layout: true
 
<div class="my-footer">

<a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a>

</div>

---

---

## Introduction to Data Simulations

Simulating data is a powerful tool in statistics and data science.
.pull-left[
- It allows us to create artificial data that mimics real-world scenarios.
- Useful for:
 - Understanding statistical properties.
 - Testing hypotheses.
 - Demonstrating data analysis techniques.
 ]
.pull-right[
 
<img src="img/data scientist at work.png" width="90%" style="display: block; margin: auto;" />
]

---

## Generating Random Numbers

Normal Distribution (`rnorm()`)

.pull-left[
- Commonly used for generating data that follows a Gaussian distribution.
- Parameters: `n` (number of observations), `mean`, `sd` (standard deviation).

```r
set.seed(1234)
observations <- 
 rnorm(n = 5, mean = 0, sd = 1)
observations
```

```
## [1] -1.2070657  0.2774292  1.0844412 -2.3456977  0.4291247
```
]
.pull-right[

<img src="d30_simulations_files/figure-html/unnamed-chunk-4-1.png" width="90%" style="display: block; margin: auto;" />
]

---

## Uniform Distribution (`runif()`)

.pull-left[
- Generates data evenly distributed over a specified range.
- Parameters: `n`, `min`, `max`.

```r
set.seed(1235)
observations <- 
 runif(n = 5, min = 0, max = 1)
observations
```

```
## [1] 0.24259237 0.51535594 0.09942167 0.90153593 0.83890292
```
]
.pull-right[
 
<img src="d30_simulations_files/figure-html/unnamed-chunk-6-1.png" width="90%" style="display: block; margin: auto;" />
]
---

## Poisson Distribution (`rpois()`)

.pull-left[
- Used for count-based data, such as the number of events in a fixed period.
- Parameter: `lambda` (mean).

```r
set.seed(123)
observations <- 
 rpois(n = 5, lambda = 2)
observations
```

```
## [1] 1 3 2 4 4
```
]

.pull-right[
 
<img src="d30_simulations_files/figure-html/unnamed-chunk-8-1.png" width="90%" style="display: block; margin: auto;" />
]

---

# Let's dig deeper into the Poisson distribution.

.pull-left-narrow[
- Lambda is the mean number of events in a fixed interval.
- For demonstration purposes, 
  - we will generate data for different lambda values one at a time.
]
.pull-right-wide[
.tip[ You don't have to generate Poisson-distributed data for different lambda values  one at a time (lapply or purr is great for this). ]
.midi[

```r
set.seed(123)
poisson_df1 <- tibble(x = rpois(1000, 1),
 lambda = 1)
poisson_df2 <- tibble(x = rpois(1000, 2),
 lambda = 2)
poisson_df3 <- tibble(x = rpois(1000, 3),
 lambda = 3)
poisson_df4 <- tibble(x = rpois(1000, 4),
 lambda = 4)
poisson_df5 <- tibble(x = rpois(1000, 5),
 lambda =5)

poisson_df <- bind_rows(poisson_df1, 
 poisson_df2, 
 poisson_df3, 
 poisson_df4, 
 poisson_df5)
```
]]

---

## Let's plot the Poisson Distributions

<img src="d30_simulations_files/figure-html/unnamed-chunk-10-1.png" width="65%" style="display: block; margin: auto;" />
--

.footnote[ The plot shows the Poisson distributions for different lambda values. The center of distribution increases as lambda does. It also looks like it becomes more symmetric as lambda increases.]
---

## Let's facet the plot

<img src="d30_simulations_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" />
--

---

# We can keep playing around with the Poisson...

.pull-left-narrow[
- We can keep generating data for different lambda values.
- We can keep plotting the distributions for different lambda values.
- Frankly, tt's a great way to develop your intuition 
 - for any distribution (Poisson included).
]
--
.pull-right-wide[
<img src="d30_simulations_files/figure-html/unnamed-chunk-12-1.png" width="77%" style="display: block; margin: auto;" />
]

---

#  The possibilities are endless

---

## Deep Dive into the Normal Distribution

---

## Using `rnorm()`

### Generating Five Random Numbers

```r
rnorm(5)
```

```
## [1] -0.84085548  1.38435934 -1.25549186  0.07014277  1.71144087
```

- Specify arguments explicitly for clarity:

```r
rnorm(n = 5, mean = 0, sd = 1)
```

```
## [1] -1.3705874  0.9512344 -2.0176590  0.8445475  0.3004160
```

---

### Setting the Random Seed for Reproducibility

- Setting the seed ensures reproducibility:

```r
set.seed(16)
rnorm(n = 5, mean = 0, sd = 1)
```

```
## [1]  0.4764134 -0.1253800  1.0962162 -1.4442290  1.1478293
```

- The same seed gives the same results

```r
set.seed(16)
rnorm(n = 5, mean = 0, sd = 1)
```

```
## [1]  0.4764134 -0.1253800  1.0962162 -1.4442290  1.1478293
```

- As long as you use the same seed

```r
rnorm(n = 5, mean = 0, sd = 1)
```

```
## [1] -0.46841204 -1.00595059  0.06356268  1.02497260  0.57314202
```

---

### Changing Parameters in `rnorm()`

```r
rnorm(n = 5, mean = 50, sd = 20)
```

```
## [1] 86.94364 52.23867 35.07925 83.16427 64.43441
```

- Using vectors for arguments:

```r
rnorm(n = 10, mean = c(0, 5, 20), sd = c(1, 5, 20))
```

```
##  [1]  -1.6630805   7.8795477  29.4552023  -0.5427317  10.6384354
##  [6] -12.9559523  -0.3141739   4.0865922  49.4095699  -0.8658988
```

---

# Simulating Categorical Data with `rep()`

---

## Generate Character Vectors with `rep()`

---

### Using `letters` and `LETTERS`

```r
rep(letters[1:2], each = 3)
```

```
## [1] "a" "a" "a" "b" "b" "b"
```

```r
rep(letters[1:2], times = 3)
```

```
## [1] "a" "b" "a" "b" "a" "b"
```

```r
rep(letters[1:2], length.out = 5)
```

```
## [1] "a" "b" "a" "b" "a"
```

```r
rep(letters[1:2], times = c(2, 4))
```

```
## [1] "a" "a" "b" "b" "b" "b"
```

```r
rep(letters[1:2], each = 2, times = 3)
```

```
##  [1] "a" "a" "b" "b" "a" "a" "b" "b" "a" "a" "b" "b"
```

```r
rep(letters[1:2], each = 2, length.out = 7)
```

```
## [1] "a" "a" "b" "b" "a" "a" "b"
```

---

# Creating Datasets with Quantitative and Categorical Variables

---

## Simulate Data with No Differences Among Two Groups

```r
data.frame(group = rep(letters[1:2], each = 3),
           response = rnorm(n = 6, mean = 0, sd = 1))
```

```
##   group   response
## 1     a  1.5274670
## 2     a  1.0541781
## 3     a  1.0300710
## 4     b  0.8401609
## 5     b  0.2169647
## 6     b -0.6725256
```

---

## Simulate Data with Differences Among Groups

```r
data.frame(group = rep(letters[1:2], each = 3),
           factor = rep(LETTERS[3:5], times = 2),
           response = rnorm(n = 6, mean = c(5, 10), sd = 1))
```

```
##   group factor  response
## 1     a      C  5.132599
## 2     a      D  9.929073
## 3     a      E  4.057305
## 4     b      C  8.977969
## 5     b      D  5.280555
## 6     b      E 10.544783
```

---

# Repeated Simulations with `replicate()`

---

## Using `replicate()` for Repeated Simulations

### Simple Example of `replicate()`

```r
set.seed(16)
replicate(n = 3, 
expr = rnorm(n = 5, mean = 0, sd = 1), 
simplify = FALSE)
```

```
## [[1]]
## [1]  0.4764134 -0.1253800  1.0962162 -1.4442290  1.1478293
## 
## [[2]]
## [1] -0.46841204 -1.00595059  0.06356268  1.02497260  0.57314202
## 
## [[3]]
## [1]  1.8471821  0.1119334 -0.7460373  1.6582137  0.7217206
```

---

### Creating Multiple Datasets

```r
simlist <- replicate(n = 3, 
 expr = data.frame(group = rep(letters[1:2], each = 3),
 response = rnorm(n = 6, mean = 0, sd = 1)),
 simplify = FALSE)
```

---

# Wrapping Up...

---

# Sources

- Ariel Muldoon's [tutorial](https://github.com/aosmith16/simulation-helper-functions)
- Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/))

---