class: center, middle, inverse, title-slide .title[ # Data Simulations in R
📈 ] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a> </span> </div> --- class: middle # Getting Started with Data Simulations in R --- ## Introduction to Data Simulations Simulating data is a powerful tool in statistics and data science. .pull-left[ - It allows us to create artificial data that mimics real-world scenarios. - Useful for: - Understanding statistical properties. - Testing hypotheses. - Demonstrating data analysis techniques. ] .pull-right[ <br><br> <img src="img/data scientist at work.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Generating Random Numbers Normal Distribution (`rnorm()`) .pull-left[ - Commonly used for generating data that follows a Gaussian distribution. - Parameters: `n` (number of observations), `mean`, `sd` (standard deviation). ```r set.seed(1234) observations <- rnorm(n = 5, mean = 0, sd = 1) observations ``` ``` ## [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 ``` ] .pull-right[ <img src="d30_simulations_files/figure-html/unnamed-chunk-4-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Uniform Distribution (`runif()`) .pull-left[ - Generates data evenly distributed over a specified range. - Parameters: `n`, `min`, `max`. <br><br> ```r set.seed(1235) observations <- runif(n = 5, min = 0, max = 1) observations ``` ``` ## [1] 0.24259237 0.51535594 0.09942167 0.90153593 0.83890292 ``` ] .pull-right[ <br> <img src="d30_simulations_files/figure-html/unnamed-chunk-6-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Poisson Distribution (`rpois()`) .pull-left[ - Used for count-based data, such as the number of events in a fixed period. - Parameter: `lambda` (mean). <br><br> ```r set.seed(123) observations <- rpois(n = 5, lambda = 2) observations ``` ``` ## [1] 1 3 2 4 4 ``` ] .pull-right[ <br><br> <img src="d30_simulations_files/figure-html/unnamed-chunk-8-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Let's dig deeper into the Poisson distribution. .pull-left-narrow[ - Lambda is the mean number of events in a fixed interval. - For demonstration purposes, - we will generate data for different lambda values one at a time. ] .pull-right-wide[ .tip[ You don't have to generate Poisson-distributed data for different lambda values one at a time (lapply or purr is great for this). ] .midi[ ```r set.seed(123) poisson_df1 <- tibble(x = rpois(1000, 1), lambda = 1) poisson_df2 <- tibble(x = rpois(1000, 2), lambda = 2) poisson_df3 <- tibble(x = rpois(1000, 3), lambda = 3) poisson_df4 <- tibble(x = rpois(1000, 4), lambda = 4) poisson_df5 <- tibble(x = rpois(1000, 5), lambda =5) poisson_df <- bind_rows(poisson_df1, poisson_df2, poisson_df3, poisson_df4, poisson_df5) ``` ]] --- ## Let's plot the Poisson Distributions <img src="d30_simulations_files/figure-html/unnamed-chunk-10-1.png" width="65%" style="display: block; margin: auto;" /> -- .footnote[ The plot shows the Poisson distributions for different lambda values. The center of distribution increases as lambda does. It also looks like it becomes more symmetric as lambda increases.] --- ## Let's facet the plot <img src="d30_simulations_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" /> -- .footnote[It definitely looks like it becomes more symmetric as lambda increases.] --- # We can keep playing around with the Poisson... .pull-left-narrow[ - We can keep generating data for different lambda values. - We can keep plotting the distributions for different lambda values. - Frankly, tt's a great way to develop your intuition - for any distribution (Poisson included). ] -- .pull-right-wide[ <img src="d30_simulations_files/figure-html/unnamed-chunk-12-1.png" width="77%" style="display: block; margin: auto;" /> ] --- # The possibilities are endless <img src="d30_simulations_files/figure-html/unnamed-chunk-13-1.png" width="77%" style="display: block; margin: auto;" /> --- class: middle ## Deep Dive into the Normal Distribution --- ## Using `rnorm()` ### Generating Five Random Numbers ```r rnorm(5) ``` ``` ## [1] -0.84085548 1.38435934 -1.25549186 0.07014277 1.71144087 ``` - Specify arguments explicitly for clarity: ```r rnorm(n = 5, mean = 0, sd = 1) ``` ``` ## [1] -1.3705874 0.9512344 -2.0176590 0.8445475 0.3004160 ``` --- ### Setting the Random Seed for Reproducibility - Setting the seed ensures reproducibility: ```r set.seed(16) rnorm(n = 5, mean = 0, sd = 1) ``` ``` ## [1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293 ``` - The same seed gives the same results ```r set.seed(16) rnorm(n = 5, mean = 0, sd = 1) ``` ``` ## [1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293 ``` - As long as you use the same seed ```r rnorm(n = 5, mean = 0, sd = 1) ``` ``` ## [1] -0.46841204 -1.00595059 0.06356268 1.02497260 0.57314202 ``` --- ### Changing Parameters in `rnorm()` ```r rnorm(n = 5, mean = 50, sd = 20) ``` ``` ## [1] 86.94364 52.23867 35.07925 83.16427 64.43441 ``` - Using vectors for arguments: ```r rnorm(n = 10, mean = c(0, 5, 20), sd = c(1, 5, 20)) ``` ``` ## [1] -1.6630805 7.8795477 29.4552023 -0.5427317 10.6384354 ## [6] -12.9559523 -0.3141739 4.0865922 49.4095699 -0.8658988 ``` --- class: middle # Simulating Categorical Data with `rep()` --- ## Generate Character Vectors with `rep()` --- ### Using `letters` and `LETTERS` ```r rep(letters[1:2], each = 3) ``` ``` ## [1] "a" "a" "a" "b" "b" "b" ``` ```r rep(letters[1:2], times = 3) ``` ``` ## [1] "a" "b" "a" "b" "a" "b" ``` ```r rep(letters[1:2], length.out = 5) ``` ``` ## [1] "a" "b" "a" "b" "a" ``` ```r rep(letters[1:2], times = c(2, 4)) ``` ``` ## [1] "a" "a" "b" "b" "b" "b" ``` ```r rep(letters[1:2], each = 2, times = 3) ``` ``` ## [1] "a" "a" "b" "b" "a" "a" "b" "b" "a" "a" "b" "b" ``` ```r rep(letters[1:2], each = 2, length.out = 7) ``` ``` ## [1] "a" "a" "b" "b" "a" "a" "b" ``` --- class: middle # Creating Datasets with Quantitative and Categorical Variables --- ## Simulate Data with No Differences Among Two Groups ```r data.frame(group = rep(letters[1:2], each = 3), response = rnorm(n = 6, mean = 0, sd = 1)) ``` ``` ## group response ## 1 a 1.5274670 ## 2 a 1.0541781 ## 3 a 1.0300710 ## 4 b 0.8401609 ## 5 b 0.2169647 ## 6 b -0.6725256 ``` --- ## Simulate Data with Differences Among Groups ```r data.frame(group = rep(letters[1:2], each = 3), factor = rep(LETTERS[3:5], times = 2), response = rnorm(n = 6, mean = c(5, 10), sd = 1)) ``` ``` ## group factor response ## 1 a C 5.132599 ## 2 a D 9.929073 ## 3 a E 4.057305 ## 4 b C 8.977969 ## 5 b D 5.280555 ## 6 b E 10.544783 ``` --- class: middle # Repeated Simulations with `replicate()` --- ## Using `replicate()` for Repeated Simulations ### Simple Example of `replicate()` ```r set.seed(16) replicate(n = 3, expr = rnorm(n = 5, mean = 0, sd = 1), simplify = FALSE) ``` ``` ## [[1]] ## [1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293 ## ## [[2]] ## [1] -0.46841204 -1.00595059 0.06356268 1.02497260 0.57314202 ## ## [[3]] ## [1] 1.8471821 0.1119334 -0.7460373 1.6582137 0.7217206 ``` --- ### Creating Multiple Datasets ```r simlist <- replicate(n = 3, expr = data.frame(group = rep(letters[1:2], each = 3), response = rnorm(n = 6, mean = 0, sd = 1)), simplify = FALSE) ``` --- class: middle # Wrapping Up... --- # Sources - Ariel Muldoon's [tutorial](https://github.com/aosmith16/simulation-helper-functions) - Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/)) ---