84 Repeatedly simulate data with replicate()

The replicate() function is a real workhorse when making repeated simulations. It is a member of the apply family in R, and is specifically made (per the documentation) for the repeated evaluation of an expression (which will usually involve random number generation).

We want to repeatedly simulate data that involves random number generation, so that sounds like a useful tool.

The replicate() function takes three arguments:

  • n, which is the number of replications to perform. This is to set the number of repeated runs we want.
  • expr, the expression that should be run repeatedly. This is often a function.
  • simplify, which controls the type of output the results of expr are saved into. Use simplify = FALSE to get output saved into a list instead of in an array.

84.1 Simple example of replicate()

Let’s say we want to simulate some values from a normal distribution, which we can do using the rnorm() function as above. But now instead of drawing some number of values from a distribution one time we want to do it many times. This could be something we’d do when demonstrating the central limit theorem, for example.

Doing the random number generation many times is where replicate() comes in. It allows us to run the function in expr exactly n times.

Here I’ll generate 5 values from a standard normal distribution three times. Notice the addition of simplify = FALSE to get a list as output.

The output below is a list of three vectors. Each vector is from a unique run of the function, so contains five random numbers drawn from the normal distribution with a mean of 0 and standard deviation of 1.

set.seed(16)
replicate(n = 3, 
          expr = rnorm(n = 5, mean = 0, sd = 1), 
          simplify = FALSE )
#> [[1]]
#> [1]  0.476 -0.125  1.096 -1.444  1.148
#> 
#> [[2]]
#> [1] -0.4684 -1.0060  0.0636  1.0250  0.5731
#> 
#> [[3]]
#> [1]  1.847  0.112 -0.746  1.658  0.722

Note if I don’t use simplify = FALSE I will get a matrix of values instead of a list. Each column in the matrix is the output from one run of the function.

In this case there will be three columns in the output, one for each run, and 5 rows. This can be a useful output type for some simulations. I focus on list output throughout the rest of this post only because that’s what I have been using recently for simulations.

set.seed(16)
replicate(n = 3, 
          expr = rnorm(n = 5, mean = 0, sd = 1) )
#>        [,1]    [,2]   [,3]
#> [1,]  0.476 -0.4684  1.847
#> [2,] -0.125 -1.0060  0.112
#> [3,]  1.096  0.0636 -0.746
#> [4,] -1.444  1.0250  1.658
#> [5,]  1.148  0.5731  0.722

84.2 An equivalent for() loop example

A for() loop can be used in place of replicate() for simulations. With time and practice I’ve found replicate() to be much more convenient in terms of writing the code. However, in my experience some folks find for() loops intuitive when they are starting out in R. I think it’s because for() loops are more explicit on the looping process: the user can see the values that i takes and the output for each i iteration is saved into the output object because the code is written out explicitly.

In my example I’ll save the output of each iteration of the loop into a list called list1. I initialize this as an empty list prior to starting the loop. To match what I did with replicate() I do three iterations of the loop (i in 1:3), drawing 5 values via rnorm() each time.

The result is identical to my replicate() code above. It took a little more code to do it but the process is very clear since it is explicitly written out.

set.seed(16)
list1 <- list() # Make an empty list to save output in
for (i in 1:3) { # Indicate number of iterations with "i"
    list1[[i]] <- rnorm(n = 5, mean = 0, sd = 1) # Save output in list for each iteration
}
list1
#> [[1]]
#> [1]  0.476 -0.125  1.096 -1.444  1.148
#> 
#> [[2]]
#> [1] -0.4684 -1.0060  0.0636  1.0250  0.5731
#> 
#> [[3]]
#> [1]  1.847  0.112 -0.746  1.658  0.722

84.3 Using replicate() to repeatedly make a dataset

Earlier we were making datasets with random numbers and some grouping variables. Our code looked like

data.frame(group = rep(letters[1:2], each = 3),
           response = rnorm(n = 6, mean = 0, sd = 1) )
#>   group response
#> 1     a   -1.663
#> 2     a    0.576
#> 3     a    0.473
#> 4     b   -0.543
#> 5     b    1.128
#> 6     b   -1.648

We could put this process as the expr argument in replicate() to get many simulated datasets. I would do something like this if I wanted to compare the long-run performance of two different statistical tools using the exact same random datasets.

I’ll replicate things 3 times again to easily see the output. I still use simplify = FALSE to get things into a list.

simlist <- replicate(n = 3, 
          expr = data.frame(group = rep(letters[1:2], each = 3),
                            response = rnorm(n = 6, mean = 0, sd = 1) ),
          simplify = FALSE)

We can see this result is a list of three data.frames.

str(simlist)
#> List of 3
#>  $ :'data.frame':    6 obs. of  2 variables:
#>   ..$ group   : chr [1:6] "a" "a" "a" "b" ...
#>   ..$ response: num [1:6] -0.314 -0.183 1.47 -0.866 1.527 ...
#>  $ :'data.frame':    6 obs. of  2 variables:
#>   ..$ group   : chr [1:6] "a" "a" "a" "b" ...
#>   ..$ response: num [1:6] 1.03 0.84 0.217 -0.673 0.133 ...
#>  $ :'data.frame':    6 obs. of  2 variables:
#>   ..$ group   : chr [1:6] "a" "a" "a" "b" ...
#>   ..$ response: num [1:6] -0.943 -1.022 0.281 0.545 0.131 ...

Here is the first one.

simlist[[1]]
#>   group response
#> 1     a   -0.314
#> 2     a   -0.183
#> 3     a    1.470
#> 4     b   -0.866
#> 5     b    1.527
#> 6     b    1.054