83 Creating datasets with quantitative and categorical variables

We now have some tools for creating quantitative data as well as categorical. Which means it’s time to make some datasets! We’ll create several simple ones to get the general idea.

83.1 Simulate data with no differences among two groups

Let’s start by simulating data that we would use in a simple two-sample analysis with no difference between groups. We’ll make a total of 6 observations, three in each group.

We’ll be using the tools we reviewed above but will now name the output and combine them into a data.frame. This last step isn’t always necessary, but does help keep things organized in certain types of simulations.

First we’ll make separate vectors for the continuous and categorical data and then bind them together via data.frame().

Notice there is no need to use cbind() here, which is commonly done by R beginners (I know I did!). Instead we can use data.frame() directly.

group <- rep(letters[1:2], each = 3)
response <- rnorm(n = 6, mean = 0, sd = 1)
data.frame(group,
           response)
#>   group response
#> 1     a    0.493
#> 2     a    0.523
#> 3     a    1.237
#> 4     b    0.356
#> 5     b    0.575
#> 6     b   -0.422

When I make a data.frame like this I prefer to make my vectors and the data.frame simultaneously to avoid having a lot of variables cluttering up my R Environment.

I often teach/blog with all the steps clearly delineated as I think it’s easier when you are starting out, so (as always) use the method that works for you.

data.frame(group = rep(letters[1:2], each = 3),
           response = rnorm(n = 6, mean = 0, sd = 1) )
#>   group response
#> 1     a    0.402
#> 2     a    0.959
#> 3     a   -1.876
#> 4     b   -0.212
#> 5     b    1.437
#> 6     b    0.386

Now let’s add another categorical variable to this dataset.

Say we are in a situation involving two factors, not one. We have a single observations for every combination of the two factors (i.e., the two factors are crossed).

The second factor, which we’ll call factor, will take on the values “C”, “D”, and “E”.

LETTERS[3:5]
#> [1] "C" "D" "E"

We need to repeat the values in a way that every combination of group and factor is present in the dataset at one time.

Remember the group factor is repeated elementwise.

rep(letters[1:2], each = 3)
#> [1] "a" "a" "a" "b" "b" "b"

We need to repeat the three values twice. But what argument do we use in rep() to do so?

rep(LETTERS[3:5], ?)

Does each work?

rep(LETTERS[3:5], each = 2)
#> [1] "C" "C" "D" "D" "E" "E"

No, if we use each then each element is repeated twice and some of the combinations of group and factor are missing.

This is a job for the times or length.out arguments, so the whole vector is repeated. We can repeat the whole vector twice using times, or use length.out = 6. I do the former.

In the result below we can see every combination of the two factors is present once.

data.frame(group = rep(letters[1:2], each = 3),
           factor = rep(LETTERS[3:5], times = 2),
           response = rnorm(n = 6, mean = 0, sd = 1) )
#>   group factor response
#> 1     a      C    0.426
#> 2     a      D    0.290
#> 3     a      E   -0.364
#> 4     b      C    1.978
#> 5     b      D    1.087
#> 6     b      E   -0.587

83.2 Simulate data with a difference among groups

The dataset above is one with “no difference” among groups. What if the means were different between groups? Let’s make two groups of three observations where the mean of one group is 5 and the other is 10. The two groups have a shared variance (and so standard deviation) of 1.

Remembering how rnorm() works with a vector of means is key here. The function draws iteratively from each distribution.

response <- rnorm(n = 6, mean = c(5, 10), sd = 1)
response
#> [1]  4.41 12.48  4.74 10.27  5.37 10.02

How do we get the group pattern correct?

group <- rep(letters[1:2], ?)

We need to repeat the whole vector three times instead of elementwise.

To get the groups in the correct order we need to use times or length.out in rep(). With length.out we define the output length of the vector, which is 6. Alternatively we could use times = 3 to repeat the whole vector 3 times.

group <- rep(letters[1:2], length.out = 6)
group
#> [1] "a" "b" "a" "b" "a" "b"

These can then be combined into a data.frame. Working out this process is another reason why sometimes we build each vector separately prior to combining together.

data.frame(group,
            response)
#>   group response
#> 1     a     4.41
#> 2     b    12.48
#> 3     a     4.74
#> 4     b    10.27
#> 5     a     5.37
#> 6     b    10.02

83.3 Multiple quantitative variables with groups

For our last dataset we’ll have two groups, with 10 observations per group.

rep(LETTERS[3:4], each = 10)
#>  [1] "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "D" "D" "D" "D" "D" "D" "D" "D" "D"
#> [20] "D"

Let’s make a dataset that has two quantitative variables, unrelated to both each other and the groups. One variable ranges from 10 and 15 and one from 100 and 150.

How many observations should we draw from each uniform distribution?

runif(n = ?, min = 10, max = 15)

We had 2 groups with 10 observations each and 2*10 = 20. So we need to use n = 20 in runif().

Here is the dataset made in a single step.

data.frame(group = rep(LETTERS[3:4], each = 10),
           x = runif(n = 20, min = 10, max = 15),
           y = runif(n = 20, min = 100, max = 150))
#>    group    x   y
#> 1      C 13.2 127
#> 2      C 13.9 137
#> 3      C 12.7 135
#> 4      C 14.3 123
#> 5      C 11.7 118
#> 6      C 14.6 108
#> 7      C 12.8 142
#> 8      C 13.5 104
#> 9      C 13.9 107
#> 10     C 12.2 145
#> 11     D 12.0 117
#> 12     D 13.7 121
#> 13     D 14.8 145
#> 14     D 11.7 120
#> 15     D 13.3 140
#> 16     D 10.8 107
#> 17     D 14.0 148
#> 18     D 14.9 113
#> 19     D 13.5 105
#> 20     D 14.4 120

What happens if we get this wrong? If we’re lucky we get an error.

data.frame(group = rep(LETTERS[3:4], each = 10),
           x = runif(n = 15, min = 10, max = 15),
           y = runif(n = 15, min = 100, max = 150))
#> Error in data.frame(group = rep(LETTERS[3:4], each = 10), x = runif(n = 15, : arguments imply differing number of rows: 20, 15

But if we get things wrong and the number we use goes into the number we need evenly, R will recycle the vector to the end of the data.frame().

This is a hard mistake to catch. If you look carefully through the output below you can see that the continuous variables start to repeat on line 10.

data.frame(group = rep(LETTERS[3:4], each = 10),
           x = runif(n = 10, min = 10, max = 15),
           y = runif(n = 10, min = 100, max = 150))
#>    group    x   y
#> 1      C 12.3 108
#> 2      C 13.8 115
#> 3      C 12.4 105
#> 4      C 10.1 125
#> 5      C 10.8 130
#> 6      C 11.0 129
#> 7      C 11.5 149
#> 8      C 13.5 139
#> 9      C 11.6 120
#> 10     C 12.9 120
#> 11     D 12.3 108
#> 12     D 13.8 115
#> 13     D 12.4 105
#> 14     D 10.1 125
#> 15     D 10.8 130
#> 16     D 11.0 129
#> 17     D 11.5 149
#> 18     D 13.5 139
#> 19     D 11.6 120
#> 20     D 12.9 120