84 Lab: Simulating data
Welcome to the Mars Colony Simulation Lab! In this lab, we will embark on a journey to Mars, simulating potential colonists’ data to plan our future Mars colony. As potential planners of a future Mars settlement, it is crucial to understand the demographic and professional attributes that will help sustain a human presence on Mars. This lab teaches you to simulate these attributes using R.
Learning goals
Simulate Quantitative Variables: Learn to create simulated quantitative variables representing colonist attributes using
rnorm()
,runif()
, andrpois()
.Generate Character Variables: Use
rep()
to create categorical variables that represent different groups within the colonists, such as their professions or roles in the colony.Replication of Data Simulation: Utilize
replicate()
to repeat the data simulation process, representing multiple scenarios or batches of potential colonists.
Getting started and warming up
Exercises
Instead of analyzing other people’s data, we’ll be generating our own! You will generate basic attributes for 100 potential colonists. These attributes include age, health, and fitness levels, which are critical for survival on Mars.
Exercise 1: Simulating Our Colonists
Imagine you’re choosing the first 100 colonists for Mars. Let’s simulate their ages assuming a mean age of 30 and a standard deviation of 5—aiming for youth but with sufficient experience.
Setting Up Your Simulation. We’ll use the rnorm()
function, which is perfect for this task because it generates normally distributed random numbers. Let’s examine the arguments for rnorm()
using the ?rnorm
command in R.
Parameters:
- n = 100: The number of random ages we want to generate.
- mean = 30: The average age of our colonists. We’re targeting a young, but experienced group of individuals.
- sd = 5: The standard deviation, representing age variability around the mean.
Here’s code to generate an age distribution:
1.1. Create a dataframe to store your colonists’ attributes. I’ve already gotten you started, but you’ll need to add their ages.
## Error in data.frame(id = 1:100, ): argument is missing, with no default
Now, let’s visualize the age distribution of our colonists using a histogram. This will help us understand the diversity within our potential Mars colony.
1.2. Consider the histograms of the age distribution that we’ve generated (as well as two more I added). What do you notice about the age distribution of our colonists? How does the seed affect the spread of the distribution?
What roles would these colonists have? Let’s decide….. We need engineers, scientists, and medics in equal numbers.
We’ll create a variable, role
, with three categories: engineer, scientist, and medic. We’ll use the rep()
function to simulate this categorical variable. I’ve demonstrated several ways you can use rep()
to create the role
variable. Try them out.
set.seed(1)
roleA <- rep(c("engineer", "scientist", "medic"),
length.out = 100)
# this works if we want to set a specific number of each role
roleB <- rep(c("engineer", "scientist", "medic"),
each = 34,
length.out = 100)
roleC <- rep(c("engineer", "scientist", "medic"),
times = c(33, 33, 33),
length.out = 100)
# if you want to use sampling weights
roleD <- sample(c("engineer", "scientist", "medic"),
replace = TRUE,
size = 100, prob = c(1,1,1))
# if you want to randomly shuffle
roleE <- sample(roleB, size = 100, replace = FALSE)
If you want to see the raw data as a table, click here
df <- data.frame(row = rep(1:100, 5),
role = c(roleA, roleB, roleC, roleD, roleE),
method = c(rep("Method A", length(roleA)),
rep("Method B", length(roleB)),
rep("Method C", length(roleC)),
rep("Method D", length(roleD)),
rep("Method E", length(roleE))))
df %>%
DT::datatable(
rownames = FALSE,
class = "cell-border stripe",
filter = list(position = "top"),
options = list(
pageLength = nrow(df)/5,
autoWidth = TRUE,
bInfo = FALSE,
paging = FALSE
)
)
1.3. Create a role variable that suits the needs of your colony, and explain why you chose that method.
Next, we’ll simulate the health and fitness levels of our colonists. Health and fitness are essential for the physical and mental well-being of our colonists, ensuring they can adapt to the harsh Martian environment.
Setting Up Your Simulation. We’ll use the runif()
function to generate random numbers drawn from a uniform distribution. This distribution is ideal for simulating health and fitness levels, as it ensures equal probability across the range of values.
Examine the arguments for runif()
, using the ?runif
command in R. Notice that although some of the arguments are the same, such as n, and parameters for the distribution are different.
On Mars, health is measured using the MARSGAR (Martian Adult Resilience Score Gauging Astronaut Readiness; Musk, 2025) scale, which ranges from 0 to 100. Like the suspiciously-similar Apgar score for newborns, a score of 0 indicates that a colonist has no signs of life: no heartbeat, no breathing, no response to stimulation, no muscle tone, and central cyanosis/pallor. A score of 100 indicates a colonist in perfect health. Take this scoring into consideration when you simulate the health of your colonists.
1.4. Add a uniformally-distributed MARSGAR variable to your colony.
Now, let us examine our colonists. We will create a scatter plot to visualize the relationship between age and health. This will help us understand the health distribution of our potential Mars colony.
What a contrived plot! It seems that there’s no relationship between age and health in the data. How odd! (Or is it?)
Spoiler: It’s not odd. We generated the data univariately, so there’s no reason for there to be a relationship between age and health.
Exercise 2: Growing Our Colonists
Next, we’ll simulate the relationship among several attributes of our colonists. We’ll consider technical skills, problem-solving abilities, psychological resilience, teamwork, and adaptability. These attributes are crucial for the success of our Mars colony, ensuring that our colonists can work together effectively and overcome challenges.
There are several ways to simulate variables with specific correlations. I will show two approaches:
- using multiple steps with rnorm
, which is a less coputationally efficient, but more accessible approach and
- using mvrnorm
, which generates data from a multivariate distribustion.
Basic method
To start, let’s illustrate a simpler method using rnorm
to simulate how certain attributes like age could influence others such as technical skills—assuming that older colonists, with more experience, likely have higher technical skills.
To model this, we define the variable technical_skills
as a linear function of age
, incorporating random noise for variability. Specifically, we use the regression equation technical_skills = 2 * age + noise
, where noise
is normally distributed with a mean of zero and a standard deviation of one. This implies that for each additional year of age, a colonist’s technical skills increase by two points, with some random noise added in.
2.1. Simulate technical skills based on age using the equation technical_skills = 2 * age + noise
.
# Simulate technical skills based on age
# Set the seed for reproducibility
set.seed(1235)
# Create a variable for technical skills
df_colonists$technical_skills <- 2 * df_colonists$age + rnorm(100, mean = 0, sd = 1)
Now, let’s visualize the relationship between age and technical skills using a scatter plot.
2.2. Simulate problem-solving abilities based on their assigned role
Now its your turn! Simulate problem-solving abilities based on their assigned role. Recall that the options are “engineer”, “scientist”, “medic”. First you need to think about the relationship between the role AND the variable being simulated. Then you can write up the equation that would create that relationship. And then you can write the code to simulate the variable.
Click this to see a hint
If you’re stuck, you can check the source code for the solution by examining the following chuck in this corresponding rmd file exercise2.2solution.Click this to see a solution
df_colonists$problem_solving[df_colonists$role == "engineer"] <- rnorm(sum(df_colonists$role == "engineer"), mean = 100, sd = 10)
df_colonists$problem_solving[df_colonists$role == "scientist"] <- rnorm(sum(df_colonists$role == "scientist"), mean = 80, sd = 10)
df_colonists$problem_solving[df_colonists$role == "medic"] <- rnorm(sum(df_colonists$role == "medic"), mean = 60, sd = 10)
Exploring Correlations with mvrnorm
Now, this approach works well if you already know what you want your model to look like. But what if you want to simulate multiple variables with specific correlations? That’s when the mvrnorm
function comes in handy from the MASS
package. This function generates data from a multivariate normal distribution, where we can specify means, standard deviations, and the correlation between variables. This is ideal for simulating more complex interdependencies that might exist among colonists’ traits.
Setting Up Your Simulation. As you’ve hopefully gathered, we’ll use the `mvrnorm
function to generate random numbers drawn from a multivariate distribution. I encourage you to examine the arguments for mvrnorm()
, using the ?mvrnorm
command in R.
Examine the arguments for mvrnorm()
, using the ?mvrnorm
command in R. Notice that the arguments are similar to rnorm()
, but with the addition of the Sigma
parameter, which represents the covariance matrix. This matrix allows us to specify the relationships between the variables we’re simulating.
(Remember, the covariance matrix is a square matrix that shows the variances of each variable on the diagonal and the covariances between variables on the off-diagonals. A helpful trick is that the correlation matrix is just a standardized covariance matrix, where the diagonals are all 1s, and the off-diagonals are the correlations between variables. You can convert between the two using the cov2cor()
and cor2cov()
functions in R.)
Parameters for Simulation:
- n = 100: We are simulating attributes for 100 colonists.
- mu = c( 50, 50): Represents the average scores for Resilience and Agreeableness.
- Sigma = matrix(c(100, 50, 50, 100), ncol = 2): The covariance matrix, where diagonals represent variances and the off-diagonals represent the covariance between the traits.
3.2. Simulate Resilience and Agreeableness
library(MASS)
# Define mean and covariance matrix
mean_traits <- c(50, 50)
cov_matrix <- matrix(c(100, 50,
50, 100), ncol = 2)
# Generate correlated data
traits_data <- mvrnorm(n = 100, mu = mean_traits, Sigma = cov_matrix, empirical = FALSE)
colnames(traits_data) <- c("resilience", "agreeableness")
df_colonists <- cbind(df_colonists, traits_data)
But, now, we need to simulate more attributes for our colonists that don’t all have the same summary statistics. We need to simulate the big five personality traits: openness, conscientiousness, extraversion, agreeableness, and neuroticism. These traits are essential for understanding how colonists will interact with each other and cope with the challenges of living on Mars. Your task to simulate these traits using the mvrnorm
function, using parameters that would make sense for these traits, and examine how closely your colonists match up with your population parameters.
3.3. Simulate Big Five Traits using the Correlation Matrix provided and examine how closely your colonists match up with your parameters.
Population Parameters for Big Five Traits I’ve provided a correlation matrix from Park et al. (2020)6 7 that you can use to get you started on simulating these traits.
Click this to see the code that built the matrix
# Define the bigfive
bigfive <- c("EX", "ES", "AG", "CO", "OP")
# Create an empty matrix with dimensions equal to the number of bigfive
cor_matrix <- matrix(0, ncol = length(bigfive),
nrow = length(bigfive),
dimnames = list(bigfive, bigfive))
# Initialize the diagonal to 1
diag(cor_matrix) <- 1
# Function to set correlation values ensuring symmetry
set_correlation <- function(matrix, row, column, value) {
matrix <- as.matrix(matrix)
matrix[row, column] <- value
matrix[column, row] <- value
return(matrix)
}
#for(i in 1:10) {
# print(paste0("set_correlation(cor_matrix, '",ma_res$construct_x[i],"' ,'", ma_res$construct_y[i],"',",round(ma_res[[6]][[i]][["barebones"]][["mean_r"]],4),")"))}
cor_matrix_bigfive <- cor_matrix <- cor_matrix %>%
set_correlation("ES", "AG", 0.1576) %>%
set_correlation("ES", "CO", 0.2306) %>%
set_correlation("ES", "EX", 0.2599) %>%
set_correlation("ES", "OP", 0.072) %>%
set_correlation("AG", "CO", 0.2866) %>%
set_correlation("AG", "EX", 0.1972) %>%
set_correlation("AG", "OP", 0.1951) %>%
set_correlation("CO", "EX", 0.186) %>%
set_correlation("CO", "OP", 0.1574) %>%
set_correlation("EX", "OP", 0.2949)
print(cor_matrix_bigfive)
## EX ES AG CO OP
## EX 1.00 0.260 0.20 0.19 0.295
## ES 0.26 1.000 0.16 0.23 0.072
## AG 0.20 0.158 1.00 0.29 0.195
## CO 0.19 0.231 0.29 1.00 0.157
## OP 0.29 0.072 0.20 0.16 1.000
Click this to get a matrix you can quickly paste
cor_matrix_bigfive <- matrix(c(
1.0000, 0.2599, 0.1972, 0.1860, 0.2949,
0.2599, 1.0000, 0.1576, 0.2306, 0.0720,
0.1972, 0.1576, 1.0000, 0.2866, 0.1951,
0.1860, 0.2306, 0.2866, 1.0000, 0.1574,
0.2949, 0.0720, 0.1951, 0.1574, 1.0000
), nrow=5, ncol=5, byrow=TRUE,
dimnames=list(c("EX", "ES", "AG", "CO", "OP"),
c("EX", "ES", "AG", "CO", "OP")))
Because this task is a tad more complex than the previous simulation, I’ve created some optional scaffolding by structured hints.
Click here to see the hints
Step 1 Hint
- Begin by defining the parameters for your simulation. This includes the number of colonists, the mean and standard deviation for each of the Big Five personality traits, and the correlation matrix that models the relationships between these traits.
Step 2 Hint
- Simulate the Big Five personality traits using the mvrnorm function. This function will use your defined means, standard deviations, and correlation matrix to generate a dataset that reflects the psychological profiles of your simulated colonists.
Step 3 Hint
- Calculate the means and standard deviations of the Big Five personality traits in your simulated dataset. This will allow you to compare the simulated data to the population parameters you defined earlier.
Click this text for a solution
# Set seed for reproducibility
seed <- 123
set.seed(seed)
# Load the Mass, Matrix, and then tidyverse (otherwise have to use conflicted package to handle conflict
## (note to self that MASS has a select function
library(MASS); library(Matrix); library(tidyverse); library(conflicted)
conflicts_prefer(dplyr::select())
# Number of colonists
n_colonists <- 100
# Define mean and covariance matrix
## "EX", "ES", "AG", "CO", "OP"
var_names <- c( "EX", "ES", "AG", "CO", "OP")
mean_traits <- c(-.5, .5, .25, .5, 0) # Mean of each trait
sd_traits <- c(1, .9, 1, 1, 1) # Standard deviation of each trait
# Get correlation matrix
cor_matrix_bigfive <- matrix(c(
1.0000, 0.2599, 0.1972, 0.1860, 0.2949,
0.2599, 1.0000, 0.1576, 0.2306, 0.0720,
0.1972, 0.1576, 1.0000, 0.2866, 0.1951,
0.1860, 0.2306, 0.2866, 1.0000, 0.1574,
0.2949, 0.0720, 0.1951, 0.1574, 1.0000
), nrow=5, ncol=5, byrow=TRUE,
dimnames=list(c("EX", "ES", "AG", "CO", "OP"),
c("EX", "ES", "AG", "CO", "OP")))
# Check if correlation matrix is positive definite (if all eigenvalues are positive)
eigen_values <- eigen(cor_matrix_bigfive)$values
is_positive_definite <- all(eigen_values > 0)
print(is_positive_definite)
## [1] TRUE
# Convert correlation matrix to covariance matrix
cov_matrix_bigfive <- cor_matrix_bigfive * (sd_traits %*% t(sd_traits))
# Print the covariance matrix
print(cov_matrix_bigfive)
## EX ES AG CO OP
## EX 1.00 0.234 0.20 0.19 0.295
## ES 0.23 0.810 0.14 0.21 0.065
## AG 0.20 0.142 1.00 0.29 0.195
## CO 0.19 0.208 0.29 1.00 0.157
## OP 0.29 0.065 0.20 0.16 1.000
simulated_data <- mvrnorm(n = 100, mu = mean_traits, Sigma = cov_matrix_bigfive)
# Add colonist_id and seed
simulated_data <- cbind.data.frame(colonist_id = 1:n_colonists, # add colonist_id
seed = seed, # add seed
simulated_data) # add simulated data
# Print the first few rows of the simulated data
print(head(simulated_data))
## colonist_id seed EX ES AG CO OP
## 1 1 123 0.90 1.806 -1.00 0.82 0.141
## 2 2 123 -0.55 1.933 -0.48 0.92 0.023
## 3 3 123 -2.01 -0.095 -1.18 -0.23 -0.218
## 4 4 123 -0.40 0.642 -0.72 0.83 0.262
## 5 5 123 -0.18 -0.346 -0.22 0.31 0.585
## 6 6 123 -2.46 0.393 -0.42 -1.11 -0.413
summary_stats_mean <- simulated_data %>%
select(-colonist_id, -seed) %>%
summarise(across(everything(), list(mean = mean))) %>%
rbind(mean_traits) # compare with population parameters from mean_traits
summary_stats_sd <- simulated_data %>%
select(-colonist_id, -seed) %>%
summarise(across(everything(), list(sd = sd))) %>%
rbind(sd_traits) # compare with population parameters from sd_traits
# compare with population parameters from mean_traits and sd_traits
summary_stats <- cbind(summary_stats_mean, summary_stats_sd)
#
summary_stats_cor <- simulated_data %>%
select(-colonist_id, -seed) %>%
cor() # compare with population parameters from cor_matrix_bigfive
summary_stats
## EX_mean ES_mean AG_mean CO_mean OP_mean EX_sd ES_sd AG_sd CO_sd OP_sd
## 1 -0.43 0.45 0.081 0.43 -0.053 1 0.88 0.97 0.97 0.85
## 2 -0.50 0.50 0.250 0.50 0.000 1 0.90 1.00 1.00 1.00
Exercise 4: Preparing for the Unexpected
Generate the big five for 100 colonies of 100 colonists, repeating this process multiple times to how much our colony might look if we settled on 100 different planets. We’ve already made the colonists for one planet, so we’ll just need to replicate that process 99 more times. There are two major approaches to take here. You can either use replicate, which is easier, or a for loop, which is more flexible.
4.1. Generate 100 colonies and extract the mean and standard deviation for extraversion.
Although you can get the summary statistic for each variable, let’s focus on the mean and standard deviation for extraversion as well as its correlation with openness. Please plot the distribution of the correlation coefficients to see how consistent the relationship between extraversion and openness is across different planets. Consider how this distribution might differ across colonists. How large of a sample size would you need to get a stable estimate of the correlation between extraversion and openness?
4.2. Plot the distributions of those statistics from your new empire of colonies, and include an annotation in the plot to show the population parameters.
Click this text for a hint about implementing with replicate
You can use thereplicate
function to repeat the simulation process multiple times. Remember to append the results of each simulation to a single dataset.
After running the simulations, you can use the group_by
and summarize
functions to get the mean and standard deviation for extraversion and its correlation with openness.
Click this text for a hint about implementing with a forloop
You can use a for loop to repeat the simulation process multiple times. Remember to append the results of each simulation to a single dataset. After running the simulations, you can use thegroup_by
and summarize
functions to get the mean and standard deviation for extraversion and its correlation with openness.
Click this text for forloop based solution
library(dplyr)
conflicts_prefer(dplyr::select())
set.seed(124)
num_simulations <- 100 # Number of times to simulate the colonist data
all_simulations <- data.frame() # Create an empty data frame to store the results
for (i in 1:num_simulations) {
# Simulate the big five personality traits
simulated_data <- mvrnorm(n = 100, mu = mean_traits, Sigma = cov_matrix_bigfive)
# Add colonist_id and seed
simulated_data <- cbind.data.frame(colonist_id = 1:n_colonists, # add colonist_id
rep = i,
simulated_data) # add simulated data
# Append the results
all_simulations <- rbind(all_simulations, simulated_data)
}
summary_stats <- all_simulations %>% group_by(rep) %>%
select(-colonist_id) %>%
summarise(across(everything(), list(mean = mean, sd = sd)))
# compare with population parameters from mean_traits
summary_stats_mean <- all_simulations %>%
select(-colonist_id) %>%
summarise(across(everything(), list(mean = mean))) %>%
rbind(mean_traits) # compare with population parameters from mean_traits
summary_stats_sd <- all_simulations %>%
select(-colonist_id) %>%
summarise(across(everything(), list(sd = sd))) %>%
rbind(sd_traits) # compare with population parameters from sd_traits
# compare with population parameters from mean_traits and sd_traits
summary_stats <- cbind(summary_stats_mean, summary_stats_sd)
#
summary_stats_cor <- simulated_data %>%
select(-colonist_id) %>%
cor() # compare with population parameters from cor_matrix_bigfive
Stretch Tasks (Optional)
In space colonization, just like in any complex project management, it’s essential to prepare for variability and uncertainty. To test the resilience of our simulated Mars colony, we’ll generate multiple sets of potential colonists at different sample sizes. By examining these various batches, we can assess how many people we actually need in our colony and how that affects are summary statistics.
Click this text for forloop based solution
library(dplyr)
set.seed(124)
sample_sizes <- seq(30, 300, by = 15) # Varying sample sizes
repetitions_per_condition <- 20 # Number of repetitions for each sample size
mean_skills <- c(50, 50) # Mean technical skills and problem-solving abilities
cov_skills <- matrix(c(100, 50, 50, 100), ncol = 2) # Covariance matrix for skills
# Initialize a DataFrame to store results
simulation_results <- data.frame(
Condition = integer(),
SampleSize = integer(),
Repetition = integer(),
Covariance = numeric()
)
# Nested loop for simulations
for (size in sample_sizes) {
for (rep in 1:repetitions_per_condition) {
skills_data <- mvrnorm(n = size, mu = mean_skills, Sigma = cov_skills, empirical = FALSE)
current_covariance <- cov(skills_data[, 1], skills_data[, 2])
# Append results
simulation_results <- rbind(simulation_results, data.frame(
SampleSize = size,
Repetition = rep,
Covariance = current_covariance
))
}
}
num_simulations <- 100 # Number of times to simulate the colonist data
all_simulations <- replicate(100, mvrnorm(n = 100, mu = mean_skills, Sigma = cov_skills, empirical = FALSE))
# Initialize a DataFrame to store results
set.seed(124)
mean_skills <- c(50, 50) # Mean technical skills and problem-solving abilities
cov_skills <- matrix(c(100, 50, 50, 100), ncol = 2) # Covariance matrix for skills
num_simulations <- 100 # Number of times to simulate the colonist data
all_simulations <- replicate(num_simulations,
mvrnorm(n = 100, mu = mean_traits,
Sigma = cov_matrix_bigfive,
empirical = FALSE), simplify = FALSE)
sample_sizes <- seq(30, 300, by = 15) # Varying sample sizes
repetitions_per_condition <- 20 # Number of repetitions for each sample size
# Initialize a DataFrame to store results
simulation_results_cov <- data.frame(
Condition = integer(),
SampleSize = integer(),
Repetition = integer(),
Covariance = numeric()
)
# Nested loop for simulations
for (size in sample_sizes) {
for (rep in 1:repetitions_per_condition) {
skills_data <- mvrnorm(n = size, mu = mean_skills, Sigma = cov_skills, empirical = FALSE)
current_covariance <- cov(skills_data[, 1], skills_data[, 2])
# Append results
simulation_results_cov <- rbind(simulation_results_cov, data.frame(
SampleSize = size,
Repetition = rep,
Covariance = current_covariance
))
}
}
# Plotting the average covariance for each sample size
average_covariances <- simulation_results_cov %>%
group_by(SampleSize) %>%
summarize(AverageCovariance = mean(Covariance))
simulation_results_cov$SampleSize_factor <- as.factor(simulation_results_cov$SampleSize)
ggplot(simulation_results_cov, aes(x = Covariance, fill = SampleSize_factor)) +
# geom_histogram(position = position_dodge()) +
geom_density(position = position_dodge(), alpha=.6) +
labs(title = "Average Covariance by Sample Size", ) +
scale_fill_viridis_d() + # This adds a nice color gradient based on the 'method'
theme_minimal()
Conclusion
Congratulations on completing the Mars Colony Simulation Lab! Today, you’ve applied your skills in R to simulate and analyze potential attributes of Mars colonists, preparing you for complex data analysis tasks in both theoretical and practical scenarios. Through various exercises, you practiced using R to create distributions, generate categorical variables, and explore relationships between simulated traits.
Final Checklist
- Ensure that you have executed all the codes and documented your findings in this lab.
- Review the simulations you have created and confirm that all data frames and plots are correctly generated and annotated.
Reflect on Your Learning
- What insights have you gained about the potential challenges and needs of a Mars colony?
- How can the simulation techniques you practiced today be applied to other areas of research or data science?
Next Steps
- Commit all your changes to your project repository. Use a comprehensive commit message that reflects the completion of this lab.
- Push your changes to ensure everything is updated in your remote repository.
I encourage you to explore further with the stretch tasks provided and utilize the resources listed to deepen your understanding of statistical simulations.
Park, H. H., Wiernik, B. M., Oh, I. S., Gonzalez-Mulé, E., Ones, D. S., & Lee, Y. (2020). Meta-analytic five-factor model personality intercorrelations: Eeny, meeny, miney, moe, how, which, why, and where to go. The Journal of applied psychology, 105(12), 1490–1529. https://doi.org/10.1037/apl0000476↩︎