79 Lab: Cross Validation in Action
Don’t Judge the Ship After It Sinks
We’ve been hired by Lloyd’s of London to help quantify risk in maritime travel. After the Titanic disaster in 1912 and the claims that followed, Lloyd’s wants models that can predict survival outcomes using passenger information.
In this lab, you will build classification models for Titanic survival and learn why apparent accuracy can be misleading. You’ll then evaluate performance more honestly using data splitting and cross validation.
Learning goals
- Fit and interpret logistic regression classifiers for binary outcomes in R.
- Compute predicted probabilities, convert them into class predictions, and calculate accuracy.
- Explain why apparent (in-sample) accuracy is optimistic.
- Evaluate models using held-out data and implement k-fold cross validation.
Warm up
Before we introduce the data, let’s warm up with some simple exercises.
YAML
Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document.
Commiting and pushing changes
- Go to the Git pane in your RStudio.
- View the Diff and confirm that you are happy with the changes.
- Add a commit message like “Update team name” in the Commit message box and hit Commit.
- Click on Push. This will prompt a dialogue box where you first need to enter your user name, and then your password.
The data
We will work with:
titanic_trainandtitanic_testfrom thetitanicpackage (a convenient train/test split often used in Kaggle contexts).titanic3loaded from data/titanic3.xls, which includes additional fields. Originally from here
titanic3 <- read_excel("data/titanic3.xls",
col_types = c("numeric", "numeric", "text",
"text", "numeric", "numeric", "numeric",
"text", "numeric", "text", "text",
"text", "text", "text"))
data("titanic_train")
data("titanic_test")Before modeling, we will do one small cleanup: standardize column names to lowercase for titanic_test and titantic_train.
Exercises
Apparent accuracy
Before we talk about cross validation, we need to see the problem it is designed to fix.
Suppose we fit a model and then immediately ask: “How well does this model predict the data?” If we use the same data to fit the model and to evaluate it, we are letting the model see the answers ahead of time.
Let’s do that on purpose.
- Fit a logistic regression predicting survival from passenger sex and class.
Save the model as
m_apparent.
m_apparent <- glm(
survived ~ sex + pclass,
data = titanic3,
family = binomial
)
summary(m_apparent)
#>
#> Call:
#> glm(formula = survived ~ sex + pclass, family = binomial, data = titanic3)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.963 0.235 12.6 <2e-16
#> sexmale -2.515 0.147 -17.1 <2e-16
#> pclass -0.860 0.085 -10.1 <2e-16
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1741.0 on 1308 degrees of freedom
#> Residual deviance: 1257.2 on 1306 degrees of freedom
#> AIC: 1263
#>
#> Number of Fisher Scoring iterations: 4This model has access to every passenger and the final outcome. It is not being asked to predict the future. It is being asked to summarize the past.
Now we use the model to generate predicted survival probabilities for every passenger.
Each probability represents how confident the model is that a passenger survived.
To turn probabilities into decisions, we need a rule. We’ll use a cutoff of 0.5.
Now we compute the model’s accuracy.
At this point, the model looks fairly good. In fact, the number looks reassuring. But it answers the wrong question. The model is being evaluated on passengers it already “knows.” This is like asking, after the Titanic sank, whether you can explain who survived. Of course you can.
Cross validation exists because this number tells us very little about how well the model would perform before the disaster. Real predictions are made before the ship hits the iceberg.
Holding passengers back
To get a more honest answer, we need to pretend that some passengers are unknown.
We’ll do this by splitting the data into two groups: - a training set, used to fit the model - a test set, used only for evaluation
Conveniently, the titanic package already provides such a split.
titanic_test %>% glimpse()
#> Rows: 418
#> Columns: 11
#> $ passengerid <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903…
#> $ pclass <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3, 3…
#> $ name <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs)", "M…
#> $ sex <chr> "male", "female", "male", "male", "female", "male", "femal…
#> $ age <dbl> 34, 47, 62, 27, 22, 14, 30, 26, 18, 21, NA, 46, 23, 63, 47…
#> $ sibsp <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0…
#> $ parch <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ ticket <chr> "330911", "363272", "240276", "315154", "3101298", "7538",…
#> $ fare <dbl> 7.8, 7.0, 9.7, 8.7, 12.3, 9.2, 7.6, 29.0, 7.2, 24.1, 7.9, …
#> $ cabin <chr> "", "", "", "", "", "", "", "", "", "", "", "", "B45", "",…
#> $ embarked <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", "S", "S"…Now we fit the same model as before, but only on the training data.
m_split <- glm(
survived ~ sex + pclass,
data = titanic_train,
family = binomial
)
summary(m_split)
#>
#> Call:
#> glm(formula = survived ~ sex + pclass, family = binomial, data = titanic_train)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 3.295 0.297 11.08 <2e-16
#> sexmale -2.643 0.184 -14.38 <2e-16
#> pclass -0.961 0.106 -9.06 <2e-16
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1186.7 on 890 degrees of freedom
#> Residual deviance: 827.2 on 888 degrees of freedom
#> AIC: 833.2
#>
#> Number of Fisher Scoring iterations: 4
p_split <- predict(m_split, type = "response")
yhat_split <- ifelse(p_split > 0.5, 1, 0)
acc_split <- mean(yhat_split == titanic_train$survived)
acc_split
#> [1] 0.79Click for a solution
m_split <- glm(
survived ~ sex + pclass,
data = titanic_train,
family = binomial
)
summary(m_split)
#>
#> Call:
#> glm(formula = survived ~ sex + pclass, family = binomial, data = titanic_train)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 3.295 0.297 11.08 <2e-16
#> sexmale -2.643 0.184 -14.38 <2e-16
#> pclass -0.961 0.106 -9.06 <2e-16
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1186.7 on 890 degrees of freedom
#> Residual deviance: 827.2 on 888 degrees of freedom
#> AIC: 833.2
#>
#> Number of Fisher Scoring iterations: 42.3 Training accuracy
Exercise 3: Cross validation across timelines
A single split is a single alternate timeline. Maybe your held-out passengers were unusually predictable, maybe unusually hard.
Cross validation reduces dependence on one split by repeating the train/test game across multiple partitions.
3.1 Create folds
Create titanic_cv by:
- filtering to complete cases on
survived,sex, andpclass, - setting a seed,
- adding a
foldvariable with values 1 through 10.
set.seed(___)
titanic_cv <- titanic3 %>%
filter(!is.na(___), !is.na(___), !is.na(___)) %>%
mutate(fold = sample(rep(___, length.out = n())))Click for a hint
Userep(1:10, length.out = n()) so folds are roughly equal-sized.
3.2 Fit and evaluate 10 models
Complete the loop so each fold is used as test exactly once, and store fold accuracies in cv_results.
cv_results <- data.frame(fold = sort(unique(titanic_cv$fold)), accuracy = NA_real_)
for (j in cv_results$fold) {
train_j <- titanic_cv %>% filter(fold != ___)
test_j <- titanic_cv %>% filter(fold == ___)
m_j <- glm(
formula = ___,
data = ___,
family = ___
)
p_j <- predict(m_j, newdata = ___, type = ___)
yhat_j <- ifelse(___ > ___, ___, ___)
cv_results$accuracy[cv_results$fold == j] <- mean(___ == ___, na.rm = TRUE)
}3.3 Summarize cross validation
Compute mean, SD, min, and max of fold accuracy.
cv_mean <- mean(___)
cv_sd <- sd(___)
cv_min <- min(___)
cv_max <- max(___)
c(cv_mean = cv_mean, cv_sd = cv_sd, cv_min = cv_min, cv_max = cv_max)Click for a solution
cv_results <- data.frame(fold = sort(unique(titanic_cv$fold)), accuracy = NA_real_)
cv_mean <- mean(cv_results$accuracy)
cv_sd <- sd(cv_results$accuracy)
cv_min <- min(cv_results$accuracy)
cv_max <- max(cv_results$accuracy)
c(cv_mean = cv_mean, cv_sd = cv_sd, cv_min = cv_min, cv_max = cv_max)
#> cv_mean cv_sd cv_min cv_max
#> NA NA NA NAExercise 4: When the cutoff is a policy decision
A probability cutoff is not a law of nature. It is a decision.
Lloyd’s might treat “predict survival” as a proxy for “low risk,” but different cutoffs change what counts as low risk. In this exercise you will examine how accuracy changes as the cutoff changes.
4.1 Evaluate multiple cutoffs on the test set
Using p_test from an earlier step, compute test accuracy for the following cutoffs: 0.3, 0.5, 0.7.
Create a data frame called cutoff_results with columns cutoff and accuracy.
cutoffs <- c(0.3, 0.5, 0.7)
cutoff_results <- data.frame(
cutoff = cutoffs,
accuracy = NA_real_
)
for (i in seq_along(cutoffs)) {
c0 <- cutoffs[i]
yhat_c <- ifelse(___ > ___, ___, ___)
cutoff_results$accuracy[i] <- mean(___ == ___, na.rm = TRUE)
}
cutoff_resultsClick for a solution
cutoffs <- c(0.3, 0.5, 0.7)
cutoff_results <- data.frame(
cutoff = cutoffs,
accuracy = NA_real_
)
for (i in seq_along(cutoffs)) {
c0 <- cutoffs[i]
yhat_c <- ifelse(p_test > c0, 1, 0)
cutoff_results$accuracy[i] <- mean(yhat_c == titanic_test$survived, na.rm = TRUE)
}
cutoff_results
#> cutoff accuracy
#> 1 0.3 NaN
#> 2 0.5 NaN
#> 3 0.7 NaNStretch tasks (optional)
These are optional, but they bring the lab closer to how real modeling work looks.
5.1 Add one more predictor and re-run CV
Pick one additional predictor from titanic3 that you think should matter (for example, age or fare).
- Fit
survived ~ sex + pclass + <your variable>. - Run 10-fold CV again.
- Compare
cv_meanandcv_sdto the baseline model.
Important: decide how you will handle missing values for your added predictor (listwise deletion is allowed here; we will treat missingness more carefully in the follow-up lab).
Super Stretch goals
Lloyd’s Actually Has to Use This Model
In the previous lab, we evaluated a simple Titanic model using holdout testing and k-fold cross validation. That lab began with a deliberately constrained model (sex and pclass) so you could focus on evaluation without getting lost in feature engineering.
Now we do the part Lloyd’s actually cares about: building a model that uses more information, and doing so in a way that does not quietly sabotage evaluation.
This lab is about the uncomfortable truth that modeling choices are evaluation choices:
- If you add predictors with missing data, you change the population your model is trained and tested on.
- If you impute missing values using the full dataset, you leak information across folds.
- If you select a cutoff to maximize accuracy, you are making a policy decision without naming it.
You will build richer models, compare them using cross validation, and then stress-test the evaluation pipeline itself.
Learning goals
- Compare multiple candidate models using k-fold cross validation.
- Handle missing data with listwise deletion and simple imputation.
- Avoid information leakage by imputing within folds.
- Evaluate classifiers beyond accuracy (confusion matrices, sensitivity/specificity).
- Reason about performance vs stability when recommending a model.
Exercise 6: Candidate models
Lloyd’s will not price risk using only sex and class. Your job is to propose reasonable candidate models and compare them using the same evaluation method.
6.1 Choose candidate predictors
Pick two additional predictors from the data set. (This can include a constructed variable that you create from existing variables.)
names(titanic3)
#> [1] "pclass" "survived" "name" "sex" "age" "sibsp"
#> [7] "parch" "ticket" "fare" "cabin" "embarked" "boat"
#> [13] "body" "home.dest"Why do you think each predictor you selected should matter for survival?
<
titanic3 %>% mutate(title = str_extract(name, "(?<=, )[^.]+")) %>% count(title) %>% arrange(desc(n))
#> # A tibble: 18 × 2
#> title n
#> <chr> <int>
#> 1 Mr 757
#> 2 Miss 260
#> 3 Mrs 197
#> 4 Master 61
#> 5 Dr 8
#> 6 Rev 8
#> 7 Col 4
#> 8 Major 2
#> 9 Mlle 2
#> 10 Ms 2
#> 11 Capt 1
#> 12 Don 1
#> 13 Dona 1
#> 14 Jonkheer 1
#> 15 Lady 1
#> 16 Mme 1
#> 17 Sir 1
#> 18 the Countess 1Exercise 7: Cross validation comparison (listwise deletion)
We will start with the simplest missing-data strategy: drop rows with missing values in any variable used by the model.
This is not always the best approach, but it is transparent.
2.1 Create a CV dataset for both models
Create cv_df that keeps only rows with non-missing values for all variables needed in f_expanded.
2.3 Write a function to compute CV accuracy
Complete the function below so it returns a data frame with fold accuracies for a given formula.
cv_glm_accuracy <- function(df, formula, cutoff = 0.5) {
out <- data.frame(fold = sort(unique(df$fold)), accuracy = NA_real_)
for (j in out$fold) {
train_j <- df %>% filter(fold != ___)
test_j <- df %>% filter(fold == ___)
m_j <- glm(
formula = ___,
data = ___,
family = ___
)
p_j <- predict(m_j, newdata = ___, type = ___)
yhat_j <- ifelse(___ > ___, ___, ___)
out$accuracy[out$fold == j] <- mean(___ == ___, na.rm = TRUE)
}
out
}
#> Error in parse(text = input): <text>:7:39: unexpected input
#> 6:
#> 7: train_j <- df %>% filter(fold != __
#> ^Click for a solution
cv_glm_accuracy <- function(df, formula, cutoff = 0.5) {
out <- data.frame(fold = sort(unique(df$fold)), accuracy = NA_real_)
for (j in out$fold) {
train_j <- df %>% filter(fold != j)
test_j <- df %>% filter(fold == j)
m_j <- glm(
formula = formula,
data = train_j,
family = binomial
)
p_j <- predict(m_j, newdata = test_j, type = "response")
yhat_j <- ifelse(p_j > cutoff, 1, 0)
out$accuracy[out$fold == j] <- mean(yhat_j == test_j$survived, na.rm = TRUE)
}
out
}7.4 Compare models
Use cv_glm_accuracy() to compute CV results for both formulas. Compute mean and SD of accuracy for each.
cv_base <- cv_glm_accuracy(cv_df, formula = ___)
cv_exp <- cv_glm_accuracy(cv_df, formula = ___)
summary_table <- data.frame(
model = c("base", "expanded"),
mean_acc = c(mean(___), mean(___)),
sd_acc = c(sd(___), sd(___))
)
summary_table
#> Error in parse(text = input): <text>:1:46: unexpected input
#> 1: cv_base <- cv_glm_accuracy(cv_df, formula = __
#> ^Click for a solution
cv_base <- cv_glm_accuracy(cv_df, formula = f_base)
cv_exp <- cv_glm_accuracy(cv_df, formula = f_expanded)
summary_table <- data.frame(
model = c("base", "expanded"),
mean_acc = c(mean(cv_base$accuracy), mean(cv_exp$accuracy)),
sd_acc = c(sd(cv_base$accuracy), sd(cv_exp$accuracy))
)
summary_table
#> model mean_acc sd_acc
#> 1 base 0.78 0.039
#> 2 expanded 0.78 0.036Exercise 8: Missingness is a modeling decision
Listwise deletion changes your population: you are now modeling a subset of passengers with complete records on your chosen predictors.
Exercise 9: Simple imputation without leakage
Now you will use a very simple imputation strategy for a numeric predictor:
- Replace missing values with the training-fold mean (not the overall mean).
This is the smallest step toward preventing leakage: the test fold should not influence how you fill in missing values.
9.1 Implement within-fold mean imputation for one variable
Choose one numeric predictor you used (for example, age). Complete the loop to:
- compute the mean of that variable in
train_j(excluding NAs), - fill missing values in both
train_jandtest_jusing that training mean, - fit the expanded model and compute fold accuracy.
Store results in cv_imp.
cv_imp <- data.frame(fold = sort(unique(cv_df$fold)), accuracy = NA_real_)
for (j in cv_imp$fold) {
train_j <- cv_df %>% filter(fold != ___)
test_j <- cv_df %>% filter(fold == ___)
mu <- mean(train_j$___, na.rm = TRUE)
train_j$___[is.na(train_j$___)] <- ___
test_j$___[is.na(test_j$___)] <- ___
m_j <- glm(
formula = ___,
data = ___,
family = ___
)
p_j <- predict(m_j, newdata = ___, type = ___)
yhat_j <- ifelse(___ > ___, ___, ___)
cv_imp$accuracy[cv_imp$fold == j] <- mean(___ == ___, na.rm = TRUE)
}
c(mean = mean(cv_imp$accuracy), sd = sd(cv_imp$accuracy))
#> Error in parse(text = input): <text>:5:40: unexpected input
#> 4:
#> 5: train_j <- cv_df %>% filter(fold != __
#> ^Click for a solution example
cv_imp <- data.frame(fold = sort(unique(cv_df$fold)), accuracy = NA_real_)
for (j in cv_imp$fold) {
train_j <- cv_df %>% filter(fold != j)
test_j <- cv_df %>% filter(fold == j)
mu <- mean(train_j$age, na.rm = TRUE)
train_j$age[is.na(train_j$age)] <- mu
test_j$age[is.na(test_j$age)] <- mu
m_j <- glm(
formula = f_expanded,
data = train_j,
family = binomial
)
p_j <- predict(m_j, newdata = test_j, type = "response")
yhat_j <- ifelse(p_j > 0.5, 1, 0)
cv_imp$accuracy[cv_imp$fold == j] <- mean(yhat_j == test_j$survived, na.rm = TRUE)
}
c(mean = mean(cv_imp$accuracy), sd = sd(cv_imp$accuracy))
#> mean sd
#> 0.777 0.036Exercise 10: Beyond accuracy (confusion matrix)
Accuracy treats all mistakes as equally bad. Lloyd’s rarely agrees.
10.1 Compute confusion matrix elements on one fold
Pick a single fold (for example, fold 1). Using your chosen model and cutoff 0.5:
- compute TP, FP, TN, FN
- compute sensitivity and specificity
# Choose a fold
j <- 1
train_j <- cv_df %>% filter(fold != j)
test_j <- cv_df %>% filter(fold == j)
m_j <- glm(formula = f_expanded, data = train_j, family = binomial)
p_j <- predict(m_j, newdata = test_j, type = "response")
yhat_j <- ifelse(p_j > 0.5, 1, 0)
y_j <- test_j$survived
TP <- sum(yhat_j == 1 & y_j == 1, na.rm = TRUE)
FP <- sum(yhat_j == 1 & y_j == 0, na.rm = TRUE)
TN <- sum(yhat_j == 0 & y_j == 0, na.rm = TRUE)
FN <- sum(yhat_j == 0 & y_j == 1, na.rm = TRUE)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
c(TP = TP, FP = FP, TN = TN, FN = FN, sensitivity = sensitivity, specificity = specificity)
#> TP FP TN FN sensitivity specificity
#> 24.00 13.00 53.00 15.00 0.62 0.80