Super Stretch goals
Lloyd’s Actually Has to Use This Model
In the previous lab, we evaluated a simple Titanic model using holdout testing and k-fold cross validation. That lab began with a deliberately constrained model (sex and pclass) so you could focus on evaluation without getting lost in feature engineering.
Now we do the part Lloyd’s actually cares about: building a model that uses more information, and doing so in a way that does not quietly sabotage evaluation.
This lab is about the uncomfortable truth that modeling choices are evaluation choices:
- If you add predictors with missing data, you change the population your model is trained and tested on.
- If you impute missing values using the full dataset, you leak information across folds.
- If you select a cutoff to maximize accuracy, you are making a policy decision without naming it.
You will build richer models, compare them using cross validation, and then stress-test the evaluation pipeline itself.
Learning goals
- Compare multiple candidate models using k-fold cross validation.
- Handle missing data with listwise deletion and simple imputation.
- Avoid information leakage by imputing within folds.
- Evaluate classifiers beyond accuracy (confusion matrices, sensitivity/specificity).
- Reason about performance vs stability when recommending a model.
Exercise 6: Candidate models
Lloyd’s will not price risk using only sex and class. Your job is to propose reasonable candidate models and compare them using the same evaluation method.
6.1 Choose candidate predictors
Pick two additional predictors from the data set. (This can include a constructed variable that you create from existing variables.)
names(titanic3)
#> [1] "pclass" "survived" "name" "sex" "age" "sibsp"
#> [7] "parch" "ticket" "fare" "cabin" "embarked" "boat"
#> [13] "body" "home.dest"Why do you think each predictor you selected should matter for survival?
<
titanic3 %>% mutate(title = str_extract(name, "(?<=, )[^.]+")) %>% count(title) %>% arrange(desc(n))
#> # A tibble: 18 × 2
#> title n
#> <chr> <int>
#> 1 Mr 757
#> 2 Miss 260
#> 3 Mrs 197
#> 4 Master 61
#> 5 Dr 8
#> 6 Rev 8
#> 7 Col 4
#> 8 Major 2
#> 9 Mlle 2
#> 10 Ms 2
#> 11 Capt 1
#> 12 Don 1
#> 13 Dona 1
#> 14 Jonkheer 1
#> 15 Lady 1
#> 16 Mme 1
#> 17 Sir 1
#> 18 the Countess 1Exercise 7: Cross validation comparison (listwise deletion)
We will start with the simplest missing-data strategy: drop rows with missing values in any variable used by the model.
This is not always the best approach, but it is transparent.
2.1 Create a CV dataset for both models
Create cv_df that keeps only rows with non-missing values for all variables needed in f_expanded.
2.3 Write a function to compute CV accuracy
Complete the function below so it returns a data frame with fold accuracies for a given formula.
cv_glm_accuracy <- function(df, formula, cutoff = 0.5) {
out <- data.frame(fold = sort(unique(df$fold)), accuracy = NA_real_)
for (j in out$fold) {
train_j <- df %>% filter(fold != ___)
test_j <- df %>% filter(fold == ___)
m_j <- glm(
formula = ___,
data = ___,
family = ___
)
p_j <- predict(m_j, newdata = ___, type = ___)
yhat_j <- ifelse(___ > ___, ___, ___)
out$accuracy[out$fold == j] <- mean(___ == ___, na.rm = TRUE)
}
out
}
#> Error in parse(text = input): <text>:7:39: unexpected input
#> 6:
#> 7: train_j <- df %>% filter(fold != __
#> ^Click for a solution
cv_glm_accuracy <- function(df, formula, cutoff = 0.5) {
out <- data.frame(fold = sort(unique(df$fold)), accuracy = NA_real_)
for (j in out$fold) {
train_j <- df %>% filter(fold != j)
test_j <- df %>% filter(fold == j)
m_j <- glm(
formula = formula,
data = train_j,
family = binomial
)
p_j <- predict(m_j, newdata = test_j, type = "response")
yhat_j <- ifelse(p_j > cutoff, 1, 0)
out$accuracy[out$fold == j] <- mean(yhat_j == test_j$survived, na.rm = TRUE)
}
out
}7.4 Compare models
Use cv_glm_accuracy() to compute CV results for both formulas. Compute mean and SD of accuracy for each.
cv_base <- cv_glm_accuracy(cv_df, formula = ___)
cv_exp <- cv_glm_accuracy(cv_df, formula = ___)
summary_table <- data.frame(
model = c("base", "expanded"),
mean_acc = c(mean(___), mean(___)),
sd_acc = c(sd(___), sd(___))
)
summary_table
#> Error in parse(text = input): <text>:1:46: unexpected input
#> 1: cv_base <- cv_glm_accuracy(cv_df, formula = __
#> ^Click for a solution
cv_base <- cv_glm_accuracy(cv_df, formula = f_base)
cv_exp <- cv_glm_accuracy(cv_df, formula = f_expanded)
summary_table <- data.frame(
model = c("base", "expanded"),
mean_acc = c(mean(cv_base$accuracy), mean(cv_exp$accuracy)),
sd_acc = c(sd(cv_base$accuracy), sd(cv_exp$accuracy))
)
summary_table
#> model mean_acc sd_acc
#> 1 base 0.78 0.039
#> 2 expanded 0.78 0.036Exercise 8: Missingness is a modeling decision
Listwise deletion changes your population: you are now modeling a subset of passengers with complete records on your chosen predictors.
Exercise 9: Simple imputation without leakage
Now you will use a very simple imputation strategy for a numeric predictor:
- Replace missing values with the training-fold mean (not the overall mean).
This is the smallest step toward preventing leakage: the test fold should not influence how you fill in missing values.
9.1 Implement within-fold mean imputation for one variable
Choose one numeric predictor you used (for example, age). Complete the loop to:
- compute the mean of that variable in
train_j(excluding NAs), - fill missing values in both
train_jandtest_jusing that training mean, - fit the expanded model and compute fold accuracy.
Store results in cv_imp.
cv_imp <- data.frame(fold = sort(unique(cv_df$fold)), accuracy = NA_real_)
for (j in cv_imp$fold) {
train_j <- cv_df %>% filter(fold != ___)
test_j <- cv_df %>% filter(fold == ___)
mu <- mean(train_j$___, na.rm = TRUE)
train_j$___[is.na(train_j$___)] <- ___
test_j$___[is.na(test_j$___)] <- ___
m_j <- glm(
formula = ___,
data = ___,
family = ___
)
p_j <- predict(m_j, newdata = ___, type = ___)
yhat_j <- ifelse(___ > ___, ___, ___)
cv_imp$accuracy[cv_imp$fold == j] <- mean(___ == ___, na.rm = TRUE)
}
c(mean = mean(cv_imp$accuracy), sd = sd(cv_imp$accuracy))
#> Error in parse(text = input): <text>:5:40: unexpected input
#> 4:
#> 5: train_j <- cv_df %>% filter(fold != __
#> ^Click for a solution example
cv_imp <- data.frame(fold = sort(unique(cv_df$fold)), accuracy = NA_real_)
for (j in cv_imp$fold) {
train_j <- cv_df %>% filter(fold != j)
test_j <- cv_df %>% filter(fold == j)
mu <- mean(train_j$age, na.rm = TRUE)
train_j$age[is.na(train_j$age)] <- mu
test_j$age[is.na(test_j$age)] <- mu
m_j <- glm(
formula = f_expanded,
data = train_j,
family = binomial
)
p_j <- predict(m_j, newdata = test_j, type = "response")
yhat_j <- ifelse(p_j > 0.5, 1, 0)
cv_imp$accuracy[cv_imp$fold == j] <- mean(yhat_j == test_j$survived, na.rm = TRUE)
}
c(mean = mean(cv_imp$accuracy), sd = sd(cv_imp$accuracy))
#> mean sd
#> 0.777 0.036Exercise 10: Beyond accuracy (confusion matrix)
Accuracy treats all mistakes as equally bad. Lloyd’s rarely agrees.
10.1 Compute confusion matrix elements on one fold
Pick a single fold (for example, fold 1). Using your chosen model and cutoff 0.5:
- compute TP, FP, TN, FN
- compute sensitivity and specificity
# Choose a fold
j <- 1
train_j <- cv_df %>% filter(fold != j)
test_j <- cv_df %>% filter(fold == j)
m_j <- glm(formula = f_expanded, data = train_j, family = binomial)
p_j <- predict(m_j, newdata = test_j, type = "response")
yhat_j <- ifelse(p_j > 0.5, 1, 0)
y_j <- test_j$survived
TP <- sum(yhat_j == 1 & y_j == 1, na.rm = TRUE)
FP <- sum(yhat_j == 1 & y_j == 0, na.rm = TRUE)
TN <- sum(yhat_j == 0 & y_j == 0, na.rm = TRUE)
FN <- sum(yhat_j == 0 & y_j == 1, na.rm = TRUE)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
c(TP = TP, FP = FP, TN = TN, FN = FN, sensitivity = sensitivity, specificity = specificity)
#> TP FP TN FN sensitivity specificity
#> 24.00 13.00 53.00 15.00 0.62 0.80