Super Stretch goals

Lloyd’s Actually Has to Use This Model

In the previous lab, we evaluated a simple Titanic model using holdout testing and k-fold cross validation. That lab began with a deliberately constrained model (sex and pclass) so you could focus on evaluation without getting lost in feature engineering.

Now we do the part Lloyd’s actually cares about: building a model that uses more information, and doing so in a way that does not quietly sabotage evaluation.

This lab is about the uncomfortable truth that modeling choices are evaluation choices:

If you add predictors with missing data, you change the population your model is trained and tested on.
If you impute missing values using the full dataset, you leak information across folds.
If you select a cutoff to maximize accuracy, you are making a policy decision without naming it.

You will build richer models, compare them using cross validation, and then stress-test the evaluation pipeline itself.

Learning goals

Compare multiple candidate models using k-fold cross validation.
Handle missing data with listwise deletion and simple imputation.
Avoid information leakage by imputing within folds.
Evaluate classifiers beyond accuracy (confusion matrices, sensitivity/specificity).
Reason about performance vs stability when recommending a model.

Exercises

Exercise 6: Candidate models

Lloyd’s will not price risk using only sex and class. Your job is to propose reasonable candidate models and compare them using the same evaluation method.

6.1 Choose candidate predictors

Pick two additional predictors from the data set. (This can include a constructed variable that you create from existing variables.)

names(titanic3)
#>  [1] "pclass"    "survived"  "name"      "sex"       "age"       "sibsp"    
#>  [7] "parch"     "ticket"    "fare"      "cabin"     "embarked"  "boat"     
#> [13] "body"      "home.dest"

Why do you think each predictor you selected should matter for survival?

titanic3 %>% mutate(title = str_extract(name, "(?<=, )[^.]+")) %>% count(title) %>% arrange(desc(n))
#> # A tibble: 18 × 2
#>    title            n
#>    <chr>        <int>
#>  1 Mr             757
#>  2 Miss           260
#>  3 Mrs            197
#>  4 Master          61
#>  5 Dr               8
#>  6 Rev              8
#>  7 Col              4
#>  8 Major            2
#>  9 Mlle             2
#> 10 Ms               2
#> 11 Capt             1
#> 12 Don              1
#> 13 Dona             1
#> 14 Jonkheer         1
#> 15 Lady             1
#> 16 Mme              1
#> 17 Sir              1
#> 18 the Countess     1

6.2 Define candidate formulas

Create two model formulas:

Baseline: survived ~ sex + pclass
Expanded: survived ~ sex + pclass + <two predictors you chose>

Store them as f_base and f_expanded.

f_base <- survived ~ ___ + ___
f_expanded <- survived ~ ___ + ___ + ___ + ___

Click for a solution example

Exercise 7: Cross validation comparison (listwise deletion)

We will start with the simplest missing-data strategy: drop rows with missing values in any variable used by the model.

This is not always the best approach, but it is transparent.

2.1 Create a CV dataset for both models

Create cv_df that keeps only rows with non-missing values for all variables needed in f_expanded.

cv_df <- titanic3 %>%
  filter(!is.na(survived)) %>%
  filter(!is.na(___)) %>%
  filter(!is.na(___)) %>%
  filter(!is.na(___)) %>%
  filter(!is.na(___))

Click for a solution example

cv_df <- titanic3 %>%
  filter(!is.na(survived)) %>%
  filter(!is.na(sex)) %>%
  filter(!is.na(pclass)) %>%
  filter(!is.na(age)) %>%
  filter(!is.na(fare))

7.2 Create folds

Assign each row to one of 10 folds.

set.seed(___)

cv_df <- cv_df %>%
  mutate(fold = sample(rep(___, length.out = n())))

Click for a solution

set.seed(21)

cv_df <- cv_df %>%
  mutate(fold = sample(rep(1:10, length.out = n())))

2.3 Write a function to compute CV accuracy

Complete the function below so it returns a data frame with fold accuracies for a given formula.

cv_glm_accuracy <- function(df, formula, cutoff = 0.5) {

  out <- data.frame(fold = sort(unique(df$fold)), accuracy = NA_real_)

  for (j in out$fold) {

    train_j <- df %>% filter(fold != ___)
    test_j  <- df %>% filter(fold == ___)

    m_j <- glm(
      formula = ___,
      data = ___,
      family = ___
    )

    p_j <- predict(m_j, newdata = ___, type = ___)
    yhat_j <- ifelse(___ > ___, ___, ___)

    out$accuracy[out$fold == j] <- mean(___ == ___, na.rm = TRUE)
  }

  out
}
#> Error in parse(text = input): <text>:7:39: unexpected input
#> 6: 
#> 7:     train_j <- df %>% filter(fold != __
#>                                          ^

Click for a solution

cv_glm_accuracy <- function(df, formula, cutoff = 0.5) {

  out <- data.frame(fold = sort(unique(df$fold)), accuracy = NA_real_)

  for (j in out$fold) {

    train_j <- df %>% filter(fold != j)
    test_j  <- df %>% filter(fold == j)

    m_j <- glm(
      formula = formula,
      data = train_j,
      family = binomial
    )

    p_j <- predict(m_j, newdata = test_j, type = "response")
    yhat_j <- ifelse(p_j > cutoff, 1, 0)

    out$accuracy[out$fold == j] <- mean(yhat_j == test_j$survived, na.rm = TRUE)
  }

  out
}

7.4 Compare models

Use cv_glm_accuracy() to compute CV results for both formulas. Compute mean and SD of accuracy for each.

cv_base <- cv_glm_accuracy(cv_df, formula = ___)
cv_exp  <- cv_glm_accuracy(cv_df, formula = ___)

summary_table <- data.frame(
  model = c("base", "expanded"),
  mean_acc = c(mean(___), mean(___)),
  sd_acc   = c(sd(___), sd(___))
)

summary_table
#> Error in parse(text = input): <text>:1:46: unexpected input
#> 1: cv_base <- cv_glm_accuracy(cv_df, formula = __
#>                                                  ^

Click for a solution

cv_base <- cv_glm_accuracy(cv_df, formula = f_base)
cv_exp  <- cv_glm_accuracy(cv_df, formula = f_expanded)

summary_table <- data.frame(
  model = c("base", "expanded"),
  mean_acc = c(mean(cv_base$accuracy), mean(cv_exp$accuracy)),
  sd_acc   = c(sd(cv_base$accuracy), sd(cv_exp$accuracy))
)

summary_table
#>      model mean_acc sd_acc
#> 1     base     0.78  0.039
#> 2 expanded     0.78  0.036

7.5 Reflection

Which model has higher mean CV accuracy?
Which model has higher variability across folds?
If the accuracy improvement is small but the expanded model is more variable, how would you advise Lloyd’s?

Exercise 8: Missingness is a modeling decision

Listwise deletion changes your population: you are now modeling a subset of passengers with complete records on your chosen predictors.

8.1 Quantify the missingness impact

Compute the number of rows in:

the original titanic3 with non-missing survived
cv_df

Report the proportion retained.

n_all <- sum(!is.na(titanic3$survived))
n_kept <- nrow(cv_df)
prop_kept <- n_kept / n_all

c(n_all = n_all, n_kept = n_kept, prop_kept = prop_kept)
#>     n_all    n_kept prop_kept 
#>    1309.0    1045.0       0.8

8.2 Reflection

What kinds of passengers might be disproportionately removed by listwise deletion when you include age?
Why does this matter for Lloyd’s, even if accuracy improves?

Exercise 9: Simple imputation without leakage

Now you will use a very simple imputation strategy for a numeric predictor:

Replace missing values with the training-fold mean (not the overall mean).

This is the smallest step toward preventing leakage: the test fold should not influence how you fill in missing values.

9.1 Implement within-fold mean imputation for one variable

Choose one numeric predictor you used (for example, age). Complete the loop to:

compute the mean of that variable in train_j (excluding NAs),
fill missing values in both train_j and test_j using that training mean,
fit the expanded model and compute fold accuracy.

Store results in cv_imp.

cv_imp <- data.frame(fold = sort(unique(cv_df$fold)), accuracy = NA_real_)

for (j in cv_imp$fold) {

  train_j <- cv_df %>% filter(fold != ___)
  test_j  <- cv_df %>% filter(fold == ___)

  mu <- mean(train_j$___, na.rm = TRUE)

  train_j$___[is.na(train_j$___)] <- ___
  test_j$___[is.na(test_j$___)] <- ___

  m_j <- glm(
    formula = ___,
    data = ___,
    family = ___
  )

  p_j <- predict(m_j, newdata = ___, type = ___)
  yhat_j <- ifelse(___ > ___, ___, ___)

  cv_imp$accuracy[cv_imp$fold == j] <- mean(___ == ___, na.rm = TRUE)
}

c(mean = mean(cv_imp$accuracy), sd = sd(cv_imp$accuracy))
#> Error in parse(text = input): <text>:5:40: unexpected input
#> 4: 
#> 5:   train_j <- cv_df %>% filter(fold != __
#>                                           ^

Click for a solution example

cv_imp <- data.frame(fold = sort(unique(cv_df$fold)), accuracy = NA_real_)

for (j in cv_imp$fold) {

  train_j <- cv_df %>% filter(fold != j)
  test_j  <- cv_df %>% filter(fold == j)

  mu <- mean(train_j$age, na.rm = TRUE)

  train_j$age[is.na(train_j$age)] <- mu
  test_j$age[is.na(test_j$age)] <- mu

  m_j <- glm(
    formula = f_expanded,
    data = train_j,
    family = binomial
  )

  p_j <- predict(m_j, newdata = test_j, type = "response")
  yhat_j <- ifelse(p_j > 0.5, 1, 0)

  cv_imp$accuracy[cv_imp$fold == j] <- mean(yhat_j == test_j$survived, na.rm = TRUE)
}

c(mean = mean(cv_imp$accuracy), sd = sd(cv_imp$accuracy))
#>  mean    sd 
#> 0.777 0.036

9.2 Reflection

Why is imputing using the overall mean (computed on the full dataset) a form of leakage?
Did within-fold imputation change mean CV accuracy compared to listwise deletion? Why might it?

Exercise 10: Beyond accuracy (confusion matrix)

Accuracy treats all mistakes as equally bad. Lloyd’s rarely agrees.

10.1 Compute confusion matrix elements on one fold

Pick a single fold (for example, fold 1). Using your chosen model and cutoff 0.5:

compute TP, FP, TN, FN
compute sensitivity and specificity

# Choose a fold
j <- 1

train_j <- cv_df %>% filter(fold != j)
test_j  <- cv_df %>% filter(fold == j)

m_j <- glm(formula = f_expanded, data = train_j, family = binomial)
p_j <- predict(m_j, newdata = test_j, type = "response")
yhat_j <- ifelse(p_j > 0.5, 1, 0)
y_j <- test_j$survived

TP <- sum(yhat_j == 1 & y_j == 1, na.rm = TRUE)
FP <- sum(yhat_j == 1 & y_j == 0, na.rm = TRUE)
TN <- sum(yhat_j == 0 & y_j == 0, na.rm = TRUE)
FN <- sum(yhat_j == 0 & y_j == 1, na.rm = TRUE)

sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)

c(TP = TP, FP = FP, TN = TN, FN = FN, sensitivity = sensitivity, specificity = specificity)
#>          TP          FP          TN          FN sensitivity specificity 
#>       24.00       13.00       53.00       15.00        0.62        0.80

10.2 Reflection

Which error type (FP or FN) would you expect to be more costly for Lloyd’s in a real risk context?
How would that change how you choose a cutoff?