class: center, middle, inverse, title-slide # Models with multiple predictors
👯♂️ ### S. Mason Garrison --- layout: true <div class="my-footer"> <span> <a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a> </span> </div> --- class: middle # The linear model with multiple predictors --- ## Data: Book weight and volume The `allbacks` data frame gives measurements on the volume and weight of 15 books, some of which are paperback and some of which are hardback .pull-left[ - Volume - cubic centimeters - Area - square centimeters - Weight - grams ] .pull-right[ .small[ ``` ## # A tibble: 15 x 4 ## volume area weight cover ## <dbl> <dbl> <dbl> <fct> ## 1 885 382 800 hb ## 2 1016 468 950 hb ## 3 1125 387 1050 hb ## 4 239 371 350 hb ## 5 701 371 750 hb ## 6 641 367 600 hb ## 7 1228 396 1075 hb ## 8 412 0 250 pb ## 9 953 0 700 pb ## 10 929 0 650 pb ## 11 1492 0 975 pb ## 12 419 0 350 pb ## 13 1010 0 950 pb ## 14 595 0 425 pb ## 15 1034 0 725 pb ``` ] ] .footnote[ .small[ These books are from the bookshelf of J. H. Maindonald at Australian National University. ] ] --- ## Book weight vs. volume .pull-left[ ```r linear_reg() %>% set_engine("lm") %>% fit(weight ~ volume, data = allbacks) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 108. 88.4 1.22 0.245 ## 2 volume 0.709 0.0975 7.27 0.00000626 ``` ] .pull-right[ <img src="d23_multiple_files/figure-html/unnamed-chunk-4-1.png" width="75%" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Book weight vs. volume and cover .pull-left[ ```r linear_reg() %>% set_engine("lm") %>% fit(weight ~ volume + cover, data = allbacks) %>% tidy() ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 198. 59.2 3.34 0.00584 ## 2 volume 0.718 0.0615 11.7 0.0000000660 ## 3 coverpb -184. 40.5 -4.55 0.000672 ``` ] .pull-right[ <img src="d23_multiple_files/figure-html/unnamed-chunk-6-1.png" width="75%" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Interpretation of estimates ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 198. 59.2 3.34 0.00584 ## 2 volume 0.718 0.0615 11.7 0.0000000660 ## 3 coverpb -184. 40.5 -4.55 0.000672 ``` -- - **Slope - volume:** *All else held constant*, for each additional cubic centimeter books are larger in volume, we would expect the weight to be higher, on average, by 0.718 grams. -- - **Slope - cover:** *All else held constant*, paperback books are weigh, on average, by 184 grams less than hardcover books. -- - **Intercept:** Hardcover books with 0 volume are expected to weigh 198 grams, on average. (Doesn't make sense in context.) --- ## Main vs. interaction effects .question[ Suppose we want to predict weight of books from their volume and cover type (hardback vs. paperback). Do you think a model with main effects or interaction effects is more appropriate? Explain your reasoning. **Hint:** Main effects would mean rate at which weight changes as volume increases would be the same for hardback and paperback books and interaction effects would mean the rate at which weight changes as volume increases would be different for hardback and paperback books. ] --- <img src="d23_multiple_files/figure-html/book-main-int-1.png" width="65%" style="display: block; margin: auto;" /> --- ## In pursuit of Occam's razor - Occam's Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected. -- - Model selection follows this principle. -- - We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model. -- - In other words, we prefer the simplest best model, i.e. **parsimonious** model. --- .pull-left[ <img src="d23_multiple_files/figure-html/unnamed-chunk-8-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ .question[ Visually, which of the two models is preferable under Occam's razor? ] ] --- ## R-squared - `\(R^2\)` is the percentage of variability in the response variable explained by the regression model. ```r glance(book_main_fit)$r.squared ``` ``` ## [1] 0.9274776 ``` ```r glance(book_int_fit)$r.squared ``` ``` ## [1] 0.9297137 ``` -- - Clearly the model with interactions has a higher `\(R^2\)`. -- - However using `\(R^2\)` for model selection in models with multiple explanatory variables is not a good idea as `\(R^2\)` increases when **any** variable is added to the model. --- ## Adjusted R-squared ... a (more) objective measure for model selection - Adjusted `\(R^2\)` doesn't increase if the new variable does not provide any new information or is completely unrelated, as it applies a penalty for number of variables included in the model. - This makes adjusted `\(R^2\)` a preferable metric for model selection in multiple regression models. --- ## Comparing models .pull-left[ ```r glance(book_main_fit)$r.squared ``` ``` ## [1] 0.9274776 ``` ```r glance(book_int_fit)$r.squared ``` ``` ## [1] 0.9297137 ``` ] .pull-right[ ```r glance(book_main_fit)$adj.r.squared ``` ``` ## [1] 0.9153905 ``` ```r glance(book_int_fit)$adj.r.squared ``` ``` ## [1] 0.9105447 ``` ] -- .pull-left-wide[ .small[ - Is R-sq higher for int model? ```r glance(book_int_fit)$r.squared > glance(book_main_fit)$r.squared ``` ``` ## [1] TRUE ``` - Is R-sq adj. higher for int model? ```r glance(book_int_fit)$adj.r.squared > glance(book_main_fit)$adj.r.squared ``` ``` ## [1] FALSE ``` ] ] --- class: middle # Wrapping Up... --- class: middle # Two numerical predictors --- ## The data ```r pp <- read_csv( "data/paris-paintings.csv", na = c("n/a", "", "NA") ) %>% mutate(log_price = log(price)) ``` --- ## Multiple predictors - Response variable: `log_price` - Explanatory variables: Width and height ```r pp_fit <- linear_reg() %>% set_engine("lm") %>% fit(log_price ~ Width_in + Height_in, data = pp) tidy(pp_fit) ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.77 0.0579 82.4 0 ## 2 Width_in 0.0269 0.00373 7.22 6.58e-13 ## 3 Height_in -0.0133 0.00395 -3.36 7.93e- 4 ``` --- ## Linear model with multiple predictors ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.77 0.0579 82.4 0 ## 2 Width_in 0.0269 0.00373 7.22 6.58e-13 ## 3 Height_in -0.0133 0.00395 -3.36 7.93e- 4 ``` <br> `$$\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height$$` --- ## Visualizing models with multiple predictors .panelset[ .panel[.panel-name[Plot] .pull-left-wide[
] ] .panel[.panel-name[Code] ```r p <- plot_ly(pp, x = ~Width_in, y = ~Height_in, z = ~log_price, marker = list(size = 3, color = "lightgray", alpha = 0.5, line = list(color = "gray", width = 2))) %>% add_markers() %>% plotly::layout(scene = list( xaxis = list(title = "Width (in)"), yaxis = list(title = "Height (in)"), zaxis = list(title = "log_price") )) %>% config(displayModeBar = FALSE) frameWidget(p) ``` ] ] --- class: middle # Numerical and categorical predictors --- ## Price, surface area, and living artist - Explore the relationship between price of paintings and surface area, conditioned on whether or not the artist is still living - First visualize and explore, then model - But first, prep the data .midi[ ```r pp <- pp %>% mutate(artistliving = if_else(artistliving == 0, "Deceased", "Living")) pp %>% count(artistliving) ``` ``` ## # A tibble: 2 x 2 ## artistliving n ## <chr> <int> ## 1 Deceased 2937 ## 2 Living 456 ``` ] --- ## Typical surface area .panelset[ .panel[.panel-name[Plot] .pull-left-narrow[ Typical surface area appears to be less than 1000 square inches (~ 80cm x 80cm). There are very few paintings that have surface area above 5000 square inches. ] .pull-right-wide[ <img src="d23_multiple_files/figure-html/unnamed-chunk-15-1.png" width="90%" style="display: block; margin: auto;" /> ] ] .panel[.panel-name[Code] ```r ggplot(data = pp, aes(x = Surface, fill = artistliving)) + geom_histogram(binwidth = 500) + facet_grid(artistliving ~ .) + scale_fill_manual(values = c("#E48957", "#071381")) + guides(fill = FALSE) + labs(x = "Surface area", y = NULL) + geom_vline(xintercept = 1000) + geom_vline(xintercept = 5000, linetype = "dashed", color = "gray") ``` ``` ## Warning: `guides(<scale> = FALSE)` is deprecated. Please use ## `guides(<scale> = "none")` instead. ``` ``` ## Warning: Removed 176 rows containing non-finite values ## (stat_bin). ``` ] ] --- ## Narrowing the scope .panel[.panel-name[Plot] For simplicity let's focus on the paintings with `Surface < 5000`: <img src="d23_multiple_files/figure-html/unnamed-chunk-16-1.png" width="55%" style="display: block; margin: auto;" /> ] --- .panel[.panel-name[Code] ```r pp_Surf_lt_5000 <- pp %>% filter(Surface < 5000) ggplot(data = pp_Surf_lt_5000, aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) + geom_point(alpha = 0.5) + labs(color = "Artist", shape = "Artist") + scale_color_manual(values = c("#E48957", "#071381")) ``` ] --- ## Facet to get a better look .panel[.panel-name[Plot] <img src="d23_multiple_files/figure-html/unnamed-chunk-17-1.png" width="80%" style="display: block; margin: auto;" /> ] --- .panel[.panel-name[Code] ```r ggplot(data = pp_Surf_lt_5000, aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) + geom_point(alpha = 0.5) + facet_wrap(~artistliving) + scale_color_manual(values = c("#E48957", "#071381")) + labs(color = "Artist", shape = "Artist") ``` ] --- ## Two ways to model - **Main effects:** Assuming relationship between surface and logged price **does not vary** by whether or not the artist is living. - **Interaction effects:** Assuming relationship between surface and logged price **varies** by whether or not the artist is living. --- ## Interacting explanatory variables - Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines. - This model implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes. - This can be accomplished by adding an interaction variable: the product of two explanatory variables. --- ## Two ways to model .pull-left-narrow[ - **Main effects:** Assuming relationship between surface and logged price **does not vary** by whether or not the artist is living - **Interaction effects:** Assuming relationship between surface and logged price **varies** by whether or not the artist is living ] .pull-right-wide[ <img src="d23_multiple_files/figure-html/pp-main-int-viz-1.png" width="70%" style="display: block; margin: auto;" /> ] --- ## Fit model with main effects - Response variable: `log_price` - Explanatory variables: `Surface` area and `artistliving` .midi[ ```r pp_main_fit <- linear_reg() %>% set_engine("lm") %>% fit(log_price ~ Surface + artistliving, data = pp_Surf_lt_5000) tidy(pp_main_fit) ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.88 0.0424 115. 0 ## 2 Surface 0.000265 0.0000415 6.39 1.85e-10 ## 3 artistlivingLiving 0.137 0.0970 1.41 1.57e- 1 ``` ] -- `$$\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times artistliving$$` --- ## Solving the model - Non-living artist: Plug in 0 for `artistliving` `\(\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times 0\)` `\(= 4.88 + 0.000265 \times surface\)` -- - Living artist: Plug in 1 for `artistliving` `\(\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times 1\)` `\(= 5.017 + 0.000265 \times surface\)` --- ## Visualizing main effects .pull-left-narrow[ - **Same slope:** Rate of change in price as the surface area increases does not vary between paintings by living and non-living artists. - **Different intercept:** Paintings by living artists are consistently more expensive than paintings by non-living artists. ] .pull-right-wide[ <img src="d23_multiple_files/figure-html/unnamed-chunk-18-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Interpreting main effects .midi[ ```r tidy(pp_main_fit) %>% mutate(exp_estimate = exp(estimate)) %>% select(term, estimate, exp_estimate) ``` ``` ## # A tibble: 3 x 3 ## term estimate exp_estimate ## <chr> <dbl> <dbl> ## 1 (Intercept) 4.88 132. ## 2 Surface 0.000265 1.00 ## 3 artistlivingLiving 0.137 1.15 ``` ] - All else held constant, for each additional square inch in painting's surface area, the price of the painting is predicted, on average, to be higher by a factor of 1. - All else held constant, paintings by a living artist are predicted, on average, to be higher by a factor of 1.15 compared to paintings by an artist who is no longer alive. - Paintings that are by an artist who is not alive and that have a surface area of 0 square inches are predicted, on average, to be 132 livres. --- ## Main vs. interaction effects - The way we specified our main effects model only lets `artistliving` affect the intercept. - Model implicitly assumes that paintings with living and deceased artists have the *same slope* and only allows for *different intercepts*. .question[ What seems more appropriate in this case? + Same slope and same intercept for both colors + Same slope and different intercept for both colors + Different slope and different intercept for both colors ] --- ## Interaction: `Surface * artistliving` <img src="d23_multiple_files/figure-html/unnamed-chunk-19-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Fit model with interaction effects - Response variable: log_price - Explanatory variables: `Surface` area, `artistliving`, and their interaction .midi[ ```r pp_int_fit <- linear_reg() %>% set_engine("lm") %>% fit(log_price ~ Surface * artistliving, data = pp_Surf_lt_5000) tidy(pp_int_fit) ``` ``` ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.91e+0 0.0432 114. 0 ## 2 Surface 2.06e-4 0.0000442 4.65 3.37e-6 ## 3 artistlivingLiving -1.26e-1 0.119 -1.06 2.89e-1 ## 4 Surface:artistlivingLiving 4.79e-4 0.000126 3.81 1.39e-4 ``` ] --- ## Linear model with interaction effects .midi[ ``` ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.91e+0 0.0432 114. 0 ## 2 Surface 2.06e-4 0.0000442 4.65 3.37e-6 ## 3 artistlivingLiving -1.26e-1 0.119 -1.06 2.89e-1 ## 4 Surface:artistlivingLiving 4.79e-4 0.000126 3.81 1.39e-4 ``` ] `$$\widehat{log\_price} = 4.91 + 0.00021 \times surface - 0.126 \times artistliving$$` `$$+ ~ 0.00048 \times surface * artistliving$$` --- ## Interpretation of interaction effects - Rate of change in price as the surface area of the painting increases does vary between paintings by living and non-living artists (different slopes), - Some paintings by living artists are more expensive than paintings by non-living artists, and some are not (different intercept). .small[ .pull-left[ - Non-living artist: `\(\widehat{log\_price} = 4.91 + 0.00021 \times surface\)` `\(- 0.126 \times 0 + 0.00048 \times surface \times 0\)` `\(= 4.91 + 0.00021 \times surface\)` - Living artist: `\(\widehat{log\_price} = 4.91 + 0.00021 \times surface\)` `\(- 0.126 \times 1 + 0.00048 \times surface \times 1\)` `\(= 4.91 + 0.00021 \times surface\)` `\(- 0.126 + 0.00048 \times surface\)` `\(= 4.784 + 0.00069 \times surface\)` ] .pull-right[ <img src="d23_multiple_files/figure-html/viz-interaction-effects2-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- ## Comparing models It appears that adding the interaction actually increased adjusted `\(R^2\)`, so we should indeed use the model with the interactions. ```r glance(pp_main_fit)$adj.r.squared ``` ``` ## [1] 0.01258977 ``` ```r glance(pp_int_fit)$adj.r.squared ``` ``` ## [1] 0.01676753 ``` --- ## Third order interactions - Can you? Yes - Should you? Probably not if you want to interpret these interactions in context of the data. --- class: middle # Wrapping Up... <br> Sources: - Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/)) - Julia Fukuyama's EDA ([link](https://jfukuyama.github.io/))