Models with multiple predictors 👯‍♂️

# Models with multiple predictors<br> 👯‍♂️
### S. Mason Garrison

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a>
</span>
</div>

---

# The linear model with multiple predictors

---

## Data: Book weight and volume

The `allbacks` data frame gives measurements on the volume and weight of 15 books, some of which are paperback and some of which are hardback

.pull-left[
- Volume - cubic centimeters
- Area - square centimeters
- Weight - grams
]
.pull-right[
.small[

```
## # A tibble: 15 x 4
##    volume  area weight cover
##     <dbl> <dbl>  <dbl> <fct>
##  1    885   382    800 hb   
##  2   1016   468    950 hb   
##  3   1125   387   1050 hb   
##  4    239   371    350 hb   
##  5    701   371    750 hb   
##  6    641   367    600 hb   
##  7   1228   396   1075 hb   
##  8    412     0    250 pb   
##  9    953     0    700 pb   
## 10    929     0    650 pb   
## 11   1492     0    975 pb   
## 12    419     0    350 pb   
## 13   1010     0    950 pb   
## 14    595     0    425 pb   
## 15   1034     0    725 pb
```
]
]

.footnote[
.small[
These books are from the bookshelf of J. H. Maindonald at Australian National University.
]
]

---

## Book weight vs. volume

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(weight ~ volume, data = allbacks) %>%
  tidy()
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)  108.      88.4         1.22 0.245     
## 2 volume         0.709    0.0975      7.27 0.00000626
```
]
.pull-right[
<img src="d23_multiple_files/figure-html/unnamed-chunk-4-1.png" width="75%" style="display: block; margin: auto 0 auto auto;" />
]

---

## Book weight vs. volume and cover

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(weight ~ volume + cover, data = allbacks) %>%
  tidy()
```

```
## # A tibble: 3 x 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)  198.      59.2         3.34 0.00584     
## 2 volume         0.718    0.0615     11.7  0.0000000660
## 3 coverpb     -184.      40.5        -4.55 0.000672
```
]
.pull-right[
<img src="d23_multiple_files/figure-html/unnamed-chunk-6-1.png" width="75%" style="display: block; margin: auto 0 auto auto;" />
]

---

## Interpretation of estimates

--
- **Slope - volume:** *All else held constant*, for each additional cubic centimeter books are larger in volume, we would expect the weight to be higher, on average, by 0.718 grams.

--
- **Slope - cover:** *All else held constant*, paperback books are weigh, on average, by 184 grams less than hardcover books.

--
- **Intercept:** Hardcover books with 0 volume are expected to weigh 198 grams, on average. (Doesn't make sense in context.)

---

## Main vs. interaction effects

.question[
Suppose we want to predict weight of books from their volume and cover type (hardback vs. paperback). Do you think a model with main effects or interaction effects is more appropriate? Explain your reasoning.

**Hint:** Main effects would mean rate at which weight changes as volume increases would be the same for hardback and paperback books and interaction effects would mean the rate at which weight changes as volume increases would be different for hardback and paperback books.
]

---

---

## In pursuit of Occam's razor

- Occam's Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.

--
- Model selection follows this principle.

--
- We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model.

--
- In other words, we prefer the simplest best model, i.e. **parsimonious** model.

---

.pull-left[
<img src="d23_multiple_files/figure-html/unnamed-chunk-8-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
.question[
Visually, which of the two models is preferable under Occam's razor?
]
]

---

## R-squared

- `$R^2$` is the percentage of variability in the response variable explained by the regression model.

```r
glance(book_main_fit)$r.squared
```

```
## [1] 0.9274776
```

```r
glance(book_int_fit)$r.squared
```

```
## [1] 0.9297137
```

--
- Clearly the model with interactions has a higher `$R^2$`.

--
- However using `$R^2$` for model selection in models with multiple explanatory variables is not a good idea as `$R^2$` increases when **any** variable is added to the model.

---

## Adjusted R-squared

... a (more) objective measure for model selection

- Adjusted `$R^2$` doesn't increase if the new variable does not provide any new information or is completely unrelated, as it applies a penalty for number of variables included in the model.
- This makes adjusted `$R^2$` a preferable metric for model selection in multiple regression models.

---

## Comparing models

```r
glance(book_main_fit)$r.squared
```

```
## [1] 0.9274776
```

```r
glance(book_int_fit)$r.squared
```

```
## [1] 0.9297137
```
]
.pull-right[

```r
glance(book_main_fit)$adj.r.squared
```

```
## [1] 0.9153905
```

```r
glance(book_int_fit)$adj.r.squared
```

```
## [1] 0.9105447
```
]

```r
glance(book_int_fit)$r.squared > glance(book_main_fit)$r.squared 
```

```
## [1] TRUE
```

- Is R-sq adj. higher for int model?

```r
glance(book_int_fit)$adj.r.squared > glance(book_main_fit)$adj.r.squared
```

```
## [1] FALSE
```
]
]

---

# Wrapping Up...

---

# Two numerical predictors

---

## The data

```r
pp <- read_csv(
  "data/paris-paintings.csv",
  na = c("n/a", "", "NA")
) %>%
  mutate(log_price = log(price))
```

---

## Multiple predictors

- Response variable: `log_price` 
- Explanatory variables: Width and height

```r
pp_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(log_price ~ Width_in + Height_in, data = pp)
tidy(pp_fit)
```

```
## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   4.77     0.0579      82.4  0       
## 2 Width_in      0.0269   0.00373      7.22 6.58e-13
## 3 Height_in    -0.0133   0.00395     -3.36 7.93e- 4
```

---

##  Linear model with multiple predictors

<br>

`$$\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height$$`

---

## Visualizing models with multiple predictors

.panelset[
.panel[.panel-name[Plot]
.pull-left-wide[
<div id="htmlwidget-dd92764ef4558d96bb92" style="width:100%;height:1483.2px;" class="widgetframe html-widget"></div>
<script type="application/json" data-for="htmlwidget-dd92764ef4558d96bb92">{"x":{"url":"d23_multiple_files/figure-html//widgets/widget_plotly-plot.html","options":{"xdomain":"*","allowfullscreen":false,"lazyload":false}},"evals":[],"jsHooks":[]}</script>
]
]
.panel[.panel-name[Code]

```r
p <- plot_ly(pp,
  x = ~Width_in, y = ~Height_in, z = ~log_price,
  marker = list(size = 3, color = "lightgray", alpha = 0.5, 
                line = list(color = "gray", width = 2))) %>%
  add_markers() %>%
  plotly::layout(scene = list(
    xaxis = list(title = "Width (in)"),
    yaxis = list(title = "Height (in)"),
    zaxis = list(title = "log_price")
  )) %>%
  config(displayModeBar = FALSE)
frameWidget(p)
```
]
]

---

# Numerical and categorical predictors

---

## Price, surface area, and living artist

- Explore the relationship between price of paintings and surface area, conditioned 
on whether or not the artist is still living
- First visualize and explore, then model
- But first, prep the data

```r
pp <- pp %>%
  mutate(artistliving = if_else(artistliving == 0, "Deceased", "Living"))

pp %>%
  count(artistliving)
```

```
## # A tibble: 2 x 2
##   artistliving     n
##   <chr>        <int>
## 1 Deceased      2937
## 2 Living         456
```
]

---

## Typical surface area

.panelset[
.panel[.panel-name[Plot]
.pull-left-narrow[
Typical surface area appears to be less than 1000 square inches (~ 80cm x 80cm). There are very few paintings that have surface area above 5000 square inches.
]
.pull-right-wide[
<img src="d23_multiple_files/figure-html/unnamed-chunk-15-1.png" width="90%" style="display: block; margin: auto;" />
]
]
.panel[.panel-name[Code]

```r
ggplot(data = pp, aes(x = Surface, fill = artistliving)) +
  geom_histogram(binwidth = 500) + 
  facet_grid(artistliving ~ .) +
  scale_fill_manual(values = c("#E48957", "#071381")) +
  guides(fill = FALSE) +
  labs(x = "Surface area", y = NULL) +
  geom_vline(xintercept = 1000) +
  geom_vline(xintercept = 5000, linetype = "dashed", color = "gray")
```

```
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use
## `guides(<scale> = "none")` instead.
```

```
## Warning: Removed 176 rows containing non-finite values
## (stat_bin).
```
]
]

---

## Narrowing the scope

<img src="d23_multiple_files/figure-html/unnamed-chunk-16-1.png" width="55%" style="display: block; margin: auto;" />
]

---

```r
pp_Surf_lt_5000 <- pp %>%
  filter(Surface < 5000)

ggplot(data = pp_Surf_lt_5000, 
       aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) +
  geom_point(alpha = 0.5) +
  labs(color = "Artist", shape = "Artist") +
  scale_color_manual(values = c("#E48957", "#071381"))
```
]

---

## Facet to get a better look

.panel[.panel-name[Plot]
<img src="d23_multiple_files/figure-html/unnamed-chunk-17-1.png" width="80%" style="display: block; margin: auto;" />
]

---
.panel[.panel-name[Code]

```r
ggplot(data = pp_Surf_lt_5000, 
       aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~artistliving) +
  scale_color_manual(values = c("#E48957", "#071381")) +
  labs(color = "Artist", shape = "Artist")
```

]

---

## Two ways to model

- **Main effects:** Assuming relationship between surface and logged price 
**does not vary** by whether or not the artist is living.
- **Interaction effects:** Assuming relationship between surface and logged price **varies** by whether or not the artist is living.

---

## Interacting explanatory variables

- Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines.
- This model implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes.
- This can be accomplished by adding an interaction variable: the product of two explanatory variables.

---

## Two ways to model

.pull-left-narrow[
- **Main effects:** Assuming relationship between surface and logged price **does not vary** by whether or not the artist is living
- **Interaction effects:** Assuming relationship between surface and logged price **varies** by whether or not the artist is living
]
.pull-right-wide[
<img src="d23_multiple_files/figure-html/pp-main-int-viz-1.png" width="70%" style="display: block; margin: auto;" />
]

---

## Fit model with main effects

- Response variable: `log_price`
- Explanatory variables: `Surface` area and `artistliving`

```r
pp_main_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(log_price ~ Surface + artistliving, data = pp_Surf_lt_5000)
tidy(pp_main_fit)
```

```
## # A tibble: 3 x 5
##   term               estimate std.error statistic  p.value
##   <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)        4.88     0.0424       115.   0       
## 2 Surface            0.000265 0.0000415      6.39 1.85e-10
## 3 artistlivingLiving 0.137    0.0970         1.41 1.57e- 1
```
]

`$$\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times artistliving$$`

---

## Solving the model

- Non-living artist: Plug in 0 for `artistliving`

`$\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times 0$`  
`$= 4.88 + 0.000265 \times surface$`

--
- Living artist: Plug in 1 for `artistliving`

`$\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times 1$`   
`$= 5.017 + 0.000265 \times surface$`

---

## Visualizing main effects

.pull-left-narrow[
- **Same slope:** Rate of change in price as the surface area increases does 
not vary between paintings by living and non-living artists.
- **Different intercept:** Paintings by living artists are consistently more 
expensive than paintings by non-living artists.
]
.pull-right-wide[
<img src="d23_multiple_files/figure-html/unnamed-chunk-18-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Interpreting main effects

```r
tidy(pp_main_fit) %>% 
  mutate(exp_estimate = exp(estimate)) %>%
  select(term, estimate, exp_estimate)
```

```
## # A tibble: 3 x 3
##   term               estimate exp_estimate
##   <chr>                 <dbl>        <dbl>
## 1 (Intercept)        4.88           132.  
## 2 Surface            0.000265         1.00
## 3 artistlivingLiving 0.137            1.15
```
]

- All else held constant, for each additional square inch in painting's surface area, the price of the painting is predicted, on average, to be higher by a factor of 1.
- All else held constant, paintings by a living artist are predicted, on average, to be higher by a factor of 1.15 compared to paintings by an artist who is no longer alive.
- Paintings that are by an artist who is not alive and that have a surface area of 0 square inches are predicted, on average, to be 132 livres.

---

## Main vs. interaction effects

- The way we specified our main effects model only lets `artistliving` affect the intercept.
- Model implicitly assumes that paintings with living and deceased artists have the *same slope* and only allows for *different intercepts*.

.question[
What seems more appropriate in this case?
+ Same slope and same intercept for both colors
+ Same slope and different intercept for both colors
+ Different slope and different intercept for both colors
]

---

## Interaction: `Surface * artistliving`

---

## Fit model with interaction effects

- Response variable: log_price
- Explanatory variables: `Surface` area, `artistliving`, and their interaction

```r
pp_int_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(log_price ~ Surface * artistliving, data = pp_Surf_lt_5000)
tidy(pp_int_fit)
```

```
## # A tibble: 4 x 5
##   term                       estimate std.error statistic p.value
##   <chr>                         <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)                 4.91e+0 0.0432       114.   0      
## 2 Surface                     2.06e-4 0.0000442      4.65 3.37e-6
## 3 artistlivingLiving         -1.26e-1 0.119         -1.06 2.89e-1
## 4 Surface:artistlivingLiving  4.79e-4 0.000126       3.81 1.39e-4
```
]

---

## Linear model with interaction effects

`$$\widehat{log\_price} = 4.91 + 0.00021 \times surface - 0.126 \times artistliving$$`
`$$+ ~ 0.00048 \times surface * artistliving$$`

---

## Interpretation of interaction effects

- Rate of change in price as the surface area of the painting increases does 
vary between paintings by living and non-living artists (different slopes), 
- Some paintings by living artists are more expensive than paintings by
non-living artists, and some are not (different intercept).

.small[
.pull-left[
- Non-living artist: 
`$\widehat{log\_price} = 4.91 + 0.00021 \times surface$`
`$- 0.126 \times 0 + 0.00048 \times surface \times 0$`
`$= 4.91 + 0.00021 \times surface$`
- Living artist: 
`$\widehat{log\_price} = 4.91 + 0.00021 \times surface$`
`$- 0.126 \times 1 + 0.00048 \times surface \times 1$`
`$= 4.91 + 0.00021 \times surface$`
`$- 0.126 + 0.00048 \times surface$`
`$= 4.784 + 0.00069 \times surface$`
]
.pull-right[
<img src="d23_multiple_files/figure-html/viz-interaction-effects2-1.png" width="80%" style="display: block; margin: auto;" />
]
]

---

## Comparing models

It appears that adding the interaction actually increased adjusted `$R^2$`, so we 
should indeed use the model with the interactions.

```r
glance(pp_main_fit)$adj.r.squared
```

```
## [1] 0.01258977
```

```r
glance(pp_int_fit)$adj.r.squared
```

```
## [1] 0.01676753
```

---

## Third order interactions

- Can you? Yes
- Should you? Probably not if you want to interpret these interactions in context
of the data.

---

# Wrapping Up...

<br>

Sources:
- Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/))
- Julia Fukuyama's EDA ([link](https://jfukuyama.github.io/))