class: center, middle, inverse, title-slide .title[ # Scientific studies and confounding
😖 ] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a> </span> </div> --- class: middle # Scientific studies --- ## Scientific studies .pull-left[ **Observational** - Collect data in a way that does not interfere with how the data arise ("observe") - Establish associations ] .pull-right[ **Experimental** - Randomly assign subjects to treatments - Establish causal connections ] <br> -- .question[ Obviously -- there is more to causal inference, but for the sake of time... ] --- .question[ What type of study is the following, observational or experiment? What does that mean in terms of causal conclusions?] - Researchers studying the relationship between exercising and energy levels asked participants in their study how many times a week they exercise and whether they have high or low energy when they wake up in the morning. - Based on responses to the exercise question the researchers grouped people into three categories (no exercise, exercise 1-3 times a week, and exercise more than 3 times a week). - The researchers then compared the proportions of people who said they have high energy in the mornings across the three exercise categories. --- .question[ What type of study is the following, observational or experiment? What does that mean in terms of causal conclusions?] - Researchers studying the relationship between exercising and energy levels randomly assigned participants in their study into three groups: no exercise, exercise 1-3 times a week, and exercise more than 3 times a week. - After one week, participants were asked whether they have high or low energy when they wake up in the morning. - The researchers then compared the proportions of people who said they have high energy in the mornings across the three exercise categories. --- class: middle # Case study: Breakfast cereal keeps girls slim --- .medi[ > *Girls who ate breakfast of any type had a lower average body mass index (BMI), a common obesity gauge, than those who said they didn't. The index was even lower for girls who said they ate cereal for breakfast, according to findings of the study conducted by the Maryland Medical Research Institute with funding from the National Institutes of Health (NIH) and cereal-maker General Mills.* [...] ] -- .medi[ > *The results were gleaned from a larger NIH survey of 2,379 girls in California, Ohio, and Maryland who were tracked between the ages of 9 and 19.* [...] ] -- .medi[ >*As part of the survey, the girls were asked once a year what they had eaten during the previous three days.* [...] ] -- .footnote[ Souce: [Study: Cereal Keeps Girls Slim](https://www.cbsnews.com/news/study-cereal-keeps-girls-slim/) ] --- ## Explanatory and response variables - Explanatory variable: Whether the participant ate breakfast or not - Response variable: BMI of the participant --- ## Three possible explanations -- 1. Eating breakfast causes girls to be slimmer -- 2. Being slim causes girls to eat breakfast -- 3. A third variable is responsible for both -- a **confounding** variable: - an extraneous variable that affects both the explanatory and the response variable, - and that makes it seem like there is a relationship between them --- ## Correlation != causation .center[ <img src="img/xkcdcorrelation.png" width="60%" height="100%" style="display: block; margin: auto;" /> ] .footnote[ Randall Munroe CC BY-NC 2.5 http://xkcd.com/552/ ] --- ## Studies and conclusions <img src="img/random_sample_assign_grid.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- class: middle # Wrapping Up... --- class: middle # Climate change survey: A Conditional Probability Case Study --- ## Survey question >A July 2019 YouGov survey asked 1633 GB and 1333 USA randomly selected adults which of the following statements about the global environment best describes their view: > >- The climate is changing and human activity is mainly responsible >- The climate is changing and human activity is partly responsible, together with other factors >- The climate is changing but human activity is not responsible at all >- The climate is not changing --- ## Survey data <br> .small[ <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> The climate is changing and human activity is mainly responsible </th> <th style="text-align:right;"> The climate is changing and human activity is partly responsible, together with other factors </th> <th style="text-align:right;"> The climate is changing but human activity is not responsible at all </th> <th style="text-align:right;"> The climate is not changing </th> <th style="text-align:right;"> Don't know </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> GB </td> <td style="text-align:right;width: 0.5 in; "> 833 </td> <td style="text-align:right;width: 0.5 in; "> 604 </td> <td style="text-align:right;width: 0.5 in; "> 49 </td> <td style="text-align:right;width: 0.5 in; "> 33 </td> <td style="text-align:right;"> 114 </td> <td style="text-align:right;"> 1633 </td> </tr> <tr> <td style="text-align:left;"> US </td> <td style="text-align:right;width: 0.5 in; "> 507 </td> <td style="text-align:right;width: 0.5 in; "> 493 </td> <td style="text-align:right;width: 0.5 in; "> 120 </td> <td style="text-align:right;width: 0.5 in; "> 80 </td> <td style="text-align:right;"> 133 </td> <td style="text-align:right;"> 1333 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;width: 0.5 in; "> 1340 </td> <td style="text-align:right;width: 0.5 in; "> 1097 </td> <td style="text-align:right;width: 0.5 in; "> 169 </td> <td style="text-align:right;width: 0.5 in; "> 113 </td> <td style="text-align:right;"> 247 </td> <td style="text-align:right;"> 2966 </td> </tr> </tbody> </table> ] .footnote[ Source: [YouGov - International Climate Change Survey](https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/epjj0nusce/YouGov%20-%20International%20climate%20change%20survey.pdf) ] --- .question[ What percent of **all respondents** think the climate is changing and human activity is mainly responsible? ] .small[ <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> The climate is changing and human activity is mainly responsible </th> <th style="text-align:right;"> The climate is changing and human activity is partly responsible, together with other factors </th> <th style="text-align:right;"> The climate is changing but human activity is not responsible at all </th> <th style="text-align:right;"> The climate is not changing </th> <th style="text-align:right;"> Don't know </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> GB </td> <td style="text-align:right;width: 0.5 in; "> 833 </td> <td style="text-align:right;width: 0.5 in; "> 604 </td> <td style="text-align:right;width: 0.5 in; "> 49 </td> <td style="text-align:right;width: 0.5 in; "> 33 </td> <td style="text-align:right;"> 114 </td> <td style="text-align:right;"> 1633 </td> </tr> <tr> <td style="text-align:left;"> US </td> <td style="text-align:right;width: 0.5 in; "> 507 </td> <td style="text-align:right;width: 0.5 in; "> 493 </td> <td style="text-align:right;width: 0.5 in; "> 120 </td> <td style="text-align:right;width: 0.5 in; "> 80 </td> <td style="text-align:right;"> 133 </td> <td style="text-align:right;"> 1333 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;width: 0.5 in; "> 1340 </td> <td style="text-align:right;width: 0.5 in; "> 1097 </td> <td style="text-align:right;width: 0.5 in; "> 169 </td> <td style="text-align:right;width: 0.5 in; "> 113 </td> <td style="text-align:right;"> 247 </td> <td style="text-align:right;"> 2966 </td> </tr> </tbody> </table> ] -- ``` r (all <- 1340 / 2966) ``` ``` ## [1] 0.4517869 ``` --- .question[ What percent of **GB respondents** think the climate is changing and human activity is mainly responsible? ] .small[ <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> The climate is changing and human activity is mainly responsible </th> <th style="text-align:right;"> The climate is changing and human activity is partly responsible, together with other factors </th> <th style="text-align:right;"> The climate is changing but human activity is not responsible at all </th> <th style="text-align:right;"> The climate is not changing </th> <th style="text-align:right;"> Don't know </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> GB </td> <td style="text-align:right;width: 0.5 in; "> 833 </td> <td style="text-align:right;width: 0.5 in; "> 604 </td> <td style="text-align:right;width: 0.5 in; "> 49 </td> <td style="text-align:right;width: 0.5 in; "> 33 </td> <td style="text-align:right;"> 114 </td> <td style="text-align:right;"> 1633 </td> </tr> <tr> <td style="text-align:left;"> US </td> <td style="text-align:right;width: 0.5 in; "> 507 </td> <td style="text-align:right;width: 0.5 in; "> 493 </td> <td style="text-align:right;width: 0.5 in; "> 120 </td> <td style="text-align:right;width: 0.5 in; "> 80 </td> <td style="text-align:right;"> 133 </td> <td style="text-align:right;"> 1333 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;width: 0.5 in; "> 1340 </td> <td style="text-align:right;width: 0.5 in; "> 1097 </td> <td style="text-align:right;width: 0.5 in; "> 169 </td> <td style="text-align:right;width: 0.5 in; "> 113 </td> <td style="text-align:right;"> 247 </td> <td style="text-align:right;"> 2966 </td> </tr> </tbody> </table> ] -- ``` r (gb <- 833 / 1633) ``` ``` ## [1] 0.5101041 ``` --- .question[ What percent of **US respondents** think the climate is changing and human activity is mainly responsible? ] .small[ <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> The climate is changing and human activity is mainly responsible </th> <th style="text-align:right;"> The climate is changing and human activity is partly responsible, together with other factors </th> <th style="text-align:right;"> The climate is changing but human activity is not responsible at all </th> <th style="text-align:right;"> The climate is not changing </th> <th style="text-align:right;"> Don't know </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> GB </td> <td style="text-align:right;width: 0.5 in; "> 833 </td> <td style="text-align:right;width: 0.5 in; "> 604 </td> <td style="text-align:right;width: 0.5 in; "> 49 </td> <td style="text-align:right;width: 0.5 in; "> 33 </td> <td style="text-align:right;"> 114 </td> <td style="text-align:right;"> 1633 </td> </tr> <tr> <td style="text-align:left;"> US </td> <td style="text-align:right;width: 0.5 in; "> 507 </td> <td style="text-align:right;width: 0.5 in; "> 493 </td> <td style="text-align:right;width: 0.5 in; "> 120 </td> <td style="text-align:right;width: 0.5 in; "> 80 </td> <td style="text-align:right;"> 133 </td> <td style="text-align:right;"> 1333 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;width: 0.5 in; "> 1340 </td> <td style="text-align:right;width: 0.5 in; "> 1097 </td> <td style="text-align:right;width: 0.5 in; "> 169 </td> <td style="text-align:right;width: 0.5 in; "> 113 </td> <td style="text-align:right;"> 247 </td> <td style="text-align:right;"> 2966 </td> </tr> </tbody> </table> ] -- ``` r (us <- 507 / 1333) ``` ``` ## [1] 0.3803451 ``` --- .question[ - Based on the percentages we calculated, does there appear to be a relationship between country and beliefs about climate change? - If yes, could there be another variable that explains this relationship? ] .pull-left[ ``` r all ``` ``` ## [1] 0.4517869 ``` ``` r gb ``` ``` ## [1] 0.5101041 ``` ``` r us ``` ``` ## [1] 0.3803451 ``` ] --- # Abridged Properties of Probabilities - `\(0 ≤ P(A) ≤1\)` - `\(P(S) = 1\)` - Probability of sample space must be 1 - `\(P (A\)` or `\(B) = P(A) + P(B)\)` IF disjoint - Disjoint (A and B don't overlap) - `\(P (not A) = 1- P(A)\)` --- ## Unabridged Properties of Probabilities .midi[- Any probability is a number between 0 and 1. - `\(0 ≤ P(A) ≤1\)` - All possible outcomes together must have probability 1. - Because some outcome must occur on every trial, the sum of the probabilities for all possible outcomes must be exactly 1 - `\(P(S) = 1\)` - Probability of sample space must be 1 - If two events have no outcomes in common (disjoint), the probability that one or the other occurs is the sum of their individual probabilities. - `\(P (A\)` or `\(B) = P(A) + P(B)\)` IF disjoint - Disjoint (A and B don't overlap) - The probability that an event does not occur is 1 minus the probability that the event does occur. - The probability that an event occurs and the probability that it does not occur always add to 1, or 100%. - `\(P (not A) = 1- P(A)\)` ] --- # Abridged Properties of Probabilities - `\(0 ≤ P(A) ≤1\)` - `\(P(S) = 1\)` - Probability of sample space must be 1 - `\(P (A\)` or `\(B) = P(A) + P(B)\)` IF disjoint - Disjoint (A and B don't overlap) - `\(P (not A) = 1- P(A)\)` --- ## Conditional probability **Notation**: `\(P(A | B)\)`: Probability of event A given event B The probability we assign to an event can change if we know that some other event has occurred. - If the occurrence of event B, depends in some sense on the occurrence of event A, - then we can talk about the conditional probability of B, given A. - `\(P(B | A)\)` - B given A --- # Examples - Weather - `\(B\)` = event, John carries an umbrella - `\(A\)` = weather forecast predicted rain - So tied to together that `\(P(B|A)\)` is likely = `\(P(A)\)` -- - Drawing a King - `\(P (K) = 4/52\)` - `\(P(\)` 2nd card is a king `\(|\)` 1st card was a king `\() = 3/51\)` - `\(P(\)` 2nd card is a king `\(|\)` 1st card not king `\() = 4/51\)` --- ## Independence - Two events are independent if `\(P(A|B) = P(A)\)`, `\(P(B|A)= P(B)\)` - Two events `\(A\)` and `\(B\)` are independent, - if knowing that one occurs does not change the probability that the other occurs -- - Card Example - A = heart; B = Jack - `\(P(H)= 13/52\)` - `\(P(H|J) = 1/4\)`; - are independent -- - Be careful not to confuse disjointness and independence. - If `\(A\)` and `\(B\)` are disjoint, - then the fact that `\(A\)` occurs tell us that `\(B\)` cannot occur - very dependent! --- class: middle # Wrapping Up... --- class: middle # Introducing Simpson's Paradox --- ## Relationships between variables - Relationship between two variables: Fitness `\(\rightarrow\)` Heart health - Relationship between multiple variables: Calories + Age + Fitness `\(\rightarrow\)` Heart health --- ## Relationship between two variables <br> .pull-right-narrow[ <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> x </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> y </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 8 </td> </tr> </tbody> </table> ] --- ## Relationship between two variables <br> .pull-left-wide[ <img src="d14_confound_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right-narrow[ <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> x </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> y </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 8 </td> </tr> </tbody> </table> ] --- ## Relationship between two variables <br> .pull-left-wide[ <img src="d14_confound_files/figure-html/unnamed-chunk-15-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right-narrow[ <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> <th style="text-align:right;"> </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> x </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> y </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 8 </td> </tr> </tbody> </table> ] --- ## Considering a third variable <br> .pull-left-wide[ <img src="d14_confound_files/figure-html/unnamed-chunk-17-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right-narrow[ <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> x </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> y </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 11 </td> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 8 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> z </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> B </td> <td style="text-align:left;"> B </td> <td style="text-align:left;"> B </td> <td style="text-align:left;"> B </td> </tr> </tbody> </table> ] --- ## Relationship between three variables <br> .pull-left-wide[ <img src="d14_confound_files/figure-html/unnamed-chunk-19-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right-narrow[ <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> x </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> y </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 11 </td> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 8 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> z </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> B </td> <td style="text-align:left;"> B </td> <td style="text-align:left;"> B </td> <td style="text-align:left;"> B </td> </tr> </tbody> </table> ] --- class: middle # Wrapping Up... --- class:middle # Case study: Berkeley admission data --- ## Berkeley admission data - Study carried out by the Graduate Division of the University of California, Berkeley in the early 1970s to evaluate whether there was a gender bias in graduate admissions. - The data come from six departments. For confidentiality we'll call them A-F. - We have information on whether the applicant was male or female and whether they were admitted or rejected. - First, we will evaluate whether the percentage of males admitted is indeed higher than females, overall. Next, we will calculate the same percentage for each department. --- ## Data .pull-left[ ``` ## # A tibble: 4,526 × 3 ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## 11 Admitted Male A ## 12 Admitted Male A ## 13 Admitted Male A ## 14 Admitted Male A ## 15 Admitted Male A ## # ℹ 4,511 more rows ``` ] .medi[.pull-right[ ``` ## # A tibble: 2 × 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ``` ## # A tibble: 6 × 2 ## dept n ## <ord> <int> ## 1 A 933 ## 2 B 585 ## 3 C 918 ## 4 D 792 ## 5 E 584 ## 6 F 714 ``` ``` ## # A tibble: 2 × 2 ## admit n ## <fct> <int> ## 1 Rejected 2771 ## 2 Admitted 1755 ``` ] ] --- .question[ What can you say about the overall gender distribution? ] .small.tip[Calculate the following probabilities: `\(P(Admit | Male)\)` and `\(P(Admit | Female)\)`. ] ``` r ucbadmit %>% count(gender, admit) ``` ``` ## # A tibble: 4 × 3 ## gender admit n ## <fct> <fct> <int> ## 1 Female Rejected 1278 ## 2 Female Admitted 557 ## 3 Male Rejected 1493 ## 4 Male Admitted 1198 ``` --- ``` r ucbadmit %>% count(gender, admit) %>% group_by(gender) %>% mutate(prop_admit = n / sum(n)) ``` ``` ## # A tibble: 4 × 4 ## # Groups: gender [2] ## gender admit n prop_admit ## <fct> <fct> <int> <dbl> ## 1 Female Rejected 1278 0.696 ## 2 Female Admitted 557 0.304 ## 3 Male Rejected 1493 0.555 ## 4 Male Admitted 1198 0.445 ``` - `\(P(Admit | Female)\)` = 0.304 - `\(P(Admit | Male)\)` = 0.445 --- ## Overall gender distribution .pull-left[ <img src="d14_confound_files/figure-html/unnamed-chunk-24-1.png" width="100%" style="display: block; margin: auto;" /> ] .medi[.pull-right[ ``` r ggplot(ucbadmit, aes(y = gender, fill = admit)) + geom_bar(position = "fill") + labs(title = "Admit by gender", y = NULL, x = NULL) ``` ] ] --- .question[ What can you say about the gender distribution by department ? ] .center[ ``` r ucbadmit %>% count(dept, gender, admit) ``` ``` ## # A tibble: 24 × 4 ## dept gender admit n ## <ord> <fct> <fct> <int> ## 1 A Female Rejected 19 ## 2 A Female Admitted 89 ## 3 A Male Rejected 313 ## 4 A Male Admitted 512 ## 5 B Female Rejected 8 ## 6 B Female Admitted 17 ## 7 B Male Rejected 207 ## 8 B Male Admitted 353 ## 9 C Female Rejected 391 ## 10 C Female Admitted 202 ## # ℹ 14 more rows ``` ] --- .question[ Let's try again... What can you say about the gender distribution by department? ] ``` r ucbadmit %>% count(dept, gender, admit) %>% pivot_wider(names_from = dept, values_from = n) ``` ``` ## # A tibble: 4 × 8 ## gender admit A B C D E F ## <fct> <fct> <int> <int> <int> <int> <int> <int> ## 1 Female Rejected 19 8 391 244 299 317 ## 2 Female Admitted 89 17 202 131 94 24 ## 3 Male Rejected 313 207 205 279 138 351 ## 4 Male Admitted 512 353 120 138 53 22 ``` --- ## Gender distribution, by department .pull-left[ <img src="d14_confound_files/figure-html/unnamed-chunk-27-1.png" width="100%" style="display: block; margin: auto;" /> ] .midi.pull-right[ ``` r ggplot(ucbadmit, aes(y = gender, fill = admit)) + geom_bar(position = "fill") + facet_wrap(. ~ dept) + scale_x_continuous(labels = label_percent()) + labs(title = "Admissions by gender and department", x = NULL, y = NULL, fill = NULL) + theme(legend.position = "bottom") ``` ] --- ## Case for gender discrimination? .pull-left[ <img src="d14_confound_files/figure-html/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="d14_confound_files/figure-html/unnamed-chunk-29-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Closer look at departments .pull-left-wide[ ``` ## # A tibble: 12 × 5 ## # Groups: dept, gender [12] ## dept gender n_admitted n_applied prop_admit ## <ord> <fct> <int> <int> <dbl> ## 1 A Female 89 108 0.824 ## 2 A Male 512 825 0.621 ## 3 B Female 17 25 0.68 ## 4 B Male 353 560 0.630 ## 5 C Female 202 593 0.341 ## 6 C Male 120 325 0.369 ## 7 D Female 131 375 0.349 ## 8 D Male 138 417 0.331 ## 9 E Female 94 393 0.239 ## 10 E Male 53 191 0.277 ## 11 F Female 24 341 0.0704 ## 12 F Male 22 373 0.0590 ``` ] .pull-right-narrow.midi[ ``` r ucbadmit %>% count(dept, gender, admit) %>% group_by(dept, gender) %>% mutate( n_applied = sum(n), prop_admit = n / n_applied ) %>% filter(admit == "Admitted") %>% rename(n_admitted = n) %>% select(-admit) %>% print(n = 12) ``` ] --- class:middle # Wrapping Up... --- class: middle # Revisting Simpson's Paradox --- ## Relationship between two variables .pull-left-narrow[ ``` ## # A tibble: 8 × 3 ## x y z ## <dbl> <dbl> <chr> ## 1 2 4 A ## 2 3 3 A ## 3 4 2 A ## 4 5 1 A ## 5 6 11 B ## 6 7 10 B ## 7 8 9 B ## 8 9 8 B ``` ] .pull-right-wide[ <img src="d14_confound_files/figure-html/unnamed-chunk-32-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Relationship between two variables .pull-left-narrow[ ``` ## # A tibble: 8 × 3 ## x y z ## <dbl> <dbl> <chr> ## 1 2 4 A ## 2 3 3 A ## 3 4 2 A ## 4 5 1 A ## 5 6 11 B ## 6 7 10 B ## 7 8 9 B ## 8 9 8 B ``` ] .pull-right-wide[ <img src="d14_confound_files/figure-html/unnamed-chunk-34-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Considering a third variable .pull-left-narrow[ ``` ## # A tibble: 8 × 3 ## x y z ## <dbl> <dbl> <chr> ## 1 2 4 A ## 2 3 3 A ## 3 4 2 A ## 4 5 1 A ## 5 6 11 B ## 6 7 10 B ## 7 8 9 B ## 8 9 8 B ``` ] .pull-right-wide[ <img src="d14_confound_files/figure-html/unnamed-chunk-36-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Relationship between three variables .pull-left-narrow[ ``` ## # A tibble: 8 × 3 ## x y z ## <dbl> <dbl> <chr> ## 1 2 4 A ## 2 3 3 A ## 3 4 2 A ## 4 5 1 A ## 5 6 11 B ## 6 7 10 B ## 7 8 9 B ## 8 9 8 B ``` ] .pull-right-wide[ <img src="d14_confound_files/figure-html/unnamed-chunk-38-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Simpson's paradox - Not considering an important variable when studying a relationship can result in **Simpson's paradox** - Simpson's paradox illustrates - the effect that omission of an explanatory variable can have - on the measure of association between - another explanatory variable and a response variable - The inclusion of a third variable in the analysis can change - the apparent relationship between the other two variables --- class: middle # Aside: `group_by()` and `count()` --- ## What does group_by() do? `group_by()` takes an existing data frame and converts it into a grouped data frame where subsequent operations are performed "once per group" .medi.pull-left[ ``` r ucbadmit ``` ``` ## # A tibble: 4,526 × 3 ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ℹ 4,516 more rows ``` ] .medi.pull-right[ ``` r ucbadmit %>% group_by(gender) ``` ``` ## # A tibble: 4,526 × 3 ## # Groups: gender [2] ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ℹ 4,516 more rows ``` ] --- ## What does group_by() not do? `group_by()` does not sort the data, `arrange()` does .medi.pull-left[ ``` r ucbadmit %>% group_by(gender) ``` ``` ## # A tibble: 4,526 × 3 ## # Groups: gender [2] ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ℹ 4,516 more rows ``` ] .medi.pull-right[ ``` r ucbadmit %>% arrange(gender) ``` ``` ## # A tibble: 4,526 × 3 ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Female A ## 2 Admitted Female A ## 3 Admitted Female A ## 4 Admitted Female A ## 5 Admitted Female A ## 6 Admitted Female A ## 7 Admitted Female A ## 8 Admitted Female A ## 9 Admitted Female A ## 10 Admitted Female A ## # ℹ 4,516 more rows ``` ] --- ## What does group_by() not do? `group_by()` does not create frequency tables, `count()` does .medi.pull-left[ ``` r ucbadmit %>% group_by(gender) ``` ``` ## # A tibble: 4,526 × 3 ## # Groups: gender [2] ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ℹ 4,516 more rows ``` ] .medi.pull-right[ ``` r ucbadmit %>% count(gender) ``` ``` ## # A tibble: 2 × 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ] --- ## Undo grouping with ungroup() .medi.pull-left[ ``` r ucbadmit %>% count(gender, admit) %>% group_by(gender) %>% mutate(prop_admit = n / sum(n)) %>% select(gender, prop_admit) ``` ``` ## # A tibble: 4 × 2 ## # Groups: gender [2] ## gender prop_admit ## <fct> <dbl> ## 1 Female 0.696 ## 2 Female 0.304 ## 3 Male 0.555 ## 4 Male 0.445 ``` ] .medi.pull-right[ ``` r ucbadmit %>% count(gender, admit) %>% group_by(gender) %>% mutate(prop_admit = n / sum(n)) %>% select(gender, prop_admit) %>% ungroup() ``` ``` ## # A tibble: 4 × 2 ## gender prop_admit ## <fct> <dbl> ## 1 Female 0.696 ## 2 Female 0.304 ## 3 Male 0.555 ## 4 Male 0.445 ``` ] --- ## count() is a short-hand `count()` is a short-hand for `group_by()` and then `summarize()` to count the number of observations in each group .pull-left[ ``` r ucbadmit %>% group_by(gender) %>% summarize(n = n()) ``` ``` ## # A tibble: 2 × 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ] .pull-right[ ``` r ucbadmit %>% count(gender) ``` ``` ## # A tibble: 2 × 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ] --- ## count can take multiple arguments .pull-left[ ``` r ucbadmit %>% group_by(gender, admit) %>% summarize(n = n()) ``` ``` ## # A tibble: 4 × 3 ## # Groups: gender [2] ## gender admit n ## <fct> <fct> <int> ## 1 Female Rejected 1278 ## 2 Female Admitted 557 ## 3 Male Rejected 1493 ## 4 Male Admitted 1198 ``` ] .pull-right[ ``` r ucbadmit %>% count(gender, admit) ``` ``` ## # A tibble: 4 × 3 ## gender admit n ## <fct> <fct> <int> ## 1 Female Rejected 1278 ## 2 Female Admitted 557 ## 3 Male Rejected 1493 ## 4 Male Admitted 1198 ``` ] --- ## `summarize()` after `group_by()` - `count()` ungroups after itself - `summarize()` peels off one layer of grouping by default, or you can specify a different behavior .medi[ ``` r ucbadmit %>% group_by(gender, admit) %>% summarize(n = n()) ``` ``` ## # A tibble: 4 × 3 ## # Groups: gender [2] ## gender admit n ## <fct> <fct> <int> ## 1 Female Rejected 1278 ## 2 Female Admitted 557 ## 3 Male Rejected 1493 ## 4 Male Admitted 1198 ``` ] --- class: middle # Wrapping Up... <br> Sources: - Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/))