class: center, middle, inverse, title-slide .title[ # Communicating data science results effectively
📰 ] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a> </span> </div> --- class: middle # Communicating data science results effectively --- # Five core activities of data science 1. Stating and refining the question 1. Exploring the data 1. Building formal statistical models 1. Interpreting the results 1. Communicating the results .center[ <img src="img/data-science.png" width="60%" height="100%" style="display: block; margin: auto;" /> ] .footnote[ Roger D. Peng and Elizabeth Matsui. "The Art of Data Science." A Guide for Anyone Who Works with Data. Skybrude Consulting, LLC (2015). ] --- class: middle # Stating and refining the question --- ## Six types of questions 1. **Descriptive:** summarize a characteristic of a set of data 1. **Exploratory:** analyze to see if there are patterns, trends, or relationships between variables (hypothesis generating) 1. **Inferential:** analyze patterns, trends, or relationships in representative data from a population 1. **Predictive:** make predictions for individuals or groups of individuals 1. **Causal:** whether changing one factor will change another factor, on average, in a population 1. **Mechanistic:** explore "how" as opposed to whether .footnote[ Jeffery T. Leek and Roger D. Peng. "What is the question?." Science 347.6228 (2015): 1314-1315. ] --- ## Ex: COVID-19 and Vitamin D 1. **Descriptive:** frequency of hospitalizations due to COVID-19 in a set of data collected from a group of individuals -- 1. **Exploratory:** examine relationships between a range of dietary factors and COVID-19 hospitalizations -- 1. **Inferential:** examine whether any relationship between taking Vitamin D supplements and COVID-19 hospitalizations found in the sample hold for the population at large -- 1. **Predictive:** what types of people will take Vitamin D supplements during the next year -- 1. **Causal:** whether people with COVID-19 who were randomly assigned to take Vitamin D supplements or those who were not are hospitalized -- 1. **Mechanistic:** how increased vitamin D intake leads to a reduction in the number of viral illnesses --- ## Questions to data science problems - Do you have appropriate data to answer your question? - Do you have information on confounding variables? - Was the data you're working with collected in a way that introduces bias? -- .pull-left[ .question[ - Suppose I want to estimate the average number of children in households in Winston-Salem. - I conduct a survey at an elementary school in Winston-Salem and ask those students: - how many children, including themselves, live in their house. - Then, I take the average of the responses. ] ] -- .pull-right[ .question[ - Is this a biased or an unbiased estimate of the number of children in households in Winston-Salem? - If biased, will the value be an overestimate or underestimate? ] ] --- class: middle # Exploratory data analysis --- ## Checklist - Formulate your question - Read in your data - Check the dimensions - Look at the top and the bottom of your data - Validate with at least one external data source - Make a plot - Try the easy solution first --- ## Formulate your question - Consider scope: - Are air pollution levels higher on the east coast than on the west coast? - Are hourly ozone levels on average higher in New York City than they are in Los Angeles? - Do counties in the eastern United States have higher ozone levels than counties in the western United States? - Most importantly: "Do I have the right data to answer this question?" --- ## Read in your data - Place your data in a folder called `data` - Read it into R with `read_csv()` or friends (`read_delim()`, `read_excel()`, etc.) ``` r library(readxl) fav_food <- read_excel("data/favorite-food.xlsx") fav_food ``` ``` ## # A tibble: 5 × 4 ## `Student ID` `Full Name` favorite.food mealPlan ## <dbl> <chr> <chr> <chr> ## 1 1 Sunil Huffmann Strawberry yoghurt Lunch only ## 2 2 Barclay Lynn French fries Lunch only ## 3 3 Jayendra Lyne Peaches Breakfast and… ## 4 4 Leon Rossini Anchovies Lunch only ## 5 5 Chidiegwu Dunkel Pizza Breakfast and… ``` --- ## `clean_names()` If the variable names are malformatted, use `janitor::clean_names()` .small[ ``` r library(janitor) fav_food %>% clean_names() ``` ``` ## # A tibble: 5 × 4 ## student_id full_name favorite_food meal_plan ## <dbl> <chr> <chr> <chr> ## 1 1 Sunil Huffmann Strawberry yoghurt Lunch only ## 2 2 Barclay Lynn French fries Lunch only ## 3 3 Jayendra Lyne Peaches Breakfast and l… ## 4 4 Leon Rossini Anchovies Lunch only ## 5 5 Chidiegwu Dunkel Pizza Breakfast and l… ``` ] --- class: middle # Wrapping Up... --- class: middle # Case study: NYC Squirrels! --- ## NYC Squirrels! - [The Squirrel Census](https://www.thesquirrelcensus.com/) is a multimedia science, design, and storytelling project focusing on the Eastern gray (*Sciurus carolinensis*). They count squirrels and present their findings to the public. - This table contains squirrel data for each of the 3,023 sightings, including location coordinates, age, primary and secondary fur color, elevation, activities, communications, and interactions between squirrels and with humans. ``` r #install_github("mine-cetinkaya-rundel/nycsquirrels18") library(nycsquirrels18) ``` --- ## Locate the codebook [mine-cetinkaya-rundel.github.io/nycsquirrels18/reference/squirrels.html](https://mine-cetinkaya-rundel.github.io/nycsquirrels18/reference/squirrels.html) <br><br> -- ## Check the dimensions ``` r dim(squirrels) ``` ``` ## [1] 3023 35 ``` --- ## Look at the top... .midi[ ``` r squirrels %>% head() ``` ``` ## # A tibble: 6 × 35 ## long lat unique_squirrel_id hectare shift date ## <dbl> <dbl> <chr> <chr> <chr> <date> ## 1 -74.0 40.8 13A-PM-1014-04 13A PM 2018-10-14 ## 2 -74.0 40.8 15F-PM-1010-06 15F PM 2018-10-10 ## 3 -74.0 40.8 19C-PM-1018-02 19C PM 2018-10-18 ## 4 -74.0 40.8 21B-AM-1019-04 21B AM 2018-10-19 ## 5 -74.0 40.8 23A-AM-1018-02 23A AM 2018-10-18 ## 6 -74.0 40.8 38H-PM-1012-01 38H PM 2018-10-12 ## # ℹ 29 more variables: hectare_squirrel_number <dbl>, age <chr>, ## # primary_fur_color <chr>, highlight_fur_color <chr>, ## # combination_of_primary_and_highlight_color <chr>, ## # color_notes <chr>, location <chr>, ## # above_ground_sighter_measurement <chr>, ## # specific_location <chr>, running <lgl>, chasing <lgl>, ## # climbing <lgl>, eating <lgl>, foraging <lgl>, … ``` ] --- ## ...and the bottom .midi[ ``` r squirrels %>% tail() ``` ``` ## # A tibble: 6 × 35 ## long lat unique_squirrel_id hectare shift date ## <dbl> <dbl> <chr> <chr> <chr> <date> ## 1 -74.0 40.8 6D-PM-1020-01 06D PM 2018-10-20 ## 2 -74.0 40.8 21H-PM-1018-01 21H PM 2018-10-18 ## 3 -74.0 40.8 31D-PM-1006-02 31D PM 2018-10-06 ## 4 -74.0 40.8 37B-AM-1018-04 37B AM 2018-10-18 ## 5 -74.0 40.8 21C-PM-1006-01 21C PM 2018-10-06 ## 6 -74.0 40.8 7G-PM-1018-04 07G PM 2018-10-18 ## # ℹ 29 more variables: hectare_squirrel_number <dbl>, age <chr>, ## # primary_fur_color <chr>, highlight_fur_color <chr>, ## # combination_of_primary_and_highlight_color <chr>, ## # color_notes <chr>, location <chr>, ## # above_ground_sighter_measurement <chr>, ## # specific_location <chr>, running <lgl>, chasing <lgl>, ## # climbing <lgl>, eating <lgl>, foraging <lgl>, … ``` ] --- ## Validate with at least one external data source .medi[.pull-left-narrow[ ``` ## # A tibble: 3,023 × 2 ## long lat ## <dbl> <dbl> ## 1 -74.0 40.8 ## 2 -74.0 40.8 ## 3 -74.0 40.8 ## 4 -74.0 40.8 ## 5 -74.0 40.8 ## 6 -74.0 40.8 ## 7 -74.0 40.8 ## 8 -74.0 40.8 ## 9 -74.0 40.8 ## 10 -74.0 40.8 ## 11 -74.0 40.8 ## 12 -74.0 40.8 ## 13 -74.0 40.8 ## 14 -74.0 40.8 ## 15 -74.0 40.8 ## # ℹ 3,008 more rows ``` ] ] .pull-right-wide[ <img src="img/central-park-coords.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Make a plot .medi[.pull-right[ ``` r ggplot(squirrels, aes(x = long, y = lat)) + geom_point(alpha = 0.2) ``` ] ] .pull-left-wide[ <img src="d15_goodtalk_files/figure-html/unnamed-chunk-11-1.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-left-narrow[ **Hypothesis:** Squirrel sightings will have a higher density on the perimeter than inside the park. ] --- ## Try the easy solution first .small[ .pull-right-narrow[ ``` r squirrels <- squirrels %>% separate(hectare, into = c("NS", "EW"), sep = 2, remove = FALSE) %>% mutate(where = if_else(NS %in% c("01", "42") | EW %in% c("A", "I"), "perimeter", "inside")) ggplot(squirrels, aes(x = long, y = lat, color = where)) + geom_point(alpha = 0.2) ``` ] ] .pull-left-wide[ <img src="d15_goodtalk_files/figure-html/unnamed-chunk-12-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Then go deeper... .medi[.pull-left[ ``` r hectare_counts <- squirrels %>% group_by(hectare) %>% summarize(n = n()) hectare_centroids <- squirrels %>% group_by(hectare) %>% summarize( centroid_x = mean(long), centroid_y = mean(lat)) squirrels %>% left_join(hectare_counts, by = "hectare") %>% left_join(hectare_centroids, by = "hectare") %>% ggplot(aes(x = centroid_x, y = centroid_y, color = n)) + geom_hex() ``` ``` ## Warning: The following aesthetics were dropped during statistical ## transformation: colour. ## ℹ This can happen when ggplot fails to infer the correct ## grouping structure in the data. ## ℹ Did you forget to specify a `group` aesthetic or to convert a ## numerical variable into a factor? ``` ] ] .pull-right[ ``` ## Warning: The following aesthetics were dropped during statistical ## transformation: colour. ## ℹ This can happen when ggplot fails to infer the correct ## grouping structure in the data. ## ℹ Did you forget to specify a `group` aesthetic or to convert a ## numerical variable into a factor? ``` <img src="d15_goodtalk_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ``` r hectare_counts <- squirrels %>% group_by(hectare) %>% summarize(n = n()) hectare_centroids <- squirrels %>% group_by(hectare) %>% summarize( centroid_x = mean(long), centroid_y = mean(lat) ) squirrels %>% left_join(hectare_counts, by = "hectare") %>% left_join(hectare_centroids, by = "hectare") %>% ggplot(aes(x = centroid_x, y = centroid_y, color = n)) + geom_hex() ``` --- ## The squirrel is staring at me! .medi[ ``` r squirrels %>% filter(str_detect(other_interactions, "star")) %>% select(shift, age, other_interactions) ``` ``` ## # A tibble: 11 × 3 ## shift age other_interactions ## <chr> <chr> <chr> ## 1 AM Adult staring at us ## 2 PM Adult he took 2 steps then turned and stared at me ## 3 PM Adult stared ## 4 PM Adult stared ## 5 PM Adult stared ## 6 PM Adult stared & then went back up tree—then ran to diffe… ## 7 PM Adult stared at me ## 8 AM Adult approaches (saw me & came forward),runs from (sta… ## 9 AM Adult started climbing down to me ## 10 PM Adult stared at me ## # ℹ 1 more row ``` ] --- ## Communicating for your audience .pull-left-narrow[ - Avoid: - jargon, - uninterpreted results, - lengthy output - Pay attention to: - organization, - presentation, - flow ] .pull-right-wide[ - Don't forget about: - code style, - coding best practices, - meaningful commits - Be open to: - suggestions, - feedback, - taking (calculated) risks ] --- # Sources - Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/)) - Jeffery T. Leek and Roger D. Peng. "What is the question?." Science 347.6228 (2015): 1314-1315. - Roger D. Peng and Elizabeth Matsui. "The Art of Data Science." A Guide for Anyone Who Works with Data. Skybrude Consulting, LLC (2015). --- class: middle # Wrapping Up...