class: center, middle, inverse, title-slide .title[ # Tidy data
🔧 ] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a> </span> </div> --- ## Tidy data >Happy families are all alike; every unhappy family is unhappy in its own way. > >Leo Tolstoy -- .pull-left[ **Characteristics of tidy data:** - Each variable forms a column. - Each observation forms a row. - Each type of observational unit forms a table. ] -- .pull-right[ **Characteristics of untidy data:** !@#$%^&*() ] --- ## .question[ What makes this data not tidy? ] <img src="img/hyperwar-airplanes-on-hand.png" width="70%" style="display: block; margin: auto;" /> .footnote[ Source: [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html) ] --- .question[ What makes this data not tidy? ] <br> <img src="img/hiv-est-prevalence-15-49.png" width="70%" style="display: block; margin: auto;" /> .footnote[ Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data) ] --- .question[ What makes this data not tidy? ] <br> <img src="img/us-general-economic-characteristic-acs-2017.png" width="85%" style="display: block; margin: auto;" /> .footnote[ Source: [US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03) ] --- ## Displaying vs. summarizing data .pull-left[ ``` ## # A tibble: 87 × 3 ## name height mass ## <chr> <int> <dbl> ## 1 Luke Skywalker 172 77 ## 2 C-3PO 167 75 ## 3 R2-D2 96 32 ## 4 Darth Vader 202 136 ## 5 Leia Organa 150 49 ## 6 Owen Lars 178 120 ## 7 Beru Whitesun Lars 165 75 ## 8 R5-D4 97 32 ## 9 Biggs Darklighter 183 84 ## 10 Obi-Wan Kenobi 182 77 ## # ℹ 77 more rows ``` ] .pull-right[ ``` ## # A tibble: 3 × 2 ## gender avg_ht ## <chr> <dbl> ## 1 feminine 167. ## 2 masculine 177. ## 3 <NA> 175 ``` ] --- .pull-left[ ``` r starwars %>% select(name, height, mass) ``` ] .pull-right[ ``` r starwars %>% group_by(gender) %>% summarize( avg_ht = mean(height, na.rm = TRUE) ) ``` ] --- class: middle # Wrapping Up... --- class: middle # Data structures in R --- ## Data structures in R Three main ways to store data: - Matrix: - Requires all the elements to be of the same type (e.g. numeric or character) - Data frame: - Allows for mixed types of variables - Tibble: - Like a data frame, but works more consistently --- ## Tibbles Tibbles are a better version of data frames. - The [documentation](https://tibble.tidyverse.org/) describes them as -- >"data.frames for the lazy and surly: > >they do less > >(i.e. they don’t change variable names or types, > >and don’t do partial matching) > >and complain more >s >(e.g. when a variable does not exist)." --- ## Two other advantages: 1. They subset more consistently: - when you use square brackets to subset a tibble, - you always get another tibble. - With data frames, you sometimes get a vector and sometimes get a data frame. 2. They print more elegantly. -- #### Compared to base R + Base R functions, (including those for data frames), + try to guess what you want to do. + Sometimes, this "helpful" guessing leads to unexpected outcomes, + so tibbles have done away with most of that. --- ## Tibbles don't change names . . . .midi[ ``` r library(tidyverse) (df = data.frame("a" = 1:5, "b 2" = 5:1)) ``` ``` ## a b.2 ## 1 1 5 ## 2 2 4 ## 3 3 3 ## 4 4 2 ## 5 5 1 ``` ``` r (ti = tibble("a" = 1:5, "b 2" = 5:1)) ``` ``` ## # A tibble: 5 × 2 ## a `b 2` ## <int> <int> ## 1 1 5 ## 2 2 4 ## 3 3 3 ## 4 4 2 ## 5 5 1 ``` ] --- ## Tibbles complain about bad column names . . . ``` r df$c ``` ``` ## NULL ``` ``` r ti$c ``` ``` ## Warning: Unknown or uninitialised column: `c`. ``` ``` ## NULL ``` --- ## Tibbles subset consistently . . . ``` r df[,1] ``` ``` ## [1] 1 2 3 4 5 ``` ``` r ti[,1] ``` ``` ## # A tibble: 5 × 1 ## a ## <int> ## 1 1 ## 2 2 ## 3 3 ## 4 4 ## 5 5 ``` --- ## Tibbles don't *do* partial matching . . . ``` r df$b ``` ``` ## [1] 5 4 3 2 1 ``` ``` r ti$b ``` ``` ## Warning: Unknown or uninitialised column: `b`. ``` ``` ## NULL ``` --- ## Tibbles don't coerce strings to factors Previously... ``` r df = data.frame(l1 = letters[1:5]) ``` The most annoying thing about data frames: ``` r df$l2 = letters[1:5] class(df$l1) ``` ``` ## [1] "factor" ``` ``` r class(df$l2) ``` ``` ## [1] "character" ``` --- # Tibbles don't have this problem ``` r ## but with tibbles it's all the same ti = tibble(l1 = letters[1:5]) ti$l2 = letters[1:5] class(ti$l1) ``` ``` ## [1] "character" ``` ``` r class(ti$l2) ``` ``` ## [1] "character" ``` --- # Sources - Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/)) - Julia Fukuyama's EDA ([link](https://jfukuyama.github.io/)) --- class: middle # Wrapping Up... --- ## Ok...So... technically... Base R has fixed this issue... ``` r df = data.frame(l1 = letters[1:5]) df$l2 = letters[1:5] class(df$l1) ``` ``` ## [1] "character" ``` ``` r class(df$l2) ``` ``` ## [1] "character" ```