Tidy data 🔧

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a>
</span>
</div>

---

## Tidy data

>Happy families are all alike; every unhappy family is unhappy in its own way. 
>
>Leo Tolstoy

**Characteristics of tidy data:**

- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
]

--
.pull-right[
**Characteristics of untidy data:**

!@#$%^&*()
]

---

.footnote[
Source: [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html)
]

---

<br>

.footnote[
Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data)
]

---

<br>

.footnote[
Source: [US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03)
]

---

## Displaying vs. summarizing data

```
## # A tibble: 87 × 3
##    name               height  mass
##    <chr>               <int> <dbl>
##  1 Luke Skywalker        172    77
##  2 C-3PO                 167    75
##  3 R2-D2                  96    32
##  4 Darth Vader           202   136
##  5 Leia Organa           150    49
##  6 Owen Lars             178   120
##  7 Beru Whitesun Lars    165    75
##  8 R5-D4                  97    32
##  9 Biggs Darklighter     183    84
## 10 Obi-Wan Kenobi        182    77
## # ℹ 77 more rows
```
]
.pull-right[

```
## # A tibble: 3 × 2
##   gender    avg_ht
##   <chr>      <dbl>
## 1 feminine    167.
## 2 masculine   177.
## 3 <NA>        175
```
]

---

``` r
starwars %>%
  select(name, height, mass)
```
]
.pull-right[

``` r
starwars %>%
  group_by(gender) %>%
  summarize(
    avg_ht = mean(height, na.rm = TRUE)
    )
```
]
---

# Wrapping Up...
---

# Data structures in R

---
## Data structures in R
Three main ways to store data:

- Matrix: 
  - Requires all the elements to be of the same type (e.g. numeric or character)
- Data frame: 
  - Allows for mixed types of variables
- Tibble: 
  - Like a data frame, but works more consistently

---
## Tibbles

Tibbles are a better version of data frames.

- The [documentation](https://tibble.tidyverse.org/) describes them as

--
>"data.frames for the lazy and surly: 
>
>they do less 
>
>(i.e. they don’t change variable names or types,
>
>and don’t do partial matching)
>
>and complain more
>s
>(e.g. when a variable does not exist)."

---

## Two other advantages:

1. They subset more consistently: 
  - when you use square brackets to subset a tibble, 
      - you always get another tibble. 
  - With data frames, you sometimes get a vector and sometimes get a data frame.
2. They print more elegantly.

#### Compared to base R
+ Base R functions, (including those for data frames), 
  + try to guess what you want to do.
+ Sometimes, this "helpful" guessing  leads to unexpected outcomes, 
  + so tibbles have done away with most of that.

---

## Tibbles don't change names

. . .
.midi[

``` r
library(tidyverse)
(df = data.frame("a" = 1:5, "b 2" = 5:1))
```

```
##   a b.2
## 1 1   5
## 2 2   4
## 3 3   3
## 4 4   2
## 5 5   1
```

``` r
(ti = tibble("a" = 1:5, "b 2" = 5:1))
```

```
## # A tibble: 5 × 2
##       a `b 2`
##   <int> <int>
## 1     1     5
## 2     2     4
## 3     3     3
## 4     4     2
## 5     5     1
```
]
---

## Tibbles complain about bad column names

. . .

``` r
df$c
```

```
## NULL
```

``` r
ti$c
```

```
## Warning: Unknown or uninitialised column: `c`.
```

```
## NULL
```

---

## Tibbles subset consistently

. . .

``` r
df[,1]
```

```
## [1] 1 2 3 4 5
```

``` r
ti[,1]
```

```
## # A tibble: 5 × 1
##       a
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
```

---

## Tibbles don't *do* partial matching

. . .

``` r
df$b
```

```
## [1] 5 4 3 2 1
```

``` r
ti$b
```

```
## Warning: Unknown or uninitialised column: `b`.
```

```
## NULL
```
---

## Tibbles don't coerce strings to factors

Previously...

``` r
df = data.frame(l1 = letters[1:5])
```

The most annoying thing about data frames:

``` r
df$l2 = letters[1:5]
class(df$l1)
```

```
## [1] "factor"
```

``` r
class(df$l2)
```

```
## [1] "character"
```

---

# Tibbles don't have this problem

``` r
## but with tibbles it's all the same
ti = tibble(l1 = letters[1:5])
ti$l2 = letters[1:5]
class(ti$l1)
```

```
## [1] "character"
```

``` r
class(ti$l2)
```

```
## [1] "character"
```

---

# Sources

- Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/))
- Julia Fukuyama's EDA ([link](https://jfukuyama.github.io/))

---

# Wrapping Up...

---

## Ok...So... technically...

Base R has fixed this issue...

``` r
df = data.frame(l1 = letters[1:5])
df$l2 = letters[1:5]
class(df$l1)
```

```
## [1] "character"
```

``` r
class(df$l2)
```

```
## [1] "character"
```