Data and visualization 💹

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a>
</span>
</div>

---

# Exploratory data analysis

---

.pull-left[ 
> "We will be exploring numbers. We need to handle them easily and look at them effectively. Techniques for handling and looking — whether graphical, arithmetic, or intermediate — will be important."
]

---

## What is EDA?

- Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize its main characteristics.
- Today, we'll be focusing on the visual aspects
- But we might also calculate summary statistics and perform data wrangling/ manipulation/transformation at (or before) this stage of the analysis. 
  - That's what we'll focus on next.

---

# Data visualization

---

## Data visualization

> "The simple graph has brought more information to the data analyst's mind than any other device."

> John Tukey

- Data visualization is the creation and study of the visual representation of data.
- Many tools for visualizing data (R is one of them) exist, as do many approaches/systems within R for making data visualizations (**ggplot2** is one of them, and that's what we're going to use).

---

## ggplot2 `$\in$` tidyverse

.pull-left[
<img src="img/ggplot2-part-of-tidyverse.png" width="80%" style="display: block; margin: auto;" />
] 
.pull-right[ 
- **ggplot2** is tidyverse's data visualization package 
- `gg` in "ggplot2" stands for Grammar of Graphics 
- Inspired by the book **Grammar of Graphics** by Leland Wilkinson
]

---

## Grammar of Graphics

.pull-left-narrow[
A grammar of graphics is a tool to concisely describe the components of a graphic
]
.pull-right-wide[
<img src="img/grammar-of-graphics.png" width="90%" style="display: block; margin: auto;" />
]

.footnote[ Source: [BloggoType](http://bloggotype.blogspot.com/2016/08/holiday-notes2-grammar-of-graphics.html)]

---

# A First (and Reproducible) Example

---

## Reading in the Data

+ Heights of the highest points by state

``` r
## 
## load required packages and data
library(tidyverse)
options(tibble.print_min = 10)
heights = read_csv("data/highest-points-by-state.csv")
## switch from feet to meters
heights$elevation = heights$elevation * .3048
```
---
## Taking a look

``` r
heights
```

```
## # A tibble: 50 × 2
##    elevation state      
##        <dbl> <chr>      
##  1      733. Alabama    
##  2     6168. Alaska     
##  3     3851. Arizona    
##  4      839. Arkansas   
##  5     4418. California 
##  6     4399. Colorado   
##  7      725. Connecticut
##  8      137. Delaware   
##  9      105. Florida    
## 10     1458. Georgia    
## # ℹ 40 more rows
```

---
## Taking another look

``` r
arrange(heights, elevation)
```

```
## # A tibble: 50 × 2
##    elevation state       
##        <dbl> <chr>       
##  1      105. Florida     
##  2      137. Delaware    
##  3      163. Louisiana   
##  4      246. Mississippi 
##  5      247. Rhode Island
##  6      376. Illinois    
##  7      383. Indiana     
##  8      472. Ohio        
##  9      509. Iowa        
## 10      540. Missouri    
## # ℹ 40 more rows
```

---
## Taking another another look

``` r
arrange(heights, desc(elevation))
```

```
## # A tibble: 50 × 2
##    elevation state     
##        <dbl> <chr>     
##  1     6168. Alaska    
##  2     4418. California
##  3     4399. Colorado  
##  4     4392. Washington
##  5     4207. Wyoming   
##  6     4205. Hawaii    
##  7     4123. Utah      
##  8     4011. New Mexico
##  9     4005. Nevada    
## 10     3901. Montana   
## # ℹ 40 more rows
```

---
## Taking another³ look... thoughtfully

Goals:
+ Write down the set of numbers, keeping as much detail as
possible
+ Pack the numbers efficiently, so you can see all of them at once

---

## Taking another³ look: Stem and Leaf
.pull-left[

``` r
stem(heights$elevation)
```

```
## 
##   The decimal point is 3 digit(s) to the right of the |
## 
##   0 | 11222445555667778
##   1 | 0011123355566779
##   2 | 0027
##   3 | 4999
##   4 | 00122444
##   5 | 
##   6 | 2
```
]
--

.pull-right[ 
<br><br><br><br>
+ Notice that parts of the numbers (the beginnings) are
repeated.
+ The first digit of each number is printed at the beginning
of the line, the remainder at the ends.
+ The first digit is the "stem", the remainder are the "leaves".
]
---

# What have we learned?

The stem-and-leaf plot shows that there are three groups of
states:
+ Alaska
+ The western and Rocky Mountain states (California, Colorado, Washington, Wyoming, Hawaii, Utah, New Mexico, Nevada, Montana, Idaho, Arizona, Oregon)
+ All the other states

```
## 
##   The decimal point is 3 digit(s) to the right of the |
## 
##   0 | 11222445555667778
##   1 | 0011123355566779
##   2 | 0027
##   3 | 4999
##   4 | 00122444
##   5 | 
##   6 | 2
```
---

## Taking another⁴ look: Density

Compare the stem-and-leaf plot with a density estimate

``` r
ggplot(heights, aes(x = elevation)) + geom_density()
```

<img src="d03_dataviz_files/figure-html/unnamed-chunk-11-1.png" width="60%" style="display: block; margin: auto;" />
--

Alaska?

---
## Taking another⁴ look: Density+

Compare the stem-and-leaf plot with a density estimate

``` r
ggplot(heights, aes(x = elevation)) + geom_density() + geom_rug()
```

Alaska!

---
class: middle

# Wrapping Up...

---

# What is in a dataset?

---

## Dataset terminology

- Each row is an **observation**
- Each column is a **variable**

``` r
starwars
```

```
## # A tibble: 87 × 14
##    name   height  mass hair_color skin_color eye_color birth_year
##    <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl>
##  1 Luke …    172    77 blond      fair       blue            19  
##  2 C-3PO     167    75 <NA>       gold       yellow         112  
##  3 R2-D2      96    32 <NA>       white, bl… red             33  
##  4 Darth…    202   136 none       white      yellow          41.9
##  5 Leia …    150    49 brown      light      brown           19  
##  6 Owen …    178   120 brown, gr… light      blue            52  
##  7 Beru …    165    75 brown      light      blue            47  
##  8 R5-D4      97    32 <NA>       white, red red             NA  
##  9 Biggs…    183    84 black      light      brown           24  
## 10 Obi-W…    182    77 auburn, w… fair       blue-gray       57  
## # ℹ 77 more rows
## # ℹ 7 more variables: sex <chr>, gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>,
## #   starships <list>
```

]

---

## Luke Skywalker

![luke-skywalker](img/luke-skywalker.png)

---

## What's in the Star Wars data?

Take a `glimpse` at the data:

``` r
glimpse(starwars)
```

```
## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth V…
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 1…
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, …
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, gr…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "lig…
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", …
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, N…
## $ sex        <chr> "male", "none", "none", "male", "female", "m…
## $ gender     <chr> "masculine", "masculine", "masculine", "masc…
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine",…
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human",…
## $ films      <list> <"A New Hope", "The Empire Strikes Back", "…
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <…
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TI…
```

---

.question[
How many rows and columns does this dataset have?
What does each row represent?
What does each column represent?
]

``` r
?starwars
```

---

``` r
nrow(starwars) # number of rows
```

```
## [1] 87
```

``` r
ncol(starwars) # number of columns
```

```
## [1] 14
```

``` r
dim(starwars)  # dimensions (row column)
```

```
## [1] 87 14
```
]

---

## Mass vs. height

.question[ 
How would you describe the relationship between mass and height of Starwars characters?
What other variables would help us understand data points that don't follow the overall trend?
Who is the not so tall but really chubby character?
]

---

## Jabba!

---

## Mass vs. height

``` r
ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
  geom_point() +
  labs(title = "Mass vs. height of Starwars characters",
       x = "Height (cm)", y = "Weight (kg)")
```

```
## Warning: Removed 28 rows containing missing values or values outside the
## scale range (`geom_point()`).
```

---

.question[ 
- What are the functions doing the plotting?
- What is the dataset being plotted?
- Which variables map to which features (aesthetics) of the plot?
- What does the warning mean?<sup>+</sup>
]

``` r
ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
  geom_point() +
  labs(title = "Mass vs. height of Starwars characters",
       x = "Height (cm)", y = "Weight (kg)")
```

```
## Warning: Removed 28 rows containing missing values or values outside the
## scale range (`geom_point()`).
```

---

## Hello ggplot2!

.pull-left-wide[
- `ggplot()` is the main function in ggplot2
- Plots are constructed in layers
- Structure of the code for plots can be summarized as

``` r
ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], y = [y-variable])) +
   geom_xxx() +
   other options
```

- The ggplot2 package comes with the tidyverse

``` r
library(tidyverse)
```

- For help with ggplot2, see [ggplot2.tidyverse.org](http://ggplot2.tidyverse.org/)
]

---
class: middle

# Wrapping Up...

---

# Why do we visualize?

---

## Anscombe's quartet

```
##    set  x     y
## 1    I 10  8.04
## 2    I  8  6.95
## 3    I 13  7.58
## 4    I  9  8.81
## 5    I 11  8.33
## 6    I 14  9.96
## 7    I  6  7.24
## 8    I  4  4.26
## 9    I 12 10.84
## 10   I  7  4.82
## 11   I  5  5.68
## 12  II 10  9.14
## 13  II  8  8.14
## 14  II 13  8.74
## 15  II  9  8.77
## 16  II 11  9.26
## 17  II 14  8.10
## 18  II  6  6.13
## 19  II  4  3.10
## 20  II 12  9.13
## 21  II  7  7.26
## 22  II  5  4.74
```
] 
.pull-right[

```
##    set  x     y
## 23 III 10  7.46
## 24 III  8  6.77
## 25 III 13 12.74
## 26 III  9  7.11
## 27 III 11  7.81
## 28 III 14  8.84
## 29 III  6  6.08
## 30 III  4  5.39
## 31 III 12  8.15
## 32 III  7  6.42
## 33 III  5  5.73
## 34  IV  8  6.58
## 35  IV  8  5.76
## 36  IV  8  7.71
## 37  IV  8  8.84
## 38  IV  8  8.47
## 39  IV  8  7.04
## 40  IV  8  5.25
## 41  IV 19 12.50
## 42  IV  8  5.56
## 43  IV  8  7.91
## 44  IV  8  6.89
```
]

---

## Summarising Anscombe's quartet

``` r
quartet %>%
  group_by(set) %>%
  summarize(
    mean_x = mean(x), 
    mean_y = mean(y),
    sd_x = sd(x),
    sd_y = sd(y),
    r = cor(x, y)
  )
```

```
## # A tibble: 4 × 6
##   set   mean_x mean_y  sd_x  sd_y     r
##   <fct>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 I          9   7.50  3.32  2.03 0.816
## 2 II         9   7.50  3.32  2.03 0.816
## 3 III        9   7.5   3.32  2.03 0.816
## 4 IV         9   7.50  3.32  2.03 0.817
```

---

## Visualizing Anscombe's quartet

---

## Age at first kiss

---

## Facebook visits

---
# Sources

- Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/))
- Jenny Bryan's Stat545 ([link](https://stat545.com))
- Julia Fukuyama's EDA ([link](https://jfukuyama.github.io/))

---
class: middle

# Wrapping Up...

document.addEventListener('DOMContentLoaded', function() {
  var hiddenSlides = document.getElementsByClassName('my-hidden-slide');
  for (var i = 0; i < hiddenSlides.length; i++) {
    hiddenSlides[i].style.display = 'none';
  }
});

</script>