Visualizing numerical and categorial data 🌠

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a>
</span>
</div>

---

# Introduction to Visualizing Data

---

## Terminology and Overview

<br>

In this chunk, we will delve into the fascinating world of data visualization.

We'll cover:

- Univariate data analysis - distribution of single variable
- Bivariate data analysis - relationship between two variables
- Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others

- Variables can be **Numerical** (classified as **continuous** or **discrete**)¹ .footnote[¹based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively.] or 
  - **Categorical** (determined as **ordinal** or not based on the natural ordering of levels).

---

# Data

---

## Data: Lending Club

- Thousands of loans made through the Lending Club,
  - a platform that allows individuals to lend to each other
- Not all loans are created equal 
  - ease of getting a loan depends on (apparent) ability to repay the loan
- Data includes loans *made*, rather than loan applications.
]
.pull-right-narrow[
<img src="img/lending-club.png" width="100%" style="display: block; margin: auto;" />
]

---

## A Glimpse at the Data

We'll start by getting a brief overview of our dataset. 
.medi[

```r
library(openintro)
glimpse(loans_full_schema)
```

```
## Rows: 10,000
## Columns: 55
## $ emp_title                        <chr> "global config enginee…
## $ emp_length                       <dbl> 3, 10, 3, 1, 10, NA, 1…
## $ state                            <fct> NJ, HI, WI, PA, CA, KY…
## $ homeownership                    <fct> MORTGAGE, RENT, RENT, …
## $ annual_income                    <dbl> 90000, 40000, 40000, 3…
## $ verified_income                  <fct> Verified, Not Verified…
## $ debt_to_income                   <dbl> 18.01, 5.04, 21.15, 10…
## $ annual_income_joint              <dbl> NA, NA, NA, NA, 57000,…
## $ verification_income_joint        <fct> , , , , Verified, , No…
## $ debt_to_income_joint             <dbl> NA, NA, NA, NA, 37.66,…
## $ delinq_2y                        <int> 0, 0, 0, 0, 0, 1, 0, 1…
## $ months_since_last_delinq         <int> 38, NA, 28, NA, NA, 3,…
## $ earliest_credit_line             <dbl> 2001, 1996, 2006, 2007…
...
```
]
---

## Selected variables

We then select a subset of variables that are particularly relevant for our exploration:

```r
loans <- loans_full_schema %>%
  select(loan_amount, interest_rate, term, grade, 
         state, annual_income, homeownership, debt_to_income)
glimpse(loans)
```

```
## Rows: 10,000
## Columns: 8
## $ loan_amount    <int> 28000, 5000, 2000, 21600, 23000, 5000, 2…
## $ interest_rate  <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, …
## $ term           <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, …
## $ grade          <ord> C, C, D, A, C, A, C, B, C, A, C, B, C, B…
## $ state          <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, …
## $ annual_income  <dbl> 90000, 40000, 40000, 30000, 35000, 34000…
## $ homeownership  <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, M…
## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, …
```

---

## Selected variables

Here's a brief description of these variables:

<br>

.midi[
variable        | description    | type
----------------|----------------|-------------
`loan_amount`   |	Amount of the loan received, in US dollars | numerical, continuous
`interest_rate` |	Interest rate on the loan, in an annual percentage | numerical, continuous
`term`	        | The length of the loan, which is always set as a whole number of months | numerical, discrete
`grade`	        | .midi[Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid ]| categorical, ordinal
`state`         |	US state where the borrower resides | categorical, not ordinal
`annual_income` |	Borrower's annual income, including any second income, in US dollars | numerical, continuous
`homeownership`	| Indicates whether the person owns, owns but has a mortgage, or rents | categorical, .medi[not ordinal]
`debt_to_income` | Debt-to-income ratio | numerical, continuous
]

---

# Visualizing Numerical Data

---
class: middle

## Histograms

---

## Histogram: `loan_amount`

```r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram()
```

```
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
```

---

## Histograms and Binwidth

Let's explore the `loan_amount` variable with different binwidths.

```r
ggplot(loans, 
       aes(x = loan_amount)) +
  geom_histogram(binwidth = 1000)
```

<img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-8-1.png" width="80%" style="display: block; margin: auto;" />
]
--
.pull-right[

```r
ggplot(loans, 
       aes(x = loan_amount)) +
  geom_histogram(binwidth = 5000)
```

<img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-9-1.png" width="80%" style="display: block; margin: auto;" />
]
---

```r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 20000)
```

---

## Customizing histograms

We can further customize the histograms by adjusting labels, colors, and other properties.

]
.pull-right[.small[

```r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 5000) +
* labs(
*   x = "Loan amount ($)",
*   y = "Frequency",
*   title = "Amounts of Lending Club loans"
* )
```
]
]

---

## Fill with a categorical variable

.pull-left[
<img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-12-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[.small[

```r
ggplot(loans, aes(x = loan_amount, 
*                 fill = homeownership)) +
  geom_histogram(binwidth = 5000,
*                alpha = 0.5) +
  labs(
    x = "Loan amount ($)",
    y = "Frequency",
    title = "Amounts of Lending Club loans"
  )
```
]
]

---

## Facet with a categorical variable

.pull-left[
<img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[.small[

```r
ggplot(loans, aes(x = loan_amount, fill = homeownership)) + 
  geom_histogram(binwidth = 5000) +
  labs(
    x = "Loan amount ($)",
    y = "Frequency",
    title = "Amounts of Lending Club loans"
  ) +
* facet_wrap(~ homeownership, nrow = 3)
```
]
]

---

## Density Plots

---

## Density Plot

```r
ggplot(loans, aes(x = loan_amount)) +
  geom_density()
```

---

## Density plots and adjusting bandwidth

Density plots allow us to visualize the distribution of numerical data with varying bandwidths.

```r
ggplot(loans, 
       aes(x = loan_amount)) +
  geom_density(adjust = 0.5)
```

<img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-15-1.png" width="50%" style="display: block; margin: auto;" />
]
.pull-right[
+ adjust = 1

```r
ggplot(loans, 
       aes(x = loan_amount)) +
  geom_density(adjust = 1) # default bandwidth
```

<img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-16-1.png" width="50%" style="display: block; margin: auto;" />
]
---

# adjust = 2

```r
ggplot(loans, aes(x = loan_amount)) +
  geom_density(adjust = 2)
```

---

## Customizing Density Plots

We can customize density plots by adding labels, titles, and adjusting other visual properties.

.pull-left[
<img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-18-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[.small[

```r
ggplot(loans, aes(x = loan_amount)) +
  geom_density(adjust = 2) +
* labs(
*   x = "Loan amount ($)",
*   y = "Density",
*   title = "Amounts of Lending Club loans"
* )
```
]
]

---

## Adding a categorical variable

.pull-left[
<img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-19-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[.small[

```r
ggplot(loans, aes(x = loan_amount, 
*                 fill = homeownership)) +
  geom_density(adjust = 2, 
*              alpha = 0.5) +
  labs(
    x = "Loan amount ($)",
    y = "Density",
    title = "Amounts of Lending Club loans", 
*   fill = "Homeownership"
  )
```
]
]

---

# Box plot

---

## Box plot

```r
ggplot(loans, aes(x = interest_rate)) +
  geom_boxplot()
```

---

## Box plot and outliers

```r
ggplot(loans, aes(x = annual_income)) +
  geom_boxplot()
```

---

## Customizing box plots

.pull-left[
<img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-22-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[.small[

```r
ggplot(loans, aes(x = interest_rate)) +
  geom_boxplot() +
  labs(
    x = "Interest rate (%)",
    y = NULL,
    title = "Interest rates of Lending Club loans"
  ) +
* theme(
*   axis.ticks.y = element_blank(),
*   axis.text.y = element_blank()
* )
```
]
]

---

## Adding a categorical variable

.pull-left[
<img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-23-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[.small[

```r
ggplot(loans, aes(x = interest_rate,
*                 y = grade)) +
  geom_boxplot() +
  labs(
    x = "Interest rate (%)",
    y = "Grade",
    title = "Interest rates of Lending Club loans",
*   subtitle = "by grade of loan"
  )
```
]
]

---

# Bar plot

---

## Bar plot

```r
ggplot(data = starwars, mapping = aes(x = gender)) +
  geom_bar()
```

---

## Segmented bar plot: counts

```r
ggplot(data = starwars, mapping = aes(x = gender, 
*                 fill = hair_color))+
  geom_bar()
```

---

## Segmented bar plots

```r
ggplot(data = starwars, mapping = aes(x = gender, 
*   fill = hair_color2))+
  geom_bar()
```

---
# For the curious...

```r
starwars <- starwars %>%
  mutate(hair_color2 =
           fct_other(hair_color,
                     keep = c("black", 
                              "brown", "brown",
                              "blond")
           )
  )
```

---

## Segmented bar plots

```r
ggplot(data = starwars, mapping = aes(x = gender, 
*   fill = hair_color2))+
* geom_bar()+
  coord_flip()
```

---

## Segmented bar plots: proportions

```r
ggplot(data = starwars,
       mapping = aes(x = gender, fill = hair_color2)) +
  geom_bar(position = "fill") +
  coord_flip()
```

```r
labs(y = "proportion")
```

```
## $y
## [1] "proportion"
## 
## attr(,"class")
## [1] "labels"
```

---

.question[
    Which bar plot is a more useful representation for visualizing the relationship between gender and hair color?
  ]

.pull-left[
<img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-30-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-31-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Customizing bar plots

We have flexibility in customizing bar plots by adjusting labels, titles, colors, and other visual properties.

.pull-left[
<img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-32-1.png" width="60%" style="display: block; margin: auto;" />
]
.pull-right[.small[

```r
*ggplot(starwars, aes(y = gender,
                  fill = hair_color2)) +
  geom_bar(position = "fill") +
* labs(
*   x = "Proportion",
*   y = "Gender",
*   fill = "Hair Color",
*   title = "Hair Colors of Starwars",
*   subtitle = "by gender"
* )
```
]

---

## Summary and Wrapping Up

- In this session, we explored various techniques for visualizing numerical and categorical data. 
- We examined histograms, density plots, bar plots, and scatterplots to gain insights into the relationships within the data. 
- By effectively visualizing data, we can uncover patterns and trends that lead to deeper understanding.

---

# Sources

- Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/))