class: center, middle, inverse, title-slide .title[ # Visualizing numerical and categorial data
🌠] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a> </span> </div> --- class: middle # Introduction to Visualizing Data --- ## Terminology and Overview <br> In this chunk, we will delve into the fascinating world of data visualization. We'll cover: - Univariate data analysis - distribution of single variable - Bivariate data analysis - relationship between two variables - Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others - Variables can be **Numerical** (classified as **continuous** or **discrete**)¹ .footnote[¹based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively.] or - **Categorical** (determined as **ordinal** or not based on the natural ordering of levels). --- class: middle # Data --- ## Data: Lending Club .pull-left-wide[ - Thousands of loans made through the Lending Club, - a platform that allows individuals to lend to each other - Not all loans are created equal - ease of getting a loan depends on (apparent) ability to repay the loan - Data includes loans *made*, rather than loan applications. ] .pull-right-narrow[ <img src="img/lending-club.png" width="100%" style="display: block; margin: auto;" /> ] --- ## A Glimpse at the Data We'll start by getting a brief overview of our dataset. .medi[ ```r library(openintro) glimpse(loans_full_schema) ``` ``` ## Rows: 10,000 ## Columns: 55 ## $ emp_title <chr> "global config enginee… ## $ emp_length <dbl> 3, 10, 3, 1, 10, NA, 1… ## $ state <fct> NJ, HI, WI, PA, CA, KY… ## $ homeownership <fct> MORTGAGE, RENT, RENT, … ## $ annual_income <dbl> 90000, 40000, 40000, 3… ## $ verified_income <fct> Verified, Not Verified… ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10… ## $ annual_income_joint <dbl> NA, NA, NA, NA, 57000,… ## $ verification_income_joint <fct> , , , , Verified, , No… ## $ debt_to_income_joint <dbl> NA, NA, NA, NA, 37.66,… ## $ delinq_2y <int> 0, 0, 0, 0, 0, 1, 0, 1… ## $ months_since_last_delinq <int> 38, NA, 28, NA, NA, 3,… ## $ earliest_credit_line <dbl> 2001, 1996, 2006, 2007… ... ``` ] --- ## Selected variables We then select a subset of variables that are particularly relevant for our exploration: ```r loans <- loans_full_schema %>% select(loan_amount, interest_rate, term, grade, state, annual_income, homeownership, debt_to_income) glimpse(loans) ``` ``` ## Rows: 10,000 ## Columns: 8 ## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, 2… ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, … ## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, … ## $ grade <ord> C, C, D, A, C, A, C, B, C, A, C, B, C, B… ## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, … ## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 34000… ## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, M… ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, … ``` --- ## Selected variables Here's a brief description of these variables: <br> .midi[ variable | description | type ----------------|----------------|------------- `loan_amount` | Amount of the loan received, in US dollars | numerical, continuous `interest_rate` | Interest rate on the loan, in an annual percentage | numerical, continuous `term` | The length of the loan, which is always set as a whole number of months | numerical, discrete `grade` | .midi[Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid ]| categorical, ordinal `state` | US state where the borrower resides | categorical, not ordinal `annual_income` | Borrower's annual income, including any second income, in US dollars | numerical, continuous `homeownership` | Indicates whether the person owns, owns but has a mortgage, or rents | categorical, .medi[not ordinal] `debt_to_income` | Debt-to-income ratio | numerical, continuous ] --- class: middle # Visualizing Numerical Data --- class: middle ## Histograms --- ## Histogram: `loan_amount` ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram() ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with ## `binwidth`. ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-7-1.png" width="30%" style="display: block; margin: auto;" /> --- ## Histograms and Binwidth Let's explore the `loan_amount` variable with different binwidths. .pull-left[ ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 1000) ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-8-1.png" width="80%" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 5000) ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-9-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 20000) ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-10-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Customizing histograms We can further customize the histograms by adjusting labels, colors, and other properties. .pull-left[ <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-11-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[.small[ ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 5000) + * labs( * x = "Loan amount ($)", * y = "Frequency", * title = "Amounts of Lending Club loans" * ) ``` ] ] --- ## Fill with a categorical variable .pull-left[ <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-12-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[.small[ ```r ggplot(loans, aes(x = loan_amount, * fill = homeownership)) + geom_histogram(binwidth = 5000, * alpha = 0.5) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) ``` ] ] --- ## Facet with a categorical variable .pull-left[ <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[.small[ ```r ggplot(loans, aes(x = loan_amount, fill = homeownership)) + geom_histogram(binwidth = 5000) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) + * facet_wrap(~ homeownership, nrow = 3) ``` ] ] --- class: middle ## Density Plots --- ## Density Plot ```r ggplot(loans, aes(x = loan_amount)) + geom_density() ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-14-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Density plots and adjusting bandwidth Density plots allow us to visualize the distribution of numerical data with varying bandwidths. .pull-left[ + adjust = 0.5 ```r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 0.5) ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-15-1.png" width="50%" style="display: block; margin: auto;" /> ] .pull-right[ + adjust = 1 ```r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 1) # default bandwidth ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-16-1.png" width="50%" style="display: block; margin: auto;" /> ] --- # adjust = 2 ```r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 2) ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-17-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Customizing Density Plots We can customize density plots by adding labels, titles, and adjusting other visual properties. .pull-left[ <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-18-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[.small[ ```r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 2) + * labs( * x = "Loan amount ($)", * y = "Density", * title = "Amounts of Lending Club loans" * ) ``` ] ] --- ## Adding a categorical variable .pull-left[ <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-19-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[.small[ ```r ggplot(loans, aes(x = loan_amount, * fill = homeownership)) + geom_density(adjust = 2, * alpha = 0.5) + labs( x = "Loan amount ($)", y = "Density", title = "Amounts of Lending Club loans", * fill = "Homeownership" ) ``` ] ] --- class: middle # Box plot --- ## Box plot ```r ggplot(loans, aes(x = interest_rate)) + geom_boxplot() ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-20-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Box plot and outliers ```r ggplot(loans, aes(x = annual_income)) + geom_boxplot() ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-21-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Customizing box plots .pull-left[ <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-22-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[.small[ ```r ggplot(loans, aes(x = interest_rate)) + geom_boxplot() + labs( x = "Interest rate (%)", y = NULL, title = "Interest rates of Lending Club loans" ) + * theme( * axis.ticks.y = element_blank(), * axis.text.y = element_blank() * ) ``` ] ] --- ## Adding a categorical variable .pull-left[ <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-23-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[.small[ ```r ggplot(loans, aes(x = interest_rate, * y = grade)) + geom_boxplot() + labs( x = "Interest rate (%)", y = "Grade", title = "Interest rates of Lending Club loans", * subtitle = "by grade of loan" ) ``` ] ] --- class: middle # Bar plot --- ## Bar plot ```r ggplot(data = starwars, mapping = aes(x = gender)) + geom_bar() ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-24-1.png" width="55%" style="display: block; margin: auto;" /> --- ## Segmented bar plot: counts ```r ggplot(data = starwars, mapping = aes(x = gender, * fill = hair_color))+ geom_bar() ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-25-1.png" width="55%" style="display: block; margin: auto;" /> --- ## Segmented bar plots ```r ggplot(data = starwars, mapping = aes(x = gender, * fill = hair_color2))+ geom_bar() ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-26-1.png" width="55%" style="display: block; margin: auto;" /> --- # For the curious... ```r starwars <- starwars %>% mutate(hair_color2 = fct_other(hair_color, keep = c("black", "brown", "brown", "blond") ) ) ``` --- ## Segmented bar plots ```r ggplot(data = starwars, mapping = aes(x = gender, * fill = hair_color2))+ * geom_bar()+ coord_flip() ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-28-1.png" width="55%" style="display: block; margin: auto;" /> --- ## Segmented bar plots: proportions ```r ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color2)) + geom_bar(position = "fill") + coord_flip() ``` <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-29-1.png" width="43%" style="display: block; margin: auto;" /> ```r labs(y = "proportion") ``` ``` ## $y ## [1] "proportion" ## ## attr(,"class") ## [1] "labels" ``` --- .question[ Which bar plot is a more useful representation for visualizing the relationship between gender and hair color? ] .pull-left[ <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-30-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-31-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Customizing bar plots We have flexibility in customizing bar plots by adjusting labels, titles, colors, and other visual properties. .pull-left[ <img src="d05_06_viznumcat_files/figure-html/unnamed-chunk-32-1.png" width="60%" style="display: block; margin: auto;" /> ] .pull-right[.small[ ```r *ggplot(starwars, aes(y = gender, fill = hair_color2)) + geom_bar(position = "fill") + * labs( * x = "Proportion", * y = "Gender", * fill = "Hair Color", * title = "Hair Colors of Starwars", * subtitle = "by gender" * ) ``` ] --- class: middle ## Summary and Wrapping Up - In this session, we explored various techniques for visualizing numerical and categorical data. - We examined histograms, density plots, bar plots, and scatterplots to gain insights into the relationships within the data. - By effectively visualizing data, we can uncover patterns and trends that lead to deeper understanding. --- # Sources - Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/))