class: center, middle, inverse, title-slide .title[ # Importing data
⬆️ ] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a> </span> </div> --- class: middle # Reading rectangular data into R --- class: middle .pull-left[ <img src="img/readr.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/readxl.png" width="80%" style="display: block; margin: auto;" /> ] --- ## readr - `read_csv()` - comma delimited files - `read_csv2()` - semicolon separated files (common in countries where , is used as the decimal place) - `read_tsv()` - tab delimited files - `read_delim()` - reads in files with any delimiter - `read_fwf()` - fixed width files - `read_table()` - common variation of fixed width files where columns are separated by white space - ... --- ## Reading data ``` r nobel <- read_csv(file = "data/nobel.csv") ``` --- .xxsmall[ ``` ## ── Data Summary ──────────────────────── ## Values ## Name nobel ## Number of rows 935 ## Number of columns 26 ## _______________________ ## Column type frequency: ## character 21 ## Date 2 ## numeric 3 ## ________________________ ## Group variables None ## ## ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 firstname 0 1 2 59 0 720 0 ## 2 surname 29 0.969 2 26 0 851 0 ## 3 category 0 1 5 10 0 6 0 ## 4 affiliation 250 0.733 4 110 0 303 0 ## 5 city 255 0.727 4 27 0 185 0 ## 6 country 254 0.728 3 14 0 27 0 ## 7 gender 0 1 3 6 0 3 0 ## 8 born_city 28 0.970 3 29 0 613 0 ## 9 born_country 28 0.970 3 28 0 80 0 ## 10 born_country_code 28 0.970 2 2 0 77 0 ## 11 died_city 327 0.650 4 29 0 303 0 ## 12 died_country 321 0.657 3 16 0 48 0 ## 13 died_country_code 321 0.657 2 2 0 46 0 ## 14 overall_motivation 918 0.0182 55 114 0 7 0 ## 15 motivation 0 1 24 337 0 656 0 ## 16 born_country_original 28 0.970 3 52 0 122 0 ## 17 born_city_original 28 0.970 3 36 0 616 0 ## 18 died_country_original 321 0.657 3 35 0 52 0 ## 19 died_city_original 327 0.650 4 29 0 303 0 ## 20 city_original 255 0.727 4 27 0 185 0 ## 21 country_original 254 0.728 3 35 0 29 0 ## ## ── Variable type: Date ───────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate min max median n_unique ## 1 born_date 33 0.965 1817-11-30 1997-07-12 1916-06-28 885 ## 2 died_date 308 0.671 1903-11-01 2019-08-07 1983-03-09 616 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 id 0 1 475. 278. 1 234. 470 716. 969 ▇▇▇▇▇ ## 2 year 0 1 1970. 33.3 1901 1947 1976 1999 2018 ▃▃▅▆▇ ## 3 share 0 1 1.99 0.936 1 1 2 3 4 ▇▇▁▅▂ ``` ] --- ## Writing data - Write a file ``` r df <- tribble( ~x, ~y, 1, "a", 2, "b", 3, "c" ) write_csv(df, file = "data/df.csv") ``` --- - Check that it got written out ``` r fs::dir_ls("data") ``` ``` ## data/df-na.csv data/df.csv ## data/edi-airbnb.csv data/favourite-food.xlsx ## data/nobel.csv data/sales ## data/sales.xlsx ``` --- .your-turn[ - [class git repo](https://github.com/DataScience4Psych) > `AE 06 - Nobels and sales + Data import` > open `nobels-csv.Rmd` and knit. - Read in the `nobels.csv` file from the `data-raw/` folder. - Split into two (STEM and non-STEM): - Create a new data frame, `nobel_stem`, that filters for the STEM fields (Physics, Medicine, Chemistry, and Economics). - Create another data frame, `nobel_nonstem`, that filters for the remaining fields. - Write out the two data frames to `nobel-stem.csv` and `nobel-nonstem.csv`, respectively, to `data/`. **Hint:** Use the `%in%` operator when filtering. ] --- class: middle # Pausing... --- class: middle # Variable names --- ``` r edi_airbnb <- read_csv("data/edi-airbnb.csv") names(edi_airbnb) ``` ``` ## [1] "ID" "Price" ## [3] "neighbourhood" "accommodates" ## [5] "Number of bathrooms" "Number of Bedrooms" ## [7] "n beds" "Review Scores Rating" ## [9] "Number of reviews" "listing_url" ``` -- ... but R doesn't allow spaces in variable names ``` r ggplot(edi_airbnb, aes(x = Number of bathrooms, y = Price)) + geom_point() ``` ``` ## Error in parse(text = input): <text>:1:35: unexpected symbol ## 1: ggplot(edi_airbnb, aes(x = Number of ## ^ ``` --- ## Option 1 - Define column names ``` r edi_airbnb_col_names <- read_csv("data/edi-airbnb.csv", col_names = c("id", "price", "neighbourhood", "accommodates", "bathroom", "bedroom", "bed", "review_scores_rating", "n_reviews", "url")) names(edi_airbnb_col_names) ``` ``` ## [1] "id" "price" ## [3] "neighbourhood" "accommodates" ## [5] "bathroom" "bedroom" ## [7] "bed" "review_scores_rating" ## [9] "n_reviews" "url" ``` --- ## Option 2 - Format text to snake_case ``` r edi_airbnb_cleaned_names <- edi_airbnb %>% janitor::clean_names() names(edi_airbnb_cleaned_names) ``` ``` ## [1] "id" "price" ## [3] "neighbourhood" "accommodates" ## [5] "number_of_bathrooms" "number_of_bedrooms" ## [7] "n_beds" "review_scores_rating" ## [9] "number_of_reviews" "listing_url" ``` --- class: middle # Wrapping Up... --- class: middle # Variable types --- .question[ Which class is `x`? Why? ] .pull-left[ <br><br><br> <img src="img/df-na.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r read_csv("data/df-na.csv") ``` ``` ## # A tibble: 9 × 3 ## x y z ## <chr> <chr> <chr> ## 1 1 a hi ## 2 <NA> b hello ## 3 3 Not applicable 9999 ## 4 4 d ola ## 5 5 e hola ## 6 . f whatup ## 7 7 g wassup ## 8 8 h sup ## 9 9 i <NA> ``` ] --- ## Option 1. Explicit NAs ``` r read_csv("data/df-na.csv", na = c("", "NA", ".", "9999", "Not applicable")) ``` .pull-left[ <br> <img src="img/df-na.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` ## # A tibble: 9 × 3 ## x y z ## <dbl> <chr> <chr> ## 1 1 a hi ## 2 NA b hello ## 3 3 <NA> <NA> ## 4 4 d ola ## 5 5 e hola ## 6 NA f whatup ## 7 7 g wassup ## 8 8 h sup ## 9 9 i <NA> ``` ] --- ## Option 2. Specify column types .midi[ ``` r read_csv("data/df-na.csv", col_types = list(col_double(), col_character(), col_character())) ``` ``` ## Warning: One or more parsing issues, call `problems()` on your data frame ## for details, e.g.: ## dat <- vroom(...) ## problems(dat) ``` ``` ## # A tibble: 9 × 3 ## x y z ## <dbl> <chr> <chr> ## 1 1 a hi ## 2 NA b hello ## 3 3 Not applicable 9999 ## 4 4 d ola ## 5 5 e hola ## 6 NA f whatup ## 7 7 g wassup ## 8 8 h sup ## 9 9 i <NA> ``` ] --- ## Column types .small[ **type function** | **data type** ------------------ | ------------- `col_character()` | character `col_date()` | date `col_datetime()` | POSIXct (date-time) `col_double()` | double (numeric) `col_factor()` | factor `col_guess()` | let readr guess (default) `col_integer()` | integer `col_logical()` | logical `col_number()` | numbers mixed with non-number characters `col_numeric()` | double or integer `col_skip()` | do not read `col_time()` | time ] --- # Pause the video... .your-turn[ - [class git repo](https://github.com/DataScience4Psych) > `AE 06 - Nobels and sales + Data import` > open `food-excel.Rmd` and knit. Work on **Exercise 1**. - Read in the Excel file called `favourite-food.xlsx` from the `data-raw/` folder. - Clean up `NA`s and make sure you're happy with variable types. - Convert SES (socio economic status) to a factor variables with levels in the following order: `Low`, `Middle`, `High`. - Write out the resulting data frame to `favourite-food.csv` in the `data/` folder. - Finally, read `favourite-food.csv` back in from the `data/` folder and observe the variable types. Are they as you left them? ] --- class: middle # Ready to move forward? --- ## `read_rds()` and `write_rds()` - CSVs can be unreliable for saving interim results if there is specific variable type information you want to hold on to. - An alternative is RDS files, you can read and write them with `read_rds()` and `write_rds()`, respectively. ``` r read_rds(path) write_rds(x, path) ``` --- .your-turn[ - [class git repo](https://github.com/DataScience4Psych) > `AE 06 - Nobels and sales + Data import` > open `food-excel.Rmd` and knit. Work on **Exercise 2**. - Repeat the first three steps from Exercise 1. - Write out the resulting data frame to `favourite-food.rds` in the `data/` folder. - Read `favourite-food.rds` back in from the `data/` folder and observe the variable types. Are they as you left them? ] --- class: middle # Ready to move forward? --- .pull-left[ .your-turn[ - [class git repo](https://github.com/DataScience4Psych) > `AE 06 - Nobels and sales + Data import` > open `sales-excel.Rmd` and knit. - Load the `sales.xlsx` file from the `data-raw/` folder, using appropriate arguments for the `read_excel()` function such that it looks like the following. ] ] .medi[ .pull-right[ ``` ## # A tibble: 9 × 2 ## id n ## <chr> <chr> ## 1 Brand 1 n ## 2 1234 8 ## 3 8721 2 ## 4 1822 3 ## 5 Brand 2 n ## 6 3333 1 ## 7 2156 3 ## 8 3987 6 ## 9 3216 5 ``` ] ] --- .pull-left[ .your-turn[ - [class git repo](https://github.com/DataScience4Psych) > `AE 06 - Nobels and sales + Data import` > open `sales-excel.Rmd` and knit. - Manipulate the sales data such that it looks like the following. ] ] .pull-right[ ``` ## # A tibble: 7 × 3 ## brand id n ## <chr> <dbl> <dbl> ## 1 Brand 1 1234 8 ## 2 Brand 1 8721 2 ## 3 Brand 1 1822 3 ## 4 Brand 2 3333 1 ## 5 Brand 2 2156 3 ## 6 Brand 2 3987 6 ## 7 Brand 2 3216 5 ``` ] --- class: middle # Wrapping Up... --- class: middle # Importing many files --- .medi[ ``` r sales_files <- fs::dir_ls("data/sales") sales_files ``` ``` ## data/sales/01-sales.csv data/sales/02-sales.csv ## data/sales/03-sales.csv ``` ``` r #library(vroom) sales <- vroom::vroom(sales_files, id = "filename") sales ``` ``` ## # A tibble: 19 × 6 ## filename month year brand item n ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 data/sales/01-sales.csv January 2019 1 1234 3 ## 2 data/sales/01-sales.csv January 2019 1 8721 9 ## 3 data/sales/01-sales.csv January 2019 1 1822 2 ## 4 data/sales/01-sales.csv January 2019 2 3333 1 ## 5 data/sales/01-sales.csv January 2019 2 2156 9 ## 6 data/sales/01-sales.csv January 2019 2 3987 6 ## 7 data/sales/01-sales.csv January 2019 2 3827 6 ## 8 data/sales/02-sales.csv February 2019 1 1234 8 ## 9 data/sales/02-sales.csv February 2019 1 8721 2 ## 10 data/sales/02-sales.csv February 2019 1 1822 3 ## # ℹ 9 more rows ``` ] --- ## vroom vroom!! .pull-left[ <img src="img/vroom.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ - **vroom** is most useful for reading large amounts of data in, fast! - and it has nice bells-and-whistles like delimiter guessing, reading many files in at once, etc. - Learn more at [vroom.r-lib.org](https://vroom.r-lib.org/) ] --- class: middle # Other types of data --- ## Other types of data - **googlesheets4:** Google Sheets - **haven**: SPSS, Stata, and SAS files - **DBI**, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc): - allows you to run SQL queries against a database and return a data frame - **jsonline**: JSON - **xml2**: xml - **rvest**: web scraping - **httr**: web APIs - **sparklyr**: data loaded into spark --- # Sources - Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/)) --- class: middle # Wrapping Up...