Importing data ⬆️

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a>
</span>
</div>

---

# Learning Goals

---

## Learning Goals

By the end of this session, you will be able to...

- Read rectangular data from CSV, Excel, and other file formats
- Handle variable naming issues and define appropriate column types
- Clean data through NA management and type specification
- Write data to multiple formats and import multiple files efficiently

---

# Reading rectangular data into R

---

.pull-left[
<img src="img/readr.png" alt="" width="80%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="img/readxl.png" alt="" width="80%" style="display: block; margin: auto;" />
]

---

## readr

- `read_csv()` - comma delimited files
- `read_csv2()` - semicolon separated files (common in countries where , is used as the decimal place)
- `read_tsv()` - tab delimited files
- `read_delim()` - reads in files with any delimiter
- `read_fwf()` - fixed width files
- `read_table()` - common variation of fixed width files where columns are separated by white space
- ...

---

## Reading data

``` r
nobel <- read_csv(file = "data/nobel.csv")
```

---

```
## -- Data Summary ------------------------
##                            Values
## Name                       nobel 
## Number of rows             935   
## Number of columns          26    
## _______________________          
## Column type frequency:           
##   character                21    
##   Date                     2     
##   numeric                  3     
## ________________________         
## Group variables            None  
## 
## -- Variable type: character --------------------------------------------------------------------------------------------
##    skim_variable         n_missing complete_rate min max empty n_unique whitespace
##  1 firstname                     0        1        2  59     0      720          0
##  2 surname                      29        0.969    2  26     0      851          0
##  3 category                      0        1        5  10     0        6          0
##  4 affiliation                 250        0.733    4 110     0      303          0
##  5 city                        255        0.727    4  27     0      185          0
##  6 country                     254        0.728    3  14     0       27          0
##  7 gender                        0        1        3   6     0        3          0
##  8 born_city                    28        0.970    3  29     0      613          0
##  9 born_country                 28        0.970    3  28     0       80          0
## 10 born_country_code            28        0.970    2   2     0       77          0
## 11 died_city                   327        0.650    4  29     0      303          0
## 12 died_country                321        0.657    3  16     0       48          0
## 13 died_country_code           321        0.657    2   2     0       46          0
## 14 overall_motivation          918        0.0182  55 114     0        7          0
## 15 motivation                    0        1       24 337     0      656          0
## 16 born_country_original        28        0.970    3  52     0      122          0
## 17 born_city_original           28        0.970    3  36     0      616          0
## 18 died_country_original       321        0.657    3  35     0       52          0
## 19 died_city_original          327        0.650    4  29     0      303          0
## 20 city_original               255        0.727    4  27     0      185          0
## 21 country_original            254        0.728    3  35     0       29          0
## 
## -- Variable type: Date -------------------------------------------------------------------------------------------------
##   skim_variable n_missing complete_rate min        max        median     n_unique
## 1 born_date            33         0.965 1817-11-30 1997-07-12 1916-06-28      885
## 2 died_date           308         0.671 1903-11-01 2019-08-07 1983-03-09      616
## 
## -- Variable type: numeric ----------------------------------------------------------------------------------------------
##   skim_variable n_missing complete_rate    mean      sd   p0   p25  p50   p75 p100 hist                            
## 1 id                    0             1  475.   278.       1  234.  470  716.  969 "\u2587\u2587\u2587\u2587\u2587"
## 2 year                  0             1 1970.    33.3   1901 1947  1976 1999  2018 "\u2583\u2583\u2585\u2586\u2587"
## 3 share                 0             1    1.99   0.936    1    1     2    3     4 "\u2587\u2587\u2581\u2585\u2582"
```
]

---

## Writing data

- Write a file

``` r
df <- tribble(
  ~x, ~y,
  1,  "a",
  2,  "b",
  3,  "c"
)

write_csv(df, file = "data/df.csv")
```

---

- Check that it got written out

``` r
fs::dir_ls("data")
```

```
## data/df-na.csv           data/df.csv              
## data/edi-airbnb.csv      data/favourite-food.xlsx 
## data/nobel.csv           data/sales               
## data/sales.xlsx
```

---

.your-turn[
- [class git repo](https://github.com/DataScience4Psych) > `AE 06 - Nobels and sales + Data import` > open `nobels-csv.Rmd` and knit.
- Read in the `nobels.csv` file from the `data-raw/` folder.
- Split into two (STEM and non-STEM): 
  - Create a new data frame, `nobel_stem`, that filters for the STEM fields 
(Physics, Medicine, Chemistry, and Economics).
  - Create another data frame, `nobel_nonstem`, that filters for the remaining 
fields.  
- Write out the two data frames to `nobel-stem.csv` and `nobel-nonstem.csv`, 
respectively, to `data/`.

**Hint:** Use the `%in%` operator when filtering.
]

---

# Pausing...

---

# Variable names

---

``` r
edi_airbnb <- read_csv("data/edi-airbnb.csv")
names(edi_airbnb)
```

```
##  [1] "ID"                   "Price"               
##  [3] "neighbourhood"        "accommodates"        
##  [5] "Number of bathrooms"  "Number of Bedrooms"  
##  [7] "n beds"               "Review Scores Rating"
##  [9] "Number of reviews"    "listing_url"
```

... but R doesn't allow spaces in variable names

``` r
ggplot(edi_airbnb, aes(x = Number of bathrooms, y = Price)) +
  geom_point()
```

```
## Error in parse(text = input): <text>:1:35: unexpected symbol
## 1: ggplot(edi_airbnb, aes(x = Number of
##                                       ^
```

---

## Option 1 - Define column names

``` r
edi_airbnb_col_names <- read_csv("data/edi-airbnb.csv",
  col_names = c("id", "price", "neighbourhood", "accommodates",
                "bathroom", "bedroom", "bed", 
                "review_scores_rating", "n_reviews", "url"))

names(edi_airbnb_col_names)
```

```
##  [1] "id"                   "price"               
##  [3] "neighbourhood"        "accommodates"        
##  [5] "bathroom"             "bedroom"             
##  [7] "bed"                  "review_scores_rating"
##  [9] "n_reviews"            "url"
```

---

## Option 2 - Format text to snake_case

``` r
edi_airbnb_cleaned_names <- edi_airbnb %>%
  janitor::clean_names()

names(edi_airbnb_cleaned_names)
```

```
##  [1] "id"                   "price"               
##  [3] "neighbourhood"        "accommodates"        
##  [5] "number_of_bathrooms"  "number_of_bedrooms"  
##  [7] "n_beds"               "review_scores_rating"
##  [9] "number_of_reviews"    "listing_url"
```

---

# Wrapping Up...

---

# Variable types

---

.pull-left[
<br><br><br>
<img src="img/df-na.png" alt="" width="100%" style="display: block; margin: auto;" />
]
.pull-right[

``` r
read_csv("data/df-na.csv")
```

```
## # A tibble: 9 x 3
##   x     y              z     
##   <chr> <chr>          <chr> 
## 1 1     a              hi    
## 2 <NA>  b              hello 
## 3 3     Not applicable 9999  
## 4 4     d              ola   
## 5 5     e              hola  
## 6 .     f              whatup
## 7 7     g              wassup
## 8 8     h              sup   
## 9 9     i              <NA>
```
]

---

## Option 1. Explicit NAs

``` r
read_csv("data/df-na.csv", 
         na = c("", "NA", ".", "9999", "Not applicable"))
```

.pull-left[
<br>
<img src="img/df-na.png" alt="" width="100%" style="display: block; margin: auto;" />
]
.pull-right[

```
## # A tibble: 9 x 3
##       x y     z     
##   <dbl> <chr> <chr> 
## 1     1 a     hi    
## 2    NA b     hello 
## 3     3 <NA>  <NA>  
## 4     4 d     ola   
## 5     5 e     hola  
## 6    NA f     whatup
## 7     7 g     wassup
## 8     8 h     sup   
## 9     9 i     <NA>
```
]

---

## Option 2. Specify column types

``` r
read_csv("data/df-na.csv", 
  col_types = list(col_double(), col_character(), col_character()))
```

```
## # A tibble: 9 x 3
##       x y              z     
##   <dbl> <chr>          <chr> 
## 1     1 a              hi    
## 2    NA b              hello 
## 3     3 Not applicable 9999  
## 4     4 d              ola   
## 5     5 e              hola  
## 6    NA f              whatup
## 7     7 g              wassup
## 8     8 h              sup   
## 9     9 i              <NA>
```
]

---

## Column types

.small[
**type function**  | **data type**
------------------ | -------------
`col_character()`  | character
`col_date()`       | date
`col_datetime()`   | POSIXct (date-time)
`col_double()`     | double (numeric)
`col_factor()`     | factor
`col_guess()`      | let readr guess (default)
`col_integer()`    | integer
`col_logical()`    | logical
`col_number()`     | numbers mixed with non-number characters
`col_numeric()`    | double or integer
`col_skip()`       | do not read
`col_time()`       | time
]

---

# Pause the video...

.your-turn[
- [class git repo](https://github.com/DataScience4Psych) > `AE 06 - Nobels and sales + Data import` > open `food-excel.Rmd` and knit. Work on **Exercise 1**.
- Read in the Excel file called `favourite-food.xlsx` from the `data-raw/` folder.
- Clean up `NA`s and make sure you're happy with variable types.
- Convert SES (socio economic status) to a factor variables with levels in the 
following order: `Low`, `Middle`, `High`.
- Write out the resulting data frame to `favourite-food.csv` in the `data/` folder.
- Finally, read `favourite-food.csv` back in from  the `data/` folder and observe the variable types. Are they as you left them?
]

---

# Ready to move forward?

---

## `read_rds()` and `write_rds()`

- CSVs can be unreliable for saving interim results if there is specific 
variable type information you want to hold on to.
- An alternative is RDS files, you can read and write them with `read_rds()` and 
`write_rds()`, respectively.

``` r
read_rds(path)
write_rds(x, path)
```

---

.your-turn[
- [class git repo](https://github.com/DataScience4Psych) > `AE 06 - Nobels and sales + Data import` > open `food-excel.Rmd` and knit. Work on **Exercise 2**.
- Repeat the first three steps from Exercise 1.
- Write out the resulting data frame to `favourite-food.rds` in the `data/` folder.
- Read `favourite-food.rds` back in from  the `data/` folder and observe the 
variable types.  Are they as you left them?
]

---

# Ready to move forward?

---

.pull-left[
.your-turn[
- [class git repo](https://github.com/DataScience4Psych) > `AE 06 - Nobels and sales + Data import` > open `sales-excel.Rmd` and knit. 
- Load the `sales.xlsx` file from the `data-raw/` folder, using appropriate 
arguments for the `read_excel()` function such that it looks like the following.
]
]
.medi[
.pull-right[

```
## # A tibble: 9 x 2
##   id      n    
##   <chr>   <chr>
## 1 Brand 1 n    
## 2 1234    8    
## 3 8721    2    
## 4 1822    3    
## 5 Brand 2 n    
## 6 3333    1    
## 7 2156    3    
## 8 3987    6    
## 9 3216    5
```
]
]
---

.pull-left[
.your-turn[
- [class git repo](https://github.com/DataScience4Psych) > `AE 06 - Nobels and sales + Data import` > open `sales-excel.Rmd` and knit. 
- Manipulate the sales data such that it looks like the following.
]
]
.pull-right[

```
## # A tibble: 7 x 3
##   brand      id     n
##   <chr>   <dbl> <dbl>
## 1 Brand 1  1234     8
## 2 Brand 1  8721     2
## 3 Brand 1  1822     3
## 4 Brand 2  3333     1
## 5 Brand 2  2156     3
## 6 Brand 2  3987     6
## 7 Brand 2  3216     5
```
]

---

# Wrapping Up...

---

# Importing many files

---
.medi[

``` r
sales_files <- fs::dir_ls("data/sales")
sales_files
```

```
## data/sales/01-sales.csv data/sales/02-sales.csv 
## data/sales/03-sales.csv
```

``` r
#library(vroom)
sales <- vroom::vroom(sales_files, id = "filename")
sales
```

```
## # A tibble: 19 x 6
##    filename                month     year brand  item     n
##    <chr>                   <chr>    <dbl> <dbl> <dbl> <dbl>
##  1 data/sales/01-sales.csv January   2019     1  1234     3
##  2 data/sales/01-sales.csv January   2019     1  8721     9
##  3 data/sales/01-sales.csv January   2019     1  1822     2
##  4 data/sales/01-sales.csv January   2019     2  3333     1
##  5 data/sales/01-sales.csv January   2019     2  2156     9
##  6 data/sales/01-sales.csv January   2019     2  3987     6
##  7 data/sales/01-sales.csv January   2019     2  3827     6
##  8 data/sales/02-sales.csv February  2019     1  1234     8
##  9 data/sales/02-sales.csv February  2019     1  8721     2
## 10 data/sales/02-sales.csv February  2019     1  1822     3
## # i 9 more rows
```
]
---

## vroom vroom!!

.pull-left[
<img src="img/vroom.png" alt="" width="80%" style="display: block; margin: auto;" />
]
.pull-right[
- **vroom** is most useful for reading large amounts of data in, fast!
- and it has nice bells-and-whistles like delimiter guessing, reading many files in at once, etc.
- Learn more at [vroom.r-lib.org](https://vroom.r-lib.org/)
]

---

# Other types of data

---

## Other types of data

- **googlesheets4:** Google Sheets
- **haven**: SPSS, Stata, and SAS files
- **DBI**, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc): 
  - allows you to run SQL queries against a database and return a data frame
- **jsonline**: JSON
- **xml2**: xml
- **rvest**: web scraping
- **httr**: web APIs
- **sparklyr**: data loaded into spark

---
# Sources

- Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/))

---

# Summary: Learning Goals Achieved

---

## What We've Learned

Today, you should now be able to...

.pull-left[
### Data Import
- ✅ Read from CSV, Excel, etc.
- ✅ Handle naming issues
- ✅ Define column types
]

.pull-right[
### Data Export
- ✅ Clean and manage NAs
- ✅ Write to multiple formats
- ✅ Import multiple files
]

---

# Wrapping Up...