31 Data types and recoding

You can follow along with the slides here if they do not appear below.

31.1 Why should you care about data types?

31.2 Data types

31.2.1 Another Hotels Activity

You can find the materials for the Hotels activity here. The compiled version should look something like the following…

31.3 Special Values

You can follow along with the slides here if they do not appear below.

R has a few special values you’ll bump into when your calculations don’t work as expected.

Try dividing by zero in R and you’ll get Inf (infinity) - unlike other languages that might crash or throw an error. R is pretty chill about it (to an annoying degree):


pi / 0  # Returns Inf
#> [1] Inf

But if you try something truly undefined like dividing zero by zero, you’ll get NaN (Not a Number):

0 / 0  # Returns NaN
#> [1] NaN

These special values appear when dealing with unusual or undefined operations in R:

  • When dividing by zero: pi / 0 gives Inf
  • When performing undefined math: 0 / 0 gives NaN
  • With contradictory operations: 1/0 - 1/0 gives NaN
  • With consistent operations: 1/0 + 1/0 gives Inf

The most common special value you’ll encounter is NA - missing data. NAs are sneaky because they’re “contagious” - almost any calculation involving an NA will give you NA as the result:


mean(c(1, 2, NA, 4))  # Returns NA
#> [1] NA

The cool thing about NAs is that they’re logically consistent. When you work with them in logical operations:

  • TRUE | NA is TRUE (because “true or anything” is always true)
  • FALSE | NA is NA (because we need to know what NA is to determine the result)

It’s like NA is saying “I don’t know what I am, but I’ll follow the rules of logic!”

31.4 Data classes

You can follow along with the slides here if they do not appear below.

Think of R’s data classes as Lego sets built from basic building blocks. The basic types (logical, character, numeric) are the individual Lego pieces, but classes are the cool structures you build with them.

Take factors - they look like character strings when you print them, but under the hood they’re actually integers with labels:

x <- factor(c("BS", "MS", "PhD", "MS"))
x  # Looks like text
#> [1] BS  MS  PhD MS 
#> Levels: BS MS PhD
typeof(x)  # But it's stored as integers!
#> [1] "integer"
as.integer(x)  # See the numbers behind the scenes
#> [1] 1 2 3 2

Or dates - they look like calendar dates when you print them:

Y2kday <- as.Date("2000-01-01")
Y2kday  # Shows as "2000-01-01"
#> [1] "2000-01-01"

But they’re actually just counting days since January 1, 1970:


as.integer(Y2kday)  # Days since 1970-01-01
#> [1] 10957

Because you’re just counting time since a fixed date. This explains why you can do math with dates, like finding out what date is 30 days from now:

Y2kday + 30  # Adds 30 days
#> [1] "2000-01-31"

Even data frames are secretly just lists where all the elements have the same length:

df <- data.frame(x = 1:2, y = 3:4)
typeof(df)  # "list"
#> [1] "list"

Understanding these “secret identities” helps you avoid common pitfalls and work more effectively with your data.

31.5 Working with factors

You can follow along with the slides here if they do not appear below.

Ever made a bar chart where the categories were in a weird order? That’s where factors come to the rescue!

When you make a simple plot like this:

ggplot(cat_lovers, mapping = aes(x = handedness,fill = handedness)) + geom_bar() + labs(title = "Cat lovers by handedness") + theme_minimal() + scale_fill_viridis_d(option = "plasma")

R automatically converts your text variable to a factor, but it uses alphabetical order by default. That’s rarely what you want! The forcats package (part of the tidyverse) gives you superpowers for controlling factor order. Want categories ordered by frequency? Just use fct_infreq():

cat_lovers %>%
  mutate(handedness = fct_infreq(handedness)) %>%
  ggplot(mapping = aes(x = handedness, fill = handedness)) +
  geom_bar() +
  labs(title = "Cat lovers by handedness") + theme_minimal() +
  scale_fill_viridis_d(option = "plasma")

Now your most common category appears first - much more informative! The slides show an example with months, which is a classic problem. If you don’t use factors, your months end up in alphabetical order (April, August, December…) instead of calendar order. Using fct_relevel() with month.name fixes this:

hotels %>%
  mutate(FOO = fct_relevel(BAR, month.name))

So next time your plot looks oddly ordered, remember: there’s probably a forcats function that can fix it in one line!

31.5.1 (An) Another Hotels Activity

You can find the materials for the Hotels activity here. The compiled version should look something like the following…

31.6 Working with Dates

31.7 Working with Dates

You can follow along with the slides here if they do not appear below.