31 Data types and recoding
You can follow along with the slides here if they do not appear below.
That’s when you realize you can educate those you work with to give you better data, better questions to answer, etc., so that you can give them better insights. You have more power than you think!
— Isabella R. Ghement (@IsabellaGhement) December 22, 2021
Wonderful visual illustration of data types in #rstats! 🤩
— Indrajeet Patil (इंद्रजीत पाटील) (@patilindrajeets) December 29, 2021
Image credits:https://t.co/ESrtYBwtNm
For further reading, see:https://t.co/nrR59hLniV#DataScience pic.twitter.com/iO0key2mnh
31.2 Data types
31.2.1 Another Hotels Activity
You can find the materials for the Hotels activity here. The compiled version should look something like the following…
31.3 Special Values
You can follow along with the slides here if they do not appear below.
R has a few special values you’ll bump into when your calculations don’t work as expected.
Try dividing by zero in R and you’ll get Inf
(infinity) - unlike other languages that might crash or throw an error. R is pretty chill about it (to an annoying degree):
But if you try something truly undefined like dividing zero by zero, you’ll get NaN
(Not a Number):
These special values appear when dealing with unusual or undefined operations in R:
- When dividing by zero:
pi / 0
givesInf
- When performing undefined math:
0 / 0
givesNaN
- With contradictory operations:
1/0 - 1/0
givesNaN
- With consistent operations:
1/0
+1/0
givesInf
The most common special value you’ll encounter is NA
- missing data. NA
s are sneaky because they’re “contagious” - almost any calculation involving an NA
will give you NA
as the result:
The cool thing about NA
s is that they’re logically consistent. When you work with them in logical operations:
TRUE | NA
isTRUE
(because “true or anything” is always true)FALSE | NA
isNA
(because we need to know whatNA
is to determine the result)
It’s like NA
is saying “I don’t know what I am, but I’ll follow the rules of logic!”
31.4 Data classes
You can follow along with the slides here if they do not appear below.
Think of R’s data classes as Lego sets built from basic building blocks. The basic types (logical
, character
, numeric
) are the individual Lego pieces, but classes are the cool structures you build with them.
Take factors - they look like character strings when you print them, but under the hood they’re actually integers with labels:
x <- factor(c("BS", "MS", "PhD", "MS"))
x # Looks like text
#> [1] BS MS PhD MS
#> Levels: BS MS PhD
typeof(x) # But it's stored as integers!
#> [1] "integer"
as.integer(x) # See the numbers behind the scenes
#> [1] 1 2 3 2
Or dates - they look like calendar dates when you print them:
But they’re actually just counting days since January 1, 1970:
Because you’re just counting time since a fixed date. This explains why you can do math with dates, like finding out what date is 30 days from now:
Even data frames are secretly just lists where all the elements have the same length:
Understanding these “secret identities” helps you avoid common pitfalls and work more effectively with your data.
31.5 Working with factors
You can follow along with the slides here if they do not appear below.
Ever made a bar chart where the categories were in a weird order? That’s where factors come to the rescue!
When you make a simple plot like this:
ggplot(cat_lovers, mapping = aes(x = handedness,fill = handedness)) + geom_bar() + labs(title = "Cat lovers by handedness") + theme_minimal() + scale_fill_viridis_d(option = "plasma")
R automatically converts your text variable to a factor, but it uses alphabetical order by default. That’s rarely what you want!
The
forcats
package (part of the tidyverse) gives you superpowers for controlling factor order. Want categories ordered by frequency? Just use fct_infreq
():
cat_lovers %>%
mutate(handedness = fct_infreq(handedness)) %>%
ggplot(mapping = aes(x = handedness, fill = handedness)) +
geom_bar() +
labs(title = "Cat lovers by handedness") + theme_minimal() +
scale_fill_viridis_d(option = "plasma")
Now your most common category appears first - much more informative!
The slides show an example with months, which is a classic problem. If you don’t use factors, your months end up in alphabetical order (April, August, December…) instead of calendar order. Using fct_relevel
() with month.name
fixes this:
So next time your plot looks oddly ordered, remember: there’s probably a forcats function that can fix it in one line!
31.5.1 (An) Another Hotels Activity
You can find the materials for the Hotels activity here. The compiled version should look something like the following…
31.7 Working with Dates
You can follow along with the slides here if they do not appear below.