57 Write your own R functions
Writing your own functions in R is a fundamental skill that enhances your ability to perform repetitive tasks efficiently, customize analyses, and improve the readability of your code. A function in R is a set of instructions designed to perform a specific task, which can be as simple or complex as needed. By now, you’ve used plenty of functions in R. Hopefully, you’ve absorbed some of their logic, and have seen first-hand how they simplify complex tasks. It’s time to take that experience and start crafting your own. Doing so isn’t just about following a set of instructions; it’s about embracing the modular, building-block nature of R. This approach doesn’t just make your code smarter; it makes it significantly more readable and customizable. Let’s dive in and transform how you interact with R, turning you from a useR into a creatoR.
57.1 What and why?
This section aims to demystify the process experienced R useRs follow to write functions. I want to shed light on the rationale behind each step. Merely looking at the finished product, e.g., source code for R packages, can be extremely deceiving. Reality is generally much uglier … but more interesting!
Why are we covering this now, smack in the middle of data aggregation? Powerful machines like dplyr, purrr, and the built-in “apply” family of functions, are ready and waiting to apply your purpose-built functions to various bits of your data. If you can express your analytical wishes in a function, these tools will give you great power.
57.2 Load the nycflights13 data
We’ll begin by loading the nycflights13
dataset, which contains information about all flights that departed from New York City in 2013. This dataset provides a rich source of real-world data for practicing data manipulation and analysis
57.3 Example Analysis: Average Delay by Airline
Consider we want to compute the average delay experienced by each airline. This is a great example of a typical input for a function. You can imagine wanting to get this statistic to evaluate airline performance. You might want to do this for different years, months, or days of the week. You might want to do this for different airports, or for different combinations of airports. You might want to do this for different types of delays. You might want to do this for different subsets of the data, e.g., only for flights that were delayed. You might want to do this for different airlines. You might want to do this for different combinations of the above.
57.4 Get something that works
First, develop some working code for interactive use, using a representative input. I’ll use flights operated by a specific airline as an example.
R functions that will be useful: mean()
and filter()
from the dplyr
package.
## Investigate the structure of the flights dataset
str(flights)
#> tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
#> $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013..
#> $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
#> $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
#> $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
#> $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
#> $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
#> $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
#> $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
#> $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
#> $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
#> $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ..
#> $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
#> $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
#> $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
#> $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
#> $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
#> $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
#> $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
#> $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-"..
## get to know the functions mentioned above
mean(flights$dep_delay)
#> [1] NA
filter(.data = flights, carrier == "AA")
#> # A tibble: 32,729 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 542 540 2 923 850
#> 2 2013 1 1 558 600 -2 753 745
#> 3 2013 1 1 559 600 -1 941 910
#> 4 2013 1 1 606 610 -4 858 910
#> 5 2013 1 1 623 610 13 920 915
#> 6 2013 1 1 628 630 -2 1137 1140
#> 7 2013 1 1 629 630 -1 824 810
#> 8 2013 1 1 635 635 0 1028 940
#> 9 2013 1 1 656 700 -4 854 850
#> 10 2013 1 1 656 659 -3 949 959
#> # ℹ 32,719 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
Carrier | Average Departure Delay |
---|---|
9E | 16.44 |
AA | 8.57 |
AS | 5.83 |
B6 | 12.97 |
DL | 9.22 |
EV | 19.84 |
F9 | 20.20 |
FL | 18.61 |
HA | 4.90 |
MQ | 10.45 |
OO | 12.59 |
UA | 12.02 |
US | 3.75 |
VX | 12.76 |
WN | 17.66 |
YV | 18.90 |
Now lets go through some natural solutions to get the average delay for the airline “AA”
57.4.1 Using dplyr
for Data Filtering and Summary
This solution employs the dplyr package to filter flights by the airline code and then calculate the average departure delay.
57.4.2 Using Base R with Subsetting
Here, we use base R to achieve the same task without the dplyr
package, directly subsetting the dataframe.
57.4.3 Using with()
Function
The with() function provides a convenient way to perform operations within a dataframe subset, making the code more readable.
57.4.4 Using aggregate()
Function
The aggregate()
function in R can be used to compute summary statistics for subgroups of data, which in this case are flights operated by “AA”.
57.4.5 Using tapply() Function
The tapply() function applies a function to subsets of a vector, which we can use to calculate the average delay for “AA” flights.
Now, internalize this “answer” because our informal testing relies on you noticing departures from this number when we generalize the function.
57.5 Turn the Working Interactive Code into a Function
When crafting your own functions in R, it’s beneficial to start with a straightforward, minimal version that accomplishes the basic task at hand. This approach is akin to building a ‘skateboard’—a simple, yet functional product. Let’s apply this philosophy to our task of calculating the average delay for a specific airline in the nycflights13 dataset.
57.5.1 Initial Simple Function: The ‘Skateboard’
average_delay_by_airline <- function(airline_code) {
flights %>%
filter(carrier == airline_code) %>%
summarize(average_delay = mean(dep_delay, na.rm = TRUE))
}
Check that you’re getting the same answer as you did with your interactive code.
# Test the function with American Airlines (AA)
average_delay_by_airline("AA")
#> # A tibble: 1 × 1
#> average_delay
#> <dbl>
#> 1 8.59
This function represents our ‘skateboard’. It’s basic, and we have added no new functionality. Yet, it gets the job done by providing the average delay for a given airline code. It doesn’t include error handling or support for additional details like distinguishing between departure and arrival delays, but it serves as a solid starting point. This is a minimal viable product (MVP) that we can build upon to create a more complex function (the ‘car’).
![This image [widely attributed to the Spotify development team][min-viable-product] conveys an important point. From [Your ultimate guide to Minimum Viable Product (+great examples)](https://web.archive.org/web/20220318100638/https://blog.fastmonkeys.com/2014/06/18/minimum-viable-product-your-ultimate-guide-to-mvp-great-examples/)](img/spotify-howtobuildmvp.jpg)
Figure 57.1: This image widely attributed to the Spotify development team conveys an important point. From Your ultimate guide to Minimum Viable Product (+great examples)
This idea is related to the valuable Telescope Rule:
It is faster to make a four-inch mirror then a six-inch mirror than to make a six-inch mirror.
57.6 Test the Function
57.6.1 Test on new inputs
Pick some new artificial inputs where you know (at least approximately) what your function should return.
I know that UA had about 12 minutes of a delay.
57.6.2 Test on real data but different real data
Back to the real world now. So typically, the next step is to check to see if your function can handle different data. This is a good way to check if your function is robust and generalizable. However, ours doesn’t. It’s hard-wired to the flights
dataset. We’ll fix that in the next section.
average_delay_by_airline <- function(data = flights, airline_code) {
data %>%
filter(carrier == airline_code) %>%
summarize(average_delay = mean(dep_delay, na.rm = TRUE))
}
I’ve now added another variable to the function, data
, which defaults to flights
. This is a good habit to get into. It makes your function more flexible and more generalizable. It also makes it easier to test your function on different datasets. Now, we can test our function on a modified flights
dataset, that I have named the flights2
dataset. The only thing I have done to this dataset is multiplied all of the delays by 2.