53 Write your own R functions

Writing your own functions in R is a fundamental skill that enhances your ability to perform repetitive tasks efficiently, customize analyses, and improve the readability of your code. A function in R is a set of instructions designed to perform a specific task, which can be as simple or complex as needed. By now, you’ve used plenty of functions in R. Hopefully, you’ve absorbed some of their logic, and have seen first-hand how they simplify complex tasks. It’s time to take that experience and start crafting your own. Doing so isn’t just about following a set of instructions; it’s about embracing the modular, building-block nature of R. This approach doesn’t just make your code smarter; it makes it significantly more readable and customizable. Let’s dive in and transform how you interact with R, turning you from a useR into a creatoR.

53.1 What and why?

This section aims to demystify the process experienced R useRs follow to write functions. I want to shed light on the rationale behind each step. Merely looking at the finished product, e.g., source code for R packages, can be extremely deceiving. Reality is generally much uglier … but more interesting!

Why are we covering this now, smack in the middle of data aggregation? Powerful machines like dplyr, purrr, and the built-in “apply” family of functions, are ready and waiting to apply your purpose-built functions to various bits of your data. If you can express your analytical wishes in a function, these tools will give you great power.

53.2 Load the nycflights13 data

We’ll begin by loading the nycflights13 dataset, which contains information about all flights that departed from New York City in 2013. This dataset provides a rich source of real-world data for practicing data manipulation and analysis

library(nycflights13)
library(dplyr)
data("flights")
#str(flights)

53.3 Example Analysis: Average Delay by Airline

Consider we want to compute the average delay experienced by each airline. This is a great example of a typical input for a function. You can imagine wanting to get this statistic to evaluate airline performance. You might want to do this for different years, months, or days of the week. You might want to do this for different airports, or for different combinations of airports. You might want to do this for different types of delays. You might want to do this for different subsets of the data, e.g., only for flights that were delayed. You might want to do this for different airlines. You might want to do this for different combinations of the above.

53.4 Get something that works

First, develop some working code for interactive use, using a representative input. I’ll use flights operated by a specific airline as an example.

R functions that will be useful: mean() and filter() from the dplyr package.


## Investigate the structure of the flights dataset
str(flights)
#> tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
#>  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013..
#>  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
#>  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
#>  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
#>  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
#>  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
#>  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
#>  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
#>  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ..
#>  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
#>  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
#>  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
#>  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
#>  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
#>  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
#>  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
#>  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-"..
## get to know the functions mentioned above

mean(flights$dep_delay)
#> [1] NA

filter(.data = flights, carrier == "AA")
#> # A tibble: 32,729 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      542            540         2      923            850
#>  2  2013     1     1      558            600        -2      753            745
#>  3  2013     1     1      559            600        -1      941            910
#>  4  2013     1     1      606            610        -4      858            910
#>  5  2013     1     1      623            610        13      920            915
#>  6  2013     1     1      628            630        -2     1137           1140
#>  7  2013     1     1      629            630        -1      824            810
#>  8  2013     1     1      635            635         0     1028            940
#>  9  2013     1     1      656            700        -4      854            850
#> 10  2013     1     1      656            659        -3      949            959
#> # ℹ 32,719 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>
Carrier Average Departure Delay
9E 16.44
AA 8.57
AS 5.83
B6 12.97
DL 9.22
EV 19.84
F9 20.20
FL 18.61
HA 4.90
MQ 10.45
OO 12.59
UA 12.02
US 3.75
VX 12.76
WN 17.66
YV 18.90

Now lets go through some natural solutions to get the average delay for the airline “AA”

53.4.1 Using dplyr for Data Filtering and Summary

This solution employs the dplyr package to filter flights by the airline code and then calculate the average departure delay.

flights %>%
  filter(carrier == "AA") %>%
  summarise(average_delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   average_delay
#>           <dbl>
#> 1          8.59

53.4.2 Using Base R with Subsetting

Here, we use base R to achieve the same task without the dplyr package, directly subsetting the dataframe.


mean(flights$dep_delay[flights$carrier=="AA"], na.rm = TRUE)
#> [1] 8.59

53.4.3 Using with() Function

The with() function provides a convenient way to perform operations within a dataframe subset, making the code more readable.

with(flights[flights$carrier == "AA", ], mean(dep_delay, na.rm = TRUE))
#> [1] 8.59

53.4.4 Using aggregate() Function

The aggregate() function in R can be used to compute summary statistics for subgroups of data, which in this case are flights operated by “AA”.

aggregate(dep_delay ~ carrier, data = flights[flights$carrier == "AA", ], FUN = mean, na.rm = TRUE)$dep_delay
#> [1] 8.59

53.4.5 Using tapply() Function

The tapply() function applies a function to subsets of a vector, which we can use to calculate the average delay for “AA” flights.

tapply(flights$dep_delay, flights$carrier, mean, na.rm = TRUE)["AA"]
#>   AA 
#> 8.59

Now, internalize this “answer” because our informal testing relies on you noticing departures from this number when we generalize the function.

53.5 Turn the Working Interactive Code into a Function

When crafting your own functions in R, it’s beneficial to start with a straightforward, minimal version that accomplishes the basic task at hand. This approach is akin to building a ‘skateboard’—a simple, yet functional product. Let’s apply this philosophy to our task of calculating the average delay for a specific airline in the nycflights13 dataset.

53.5.1 Initial Simple Function: The ‘Skateboard’

average_delay_by_airline <- function(airline_code) {
  flights %>%
    filter(carrier == airline_code) %>%
    summarise(average_delay = mean(dep_delay, na.rm = TRUE))
}

Check that you’re getting the same answer as you did with your interactive code.

# Test the function with American Airlines (AA)
average_delay_by_airline("AA")
#> # A tibble: 1 × 1
#>   average_delay
#>           <dbl>
#> 1          8.59

This function represents our ‘skateboard’. It’s basic, and we have added no new functionality. Yet, it gets the job done by providing the average delay for a given airline code. It doesn’t include error handling or support for additional details like distinguishing between departure and arrival delays, but it serves as a solid starting point. This is a minimal viable product (MVP) that we can build upon to create a more complex function (the ‘car’).

This idea is related to the valuable Telescope Rule:

It is faster to make a four-inch mirror then a six-inch mirror than to make a six-inch mirror.

53.6 Test the Function

53.6.1 Test on new inputs

Pick some new artificial inputs where you know (at least approximately) what your function should return.

average_delay_by_airline("UA")
#> # A tibble: 1 × 1
#>   average_delay
#>           <dbl>
#> 1          12.1

I know that UA had about 12 minutes of a delay.

53.6.2 Test on real data but different real data

Back to the real world now. So typically, the next step is to check to see if your function can handle different data. This is a good way to check if your function is robust and generalizable. However, ours doesn’t. It’s hard-wired to the flights dataset. We’ll fix that in the next section.

average_delay_by_airline <- function(data = flights, airline_code) {
  data %>%
    filter(carrier == airline_code) %>%
    summarise(average_delay = mean(dep_delay, na.rm = TRUE))
}

I’ve now added another variable to the function, data, which defaults to flights. This is a good habit to get into. It makes your function more flexible and more generalizable. It also makes it easier to test your function on different datasets. Now, we can test our function on a modified flights dataset, that I have named the flights2 dataset. The only thing I have done to this dataset is multiplied all of the delays by 2.

flights2 <- flights

flights2$dep_delay <- flights2$dep_delay * 2

average_delay_by_airline(flights2, "AA")
#> # A tibble: 1 × 1
#>   average_delay
#>           <dbl>
#> 1          17.2