53 Write your own R functions

Writing your own functions in R is a fundamental skill that enhances your ability to perform repetitive tasks efficiently, customize analyses, and improve the readability of your code. A function in R is a set of instructions designed to perform a specific task, which can be as simple or complex as needed. By now, you’ve used plenty of functions in R. Hopefully, you’ve absorbed some of their logic, and have seen first-hand how they simplify complex tasks. It’s time to take that experience and start crafting your own. Doing so isn’t just about following a set of instructions; it’s about embracing the modular, building-block nature of R. This approach doesn’t just make your code smarter; it makes it significantly more readable and customizable. Let’s dive in and transform how you interact with R, turning you from a useR into a creatoR.

53.1 What and why?

This section aims to demystify the process experienced R useRs follow to write functions. I want to shed light on the rationale behind each step. Merely looking at the finished product, e.g., source code for R packages, can be extremely deceiving. Reality is generally much uglier … but more interesting!

Why are we covering this now, smack in the middle of data aggregation? Powerful machines like dplyr, purrr, and the built-in “apply” family of functions, are ready and waiting to apply your purpose-built functions to various bits of your data. If you can express your analytical wishes in a function, these tools will give you great power.

53.2 Load the nycflights13 data

We’ll begin by loading the nycflights13 dataset, which contains information about all flights that departed from New York City in 2013. This dataset provides a rich source of real-world data for practicing data manipulation and analysis

library(nycflights13)
#> Error in library(nycflights13): there is no package called 'nycflights13'
library(dplyr)
data("flights")
#> Warning in data("flights"): data set 'flights' not found
#str(flights)

53.3 Example Analysis: Average Delay by Airline

Consider we want to compute the average delay experienced by each airline. This is a great example of a typical input for a function. You can imagine wanting to get this statistic to evaluate airline performance. You might want to do this for different years, months, or days of the week. You might want to do this for different airports, or for different combinations of airports. You might want to do this for different types of delays. You might want to do this for different subsets of the data, e.g., only for flights that were delayed. You might want to do this for different airlines. You might want to do this for different combinations of the above.

53.4 Get something that works

First, develop some working code for interactive use, using a representative input. I’ll use flights operated by a specific airline as an example.

R functions that will be useful: mean() and filter() from the dplyr package.


## Investigate the structure of the flights dataset
str(flights)
#> Error: object 'flights' not found
## get to know the functions mentioned above

mean(flights$dep_delay)
#> Error: object 'flights' not found

filter(.data = flights, carrier == "AA")
#> Error: object 'flights' not found

#> Error: object 'flights' not found

Now lets go through some natural solutions to get the average delay for the airline “AA”

53.4.1 Using `dplyr` for Data Filtering and Summary

This solution employs the dplyr package to filter flights by the airline code and then calculate the average departure delay.

flights %>%
  filter(carrier == "AA") %>%
  summarise(average_delay = mean(dep_delay, na.rm = TRUE))
#> Error: object 'flights' not found

53.4.2 Using Base R with Subsetting

Here, we use base R to achieve the same task without the dplyr package, directly subsetting the dataframe.


mean(flights$dep_delay[flights$carrier=="AA"], na.rm = TRUE)
#> Error: object 'flights' not found

53.4.3 Using `with()` Function

The with() function provides a convenient way to perform operations within a dataframe subset, making the code more readable.

with(flights[flights$carrier == "AA", ], mean(dep_delay, na.rm = TRUE))
#> Error: object 'flights' not found

53.4.4 Using `aggregate()` Function

The aggregate() function in R can be used to compute summary statistics for subgroups of data, which in this case are flights operated by “AA”.

aggregate(dep_delay ~ carrier, data = flights[flights$carrier == "AA", ], FUN = mean, na.rm = TRUE)$dep_delay
#> Error in eval(m$data, parent.frame()): object 'flights' not found

53.4.5 Using tapply() Function

The tapply() function applies a function to subsets of a vector, which we can use to calculate the average delay for “AA” flights.

tapply(flights$dep_delay, flights$carrier, mean, na.rm = TRUE)["AA"]
#> Error: object 'flights' not found

Now, internalize this “answer” because our informal testing relies on you noticing departures from this number when we generalize the function.

53.5 Turn the Working Interactive Code into a Function

When crafting your own functions in R, it’s beneficial to start with a straightforward, minimal version that accomplishes the basic task at hand. This approach is akin to building a ‘skateboard’—a simple, yet functional product. Let’s apply this philosophy to our task of calculating the average delay for a specific airline in the nycflights13 dataset.

53.5.1 Initial Simple Function: The ‘Skateboard’

average_delay_by_airline <- function(airline_code) {
  flights %>%
    filter(carrier == airline_code) %>%
    summarise(average_delay = mean(dep_delay, na.rm = TRUE))
}

Check that you’re getting the same answer as you did with your interactive code.

# Test the function with American Airlines (AA)
average_delay_by_airline("AA")
#> Error in average_delay_by_airline("AA"): object 'flights' not found

This function represents our ‘skateboard’. It’s basic, and we have added no new functionality. Yet, it gets the job done by providing the average delay for a given airline code. It doesn’t include error handling or support for additional details like distinguishing between departure and arrival delays, but it serves as a solid starting point. This is a minimal viable product (MVP) that we can build upon to create a more complex function (the ‘car’).

This image [widely attributed to the Spotify development team][min-viable-product] conveys an important point. From [Your ultimate guide to Minimum Viable Product (+great examples)](https://web.archive.org/web/20220318100638/https://blog.fastmonkeys.com/2014/06/18/minimum-viable-product-your-ultimate-guide-to-mvp-great-examples/)

Figure 53.1: This image widely attributed to the Spotify development team conveys an important point. From Your ultimate guide to Minimum Viable Product (+great examples)

This idea is related to the valuable Telescope Rule:

It is faster to make a four-inch mirror then a six-inch mirror than to make a six-inch mirror.

53.6 Test the Function

53.6.1 Test on new inputs

Pick some new artificial inputs where you know (at least approximately) what your function should return.

average_delay_by_airline("UA")
#> Error in average_delay_by_airline("UA"): object 'flights' not found

I know that UA had about 12 minutes of a delay.

53.6.2 Test on real data but different real data

Back to the real world now. So typically, the next step is to check to see if your function can handle different data. This is a good way to check if your function is robust and generalizable. However, ours doesn’t. It’s hard-wired to the flights dataset. We’ll fix that in the next section.

average_delay_by_airline <- function(data = flights, airline_code) {
  data %>%
    filter(carrier == airline_code) %>%
    summarise(average_delay = mean(dep_delay, na.rm = TRUE))
}

I’ve now added another variable to the function, data, which defaults to flights. This is a good habit to get into. It makes your function more flexible and more generalizable. It also makes it easier to test your function on different datasets. Now, we can test our function on a modified flights dataset, that I have named the flights2 dataset. The only thing I have done to this dataset is multiplied all of the delays by 2.

flights2 <- flights
#> Error: object 'flights' not found

flights2$dep_delay <- flights2$dep_delay * 2
#> Error: object 'flights2' not found

average_delay_by_airline(flights2, "AA")
#> Error: object 'flights2' not found