Working with Web APIs 🔌

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a>
</span>
</div>

---

# Learning Goals

---

## Learning Goals

By the end of this session, you will be able to...

- Distinguish between web scraping and web APIs as data collection methods
- Explain the structure of HTTP requests and responses
- Use the httr package to make API requests in R
- Parse JSON responses into tidy data frames
- Apply best practices for authentication, rate limiting, and ethical API use

---

# Recap: Scraping vs. APIs

---

## Two approaches to web data

In the web scraping module, we saw two scenarios:

- **Screen scraping**: extract data from the source code of a website, with an HTML parser or regular expressions

--
- **Web APIs**: a website offers a set of structured HTTP requests that return JSON or XML files

Today we focus on the second approach.

---

## Why use an API?

- **Structured data**: APIs return data in a consistent, machine-readable format (JSON, XML)

--
- **Consistency**: The data format is documented and maintained by the provider

--
- **Permission**: APIs are *intended* for programmatic access

--
-  **Efficiency**:You get exactly the data you need, no parsing HTML

--
-  **Ethics**: Using an API respects the data provider's terms of service

---

# What is an API?

---

## Application Programming Interface

An **API** is a set of rules that allows one piece of software to talk to another.

--
- Think of it like a menu at a restaurant:
  - The menu (API documentation) tells you what you can order
  - You place an order (make a request)
  - The kitchen (server) prepares your food
  - You receive your meal (get a response)

--
- A **web API** specifically uses HTTP (the same protocol your browser uses) to send and receive data over the internet

---

.xlarge[.question[
What kinds of data have you encountered that might be available through an API?
]]

---

## Anatomy of an API request

1. **Base URL**: the address of the API <br>
   `https://api.example.com/`

2. **Endpoint**: the specific resource you want<br>
   `https://api.example.com/data/`

3. **Parameters**: filters or options for your request<br>
   `https://api.example.com/data?year=2024&limit=10`

4. **HTTP method**: what action you want to perform
   - `GET` - retrieve data (most common)
   - `POST` - send data
   - `PUT` - update data
   - `DELETE` - remove data

---

## HTTP response

When the server responds, you get:

--
.pull-left[
- **Status code**: was the request successful?

| Code | Meaning |
|------|---------|
| `200` | OK - success! |
| `301` | Moved permanently (redirect) |
| `400` | Bad request (something wrong with your request) |
| `401` | Unauthorized (need authentication) |
| `403` | Forbidden (not allowed) |
| `404` | Not found |
| `429` | Too many requests (rate limited) |
| `500` | Internal server error |
]
--
.pull-right[
- **Headers**: metadata about the response
- **Body**: the actual data (usually JSON)
]
---

# JSON: The language of APIs

---

## JavaScript Object Notation (JSON)

Most modern APIs return data in **JSON** format:
.pull-left.small[
```json
{
  "name": "R Project",
  "year": 1993,
  "open_source": true,
  "creators": ["Ross Ihaka", "Robert Gentleman"],
  "stats": {
    "packages": 20000,
    "users": "millions"
  }
}
```
]
--
.pull-right[
- Key-value pairs: `"name": "R Project"`
- Nested structures: objects within objects
- Arrays: lists of values `["Ross Ihaka", "Robert Gentleman"]`
- Data types: strings, numbers, booleans, null
]
---

## JSON to R

The **jsonlite** package converts JSON into R objects:
- JSON objects become named lists or data frames
- JSON arrays become vectors or lists
- Nested JSON becomes nested lists

``` r
library(jsonlite)
json_text <- '{
  "name": "R Project",
  "year": 1993,
  "open_source": true
}'
```
]

``` r
fromJSON(json_text)
```

```
## $name
## [1] "R Project"
## 
## $year
## [1] 1993
## 
## $open_source
## [1] TRUE
```
]

---

# Making API requests in R

---

## The httr package

.pull-left[
- **httr** makes HTTP requests from R simple and consistent
- Works well with the tidyverse pipeline
- Handles authentication, headers, and content parsing
]
.pull-right[
Key functions:
- `GET()` - make a GET request
- `POST()` - make a POST request
- `content()` - extract response content
- `status_code()` - check the status code
- `headers()` - view response headers
]

---

## A first API call

Let's query a free, public API. The [Open-Meteo API](https://open-meteo.com/) provides weather data with no authentication required.

.your-turn[
- [class git repo](https://github.com/DataScience4Psych) > `AE 09 - APIs`.
- Open `01-weather-api.R`.
- Follow along, and fill in the blanks as we go based on upcoming slides.
]

---

# Querying the Open-Meteo API
.midi[

``` r
response <- GET(
  "https://api.open-meteo.com/v1/forecast",
  query = list(
  latitude = 36.00,        # Winston-Salem, NC
  longitude = -80.24,
	models= "gfs_seamless", 
	current= "temperature_2m", # current temp at 2 meters above ground
	wind_speed_unit= "mph", 
  precipitation_unit= "inch", 
  temperature_unit= "fahrenheit"
  ))
```
]

``` r
status_code(response)
```

```
## [1] 200
```

---

## Inspecting the response

``` r
weather_data <- httr::content(response, as = "text", encoding = "UTF-8") %>%
  jsonlite::fromJSON()
names(weather_data)# What did we get?
```

```
##  [1] "latitude"              "longitude"            
##  [3] "generationtime_ms"     "utc_offset_seconds"   
##  [5] "timezone"              "timezone_abbreviation"
##  [7] "elevation"             "current_units"        
##  [9] "current"               "hourly_units"         
## [11] "hourly"
```

]
--
.midi[

``` r
weather_data$current
```

```
## $time
## [1] "2026-02-27T00:15"
## 
## $interval
## [1] 900
## 
## $temperature_2m
## [1] 52.6
```
]

---

## From API response to tibble

``` r
hourly <- weather_data$hourly

tidy_forecast <- tibble(
  temperature = hourly$temperature_2m,
  time = as.POSIXct(hourly$time),
  date = as.Date(hourly$time),
) %>% arrange(date) %>% 
  mutate(date = as.factor(date))
```

---

## Visualize the forecast

<img src="d16b_apis_files/figure-html/forecast-plot-1.png" alt="" width="100%" style="display: block; margin: auto;" />
]

.midi.pull-right[

``` r
tidy_forecast %>%
  mutate(time = as.POSIXct(time)) %>%
  ggplot(aes(x = time, y = temperature)) +
  geom_line() +
  labs(
    title = "Hourly temperature forecast – Winston-Salem, NC",
    x     = "Time",
    y     = "Temperature (°F)"
  )
```
]

---

.pull-left[
<img src="d16b_apis_files/figure-html/forecast-ridgeplot-1.png" alt="" width="100%" style="display: block; margin: auto;" />
]
--

.midi.pull-right[

``` r
library(ggridges)
tidy_forecast %>%
  ggplot(aes(x = temperature, 
             y = date, fill = after_stat(x))) +
  geom_density_ridges_gradient(scale = 4, 
                               rel_min_height = 0.01, alpha = 0.5) +
  labs(
    title = "Hourly temperature forecast – Winston-Salem, NC",
    y     = "Time",
    x     = "Temperature (°F)"
  ) + viridis::scale_fill_viridis(name = "Temperature (°F)", option = "C")  + theme_ridges() + theme(legend.position = "right")
```
]

---

# Wrapping Up...

---

# A richer example: University data

---

## Hipolabs Universities API

The [Hipolabs Universities API](https://universities.hipolabs.com/) lets you search for universities worldwide -- no API key needed.
.pull-left[

``` r
response <- GET(
  "http://universities.hipolabs.com/search",
  query = list(
    country = "United States",
    name = "Wake Forest"
  )
)

status_code(response)
```

```
## [1] 200
```
]
---

# Code along!

.your-turn[
- [class git repo](https://github.com/DataScience4Psych) > `AE 09 - APIs`.
- Open `02-university-api.R`.
- Use the Hipolabs Universities API to search for universities in a country of your choice.
- Parse the JSON response and create a tidy data frame.
- Try querying multiple countries and combining the results.
]

---

## Parse the response

``` r
universities <- content(response, as = "text", encoding = "UTF-8") %>%
  fromJSON(flatten = TRUE)

glimpse(universities)
```

```
## Rows: 2
## Columns: 6
## $ country          <chr> "United States", "United States"
## $ web_pages        <list> "http://www.wfu.edu/", "http://www.wak…
## $ alpha_two_code   <chr> "US", "US"
## $ `state-province` <lgl> NA, NA
## $ name             <chr> "Wake Forest University", "Wake Fores…
## $ domains          <list> "wfu.edu", "wakehealth.edu"
```

---

## Build a tidy data frame

``` r
uni_df <- universities %>%
  select(name, country,
         state = `state-province`,
         website = web_pages) %>%
  unnest(website)

uni_df
```

```
## # A tibble: 2 × 4
##   name                       country       state website         
##   <chr>                      <chr>         <lgl> <chr>           
## 1 Wake Forest University     United States NA    http://www.wfu.…
## 2 Wake Forest Baptist Health United States NA    http://www.wake…
```

---

.question[
What information would you want to extract from the university data? How would you reshape it into a tidy format?
]

---

## Scaling up: multiple queries

What if we want universities from several countries?

``` r
countries <- c("United States", "United Kingdom", "Canada")

all_unis <- map_dfr(countries, function(country) {
  resp <- GET(
    "http://universities.hipolabs.com/search",
    query = list(country = country)
  )

content(resp, as = "text", encoding = "UTF-8") %>%
    fromJSON(flatten = TRUE) %>%
    select(name, country) %>%
    head(10)  # first 10 per country for demo
})
```

---

``` r
all_unis %>%
  count(country, sort = TRUE)
```

```
## # A tibble: 0 × 2
## # ℹ 2 variables: country <chr>, n <int>
```

<div id="htmlwidget-1b4ff99564eb6e8884a5" style="width:100%;height:300px;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-1b4ff99564eb6e8884a5">{"x":{"filter":"none","vertical":false,"data":[[],[],[]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>name<\/th>\n      <th>country<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"p","pageLength":8,"columnDefs":[{"orderable":false,"targets":0},{"name":" ","targets":0},{"name":"name","targets":1},{"name":"country","targets":2}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[8,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

# Wrapping Up...

---

# Writing functions for API calls

---

## Why wrap API calls in functions?

When you call the same API repeatedly, it can be helpful to wrap the logic in a function:

``` r
get_universities <- function(country, name = NULL) {
  params <- list(country = country)
  if (!is.null(name)) params$name <- name

resp <- GET(
    "http://universities.hipolabs.com/search",
    query = params
  )

if (status_code(resp) != 200) {
    warning("API request failed with status: ", status_code(resp))
    return(tibble())
  }

content(resp, as = "text", encoding = "UTF-8") %>%
    fromJSON(flatten = TRUE) %>%
    as_tibble()
}
```

---

## Using the helper function

``` r
# Search for psychology-related universities
psych_unis <- get_universities("United States", name = "psychology")
nrow(psych_unis)
```

```
## [1] 1
```

``` r
# Get Canadian universities
canadian_unis <- get_universities("Canada")
nrow(canadian_unis)
```

```
## [1] 157
```

This pattern -- wrapping API calls in functions -- makes your code:
- Reusable: call it many times with different parameters
- Readable: clear what the function does
- Robust: error handling in one place

---

# Authentication

---

## API keys

Many APIs require an **API key** to:
- Identify who is making requests
- Enforce rate limits
- Track usage

Getting an API key typically involves:
1. Creating an account on the API provider's website
2. Generating a key in your dashboard
3. Including the key in your requests

---

## Using API keys safely

.pull-left[
### Do
- Store keys in `.Renviron` file
- Use `Sys.getenv()` to read them
- Add `.Renviron` to `.gitignore`
]

``` r
# In your .Renviron file:
# MY_API_KEY=abc123xyz

# In your R script:
api_key <- Sys.getenv("MY_API_KEY")

response <- GET(
  "https://api.example.com/data",
  add_headers(Authorization = paste("Bearer", api_key))
)
```

---

## Common authentication patterns

``` r
# 1. API key as query parameter
GET("https://api.example.com/data",
    query = list(api_key = Sys.getenv("API_KEY")))

# 2. API key in header
GET("https://api.example.com/data",
    add_headers("X-API-Key" = Sys.getenv("API_KEY")))

# 3. Bearer token
GET("https://api.example.com/data",
    add_headers(Authorization = paste("Bearer", Sys.getenv("TOKEN"))))

# 4. Basic authentication
GET("https://api.example.com/data",
    authenticate("username", "password"))
```

---

# Best practices and ethics

---

## Rate limiting

APIs limit how many requests you can make in a given time period.

- Check the API documentation for rate limits
- Add delays between requests with `Sys.sleep()`

``` r
results <- map(1:100, function(i) {
  Sys.sleep(0.5)  # wait half a second between requests
  GET(paste0("https://api.example.com/item/", i))
})
```

- Watch for `429 Too Many Requests` status codes
- Some APIs include rate limit info in response headers:
  - `X-RateLimit-Remaining`
  - `X-RateLimit-Reset`

---

## Caching API responses

Just like with web scraping, avoid making unnecessary API calls:

``` r
# Save API response data locally
if (!file.exists("data/api_results.rds")) {
  results <- get_universities("United States")
  saveRDS(results, "data/api_results.rds")
} else {
  results <- readRDS("data/api_results.rds")
}
```

This is especially important when:
- Knitting R Markdown documents (avoid re-fetching each time)
- Working with APIs that have strict rate limits
- The data doesn't change frequently

---

## Ethical API use

.pull-left[
### Read the documentation
- Understand the terms of service
- Respect rate limits
- Know what data you're allowed to collect
]

.pull-right[
### Be a good citizen
- Cache responses when possible
- Don't scrape what the API provides
- Give attribution when required
- Don't redistribute data without permission
]

---

# Wrapping Up...

---

# Common API patterns in R

---

## Working with paginated APIs

Many APIs return results in pages:

``` r
all_results <- list()
page <- 1

repeat {
  resp <- GET("https://api.example.com/data",
              query = list(page = page, per_page = 100))

data <- content(resp, as = "text", encoding = "UTF-8") %>%
    fromJSON()

if (length(data$results) == 0) break

all_results[[page]] <- data$results
  page <- page + 1
  Sys.sleep(0.5)
}

final_data <- bind_rows(all_results)
```

---

## Error handling

APIs can fail for many reasons. Always handle errors gracefully:

``` r
safe_api_call <- function(url, params) {
  tryCatch({
    resp <- GET(url, query = params)

if (status_code(resp) != 200) {
      warning("Request failed: HTTP ", status_code(resp))
      return(NULL)
    }

content(resp, as = "text", encoding = "UTF-8") %>%
      fromJSON(flatten = TRUE)
  }, error = function(e) {
    warning("Error: ", e$message)
    return(NULL)
  })
}
```

---

.question[
When would you choose web scraping over using an API? When would an API be the better choice?
]

---

## Comparing approaches: scraping vs. API

---

# Sources

- Hadley Wickham & Jenny Bryan, [R for Data Science](https://r4ds.hadley.nz/)
- Mine Cetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/))
- httr package documentation ([link](https://httr.r-lib.org/))
- Open-Meteo API ([link](https://open-meteo.com/))
- Hipolabs Universities API ([link](https://universities.hipolabs.com/))

---

# Summary: Learning Goals Achieved

---

## What We've Learned

Today, you should now be able to...

.pull-left[
### Concepts
- ✅ Distinguish scraping vs. APIs
- ✅ HTTP request/response cycle
- ✅ JSON data format
- ✅ Authentication patterns
]

.pull-right[
### Skills
- ✅ Make API calls with httr
- ✅ Parse JSON with jsonlite
- ✅ Write reusable API functions
- ✅ Handle errors and rate limits
]

---

# Wrapping Up...

---

# Bonus Exercise: Open Library API

.your-turn[
- [class git repo](https://github.com/DataScience4Psych) > `AE 09 - APIs`.
- Open `03-open-library.R`.
- Use the [Open Library API](https://openlibrary.org/developers/api) to search for books on a topic of your choice.
- Parse the response and create a data frame with title, author, and year.
]

---