92 Working with OpenAI’s API

This module introduces the basics of interacting with OpenAI’s API from R. We’ll explore how to make API calls, handle responses, and integrate AI capabilities into data science workflows.

92.1 Getting Started

First, we need to load the required packages:

92.1.1 API Authentication

To use OpenAI’s API, you’ll need an API key. Like we learned with other APIs, it’s important to keep this secure:

# Store API key securely (NEVER commit to Git!)
openai_api_key <- readLines("path/to/api_key.txt")

92.1.2 Making API Requests

The core workflow involves:

Constructing the API request
Sending it to OpenAI’s endpoint
Processing the response

Next, we define a function to generate text using OpenAI’s API. The function takes a prompt as input and returns the generated text.

Here’s a basic function for text generation:

generate_text <- function(prompt) {
  response <- POST(
    # curl https://api.openai.com/v1/chat/completions
    url = "https://api.openai.com/v1/chat/completions",
    # -H "Authorization: Bearer $OPENAI_API_KEY"
    add_headers(Authorization = paste("Bearer", openai_api_key)),
    # -H "Content-Type: application/json"
    content_type_json(),
    # -d '{
    #   "model": "gpt-3.5-turbo",
    #   "messages": [{"role": "user", "content": "What is a banana?"}]
    # }'
    encode = "json",
    body = list(
      model = "gpt-3.5-turbo",
      messages = list(list(role = "user", content = prompt))
    )
  )

  str_content <- content(response, "text", encoding = "UTF-8")
  parsed <- fromJSON(str_content)

  # return(parsed$choices[[1]]$text)
  return(parsed)
}

92.2 Example Usage and Handling the Response

Now that we’ve defined our generate_text() function, let’s test it by sending a request to OpenAI’s API and working with the response.

92.2.1 Step 1: Send a Request

prompt <- "Summarize the key steps in a data science workflow:"
generated_text <- generate_text(prompt)

92.2.2 Step 2: Examine the Raw API Response

When we call the generate_text(prompt) function, OpenAI’s API returns a structured response in JSON format, which R reads as a list. This response contains multiple components, but the most important part is the generated text.

Let’s print the raw response to see its structure.

print(generated_text)
#> $id
#> [1] "chatcmpl-BSCrtN3Y1U1afbmC1G3dnk1BqiiKD"
#> 
#> $object
#> [1] "chat.completion"
#> 
#> $created
#> [1] 1746062349
#> 
#> $model
#> [1] "gpt-3.5-turbo-0125"
#> 
#> $choices
#>   index message.role
#> 1     0    assistant
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          message.content
#> 1 1. Define the problem: Clearly outline the objectives and goals of the project and determine what type of data is needed to solve the problem.\n\n2. Data collection: Gather the necessary data from various sources, which may include databases, surveys, APIs, etc.\n\n3. Data preparation: Clean, transform, and preprocess the data to make it suitable for analysis. This may involve dealing with missing values, outliers, and formatting issues.\n\n4. Exploratory data analysis (EDA): Explore the data to gain a better understanding of its characteristics, relationships, and patterns. This may involve visualization techniques and statistical analysis.\n\n5. Feature engineering: Create new features or transform existing features to improve the performance of the model.\n\n6. Model selection: Choose the appropriate machine learning algorithm(s) for the problem at hand based on the nature of the data and the objectives of the project.\n\n7. Model training: Split the data into training and testing sets, and train the selected model(s) on the training data.\n\n8. Model evaluation: Assess the performance of the model(s) using appropriate evaluation metrics and techniques, such as cross-validation or grid search.\n\n9. Model deployment: Deploy the trained model to production and make predictions on new data.\n\n10. Model monitoring and maintenance: Continuously monitor the model's performance in production, retraining and updating it as needed to ensure it remains accurate and relevant.
#>   message.refusal message.annotations logprobs finish_reason
#> 1              NA                NULL       NA          stop
#> 
#> $usage
#> $usage$prompt_tokens
#> [1] 19
#> 
#> $usage$completion_tokens
#> [1] 279
#> 
#> $usage$total_tokens
#> [1] 298
#> 
#> $usage$prompt_tokens_details
#> $usage$prompt_tokens_details$cached_tokens
#> [1] 0
#> 
#> $usage$prompt_tokens_details$audio_tokens
#> [1] 0
#> 
#> 
#> $usage$completion_tokens_details
#> $usage$completion_tokens_details$reasoning_tokens
#> [1] 0
#> 
#> $usage$completion_tokens_details$audio_tokens
#> [1] 0
#> 
#> $usage$completion_tokens_details$accepted_prediction_tokens
#> [1] 0
#> 
#> $usage$completion_tokens_details$rejected_prediction_tokens
#> [1] 0
#> 
#> 
#> 
#> $service_tier
#> [1] "default"
#> 
#> $system_fingerprint
#> NULL

As you can see, the response is a nested list containing various metadata (e.g., request ID, model name, creation time), the AI-generated response (inside $choices[[1]]$message$content), token usage information (inside $usage$total_tokens), and more.

92.2.3 Step 3: Extract the AI-Generated Text

Since the response contains both metadata and content, we need to extract only the generated text. The key part of the response is stored in:

ai_response <- generated_text$choices$message$content

Now, let’s print the AI-generated text:

print(ai_response)
#> [1] "1. Defining the problem: Identify the business problem or question that needs to be answered.\n\n2. Data collection: Collect relevant data from various sources, including databases, APIs, and external sources.\n\n3. Data cleaning: Clean and preprocess the data to remove any inconsistencies, missing values, duplicates, or outliers.\n\n4. Exploratory data analysis: Explore and analyze the data to understand its key characteristics, patterns, and relationships.\n\n5. Feature engineering: Create new features or transform existing ones to improve model performance.\n\n6. Model selection: Choose the appropriate machine learning algorithm or model that best fits the problem and data.\n\n7. Model training: Train the selected model on the training data to learn the underlying patterns.\n\n8. Model evaluation: Evaluate the model's performance using metrics like accuracy, precision, recall, and F1 score.\n\n9. Model tuning: Fine-tune the model by adjusting hyperparameters to improve performance.\n\n10. Deployment: Deploy the model into production to make predictions on new data and monitor its performance over time."

Ok, so that wasn’t really readable. Let’s try to format it a bit better:

cat(ai_response)

Define the problem: Clearly outline the objectives and goals of the project and determine what type of data is needed to solve the problem.
Data collection: Gather the necessary data from various sources, which may include databases, surveys, APIs, etc.
Data preparation: Clean, transform, and preprocess the data to make it suitable for analysis. This may involve dealing with missing values, outliers, and formatting issues.
Exploratory data analysis (EDA): Explore the data to gain a better understanding of its characteristics, relationships, and patterns. This may involve visualization techniques and statistical analysis.
Feature engineering: Create new features or transform existing features to improve the performance of the model.
Model selection: Choose the appropriate machine learning algorithm(s) for the problem at hand based on the nature of the data and the objectives of the project.
Model training: Split the data into training and testing sets, and train the selected model(s) on the training data.
Model evaluation: Assess the performance of the model(s) using appropriate evaluation metrics and techniques, such as cross-validation or grid search.
Model deployment: Deploy the trained model to production and make predictions on new data.
Model monitoring and maintenance: Continuously monitor the model’s performance in production, retraining and updating it as needed to ensure it remains accurate and relevant.

92.2.4 Step 4: Understanding Token Usage

Since OpenAI charges based on token usage, it’s useful to monitor how many tokens are used per request. The API response includes:

usage$prompt_tokens → Tokens in the input prompt
usage$completion_tokens → Tokens generated by the model
usage$total_tokens → The total token count for billing

To check token usage:

print(generated_text$usage$total_tokens) # Total tokens used
#> [1] 298
print(generated_text$usage$completion_tokens) # Tokens used for output
#> [1] 279
print(generated_text$usage$prompt_tokens) # Tokens used for input
#> [1] 19

92.3 Error Handling

Like we’ve seen with other APIs, it’s important to handle errors gracefully. As with any API call, errors can occur due to network issues, invalid requests, or rate limits. To ensure our script doesn’t crash, we can wrap API calls in tryCatch():

generate_text_safe <- function(prompt) {
  tryCatch(
    {
      generate_text(prompt)
    },
    error = function(e) {
      warning("API call failed: ", e$message)
      return(NULL)
    }
  )
}

Now, we can use generate_text_safe() to handle errors. If an error occurs, the function will return NULL and print a warning message.

92.4 Processing Multiple Requests

When working with multiple prompts, we can use purrr::map_chr() to process them efficiently:

library(purrr)
prompts <- c(
  "Define p-value",
  "Explain Type I error",
  "What is statistical power?"
)
responses <- list()
responses <- map(prompts, generate_text_safe)

This code generates text for each prompt in the prompts vector. If an error occurs, the response will be NULL. After running this code, we can examine the responses and handle any errors. I’ve included a table below to display the responses.

As you can see, the table displays the prompts, AI-generated responses, token usage, model name, and completion time for each request. This information can help us monitor the API usage and response quality.

92.4.1 Rate Limiting

OpenAI has rate limits we need to respect. We can add delays between requests to avoid exceeding these limits. Here’s a throttled version of the generate_text() function:

generate_text_throttled <- function(prompt) {
  Sys.sleep(1) # Wait 1 second between requests
  generate_text_safe(prompt)
}

This function adds a 1-second delay between requests to avoid exceeding OpenAI’s rate limits. You can adjust the delay as needed based on the API’s rate limits.

92.5 Conclusion

In this guide, we’ve covered how to generate text using OpenAI’s GPT-3 API in R. We’ve defined a function to interact with the API, handled responses, extracted generated text, monitored token usage, and processed multiple requests. We’ve also discussed error handling, rate limiting, and best practices for working with the API. By following these steps, you can effectively use OpenAI’s GPT-3 API to generate text in R for various applications. For the curious, yes, these prompts and responses are generated using the OpenAI API every time you render this notebook.