67 Lab: Ethics in Data Science
“With great power comes great responsibility”: Exploring Algorithmic Bias
Note: This lab is in beta testing…
Data scientists wield considerable power in today’s algorithmic society. The models we build can determine who gets a loan, who gets hired, and even who goes to jail. As the famous quote (often attributed to Spider-Man) goes: “With great power comes great responsibility.” In this lab, we’ll explore what that responsibility looks like by analyzing algorithmic bias in the COMPAS recidivism risk score used in the U.S. criminal justice system.
In 2016, ProPublica published an investigative report titled “Machine Bias” that examined how the COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) algorithm was being used to inform judicial decisions across the United States—from bail to sentencing to parole. The algorithm assigns risk scores that predicts a defendant’s likelihood of reoffending.
ProPublica analyzed the COMPAS algorithm by looking at the outcomes of more than 7,000 people arrested in Broward County, Florida. They found that the algorithm was biased against Black defendants, who were more likely to be incorrectly classified as high risk compared to White defendants. In this lab, we’ll work with the same dataset that ProPublica used and conduct our own investigation into algorithmic bias.
Getting started
Go to the course GitHub organization and locate the lab repo, which should be named something like lab-09-ethics-algorithmic-bias
. Either Fork it or use the template. Then clone it in RStudio. First, open the R Markdown document lab-09.Rmd
and Knit it. Make sure it compiles without errors. The output will be in the file markdown .md
file with the same name.
Packages
In this lab, we will use the tidyverse package for data manipulation and visualization. We’ll also use the fairness package, which provides tools for measuring algorithmic fairness.
# Install the fairness package from GitHub
install.packages("devtools")
devtools::install_github("kozodoi/fairness")
If you cannot get the fairness package to install, don’t worry - we’ll implement the fairness metrics ourselves.
The data
For this lab, we’ll use the COMPAS dataset compiled by ProPublica. The data has been preprocessed and mostly cleaned for you. I recommend using the janitor
package to clean the column names.
# Load the COMPAS data
compas <- read_csv("data/compas-scores-two-years.csv") %>% clean_names() %>% rename(decile_score = decile_score_12,
priors_count = priors_count_15)
# Take a look at the data
glimpse(compas)
Each observation in this dataset represents a defendant. Here are the key variables:
id
: A unique identifier for each defendantname
: The defendant’s name (anonymized)sex
: The defendant’s sex (Female, Male)race
: The defendant’s race (African-American, Caucasian, Hispanic, Other)age
: The defendant’s ageage_cat
: The defendant’s age category (Less than 25, 25-45, Greater than 45)juv_fel_count
: Number of juvenile feloniesjuv_misd_count
: Number of juvenile misdemeanorsjuv_other_count
: Number of other juvenile offensespriors_count
: Number of prior offensesc_charge_degree
: Degree of the current charge (F = felony, M = misdemeanor)c_charge_desc
: Description of the current chargedecile_score
: COMPAS risk score from 1-10 (higher = greater risk)score_text
: COMPAS score category (Low, Medium, High)v_decile_score
: COMPAS violent risk score from 1-10v_score_text
: COMPAS violent score categorytwo_year_recid
: Whether the defendant recidivated within two years (1 = yes, 0 = no)is_recid
: Whether the defendant recidivated at all (1 = yes, 0 = no)
Exercises
Part 1: Exploring the data
What are the dimensions of the COMPAS dataset? (Hint: Use inline R code and functions like
nrow
andncol
to compose your answer.) What does each row in the dataset represent? What are the variables?How many unique defendants are in the dataset? Is this the same as the number of rows? If not, why might there be a difference?
Let’s examine the demographic distribution in our dataset. Create visualizations to show:
- The distribution of defendants by race
- The distribution of defendants by sex
- The distribution of defendants by age category
For an extra challenge, try to create a single visualization that shows all three distributions side by side.
- Create a visualization of the COMPAS risk scores (
decile_score
) distribution. What do you observe about the shape of this distribution?
Part 2: Risk scores and recidivism
Create a visualization showing the relationship between risk scores (
decile_score
) and actual recidivism (two_year_recid
). Do higher risk scores actually correspond to higher rates of recidivism?Calculate the overall accuracy of the COMPAS algorithm. For this exercise, consider a prediction “correct” if:
- A defendant with a high risk score (decile_score >= 7) did recidivate (two_year_recid = 1)
- A defendant with a low risk score (decile_score <= 4) did not recidivate (two_year_recid = 0)
How well does the COMPAS algorithm perform overall? What percentage of its predictions are correct based on your calculation above?
Part 3: Investigating disparities
Now let’s assess the predictive accuracy of the COMPAS algorithm across different demographic groups. For this exercise, we’ll focus on race, but you can also explore other demographic variables.
Create visualizations comparing the distribution of risk scores (
decile_score
) between Black and white defendants. Do you observe any differences?Calculate the percentage of Black defendants and white defendants who were classified as high risk (decile_score >= 7). Is there a disparity?
Now, let’s look at the accuracy of predictions for different racial groups. Calculate the following metrics separately for Black defendants and white defendants:
- False Positive Rate: Proportion of non-recidivists (two_year_recid = 0) who were classified as high risk (decile_score >= 7)
- False Negative Rate: Proportion of recidivists (two_year_recid = 1) who were classified as low risk (decile_score <= 4)
# To filter for specific conditions:
non_recidivists <- compas %>%
filter(two_year_recid == 0)
# Now calculate false positive rates by race
- Create a visualization comparing these metrics between Black and white defendants. What disparities do you observe?
Part 4: Understanding the sources of bias
Note that their are many ways to measure bias in an algorithm. In this exercise, we’ll focus on disparities in the false positive and false negative rates. You can also explore other measures of bias, including in the stretch goals.
Let’s investigate what factors might be contributing to the disparities we’ve observed. Create a visualization showing the relationship between prior convictions (
priors_count
) and risk score (decile_score
), colored by race. Does the algorithm weigh prior convictions differently for different racial groups?In 2016, ProPublica and Northpointe (the company that created COMPAS) had a disagreement about how to measure fairness. ProPublica focused on error rates (false positives and false negatives), while Northpointe focused on calibration (whether the same score means the same probability of recidivism across groups).
Based on your analysis, do you see evidence supporting ProPublica’s claim that the algorithm is biased? Explain your reasoning.
Part 5: Designing fairer algorithms
If you were tasked with creating a fairer risk assessment algorithm, what changes would you make to address the disparities you’ve observed?
Different definitions of fairness can sometimes be mathematically incompatible with each other. What trade-offs might be involved in designing a “fair” algorithm for criminal risk assessment?
Beyond technical solutions, what policy changes might be needed to ensure that algorithmic risk assessments are used fairly in the criminal justice system?
Stretch goals
Investigating the sources of bias
Now let’s investigate what factors might be contributing to the disparities we’ve observed. This stretch goal will involve more advanced calculations and visualizations, so you may need to refer to additional resources or documentation to complete these exercises.
Create a visualization showing the relationship between prior convictions (
priors_count
) and risk score (decile_score
), colored by race. Does the algorithm weigh prior convictions differently for different racial groups?Look at the distribution of prior convictions (
priors_count
) by race. Are there differences in the distribution that might help explain the disparities in risk scores?Consider other variables in the dataset that might influence risk scores, such as age or type of charge (
c_charge_degree
). Do these variables show different distributions across racial groups? Might these differences contribute to disparities in risk scores?
Building a fairer algorithm
- Let’s try to build our own recidivism prediction model using logistic regression. We’ll use the following variables as predictors:
age
priors_count
c_charge_degree
# Create a logistic regression model
recid_model <- glm(
two_year_recid ~ age + priors_count + c_charge_degree,
data = compas,
family = binomial()
)
# Add predicted probabilities to the dataset
compas <- compas %>%
mutate(
predicted_prob = predict(recid_model, type = "response"),
our_high_risk = predicted_prob >= 0.5
)
Evaluate the fairness of your model using the same metrics we calculated for the COMPAS algorithm. Does your model show less bias than the COMPAS algorithm? If so, why might that be the case?
Now build a more complex model that includes race as a predictor:
# Create a logistic regression model with race
recid_model_with_race <- glm(
two_year_recid ~ age + priors_count + c_charge_degree + race,
data = compas,
family = binomial()
)
# Add predicted probabilities to the dataset
compas <- compas %>%
mutate(
predicted_prob_with_race = predict(recid_model_with_race, type = "response"),
race_high_risk = predicted_prob_with_race >= 0.5
)
Compare the fairness metrics for this model with your previous model. What happened to the disparities between racial groups? Does including race as a variable make the algorithm more or less fair?
Based on your analysis, write a brief policy recommendation for how risk assessment algorithms should be used in the criminal justice system. Consider the following questions:
- Should algorithms like COMPAS be used in criminal justice decisions? Why or why not?
- If they are used, what safeguards should be put in place?
- How should the trade-off between accuracy and fairness be handled?
- What role should transparency play in algorithmic decision-making?