33 ODD: Data Transformations and Tukey’s Ladder of Powers
This optional deep dive explores data transformations, focusing on Tukey’s Ladder of Powers—an groundbreaking approach for reshaping data distributions and relationships between variables. Our journey through this topic is inspired by Fox (2016) (ch. 4, pp. 28 - 80)
33.1 Transforming Data: Tukey’s Ladder of Powers
John Tukey introduced a methodological toolkit, likened to a set of drill bits of varying sizes, for modifying the shapes of distributions and relationships between variables. This section delves into these transformations, particularly focusing on the mathematical formulations that underpin this approach.
33.1.1 Dataset Preparation and Visualization
We’ll be using data from Fox (2016)’s book. You may need to download the data manually and save it in the “data” directory within your current working directory. The dataset is available here. You can use this r code to download the data.
download.file("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/UnitedNations.txt", "data/UnitedNations.txt")
#if (!require("p3d")) install.packages("p3d", repos = "http://R-Forge.R-project.org")
library(p3d)
library(car)
library(latticeExtra)
library(gridExtra)
# read data
un <- read.table("data/UnitedNations.txt", header = TRUE)%>%
mutate(country = rownames(.)) # Assigning country names
Let’s visualize an attempt to make our regression behave. Below are the original and log-transformed relationships with ggplot2. You’ve likely used a log-transformation before, but we’ll be exploring the more general family of transformations.
The log transformation has straightened out the data, allowing us to fit a more useful regression line. However, we’re not limited to just a log transformation. We can use an entire family of transformations called Tukey’s Ladder of Powers.
33.2 Introduction to Tukey’s Ladder of Powers
John Tukey suggested this simple toolkit, like a set drill bits of varying sizes, to modify the shape of distributions and the shape of relationships between variables. The basic idea stems from the fact that functions of the form \(y' = y^p, \quad y > 0\) have a graph that is concave up if \(p >1\), and concave down if \(0<p<1\) or \(p<0\).
Here are some examples to illustrate these concepts:
Concave Up (\(y= x^2\)): This plot shows a parabolic curve opening upwards, illustrating the concept of a concave up graph where the slope of the tangent line increases as \(x\) increases.
Concave Down (\(y= -x^2\)): Conversely, this plot shows a parabolic curve opening downwards, illustrating a concave down graph where the slope of the tangent line decreases as \(x\) increases.
We can use this information to modify the shape of a distribution or the shape of a relationship between variables by using a power transformation (\(p\)) on our data.
33.2.1 Mathematical Formulation of Tukey’s Ladder of Powers
To understand Tukey’s Ladder of Powers, let’s start with the general form of the transformation:
\[y' = \frac{y^p - 1}{p}\]
where \(p\) is the power parameter. This formula helps us adjust the shape of the data distribution. When \(p\) is positive, the transformation can help reduce right skewness. When \(p\) is negative, it can help reduce left skewness.
Special Case: \(p = 0\)
When \(p = 0\), the transformation simplifies to a logarithmic transformation. This is because as \(p\) approaches 0, the formula \(\frac{y^p - 1}{p}\) approaches \(\ln y\). This can be understood using a concept from calculus called l’Hôpital’s rule, which is used to evaluate limits of indeterminate forms. For our purposes, we can state that:
\[\lim_{p \to 0} \frac{y^p - 1}{p}= \lim_{p \to 0}\, e^{\, p \ln y} \ln y = \ln y\]
To make this concept more intuitive, imagine \(y^p\) as a function that gets closer and closer to 1 as \(p\) gets closer to 0. The subtraction of 1 and division by \(p\) in the formula adjust this function to reflect the logarithmic behavior.
33.2.2 Defining the Transformation Function in R
We define a function that produces this transformation. The easy way to define it is:
pow <- function(y, p) {
if (p == 0) {
log(y)
} else {
(y^p - 1) / p
}
}
# Testing the transformation with a sequence of x values
x <- seq(-1, 3, by = .5)
# Applying the transformation for different powers of p
x_transformed <- tibble(
x = x,
`p=2` = pow(x, 2),
`p=0` = pow(x, 0),
`p=-1` = pow(x, -1)
)
#> Warning in log(y): NaNs produced
# Visualizing the transformed data
x_transformed_long <- pivot_longer(x_transformed, cols = starts_with("p"), names_to = "Power", values_to = "Transformed")
ggplot(x_transformed_long, aes(x = x, y = Transformed, color = Power)) +
geom_line() +
labs(title = "Visualization of Power Transformations", x = "Original x", y = "Transformed x") +
theme_minimal()
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_line()`).
# test:
x <- seq(-1, 3,by = .5)
x # note that these transformations are really intended for y > 0
#> [1] -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
pow(x, 2)
#> [1] 0.000 -0.375 -0.500 -0.375 0.000 0.625 1.500 2.625 4.000
#pow(x, 0) %>%
# name(x)
#pow(x, -1) %>%
# name(x) %>%
# cbind()
plot(exp) # easy plotting of a function
However, this approach works correctly only for a single value of \(p\) because the statement if(p == 0)
only tests the first element of p
.
33.3 Vectorizing a function
Most operators in R are vectorized so they work element-wise when their arguments are vectors. When the arguments have incompatible lengths, the shorter argument is recycled to have the same length as the longer one. That is why the following produces sensible results:
z <- c(3, 5, 9)
z + c(1, 1, 1)
#> [1] 4 6 10
z + 1 # 1 is recycled so the result is equivalent to the previous line
#> [1] 4 6 10
z + c(1, 2, 3)
#> [1] 4 7 12
z + c(1, 2) # recycles but gives a warning
#> Warning in z + c(1, 2): longer object length is not a multiple of shorter
#> object length
#> [1] 4 7 10
z + z
#> [1] 6 10 18
z^2
#> [1] 9 25 81
z^z
#> [1] 2.70e+01 3.12e+03 3.87e+08
We can use ifelse
which works on a vector instead of a single value.
pow <- function(y, p) {
p <- rep(p, length.out = length(y))
y <- rep(y, length.out = length(p))
ifelse(p == 0, log(y), (y^p - 1) / p)
}
# To apply the function over vectors of y and p, ensuring vectorized operations:
vectorized_pow <- function(y, p) {
map2(y, p, pow)
}
# test:
pow(-1:4, c(2, 0, -1, 1, 3))
#> Warning in log(y): NaNs produced
#> [1] 0.00 -Inf 0.00 1.00 8.67 7.50
pow(-1:4, 2)
#> [1] 0.0 -0.5 0.0 1.5 4.0 7.5
With a bit more work, we can avoid unnecessary evaluations:
pow <- function(y, p) {
p <- rep(p, length.out = length(y))
y <- rep(y, length.out = length(p))
y[p == 0] <- log(y[p == 0])
y[p != 0] <- (y[p != 0]^p[p != 0] - 1) / p[p != 0]
y
}
# Test:
pow(1:10, 0) == log(1:10)
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
pow(1:10, -1)
#> [1] 0.000 0.500 0.667 0.750 0.800 0.833 0.857 0.875 0.889 0.900
pow(1:10, .5)
#> [1] 0.000 0.828 1.464 2.000 2.472 2.899 3.292 3.657 4.000 4.325
pow(1:10, -1:8)
#> [1] 0.00e+00 6.93e-01 2.00e+00 7.50e+00 4.13e+01 3.24e+02 3.36e+03 4.37e+04
#> [9] 6.83e+05 1.25e+07
Let’s plot this transformation for a range of values of \(p\). The value of expand.grid
is a data frame whose rows consist of the Cartesian product (i.e. all possible combinations) of its arguments.
expand.grid(a = c("A", "B"), x = 1:3)
#> a x
#> 1 A 1
#> 2 B 1
#> 3 A 2
#> 4 B 2
#> 5 A 3
#> 6 B 3
dd <- expand.grid(y = seq(.01, 3, .01), p = c(-2, -1, -.5, 0, .5, 1, 2, 3))
dim(dd)
#> [1] 2400 2
head(dd)
#> y p
#> 1 0.01 -2
#> 2 0.02 -2
#> 3 0.03 -2
#> 4 0.04 -2
#> 5 0.05 -2
#> 6 0.06 -2
some(dd) # 10 rows at random
#> y p
#> 80 0.80 -2.0
#> 137 1.37 -2.0
#> 181 1.81 -2.0
#> 281 2.81 -2.0
#> 615 0.15 -0.5
#> 974 0.74 0.0
#> 1200 3.00 0.0
#> 2130 0.30 3.0
#> 2136 0.36 3.0
#> 2325 2.25 3.0
dd$yval <- with(dd, pow(y, p))
xyplot(yval ~ y | factor(p), dd, type = "l")
gd(8, lwd = 2) # number of colours needed
#> Error in gd(8, lwd = 2): could not find function "gd"
xyplot(yval ~ y, dd,
groups = p,
type = "l",
xlim = c(0, 3),
ylim = c(-2, max(dd$yval))
)
xyplot(yval ~ y, dd,
groups = p,
type = "l",
auto.key = list(space = "right", lines = T, points = F),
xlim = c(0, 3),
ylim = c(-2, max(dd$yval))
)
It’s much better to have the legend in the same order as the lines in the graph. We can turn p
into a factor and reverse its order.
dd$po <- factor(dd$p)
dd$po <- reorder(dd$po, -dd$p)
xyplot(yval ~ y, dd,
groups = po,
type = "l",
auto.key = list(space = "right", lines = T, points = F, title = "power"),
xlim = c(0, 3),
ylim = c(-2, max(dd$yval))
)
From quantile plots:
Uniform quantiles…
Normal quantiles
xqplot(un, ptype = "normal")
#> Error in xqplot(un, ptype = "normal"): could not find function "xqplot"
We see that none of the numeric variables have normal distributions.
- ‘age’ is somewhat platykurtic compared with a normal
- ‘compositeHourlyWages’ has both a categorical (0) and a continuous component
- ‘education’ is also platykurtic
- ‘working’ is dichotomous
- ‘familyIncome’ is skewed to the right
Note that the fact that \(x\) or \(y\) variables are not normal does not mean that the conditional distribution of \(y\) given \(x\) is not normal.
Let’s explore wages of working women as a function of education.
library(latticeExtra)
un %>%
xyplot(infantMortality ~ GDPperCapita, .) +
layer(panel.loess(..., lwd = 2))
# Scatterplot showing curvature in relationship
trellis.focus()
panel.identify(labels = rownames(un))
#> integer(0)
trellis.unfocus()
un %>% subset(country %in% c("United.States", "Canada"))
#> region tfr contraception educationMale educationFemale lifeMale
#> Canada America 1.61 66 17.2 17.8 76.1
#> United.States America 1.96 71 15.4 16.2 73.4
#> lifeFemale infantMortality GDPperCapita economicActivityMale
#> Canada 81.8 6 18943 72.4
#> United.States 80.1 7 26037 74.9
#> economicActivityFemale illiteracyMale illiteracyFemale
#> Canada 57.6 NA NA
#> United.States 59.3 2.24 2.23
#> country
#> Canada Canada
#> United.States United.States
between wage and education, and heteroskedasticity in wage as a function of education.
33.4 Box Cox Transformation
This video was made by math et al. I like their channel and found this video to be a good one.
33.4.1 Additional Resources
Salvatore S. Mangiafico’s Summary and Analysis of Extension Program Evaluation in R, rcompanion.org/handbook/. Pdf version
http://blackwell.math.yorku.ca/math4939/lectures/transforming_data_tukeys_ladder_of_powers.html
https://thomaselove.github.io/431-notes/re-expression-tukeys-ladder-box-cox-plot.html