30 ODD: Data Transformations and Tukey’s Ladder of Powers

This optional deep dive explores data transformations, focusing on Tukey’s Ladder of Powers—an groundbreaking approach for reshaping data distributions and relationships between variables. Our journey through this topic is inspired by Fox (2016) (ch. 4, pp. 28 - 80)

30.1 Transforming Data: Tukey’s Ladder of Powers

John Tukey introduced a methodological toolkit, likened to a set of drill bits of varying sizes, for modifying the shapes of distributions and relationships between variables. This section delves into these transformations, particularly focusing on the mathematical formulations that underpin this approach.

30.1.1 Dataset Preparation and Visualization

We’ll be using data from Fox (2016)’s book. You may need to download the data manually and save it in the “data” directory within your current working directory. The dataset is available here. You can use this r code to download the data.

download.file("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/UnitedNations.txt", "data/UnitedNations.txt")
#if (!require("p3d")) install.packages("p3d", repos = "http://R-Forge.R-project.org")
library(p3d)
library(car)
library(latticeExtra)
library(gridExtra)
# read data
un <- read.table("data/UnitedNations.txt", header = TRUE)%>%
  mutate(country = rownames(.)) # Assigning country names

Let’s visualize an attempt to make our regression behave. Below are the original and log-transformed relationships with ggplot2. You’ve likely used a log-transformation before, but we’ll be exploring the more general family of transformations.

The log transformation has straightened out the data, allowing us to fit a more useful regression line. However, we’re not limited to just a log transformation. We can use an entire family of transformations called Tukey’s Ladder of Powers.

30.2 Introduction to Tukey’s Ladder of Powers

John Tukey suggested this simple toolkit, like a set drill bits of varying sizes, to modify the shape of distributions and the shape of relationships between variables. The basic idea stems from the fact that functions of the form \(y' = y^p, \quad y > 0\) have a graph that is concave up if \(p >1\), and concave down if \(0<p<1\) or \(p<0\).

Here are some examples to illustrate these concepts:

Concave Up (\(y= x^2\)): This plot shows a parabolic curve opening upwards, illustrating the concept of a concave up graph where the slope of the tangent line increases as \(x\) increases.

Concave Down (\(y= -x^2\)): Conversely, this plot shows a parabolic curve opening downwards, illustrating a concave down graph where the slope of the tangent line decreases as \(x\) increases.

We can use this information to modify the shape of a distribution or the shape of a relationship between variables by using a power transformation (\(p\)) on our data.

30.2.1 Mathematical Formulation of Tukey’s Ladder of Powers

To understand Tukey’s Ladder of Powers, let’s start with the general form of the transformation:

\[y' = \frac{y^p - 1}{p}\]

where \(p\) is the power parameter. This formula helps us adjust the shape of the data distribution. When \(p\) is positive, the transformation can help reduce right skewness. When \(p\) is negative, it can help reduce left skewness.

Special Case: \(p = 0\)

When \(p = 0\), the transformation simplifies to a logarithmic transformation. This is because as \(p\) approaches 0, the formula \(\frac{y^p - 1}{p}\) approaches \(\ln y\). This can be understood using a concept from calculus called l’Hôpital’s rule, which is used to evaluate limits of indeterminate forms. For our purposes, we can state that:

\[\lim_{p \to 0} \frac{y^p - 1}{p}= \lim_{p \to 0}\, e^{\, p \ln y} \ln y = \ln y\]

To make this concept more intuitive, imagine \(y^p\) as a function that gets closer and closer to 1 as \(p\) gets closer to 0. The subtraction of 1 and division by \(p\) in the formula adjust this function to reflect the logarithmic behavior.

Negative Powers

For negative values of \(p\), the formula \(y = -y^p\) is used. This formula is concave down. The negative power causes the transformation to be concave down. This helps transform the data in a way that handles extreme values more effectively.

30.2.2 Defining the Transformation Function in R

We define a function that produces this transformation. The easy way to define it is:

pow <- function(y, p) {
  if (p == 0) {
    log(y)
  } else {
    (y^p - 1) / p
  }
}

# Testing the transformation with a sequence of x values
x <- seq(-1, 3, by = .5)

# Applying the transformation for different powers of p
x_transformed <- tibble(
  x = x,
  `p=2` = pow(x, 2),
  `p=0` = pow(x, 0),
  `p=-1` = pow(x, -1)
)
#> Warning in log(y): NaNs produced


# Visualizing the transformed data
x_transformed_long <- pivot_longer(x_transformed, cols = starts_with("p"), names_to = "Power", values_to = "Transformed")

ggplot(x_transformed_long, aes(x = x, y = Transformed, color = Power)) +
  geom_line() +
  labs(title = "Visualization of Power Transformations", x = "Original x", y = "Transformed x") +
  theme_minimal()
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_line()`).



# test:
x <- seq(-1, 3,by =  .5)
x # note that these transformations are really intended for y > 0
#> [1] -1.0 -0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0
pow(x, 2)
#> [1]  0.000 -0.375 -0.500 -0.375  0.000  0.625  1.500  2.625  4.000
#pow(x, 0) %>%
#  name(x)
#pow(x, -1) %>%
#  name(x) %>%
#  cbind()


plot(exp) # easy plotting of a function

plot(function(x) pow(x, p = 2)) # anonymous function or 'lambda'

plot(function(x) pow(x, p = .5), xlim = c(0, 3))

However, this approach works correctly only for a single value of \(p\) because the statement if(p == 0) only tests the first element of p.

30.3 Vectorizing a function

Most operators in R are vectorized so they work element-wise when their arguments are vectors. When the arguments have incompatible lengths, the shorter argument is recycled to have the same length as the longer one. That is why the following produces sensible results:

z <- c(3, 5, 9)
z + c(1, 1, 1)
#> [1]  4  6 10
z + 1 # 1 is recycled so the result is equivalent to the previous line
#> [1]  4  6 10
z + c(1, 2, 3)
#> [1]  4  7 12
z + c(1, 2) # recycles but gives a warning
#> Warning in z + c(1, 2): longer object length is not a multiple of shorter
#> object length
#> [1]  4  7 10
z + z
#> [1]  6 10 18
z^2
#> [1]  9 25 81
z^z
#> [1] 2.70e+01 3.12e+03 3.87e+08

We can use ifelse which works on a vector instead of a single value.

pow <- function(y, p) {
  p <- rep(p, length.out = length(y))
  y <- rep(y, length.out = length(p))
  ifelse(p == 0, log(y), (y^p - 1) / p)
}

# To apply the function over vectors of y and p, ensuring vectorized operations:
vectorized_pow <- function(y, p) {
  map2(y, p, pow)
}
# test:
pow(-1:4, c(2, 0, -1, 1, 3))
#> Warning in log(y): NaNs produced
#> [1] 0.00 -Inf 0.00 1.00 8.67 7.50
pow(-1:4, 2)
#> [1]  0.0 -0.5  0.0  1.5  4.0  7.5

With a bit more work, we can avoid unnecessary evaluations:

pow <- function(y, p) {
  p <- rep(p, length.out = length(y))
  y <- rep(y, length.out = length(p))
  y[p == 0] <- log(y[p == 0])
  y[p != 0] <- (y[p != 0]^p[p != 0] - 1) / p[p != 0]
  y
}

# Test:

pow(1:10, 0) == log(1:10)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
pow(1:10, -1)
#>  [1] 0.000 0.500 0.667 0.750 0.800 0.833 0.857 0.875 0.889 0.900
pow(1:10, .5)
#>  [1] 0.000 0.828 1.464 2.000 2.472 2.899 3.292 3.657 4.000 4.325
pow(1:10, -1:8)
#>  [1] 0.00e+00 6.93e-01 2.00e+00 7.50e+00 4.13e+01 3.24e+02 3.36e+03 4.37e+04
#>  [9] 6.83e+05 1.25e+07

Let’s plot this transformation for a range of values of \(p\). The value of expand.grid is a data frame whose rows consist of the Cartesian product (i.e. all possible combinations) of its arguments.

expand.grid(a = c("A", "B"), x = 1:3)
#>   a x
#> 1 A 1
#> 2 B 1
#> 3 A 2
#> 4 B 2
#> 5 A 3
#> 6 B 3


dd <- expand.grid(y = seq(.01, 3, .01), p = c(-2, -1, -.5, 0, .5, 1, 2, 3))
dim(dd)
#> [1] 2400    2
head(dd)
#>      y  p
#> 1 0.01 -2
#> 2 0.02 -2
#> 3 0.03 -2
#> 4 0.04 -2
#> 5 0.05 -2
#> 6 0.06 -2
some(dd) # 10 rows at random
#>         y    p
#> 125  1.25 -2.0
#> 497  1.97 -1.0
#> 1180 2.80  0.0
#> 1225 0.25  0.5
#> 1504 0.04  1.0
#> 1523 0.23  1.0
#> 1653 1.53  1.0
#> 1876 0.76  2.0
#> 2168 0.68  3.0
#> 2190 0.90  3.0


dd$yval <- with(dd, pow(y, p))
xyplot(yval ~ y | factor(p), dd, type = "l")

xyplot(yval ~ y | factor(p), dd, type = "l", ylim = c(-2, max(dd$yval)))

xyplot(yval ~ y, dd, groups = p, type = "l", ylim = c(-2, max(dd$yval)))

xyplot(yval ~ y, dd,
  groups = p,
  type = "l",
  xlim = c(0, 3),
  ylim = c(-2, max(dd$yval))
)

gd(8, lwd = 2) # number of colours needed
#> Error in gd(8, lwd = 2): could not find function "gd"
xyplot(yval ~ y, dd,
  groups = p,
  type = "l",
  xlim = c(0, 3),
  ylim = c(-2, max(dd$yval))
)

xyplot(yval ~ y, dd,
  groups = p,
  type = "l",
  auto.key = list(space = "right", lines = T, points = F),
  xlim = c(0, 3),
  ylim = c(-2, max(dd$yval))
)

It’s much better to have the legend in the same order as the lines in the graph. We can turn p into a factor and reverse its order.

dd$po <- factor(dd$p)
dd$po <- reorder(dd$po, -dd$p)
xyplot(yval ~ y, dd,
  groups = po,
  type = "l",
  auto.key = list(space = "right", lines = T, points = F, title = "power"),
  xlim = c(0, 3),
  ylim = c(-2, max(dd$yval))
)

From quantile plots:

Uniform quantiles…

xqplot(un)
#> Error in xqplot(un): could not find function "xqplot"

Normal quantiles

xqplot(un, ptype = "normal")
#> Error in xqplot(un, ptype = "normal"): could not find function "xqplot"

We see that none of the numeric variables have normal distributions.

  • ‘age’ is somewhat platykurtic compared with a normal
  • ‘compositeHourlyWages’ has both a categorical (0) and a continuous component
  • ‘education’ is also platykurtic
  • ‘working’ is dichotomous
  • ‘familyIncome’ is skewed to the right

Note that the fact that \(x\) or \(y\) variables are not normal does not mean that the conditional distribution of \(y\) given \(x\) is not normal.

Let’s explore wages of working women as a function of education.

library(latticeExtra)
un %>%
  xyplot(infantMortality ~ GDPperCapita, .) +
  layer(panel.loess(..., lwd = 2))


# Scatterplot showing curvature in relationship

trellis.focus()
panel.identify(labels = rownames(un))
#> integer(0)
trellis.unfocus()


un %>%
  xyplot(log(infantMortality) ~ GDPperCapita | region, .) +
  layer(panel.loess(..., lwd = 2))


un %>% subset(country %in% c("United.States", "Canada"))
#>                region  tfr contraception educationMale educationFemale lifeMale
#> Canada        America 1.61            66          17.2            17.8     76.1
#> United.States America 1.96            71          15.4            16.2     73.4
#>               lifeFemale infantMortality GDPperCapita economicActivityMale
#> Canada              81.8               6        18943                 72.4
#> United.States       80.1               7        26037                 74.9
#>               economicActivityFemale illiteracyMale illiteracyFemale
#> Canada                          57.6             NA               NA
#> United.States                   59.3           2.24             2.23
#>                     country
#> Canada               Canada
#> United.States United.States

between wage and education, and heteroskedasticity in wage as a function of education.

# library(p3d)

# slid %>%
#  xyplot(sqrt(wage) ~ education, .) +
#  layer(panel.loess(...))

# Init3d()
# Plot3d(log(infantMortality) ~ GDPperCapita + lifeFemale | region, un)
# Id3d()
# Id3d('United.States')
# Id3d('Canada')
# rownames(un)
# names(un)

30.4 Box Cox Transformation

This video was made by math et al. I like their channel and found this video to be a good one.