30 ODD: Data Transformations and Tukey’s Ladder of Powers

This optional deep dive covers data transformations, with a special focus on Tukey’s Ladder of Powers, an invaluable tool for reshaping data distributions and relationships between variables. Our journey through this topic is inspired by Fox (2016) (ch. 4, pp. 28 - 80)

30.1 Transforming Data: Tukey’s Ladder of Powers

John Tukey introduced a methodological toolkit, likened to a set of drill bits of varying sizes, for modifying the shapes of distributions and relationships between variables. This section delves into these transformations, particularly focusing on the mathematical formulations that underpin this approach.

30.1.1 Dataset Preparation and Visualization

We’ll be using data from Fox (2016). You may need to download the data manually and save it in the “data” directory within your current working directory. The dataset is available here. You can use this r code download.file("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/UnitedNations.txt"), "data/UnitedNations.txt") to download the data.

#if (!require("p3d")) install.packages("p3d", repos = "http://R-Forge.R-project.org")
library(p3d)
library(car)
library(latticeExtra)
library(gridExtra)
# read data
un <- read.table("data/UnitedNations.txt", header = TRUE)%>%
  mutate(country = rownames(.)) # Assigning country names

Let’s visualize an attempt to make our regression behave. Below are the original and log-transformed relationships with ggplot2. You’ve likely used a log-transformation before, but we’ll be exploring the more general family of transformations.

#> `geom_smooth()` using formula = 'y ~ x'
#> `geom_smooth()` using formula = 'y ~ x'

It’s clear that the log-transformation has straightened out the data, allowing us to fit a more useful regression line. However, we’re not limited to just a log transformation. We can use an entire family of transformations to modify our data. This family of transformations is called Tukey’s Ladder of Powers.

John Tukey suggested this simple toolkit, like a set drill bits of varying sizes, to modify the shape of distributions and the shape of relationships between variables. The basic idea stems from the fact that functions of the form \(y' = y^p, \quad y > 0\) have a graph that is concave up if \(p >1\), and concave down if \(0<p<1\) (or \(p < 0\)).

Concave Up (\(y= x^2\)): This plot shows a parabolic curve opening upwards, illustrating the concept of a concave up graph where the slope of the tangent line increases as \(x\) increases.

Concave Down (\(y= -x^2\)): Conversely, this plot shows a parabolic curve opening downwards, illustrating a concave down graph where the slope of the tangent line decreases as \(x\) increases.

We can use this information to modify the shape of a distribution or the shape of a relationship between variables by using a power transformation ( \(p\)) on our data. This family of transformations is called Tukey’s Ladder of Powers.

Notably, \(p < 0\), \(y' = - y^p, \quad y > 0\), is also concave down.

Now, this graph leaves out \(p=0\) but we will see shortly that \(y' = \ln y\) is the sensible transformation that corresponds to \(p = 0\).

Now, we standardize the family of power transformations so that they have the value 0 when \(y = 1\) and so their derivative is equal to 1 when \(y=1\).

For \(p \ne 0\), this yields \[y' = \frac{y^p - 1}{p}\]

Note that, by l’Hôpital’s rule:

\[\lim_{p \to 0} \frac{y^p - 1}{p}= \lim_{p \to 0}\, e^{\, p \ln y} \ln y = \ln y\]

We define a function that produces this transformation. The easy way to define it is:

pow <- function(y, p) {
  if (p == 0) {
    log(y)
  } else {
    (y^p - 1) / p
  }
}

# Testing the transformation with a sequence of x values
x <- seq(-1, 3, by = .5)

# Applying the transformation for different powers of p
x_transformed <- tibble(
  x = x,
  `p=2` = pow(x, 2),
  `p=0` = pow(x, 0),
  `p=-1` = pow(x, -1)
)
#> Warning in log(y): NaNs produced


# Visualizing the transformed data
x_transformed_long <- pivot_longer(x_transformed, cols = starts_with("p"), names_to = "Power", values_to = "Transformed")

ggplot(x_transformed_long, aes(x = x, y = Transformed, color = Power)) +
  geom_line() +
  labs(title = "Visualization of Power Transformations", x = "Original x", y = "Transformed x")
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_line()`).



# test:
x <- seq(-1, 3,by =  .5)
x # note that these transformations are really intended for y > 0
#> [1] -1.0 -0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0
pow(x, 2)
#> [1]  0.000 -0.375 -0.500 -0.375  0.000  0.625  1.500  2.625  4.000
pow(x, 0) %>%
  name(x)
#> Error in name(., x): could not find function "name"
pow(x, -1) %>%
  name(x) %>%
  cbind()
#> Error in name(., x): could not find function "name"

plot(exp) # easy plotting of a function

plot(function(x) pow(x, p = 2)) # anonymous function or 'lambda'

plot(function(x) pow(x, p = .5), xlim = c(0, 3))

But this approach has the disadvantage that it works correctly only for a single value of \(p\) because the statement if(p == 0) only tests the first element of p.

30.2 Vectorizing a function

Most operators in R are vectorized so they work element-wise when their arguments are vectors. When the arguments have incompatible lengths, the shorter argument is recycled to have the same length as the longer one. That is why the following produces sensible results:

z <- c(3, 5, 9)
z + c(1, 1, 1)
#> [1]  4  6 10
z + 1 # 1 is recycled so the result is equivalent to the previous line
#> [1]  4  6 10
z + c(1, 2, 3)
#> [1]  4  7 12
z + c(1, 2) # recycles but gives a warning
#> Warning in z + c(1, 2): longer object length is not a multiple of shorter
#> object length
#> [1]  4  7 10
z + z
#> [1]  6 10 18
z^2
#> [1]  9 25 81
z^z
#> [1] 2.70e+01 3.12e+03 3.87e+08

We can use ifelse which works on a vector instead of a single value.

pow <- function(y, p) {
  p <- rep(p, length.out = length(y))
  y <- rep(y, length.out = length(p))
  ifelse(p == 0, log(y), (y^p - 1) / p)
}

# To apply the function over vectors of y and p, ensuring vectorized operations:
vectorized_pow <- function(y, p) {
  map2(y, p, pow)
}
# test:
pow(-1:4, c(2, 0, -1, 1, 3))
#> Warning in log(y): NaNs produced
#> [1] 0.00 -Inf 0.00 1.00 8.67 7.50
pow(-1:4, 2)
#> [1]  0.0 -0.5  0.0  1.5  4.0  7.5

With a bit more work, we can avoid unnecessary evaluations:

pow <- function(y, p) {
  p <- rep(p, length.out = length(y))
  y <- rep(y, length.out = length(p))
  y[p == 0] <- log(y[p == 0])
  y[p != 0] <- (y[p != 0]^p[p != 0] - 1) / p[p != 0]
  y
}

# Test:

pow(1:10, 0) == log(1:10)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
pow(1:10, -1)
#>  [1] 0.000 0.500 0.667 0.750 0.800 0.833 0.857 0.875 0.889 0.900
pow(1:10, .5)
#>  [1] 0.000 0.828 1.464 2.000 2.472 2.899 3.292 3.657 4.000 4.325
pow(1:10, -1:8)
#>  [1] 0.00e+00 6.93e-01 2.00e+00 7.50e+00 4.13e+01 3.24e+02 3.36e+03 4.37e+04
#>  [9] 6.83e+05 1.25e+07

Let’s plot this transformation for a range of values of \(p\). The value of expand.grid is a data frame whose rows consist of the Cartesian product (i.e. all possible combinations) of its arguments.

expand.grid(a = c("A", "B"), x = 1:3)
#>   a x
#> 1 A 1
#> 2 B 1
#> 3 A 2
#> 4 B 2
#> 5 A 3
#> 6 B 3


dd <- expand.grid(y = seq(.01, 3, .01), p = c(-2, -1, -.5, 0, .5, 1, 2, 3))
dim(dd)
#> [1] 2400    2
head(dd)
#>      y  p
#> 1 0.01 -2
#> 2 0.02 -2
#> 3 0.03 -2
#> 4 0.04 -2
#> 5 0.05 -2
#> 6 0.06 -2
some(dd) # 10 rows at random
#>         y    p
#> 115  1.15 -2.0
#> 332  0.32 -1.0
#> 363  0.63 -1.0
#> 564  2.64 -1.0
#> 838  2.38 -0.5
#> 1182 2.82  0.0
#> 1320 1.20  0.5
#> 1548 0.48  1.0
#> 2207 1.07  3.0
#> 2255 1.55  3.0


dd$yval <- with(dd, pow(y, p))
xyplot(yval ~ y | factor(p), dd, type = "l")

xyplot(yval ~ y | factor(p), dd, type = "l", ylim = c(-2, max(dd$yval)))

xyplot(yval ~ y, dd, groups = p, type = "l", ylim = c(-2, max(dd$yval)))

xyplot(yval ~ y, dd,
  groups = p,
  type = "l",
  xlim = c(0, 3),
  ylim = c(-2, max(dd$yval))
)

gd(8, lwd = 2) # number of colours needed
#> Error in gd(8, lwd = 2): could not find function "gd"
xyplot(yval ~ y, dd,
  groups = p,
  type = "l",
  xlim = c(0, 3),
  ylim = c(-2, max(dd$yval))
)

xyplot(yval ~ y, dd,
  groups = p,
  type = "l",
  auto.key = list(space = "right", lines = T, points = F),
  xlim = c(0, 3),
  ylim = c(-2, max(dd$yval))
)

It’s much better to have the legend in the same order as the lines in the graph. We can turn p into a factor and reverse its order.

dd$po <- factor(dd$p)
dd$po <- reorder(dd$po, -dd$p)
xyplot(yval ~ y, dd,
  groups = po,
  type = "l",
  auto.key = list(space = "right", lines = T, points = F, title = "power"),
  xlim = c(0, 3),
  ylim = c(-2, max(dd$yval))
)

From quantile plots:

Uniform quantiles…

xqplot(un)
#> Error in xqplot(un): could not find function "xqplot"

Normal quantiles

xqplot(un, ptype = "normal")
#> Error in xqplot(un, ptype = "normal"): could not find function "xqplot"

We see that none of the numeric variables have normal distributions.

  • ‘age’ is somewhat platykurtic compared with a normal
  • ‘compositeHourlyWages’ has both a categorical (0) and a continuous component
  • ‘education’ is also platykurtic
  • ‘working’ is dichotomous
  • ‘familyIncome’ is skewed to the right

Note that the fact that \(x\) or \(y\) variables are not normal does not mean that the conditional distribution of \(y\) given \(x\) is not normal.

Let’s explore wages of working women as a function of education.

library(latticeExtra)
un %>%
  xyplot(infantMortality ~ GDPperCapita, .) +
  layer(panel.loess(..., lwd = 2))


# Scatterplot showing curvature in relationship

trellis.focus()
panel.identify(labels = rownames(un))
#> integer(0)
trellis.unfocus()


un %>%
  xyplot(log(infantMortality) ~ GDPperCapita | region, .) +
  layer(panel.loess(..., lwd = 2))


un %>% subset(country %in% c("United.States", "Canada"))
#>                region  tfr contraception educationMale educationFemale lifeMale
#> Canada        America 1.61            66          17.2            17.8     76.1
#> United.States America 1.96            71          15.4            16.2     73.4
#>               lifeFemale infantMortality GDPperCapita economicActivityMale
#> Canada              81.8               6        18943                 72.4
#> United.States       80.1               7        26037                 74.9
#>               economicActivityFemale illiteracyMale illiteracyFemale
#> Canada                          57.6             NA               NA
#> United.States                   59.3           2.24             2.23
#>                     country
#> Canada               Canada
#> United.States United.States

between wage and education, and heteroskedasticity in wage as a function of education.

# library(p3d)

# slid %>%
#  xyplot(sqrt(wage) ~ education, .) +
#  layer(panel.loess(...))

# Init3d()
# Plot3d(log(infantMortality) ~ GDPperCapita + lifeFemale | region, un)
# Id3d()
# Id3d('United.States')
# Id3d('Canada')
# rownames(un)
# names(un)

30.3 Box Cox Transformation

This video was made by math et al. I like their channel and found this video to be a good one.