73 Lab: Modeling professor attractiveness and course evaluations

Why are hot professors “better” teachers?

At the end of most college courses, students are asked to evaluate the class and the instructor—usually anonymously, often hastily, sometimes with one hand already on the doorknob. These are often used to assess instructor effectiveness, allocate merit raises, and sometimes even decide whether people keep their jobs.

But are course evaluations actually measuring teaching quality? Or are they picking up on other things—like a professor’s appearance?

In a now-classic economics paper, Daniel Hamermesh and Amy Parker looked at whether professors who are considered more physically attractive get higher evaluation scores. The short answer? Yeah, they do. You can read their study here.4 The dataset we’ll use comes from a slightly modified version of the replication data included with Data Analysis Using Regression and Multilevel/Hierarchical Models by Gelman and Hill.

In this lab, you’ll explore that dataset—focusing on one predictor at a time—to get a feel for how linear models behave, how to interpret them, and how to visualize their results. Along the way, you’ll also get a preview of just how messy “evaluation” can be when the outcome depends on variables that have nothing to do with teaching.

Packages

We’ll use tidyverse, openintro, and broom to wrangle, model, and tidy up our regression output.

library(tidyverse)
library(broom)
library(openintro)

The data

The dataset we’ll be using is called evals from the openintro package. Take a peek at the codebook with ?evals.

Exercises

Part 1: Getting to know the outcome

Before we model anything, let’s look at what students actually do with the evaluation scale.

  1. Visualize the distribution of score. Is the distribution skewed? Are students generous? Harsh? Do most scores pile up near the top? Include at least one plot and a couple of numerical summaries, and tell me what you notice.

Now let’s introduce the main character in this dataset: bty_avg. This variable is a professor’s average beauty rating.

  1. Create a scatterplot of score versus bty_avg. Describe what you see. Trend? Noise? Clusters? Total chaos? Don’t overthink it; just describe what you see.

Hint: See the help page for the function at http://ggplot2.tidyverse.org/reference/index.html.

  1. Recreate your scatterplot, but use geom_jitter() instead of geom_point(). What does jittering fix here, and what would someone misunderstand if they only saw the non-jittered plot?

Part 2: Beauty as a predictor

Recall: Linear model is in the form \(\hat{y} = b_0 + b_1 x\).

Let’s see if the apparent trend in the plot is something more than natural variation.

  1. Fit a linear model called m_bty to predict average professor evaluation score by average beauty rating (bty_avg). Write the fitted regression equation, and then interpret the slope in plain language: What does a one-point increase in beauty predict for evaluation scores?

  2. Replot your existing visualization, this time add a regression line in orange. Turn off the default shading around the line. (By default, the plot includes shading around the line.)

  3. Then, stepping back, interpret what this model is really saying:

  • How much do evaluation scores change with beauty ratings (slope)?
  • What does the intercept represent here? Is it meaningful in this context, or just a mathematical artifact? Explain your reasoning.
  • What is the \(R^2\) value of this model? Interpret it in context. What does it tell you about whether beauty is a “big deal” or a pretty minor piece of the story?
  • That shading you turned off represents uncertainty around the line—why might it be easy to read too much into it at this stage?

Part 3: Linear regression with a categorical predictor

Let’s switch gears from numeric predictors to categorical ones. Beauty scores might be (somewhat) continuous, but characteristics like gender and rank are categorical, meaning they fall into distinct groups.

We’ll start by seeing whether evaluation scores differ by gender.

m_gen <- lm(score ~ gender, data = evals)
tidy(m_gen)
#> # A tibble: 2 × 5
#>   term        estimate std.error statistic p.value
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)    4.09     0.0387    106.   0      
#> 2 gendermale     0.142    0.0508      2.78 0.00558
  1. Take a look at the model output. What’s the reference level? What do the coefficients tell you about how evaluation scores differ between male and female professors? Write the predicted mean evaluation score for each gender.

Now let’s do one more categorical predictor: rank

Actually, let’s do three slightly different versions of this model to see how changing the reference level or regrouping categories affects the output.

Create a new variable from rank called rank_relevel where "tenure track" is the baseline level. Create another new variable called tenure_eligible that labels "teaching" faculty as "no" and labels "tenure track" and "tenured" faculty as "yes".

  1. Fit three different linear models to explore how rank affects evaluation scores. Each of these models will use a different version of the rank variable to predict average professor evaluation score

  2. Based on the regression outputs, interpret how teaching faculty and tenured faculty differ from that baseline. (Hint you should interpret the slopes and intercepts for all three models in context of the data.)

  3. Report the \(R^2\). Does rank explain much, or not really?

Part 4: Multiple linear regression

Now we’re going to ask a more interesting question:

Is the “beauty effect” still there once we account for gender?

Fit these two models:

  • m_bty: score ~ bty_avg

  • m_bty_gen: score ~ bty_avg + gender

Then answer:

  1. What changes in the beauty slope when gender is added?

  2. For two professors with the same beauty rating, does gender still shift the predicted score?

  3. Compare the adjusted \(R^2\) values. Is gender actually helping much, or is beauty doing most of the work already?

Finally, swap gender out and add rank instead.

  1. Fit m_bty_rank: score ~ bty_avg + rank and then interpret one rank coefficient and the beauty slope in context.

Part 3: The search for the best model

Going forward, only consider the following variables as potential predictors: rank, ethnicity, gender, language, age, cls_perc_eval, cls_did_eval, cls_students, cls_level, cls_profs, cls_credits, bty_avg.

  1. Which variable, on its own, would you expect to be the worst predictor of evaluation scores? Why? Hint: Think about which variable would you expect to not have any association with the professor’s score.

  2. Check your suspicions from the previous exercise. Include the model output for that variable in your response.

  3. Suppose you wanted to fit a full model with the variables listed above. If you are already going to include cls_perc_eval and cls_students, which variable should you not include as an additional predictor? Why?

  4. Fit a full model with all predictors listed above (except for the one you decided to exclude) in the previous question.

  5. Using backward-selection with adjusted R-squared as the selection criterion, determine the best model. You do not need to show all steps in your answer, just the output for the final model. Also, write out the linear model for predicting score based on the final model you settle on.

  6. Interpret the slopes of one numerical and one categorical predictor based on your final model.

  7. Based on your final model, describe the characteristics of a professor and course at University of Texas at Austin that would be associated with a high evaluation score.

  8. Would you be comfortable generalizing your conclusions to apply to professors generally (at any university)? Why or why not?


  1. Daniel S. Hamermesh, Amy Parker, Beauty in the classroom: instructors pulchritude and putative pedagogical productivity, Economics of Education Review, Volume 24, Issue 4, August 2005, Pages 369-376, ISSN 0272-7757, 10.1016/j.econedurev.2004.07.013.↩︎