Reliability coefficients are for squares

W. Joel Schneider

The more reliable a score is, the more certain we can be about what it means (provided its validity is close to its reliability). Certain rules-of-thumb about score reliability are sometimes proposed:

Base high-stakes decisions only on scores with reliability coefficients of 0.98 or better.
Base substantive interpretations on scores with reliability coefficients of 0.90 or better.
Base decisions to give more tests or not on scores with reliability coefficients of 0.80 or more.

Such guidelines seem reasonable to me, but I do not find reliability coefficients to be intuitively easy to understand. How much uncertainty is associated with a reliability coefficient of .80? The value of the coefficient (.80) is not directly informative about individual scores. Instead, it refers to the correlation the scores have with a repeated measurement.

Another way to think about the reliability coefficient is that it is a ratio of true score variance to observed score variance. In classical test theory, an observed score (X) is influenced by a reliable component, the true score (T), and also by measurement error (e). That is

X=T+e

rxx <- .9
sigma <- 15
mu <- 100
sigma_ts <- sigma * rxx
sigma_e <- sigma * sqrt(1 - rxx)

x <- seq(mu - 4 * sigma, mu + 4 * sigma, sigma / 15)
y <- dnorm(x,mu, sigma_ts)
ts <- 120
tibble(ts = ts) %>% 
  ggplot() +
  ggnormalviolin::geom_normalviolin(aes(x = ts, mu = 100, sigma = sigma_ts))

The true score is static for each person, and the error fluctuates randomly.

Because the true score and error are uncorrelated, the variance of X is the sum of the variance of T and the variance fo e:

\sigma^2_X=\sigma^2_T+\sigma^2_e

Therefore, the reliability coefficient of an observed score is the ratio of the true score variance over the total variance:

r_{XX}=\frac{\sigma^2_T}{\sigma^2_X}

In other words, what proportion of the observed score’s variability is consistent?

Okay, so what is variance? Variance is the average squared deviation from the mean. Squared quantities are not easy to think about for most of us. For this reason, I prefer to convert reliability coefficients into confidence interval widths. Confidence interval widths and reliability coefficients have a non-linear relationship:

\text{CI Width}=2z\sigma_{x}\sqrt{r_{xx}-r^2_{xx}}

Where:

z is the z-score associated with the level of confidence you want (e.g., 1.96 for a 95% confidence interval)

\sigma_{x} is the standard deviation of X

r_{xx} is the classical test theory reliability coefficient for X

For index scores (μ = 100, σ = 15), a reliability coefficient of .80 is associated with a 95% confidence interval that is 24 points wide. That to me is much more informative than knowing that 80% of the variance is reliable.

Calculating a lower and upper bounds of a confidence interval for a score looks complex with all the symbols and subscripts, but after doing it a few times, it is not so bad. Basically, compute the estimated true score (\hat{T}) and then add (or subtract) the margin of error.

\hat{T}=\mu_x+r_{xx}\left(X-\mu_x\right)

\text{CI} = \hat{T} \pm z\sigma_x\sqrt{r_{xx}-r^2_{xx}}

The interactive app in Figure 1 graph below shows the non-linear relationship between reliability and 95% confidence interval widths for different observed index scores. The confidence interval width is widest when the reliability coefficient is .5 and tapers to 0 when the reliability coefficient is 0 or 1.

The idea that the confidence interval width is zero when reliability is perfect makes sense. However, it might be counterintuitive that the confidence interval width is also zero when the reliability coefficient is zero.

How is this possible? To make sense of this, we have to remember what the true score is. It is the long term average score after repeated measurements (assuming no carryover effects). If a score has no reliable component, it is pure error. When r_{XX}=0, the score is X=T+e, where \sigma_T^2=0 and \sigma_e^2=\sigma_X^2. So the true has no variance, meaning that its mean is exactly the mean of X for everyone.

The reason that this is true is that the true score for a variable with no reliability is a constant.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 650
library(shiny)
library(dplyr)
library(ggplot2)
library(ggtext)
library(tibble)
library(ragg)
library(scales)


ui <- fluidPage(
  sidebarLayout(
      mainPanel(
      plotOutput(outputId = "distPlot",width = "400px", height = "400px")
    ),
    
    sidebarPanel(
      div(style="display: inline-block; width: 200px;",
      sliderInput(
        inputId = "x",
        label = HTML("Score (<i>X</i>)"),
        min = 40,
        max = 160,
        value = 120,
        step = 1
      )),
      div(style="display: inline-block; width: 200px;",
          sliderInput(
        inputId = "my_rxx",
        label = shiny::HTML("Reliability (<i>r<sub>XX</sub></i>)"),
        min = 0,
        max = 1,
        value = .9,
        step = .01
      )),
      shiny::tags$br(),
      div(style="display: inline-block; width: 75px;",
      numericInput(
        inputId = "mu",
        label = "Mean",
        min = 0,
        max = 100,
        value = 100,
        step = 1
      )),
      div(style="display: inline-block; width: 75px;",
      numericInput(
        inputId = "sigma",
        label = "SD",
        min = 1,
        max = 15,
        value = 15, 
        step = 1
      ))
    )
    
  )
)
server <- function(input, output, session) {
  output$distPlot <- renderPlot({
    mu <- input$mu
    sigma <- input$sigma
    my_rxx <- input$my_rxx
    x <- input$x
    observe(updateSliderInput(session, "x", max = input$mu + input$sigma * 4, min = input$mu - input$sigma * 4))
    # mu = 100
    # sigma = 15
    # my_rxx  = .9
    # x = 120

p <- .95
z <- (x - mu)  / sigma
z_ci <- qnorm(1 - (1 - p) / 2)
rxx = c(seq(0,.019,.001),seq(.02,.98,.01), seq(.981,1,.001))
rxx_rev <- rev(rxx)
lb <- sigma * (z * rxx - z_ci * sqrt(rxx - rxx ^ 2)) + mu
ub <- sigma * (z * rxx_rev + z_ci * sqrt(rxx_rev - rxx_rev ^ 2)) + mu


my_see <- sigma * sqrt(my_rxx - my_rxx ^ 2)
my_moe <- z_ci * my_see
my_tau <- sigma * z * my_rxx  + mu
my_lb <- sigma * z * my_rxx - my_moe + mu
my_ub <- sigma * z * my_rxx + my_moe + mu

my_xhat <- (x - mu) * sqrt(my_rxx) + mu
my_see2 <- sqrt(1 - my_rxx) 
my_moe2 <- z_ci * my_see2
my_lb2 <- my_xhat - my_moe2
my_ub2 <- my_xhat + my_moe2

lb2 <- sigma * (z * sqrt(rxx) - z_ci * sqrt(1 - rxx )) + mu
ub2 <- sigma * (z * sqrt(rxx_rev) + z_ci * sqrt(1 - rxx_rev)) + mu


d_arrow <- tibble(Reliability = my_rxx, 
                  ci = c(my_lb, my_ub))



tibble(Reliability = c(rxx, rxx_rev),
       ci = c(lb, ub),
       ci2 = c(lb2, ub2)) %>% 
  ggplot(aes(Reliability, ci)) +
  geom_polygon(fill = "dodgerblue4", alpha = .2, aes(y = ci2)) +
  geom_polygon(fill = "dodgerblue3", alpha = .5) + 
  ggnormalviolin::geom_normalviolin(data = tibble(mu = my_tau, x = my_rxx, sigma = my_see), fill = "black", p_tail = .05, aes(x = x, mu = mu, sigma = sigma, width = .15, face_left = F), inherit.aes = F, color = NA, alpha = .3) +
  scale_x_continuous("Reliability Coefficient", breaks = seq(0,1,.2), labels = c("0", ".20", ".40", ".60", ".80", "1"), expand = expansion(add = .09)) + 
  scale_y_continuous("Score", breaks = -4:4 * sigma + mu, 
                     minor_breaks = seq(-4 * sigma + mu, 
                                        4 * sigma + mu,
                                        ifelse((sigma %% 3) == 0, 
                                               sigma / 3, 
                                               sigma / 2)) ) + 
  coord_fixed(ratio = 1 / (sigma * 8),
              ylim = c(-4 * sigma + mu, 4 * sigma + mu),
              clip = "off") +
  theme_minimal(16, "sans") +
  theme(panel.spacing.x = unit(5, "mm")) +
  geom_line(data = d_arrow) +
  geom_text(data = d_arrow,
            aes(label = scales::number(ci, .1), x = Reliability - .02),
            hjust = 1,
            family = "sans") +
  annotate(
    "richtext",
    x = my_rxx + .02,
    y = x,
    hjust = 0,
    label = paste0("*X* = ", x),
    color = "firebrick",
    family = "sans"
  ) +
  annotate(
    "text",
    x = my_rxx - .02,
    y = my_tau,
    hjust = 1,
    label = scales::number(my_tau, .1),
    family = "sans"
  ) +
  annotate("point", x = my_rxx, y = my_tau) +
  annotate("point", x = my_rxx, y = x, size = 3, color = "firebrick") + 
  ggtitle(paste0("Confidence Interval Width = ", scales::number(my_ub - my_lb, .1)))
  
    
  })
}
shinyApp(ui = ui, server = server)

Figure 1: The relationship of reliability coefficients and confidence interval widths.

Paying close attention to confidence intervals allows you to do away with rough rules-of-thumb about reliability and make more direct and accurate interpretations about individual scores.

Citation

BibTeX citation:

@misc{schneider2014,
  author = {Schneider, W. Joel},
  title = {Reliability Coefficients Are for Squares},
  date = {2014-01-16},
  url = {https://wjschne.github.io/AssessingPsyche/2014-01-16-rereliability-is-for-squares/},
  langid = {en}
}

For attribution, please cite this work as:

Schneider, W. J. (2014, January 16). Reliability coefficients are for squares. AssessingPsyche. https://wjschne.github.io/AssessingPsyche/2014-01-16-rereliability-is-for-squares/