12  Reliability

A psychometrically sound measure is both reliable and valid. Reliability and validity are multidimentional concepts. There are different kinds of reliability and validity, and each aspect potentially can be anywhere between zero and perfect.

Reliability refers to the consistency of measurements.Test validity refers to whether appropriate inferences can be drawn from test scores.

Although physicists have been able to measure some quantities with extraordinary precision, they know that perfect measurement will elude them forever. Physical quantities like length, mass, temperature, and time are always measured with at least some amount of error (Taylor, 2022, pp. 4–5). Indeed, Heisenberg (1927) proved that the act of measurement itself disturbed physical systems such that the more accurately some quantities were measured, the less certain other aspects of the system could be measured. For example, the act of measuring a particle’s position accurately makes the measurement of the particle’s momentum less certain, and vice versa.

Taylor, J. R. (2022). An introduction to error analysis: The study of uncertainties in physical measurements (Third edition). University Science Books.
Heisenberg, W. (1927). Über den anschaulichen Inhalt der quantentheoretischen Kinematik und Mechanik. Zeitschrift für Physik, 43(3–4), 172–198. https://doi.org/10.1007/BF01397280

There are no psychological measurements with anywhere near the same level of precision that is possible in the physical sciences. Measurement error is not only always present in psychological measures, it is always consequential. For example, IQ tests, whatever the controversies that surround their use, are among the most reliable measures of psychological traits we have. Yet, IQ estimates for individuals can only be narrowed confidently to an interval that is still fairly wide, usually around 10 points (two-thirds of a standard deviation).

Measurement error refers to any imperfection in measurement that causes measured values to differ from true values.

12.1 True Scores

In psychological measurement, there is no way to know what a measured value truly is. Indeed, psychological attributes are in constant flux. Most changes occur naturally in real time, but sometimes psychological variables respond to attempts to measurement them. That is, psychological measurements have carryover effects, some of which enhance performance (i.e., practice effects). If people are asked to repeat a task during testing, they can learn how to do it more accurately or more quickly. Any carryover effect that worsens performance is referred to as a fatigue effect. Sooner or later, people tire of repeated testing, and they become unable or unwilling to perform to the best of their ability. Fatigue effects are still called fatigue effects even when performance worsens for reasons other than literal fatigue (e.g., boredom or annoyance).

Carryover effects refer to the various ways in which previous measurments can influence subsequent measurements.Practice effects are when the effect of testing causes people to learn to perform better on the test.Fatigue effects are when the effect of measurement causes people to perform worse on subtest measurements.

Although carryover effects can be minimized, they cannot be eliminated entirely. To understand the consistency of measurements, psychologists employ a employ a useful fiction. Imagine that we could rewind time so that we could measure things with the same measurement instrument as many times as wanted with no carryover effects. Because time is rewound, people never learn or tire between measurements. They experience each measurement as if it were the first time. That is, each new measurement is completely independent of all previous measurements. The average of all such hypothetical measurements is called the true score. The term true score is something of a misnomer. Notice that the definition requires that we use the same measurement instrument (i.e., test, device, or questionnaire) every time. Any flaws in the in measurement instrument are passed along to the true score. Averaging can cancel out random errors, converging on a specific value. Averaging does nothing to correct persistently wrong answers. That is, the average of many biased measurements is itself biased.

True scores are the average of all possible measurements using a specific measurement instrument if such measurements could be repeated without carryover effects.
Figure 12.1

12.2 Retest reliability

12.3 Alternate-form reliability

Split-half reliability

Internal consistency

  • Cronbach’s Alpha
  • McDonald’s Omega

Conditional reliability (IRT)