Regression in R

A step-by-step introduction to regresion in R
Author

W. Joel Schneider

Published

March 7, 2024

Here we are going to use a small data set to predict people’s height using the height of their parents.

Install packages

An R package is software that extends the capabilities of R. To use a package, you must first install it on your machine. You only need to install it once.

Install tidyverse

Some packages are designed to work with several other packages as a system. The tidyverse package is a “meta-package” that installs and loads a coherent set of packages designed to help you import, manipulate, visualize, and interpret data. It also installs the haven package, which we will use to import some data. If you do not have a recent version of tidyverse already installed, you can install it with this code:

install.packages("tidyverse")

Install easystats

The easystats package is another “meta-package” that installs a set of packages designed to work together to make data analysis easier.

If you do not have a recent version of easystats already installed, you can install it with this code:

install.packages("easystats")

Load packages

A package that is installed on your machine has additional functions that you can use. Each session, you can “load” a package with the library function:

library(tidyverse) # Loads primary packages for data wrangling and plotting
library(easystats) # Loads packages that make extracting data from model fit objects easier

Import data

If I use only one data set in an analysis, I call it d. If I need multiple data sets, I use a d_ prefix to differentiate them. For example, if I have separate data sets for juniors and seniors, I might call them, d_juniors and d_seniors, respectively. This kind of consistency seems like extra work, but it pays off in making it easy for future-you to read and understand your own code.

You can import the height.sav data set directly from my github repository for this course. There is no need to save the data anywhere. This code loads the data into variable d using the read_spss function from the haven package.

We could load haven using library(haven), but we only need to use one function one time. So instead of loading the whole package, we can use a function without loading the package by using the package name as prefix followed by two colons like so:

d <- haven::read_spss("https://github.com/wjschne/EDUC5325/raw/master/height.sav")

This code reaches across the web to the file in quotes, “reads” it into our session, and holds the data in a new variable we called d. Every time we need the data, we will get it from the d variable.

Make a plot

Use the ggplot function from the ggplot2 package to make a scatterplot with points and a regression line:

ggplot(d, aes(avgphgt,height)) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

Annotated, here is what the code does:

Annotated ggplot2 code

Save the plot!

You can save to your hard drive a high-quality plot with the ggsave function. It will save whatever the last plot you created with ggplot. It will guess the file format from the file extension.

Here I save a .pdf file of the plot to the working directory:

ggsave("my_plot.pdf")

What is the working directory? Is the folder on your machine that R thinks it should look first if it needs to find files or save files. If you are every curious about which directory is the working directory, you can use the getwd function:

getwd()

If you saved the plot above as "my_plot.pdf", you will find the file in the working directory returned by getwd.

Vector-based images

The .pdf format gives the best image quality but can only be viewed in a .pdf reader. The .svg format is almost as good and can be incorporated into webpages and Office documents. One downside of their near-perfect image quality is that .pdf and .svg image file sizes can become quite large.

Raster images

The .png format gives good image quality and renders small file sizes.

The primary use of the .gif format is to create animated plots. Otherwise stick with .png.

Although the .jpg format is good for photos, it is terrible for plots—it often renders text and sharp corners with pixelated smudges.

Creating the regression model

To run regression models, use the lm function. The lm stands for “linear model.” The general pattern is lm(Y~X, data = d), which means “Y is predicted by X, using the data set d.”

Here we predict height from Average of Parent Height (avgphgt):

m1 <- lm(height~avgphgt, data = d)

Notice that we did not get any results. Instead, we created the model fit object m1, which contains information about the model. There are a variety of functions we can use to extract information from m1.

Checking assumptions

Regression assumes that the observations should independent, and the residuals should be normal and homoscedastic. The performance package has a great function for checking model assumptions: check_model

check_model(m1)

Here we see that none of the assumptions have been severely violated.

Summarising results

Base R gives you most of what you would want to know about the regression results with the summary function:

summary(m1)

Call:
lm(formula = height ~ avgphgt, data = d)

Residuals:
   Min     1Q Median     3Q    Max 
-5.241 -1.649 -0.539  1.224  6.163 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -12.878      9.779   -1.32      0.2    
avgphgt        1.193      0.145    8.23  5.7e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.68 on 38 degrees of freedom
Multiple R-squared:  0.641, Adjusted R-squared:  0.631 
F-statistic: 67.7 on 1 and 38 DF,  p-value: 5.7e-10

This output is not pretty, nor was it intended to be. It is designed for you, the analyst. The summary function’s print method is optimized for reading results in the console, not in a document. Presentation-worthy results need a lot more care and attention.

An automated report from the report package:

report::report(m1)
We fitted a linear model (estimated using OLS) to predict height with avgphgt
(formula: height ~ avgphgt). The model explains a statistically significant and
substantial proportion of variance (R2 = 0.64, F(1, 38) = 67.73, p < .001, adj.
R2 = 0.63). The model's intercept, corresponding to avgphgt = 0, is at -12.88
(95% CI [-32.67, 6.92], t(38) = -1.32, p = 0.196). Within this model:

  - The effect of avgphgt is statistically significant and positive (beta = 1.19,
95% CI [0.90, 1.49], t(38) = 8.23, p < .001; Std. beta = 0.80, 95% CI [0.60,
1.00])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.

Model-level statistics

Some statistics like the coefficient of determination (R2) or the standard error of the estimate (σe) describe the model as a whole.

The model-level statistics can be extracted with the performance package’s model_performance function.

performance(m1)
AIC AICc BIC R2 R2_adjusted RMSE Sigma
196 197 201 0.641 0.631 2.61 2.68

R2 is in the glance function’s r.squared column, and the standard error of the estimate is in the sigma column. If all you wanted was the R2, you could do this:

r2(m1)
# R2 for Linear Regression
       R2: 0.641
  adj. R2: 0.631

For the standard error of the estimate:

sigma(m1)
[1] 2.68

Why would you want just one number instead of reading it from a table? In reproducible research, we intermingle text and code so that it is clear where every number came from. Thus, “hard-coding” your results like this is considered poor practice:

The model explains 64% of the variance.

Using rmarkdown, instead of typing the numeric results, we type pull the results using an inline code chunk:

The model explains 64% of the variance.

Which, when rendered, produces the correct output:

The model explains 64% of the variance.

That seems like a lot of extra work, right? Yes, it is—unless there is a possibility that your underlying data might change or that you might copy your numbers incorrectly. If you are imperfect, the extra time and effort is worth it. It makes it easy for other scholars to see exactly where each number came from. Hard-coded results are harder to trust.

Coefficient-level statistics

The regression coefficients—the intercept (b0) and the slope (b1)—have a number of statistics associated with them, which we will discuss later in the course.

If you just wanted the intercept and the slope coefficients, use the coef function:

coef(m1)
(Intercept)     avgphgt 
     -12.88        1.19 

Thus, the intercept is -12.88 and the slope is 1.19.

To get the model parameters (intercept and slope coefficients) along with their p-values and other statistical information

parameters(m1) 
Parameter Coefficient SE CI CI_low CI_high t df_error p
(Intercept) -12.88 9.779 0.95 -32.675 6.92 -1.32 38 0.196
avgphgt 1.19 0.145 0.95 0.899 1.49 8.23 38 0.000

If you want standardized parameters:

standardise_parameters(m1)
Parameter Std_Coefficient CI CI_low CI_high
(Intercept) 0.0 0.95 -0.194 0.194
avgphgt 0.8 0.95 0.603 0.997

Multiple regression

In this data set, all participants identified as either male or female. We assume that males are, on average, taller than females. The codes are male = 1 and female = 2. Let’s mutate the gender variable so that it becomes a categorical “factor” instead of a numeric variable.

d <- d %>% 
  mutate(gender = factor(gender, levels = c(1, 2), 
                         labels = c("Males", "Females"))) 

By default, the first category in a factor is the “reference level” that all other categories are compared to. Thus, the regression coefficient for gender will refer to how much “Females” differ from “Males,” on average. If want to do it the other way around, you can reverse the order of the levels and labels in the factor function. See also the fct_rev or fct_relevel function for other ways to do this.

We run the regression model with 2 predictors like so:

m2 <- lm(height ~ avgphgt + gender, data = d)

Checking assumptions

check_model(m2)

All looks well.

Summarizing results

To summarize the results, use the summary function:

summary(m2)

Call:
lm(formula = height ~ avgphgt + gender, data = d)

Residuals:
   Min     1Q Median     3Q    Max 
-3.826 -1.094 -0.347  1.656  4.503 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     24.567     10.932    2.25  0.03068 *  
avgphgt          0.671      0.157    4.26  0.00013 ***
genderFemales   -4.468      0.920   -4.85  2.2e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.12 on 37 degrees of freedom
Multiple R-squared:  0.78,  Adjusted R-squared:  0.769 
F-statistic: 65.7 on 2 and 37 DF,  p-value: 6.6e-13

Standardized coefficients:

standardise_parameters(m2)
Parameter Std_Coefficient CI CI_low CI_high
(Intercept) 0.506 0.95 0.245 0.768
avgphgt 0.450 0.95 0.236 0.664
genderFemales -1.012 0.95 -1.435 -0.590

An automated report:

report(m2)
We fitted a linear model (estimated using OLS) to predict height with avgphgt
and gender (formula: height ~ avgphgt + gender). The model explains a
statistically significant and substantial proportion of variance (R2 = 0.78,
F(2, 37) = 65.75, p < .001, adj. R2 = 0.77). The model's intercept,
corresponding to avgphgt = 0 and gender = Males, is at 24.57 (95% CI [2.42,
46.72], t(37) = 2.25, p = 0.031). Within this model:

  - The effect of avgphgt is statistically significant and positive (beta = 0.67,
95% CI [0.35, 0.99], t(37) = 4.26, p < .001; Std. beta = 0.45, 95% CI [0.24,
0.66])
  - The effect of gender [Females] is statistically significant and negative
(beta = -4.47, 95% CI [-6.33, -2.60], t(37) = -4.85, p < .001; Std. beta =
-1.01, 95% CI [-1.43, -0.59])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.

Comparing models

To compare two regression models:

test_wald(m1, m2)
Name Model df df_diff F p
m1 lm 38
m2 lm 37 1 23.6 0

The p-value is significant, meaning that m2 explains more variance than m1.

This uses Bayes Factor instead of a p-value:

test_bf(m1, m2)
Model log_BF BF
avgphgt
avgphgt + gender 8.01 3010

A BF > 1 means that m2 is more strongly supported than m1. A BF < 1 means that m1 is more strongly supported than m2. If you are not sure how to interpret the output of test_bf, you can get an automated interpretation:

test_bf(m1, m2) %>% 
  report()
Bayes factors were computed using the BIC approximation, by which BF10 =
exp((BIC0 - BIC1)/2). Compared to the avgphgt model, we found extreme evidence
(BF = 3.01e+03) in favour of the avgphgt + gender model (the least supported
model).

To compare many performance statistics at once:

compare_performance(m1, m2) 
Name Model AIC AIC_wt AICc AICc_wt BIC BIC_wt R2 R2_adjusted RMSE Sigma
m1 lm 196 0 197 0 201 0 0.641 0.631 2.61 2.68
m2 lm 179 1 180 1 185 1 0.780 0.769 2.04 2.12

Pretty Regression Tables

There are several packages that automate the production of pretty regression tables. There is a trade-off here between ease of use and getting exactly what you want. If you are okay with a function’s defaults and the programmer’s choices, then there is no problem. If you want full control of the display, you are best off creating your own functions or adapting existing functions.

The tab_model function in sjPlot look great as is, but if it gives you too much or not enough, check out its options to see if it can do what you want.

Show a single model

sjPlot::tab_model(m1)
  height(in inches)
Predictors Estimates CI p
(Intercept) -12.88 -32.67 – 6.92 0.196
Average height of parents 1.19 0.90 – 1.49 <0.001
Observations 40
R2 / R2 adjusted 0.641 / 0.631

Compare nested models

sjPlot::tab_model(m1, m2)
  height(in inches) height(in inches)
Predictors Estimates CI p Estimates CI p
(Intercept) -12.88 -32.67 – 6.92 0.196 24.57 2.42 – 46.72 0.031
Average height of parents 1.19 0.90 – 1.49 <0.001 0.67 0.35 – 0.99 <0.001
gender: Females -4.47 -6.33 – -2.60 <0.001
Observations 40 40
R2 / R2 adjusted 0.641 / 0.631 0.780 / 0.769
library(gtsummary)
# Plain table
tbl_regression(m1)
Characteristic Beta 95% CI1 p-value
Average height of parents 1.2 0.90, 1.5 <0.001
1 CI = Confidence Interval
# Merging tables
tbl_merge(tbls = list(tbl_regression(m1) %>% add_glance_table(),
                      tbl_regression(m2) %>% add_glance_table()),
          tab_spanner = c("Parent Height", "Parent Height + Gender")) 
Characteristic
Parent Height
Parent Height + Gender
Beta 95% CI1 p-value Beta 95% CI1 p-value
Average height of parents 1.2 0.90, 1.5 <0.001 0.67 0.35, 0.99 <0.001
0.641

0.780

Adjusted R² 0.631

0.769

Sigma 2.68

2.12

Statistic 67.7

65.7

p-value <0.001

<0.001

df 1

2

Log-likelihood -95.2

-85.3

AIC 196

179

BIC 201

185

Deviance 273

167

Residual df 38

37

No. Obs. 40

40

gender





    Males



    Females


-4.5 -6.3, -2.6 <0.001
1 CI = Confidence Interval
# Stacking tables
tbl_stack(list(tbl_regression(m1), tbl_regression(m2)),
          group_header = c("Parent Height", "Parent Height + Gender"))
Characteristic Beta 95% CI1 p-value
Parent Height
Average height of parents 1.2 0.90, 1.5 <0.001
Parent Height + Gender
Average height of parents 0.67 0.35, 0.99 <0.001
gender


    Males
    Females -4.5 -6.3, -2.6 <0.001
1 CI = Confidence Interval
parameters(m2) %>% 
  rempsyc::nice_table()

Parameter

b

SE

CI

t

df

p

95% CI

(Intercept)

24.57

10.93

0.95

2.25

37

.031*

[2.42, 46.72]

avgphgt

0.67

0.16

0.95

4.26

37

< .001***

[0.35, 0.99]

genderFemales

-4.47

0.92

0.95

-4.85

37

< .001***

[-6.33, -2.60]

All the code in one place:

The preceding analyses might seem like a lot, but it is not really so much when you see it all in just a few lines of code. Here are all the main analyses:

# Load packages
library(tidyverse)
library(easystats)

# Import data
d <- haven::read_spss("https://github.com/wjschne/EDUC5325/raw/master/height.sav")

# Plot data
ggplot(d, aes(weight,height)) +
  geom_point() +
  geom_smooth(method = "lm")

# Save plot
ggsave("my_plot.pdf")

# Create regression model
m1 <- lm(height~avgphgt, data = d)

# Check model assumptions
check_model(m1)

# Display results
summary(m1)
# Model fit
performance(m1)
# Model parameters
parameters(m1)

# Multiple regression
m2 <- lm(height ~ avgphgt + gender, data = d)
summary(m2)
parameters(m2, standardize = "refit")

# Compare model m1 and model m2
test_wald(m1, m2)
test_bf(m1, m2)
compare_performance(m1, m2)

Questions

Use the Transfer of Learning data set. A set of 38 inmates participated in a study that tested the Transfer of Learning Hypothesis. Some people believe that studying Latin is particularly beneficial for progress in other academic disciplines. Each inmate was given a reading test and a math test before the study began. Some inmates were randomly assigned to participate in a 48-week Latin course. The control group studied the lives of famous athletes for the same time period. Each inmate took the reading and math test again to see if studying Latin improved their academic skills. Personal information was also collected about each inmate including their annual income before going to prison, whether or not the inmate had a documented learning disorder, and whether or not the inmate had been convicted of a violent offense. Here are the variable names:

  • read_1 = Reading before study
  • read_2 = Reading after study
  • math_1 = Math before study
  • math_2 = Math after study
  • latin = 1 (Studied Latin), 0 (Studied sports history)
  • violent = 1 (convicted of a violent offense), 0 (convicted of a non-violent offense)
  • learning = 1 (Learning disabled), 0 (Not learning disabled)
  • income = Annual income before incarceration
d_learn <- read_csv("https://github.com/wjschne/EDUC5529/raw/master/transfer_of_learning.csv") 

Assume α = 0.05 for all hypothesis tests.

Create a regression model in which you predict reading at time 2 (read_2) using reading at time 1 (read_1). Call the model fit object m_read.

Question 1

Is read_1 a significant predictor of read_2?

Make model fit object called m_read using the lm function.

m_read <- lm(read_2 ~ read_1, data = d_learn)

View coefficient-level statistics with the parameters function.

You could also use Base R’s summary function.

parameters(m_read)

# or

summary(m_read)

In the read_1 row, is the p.value column less than 0.05?

parameters(m_read)
Parameter Coefficient SE CI CI_low CI_high t df_error p
(Intercept) 83.591 10.922 0.95 61.44 105.741 7.65 36 0.000
read_1 0.251 0.104 0.95 0.04 0.463 2.41 36 0.021

Question 2

What is the R2 for the m_read model?

View model-level statistics with the performance function.

You could also use Base R’s summary function.

performance(m_read)

# or

summary(m_read)

Question 3

What does the scatter plot look like when read_1 is on the x-axis and read_2 is on the y-axis? Also plot the regression line. Save your plot using the ggsave function.

This will get you started

ggplot(d_learn, aes(read_1, read_2))

Here is how you add points.

ggplot(d_learn, aes(read_1, read_2)) +
  geom_point() 

Here is how you add a regression line.

ggplot(d_learn, aes(read_1, read_2)) +
  geom_point() + 
  geom_smooth(method = "lm") 

This is overkill for now. But someday you might want to be able to make your plots as polished as possible.

# I want to put the equation at x = 130
equation_x <- 130
equation_y <- predict(m_read, newdata = tibble(read_1 = equation_x))

# Extracting the coefficients
b_read <- round(coef(m_read),2)

# The angle of the regression line is the inverse tangent of the slope (converted to degrees)
eq_angle <- atan(b_read[2]) * 180 / pi

# Equation
eq_read <- paste0("italic(Y) == ", 
                  b_read[1], 
                  " + ", 
                  b_read[2], 
                  " *  italic(X) + italic(e)")

ggplot(d_learn, aes(read_1, read_2)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Reading at Time 1",
       y = "Reading at Time 2",
       title = "Using Reading at Time 1 to Predict Reading at Time 2") +
  coord_fixed() +
  theme_minimal() +
  annotate(
    geom = "text",
    x = equation_x,
    y = equation_y,
    angle = eq_angle,
    label = eq_read,
    parse = TRUE,
    vjust = -0.5)

Question 4

Create a regression model in which you predict math at time 2 (math_2) using reading at time 1 (math_1). Call the model fit object m_math1.

Does math_1 predict math_2?.

Does math_1 still predict math_2 after controlling for read_1?

m_math <- lm(math_2 ~ math_1 + read_1, data = d_learn)
summary(m_math)