install.packages("tidyverse")
Here we are going to use a small data set to predict people’s height using the height of their parents.
Install packages
An R package is software that extends the capabilities of R. To use a package, you must first install it on your machine. You only need to install it once.
Install tidyverse
Some packages are designed to work with several other packages as a system. The tidyverse package is a “meta-package” that installs and loads a coherent set of packages designed to help you import, manipulate, visualize, and interpret data. It also installs the haven package, which we will use to import some data. If you do not have a recent version of tidyverse already installed, you can install it with this code:
Install easystats
The easystats package is another “meta-package” that installs a set of packages designed to work together to make data analysis easier.
If you do not have a recent version of easystats already installed, you can install it with this code:
install.packages("easystats")
Load packages
A package that is installed on your machine has additional functions that you can use. Each session, you can “load” a package with the library
function:
library(tidyverse) # Loads primary packages for data wrangling and plotting
library(easystats) # Loads packages that make extracting data from model fit objects easier
Import data
If I use only one data set in an analysis, I call it d
. If I need multiple data sets, I use a d_
prefix to differentiate them. For example, if I have separate data sets for juniors and seniors, I might call them, d_juniors
and d_seniors
, respectively. This kind of consistency seems like extra work, but it pays off in making it easy for future-you to read and understand your own code.
You can import the height.sav
data set directly from my github repository for this course. There is no need to save the data anywhere. This code loads the data into variable d
using the read_spss
function from the haven package.
We could load haven using library(haven)
, but we only need to use one function one time. So instead of loading the whole package, we can use a function without loading the package by using the package name as prefix followed by two colons like so:
<- haven::read_spss("https://github.com/wjschne/EDUC5325/raw/master/height.sav") d
This code reaches across the web to the file in quotes, “reads” it into our session, and holds the data in a new variable we called d
. Every time we need the data, we will get it from the d
variable.
Make a plot
Use the ggplot
function from the ggplot2 package to make a scatterplot with points and a regression line:
ggplot(d, aes(avgphgt,height)) +
geom_point() +
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Annotated, here is what the code does:
Save the plot!
You can save to your hard drive a high-quality plot with the ggsave
function. It will save whatever the last plot you created with ggplot
. It will guess the file format from the file extension.
Here I save a .pdf file of the plot to the working directory:
ggsave("my_plot.pdf")
What is the working directory? Is the folder on your machine that R thinks it should look first if it needs to find files or save files. If you are every curious about which directory is the working directory, you can use the getwd
function:
getwd()
If you saved the plot above as "my_plot.pdf"
, you will find the file in the working directory returned by getwd
.
Vector-based images
The .pdf format gives the best image quality but can only be viewed in a .pdf reader. The .svg format is almost as good and can be incorporated into webpages and Office documents. One downside of their near-perfect image quality is that .pdf and .svg image file sizes can become quite large.
Raster images
The .png format gives good image quality and renders small file sizes.
The primary use of the .gif format is to create animated plots. Otherwise stick with .png.
Although the .jpg format is good for photos, it is terrible for plots—it often renders text and sharp corners with pixelated smudges.
Creating the regression model
To run regression models, use the lm
function. The lm
stands for “linear model.” The general pattern is lm(Y~X, data = d)
, which means “Y
is predicted by X
, using the data set d
.”
Here we predict height from Average of Parent Height (avgphgt
):
<- lm(height~avgphgt, data = d) m1
Notice that we did not get any results. Instead, we created the model fit object m1
, which contains information about the model. There are a variety of functions we can use to extract information from m1
.
Checking assumptions
Regression assumes that the observations should independent, and the residuals should be normal and homoscedastic. The performance package has a great function for checking model assumptions: check_model
check_model(m1)
Here we see that none of the assumptions have been severely violated.
Summarising results
Base R gives you most of what you would want to know about the regression results with the summary
function:
summary(m1)
Call:
lm(formula = height ~ avgphgt, data = d)
Residuals:
Min 1Q Median 3Q Max
-5.241 -1.649 -0.539 1.224 6.163
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.878 9.779 -1.32 0.2
avgphgt 1.193 0.145 8.23 5.7e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.68 on 38 degrees of freedom
Multiple R-squared: 0.641, Adjusted R-squared: 0.631
F-statistic: 67.7 on 1 and 38 DF, p-value: 5.7e-10
This output is not pretty, nor was it intended to be. It is designed for you, the analyst. The summary
function’s print method is optimized for reading results in the console, not in a document. Presentation-worthy results need a lot more care and attention.
An automated report from the report package:
::report(m1) report
We fitted a linear model (estimated using OLS) to predict height with avgphgt
(formula: height ~ avgphgt). The model explains a statistically significant and
substantial proportion of variance (R2 = 0.64, F(1, 38) = 67.73, p < .001, adj.
R2 = 0.63). The model's intercept, corresponding to avgphgt = 0, is at -12.88
(95% CI [-32.67, 6.92], t(38) = -1.32, p = 0.196). Within this model:
- The effect of avgphgt is statistically significant and positive (beta = 1.19,
95% CI [0.90, 1.49], t(38) = 8.23, p < .001; Std. beta = 0.80, 95% CI [0.60,
1.00])
Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.
Model-level statistics
Some statistics like the coefficient of determination (R2) or the standard error of the estimate (σe) describe the model as a whole.
The model-level statistics can be extracted with the performance package’s model_performance
function.
performance(m1)
AIC | AICc | BIC | R2 | R2_adjusted | RMSE | Sigma |
---|---|---|---|---|---|---|
196 | 197 | 201 | 0.641 | 0.631 | 2.61 | 2.68 |
R2 is in the glance
function’s r.squared
column, and the standard error of the estimate is in the sigma
column. If all you wanted was the R2, you could do this:
r2(m1)
# R2 for Linear Regression
R2: 0.641
adj. R2: 0.631
For the standard error of the estimate:
sigma(m1)
[1] 2.68
Why would you want just one number instead of reading it from a table? In reproducible research, we intermingle text and code so that it is clear where every number came from. Thus, “hard-coding” your results like this is considered poor practice:
The model explains 64% of the variance.
Using rmarkdown, instead of typing the numeric results, we type pull the results using an inline code chunk:
The model explains 64% of the variance.
Which, when rendered, produces the correct output:
The model explains 64% of the variance.
That seems like a lot of extra work, right? Yes, it is—unless there is a possibility that your underlying data might change or that you might copy your numbers incorrectly. If you are imperfect, the extra time and effort is worth it. It makes it easy for other scholars to see exactly where each number came from. Hard-coded results are harder to trust.
Coefficient-level statistics
The regression coefficients—the intercept (b0) and the slope (b1)—have a number of statistics associated with them, which we will discuss later in the course.
If you just wanted the intercept and the slope coefficients, use the coef
function:
coef(m1)
(Intercept) avgphgt
-12.88 1.19
Thus, the intercept is -12.88 and the slope is 1.19.
To get the model parameters (intercept and slope coefficients) along with their p-values and other statistical information
parameters(m1)
Parameter | Coefficient | SE | CI | CI_low | CI_high | t | df_error | p |
---|---|---|---|---|---|---|---|---|
(Intercept) | -12.88 | 9.779 | 0.95 | -32.675 | 6.92 | -1.32 | 38 | 0.196 |
avgphgt | 1.19 | 0.145 | 0.95 | 0.899 | 1.49 | 8.23 | 38 | 0.000 |
If you want standardized parameters:
standardise_parameters(m1)
Parameter | Std_Coefficient | CI | CI_low | CI_high |
---|---|---|---|---|
(Intercept) | 0.0 | 0.95 | -0.194 | 0.194 |
avgphgt | 0.8 | 0.95 | 0.603 | 0.997 |
Multiple regression
In this data set, all participants identified as either male or female. We assume that males are, on average, taller than females. The codes are male = 1 and female = 2. Let’s mutate the gender variable so that it becomes a categorical “factor” instead of a numeric variable.
<- d %>%
d mutate(gender = factor(gender, levels = c(1, 2),
labels = c("Males", "Females")))
By default, the first category in a factor is the “reference level” that all other categories are compared to. Thus, the regression coefficient for gender
will refer to how much “Females” differ from “Males,” on average. If want to do it the other way around, you can reverse the order of the levels and labels in the factor
function. See also the fct_rev
or fct_relevel
function for other ways to do this.
We run the regression model with 2 predictors like so:
<- lm(height ~ avgphgt + gender, data = d) m2
Checking assumptions
check_model(m2)
All looks well.
Summarizing results
To summarize the results, use the summary
function:
summary(m2)
Call:
lm(formula = height ~ avgphgt + gender, data = d)
Residuals:
Min 1Q Median 3Q Max
-3.826 -1.094 -0.347 1.656 4.503
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.567 10.932 2.25 0.03068 *
avgphgt 0.671 0.157 4.26 0.00013 ***
genderFemales -4.468 0.920 -4.85 2.2e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.12 on 37 degrees of freedom
Multiple R-squared: 0.78, Adjusted R-squared: 0.769
F-statistic: 65.7 on 2 and 37 DF, p-value: 6.6e-13
Standardized coefficients:
standardise_parameters(m2)
Parameter | Std_Coefficient | CI | CI_low | CI_high |
---|---|---|---|---|
(Intercept) | 0.506 | 0.95 | 0.245 | 0.768 |
avgphgt | 0.450 | 0.95 | 0.236 | 0.664 |
genderFemales | -1.012 | 0.95 | -1.435 | -0.590 |
An automated report:
report(m2)
We fitted a linear model (estimated using OLS) to predict height with avgphgt
and gender (formula: height ~ avgphgt + gender). The model explains a
statistically significant and substantial proportion of variance (R2 = 0.78,
F(2, 37) = 65.75, p < .001, adj. R2 = 0.77). The model's intercept,
corresponding to avgphgt = 0 and gender = Males, is at 24.57 (95% CI [2.42,
46.72], t(37) = 2.25, p = 0.031). Within this model:
- The effect of avgphgt is statistically significant and positive (beta = 0.67,
95% CI [0.35, 0.99], t(37) = 4.26, p < .001; Std. beta = 0.45, 95% CI [0.24,
0.66])
- The effect of gender [Females] is statistically significant and negative
(beta = -4.47, 95% CI [-6.33, -2.60], t(37) = -4.85, p < .001; Std. beta =
-1.01, 95% CI [-1.43, -0.59])
Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.
Comparing models
To compare two regression models:
test_wald(m1, m2)
Name | Model | df | df_diff | F | p |
---|---|---|---|---|---|
m1 | lm | 38 | |||
m2 | lm | 37 | 1 | 23.6 | 0 |
The p-value is significant, meaning that m2
explains more variance than m1
.
This uses Bayes Factor instead of a p-value:
test_bf(m1, m2)
Model | log_BF | BF |
---|---|---|
avgphgt | ||
avgphgt + gender | 8.01 | 3010 |
A BF > 1 means that m2
is more strongly supported than m1
. A BF < 1 means that m1
is more strongly supported than m2
. If you are not sure how to interpret the output of test_bf
, you can get an automated interpretation:
test_bf(m1, m2) %>%
report()
Bayes factors were computed using the BIC approximation, by which BF10 =
exp((BIC0 - BIC1)/2). Compared to the avgphgt model, we found extreme evidence
(BF = 3.01e+03) in favour of the avgphgt + gender model (the least supported
model).
To compare many performance statistics at once:
compare_performance(m1, m2)
Name | Model | AIC | AIC_wt | AICc | AICc_wt | BIC | BIC_wt | R2 | R2_adjusted | RMSE | Sigma |
---|---|---|---|---|---|---|---|---|---|---|---|
m1 | lm | 196 | 0 | 197 | 0 | 201 | 0 | 0.641 | 0.631 | 2.61 | 2.68 |
m2 | lm | 179 | 1 | 180 | 1 | 185 | 1 | 0.780 | 0.769 | 2.04 | 2.12 |
Pretty Regression Tables
There are several packages that automate the production of pretty regression tables. There is a trade-off here between ease of use and getting exactly what you want. If you are okay with a function’s defaults and the programmer’s choices, then there is no problem. If you want full control of the display, you are best off creating your own functions or adapting existing functions.
The tab_model
function in sjPlot look great as is, but if it gives you too much or not enough, check out its options to see if it can do what you want.
Show a single model
::tab_model(m1) sjPlot
height(in inches) | |||
Predictors | Estimates | CI | p |
(Intercept) | -12.88 | -32.67 – 6.92 | 0.196 |
Average height of parents | 1.19 | 0.90 – 1.49 | <0.001 |
Observations | 40 | ||
R2 / R2 adjusted | 0.641 / 0.631 |
Compare nested models
::tab_model(m1, m2) sjPlot
height(in inches) | height(in inches) | |||||
Predictors | Estimates | CI | p | Estimates | CI | p |
(Intercept) | -12.88 | -32.67 – 6.92 | 0.196 | 24.57 | 2.42 – 46.72 | 0.031 |
Average height of parents | 1.19 | 0.90 – 1.49 | <0.001 | 0.67 | 0.35 – 0.99 | <0.001 |
gender: Females | -4.47 | -6.33 – -2.60 | <0.001 | |||
Observations | 40 | 40 | ||||
R2 / R2 adjusted | 0.641 / 0.631 | 0.780 / 0.769 |
library(gtsummary)
# Plain table
tbl_regression(m1)
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
Average height of parents | 1.2 | 0.90, 1.5 | <0.001 |
1 CI = Confidence Interval |
# Merging tables
tbl_merge(tbls = list(tbl_regression(m1) %>% add_glance_table(),
tbl_regression(m2) %>% add_glance_table()),
tab_spanner = c("Parent Height", "Parent Height + Gender"))
Characteristic |
Parent Height
|
Parent Height + Gender
|
||||
---|---|---|---|---|---|---|
Beta | 95% CI1 | p-value | Beta | 95% CI1 | p-value | |
Average height of parents | 1.2 | 0.90, 1.5 | <0.001 | 0.67 | 0.35, 0.99 | <0.001 |
R² | 0.641 | 0.780 | ||||
Adjusted R² | 0.631 | 0.769 | ||||
Sigma | 2.68 | 2.12 | ||||
Statistic | 67.7 | 65.7 | ||||
p-value | <0.001 | <0.001 | ||||
df | 1 | 2 | ||||
Log-likelihood | -95.2 | -85.3 | ||||
AIC | 196 | 179 | ||||
BIC | 201 | 185 | ||||
Deviance | 273 | 167 | ||||
Residual df | 38 | 37 | ||||
No. Obs. | 40 | 40 | ||||
gender | ||||||
Males | — | — | ||||
Females | -4.5 | -6.3, -2.6 | <0.001 | |||
1 CI = Confidence Interval |
# Stacking tables
tbl_stack(list(tbl_regression(m1), tbl_regression(m2)),
group_header = c("Parent Height", "Parent Height + Gender"))
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
Parent Height | |||
Average height of parents | 1.2 | 0.90, 1.5 | <0.001 |
Parent Height + Gender | |||
Average height of parents | 0.67 | 0.35, 0.99 | <0.001 |
gender | |||
Males | — | — | |
Females | -4.5 | -6.3, -2.6 | <0.001 |
1 CI = Confidence Interval |
parameters(m2) %>%
::nice_table() rempsyc
Parameter | b | SE | CI | t | df | p | 95% CI |
---|---|---|---|---|---|---|---|
(Intercept) | 24.57 | 10.93 | 0.95 | 2.25 | 37 | .031* | [2.42, 46.72] |
avgphgt | 0.67 | 0.16 | 0.95 | 4.26 | 37 | < .001*** | [0.35, 0.99] |
genderFemales | -4.47 | 0.92 | 0.95 | -4.85 | 37 | < .001*** | [-6.33, -2.60] |
All the code in one place:
The preceding analyses might seem like a lot, but it is not really so much when you see it all in just a few lines of code. Here are all the main analyses:
# Load packages
library(tidyverse)
library(easystats)
# Import data
<- haven::read_spss("https://github.com/wjschne/EDUC5325/raw/master/height.sav")
d
# Plot data
ggplot(d, aes(weight,height)) +
geom_point() +
geom_smooth(method = "lm")
# Save plot
ggsave("my_plot.pdf")
# Create regression model
<- lm(height~avgphgt, data = d)
m1
# Check model assumptions
check_model(m1)
# Display results
summary(m1)
# Model fit
performance(m1)
# Model parameters
parameters(m1)
# Multiple regression
<- lm(height ~ avgphgt + gender, data = d)
m2 summary(m2)
parameters(m2, standardize = "refit")
# Compare model m1 and model m2
test_wald(m1, m2)
test_bf(m1, m2)
compare_performance(m1, m2)
Questions
Use the Transfer of Learning data set. A set of 38 inmates participated in a study that tested the Transfer of Learning Hypothesis. Some people believe that studying Latin is particularly beneficial for progress in other academic disciplines. Each inmate was given a reading test and a math test before the study began. Some inmates were randomly assigned to participate in a 48-week Latin course. The control group studied the lives of famous athletes for the same time period. Each inmate took the reading and math test again to see if studying Latin improved their academic skills. Personal information was also collected about each inmate including their annual income before going to prison, whether or not the inmate had a documented learning disorder, and whether or not the inmate had been convicted of a violent offense. Here are the variable names:
read_1
= Reading before studyread_2
= Reading after studymath_1
= Math before studymath_2
= Math after studylatin
= 1 (Studied Latin), 0 (Studied sports history)violent
= 1 (convicted of a violent offense), 0 (convicted of a non-violent offense)learning
= 1 (Learning disabled), 0 (Not learning disabled)income
= Annual income before incarceration
<- read_csv("https://github.com/wjschne/EDUC5529/raw/master/transfer_of_learning.csv") d_learn
Assume α = 0.05 for all hypothesis tests.
Create a regression model in which you predict reading at time 2 (read_2
) using reading at time 1 (read_1
). Call the model fit object m_read
.
Question 1
Is read_1
a significant predictor of read_2
?
Make model fit object called m_read
using the lm
function.
<- lm(read_2 ~ read_1, data = d_learn) m_read
View coefficient-level statistics with the parameters
function.
You could also use Base R’s summary
function.
parameters(m_read)
# or
summary(m_read)
In the read_1 row, is the p.value column less than 0.05?
parameters(m_read)
Parameter | Coefficient | SE | CI | CI_low | CI_high | t | df_error | p |
---|---|---|---|---|---|---|---|---|
(Intercept) | 83.591 | 10.922 | 0.95 | 61.44 | 105.741 | 7.65 | 36 | 0.000 |
read_1 | 0.251 | 0.104 | 0.95 | 0.04 | 0.463 | 2.41 | 36 | 0.021 |
Question 2
What is the R2 for the m_read
model?
View model-level statistics with the performance
function.
You could also use Base R’s summary
function.
performance(m_read)
# or
summary(m_read)
Question 3
What does the scatter plot look like when read_1
is on the x-axis and read_2
is on the y-axis? Also plot the regression line. Save your plot using the ggsave
function.
This will get you started
ggplot(d_learn, aes(read_1, read_2))
Here is how you add points.
ggplot(d_learn, aes(read_1, read_2)) +
geom_point()
Here is how you add a regression line.
ggplot(d_learn, aes(read_1, read_2)) +
geom_point() +
geom_smooth(method = "lm")
This is overkill for now. But someday you might want to be able to make your plots as polished as possible.
# I want to put the equation at x = 130
<- 130
equation_x <- predict(m_read, newdata = tibble(read_1 = equation_x))
equation_y
# Extracting the coefficients
<- round(coef(m_read),2)
b_read
# The angle of the regression line is the inverse tangent of the slope (converted to degrees)
<- atan(b_read[2]) * 180 / pi
eq_angle
# Equation
<- paste0("italic(Y) == ",
eq_read 1],
b_read[" + ",
2],
b_read[" * italic(X) + italic(e)")
ggplot(d_learn, aes(read_1, read_2)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Reading at Time 1",
y = "Reading at Time 2",
title = "Using Reading at Time 1 to Predict Reading at Time 2") +
coord_fixed() +
theme_minimal() +
annotate(
geom = "text",
x = equation_x,
y = equation_y,
angle = eq_angle,
label = eq_read,
parse = TRUE,
vjust = -0.5)
Question 4
Create a regression model in which you predict math at time 2 (math_2
) using reading at time 1 (math_1
). Call the model fit object m_math1
.
Does math_1
predict math_2
?.
Does math_1
still predict math_2
after controlling for read_1
?
<- lm(math_2 ~ math_1 + read_1, data = d_learn)
m_math summary(m_math)