Generalised Linear Models
👋 Welcome
This is a training workshop for BIOL8001 students but extended to also staff and students affiliated with the Australian National University (ANU) offered by the ANU Biological Data Science Institute (BDSI).
The workshop aims to teach generalised linear models for modeling data with non-normal distributions. Participants are expected to have a basic understanding of R and linear models prior to the start of the workshop.
🎯 Learning objectives
Upon completion of this workshop, participants should be able to:
- Understand and apply generalized linear models (GLMs) for modeling data with non-normal distributions
- Apply GLMs effectively to model binary outcomes (logistic regression) and count data (Poisson or negative binomial regression).
- Fit, interpret, and assess the adequacy of GLMs
🔧 Preparation
Please ensure that you download and install
- the latest version of R,
- the latest version of RStudio Desktop or Positron, and
- the following packages by opening RStudio Desktop or Positron, then copy and paste the command below in the Console section, pushing Enter after pasting.
install.packages(c("tidyverse", "broom", "faraway", "GLMsData", "GGally", "statmod"))
- For Window users, you may need to install Rtools to install R packages.
📚 Slides
Click on the heading to open the slides in a new tab, or click on details to see the slides on this page.
Modelling for binary or proportion responses
Modelling for count responses
General structure and diagnostics
📑 Resources
- Faraway (2009) Linear Models with R (for review of linear models)
- Chapters 2, 3, and 5 from Faraway (2016) Extending the Linear Model with R. 2nd Edition.
- Dunn and Smyth (2018) Generalized Linear Models With Examples in R
🏋️♀️ Self-paced exercises
The following self-paced exercises are designed to check your understanding of the material.
- Understand and apply generalized linear models (GLMs) for modeling data with non-normal distributions
- Apply GLMs effectively to model binary outcomes (logistic regression) and count data (Poisson or negative binomial regression).
- Fit, interpret, and assess the adequacy of GLMs
Question: What might affect the chance of getting heart disease?
- We have the data from Western Collaborative Group Study contains 3154 healthy men, aged from 39 to 59, from the San Francisco area.
- At the start of the study, all were free of heart disease.
- Eight and a half years later, the study recorded whether these men now suffered from coronary heart disease (
chd
). - Other recorded variables that might be related to the chance of developing this disease are:
age
- age in yearsheight
- height in inchesweight
- weight in poundssdp
- systolic blood pressure in mm Hgdbp
- diastolic blood pressure in mm Hgchol
- fasting serum cholesterol in mm %behave
- behaviour type (A1, A2, B3, B4)cigs
- number of cigarettes smoked per daydibep
- behaviour type (A = Aggressive, P = Passive)typechd
- type of coronary heart disease (angina, infdeath, none, or silent)timechd
- time of coronary heart disease or end of follow-uparcus
- arcus senilis (absent or present)
Load the data
- Is this data experimental or observational?
Explore the data (initial data analysis)
- There are a number of ways to explore data.
- First let’s start by looking at the distribution of the outcome (coronary heart disease). What do you notice?
- Next let’s looks at the marginal distribution of numerical covariates by the outcome. Why is the median value for
timechd
clustered around 3000 for those that didn’t have a coronary heart disease?
Answer
The study went on for 8.5 years (which is about 8.5 \(\times\) 365.25 = 3105 days). The variable timechd
for the group that didn’t have a coronary heart disease is related to number of days to end of follow-up. This means that those with earlier time (e.g. 1000 days) within this group means that they dropped out of study for some reason (e.g. due to death or unable to contact participant).
- Let’s look also at the marginal distribution of categorical variables by the outcome. Do the levels within a factor have a higher association with
chd
? What do you notice in particular abouttypechd
?
Answer
typechd
is derived from chd
so those that did not have a coronary heart disease is all assigned “none” as expected.
ggpairs()
fromGGally
package is useful for looking at the pairwise relationship between any two covariates in the data. It will be slow to compute if you have many variables (and individual graph may be too slow to see) so you may need to subset to a smaller number of variables first.
- Let’s zoom in and have a look at the relationship between
sdp
anddbp
. What do you notice about the relationship between these variables?
Answer
The variablessdp
and dbp
are highly correlated. Systolic pressure (sdp
) is the maximum blood pressure during contraction of the ventricles; while diastolic pressure (dbp
) is the minimum pressure recorded just prior to the next contraction. Both measure blood pressure (in different ways) so the high correlation is perhaps expected!
- What is the relationship between
behave
anddibep
?
Answer
All those that are assigned with value “A” fordibep
are either “A1” or “A2” for behave
. Similarly all that are assigned with value “B” for dipep
are either “B1” or “B2” for behave
. This suggests that perhaps behave
was a further refinement of behaviour type based on dibep
(so behave
is nested within dibep
).
Model the data
- Why would you not (or would you) use
typechd
andtimechd
as predictors in the model?
Answer
The variables typechd
and timechd
are calculated based on the outcome! You can’t use predictors that were derived based on the outcome.
- Do you think the behaviour type variable (
behave
) should be included in the model?
Answer
behave
does not appear to be (statistically) significantly contribute to explaining the chd
so we omit from the model.
- What do you think the best model that explains
chd
is?
Answer
Variable selection is a hard problem! We can use stepwise selection (which goes through backward selection - drop the least signiciant variable - and forward selection - add the most significant variable, until it meets a certain criteria), but an “automated” selection like this doesn’t account for the domain context.
For example, sdp
is in the final model selected by stepwise selection but sdp
is highly correlated with dbp
. Should dbp
been included instead of sdp
? Also weight
is included but not height
. Tall people would naturally weight more than short people with the same body type. Would it have been better to feature engineer another variable that normalises weight with respect to height? Body mass index (which is weight in kg divided by square of body height in metres) is supposed to account for weight with respect to height.
Question: What is the effect of gamma radiation (characterised by dose amount and dose rate) on the number of chromosomal abnormalities?
- The data is from Purott and Reeder (1976) where an experiment was conducted to determine the effect of gamma radiation on chromosomal abnormalities.
- The data recorded:
ca
: the number of chromosomal abnormalitiescells
: the number of cellsdoseamt
: the dose amountdoserate
: the dose rate
Load the data
- Is this data experimental or observational?
Explore the data
- What are the combination of
doseamt
anddoserate
that was used?
Answer
There are 9 levels ofdoserate
spanning from 0.1 to 4 and 3 levels of doseamt
(1, 2.5, 5).
doseamt
anddoserate
are recorded as numeric variables. Should we consider it as categorical variables?
Answer
The effect of the doserate
may be multiplicative (as indicated by the increasing spread) so we can consider taking the log of the variable and consider its interaction with doseamt
.
There are only 3 levels of the doseamt
and the quantity doesn’t appear to be linearly related to the response rate. We can consider taking it as a categorical variable.
Model the data
- Let’s consider using a linear model for the rate of chromosomal abnormalities. What proportion of the variation does the model explain?
Answer
Remember that you can find the proportion of (response) variation that the model explains from the coefficient of determination (or \(R^2\)). The value is 0.9874, which is quite high and suggest that model fits well.
- Let’s check residual plot for model diagnostics. Do you see any issues from the residual plot?
Answer
The variance does not appear to be constant as we see that the spread of the residuals is larger as the fitted value increases.
Let’s fit a Poisson regression instead. What is the default link for Poisson regression?
Now try fitting the Poisson regression model to this data.
Answer
Aim: Propose a model for forest biomass of small-leaved lime trees.
- A series of studies sampled the forest biomass in Eurasia.
- We use part of that data for small-leaved lime trees (Tilia cordata).
- The data contains the variables:
Foliage
: the foliage biomass (in kg)DBH
: the tree diameter at breast height (in cm)Age
: the age of tree (in years)Origin
: the origin of the tree (Coppice, Natural, or Planted)
Load the data
- Is this data experimental or observational?
Explore the data
- Let’s start by looking at the pairwise plots of the data.
- Before building the model, let’s take into account some domain context. A foliage mostly grows on the outer canopy, which could be crudely approximated as a spherical shape. The surface area of a sphere is given as \(4\pi r^2\) where \(r\) is radius. So we could consider that \(\texttt{Foliage} \propto 4 \pi (\texttt{DBH}/2)^2 = \pi \texttt{DBH}^2\). Taking log of both sides we have \[\log(\texttt{Foliage}) \approx \text{constant} + 2\log(\texttt{DBH}).\] This suggests that we should take the log transformation for
Foliage
andDBH
. What does the data show?
Answer
The scatter plot shows that the log transformations of Foliage
and DBH
show a linear relationship. The coefficient of log(DBH)
is close to 2.
Model the data
- Let’s consider the systematic component to be
log(Foliage) ~ log(DBH) * Age * Origin
for now. Are the two models shown below the same?
- What about the two models shown below? Are they the same? Why or why not?
Answer
The models are not the same. The first model is modelling the expected value of the log of the response variable, while the second model is modelling the log of the expected response value, i.e. \(E(\log(Y)) \neq \log(E(Y))\). This is a subtle but important distinction.
- Let’s consider breaking the data into small groups by
Age
andOrigin
and look at the relationship between mean and variance forFoliage
of each group.
Answer
We have \(\log(\text{group variance}) \approx 2\log(\text{group mean})\). This means that \(\text{group variance} \approx \text{group mean}^2\). Recall that \(Var(\mu) = \mu^2\) for Gamma distribution, so this suggests that the Gamma distribution could be a good choice for modelling the data.
- Let’s consider the systematic component to be
Foliage ~ log(DBH) * Age * Origin
and fit a Gamma regression with log link. Do you thinkAge
should be included in the model?
- Update the model without
Age
. ShouldOrigin
be included in the model?
- Check some model diagnostics to assess your selected model.