Generalised Linear Models

BDSI Workshop

Published

September 24, 2024

👋 Welcome

This is a training workshop for BIOL8001 students but extended to also staff and students affiliated with the Australian National University (ANU) offered by the ANU Biological Data Science Institute (BDSI).

The workshop aims to teach generalised linear models for modeling data with non-normal distributions. Participants are expected to have a basic understanding of R and linear models prior to the start of the workshop.

🎯 Learning objectives

Upon completion of this workshop, participants should be able to:

Understand and apply generalized linear models (GLMs) for modeling data with non-normal distributions
Apply GLMs effectively to model binary outcomes (logistic regression) and count data (Poisson or negative binomial regression).
Fit, interpret, and assess the adequacy of GLMs

🔧 Preparation

Please ensure that you download and install

the latest version of R,
the latest version of RStudio Desktop or Positron, and
the following packages by opening RStudio Desktop or Positron, then copy and paste the command below in the Console section, pushing Enter after pasting.

install.packages(c("tidyverse", "broom", "faraway", "GLMsData", "GGally", "statmod"))

For Window users, you may need to install Rtools to install R packages.

📚 Slides

Click on the heading to open the slides in a new tab, or click on details to see the slides on this page.

📑 Resources

Faraway (2009) Linear Models with R (for review of linear models)
Chapters 2, 3, and 5 from Faraway (2016) Extending the Linear Model with R. 2nd Edition.
Dunn and Smyth (2018) Generalized Linear Models With Examples in R

🏋️‍♀️ Self-paced exercises

The following self-paced exercises are designed to check your understanding of the material.

Reflect on learning objectives

You should be able to:

Understand and apply generalized linear models (GLMs) for modeling data with non-normal distributions
Apply GLMs effectively to model binary outcomes (logistic regression) and count data (Poisson or negative binomial regression).
Fit, interpret, and assess the adequacy of GLMs

Question: What might affect the chance of getting heart disease?

About the data

We have the data from Western Collaborative Group Study contains 3154 healthy men, aged from 39 to 59, from the San Francisco area.
At the start of the study, all were free of heart disease.
Eight and a half years later, the study recorded whether these men now suffered from coronary heart disease (chd).
Other recorded variables that might be related to the chance of developing this disease are:
- age - age in years
- height - height in inches
- weight - weight in pounds
- sdp - systolic blood pressure in mm Hg
- dbp - diastolic blood pressure in mm Hg
- chol - fasting serum cholesterol in mm %
- behave - behaviour type (A1, A2, B3, B4)
- cigs - number of cigarettes smoked per day
- dibep - behaviour type (A = Aggressive, P = Passive)
- typechd - type of coronary heart disease (angina, infdeath, none, or silent)
- timechd - time of coronary heart disease or end of follow-up
- arcus - arcus senilis (absent or present)

Load the data

Is this data experimental or observational?

Explore the data (initial data analysis)

There are a number of ways to explore data.
First let’s start by looking at the distribution of the outcome (coronary heart disease). What do you notice?

Next let’s looks at the marginal distribution of numerical covariates by the outcome. Why is the median value for timechd clustered around 3000 for those that didn’t have a coronary heart disease?

Answer

The study went on for 8.5 years (which is about 8.5 \(\times\) 365.25 = 3105 days). The variable timechd for the group that didn’t have a coronary heart disease is related to number of days to end of follow-up. This means that those with earlier time (e.g. 1000 days) within this group means that they dropped out of study for some reason (e.g. due to death or unable to contact participant).

Let’s look also at the marginal distribution of categorical variables by the outcome. Do the levels within a factor have a higher association with chd? What do you notice in particular about typechd?

Answer

typechd is derived from chd so those that did not have a coronary heart disease is all assigned “none” as expected.

ggpairs() from GGally package is useful for looking at the pairwise relationship between any two covariates in the data. It will be slow to compute if you have many variables (and individual graph may be too slow to see) so you may need to subset to a smaller number of variables first.

Let’s zoom in and have a look at the relationship between sdp and dbp. What do you notice about the relationship between these variables?

Answer

The variables sdp and dbp are highly correlated. Systolic pressure (sdp) is the maximum blood pressure during contraction of the ventricles; while diastolic pressure (dbp) is the minimum pressure recorded just prior to the next contraction. Both measure blood pressure (in different ways) so the high correlation is perhaps expected!

What is the relationship between behave and dibep?

Answer

All those that are assigned with value “A” for dibep are either “A1” or “A2” for behave. Similarly all that are assigned with value “B” for dipep are either “B1” or “B2” for behave. This suggests that perhaps behave was a further refinement of behaviour type based on dibep (so behave is nested within dibep).

Model the data

Why would you not (or would you) use typechd and timechd as predictors in the model?

Answer

The variables typechd and timechd are calculated based on the outcome! You can’t use predictors that were derived based on the outcome.

Do you think the behaviour type variable (behave) should be included in the model?

Answer

behave does not appear to be (statistically) significantly contribute to explaining the chd so we omit from the model.

What do you think the best model that explains chd is?

Answer

Variable selection is a hard problem! We can use stepwise selection (which goes through backward selection - drop the least signiciant variable - and forward selection - add the most significant variable, until it meets a certain criteria), but an “automated” selection like this doesn’t account for the domain context.

For example, sdp is in the final model selected by stepwise selection but sdp is highly correlated with dbp. Should dbp been included instead of sdp? Also weight is included but not height. Tall people would naturally weight more than short people with the same body type. Would it have been better to feature engineer another variable that normalises weight with respect to height? Body mass index (which is weight in kg divided by square of body height in metres) is supposed to account for weight with respect to height.

Remember that all models are wrong, but some are useful. Your goal is just to find a useful model that approximates the reality well enough.

Question: What is the effect of gamma radiation (characterised by dose amount and dose rate) on the number of chromosomal abnormalities?

About the data

The data is from Purott and Reeder (1976) where an experiment was conducted to determine the effect of gamma radiation on chromosomal abnormalities.
The data recorded:
- ca: the number of chromosomal abnormalities
- cells: the number of cells
- doseamt: the dose amount
- doserate: the dose rate

Load the data

Is this data experimental or observational?

Explore the data

What are the combination of doseamt and doserate that was used?

Answer

There are 9 levels of doserate spanning from 0.1 to 4 and 3 levels of doseamt (1, 2.5, 5).

doseamt and doserate are recorded as numeric variables. Should we consider it as categorical variables?

Answer

The effect of the doserate may be multiplicative (as indicated by the increasing spread) so we can consider taking the log of the variable and consider its interaction with doseamt.

There are only 3 levels of the doseamt and the quantity doesn’t appear to be linearly related to the response rate. We can consider taking it as a categorical variable.

Model the data

Let’s consider using a linear model for the rate of chromosomal abnormalities. What proportion of the variation does the model explain?

Answer

Remember that you can find the proportion of (response) variation that the model explains from the coefficient of determination (or \(R^2\)). The value is 0.9874, which is quite high and suggest that model fits well.

Let’s check residual plot for model diagnostics. Do you see any issues from the residual plot?

Answer

The variance does not appear to be constant as we see that the spread of the residuals is larger as the fitted value increases.

Let’s fit a Poisson regression instead. What is the default link for Poisson regression?
Now try fitting the Poisson regression model to this data.

Answer

Aim: Propose a model for forest biomass of small-leaved lime trees.

About the data

A series of studies sampled the forest biomass in Eurasia.
We use part of that data for small-leaved lime trees (Tilia cordata).
The data contains the variables:
- Foliage: the foliage biomass (in kg)
- DBH: the tree diameter at breast height (in cm)
- Age: the age of tree (in years)
- Origin: the origin of the tree (Coppice, Natural, or Planted)

Load the data

Is this data experimental or observational?

Explore the data

Let’s start by looking at the pairwise plots of the data.

Before building the model, let’s take into account some domain context. A foliage mostly grows on the outer canopy, which could be crudely approximated as a spherical shape. The surface area of a sphere is given as \(4\pi r^2\) where \(r\) is radius. So we could consider that \(\texttt{Foliage} \propto 4 \pi (\texttt{DBH}/2)^2 = \pi \texttt{DBH}^2\). Taking log of both sides we have \[\log(\texttt{Foliage}) \approx \text{constant} + 2\log(\texttt{DBH}).\] This suggests that we should take the log transformation for Foliage and DBH. What does the data show?

Answer

The scatter plot shows that the log transformations of Foliage and DBH show a linear relationship. The coefficient of log(DBH) is close to 2.

Model the data

Let’s consider the systematic component to be log(Foliage) ~ log(DBH) * Age * Origin for now. Are the two models shown below the same?

What about the two models shown below? Are they the same? Why or why not?

Answer

The models are not the same. The first model is modelling the expected value of the log of the response variable, while the second model is modelling the log of the expected response value, i.e. \(E(\log(Y)) \neq \log(E(Y))\). This is a subtle but important distinction.

Let’s consider breaking the data into small groups by Age and Origin and look at the relationship between mean and variance for Foliage of each group.

Answer

We have \(\log(\text{group variance}) \approx 2\log(\text{group mean})\). This means that \(\text{group variance} \approx \text{group mean}^2\). Recall that \(Var(\mu) = \mu^2\) for Gamma distribution, so this suggests that the Gamma distribution could be a good choice for modelling the data.

Let’s consider the systematic component to be Foliage ~ log(DBH) * Age * Origin and fit a Gamma regression with log link. Do you think Age should be included in the model?

Update the model without Age. Should Origin be included in the model?

Check some model diagnostics to assess your selected model.

Answer

This website is brought to you by the ANU Biological Data Science Institute.

👋 Welcome

🎯 Learning objectives

🔧 Preparation

📚 Slides

Modelling for binary or proportion responses

Modelling for count responses

General structure and diagnostics

📑 Resources

🏋️‍♀️ Self-paced exercises

Load the data

Explore the data (initial data analysis)

Model the data

Load the data

Explore the data

Model the data

Load the data

Explore the data

Model the data