General structure and diagnostics

ANU BDSI
workshop Generalised Linear Models

Emi Tanaka

Biological Data Science Institute

27th September 2024

Probability Distributions

Generalised linear models

Generalised linear models (GLMs) has two components:

A random component

Y_i has a distribution from the exponential family with mean \mu_i = E(Y_i).

A systematic component

A linear predictor: g(\mu_i) = \eta_i = o_i + \beta_0 + \beta_1 x_{i1} + \ldots + \beta_k x_{ik}

where

g(\cdot) is a (smooth and invertible) link function, and
o_i is the offset value for the i-th observation (if any).

Exponential family

A distribution is part of the exponential family if the probability density/mass function can be written in the form:

f(y;\theta, \phi) = a(y, \theta)\text{exp}\left(\frac{y\cdot\theta - \kappa(\theta)}{\phi}\right)

\theta is the canonical parameter
\kappa(\theta) is a known function called the cumulant function
\phi > 0 is the dispersion parameter
a(y, \theta) is a normalising constant.

Common GLMs

Family	Default Link	g(\mu)	g^{-1}(\eta)	Range of Y_i	Var(\mu)
Normal/Gaussian	Identity	\mu	\eta	(-\infty, \infty)	1
Gamma	Inverse	\frac{1}{\mu}	\frac{1}{\eta}	(0, \infty)	\mu^2
Binomial	Logit	\log_e(\frac{\mu}{1 - \mu})	\frac{\text{exp}(\eta)}{1 + \text{exp}(\eta)}	[0, 1]	\mu(1-\mu)
Poisson	Log	\log_e(\mu)	\text{exp}(\eta)	[0, \infty)	\mu
Negative Binomial	Log	\log_e(\mu)	\text{exp}(\eta)	[0, \infty)	\mu + \mu^2/k

\mu is the expected value on the response scale
\eta is the expected value on the link scale

Fitting GLMs in R

glm(y ~ x, family = dist(link = "function"), data = data)

Family	Function	Default Link	Other Links
Normal/Gaussian	`gaussian`	identity	log, inverse
Gamma	`Gamma`	inverse	identity, log
Binomial	`binomial`	logit	probit, cauchit, log, cloglog
Poisson	`poisson`	log	log, identity, sqrt
Negative Binomial^*	`MASS::glm.nb`	log	sqrt, identity

The default links in R are the canonical link (except for the Negative Binomial)
^*Negative Binomial is specified with glm.nb from the MASS package and does not require family to be specified.

Which GLM to use?

For the random component:
- What is the response type? E.g. positive integer, continuous over infinite support, etc.
- Look at the relationship between the group mean and the group variance
  (left as an exercises – see Exercise 3)
For the systematic component:
- How are the explanatory variables related to the mean of the response variable?
- Do the explanatory variables need to be transformed or removed?
For the practical component:
- Are the interpretation from the fitted model understandable?
- Can you fit the model in practice?

Diagnostics

Residual plot

Plots of residuals vs fitted values and each of the covariates is important to check.
If there are any trends in the plot, then the systematic component can be improved, e.g. change the link function, add extra predictors or transform predictors.
If the random component is correct, then the variance should be approximately constant across fitted values and predictors.
For discrete responses, it is better to use quantile residuals.

Quantile residual plot

Code

cancer <- read_csv("https://anu-bdsi.github.io/workshop-GLM/data/cancer.csv")
cancer_fit <- glm(diagnosis_malignant ~ radius_mean + concave_points_mean, 
                  data = cancer, 
                  family = binomial(link = "logit"))

cancer_augment <- cancer_fit |> 
  broom::augment(type.predict = "response") |> 
  mutate(qres = statmod::qresid(cancer_fit))

Code

cancer_augment |> 
  ggplot(aes(.fitted, .resid)) +
    geom_point() + 
    geom_hline(yintercept = 0, linetype = 2) +
    labs(y = "Deviance residuals", x = "Fitted values")

Code

cancer_augment |> 
  ggplot(aes(.fitted, qres)) +
    geom_point() + 
    geom_hline(yintercept = 0, linetype = 2) +
    labs(y = "Quantile residuals", x = "Fitted values")

Q-Q plot

Code

cancer_augment |> 
  ggplot(aes(sample = qres)) +
    stat_qq() + 
    stat_qq_line() +
    labs(title = "Q-Q plot of quantile residuals",
         x = "Theoretical quantiles", 
         y = "Sample quantiles")

Simulate from fitted model and check the Q-Q plot of quantile residuals to know the “typical” Q-Q plot:

Code

simulate(cancer_fit, nsim = 6) |> 
  bind_cols(cancer) |> 
  pivot_longer(starts_with("sim_"),
               names_to = "sim",
               values_to = "sim_response") |> 
  nest(.by = sim) |> 
  mutate(fit = map(data, ~update(cancer_fit, data = .))) |> 
  mutate(qres = map(fit, ~statmod::qresid(.))) |> 
  unnest_longer(qres) |> 
  ggplot(aes(sample = qres)) +
  stat_qq() +
  stat_qq_line() + 
  facet_wrap(~sim) + 
  coord_equal()

Summary

A generalised linear model are regression models that consist of two components:
- a random component (requiring decision on a distribution from the exponential family): Y_i \sim \text{Exponential family}
- a systematic component (requiring decision on the link function and predictors): g(\mu_i) = \eta_i = o_i + \beta_0 + \beta_1 x_{i1} + \ldots + \beta_k x_{ik}.
Diagnose the model using (quantile) residual plots, Q-Q plots, and other methods (e.g. influential points or outliers).