Probability Distributions

Generalised linear models

Generalised linear models (GLMs) has two components:

A random component

Y_i has a distribution from the exponential family with mean \mu_i = E(Y_i).

A systematic component

A linear predictor: g(\mu_i) = \eta_i = o_i + \beta_0 + \beta_1 x_{i1} + \ldots + \beta_k x_{ik}

where

  • g(\cdot) is a (smooth and invertible) link function, and
  • o_i is the offset value for the i-th observation (if any).

Exponential family

A distribution is part of the exponential family if the probability density/mass function can be written in the form:

f(y;\theta, \phi) = a(y, \theta)\text{exp}\left(\frac{y\cdot\theta - \kappa(\theta)}{\phi}\right)

  • \theta is the canonical parameter
  • \kappa(\theta) is a known function called the cumulant function
  • \phi > 0 is the dispersion parameter
  • a(y, \theta) is a normalising constant.

Common GLMs

Family Default Link g(\mu) g^{-1}(\eta) Range of Y_i Var(\mu)
Normal/Gaussian Identity \mu \eta (-\infty, \infty) 1
Gamma Inverse \frac{1}{\mu} \frac{1}{\eta} (0, \infty) \mu^2
Binomial Logit \log_e(\frac{\mu}{1 - \mu}) \frac{\text{exp}(\eta)}{1 + \text{exp}(\eta)} [0, 1] \mu(1-\mu)
Poisson Log \log_e(\mu) \text{exp}(\eta) [0, \infty) \mu
Negative Binomial Log \log_e(\mu) \text{exp}(\eta) [0, \infty) \mu + \mu^2/k
  • \mu is the expected value on the response scale
  • \eta is the expected value on the link scale

Fitting GLMs in R

glm(y ~ x, family = dist(link = "function"), data = data)


Family Function Default Link Other Links
Normal/Gaussian gaussian identity log, inverse
Gamma Gamma inverse identity, log
Binomial binomial logit probit, cauchit, log, cloglog
Poisson poisson log log, identity, sqrt
Negative Binomial* MASS::glm.nb log sqrt, identity
  • The default links in R are the canonical link (except for the Negative Binomial)
  • *Negative Binomial is specified with glm.nb from the MASS package and does not require family to be specified.

Which GLM to use?

  • For the random component:
    • What is the response type? E.g. positive integer, continuous over infinite support, etc.
    • Look at the relationship between the group mean and the group variance
      (left as an exercises – see Exercise 3)
  • For the systematic component:
    • How are the explanatory variables related to the mean of the response variable?
    • Do the explanatory variables need to be transformed or removed?
  • For the practical component:
    • Are the interpretation from the fitted model understandable?
    • Can you fit the model in practice?

Diagnostics

Residual plot

  • Plots of residuals vs fitted values and each of the covariates is important to check.
  • If there are any trends in the plot, then the systematic component can be improved, e.g. change the link function, add extra predictors or transform predictors.
  • If the random component is correct, then the variance should be approximately constant across fitted values and predictors.
  • For discrete responses, it is better to use quantile residuals.

Quantile residual plot

Code
cancer <- read_csv("https://anu-bdsi.github.io/workshop-GLM/data/cancer.csv")
cancer_fit <- glm(diagnosis_malignant ~ radius_mean + concave_points_mean, 
                  data = cancer, 
                  family = binomial(link = "logit"))

cancer_augment <- cancer_fit |> 
  broom::augment(type.predict = "response") |> 
  mutate(qres = statmod::qresid(cancer_fit))
Code
cancer_augment |> 
  ggplot(aes(.fitted, .resid)) +
    geom_point() + 
    geom_hline(yintercept = 0, linetype = 2) +
    labs(y = "Deviance residuals", x = "Fitted values")

Code
cancer_augment |> 
  ggplot(aes(.fitted, qres)) +
    geom_point() + 
    geom_hline(yintercept = 0, linetype = 2) +
    labs(y = "Quantile residuals", x = "Fitted values")

Q-Q plot

Code
cancer_augment |> 
  ggplot(aes(sample = qres)) +
    stat_qq() + 
    stat_qq_line() +
    labs(title = "Q-Q plot of quantile residuals",
         x = "Theoretical quantiles", 
         y = "Sample quantiles")

Simulate from fitted model and check the Q-Q plot of quantile residuals to know the “typical” Q-Q plot:

Code
simulate(cancer_fit, nsim = 6) |> 
  bind_cols(cancer) |> 
  pivot_longer(starts_with("sim_"),
               names_to = "sim",
               values_to = "sim_response") |> 
  nest(.by = sim) |> 
  mutate(fit = map(data, ~update(cancer_fit, data = .))) |> 
  mutate(qres = map(fit, ~statmod::qresid(.))) |> 
  unnest_longer(qres) |> 
  ggplot(aes(sample = qres)) +
  stat_qq() +
  stat_qq_line() + 
  facet_wrap(~sim) + 
  coord_equal()

Summary

  • A generalised linear model are regression models that consist of two components:
    • a random component (requiring decision on a distribution from the exponential family): Y_i \sim \text{Exponential family}
    • a systematic component (requiring decision on the link function and predictors): g(\mu_i) = \eta_i = o_i + \beta_0 + \beta_1 x_{i1} + \ldots + \beta_k x_{ik}.
  • Diagnose the model using (quantile) residual plots, Q-Q plots, and other methods (e.g. influential points or outliers).