Modelling for binary or proportion responses

ANU BDSI
workshop
Generalised Linear Models

Emi Tanaka

Biological Data Science Institute

24th September 2024

Workshop materials

All materials will be hosted at
https://anu-bdsi.github.io/workshop-GLM/

Case study: Longevity of fruitflies

Aim: Study longevity of fruitflies depending on sexual activity and thorax length

Code
library(tidyverse)
gfly <- faraway::fruitfly |> 
  mutate(activity = factor(activity, levels = c("isolated", "low", "high", "one", "many"))) |> 
  ggplot(aes(x = thorax, y = longevity, color = activity)) + 
  geom_point(data = ~select(., -activity), color = "lightgrey") +
  geom_point(size = 4) + 
  facet_wrap(~activity, labeller = label_both) +
  guides(color = "none") + 
  colorspace::scale_color_discrete_qualitative()

gfly
  • 125 fruitflies were divided randomly into 5 groups of 25 each.
  • The response was the longevity of the fruitfly in days.
  • One group was kept solitary (“isolated”).
  • Another was kept individually with a virgin female each day (“low”).
  • Another group was given 8 virgin females per day (“high”).
  • As an additional control the fourth and fifth groups were kept with one (“one”) or eight (“many”) pregnant females per day.
  • Pregnant fruitflies will not mate.
  • The thorax length of each male was measured as this was known to affect longevity.
  • One observation in the many group has been lost.

Linear model 1

lm(longevity ~ thorax + activity + thorax:activity)

\texttt{longevity}_i = \beta_0 + \beta_1\texttt{thorax}_i + \beta_{2,T(i)} + \beta_{3,T(i)}\texttt{thorax}_i + \epsilon_i

where

  • \beta_0 is the overall intercept,
  • \beta_1 is the effect of thorax length on longevity,
  • \beta_{2,T(i)} is the effect of activity on longevity,
  • T(i) maps to the activity type of the i-th observation,
  • \beta_{3,T(i)} is the interaction effect between thorax length and activity type, and
  • \epsilon_i is the error term.

Note that:

  • We assume \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2).
  • We use the constraints: \beta_{2,1} = 0 and \beta_{3,1} = 0.
  • \beta_0 + \beta_{2,k} is the intercept for the k-th activity type.
  • \beta_1 + \beta_{3,k} is the slope for the k-th activity type.
Code
gfly + geom_smooth(method = "lm", se = FALSE, color = "black", linewidth = 2)
Code
flyfit1 <- lm(longevity ~ thorax + activity + thorax:activity, data = faraway::fruitfly) |> 
  broom::augment() |> 
  mutate(activity = factor(activity, levels = c("isolated", "low", "high", "one", "many"))) 
gres <- flyfit1 |> 
  ggplot(aes(.fitted, .resid, color = activity)) + 
  geom_point(data = ~select(., -activity), color = "lightgrey") + 
  geom_point(size = 4) +
  #facet_wrap(~activity, labeller = label_both) +
  guides(color = "none") +
  geom_hline(yintercept = 0, color = "black", linetype = 2) +
  colorspace::scale_color_discrete_qualitative() +
  labs(x = "Fitted values", y = "Residual", title = "Residual vs fitted values plot")

gres

Linear model 2

lm(log10(longevity) ~ thorax + activity + thorax:activity)

\log_{10}(\texttt{longevity}_i) = \beta_0 + \beta_1\texttt{thorax}_i + \beta_{2,T(i)} + \beta_{3,T(i)}\texttt{thorax}_i + \epsilon_i

Code
gfly + geom_smooth(method = "lm", se = FALSE, color = "black", linewidth = 2) +
  scale_y_log10()
Code
flyfit2 <- lm(log10(longevity) ~ thorax + activity + thorax:activity, data = faraway::fruitfly) |> 
  broom::augment() |> 
  mutate(.fitted = 10^.fitted) |> 
  mutate(activity = factor(activity, levels = c("isolated", "low", "high", "one", "many"))) 

gres %+% flyfit2

Linear model 3

lm(log10(longevity) ~ thorax + activity)

\log_{10}(\texttt{longevity}_i) = \beta_0 + \beta_1\texttt{thorax}_i + \beta_{2,T(i)} + \epsilon_i

fit1 <- lm(log10(longevity) ~ thorax + activity, data = faraway::fruitfly)
fit2 <- lm(log10(longevity) ~ thorax + activity + thorax:activity, data = faraway::fruitfly)
anova(fit1, fit2)
Analysis of Variance Table

Model 1: log10(longevity) ~ thorax + activity
Model 2: log10(longevity) ~ thorax + activity + thorax:activity
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1    118 0.83011                           
2    114 0.78511  4  0.044997 1.6334 0.1706
Code
fit <- lm(log10(longevity) ~ -1 + thorax + activity, data = faraway::fruitfly)
est <- tibble(slope = coef(fit)[1], 
              intercept = coef(fit)[-1],
              activity = str_remove(names(coef(fit)[-1]), "^activity")) |> 
mutate(activity = factor(activity, levels = c("isolated", "low", "high", "one", "many"))) 
gfly + geom_abline(aes(slope = slope, intercept = intercept), color = "black", linewidth = 2, data = est) + scale_y_log10()
Code
gres %+% (broom::augment(fit) |> 
  mutate(activity = factor(activity, levels = c("isolated", "low", "high", "one", "many"))))

Multiple linear regression

For i = 1, ..., n, the linear regression model is given by

y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots + \beta_kx_{ik} + \epsilon_i= \sum_{j=0}^{k}\beta_jx_{ij} + \epsilon_i

where

  • y_i is the response variable,
  • x_{ij} is the j-th predictor variable and x_{0j} = 1 for all j,
  • \beta_j is the coefficient for the j-th predictor variable,
  • \epsilon_i is the error term, and
  • we assume that \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2).

Case study: Breast cancer diagnosis

Aim: diagnose breast cancer from features of a digitized image of breast mass sample

Code
library(tidyverse)
cancer <- read_csv("https://anu-bdsi.github.io/workshop-GLM/data/cancer.csv") 

cancer |> 
  select(diagnosis, radius_mean, concave_points_mean) |> 
  GGally::ggpairs(mapping = aes(color = diagnosis)) 

  • The image features include the average radius (radius_mean) and the average concave portion of the contour (concave_points_mean).
  • The breast mass sample was diagnosed as malignant (M) or benign (B).

Binary response

Code
cancer |> 
  count(diagnosis) |> 
  ggplot(aes(diagnosis, n)) +
  geom_col(aes(fill = diagnosis)) +
  scale_y_continuous(expand = expansion(add = c(0, 0.1))) +
  labs(y = "Count") +
  guides(fill = "none")

  • The outcome of interest is a binary response (M or B).
  • We can convert it to a numeric value such that M = 1 and B = 0.
  • Should we model this assuming that using a linear model?
Code
cancer |> 
  ggplot(aes(radius_mean, diagnosis_malignant)) +
  geom_point(alpha = 0.25, size = 2, aes(color = diagnosis)) +
  geom_smooth(method = lm,
              formula = y ~ x,
              se = FALSE,
              color = "black",
              linewidth = 1.2) +
  scale_y_continuous(breaks = c(0, 1)) +
  labs(y = "diagnosis") +
  guides(color = "none")

Concepts

Odds of a class

  • Suppose we consider Y_i as a binary category: Y_i = \begin{cases} 0 & \text{ if $i$-th observation is in class 1}\\ 1 & \text{ if $i$-th observation is in class 2}\\ \end{cases}
  • P(Y_i = 1|\boldsymbol{x}_i), where \boldsymbol{x}_i = (x_{i1}, x_{i2}, \ldots, x_{ik}) is the covariates of i-th observational unit, is the conditional probability of i-th observation belonging to class 2.
  • Note P(Y_i=0|\boldsymbol{x}_i) = 1 - P(Y_i=1|\boldsymbol{x}_i).
  • The odds of class 2 (for i-th observation) then is the ratio: \text{odds}_i = \frac{P(Y_i=1|\boldsymbol{x}_i)}{1-P(Y_i=1|\boldsymbol{x}_i)}.

Logistic function

f(z) = \frac{e^z}{1+e^z} = \frac{1}{1+e^{-z}}

0 < f(z) < 1 for all finite values of z.

Logit function

g(p) = \log_e \left(\frac{p}{1- p}\right)

-\infty < g(p) < \infty for all p \in (0, 1)

  • Note that logit and logistic functions are inverse functions of one another, i.e. f(g(z)) = z and g(f(p)) = p.

Binomial distribution

  • Suppose Y is binomially distributed B(n, p).
  • This means that there were n trials, each with a probability p of success, and Y is the number of successes out of n.
  • The expected value of Y is E(Y) = np and Var(Y) = np(1-p).
  • If n=1, Y can only take the values 0 or 1 and the distribution is referred to as Benoulli distribution.
  • If Y\sim Bernoulli(p) then E(Y) = p and Var(Y) = p(1-p).

Logistic regression

Logistic regression for a binary response

  • We assume that Y_i \sim B(1, p_i) where p_i = P(Y_i=1|\boldsymbol{x}_i).
  • Note that E(Y_i|\boldsymbol{x}_i) = p_i.
  • We model the the log odds (or logit of p_i) as a linear combination of predictors: \text{logit}(p_i) = \log_e \left(\frac{p_i}{1-p_i}\right) = \sum_{j=0}^k\beta_jx_{ij}
  • The \hat{\beta}_js are found through maximum likelihood estimate.
  • Then \hat{p}_i = \text{logistic}\left(\displaystyle\sum_{j=0}^k\hat{\beta}_jx_{ij}\right) = \dfrac{e^{\sum_{j=0}^k\hat{\beta}_jx_{ij}}}{1+e^{\sum_{j=0}^k\hat{\beta}_jx_{ij}}}.

Logistic regression for a binary response

  • When response is a binary value (0 or 1):
Code for loading data
  • Fit the logistic model in R as

Logistic regression for a factor response

  • When response is a factor, the first level is considered failure, every other level is considered success.
  • Watch out for the order of the levels!

Interpretation of logistic regression

  • Increasing radius_mean by one unit changes the log odds by \hat{\beta}_1, 0.639, or equivalently it multiplies the odds by e^{\hat\beta_1}, 1.894, provided concave_points_mean is held fixed.

Predicting from logistic regression

Alternative method
  • You can also get the prediction with:

By hand:

  • \hat{\eta} = -13.699 + 0.639\times 15 + 84.223\times 0.05 \approx 0.0961
  • \hat{p} = \dfrac{\exp(0.0961)}{1 + \exp(0.0961)} = 0.524

Logistic regression is a linear classifier

  • We choose the threshold q such that P(Y_i=1|\boldsymbol{x}_i) \ge q is considered to be in class 1.
  • The separation in class is a point for one variable, line for two variables and a hyperplane for more than two variables.

Case study: Germination of seeds

  • Response does not have to be a binary response to fit a logistic regression.
  • Observations may be count (out of fixed trials).
  • Below is an experiment where the number of seeds germination was recorded for two types of seeds and two types of root extracts.
Code
data(germ, package = "GLMsData")
germ <- germ |> 
  mutate(Proportion_Germ = Germ / Total)
Germ Total Extract Seeds Proportion_Germ
10 39 Bean OA75 0.2564103
23 62 Bean OA75 0.3709677
23 81 Bean OA75 0.2839506
26 51 Bean OA75 0.5098039
17 39 Bean OA75 0.4358974
5 6 Cucumber OA75 0.8333333
53 74 Cucumber OA75 0.7162162
55 72 Cucumber OA75 0.7638889
32 51 Cucumber OA75 0.6274510
46 79 Cucumber OA75 0.5822785
10 13 Cucumber OA75 0.7692308
8 16 Bean OA73 0.5000000
10 30 Bean OA73 0.3333333
8 28 Bean OA73 0.2857143
23 45 Bean OA73 0.5111111
0 4 Bean OA73 0.0000000
3 12 Cucumber OA73 0.2500000
22 41 Cucumber OA73 0.5365854
15 30 Cucumber OA73 0.5000000
32 51 Cucumber OA73 0.6274510
3 7 Cucumber OA73 0.4285714
Code
germ |>  
  mutate(NotGerm = Total - Germ) |> 
  pivot_longer(cols = c(Germ, NotGerm), names_to = "Germ", values_to = "Freq") |>
  ggplot(aes(Germ, Freq)) +
  geom_col() +
  facet_grid(Extract ~ Seeds, labeller = label_both) +
  labs(y = "Count")  

Logistic regression with count responses

  • The response need to be specified as a two column matrix with the counts of each class.
Code to load data

Logistic regression with proportion responses

  • You need the total cases for each proportion to fit the model.
  • The exception is if the total cases are the same for all observations.

Summary

  • Logistic regression can be used to model binary, proportion or count (out of total cases) responses by modelling the log odds as a linear combination of predictors.
  • Logistic regression is a linear classifier so it doesn’t work well where the separation between classes are non-linear.
  • To fit a logistic regression in R, it depends on how the response is encoded:
    • if y is binary or factor then use glm(y ~ x, family = binomial(link = "logit")),
    • if y is count and total is total cases then use glm(cbind(y, total - y) ~ x, family = binomial(link = "logit")), and
    • if y is proportion and total is total cases then use glm(y ~ x, family = binomial(link = "logit"), weights = total).