Modelling for binary or proportion responses

ANU BDSI
workshop Generalised Linear Models

Emi Tanaka

Biological Data Science Institute

24th September 2024

Workshop materials

All materials will be hosted at
https://anu-bdsi.github.io/workshop-GLM/

Case study: Longevity of fruitflies

Aim: Study longevity of fruitflies depending on sexual activity and thorax length

Code

library(tidyverse)
gfly <- faraway::fruitfly |> 
  mutate(activity = factor(activity, levels = c("isolated", "low", "high", "one", "many"))) |> 
  ggplot(aes(x = thorax, y = longevity, color = activity)) + 
  geom_point(data = ~select(., -activity), color = "lightgrey") +
  geom_point(size = 4) + 
  facet_wrap(~activity, labeller = label_both) +
  guides(color = "none") + 
  colorspace::scale_color_discrete_qualitative()

gfly

125 fruitflies were divided randomly into 5 groups of 25 each.
The response was the longevity of the fruitfly in days.
One group was kept solitary (“isolated”).
Another was kept individually with a virgin female each day (“low”).
Another group was given 8 virgin females per day (“high”).
As an additional control the fourth and fifth groups were kept with one (“one”) or eight (“many”) pregnant females per day.
Pregnant fruitflies will not mate.
The thorax length of each male was measured as this was known to affect longevity.
One observation in the many group has been lost.

Linear model 1

lm(longevity ~ thorax + activity + thorax:activity)

\texttt{longevity}_i = \beta_0 + \beta_1\texttt{thorax}_i + \beta_{2,T(i)} + \beta_{3,T(i)}\texttt{thorax}_i + \epsilon_i

where

\beta_0 is the overall intercept,
\beta_1 is the effect of thorax length on longevity,
\beta_{2,T(i)} is the effect of activity on longevity,
T(i) maps to the activity type of the i-th observation,
\beta_{3,T(i)} is the interaction effect between thorax length and activity type, and
\epsilon_i is the error term.

Note that:

We assume \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2).
We use the constraints: \beta_{2,1} = 0 and \beta_{3,1} = 0.
\beta_0 + \beta_{2,k} is the intercept for the k-th activity type.
\beta_1 + \beta_{3,k} is the slope for the k-th activity type.

Code

gfly + geom_smooth(method = "lm", se = FALSE, color = "black", linewidth = 2)

Code

flyfit1 <- lm(longevity ~ thorax + activity + thorax:activity, data = faraway::fruitfly) |> 
  broom::augment() |> 
  mutate(activity = factor(activity, levels = c("isolated", "low", "high", "one", "many"))) 
gres <- flyfit1 |> 
  ggplot(aes(.fitted, .resid, color = activity)) + 
  geom_point(data = ~select(., -activity), color = "lightgrey") + 
  geom_point(size = 4) +
  #facet_wrap(~activity, labeller = label_both) +
  guides(color = "none") +
  geom_hline(yintercept = 0, color = "black", linetype = 2) +
  colorspace::scale_color_discrete_qualitative() +
  labs(x = "Fitted values", y = "Residual", title = "Residual vs fitted values plot")

gres

Linear model 2

lm(log10(longevity) ~ thorax + activity + thorax:activity)

\log_{10}(\texttt{longevity}_i) = \beta_0 + \beta_1\texttt{thorax}_i + \beta_{2,T(i)} + \beta_{3,T(i)}\texttt{thorax}_i + \epsilon_i

Code

gfly + geom_smooth(method = "lm", se = FALSE, color = "black", linewidth = 2) +
  scale_y_log10()

Code

flyfit2 <- lm(log10(longevity) ~ thorax + activity + thorax:activity, data = faraway::fruitfly) |> 
  broom::augment() |> 
  mutate(.fitted = 10^.fitted) |> 
  mutate(activity = factor(activity, levels = c("isolated", "low", "high", "one", "many"))) 

gres %+% flyfit2

Linear model 3

lm(log10(longevity) ~ thorax + activity)

\log_{10}(\texttt{longevity}_i) = \beta_0 + \beta_1\texttt{thorax}_i + \beta_{2,T(i)} + \epsilon_i

fit1 <- lm(log10(longevity) ~ thorax + activity, data = faraway::fruitfly)
fit2 <- lm(log10(longevity) ~ thorax + activity + thorax:activity, data = faraway::fruitfly)
anova(fit1, fit2)

Analysis of Variance Table

Model 1: log10(longevity) ~ thorax + activity
Model 2: log10(longevity) ~ thorax + activity + thorax:activity
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1    118 0.83011                           
2    114 0.78511  4  0.044997 1.6334 0.1706

Code

fit <- lm(log10(longevity) ~ -1 + thorax + activity, data = faraway::fruitfly)
est <- tibble(slope = coef(fit)[1], 
              intercept = coef(fit)[-1],
              activity = str_remove(names(coef(fit)[-1]), "^activity")) |> 
mutate(activity = factor(activity, levels = c("isolated", "low", "high", "one", "many"))) 
gfly + geom_abline(aes(slope = slope, intercept = intercept), color = "black", linewidth = 2, data = est) + scale_y_log10()

Code

gres %+% (broom::augment(fit) |> 
  mutate(activity = factor(activity, levels = c("isolated", "low", "high", "one", "many"))))

Multiple linear regression

For i = 1, ..., n, the linear regression model is given by

y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots + \beta_kx_{ik} + \epsilon_i= \sum_{j=0}^{k}\beta_jx_{ij} + \epsilon_i

where

y_i is the response variable,
x_{ij} is the j-th predictor variable and x_{0j} = 1 for all j,
\beta_j is the coefficient for the j-th predictor variable,
\epsilon_i is the error term, and
we assume that \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2).

Case study: Breast cancer diagnosis

Aim: diagnose breast cancer from features of a digitized image of breast mass sample

Code

library(tidyverse)
cancer <- read_csv("https://anu-bdsi.github.io/workshop-GLM/data/cancer.csv") 

cancer |> 
  select(diagnosis, radius_mean, concave_points_mean) |> 
  GGally::ggpairs(mapping = aes(color = diagnosis))

The Wisconsin breast cancer data set contains data of features of digitized image of a breast mass sample.

The image features include the average radius (radius_mean) and the average concave portion of the contour (concave_points_mean).
The breast mass sample was diagnosed as malignant (M) or benign (B).

Binary response

Code

cancer |> 
  count(diagnosis) |> 
  ggplot(aes(diagnosis, n)) +
  geom_col(aes(fill = diagnosis)) +
  scale_y_continuous(expand = expansion(add = c(0, 0.1))) +
  labs(y = "Count") +
  guides(fill = "none")

The outcome of interest is a binary response (M or B).
We can convert it to a numeric value such that M = 1 and B = 0.

Should we model this assuming that using a linear model?

Code

cancer |> 
  ggplot(aes(radius_mean, diagnosis_malignant)) +
  geom_point(alpha = 0.25, size = 2, aes(color = diagnosis)) +
  geom_smooth(method = lm,
              formula = y ~ x,
              se = FALSE,
              color = "black",
              linewidth = 1.2) +
  scale_y_continuous(breaks = c(0, 1)) +
  labs(y = "diagnosis") +
  guides(color = "none")

Concepts

Odds of a class

Suppose we consider Y_i as a binary category: Y_i = \begin{cases} 0 & \text{ if $i$-th observation is in class 1}\\ 1 & \text{ if $i$-th observation is in class 2}\\ \end{cases}
P(Y_i = 1|\boldsymbol{x}_i), where \boldsymbol{x}_i = (x_{i1}, x_{i2}, \ldots, x_{ik}) is the covariates of i-th observational unit, is the conditional probability of i-th observation belonging to class 2.
Note P(Y_i=0|\boldsymbol{x}_i) = 1 - P(Y_i=1|\boldsymbol{x}_i).
The odds of class 2 (for i-th observation) then is the ratio: \text{odds}_i = \frac{P(Y_i=1|\boldsymbol{x}_i)}{1-P(Y_i=1|\boldsymbol{x}_i)}.

Logistic function

f(z) = \frac{e^z}{1+e^z} = \frac{1}{1+e^{-z}}

0 < f(z) < 1 for all finite values of z.

Logit function

g(p) = \log_e \left(\frac{p}{1- p}\right)

-\infty < g(p) < \infty for all p \in (0, 1)

Note that logit and logistic functions are inverse functions of one another, i.e. f(g(z)) = z and g(f(p)) = p.

Binomial distribution

Suppose Y is binomially distributed B(n, p).
This means that there were n trials, each with a probability p of success, and Y is the number of successes out of n.
The expected value of Y is E(Y) = np and Var(Y) = np(1-p).
If n=1, Y can only take the values 0 or 1 and the distribution is referred to as Benoulli distribution.
If Y\sim Bernoulli(p) then E(Y) = p and Var(Y) = p(1-p).

Logistic regression

Logistic regression for a binary response

We assume that Y_i \sim B(1, p_i) where p_i = P(Y_i=1|\boldsymbol{x}_i).
Note that E(Y_i|\boldsymbol{x}_i) = p_i.
We model the the log odds (or logit of p_i) as a linear combination of predictors: \text{logit}(p_i) = \log_e \left(\frac{p_i}{1-p_i}\right) = \sum_{j=0}^k\beta_jx_{ij}
The \hat{\beta}_js are found through maximum likelihood estimate.
Then \hat{p}_i = \text{logistic}\left(\displaystyle\sum_{j=0}^k\hat{\beta}_jx_{ij}\right) = \dfrac{e^{\sum_{j=0}^k\hat{\beta}_jx_{ij}}}{1+e^{\sum_{j=0}^k\hat{\beta}_jx_{ij}}}.

Logistic regression for a binary response

When response is a binary value (0 or 1):

Code for loading data

Fit the logistic model in R as

Logistic regression for a factor response

When response is a factor, the first level is considered failure, every other level is considered success.

Watch out for the order of the levels!

Interpretation of logistic regression

Increasing radius_mean by one unit changes the log odds by \hat{\beta}_1, 0.639, or equivalently it multiplies the odds by e^{\hat\beta_1}, 1.894, provided concave_points_mean is held fixed.

Predicting from logistic regression

Alternative method

You can also get the prediction with:

By hand:

\hat{\eta} = -13.699 + 0.639\times 15 + 84.223\times 0.05 \approx 0.0961
\hat{p} = \dfrac{\exp(0.0961)}{1 + \exp(0.0961)} = 0.524

Logistic regression is a linear classifier

We choose the threshold q such that P(Y_i=1|\boldsymbol{x}_i) \ge q is considered to be in class 1.
The separation in class is a point for one variable, line for two variables and a hyperplane for more than two variables.

Case study: Germination of seeds

Response does not have to be a binary response to fit a logistic regression.
Observations may be count (out of fixed trials).
Below is an experiment where the number of seeds germination was recorded for two types of seeds and two types of root extracts.

Code

data(germ, package = "GLMsData")
germ <- germ |> 
  mutate(Proportion_Germ = Germ / Total)

Germ	Total	Extract	Seeds	Proportion_Germ
10	39	Bean	OA75	0.2564103
23	62	Bean	OA75	0.3709677
23	81	Bean	OA75	0.2839506
26	51	Bean	OA75	0.5098039
17	39	Bean	OA75	0.4358974
5	6	Cucumber	OA75	0.8333333
53	74	Cucumber	OA75	0.7162162
55	72	Cucumber	OA75	0.7638889
32	51	Cucumber	OA75	0.6274510
46	79	Cucumber	OA75	0.5822785
10	13	Cucumber	OA75	0.7692308
8	16	Bean	OA73	0.5000000
10	30	Bean	OA73	0.3333333
8	28	Bean	OA73	0.2857143
23	45	Bean	OA73	0.5111111
0	4	Bean	OA73	0.0000000
3	12	Cucumber	OA73	0.2500000
22	41	Cucumber	OA73	0.5365854
15	30	Cucumber	OA73	0.5000000
32	51	Cucumber	OA73	0.6274510
3	7	Cucumber	OA73	0.4285714

Code

germ |>  
  mutate(NotGerm = Total - Germ) |> 
  pivot_longer(cols = c(Germ, NotGerm), names_to = "Germ", values_to = "Freq") |>
  ggplot(aes(Germ, Freq)) +
  geom_col() +
  facet_grid(Extract ~ Seeds, labeller = label_both) +
  labs(y = "Count")

Logistic regression with count responses

The response need to be specified as a two column matrix with the counts of each class.

Code to load data

Logistic regression with proportion responses

You need the total cases for each proportion to fit the model.
The exception is if the total cases are the same for all observations.

Summary

Logistic regression can be used to model binary, proportion or count (out of total cases) responses by modelling the log odds as a linear combination of predictors.
Logistic regression is a linear classifier so it doesn’t work well where the separation between classes are non-linear.
To fit a logistic regression in R, it depends on how the response is encoded:
- if y is binary or factor then use glm(y ~ x, family = binomial(link = "logit")),
- if y is count and total is total cases then use glm(cbind(y, total - y) ~ x, family = binomial(link = "logit")), and
- if y is proportion and total is total cases then use glm(y ~ x, family = binomial(link = "logit"), weights = total).