Day 2
🎯 Aim
- Be familiar with the
tidyverse
ecosystem - Understand the concept of tidy data
- Tidy data with
tidyr
- Wrangle data with
dplyr
- Plot a layer with
ggplot2
🕙 Schedule
Please note this is an indicative schedule only!
Time | Content |
---|---|
10.00-10.25 | Welcome to the tidyverse |
10.25-10.30 | Break |
10.30-11.00 | Data wrangling with dplyr |
11.00-11.05 | Break |
11.05-11.40 | Getting started with ggplot2 |
11.40-12.00 | Your turn |
📑 Resources
🏋️♀️ Exercises
This exercise is written by Dr. Terry Neeman.
The data for this exercise can be downloaded here.
Write the answer to the following exercise using an R script
Erythrocyte-Platelet complex formation
We have developed an assay to measure the formation of platelet-mediated erythrocyte-parasite complexes in the presence of P. falciparum infected erythrocytes, expressed as the percentage of erythrocytes (out of 500) that have formed complexes. We tested our assay using:
- infected and uninfected erythrocytes,
- 3 different platelet:erythrocyte concentration ratios (0.02:1, 0.2:1, and 2:1),
- 2 different incubation times (2 hours - baseline and 24 hours - final).
We ran the experiment FIVE times. We would like to see if our new assay shows that malaria-infected red blood cells (RBC) induce complex formation.
R workflow
We organise our analyses in R as follows:
make sure spreadsheet is properly formatted.
Preparation: import libraries that we’ll need for the analyses.
Set-up: import the data set, and do some initial checks.
Data management: set data types/ restructure/subset data as needed.
Data exploration: visualisation for assessing patterns/associations.
Fit statistical model:
- assess model assumptions
- statistical inference
- obtain estimates of treatment effects (plus/minus uncertainty)
Graphical or tabular summary of statistical model.
Set up libraries
library(tidyverse)
library(lmerTest)
library(emmeans)
library(ggResidpanel)
Import data and check data structure
We can check the data structure either using the function str()
or glimpse()
How many observations for each Experiment, infection status, concentration ratio and incubation time? Remember to use group_by()
and summarise()
. Can you use pivot_wider()
to make the table easier to read?
Data exploration
Think of a few different ways to explore your data, keeping mind your research question. We would like to know if infected cells induce complex formation. Be sensitive to the order within factors, and change factor levels, if needed.
You might try a few different plots instead of just a single plot.
Fit a model to these data to address the research question. Include Experiment as a blocking factor. (Note: we will talk more about modelling in R next time).
<- lmer(complex ~ time * RBC * ratio + (1|Experiment), data = ery) lm.ery
Use the anova() function to get the ANOVA table. Interpret the table.
anova(lm.ery)
Looking at your exploratory graphs, we can start to piece together the story the data are telling.
Typically, we look at the highest level interactions first. They tell us about interesting patterns. If an interaction is “statistically significant”, but the main effect is not, it could be that the individual effects (contrasts) average out to 0.
(I1) Time:RBC interaction: There is a difference in the incubation time effect between infected and uninfected RBC (averaged across ratio) and across Experiment. This is what we are hoping to see! Describe this difference.
In the absence of treatment interactions, we can describe the main effects.
(M1) There is a platelet:erythrocyte ratio effect. This is estimated as the effect of ratio, averaged across the other 2 conditions (time, RBC) and across Experiments. Referring to your graph, describe that pattern.
(M2) There is an incubation time effect. This is estimated as the effect of incubation time, averaged across the other 2 conditions (ratio, RBC) and across Experiments. Referring to your graph, describe that pattern.
(M3) There is an infection effect. This is estimated as the difference between infected and uninfected RBC, averaged across the other 2 conditions (time, ratio) and across Experiments. Referring to your graph, describe that pattern.
Assess model assumptions with residual plots
resid_panel(lm.ery)
Look at parameter estimates and between-Experiment variation using the summary() function
summary(lm.ery)
Obtain mean estimates and standard errors for plotting using the emmeans() function
<- summary(emmeans(lm.ery, ~time * RBC * ratio))
results1 results1
Finally, create a graphic summarising the model
ggplot(results1, aes(x = RBC, y = emmean, fill = time)) +
geom_col(position = "dodge") +
geom_errorbar(aes(ymin = emmean - SE, ymax = emmean + SE),
width = 0.2,
position = position_dodge(width = 0.9))+
facet_wrap(~ratio) +
scale_fill_brewer(palette = "Paired") +
ylab("Mean Complex formation (%)") +
theme_bw()
Homework
- Go through the exercises here.
- See the Resources section for some helpful related materials.