Data wrangling with dplyr

BDSI R Training I

Emi Tanaka

Biological Data Science Institute

3rd November 2023

A grammar of data manipulation

library(dplyr) # or library(tidyverse)
  • dplyr is a core package in tidyverse
  • The earlier concept of dplyr (first on CRAN in 2014-01-29) was implemented in plyr (first on CRAN in 2008-10-08)
  • The functions in dplyr has been evolving frequently but dplyr v1.0.0 was released on CRAN in 2020-05-29
  • This new version contained new “verbs”
  • The major release suggests that functions in dplyr are maturing and thus the user interface is unlikely to change

Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2020). dplyr: A Grammar of Data Manipulation. R package version 1.0.2.

Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29.

dplyr “verbs”

  • The main functions of dplyr include:
    arrange select mutate rename group_by summarise
  • Notice that these functions are verbs
  • Functions in dplyr generally have the form:

verb(data, args)

  • I.e., the first argument data is a data.frame object
  • What do you think the following will do?

rename(mtcars, miles_per_gallon = mpg)
arrange(mtcars, wt)

Pipe operator %>%

  • Almost all tidyverse packages import the magrittr package to use %>%

  • x %>% f(y) is the same as f(x, y)

  • x %>% f(y) %>% g(z) is the same as g(f(x, y), z)

  • When you see the pipe operator %>%, read it as “and then”

mtcars %>%                          # take mtcars data, and then
 rename(miles_per_gallon = mpg) %>% # rename mpg as miles_per_gallon, and then
 arrange(wt)                        # arrange row by wt
                    miles_per_gallon cyl  disp  hp drat
Lotus Europa                    30.4   4  95.1 113 3.77
Honda Civic                     30.4   4  75.7  52 4.93
Toyota Corolla                  33.9   4  71.1  65 4.22
Fiat X1-9                       27.3   4  79.0  66 4.08
Porsche 914-2                   26.0   4 120.3  91 4.43
Fiat 128                        32.4   4  78.7  66 4.08
Datsun 710                      22.8   4 108.0  93 3.85
Toyota Corona                   21.5   4 120.1  97 3.70
Mazda RX4                       21.0   6 160.0 110 3.90
Ferrari Dino                    19.7   6 145.0 175 3.62
Volvo 142E                      21.4   4 121.0 109 4.11
Mazda RX4 Wag                   21.0   6 160.0 110 3.90
Merc 230                        22.8   4 140.8  95 3.92
Ford Pantera L                  15.8   8 351.0 264 4.22
Merc 240D                       24.4   4 146.7  62 3.69
Hornet 4 Drive                  21.4   6 258.0 110 3.08
AMC Javelin                     15.2   8 304.0 150 3.15
Hornet Sportabout               18.7   8 360.0 175 3.15
Merc 280                        19.2   6 167.6 123 3.92
Merc 280C                       17.8   6 167.6 123 3.92
Valiant                         18.1   6 225.0 105 2.76
Dodge Challenger                15.5   8 318.0 150 2.76
Duster 360                      14.3   8 360.0 245 3.21
Maserati Bora                   15.0   8 301.0 335 3.54
Merc 450SL                      17.3   8 275.8 180 3.07
Merc 450SLC                     15.2   8 275.8 180 3.07
Camaro Z28                      13.3   8 350.0 245 3.73
Pontiac Firebird                19.2   8 400.0 175 3.08
Merc 450SE                      16.4   8 275.8 180 3.07
Cadillac Fleetwood              10.4   8 472.0 205 2.93
Chrysler Imperial               14.7   8 440.0 230 3.23
Lincoln Continental             10.4   8 460.0 215 3.00
                       wt  qsec vs am gear carb
Lotus Europa        1.513 16.90  1  1    5    2
Honda Civic         1.615 18.52  1  1    4    2
Toyota Corolla      1.835 19.90  1  1    4    1
Fiat X1-9           1.935 18.90  1  1    4    1
Porsche 914-2       2.140 16.70  0  1    5    2
Fiat 128            2.200 19.47  1  1    4    1
Datsun 710          2.320 18.61  1  1    4    1
Toyota Corona       2.465 20.01  1  0    3    1
Mazda RX4           2.620 16.46  0  1    4    4
Ferrari Dino        2.770 15.50  0  1    5    6
Volvo 142E          2.780 18.60  1  1    4    2
Mazda RX4 Wag       2.875 17.02  0  1    4    4
Merc 230            3.150 22.90  1  0    4    2
Ford Pantera L      3.170 14.50  0  1    5    4
Merc 240D           3.190 20.00  1  0    4    2
Hornet 4 Drive      3.215 19.44  1  0    3    1
AMC Javelin         3.435 17.30  0  0    3    2
Hornet Sportabout   3.440 17.02  0  0    3    2
Merc 280            3.440 18.30  1  0    4    4
Merc 280C           3.440 18.90  1  0    4    4
Valiant             3.460 20.22  1  0    3    1
Dodge Challenger    3.520 16.87  0  0    3    2
Duster 360          3.570 15.84  0  0    3    4
Maserati Bora       3.570 14.60  0  1    5    8
Merc 450SL          3.730 17.60  0  0    3    3
Merc 450SLC         3.780 18.00  0  0    3    3
Camaro Z28          3.840 15.41  0  0    3    4
Pontiac Firebird    3.845 17.05  0  0    3    2
Merc 450SE          4.070 17.40  0  0    3    3
Cadillac Fleetwood  5.250 17.98  0  0    3    4
Chrysler Imperial   5.345 17.42  0  0    3    4
Lincoln Continental 5.424 17.82  0  0    3    4

Stefan Milton Bache and Hadley Wickham (2020). magrittr: A Forward-Pipe Operator for R. R package version 2.0.1.

Subsetting by column

select(mtcars, c(mpg, cyl))
select(mtcars, c("mpg", "cyl"))
select(mtcars, mpg, cyl)
select(mtcars, "mpg", "cyl")

All the same result as below

mtcars %>% 
  select(mpg, cyl)
                     mpg cyl
Mazda RX4           21.0   6
Mazda RX4 Wag       21.0   6
Datsun 710          22.8   4
Hornet 4 Drive      21.4   6
Hornet Sportabout   18.7   8
Valiant             18.1   6
Duster 360          14.3   8
Merc 240D           24.4   4
Merc 230            22.8   4
Merc 280            19.2   6
Merc 280C           17.8   6
Merc 450SE          16.4   8
Merc 450SL          17.3   8
Merc 450SLC         15.2   8
Cadillac Fleetwood  10.4   8
Lincoln Continental 10.4   8
Chrysler Imperial   14.7   8
Fiat 128            32.4   4
Honda Civic         30.4   4
Toyota Corolla      33.9   4
Toyota Corona       21.5   4
Dodge Challenger    15.5   8
AMC Javelin         15.2   8
Camaro Z28          13.3   8
Pontiac Firebird    19.2   8
Fiat X1-9           27.3   4
Porsche 914-2       26.0   4
Lotus Europa        30.4   4
Ford Pantera L      15.8   8
Ferrari Dino        19.7   6
Maserati Bora       15.0   8
Volvo 142E          21.4   4
mtcars %>% select(mpg)
                     mpg
Mazda RX4           21.0
Mazda RX4 Wag       21.0
Datsun 710          22.8
Hornet 4 Drive      21.4
Hornet Sportabout   18.7
Valiant             18.1
Duster 360          14.3
Merc 240D           24.4
Merc 230            22.8
Merc 280            19.2
Merc 280C           17.8
Merc 450SE          16.4
Merc 450SL          17.3
Merc 450SLC         15.2
Cadillac Fleetwood  10.4
Lincoln Continental 10.4
Chrysler Imperial   14.7
Fiat 128            32.4
Honda Civic         30.4
Toyota Corolla      33.9
Toyota Corona       21.5
Dodge Challenger    15.5
AMC Javelin         15.2
Camaro Z28          13.3
Pontiac Firebird    19.2
Fiat X1-9           27.3
Porsche 914-2       26.0
Lotus Europa        30.4
Ford Pantera L      15.8
Ferrari Dino        19.7
Maserati Bora       15.0
Volvo 142E          21.4
  • Selecting one column doesn’t “drop” it to a vector.
  • If you really want the vector then use pull(mpg).

Subsetting by row

mtcars %>% 
  slice(3:1)
               mpg cyl disp  hp drat    wt  qsec vs am gear carb
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
mtcars %>% 
  filter(rownames(.) %in% c("Datsun 710", "Mazda RX4"))
            mpg cyl disp  hp drat   wt  qsec vs am gear carb
Mazda RX4  21.0   6  160 110 3.90 2.62 16.46  0  1    4    4
Datsun 710 22.8   4  108  93 3.85 2.32 18.61  1  1    4    1
  • Use slice to subset by index and filter by logical vector
  • Placeholder binding
    • x %>% f(y, g(.)) is the same as f(x, y, g(x))
    • x %>% f(y, .) is the same as f(y, x)
    • Note: row names do not follow tidy data principles
    • Use tibble::rownames_to_column() to convert rownames to a column to make into a tidy data

Adding or modifying a column

mtcars %>% 
  mutate(gpm = 1 / mpg,
          wt = gpm^2,
          wt = if_else(cyl==6, 10, wt),
          hp = case_when(cyl==6 ~ 11,
                         cyl==4 ~ 10,
                           TRUE ~ 3))
                     mpg cyl  disp hp drat           wt  qsec vs am gear carb        gpm
Mazda RX4           21.0   6 160.0 11 3.90 1.000000e+01 16.46  0  1    4    4 0.04761905
Mazda RX4 Wag       21.0   6 160.0 11 3.90 1.000000e+01 17.02  0  1    4    4 0.04761905
Datsun 710          22.8   4 108.0 10 3.85 1.923669e-03 18.61  1  1    4    1 0.04385965
Hornet 4 Drive      21.4   6 258.0 11 3.08 1.000000e+01 19.44  1  0    3    1 0.04672897
Hornet Sportabout   18.7   8 360.0  3 3.15 2.859676e-03 17.02  0  0    3    2 0.05347594
Valiant             18.1   6 225.0 11 2.76 1.000000e+01 20.22  1  0    3    1 0.05524862
Duster 360          14.3   8 360.0  3 3.21 4.890215e-03 15.84  0  0    3    4 0.06993007
Merc 240D           24.4   4 146.7 10 3.69 1.679656e-03 20.00  1  0    4    2 0.04098361
Merc 230            22.8   4 140.8 10 3.92 1.923669e-03 22.90  1  0    4    2 0.04385965
Merc 280            19.2   6 167.6 11 3.92 1.000000e+01 18.30  1  0    4    4 0.05208333
Merc 280C           17.8   6 167.6 11 3.92 1.000000e+01 18.90  1  0    4    4 0.05617978
Merc 450SE          16.4   8 275.8  3 3.07 3.718025e-03 17.40  0  0    3    3 0.06097561
Merc 450SL          17.3   8 275.8  3 3.07 3.341241e-03 17.60  0  0    3    3 0.05780347
Merc 450SLC         15.2   8 275.8  3 3.07 4.328255e-03 18.00  0  0    3    3 0.06578947
Cadillac Fleetwood  10.4   8 472.0  3 2.93 9.245562e-03 17.98  0  0    3    4 0.09615385
Lincoln Continental 10.4   8 460.0  3 3.00 9.245562e-03 17.82  0  0    3    4 0.09615385
Chrysler Imperial   14.7   8 440.0  3 3.23 4.627701e-03 17.42  0  0    3    4 0.06802721
Fiat 128            32.4   4  78.7 10 4.08 9.525987e-04 19.47  1  1    4    1 0.03086420
Honda Civic         30.4   4  75.7 10 4.93 1.082064e-03 18.52  1  1    4    2 0.03289474
Toyota Corolla      33.9   4  71.1 10 4.22 8.701630e-04 19.90  1  1    4    1 0.02949853
Toyota Corona       21.5   4 120.1 10 3.70 2.163332e-03 20.01  1  0    3    1 0.04651163
Dodge Challenger    15.5   8 318.0  3 2.76 4.162331e-03 16.87  0  0    3    2 0.06451613
AMC Javelin         15.2   8 304.0  3 3.15 4.328255e-03 17.30  0  0    3    2 0.06578947
Camaro Z28          13.3   8 350.0  3 3.73 5.653231e-03 15.41  0  0    3    4 0.07518797
Pontiac Firebird    19.2   8 400.0  3 3.08 2.712674e-03 17.05  0  0    3    2 0.05208333
Fiat X1-9           27.3   4  79.0 10 4.08 1.341760e-03 18.90  1  1    4    1 0.03663004
Porsche 914-2       26.0   4 120.3 10 4.43 1.479290e-03 16.70  0  1    5    2 0.03846154
Lotus Europa        30.4   4  95.1 10 3.77 1.082064e-03 16.90  1  1    5    2 0.03289474
Ford Pantera L      15.8   8 351.0  3 4.22 4.005768e-03 14.50  0  1    5    4 0.06329114
Ferrari Dino        19.7   6 145.0 11 3.62 1.000000e+01 15.50  0  1    5    6 0.05076142
Maserati Bora       15.0   8 301.0  3 3.54 4.444444e-03 14.60  0  1    5    8 0.06666667
Volvo 142E          21.4   4 121.0 10 4.11 2.183597e-03 18.60  1  1    4    2 0.04672897
  • Evaluation in mutate is done sequentially based on input order
  • So you refer to the newly created variable in later input
  • You can call multiple mutate but computational performance is usually better if done within the same mutate call
  • mtcars %>%
      mutate(gpm = 1 / mpg) %>% 
      mutate(wt = gpm^2)
    

Sorting columns

mtcars %>% 
  select(sort(names(.)))
                    am carb cyl  disp drat gear  hp  mpg  qsec vs    wt
Mazda RX4            1    4   6 160.0 3.90    4 110 21.0 16.46  0 2.620
Mazda RX4 Wag        1    4   6 160.0 3.90    4 110 21.0 17.02  0 2.875
Datsun 710           1    1   4 108.0 3.85    4  93 22.8 18.61  1 2.320
Hornet 4 Drive       0    1   6 258.0 3.08    3 110 21.4 19.44  1 3.215
Hornet Sportabout    0    2   8 360.0 3.15    3 175 18.7 17.02  0 3.440
Valiant              0    1   6 225.0 2.76    3 105 18.1 20.22  1 3.460
Duster 360           0    4   8 360.0 3.21    3 245 14.3 15.84  0 3.570
Merc 240D            0    2   4 146.7 3.69    4  62 24.4 20.00  1 3.190
Merc 230             0    2   4 140.8 3.92    4  95 22.8 22.90  1 3.150
Merc 280             0    4   6 167.6 3.92    4 123 19.2 18.30  1 3.440
Merc 280C            0    4   6 167.6 3.92    4 123 17.8 18.90  1 3.440
Merc 450SE           0    3   8 275.8 3.07    3 180 16.4 17.40  0 4.070
Merc 450SL           0    3   8 275.8 3.07    3 180 17.3 17.60  0 3.730
Merc 450SLC          0    3   8 275.8 3.07    3 180 15.2 18.00  0 3.780
Cadillac Fleetwood   0    4   8 472.0 2.93    3 205 10.4 17.98  0 5.250
Lincoln Continental  0    4   8 460.0 3.00    3 215 10.4 17.82  0 5.424
Chrysler Imperial    0    4   8 440.0 3.23    3 230 14.7 17.42  0 5.345
Fiat 128             1    1   4  78.7 4.08    4  66 32.4 19.47  1 2.200
Honda Civic          1    2   4  75.7 4.93    4  52 30.4 18.52  1 1.615
Toyota Corolla       1    1   4  71.1 4.22    4  65 33.9 19.90  1 1.835
Toyota Corona        0    1   4 120.1 3.70    3  97 21.5 20.01  1 2.465
Dodge Challenger     0    2   8 318.0 2.76    3 150 15.5 16.87  0 3.520
AMC Javelin          0    2   8 304.0 3.15    3 150 15.2 17.30  0 3.435
Camaro Z28           0    4   8 350.0 3.73    3 245 13.3 15.41  0 3.840
Pontiac Firebird     0    2   8 400.0 3.08    3 175 19.2 17.05  0 3.845
Fiat X1-9            1    1   4  79.0 4.08    4  66 27.3 18.90  1 1.935
Porsche 914-2        1    2   4 120.3 4.43    5  91 26.0 16.70  0 2.140
Lotus Europa         1    2   4  95.1 3.77    5 113 30.4 16.90  1 1.513
Ford Pantera L       1    4   8 351.0 4.22    5 264 15.8 14.50  0 3.170
Ferrari Dino         1    6   6 145.0 3.62    5 175 19.7 15.50  0 2.770
Maserati Bora        1    8   8 301.0 3.54    5 335 15.0 14.60  0 3.570
Volvo 142E           1    2   4 121.0 4.11    4 109 21.4 18.60  1 2.780
mtcars %>% 
  relocate(wt, gear, .after = mpg)
                     mpg    wt gear cyl  disp  hp drat  qsec vs am carb
Mazda RX4           21.0 2.620    4   6 160.0 110 3.90 16.46  0  1    4
Mazda RX4 Wag       21.0 2.875    4   6 160.0 110 3.90 17.02  0  1    4
Datsun 710          22.8 2.320    4   4 108.0  93 3.85 18.61  1  1    1
Hornet 4 Drive      21.4 3.215    3   6 258.0 110 3.08 19.44  1  0    1
Hornet Sportabout   18.7 3.440    3   8 360.0 175 3.15 17.02  0  0    2
Valiant             18.1 3.460    3   6 225.0 105 2.76 20.22  1  0    1
Duster 360          14.3 3.570    3   8 360.0 245 3.21 15.84  0  0    4
Merc 240D           24.4 3.190    4   4 146.7  62 3.69 20.00  1  0    2
Merc 230            22.8 3.150    4   4 140.8  95 3.92 22.90  1  0    2
Merc 280            19.2 3.440    4   6 167.6 123 3.92 18.30  1  0    4
Merc 280C           17.8 3.440    4   6 167.6 123 3.92 18.90  1  0    4
Merc 450SE          16.4 4.070    3   8 275.8 180 3.07 17.40  0  0    3
Merc 450SL          17.3 3.730    3   8 275.8 180 3.07 17.60  0  0    3
Merc 450SLC         15.2 3.780    3   8 275.8 180 3.07 18.00  0  0    3
Cadillac Fleetwood  10.4 5.250    3   8 472.0 205 2.93 17.98  0  0    4
Lincoln Continental 10.4 5.424    3   8 460.0 215 3.00 17.82  0  0    4
Chrysler Imperial   14.7 5.345    3   8 440.0 230 3.23 17.42  0  0    4
Fiat 128            32.4 2.200    4   4  78.7  66 4.08 19.47  1  1    1
Honda Civic         30.4 1.615    4   4  75.7  52 4.93 18.52  1  1    2
Toyota Corolla      33.9 1.835    4   4  71.1  65 4.22 19.90  1  1    1
Toyota Corona       21.5 2.465    3   4 120.1  97 3.70 20.01  1  0    1
Dodge Challenger    15.5 3.520    3   8 318.0 150 2.76 16.87  0  0    2
AMC Javelin         15.2 3.435    3   8 304.0 150 3.15 17.30  0  0    2
Camaro Z28          13.3 3.840    3   8 350.0 245 3.73 15.41  0  0    4
Pontiac Firebird    19.2 3.845    3   8 400.0 175 3.08 17.05  0  0    2
Fiat X1-9           27.3 1.935    4   4  79.0  66 4.08 18.90  1  1    1
Porsche 914-2       26.0 2.140    5   4 120.3  91 4.43 16.70  0  1    2
Lotus Europa        30.4 1.513    5   4  95.1 113 3.77 16.90  1  1    2
Ford Pantera L      15.8 3.170    5   8 351.0 264 4.22 14.50  0  1    4
Ferrari Dino        19.7 2.770    5   6 145.0 175 3.62 15.50  0  1    6
Maserati Bora       15.0 3.570    5   8 301.0 335 3.54 14.60  0  1    8
Volvo 142E          21.4 2.780    4   4 121.0 109 4.11 18.60  1  1    2

Calculating statistical summaries by group

🎯 Calculate the average weight (wt) of a car for each gear type in (gear) mtcars

mtcars %>% 
  group_by(gear) %>% 
  summarise(avg_wt = mean(wt))
# A tibble: 3 × 2
   gear avg_wt
  <dbl>  <dbl>
1     3   3.89
2     4   2.62
3     5   2.63

🎯 Calculate the median weight (wt) of a car for each gear (gear) and engine (vs) type in mtcars

mtcars %>% 
  group_by(gear, vs) %>% 
  summarise(avg_wt = mean(wt),
            med_wt = median(wt))
# A tibble: 6 × 4
# Groups:   gear [3]
   gear    vs avg_wt med_wt
  <dbl> <dbl>  <dbl>  <dbl>
1     3     0   4.10   3.81
2     3     1   3.05   3.22
3     4     0   2.75   2.75
4     4     1   2.59   2.55
5     5     0   2.91   2.97
6     5     1   1.51   1.51

across

  • Using across, you can more easily apply a function to multiple columns
mtcars %>% 
  group_by(gear, vs) %>% 
  summarise(across(everything(), mean)) 
# A tibble: 6 × 11
# Groups:   gear [3]
   gear    vs   mpg   cyl  disp    hp  drat    wt  qsec    am  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     3     0  15.0  8    358.  194.   3.12  4.10  17.1   0    3.08
2     3     1  20.3  5.33 201.  104    3.18  3.05  19.9   0    1   
3     4     0  21    6    160   110    3.9   2.75  16.7   1    4   
4     4     1  25.2  4.4  116.   85.4  4.07  2.59  19.4   0.6  2   
5     5     0  19.1  6.5  229.  216.   3.95  2.91  15.3   1    5   
6     5     1  30.4  4     95.1 113    3.77  1.51  16.9   1    2   

where

  • You can combine across with the selection helper where
mtcars %>% 
  group_by(gear, vs) %>% 
  summarise(across(where(function(x) n_distinct(x) > 10), mean)) 
# A tibble: 6 × 8
# Groups:   gear [3]
   gear    vs   mpg  disp    hp  drat    wt  qsec
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     3     0  15.0 358.  194.   3.12  4.10  17.1
2     3     1  20.3 201.  104    3.18  3.05  19.9
3     4     0  21   160   110    3.9   2.75  16.7
4     4     1  25.2 116.   85.4  4.07  2.59  19.4
5     5     0  19.1 229.  216.   3.95  2.91  15.3
6     5     1  30.4  95.1 113    3.77  1.51  16.9

c_across

  • Remember tidy selection only works with functions that are compatible
mtcars %>% 
  rowwise() %>% 
  summarise(disp = disp, hp = hp, drat = drat, wt = wt,
            score = sum(c_across(disp:wt))) 
# A tibble: 32 × 5
    disp    hp  drat    wt score
   <dbl> <dbl> <dbl> <dbl> <dbl>
 1  160    110  3.9   2.62  277.
 2  160    110  3.9   2.88  277.
 3  108     93  3.85  2.32  207.
 4  258    110  3.08  3.22  374.
 5  360    175  3.15  3.44  542.
 6  225    105  2.76  3.46  336.
 7  360    245  3.21  3.57  612.
 8  147.    62  3.69  3.19  216.
 9  141.    95  3.92  3.15  243.
10  168.   123  3.92  3.44  298.
# ℹ 22 more rows

rowwise?

  • What happens if you omit rowwise?
mtcars %>% 
  #rowwise() %>% 
  summarise(disp = disp, hp = hp, drat = drat, wt = wt,
            score = sum(c_across(disp:wt))) 
    disp  hp drat    wt    score
1  160.0 110 3.90 2.620 12295.14
2  160.0 110 3.90 2.875 12295.14
3  108.0  93 3.85 2.320 12295.14
4  258.0 110 3.08 3.215 12295.14
5  360.0 175 3.15 3.440 12295.14
6  225.0 105 2.76 3.460 12295.14
7  360.0 245 3.21 3.570 12295.14
8  146.7  62 3.69 3.190 12295.14
9  140.8  95 3.92 3.150 12295.14
10 167.6 123 3.92 3.440 12295.14
11 167.6 123 3.92 3.440 12295.14
12 275.8 180 3.07 4.070 12295.14
13 275.8 180 3.07 3.730 12295.14
14 275.8 180 3.07 3.780 12295.14
15 472.0 205 2.93 5.250 12295.14
16 460.0 215 3.00 5.424 12295.14
17 440.0 230 3.23 5.345 12295.14
18  78.7  66 4.08 2.200 12295.14
19  75.7  52 4.93 1.615 12295.14
20  71.1  65 4.22 1.835 12295.14
21 120.1  97 3.70 2.465 12295.14
22 318.0 150 2.76 3.520 12295.14
23 304.0 150 3.15 3.435 12295.14
24 350.0 245 3.73 3.840 12295.14
25 400.0 175 3.08 3.845 12295.14
26  79.0  66 4.08 1.935 12295.14
27 120.3  91 4.43 2.140 12295.14
28  95.1 113 3.77 1.513 12295.14
29 351.0 264 4.22 3.170 12295.14
30 145.0 175 3.62 2.770 12295.14
31 301.0 335 3.54 3.570 12295.14
32 121.0 109 4.11 2.780 12295.14

Lazy and non-standard evaluation

  • In Base R:
subset(mtcars, mpg > 31)
                mpg cyl disp hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1
Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1
  • But the second argument cannot be evaluated:
mpg > 31
Error in eval(expr, envir, enclos): object 'mpg' not found
  • R employs what is called lazy evaluation for function inputs

  • Non-standard evaluation uses this feature to capture the input expression within the function and evaluate only when requested

Tidy evaluation Part 1

  • Tidy evaluation builds on the lazy and non-standard evaluation and is implemented in rlang
  • All core tidyverse packages import rlang
  • So what does it do?
  • Let’s consider filter, the Tidyverse version of subset
filter(mtcars, mpg > 31)
                mpg cyl disp hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1
Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1
  • If you get an error using filter, replace it with dplyr::filter
    for those interested, dplyr::filter is a conflict with stats::filter and it may be using stats::filter instead… I’ve fallen into this trap so many times!

Lionel Henry and Hadley Wickham (2020). rlang: Functions for Base Types and Core R and ‘Tidyverse’ Features. R package version 0.4.8.

Tidy evaluation Part 2

  • Suppose we have a silly function that subsets mtcars for a given condition
myCarSubset <- function(cond) subset(mtcars, cond)
myCarFilter <- function(cond) filter(mtcars, cond)
  • This causes an issue because cond is evaluated before it is parsed into subset or filter
myCarSubset(mpg > 31)
Error in eval(expr, envir, enclos): object 'mpg' not found
myCarFilter(mpg > 31)
Error in `filter()`:
ℹ In argument: `cond`.
Caused by error:
! object 'mpg' not found

Tidy evaluation Part 3

  • Functions that use non-standard evaluation is problematic
myCarSubsetNew <- function(cond) subset(mtcars, {{ cond }})
myCarFilterNew <- function(cond) filter(mtcars, {{ cond }})
myCarSubsetNew(mpg > 31)
Error in eval(expr, envir, enclos): object 'mpg' not found
myCarFilterNew(mpg > 31)
                mpg cyl disp hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1
Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1
  • {{ }} only works if the underlying function implements rlang

Data masking Part 1

ind <- 1:nrow(cars) # nrow(cars) = 50
filter(cars, ind > 49)
  speed dist
1    25   85
  • For any variables that don’t exist in the data, R searches the parental environment for evaluation.
speed <- c(40, 51)
filter(cars, speed > 24)
  speed dist
1    25   85
  • The variables in data take priority for reference over those in parental environment

Data masking Part 2

speed <- 1:nrow(cars)
filter(cars, .data$speed > 24)
  speed dist
1    25   85
filter(cars, .env$speed > 24)
   speed dist
1     15   26
2     15   54
3     16   32
4     16   40
5     17   32
6     17   40
7     17   50
8     18   42
9     18   56
10    18   76
11    18   84
12    19   36
13    19   46
14    19   68
15    20   32
16    20   48
17    20   52
18    20   56
19    20   64
20    22   66
21    23   54
22    24   70
23    24   92
24    24   93
25    24  120
26    25   85
  • In Tidyverse, you can be explicit whether the variable is in the data or in the parental environment
  • .data is a special pronoun referring to variables in the data parsed in the first argument
  • .env is a special pronoun referring to variables in the environment (i.e. not in the data parsed in the first argument)

Tidy select Part 1

  • Tidyverse packages generally use syntax from the tidyselect package for variable/column selection
immunity <- readr::read_csv("https://github.com/anu-bdsi/training-R1/raw/main/data/marcato-2021-immunity.csv")
str(immunity) 
spc_tbl_ [1,024 × 37] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Timepoint         : chr [1:1024] "CC" "CC" "CC" "CC" ...
 $ Batch             : num [1:1024] 1 1 1 1 1 1 1 1 1 1 ...
 $ Animal            : num [1:1024] 1 2 3 4 5 6 7 8 9 10 ...
 $ Eartag            : num [1:1024] 48454 4796 85085 884 59787 ...
 $ Uplo              : chr [1:1024] "up" "up" "low" "low" ...
 $ Bafr              : chr [1:1024] "back" "back" "back" "back" ...
 $ Pen               : num [1:1024] 4 4 6 6 7 7 2 2 10 10 ...
 $ Corridor          : num [1:1024] 1 1 1 1 1 1 1 1 2 2 ...
 $ Diet              : chr [1:1024] "E" "E" "M" "M" ...
 $ Duration          : num [1:1024] 6 6 6 6 6 6 6 6 6 6 ...
 $ Type              : chr [1:1024] "U" "U" "C" "C" ...
 $ Bilirubin_1       : chr [1:1024] "12.95" "18.51" "26.11" "9.84" ...
 $ Cortisol_1        : chr [1:1024] "7.58" "16.55" "17.49" "12.37" ...
 $ Haptoglobin       : chr [1:1024] "0.31" "0.28" "0.18" "0.55" ...
 $ Titer_IgG         : chr [1:1024] "4.7" "3.9" "4.4" "4.9" ...
 $ Titer_IgM         : chr [1:1024] "6.6" "4.0" "5.3" "3.8" ...
 $ wbc               : num [1:1024] 6.79 13.31 11.68 5.29 6.87 ...
 $ neutro1           : num [1:1024] 2.77 6.02 4.17 1.6 2.22 4.94 3 8.65 3.81 6.93 ...
 $ lympho1           : num [1:1024] 2.69 4.75 3.8 2.33 3.58 1.79 4.65 5.55 4.87 3.77 ...
 $ mono1             : num [1:1024] 1.15 1.87 2.2 1.04 0.9 1.1 1.59 2.07 1.48 1.09 ...
 $ eo1               : num [1:1024] 0.07 0.16 0.46 0.14 0.08 0.73 0.41 0.51 0.06 0.15 ...
 $ baso1             : num [1:1024] 0.11 0.52 1.06 0.18 0.1 0.4 0.14 0.17 0.23 0.19 ...
 $ neutro2           : num [1:1024] 40.8 45.3 35.6 30.2 32.3 ...
 $ lympho2           : num [1:1024] 39.6 35.6 32.5 44 52.1 ...
 $ mono2             : num [1:1024] 16.9 14.1 18.8 19.8 13.1 ...
 $ eo2               : num [1:1024] 1.05 1.15 3.95 2.6 1.15 8.15 4.15 3.05 0.55 1.2 ...
 $ baso2             : num [1:1024] 1.55 3.85 9.1 3.5 1.4 4.5 1.4 1 2.2 1.5 ...
 $ CD8+              : num [1:1024] 7.05 5.73 10.7 5.15 8.08 12.1 15.5 5.19 14.4 11.5 ...
 $ CD8+ perf+        : num [1:1024] 0.67 1.31 1.41 2.08 0.78 1.24 0 0.34 0.86 0.68 ...
 $ NK+               : num [1:1024] 4.52 6.52 13.3 3.16 14.4 13.8 8.3 5.22 8.28 12.3 ...
 $ NK+ perf+         : num [1:1024] 1.64 2.86 1.3 1.43 2.17 0.94 0.82 1.39 1.66 2.83 ...
 $ CD4+              : num [1:1024] 17.9 11.2 0.1 0.09 8.47 22.5 14.9 8.19 24.9 11.5 ...
 $ CD4+ perf+        : num [1:1024] 0.02 0.25 4.17 11.8 0.19 0.12 0 0.06 0.49 0.12 ...
 $ gamma delta+      : num [1:1024] 41.6 48.9 23.5 55.9 36.9 18.9 3.17 63.5 0.11 37 ...
 $ gamma delta+ perf+: num [1:1024] 0.06 0.33 0.12 0.23 0.07 0.19 0 0.12 0 0.09 ...
 $ CD14+ large       : num [1:1024] 35.2 34.2 44.7 24.1 19.1 61.1 3.93 43.7 28.3 25.1 ...
 $ CD21+ large       : num [1:1024] 8.04 12.5 12.8 6.85 16.3 4.3 29.1 3.4 11.5 6.92 ...
 - attr(*, "spec")=
  .. cols(
  ..   Timepoint = col_character(),
  ..   Batch = col_double(),
  ..   Animal = col_double(),
  ..   Eartag = col_double(),
  ..   Uplo = col_character(),
  ..   Bafr = col_character(),
  ..   Pen = col_double(),
  ..   Corridor = col_double(),
  ..   Diet = col_character(),
  ..   Duration = col_double(),
  ..   Type = col_character(),
  ..   Bilirubin_1 = col_character(),
  ..   Cortisol_1 = col_character(),
  ..   Haptoglobin = col_character(),
  ..   Titer_IgG = col_character(),
  ..   Titer_IgM = col_character(),
  ..   wbc = col_double(),
  ..   neutro1 = col_double(),
  ..   lympho1 = col_double(),
  ..   mono1 = col_double(),
  ..   eo1 = col_double(),
  ..   baso1 = col_double(),
  ..   neutro2 = col_double(),
  ..   lympho2 = col_double(),
  ..   mono2 = col_double(),
  ..   eo2 = col_double(),
  ..   baso2 = col_double(),
  ..   `CD8+` = col_double(),
  ..   `CD8+ perf+` = col_double(),
  ..   `NK+` = col_double(),
  ..   `NK+ perf+` = col_double(),
  ..   `CD4+` = col_double(),
  ..   `CD4+ perf+` = col_double(),
  ..   `gamma delta+` = col_double(),
  ..   `gamma delta+ perf+` = col_double(),
  ..   `CD14+ large` = col_double(),
  ..   `CD21+ large` = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Tidy select Part 2

immunity %>% 
 select(Timepoint, Batch, Animal)
# A tibble: 1,024 × 3
   Timepoint Batch Animal
   <chr>     <dbl>  <dbl>
 1 CC            1      1
 2 CC            1      2
 3 CC            1      3
 4 CC            1      4
 5 CC            1      5
 6 CC            1      6
 7 CC            1      7
 8 CC            1      8
 9 CC            1      9
10 CC            1     10
# ℹ 1,014 more rows
immunity %>% 
  select(Timepoint:Animal)
# A tibble: 1,024 × 3
   Timepoint Batch Animal
   <chr>     <dbl>  <dbl>
 1 CC            1      1
 2 CC            1      2
 3 CC            1      3
 4 CC            1      4
 5 CC            1      5
 6 CC            1      6
 7 CC            1      7
 8 CC            1      8
 9 CC            1      9
10 CC            1     10
# ℹ 1,014 more rows
  • The tidyselect syntax : can be used to select contiguous columns in the data.

Tidy select Part 3

immunity %>% 
  select(contains("perf"))
# A tibble: 1,024 × 4
   `CD8+ perf+` `NK+ perf+` `CD4+ perf+` `gamma delta+ perf+`
          <dbl>       <dbl>        <dbl>                <dbl>
 1         0.67        1.64         0.02                 0.06
 2         1.31        2.86         0.25                 0.33
 3         1.41        1.3          4.17                 0.12
 4         2.08        1.43        11.8                  0.23
 5         0.78        2.17         0.19                 0.07
 6         1.24        0.94         0.12                 0.19
 7         0           0.82         0                    0   
 8         0.34        1.39         0.06                 0.12
 9         0.86        1.66         0.49                 0   
10         0.68        2.83         0.12                 0.09
# ℹ 1,014 more rows
immunity %>% 
  select(starts_with("Titer"))
# A tibble: 1,024 × 2
   Titer_IgG Titer_IgM
   <chr>     <chr>    
 1 4.7       6.6      
 2 3.9       4.0      
 3 4.4       5.3      
 4 4.9       3.8      
 5 5.3       4.5      
 6 5.7       5.7      
 7 5.4       2.0      
 8 5.2       6.9      
 9 3.5       5.0      
10 3.9       4.8      
# ℹ 1,014 more rows

Selection language Part 1

  • : for selecting contiguous variables
  • ! for taking complement set of variables
  • & or | for selecting intersection or union of two sets of variables, e.g. 
immunity %>% 
  select(starts_with("Titer") & ends_with("IgG")) %>% 
  str()
tibble [1,024 × 1] (S3: tbl_df/tbl/data.frame)
 $ Titer_IgG: chr [1:1024] "4.7" "3.9" "4.4" "4.9" ...
  • c() for combining selections
  • everything() to select all variables
  • last_col() to select last variable, with option of an offset

Selection language Part 2

  • starts_with() selects columns with the given prefix
  • ends_with() selects columns with the given suffix
  • contains() selects columns with a literal string
  • matches() selects columns that match the regular expression
  • num_range() selects columns with a numerical range
  • all_of() for selecting columns based on a character vector
  • any_of() is the same as all_of() but no error when variables do not exist
  • where() selects based on where given function return TRUE
help(language, package = "tidyselect")