Computing numerical summaries and parametric distributions

BDSI R Training I

Emi Tanaka

Biological Data Science Institute

2nd November 2023

Functions in R

  • There are many functions in R!
  • Generally if you need to compute some numerical summary that is common in your field, then there is probably already an existing function in the R ecosystem.
  • Try always searching it on a search engine (e.g. Google) with the right keywords.
  • If it’s computed from a community contributed R package, then check to see if there’s some quality indicators:
    • Is it actively maintained?
    • Is it widely used?
    • Does the package have tests for its functions? Etc.

Base packages

  • R has 7 packages (stats, graphics, grDevices, utils, datasets, methods, base), collectively referred to as the “base packages”, that are loaded automatically when you launch it.
  • The functions in the base packages are generally well-tested and trustworthy.

Artihmetics

  • Many of the arithmetic functions come from base.
  • You can see library(help = "base") for indexed help files.
sqrt(3)
[1] 1.732051
abs(-3)
[1] 3
exp(1)
[1] 2.718282
log(4, base = exp(1))
[1] 1.386294
sum(1:3)
[1] 6

Numerical summaries

  • Numerical summaries generally come base or stats package.
  • Some common numerical summaries include:
    • Mean: mean()
    • Median: median()
    • Five number summary: fivenum()
    • Minimum: min()
    • Maximum: max()
    • Quantile: quantile()
    • Correlation coefficient: cor()

Missing values

  • NA in R denotes missing values – there are in fact different types of missing values (NA_character_, NA_integer_, NA_real_, NA_complex_).
  • When there are missing values, it can cause issues in the computation.
x <- c(2.3, NA, 4.7)
mean(x)
[1] NA
  • Below we remove the missing values:
mean(x, na.rm = TRUE)
[1] 3.5
  • Notice that the above is different to below when there are missing value(s):
sum(x, na.rm = TRUE) / length(x)
[1] 2.333333

Some parametric distributions

  • The density (d), distribution (p) or quantile (q) functions of a parametric distribution are generally in the stats package.

  • There are functions to generate random values from a particular parametric distribution (r).

  • Some examples are:

Normal distribution

  • dnorm()
  • pnorm()
  • qnorm()
  • rnorm()

t-distribution

  • dt()
  • pt()
  • qt()
  • rt()

Poisson distribution

  • dpois()
  • ppois()
  • qpois()
  • rpois()

F distribution

  • df()
  • pf()
  • qf()
  • rf()