Pivotting data

ANU BDSI
workshop
Data Wrangling with R Part 1

Emi Tanaka

Biological Data Science Institute

8th April 2024

Current learning objective

  • -Recognize the characteristics of tidy data
  • -Differentiate between the Base and Tidyverse paradigms
  • -Acquire the skills to add/modify columns, subset data by rows and columns, rename column names, and perform group operations using dplyr
  • Pivot data into longer or wider format using tidyr
  • -Join datasets using dplyr

Tidy data

Definition of a tidy data (Wickham, 2014)

  • Each variable must have its own column
  • Each observation must have its own row
  • Each value must have its own cell
Produced by OmniGraffle 7.18\n2020-11-23 23:57:30 +0000 Canvas 2 Layer 1 variables observations values
  • Tidyverse functions are designed to work with tidy data.

Wickham (2014) Tidy Data. Journal of Statistical Software

Evolving language

  • Earlier efforts to transform data from wide to long was in the reshape, first released on CRAN in 2005-08-05.
  • It was then superseded by reshape2 released on CRAN in 2010-09-10.
  • Then finally tidyr released on CRAN in 2014-07-21 *v1.0.0 released 2019-09-12.

Wide to long

  • reshape::melt
  • reshape2::melt
  • tidyr::gather
  • tidyr::pivot_longer*

Long to wide

  • reshape::cast
  • reshape2::dcast
  • tidyr::spread
  • tidyr::pivot_wider*

Hadley Wickham (2020). tidyr: Tidy Messy Data. R package version 1.1.2.

Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20

Lifecycle

Canvas 1 Layer 1 experimental stable superseded deprecated

lifecycle: deprecated lifecycle deprecated lifecycle: experimental lifecycle experimental lifecycle: stable lifecycle stable lifecycle: superseded lifecycle superseded

  • Functions (and sometimes arguments of functions) in tidyverse packages often are labelled with a badge like on the left

Lionel Henry (2020). lifecycle: Manage the Life Cycle of your Package Functions. R package version 0.2.0.

Pivotting data with tidyr Part 1

Data
df_wide <- tibble::tibble(state = c("NSW", "VIC", "ACT"),
                          `2019` = c(8130159, 6655284, 427892),
                          `2018` = c(80366651, 6528601, 423169),
                          `2017` = c(7919815, 6387081, 415874))

df_long <- data.frame(state = c("NSW", "NSW", "NSW", "VIC", "VIC", "VIC", "ACT", "ACT", "ACT"),
                      year = c("2019", "2018", "2017", "2019", 
                               "2018", "2017", "2019", "2018", "2017"), 
                      population = c(8130159, 80366651, 7919815, 6655284, 6528601, 6387081, 427892, 423169, 415874))
df_wide
state 2019 2018 2017
NSW 8130159 80366651 7919815
VIC 6655284 6528601 6387081
ACT 427892 423169 415874
df_long
state year population
NSW 2019 8130159
NSW 2018 80366651
NSW 2017 7919815
VIC 2019 6655284
VIC 2018 6528601
VIC 2017 6387081
ACT 2019 427892
ACT 2018 423169
ACT 2017 415874

Values adapted from Australian Bureau of Statistics. (2020). Table 04. Estimated Resident Population, States and Territories [Time series spreadsheet]. National, state and territory population, Australia Mar 2020. Retrieved Nov 24, 2020. https://www.abs.gov.au/statistics/people/population/national-state-and-territory-population/mar-2020/310104.xls

Pivotting data with tidyr Part 2

Data
yield_long <- data.frame(year = c(1900, 1900, 2000, 1900, 1900, 2000, 2000),
                         state = c("Iowa", "Kansas", "Kansas", "Iowa", "Kansas", "Iowa", "Kansas"),
                         crop = c("barley", "barley", "barley", "wheat", "wheat", "wheat", "wheat"),
                         yield = c(28.5, 18, 35, 14.4, 18.2, 47, 37))
yield_long
year state crop yield
1900 Iowa barley 28.5
1900 Kansas barley 18.0
2000 Kansas barley 35.0
1900 Iowa wheat 14.4
1900 Kansas wheat 18.2
2000 Iowa wheat 47.0
2000 Kansas wheat 37.0
yield_wide
year state barley_yield wheat_yield
1900 Iowa 28.5 14.4
1900 Kansas 18.0 18.2
2000 Kansas 35.0 37.0
2000 Iowa NA 47.0

yield_long yield_wide

United States Department of Agriculture, National Agricultural Statistics Service. http://quickstats.nass.usda.gov/

Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17

Pivotting data with tidyr Part 3

Data
yield_wide <- data.frame(year = c(1900, 1900, 2000, 2000), 
                         state = c("Iowa", "Kansas", "Kansas", "Iowa"),
                         barley_yield = c(28.5, 18, 35, NA), 
                         wheat_yield = c(14.4, 18.2, 37, 47))
yield_wide
year state barley_yield wheat_yield
1900 Iowa 28.5 14.4
1900 Kansas 18.0 18.2
2000 Kansas 35.0 37.0
2000 Iowa NA 47.0
yield_long
year state crop yield
1900 Iowa barley 28.5
1900 Kansas barley 18.0
2000 Kansas barley 35.0
1900 Iowa wheat 14.4
1900 Kansas wheat 18.2
2000 Iowa wheat 47.0
2000 Kansas wheat 37.0

yield_wide yield_long

United States Department of Agriculture, National Agricultural Statistics Service. http://quickstats.nass.usda.gov/

Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17

Pivotting data with tidyr Part 4

Data
crop_long <- data.frame(year = c(1900L, 1900L, 1900L, 1900L, 2000L, 2000L, 1900L, 1900L, 1900L, 1900L, 2000L, 2000L, 2000L, 2000L),
                        state = c("Iowa", "Iowa", "Kansas", "Kansas", "Kansas", "Kansas", "Iowa", "Iowa", "Kansas", "Kansas", "Iowa", "Iowa", "Kansas", "Kansas"),
                        crop = c("barley", "barley", "barley", "barley", "barley", "barley", "wheat", "wheat", "wheat", "wheat", "wheat", "wheat", "wheat", "wheat"),
                        metric = c("yield", "acres", "yield", "acres", "yield", "acres", "yield", "acres", "yield", "acres", "yield", "acres", "yield", "acres"),
                        value = c(28.5, 620000, 18, 127000, 35, 7000, 14.4, 1450000, 18.2, 4290000, 47, 18000, 37, 9400000))
crop_long
year state crop metric value
1900 Iowa barley yield 28.5
1900 Iowa barley acres 620,000.0
1900 Kansas barley yield 18.0
1900 Kansas barley acres 127,000.0
2000 Kansas barley yield 35.0
2000 Kansas barley acres 7,000.0
1900 Iowa wheat yield 14.4
1900 Iowa wheat acres 1,450,000.0
1900 Kansas wheat yield 18.2
1900 Kansas wheat acres 4,290,000.0
2000 Iowa wheat yield 47.0
2000 Iowa wheat acres 18,000.0
2000 Kansas wheat yield 37.0
2000 Kansas wheat acres 9,400,000.0
crop_wide
year state barley_yield wheat_yield barley_acres wheat_acres
1900 Iowa 28.5 14.4 620000 1450000
1900 Kansas 18.0 18.2 127000 4290000
2000 Kansas 35.0 37.0 7000 9400000
2000 Iowa NA 47.0 NA 18000

crop_long crop_wide

United States Department of Agriculture, National Agricultural Statistics Service. http://quickstats.nass.usda.gov/

Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17

Pivotting data with tidyr Part 5

Data
crop_wide <- data.frame(year = c(1900L, 1900L, 2000L, 2000L), 
                        state = c("Iowa", "Kansas", "Kansas", "Iowa"), 
                        barley_yield = c(28.5, 18, 35, NA), 
                        wheat_yield = c(14.4, 18.2, 37, 47), 
                        barley_acres = c(620000, 127000, 7000, NA), 
                        wheat_acres = c(1450000, 4290000, 9400000, 18000))
crop_wide
year state barley_yield wheat_yield barley_acres wheat_acres
1900 Iowa 28.5 14.4 620000 1450000
1900 Kansas 18.0 18.2 127000 4290000
2000 Kansas 35.0 37.0 7000 9400000
2000 Iowa NA 47.0 NA 18000

crop_wide crop_long

crop_long
year state crop metric value
1900 Iowa barley yield 28.5
1900 Iowa barley acres 620,000.0
1900 Kansas barley yield 18.0
1900 Kansas barley acres 127,000.0
2000 Kansas barley yield 35.0
2000 Kansas barley acres 7,000.0
1900 Iowa wheat yield 14.4
1900 Iowa wheat acres 1,450,000.0
1900 Kansas wheat yield 18.2
1900 Kansas wheat acres 4,290,000.0
2000 Iowa wheat yield 47.0
2000 Iowa wheat acres 18,000.0
2000 Kansas wheat yield 37.0
2000 Kansas wheat acres 9,400,000.0

United States Department of Agriculture, National Agricultural Statistics Service. http://quickstats.nass.usda.gov/

Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17

Pivotting data with tidyr Part 6

Data
crop_long <- data.frame(year = c(1900L, 1900L, 1900L, 1900L, 2000L, 2000L, 1900L, 1900L, 1900L, 1900L, 2000L, 2000L, 2000L, 2000L),
                        state = c("Iowa", "Iowa", "Kansas", "Kansas", "Kansas", "Kansas", "Iowa", "Iowa", "Kansas", "Kansas", "Iowa", "Iowa", "Kansas", "Kansas"),
                        crop = c("barley", "barley", "barley", "barley", "barley", "barley", "wheat", "wheat", "wheat", "wheat", "wheat", "wheat", "wheat", "wheat"),
                        metric = c("yield", "acres", "yield", "acres", "yield", "acres", "yield", "acres", "yield", "acres", "yield", "acres", "yield", "acres"),
                        value = c(28.5, 620000, 18, 127000, 35, 7000, 14.4, 1450000, 18.2, 4290000, 47, 18000, 37, 9400000))
crop_long
year state crop metric value
1900 Iowa barley yield 2.85e+01
1900 Iowa barley acres 6.20e+05
1900 Kansas barley yield 1.80e+01
1900 Kansas barley acres 1.27e+05
2000 Kansas barley yield 3.50e+01
2000 Kansas barley acres 7.00e+03
1900 Iowa wheat yield 1.44e+01
1900 Iowa wheat acres 1.45e+06
1900 Kansas wheat yield 1.82e+01
1900 Kansas wheat acres 4.29e+06
2000 Iowa wheat yield 4.70e+01
2000 Iowa wheat acres 1.80e+04
2000 Kansas wheat yield 3.70e+01
2000 Kansas wheat acres 9.40e+06
crop_tidy
year state crop yield acres
1900 Iowa barley 28.5 620000
1900 Kansas barley 18.0 127000
2000 Kansas barley 35.0 7000
1900 Iowa wheat 14.4 1450000
1900 Kansas wheat 18.2 4290000
2000 Iowa wheat 47.0 18000
2000 Kansas wheat 37.0 9400000

crop_long crop_tidy

United States Department of Agriculture, National Agricultural Statistics Service. http://quickstats.nass.usda.gov/

Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17

Separate values into columns

Data
pkg_dat <- data.frame(package = c("dplyr", "magrittr", "rlang", "stringr", "tibble", "tidyr", "tidyselect"), 
                      maintainer = c("Hadley Wickham", "Lionel Henry", "Lionel Henry", "Hadley Wickham", "Kirill Müller", "Hadley Wickham", "Lionel Henry"))
pkg_dat
package maintainer
dplyr Hadley Wickham
magrittr Lionel Henry
rlang Lionel Henry
stringr Hadley Wickham
tibble Kirill Müller
tidyr Hadley Wickham
tidyselect Lionel Henry

🎯 separate maintainer name to columns, first name and last name

Separate values into rows

Data
author_dat <- data.frame(package = c("dplyr", "magrittr", "rlang", "stringr", "tibble", "tidyr", "tidyselect"), 
                         author = c("Hadley Wickham, Romain François, Lionel Henry, Kirill Müller", "Lionel Henry, Stefan Milton Bache, Hadley Wickham", "Lionel Henry, Hadley Wickham", "Hadley Wickham", "Kirill Müller, Hadley Wickham", "Hadley Wickham", "Lionel Henry, Hadley Wickham"))
author_dat
package author
dplyr Hadley Wickham, Romain François, Lionel Henry, Kirill Müller
magrittr Lionel Henry, Stefan Milton Bache, Hadley Wickham
rlang Lionel Henry, Hadley Wickham
stringr Hadley Wickham
tibble Kirill Müller, Hadley Wickham
tidyr Hadley Wickham
tidyselect Lionel Henry, Hadley Wickham

Summary

  • tidyr package provides tools for data cleaning and reshaping.
  • pivot_longer() and pivot_wider() are used to convert data between long and wide formats.
  • separate() and separate_rows() functions are used to split a single column into multiple columns or rows.

tidyr cheatsheet

Exercise time

20:00