Title: | Tidy Messy Data |
---|---|
Description: | Tools to help to create tidy data, where each column is a variable, each row is an observation, and each cell contains a single value. 'tidyr' contains tools for changing the shape (pivoting) and hierarchy (nesting and 'unnesting') of a dataset, turning deeply nested lists into rectangular data frames ('rectangling'), and extracting values out of string columns. It also includes tools for working with missing values (both implicit and explicit). |
Authors: | Hadley Wickham [aut, cre], Davis Vaughan [aut], Maximilian Girlich [aut], Kevin Ushey [ctb], Posit Software, PBC [cph, fnd] |
Maintainer: | Hadley Wickham <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.3.1.9000 |
Built: | 2024-08-28 19:14:32 UTC |
Source: | https://github.com/tidyverse/tidyr |
Song rankings for Billboard top 100 in the year 2000
billboard
billboard
A dataset with variables:
Artist name
Song name
Date the song entered the top 100
Rank of the song in each week after it entered
The "Whitburn" project, https://waxy.org/2008/05/the_whitburn_project/, (downloaded April 2008)
Chopping and unchopping preserve the width of a data frame, changing its
length. chop()
makes df
shorter by converting rows within each group
into list-columns. unchop()
makes df
longer by expanding list-columns
so that each element of the list-column gets its own row in the output.
chop()
and unchop()
are building blocks for more complicated functions
(like unnest()
, unnest_longer()
, and unnest_wider()
) and are generally
more suitable for programming than interactive data analysis.
chop(data, cols, ..., error_call = current_env()) unchop( data, cols, ..., keep_empty = FALSE, ptype = NULL, error_call = current_env() )
chop(data, cols, ..., error_call = current_env()) unchop( data, cols, ..., keep_empty = FALSE, ptype = NULL, error_call = current_env() )
data |
A data frame. |
cols |
< For |
... |
These dots are for future extensions and must be empty. |
error_call |
The execution environment of a currently
running function, e.g. |
keep_empty |
By default, you get one row of output for each element
of the list that you are unchopping/unnesting. This means that if there's a
size-0 element (like |
ptype |
Optionally, a named list of column name-prototype pairs to
coerce |
Generally, unchopping is more useful than chopping because it simplifies
a complex data structure, and nest()
ing is usually more appropriate
than chop()
ing since it better preserves the connections between
observations.
chop()
creates list-columns of class vctrs::list_of()
to ensure
consistent behaviour when the chopped data frame is emptied. For
instance this helps getting back the original column types after
the roundtrip chop and unchop. Because <list_of>
keeps tracks of
the type of its elements, unchop()
is able to reconstitute the
correct vector type even for empty list-columns.
# Chop ---------------------------------------------------------------------- df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1) # Note that we get one row of output for each unique combination of # non-chopped variables df %>% chop(c(y, z)) # cf nest df %>% nest(data = c(y, z)) # Unchop -------------------------------------------------------------------- df <- tibble(x = 1:4, y = list(integer(), 1L, 1:2, 1:3)) df %>% unchop(y) df %>% unchop(y, keep_empty = TRUE) # unchop will error if the types are not compatible: df <- tibble(x = 1:2, y = list("1", 1:3)) try(df %>% unchop(y)) # Unchopping a list-col of data frames must generate a df-col because # unchop leaves the column names unchanged df <- tibble(x = 1:3, y = list(NULL, tibble(x = 1), tibble(y = 1:2))) df %>% unchop(y) df %>% unchop(y, keep_empty = TRUE)
# Chop ---------------------------------------------------------------------- df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1) # Note that we get one row of output for each unique combination of # non-chopped variables df %>% chop(c(y, z)) # cf nest df %>% nest(data = c(y, z)) # Unchop -------------------------------------------------------------------- df <- tibble(x = 1:4, y = list(integer(), 1L, 1:2, 1:3)) df %>% unchop(y) df %>% unchop(y, keep_empty = TRUE) # unchop will error if the types are not compatible: df <- tibble(x = 1:2, y = list("1", 1:3)) try(df %>% unchop(y)) # Unchopping a list-col of data frames must generate a df-col because # unchop leaves the column names unchanged df <- tibble(x = 1:3, y = list(NULL, tibble(x = 1), tibble(y = 1:2))) df %>% unchop(y) df %>% unchop(y, keep_empty = TRUE)
Two datasets from public data provided the Centers for Medicare & Medicaid Services, https://data.cms.gov.
cms_patient_experience
contains some lightly cleaned data from
"Hospice - Provider Data", which provides a list of hospice agencies
along with some data on quality of patient care,
https://data.cms.gov/provider-data/dataset/252m-zfp9.
cms_patient_care
"Doctors and Clinicians Quality Payment Program PY 2020
Virtual Group Public Reporting",
https://data.cms.gov/provider-data/dataset/8c70-d353
cms_patient_experience cms_patient_care
cms_patient_experience cms_patient_care
cms_patient_experience
is a data frame with 500 observations and
five variables:
Organisation ID and name
Measure code and title
Measure performance rate
cms_patient_care
is a data frame with 252 observations and
five variables:
Facility ID and name
Abbreviated measurement title, suitable for use as variable name
Measure score
Whether score refers to the rating out of 100 ("observed"), or the maximum possible value of the raw score ("denominator")
cms_patient_experience %>% dplyr::distinct(measure_cd, measure_title) cms_patient_experience %>% pivot_wider( id_cols = starts_with("org"), names_from = measure_cd, values_from = prf_rate ) cms_patient_care %>% pivot_wider( names_from = type, values_from = score ) cms_patient_care %>% pivot_wider( names_from = measure_abbr, values_from = score ) cms_patient_care %>% pivot_wider( names_from = c(measure_abbr, type), values_from = score )
cms_patient_experience %>% dplyr::distinct(measure_cd, measure_title) cms_patient_experience %>% pivot_wider( id_cols = starts_with("org"), names_from = measure_cd, values_from = prf_rate ) cms_patient_care %>% pivot_wider( names_from = type, values_from = score ) cms_patient_care %>% pivot_wider( names_from = measure_abbr, values_from = score ) cms_patient_care %>% pivot_wider( names_from = c(measure_abbr, type), values_from = score )
Turns implicit missing values into explicit missing values. This is a wrapper
around expand()
, dplyr::full_join()
and replace_na()
that's useful for
completing missing combinations of data.
complete(data, ..., fill = list(), explicit = TRUE)
complete(data, ..., fill = list(), explicit = TRUE)
data |
A data frame. |
... |
<
When used with factors, When used with continuous variables, you may need to fill in values
that do not appear in the data: to do so use expressions like
|
fill |
A named list that for each variable supplies a single value to
use instead of |
explicit |
Should both implicit (newly created) and explicit
(pre-existing) missing values be filled by |
With grouped data frames created by dplyr::group_by()
, complete()
operates within each group. Because of this, you cannot complete a grouping
column.
df <- tibble( group = c(1:2, 1, 2), item_id = c(1:2, 2, 3), item_name = c("a", "a", "b", "b"), value1 = c(1, NA, 3, 4), value2 = 4:7 ) df # Combinations -------------------------------------------------------------- # Generate all possible combinations of `group`, `item_id`, and `item_name` # (whether or not they appear in the data) df %>% complete(group, item_id, item_name) # Cross all possible `group` values with the unique pairs of # `(item_id, item_name)` that already exist in the data df %>% complete(group, nesting(item_id, item_name)) # Within each `group`, generate all possible combinations of # `item_id` and `item_name` that occur in that group df %>% dplyr::group_by(group) %>% complete(item_id, item_name) # Supplying values for new rows --------------------------------------------- # Use `fill` to replace NAs with some value. By default, affects both new # (implicit) and pre-existing (explicit) missing values. df %>% complete( group, nesting(item_id, item_name), fill = list(value1 = 0, value2 = 99) ) # Limit the fill to only the newly created (i.e. previously implicit) # missing values with `explicit = FALSE` df %>% complete( group, nesting(item_id, item_name), fill = list(value1 = 0, value2 = 99), explicit = FALSE )
df <- tibble( group = c(1:2, 1, 2), item_id = c(1:2, 2, 3), item_name = c("a", "a", "b", "b"), value1 = c(1, NA, 3, 4), value2 = 4:7 ) df # Combinations -------------------------------------------------------------- # Generate all possible combinations of `group`, `item_id`, and `item_name` # (whether or not they appear in the data) df %>% complete(group, item_id, item_name) # Cross all possible `group` values with the unique pairs of # `(item_id, item_name)` that already exist in the data df %>% complete(group, nesting(item_id, item_name)) # Within each `group`, generate all possible combinations of # `item_id` and `item_name` that occur in that group df %>% dplyr::group_by(group) %>% complete(item_id, item_name) # Supplying values for new rows --------------------------------------------- # Use `fill` to replace NAs with some value. By default, affects both new # (implicit) and pre-existing (explicit) missing values. df %>% complete( group, nesting(item_id, item_name), fill = list(value1 = 0, value2 = 99) ) # Limit the fill to only the newly created (i.e. previously implicit) # missing values with `explicit = FALSE` df %>% complete( group, nesting(item_id, item_name), fill = list(value1 = 0, value2 = 99), explicit = FALSE )
Completed construction in the US in 2018
construction
construction
A dataset with variables:
Record date
1 unit
, 2 to 4 units
, 5 units or mote
Number of completed units of each size
Number of completed units in each region
Completions of "New Residential Construction" found in Table 5 at https://www.census.gov/construction/nrc/xls/newresconst.xls (downloaded March 2019)
drop_na()
drops rows where any column specified by ...
contains a
missing value.
drop_na(data, ...)
drop_na(data, ...)
data |
A data frame. |
... |
< |
Another way to interpret drop_na()
is that it only keeps the "complete"
rows (where no rows contain missing values). Internally, this completeness is
computed through vctrs::vec_detect_complete()
.
df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b")) df %>% drop_na() df %>% drop_na(x) vars <- "y" df %>% drop_na(x, any_of(vars))
df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b")) df %>% drop_na() df %>% drop_na(x) vars <- "y" df %>% drop_na(x, any_of(vars))
expand()
generates all combination of variables found in a dataset.
It is paired with nesting()
and crossing()
helpers. crossing()
is a wrapper around expand_grid()
that de-duplicates and sorts its inputs;
nesting()
is a helper that only finds combinations already present in the
data.
expand()
is often useful in conjunction with joins:
use it with right_join()
to convert implicit missing values to
explicit missing values (e.g., fill in gaps in your data frame).
use it with anti_join()
to figure out which combinations are missing
(e.g., identify gaps in your data frame).
expand(data, ..., .name_repair = "check_unique") crossing(..., .name_repair = "check_unique") nesting(..., .name_repair = "check_unique")
expand(data, ..., .name_repair = "check_unique") crossing(..., .name_repair = "check_unique") nesting(..., .name_repair = "check_unique")
data |
A data frame. |
... |
<
When used with factors, When used with continuous variables, you may need to fill in values
that do not appear in the data: to do so use expressions like
|
.name_repair |
One of |
With grouped data frames created by dplyr::group_by()
, expand()
operates
within each group. Because of this, you cannot expand on a grouping column.
complete()
to expand list objects. expand_grid()
to input vectors rather than a data frame.
# Finding combinations ------------------------------------------------------ fruits <- tibble( type = c("apple", "orange", "apple", "orange", "orange", "orange"), year = c(2010, 2010, 2012, 2010, 2011, 2012), size = factor( c("XS", "S", "M", "S", "S", "M"), levels = c("XS", "S", "M", "L") ), weights = rnorm(6, as.numeric(size) + 2) ) # All combinations, including factor levels that are not used fruits %>% expand(type) fruits %>% expand(size) fruits %>% expand(type, size) fruits %>% expand(type, size, year) # Only combinations that already appear in the data fruits %>% expand(nesting(type)) fruits %>% expand(nesting(size)) fruits %>% expand(nesting(type, size)) fruits %>% expand(nesting(type, size, year)) # Other uses ---------------------------------------------------------------- # Use with `full_seq()` to fill in values of continuous variables fruits %>% expand(type, size, full_seq(year, 1)) fruits %>% expand(type, size, 2010:2013) # Use `anti_join()` to determine which observations are missing all <- fruits %>% expand(type, size, year) all all %>% dplyr::anti_join(fruits) # Use with `right_join()` to fill in missing rows (like `complete()`) fruits %>% dplyr::right_join(all) # Use with `group_by()` to expand within each group fruits %>% dplyr::group_by(type) %>% expand(year, size)
# Finding combinations ------------------------------------------------------ fruits <- tibble( type = c("apple", "orange", "apple", "orange", "orange", "orange"), year = c(2010, 2010, 2012, 2010, 2011, 2012), size = factor( c("XS", "S", "M", "S", "S", "M"), levels = c("XS", "S", "M", "L") ), weights = rnorm(6, as.numeric(size) + 2) ) # All combinations, including factor levels that are not used fruits %>% expand(type) fruits %>% expand(size) fruits %>% expand(type, size) fruits %>% expand(type, size, year) # Only combinations that already appear in the data fruits %>% expand(nesting(type)) fruits %>% expand(nesting(size)) fruits %>% expand(nesting(type, size)) fruits %>% expand(nesting(type, size, year)) # Other uses ---------------------------------------------------------------- # Use with `full_seq()` to fill in values of continuous variables fruits %>% expand(type, size, full_seq(year, 1)) fruits %>% expand(type, size, 2010:2013) # Use `anti_join()` to determine which observations are missing all <- fruits %>% expand(type, size, year) all all %>% dplyr::anti_join(fruits) # Use with `right_join()` to fill in missing rows (like `complete()`) fruits %>% dplyr::right_join(all) # Use with `group_by()` to expand within each group fruits %>% dplyr::group_by(type) %>% expand(year, size)
expand_grid()
is heavily motivated by expand.grid()
.
Compared to expand.grid()
, it:
Produces sorted output by varying the first column the slowest by default.
Returns a tibble, not a data frame.
Never converts strings to factors.
Does not add any additional attributes.
Can expand any generalised vector, including data frames.
expand_grid(..., .name_repair = "check_unique", .vary = "slowest")
expand_grid(..., .name_repair = "check_unique", .vary = "slowest")
... |
Name-value pairs. The name will become the column name in the output. |
.name_repair |
One of |
.vary |
One of:
|
A tibble with one column for each input in ...
. The output will
have one row for each combination of the inputs, i.e. the size will be
equal to the product of the sizes of the inputs. This implies that if any
input has length 0, the output will have zero rows. The ordering of the
output depends on the .vary
argument.
# Default behavior varies the first column "slowest" expand_grid(x = 1:3, y = 1:2) # Vary the first column "fastest", like `expand.grid()` expand_grid(x = 1:3, y = 1:2, .vary = "fastest") # Can also expand data frames expand_grid(df = tibble(x = 1:2, y = c(2, 1)), z = 1:3) # And matrices expand_grid(x1 = matrix(1:4, nrow = 2), x2 = matrix(5:8, nrow = 2))
# Default behavior varies the first column "slowest" expand_grid(x = 1:3, y = 1:2) # Vary the first column "fastest", like `expand.grid()` expand_grid(x = 1:3, y = 1:2, .vary = "fastest") # Can also expand data frames expand_grid(df = tibble(x = 1:2, y = c(2, 1)), z = 1:3) # And matrices expand_grid(x1 = matrix(1:4, nrow = 2), x2 = matrix(5:8, nrow = 2))
extract()
has been superseded in favour of separate_wider_regex()
because it has a more polished API and better handling of problems.
Superseded functions will not go away, but will only receive critical bug
fixes.
Given a regular expression with capturing groups, extract()
turns
each group into a new column. If the groups don't match, or the input
is NA, the output will be NA.
extract( data, col, into, regex = "([[:alnum:]]+)", remove = TRUE, convert = FALSE, ... )
extract( data, col, into, regex = "([[:alnum:]]+)", remove = TRUE, convert = FALSE, ... )
data |
A data frame. |
col |
< |
into |
Names of new variables to create as character vector.
Use |
regex |
A string representing a regular expression used to extract the
desired values. There should be one group (defined by |
remove |
If |
convert |
If NB: this will cause string |
... |
Additional arguments passed on to methods. |
separate()
to split up by a separator.
df <- tibble(x = c(NA, "a-b", "a-d", "b-c", "d-e")) df %>% extract(x, "A") df %>% extract(x, c("A", "B"), "([[:alnum:]]+)-([[:alnum:]]+)") # Now recommended df %>% separate_wider_regex( x, patterns = c(A = "[[:alnum:]]+", "-", B = "[[:alnum:]]+") ) # If no match, NA: df %>% extract(x, c("A", "B"), "([a-d]+)-([a-d]+)")
df <- tibble(x = c(NA, "a-b", "a-d", "b-c", "d-e")) df %>% extract(x, "A") df %>% extract(x, c("A", "B"), "([[:alnum:]]+)-([[:alnum:]]+)") # Now recommended df %>% separate_wider_regex( x, patterns = c(A = "[[:alnum:]]+", "-", B = "[[:alnum:]]+") ) # If no match, NA: df %>% extract(x, c("A", "B"), "([a-d]+)-([a-d]+)")
Fills missing values in selected columns using the next or previous entry. This is useful in the common output format where values are not repeated, and are only recorded when they change.
fill(data, ..., .by = NULL, .direction = c("down", "up", "downup", "updown"))
fill(data, ..., .by = NULL, .direction = c("down", "up", "downup", "updown"))
data |
A data frame. |
... |
< |
.by |
< |
.direction |
Direction in which to fill missing values. Currently either "down" (the default), "up", "downup" (i.e. first down and then up) or "updown" (first up and then down). |
Missing values are replaced in atomic vectors; NULL
s are replaced in lists.
With grouped data frames created by dplyr::group_by()
, fill()
will be
applied within each group, meaning that it won't fill across group
boundaries. This can also be accomplished using the .by
argument to
fill()
, which creates a temporary grouping for just this operation.
# direction = "down" -------------------------------------------------------- # Value (year) is recorded only when it changes sales <- tibble::tribble( ~quarter, ~year, ~sales, "Q1", 2000, 66013, "Q2", NA, 69182, "Q3", NA, 53175, "Q4", NA, 21001, "Q1", 2001, 46036, "Q2", NA, 58842, "Q3", NA, 44568, "Q4", NA, 50197, "Q1", 2002, 39113, "Q2", NA, 41668, "Q3", NA, 30144, "Q4", NA, 52897, "Q1", 2004, 32129, "Q2", NA, 67686, "Q3", NA, 31768, "Q4", NA, 49094 ) # `fill()` defaults to replacing missing data from top to bottom sales %>% fill(year) # direction = "up" ---------------------------------------------------------- # Value (pet_type) is missing above tidy_pets <- tibble::tribble( ~rank, ~pet_type, ~breed, 1L, NA, "Boston Terrier", 2L, NA, "Retrievers (Labrador)", 3L, NA, "Retrievers (Golden)", 4L, NA, "French Bulldogs", 5L, NA, "Bulldogs", 6L, "Dog", "Beagles", 1L, NA, "Persian", 2L, NA, "Maine Coon", 3L, NA, "Ragdoll", 4L, NA, "Exotic", 5L, NA, "Siamese", 6L, "Cat", "American Short" ) # For values that are missing above you can use `.direction = "up"` tidy_pets %>% fill(pet_type, .direction = "up") # direction = "downup" ------------------------------------------------------ # Value (n_squirrels) is missing above and below within a group squirrels <- tibble::tribble( ~group, ~name, ~role, ~n_squirrels, 1, "Sam", "Observer", NA, 1, "Mara", "Scorekeeper", 8, 1, "Jesse", "Observer", NA, 1, "Tom", "Observer", NA, 2, "Mike", "Observer", NA, 2, "Rachael", "Observer", NA, 2, "Sydekea", "Scorekeeper", 14, 2, "Gabriela", "Observer", NA, 3, "Derrick", "Observer", NA, 3, "Kara", "Scorekeeper", 9, 3, "Emily", "Observer", NA, 3, "Danielle", "Observer", NA ) # The values are inconsistently missing by position within the `group`. # Use `.direction = "downup"` to fill missing values in both directions # and `.by = group` to apply the fill per group. squirrels %>% fill(n_squirrels, .direction = "downup", .by = group) # If you want, you can also supply a data frame grouped with `group_by()`, # but don't forget to `ungroup()`! squirrels %>% dplyr::group_by(group) %>% fill(n_squirrels, .direction = "downup") %>% dplyr::ungroup()
# direction = "down" -------------------------------------------------------- # Value (year) is recorded only when it changes sales <- tibble::tribble( ~quarter, ~year, ~sales, "Q1", 2000, 66013, "Q2", NA, 69182, "Q3", NA, 53175, "Q4", NA, 21001, "Q1", 2001, 46036, "Q2", NA, 58842, "Q3", NA, 44568, "Q4", NA, 50197, "Q1", 2002, 39113, "Q2", NA, 41668, "Q3", NA, 30144, "Q4", NA, 52897, "Q1", 2004, 32129, "Q2", NA, 67686, "Q3", NA, 31768, "Q4", NA, 49094 ) # `fill()` defaults to replacing missing data from top to bottom sales %>% fill(year) # direction = "up" ---------------------------------------------------------- # Value (pet_type) is missing above tidy_pets <- tibble::tribble( ~rank, ~pet_type, ~breed, 1L, NA, "Boston Terrier", 2L, NA, "Retrievers (Labrador)", 3L, NA, "Retrievers (Golden)", 4L, NA, "French Bulldogs", 5L, NA, "Bulldogs", 6L, "Dog", "Beagles", 1L, NA, "Persian", 2L, NA, "Maine Coon", 3L, NA, "Ragdoll", 4L, NA, "Exotic", 5L, NA, "Siamese", 6L, "Cat", "American Short" ) # For values that are missing above you can use `.direction = "up"` tidy_pets %>% fill(pet_type, .direction = "up") # direction = "downup" ------------------------------------------------------ # Value (n_squirrels) is missing above and below within a group squirrels <- tibble::tribble( ~group, ~name, ~role, ~n_squirrels, 1, "Sam", "Observer", NA, 1, "Mara", "Scorekeeper", 8, 1, "Jesse", "Observer", NA, 1, "Tom", "Observer", NA, 2, "Mike", "Observer", NA, 2, "Rachael", "Observer", NA, 2, "Sydekea", "Scorekeeper", 14, 2, "Gabriela", "Observer", NA, 3, "Derrick", "Observer", NA, 3, "Kara", "Scorekeeper", 9, 3, "Emily", "Observer", NA, 3, "Danielle", "Observer", NA ) # The values are inconsistently missing by position within the `group`. # Use `.direction = "downup"` to fill missing values in both directions # and `.by = group` to apply the fill per group. squirrels %>% fill(n_squirrels, .direction = "downup", .by = group) # If you want, you can also supply a data frame grouped with `group_by()`, # but don't forget to `ungroup()`! squirrels %>% dplyr::group_by(group) %>% fill(n_squirrels, .direction = "downup") %>% dplyr::ungroup()
Information about fish swimming down a river: each station represents an autonomous monitor that records if a tagged fish was seen at that location. Fish travel in one direction (migrating downstream). Information about misses is just as important as hits, but is not directly recorded in this form of the data.
fish_encounters
fish_encounters
A dataset with variables:
Fish identifier
Measurement station
Was the fish seen? (1 if yes, and true for all rows)
Dataset provided by Myfanwy Johnston; more details at https://fishsciences.github.io/post/visualizing-fish-encounter-histories/
This is useful if you want to fill in missing values that should have
been observed but weren't. For example, full_seq(c(1, 2, 4, 6), 1)
will return 1:6
.
full_seq(x, period, tol = 1e-06)
full_seq(x, period, tol = 1e-06)
x |
A numeric vector. |
period |
Gap between each observation. The existing data will be checked to ensure that it is actually of this periodicity. |
tol |
Numerical tolerance for checking periodicity. |
full_seq(c(1, 2, 4, 5, 10), 1)
full_seq(c(1, 2, 4, 5, 10), 1)
Development on gather()
is complete, and for new code we recommend
switching to pivot_longer()
, which is easier to use, more featureful, and
still under active development.
df %>% gather("key", "value", x, y, z)
is equivalent to
df %>% pivot_longer(c(x, y, z), names_to = "key", values_to = "value")
See more details in vignette("pivot")
.
gather( data, key = "key", value = "value", ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE )
gather( data, key = "key", value = "value", ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE )
data |
A data frame. |
key , value
|
Names of new key and value columns, as strings or symbols. This argument is passed by expression and supports
quasiquotation (you can unquote strings
and symbols). The name is captured from the expression with
|
... |
A selection of columns. If empty, all variables are
selected. You can supply bare variable names, select all
variables between x and z with |
na.rm |
If |
convert |
If |
factor_key |
If |
Arguments for selecting columns are passed to tidyselect::vars_select()
and are treated specially. Unlike other verbs, selecting functions make a
strict distinction between data expressions and context expressions.
A data expression is either a bare name like x
or an expression
like x:y
or c(x, y)
. In a data expression, you can only refer
to columns from the data frame.
Everything else is a context expression in which you can only
refer to objects that you have defined with <-
.
For instance, col1:col3
is a data expression that refers to data
columns, while seq(start, end)
is a context expression that
refers to objects from the contexts.
If you need to refer to contextual objects from a data expression, you can
use all_of()
or any_of()
. These functions are used to select
data-variables whose names are stored in a env-variable. For instance,
all_of(a)
selects the variables listed in the character vector a
.
For more details, see the tidyselect::select_helpers()
documentation.
# From https://stackoverflow.com/questions/1181060 stocks <- tibble( time = as.Date("2009-01-01") + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4) ) gather(stocks, "stock", "price", -time) stocks %>% gather("stock", "price", -time) # get first observation for each Species in iris data -- base R mini_iris <- iris[c(1, 51, 101), ] # gather Sepal.Length, Sepal.Width, Petal.Length, Petal.Width gather(mini_iris, key = "flower_att", value = "measurement", Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) # same result but less verbose gather(mini_iris, key = "flower_att", value = "measurement", -Species)
# From https://stackoverflow.com/questions/1181060 stocks <- tibble( time = as.Date("2009-01-01") + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4) ) gather(stocks, "stock", "price", -time) stocks %>% gather("stock", "price", -time) # get first observation for each Species in iris data -- base R mini_iris <- iris[c(1, 51, 101), ] # gather Sepal.Length, Sepal.Width, Petal.Length, Petal.Width gather(mini_iris, key = "flower_att", value = "measurement", Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) # same result but less verbose gather(mini_iris, key = "flower_att", value = "measurement", -Species)
hoist()
allows you to selectively pull components of a list-column
into their own top-level columns, using the same syntax as purrr::pluck()
.
Learn more in vignette("rectangle")
.
hoist( .data, .col, ..., .remove = TRUE, .simplify = TRUE, .ptype = NULL, .transform = NULL )
hoist( .data, .col, ..., .remove = TRUE, .simplify = TRUE, .ptype = NULL, .transform = NULL )
.data |
A data frame. |
.col |
< |
... |
< The column names must be unique in a call to |
.remove |
If |
.simplify |
If |
.ptype |
Optionally, a named list of prototypes declaring the desired output type of each component. Alternatively, a single empty prototype can be supplied, which will be applied to all components. Use this argument if you want to check that each element has the type you expect when simplifying. If a |
.transform |
Optionally, a named list of transformation functions applied to each component. Alternatively, a single function can be supplied, which will be applied to all components. Use this argument if you want to transform or parse individual elements as they are extracted. When both |
Other rectangling:
unnest()
,
unnest_longer()
,
unnest_wider()
df <- tibble( character = c("Toothless", "Dory"), metadata = list( list( species = "dragon", color = "black", films = c( "How to Train Your Dragon", "How to Train Your Dragon 2", "How to Train Your Dragon: The Hidden World" ) ), list( species = "blue tang", color = "blue", films = c("Finding Nemo", "Finding Dory") ) ) ) df # Extract only specified components df %>% hoist(metadata, "species", first_film = list("films", 1L), third_film = list("films", 3L) )
df <- tibble( character = c("Toothless", "Dory"), metadata = list( list( species = "dragon", color = "black", films = c( "How to Train Your Dragon", "How to Train Your Dragon 2", "How to Train Your Dragon: The Hidden World" ) ), list( species = "blue tang", color = "blue", films = c("Finding Nemo", "Finding Dory") ) ) ) df # Extract only specified components df %>% hoist(metadata, "species", first_film = list("films", 1L), third_film = list("films", 3L) )
This dataset is based on an example in
vignette("datatable-reshape", package = "data.table")
household
household
A data frame with 5 rows and 5 columns:
Family identifier
Date of birth of first child
Date of birth of second child
Name of first child
?
Name of second child
Nesting creates a list-column of data frames; unnesting flattens it back out into regular columns. Nesting is implicitly a summarising operation: you get one row for each group defined by the non-nested columns. This is useful in conjunction with other summaries that work with whole datasets, most notably models.
Learn more in vignette("nest")
.
nest(.data, ..., .by = NULL, .key = NULL, .names_sep = NULL)
nest(.data, ..., .by = NULL, .key = NULL, .names_sep = NULL)
.data |
A data frame. |
... |
< Specified using name-variable pairs of the form
If not supplied, then :
previously you could write |
.by |
<
If not supplied, then |
.key |
The name of the resulting nested column. Only applicable when
If |
.names_sep |
If |
If neither ...
nor .by
are supplied, nest()
will nest all variables,
and will use the column name supplied through .key
.
tidyr 1.0.0 introduced a new syntax for nest()
and unnest()
that's
designed to be more similar to other functions. Converting to the new syntax
should be straightforward (guided by the message you'll receive) but if
you just need to run an old analysis, you can easily revert to the previous
behaviour using nest_legacy()
and unnest_legacy()
as follows:
library(tidyr) nest <- nest_legacy unnest <- unnest_legacy
df %>% nest(data = c(x, y))
specifies the columns to be nested; i.e. the
columns that will appear in the inner data frame. df %>% nest(.by = c(x, y))
specifies the columns to nest by; i.e. the columns that will remain in
the outer data frame. An alternative way to achieve the latter is to nest()
a grouped data frame created by dplyr::group_by()
. The grouping variables
remain in the outer data frame and the others are nested. The result
preserves the grouping of the input.
Variables supplied to nest()
will override grouping variables so that
df %>% group_by(x, y) %>% nest(data = !z)
will be equivalent to
df %>% nest(data = !z)
.
You can't supply .by
with a grouped data frame, as the groups already
represent what you are nesting by.
df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1) # Specify variables to nest using name-variable pairs. # Note that we get one row of output for each unique combination of # non-nested variables. df %>% nest(data = c(y, z)) # Specify variables to nest by (rather than variables to nest) using `.by` df %>% nest(.by = x) # In this case, since `...` isn't used you can specify the resulting column # name with `.key` df %>% nest(.by = x, .key = "cols") # Use tidyselect syntax and helpers, just like in `dplyr::select()` df %>% nest(data = any_of(c("y", "z"))) # `...` and `.by` can be used together to drop columns you no longer need, # or to include the columns you are nesting by in the inner data frame too. # This drops `z`: df %>% nest(data = y, .by = x) # This includes `x` in the inner data frame: df %>% nest(data = everything(), .by = x) # Multiple nesting structures can be specified at once iris %>% nest(petal = starts_with("Petal"), sepal = starts_with("Sepal")) iris %>% nest(width = contains("Width"), length = contains("Length")) # Nesting a grouped data frame nests all variables apart from the group vars fish_encounters %>% dplyr::group_by(fish) %>% nest() # That is similar to `nest(.by = )`, except here the result isn't grouped fish_encounters %>% nest(.by = fish) # Nesting is often useful for creating per group models mtcars %>% nest(.by = cyl) %>% dplyr::mutate(models = lapply(data, function(df) lm(mpg ~ wt, data = df)))
df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1) # Specify variables to nest using name-variable pairs. # Note that we get one row of output for each unique combination of # non-nested variables. df %>% nest(data = c(y, z)) # Specify variables to nest by (rather than variables to nest) using `.by` df %>% nest(.by = x) # In this case, since `...` isn't used you can specify the resulting column # name with `.key` df %>% nest(.by = x, .key = "cols") # Use tidyselect syntax and helpers, just like in `dplyr::select()` df %>% nest(data = any_of(c("y", "z"))) # `...` and `.by` can be used together to drop columns you no longer need, # or to include the columns you are nesting by in the inner data frame too. # This drops `z`: df %>% nest(data = y, .by = x) # This includes `x` in the inner data frame: df %>% nest(data = everything(), .by = x) # Multiple nesting structures can be specified at once iris %>% nest(petal = starts_with("Petal"), sepal = starts_with("Sepal")) iris %>% nest(width = contains("Width"), length = contains("Length")) # Nesting a grouped data frame nests all variables apart from the group vars fish_encounters %>% dplyr::group_by(fish) %>% nest() # That is similar to `nest(.by = )`, except here the result isn't grouped fish_encounters %>% nest(.by = fish) # Nesting is often useful for creating per group models mtcars %>% nest(.by = cyl) %>% dplyr::mutate(models = lapply(data, function(df) lm(mpg ~ wt, data = df)))
nest()
and unnest()
tidyr 1.0.0 introduced a new syntax for nest()
and unnest()
. The majority
of existing usage should be automatically translated to the new syntax with a
warning. However, if you need to quickly roll back to the previous behaviour,
these functions provide the previous interface. To make old code work as is,
add the following code to the top of your script:
library(tidyr) nest <- nest_legacy unnest <- unnest_legacy
nest_legacy(data, ..., .key = "data") unnest_legacy(data, ..., .drop = NA, .id = NULL, .sep = NULL, .preserve = NULL)
nest_legacy(data, ..., .key = "data") unnest_legacy(data, ..., .drop = NA, .id = NULL, .sep = NULL, .preserve = NULL)
data |
A data frame. |
... |
Specification of columns to unnest. Use bare variable names or functions of variables. If omitted, defaults to all list-cols. |
.key |
The name of the new column, as a string or symbol. This argument
is passed by expression and supports
quasiquotation (you can unquote strings and
symbols). The name is captured from the expression with |
.drop |
Should additional list columns be dropped? By default,
|
.id |
Data frame identifier - if supplied, will create a new column with
name |
.sep |
If non- |
.preserve |
Optionally, list-columns to preserve in the output. These
will be duplicated in the same way as atomic vectors. This has
|
# Nest and unnest are inverses df <- tibble(x = c(1, 1, 2), y = 3:1) df %>% nest_legacy(y) df %>% nest_legacy(y) %>% unnest_legacy() # nesting ------------------------------------------------------------------- as_tibble(iris) %>% nest_legacy(!Species) as_tibble(chickwts) %>% nest_legacy(weight) # unnesting ----------------------------------------------------------------- df <- tibble( x = 1:2, y = list( tibble(z = 1), tibble(z = 3:4) ) ) df %>% unnest_legacy(y) # You can also unnest multiple columns simultaneously df <- tibble( a = list(c("a", "b"), "c"), b = list(1:2, 3), c = c(11, 22) ) df %>% unnest_legacy(a, b) # If you omit the column names, it'll unnest all list-cols df %>% unnest_legacy()
# Nest and unnest are inverses df <- tibble(x = c(1, 1, 2), y = 3:1) df %>% nest_legacy(y) df %>% nest_legacy(y) %>% unnest_legacy() # nesting ------------------------------------------------------------------- as_tibble(iris) %>% nest_legacy(!Species) as_tibble(chickwts) %>% nest_legacy(weight) # unnesting ----------------------------------------------------------------- df <- tibble( x = 1:2, y = list( tibble(z = 1), tibble(z = 3:4) ) ) df %>% unnest_legacy(y) # You can also unnest multiple columns simultaneously df <- tibble( a = list(c("a", "b"), "c"), b = list(1:2, 3), c = c(11, 22) ) df %>% unnest_legacy(a, b) # If you omit the column names, it'll unnest all list-cols df %>% unnest_legacy()
Packing and unpacking preserve the length of a data frame, changing its
width. pack()
makes df
narrow by collapsing a set of columns into a
single df-column. unpack()
makes data
wider by expanding df-columns
back out into individual columns.
pack(.data, ..., .names_sep = NULL, .error_call = current_env()) unpack( data, cols, ..., names_sep = NULL, names_repair = "check_unique", error_call = current_env() )
pack(.data, ..., .names_sep = NULL, .error_call = current_env()) unpack( data, cols, ..., names_sep = NULL, names_repair = "check_unique", error_call = current_env() )
... |
For For |
data , .data
|
A data frame. |
cols |
< |
names_sep , .names_sep
|
If If a string, the inner and outer names will be used together. In
|
names_repair |
Used to check that output data frame has valid names. Must be one of the following options:
See |
error_call , .error_call
|
The execution environment of a currently
running function, e.g. |
Generally, unpacking is more useful than packing because it simplifies a complex data structure. Currently, few functions work with df-cols, and they are mostly a curiosity, but seem worth exploring further because they mimic the nested column headers that are so popular in Excel.
# Packing ------------------------------------------------------------------- # It's not currently clear why you would ever want to pack columns # since few functions work with this sort of data. df <- tibble(x1 = 1:3, x2 = 4:6, x3 = 7:9, y = 1:3) df df %>% pack(x = starts_with("x")) df %>% pack(x = c(x1, x2, x3), y = y) # .names_sep allows you to strip off common prefixes; this # acts as a natural inverse to name_sep in unpack() iris %>% as_tibble() %>% pack( Sepal = starts_with("Sepal"), Petal = starts_with("Petal"), .names_sep = "." ) # Unpacking ----------------------------------------------------------------- df <- tibble( x = 1:3, y = tibble(a = 1:3, b = 3:1), z = tibble(X = c("a", "b", "c"), Y = runif(3), Z = c(TRUE, FALSE, NA)) ) df df %>% unpack(y) df %>% unpack(c(y, z)) df %>% unpack(c(y, z), names_sep = "_")
# Packing ------------------------------------------------------------------- # It's not currently clear why you would ever want to pack columns # since few functions work with this sort of data. df <- tibble(x1 = 1:3, x2 = 4:6, x3 = 7:9, y = 1:3) df df %>% pack(x = starts_with("x")) df %>% pack(x = c(x1, x2, x3), y = y) # .names_sep allows you to strip off common prefixes; this # acts as a natural inverse to name_sep in unpack() iris %>% as_tibble() %>% pack( Sepal = starts_with("Sepal"), Petal = starts_with("Petal"), .names_sep = "." ) # Unpacking ----------------------------------------------------------------- df <- tibble( x = 1:3, y = tibble(a = 1:3, b = 3:1), z = tibble(X = c("a", "b", "c"), Y = runif(3), Z = c(TRUE, FALSE, NA)) ) df df %>% unpack(y) df %>% unpack(c(y, z)) df %>% unpack(c(y, z), names_sep = "_")
pivot_longer()
"lengthens" data, increasing the number of rows and
decreasing the number of columns. The inverse transformation is
pivot_wider()
Learn more in vignette("pivot")
.
pivot_longer( data, cols, ..., cols_vary = "fastest", names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL )
pivot_longer( data, cols, ..., cols_vary = "fastest", names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL )
data |
A data frame to pivot. |
cols |
< |
... |
Additional arguments passed on to methods. |
cols_vary |
When pivoting
|
names_to |
A character vector specifying the new column or columns to
create from the information stored in the column names of
|
names_prefix |
A regular expression used to remove matching text from the start of each variable name. |
names_sep , names_pattern
|
If
If these arguments do not give you enough control, use
|
names_ptypes , values_ptypes
|
Optionally, a list of column name-prototype
pairs. Alternatively, a single empty prototype can be supplied, which will
be applied to all columns. A prototype (or ptype for short) is a
zero-length vector (like |
names_transform , values_transform
|
Optionally, a list of column
name-function pairs. Alternatively, a single function can be supplied,
which will be applied to all columns. Use these arguments if you need to
change the types of specific columns. For example, If not specified, the type of the columns generated from |
names_repair |
What happens if the output has invalid column names?
The default, |
values_to |
A string specifying the name of the column to create
from the data stored in cell values. If |
values_drop_na |
If |
pivot_longer()
is an updated approach to gather()
, designed to be both
simpler to use and to handle more use cases. We recommend you use
pivot_longer()
for new code; gather()
isn't going away but is no longer
under active development.
# See vignette("pivot") for examples and explanation # Simplest case where column names are character data relig_income relig_income %>% pivot_longer(!religion, names_to = "income", values_to = "count") # Slightly more complex case where columns have common prefix, # and missing missings are structural so should be dropped. billboard billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", names_prefix = "wk", values_to = "rank", values_drop_na = TRUE ) # Multiple variables stored in column names who %>% pivot_longer( cols = new_sp_m014:newrel_f65, names_to = c("diagnosis", "gender", "age"), names_pattern = "new_?(.*)_(.)(.*)", values_to = "count" ) # Multiple observations per row. Since all columns are used in the pivoting # process, we'll use `cols_vary` to keep values from the original columns # close together in the output. anscombe anscombe %>% pivot_longer( everything(), cols_vary = "slowest", names_to = c(".value", "set"), names_pattern = "(.)(.)" )
# See vignette("pivot") for examples and explanation # Simplest case where column names are character data relig_income relig_income %>% pivot_longer(!religion, names_to = "income", values_to = "count") # Slightly more complex case where columns have common prefix, # and missing missings are structural so should be dropped. billboard billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", names_prefix = "wk", values_to = "rank", values_drop_na = TRUE ) # Multiple variables stored in column names who %>% pivot_longer( cols = new_sp_m014:newrel_f65, names_to = c("diagnosis", "gender", "age"), names_pattern = "new_?(.*)_(.)(.*)", values_to = "count" ) # Multiple observations per row. Since all columns are used in the pivoting # process, we'll use `cols_vary` to keep values from the original columns # close together in the output. anscombe anscombe %>% pivot_longer( everything(), cols_vary = "slowest", names_to = c(".value", "set"), names_pattern = "(.)(.)" )
pivot_wider()
"widens" data, increasing the number of columns and
decreasing the number of rows. The inverse transformation is
pivot_longer()
.
Learn more in vignette("pivot")
.
pivot_wider( data, ..., id_cols = NULL, id_expand = FALSE, names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL )
pivot_wider( data, ..., id_cols = NULL, id_expand = FALSE, names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL )
data |
A data frame to pivot. |
... |
Additional arguments passed on to methods. |
id_cols |
< Defaults to all columns in |
id_expand |
Should the values in the |
names_from , values_from
|
< If |
names_prefix |
String added to the start of every variable name. This is
particularly useful if |
names_sep |
If |
names_glue |
Instead of |
names_sort |
Should the column names be sorted? If |
names_vary |
When
|
names_expand |
Should the values in the |
names_repair |
What happens if the output has invalid column names?
The default, |
values_fill |
Optionally, a (scalar) value that specifies what each
This can be a named list if you want to apply different fill values to different value columns. |
values_fn |
Optionally, a function applied to the value in each cell
in the output. You will typically use this when the combination of
This can be a named list if you want to apply different aggregations
to different |
unused_fn |
Optionally, a function applied to summarize the values from
the unused columns (i.e. columns not identified by The default drops all unused columns from the result. This can be a named list if you want to apply different aggregations to different unused columns.
This is similar to grouping by the |
pivot_wider()
is an updated approach to spread()
, designed to be both
simpler to use and to handle more use cases. We recommend you use
pivot_wider()
for new code; spread()
isn't going away but is no longer
under active development.
pivot_wider_spec()
to pivot "by hand" with a data frame that
defines a pivoting specification.
# See vignette("pivot") for examples and explanation fish_encounters fish_encounters %>% pivot_wider(names_from = station, values_from = seen) # Fill in missing values fish_encounters %>% pivot_wider(names_from = station, values_from = seen, values_fill = 0) # Generate column names from multiple variables us_rent_income us_rent_income %>% pivot_wider( names_from = variable, values_from = c(estimate, moe) ) # You can control whether `names_from` values vary fastest or slowest # relative to the `values_from` column names using `names_vary`. us_rent_income %>% pivot_wider( names_from = variable, values_from = c(estimate, moe), names_vary = "slowest" ) # When there are multiple `names_from` or `values_from`, you can use # use `names_sep` or `names_glue` to control the output variable names us_rent_income %>% pivot_wider( names_from = variable, names_sep = ".", values_from = c(estimate, moe) ) us_rent_income %>% pivot_wider( names_from = variable, names_glue = "{variable}_{.value}", values_from = c(estimate, moe) ) # Can perform aggregation with `values_fn` warpbreaks <- as_tibble(warpbreaks[c("wool", "tension", "breaks")]) warpbreaks warpbreaks %>% pivot_wider( names_from = wool, values_from = breaks, values_fn = mean ) # Can pass an anonymous function to `values_fn` when you # need to supply additional arguments warpbreaks$breaks[1] <- NA warpbreaks %>% pivot_wider( names_from = wool, values_from = breaks, values_fn = ~ mean(.x, na.rm = TRUE) )
# See vignette("pivot") for examples and explanation fish_encounters fish_encounters %>% pivot_wider(names_from = station, values_from = seen) # Fill in missing values fish_encounters %>% pivot_wider(names_from = station, values_from = seen, values_fill = 0) # Generate column names from multiple variables us_rent_income us_rent_income %>% pivot_wider( names_from = variable, values_from = c(estimate, moe) ) # You can control whether `names_from` values vary fastest or slowest # relative to the `values_from` column names using `names_vary`. us_rent_income %>% pivot_wider( names_from = variable, values_from = c(estimate, moe), names_vary = "slowest" ) # When there are multiple `names_from` or `values_from`, you can use # use `names_sep` or `names_glue` to control the output variable names us_rent_income %>% pivot_wider( names_from = variable, names_sep = ".", values_from = c(estimate, moe) ) us_rent_income %>% pivot_wider( names_from = variable, names_glue = "{variable}_{.value}", values_from = c(estimate, moe) ) # Can perform aggregation with `values_fn` warpbreaks <- as_tibble(warpbreaks[c("wool", "tension", "breaks")]) warpbreaks warpbreaks %>% pivot_wider( names_from = wool, values_from = breaks, values_fn = mean ) # Can pass an anonymous function to `values_fn` when you # need to supply additional arguments warpbreaks$breaks[1] <- NA warpbreaks %>% pivot_wider( names_from = wool, values_from = breaks, values_fn = ~ mean(.x, na.rm = TRUE) )
Pew religion and income survey
relig_income
relig_income
A dataset with variables:
Name of religion
<$10k
-Don\'t know/refused
Number of respondees with income range in column name
Downloaded from https://www.pewresearch.org/religious-landscape-study/database/ (downloaded November 2009)
Replace NAs with specified values
replace_na(data, replace, ...)
replace_na(data, replace, ...)
data |
A data frame or vector. |
replace |
If If |
... |
Additional arguments for methods. Currently unused. |
replace_na()
returns an object with the same type as data
.
dplyr::na_if()
to replace specified values with NA
s;
dplyr::coalesce()
to replaces NA
s with values from other vectors.
# Replace NAs in a data frame df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b")) df %>% replace_na(list(x = 0, y = "unknown")) # Replace NAs in a vector df %>% dplyr::mutate(x = replace_na(x, 0)) # OR df$x %>% replace_na(0) df$y %>% replace_na("unknown") # Replace NULLs in a list: NULLs are the list-col equivalent of NAs df_list <- tibble(z = list(1:5, NULL, 10:20)) df_list %>% replace_na(list(z = list(5)))
# Replace NAs in a data frame df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b")) df %>% replace_na(list(x = 0, y = "unknown")) # Replace NAs in a vector df %>% dplyr::mutate(x = replace_na(x, 0)) # OR df$x %>% replace_na(0) df$y %>% replace_na("unknown") # Replace NULLs in a list: NULLs are the list-col equivalent of NAs df_list <- tibble(z = list(1:5, NULL, 10:20)) df_list %>% replace_na(list(z = list(5)))
separate()
has been superseded in favour of separate_wider_position()
and separate_wider_delim()
because the two functions make the two uses
more obvious, the API is more polished, and the handling of problems is
better. Superseded functions will not go away, but will only receive
critical bug fixes.
Given either a regular expression or a vector of character positions,
separate()
turns a single character column into multiple columns.
separate( data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ... )
separate( data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ... )
data |
A data frame. |
col |
< |
into |
Names of new variables to create as character vector.
Use |
sep |
Separator between columns. If character, If numeric, |
remove |
If |
convert |
If NB: this will cause string |
extra |
If
|
fill |
If
|
... |
Additional arguments passed on to methods. |
unite()
, the complement, extract()
which uses regular
expression capturing groups.
# If you want to split by any non-alphanumeric value (the default): df <- tibble(x = c(NA, "x.y", "x.z", "y.z")) df %>% separate(x, c("A", "B")) # If you just want the second variable: df %>% separate(x, c(NA, "B")) # We now recommend separate_wider_delim() instead: df %>% separate_wider_delim(x, ".", names = c("A", "B")) df %>% separate_wider_delim(x, ".", names = c(NA, "B")) # Controlling uneven splits ------------------------------------------------- # If every row doesn't split into the same number of pieces, use # the extra and fill arguments to control what happens: df <- tibble(x = c("x", "x y", "x y z", NA)) df %>% separate(x, c("a", "b")) # The same behaviour as previous, but drops the c without warnings: df %>% separate(x, c("a", "b"), extra = "drop", fill = "right") # Opposite of previous, keeping the c and filling left: df %>% separate(x, c("a", "b"), extra = "merge", fill = "left") # Or you can keep all three: df %>% separate(x, c("a", "b", "c")) # To only split a specified number of times use extra = "merge": df <- tibble(x = c("x: 123", "y: error: 7")) df %>% separate(x, c("key", "value"), ": ", extra = "merge") # Controlling column types -------------------------------------------------- # convert = TRUE detects column classes: df <- tibble(x = c("x:1", "x:2", "y:4", "z", NA)) df %>% separate(x, c("key", "value"), ":") %>% str() df %>% separate(x, c("key", "value"), ":", convert = TRUE) %>% str()
# If you want to split by any non-alphanumeric value (the default): df <- tibble(x = c(NA, "x.y", "x.z", "y.z")) df %>% separate(x, c("A", "B")) # If you just want the second variable: df %>% separate(x, c(NA, "B")) # We now recommend separate_wider_delim() instead: df %>% separate_wider_delim(x, ".", names = c("A", "B")) df %>% separate_wider_delim(x, ".", names = c(NA, "B")) # Controlling uneven splits ------------------------------------------------- # If every row doesn't split into the same number of pieces, use # the extra and fill arguments to control what happens: df <- tibble(x = c("x", "x y", "x y z", NA)) df %>% separate(x, c("a", "b")) # The same behaviour as previous, but drops the c without warnings: df %>% separate(x, c("a", "b"), extra = "drop", fill = "right") # Opposite of previous, keeping the c and filling left: df %>% separate(x, c("a", "b"), extra = "merge", fill = "left") # Or you can keep all three: df %>% separate(x, c("a", "b", "c")) # To only split a specified number of times use extra = "merge": df <- tibble(x = c("x: 123", "y: error: 7")) df %>% separate(x, c("key", "value"), ": ", extra = "merge") # Controlling column types -------------------------------------------------- # convert = TRUE detects column classes: df <- tibble(x = c("x:1", "x:2", "y:4", "z", NA)) df %>% separate(x, c("key", "value"), ":") %>% str() df %>% separate(x, c("key", "value"), ":", convert = TRUE) %>% str()
Each of these functions takes a string and splits it into multiple rows:
separate_longer_delim()
splits by a delimiter.
separate_longer_position()
splits by a fixed width.
separate_longer_delim(data, cols, delim, ...) separate_longer_position(data, cols, width, ..., keep_empty = FALSE)
separate_longer_delim(data, cols, delim, ...) separate_longer_position(data, cols, width, ..., keep_empty = FALSE)
data |
A data frame. |
cols |
< |
delim |
For |
... |
These dots are for future extensions and must be empty. |
width |
For |
keep_empty |
By default, you'll get |
A data frame based on data
. It has the same columns, but different
rows.
df <- tibble(id = 1:4, x = c("x", "x y", "x y z", NA)) df %>% separate_longer_delim(x, delim = " ") # You can separate multiple columns at once if they have the same structure df <- tibble(id = 1:3, x = c("x", "x y", "x y z"), y = c("a", "a b", "a b c")) df %>% separate_longer_delim(c(x, y), delim = " ") # Or instead split by a fixed length df <- tibble(id = 1:3, x = c("ab", "def", "")) df %>% separate_longer_position(x, 1) df %>% separate_longer_position(x, 2) df %>% separate_longer_position(x, 2, keep_empty = TRUE)
df <- tibble(id = 1:4, x = c("x", "x y", "x y z", NA)) df %>% separate_longer_delim(x, delim = " ") # You can separate multiple columns at once if they have the same structure df <- tibble(id = 1:3, x = c("x", "x y", "x y z"), y = c("a", "a b", "a b c")) df %>% separate_longer_delim(c(x, y), delim = " ") # Or instead split by a fixed length df <- tibble(id = 1:3, x = c("ab", "def", "")) df %>% separate_longer_position(x, 1) df %>% separate_longer_position(x, 2) df %>% separate_longer_position(x, 2, keep_empty = TRUE)
separate_rows()
has been superseded in favour of separate_longer_delim()
because it has a more consistent API with other separate functions.
Superseded functions will not go away, but will only receive critical bug
fixes.
If a variable contains observations with multiple delimited values,
separate_rows()
separates the values and places each one in its own row.
separate_rows(data, ..., sep = "[^[:alnum:].]+", convert = FALSE)
separate_rows(data, ..., sep = "[^[:alnum:].]+", convert = FALSE)
data |
A data frame. |
... |
< |
sep |
Separator delimiting collapsed values. |
convert |
If |
df <- tibble( x = 1:3, y = c("a", "d,e,f", "g,h"), z = c("1", "2,3,4", "5,6") ) separate_rows(df, y, z, convert = TRUE) # Now recommended df %>% separate_longer_delim(c(y, z), delim = ",")
df <- tibble( x = 1:3, y = c("a", "d,e,f", "g,h"), z = c("1", "2,3,4", "5,6") ) separate_rows(df, y, z, convert = TRUE) # Now recommended df %>% separate_longer_delim(c(y, z), delim = ",")
Each of these functions takes a string column and splits it into multiple new columns:
separate_wider_delim()
splits by delimiter.
separate_wider_position()
splits at fixed widths.
separate_wider_regex()
splits with regular expression matches.
These functions are equivalent to separate()
and extract()
, but use
stringr as the underlying string
manipulation engine, and their interfaces reflect what we've learned from
unnest_wider()
and unnest_longer()
.
separate_wider_delim( data, cols, delim, ..., names = NULL, names_sep = NULL, names_repair = "check_unique", too_few = c("error", "debug", "align_start", "align_end"), too_many = c("error", "debug", "drop", "merge"), cols_remove = TRUE ) separate_wider_position( data, cols, widths, ..., names_sep = NULL, names_repair = "check_unique", too_few = c("error", "debug", "align_start"), too_many = c("error", "debug", "drop"), cols_remove = TRUE ) separate_wider_regex( data, cols, patterns, ..., names_sep = NULL, names_repair = "check_unique", too_few = c("error", "debug", "align_start"), cols_remove = TRUE )
separate_wider_delim( data, cols, delim, ..., names = NULL, names_sep = NULL, names_repair = "check_unique", too_few = c("error", "debug", "align_start", "align_end"), too_many = c("error", "debug", "drop", "merge"), cols_remove = TRUE ) separate_wider_position( data, cols, widths, ..., names_sep = NULL, names_repair = "check_unique", too_few = c("error", "debug", "align_start"), too_many = c("error", "debug", "drop"), cols_remove = TRUE ) separate_wider_regex( data, cols, patterns, ..., names_sep = NULL, names_repair = "check_unique", too_few = c("error", "debug", "align_start"), cols_remove = TRUE )
data |
A data frame. |
cols |
< |
delim |
For |
... |
These dots are for future extensions and must be empty. |
names |
For |
names_sep |
If supplied, output names will be composed
of the input column name followed by the separator followed by the
new column name. Required when For |
names_repair |
Used to check that output data frame has valid names. Must be one of the following options:
See |
too_few |
What should happen if a value separates into too few pieces?
|
too_many |
What should happen if a value separates into too many pieces?
|
cols_remove |
Should the input |
widths |
A named numeric vector where the names become column names, and the values specify the column width. Unnamed components will match, but not be included in the output. |
patterns |
A named character vector where the names become column names and the values are regular expressions that match the contents of the vector. Unnamed components will match, but not be included in the output. |
A data frame based on data
. It has the same rows, but different
columns:
The primary purpose of the functions are to create new columns from
components of the string.
For separate_wider_delim()
the names of new columns come from names
.
For separate_wider_position()
the names come from the names of widths
.
For separate_wider_regex()
the names come from the names of
patterns
.
If too_few
or too_many
is "debug"
, the output will contain additional
columns useful for debugging:
{col}_ok
: a logical vector which tells you if the input was ok or
not. Use to quickly find the problematic rows.
{col}_remainder
: any text remaining after separation.
{col}_pieces
, {col}_width
, {col}_matches
: number of pieces,
number of characters, and number of matches for separate_wider_delim()
,
separate_wider_position()
and separate_regexp_wider()
respectively.
If cols_remove = TRUE
(the default), the input cols
will be removed
from the output.
df <- tibble(id = 1:3, x = c("m-123", "f-455", "f-123")) # There are three basic ways to split up a string into pieces: # 1. with a delimiter df %>% separate_wider_delim(x, delim = "-", names = c("gender", "unit")) # 2. by length df %>% separate_wider_position(x, c(gender = 1, 1, unit = 3)) # 3. defining each component with a regular expression df %>% separate_wider_regex(x, c(gender = ".", ".", unit = "\\d+")) # Sometimes you split on the "last" delimiter df <- tibble(var = c("race_1", "race_2", "age_bucket_1", "age_bucket_2")) # _delim won't help because it always splits on the first delimiter try(df %>% separate_wider_delim(var, "_", names = c("var1", "var2"))) df %>% separate_wider_delim(var, "_", names = c("var1", "var2"), too_many = "merge") # Instead, you can use _regex df %>% separate_wider_regex(var, c(var1 = ".*", "_", var2 = ".*")) # this works because * is greedy; you can mimic the _delim behaviour with .*? df %>% separate_wider_regex(var, c(var1 = ".*?", "_", var2 = ".*")) # If the number of components varies, it's most natural to split into rows df <- tibble(id = 1:4, x = c("x", "x y", "x y z", NA)) df %>% separate_longer_delim(x, delim = " ") # But separate_wider_delim() provides some tools to deal with the problem # The default behaviour tells you that there's a problem try(df %>% separate_wider_delim(x, delim = " ", names = c("a", "b"))) # You can get additional insight by using the debug options df %>% separate_wider_delim( x, delim = " ", names = c("a", "b"), too_few = "debug", too_many = "debug" ) # But you can suppress the warnings df %>% separate_wider_delim( x, delim = " ", names = c("a", "b"), too_few = "align_start", too_many = "merge" ) # Or choose to automatically name the columns, producing as many as needed df %>% separate_wider_delim(x, delim = " ", names_sep = "", too_few = "align_start")
df <- tibble(id = 1:3, x = c("m-123", "f-455", "f-123")) # There are three basic ways to split up a string into pieces: # 1. with a delimiter df %>% separate_wider_delim(x, delim = "-", names = c("gender", "unit")) # 2. by length df %>% separate_wider_position(x, c(gender = 1, 1, unit = 3)) # 3. defining each component with a regular expression df %>% separate_wider_regex(x, c(gender = ".", ".", unit = "\\d+")) # Sometimes you split on the "last" delimiter df <- tibble(var = c("race_1", "race_2", "age_bucket_1", "age_bucket_2")) # _delim won't help because it always splits on the first delimiter try(df %>% separate_wider_delim(var, "_", names = c("var1", "var2"))) df %>% separate_wider_delim(var, "_", names = c("var1", "var2"), too_many = "merge") # Instead, you can use _regex df %>% separate_wider_regex(var, c(var1 = ".*", "_", var2 = ".*")) # this works because * is greedy; you can mimic the _delim behaviour with .*? df %>% separate_wider_regex(var, c(var1 = ".*?", "_", var2 = ".*")) # If the number of components varies, it's most natural to split into rows df <- tibble(id = 1:4, x = c("x", "x y", "x y z", NA)) df %>% separate_longer_delim(x, delim = " ") # But separate_wider_delim() provides some tools to deal with the problem # The default behaviour tells you that there's a problem try(df %>% separate_wider_delim(x, delim = " ", names = c("a", "b"))) # You can get additional insight by using the debug options df %>% separate_wider_delim( x, delim = " ", names = c("a", "b"), too_few = "debug", too_many = "debug" ) # But you can suppress the warnings df %>% separate_wider_delim( x, delim = " ", names = c("a", "b"), too_few = "align_start", too_many = "merge" ) # Or choose to automatically name the columns, producing as many as needed df %>% separate_wider_delim(x, delim = " ", names_sep = "", too_few = "align_start")
A small demo dataset describing John and Mary Smith.
smiths
smiths
A data frame with 2 rows and 5 columns.
Development on spread()
is complete, and for new code we recommend
switching to pivot_wider()
, which is easier to use, more featureful, and
still under active development.
df %>% spread(key, value)
is equivalent to
df %>% pivot_wider(names_from = key, values_from = value)
See more details in vignette("pivot")
.
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
data |
A data frame. |
key , value
|
< |
fill |
If set, missing values will be replaced with this value. Note
that there are two types of missingness in the input: explicit missing
values (i.e. |
convert |
If |
drop |
If |
sep |
If |
stocks <- tibble( time = as.Date("2009-01-01") + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4) ) stocksm <- stocks %>% gather(stock, price, -time) stocksm %>% spread(stock, price) stocksm %>% spread(time, price) # Spread and gather are complements df <- tibble(x = c("a", "b"), y = c(3, 4), z = c(5, 6)) df %>% spread(x, y) %>% gather("x", "y", a:b, na.rm = TRUE) # Use 'convert = TRUE' to produce variables of mixed type df <- tibble( row = rep(c(1, 51), each = 3), var = rep(c("Sepal.Length", "Species", "Species_num"), 2), value = c(5.1, "setosa", 1, 7.0, "versicolor", 2) ) df %>% spread(var, value) %>% str() df %>% spread(var, value, convert = TRUE) %>% str()
stocks <- tibble( time = as.Date("2009-01-01") + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4) ) stocksm <- stocks %>% gather(stock, price, -time) stocksm %>% spread(stock, price) stocksm %>% spread(time, price) # Spread and gather are complements df <- tibble(x = c("a", "b"), y = c(3, 4), z = c(5, 6)) df %>% spread(x, y) %>% gather("x", "y", a:b, na.rm = TRUE) # Use 'convert = TRUE' to produce variables of mixed type df <- tibble( row = rep(c(1, 51), each = 3), var = rep(c("Sepal.Length", "Species", "Species_num"), 2), value = c(5.1, "setosa", 1, 7.0, "versicolor", 2) ) df %>% spread(var, value) %>% str() df %>% spread(var, value, convert = TRUE) %>% str()
Data sets that demonstrate multiple ways to layout the same tabular data.
table1 table2 table3 table4a table4b table5
table1 table2 table3 table4a table4b table5
table1
, table2
, table3
, table4a
, table4b
,
and table5
all display the number of TB cases documented by the World
Health Organization in Afghanistan, Brazil, and China between 1999 and 2000.
The data contains values associated with four variables (country, year,
cases, and population), but each table organizes the values in a different
layout.
The data is a subset of the data contained in the World Health Organization Global Tuberculosis Report
https://www.who.int/teams/global-tuberculosis-programme/data
Performs the opposite operation to dplyr::count()
, duplicating rows
according to a weighting variable (or expression).
uncount(data, weights, ..., .remove = TRUE, .id = NULL)
uncount(data, weights, ..., .remove = TRUE, .id = NULL)
data |
A data frame, tibble, or grouped tibble. |
weights |
A vector of weights. Evaluated in the context of |
... |
Additional arguments passed on to methods. |
.remove |
If |
.id |
Supply a string to create a new variable which gives a unique identifier for each created row. |
df <- tibble(x = c("a", "b"), n = c(1, 2)) uncount(df, n) uncount(df, n, .id = "id") # You can also use constants uncount(df, 2) # Or expressions uncount(df, 2 / n)
df <- tibble(x = c("a", "b"), n = c(1, 2)) uncount(df, n) uncount(df, n, .id = "id") # You can also use constants uncount(df, 2) # Or expressions uncount(df, 2 / n)
Convenience function to paste together multiple columns into one.
unite(data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)
unite(data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)
data |
A data frame. |
col |
The name of the new column, as a string or symbol. This argument is passed by expression and supports
quasiquotation (you can unquote strings
and symbols). The name is captured from the expression with
|
... |
< |
sep |
Separator to use between values. |
remove |
If |
na.rm |
If |
separate()
, the complement.
df <- expand_grid(x = c("a", NA), y = c("b", NA)) df df %>% unite("z", x:y, remove = FALSE) # To remove missing values: df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE) # Separate is almost the complement of unite df %>% unite("xy", x:y) %>% separate(xy, c("x", "y")) # (but note `x` and `y` contain now "NA" not NA)
df <- expand_grid(x = c("a", NA), y = c("b", NA)) df df %>% unite("z", x:y, remove = FALSE) # To remove missing values: df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE) # Separate is almost the complement of unite df %>% unite("xy", x:y) %>% separate(xy, c("x", "y")) # (but note `x` and `y` contain now "NA" not NA)
Unnest expands a list-column containing data frames into rows and columns.
unnest( data, cols, ..., keep_empty = FALSE, ptype = NULL, names_sep = NULL, names_repair = "check_unique", .drop = deprecated(), .id = deprecated(), .sep = deprecated(), .preserve = deprecated() )
unnest( data, cols, ..., keep_empty = FALSE, ptype = NULL, names_sep = NULL, names_repair = "check_unique", .drop = deprecated(), .id = deprecated(), .sep = deprecated(), .preserve = deprecated() )
data |
A data frame. |
cols |
< When selecting multiple columns, values from the same row will be recycled to their common size. |
... |
:
previously you could write |
keep_empty |
By default, you get one row of output for each element
of the list that you are unchopping/unnesting. This means that if there's a
size-0 element (like |
ptype |
Optionally, a named list of column name-prototype pairs to
coerce |
names_sep |
If |
names_repair |
Used to check that output data frame has valid names. Must be one of the following options:
See |
.drop , .preserve
|
:
all list-columns are now preserved; If there are any that you
don't want in the output use |
.id |
:
convert |
.sep |
tidyr 1.0.0 introduced a new syntax for nest()
and unnest()
that's
designed to be more similar to other functions. Converting to the new syntax
should be straightforward (guided by the message you'll receive) but if
you just need to run an old analysis, you can easily revert to the previous
behaviour using nest_legacy()
and unnest_legacy()
as follows:
library(tidyr) nest <- nest_legacy unnest <- unnest_legacy
Other rectangling:
hoist()
,
unnest_longer()
,
unnest_wider()
# unnest() is designed to work with lists of data frames df <- tibble( x = 1:3, y = list( NULL, tibble(a = 1, b = 2), tibble(a = 1:3, b = 3:1, c = 4) ) ) # unnest() recycles input rows for each row of the list-column # and adds a column for each column df %>% unnest(y) # input rows with 0 rows in the list-column will usually disappear, # but you can keep them (generating NAs) with keep_empty = TRUE: df %>% unnest(y, keep_empty = TRUE) # Multiple columns ---------------------------------------------------------- # You can unnest multiple columns simultaneously df <- tibble( x = 1:2, y = list( tibble(a = 1, b = 2), tibble(a = 3:4, b = 5:6) ), z = list( tibble(c = 1, d = 2), tibble(c = 3:4, d = 5:6) ) ) df %>% unnest(c(y, z)) # Compare with unnesting one column at a time, which generates # the Cartesian product df %>% unnest(y) %>% unnest(z)
# unnest() is designed to work with lists of data frames df <- tibble( x = 1:3, y = list( NULL, tibble(a = 1, b = 2), tibble(a = 1:3, b = 3:1, c = 4) ) ) # unnest() recycles input rows for each row of the list-column # and adds a column for each column df %>% unnest(y) # input rows with 0 rows in the list-column will usually disappear, # but you can keep them (generating NAs) with keep_empty = TRUE: df %>% unnest(y, keep_empty = TRUE) # Multiple columns ---------------------------------------------------------- # You can unnest multiple columns simultaneously df <- tibble( x = 1:2, y = list( tibble(a = 1, b = 2), tibble(a = 3:4, b = 5:6) ), z = list( tibble(c = 1, d = 2), tibble(c = 3:4, d = 5:6) ) ) df %>% unnest(c(y, z)) # Compare with unnesting one column at a time, which generates # the Cartesian product df %>% unnest(y) %>% unnest(z)
unnest_longer()
turns each element of a list-column into a row. It
is most naturally suited to list-columns where the elements are unnamed
and the length of each element varies from row to row.
unnest_longer()
generally preserves the number of columns of x
while
modifying the number of rows.
Learn more in vignette("rectangle")
.
unnest_longer( data, col, values_to = NULL, indices_to = NULL, indices_include = NULL, keep_empty = FALSE, names_repair = "check_unique", simplify = TRUE, ptype = NULL, transform = NULL )
unnest_longer( data, col, values_to = NULL, indices_to = NULL, indices_include = NULL, keep_empty = FALSE, names_repair = "check_unique", simplify = TRUE, ptype = NULL, transform = NULL )
data |
A data frame. |
col |
< When selecting multiple columns, values from the same row will be recycled to their common size. |
values_to |
A string giving the column name (or names) to store the
unnested values in. If multiple columns are specified in |
indices_to |
A string giving the column name (or names) to store the
inner names or positions (if not named) of the values. If multiple columns
are specified in |
indices_include |
A single logical value specifying whether or not to
add an index column. If any value has inner names, the index column will be
a character vector of those names, otherwise it will be an integer vector
of positions. If If |
keep_empty |
By default, you get one row of output for each element
of the list that you are unchopping/unnesting. This means that if there's a
size-0 element (like |
names_repair |
Used to check that output data frame has valid names. Must be one of the following options:
See |
simplify |
If |
ptype |
Optionally, a named list of prototypes declaring the desired output type of each component. Alternatively, a single empty prototype can be supplied, which will be applied to all components. Use this argument if you want to check that each element has the type you expect when simplifying. If a |
transform |
Optionally, a named list of transformation functions applied to each component. Alternatively, a single function can be supplied, which will be applied to all components. Use this argument if you want to transform or parse individual elements as they are extracted. When both |
Other rectangling:
hoist()
,
unnest()
,
unnest_wider()
# `unnest_longer()` is useful when each component of the list should # form a row df <- tibble( x = 1:4, y = list(NULL, 1:3, 4:5, integer()) ) df %>% unnest_longer(y) # Note that empty values like `NULL` and `integer()` are dropped by # default. If you'd like to keep them, set `keep_empty = TRUE`. df %>% unnest_longer(y, keep_empty = TRUE) # If the inner vectors are named, the names are copied to an `_id` column df <- tibble( x = 1:2, y = list(c(a = 1, b = 2), c(a = 10, b = 11, c = 12)) ) df %>% unnest_longer(y) # Multiple columns ---------------------------------------------------------- # If columns are aligned, you can unnest simultaneously df <- tibble( x = 1:2, y = list(1:2, 3:4), z = list(5:6, 7:8) ) df %>% unnest_longer(c(y, z)) # This is important because sequential unnesting would generate the # Cartesian product of the rows df %>% unnest_longer(y) %>% unnest_longer(z)
# `unnest_longer()` is useful when each component of the list should # form a row df <- tibble( x = 1:4, y = list(NULL, 1:3, 4:5, integer()) ) df %>% unnest_longer(y) # Note that empty values like `NULL` and `integer()` are dropped by # default. If you'd like to keep them, set `keep_empty = TRUE`. df %>% unnest_longer(y, keep_empty = TRUE) # If the inner vectors are named, the names are copied to an `_id` column df <- tibble( x = 1:2, y = list(c(a = 1, b = 2), c(a = 10, b = 11, c = 12)) ) df %>% unnest_longer(y) # Multiple columns ---------------------------------------------------------- # If columns are aligned, you can unnest simultaneously df <- tibble( x = 1:2, y = list(1:2, 3:4), z = list(5:6, 7:8) ) df %>% unnest_longer(c(y, z)) # This is important because sequential unnesting would generate the # Cartesian product of the rows df %>% unnest_longer(y) %>% unnest_longer(z)
unnest_wider()
turns each element of a list-column into a column. It
is most naturally suited to list-columns where every element is named,
and the names are consistent from row-to-row.
unnest_wider()
preserves the rows of x
while modifying the columns.
Learn more in vignette("rectangle")
.
unnest_wider( data, col, names_sep = NULL, simplify = TRUE, strict = FALSE, names_repair = "check_unique", ptype = NULL, transform = NULL )
unnest_wider( data, col, names_sep = NULL, simplify = TRUE, strict = FALSE, names_repair = "check_unique", ptype = NULL, transform = NULL )
data |
A data frame. |
col |
< When selecting multiple columns, values from the same row will be recycled to their common size. |
names_sep |
If If any values being unnested are unnamed, then |
simplify |
If |
strict |
A single logical specifying whether or not to apply strict
vctrs typing rules. If |
names_repair |
Used to check that output data frame has valid names. Must be one of the following options:
See |
ptype |
Optionally, a named list of prototypes declaring the desired output type of each component. Alternatively, a single empty prototype can be supplied, which will be applied to all components. Use this argument if you want to check that each element has the type you expect when simplifying. If a |
transform |
Optionally, a named list of transformation functions applied to each component. Alternatively, a single function can be supplied, which will be applied to all components. Use this argument if you want to transform or parse individual elements as they are extracted. When both |
Other rectangling:
hoist()
,
unnest()
,
unnest_longer()
df <- tibble( character = c("Toothless", "Dory"), metadata = list( list( species = "dragon", color = "black", films = c( "How to Train Your Dragon", "How to Train Your Dragon 2", "How to Train Your Dragon: The Hidden World" ) ), list( species = "blue tang", color = "blue", films = c("Finding Nemo", "Finding Dory") ) ) ) df # Turn all components of metadata into columns df %>% unnest_wider(metadata) # Choose not to simplify list-cols of length-1 elements df %>% unnest_wider(metadata, simplify = FALSE) df %>% unnest_wider(metadata, simplify = list(color = FALSE)) # You can also widen unnamed list-cols: df <- tibble( x = 1:3, y = list(NULL, 1:3, 4:5) ) # but you must supply `names_sep` to do so, which generates automatic names: df %>% unnest_wider(y, names_sep = "_") # 0-length elements --------------------------------------------------------- # The defaults of `unnest_wider()` treat empty types (like `list()`) as `NULL`. json <- list( list(x = 1:2, y = 1:2), list(x = list(), y = 3:4), list(x = 3L, y = list()) ) df <- tibble(json = json) df %>% unnest_wider(json) # To instead enforce strict vctrs typing rules, use `strict` df %>% unnest_wider(json, strict = TRUE)
df <- tibble( character = c("Toothless", "Dory"), metadata = list( list( species = "dragon", color = "black", films = c( "How to Train Your Dragon", "How to Train Your Dragon 2", "How to Train Your Dragon: The Hidden World" ) ), list( species = "blue tang", color = "blue", films = c("Finding Nemo", "Finding Dory") ) ) ) df # Turn all components of metadata into columns df %>% unnest_wider(metadata) # Choose not to simplify list-cols of length-1 elements df %>% unnest_wider(metadata, simplify = FALSE) df %>% unnest_wider(metadata, simplify = list(color = FALSE)) # You can also widen unnamed list-cols: df <- tibble( x = 1:3, y = list(NULL, 1:3, 4:5) ) # but you must supply `names_sep` to do so, which generates automatic names: df %>% unnest_wider(y, names_sep = "_") # 0-length elements --------------------------------------------------------- # The defaults of `unnest_wider()` treat empty types (like `list()`) as `NULL`. json <- list( list(x = 1:2, y = 1:2), list(x = list(), y = 3:4), list(x = 3L, y = list()) ) df <- tibble(json = json) df %>% unnest_wider(json) # To instead enforce strict vctrs typing rules, use `strict` df %>% unnest_wider(json, strict = TRUE)
Captured from the 2017 American Community Survey using the tidycensus package.
us_rent_income
us_rent_income
A dataset with variables:
FIP state identifier
Name of state
Variable name: income = median yearly income, rent = median monthly rent
Estimated value
90% margin of error
A subset of data from the World Health Organization Global Tuberculosis
Report, and accompanying global populations. who
uses the original
codes from the World Health Organization. The column names for columns
5 through 60 are made by combining new_
with:
the method of diagnosis (rel
= relapse, sn
= negative pulmonary
smear, sp
= positive pulmonary smear, ep
= extrapulmonary),
gender (f
= female, m
= male), and
age group (014
= 0-14 yrs of age, 1524
= 15-24, 2534
= 25-34,
3544
= 35-44 years of age, 4554
= 45-54, 5564
= 55-64,
65
= 65 years or older).
who2
is a lightly modified version that makes teaching the basics
easier by tweaking the variables to be slightly more consistent and
dropping iso2
and iso3
. newrel
is replaced by new_rel
, and a
_
is added after the gender.
who who2 population
who who2 population
who
A data frame with 7,240 rows and 60 columns:
Country name
2 & 3 letter ISO country codes
Year
Counts of new TB cases recorded by group. Column names encode three variables that describe the group.
who2
A data frame with 7,240 rows and 58 columns.
population
A data frame with 4,060 rows and three columns:
Country name
Year
Population
https://www.who.int/teams/global-tuberculosis-programme/data
Data about population from the World Bank.
world_bank_pop
world_bank_pop
A dataset with variables:
Three letter country code
Indicator name: SP.POP.GROW
= population growth,
SP.POP.TOTL
= total population, SP.URB.GROW
= urban population
growth, SP.URB.TOTL
= total urban population
Value for each year
Dataset from the World Bank data bank: https://data.worldbank.org