Title: | Read and Write Rectangular Text Data Quickly |
---|---|
Description: | The goal of 'vroom' is to read and write data (like 'csv', 'tsv' and 'fwf') quickly. When reading it uses a quick initial indexing step, then reads the values lazily , so only the data you actually use needs to be read. The writer formats the data in parallel and writes to disk asynchronously from formatting. |
Authors: | Jim Hester [aut] , Hadley Wickham [aut] , Jennifer Bryan [aut, cre] , Shelby Bearrows [ctb], https://github.com/mandreyel/ [cph] (mio library), Jukka Jylänki [cph] (grisu3 implementation), Mikkel Jørgensen [cph] (grisu3 implementation), Posit Software, PBC [cph, fnd] |
Maintainer: | Jennifer Bryan <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.6.5.9000 |
Built: | 2024-11-19 04:12:09 UTC |
Source: | https://github.com/tidyverse/vroom |
cols()
includes all columns in the input data, guessing the column types
as the default. cols_only()
includes only the columns you explicitly
specify, skipping the rest.
cols(..., .default = col_guess(), .delim = NULL) cols_only(...) col_logical(...) col_integer(...) col_big_integer(...) col_double(...) col_character(...) col_skip(...) col_number(...) col_guess(...) col_factor(levels = NULL, ordered = FALSE, include_na = FALSE, ...) col_datetime(format = "", ...) col_date(format = "", ...) col_time(format = "", ...)
cols(..., .default = col_guess(), .delim = NULL) cols_only(...) col_logical(...) col_integer(...) col_big_integer(...) col_double(...) col_character(...) col_skip(...) col_number(...) col_guess(...) col_factor(levels = NULL, ordered = FALSE, include_na = FALSE, ...) col_datetime(format = "", ...) col_date(format = "", ...) col_time(format = "", ...)
... |
Either column objects created by |
.default |
Any named columns not explicitly overridden in |
.delim |
The delimiter to use when parsing. If the |
levels |
Character vector of the allowed levels. When |
ordered |
Is it an ordered factor? |
include_na |
If |
format |
A format specification, as described below. If set to "",
date times are parsed as ISO8601, dates and times used the date and
time formats specified in the Unlike |
The available specifications are: (long names in quotes and string abbreviations in brackets)
function | long name | short name | description |
col_logical() |
"logical" | "l" | Logical values containing only T , F , TRUE or FALSE . |
col_integer() |
"integer" | "i" | Integer numbers. |
col_big_integer() |
"big_integer" | "I" | Big Integers (64bit), requires the bit64 package. |
col_double() |
"double", "numeric" | "d" | 64-bit double floating point numbers. |
col_character() |
"character" | "c" | Character string data. |
col_factor(levels, ordered) |
"factor" | "f" | A fixed set of values. |
col_date(format = "") |
"date" | "D" | Calendar dates formatted with the locale's date_format . |
col_time(format = "") |
"time" | "t" | Times formatted with the locale's time_format . |
col_datetime(format = "") |
"datetime", "POSIXct" | "T" | ISO8601 date times. |
col_number() |
"number" | "n" | Human readable numbers containing the grouping_mark |
col_skip() |
"skip", "NULL" | "_", "-" | Skip and don't import this column. |
col_guess() |
"guess", "NA" | "?" | Parse using the "best" guessed type based on the input. |
cols(a = col_integer()) cols_only(a = col_integer()) # You can also use the standard abbreviations cols(a = "i") cols(a = "i", b = "d", c = "_") # Or long names (like utils::read.csv) cols(a = "integer", b = "double", c = "skip") # You can also use multiple sets of column definitions by combining # them like so: t1 <- cols( column_one = col_integer(), column_two = col_number()) t2 <- cols( column_three = col_character()) t3 <- t1 t3$cols <- c(t1$cols, t2$cols) t3
cols(a = col_integer()) cols_only(a = col_integer()) # You can also use the standard abbreviations cols(a = "i") cols(a = "i", b = "d", c = "_") # Or long names (like utils::read.csv) cols(a = "integer", b = "double", c = "skip") # You can also use multiple sets of column definitions by combining # them like so: t1 <- cols( column_one = col_integer(), column_two = col_number()) t2 <- cols( column_three = col_character()) t3 <- t1 t3$cols <- c(t1$cols, t2$cols) t3
cols_condense()
takes a spec object and condenses its definition by setting
the default column type to the most frequent type and only listing columns
with a different type.
spec()
extracts the full column specification from a tibble
created by readr.
cols_condense(x) spec(x)
cols_condense(x) spec(x)
x |
The data frame object to extract from |
A col_spec object.
df <- vroom(vroom_example("mtcars.csv")) s <- spec(df) s cols_condense(s)
df <- vroom(vroom_example("mtcars.csv")) s <- spec(df) s cols_condense(s)
When parsing dates, you often need to know how weekdays of the week and
months are represented as text. This pair of functions allows you to either
create your own, or retrieve from a standard list. The standard list is
derived from ICU (https://site.icu-project.org
) via the stringi package.
date_names(mon, mon_ab = mon, day, day_ab = day, am_pm = c("AM", "PM")) date_names_lang(language) date_names_langs()
date_names(mon, mon_ab = mon, day, day_ab = day, am_pm = c("AM", "PM")) date_names_lang(language) date_names_langs()
mon , mon_ab
|
Full and abbreviated month names. |
day , day_ab
|
Full and abbreviated week day names. Starts with Sunday. |
am_pm |
Names used for AM and PM. |
language |
A BCP 47 locale, made up of a language and a region,
e.g. |
date_names_lang("en") date_names_lang("ko") date_names_lang("fr")
date_names_lang("en") date_names_lang("ko") date_names_lang("fr")
This is useful for benchmarking, but also for bug reports when you cannot share the real dataset.
gen_tbl( rows, cols = NULL, col_types = NULL, locale = default_locale(), missing = 0 )
gen_tbl( rows, cols = NULL, col_types = NULL, locale = default_locale(), missing = 0 )
rows |
Number of rows to generate |
cols |
Number of columns to generate, if |
col_types |
One of If Column specifications created by Alternatively, you can use a compact string representation where each character represents one column:
|
locale |
The locale controls defaults that vary from place to place.
The default locale is US-centric (like R), but you can use
|
missing |
The percentage (from 0 to 1) of missing data to use |
There is also a family of functions to generate individual vectors of each type.
generators to generate individual vectors.
# random 10 x 5 table with random column types rand_tbl <- gen_tbl(10, 5) rand_tbl # all double 25 x 4 table dbl_tbl <- gen_tbl(25, 4, col_types = "dddd") dbl_tbl # Use the dots in long form column types to change the random function and options types <- rep(times = 4, list(col_double(f = stats::runif, min = -10, max = 25))) types dbl_tbl2 <- gen_tbl(25, 4, col_types = types) dbl_tbl2
# random 10 x 5 table with random column types rand_tbl <- gen_tbl(10, 5) rand_tbl # all double 25 x 4 table dbl_tbl <- gen_tbl(25, 4, col_types = "dddd") dbl_tbl # Use the dots in long form column types to change the random function and options types <- rep(times = 4, list(col_double(f = stats::runif, min = -10, max = 25))) types dbl_tbl2 <- gen_tbl(25, 4, col_types = types) dbl_tbl2
Generate individual vectors of the types supported by vroom
gen_character(n, min = 5, max = 25, values = c(letters, LETTERS, 0:9), ...) gen_double(n, f = stats::rnorm, ...) gen_number(n, f = stats::rnorm, ...) gen_integer(n, min = 1L, max = .Machine$integer.max, prob = NULL, ...) gen_factor( n, levels = NULL, ordered = FALSE, num_levels = gen_integer(1L, 1L, 25L), ... ) gen_time(n, min = 0, max = hms::hms(days = 1), fractional = FALSE, ...) gen_date(n, min = as.Date("2001-01-01"), max = as.Date("2021-01-01"), ...) gen_datetime( n, min = as.POSIXct("2001-01-01"), max = as.POSIXct("2021-01-01"), tz = "UTC", ... ) gen_logical(n, ...) gen_name(n)
gen_character(n, min = 5, max = 25, values = c(letters, LETTERS, 0:9), ...) gen_double(n, f = stats::rnorm, ...) gen_number(n, f = stats::rnorm, ...) gen_integer(n, min = 1L, max = .Machine$integer.max, prob = NULL, ...) gen_factor( n, levels = NULL, ordered = FALSE, num_levels = gen_integer(1L, 1L, 25L), ... ) gen_time(n, min = 0, max = hms::hms(days = 1), fractional = FALSE, ...) gen_date(n, min = as.Date("2001-01-01"), max = as.Date("2021-01-01"), ...) gen_datetime( n, min = as.POSIXct("2001-01-01"), max = as.POSIXct("2021-01-01"), tz = "UTC", ... ) gen_logical(n, ...) gen_name(n)
n |
The size of the vector to generate |
min |
The minimum range for the vector |
max |
The maximum range for the vector |
values |
The explicit values to use. |
... |
Additional arguments passed to internal generation functions |
f |
The random function to use. |
prob |
a vector of probability weights for obtaining the elements of the vector being sampled. |
levels |
The explicit levels to use, if |
ordered |
Should the factors be ordered factors? |
num_levels |
The number of factor levels to generate |
fractional |
Whether to generate times with fractional seconds |
tz |
The timezone to use for dates |
# characters gen_character(4) # factors gen_factor(4) # logical gen_logical(4) # numbers gen_double(4) gen_integer(4) # temporal data gen_time(4) gen_date(4) gen_datetime(4)
# characters gen_character(4) # factors gen_factor(4) # logical gen_logical(4) # numbers gen_double(4) gen_integer(4) # temporal data gen_time(4) gen_date(4) gen_datetime(4)
Guess the type of a vector
guess_type( x, na = c("", "NA"), locale = default_locale(), guess_integer = FALSE )
guess_type( x, na = c("", "NA"), locale = default_locale(), guess_integer = FALSE )
x |
Character vector of values to parse. |
na |
Character vector of strings to interpret as missing values. Set this
option to |
locale |
The locale controls defaults that vary from place to place.
The default locale is US-centric (like R), but you can use
|
guess_integer |
If |
# Logical vectors guess_type(c("FALSE", "TRUE", "F", "T")) # Integers and doubles guess_type(c("1","2","3")) guess_type(c("1.6","2.6","3.4")) # Numbers containing grouping mark guess_type("1,234,566") # ISO 8601 date times guess_type(c("2010-10-10")) guess_type(c("2010-10-10 01:02:03")) guess_type(c("01:02:03 AM"))
# Logical vectors guess_type(c("FALSE", "TRUE", "F", "T")) # Integers and doubles guess_type(c("1","2","3")) guess_type(c("1.6","2.6","3.4")) # Numbers containing grouping mark guess_type("1,234,566") # ISO 8601 date times guess_type(c("2010-10-10")) guess_type(c("2010-10-10 01:02:03")) guess_type(c("01:02:03 AM"))
A locale object tries to capture all the defaults that can vary between
countries. You set the locale in once, and the details are automatically
passed on down to the columns parsers. The defaults have been chosen to
match R (i.e. US English) as closely as possible. See
vignette("locales")
for more details.
locale( date_names = "en", date_format = "%AD", time_format = "%AT", decimal_mark = ".", grouping_mark = ",", tz = "UTC", encoding = "UTF-8" ) default_locale()
locale( date_names = "en", date_format = "%AD", time_format = "%AT", decimal_mark = ".", grouping_mark = ",", tz = "UTC", encoding = "UTF-8" ) default_locale()
date_names |
Character representations of day and month names. Either
the language code as string (passed on to |
date_format , time_format
|
Default date and time formats. |
decimal_mark , grouping_mark
|
Symbols used to indicate the decimal
place, and to chunk larger numbers. Decimal mark can only be |
tz |
Default tz. This is used both for input (if the time zone isn't present in individual strings), and for output (to control the default display). The default is to use "UTC", a time zone that does not use daylight savings time (DST) and hence is typically most useful for data. The absence of time zones makes it approximately 50x faster to generate UTC times than any other time zone. Use For a complete list of possible time zones, see |
encoding |
Default encoding. |
locale() locale("fr") # South American locale locale("es", decimal_mark = ",")
locale() locale("fr") # South American locale locale("es", decimal_mark = ",")
vroom will only fail to parse a file if the file is invalid in a way that is unrecoverable. However there are a number of non-fatal problems that you might want to know about. You can retrieve a data frame of these problems with this function.
problems(x = .Last.value, lazy = FALSE)
problems(x = .Last.value, lazy = FALSE)
x |
A data frame from |
lazy |
If |
A data frame with one row for each problem and four columns:
row,col - Row and column number that caused the problem, referencing the original input
expected - What vroom expected to find
actual - What it actually found
file - The file with the problem
Read a delimited file into a tibble
vroom( file, delim = NULL, col_names = TRUE, col_types = NULL, col_select = NULL, id = NULL, skip = 0, n_max = Inf, na = c("", "NA"), quote = "\"", comment = "", skip_empty_rows = TRUE, trim_ws = TRUE, escape_double = TRUE, escape_backslash = FALSE, locale = default_locale(), guess_max = 100, altrep = TRUE, altrep_opts = deprecated(), num_threads = vroom_threads(), progress = vroom_progress(), show_col_types = NULL, .name_repair = "unique" )
vroom( file, delim = NULL, col_names = TRUE, col_types = NULL, col_select = NULL, id = NULL, skip = 0, n_max = Inf, na = c("", "NA"), quote = "\"", comment = "", skip_empty_rows = TRUE, trim_ws = TRUE, escape_double = TRUE, escape_backslash = FALSE, locale = default_locale(), guess_max = 100, altrep = TRUE, altrep_opts = deprecated(), num_threads = vroom_threads(), progress = vroom_progress(), show_col_types = NULL, .name_repair = "unique" )
file |
Either a path to a file, a connection, or literal data (either a
single string or a raw vector). Files ending in Literal data is most useful for examples and tests. To be recognised as
literal data, wrap the input with |
delim |
One or more characters used to delimit fields within a
file. If |
col_names |
Either If If Missing ( |
col_types |
One of If Column specifications created by Alternatively, you can use a compact string representation where each character represents one column:
|
col_select |
Columns to include in the results. You can use the same
mini-language as |
id |
Either a string or 'NULL'. If a string, the output will contain a variable with that name with the filename(s) as the value. If 'NULL', the default, no variable will be created. |
skip |
Number of lines to skip before reading data. If |
n_max |
Maximum number of lines to read. |
na |
Character vector of strings to interpret as missing values. Set this
option to |
quote |
Single character used to quote strings. |
comment |
A string used to identify comments. Any text after the comment characters will be silently ignored. |
skip_empty_rows |
Should blank rows be ignored altogether? i.e. If this
option is |
trim_ws |
Should leading and trailing whitespace (ASCII spaces and tabs) be trimmed from each field before parsing it? |
escape_double |
Does the file escape quotes by doubling them?
i.e. If this option is |
escape_backslash |
Does the file use backslashes to escape special
characters? This is more general than |
locale |
The locale controls defaults that vary from place to place.
The default locale is US-centric (like R), but you can use
|
guess_max |
Maximum number of lines to use for guessing column types.
See |
altrep |
Control which column types use Altrep representations,
either a character vector of types, |
altrep_opts |
|
num_threads |
Number of threads to use when reading and materializing vectors. If your data contains newlines within fields the parser will automatically be forced to use a single thread only. |
progress |
Display a progress bar? By default it will only display
in an interactive session and not while knitting a document. The automatic
progress bar can be disabled by setting option |
show_col_types |
Control showing the column specifications. If |
.name_repair |
Handling of column names. The default behaviour is to
ensure column names are
This argument is passed on as |
# get path to example file input_file <- vroom_example("mtcars.csv") input_file # Read from a path # Input sources ------------------------------------------------------------- # Read from a path vroom(input_file) # You can also use paths directly # vroom("mtcars.csv") ## Not run: # Including remote paths vroom("https://github.com/tidyverse/vroom/raw/main/inst/extdata/mtcars.csv") ## End(Not run) # Or directly from a string with `I()` vroom(I("x,y\n1,2\n3,4\n")) # Column selection ---------------------------------------------------------- # Pass column names or indexes directly to select them vroom(input_file, col_select = c(model, cyl, gear)) vroom(input_file, col_select = c(1, 3, 11)) # Or use the selection helpers vroom(input_file, col_select = starts_with("d")) # You can also rename specific columns vroom(input_file, col_select = c(car = model, everything())) # Column types -------------------------------------------------------------- # By default, vroom guesses the columns types, looking at 1000 rows # throughout the dataset. # You can specify them explicitly with a compact specification: vroom(I("x,y\n1,2\n3,4\n"), col_types = "dc") # Or with a list of column types: vroom(I("x,y\n1,2\n3,4\n"), col_types = list(col_double(), col_character())) # File types ---------------------------------------------------------------- # csv vroom(I("a,b\n1.0,2.0\n"), delim = ",") # tsv vroom(I("a\tb\n1.0\t2.0\n")) # Other delimiters vroom(I("a|b\n1.0|2.0\n"), delim = "|") # Read datasets across multiple files --------------------------------------- mtcars_by_cyl <- vroom_example(vroom_examples("mtcars-")) mtcars_by_cyl # Pass the filenames directly to vroom, they are efficiently combined vroom(mtcars_by_cyl) # If you need to extract data from the filenames, use `id` to request a # column that reveals the underlying file path dat <- vroom(mtcars_by_cyl, id = "source") dat$source <- basename(dat$source) dat
# get path to example file input_file <- vroom_example("mtcars.csv") input_file # Read from a path # Input sources ------------------------------------------------------------- # Read from a path vroom(input_file) # You can also use paths directly # vroom("mtcars.csv") ## Not run: # Including remote paths vroom("https://github.com/tidyverse/vroom/raw/main/inst/extdata/mtcars.csv") ## End(Not run) # Or directly from a string with `I()` vroom(I("x,y\n1,2\n3,4\n")) # Column selection ---------------------------------------------------------- # Pass column names or indexes directly to select them vroom(input_file, col_select = c(model, cyl, gear)) vroom(input_file, col_select = c(1, 3, 11)) # Or use the selection helpers vroom(input_file, col_select = starts_with("d")) # You can also rename specific columns vroom(input_file, col_select = c(car = model, everything())) # Column types -------------------------------------------------------------- # By default, vroom guesses the columns types, looking at 1000 rows # throughout the dataset. # You can specify them explicitly with a compact specification: vroom(I("x,y\n1,2\n3,4\n"), col_types = "dc") # Or with a list of column types: vroom(I("x,y\n1,2\n3,4\n"), col_types = list(col_double(), col_character())) # File types ---------------------------------------------------------------- # csv vroom(I("a,b\n1.0,2.0\n"), delim = ",") # tsv vroom(I("a\tb\n1.0\t2.0\n")) # Other delimiters vroom(I("a|b\n1.0|2.0\n"), delim = "|") # Read datasets across multiple files --------------------------------------- mtcars_by_cyl <- vroom_example(vroom_examples("mtcars-")) mtcars_by_cyl # Pass the filenames directly to vroom, they are efficiently combined vroom(mtcars_by_cyl) # If you need to extract data from the filenames, use `id` to request a # column that reveals the underlying file path dat <- vroom(mtcars_by_cyl, id = "source") dat$source <- basename(dat$source) dat
vroom_altrep()
can be used directly as input to the altrep
argument of vroom()
.
vroom_altrep(which = NULL)
vroom_altrep(which = NULL)
which |
A character vector of column types to use Altrep for. Can also
take |
Alternatively there is also a family of environment variables to control use of
the Altrep framework. These can then be set in your .Renviron
file, e.g.
with usethis::edit_r_environ()
. For versions of R where the Altrep
framework is unavailable (R < 3.5.0) they are automatically turned off and
the variables have no effect. The variables can take one of true
, false
,
TRUE
, FALSE
, 1
, or 0
.
VROOM_USE_ALTREP_NUMERICS
- If set use Altrep for all numeric types
(default false
).
There are also individual variables for each type. Currently only
VROOM_USE_ALTREP_CHR
defaults to true
.
VROOM_USE_ALTREP_CHR
VROOM_USE_ALTREP_FCT
VROOM_USE_ALTREP_INT
VROOM_USE_ALTREP_BIG_INT
VROOM_USE_ALTREP_DBL
VROOM_USE_ALTREP_NUM
VROOM_USE_ALTREP_LGL
VROOM_USE_ALTREP_DTTM
VROOM_USE_ALTREP_DATE
VROOM_USE_ALTREP_TIME
vroom_altrep() vroom_altrep(c("chr", "fct", "int")) vroom_altrep(TRUE) vroom_altrep(FALSE)
vroom_altrep() vroom_altrep(c("chr", "fct", "int")) vroom_altrep(TRUE) vroom_altrep(FALSE)
This function is deprecated in favor of vroom_altrep()
.
vroom_altrep_opts(which = NULL)
vroom_altrep_opts(which = NULL)
which |
A character vector of column types to use Altrep for. Can also
take |
vroom comes bundled with a number of sample files in
its 'inst/extdata' directory. Use vroom_examples()
to list all the
available examples and vroom_example()
to retrieve the path to one
example.
vroom_example(path) vroom_examples(pattern = NULL)
vroom_example(path) vroom_examples(pattern = NULL)
path |
Name of file. |
pattern |
A regular expression of filenames to match. If |
# List all available examples vroom_examples() # Get path to one example vroom_example("mtcars.csv")
# List all available examples vroom_examples() # Get path to one example vroom_example("mtcars.csv")
This is equivalent to vroom_write()
, but instead of writing to
disk, it returns a string. It is primarily useful for examples and for
testing.
vroom_format( x, delim = "\t", eol = "\n", na = "NA", col_names = TRUE, escape = c("double", "backslash", "none"), quote = c("needed", "all", "none"), bom = FALSE, num_threads = vroom_threads() )
vroom_format( x, delim = "\t", eol = "\n", na = "NA", col_names = TRUE, escape = c("double", "backslash", "none"), quote = c("needed", "all", "none"), bom = FALSE, num_threads = vroom_threads() )
x |
A data frame or tibble to write to disk. |
delim |
Delimiter used to separate values. Defaults to |
eol |
The end of line character to use. Most commonly either |
na |
String used for missing values. Defaults to 'NA'. |
col_names |
If |
escape |
The type of escape to use when quotes are in the data.
|
quote |
How to handle fields which contain characters that need to be quoted.
|
bom |
If |
num_threads |
Number of threads to use when reading and materializing vectors. If your data contains newlines within fields the parser will automatically be forced to use a single thread only. |
Read a fixed width file into a tibble
vroom_fwf( file, col_positions = fwf_empty(file, skip, n = guess_max), col_types = NULL, col_select = NULL, id = NULL, locale = default_locale(), na = c("", "NA"), comment = "", skip_empty_rows = TRUE, trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = 100, altrep = TRUE, altrep_opts = deprecated(), num_threads = vroom_threads(), progress = vroom_progress(), show_col_types = NULL, .name_repair = "unique" ) fwf_empty(file, skip = 0, col_names = NULL, comment = "", n = 100L) fwf_widths(widths, col_names = NULL) fwf_positions(start, end = NULL, col_names = NULL) fwf_cols(...)
vroom_fwf( file, col_positions = fwf_empty(file, skip, n = guess_max), col_types = NULL, col_select = NULL, id = NULL, locale = default_locale(), na = c("", "NA"), comment = "", skip_empty_rows = TRUE, trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = 100, altrep = TRUE, altrep_opts = deprecated(), num_threads = vroom_threads(), progress = vroom_progress(), show_col_types = NULL, .name_repair = "unique" ) fwf_empty(file, skip = 0, col_names = NULL, comment = "", n = 100L) fwf_widths(widths, col_names = NULL) fwf_positions(start, end = NULL, col_names = NULL) fwf_cols(...)
file |
Either a path to a file, a connection, or literal data (either a single string or a raw vector). Files ending in Literal data is most useful for examples and tests. To be recognised as
literal data, the input must be either wrapped with Using a value of |
col_positions |
Column positions, as created by |
col_types |
One of If Column specifications created by Alternatively, you can use a compact string representation where each character represents one column:
By default, reading a file without a column specification will print a
message showing what |
col_select |
Columns to include in the results. You can use the same
mini-language as |
id |
The name of a column in which to store the file path. This is
useful when reading multiple input files and there is data in the file
paths, such as the data collection date. If |
locale |
The locale controls defaults that vary from place to place.
The default locale is US-centric (like R), but you can use
|
na |
Character vector of strings to interpret as missing values. Set this
option to |
comment |
A string used to identify comments. Any text after the comment characters will be silently ignored. |
skip_empty_rows |
Should blank rows be ignored altogether? i.e. If this
option is |
trim_ws |
Should leading and trailing whitespace (ASCII spaces and tabs) be trimmed from each field before parsing it? |
skip |
Number of lines to skip before reading data. |
n_max |
Maximum number of lines to read. |
guess_max |
Maximum number of lines to use for guessing column types.
Will never use more than the number of lines read.
See |
altrep |
Control which column types use Altrep representations,
either a character vector of types, |
altrep_opts |
|
num_threads |
The number of processing threads to use for initial
parsing and lazy reading of data. If your data contains newlines within
fields the parser should automatically detect this and fall back to using
one thread only. However if you know your file has newlines within quoted
fields it is safest to set |
progress |
Display a progress bar? By default it will only display
in an interactive session and not while knitting a document. The automatic
progress bar can be disabled by setting option |
show_col_types |
If |
.name_repair |
Handling of column names. The default behaviour is to
ensure column names are
This argument is passed on as |
col_names |
Either NULL, or a character vector column names. |
n |
Number of lines the tokenizer will read to determine file structure. By default it is set to 100. |
widths |
Width of each field. Use NA as width of last field when reading a ragged fwf file. |
start , end
|
Starting and ending (inclusive) positions of each field. Use NA as last end field when reading a ragged fwf file. |
... |
If the first element is a data frame,
then it must have all numeric columns and either one or two rows.
The column names are the variable names. The column values are the
variable widths if a length one vector, and if length two, variable start and end
positions. The elements of |
Note: fwf_empty()
cannot take a R connection such as a URL as input, as
this would result in reading from the connection twice. In these cases it is
better to download the file first before reading.
fwf_sample <- vroom_example("fwf-sample.txt") writeLines(vroom_lines(fwf_sample)) # You can specify column positions in several ways: # 1. Guess based on position of empty columns vroom_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn"))) # 2. A vector of field widths vroom_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn"))) # 3. Paired vectors of start and end positions vroom_fwf(fwf_sample, fwf_positions(c(1, 30), c(20, 42), c("name", "ssn"))) # 4. Named arguments with start and end positions vroom_fwf(fwf_sample, fwf_cols(name = c(1, 20), ssn = c(30, 42))) # 5. Named arguments with column widths vroom_fwf(fwf_sample, fwf_cols(name = 20, state = 10, ssn = 12))
fwf_sample <- vroom_example("fwf-sample.txt") writeLines(vroom_lines(fwf_sample)) # You can specify column positions in several ways: # 1. Guess based on position of empty columns vroom_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn"))) # 2. A vector of field widths vroom_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn"))) # 3. Paired vectors of start and end positions vroom_fwf(fwf_sample, fwf_positions(c(1, 30), c(20, 42), c("name", "ssn"))) # 4. Named arguments with start and end positions vroom_fwf(fwf_sample, fwf_cols(name = c(1, 20), ssn = c(30, 42))) # 5. Named arguments with column widths vroom_fwf(fwf_sample, fwf_cols(name = 20, state = 10, ssn = 12))
vroom_lines()
is similar to readLines()
, however it reads the lines
lazily like vroom()
, so operations like length()
, head()
, tail()
and sample()
can be done much more efficiently without reading all the data into R.
vroom_lines( file, n_max = Inf, skip = 0, na = character(), skip_empty_rows = FALSE, locale = default_locale(), altrep = TRUE, altrep_opts = deprecated(), num_threads = vroom_threads(), progress = vroom_progress() )
vroom_lines( file, n_max = Inf, skip = 0, na = character(), skip_empty_rows = FALSE, locale = default_locale(), altrep = TRUE, altrep_opts = deprecated(), num_threads = vroom_threads(), progress = vroom_progress() )
file |
Either a path to a file, a connection, or literal data (either a
single string or a raw vector). Files ending in Literal data is most useful for examples and tests. To be recognised as
literal data, wrap the input with |
n_max |
Maximum number of lines to read. |
skip |
Number of lines to skip before reading data. If |
na |
Character vector of strings to interpret as missing values. Set this
option to |
skip_empty_rows |
Should blank rows be ignored altogether? i.e. If this
option is |
locale |
The locale controls defaults that vary from place to place.
The default locale is US-centric (like R), but you can use
|
altrep |
Control which column types use Altrep representations,
either a character vector of types, |
altrep_opts |
|
num_threads |
Number of threads to use when reading and materializing vectors. If your data contains newlines within fields the parser will automatically be forced to use a single thread only. |
progress |
Display a progress bar? By default it will only display
in an interactive session and not while knitting a document. The automatic
progress bar can be disabled by setting option |
lines <- vroom_lines(vroom_example("mtcars.csv")) length(lines) head(lines, n = 2) tail(lines, n = 2) sample(lines, size = 2)
lines <- vroom_lines(vroom_example("mtcars.csv")) length(lines) head(lines, n = 2) tail(lines, n = 2) sample(lines, size = 2)
By default, vroom shows progress bars. However, progress reporting is suppressed if any of the following conditions hold:
The bar is explicitly disabled by setting the environment variable
VROOM_SHOW_PROGRESS
to "false"
.
The code is run in a non-interactive session, as determined by
rlang::is_interactive()
.
The code is run in an RStudio notebook chunk, as determined by
getOption("rstudio.notebook.executing")
.
vroom_progress()
vroom_progress()
vroom_progress()
vroom_progress()
Similar to str()
but with more information for Altrep objects.
vroom_str(x)
vroom_str(x)
x |
a vector |
# when used on non-altrep objects altrep will always be false vroom_str(mtcars) mt <- vroom(vroom_example("mtcars.csv"), ",", altrep = c("chr", "dbl")) vroom_str(mt)
# when used on non-altrep objects altrep will always be false vroom_str(mtcars) mt <- vroom(vroom_example("mtcars.csv"), ",", altrep = c("chr", "dbl")) vroom_str(mt)
Write a data frame to a delimited file
vroom_write( x, file, delim = "\t", eol = "\n", na = "NA", col_names = !append, append = FALSE, quote = c("needed", "all", "none"), escape = c("double", "backslash", "none"), bom = FALSE, num_threads = vroom_threads(), progress = vroom_progress(), path = deprecated() )
vroom_write( x, file, delim = "\t", eol = "\n", na = "NA", col_names = !append, append = FALSE, quote = c("needed", "all", "none"), escape = c("double", "backslash", "none"), bom = FALSE, num_threads = vroom_threads(), progress = vroom_progress(), path = deprecated() )
x |
A data frame or tibble to write to disk. |
file |
File or connection to write to. |
delim |
Delimiter used to separate values. Defaults to |
eol |
The end of line character to use. Most commonly either |
na |
String used for missing values. Defaults to 'NA'. |
col_names |
If |
append |
If |
quote |
How to handle fields which contain characters that need to be quoted.
|
escape |
The type of escape to use when quotes are in the data.
|
bom |
If |
num_threads |
Number of threads to use when reading and materializing vectors. If your data contains newlines within fields the parser will automatically be forced to use a single thread only. |
progress |
Display a progress bar? By default it will only display
in an interactive session and not while knitting a document. The display
is updated every 50,000 values and will only display if estimated reading
time is 5 seconds or more. The automatic progress bar can be disabled by
setting option |
path |
# If you only specify a file name, vroom_write() will write # the file to your current working directory. out_file <- tempfile(fileext = "csv") vroom_write(mtcars, out_file, ",") # You can also use a literal filename # vroom_write(mtcars, "mtcars.tsv") # If you add an extension to the file name, write_()* will # automatically compress the output. # vroom_write(mtcars, "mtcars.tsv.gz") # vroom_write(mtcars, "mtcars.tsv.bz2") # vroom_write(mtcars, "mtcars.tsv.xz")
# If you only specify a file name, vroom_write() will write # the file to your current working directory. out_file <- tempfile(fileext = "csv") vroom_write(mtcars, out_file, ",") # You can also use a literal filename # vroom_write(mtcars, "mtcars.tsv") # If you add an extension to the file name, write_()* will # automatically compress the output. # vroom_write(mtcars, "mtcars.tsv.gz") # vroom_write(mtcars, "mtcars.tsv.bz2") # vroom_write(mtcars, "mtcars.tsv.xz")
Write lines to a file
vroom_write_lines( x, file, eol = "\n", na = "NA", append = FALSE, num_threads = vroom_threads() )
vroom_write_lines( x, file, eol = "\n", na = "NA", append = FALSE, num_threads = vroom_threads() )
x |
A character vector. |
file |
File or connection to write to. |
eol |
The end of line character to use. Most commonly either |
na |
String used for missing values. Defaults to 'NA'. |
append |
If |
num_threads |
Number of threads to use when reading and materializing vectors. If your data contains newlines within fields the parser will automatically be forced to use a single thread only. |