This vignette demonstrates how to use duckplyr selectively, for individual data frames or for other packages.
library(conflicted)
library(dplyr)
conflict_prefer("filter", "dplyr")
#> [conflicted] Will prefer dplyr::filter over any other package.
To enable duckplyr for individual data frames instead of session-wide,
library()
.duckplyr::as_duckdb_tibble()
as the first step in
your pipe, without attaching the package.lazy <-
duckplyr::flights_df() |>
duckplyr::as_duckdb_tibble() |>
filter(!is.na(arr_delay), !is.na(dep_delay)) |>
mutate(inflight_delay = arr_delay - dep_delay) |>
summarize(
.by = c(year, month),
mean_inflight_delay = mean(inflight_delay),
median_inflight_delay = median(inflight_delay),
) |>
filter(month <= 6)
The result is a tibble, with its own class.
class(lazy)
#> [1] "duckplyr_df" "tbl_df" "tbl" "data.frame"
names(lazy)
#> [1] "year" "month" "mean_inflight_delay"
#> [4] "median_inflight_delay"
DuckDB is responsible for eventually carrying out the operations. Despite the filter coming very late in the pipeline, it is applied to the raw data.
lazy |>
explain()
#> ┌---------------------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ year │
#> │ month │
#> │ mean_inflight_delay │
#> │ median_inflight_delay │
#> │ │
#> │ ~33677 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ HASH_GROUP_BY │
#> │ -------------------- │
#> │ Groups: │
#> │ #0 │
#> │ #1 │
#> │ │
#> │ Aggregates: │
#> │ sum_no_overflow(#2) │
#> │ avg(#3) │
#> │ quantile_cont(#4) │
#> │ │
#> │ ~33677 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ year │
#> │ month │
#> │ CASE WHEN ( │
#> │ (inflight_delay IS NULL)) │
#> │ THEN (1) ELSE 0 END │
#> │ inflight_delay │
#> │ inflight_delay │
#> │ │
#> │ ~67355 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ year │
#> │ month │
#> │ inflight_delay │
#> │ │
#> │ ~67355 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ FILTER │
#> │ -------------------- │
#> │ ((CAST(month AS DOUBLE) <=│
#> │ 6.0) AND (NOT (arr_delay │
#> │ IS NULL)) AND (NOT │
#> │ (dep_delay IS NULL))) │
#> │ │
#> │ ~67355 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ R_DATAFRAME_SCAN │
#> │ -------------------- │
#> │ data.frame │
#> │ │
#> │ Projections: │
#> │ year │
#> │ month │
#> │ dep_delay │
#> │ arr_delay │
#> │ │
#> │ ~336776 Rows │
#> └---------------------------┘
All data frame operations are supported. Computation happens upon the first request.
After the computation has been carried out, the results are preserved and available immediately:
Construct duckplyr frames directly with
duckplyr::duckdb_tibble()
:
Like other dependencies, duckplyr must be declared in the
DESCRIPTION
file and optionally imported in the
NAMESPACE
file. Because duckplyr does not import dplyr, it
is necessary to import both packages. The recipe below shows how to
achieve this with the usethis package.
usethis::use_package("dplyr")
usethis::use_package("duckplyr")
data |> duckplyr::as_duckdb_tibble() |> dplyr::filter(...)
as_duckdb_tibble()
or filter()
:
usethis::use_import_from("duckplyr", "as_duckdb_tibble")
usethis::use_import_from("dplyr", "filter")
Learn more about usethis at https://usethis.r-lib.org/.
The default mode for as_duckdb_tibble()
and
duckdb_tibble()
is lavish. This means that the dplyr
operations are carried out by DuckDB when possible, and also available
as data frames upon first request. Use
as_duckdb_tibble(prudence = "frugal")
or
duckdb_tibble(.prudence = "frugal")
to avoid materializing
intermediate data and to ensure that all operations are carried out by
DuckDB or fail. Prudence can also limit the number of rows or cells that
are materialized:
data <- duckplyr::duckdb_tibble(x = 1:5, .prudence = c(rows = 3))
data
#> # A duckplyr data frame: 1 variable
#> x
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
nrow(data)
#> Error: Materialization would result in 4 rows, which exceeds the limit of 3. Use collect() or as_tibble() to materialize.
Learn more about prudence in vignette("prudence")
, about
fallbacks to dplyr in vignette("fallback")
, and about the
translation employed by duckplyr in vignette("limits")
.