Selective use of duckplyr

This vignette demonstrates how to use duckplyr selectively, for individual data frames or for other packages.

library(conflicted)
library(dplyr)
conflict_prefer("filter", "dplyr")
#> [conflicted] Will prefer dplyr::filter over any other package.

External data frame

To enable duckplyr for individual data frames instead of session-wide,

  • do not load duckplyr with library().
  • use duckplyr::as_duckdb_tibble() as the first step in your pipe, without attaching the package.
lazy <-
  duckplyr::flights_df() |>
  duckplyr::as_duckdb_tibble() |>
  filter(!is.na(arr_delay), !is.na(dep_delay)) |>
  mutate(inflight_delay = arr_delay - dep_delay) |>
  summarize(
    .by = c(year, month),
    mean_inflight_delay = mean(inflight_delay),
    median_inflight_delay = median(inflight_delay),
  ) |>
  filter(month <= 6)

The result is a tibble, with its own class.

class(lazy)
#> [1] "duckplyr_df" "tbl_df"      "tbl"         "data.frame"
names(lazy)
#> [1] "year"                  "month"                 "mean_inflight_delay"  
#> [4] "median_inflight_delay"

DuckDB is responsible for eventually carrying out the operations. Despite the filter coming very late in the pipeline, it is applied to the raw data.

lazy |>
  explain()
#> ┌---------------------------┐
#> │         PROJECTION        │
#> │    --------------------   │
#> │            year           │
#> │           month           │
#> │    mean_inflight_delay    │
#> │   median_inflight_delay   │
#> │                           │
#> │        ~33677 Rows        │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │       HASH_GROUP_BY       │
#> │    --------------------   │
#> │          Groups:          │
#> │             #0            │
#> │             #1            │
#> │                           │
#> │        Aggregates:        │
#> │    sum_no_overflow(#2)    │
#> │          avg(#3)          │
#> │     quantile_cont(#4)     │
#> │                           │
#> │        ~33677 Rows        │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │         PROJECTION        │
#> │    --------------------   │
#> │            year           │
#> │           month           │
#> │        CASE  WHEN (       │
#> │ (inflight_delay IS NULL)) │
#> │     THEN (1) ELSE 0 END
#> │       inflight_delay      │
#> │       inflight_delay      │
#> │                           │
#> │        ~67355 Rows        │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │         PROJECTION        │
#> │    --------------------   │
#> │            year           │
#> │           month           │
#> │       inflight_delay      │
#> │                           │
#> │        ~67355 Rows        │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │           FILTER          │
#> │    --------------------   │
#> │ ((CAST(month AS DOUBLE) <=│
#> │  6.0) AND (NOT (arr_delay │
#> │     IS NULL)) AND (NOT    │
#> │   (dep_delay IS NULL)))   │
#> │                           │
#> │        ~67355 Rows        │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │     R_DATAFRAME_SCAN      │
#> │    --------------------   │
#> │         data.frame        │
#> │                           │
#> │        Projections:       │
#> │            year           │
#> │           month           │
#> │         dep_delay         │
#> │         arr_delay         │
#> │                           │
#> │        ~336776 Rows       │
#> └---------------------------┘

All data frame operations are supported. Computation happens upon the first request.

lazy$mean_inflight_delay
#> [1] -7.356713 -2.673124 -4.244284 -3.855519 -5.147220 -9.370201

After the computation has been carried out, the results are preserved and available immediately:

lazy
#> # A duckplyr data frame: 4 variables
#>    year month mean_inflight_delay median_inflight_delay
#>   <int> <int>               <dbl>                 <dbl>
#> 1  2013     3               -7.36                    -9
#> 2  2013     4               -2.67                    -5
#> 3  2013     6               -4.24                    -7
#> 4  2013     1               -3.86                    -5
#> 5  2013     2               -5.15                    -6
#> 6  2013     5               -9.37                   -10

Own data

Construct duckplyr frames directly with duckplyr::duckdb_tibble():

data <- duckplyr::duckdb_tibble(
  x = 1:10,
  y = 5,
  z = letters[1:10]
)
data
#> # A duckplyr data frame: 3 variables
#>        x     y z    
#>    <int> <dbl> <chr>
#>  1     1     5 a    
#>  2     2     5 b    
#>  3     3     5 c    
#>  4     4     5 d    
#>  5     5     5 e    
#>  6     6     5 f    
#>  7     7     5 g    
#>  8     8     5 h    
#>  9     9     5 i    
#> 10    10     5 j

In other packages

Like other dependencies, duckplyr must be declared in the DESCRIPTION file and optionally imported in the NAMESPACE file. Because duckplyr does not import dplyr, it is necessary to import both packages. The recipe below shows how to achieve this with the usethis package.

  • Add dplyr as a dependency with usethis::use_package("dplyr")
  • Add duckplyr as a dependency with usethis::use_package("duckplyr")
  • In your code, use a pattern like data |> duckplyr::as_duckdb_tibble() |> dplyr::filter(...)
  • To avoid the package prefix and simply write as_duckdb_tibble() or filter():
    • Import the duckplyr function with usethis::use_import_from("duckplyr", "as_duckdb_tibble")
    • Import the dplyr function with usethis::use_import_from("dplyr", "filter")

Learn more about usethis at https://usethis.r-lib.org/.

Prudence

The default mode for as_duckdb_tibble() and duckdb_tibble() is lavish. This means that the dplyr operations are carried out by DuckDB when possible, and also available as data frames upon first request. Use as_duckdb_tibble(prudence = "frugal") or duckdb_tibble(.prudence = "frugal") to avoid materializing intermediate data and to ensure that all operations are carried out by DuckDB or fail. Prudence can also limit the number of rows or cells that are materialized:

data <- duckplyr::duckdb_tibble(x = 1:5, .prudence = c(rows = 3))
data
#> # A duckplyr data frame: 1 variable
#>       x
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5
nrow(data)
#> Error: Materialization would result in 4 rows, which exceeds the limit of 3. Use collect() or as_tibble() to materialize.

Learn more about prudence in vignette("prudence"), about fallbacks to dplyr in vignette("fallback"), and about the translation employed by duckplyr in vignette("limits").