Memory protection: Prudence

This vignette discusses eager and lazy computation, and prudence.

library(conflicted)
library(dplyr)
conflict_prefer("filter", "dplyr")
#> [conflicted] Removing existing preference.
#> [conflicted] Will prefer dplyr::filter over any other package.

Introduction

Data frames backed by duckplyr, with class "duckplyr_df", behave as regular data frames in almost all respects. In particular, direct column access like df$x, or retrieving the number of rows with nrow(), works identically. Conceptually, duckplyr frames are “eager”: from a user’s perspective, they behave like regular data frames.

df <-
  duckplyr::duckdb_tibble(x = 1:5) |>
  mutate(y = x + 1)
df
#> # A duckplyr data frame: 2 variables
#>       x     y
#>   <int> <dbl>
#> 1     1     2
#> 2     2     3
#> 3     3     4
#> 4     4     5
#> 5     5     6
class(df)
#> [1] "duckplyr_df" "tbl_df"      "tbl"         "data.frame"
df$y
#> [1] 2 3 4 5 6
nrow(df)
#> [1] 5

Under the hood, two key differences provide improved performance and usability: lazy materialization and prudence.

Eager and lazy computation

For a duckplyr frame that is the result of a dplyr operation, accessing column data or retrieving the number of rows will trigger a computation that is carried out by DuckDB, not dplyr. In this sense, duckplyr frames are also “lazy”: the computation is deferred until the last possible moment, allowing DuckDB to optimize the whole pipeline.

Example

This is explained in the following example that computes the mean arrival delay for flights departing from Newark airport (EWR) by day and month:

flights <- duckplyr::flights_df()

flights_duckdb <-
  flights |>
  duckplyr::as_duckdb_tibble()

system.time(
  mean_arr_delay_ewr <-
    flights_duckdb |>
    filter(origin == "EWR", !is.na(arr_delay)) |>
    summarize(
      .by = month,
      mean_arr_delay = mean(arr_delay),
      min_arr_delay = min(arr_delay),
      max_arr_delay = max(arr_delay),
      median_arr_delay = median(arr_delay),
    )
)
#>    user  system elapsed 
#>   0.005   0.000   0.005

Setting up the pipeline is fast, the size of the data does not affect the setup costs. Because the computation is deferred, DuckDB can optimize the whole pipeline, which can be seen in the output below:

mean_arr_delay_ewr |>
  explain()
#> ┌---------------------------┐
#> │       HASH_GROUP_BY       │
#> │    --------------------   │
#> │         Groups: #0        │
#> │                           │
#> │        Aggregates:        │
#> │          mean(#1)         │
#> │          min(#2)          │
#> │          max(#3)          │
#> │         median(#4)        │
#> │                           │
#> │        ~33677 Rows        │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │         PROJECTION        │
#> │    --------------------   │
#> │           month           │
#> │         arr_delay         │
#> │         arr_delay         │
#> │         arr_delay         │
#> │         arr_delay         │
#> │                           │
#> │        ~67355 Rows        │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │         PROJECTION        │
#> │    --------------------   │
#> │           month           │
#> │         arr_delay         │
#> │                           │
#> │        ~67355 Rows        │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │           FILTER          │
#> │    --------------------   │
#> │ ((NOT (arr_delay IS NULL))│
#> │    AND (origin = 'EWR'))  │
#> │                           │
#> │        ~67355 Rows        │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │     R_DATAFRAME_SCAN      │
#> │    --------------------   │
#> │         data.frame        │
#> │                           │
#> │        Projections:       │
#> │           month           │
#> │         arr_delay         │
#> │           origin          │
#> │                           │
#> │        ~336776 Rows       │
#> └---------------------------┘

The first step is to prune the unneeded columns, only origin, month, and arr_delay are kept. The result becomes available when accessed:

system.time(mean_arr_delay_ewr$mean_arr_delay[[1]])
#>    user  system elapsed 
#>   0.020   0.000   0.017

Comparison

The functionality is similar to lazy tables in dbplyr and lazy frames in dtplyr. However, the behavior is different: at the time of writing, the internal structure of a lazy table or frame is different from a data frame, and columns cannot be accessed directly.

Eager 😃 Lazy 😴
dplyr
dbplyr
dtplyr
duckplyr

In contrast, with dplyr, each intermediate step and also the final result is a proper data frame, and computed right away, forfeiting the opportunity for optimization:

system.time(
  flights |>
    filter(origin == "EWR", !is.na(arr_delay)) |>
    summarize(
      .by = c(month, day),
      mean_arr_delay = mean(arr_delay),
      min_arr_delay = min(arr_delay),
      max_arr_delay = max(arr_delay),
      median_arr_delay = median(arr_delay),
    )
)
#>    user  system elapsed 
#>   0.006   0.000   0.005

Prudence

Being both “eager” and “lazy” at the same time introduces a challenge: it is too easy to accidentally trigger computation, which may be prohibitive if an intermediate result is too large. This is where prudence comes in.

Concept

For lavish duckplyr frames, as in the two previous examples the underlying DuckDB computation is carried out upon the first request. Once the results are computed, they are cached and subsequent requests are fast. This is a good choice for small to medium-sized data, where DuckDB can provide a nice speedup but materializing the data is affordable at any stage. This is the default for duckdb_tibble() and as_duckdb_tibble().

For frugal duckplyr frames, accessing a column or requesting the number of rows triggers an error. This is a good choice for large data sets where the cost of materializing the data may be prohibitive due to size or computation time, and the user wants to control when the computation is carried out.

Example

The example below demonstrates the use of frugal duckplyr frames.

flights_frugal <-
  flights |>
  duckplyr::as_duckdb_tibble(prudence = "frugal")

In this example, flights_frugal is a frugal duckplyr frame. The data can be displayed, and column names and types can be accessed.

flights_frugal
#> # A duckplyr data frame: 19 variables
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>
names(flights_frugal)[1:10]
#>  [1] "year"           "month"          "day"            "dep_time"      
#>  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
#>  [9] "arr_delay"      "carrier"
class(flights_frugal)
#> [1] "prudent_duckplyr_df" "duckplyr_df"         "tbl_df"             
#> [4] "tbl"                 "data.frame"
class(flights_frugal[[1]])
#> [1] "integer"

On the other hand, accessing a column or requesting the number of rows triggers an error:

nrow(flights_frugal)
#> Error: Materialization is disabled, use collect() or as_tibble() to materialize.
flights_frugal[[1]]
#> Error: Materialization is disabled, use collect() or as_tibble() to materialize.

Enforcing DuckDB operation

For operations not supported by duckplyr, the original dplyr implementation is used as a fallback. As the original dplyr implementation accesses columns directly, the data must be materialized before a fallback can be executed. Therefore, frugal frames allow you to check that all operations are supported by DuckDB: for a frugal frame, fallbacks to dplyr are not possible.

flights_frugal |>
  group_by(origin) |>
  summarize(n = n()) |>
  ungroup()
#> Error in `group_by()`:
#> ! This operation cannot be carried out by DuckDB, and the input is a
#>   frugal duckplyr frame.
#> • Try `summarise(.by = ...)` or `mutate(.by = ...)` instead of `group_by()` and
#>   `ungroup()`.
#> ℹ Use `compute(prudence = "lavish")` to materialize to temporary storage and
#>   continue with duckplyr.
#> ℹ See `vignette("funnel")` for other options.

The same pipeline with a lavish frame works, but the computation is carried out by dplyr:

flights_frugal |>
  duckplyr::as_duckdb_tibble(prudence = "lavish") |>
  group_by(origin) |>
  summarize(n = n()) |>
  ungroup()
#> # A tibble: 3 × 2
#>   origin      n
#>   <chr>   <int>
#> 1 EWR    120835
#> 2 JFK    111279
#> 3 LGA    104662

By using operations supported by duckplyr and avoiding fallbacks as much as possible, your pipelines will be executed by DuckDB in an optimized way. See ?fallback for details on fallbacks, and vignette("limits") for the operations supported by duckplyr.

From frugal to lavish

A frugal duckplyr frame can be converted to a lavish one with as_duckdb_tibble(prudence = "lavish"). The collect.duckplyr_df() method triggers computation and converts to a plain tibble. The difference between the two is the class of the returned object:

flights_frugal |>
  duckplyr::as_duckdb_tibble(prudence = "lavish") |>
  class()
#> [1] "duckplyr_df" "tbl_df"      "tbl"         "data.frame"

flights_frugal |>
  collect() |>
  class()
#> [1] "tbl_df"     "tbl"        "data.frame"

The same behavior is achieved with as_tibble() and as.data.frame():

flights_frugal |>
  as_tibble() |>
  class()
#> [1] "tbl_df"     "tbl"        "data.frame"

flights_frugal |>
  as.data.frame() |>
  class()
#> [1] "data.frame"

See vignette("large") for techniques for working with large data sets.

Comparison

Frugal duckplyr frames behave like lazy tables in dbplyr and lazy frames in dtplyr: the computation only starts when you explicitly request it with collect.duckplyr_df() or through other means. However, frugal duckplyr frames can be converted to lavish ones at any time, and vice versa. In dtplyr and dbplyr, there are no lavish frames: collection always needs to be explicit.

Thrift

Thrifty is a compromise between frugal and lavish. Materialization is allowed for data up to a certain size, measured in cells (values) and rows in the resulting data frame.

nrow(flights)
#> [1] 336776
flights_partial <-
  flights |>
  duckplyr::as_duckdb_tibble(prudence = "thrifty")

With this setting, the data is materialized only if the result has fewer than 1,000,000 cells (rows multiplied by columns).

flights_partial |>
  select(origin) |>
  nrow()
#> [1] 336776

The original input is too large to be materialized, so the operation fails. On the other hand, the result after aggregation is small enough to be materialized:

flights_partial |>
  count(origin) |>
  nrow()
#> [1] 3

Thrifty is a good choice for data sets where the cost of materializing the data is prohibitive only for large results. This is the default for the ingestion functions like read_parquet_duckdb(). See vignette("large") for more details on working with large data sets.