This vignette discusses how to work with large data sets using duckplyr.
library(conflicted)
library(duckplyr)
conflict_prefer("filter", "dplyr")
#> [conflicted] Removing existing preference.
#> [conflicted] Will prefer dplyr::filter over any other package.
Data frames and other objects in R are stored in RAM. This can become problematic:
A variety of tools have been developed to work with large data sets,
also in R. One examples is the dbplyr package, a dplyr backend that
connects to SQL databases and is designed to work with various databases
that support SQL. This is a viable approach if the data is already
stored in a database, or if the data is stored in Parquet or CSV files
and loaded as a lazy table via duckdb::tbl_file()
.
The dbplyr package translates dplyr code to SQL. The syntax and semantics are very similar, but not identical to plain dplyr. In contrast, the duckplyr package aims to be a fully compatible drop-in replacement for dplyr, with exactly the same syntax and semantics:
Full compatibility means fewer surprises and less cognitive load for the user. With DuckDB as the backend, duckplyr can also handle large data sets that do not fit into RAM, keeping full dplyr compatibility. The tools for bringing data into and out of R memory are modeled after the dplyr and dbplyr packages, and are described in the following sections.
See vignette("funnel")
on eager and lazy data,
vignette("limits")
for limitations in the translation
employed by duckplyr, and ?fallback
for more information on
fallback.
The duckdb_tibble()
function creates a duckplyr data
frame from vectors:
df <- duckdb_tibble(x = 1:5, y = letters[1:5])
df
#> # A duckplyr data frame: 2 variables
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
The duckdb_tibble()
function is a drop-in replacement
for tibble()
, and can be used in the same way.
Similarly, as_duckdb_tibble()
can be used to convert a
data frame or another object to a duckplyr data frame:
flights_df() |>
as_duckdb_tibble()
#> # A duckplyr data frame: 19 variables
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> 7 2013 1 1 555 600 -5 913 854
#> 8 2013 1 1 557 600 -3 709 723
#> 9 2013 1 1 557 600 -3 838 846
#> 10 2013 1 1 558 600 -2 753 745
#> # ℹ more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
Existing code that uses DuckDB via dbplyr can also take advantage. The following code creates a DuckDB connection and writes a data frame to a table:
path_duckdb <- tempfile(fileext = ".duckdb")
con <- DBI::dbConnect(duckdb::duckdb(path_duckdb))
DBI::dbWriteTable(con, "data", data.frame(x = 1:5, y = letters[1:5]))
dbplyr_data <- tbl(con, "data")
dbplyr_data
#> # Source: table<"data"> [?? x 2]
#> # Database: DuckDB v1.1.3 [unknown@Linux 6.5.0-1025-azure:R 4.4.2//tmp/RtmpIdtW3W/file11094c7cf0de.duckdb]
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
dbplyr_data |>
explain()
#> <SQL>
#> SELECT *
#> FROM "data"
#>
#> <PLAN>
#> physical_plan
#> ┌---------------------------┐
#> │ SEQ_SCAN │
#> │ -------------------- │
#> │ data │
#> │ │
#> │ Projections: │
#> │ x │
#> │ y │
#> │ │
#> │ ~5 Rows │
#> └---------------------------┘
The explain()
output shows that the data is actually
coming from a DuckDB table. The as_duckdb_tibble()
function
can then be used to seamlessly convert the data to a duckplyr frame:
dbplyr_data |>
as_duckdb_tibble()
#> # A duckplyr data frame: 2 variables
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
dbplyr_data |>
as_duckdb_tibble() |>
explain()
#> ┌---------------------------┐
#> │ SEQ_SCAN │
#> │ -------------------- │
#> │ data │
#> │ │
#> │ Projections: │
#> │ x │
#> │ y │
#> │ │
#> │ ~5 Rows │
#> └---------------------------┘
This only works for DuckDB connections. For other databases, turn the
data into an R data frame or export it to a file before using
as_duckdb_tibble()
.
For other common cases, the duckdb_tibble()
function
fails with a helpful error message:
duckdb_tibble(a = 1) |>
group_by(a) |>
as_duckdb_tibble()
#> Error in `as_duckdb_tibble()`:
#> ! duckplyr does not support `group_by()`.
#> ℹ Use `.by` instead.
#> ℹ To proceed with dplyr, use `as_tibble()` or `as.data.frame()`.
duckdb_tibble(a = 1) |>
rowwise() |>
as_duckdb_tibble()
#> Error in `as_duckdb_tibble()`:
#> ! duckplyr does not support `rowwise()`.
#> ℹ To proceed with dplyr, use `as_tibble()` or `as.data.frame()`.
readr::read_csv("a\n1", show_col_types = FALSE) |>
as_duckdb_tibble()
#> Error in `as_duckdb_tibble()`:
#> ! The input is data read by readr, and duckplyr supports reading CSV
#> files directly.
#> ℹ Use `read_csv_duckdb()` to read with the built-in reader.
#> ℹ To proceed with the data as read by readr, use `as_tibble()` before
#> `as_duckdb_tibble()`.
In all cases, as_tibble()
can be used to proceed with
the existing code.
DuckDB supports data ingestion from CSV, Parquet, and JSON files. The
read_csv_duckdb()
function accepts a file path and returns
a duckplyr frame.
path_csv_1 <- tempfile(fileext = ".csv")
writeLines("x,y\n1,a\n2,b\n3,c", path_csv_1)
read_csv_duckdb(path_csv_1)
#> # A duckplyr data frame: 2 variables
#> x y
#> <dbl> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
Reading multiple files is also supported:
path_csv_2 <- tempfile(fileext = ".csv")
writeLines("x,y\n4,d\n5,e\n6,f", path_csv_2)
read_csv_duckdb(c(path_csv_1, path_csv_2))
#> # A duckplyr data frame: 2 variables
#> x y
#> <dbl> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#> 6 6 f
The options
argument can be used to control the
reading.
Similarly, the read_parquet_duckdb()
and
read_json_duckdb()
functions can be used to read Parquet
and JSON files, respectively.
For reading from HTTPS or S3 URLs, the httpfs extension must be installed and loaded in each session.
Installation is fast if the extension is already installed. Once
loaded, the read_csv_duckdb()
,
read_parquet_duckdb()
, and read_json_duckdb()
functions can be used with URLs:
url <- "https://blobs.duckdb.org/flight-data-partitioned/Year=2024/data_0.parquet"
flights_parquet <- read_parquet_duckdb(url)
flights_parquet
#> # A duckplyr data frame: 110 variables
#> Year Quarter Month DayofMonth DayOfWeek FlightDate Reporting_Airline
#> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <chr>
#> 1 2024 1 1 8 1 2024-01-08 9E
#> 2 2024 1 1 9 2 2024-01-09 9E
#> 3 2024 1 1 10 3 2024-01-10 9E
#> 4 2024 1 1 11 4 2024-01-11 9E
#> 5 2024 1 1 12 5 2024-01-12 9E
#> 6 2024 1 1 15 1 2024-01-15 9E
#> 7 2024 1 1 16 2 2024-01-16 9E
#> 8 2024 1 1 17 3 2024-01-17 9E
#> 9 2024 1 1 18 4 2024-01-18 9E
#> 10 2024 1 1 19 5 2024-01-19 9E
#> # ℹ more rows
#> # ℹ 103 more variables: DOT_ID_Reporting_Airline <dbl>,
#> # IATA_CODE_Reporting_Airline <chr>, Tail_Number <chr>,
#> # Flight_Number_Reporting_Airline <dbl>, OriginAirportID <dbl>,
#> # OriginAirportSeqID <dbl>, OriginCityMarketID <dbl>, Origin <chr>,
#> # OriginCityName <chr>, OriginState <chr>, OriginStateFips <chr>,
#> # OriginStateName <chr>, OriginWac <dbl>, DestAirportID <dbl>, …
In all cases, the data is read lazily: only the metadata is read
initially, and the data is read as required. This means that data can be
read from files that are larger than the available RAM. The Parquet
format is particularly efficient for this purpose, as it stores data in
a columnar format and allows reading only the columns that are required.
See vignette("funnel")
for more details on the concept of
lazy data.
In addition to as_duckdb_tibble()
, arbitrary DuckDB
queries can be executed and the result can be converted to a duckplyr
frame. For this, attach an
existing DuckDB database first:
sql_attach <- paste0(
"ATTACH DATABASE '",
path_duckdb,
"' AS external (READ_ONLY)"
)
db_exec(sql_attach)
Then, use read_sql_duckdb()
to execute a query and
return a duckplyr frame:
In dbplyr, compute()
is used to materialize a lazy table
in a temporary table on the database, and collect()
is used
to bring the data into R memory. This interface works exactly the same
in duckplyr:
simple_data <-
duckdb_tibble(a = 1) |>
mutate(b = 2)
simple_data |>
explain()
#> ┌---------------------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ a │
#> │ b │
#> │ │
#> │ ~1 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ R_DATAFRAME_SCAN │
#> │ -------------------- │
#> │ data.frame │
#> │ │
#> │ Projections: a │
#> │ │
#> │ ~1 Rows │
#> └---------------------------┘
simple_data_computed <-
simple_data |>
compute()
The compute.duckplyr_df()
function returns a duckplyr
frame that is materialized in a temporary table. The return value of the
function is a duckplyr frame that can be used in further computations.
The materialization is done in a temporary table, so the data is not
persisted after the session ends:
simple_data_computed |>
explain()
#> ┌---------------------------┐
#> │ SEQ_SCAN │
#> │ -------------------- │
#> │ duckplyr_FDn9c63wNw │
#> │ │
#> │ Projections: │
#> │ a │
#> │ b │
#> │ │
#> │ ~1 Rows │
#> └---------------------------┘
The collect()
function brings the data into R memory and
returns a plain tibble:
To materialize data in a persistent file, the
compute_csv()
and compute_parquet()
functions
can be used. The compute_csv()
function writes the data to
a CSV file:
path_csv_out <- tempfile(fileext = ".csv")
duckdb_tibble(a = 1) |>
mutate(b = 2) |>
compute_csv(path_csv_out)
#> # A duckplyr data frame: 2 variables
#> a b
#> <dbl> <dbl>
#> 1 1 2
writeLines(readLines(path_csv_out))
#> a,b
#> 1.0,2.0
The compute_parquet()
function writes the data to a
Parquet file:
path_parquet_out <- tempfile(fileext = ".parquet")
duckdb_tibble(a = 1) |>
mutate(b = 2) |>
compute_parquet(path_parquet_out) |>
explain()
#> ┌---------------------------┐
#> │ READ_PARQUET │
#> │ -------------------- │
#> │ Function: │
#> │ READ_PARQUET │
#> │ │
#> │ Projections: │
#> │ a │
#> │ b │
#> │ │
#> │ ~1 Rows │
#> └---------------------------┘
Just like with compute.duckplyr_df()
, the return value
of compute_csv()
and compute_parquet()
is a
duckplyr frame that uses the created CSV or Parquet file and can be used
in further computations. At the time of writing, direct JSON export is
not supported.
The functions shown in this vignette allow the construction of data transformation pipelines spanning multiple data sources and data that is too large to fit into memory. Full compatibility with dplyr is provided, so existing code can be used with duckplyr with minimal changes. The lazy evaluation of duckplyr frames allows for efficient data processing, as only the required data is read from disk. The materialization functions allow the data to be persisted in temporary tables or files, depending on the use case. A typical workflow might look like this:
compute.duckplyr_df()
or compute_parquet()
to
materialize any intermediate result that is too large to fit into
memorycollect.duckplyr_df()
or
write it to a file using compute_csv()
or
compute_parquet()
There is a caveat: due to the design of duckplyr, if a dplyr verb is not supported or uses a function that is not supported, the data will be read into memory before being processed further. By default, if the data pipeline starts with an ingestion function, the data will only be read into memory if it is not “too large” (currently defined as 1 million cells or values in the table):
flights_parquet |>
group_by(Month)
#> Error in `group_by()`:
#> ! This operation cannot be carried out by DuckDB, and the input is a
#> frugal duckplyr frame.
#> • Try `summarise(.by = ...)` or `mutate(.by = ...)` instead of `group_by()` and
#> `ungroup()`.
#> ℹ Use `compute(prudence = "lavish")` to materialize to temporary storage and
#> continue with duckplyr.
#> ℹ See `vignette("funnel")` for other options.
Because group_by()
is not supported, the data will be
attempted to read into memory before the group_by()
operation is executed. Once the data is small enough to fit into memory,
this works transparently.
flights_parquet |>
count(Month, DayofMonth) |>
group_by(Month)
#> # A tibble: 182 × 3
#> # Groups: Month [6]
#> Month DayofMonth n
#> <dbl> <dbl> <int>
#> 1 1 1 17265
#> 2 1 2 18977
#> 3 1 3 18520
#> 4 1 4 18066
#> 5 1 5 18109
#> 6 1 6 16950
#> 7 1 7 18812
#> 8 1 8 18472
#> 9 1 9 16775
#> 10 1 10 16795
#> # ℹ 172 more rows
See vignette("funnel")
for the concepts and mechanisms
at play.