Title: | A 'dplyr' Back End for Databases |
---|---|
Description: | A 'dplyr' back end for databases that allows you to work with remote database tables as if they are in-memory data frames. Basic features works with any database that has a 'DBI' back end; more advanced features require 'SQL' translation to be provided by the package author. |
Authors: | Hadley Wickham [aut, cre], Maximilian Girlich [aut], Edgar Ruiz [aut], Posit Software, PBC [cph, fnd] |
Maintainer: | Hadley Wickham <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.5.0.9000 |
Built: | 2024-11-12 17:17:32 UTC |
Source: | https://github.com/tidyverse/dbplyr |
This is an method for the dplyr arrange()
generic. It generates
the ORDER BY
clause of the SQL query. It also affects the
window_order()
of windowed expressions in mutate.tbl_lazy()
.
Note that ORDER BY
clauses can not generally appear in subqueries, which
means that you should arrange()
as late as possible in your pipelines.
## S3 method for class 'tbl_lazy' arrange(.data, ..., .by_group = FALSE)
## S3 method for class 'tbl_lazy' arrange(.data, ..., .by_group = FALSE)
.data |
A lazy data frame backed by a database query. |
... |
< |
.by_group |
If |
Another tbl_lazy
. Use show_query()
to see the generated
query, and use collect()
to execute the query
and return data to R.
Unlike R, most databases sorts NA
(NULL
s) at the front. You can
can override this behaviour by explicitly sorting on is.na(x)
.
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(a = c(3, 4, 1, 2), b = c(5, 1, 2, NA)) db %>% arrange(a) %>% show_query() # Note that NAs are sorted first db %>% arrange(b) # override by sorting on is.na() first db %>% arrange(is.na(b), b)
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(a = c(3, 4, 1, 2), b = c(5, 1, 2, NA)) db %>% arrange(a) %>% show_query() # Note that NAs are sorted first db %>% arrange(b) # override by sorting on is.na() first db %>% arrange(is.na(b), b)
See vignette("translation-function")
and vignette("translation-verb")
for
details of overall translation technology. Key differences for this backend
are:
SELECT
uses TOP
, not LIMIT
Non-standard types and mathematical functions
String concatenation uses &
No ANALYZE
equivalent
TRUE
and FALSE
converted to 1 and 0
Use simulate_access()
with lazy_frame()
to see simulated SQL without
converting to live access database.
simulate_access()
simulate_access()
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(x = 1, y = 2, z = "a", con = simulate_access()) lf %>% head() lf %>% mutate(y = as.numeric(y), z = sqrt(x^2 + 10)) lf %>% mutate(a = paste0(z, " times"))
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(x = 1, y = 2, z = "a", con = simulate_access()) lf %>% head() lf %>% mutate(y = as.numeric(y), z = sqrt(x^2 + 10)) lf %>% mutate(a = paste0(z, " times"))
See vignette("translation-function")
and vignette("translation-verb")
for
details of overall translation technology. Key differences for this backend
are:
Temporary tables get #
prefix and use LOCAL TEMPORARY COLUMN
.
No table analysis performed in copy_to()
.
paste()
uses ||
Note that you can't create new boolean columns from logical expressions;
you need to wrap with explicit ifelse
: ifelse(x > y, TRUE, FALSE)
.
Use simulate_hana()
with lazy_frame()
to see simulated SQL without
converting to live access database.
simulate_hana()
simulate_hana()
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_hana()) lf %>% transmute(x = paste0(d, " times"))
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_hana()) lf %>% transmute(x = paste0(d, " times"))
See vignette("translation-function")
and vignette("translation-verb")
for
details of overall translation technology. Key differences for this backend
are a scattering of custom translations provided by users.
Use simulate_hive()
with lazy_frame()
to see simulated SQL without
converting to live access database.
simulate_hive()
simulate_hive()
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, d = 2, c = "z", con = simulate_hive()) lf %>% transmute(x = cot(b)) lf %>% transmute(x = bitwShiftL(c, 1L)) lf %>% transmute(x = str_replace_all(c, "a", "b")) lf %>% summarise(x = median(d, na.rm = TRUE)) lf %>% summarise(x = var(c, na.rm = TRUE))
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, d = 2, c = "z", con = simulate_hive()) lf %>% transmute(x = cot(b)) lf %>% transmute(x = bitwShiftL(c, 1L)) lf %>% transmute(x = str_replace_all(c, "a", "b")) lf %>% summarise(x = median(d, na.rm = TRUE)) lf %>% summarise(x = var(c, na.rm = TRUE))
See vignette("translation-function")
and vignette("translation-verb")
for
details of overall translation technology. Key differences for this backend
are a scattering of custom translations provided by users, mostly focussed
on bitwise operations.
Use simulate_impala()
with lazy_frame()
to see simulated SQL without
converting to live access database.
simulate_impala()
simulate_impala()
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_impala()) lf %>% transmute(X = bitwNot(bitwOr(b, c)))
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_impala()) lf %>% transmute(X = bitwNot(bitwOr(b, c)))
See vignette("translation-function")
and vignette("translation-verb")
for
details of overall translation technology. Key differences for this backend
are:
SELECT
uses TOP
not LIMIT
Automatically prefixes #
to create temporary tables. Add the prefix
yourself to avoid the message.
String basics: paste()
, substr()
, nchar()
Custom types for as.*
functions
Lubridate extraction functions, year()
, month()
, day()
etc
Semi-automated bit <-> boolean translation (see below)
Use simulate_mssql()
with lazy_frame()
to see simulated SQL without
converting to live access database.
simulate_mssql(version = "15.0")
simulate_mssql(version = "15.0")
version |
Version of MS SQL to simulate. Currently only, difference is
that 15.0 and above will use |
SQL server uses two incompatible types to represent TRUE
and FALSE
values:
The BOOLEAN
type is the result of logical comparisons (e.g. x > y
)
and can be used WHERE
but not to create new columns in SELECT
.
https://docs.microsoft.com/en-us/sql/t-sql/language-elements/comparison-operators-transact-sql
The BIT
type is a special type of numeric column used to store
TRUE
and FALSE
values, but can't be used in WHERE
clauses.
https://learn.microsoft.com/en-us/sql/t-sql/data-types/bit-transact-sql?view=sql-server-ver15
dbplyr does its best to automatically create the correct type when needed, but can't do it 100% correctly because it does not have a full type inference system. This means that you many need to manually do conversions from time to time.
To convert from bit to boolean use x == 1
To convert from boolean to bit use as.logical(if(x, 0, 1))
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_mssql()) lf %>% head() lf %>% transmute(x = paste(b, c, d)) # Can use boolean as is: lf %>% filter(c > d) # Need to convert from boolean to bit: lf %>% transmute(x = c > d) # Can use boolean as is: lf %>% transmute(x = ifelse(c > d, "c", "d"))
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_mssql()) lf %>% head() lf %>% transmute(x = paste(b, c, d)) # Can use boolean as is: lf %>% filter(c > d) # Need to convert from boolean to bit: lf %>% transmute(x = c > d) # Can use boolean as is: lf %>% transmute(x = ifelse(c > d, "c", "d"))
See vignette("translation-function")
and vignette("translation-verb")
for
details of overall translation technology. Key differences for this backend
are:
paste()
uses CONCAT_WS()
String translations for str_detect()
, str_locate()
, and
str_replace_all()
Clear error message for unsupported full joins
Use simulate_mysql()
with lazy_frame()
to see simulated SQL without
converting to live access database.
simulate_mysql() simulate_mariadb()
simulate_mysql() simulate_mariadb()
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_mysql()) lf %>% transmute(x = paste0(d, " times"))
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_mysql()) lf %>% transmute(x = paste0(d, " times"))
See vignette("translation-function")
and vignette("translation-verb")
for
details of overall translation technology. Key differences for this backend
are minor translations for common data types.
Use simulate_odbc()
with lazy_frame()
to see simulated SQL without
converting to live access database.
simulate_odbc()
simulate_odbc()
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, d = 2, c = "z", con = simulate_odbc()) lf %>% transmute(x = as.numeric(b)) lf %>% transmute(x = as.integer(b)) lf %>% transmute(x = as.character(b))
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, d = 2, c = "z", con = simulate_odbc()) lf %>% transmute(x = as.numeric(b)) lf %>% transmute(x = as.integer(b)) lf %>% transmute(x = as.character(b))
See vignette("translation-function")
and vignette("translation-verb")
for
details of overall translation technology. Key differences for this backend
are:
Use FETCH FIRST
instead of LIMIT
Custom types
paste()
uses ||
Custom subquery generation (no AS
)
setdiff()
uses MINUS
instead of EXCEPT
Note that versions of Oracle prior to 23c have limited supported for
TRUE
and FALSE
and you may need to use 1
and 0
instead.
See https://oracle-base.com/articles/23c/boolean-data-type-23c for
more details.
Use simulate_oracle()
with lazy_frame()
to see simulated SQL without
converting to live access database.
simulate_oracle()
simulate_oracle()
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_oracle()) lf %>% transmute(x = paste0(c, " times")) lf %>% setdiff(lf)
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_oracle()) lf %>% transmute(x = paste0(c, " times")) lf %>% setdiff(lf)
See vignette("translation-function")
and vignette("translation-verb")
for
details of overall translation technology. Key differences for this backend
are:
Many stringr functions
lubridate date-time extraction functions
More standard statistical summaries
Use simulate_postgres()
with lazy_frame()
to see simulated SQL without
converting to live access database.
simulate_postgres()
simulate_postgres()
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_postgres()) lf %>% summarise(x = sd(b, na.rm = TRUE)) lf %>% summarise(y = cor(b, c), z = cov(b, c))
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_postgres()) lf %>% summarise(x = sd(b, na.rm = TRUE)) lf %>% summarise(y = cor(b, c), z = cov(b, c))
Base translations come from PostgreSQL backend. There are generally few differences, apart from string manipulation.
Use simulate_redshift()
with lazy_frame()
to see simulated SQL without
converting to live access database.
simulate_redshift()
simulate_redshift()
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_redshift()) lf %>% transmute(x = paste(c, " times")) lf %>% transmute(x = substr(c, 2, 3)) lf %>% transmute(x = str_replace_all(c, "a", "z"))
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_redshift()) lf %>% transmute(x = paste(c, " times")) lf %>% transmute(x = substr(c, 2, 3)) lf %>% transmute(x = str_replace_all(c, "a", "z"))
See vignette("translation-function")
and vignette("translation-verb")
for
details of overall translation technology.
Use simulate_snowflake()
with lazy_frame()
to see simulated SQL without
converting to live access database.
simulate_snowflake()
simulate_snowflake()
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_snowflake()) lf %>% transmute(x = paste0(d, " times"))
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_snowflake()) lf %>% transmute(x = paste0(d, " times"))
See vignette("translation-function")
and vignette("translation-verb")
for
details of overall translation technology. Key differences for this backend
are better translation of statistical aggregate functions
(e.g. var()
, median()
) and use of temporary views instead of temporary
tables when copying data.
Use simulate_spark_sql()
with lazy_frame()
to see simulated SQL without
converting to live access database.
simulate_spark_sql()
simulate_spark_sql()
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, d = 2, c = "z", con = simulate_spark_sql()) lf %>% summarise(x = median(d, na.rm = TRUE)) lf %>% summarise(x = var(c, na.rm = TRUE), .by = d) lf %>% mutate(x = first(c)) lf %>% mutate(x = first(c), .by = d)
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, d = 2, c = "z", con = simulate_spark_sql()) lf %>% summarise(x = median(d, na.rm = TRUE)) lf %>% summarise(x = var(c, na.rm = TRUE), .by = d) lf %>% mutate(x = first(c)) lf %>% mutate(x = first(c), .by = d)
See vignette("translation-function")
and vignette("translation-verb")
for
details of overall translation technology. Key differences for this backend
are:
Uses non-standard LOG()
function
Date-time extraction functions from lubridate
Custom median translation
Right and full joins are simulated using left joins
Use simulate_sqlite()
with lazy_frame()
to see simulated SQL without
converting to live access database.
simulate_sqlite()
simulate_sqlite()
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_sqlite()) lf %>% transmute(x = paste(c, " times")) lf %>% transmute(x = log(b), y = log(b, base = 2))
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_sqlite()) lf %>% transmute(x = paste(c, " times")) lf %>% transmute(x = log(b), y = log(b, base = 2))
See vignette("translation-function")
and vignette("translation-verb")
for
details of overall translation technology. Key differences for this backend
are:
Uses TOP
instead of LIMIT
Selection of user supplied translations
Use simulate_teradata()
with lazy_frame()
to see simulated SQL without
converting to live access database.
simulate_teradata()
simulate_teradata()
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_teradata()) lf %>% head()
library(dplyr, warn.conflicts = FALSE) lf <- lazy_frame(a = TRUE, b = 1, c = 2, d = "z", con = simulate_teradata()) lf %>% head()
These are methods for the dplyr generics collapse()
, compute()
,
and collect()
. collapse()
creates a subquery, compute()
stores
the results in a remote table, and collect()
executes the query and
downloads the data into R.
## S3 method for class 'tbl_sql' collapse(x, ...) ## S3 method for class 'tbl_sql' compute( x, name = NULL, temporary = TRUE, unique_indexes = list(), indexes = list(), analyze = TRUE, ..., cte = FALSE ) ## S3 method for class 'tbl_sql' collect(x, ..., n = Inf, warn_incomplete = TRUE, cte = FALSE)
## S3 method for class 'tbl_sql' collapse(x, ...) ## S3 method for class 'tbl_sql' compute( x, name = NULL, temporary = TRUE, unique_indexes = list(), indexes = list(), analyze = TRUE, ..., cte = FALSE ) ## S3 method for class 'tbl_sql' collect(x, ..., n = Inf, warn_incomplete = TRUE, cte = FALSE)
x |
A lazy data frame backed by a database query. |
... |
other parameters passed to methods. |
name |
Table name in remote database. |
temporary |
Should the table be temporary ( |
unique_indexes |
a list of character vectors. Each element of the list will create a new unique index over the specified column(s). Duplicate rows will result in failure. |
indexes |
a list of character vectors. Each element of the list will create a new index. |
analyze |
if |
cte |
|
n |
Number of rows to fetch. Defaults to |
warn_incomplete |
Warn if |
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(a = c(3, 4, 1, 2), b = c(5, 1, 2, NA)) db %>% filter(a <= 2) %>% collect()
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(a = c(3, 4, 1, 2), b = c(5, 1, 2, NA)) db %>% filter(a <= 2) %>% collect()
Turns implicit missing values into explicit missing values. This is a method
for the tidyr::complete()
generic.
## S3 method for class 'tbl_lazy' complete(data, ..., fill = list())
## S3 method for class 'tbl_lazy' complete(data, ..., fill = list())
data |
A lazy data frame backed by a database query. |
... |
Specification of columns to expand. See tidyr::expand for more details. |
fill |
A named list that for each variable supplies a single value to use instead of NA for missing combinations. |
Another tbl_lazy
. Use show_query()
to see the generated
query, and use collect()
to execute the query
and return data to R.
df <- memdb_frame( group = c(1:2, 1), item_id = c(1:2, 2), item_name = c("a", "b", "b"), value1 = 1:3, value2 = 4:6 ) df %>% tidyr::complete(group, nesting(item_id, item_name)) # You can also choose to fill in missing values df %>% tidyr::complete(group, nesting(item_id, item_name), fill = list(value1 = 0))
df <- memdb_frame( group = c(1:2, 1), item_id = c(1:2, 2), item_name = c("a", "b", "b"), value1 = 1:3, value2 = 4:6 ) df %>% tidyr::complete(group, nesting(item_id, item_name)) # You can also choose to fill in missing values df %>% tidyr::complete(group, nesting(item_id, item_name), fill = list(value1 = 0))
This is an alternative to copy_to()
that does not need write access and
is faster for small data.
copy_inline(con, df, types = NULL)
copy_inline(con, df, types = NULL)
con |
A database connection. |
df |
A local data frame. The data is written directly in the SQL query so it should be small. |
types |
A named character vector of SQL data types to use for the columns.
The data types are backend specific. For example for Postgres this could
be |
It writes the data directly in the SQL query via the VALUES
clause.
A tbl_lazy
.
copy_to()
to copy the data into a new database table.
df <- data.frame(x = 1:3, y = c("a", "b", "c")) con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") copy_inline(con, df) copy_inline(con, df) %>% dplyr::show_query()
df <- data.frame(x = 1:3, y = c("a", "b", "c")) con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") copy_inline(con, df) copy_inline(con, df) %>% dplyr::show_query()
This is an implementation of the dplyr copy_to()
generic and it mostly
a wrapper around DBI::dbWriteTable()
.
It is useful for copying small amounts of data to a database for examples, experiments, and joins. By default, it creates temporary tables which are only visible within the current connection to the database.
## S3 method for class 'src_sql' copy_to( dest, df, name = deparse(substitute(df)), overwrite = FALSE, types = NULL, temporary = TRUE, unique_indexes = NULL, indexes = NULL, analyze = TRUE, ..., in_transaction = TRUE )
## S3 method for class 'src_sql' copy_to( dest, df, name = deparse(substitute(df)), overwrite = FALSE, types = NULL, temporary = TRUE, unique_indexes = NULL, indexes = NULL, analyze = TRUE, ..., in_transaction = TRUE )
dest |
remote data source |
df |
A local data frame, a |
name |
Name of new remote table. Use a string to create the table
in the current catalog/schema. Use |
overwrite |
If |
types |
a character vector giving variable types to use for the columns. See https://www.sqlite.org/datatype3.html for available types. |
temporary |
if |
unique_indexes |
a list of character vectors. Each element of the list will create a new unique index over the specified column(s). Duplicate rows will result in failure. |
indexes |
a list of character vectors. Each element of the list will create a new index. |
analyze |
if |
... |
other parameters passed to methods. |
in_transaction |
Should the table creation be wrapped in a transaction? This typically makes things faster, but you may want to suppress if the database doesn't support transactions, or you're wrapping in a transaction higher up (and your database doesn't support nested transactions.) |
Another tbl_lazy
. Use show_query()
to see the generated
query, and use collect()
to execute the query
and return data to R.
copy_inline()
to use small data in an SQL query without actually
writing to a table.
library(dplyr, warn.conflicts = FALSE) df <- data.frame(x = 1:5, y = letters[5:1]) db <- copy_to(src_memdb(), df) db df2 <- data.frame(y = c("a", "d"), fruit = c("apple", "date")) # copy_to() is called automatically if you set copy = TRUE # in the join functions db %>% left_join(df2, copy = TRUE)
library(dplyr, warn.conflicts = FALSE) df <- data.frame(x = 1:5, y = letters[5:1]) db <- copy_to(src_memdb(), df) db df2 <- data.frame(y = c("a", "d"), fruit = c("apple", "date")) # copy_to() is called automatically if you set copy = TRUE # in the join functions db %>% left_join(df2, copy = TRUE)
These are methods for the dplyr count()
and tally()
generics. They
wrap up group_by.tbl_lazy()
, summarise.tbl_lazy()
and, optionally,
arrange.tbl_lazy()
.
## S3 method for class 'tbl_lazy' count(x, ..., wt = NULL, sort = FALSE, name = NULL) ## S3 method for class 'tbl_lazy' add_count(x, ..., wt = NULL, sort = FALSE, name = NULL, .drop = NULL) ## S3 method for class 'tbl_lazy' tally(x, wt = NULL, sort = FALSE, name = NULL)
## S3 method for class 'tbl_lazy' count(x, ..., wt = NULL, sort = FALSE, name = NULL) ## S3 method for class 'tbl_lazy' add_count(x, ..., wt = NULL, sort = FALSE, name = NULL, .drop = NULL) ## S3 method for class 'tbl_lazy' tally(x, wt = NULL, sort = FALSE, name = NULL)
x |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
< |
wt |
<
|
sort |
If |
name |
The name of the new column in the output. If omitted, it will default to |
.drop |
Not supported for lazy tables. |
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(g = c(1, 1, 1, 2, 2), x = c(4, 3, 6, 9, 2)) db %>% count(g) %>% show_query() db %>% count(g, wt = x) %>% show_query() db %>% count(g, wt = x, sort = TRUE) %>% show_query()
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(g = c(1, 1, 1, 2, 2), x = c(4, 3, 6, 9, 2)) db %>% count(g) %>% show_query() db %>% count(g, wt = x) %>% show_query() db %>% count(g, wt = x, sort = TRUE) %>% show_query()
This is a method for the tidyr uncount()
generic. It uses a temporary
table, so your database user needs permissions to create one.
dbplyr_uncount(data, weights, .remove = TRUE, .id = NULL)
dbplyr_uncount(data, weights, .remove = TRUE, .id = NULL)
data |
A lazy data frame backed by a database query. |
weights |
A vector of weights. Evaluated in the context of |
.remove |
If |
.id |
Supply a string to create a new variable which gives a unique identifier for each created row. |
df <- memdb_frame(x = c("a", "b"), n = c(1, 2)) dbplyr_uncount(df, n) dbplyr_uncount(df, n, .id = "id") # You can also use constants dbplyr_uncount(df, 2) # Or expressions dbplyr_uncount(df, 2 / n)
df <- memdb_frame(x = c("a", "b"), n = c(1, 2)) dbplyr_uncount(df, n) dbplyr_uncount(df, n, .id = "id") # You can also use constants dbplyr_uncount(df, 2) # Or expressions dbplyr_uncount(df, 2 / n)
These are methods for the dplyr generics slice_min()
, slice_max()
, and
slice_sample()
. They are translated to SQL using filter()
and
window functions (ROWNUMBER
, MIN_RANK
, or CUME_DIST
depending on
arguments). slice()
, slice_head()
, and slice_tail()
are not supported
since database tables have no intrinsic order.
If data is grouped, the operation will be performed on each group so that
(e.g.) slice_min(db, x, n = 3)
will select the three rows with the smallest
value of x
in each group.
## S3 method for class 'tbl_lazy' slice_min( .data, order_by, ..., n, prop, by = NULL, with_ties = TRUE, na_rm = TRUE ) ## S3 method for class 'tbl_lazy' slice_max( .data, order_by, ..., n, by = NULL, prop, with_ties = TRUE, na_rm = TRUE ) ## S3 method for class 'tbl_lazy' slice_sample(.data, ..., n, prop, by = NULL, weight_by = NULL, replace = FALSE)
## S3 method for class 'tbl_lazy' slice_min( .data, order_by, ..., n, prop, by = NULL, with_ties = TRUE, na_rm = TRUE ) ## S3 method for class 'tbl_lazy' slice_max( .data, order_by, ..., n, by = NULL, prop, with_ties = TRUE, na_rm = TRUE ) ## S3 method for class 'tbl_lazy' slice_sample(.data, ..., n, prop, by = NULL, weight_by = NULL, replace = FALSE)
.data |
A lazy data frame backed by a database query. |
order_by |
Variable or function of variables to order by. |
... |
Not used. |
n , prop
|
Provide either If |
by |
< |
with_ties |
Should ties be kept together? The default, |
na_rm |
Should missing values in |
weight_by , replace
|
Not supported for database backends. |
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(x = 1:3, y = c(1, 1, 2)) db %>% slice_min(x) %>% show_query() db %>% slice_max(x) %>% show_query() db %>% slice_sample() %>% show_query() db %>% group_by(y) %>% slice_min(x) %>% show_query() # By default, ties are includes so you may get more rows # than you expect db %>% slice_min(y, n = 1) db %>% slice_min(y, n = 1, with_ties = FALSE) # Non-integer group sizes are rounded down db %>% slice_min(x, prop = 0.5)
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(x = 1:3, y = c(1, 1, 2)) db %>% slice_min(x) %>% show_query() db %>% slice_max(x) %>% show_query() db %>% slice_sample() %>% show_query() db %>% group_by(y) %>% slice_min(x) %>% show_query() # By default, ties are includes so you may get more rows # than you expect db %>% slice_min(y, n = 1) db %>% slice_min(y, n = 1, with_ties = FALSE) # Non-integer group sizes are rounded down db %>% slice_min(x, prop = 0.5)
This is a method for the dplyr distinct()
generic. It adds the
DISTINCT
clause to the SQL query.
## S3 method for class 'tbl_lazy' distinct(.data, ..., .keep_all = FALSE)
## S3 method for class 'tbl_lazy' distinct(.data, ..., .keep_all = FALSE)
.data |
A lazy data frame backed by a database query. |
... |
< |
.keep_all |
If |
Another tbl_lazy
. Use show_query()
to see the generated
query, and use collect()
to execute the query
and return data to R.
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(x = c(1, 1, 2, 2), y = c(1, 2, 1, 1)) db %>% distinct() %>% show_query() db %>% distinct(x) %>% show_query()
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(x = c(1, 1, 2, 2), y = c(1, 2, 1, 1)) db %>% distinct() %>% show_query() db %>% distinct(x) %>% show_query()
Perform arbitrary computation on remote backend
## S3 method for class 'tbl_sql' do(.data, ..., .chunk_size = 10000L)
## S3 method for class 'tbl_sql' do(.data, ..., .chunk_size = 10000L)
.data |
a tbl |
... |
Expressions to apply to each group. If named, results will be
stored in a new column. If unnamed, must return a data frame. You can
use |
.chunk_size |
The size of each chunk to pull into R. If this number is too big, the process will be slow because R has to allocate and free a lot of memory. If it's too small, it will be slow, because of the overhead of talking to the database. |
escape()
requires you to provide a database connection to control the
details of escaping. escape_ansi()
uses the SQL 92 ANSI standard.
escape(x, parens = NA, collapse = " ", con = NULL) escape_ansi(x, parens = NA, collapse = "") sql_vector(x, parens = NA, collapse = " ", con = NULL)
escape(x, parens = NA, collapse = " ", con = NULL) escape_ansi(x, parens = NA, collapse = "") sql_vector(x, parens = NA, collapse = " ", con = NULL)
x |
An object to escape. Existing sql vectors will be left as is,
character vectors are escaped with single quotes, numeric vectors have
trailing |
parens , collapse
|
Controls behaviour when multiple values are supplied.
Default behaviour: lists are always wrapped in parens and separated by commas, identifiers are separated by commas and never wrapped, atomic vectors are separated by spaces and wrapped in parens if needed. |
con |
Database connection. |
# Doubles vs. integers escape_ansi(1:5) escape_ansi(c(1, 5.4)) # String vs known sql vs. sql identifier escape_ansi("X") escape_ansi(sql("X")) escape_ansi(ident("X")) # Escaping is idempotent escape_ansi("X") escape_ansi(escape_ansi("X")) escape_ansi(escape_ansi(escape_ansi("X")))
# Doubles vs. integers escape_ansi(1:5) escape_ansi(c(1, 5.4)) # String vs known sql vs. sql identifier escape_ansi("X") escape_ansi(sql("X")) escape_ansi(ident("X")) # Escaping is idempotent escape_ansi("X") escape_ansi(escape_ansi("X")) escape_ansi(escape_ansi(escape_ansi("X")))
This is a method for the tidyr::expand generics. It doesn't sort the
result explicitly, so the order might be different to what expand()
returns for data frames.
## S3 method for class 'tbl_lazy' expand(data, ..., .name_repair = "check_unique")
## S3 method for class 'tbl_lazy' expand(data, ..., .name_repair = "check_unique")
data |
A lazy data frame backed by a database query. |
... |
Specification of columns to expand. See tidyr::expand for more details. |
.name_repair |
Treatment of problematic column names:
This argument is passed on as |
Another tbl_lazy
. Use show_query()
to see the generated
query, and use collect()
to execute the query
and return data to R.
fruits <- memdb_frame( type = c("apple", "orange", "apple", "orange", "orange", "orange"), year = c(2010, 2010, 2012, 2010, 2010, 2012), size = c("XS", "S", "M", "S", "S", "M"), weights = rnorm(6) ) # All possible combinations --------------------------------------- fruits %>% tidyr::expand(type) fruits %>% tidyr::expand(type, size) # Only combinations that already appear in the data --------------- fruits %>% tidyr::expand(nesting(type, size))
fruits <- memdb_frame( type = c("apple", "orange", "apple", "orange", "orange", "orange"), year = c(2010, 2010, 2012, 2010, 2010, 2012), size = c("XS", "S", "M", "S", "S", "M"), weights = rnorm(6) ) # All possible combinations --------------------------------------- fruits %>% tidyr::expand(type) fruits %>% tidyr::expand(type, size) # Only combinations that already appear in the data --------------- fruits %>% tidyr::expand(nesting(type, size))
Fill in missing values with previous or next value
## S3 method for class 'tbl_lazy' fill(.data, ..., .direction = c("down", "up", "updown", "downup"))
## S3 method for class 'tbl_lazy' fill(.data, ..., .direction = c("down", "up", "updown", "downup"))
.data |
A lazy data frame backed by a database query. |
... |
Columns to fill. |
.direction |
Direction in which to fill missing values. Currently
either "down" (the default) or "up". Note that "up" does not work when
|
squirrels <- tibble::tribble( ~group, ~name, ~role, ~n_squirrels, ~ n_squirrels2, 1, "Sam", "Observer", NA, 1, 1, "Mara", "Scorekeeper", 8, NA, 1, "Jesse", "Observer", NA, NA, 1, "Tom", "Observer", NA, 4, 2, "Mike", "Observer", NA, NA, 2, "Rachael", "Observer", NA, 6, 2, "Sydekea", "Scorekeeper", 14, NA, 2, "Gabriela", "Observer", NA, NA, 3, "Derrick", "Observer", NA, NA, 3, "Kara", "Scorekeeper", 9, 10, 3, "Emily", "Observer", NA, NA, 3, "Danielle", "Observer", NA, NA ) squirrels$id <- 1:12 tbl_memdb(squirrels) %>% window_order(id) %>% tidyr::fill( n_squirrels, n_squirrels2, )
squirrels <- tibble::tribble( ~group, ~name, ~role, ~n_squirrels, ~ n_squirrels2, 1, "Sam", "Observer", NA, 1, 1, "Mara", "Scorekeeper", 8, NA, 1, "Jesse", "Observer", NA, NA, 1, "Tom", "Observer", NA, 4, 2, "Mike", "Observer", NA, NA, 2, "Rachael", "Observer", NA, 6, 2, "Sydekea", "Scorekeeper", 14, NA, 2, "Gabriela", "Observer", NA, NA, 3, "Derrick", "Observer", NA, NA, 3, "Kara", "Scorekeeper", 9, 10, 3, "Emily", "Observer", NA, NA, 3, "Danielle", "Observer", NA, NA ) squirrels$id <- 1:12 tbl_memdb(squirrels) %>% window_order(id) %>% tidyr::fill( n_squirrels, n_squirrels2, )
This is a method for the dplyr filter()
generic. It generates the
WHERE
clause of the SQL query.
## S3 method for class 'tbl_lazy' filter(.data, ..., .by = NULL, .preserve = FALSE)
## S3 method for class 'tbl_lazy' filter(.data, ..., .by = NULL, .preserve = FALSE)
.data |
A lazy data frame backed by a database query. |
... |
< |
.by |
< |
.preserve |
Not supported by this method. |
Another tbl_lazy
. Use show_query()
to see the generated
query, and use collect()
to execute the query
and return data to R.
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(x = c(2, NA, 5, NA, 10), y = 1:5) db %>% filter(x < 5) %>% show_query() db %>% filter(is.na(x)) %>% show_query()
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(x = c(2, NA, 5, NA, 10), y = 1:5) db %>% filter(x < 5) %>% show_query() db %>% filter(is.na(x)) %>% show_query()
RETURNING
rowsget_returned_rows()
extracts the RETURNING
rows produced by
rows_insert()
, rows_append()
, rows_update()
, rows_upsert()
,
or rows_delete()
if these are called with the returning
argument.
An error is raised if this information is not available.
has_returned_rows()
checks if x
has stored RETURNING rows produced by
rows_insert()
, rows_append()
, rows_update()
, rows_upsert()
,
or rows_delete()
.
get_returned_rows(x) has_returned_rows(x)
get_returned_rows(x) has_returned_rows(x)
x |
A lazy tbl. |
For get_returned_rows()
, a tibble.
For has_returned_rows()
, a scalar logical.
library(dplyr) con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") DBI::dbExecute(con, "CREATE TABLE Info ( id INTEGER PRIMARY KEY AUTOINCREMENT, number INTEGER )") info <- tbl(con, "Info") rows1 <- copy_inline(con, data.frame(number = c(1, 5))) rows_insert(info, rows1, conflict = "ignore", in_place = TRUE) info # If the table has an auto incrementing primary key, you can use # the returning argument + `get_returned_rows()` its value rows2 <- copy_inline(con, data.frame(number = c(13, 27))) info <- rows_insert( info, rows2, conflict = "ignore", in_place = TRUE, returning = id ) info get_returned_rows(info)
library(dplyr) con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") DBI::dbExecute(con, "CREATE TABLE Info ( id INTEGER PRIMARY KEY AUTOINCREMENT, number INTEGER )") info <- tbl(con, "Info") rows1 <- copy_inline(con, data.frame(number = c(1, 5))) rows_insert(info, rows1, conflict = "ignore", in_place = TRUE) info # If the table has an auto incrementing primary key, you can use # the returning argument + `get_returned_rows()` its value rows2 <- copy_inline(con, data.frame(number = c(13, 27))) info <- rows_insert( info, rows2, conflict = "ignore", in_place = TRUE, returning = id ) info get_returned_rows(info)
This is a method for the dplyr group_by()
generic. It is translated to
the GROUP BY
clause of the SQL query when used with
summarise()
and to the PARTITION BY
clause of
window functions when used with mutate()
.
## S3 method for class 'tbl_lazy' group_by(.data, ..., .add = FALSE, add = deprecated(), .drop = TRUE)
## S3 method for class 'tbl_lazy' group_by(.data, ..., .add = FALSE, add = deprecated(), .drop = TRUE)
.data |
A lazy data frame backed by a database query. |
... |
< |
.add |
When This argument was previously called |
add |
Deprecated. Please use |
.drop |
Not supported by this method. |
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(g = c(1, 1, 1, 2, 2), x = c(4, 3, 6, 9, 2)) db %>% group_by(g) %>% summarise(n()) %>% show_query() db %>% group_by(g) %>% mutate(x2 = x / sum(x, na.rm = TRUE)) %>% show_query()
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(g = c(1, 1, 1, 2, 2), x = c(4, 3, 6, 9, 2)) db %>% group_by(g) %>% summarise(n()) %>% show_query() db %>% group_by(g) %>% mutate(x2 = x / sum(x, na.rm = TRUE)) %>% show_query()
This is a method for the head()
generic. It is usually translated to the
LIMIT
clause of the SQL query. Because LIMIT
is not an official part of
the SQL specification, some database use other clauses like TOP
or
FETCH ROWS
.
Note that databases don't really have a sense of row order, so what "first"
means is subject to interpretation. Most databases will respect ordering
performed with arrange()
, but it's not guaranteed. tail()
is not
supported at all because the situation is even murkier for the "last" rows.
## S3 method for class 'tbl_lazy' head(x, n = 6L, ...)
## S3 method for class 'tbl_lazy' head(x, n = 6L, ...)
x |
A lazy data frame backed by a database query. |
n |
Number of rows to return |
... |
Not used. |
Another tbl_lazy
. Use show_query()
to see the generated
query, and use collect()
to execute the query
and return data to R.
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(x = 1:100) db %>% head() %>% show_query() # Pretend we have data in a SQL server database db2 <- lazy_frame(x = 1:100, con = simulate_mssql()) db2 %>% head() %>% show_query()
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(x = 1:100) db %>% head() %>% show_query() # Pretend we have data in a SQL server database db2 <- lazy_frame(x = 1:100, con = simulate_mssql()) db2 %>% head() %>% show_query()
These are methods for the dplyr generics dplyr::intersect()
,
dplyr::union()
, and dplyr::setdiff()
. They are translated to
INTERSECT
, UNION
, and EXCEPT
respectively.
## S3 method for class 'tbl_lazy' intersect(x, y, copy = FALSE, ..., all = FALSE) ## S3 method for class 'tbl_lazy' union(x, y, copy = FALSE, ..., all = FALSE) ## S3 method for class 'tbl_lazy' union_all(x, y, copy = FALSE, ...) ## S3 method for class 'tbl_lazy' setdiff(x, y, copy = FALSE, ..., all = FALSE)
## S3 method for class 'tbl_lazy' intersect(x, y, copy = FALSE, ..., all = FALSE) ## S3 method for class 'tbl_lazy' union(x, y, copy = FALSE, ..., all = FALSE) ## S3 method for class 'tbl_lazy' union_all(x, y, copy = FALSE, ...) ## S3 method for class 'tbl_lazy' setdiff(x, y, copy = FALSE, ..., all = FALSE)
x , y
|
A pair of lazy data frames backed by database queries. |
copy |
If This allows you to join tables across srcs, but it's potentially expensive operation so you must opt into it. |
... |
Not currently used; provided for future extensions. |
all |
If |
These are methods for the dplyr join generics. They are translated to the following SQL queries:
inner_join(x, y)
: SELECT * FROM x JOIN y ON x.a = y.a
left_join(x, y)
: SELECT * FROM x LEFT JOIN y ON x.a = y.a
right_join(x, y)
: SELECT * FROM x RIGHT JOIN y ON x.a = y.a
full_join(x, y)
: SELECT * FROM x FULL JOIN y ON x.a = y.a
semi_join(x, y)
: SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a)
anti_join(x, y)
: SELECT * FROM x WHERE NOT EXISTS (SELECT 1 FROM y WHERE x.a = y.a)
## S3 method for class 'tbl_lazy' inner_join( x, y, by = NULL, copy = FALSE, suffix = NULL, ..., keep = NULL, na_matches = c("never", "na"), multiple = NULL, unmatched = "drop", relationship = NULL, sql_on = NULL, auto_index = FALSE, x_as = NULL, y_as = NULL ) ## S3 method for class 'tbl_lazy' left_join( x, y, by = NULL, copy = FALSE, suffix = NULL, ..., keep = NULL, na_matches = c("never", "na"), multiple = NULL, unmatched = "drop", relationship = NULL, sql_on = NULL, auto_index = FALSE, x_as = NULL, y_as = NULL ) ## S3 method for class 'tbl_lazy' right_join( x, y, by = NULL, copy = FALSE, suffix = NULL, ..., keep = NULL, na_matches = c("never", "na"), multiple = NULL, unmatched = "drop", relationship = NULL, sql_on = NULL, auto_index = FALSE, x_as = NULL, y_as = NULL ) ## S3 method for class 'tbl_lazy' full_join( x, y, by = NULL, copy = FALSE, suffix = NULL, ..., keep = NULL, na_matches = c("never", "na"), multiple = NULL, relationship = NULL, sql_on = NULL, auto_index = FALSE, x_as = NULL, y_as = NULL ) ## S3 method for class 'tbl_lazy' cross_join( x, y, ..., copy = FALSE, suffix = c(".x", ".y"), x_as = NULL, y_as = NULL ) ## S3 method for class 'tbl_lazy' semi_join( x, y, by = NULL, copy = FALSE, ..., na_matches = c("never", "na"), sql_on = NULL, auto_index = FALSE, x_as = NULL, y_as = NULL ) ## S3 method for class 'tbl_lazy' anti_join( x, y, by = NULL, copy = FALSE, ..., na_matches = c("never", "na"), sql_on = NULL, auto_index = FALSE, x_as = NULL, y_as = NULL )
## S3 method for class 'tbl_lazy' inner_join( x, y, by = NULL, copy = FALSE, suffix = NULL, ..., keep = NULL, na_matches = c("never", "na"), multiple = NULL, unmatched = "drop", relationship = NULL, sql_on = NULL, auto_index = FALSE, x_as = NULL, y_as = NULL ) ## S3 method for class 'tbl_lazy' left_join( x, y, by = NULL, copy = FALSE, suffix = NULL, ..., keep = NULL, na_matches = c("never", "na"), multiple = NULL, unmatched = "drop", relationship = NULL, sql_on = NULL, auto_index = FALSE, x_as = NULL, y_as = NULL ) ## S3 method for class 'tbl_lazy' right_join( x, y, by = NULL, copy = FALSE, suffix = NULL, ..., keep = NULL, na_matches = c("never", "na"), multiple = NULL, unmatched = "drop", relationship = NULL, sql_on = NULL, auto_index = FALSE, x_as = NULL, y_as = NULL ) ## S3 method for class 'tbl_lazy' full_join( x, y, by = NULL, copy = FALSE, suffix = NULL, ..., keep = NULL, na_matches = c("never", "na"), multiple = NULL, relationship = NULL, sql_on = NULL, auto_index = FALSE, x_as = NULL, y_as = NULL ) ## S3 method for class 'tbl_lazy' cross_join( x, y, ..., copy = FALSE, suffix = c(".x", ".y"), x_as = NULL, y_as = NULL ) ## S3 method for class 'tbl_lazy' semi_join( x, y, by = NULL, copy = FALSE, ..., na_matches = c("never", "na"), sql_on = NULL, auto_index = FALSE, x_as = NULL, y_as = NULL ) ## S3 method for class 'tbl_lazy' anti_join( x, y, by = NULL, copy = FALSE, ..., na_matches = c("never", "na"), sql_on = NULL, auto_index = FALSE, x_as = NULL, y_as = NULL )
x , y
|
A pair of lazy data frames backed by database queries. |
by |
A join specification created with If To join on different variables between To join by multiple variables, use a
For simple equality joins, you can alternatively specify a character vector
of variable names to join by. For example, To perform a cross-join, generating all combinations of |
copy |
If This allows you to join tables across srcs, but it's potentially expensive operation so you must opt into it. |
suffix |
If there are non-joined duplicate variables in |
... |
Other parameters passed onto methods. |
keep |
Should the join keys from both
|
na_matches |
Should NA (NULL) values match one another?
The default, "never", is how databases usually work. |
multiple , unmatched
|
Unsupported in database backends. As a workaround for multiple use a unique key and for unmatched a foreign key constraint. |
relationship |
Unsupported in database backends. |
sql_on |
A custom join predicate as an SQL expression.
Usually joins use column equality, but you can perform more complex
queries by supply |
auto_index |
if |
x_as , y_as
|
Alias to use for |
Another tbl_lazy
. Use show_query()
to see the generated
query, and use collect()
to execute the query
and return data to R.
library(dplyr, warn.conflicts = FALSE) band_db <- tbl_memdb(dplyr::band_members) instrument_db <- tbl_memdb(dplyr::band_instruments) band_db %>% left_join(instrument_db) %>% show_query() # Can join with local data frames by setting copy = TRUE band_db %>% left_join(dplyr::band_instruments, copy = TRUE) # Unlike R, joins in SQL don't usually match NAs (NULLs) db <- memdb_frame(x = c(1, 2, NA)) label <- memdb_frame(x = c(1, NA), label = c("one", "missing")) db %>% left_join(label, by = "x") # But you can activate R's usual behaviour with the na_matches argument db %>% left_join(label, by = "x", na_matches = "na") # By default, joins are equijoins, but you can use `sql_on` to # express richer relationships db1 <- memdb_frame(x = 1:5) db2 <- memdb_frame(x = 1:3, y = letters[1:3]) db1 %>% left_join(db2) %>% show_query() db1 %>% left_join(db2, sql_on = "LHS.x < RHS.x") %>% show_query()
library(dplyr, warn.conflicts = FALSE) band_db <- tbl_memdb(dplyr::band_members) instrument_db <- tbl_memdb(dplyr::band_instruments) band_db %>% left_join(instrument_db) %>% show_query() # Can join with local data frames by setting copy = TRUE band_db %>% left_join(dplyr::band_instruments, copy = TRUE) # Unlike R, joins in SQL don't usually match NAs (NULLs) db <- memdb_frame(x = c(1, 2, NA)) label <- memdb_frame(x = c(1, NA), label = c("one", "missing")) db %>% left_join(label, by = "x") # But you can activate R's usual behaviour with the na_matches argument db %>% left_join(label, by = "x", na_matches = "na") # By default, joins are equijoins, but you can use `sql_on` to # express richer relationships db1 <- memdb_frame(x = 1:5) db2 <- memdb_frame(x = 1:3, y = letters[1:3]) db1 %>% left_join(db2) %>% show_query() db1 %>% left_join(db2, sql_on = "LHS.x < RHS.x") %>% show_query()
memdb_frame()
works like tibble::tibble()
, but instead of creating a new
data frame in R, it creates a table in src_memdb()
.
memdb_frame(..., .name = unique_table_name()) tbl_memdb(df, name = deparse(substitute(df))) src_memdb()
memdb_frame(..., .name = unique_table_name()) tbl_memdb(df, name = deparse(substitute(df))) src_memdb()
... |
< Arguments are evaluated sequentially.
You can refer to previously created elements directly or using the .data
pronoun.
To refer explicitly to objects in the calling environment, use |
df |
Data frame to copy |
name , .name
|
Name of table in database: defaults to a random name that's unlikely to conflict with an existing table. |
library(dplyr) df <- memdb_frame(x = runif(100), y = runif(100)) df %>% arrange(x) df %>% arrange(x) %>% show_query() mtcars_db <- tbl_memdb(mtcars) mtcars_db %>% group_by(cyl) %>% summarise(n = n()) %>% show_query()
library(dplyr) df <- memdb_frame(x = runif(100), y = runif(100)) df %>% arrange(x) df %>% arrange(x) %>% show_query() mtcars_db <- tbl_memdb(mtcars) mtcars_db %>% group_by(cyl) %>% summarise(n = n()) %>% show_query()
These are methods for the dplyr mutate()
and transmute()
generics.
They are translated to computed expressions in the SELECT
clause of
the SQL query.
## S3 method for class 'tbl_lazy' mutate( .data, ..., .by = NULL, .keep = c("all", "used", "unused", "none"), .before = NULL, .after = NULL )
## S3 method for class 'tbl_lazy' mutate( .data, ..., .by = NULL, .keep = c("all", "used", "unused", "none"), .before = NULL, .after = NULL )
.data |
A lazy data frame backed by a database query. |
... |
< |
.by |
< |
.keep |
Control which columns from
|
.before , .after
|
< |
Another tbl_lazy
. Use show_query()
to see the generated
query, and use collect()
to execute the query
and return data to R.
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(x = 1:5, y = 5:1) db %>% mutate(a = (x + y) / 2, b = sqrt(x^2L + y^2L)) %>% show_query() # dbplyr automatically creates subqueries as needed db %>% mutate(x1 = x + 1, x2 = x1 * 2) %>% show_query()
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(x = 1:5, y = 5:1) db %>% mutate(a = (x + y) / 2, b = sqrt(x^2L + y^2L)) %>% show_query() # dbplyr automatically creates subqueries as needed db %>% mutate(x1 = x + 1, x2 = x1 * 2) %>% show_query()
pivot_longer()
"lengthens" data, increasing the number of rows and
decreasing the number of columns. The inverse transformation is
tidyr::pivot_wider()
.
Learn more in vignette("pivot", "tidyr")
.
While most functionality is identical there are some differences to
pivot_longer()
on local data frames:
the output is sorted differently/not explicitly,
the coercion of mixed column types is left to the database,
values_ptypes
NOT supported.
Note that build_longer_spec()
and pivot_longer_spec()
do not work with
remote tables.
## S3 method for class 'tbl_lazy' pivot_longer( data, cols, ..., cols_vary, names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, values_ptypes, values_transform = NULL )
## S3 method for class 'tbl_lazy' pivot_longer( data, cols, ..., cols_vary, names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, values_ptypes, values_transform = NULL )
data |
A data frame to pivot. |
cols |
Columns to pivot into longer format. |
... |
Additional arguments passed on to methods. |
cols_vary |
Unsupported; included for compatibility with the generic. |
names_to |
A string specifying the name of the column to create
from the data stored in the column names of |
names_prefix |
A regular expression used to remove matching text from the start of each variable name. |
names_sep , names_pattern
|
If |
names_ptypes |
A list of column name-prototype pairs. |
names_transform , values_transform
|
A list of column name-function pairs. |
names_repair |
What happens if the output has invalid column names? |
values_to |
A string specifying the name of the column to create
from the data stored in cell values. If |
values_drop_na |
If |
values_ptypes |
Not supported. |
The SQL translation basically works as follows:
split the specification by its key columns i.e. by variables crammed into the column names.
for each part in the split specification transmute()
data
into the
following columns
id columns i.e. columns that are not pivotted
key columns
value columns i.e. columns that are pivotted
combine all the parts with union_all()
# See vignette("pivot") for examples and explanation # Simplest case where column names are character data memdb_frame( id = c("a", "b"), x = 1:2, y = 3:4 ) %>% tidyr::pivot_longer(-id)
# See vignette("pivot") for examples and explanation # Simplest case where column names are character data memdb_frame( id = c("a", "b"), x = 1:2, y = 3:4 ) %>% tidyr::pivot_longer(-id)
pivot_wider()
"widens" data, increasing the number of columns and
decreasing the number of rows. The inverse transformation is
pivot_longer()
.
Learn more in vignette("pivot", "tidyr")
.
Note that pivot_wider()
is not and cannot be lazy because we need to look
at the data to figure out what the new column names will be.
If you have a long running query you have two options:
(temporarily) store the result of the query via compute()
.
Create a spec before and use dbplyr_pivot_wider_spec()
- dbplyr's version
of tidyr::pivot_wider_spec()
. Note that this function is only a temporary
solution until pivot_wider_spec()
becomes a generic. It will then be
removed soon afterwards.
## S3 method for class 'tbl_lazy' pivot_wider( data, ..., id_cols = NULL, id_expand = FALSE, names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", values_from = value, values_fill = NULL, values_fn = ~max(.x, na.rm = TRUE), unused_fn = NULL ) dbplyr_pivot_wider_spec( data, spec, ..., names_repair = "check_unique", id_cols = NULL, id_expand = FALSE, values_fill = NULL, values_fn = ~max(.x, na.rm = TRUE), unused_fn = NULL, error_call = current_env() )
## S3 method for class 'tbl_lazy' pivot_wider( data, ..., id_cols = NULL, id_expand = FALSE, names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", values_from = value, values_fill = NULL, values_fn = ~max(.x, na.rm = TRUE), unused_fn = NULL ) dbplyr_pivot_wider_spec( data, spec, ..., names_repair = "check_unique", id_cols = NULL, id_expand = FALSE, values_fill = NULL, values_fn = ~max(.x, na.rm = TRUE), unused_fn = NULL, error_call = current_env() )
data |
A lazy data frame backed by a database query. |
... |
Unused; included for compatibility with generic. |
id_cols |
A set of columns that uniquely identifies each observation. |
id_expand |
Unused; included for compatibility with the generic. |
names_from , values_from
|
A pair of
arguments describing which column (or columns) to get the name of the
output column ( If |
names_prefix |
String added to the start of every variable name. |
names_sep |
If |
names_glue |
Instead of |
names_sort |
Should the column names be sorted? If |
names_vary |
When
|
names_expand |
Should the values in the |
names_repair |
What happens if the output has invalid column names? |
values_fill |
Optionally, a (scalar) value that specifies what each
|
values_fn |
A function, the default is |
unused_fn |
Optionally, a function applied to summarize the values from
the unused columns (i.e. columns not identified by The default drops all unused columns from the result. This can be a named list if you want to apply different aggregations to different unused columns.
This is similar to grouping by the |
spec |
A specification data frame. This is useful for more complex pivots because it gives you greater control on how metadata stored in the columns become column names in the result. Must be a data frame containing character |
error_call |
The execution environment of a currently
running function, e.g. |
The big difference to pivot_wider()
for local data frames is that
values_fn
must not be NULL
. By default it is max()
which yields
the same results as for local data frames if the combination of id_cols
and value
column uniquely identify an observation.
Mind that you also do not get a warning if an observation is not uniquely
identified.
The translation to SQL code basically works as follows:
Get unique keys in names_from
column.
For each key value generate an expression of the form:
value_fn( CASE WHEN (`names from column` == `key value`) THEN (`value column`) END ) AS `output column`
Group data by id columns.
Summarise the grouped data with the expressions from step 2.
memdb_frame( id = 1, key = c("x", "y"), value = 1:2 ) %>% tidyr::pivot_wider( id_cols = id, names_from = key, values_from = value )
memdb_frame( id = 1, key = c("x", "y"), value = 1:2 ) %>% tidyr::pivot_wider( id_cols = id, names_from = key, values_from = value )
This is a method for the dplyr pull()
generic. It evaluates the query
retrieving just the specified column.
## S3 method for class 'tbl_sql' pull(.data, var = -1, name = NULL, ...)
## S3 method for class 'tbl_sql' pull(.data, var = -1, name = NULL, ...)
.data |
A lazy data frame backed by a database query. |
var |
A variable specified as:
The default returns the last column (on the assumption that's the column you've created most recently). This argument is taken by expression and supports quasiquotation (you can unquote column names and column locations). |
name |
An optional parameter that specifies the column to be used
as names for a named vector. Specified in a similar manner as |
... |
< |
A vector of data.
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(x = 1:5, y = 5:1) db %>% mutate(z = x + y * 2) %>% pull()
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(x = 1:5, y = 5:1) db %>% mutate(z = x + y * 2) %>% pull()
remote_name()
gives the unescaped name of the remote table, or NULL
if it
is a query (created by sql()
) or already escape (created by ident_q()
).
remote_table()
gives the remote table or the query.
remote_query()
gives the text of the query, and remote_query_plan()
the query plan (as computed by the remote database). remote_src()
and
remote_con()
give the dplyr source and DBI connection respectively.
remote_name(x, null_if_local = TRUE) remote_table(x, null_if_local = TRUE) remote_src(x) remote_con(x) remote_query(x, cte = FALSE, sql_options = NULL) remote_query_plan(x, ...)
remote_name(x, null_if_local = TRUE) remote_table(x, null_if_local = TRUE) remote_src(x) remote_con(x) remote_query(x, cte = FALSE, sql_options = NULL) remote_query_plan(x, ...)
x |
Remote table, currently must be a tbl_sql. |
null_if_local |
Return |
cte |
|
sql_options |
|
... |
Additional arguments passed on to methods. |
The value, or NULL
if not remote table, or not applicable.
For example, computed queries do not have a "name"
mf <- memdb_frame(x = 1:5, y = 5:1, .name = "blorp") remote_name(mf) remote_src(mf) remote_con(mf) remote_query(mf) mf2 <- dplyr::filter(mf, x > 3) remote_name(mf2) remote_src(mf2) remote_con(mf2) remote_query(mf2)
mf <- memdb_frame(x = 1:5, y = 5:1, .name = "blorp") remote_name(mf) remote_src(mf) remote_con(mf) remote_query(mf) mf2 <- dplyr::filter(mf, x > 3) remote_name(mf2) remote_src(mf2) remote_con(mf2) remote_query(mf2)
This is a method for the tidyr::replace_na()
generic.
## S3 method for class 'tbl_lazy' replace_na(data, replace = list(), ...)
## S3 method for class 'tbl_lazy' replace_na(data, replace = list(), ...)
data |
A pair of lazy data frame backed by database queries. |
replace |
A named list of values, with one value for each column that has NA values to be replaced. |
... |
Unused; included for compatibility with generic. |
Another tbl_lazy
. Use show_query()
to see the generated
query, and use collect()
to execute the query
and return data to R.
df <- memdb_frame(x = c(1, 2, NA), y = c("a", NA, "b")) df %>% tidyr::replace_na(list(x = 0, y = "unknown"))
df <- memdb_frame(x = c(1, 2, NA), y = c("a", NA, "b")) df %>% tidyr::replace_na(list(x = 0, y = "unknown"))
These are methods for the dplyr rows_insert()
, rows_append()
,
rows_update()
, rows_patch()
, rows_upsert()
, and rows_delete()
generics.
When in_place = TRUE
these verbs do not generate SELECT
queries, but
instead directly modify the underlying data using INSERT
, UPDATE
, or
DELETE
operators. This will require that you have write access to
the database: the connection needs permission to insert, modify or delete
rows, but not to alter the structure of the table.
The default, in_place = FALSE
, generates equivalent lazy tables (using
SELECT
queries) that allow previewing the result without actually
modifying the underlying table on the database.
## S3 method for class 'tbl_lazy' rows_insert( x, y, by = NULL, ..., conflict = c("error", "ignore"), copy = FALSE, in_place = FALSE, returning = NULL, method = NULL ) ## S3 method for class 'tbl_lazy' rows_append(x, y, ..., copy = FALSE, in_place = FALSE, returning = NULL) ## S3 method for class 'tbl_lazy' rows_update( x, y, by = NULL, ..., unmatched = c("error", "ignore"), copy = FALSE, in_place = FALSE, returning = NULL ) ## S3 method for class 'tbl_lazy' rows_patch( x, y, by = NULL, ..., unmatched = c("error", "ignore"), copy = FALSE, in_place = FALSE, returning = NULL ) ## S3 method for class 'tbl_lazy' rows_upsert( x, y, by = NULL, ..., copy = FALSE, in_place = FALSE, returning = NULL, method = NULL ) ## S3 method for class 'tbl_lazy' rows_delete( x, y, by = NULL, ..., unmatched = c("error", "ignore"), copy = FALSE, in_place = FALSE, returning = NULL )
## S3 method for class 'tbl_lazy' rows_insert( x, y, by = NULL, ..., conflict = c("error", "ignore"), copy = FALSE, in_place = FALSE, returning = NULL, method = NULL ) ## S3 method for class 'tbl_lazy' rows_append(x, y, ..., copy = FALSE, in_place = FALSE, returning = NULL) ## S3 method for class 'tbl_lazy' rows_update( x, y, by = NULL, ..., unmatched = c("error", "ignore"), copy = FALSE, in_place = FALSE, returning = NULL ) ## S3 method for class 'tbl_lazy' rows_patch( x, y, by = NULL, ..., unmatched = c("error", "ignore"), copy = FALSE, in_place = FALSE, returning = NULL ) ## S3 method for class 'tbl_lazy' rows_upsert( x, y, by = NULL, ..., copy = FALSE, in_place = FALSE, returning = NULL, method = NULL ) ## S3 method for class 'tbl_lazy' rows_delete( x, y, by = NULL, ..., unmatched = c("error", "ignore"), copy = FALSE, in_place = FALSE, returning = NULL )
x |
A lazy table.
For |
y |
A lazy table, data frame, or data frame extensions (e.g. a tibble). |
by |
An unnamed character vector giving the key columns. The key columns
must exist in both By default, we use the first column in |
... |
Other parameters passed onto methods. |
conflict |
For One of:
|
copy |
If |
in_place |
Should |
returning |
Columns to return. See |
method |
A string specifying the method to use. This is only relevant for
|
unmatched |
For One of:
|
A new tbl_lazy
of the modified data.
With in_place = FALSE
, the result is a lazy query that prints visibly,
because the purpose of this operation is to preview the results.
With in_place = TRUE
, x
is returned invisibly,
because the purpose of this operation is the side effect of modifying rows
in the table behind x
.
library(dplyr) con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") DBI::dbExecute(con, "CREATE TABLE Ponies ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, cutie_mark TEXT )") ponies <- tbl(con, "Ponies") applejack <- copy_inline(con, data.frame( name = "Apple Jack", cutie_mark = "three apples" )) # The default behavior is to generate a SELECT query rows_insert(ponies, applejack, conflict = "ignore") # And the original table is left unchanged: ponies # You can also choose to modify the table with in_place = TRUE: rows_insert(ponies, applejack, conflict = "ignore", in_place = TRUE) # In this case `rows_insert()` returns nothing and the underlying # data is modified ponies
library(dplyr) con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") DBI::dbExecute(con, "CREATE TABLE Ponies ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, cutie_mark TEXT )") ponies <- tbl(con, "Ponies") applejack <- copy_inline(con, data.frame( name = "Apple Jack", cutie_mark = "three apples" )) # The default behavior is to generate a SELECT query rows_insert(ponies, applejack, conflict = "ignore") # And the original table is left unchanged: ponies # You can also choose to modify the table with in_place = TRUE: rows_insert(ponies, applejack, conflict = "ignore", in_place = TRUE) # In this case `rows_insert()` returns nothing and the underlying # data is modified ponies
These are methods for the dplyr select()
, rename()
, and relocate()
generics. They generate the SELECT
clause of the SQL query.
These functions do not support predicate functions, i.e. you can
not use where(is.numeric)
to select all numeric variables.
## S3 method for class 'tbl_lazy' select(.data, ...) ## S3 method for class 'tbl_lazy' rename(.data, ...) ## S3 method for class 'tbl_lazy' rename_with(.data, .fn, .cols = everything(), ...) ## S3 method for class 'tbl_lazy' relocate(.data, ..., .before = NULL, .after = NULL)
## S3 method for class 'tbl_lazy' select(.data, ...) ## S3 method for class 'tbl_lazy' rename(.data, ...) ## S3 method for class 'tbl_lazy' rename_with(.data, .fn, .cols = everything(), ...) ## S3 method for class 'tbl_lazy' relocate(.data, ..., .before = NULL, .after = NULL)
.data |
A lazy data frame backed by a database query. |
... |
< |
.fn |
A function used to transform the selected |
.cols |
< |
.before , .after
|
< |
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(x = 1, y = 2, z = 3) db %>% select(-y) %>% show_query() db %>% relocate(z) %>% show_query() db %>% rename(first = x, last = z) %>% show_query()
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(x = 1, y = 2, z = 3) db %>% select(-y) %>% show_query() db %>% relocate(z) %>% show_query() db %>% rename(first = x, last = z) %>% show_query()
These functions are critical when writing functions that translate R functions to sql functions. Typically a conversion function should escape all its inputs and return an sql object.
sql(...) is.sql(x) as.sql(x, con)
sql(...) is.sql(x) as.sql(x, con)
... |
Character vectors that will be combined into a single SQL expression. |
x |
Object to coerce |
con |
Needed when |
Options for generating SQL
sql_options(cte = FALSE, use_star = TRUE, qualify_all_columns = FALSE)
sql_options(cte = FALSE, use_star = TRUE, qualify_all_columns = FALSE)
cte |
If |
use_star |
If |
qualify_all_columns |
If |
A <dbplyr_sql_options> object.
library(dplyr, warn.conflicts = FALSE) lf1 <- lazy_frame(key = 1, a = 1, b = 2) lf2 <- lazy_frame(key = 1, a = 1, c = 3) result <- left_join(lf1, lf2, by = "key") %>% filter(c >= 3) show_query(result) sql_options <- sql_options(cte = TRUE, qualify_all_columns = TRUE) show_query(result, sql_options = sql_options)
library(dplyr, warn.conflicts = FALSE) lf1 <- lazy_frame(key = 1, a = 1, b = 2) lf2 <- lazy_frame(key = 1, a = 1, c = 3) result <- left_join(lf1, lf2, by = "key") %>% filter(c >= 3) show_query(result) sql_options <- sql_options(cte = TRUE, qualify_all_columns = TRUE) show_query(result, sql_options = sql_options)
These functions generate the SQL used in rows_*(in_place = TRUE)
.
sql_query_insert( con, table, from, insert_cols, by, ..., conflict = c("error", "ignore"), returning_cols = NULL, method = NULL ) sql_query_append(con, table, from, insert_cols, ..., returning_cols = NULL) sql_query_update_from( con, table, from, by, update_values, ..., returning_cols = NULL ) sql_query_upsert( con, table, from, by, update_cols, ..., returning_cols = NULL, method = NULL ) sql_query_delete(con, table, from, by, ..., returning_cols = NULL)
sql_query_insert( con, table, from, insert_cols, by, ..., conflict = c("error", "ignore"), returning_cols = NULL, method = NULL ) sql_query_append(con, table, from, insert_cols, ..., returning_cols = NULL) sql_query_update_from( con, table, from, by, update_values, ..., returning_cols = NULL ) sql_query_upsert( con, table, from, by, update_cols, ..., returning_cols = NULL, method = NULL ) sql_query_delete(con, table, from, by, ..., returning_cols = NULL)
con |
Database connection. |
table |
Table to update. Must be a table identifier.
Use a string to refer to tables in the current schema/catalog or
|
from |
Table or query that contains the new data. Either a table identifier or SQL. |
insert_cols |
Names of columns to insert. |
by |
An unnamed character vector giving the key columns. The key columns
must exist in both By default, we use the first column in |
... |
Other parameters passed onto methods. |
conflict |
For One of:
|
returning_cols |
Optional. Names of columns to return. |
method |
Optional. The method to use. |
update_values |
A named SQL vector that specify how to update the columns. |
update_cols |
Names of columns to update. |
Insert Methods
"where_not_exists"
The default for most databases.
INSERT INTO x_name SELECT * FROM y WHERE NOT EXISTS <match on by columns>
"on_conflict"
Supported by:
Postgres
SQLite
This method uses the ON CONFLICT
clause and therefore requires a unique
index on the columns specified in by
.
Upsert Methods
"merge"
The upsert method according to the SQL standard. It uses the MERGE
statement
MERGE INTO x_name USING y ON <match on by columns> WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ...
"on_conflict"
Supported by:
Postgres
SQLite
This method uses the ON CONFLICT
clause and therefore requires a unique
index on the columns specified in by
.
"cte_update"
Supported by:
Postgres
SQLite
Oracle
The classical way to upsert in Postgres and SQLite before support for
ON CONFLICT
was added. The update is done in a CTE clause and the unmatched
values are then inserted outside of the CTE.
A SQL query.
sql_query_upsert( con = simulate_postgres(), table = ident("airlines"), from = ident("df"), by = "carrier", update_cols = "name" )
sql_query_upsert( con = simulate_postgres(), table = ident("airlines"), from = ident("df"), by = "carrier", update_cols = "name" )
This is a method for the dplyr summarise()
generic. It generates the
SELECT
clause of the SQL query, and generally needs to be combined with
group_by()
.
## S3 method for class 'tbl_lazy' summarise(.data, ..., .by = NULL, .groups = NULL)
## S3 method for class 'tbl_lazy' summarise(.data, ..., .by = NULL, .groups = NULL)
.data |
A lazy data frame backed by a database query. |
... |
< |
.by |
< |
.groups |
Grouping structure of the result.
When In addition, a message informs you of that choice, unless the result is ungrouped,
the option "dplyr.summarise.inform" is set to |
Another tbl_lazy
. Use show_query()
to see the generated
query, and use collect()
to execute the query
and return data to R.
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(g = c(1, 1, 1, 2, 2), x = c(4, 3, 6, 9, 2)) db %>% summarise(n()) %>% show_query() db %>% group_by(g) %>% summarise(n()) %>% show_query()
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(g = c(1, 1, 1, 2, 2), x = c(4, 3, 6, 9, 2)) db %>% summarise(n()) %>% show_query() db %>% group_by(g) %>% summarise(n()) %>% show_query()
All data manipulation on SQL tbls are lazy: they will not actually
run the query or retrieve the data unless you ask for it: they all return
a new tbl_dbi
object. Use compute()
to run the query and save the
results in a temporary table in the database, or use collect()
to retrieve the
results to R. You can see the query with show_query()
.
## S3 method for class 'src_dbi' tbl(src, from, ...)
## S3 method for class 'src_dbi' tbl(src, from, ...)
src |
A |
from |
Either a table identifier or a literal Use a string to identify a table in the current schema/catalog. We
recommend using |
... |
Passed on to |
For best performance, the database should have an index on the variables
that you are grouping by. Use explain()
to check that the database is using
the indexes that you expect.
There is one verb that is not lazy: do()
is eager because it must pull
the data into R.
library(dplyr) # Connect to a temporary in-memory SQLite database con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") # Add some data copy_to(con, mtcars) DBI::dbListTables(con) # To retrieve a single table from a source, use `tbl()` con %>% tbl("mtcars") # Use `I()` for qualified table names con %>% tbl(I("temp.mtcars")) %>% head(1) # You can also use pass raw SQL if you want a more sophisticated query con %>% tbl(sql("SELECT * FROM mtcars WHERE cyl = 8")) # If you just want a temporary in-memory database, use src_memdb() src2 <- src_memdb() # To show off the full features of dplyr's database integration, # we'll use the Lahman database. lahman_sqlite() takes care of # creating the database. if (requireNamespace("Lahman", quietly = TRUE)) { batting <- copy_to(con, Lahman::Batting) batting # Basic data manipulation verbs work in the same way as with a tibble batting %>% filter(yearID > 2005, G > 130) batting %>% select(playerID:lgID) batting %>% arrange(playerID, desc(yearID)) batting %>% summarise(G = mean(G), n = n()) # There are a few exceptions. For example, databases give integer results # when dividing one integer by another. Multiply by 1 to fix the problem batting %>% select(playerID:lgID, AB, R, G) %>% mutate( R_per_game1 = R / G, R_per_game2 = R * 1.0 / G ) # All operations are lazy: they don't do anything until you request the # data, either by `print()`ing it (which shows the first ten rows), # or by `collect()`ing the results locally. system.time(recent <- filter(batting, yearID > 2010)) system.time(collect(recent)) # You can see the query that dplyr creates with show_query() batting %>% filter(G > 0) %>% group_by(playerID) %>% summarise(n = n()) %>% show_query() }
library(dplyr) # Connect to a temporary in-memory SQLite database con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") # Add some data copy_to(con, mtcars) DBI::dbListTables(con) # To retrieve a single table from a source, use `tbl()` con %>% tbl("mtcars") # Use `I()` for qualified table names con %>% tbl(I("temp.mtcars")) %>% head(1) # You can also use pass raw SQL if you want a more sophisticated query con %>% tbl(sql("SELECT * FROM mtcars WHERE cyl = 8")) # If you just want a temporary in-memory database, use src_memdb() src2 <- src_memdb() # To show off the full features of dplyr's database integration, # we'll use the Lahman database. lahman_sqlite() takes care of # creating the database. if (requireNamespace("Lahman", quietly = TRUE)) { batting <- copy_to(con, Lahman::Batting) batting # Basic data manipulation verbs work in the same way as with a tibble batting %>% filter(yearID > 2005, G > 130) batting %>% select(playerID:lgID) batting %>% arrange(playerID, desc(yearID)) batting %>% summarise(G = mean(G), n = n()) # There are a few exceptions. For example, databases give integer results # when dividing one integer by another. Multiply by 1 to fix the problem batting %>% select(playerID:lgID, AB, R, G) %>% mutate( R_per_game1 = R / G, R_per_game2 = R * 1.0 / G ) # All operations are lazy: they don't do anything until you request the # data, either by `print()`ing it (which shows the first ten rows), # or by `collect()`ing the results locally. system.time(recent <- filter(batting, yearID > 2010)) system.time(collect(recent)) # You can see the query that dplyr creates with show_query() batting %>% filter(G > 0) %>% group_by(playerID) %>% summarise(n = n()) %>% show_query() }
dbplyr translates commonly used base functions including logical
(!
, &
, |
), arithmetic (^
), and comparison (!=
) operators, as well
as common summary (mean()
, var()
), and transformation (log()
)
functions. All other functions will be preserved as is. R's infix functions
(e.g. %like%
) will be converted to their SQL equivalents (e.g. LIKE
).
Learn more in vignette("translation-function")
.
translate_sql( ..., con, vars_group = NULL, vars_order = NULL, vars_frame = NULL, window = TRUE ) translate_sql_( dots, con, vars_group = NULL, vars_order = NULL, vars_frame = NULL, window = TRUE, context = list() )
translate_sql( ..., con, vars_group = NULL, vars_order = NULL, vars_frame = NULL, window = TRUE ) translate_sql_( dots, con, vars_group = NULL, vars_order = NULL, vars_frame = NULL, window = TRUE, context = list() )
... , dots
|
Expressions to translate. |
con |
An optional database connection to control the details of
the translation. The default, |
vars_group , vars_order , vars_frame
|
Parameters used in the |
window |
Use |
context |
Use to carry information for special translation cases. For example, MS SQL needs a different conversion for is.na() in WHERE vs. SELECT clauses. Expects a list. |
con <- simulate_dbi() # Regular maths is translated in a very straightforward way translate_sql(x + 1, con = con) translate_sql(sin(x) + tan(y), con = con) # Note that all variable names are escaped translate_sql(like == "x", con = con) # In ANSI SQL: "" quotes variable _names_, '' quotes strings # Logical operators are converted to their sql equivalents translate_sql(x < 5 & !(y >= 5), con = con) # xor() doesn't have a direct SQL equivalent translate_sql(xor(x, y), con = con) # If is translated into case when translate_sql(if (x > 5) "big" else "small", con = con) # Infix functions are passed onto SQL with % removed translate_sql(first %like% "Had%", con = con) translate_sql(first %is% NA, con = con) translate_sql(first %in% c("John", "Roger", "Robert"), con = con) # And be careful if you really want integers translate_sql(x == 1, con = con) translate_sql(x == 1L, con = con) # If you have an already quoted object, use translate_sql_: x <- quote(y + 1 / sin(t)) translate_sql_(list(x), con = simulate_dbi()) # Windowed translation -------------------------------------------- # Known window functions automatically get OVER() translate_sql(mpg > mean(mpg), con = con) # Suppress this with window = FALSE translate_sql(mpg > mean(mpg), window = FALSE, con = con) # vars_group controls partition: translate_sql(mpg > mean(mpg), vars_group = "cyl", con = con) # and vars_order controls ordering for those functions that need it translate_sql(cumsum(mpg), con = con) translate_sql(cumsum(mpg), vars_order = "mpg", con = con)
con <- simulate_dbi() # Regular maths is translated in a very straightforward way translate_sql(x + 1, con = con) translate_sql(sin(x) + tan(y), con = con) # Note that all variable names are escaped translate_sql(like == "x", con = con) # In ANSI SQL: "" quotes variable _names_, '' quotes strings # Logical operators are converted to their sql equivalents translate_sql(x < 5 & !(y >= 5), con = con) # xor() doesn't have a direct SQL equivalent translate_sql(xor(x, y), con = con) # If is translated into case when translate_sql(if (x > 5) "big" else "small", con = con) # Infix functions are passed onto SQL with % removed translate_sql(first %like% "Had%", con = con) translate_sql(first %is% NA, con = con) translate_sql(first %in% c("John", "Roger", "Robert"), con = con) # And be careful if you really want integers translate_sql(x == 1, con = con) translate_sql(x == 1L, con = con) # If you have an already quoted object, use translate_sql_: x <- quote(y + 1 / sin(t)) translate_sql_(list(x), con = simulate_dbi()) # Windowed translation -------------------------------------------- # Known window functions automatically get OVER() translate_sql(mpg > mean(mpg), con = con) # Suppress this with window = FALSE translate_sql(mpg > mean(mpg), window = FALSE, con = con) # vars_group controls partition: translate_sql(mpg > mean(mpg), vars_group = "cyl", con = con) # and vars_order controls ordering for those functions that need it translate_sql(cumsum(mpg), con = con) translate_sql(cumsum(mpg), vars_order = "mpg", con = con)
These allow you to override the PARTITION BY
and ORDER BY
clauses
of window functions generated by grouped mutates.
window_order(.data, ...) window_frame(.data, from = -Inf, to = Inf)
window_order(.data, ...) window_frame(.data, from = -Inf, to = Inf)
.data |
A lazy data frame backed by a database query. |
... |
Variables to order by |
from , to
|
Bounds of the frame. |
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(g = rep(1:2, each = 5), y = runif(10), z = 1:10) db %>% window_order(y) %>% mutate(z = cumsum(y)) %>% show_query() db %>% group_by(g) %>% window_frame(-3, 0) %>% window_order(z) %>% mutate(z = sum(y)) %>% show_query()
library(dplyr, warn.conflicts = FALSE) db <- memdb_frame(g = rep(1:2, each = 5), y = runif(10), z = 1:10) db %>% window_order(y) %>% mutate(z = cumsum(y)) %>% show_query() db %>% group_by(g) %>% window_frame(-3, 0) %>% window_order(z) %>% mutate(z = sum(y)) %>% show_query()