--- title: "Translation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{translation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction This vignette shows the details of how dtplyr translates dplyr expressions into the equivalent [data.table](http://r-datatable.com/) code. If you see places where you think I could generate better data.table code, please [let me know](https://github.com/tidyverse/dtplyr/issues)! This document assumes that you're familiar with the basics of data.table; if you're not, I recommend starting at `vignette("datatable-intro.html")`. ```{r setup, message = FALSE} library(dtplyr) library(data.table) library(dplyr) ``` ## The basics To get started, I'll create a simple lazy table with `lazy_dt()`: ```{r} df <- data.frame(a = 1:5, b = 1:5, c = 1:5, d = 1:5) dt <- lazy_dt(df) ``` The actual data doesn't matter here since we're just looking at the translation. When you print a lazy frame, it tells you that it's a local data table with four rows. It also prints the call that dtplyr will evaluate when we execute the lazy table. In this case it's very simple: ```{r} dt ``` If we just want to see the generated code, you can use `show_query()`. I'll use that a lot in this vignette. ```{r} dt %>% show_query() ``` ## Simple verbs Many dplyr verbs have a straightforward translation to either the `i` or `j` component of `[.data.table`. ### `filter()` and `arrange()` `filter()` and `arrange()` become elements of `i`: ```{r} dt %>% arrange(a, b, c) %>% show_query() dt %>% filter(b == c) %>% show_query() dt %>% filter(b == c, c == d) %>% show_query() ``` ### `select()`, `summarise()`, `transmute()` `select()`, `summarise()` and `transmute()` all become elements of `j`: ```{r} dt %>% select(a:b) %>% show_query() dt %>% summarise(a = mean(a)) %>% show_query() dt %>% transmute(a2 = a * 2) %>% show_query() ``` `mutate()` also uses the `j` component with data.table's special `:=` operator: ```{r} dt %>% mutate(a2 = a * 2, b2 = b * 2) %>% show_query() ``` Note that dplyr will not copy the input data by default, see below for more details. `mutate()` allows to refer to variables that you just created using an "extended `j`" expression: ```{r} dt %>% mutate(a2 = a * 2, b2 = b * 2, a4 = a2 * 2) %>% show_query() ``` `transmute()` works similarly: ```{r} dt %>% transmute(a2 = a * 2, b2 = b * 2, a4 = a2 * 2) %>% show_query() ``` ## Other calls Other verbs require calls to other functions: ### `rename()` `rename()` uses `setnames()`: ```{r} dt %>% rename(x = a, y = b) %>% show_query() ``` ### `distinct()` `distinct()` uses `unique()`: ```{r} dt %>% distinct() %>% show_query() dt %>% distinct(a, b) %>% show_query() dt %>% distinct(a, b, .keep_all = TRUE) %>% show_query() ``` `distinct()` on a computed column uses an intermediate mutate: ```{r} dt %>% distinct(c = a + b) %>% show_query() dt %>% distinct(c = a + b, .keep_all = TRUE) %>% show_query() ``` ### Joins Most joins use the `[.data.table` equivalent: ```{r} dt2 <- lazy_dt(data.frame(a = 1)) dt %>% inner_join(dt2, by = "a") %>% show_query() dt %>% right_join(dt2, by = "a") %>% show_query() dt %>% left_join(dt2, by = "a") %>% show_query() dt %>% anti_join(dt2, by = "a") %>% show_query() ``` But `full_join()` uses `merge()` ```{r} dt %>% full_join(dt2, by = "a") %>% show_query() ``` In some case extra calls to `data.table::setcolorder()` and `data.table::setnames()` are required to ensure correct column order and names in: ```{r} dt3 <- lazy_dt(data.frame(b = 1, a = 1)) dt %>% left_join(dt3, by = "a") %>% show_query() dt %>% full_join(dt3, by = "b") %>% show_query() ``` Semi-joins are little more complex: ```{r} dt %>% semi_join(dt2, by = "a") %>% show_query() ``` ### Set operations Set operations use the fast data.table alternatives: ```{r} dt %>% intersect(dt2) %>% show_query() dt %>% setdiff(dt2) %>% show_query() dt %>% union(dt2) %>% show_query() ``` ## Grouping Just like in dplyr, `group_by()` doesn't do anything by itself, but instead modifies the operation of downstream verbs. This generally just involves using the `keyby` argument: ```{r} dt %>% group_by(a) %>% summarise(b = mean(b)) %>% show_query() ``` You may use `by` instead of `keyby` if you set `arrange = FALSE`: ```{r} dt %>% group_by(a, arrange = FALSE) %>% summarise(b = mean(b)) %>% show_query() ``` Often, there won't be too much of a difference between these, but for larger grouped operations, the overhead of reordering data may become significant. In these situations, using `arrange = FALSE` becomes preferable. The primary exception is grouped `filter()`, which requires the use of `.SD`: ```{r} dt %>% group_by(a) %>% filter(b < mean(b)) %>% show_query() ``` ## Combinations dtplyr tries to generate generate data.table code as close as possible to what you'd write by hand, as this tends to unlock data.table's tremendous speed. For example, if you `filter()` and then `select()`, dtplyr generates a single `[`: ```{r} dt %>% filter(a == 1) %>% select(-a) %>% show_query() ``` And similarly when combining filtering and summarising: ```{r} dt %>% group_by(a) %>% filter(b < mean(b)) %>% summarise(c = max(c)) %>% show_query() ``` This is particularly nice when joining two tables together because you can select variables after you have joined and data.table will only carry those into the join: ```{r} dt3 <- lazy_dt(data.frame(x = 1, y = 2)) dt4 <- lazy_dt(data.frame(x = 1, a = 2, b = 3, c = 4, d = 5, e = 7)) dt3 %>% left_join(dt4) %>% select(x, a:c) %>% show_query() ``` Note, however, that `select()`ing and then `filter()`ing must generate two separate calls to `[`, because data.table evaluates `i` before `j`. ```{r} dt %>% select(X = a, Y = b) %>% filter(X == 1) %>% show_query() ``` Similarly, a `filter()` and `mutate()` can't be combined because `dt[a == 1, .(b2 := b * 2)]` would modify the selected rows in place: ```{r} dt %>% filter(a == 1) %>% mutate(b2 = b * 2) %>% show_query() ``` ## Copies By default dtplyr avoids mutating the input data, automatically creating a `copy()` if needed: ```{r} dt %>% mutate(a2 = a * 2, b2 = b * 2) %>% show_query() ``` Note that dtplyr does its best to avoid needless copies, so it won't explicitly copy if there's already an implicit copy produced by `[`, `head()`, `merge()` or similar: ```{r} dt %>% filter(x == 1) %>% mutate(a2 = a * 2, b2 = b * 2) %>% show_query() ``` You can choose to opt out of this copy, and take advantage of data.table's reference semantics (see `vignette("datatable-reference-semantics")` for more details). Do this by setting `immutable = FALSE` on construction: ```{r} dt2 <- data.table(a = 1:10) dt_inplace <- lazy_dt(dt2, immutable = FALSE) dt_inplace %>% mutate(a2 = a * 2, b2 = b * 2) %>% show_query() ``` ## Performance There are two components to the performance of dtplyr: how long it takes to generate the translation, and how well the translation performs. Given my explorations so far, I'm reasonably confident that we're generating high-quality data.table code, so most of the cost should be in the translation itself. The following code briefly explores the performance of a few different translations. A significant amount of work is done by the dplyr verbs, so we benchmark the whole process. ```{r} bench::mark( filter = dt %>% filter(a == b, c == d), mutate = dt %>% mutate(a = a * 2, a4 = a2 * 2, a8 = a4 * 2) %>% show_query(), summarise = dt %>% group_by(a) %>% summarise(b = mean(b)) %>% show_query(), check = FALSE )[1:6] ``` These translations all take less than a millisecond, suggesting that the performance overhead of dtplyr should be negligible for realistic data sizes. Note that dtplyr run-time scales with the complexity of the pipeline, not the size of the data, so these timings should apply regardless of the size of the underlying data[^copy]. [^copy]: Unless a copy is performed.