This document is focused on using data.table as a dependency in other R packages. If you are interested to use data.table C code from non-R application, or call its C functions directly, jump to the last section of this vignette.
Importing data.table is no different from importing other R packages. This vignette meant to answer most common questions which popups around that subject. Defining dependency presented here can be applied to other R packages.
One of the biggest features of data.table is the concise syntax which makes exploratory analysis faster and easier to perceive. Yet the same reason can drive packages authors to use data.table in their own packages. Another maybe even more important reason is high performance. When outsourcing heavy computing tasks from your package to data.table you usually get top performance without needing to learn any high performance computing programming or tricks.
It is very easy to use data.table as a dependency due to the fact that data.table does not have any dependencies. This statement is valid for both operating system dependencies and R dependencies. It means that if you have R installed on your machine, it already has everything what is needed to install data.table. This also means that adding data.table as a dependency of your package will not result in chain of other recursive dependencies required to be installed, making it very convenient for offline installation.
First place to define a dependency in a package is DESCRIPTION file. Most commonly you will need to use Imports: data.table
keyword there. This definition will force to install data.table before your package installation. As mentioned above no other packages will be installed because data.table does not have own dependencies. You can also specify the lowest required version of a dependency, for example if your package is using fwrite
function which was introduced in data.table 1.9.8 you may define Imports: data.table (>= 1.9.8)
. This way you can ensure that data.table is installed in 1.9.8 or later version before you will be able to install your package. Another way is to use Depends: data.table
instead of Imports
although this should be avoided because it forces end users of your package to use data.table.
Next thing is to define what content of data.table your package is using. This needs to be done in NAMESPACE file. Most commonly package authors will want to use import(data.table)
which will import all (exported) functions from data.table package.
If you want to use just a subset of data.table functions, lets say only fast read and write CSV files, you can use importFrom(data.table, fread, fwrite)
in NAMESPACE file. It is possible to import all functions from a package excluding particular ones using import(data.table, except=c(fread, fwrite))
.
As an example we will define two functions in a.pkg
package that uses data.table. gen
function will generate data.table, aggr
will aggregate that data.table.
gen = function (n = 100L) {
dt = as.data.table(list(id = seq_len(n)))
dt[, grp := ((id - 1) %% 26) + 1
][, grp := letters[grp]
][]
}
aggr = function (x) {
stopifnot(
is.data.table(x),
"grp" %in% names(x)
)
x[, .N, by = grp]
}
Be sure to include tests in your package. Before each major release of data.table we check reverse dependencies so if any changes in data.table would break your code we will be able to spot breaking changes and inform you before releasing new version. This of course assumes you will publish your package to CRAN. The most basic test can be a plaintext R script in your package directory tests/test.R
:
library(a.pkg)
dt = gen()
stopifnot(nrow(dt) == 100)
dt2 = aggr(dt)
stopifnot(nrow(dt2) < 100)
When testing package you may want to use R CMD check --no-stop-on-test-error
which will continue to run all your tests and not stop on first script that failed (requires R 3.4.0).
It is very common to use testthat package for purpose of tests. Testing package that imports data.table is no different from testing other packages. An example test script tests/testthat/test-pkg.R
:
context("pkg tests")
test_that("generate dt", { expect_true(nrow(gen()) == 100) })
test_that("aggregate dt", { expect_true(nrow(aggr(gen())) < 100) })
data.table
's use of Non-Standard Evaluation (especially on the left-hand side of :=
) is not well-recognised by R CMD check
. This results NOTE
s like the following during package check:
* checking R code for possible problems ... NOTE
aggr: no visible binding for global variable 'grp'
gen: no visible binding for global variable 'grp'
gen: no visible binding for global variable 'id'
Undefined global functions or variables:
grp id
The easiest way to deal with this is to pre-define those variables and set them to NULL
, eventually adding comment (as was done in the function gen
above). When possible, you can also use a character vector instead of symbols (as in aggr
). The functions from above example would then look like:
gen = function (n = 100L) {
id = grp = NULL # due to NSE notes in R CMD check
dt = as.data.table(list(id = seq_len(n)))
dt[, grp := ((id - 1) %% 26) + 1
][, grp := letters[grp]
][]
}
aggr = function (x) {
stopifnot(
is.data.table(x),
"grp" %in% names(x)
)
x[, .N, by = "grp"]
}
The case for :=
is slightly different, because :=
is interpreted as a function in the above code; so instead of registering :=
as NULL
, you must register it as a function, e.g.:
`:=` = function(...) NULL
If you don't mind having id
and grp
registered as variables globally in your package namespace you can use ?globalVariables
. Be aware that these notes do not have any impact on the code or its functionality; if you are not going to publish your package, you may simply choose to ignore them.
If you face any problems in this process before trying to ask questions or reporting issues please confirm that problem is reproducible in clean R session using R console R CMD check package.name
.
Some of the most common issues developers are facing are usually related to helpers tools that meant to automate some package development tasks. For example using roxygen to generate NAMESPACE file from metadata in R code files. Others are related to helpers that build and check package. Thus be sure to double check using R console, also ensure proper import is defined in DESCRIPTION and NAMESPACE files. If are not able to reproduce problems you have using just R console build and check you may try to get some support in devtools#192 or devtools#1472.
Since version 1.10.5 data.table is licensed as Mozilla Public License (MPL). The reasons for the change from GPL should be read in full here and you can read more about MPL on Wikipedia here and here.
If you want to use data.table conditionally only when it is installed you should use Suggests: data.table
in your DESCRIPTION file instead of using Imports: data.table
. By default this definition will not force to install data.table when installing your package. This also requires you to conditionally use data.table in your package code which should be done using ?requireNamespace
function. Below example demonstrates conditional use of data.table's fast CSV writer ?fwrite
. If data.table package is not installed then much slower base R ?write.table
function is used to write CSV file.
my.write = function (x) {
if(requireNamespace("data.table", quietly=TRUE)) {
data.table::fwrite(x, "data.csv")
} else {
write.table(x, "data.csv")
}
}
Slightly more extended way which also ensure that installed data.table is in version recent enough to have fwrite
function available.
my.write = function (x) {
if(requireNamespace("data.table", quietly=TRUE) &&
utils::packageVersion("data.table") >= "1.9.8") {
data.table::fwrite(x, "data.csv")
} else {
write.table(x, "data.csv")
}
}
When using package as a suggested dependency you should not import it in NAMESPACE file, just mention it in DESCRIPTION file. You also have to use data.table::
prefix when calling data.table functions because none of them are imported.
For more canonical documentation of defining packages dependency check Writing R Extensions official manual.
Some tiny parts of data.table C code were isolated from R C API and now can be used from non-R application by linking to .so / .dll files. More details about this will be provided later, for now you can study C code that were isolated from R C API in src/fread.c and src/fwrite.c.