A Future for R: Non-Exportable Objects

Certain types of objects are tied to a given R session. Such objects cannot be saved to file by one R process and then later be reloaded in another R process and expected to work correctly. If attempted, we will often get an informative error but not always. For the same reason, these type of objects cannot be exported to another R processes(*) for parallel processing. We refer to these type of objects as “non-exportable objects”.

(*) The exception is when forked processes are used, i.e. plan(multicore). However, such attempts to work around the underlying problem, which is non-exportable objects, should be avoided and considered non-stable. Moreover, such code will fail to parallelize when using other future backends.

A first example - file connections

An example of a non-exportable object is a connection, e.g. a file connection. For instance, if you create a file connection,

con <- file("output.log", open = "wb")
cat("hello ", file = con)
flush(con)
readLines("output.log", warn = FALSE)
## [1] "hello "

it will not work when used in another R process. If we try, the result is “unknown”, e.g.

library("future")
plan(multisession)
f <- future({ cat("world!", file = con); flush(con) })
value(f)
## NULL
close(con)
readLines("output.log", warn = FALSE)
## [1] "hello "

In other words, the output "world!" written by the R worker is completely lost.

The culprit here is that the connection uses a so called external pointer:

str(con)
## Classes 'file', 'connection'  atomic [1:1] 3
##   ..- attr(*, "conn_id")=<externalptr> 

which is bound to the main R process and makes no sense to the worker. Ideally, the R process of the worker would detect this and produce an informative error message, but as seen here, that does not always occur.

Protect against non-exportable objects

To help avoiding exporting non-exportable objects by mistakes, which typically happens because a global variable is non-exportable, the future framework provides a mechanism for automatically detecting such objects. To enable it, do:

options(future.globals.onReference = "error")
f <- future({ cat("world!", file = con); flush(con) })
## Error: Detected a non-exportable reference ('externalptr') in one of the globals
## ('con' of class 'file') used in the future expression

Comment: The future.globals.onReference options is set to "ignore" by default due to the extra overhead "error" introduces, which can be significant for very large nested objects. Furthermore, some subclasses of external pointers can be exported without causing problems.

Packages with non-exportable objects

The below table and sections provide a few examples of non-exportable R objects that you may run into when trying to parallelize your code, or simply from trying reload objects saved in a previous R session. If you identify other cases, please consider reporting them so they can be documented here and possibly even be fixed.

Package Example of non-exportable types or classes
base connection
DBI DBIConnection
Rcpp NativeSymbol
reticulate python.builtin.function, python.builtin.module
rJava jclassName
rstan stanmodel
sparklyr tbl_spark
udpipe udpipe_model
xml2 xml_document

Package: DBI

DBI provides a unified database interface for communication between R and various database engines. Analogously to regular connections in R, DBIConnection objects can not safely be exported to another R process, e.g.

library("future")
options(future.globals.onReference = "error")
plan(multisession)
library("DBI")
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dummy %<-% print(con)
## Error: Detected a non-exportable reference ('externalptr') in one of the globals
## ('con' of class 'SQLiteConnection') used in the future expression

Package: Rcpp

Another example is Rcpp , which allow us to easily create R functions that are implemented in C++, e.g.

Rcpp::sourceCpp(code = "
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
int my_length(NumericVector x) {
    return x.size();
}
")

so that:

x <- 1:10
my_length(x)
## [1] 10

However, since this function uses an external pointer internally, we cannot pass it to another R process:

library("future")
plan(multisession)
n %<-% my_length(x)
n
## Error in .Call(<pointer: (nil)>, x) : NULL value passed as symbol address

We can detect / protect against this using:

options(future.globals.onReference = "error")
n %<-% my_length(x)
## Error: Detected a non-exportable reference ('externalptr' of class
## 'NativeSymbol') in one of the globals ('my_length' of class 'function')
## used in the future expression

Package: reticulate

The reticulate package provides methods for creating and calling Python code from within R. If one attempt to use Python-binding objects from this package, we get errors like:

library("future")
plan(multisession)
library("reticulate")
os <- import("os")
pwd %<-% os$getcwd()
pwd
## Error in eval(quote(os$getcwd()), new.env()) : 
##   attempt to apply non-function

and by telling the future package to validate globals further, we get:

options(future.globals.onReference = "error")
pwd %<-% os$getcwd()
## Error: Detected a non-exportable reference ('externalptr') in one of the
## globals ('os' of class 'python.builtin.module') used in the future expression

Another reticulate example is when we try to use a Python function that we create ourselves as in:

cat("def twice(x):\n    return 2*x\n", file = "twice.py")
source_python("twice.py")
twice(1.2)
## [1] 2.4
y %<-% twice(1.2)
y
## Error in unserialize(node$con) : 
##   Failed to retrieve the value of MultisessionFuture from cluster node #1
##   (on 'localhost').  The reason reported was 'error reading from connection'

which, again, is because:

options(future.globals.onReference = "error")
y %<-% twice(1.2)
## Error: Detected a non-exportable reference ('externalptr') in one of the globals
## ('twice' of class 'python.builtin.function') used in the future expression

Package: rJava

Here is an example that shows how rJava objects cannot be exported to external R processes.

library("future")
plan(multisession)
library("rJava")
.jinit() ## Initialize Java VM on master

Double <- J("java.lang.Double")
d0 <- new(Double, "3.14")
d0
## [1] "Java-Object{3.14}"

f <- future({
  .jinit() ## Initialize Java VM on worker
  new(Double, "3.14")
})
d1 <- value(f)
d1
## [1] "Java-Object<null>"

Although no error is produced, we see that the value d1 is a Java NULL Object. As before, we can catch this by using:

options(future.globals.onReference = "error")
f <- future({
  .jinit() ## Initialize Java VM on worker
  new(Double, "3.14")
})
## Error: Detected a non-exportable reference ('externalptr') in one of the
## globals ('Double' of class 'jclassName') used in the future expression

Package: udpipe

library("future")
plan(multisession)
library("sparklyr")
sc <- spark_connect(master = "local")

file <- system.file("misc", "exDIF.csv", package = "utils")
data <- spark_read_csv(sc, "exDIF", file)
d %<-% dim(data)
d
Error in unserialize(node$con) : 
  Failed to retrieve the value of MultisessionFuture (<none>) from cluster
SOCKnode #1 (PID 29864 on localhost 'localhost'). The reason reported was
'unknown input format'. Post-mortem diagnostic: A process with this PID
exists, which suggests that the localhost worker is still alive.

To catch this as soon as possible,

options(future.globals.onReference = "error")
d %<-% dim(data)
## Error: Detected a non-exportable reference ('externalptr') in one of
the globals ('data' of class 'tbl_spark') used in the future expression
``


### Package: udpipe

```r
library("future")
plan(multisession)
library("udpipe")
udmodel <- udpipe_download_model(language = "dutch")
udmodel <- udpipe_load_model(file = udmodel$file_model)
x %<-% udpipe_annotate(udmodel, x = "Ik ging op reis en ik nam mee.")
x
## Error in udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer, tagger,  : 
##   external pointer is not valid

To catch this as soon as possible,

options(future.globals.onReference = "error")
x %<-% udpipe_annotate(udmodel, x = "Ik ging op reis en ik nam mee.")
## Error: Detected a non-exportable reference ('externalptr') in one of the
## globals ('udmodel' of class 'udpipe_model') used in the future expression

Now, it is indeed possible to parallelize \pkg{udpipe} calls. For details on how to do this, see the 'UDPipe Natural Language Processing - Parallel' vignette that comes with the \pkg{udpipe} package.

Package: xml2

Another example is XML objects of the xml2 package, which may produce evaluation errors (or just invalid results depending on how they are used), e.g.

library("future")
plan(multisession)
library("xml2")
xml <- read_xml("<body></body>")
f <- future(xml_children(xml))
value(f)
## Error: external pointer is not valid

The future framework can help detect this before sending off the future to the worker;

options(future.globals.onReference = "error")
f <- future(xml_children(xml))
## Error: Detected a non-exportable reference ('externalptr') in one of the
## globals ('xml' of class 'xml_document') used in the future expression

Copyright Henrik Bengtsson, 2019