This work is funded by the National Science Foundation grant NSF-IOS 1546858.

`rmonad`

offers

a stateful pipeline framework

pure error handling

access to the intermediate results of a pipeline

effects – e.g. plotting, caching – within a pipeline

branching and chaining of pipelines

a flexible approach to literate programming

I will introduce `rmonad`

with a simple sequence of squares

```
# %>>% corresponds to Haskell's >>=
1:5 %>>%
sqrt %>>%
sqrt %>>%
sqrt
```

```
## N1> "1:5"
## N2> "sqrt"
## N3> "sqrt"
## N4> "sqrt"
##
## -----------------
##
## [1] 1.000000 1.090508 1.147203 1.189207 1.222845
```

So what exactly did `rmonad`

do with your data? It is still there, sitting
happily inside the monad.

In `magrittr`

you could do something similar:

```
1:5 %>%
sqrt %>%
sqrt %>%
sqrt
```

```
## [1] 1.000000 1.090508 1.147203 1.189207 1.222845
```

`%>%`

takes the value on the left and applies it to the function on the right.
`%>>%`

, takes a monad on the left and a function on the right, then builds a
new monad from them. This new monad holds the computed value, if the
computation succeeded. It collates all errors, warnings, and messages. These
are stored in step-by-step a history of the pipeline.

`%>%`

is an application operator, `%>>%`

is a *monadic bind* operator.
`magrittr`

and `rmonad`

complement each other. `%>%`

can be used inside a
monadic sequence to perform operations *on* monads, whereas `%>>%`

performs
operations *in* them. If this is all too mystical, just hold on, you don't
need to understand monads to understand the examples.

Below, we store an intermediate value in the monad:

```
1:5 %>>%
sqrt %v>% # store this result
sqrt %>>%
sqrt
```

```
## N1> "1:5"
## N2> "sqrt"
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
##
## N3> "sqrt"
## N4> "sqrt"
##
## -----------------
##
## [1] 1.000000 1.090508 1.147203 1.189207 1.222845
```

The `%v>%`

variant of the *monadic bind* operator stores the results as they
are passed.

Following the example of `magrittr`

, arbitrary anonymous functions of '.' are
supported

```
1:5 %>>% { o <- . * 2 ; { o + . } %>% { . + o } }
```

```
## N1> "1:5"
## N2> "function (.)
## {
## o <- . * 2
## {
## o + .
## } %>% {
## . + o
## }
## }"
##
## -----------------
##
## [1] 5 10 15 20 25
```

Warnings are caught and stored

```
-1:3 %>>%
sqrt %v>%
sqrt %>>%
sqrt
```

```
## N1> "-1:3"
## N2> "sqrt"
## * WARNING: NaNs produced
## [1] NaN 0.000000 1.000000 1.414214 1.732051
##
## N3> "sqrt"
## N4> "sqrt"
##
## -----------------
##
## [1] NaN 0.000000 1.000000 1.090508 1.147203
```

Similarly for errors

```
"wrench" %>>%
sqrt %v>%
sqrt %>>%
sqrt
```

```
## N1> ""wrench""
## N2> "sqrt"
## * ERROR: non-numeric argument to mathematical function
##
## -----------------
##
## [1] "wrench"
## *** FAILURE ***
```

The first `sqrt`

failed, and this step was coupled to the resultant error.
Contrast this with `magrittr`

, where the location of the error is lost:

```
"wrench" %>%
sqrt %>%
sqrt %>%
sqrt
```

```
## Error in sqrt(.): non-numeric argument to mathematical function
```

Also note that a value was still produced. This value will never be used in the downstream monadic sequence (except when explicitly doing error handling). However it, and all other information in the monad, can be easily accessed.

`rmonad`

If you want to extract the terminal result from the monad, you can use the `esc`

function:

```
1:5 %>>% sqrt %>% esc
```

```
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
```

`esc`

is our first example of a class of functions that work on monads, rather
than the values they wrap. We use `magrittr`

's application operator `%>%`

here,
rather than the monadic bind operator `%>>%`

, because we are passing a literal
monad to `esc`

.

If the monad is in a failed state, `esc`

will raise an error.

```
"wrench" %>>% sqrt %>>% sqrt %>% esc
```

```
## Error: in "sqrt":
## non-numeric argument to mathematical function
```

If you prefer a tabular summary of your results, you can pipe the monad into
the `mtabulate`

function.

```
1:5 %>>%
sqrt %v>%
sqrt %>>%
sqrt %>% mtabulate
```

```
## id OK cached time space is_nested ndependents nnotes nwarnings error doc
## 1 1 TRUE FALSE 0 80 0 1 0 0 0 0
## 2 2 TRUE TRUE 0 96 0 1 0 0 0 0
## 3 3 TRUE FALSE 0 96 0 1 0 0 0 0
## 4 4 TRUE TRUE 0 96 0 0 0 0 0 0
```

An internal states can be accessed by converting the monad to a list of past states and simple indexing out the ones you want.

All errors, warnings and notes can be extracted with the `missues`

command

```
-2:2 %>>% sqrt %>>% colSums %>% missues
```

```
## id type issue
## 1 3 error 'x' must be an array of at least two dimensions
## 2 2 warning NaNs produced
```

The `id`

column refers to row numbers in the `mtabulate`

output. Internal
values can be extracted:

```
result <- 1:5 %v>% sqrt %v>% sqrt %v>% sqrt
get_value(result)[[2]]
```

```
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
```

The `%>_%`

operator is useful when you want to include a function inside a
pipeline that should be bypassed, but you want the errors, warnings, and
messages to pass along with the main.

You can cache an intermediate result

```
cars %>_% write.csv(file="cars.tab") %>>% summary
```

Or plot a value along with a summary

```
cars %>_% plot(xlab="index", ylab="value") %>>% summary
```

```
## N1> "cars"
## N2> "plot(xlab = "index", ylab = "value")"
## N3> "summary"
##
## -----------------
##
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
```

I pipe the final monad into `forget`

, which is (like `esc`

) a function for
operating on monads. `forget`

removes history from a monad. I do this just to
de-clutter the output.

You can call multiple effects

```
cars %>_%
plot(xlab="index", ylab="value") %>_%
write.csv(file="cars.tab") %>>%
summary
```

Since state is passed, you can make assertions about the data inside a pipeline.

```
iris %>_%
{ stopifnot(is.data.frame(.)) } %>_%
{ stopifnot(sapply(.,is.numeric)) } %>>%
colSums %|>% head
```

```
## N1> "iris"
## N2> "function (.)
## {
## stopifnot(is.data.frame(.))
## }"
## N3> "function (.)
## {
## stopifnot(sapply(., is.numeric))
## }"
## * ERROR: sapply(., is.numeric) are not all TRUE
## N4> "head"
##
## -----------------
##
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
```

The above code will enter a failed state if the input is either not a data
frame or the columns are not all numeric. The braced expressions are anonymous
functions of '.' (as in `magrittr`

). The final expression `%|>%`

catches an
error and performs `head`

on the last valid input (`iris`

).

Errors needn't be viewed as abnormal. For example, we might want to try several alternatives functions, and use the first that works.

```
1:10 %>>% colSums %|>% sum
```

```
## N1> "1:10"
## N2> "colSums"
## * ERROR: 'x' must be an array of at least two dimensions
## N3> "sum"
##
## -----------------
##
## [1] 55
```

Here we will do either `colSums`

or `sum`

. The pipeline fails only if both
fail.

Sometimes you want to ignore the previous failure completely, and make a new call – for example in reading files:

```
# try to load a cached file, on failure rerun the analysis
read.table("analyasis_cache.tab") %||% run_analysis(x)
```

This can also be used to replace if-else if-else strings

```
x <- list()
# compare
if(length(x) > 0) { x[[1]] } else { NULL }
```

```
## NULL
```

```
# to
x[[1]] %||% NULL %>% esc
```

```
## NULL
```

Or maybe you want to support multiple extensions for an input file

```
read.table("a.tab") %||% read.table("a.tsv") %>>% dostuff
```

Used together with `%|>%`

we can build full error handling pipelines

```
letters[1:10] %v>% colSums %|>% sum %||% message("Can't process this")
```

```
## Can't process this
```

```
## N1> "letters[1:10]"
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
##
## N2> "colSums"
## * ERROR: 'x' must be an array of at least two dimensions
## N3> "sum"
## * ERROR: invalid 'type' (character) of argument
## N4> "message("Can't process this")"
##
## -----------------
##
## NULL
```

Overall, in `rmonad`

, errors are well-behaved. It is reasonable to write
functions that return an error rather than one of the myriad default values
(`NULL`

, `NA`

, `logical(0)`

, `list()`

, `FALSE`

). This approach is unambiguous.
`rmonad`

can catch the error and allow allow the programmer to deal with it
accordingly.

If you want to perform an operation on a value inside the chain, but don't want
to pass it, you can use the branch operator `%>^%`

.

```
rnorm(30) %>^% qplot(xlab="index", ylab="value") %>>% mean
```

This stores the result of `qplot`

in a branch off the main pipeline. This means
that `plot`

could fail, but the rest of the pipeline could continue. You can
store multiple branches.

```
rnorm(30) %>^% qplot(xlab="index", ylab="value") %>^% summary %>>% mean
```

Branches can be used as input, as well.

```
x <- 1:10 %>^% dgamma(10, 1) %>^% dgamma(10, 5) %^>% cor
get_value(x)
```

```
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [1] 1.013777e-06 1.909493e-04 2.700504e-03 1.323119e-02 3.626558e-02
## [6] 6.883849e-02 1.014047e-01 1.240769e-01 1.317556e-01 1.251100e-01
##
## [[3]]
## [1] 1.813279e-01 6.255502e-01 1.620358e-01 1.454077e-02 7.299700e-04
## [6] 2.537837e-05 6.847192e-07 1.534503e-08 2.984475e-10 5.190544e-12
##
## [[4]]
## [1] -0.5838848
```

Note the branches could be long monadic chains themselves, which might have their own branches.

Use of the `%>^%`

and `%^>%`

operators is a little awkward. A more general
option is to use tags and views. `tag`

this allows the *head* of a pipeline to
be reset.

```
# build memory cacher
f <- make_recacher(memory_cache)
# make core dataset
m <- as_monad(iris) %>>%
dplyr::select(
sepal_length = Sepal.Length,
sepal_width = Sepal.Width,
species = Species
) %>%
# cache value with tag 'iris'
f('iris') %>>%
# some downstream stuff
nrow
# Now can pick from the tagged node
m <- view(m, 'iris') %>>% {
qplot(
x=sepal_length,
y=sepal_width,
color=species,
data=.
)} %>% f('plot')
# and repeat however many times we like
m <- view(m, 'iris') %>>% summary %>% f('sum')
plot(m)
```

If you want to connect many chains, all with independent inputs, you can do so
with the `%__%`

operator.

```
runif(10) %>>% sum %__%
rnorm(10) %>>% sum %__%
rexp(10) %>>% sum
```

```
## N1> "runif(10)"
## N2> "sum"
## [1] 4.583261
##
## N3> "rexp(10)"
## N4> "sum"
## [1] -1.522069
##
## N5> "sum"
##
## -----------------
##
## [1] 14.88481
```

The `%__%`

operator records the output of the lhs and evaluates the rhs into an
`rmonad`

. This operator is a little like a semicolon, in that it demarcates
independent statements. Each statement, though, is wrapped into a graph of
operations. This graph is itself data, and can be computed on. You could take
any analysis and recompose it as `%__%`

delimited blocks. The result of running
the analysis would be a data structure containing all results and errors.

```
program <-
{
x = 2
y = 5
x * y
} %__% {
letters %>% sqrt
} %__% {
10 * x
}
```

You can link chunks of code, with their results, and performance information.

So far our pipelines have been limited to either linear paths or the somewhat
awkward branch merging. An easier approach is to read inputs from a list. But
we want to be able to catch errors resulting from evaluation of each member of
the list. We can do this with `list_meval`

.

```
funnel(
"yolo",
stop("stop, drop, and die"),
runif("simon"),
k = 2
)
```

```
## N1> "2"
## N2> "runif("simon")"
## * ERROR: invalid arguments
## * WARNING: NAs introduced by coercion
## N3> "stop("stop, drop, and die")"
## * ERROR: stop, drop, and die
## N4> "yolo"
## N5> "funnel("yolo", stop("stop, drop, and die"), runif("simon"), k = 2)"
##
## -----------------
##
## [[1]]
## [1] "yolo"
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## $k
## [1] 2
##
## *** FAILURE ***
```

This returns a monad which fails if any of the components evaluate to an error. But it does not toss the rest of the inputs, instead returning a clean list with a NULL filling in missing pieces. Contrast this with normal list evaluation:

```
list( "yolo", stop("stop, drop, and die"), runif("simon"), 2)
```

```
## Error in eval(expr, envir, enclos): stop, drop, and die
```

`funnel`

records each failure in each element of the list independently.

This approach can also be used with the infix operator `%*>%`

.

```
funnel(read.csv("a.csv"), read.csv("b.csv")) %*>% merge
```

Now, of course, we can add monads to the mix

```
funnel(
a = read.csv("a.csv") %>>% do_analysis_a,
b = read.csv("b.csv") %>>% do_analysis_b,
k = 5
) %*>% joint_analysis
```

Monadic list evaluation is the natural way to build large programs from smaller pieces.

As our pipelines become more complex, it becomes essential to document them. We can do that as follows:

```
{
"This is docstring. The following list is metadata associated with this
node. Both the docstring and the metadata list will be processed out of
this function before it is executed. They also will not appear in the code
stored in the Rmonad object."
list(sys = sessionInfo(), foo = "This can be anything")
# This NULL is necessary, otherwise the metadata list above would be
# treated as the node output
NULL
} %__% # The %__% operator connects independent pieces of a pipeline.
"a" %>>% {
"The docstrings are stored in the Rmonad objects. They may be extracted in
the generation of reports. For example, they could go into a text block
below the code in a knitr document. The advantage of having documentation
here, is that it is coupled unambiguously to the generating function. These
annotations, together with the ability to chain chains of monads, allows
whole complex workflows to be built, with the results collated into a
single object. All errors propagate exactly as errors should, only
affecting downstream computations. The final object can be converted into a
markdown document and automatically generated function graphs."
paste(., "b")
}
```

```
##
##
## This is docstring. The following list is metadata associated with this
## node. Both the docstring and the metadata list will be processed out of
## this function before it is executed. They also will not appear in the code
## stored in the Rmonad object.
##
## N1> "{
## NULL
## }"
## NULL
##
## N2> ""a""
##
##
## The docstrings are stored in the Rmonad objects. They may be extracted in
## the generation of reports. For example, they could go into a text block
## below the code in a knitr document. The advantage of having documentation
## here, is that it is coupled unambiguously to the generating function. These
## annotations, together with the ability to chain chains of monads, allows
## whole complex workflows to be built, with the results collated into a
## single object. All errors propagate exactly as errors should, only
## affecting downstream computations. The final object can be converted into a
## markdown document and automatically generated function graphs.
##
## N3> "function (.)
## {
## paste(., "b")
## }"
##
## -----------------
##
## [1] "a b"
```

`rmonad`

pipelines may be nested to arbitrary depth.

```
foo <- function(x, y) {
"This is a function containing a pipeline. It always fails"
"a" %>>% paste(x) %>>% paste(y) %>>% log
}
bar <- function(x) {
"this is another function, it doesn't fail"
funnel("b", "c") %*>% foo %>>% paste(x)
}
"d" %>>% bar
```

```
## N1> ""d""
## N2> "c"
## N3> "b"
## N4> "funnel("b", "c")"
## N5> ""a""
## N6> "paste(x)"
## N7> "paste(y)"
## N8> "log"
## * ERROR: non-numeric argument to mathematical function
## [1] "a b c"
##
##
##
## This is a function containing a pipeline. It always fails
##
## N9> "foo"
## [[1]]
## [1] "b"
##
## [[2]]
## [1] "c"
##
##
##
##
## this is another function, it doesn't fail
##
## N10> "bar"
##
## -----------------
##
## [1] "d"
## *** FAILURE ***
```

This function descends through three levels of nesting. There is a failure at
the deepest level. This failing node, where a string is passed to a `log`

function, stores the error message and the input. Each node ascending from the
point of failure stores their respective input. This allows debugging to resume
from any desired level.

A feature new to `rmonad v0.4`

are a set of post-processors. These act on an
`Rmonad`

object after the code the object wraps has been evaluated.

Here are the currently supported post-processors:

`format_warnings`

- A function of the final value and the list of warnings, that formats the node's warning message.`format_error`

- Like`format_warnings`

but for errors`format_notes`

- Like`format_warnings`

but for messages/notes`summarize`

- A function of the final value that stores a summary of the data`cache`

- A function of the final value that caches the value`format_log`

- A function of the final state that prints an progress message

These are all quite experimental at this point.

The post-processors are included in the node metadata, for example

```
"hello world" %>>% {
list(
format_error=function(x, err){
paste0("Failure on input '", x, "': ", err)
}
)
sqrt(.)
}
```

```
## N1> ""hello world""
## N2> "function (.)
## {
## sqrt(.)
## }"
## * ERROR: Failure on input 'hello world': non-numeric argument to mathematical function
##
## -----------------
##
## [1] "hello world"
## *** FAILURE ***
```

`summarize`

is useful since it is often useful to store information about an
intermediate step but storing the full data is too memory intensive. Rather
than stopping the flow of an analysis with a bunch of intermediate analytic
code, a summary function can be nested in a node that holds an arbitrary
description of the data, coupled immediately to the function that produced it.

```
d <- mtcars %>>% {
list(summarize=summary)
subset(., mpg > 20)
} %>>% nrow
get_summary(d)[[2]]
```

```
## [[1]]
## mpg cyl disp hp
## Min. :21.00 Min. :4.000 Min. : 71.10 Min. : 52.0
## 1st Qu.:21.43 1st Qu.:4.000 1st Qu.: 83.03 1st Qu.: 66.0
## Median :23.60 Median :4.000 Median :120.20 Median : 94.0
## Mean :25.48 Mean :4.429 Mean :123.89 Mean : 88.5
## 3rd Qu.:29.62 3rd Qu.:4.000 3rd Qu.:145.22 3rd Qu.:109.8
## Max. :33.90 Max. :6.000 Max. :258.00 Max. :113.0
## drat wt qsec vs
## Min. :3.080 Min. :1.513 Min. :16.46 Min. :0.0000
## 1st Qu.:3.790 1st Qu.:1.986 1st Qu.:17.39 1st Qu.:1.0000
## Median :3.910 Median :2.393 Median :18.75 Median :1.0000
## Mean :3.976 Mean :2.418 Mean :18.82 Mean :0.7857
## 3rd Qu.:4.103 3rd Qu.:2.851 3rd Qu.:19.79 3rd Qu.:1.0000
## Max. :4.930 Max. :3.215 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3 Min. :1.000
## 1st Qu.:0.2500 1st Qu.:4 1st Qu.:1.000
## Median :1.0000 Median :4 Median :2.000
## Mean :0.7143 Mean :4 Mean :1.857
## 3rd Qu.:1.0000 3rd Qu.:4 3rd Qu.:2.000
## Max. :1.0000 Max. :5 Max. :4.000
```

The summary information will tucked away invisibly in the `Rmonad`

object until
a debugger or report generator extracts it. Of course, this could also be used
to just store a full copy of the output in memory, by setting the `summarize`

function to `identity`

.

Summaries like this will be more useful in the `rmonad`

world when a Shiny app
(or something comparable) makes the workflow graph interactive. Then the
summary for a node can automatically be displayed when the node is accessed.

The `cache`

and `log`

post-processors are not yet well developed. But they are
intended to do what their names suggest. `cache`

is not yet useful since I
don't have the infrastructure to test whether the cache is valid. `log`

will
eventually allow progress messages to be passed to STDOUT as `rmonad`

is
running (by default messages are captured and stored).