Banner

Abraham Lincoln once said, "Give me six hours to chop down a tree and I will spend the first four sharpening the axe."
Aunt Margaret used to say, "If you dream of a forest, you'd better learn how to plant a tree."
data.tree says, "No matter if you are a lumberjack or a tree hugger. I will be your sanding block, and I will be your seed."

Introduction

Trees

Trees are ubiquitous in mathematics, computer science, data sciences, finance, and in many other fields. Trees are especially useful when we are facing hierarchical data. For example, trees are used:

  • in decision theory (cf. decision trees)
  • in machine learning (e.g. classification trees)
  • in finance, e.g. to classify financial instruments into asset classes
  • in routing algorithms
  • in computer science and programming (e.g. binary search trees, XML)
  • e.g. for family trees

For more details, see the applications vignette by typing vignette("applications", package = "data.tree")

Trees in R

Tree-like structures are already used in R. For example, environments can be seen as nodes in a tree. And CRAN provides numerous packages that deal with tree-like structures, especially in the area of decision theory. Yet, there is no general purpose hierarchical data structure that could be used as conveniently and generically as, say, data.frame.

As a result, people often try to resolve hierarchical problems in a tabular fashion, for instance with data.frames. But often, hierarchies don’t marry with tables, and various workarounds are usually required.

Trees in data.tree

This package offers an alternative. The data.tree package lets you create hierarchies, called data.tree structures. The building block of theses structures are Node objects. The package provides basic traversal, search, and sort operations, and an infrastructure for recursive tree programming. You can decorate Nodes with your own fields and methods, so as to extend the package to your needs.

The package also provides convenience methods for neatly printing and plotting trees. It supports conversion from and to data.frames, lists, and other tree structures such as dendrogram, phylo objects from the ape package, igraph, and other packages.

Technically, data.tree structures are bi-directional, ordered trees. Bi-directional means that you can navigate from parent to chidren and vice versa. Ordered means that the sort order of the children of a parent node is well-defined.

data.tree basics

Definitions

  • data.tree structure: a tree, consisting of multiple Node objects. Often, the entry point to a data.tree structure is the root Node
  • Node: both a class and the basic building block of data.tree structures
  • attribute: an active, a field, or a method. **Not to be confused with standard R attributes, c.f. ?attr, which have a different meaning. Many methods and functions have an attribute arg, which can refer to a an active, a field or a method. For example, see ?Get
  • active (sometimes called property): a field on a Node that can be called like an attribute, but behaves like a function without arguments. For example: node$position
  • field: a named value on a Node, e.g. node$cost <- 2500
  • method: a function acting on an object (on a Node in this context). Many methods are available in OO style (e.g. node$Revert()) or in traditional style (Revert(node))
  • inheritance: in this context, inheritance refers to a situation in which a child Node inherits e.g. an attribute from one of its ancestors. For example, see ?Get, ?SetNodeStyle

Tree creation

There are different ways to create a data.tree structure. For example, you can create a tree programmatically, by conversion from other R objects, or from a file.

Create a tree programmatically

Let’s start by creating a tree programmatically. We do this by creating Node objects, and linking them together so as to define the parent-child relationships.

In this example, we are looking at a company, Acme Inc., and the tree reflects its organisational structure. The root (level 1) is the company. On level 2, the nodes represent departments, and the leaves of the tree represent projects that the company is considering for next year:

library(data.tree)

acme <- Node$new("Acme Inc.")
  accounting <- acme$AddChild("Accounting")
    software <- accounting$AddChild("New Software")
    standards <- accounting$AddChild("New Accounting Standards")
  research <- acme$AddChild("Research")
    newProductLine <- research$AddChild("New Product Line")
    newLabs <- research$AddChild("New Labs")
  it <- acme$AddChild("IT")
    outsource <- it$AddChild("Outsource")
    agile <- it$AddChild("Go agile")
    goToR <- it$AddChild("Switch to R")

print(acme)
##                           levelName
## 1  Acme Inc.                       
## 2   ¦--Accounting                  
## 3   ¦   ¦--New Software            
## 4   ¦   °--New Accounting Standards
## 5   ¦--Research                    
## 6   ¦   ¦--New Product Line        
## 7   ¦   °--New Labs                
## 8   °--IT                          
## 9       ¦--Outsource               
## 10      ¦--Go agile                
## 11      °--Switch to R

As you can see from the previous example, each Node is identified by its name, i.e. the argument you pass into the Node$new(name) constructor. The name needs to be unique among siblings, such that paths to Nodes are unambiguous.

Node inherits from R6 reference class. This has the following implications:

  1. You can call methods on a Node in OO style, e.g. acme$Get("name")
  2. Node exhibits reference semantics. Thus, multiple variables in R can point to the same Node, and modifying a Node will modify it for all referencing variables. In the above code example, both acme$IT and it reference the same object. This is different from the value semantics, which is much more widely used in R.

Create a tree from a data.frame

Creating a tree programmatically is useful especially in the context of algorithms. However, most times you will create a tree by conversion. This could be by conversion from a nested list-of-lists, by conversion from another R tree-structure (e.g. an ape phylo), or by conversion from a data.frame. For more details on all the options, type ?as.Node and refer to the See Also section.

One of the most common conversions is the one from a data.frame in table format. The following code illustrates this. We load the GNI2014 data from the treemap package. This data.frame is in table format, meaning that each row will represent a leaf in the data.tree structure:

library(treemap)
data(GNI2014)
head(GNI2014)
##   iso3          country     continent population    GNI
## 3  BMU          Bermuda North America      67837 106140
## 4  NOR           Norway        Europe    4676305 103630
## 5  QAT            Qatar          Asia     833285  92200
## 6  CHE      Switzerland        Europe    7604467  88120
## 7  MAC Macao SAR, China          Asia     559846  76270
## 8  LUX       Luxembourg        Europe     491775  75990

Let’s convert that into a data.tree structure! We start by defining a pathString. The pathString describes the hierarchy by defining a path from the root to each leaf. In this example, the hierarchy comes very naturally:

GNI2014$pathString <- paste("world", 
                            GNI2014$continent, 
                            GNI2014$country, 
                            sep = "/")

Once our pathString is defined, conversion to Node is very easy:

population <- as.Node(GNI2014)
print(population, "iso3", "population", "GNI", limit = 20)
##                                 levelName iso3 population    GNI
## 1  world                                               NA     NA
## 2   ¦--North America                                   NA     NA
## 3   ¦   ¦--Bermuda                         BMU      67837 106140
## 4   ¦   ¦--United States                   USA  313973000  55200
## 5   ¦   ¦--Canada                          CAN   33487208  51630
## 6   ¦   ¦--Bahamas, The                    BHS     309156  20980
## 7   ¦   ¦--Trinidad and Tobago             TTO    1310000  20070
## 8   ¦   ¦--Puerto Rico                     PRI    3971020  19310
## 9   ¦   ¦--Barbados                        BRB     284589  15310
## 10  ¦   ¦--St. Kitts and Nevis             KNA      40131  14920
## 11  ¦   ¦--Antigua and Barbuda             ATG      85632  13300
## 12  ¦   ¦--Panama                          PAN    3360474  11130
## 13  ¦   ¦--Costa Rica                      CRI    4253877  10120
## 14  ¦   ¦--Mexico                          MEX  111211789   9870
## 15  ¦   ¦--Grenada                         GRD      90739   7910
## 16  ¦   ¦--St. Lucia                       LCA     160267   7260
## 17  ¦   ¦--Dominica                        DMA      72660   6930
## 18  ¦   ¦--St. Vincent and the Grenadines  VCT     104574   6610
## 19  ¦   ¦--Dominican Republic              DOM    9650054   6040
## 20  ¦   °--... 7 nodes w/ 0 sub                        NA     NA
## 21  °--... 6 nodes w/ 171 sub                          NA     NA

This is a simple example, and more options are available. Type ?FromDataFrameTable for all the details.

Create a tree from a file

Ofte