Introduction to crandep

2020-05-12

This vignette provides an introduction to the functions facilitating the analysis of the dependencies of CRAN packages.

library(crandep)
library(dplyr)
library(ggplot2)
library(igraph)

One type of dependencies

To obtain the information about various kinds of dependencies of a package, we can use the function get_dep_all() which takes the package name and the type of dependencies as the first and second arguments, respectively. Currently, the second argument accepts Depends, Imports, LinkingTo, Suggests, Reverse_depends, Reverse_imports, Reverse_linking_to, and Reverse_suggests, or any variations in their letter cases.

get_dep_all("dplyr", "Imports")
#>  [1] "ellipsis"   "assertthat" "glue"       "magrittr"   "methods"   
#>  [6] "pkgconfig"  "R6"         "Rcpp"       "rlang"      "tibble"    
#> [11] "tidyselect" "utils"
get_dep_all("MASS", "depends")
#> [1] "grDevices" "graphics"  "stats"     "utils"
get_dep_all("MASS", "dePends") # should give same result
#> [1] "grDevices" "graphics"  "stats"     "utils"

Imports and Depends are the most common types of dependencies in R packages, but there are other types such as Suggests. For more information on different types of dependencies, see the official guidelines and http://r-pkgs.had.co.nz/description.html.

Multiple types of dependencies

As the information all dependencies of one package are on the same page on CRAN, to avoid scraping the same multiple times, we can use get_dep_df() instead of get_dep_all(). The output will be a data frame instead of a character vector.

get_dep_df("dplyr", c("imports", "LinkingTo"))
#>     from         to       type reverse
#> 1  dplyr   ellipsis    imports   FALSE
#> 2  dplyr assertthat    imports   FALSE
#> 3  dplyr       glue    imports   FALSE
#> 4  dplyr   magrittr    imports   FALSE
#> 5  dplyr    methods    imports   FALSE
#> 6  dplyr  pkgconfig    imports   FALSE
#> 7  dplyr         R6    imports   FALSE
#> 8  dplyr       Rcpp    imports   FALSE
#> 9  dplyr      rlang    imports   FALSE
#> 10 dplyr     tibble    imports   FALSE
#> 11 dplyr tidyselect    imports   FALSE
#> 12 dplyr      utils    imports   FALSE
#> 13 dplyr         BH linking_to   FALSE
#> 14 dplyr      plogr linking_to   FALSE
#> 15 dplyr       Rcpp linking_to   FALSE

The column type is the type of the dependency converted to lower case. Also, LinkingTo is now converted to linking_to for consistency. For the four reverse dependencies, the substring "reverse_" will not be shown in type; instead the reverse column will be TRUE. This can be illustrated by the following:

get_dep_all("abc", "depends")
#> [1] "abc.data" "nnet"     "quantreg" "MASS"     "locfit"
get_dep_all("abc", "reverse_depends")
#> [1] "abctools" "EasyABC"
get_dep_df("abc", c("depends", "reverse_depends"))
#>   from       to    type reverse
#> 1  abc abc.data depends   FALSE
#> 2  abc     nnet depends   FALSE
#> 3  abc quantreg depends   FALSE
#> 4  abc     MASS depends   FALSE
#> 5  abc   locfit depends   FALSE
#> 6  abc abctools depends    TRUE
#> 7  abc  EasyABC depends    TRUE

Theoretically, for each forward dependency

#>   from to type reverse
#> 1    A  B    c   FALSE

there should be an equivalent reverse dependency

#>   from to type reverse
#> 1    B  A    c    TRUE

Aligning the type in the forward and reverse dependencies enables this to be checked easily.

Building and visualising a dependency network

To build a dependency network, we have to obtain the dependencies for multiple packages. For illustration, we choose the core packages of the tidyverse, and find out what each package Imports. We put all the dependencies into one data frame, in which the package in the from column imports the package in the to column. This is essentially the edge list of the dependency network.

df0.imports <- rbind(
    get_dep_df("ggplot2", "Imports"),
    get_dep_df("dplyr", "Imports"),
    get_dep_df("tidyr", "Imports"),
    get_dep_df("readr", "Imports"),
    get_dep_df("purrr", "Imports"),
    get_dep_df("tibble", "Imports"),
    get_dep_df("stringr", "Imports"),
    get_dep_df("forcats", "Imports")
)
head(df0.imports)
#>      from        to    type reverse
#> 1 ggplot2    digest imports   FALSE
#> 2 ggplot2      glue imports   FALSE
#> 3 ggplot2 grDevices imports   FALSE
#> 4 ggplot2      grid imports   FALSE
#> 5 ggplot2    gtable imports   FALSE
#> 6 ggplot2   isoband imports   FALSE
tail(df0.imports)
#>       from       to    type reverse
#> 61 stringr magrittr imports   FALSE
#> 62 stringr  stringi imports   FALSE
#> 63 forcats ellipsis imports   FALSE
#> 64 forcats magrittr imports   FALSE
#> 65 forcats    rlang imports   FALSE
#> 66 forcats   tibble imports   FALSE

With the help of the ‘igraph’ package, we can use this data frame to build a graph object that represents the dependency network.

g0.imports <- igraph::graph_from_data_frame(df0.imports)
set.seed(1457L)
old.par <- par(mar = rep(0.0, 4))
plot(g0.imports, vertex.label.cex = 1.5)
par(old.par)

The nature of a dependency network makes it a directed acyclic graph (DAG). We can use the ‘igraph’ function is_dag() to check.

igraph::is_dag(g0.imports)
#> [1] TRUE

Note that this applies to Imports (and Depends) only due to their nature. This acyclic nature does not apply to a network of, for example, Suggests.

Boundary and giant component

It is possible to set a boundary on the nodes to which the edges are directed, using the function df_to_graph(). The second argument takes in a data frame that contains the list of such nodes in the column name.

df0.nodes <- data.frame(name = c("ggplot2", "dplyr", "tidyr", "readr", "purrr", "tibble", "stringr", "forcats"), stringsAsFactors = FALSE)
g0.core <- df_to_graph(df0.imports, df0.nodes)
set.seed(259L)
old.par <- par(mar = rep(0.0, 4))
plot(g0.core, vertex.label.cex = 1.5)
par(old.par)

Topological ordering of nodes

Since networks according to Imports or Depends are DAGs, we can obtain the topological ordering using, for example, Kahn’s (1962) sorting algorithm.

topo_sort_kahn(g0.core)
#>        id id_num
#> 1 forcats      1
#> 2 ggplot2      2
#> 3   readr      3
#> 4   tidyr      4
#> 5   dplyr      5
#> 6   purrr      6
#> 7  tibble      7

In the topological ordering, represented by the column id_num, a low (high) number represents being at the front (back) of the ordering. If package A Imports package B i.e. there is a directed edge from A to B, then A will be topologically before B. As the package ‘tibble’ doesn’t import any package but is imported by most other packages, it naturally goes to the back of the ordering. This ordering may not be unique for a DAG, and other admissible orderings can be obtained by setting random=TRUE in the function:

set.seed(387L); topo_sort_kahn(g0.core, random = TRUE)
#>        id id_num
#> 1 ggplot2      1
#> 2   readr      2
#> 3 forcats      3
#> 4   tidyr      4
#> 5   purrr      5
#> 6   dplyr      6
#> 7  tibble      7

We can also apply the topological sorting to the bigger dependencies network.

df0.topo <- topo_sort_kahn(g0.imports)
head(df0.topo)
#>        id id_num
#> 1 forcats      1
#> 2 ggplot2      2
#> 3   readr      3
#> 4 stringr      4
#> 5   tidyr      5
#> 6  digest      6
tail(df0.topo)
#>           id id_num
#> 33   methods     33
#> 34    pillar     34
#> 35 pkgconfig     35
#> 36     rlang     36
#> 37     utils     37
#> 38     vctrs     38

The dependency network of all CRAN packages

Ultimately, we can use get_dep_df() to obtain all dependencies of all packages available on CRAN. This package provides an example dataset cran_dependencies that contains all such dependencies as of 2020-05-09.

data(cran_dependencies)
cran_dependencies
#> # A tibble: 211,381 x 4
#>    from  to             type     reverse
#>    <chr> <chr>          <chr>    <lgl>  
#>  1 A3    xtable         depends  FALSE  
#>  2 A3    pbapply        depends  FALSE  
#>  3 A3    randomForest   suggests FALSE  
#>  4 A3    e1071          suggests FALSE  
#>  5 aaSEA DT             imports  FALSE  
#>  6 aaSEA networkD3      imports  FALSE  
#>  7 aaSEA shiny          imports  FALSE  
#>  8 aaSEA shinydashboard imports  FALSE  
#>  9 aaSEA magrittr       imports  FALSE  
#> 10 aaSEA Bios2cor       imports  FALSE  
#> # … with 211,371 more rows

We can build dependency network in the same way as above. Furthermore, we can verify that the forward and reverse dependency networks are (almost) the same.

g0.depends <- cran_dependencies %>%
    dplyr::filter(type == "depends" & !reverse) %>%
    df_to_graph(nodelist = dplyr::rename(cran_dependencies, name = from))
g0.rev_depends <- cran_dependencies %>%
    dplyr::filter(type == "depends" & reverse) %>%
    df_to_graph(nodelist = dplyr::rename(cran_dependencies, name = from))
g0.depends
#> IGRAPH 4f80ca7 DN-- 4810 8070 -- 
#> + attr: name (v/c), type (e/c), reverse (e/l)
#> + edges from 4f80ca7 (vertex names):
#>  [1] A3         ->xtable   A3         ->pbapply  abc        ->abc.data
#>  [4] abc        ->nnet     abc        ->quantreg abc        ->MASS    
#>  [7] abc        ->locfit   abcdeFBA   ->Rglpk    abcdeFBA   ->rgl     
#> [10] abcdeFBA   ->corrplot abcdeFBA   ->lattice  ABCp2      ->MASS    
#> [13] abctools   ->abc      abctools   ->abind    abctools   ->plyr    
#> [16] abctools   ->Hmisc    abd        ->nlme     abd        ->lattice 
#> [19] abd        ->mosaic   abn        ->nnet     abn        ->MASS    
#> [22] abn        ->lme4     abodOutlier->cluster  AbSim      ->ape     
#> + ... omitted several edges
g0.rev_depends
#> IGRAPH 4678a3b DN-- 4810 8070 -- 
#> + attr: name (v/c), type (e/c), reverse (e/l)
#> + edges from 4678a3b (vertex names):
#>  [1] abc     ->abctools   abc     ->EasyABC    abc.data->abc       
#>  [4] abd     ->tigerstats abind   ->abctools   abind   ->BCBCSF    
#>  [7] abind   ->CPMCGLM    abind   ->depth      abind   ->dgmb      
#> [10] abind   ->dynamo     abind   ->fractaldim abind   ->informR   
#> [13] abind   ->interplot  abind   ->magic      abind   ->mlma      
#> [16] abind   ->mlogitBMA  abind   ->multicon   abind   ->MultiPhen 
#> [19] abind   ->multipol   abind   ->mvmesh     abind   ->mvSLOUCH  
#> [22] abind   ->plfm      
#> + ... omitted several edges

Their size (number of edges) and order (number of nodes) should be very close if not identical to each other. Because of the dependency direction, their edge lists should be the same but with the column names from and to swapped.

External reverse dependencies & defunct packages

One may notice that there are external reverse dependencies which won’t be appear in the forward dependencies if the scraping is limited to CRAN packages. We can find these external reverse dependencies by nodelist = NULL in df_to_graph():

df1.rev_depends <- cran_dependencies %>%
    dplyr::filter(type == "depends" & reverse) %>%
    df_to_graph(nodelist = NULL, gc = FALSE) %>%
    igraph::as_data_frame() # to obtain the edge list
df1.depends <- cran_dependencies %>%
    dplyr::filter(type == "depends" & !reverse) %>%
    df_to_graph(nodelist = NULL, gc = FALSE) %>%
    igraph::as_data_frame()
dfa.diff.depends <- dplyr::anti_join(
    df1.rev_depends,
    df1.depends,
    c("from" = "to", "to" = "from")
)
head(dfa.diff.depends)
#>    from          to    type reverse
#> 1 abind      baySeq depends    TRUE
#> 2 abind      CNORdt depends    TRUE
#> 3 abind  FISHalyseR depends    TRUE
#> 4 abind     flowMap depends    TRUE
#> 5 abind    riboSeqR depends    TRUE
#> 6 abind RNAinteract depends    TRUE

This means we are extracting the reverse dependencies of which the forward equivalents are not listed. The column to shows the packages external to CRAN. On the other hand, if we apply dplyr::anti_join() by switching the order of two edge lists,

dfb.diff.depends <- dplyr::anti_join(
    df1.depends,
    df1.rev_depends,
    c("from" = "to", "to" = "from")
)
head(dfb.diff.depends)
#>                 from       to    type reverse
#> 1           abctools parallel depends   FALSE
#> 2                abd     grid depends   FALSE
#> 3 AcceptanceSampling  methods depends   FALSE
#> 4 AcceptanceSampling    stats depends   FALSE
#> 5            accrued     grid depends   FALSE
#> 6               acid    stats depends   FALSE

the column to lists those which are not on the page of available packages (anymore). These are either defunct or core packages.

Summary statistics

We can also obtain the degree for each package and each type:

df0.summary <- dplyr::count(cran_dependencies, from, type, reverse)
df0.summary
#> # A tibble: 34,861 x 4
#>    from        type       reverse     n
#>    <chr>       <chr>      <lgl>   <int>
#>  1 A3          depends    FALSE       2
#>  2 A3          suggests   FALSE       2
#>  3 ABACUS      imports    FALSE       2
#>  4 ABACUS      suggests   FALSE       2
#>  5 ABC.RAP     imports    FALSE       3
#>  6 ABC.RAP     suggests   FALSE       2
#>  7 ABCanalysis imports    FALSE       1
#>  8 ABCanalysis suggests   TRUE        4
#>  9 ABCoptim    imports    FALSE       4
#> 10 ABCoptim    linking_to FALSE       1
#> # … with 34,851 more rows

We can look at the “winner” in each of the reverse dependencies:

df0.summary %>%
    dplyr::filter(reverse) %>%
    dplyr::group_by(type) %>%
    dplyr::top_n(1, n)
#> # A tibble: 4 x 4
#> # Groups:   type [4]
#>   from    type       reverse     n
#>   <chr>   <chr>      <lgl>   <int>
#> 1 MASS    depends    TRUE      455
#> 2 Rcpp    linking_to TRUE     2082
#> 3 ggplot2 imports    TRUE     2038
#> 4 knitr   suggests   TRUE     5806

This is not surprising given the nature of each package. To take the summarisation one step further, we can obtain the frequencies of the degrees, and visualise the empirical degree distribution neatly on the log-log scale:

df1.summary <- df0.summary %>%
    dplyr::count(type, reverse, n)
gg0.summary <- df1.summary %>%
    dplyr::mutate(reverse = ifelse(reverse, "reverse", "forward")) %>%
    ggplot2::ggplot() +
    ggplot2::geom_point(ggplot2::aes(n, nn)) +
    ggplot2::facet_grid(type ~ reverse) +
    ggplot2::scale_x_log10() +
    ggplot2::scale_y_log10() +
    ggplot2::labs(x = "Degree", y = "Number of packages") +
    ggplot2::theme_bw(20)
gg0.summary

This shows the reverse dependencies, in particular Reverse_depends and Reverse_imports, follow the power law, which is empirically observed in various academic fields.

Going forward

Methods in social network analysis, such as community detection algorithms and/or stochastic block models, can be applied to study the properties of the dependency network. Ideally, by analysing the dependencies of all CRAN packages, we can obtain a bird’s-eye view of the ecosystem.