When dealing with nominal data, usually only some methods such as simple frequency statistics can be carried out. NIMAA package proposes a pipeline for nominal data mining, which can effectively find special relationships between data.
It uses bipartite graphs to represent the relationship between two different types of nominal data, and organizes them in the incidence matrix to find sub-matrices that are larger in different dimensions and do not contain any missing values.
Then, two one-partite graphs are obtained on the sub-matrix using a variety of different methods of projection. For any of them, NIMAA can use a variety of methods to cluster, and users can use possible external prior knowledge to select. the result of the best clustering method as the ‘reference cluster’.
After that we can perform multiple numerical imputations on the matrix with missing data, and apply the same clustering method as ‘reference clustering’ for clustering, and select the best data imputation method which is the one with the closest result to the ‘reference cluster’.
Here we use ‘beatAML’ dataset as an example, which is a data set composed of three columns, the first two columns are nominal data, and the third column is numerical data.
|Doramapimod (BIRB 796)||11-00261||101.52120|
Read the data from the package:
# read the data <- NIMAA::beatAMLbeatAML_data
Function plotInput() will print the incidence matrix plot of input data, and return that matrix.
NB: To keep the size of vignette small enough for CRAN rules, we won’t output the interactive figure here.
<- plotInput( beatAML_incidence_matrix x = beatAML_data, # original data with 3 columns index_nominal = c(2,1), # the first two columns are nominal data index_numeric = 3, # the third column inumeric data print_skim = FALSE, # if you want to check the skim output, set this as TRUE(Default) plot_weight = TRUE, # when plotting the figure, show the weights verbose = FALSE # NOT save the figures to local folder )
Since we have got the incidence matrix, then we can easily use it to form a bipartite graph, in this part we have two different ways to visualize the bipartite graph, static or interactive
Function plotBipartite() will print the bipartite graph based on igraph package, and return that igraph graph object.
<- plotBipartite(inc_mat = beatAML_incidence_matrix,vertex.label.display = T)graph
# show the igraph graph object graph#> IGRAPH ad6c4b9 UNWB 650 47636 -- #> + attr: name (v/c), type (v/l), shape (v/c), color (v/c), weight (e/n) #> + edges from ad6c4b9 (vertex names): #>  Alisertib (MLN8237) --11-00261 Barasertib (AZD1152-HQPA)--11-00261 #>  Bortezomib (Velcade) --11-00261 Canertinib (CI-1033) --11-00261 #>  Crenolanib --11-00261 CYT387 --11-00261 #>  Dasatinib --11-00261 Doramapimod (BIRB 796) --11-00261 #>  Dovitinib (CHIR-258) --11-00261 Erlotinib --11-00261 #>  Flavopiridol --11-00261 GDC-0941 --11-00261 #>  Gefitinib --11-00261 Go6976 --11-00261 #>  GW-2580 --11-00261 Idelalisib --11-00261 #> + ... omitted several edges
Function plotBipartiteInteractive() will print the interactive bipartite graph based on visNetwork package.
NB: To keep the size of vignette small enough, we won’t output the interactive figure here, a screenshot instead.
plotBipartiteInteractive(inc_mat = beatAML_incidence_matrix)
Function extractSubMatrix() will extract the sub-matrices which have no missing value inside or with specific proportion of missing values inside (not for elements-max matrix), depends on the user’s input. The result will also be showed as plotly figure.
The extraction process has two types of data preprocessing, the difference is that the first one directly uses the original input matrix (row-wise), while the second one uses the transposed matrix (column-wise).
After preprocessing, the matrix will be “three-step arrangement”:
the first step is row arranging ;
the second step is column arranging ;
the third step is total rearranging.
Then look for the largest possible matrix (with no missing values or with specific proportion of missing values) in the four dimensions, output the result and print the visualization.
Here we extract two sub-matrices of the
<- extractSubMatrix( sub_matrices x = beatAML_incidence_matrix, shape = c("Square", "Rectangular_element_max"), # the shapes wanted row.vars = "patient_id", col.vars = "inhibitor", plot_weight = TRUE, verbose = FALSE, print_skim = TRUE # just to reduce the length of vignette )
We can see that there is an output called binmatnest2.temperature, this is the nestedness measure of the matrix, if the input is a highly nested (nestedness temperature is less than 1). We suggest that divide the data into different parts.