iGraphMatch

library(iGraphMatch)

General Description of Graph Matching

Graph matching is an increasingly important problem which can be applied to a wide variety of fields including biology, neuroscience, pattern recognition and machine learning, to name a few. For example, as a fundamental tool in solving pattern recognition problem, graph matching is widely used in computer vision. In this problem, one seeks to find a correspondence between local features of the image, which are labeled as nodes of the graphs. Relational aspects between features are modeled by edges of the graphs in this context.

While packages such as iGraph and GraphM also have graph matching functionality our goal is to provide a single centralized repository for graph matching which attempts to confront many of the additional pathologies of graph matching. While the iGraph package provides versatile options on descriptive network analysis and graph visualization based on igraph objects in R, Python and C/C++, it doesn’t focus on implementation of the most commonly used and cutting edge algorithms of graph matching. In contrast to iGraph, the iGraphMatch package is also more flexible in dealing with different type of graph objects, ranging from igraph objects to matrix objects. The GraphM provides tools for approximately soloving large scale graph matching problems. It implements a variety of graph methods that were proposed between 1999 and 2009, including Umeyama algorithm, Linear programming approach, Rank algorithm, QCV (Quadratic convex relaxation) algorithm and PATH (A path following) algorithm in C/C++. Its corresponding R package version RGraphM hasn’t been published yet.

What Is the Graph Matching Problem

The graph matching problem seeks to find an alignment between the vertices of two graphs with shuffled labels. To formulate the problem, given two adjacency matrices \(A\) and \(B\) corresponding to graphs \(G1\) and \(G2\) with \(n\) nodes, the target is to find the true permutation matrix \(P=\)argmin_{P} A-PBP^T _F$, where \(\Pi\) denotes the set of all the \(n \times n\) permutation matrices, \(\Vert \bullet \Vert\) denotes the Froebenius norm. In a mathematical sense, the graph matching problem is a quadratic assignment problem. The goal of research in graph matching has been focused on developing more accurate or faster algorithms to approximately match two graphs since the problem is NP-hard.

Seeded graph matching has been extensively studied in the literature . Suppose that we know the correspondence between a subset of vertices in both graphs. With some available seeds, the problem is to minimize \(\Vert A-(I\oplus P)B(I\oplus P)^T\Vert_F\) over all \(m\)-by-\(m\) permutation matrices \(P\), where \(m:=n-s\), and \(\oplus\) is the direct sum, and \(I\) is the identity matrix.

What Can iGraphMatch Package Do

The iGraphMatch package aims to provide useful and convenient tools to users who are dealing with problems related to graph matching and working with either igraph objects or matrix objects. Among the capabilities of this package, you can do the following:

Outline of This Vignette

This document introduces iGraphMatch’s basic set of tools by showing you how to apply them to some problems in the graph matching field:

Background

Graph Mathcing Algorithms

The graph matching algorithms currently implemented in iGraphMatch are the FAQ algorithm (graph_match_FW) and the convex relaxed algorithm (graph_match_convex) . Seeded graph matching algorithm is an extension of FAQ algorithm which incorporates seeds .

Formally, for any two \(n\)-by-\(n\) adjacency matrices \(A\) and \(B\) corresponding to two graphs \(G1\) and \(G2\) with \(n\) nodes, the graph matching problem is to minimize \(\Vert A-PBP^T \Vert_F^2\) over all \(n\)-by-\(n\) permutation matrices, and \(\Vert \bullet \Vert_F\) denotes the Froebenius norm, which is to minimize the edge disagreements between \(G1\) and \(G2\). For any \(P\in\Pi\), where \(\Pi\) is the set of all the permutation matrices, the original objective function of the graph matching problem can be expanded \[\Vert A-PBP^T \Vert_F^2 = \Vert AP-PB \Vert^2_F = \Vert A \Vert^2_F + \Vert B \Vert^2_F - 2\langle AP,PB \rangle.\] The first equality holds due to unitarity of the permutation matrices. Optimizing any of these equivalent forms of the objective function over all the permutation matrices is a NP-hard quadratic assignment problem due to the combinatorial complexity of the constraints \(P\in\Pi\).

Relaxation techniques can be applied to solve the optimization problem by replacing the constraints from all the permutation matrices to all the doubly stochastic matrices, which is the convex hull of \(\Pi\). There are two ways to relax the original graph matching problem, which correspond to two equivalent forms of the graph matching objective function. When relaxation is applied to the middle term, we get the convex relaxed graph matching problem, which is to minimize \(\Vert AD-DB \Vert^2_F\) over all the \(n\)-by-\(n\) doubly stochastic matrices. When relaxation is applied to the third term, the graph matching problem becomes maximizing \(\mathrm{tr}BDAD^T\) over all the \(n\)-by-\(n\) doubly stochastic matrices, we call it the indefinite relaxed graph matching problem. The Hessian of \(\mathrm{tr}BDAD^T\) is not necessarily positive definite, this is why it got the name. In order to clearly present the relationships between the original graph matching problem and these two different relaxation techniques, we list the descriptions and properties of them in the table below.

Graph Matching Problem Objective Functions Constraints Optimization Guarantees
Original Form \(\Vert A-PBP^T \Vert _F^2\) \(P\in \Pi\) NP-hard
Indefinite Relaxed Form \(\mathrm{tr}BDAD^T\) \(D\in \mathcal{D}\) Local Convergence
Convexed Relaxed Form \(\Vert AD-DB \Vert _F^2\) \(D\in \mathcal{D}\) Global Convergence

The FAQ algorithm is an efficient approximate graph matching algorithm based on Frank-Wolfe methodology, which is a gradient acent approach. The Frank-Wolfe methodology is described in algorithm . Experimental studies yield that it’s reasonable to set the number of iteration less than 25. In the graph_match_FW function, the default setting for max_iter parameter is 100. Previous studies have proved that if \(n\ge 100\) and \(G2\), which is an isomorphic copy of \(G1\), has \(V(G2)\) being a discrete-uniform random permutation of \(V(G1)\), then the probability of FAQ yields the correct alignment is nearly 1.

Finally, let’s look at the optimization guarantees for different relaxation methods. The convex relaxed problem can achieve the glocal minimum. But we can only employ the tools of continuous optimization to find the local maximum for the indefinite relaxed problem, and therefore initialization is significant for the problem. These global optima or local optima is then projected back to \(\Pi\), yielding an approximate solution to the original optimization problem.

In the context of seeded graph matching , without loss of generality, assume the correct matching is \(I\) and suppose the first \(s\) nodes are seeds. Suppose \(A\) and \(B\) are adjacency matrices of graphs \(G1\) and \(G2\) with \(n\) nodes. To formulate the seeded graph matching algorithm, denote the number of nonseed \(m:=n-s\), and partition adjacency matrices \(A\) and \(B\) \[A=\begin{bmatrix} A_{11} & A^T_{21} \\ A_{21} & A_{22} \end{bmatrix}\quad B=\begin{bmatrix} B_{11} & B^T_{21} \\ B_{21} & B_{22} \end{bmatrix}\quad\] where \(A_{11},B_{11}\in \{0,1\}^{s\times s}\) denotes the adjacency matrices corresponding to seeds, \(A_{22},B_{22}\in \{0,1\}^{m\times m}\) denotes the adjacency matrices corresponding to nonseeds, and \(A_{21},B_{21}\in \{0,1\}^{m\times s}\) corresponds to the nonseed to seed information. Then the seeded graph matching problem is to find \(P=\) argmax\(_{P\in \Pi}\) \(\mathrm{tr}A^T(I\oplus P)B(I\oplus P)^T\), where \(\Pi\) is the set of all permutation matrices. To apply the Frank-Wolfe methodology, relax the maximization of \(\mathrm{tr}A^T(I\oplus P)B(I\oplus P)^T\) over all permutation matrices to the maximization of the same objective function over all doubly stochastic matrices to form a convex optimization problem. Further simplification yields the objective function \[f(D)=\mathrm{tr}A_{11}B_{11}+2\mathrm{tr}D^TA_{21}B_{21}^T+\mathrm{tr}A_{22}DB_{22}P^T\] with gradient \[\triangledown (D)=2A_{21}B_{21}^T+2A_{22}DB_{22}.\] where \(D\) denotes the doubly stochastic matrix. Then apply the Frank-Wolfe methodology with the specified objective function and gradient function, and optimize over all the \(m\)-by-\(m\) doubly stochastic matrices. Therefore, to understand the seeded graph matching, it is still based on the Frank-Wolfe methodology while using different objective function to incorporate seeds.

Suppose we are matching two graphs with cardinality \(n\), the running times of both the FAQ algorithm and convex relaxed algorithm are \(O(n^3)\) (per iteration). In iGraphMatch package, for both graph_match_FW (corresponds to the FAQ algorithm) and graph_match_convex (corresponds to convex relaxed algorithm) functions, the argument seeds incorporates seeds information.

Correlated Erdős-Rényi Random Graphs

Presently, we describe Correlated Erdős-Rényi random graphs model . This model is an extensively used theoretical framework within which many theorems are proved. This model is frequently used in simulations as well.

Correlated Erdős-Rényi random graphs model is specified by parameters including a positive number \(n\), a real number \(p\in (0,1)\) and a real number \(\rho \in [0,1]\). Suppose we have two graphs \(G1\) and \(G2\) whose common cardinality of vertex set is \(n\). For each pair of vertices \(\{v,v'\}\in {{V}\choose {2}}\), let \(\mathds{1} \{ \{v,v' \}\in E(Gi) \}\) denote the indicator variable for the event \(\{v,v'\}\in E(Gi)\), where \(i=1,2\). Correlated Erdős-Rényi Random Graphs model assumes that the indicator variable \(\mathds{1} \{ \{v,v' \}\in E(Gi)\}\) follows Bernoulli(\(p\)) distribution. We call \(p\) edge probability, because it is the probability of there existing an edge between \(\{ \{v,v' \}\in E(Gi)\}\). Note that the random indicator variables are all independent within a graph, but there is Pearson product-moment correlation coefficient \(\rho\) between each pair of vertices \(\{ v,v' \}\in {{V}\choose {2}}\) across graphs. If \(\rho\) is 0, then \(G1\) and \(G2\) are independent, and at the other extreme, if \(\rho\) is 1, then \(G1\) and \(G2\) are identical almost surely.

Random dot product graphs (RDPG) model is another frequently used random graphs model, which can be regarded as an extension of correlated Erdős-Rényi Random Graphs model. For the RDPG model, for each pair of vertices \(\{v,v'\}\in {{V}\choose {2}}\), the edge probability \(p\) between them is specified by the dot product of two vectors in \(\mathds{R}^d\) drawn from distribution \(\mathds{F}^d\), representing latent positions. Given this, the probability of an edge occuring between \(\{v,v'\}\) is determined by the Bernoulli trail with probability given by the dot product of two latent positions vectors.

In iGraphMatch package, sample_correlated_gnp_pair and sample_correlated_gnp_pair_w_junk are two useful functions to generate graph pairs from Correlated Erdős-Rényi Random Graphs model. sample_correlated_rdpg function is for sampling a pair of graphs from random dot product graphs model.

Initialization Methods For Soft Seeding

Soft Seeding Graph Matching

In the previous studies on seeded graph matching, authors often assume that there is prior knowledge about the vertex correspondence. This could be that part of the bijection between the two vertex sets is known. However, in many cases, knowledge on seeds is quite limited, e.g. there might be errors in the seeds or we may only know the range of possible matches to a seed. If we still incorporate these information as hard seeds and keep them fixed during graph matching, incorrect information can yield bad matching results. As a result, we propose another way to make use of such information by incorporating them into the initialization of \(D^0\) in the first step of the FAQ algorithm. We call such kind of information soft seeds.

Soft seeds can improve graph matching performance when the prior information is uncertaint since soft seeds are not fixed. Suppose \(n_{ss}\) is the number of soft seeds, \(n_{ns}\) is the number of nonseeds. Soft seeding algorithm is based on the Frank-Wolfe methodology which is described in algorithm . Without loss of generality, assume the first \(n_{ss}\) of the non-hard-seeds are soft seeds. In step 1, soft seeding initializes the start matrix at \(D^0=\begin{bmatrix} D_{n_{ss}}^0 & \textbf{0}^T \\ \textbf{0} & D_{n_{ns}}^0 \end{bmatrix}\quad\)\(P\), where \(D_{n_{ss}}\) is \(n_{ss}\)-by-\(n_{ss}\) doubly stochastic matrix, which denotes matrix of m soft seeds from graph \(A\) to graph \(B\). \(D_{n_{ss}}\) can be an identity matrix or any doubly stochatic matrix to represent many-to-many soft seeds information. \(\textbf{0}\) is the \(n_{ns}\)-by-\(n_{ss}\) zero matrix. \(D_{n_{ns}}\), which is \(n_{ns}\)-by-\(n_{ns}\) doubly stochastic matrix, denotes matrix without soft seeds. \(P\), which is a n-by-n permutation matrix multiplied on the right side of the block matrix, denotes permutation to columns.

Initialization Methods

Soft seeds are implemented by choosing a specific initialization for the Frank-Wolfe iterations, which is to initialize matrix \(D^0\). Notice that \(D^0\) is composed of two doubly stochastic matrices, \(D_{n_{ss}}^0\) and \(D_{n_{ns}}^0\) should be initialized seperately. \(D_{n_{ss}}^0\) is specified by incorporating the information in soft seeds. If the soft seeds are one-to-one, which means each soft seed is matched to a specific vertex in the other graph, then \(D_{n_{ss}}^0 \in\{0,1\}^{n_{ss}\times n_{ss}}\). Or if the soft seeds are many-to-many, that is we only know probabilistic estimates of the correspondences, \(D_{n_{ss}}^0\) should still be a doubly stochastic matrix. While \(D_{n_{ss}}^0\) can be uniquely specified by the information in soft seeds, there are multiple ways to initialize \(D_{n_{ns}}^0\). Presently, we discuss different methods of initialization of \(D_{n_{ns}}^0\) which affects the performance of soft seeding. Here we will discuss three initialization methods: barycenter start, random doubly stochastic start and convex start. A good choice of the start matrix contributes to finding the global optimum permutation matrix and to higher speed of reaching the result.

The most straightforward way is to initialize at the barycenter, indicating that each nonseed vertex is equally likely to be matched to other nonseed vertices. Suppose \(n_{ns}\) is the number of non seeds and \(n_{ss}\) is the number of soft seeds. The formula for barycenter start is: \[D_{n_{ns}}=D_{bari}=\frac{1}{n_{ns}-n_{ss}}\mathds{1}\mathds{1}^T_{(n_{ns}-n_{ss})\times(n_{ns}-n_{ss})}.\]

Another approach to initialize the start matrix is to use a random doubly stochastic matrix. The algorithm to generate a random doubly stochastic matrix incorporates the sinkhorn iterative renormalization approach. The basic idea is to normalize rows and columns of the matrix iteratively. Algorithm describes how the algorithm works. Note that the expectation of the random doubly stochastic matrix is barycenter.

Here we illustrate the convex initialization method in the general setting where we have both hard seeds and soft seeds. Suppose hard seeds provide the correct bijection between two vertex sets. To generate the convex start matrix, we first use the convex relaxed graph matching algorithm, regarding all the soft seeds as hard seeds. The correspondence of soft seeds won’t change during this initial matching process. As a result, the convex initialization method can only be applied to one-to-one soft seeds, the convex initialization method for many-to-many soft seeds doesn’t make sense. We have \(n_s+n_{ss}\) seeds, where \(n_s\) denotes the number of hard seeds and \(n_{ss}\) denotes the number of soft seeds. Denote the number of nonseeds as \(n_{ns}:=n-n_s-n_{ss}\). Then we follow the seeded graph matching algorithm to find the doubly stochatic matrix corresponding to nonseeded vertices: \(\hat{D_{n_{ns}}}=\) argmin\(_{D_{n_{ns}}\in\mathcal{D'}} \Vert A-(I_{n_s}\oplus (I_{n_{ss}}\oplus D_{n_{ns}})P)B(I_{n_s}\oplus (I_{n_{ss}}\oplus D_{n_{ns}})P)^T \Vert ^2_F\), where \(\mathcal{D'}\) denotes the set of all \(n_{ns}\)-by-\(n_{ns}\) doubly stochastic matrices. \(I_{n_s}\), which is a \(n_s\)-by-\(n_s\) identity matrix, denotes matrix of hard seeds. \(I_{n_{ss}}\), which is a \(n_{ss}\)-by-\(n_{ss}\) identity matrix, denotes matrix of one-to-one soft seeds. \(D_{n_{ns}}\), which is a \(n_{ns}\)-by-\(n_{ns}\) matrix, denotes matrix without seeds. \(P\), which is a \((n-n_s)\)-by-\((n-n_s)\) permutation matrix multiplied on the right side of the block matrix, denotes permutation to columns. The convex start matrix is defined as \(D_{convex}:=(D_{n_{ss}}\oplus \hat{D_{n_{ns}}})P\) which is a \((n-n_s)\)-by-\((n-n_s)\) matrix.

Simulations

Soft seeding algorithm can be easily implemented in R using the existing functions in iGraphMatch package. Basically, the main function used is graph_match_FW with specification of the start method and seeds which correspond to the hard seeds. For example, now we try to compare random doubly stochastic start, bari start and convex start which are discussed above and do simulations on the correlated Erdős-Rényi graphs model. To make the result easy to be shown, we sample a graph pair with a small cardinality 10 from the correlated Erdős-Rényi graphs model with the edge probability equal to 0.5 and the correlation between two graphs to be 0.5. Set the first three pairs of vertices to be hard seeds.

library(igraph)
library(iGraphMatch)

set.seed(8)
cgnp_pair <- sample_correlated_gnp_pair(n = 10, corr = .5, p = .5)
g1 <- cgnp_pair$graph1
g2 <- cgnp_pair$graph2

(seeds <- 1:10 <= 3)
#>  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Then just randomly pick two pairs of vertices as soft seeds. Note that soft seeds may not be correct, but soft seeds may improve during the matching procedure. Let’s first pick two bad soft seeds {4,4} and {8,6} which is a combination of good soft seed and bad soft seed:

bad_soft_seeds <- rbind(c(4,4),c(8,6))

Initialization of different start matrix is very convenient in R with iGraphPackage. Function init_start returns a \(n_{ns}\)-by-\(n_{ns}\) matrix where \(n_{ns}\) denotes the number of nonseeds vertices.

nns <- 7
ns <- 3
(start_bari <- init_start(start = "bari", nns = nns, ns = ns, soft_seeds = bad_soft_seeds))
#> Sparse part
#> 7 x 7 sparse Matrix of class "dgCMatrix"
#>                   
#> [1,] 1 . . . . . .
#> [2,] . . . . . . .
#> [3,] . . . . . . .
#> [4,] . . . . . . .
#> [5,] . . 1 . . . .
#> [6,] . . . . . . .
#> [7,] . . . . . . .
#> plus left factor
#> 7 x 1 sparse Matrix of class "dgCMatrix"
#>       
#> [1,] .
#> [2,] 1
#> [3,] 1
#> [4,] 1
#> [5,] .
#> [6,] 1
#> [7,] 1
#> times right factor transpose
#> 7 x 1 sparse Matrix of class "dgCMatrix"
#>         
#> [1,] .  
#> [2,] 0.2
#> [3,] .  
#> [4,] 0.2
#> [5,] 0.2
#> [6,] 0.2
#> [7,] 0.2
set.seed(123)
(start_rds <- init_start(start = "rds", nns = nns, ns = ns, soft_seeds = bad_soft_seeds))
#> 7 x 7 sparse Matrix of class "dgCMatrix"
#>                                                               
#> [1,] 1 .          . .          .          .          .        
#> [2,] . 0.08224823 . 0.01678766 0.32382493 0.35940770 0.2177315
#> [3,] . 0.26796914 . 0.23130193 0.18235230 0.11682563 0.2015510
#> [4,] . 0.13781874 . 0.38747748 0.27018824 0.01979391 0.1847216
#> [5,] . .          1 .          .          .          .        
#> [6,] . 0.24665273 . 0.19846281 0.18927607 0.12792141 0.2376870
#> [7,] . 0.26531116 . 0.16597012 0.03435846 0.37605135 0.1583089
(start_convex <- init_start(start = "convex", nns = nns, ns = ns, soft_seeds = bad_soft_seeds, 
                           A = g1, B = g2, seeds = seeds))
#> Sparse part
#> 7 x 7 sparse Matrix of class "dgCMatrix"
#>                                                                      
#> [1,] 0.29659736 0.16740579 .         0.12411063 0.20217656 0.07348296
#> [2,] 0.05777703 0.14300415 .         .          0.07796460 0.63939292
#> [3,] 0.23157496 0.41937506 .         .          0.12471556 0.13051394
#> [4,] 0.09105435 0.05728915 0.2497145 0.49848492 0.05364411 .         
#> [5,] .          0.07032994 0.3898809 0.05364411 0.36359585 .         
#> [6,] 0.29561797 0.06519647 0.1334635 0.29554317 .          0.09343773
#> [7,] 0.02737833 0.04918226 0.2269410 .          0.14968614 0.03495526
#>                
#> [1,] 0.13622670
#> [2,] 0.05364411
#> [3,] 0.06560330
#> [4,] 0.02159578
#> [5,] 0.12254920
#> [6,] 0.08852395
#> [7,] 0.48363979
#> plus left factor
#> 7 x 1 sparse Matrix of class "dgCMatrix"
#>                
#> [1,] .         
#> [2,] 0.02821718
#> [3,] 0.02821718
#> [4,] 0.02821718
#> [5,] .         
#> [6,] 0.02821718
#> [7,] 0.02821718
#> times right factor transpose
#> 7 x 1 sparse Matrix of class "dgCMatrix"
#>         
#> [1,] .  
#> [2,] 0.2
#> [3,] .  
#> [4,] 0.2
#> [5,] 0.2
#> [6,] 0.2
#> [7,] 0.2

Then implement seeded graph matching using the FAQ algorithm in R by using graph_match_FW function. To incorporate soft seeds, we specify the start argument in graph_match_FW function accordingly. Note that in this example, we assume the true permutation to be the identity matrix.

set.seed(123)
match_bari <- graph_match_FW(g1, g2, seeds, start = start_bari)
match_rds <- graph_match_FW(g1, g2, seeds, start = start_rds)
match_convex <- graph_match_FW(g1, g2, seeds, start = start_convex)
err_bari_bad <- mean(match_bari$corr$corr_A[!seeds] != match_bari$corr$corr_B[!seeds])
err_rds_bad <- mean(match_rds$corr$corr_A[!seeds] != match_rds$corr$corr_B[!seeds])
err_convex_bad <- mean(match_convex$corr$corr_A[!seeds] != match_convex$corr$corr_B[!seeds])

Since in this example both of the soft seeds are incorrect, intuitively not incorporating such incorrect information might yield better matching results. This motivates us to perform a comparative experiment on whether or not to incorporate such incorrect soft seeds. The R code for performing graph matching with only the hard seeds is similar to soft seeding except for different specification of the start argument.

set.seed(123)
match_nss_bari <- graph_match_FW(g1, g2, seeds, start = "bari")
match_nss_rds <- graph_match_FW(g1, g2, seeds, start = "rds")
match_nss_convex <- graph_match_FW(g1, g2, seeds, start = "convex")

For completeness, we also include an example of good soft seeds, say {4,4} and {8,8}. Then we follow the same procedure as in the bad soft seeds case, and yield the matching results for good soft seeds when matching the same graphs. Since the codes corresponding soft seeding implementation are the same, we’ll skip the codes and only show the results comparing the three cases.

good_soft_seeds <- rbind(c(4,4),c(8,8))
Matching Errors With Various Initialization Methods
bari rds convex
good soft seeds 0.7142857 0.5714286 0.7142857
bad soft seeds 0.7142857 0.7142857 0.7142857
non soft seeds 0.8571429 0.8571429 0.8571429

Table 2 presents the matching error of nonseeded vertex sets with various initialization methods. Compared with using a mixture of good soft seeds and bad soft seeds and matching without soft seeds, incporating good soft seeds decrease or maintain the same matching error for all the initialization methods. Note that for the convex initialization method, we can successfully uncover the true correspondence of two graphs by using good soft seeds. Using partial good soft seeds can also improve the matching performances except for the convex case. In general, random doubly stochastic initialization yields the most stable matching result while the convex intialization method using good soft seeds achieves the highest matching accuracy among all the methods.

We also illuminate the difference between various initialization methods by presenting the experiment results on larger scale graphs under more parameter settings and with more monte carlo replicates. The experiment is relatively time consuming in case of larger scale graphs, we recommend to run larger simulations on server, or consider using statistical computing methods, e.g. divide and conquer algorithm to enhance the speed of experiments. Here we just show the result figure and skip the R code.

Figure 1 shows the average performance for 300 graph pairs sampled from Erdős-Rényi graph model with the settings: \(p\in\{0.1,0.2,0.5\}\), \(\rho=0.5\), \(n_v=250\) and \({n_s}=10\). Wrong soft seeds are randomly sampled from the nonseed vertices. We observe that with a moderate number of incorrect soft seeds, convex start outpreforms the other two start methods. Converting from a random doubly stochastic start to a convex start can reduce the matching error by 0.1 to 0.3. As a result, when implementing a soft seeding graph matching, if we know the proportion of incorrect soft seeds is not too high (approximately \(\le70\%\)), it’s recommendated to choose a convex start.

Identification of Core Vertices

Graph Matching with Junk Vertices Setting

The previous discussions are all based on the setting that there is always a corresponding vertex in \(G2\) for each vertex in \(G1\). however, this is not always the case. Social networks offer a compelling example for this, where matching across different social platforms (or within a single time varying social network) requires the understanding that not all users will be participants in both networks.

Junk vertices refer to the vertices that don’t have true alignments in the other graph. This could be modeled by correlated heterogeneous Erdős-Rényi random graphs .

Definition
For \(R\) and \(\Lambda\) symmetric, hollow matrices in [0,1]\(^{n\times n}\), wesay \(A\), \(B\) are \(R\)-correlated heterogeneous Erdős-Rényi(\(\Lambda\)) random graphs (abbreviated \(CorrER(\Lambda, R)\)) if:
  1. \(A\) and \(B\) are marginally \(ER(\Lambda)\); i.e., for all \(u\), \(v\in [n]\), \(u<v\), \(A_{uv} \stackrel{\text{iid}}{\sim} Bern(\Lambda_{uv})\) and \(B_{uv} \stackrel{\text{iid}}{\sim} Bern(\Lambda_{uv})\), with \(A_{uv}=A_{vu}\) and \(B_{uv}=B_{vu}\).

  2. For all \(u\),\(v\),\(w\),\(r\in [n]\), \(u<v\), \(w<r\), it holds that \(A_{uv}\) and \(B_{wr}\) are independent unless \(u=w\) and \(v=r\), in which case the correlation between \(A_{uv}\) and \(B_{uv}\) is \(R_{u,v}\ge 0\).

At one extreme, if \(R=[0]_n\) then the graphs are independent \(ER(\Lambda)\), and at the other extreme, if \(R=J_n\) then \(A\) and \(B\) are isomorphic almost surely. This model is a generalization of the homogeneous correlated \(ER\) model discussed earlier, and similarly allows for the addition of “junk” vertices—those without a probabilistic match across graphs—by setting \(R=R_k\oplus[0]_{n-k}\) for some \(k\le n\).

\((A,B) \sim CorrER(\Lambda, R)\) are matchable, if argmin\(_{P\in \Pi(n)} \Vert AP-PB \Vert _F={I_n}\), where \(I_n\) denotes the identity matrix. Our goals are to uncover the alignment between core vertices, i.e those where \(R_{u,v}>0\) for some \(u,v \in [n]\), and to detect which vertices are core versus junk vertices.

Different Measures for Goodness of Matching

Having a measure for goodness of matching is important. The measure can be used in finding core vertices in the setting with junk vertices. Since we can rank all the vertices in the order of goodness of matching based on the measure, it’s also useful for choosing the best matched vertices. The best matched core vertices can then be used as additional seeds in the next iteration of graph matching, and we call this process adaptive seeding which is introduced in the next section. Here we list three different measures for goodness of matching, row permutation statistics, row difference and row correlation.

Row permutation statstics is based on a graph matching variant of the permutation test. In terms of a vertex \(v\), test the hypotheses \[H_0^{(v)}: \forall P\in\mathcal{P}, u\neq v, corr(A_{vu},(PBP^T)_{vu})=0,\] \[H_A^{(v)}: \exists P\in\mathcal{P}, u\neq v, corr(A_{vu},(PBP^T)_{vu})>0\] To test, we will make use of the following relationship between edge-wise correlation and the number of induced error between \(A\) and \(P^*B{P^*}^T\). Namely, if \(v\) is a core vertex (or \(v\) is correctly matched), then the number of errors induced by \(P^*\) across the neighborhoods of \(v\) in \(A\) and \(B\) (i.e., \(\Vert(AP^*-P^*B)_{v,\bullet}\Vert_1\)) should be significantly smaller than the number of errors induced by a randomly chosen permutation \(P\) (i.e., \(\Vert(AP-PB)_{v,\bullet}\Vert_1\)). With this in mind, we define \(\Delta_v(P)=\Vert(AP-PB)_{v,\bullet}\Vert_1\) and let \(\mathbb{E}_P\) and Var\(_P\) denote the conditional expection and variance of these quantities with respect to uniform sampling of \(P\) over all permutation matrices. Inspired by the permutation-test, we define the row permutation statistic as: \[T_p(v,P^*):=\frac{\Delta_v(P^*)-\mathbb{E}_P\Delta_v(P)}{\sqrt{Var_P\Delta_v(P)}}\] Intuitively, the larger \(T_p(v)\), the more likely \(v\) is to be a core vertex (or the more likely we find the true alignment to \(v\)).

Row difference is defined as the L-1 norm of \(A\) and \(P^*B{P^*}^T\), which is \[T_d(v,P^*):=\Vert A_{v\bullet}-(P^*B{P^*}^T)_{v\bullet}\Vert_1\] Intuitively, correctly matched vertex \(v\) (or core vertex \(v\)) should induce smaller \(T_d(v,P^*)\).

Row correlation is a statistics for the correlation between \(A\) and \(P^*B{P^*}^T\) which is defined as \[T_c(v,P^*):=1-corr(A_{v\bullet},(P^*B{P^*}^T)_{v\bullet})\] If \(v\) is correctly matched, then the correlation between the neighborhoods of \(v\) in \(A\) and \(B\) should be smaller. Thus, larger \(T_c(v)\) value indicates higher chance of vertex \(v\) being correctly matched (or more likely \(v\) is to be a core vertex).

Algorithm of Finding Core Vertices

We next develop an approach to identify core vertices correctly. The main tool that we utilize is a graph matching variant of permutation test. We will use the test statistic \(T(\bullet)\) to rank vertices based on the likelihood they are core vertices. Suppose \(A,B\in\{0,1\}^{(n_c+n_j)\times (n_c+n_j)}\) are the adjacency matrices corresponding to graphs \(G1\) and \(G2\), where \(n_c\) denotes the number of core vertices and \(n_j\) denotes the number of junk vertices. Denote the available seeded vertices as \(S\). Algorithm of finding core vertices consists of the following steps. First, use the available seeded vertices \(S\) to match \(A\) and \(B\) yielding the optimal permutation \(P^*\). Based on the matching result, then plug in \(P^*\) to compute the value of \(T(v,P^*)\) for each vertex \(v\). Finally rank all the vertices via the decreasing value of \(T(v,P^*)\) with increasing likelihood of being core vertices, that is the first \(n_j\) vertices are identified as junk vertices with decreasing likelihood.

Simulations

Now implement algorithm of finding core vertices in R with iGraphMatch package. First we sample a pair of graphs with 50 vertices from the correlated Erdős-Rényi graph model and let 5 out of the total vertices to be junk vertices using function sample_correlated_gnp_pair_w_junk. We also assume the first 10 vertices to be seeds. Then apply the FAQ algorithm to match two graphs.

set.seed(5)
cgnp_pair <- sample_correlated_gnp_pair_w_junk(n = 50, corr = .5, p = .5, ncore = 45)
g1 <- cgnp_pair$graph1
g2 <- cgnp_pair$graph2

seeds <- 1 : 50 <= 10
core <- 1 : 50 <= 45
junk <- !core
non_seeds <- !seeds

match <- graph_match_FW(g1, g2, seeds, start = "rds")

Then implement the algorithm for finding core vertices, we use row permutation statistic as our measure in this example. The main steps of the algorithm involves calculating values of the row permutation statistic for each nonseed vertex and rank them in the order of increasing value of the statistic. These steps can be implemented by using the best_matches function. The best_matches function has the functionality of finding best matched vertices and identifying core vertices. Since we want to rank all the non-seed vertices in terms of their row permutation statistics, set argument x to be non_seeds, which denotes the vertices we are interested in. Also set argument num to be the number of non-seeds, which denotes the number of top ranked vertices we want to get. By returning the result in A_best, we get the indices corresponding to vertices in \(G1\) in the order of decreasing likelihood of being core vertices.

r <- best_matches(A = g1, B = g2, match = match,
                  measure = "row_perm_stat", num = sum(non_seeds))$A_best
r
#>  [1] 18 31 32 24 19 13 27 28 30 17 36 20 37 33 42 44 34 21 29 15 48 12 49 14 39
#> [26] 11 38 46 35 16 43 50 40 41 25 45 22 26 47 23
Summarization Table for Core Identification Precisions
k1 k5 k10 k25 k35
core identification precision 1 1 1 0.8687371 0.8125
Summarization Table for Junk Identification Precisions
k1 k2 k3 k4 k5
junk identification precision 0.25 0 0.08 0.21 0

Since there are 5 junk vertices, the last 5 vertices in the output are identified as junk vertices. Note that seeded vertices are not ranked. In order to evaluate the performance of core identification algorithm in this example, we calculate the precision of classification for each vertex and summarize the precisions in table 3 and table 4. Table 3 shows the identification precision for \(k=1,5,10,25,35\), where \(k\) denotes the rank of identified core vertices. Higher rank indicates the vertex has higher probability to be a core vertex. From table 3, the core identification algorithm achieves pretty high precision for identifying the core vertices. In contrast, the identification precision for junk vertices drops quickly as we averaging over more high rank junk vertices.

Figure shows the simulation results for two random graphs sampled from CorrER\((0.2,0.5J_{n_c}\oplus\mathrm{0}_{n_j})\) with \(n_j \in\{20,50\}\) and \(n_s\in\{5,10,20\}\) . The figure averages the results of 1000 monte carlo replicates by plotting the mean precision at each rank with lower ranks indicating core vertices and higher ranks near \(n\) indicating junk vertices. When there are a small number of seeded vertices, the core identification algorithm doesn’t outperform the random algorithm much, which is indicated by the horizontal lines. But when we have 20 seeds, the core identification algorithm achieves substantially higher precision than the random algorithm, especially for junk vertices. As expected, when there are smaller number of junk vertices, the algorithm is able to have better performance.

Graph Matching with Adaptive Seeds

In this part, we will show you how to conduct FW graph matching method with adding adaptive seeds iteratively. The idea is motivated by the fact that in many realistic applications, we only know a small number of seeds while the number of vertices to be matched is large. Since there are costs and difficulties in acquiring seeds, the following algorithm will provide a feasible approach to acquire additional seeds and use them in graph matching to achieve higher matching accuracy.

The inputs are the adjacency matrices of two graphs A and B. Both graphs are composed of \(n_c\) matchable core vertices and \(n_j\) junk vertices, vertices that don’t have a bijection in the other graph, and some available seeds \(S\). The detailed algorithm is given in algorithm .

##Discussion The primary goal of this package is to facilitate research related to graph matching by providing a variety of useful tools in several aspects of graph matching problem, including graph matching algorithm, quality measure, random graph models etc. In iGraphMatch package we implement two graph matching algorithms, the FAQ algorithm and the convex relaxed algorithm. Both algorithms are applicable to the setting with junk vertice which is a more general case. Moreover, the package is capable of handling weighted graphs and graphs with different orders (by adding padding ). For future works, we target to implement more graph matching algorithms aiming at enhancing the computational efficiency, which is especially significant to matching two large scale graphs.