Conceptual Introduction to Dominance Analysis

Examples and Implementation with {domir}’s domin

Joseph Luchman

2021-09-27

Conceptual Introduction

Dominance Analysis

The focus of the {domir} package is on the dominance analysis (DA) methodology for determining the relative importance of independent variables (IVs)/predictors/features in a predictive model. DA determines the relative importance of independent variables in an predictive model based on each IV’s contribution to an overall model fit statistic. More specifically, DA produces several results that decompose and compare the contribution each IV makes to the fit statistic to one another. Conceptually, DA is an extension of Shapley value decomposition from Cooperative Game Theory (see Grömping, 2007 for a discussion) which seeks to find solutions of how to subdivide “payoffs” (i.e., the fit statistic) to “players” (i.e., the IVs) based on their relative contribution to the payoff; the payoff is usually assumed to be a single value (i.e., scalar valued).

The implementation DA uses to estimate the Shapley value decomposition works by using a pre-selected predictive model and applying a very extensive experimental design to it. Effectively, DA asks treats the predictive model as data and the IVs as factors in an experiment with a within-subjects full-factorial design (i.e., all possible combinations of the IVs are applied to the data). Another way of describing the process is as a brute force method where all combinations of the IVs included or excluded is estimated from the data and the fit statistics associated with each sub-model is collected.

Assuming there are \(p\) IVs in the pre-selected model, obtaining all possible combinations of the IVs results in \(2^p\) models and fit statistics to be estimated. Again, each combination of \(p\) variables alternating between included versus excluded (see Budescu, 1993).

Key Caveat

As is mentioned above, an implication of the DA method is that the predictive model used in it is “pre-selected”. Hence, the model has been passed through model selection procedures and the user is confident that the model used in the dominance analysis is a good reflection of the “payoff” structure. DA, at least the implementation of DA that acts like a full-factorial experiment, is not effective for model selection of IVs/finding IVs with trivial effects and removing them.

“Relative importance” as a concept is used in many different ways in statistics and data science and many of them refer to methods that are focused on, or are probably best used for, model selection/identifying trivial IVs for removal. DA is a method that is more focused on importance in a “model evaluation” sense. I see this as being an application where the user describes/interprets IVs’ effects in the context of a finalized, predictive approach.

DA Implemtation with domin

The domin function of the {domir} package implements the DA, full-factorial methodology with a flexible R function application programming interface/API. The API is based on the R formula which is used in many predictive R modeling functions in base and recommended packages such as lm, glm, arima, polr, nnet, lme, and gam. Functions that do not use formulas, or use variants of the standard R formula (i.e., those in the Formula package), can be accommodated using wrapper functions.

I see the DA methodology is a general approach to model evaluation using relative importance and it is this perspective that has guided the development of this package. In my view, if you have a pre-selected predictive model and a fit statistic that can be applied to it, the domin can help you evaluate/interpret the contribution of the IVs in that model.

Extensive Introductory Example

This section introduces some of the concepts in DA using a simple predictive model and discusses how each is displayed in the domin implementation when print-ed.

The focus of this section is on the concepts underlying DA and less on the implementation of DA, and how to work with, the domin function.

Linear Regression-based DA

The example below uses the mtcars data in the {datasets} package. Assume I have pre-selected, through whatever means, the following linear model:

lm(mpg ~ am + cyl + carb, data = mtcars)

Applying a DA to this model using domin, assuming the fit statistic of interest is the explained variance \(R^2\), would look like:

library(domir)
library(datasets)

domin(mpg ~ am + cyl + carb, 
      lm, 
      list(summary, "r.squared"), 
      data = mtcars)
#> Overall Fit Statistic:      0.8113023 
#> 
#> General Dominance Statistics:
#>      General Dominance Standardized Ranks
#> am           0.2156848    0.2658501     2
#> cyl          0.4173094    0.5143698     1
#> carb         0.1783081    0.2197801     3
#> 
#> Conditional Dominance Statistics:
#>         IVs: 1    IVs: 2     IVs: 3
#> am   0.3597989 0.2164938 0.07076149
#> cyl  0.7261800 0.4181185 0.10762967
#> carb 0.3035184 0.1791172 0.05228872
#> 
#> Complete Dominance Designations:
#>              Dmnated?am Dmnated?cyl Dmnated?carb
#> Dmnates?am           NA       FALSE         TRUE
#> Dmnates?cyl        TRUE          NA         TRUE
#> Dmnates?carb      FALSE       FALSE           NA

domin prints results in four sections including fit statistics overall, as well as general and conditional dominance statistics, and complete dominance designations.

Fit Statistics

#> Overall Fit Statistic:      0.8113023 

This results section reports on the value of the fit statistic of the pre-selected model. Other fit statistic value adjustments are reported in this section as well.

The \(R^2\) for this model is ~\(.8113\) meaning the model explains around 80% of the variance in mpg; this is the limiting value for how much each IV will be able to explain. The three IVs in this model will be ascribed separate components of this ~\(.8113\) value.

General Dominance Statistics

#> General Dominance Statistics:
#>      General Dominance Standardized Ranks
#> am           0.2156848    0.2658501     2
#> cyl          0.4173094    0.5143698     1
#> carb         0.1783081    0.2197801     3

This section reports on how the fit statistic value from the last section is divided up among the IVs. The statistics produced when dividing up the overall fit statistic associated with the pre-selected model are known as general dominance statistics. Because they sum to the overall fit statistic, general dominance statistics are the results from the DA that are considered easiest to interpret.

General dominance statistics are computed as a weighted average of the marginal/incremental contribution to prediction (as measured by the focal fit metric, in this case the \(R^2\)) the IV makes across all sub-models in which it is included. For example, am has a value of ~\(0.2157\) which means, on average, am results in an increment to the \(R^2\) of about twenty-two percentage points when it is included in the model versus when it is not.

The Standardized column of statistics expresses the general dominance statistic value as a percentage of the overall fit statistic value and thus sum to 100%. am’s contributions to prediction comprise ~27% of the predictive usefulness of the pre-selected model (i.e., \(\frac{.2157}{.8113} = .2659\)).

Finally, the general dominance statistics can be compared to one another to determine which variables generally dominate others. Here, cyl’s larger general dominance statistic than am indicates that it generally dominates, and is thus more important than, am. These general dominance designations are reflected in the Ranks column.

The general dominance statistics, whereas simple to interpret, are the least stringent of the dominance statistics/designations as they average across the most sub-models (more specifics provided below). Though least stringent, the general dominance statistics tent to always allow for a comparison between two IVs. The only time a general dominance designation cannot be made between two IVs is when they are exactly equal to one another in terms of predicting the dependent variable (DV)/outcome/response.

General dominance statistics for an IV are also the arithmetic average of that IV’s conditional dominance statistics - these statistics are discussed next.

Conditional Dominance Statistics

#> Conditional Dominance Statistics:
#>         IVs: 1    IVs: 2     IVs: 3
#> am   0.3597989 0.2164938 0.07076149
#> cyl  0.7261800 0.4181185 0.10762967
#> carb 0.3035184 0.1791172 0.05228872

This section reports on how the predictive usefulness for any given IV changes as more IVs are added to the model. These statistics describing change as more IVs are added to the model are known as conditional dominance statistics. Conditional dominance statistics are more challenging to interpret as they are best evaluated across their trajectory within an IV but across different numbers of IVs in the model. They also do not cleanly decompose the \(R^2\) value like the general dominance statistics but rather shows the variation in \(R^2\) values as more IVs are added to the model.

Conditional dominance statistics are computed as the average incremental contribution to prediction an IV makes within a single order of models–where order refers to a distinct number of IVs in the prediction model. As such, each IV will have \(p\) different conditional dominance statistics.

In the example above, order one is IVs: 1 and refers to the incremental contribution an IV makes to predicting the DV when by itself. Order two is IVs: 2 and refers to the average incremental contribution an IV makes in predicting the DV when another IV is included in the model. Finally, order three is IVs: 3 and refers to the incremental contribution an IV makes beyond the other two IVs.

Conditional dominance statistics provide more information about each IV than general dominance statistics as they reveal the effect that IV redundancy has on prediction for each IV. In the above conditional dominance statistic matrix, am is slightly less affected by variable redundancy overall as its average incremental contribution to fit shrinks relatively less than the other two IVs across all three orders.

As with general dominance statistics, conditional dominance statistics can be compared to determine which conditionally dominate others. Conditional dominance is determined for an IV over another when its within-order conditional dominance statistic comparisons are larger than another IV’s statistics for all p orders. If, at any order, the within-order comparison between conditional dominance statistics for two IVs are equal or there is a change rank order between IVs, no conditional dominance designation can be made between those IVs. As applied to the results above, am is conditionally dominated by cyl as its conditional dominance statistics are smaller than cyl’s at order 1, 2, and 3.

The increase in complexity with conditional dominance from general also results in a more stringent set of comparisons. It is more difficult for one IV to conditionally dominate than it is for one IV to generally dominate another IV as there are more ways in which conditional dominance can fail to be achieved. Additionally, conditional dominance implies general dominance–but the reverse is not true. An IV can generally, but not conditionally, dominate another IV.

The next section discusses the most stringent of the dominance designations.

Complete Dominance Designations

#> Complete Dominance Designations:
#>              Dmnated?am Dmnated?cyl Dmnated?carb
#> Dmnates?am           NA       FALSE         TRUE
#> Dmnates?cyl        TRUE          NA         TRUE
#> Dmnates?carb      FALSE       FALSE           NA

This section reports on how the predictive usefulness for an IV compares with another IV across all models that are comparable for the two IVs. The comparisons described in this section do not form statistics but only dominance designations; these are known as complete dominance designations. Conditional dominance designations are not difficult to interpret but are the most complex set of comparisons and only exist as a set of comparisons across IV pairs.

Complete dominance designations are made by comparing all possible incremental contributions to predicting the DV between two IVs for models that are comparable–which include all sub-models that do not include the other IV. For example, the comparison between am and cyl would include comparing the model with am alone to the model with cyl alone as well as the model with am and carb beyond the model with carb to the model with cyl and carb beyond the model with carb.

Complete dominance is the most stringent dominance criterion to satisfy as it requires that an IV always have a larger increment to to prediction across all, \(2^{p-2}\) comparable individual models between two IVs.

As is noted above, the complete dominance designation has no statistic. The matrix domin returns is a set of logicals that reads from the left to right. A value of TRUE means that the IV in the row completely dominates the IV in the column. Conversely, a value of FALSE means the opposite, that the IV in the row is completely dominated by the IV in the column. A NA value means no complete dominance designation could be made as the comparison IVs’ incremental contributions differ in relative magnitude from model to model.

In the example above, the complete dominance designations show a clear hierarchy among the IVs with cyl completely dominating am which completely dominates carb. Complete dominance implies both general and conditional dominance, but, again, the reverse is not true. An IV can conditionally or generally, but not completely, dominate another IV.

How Dominance Statistics Work

In this section, I expand more on what goes on during a DA and how statistics are computed using an example.

In the example below, I create a custom function that exports model results to an external R object (i.e., the results data frame).

The programming that is involved in the function below is a little complex (how it works and how to develop something like it will be the subject of another vignette!) but the idea is that it captures the formula domin feeds into the lm model and collects the \(R^2\) value the model produces.

lm_capture <- 
  function(x, ...) {
    count <<- count + 1
    y <- lm(x, ...)
    results[count, "fmla"] <<- deparse(x)
    results[count, "r2"] <<- summary(y)[["r.squared"]]
    return(y)
  }

count <- 0

results <- data.frame(fmla = rep("", times = 2^3-1), 
                      r2 = rep(NA, times = 2^3-1) )

lm_da <- domin(mpg ~ am + cyl + carb, lm_capture, list(summary, "r.squared"), data = mtcars)

results
#>                    fmla        r2
#> 1              mpg ~ am 0.3597989
#> 2             mpg ~ cyl 0.7261800
#> 3            mpg ~ carb 0.3035184
#> 4        mpg ~ am + cyl 0.7590135
#> 5       mpg ~ am + carb 0.7036726
#> 6      mpg ~ cyl + carb 0.7405408
#> 7 mpg ~ am + cyl + carb 0.8113023

As the printed result from the results data frame shows, domin runs 7 linear regressions with all the different combinations of the IVs and collects the \(R^2\) value associated with each model.

Designating Complete Dominance

Given the results in the results data frame, I will explain first how to determine complete dominance of two IVs–in this case, between am and carb. Complete dominance (\(D\)) is computed as:

\(X_vDX_z\; if\;2^{p-2}\, =\, \Sigma^{2^{p-2}}_{j=1}{ \{\begin{matrix} if\, R^2_{X_vS2_j}\, - R^2_{S2_j} > R^2_{X_zS2_j}\, - R^2_{S2_j}\,then\, 1\, \\ if\, R^2_{X_vS2_j}\, - R^2_{S2_j} \le R^2_{X_zS2_j}- R^2_{S2_j}\,then\, \,0\end{matrix} }\)

Where \(X_v\) and \(X_z\) are two IVs, \(S2_j\) is a set of the other IVs not including \(X_v\) and \(X_z\) which can include the null set with no other IVs, and \(R^2\) is a model fit statistic (does not have to be an \(R^2\) metric).

When comparing am and carb, there are 2 comparisons for each IV pair of the 7 model results to be made. Those comparisons are that between mpg ~ am and mpg ~ carb as well as the difference between mpg ~ am + cyl and mpg ~ cyl + carb from mpg ~ cyl. These comparisons are all the possible submodels in which models containing am and carb can be directly compared.

The comparisons suggest the complete dominance of am as mpg ~ am’s 0.3597989 is greater than mpg ~ carb’s 0.3035184 and mpg ~ am + cyl’s 0.0328335 increment beyond mpg ~ cyl is greater than mpg ~ cyl + carb’s 0.0143608 increment beyond mpg ~ cyl.

This result is reported in row 3-column 1 and row 1-column 3 of the Complete_Dominance matrix in the object domin returns.

lm_da$Complete_Dominance
#>              Dmnated_am Dmnated_cyl Dmnated_carb
#> Dmnates_am           NA       FALSE         TRUE
#> Dmnates_cyl        TRUE          NA         TRUE
#> Dmnates_carb      FALSE       FALSE           NA

The process used to compare models containing am and carb are repeated for the comparison between am and cyl as well as carb and cyl.

Computing Conditional Dominance

Computing conditional dominance statistics is the next process to be described in this vignette. Conditional dominance statistics are computed as:

\(C^i_{X_v} = \Sigma^{(\begin{matrix}p-1\\i-1\end{matrix})}_{i=1}{\frac{R^2_{X_vS_i}\, - R^2_{S_i}}{(\begin{matrix}p-1\\i-1\end{matrix})}}\)

Where \(S_i\) is a subset of IVs not including \(X_v\) and \((\begin{matrix}p-1\\i-1\end{matrix})\) is the number of distinct combinations produced choosing the number of elements in the bottom value (\(i-1\)) given the number of elements in the top value (\(p-1\))–which is the value that would be produced by choose(p-1, i-1).

The user can think of conditional dominance statistics as a window or grouped statistic in which a moving average of statistics from the results data frame are combined. Specifically, at the specific order in question (i.e., number of IVs in the model), the conditional dominance statistic for that order will be the sum of the models at the order in question with the IV included less the sum of the models at the previous order with the IV omitted. This sum is then divided by the number of models at the order in question.

Put another way, it is the arithmetic average of the increment to the fit statistic for the models including the IV at the order in question.

As applied to the example above involving am, the three conditional dominance statistics for am are computed as just the value of am by itself in the model (i.e., 0.3597989) for order 1, the average increment of am beyond the other two IVs (i.e., 0.7590135 - 0.72618 and 0.7036726 - 0.3035184) for order 2, and the increment am provides to the full model (i.e., 0.8113023 - 0.7405408).

This result is reported in row 1 of the Conditional_Dominance matrix in the object domin returns.

lm_da$Conditional_Dominance
#>          IVs_1     IVs_2      IVs_3
#> am   0.3597989 0.2164938 0.07076149
#> cyl  0.7261800 0.4181185 0.10762967
#> carb 0.3035184 0.1791172 0.05228872

Because am completely dominates carb and is completely dominated by cyl, it also conditionally dominates carb and is conditionally dominated by cyl.

Computing General Dominance

General dominance is computed as:

\(C_{X_v} = \Sigma^p_{i=1}{\frac{C^i_{X_v}}{p}}\)

Thus, the general dominance statistics are the arithmetic average of all the conditional dominance statistics for an IV.

Another way of thinking about the general dominance statistic is as a weighted average of all the fit statistics in the model. This is clearer when substituting the equation for the conditional dominance statistics into the equation for general dominance as each model is weighted both by the number of submodels at the order in which it was estimated (the average taken for conditional dominance statistic computation) as well as the number of IVs in the model total (the average taken for general dominance statistic computation).

The general dominance statistic for am is then computed as the average of 0.3597989, 0.2164938, and 0.0707615.

This result is reported in element 1 of the General_Dominance vector in the object domin returns.

lm_da$General_Dominance
#>        am       cyl      carb 
#> 0.2156848 0.4173094 0.1783081

Because am conditionally and completely dominates carb as well as is conditionally and completely dominated by cyl, it also generally dominates carb and is generally dominated by cyl.