1 Model, explanation & bias

In this tutorial you will learn how to tackle bias using the bias mitigation techniques supported by fairmodels. As always we will start from the data.

library(fairmodels)

data("adult")
head(adult)
# for this vignette data will be truncated

We will use adult data to predict whether certain person has yearly salary exceeding 50 000 or not. Our protected variable will be sex. For this tutorial we will be using gbm and of course we will explain it with DALEX.

library(gbm)
library(DALEX)

adult$salary   <- as.numeric(adult$salary) -1 # 0 if bad and 1 if good risk
protected     <- adult$sex
adult <- adult[colnames(adult) != "sex"] # sex not specified

# making model
set.seed(1)
gbm_model <-gbm(salary ~. , data = adult, distribution = "bernoulli")

# making explainer
gbm_explainer <- explain(gbm_model,
                         data = adult[,-1],
                         y = adult$salary,
                         colorize = FALSE)
#> Preparation of a new explainer is initiated
#>   -> model label       :  gbm  (  default  )
#>   -> data              :  32561  rows  13  cols 
#>   -> target variable   :  32561  values 
#>   -> predict function  :  yhat.gbm  will be used (  default  )
#>   -> predicted values  :  numerical, min =  0.0101498 , mean =  0.2409542 , max =  0.9864558  
#>   -> model_info        :  package gbm , ver. 2.1.8 , task classification (  default  ) 
#>   -> residual function :  difference between y and yhat (  default  )
#>   -> residuals         :  numerical, min =  -0.9790795 , mean =  -0.0001445991 , max =  0.9864904  
#>   A new explainer has been created!
model_performance(gbm_explainer)
#> Measures for:  classification
#> recall     : 0.5353909 
#> precision  : 0.7999238 
#> f1         : 0.6414547 
#> accuracy   : 0.8558705 
#> auc        : 0.9093789
#> 
#> Residuals:
#>          0%         10%         20%         30%         40%         50% 
#> -0.97907954 -0.31512711 -0.20921106 -0.12255213 -0.06480941 -0.04122486 
#>         60%         70%         80%         90%        100% 
#> -0.02832005 -0.01740186  0.13344943  0.53676046  0.98649038

Our model has around 86% accuracy. And how about bias? Sex is our protected variable and we should suspect that men will be more frequently assigned better annual income.

fobject <- fairness_check(gbm_explainer, 
                          protected  = protected, 
                          privileged = "Male", 
                          colorize = FALSE)
#> Creating fairness object
#> -> Privileged subgroup       : character ( Ok  )
#> -> Protected variable        : factor ( Ok  ) 
#> -> Cutoff values for explainers  : 0.5 ( for all subgroups ) 
#> -> Fairness objects      : 0 objects 
#> -> Checking explainers       : 1 in total (  compatible  )
#> -> Metric calculation        : 12/12 metrics calculated for all models
#>  Fairness object created succesfully
print(fobject, colorize = FALSE)
#> 
#> Fairness check for models: gbm 
#> 
#> gbm passes 2/5 metrics
#> Total loss:  2.028729

Our model passes only few metrics, how big is the bias?

plot(fobject)

The biggest bias is in Statistical parity loss metric. It is metric that is frequently look at because it gives answer how much difference is there in positive label rates in model within protected variable. Let’s say that it will be metric that we will try to mitigate.

2 Bias mitigation strategies

2.1 Pre-processing techniques

Pre-processing techniques focus on changing data before model is trained. This reduces bias in data.

2.1.1 Distribution changing

Firs technique you will learn about is disparate_impact_remover. It is somehow limited as it works on ordinal, numeric data. This technique returns “fixed” data frame. Through parameter lambda we can manipulate with how much the distribution will be fixed. lambda = 1 (default) will return data with identical distributions for all levels of protected variable whereas lambda = 0 will barely change anything. We will transform a few features.

data_fixed <- disparate_impact_remover(data = adult, protected = protected, 
                            features_to_transform = c("age", "hours_per_week",
                                                      "capital_loss",
                                                      "capital_gain"))

set.seed(1)
gbm_model     <- gbm(salary ~. , data = data_fixed, distribution = "bernoulli")
gbm_explainer_dir <- explain(gbm_model,
                             data = data_fixed[,-1],
                             y = adult$salary,
                             label = "gbm_dir",
                             verbose = FALSE)

Now we will compare old explainer and new one.

fobject <- fairness_check(gbm_explainer, gbm_explainer_dir,
                          protected = protected, 
                          privileged = "Male",
                          verbose = FALSE)
plot(fobject)