Authors: - Norm Matloff - Aditya Mittal - Taha Abdullah - Arjun Ashok - Shubhada Martha - Billy Ouattara - Jonathan Tran - Brandon Zarate

Discrimination is a key social issue in the United States and in a number of other countries. There is lots of available data with which one might investigate possible discrimination. But how might such investigations be conducted?

Our **DSLD** package provides statistical and graphical
tools for detecting and measuring discrimination and bias; be it racial,
gender, age or other. This is an R package. It is widely applicable,
here are just a few possible use cases:

- Quantitative analysis in instruction and research in the social sciences.
- Corporate HR analysis and research.
- Litigation involving discrimination and related issues.
- Concerned citizenry.

This package is broadly aimed at users ranging from instructors of
statistics classes to legal professionals, as it offers a powerful yet
intuitive approach to discrimination analysis. It also includes an 80
page **Quarto book** to serve as a guide of the key
statistical principles and their applications.

As of now, the package may be installed using the
**devtools** package:

```
library(devtools)
install_github("matloff/dsld", force = TRUE)
```

[WAITING TO PUT ON CRAN]

Here are the main categories:

In the estimation realm, say investigating a possible gender pay gap. In doing so, we must be careful to account for

*confounders*, variables that may affect wages other than through gender.In a prediction context, with concern that an ML algorithm has built-in bias against some racial group. We want to eliminate race from the analysis, while also controlling the effect of

*proxies*– other variables that may be strongly related to race.

In the first case, we are checking for *societal* or
*institutional* bias. In the second, the issue is
*algorithmic* bias.

To distinguish between a “fair ML” and a “statistics” dataset. Here is a side-by-side comparison:

statistics | fair ML |
---|---|

estimate an effect | predict an outcome |

harm comes from society | harm comes from an algorithm |

include sensitive variables | exclude sensitive variables |

adjust for covariates | use proxies but limit their impact |

Here we will take a quick tour of a subset of dsld’s features, using
the **svcensus** data included in the package.

The **svcensus** dataset consists of recorded income
across 6 different engineering occupations. It consists of the columns
‘age’, ‘education level’, ‘occupation’,‘wage income’, ‘number of weeks
worked’, ‘gender’.

```
> data(svcensus)
> head(svcensus)
age educ occ wageinc wkswrkd gender1 50.30082 zzzOther 102 75000 52 female
2 41.10139 zzzOther 101 12300 20 male
3 24.67374 zzzOther 102 15400 52 female
4 50.19951 zzzOther 100 0 52 male
5 51.18112 zzzOther 100 160 1 female
6 57.70413 zzzOther 100 0 0 male
```

We will use only a few features, to keep things simple. Note that the
*Quarto Book* provides an extensive analysis of examples shown
below.

We wish to *estimate the impact* of a sensitive variable [S]
on an outcome variable [Y], while *accounting for confounders*
[C]. Let’s call such analysis “confounder adjustment.” The package
provides several graphical and analytical tools for this purpose.

We are investigating a possible gender pay gap between men and women. Here, [Y] is wage and [S] is gender. We will treat age as a confounder [C], using a linear model. For simplicity, no other confounders (such as occupation) or any other predictors [X] are included in this example.

```
> data(svcensus)
> svcensus <- svcensus[,c(1,4,6)] # subset columns: age, wage, gender
> z <- dsldLinear(svcensus,'wageinc','gender')
> coef(z) # show coefficients of linear model
$gender
(Intercept) age gendermale 31079.9174 489.5728 13098.2091
```

Our linear model can be written as:

E(W) = \(\beta_0\) + \(\beta_1\) A + \(\beta_2\) M

Consider the case without any interaction: Here *W* indicates
wage income, *A* is age and *M* denotes an indicator
variable, with M = 1 for men and M = 0 for women.

We can speak of \(\beta_2\) as
*the* gender wage gap, at any age. According to the model,
younger men earn an estimated $13,000 more than younger women, with the
*same-sized* gap between older men and older women. Furthermore,
it may be, for instance that the gender gap is small at younger ages but
much larger for older people. Thus, we can account for an interaction by
fitting two linear models, one for men and one for women. Suppose we are
comparing this difference between ages 36 and 43:

```
<- data.frame(age=c(36,43))
newData <- dsldLinear(svcensus,'wageinc','gender',interactions=T,
z
newData)summary(z)
$female
Covariate Estimate StandardError PValue1 (Intercept) 30551.4302 2123.44361 0
2 age 502.9624 52.07742 0
$male
Covariate Estimate StandardError PValue1 (Intercept) 44313.159 1484.82216 0
2 age 486.161 36.02116 0
$`Sensitive Factor Level Comparisons`
Factors Compared New Data Row Estimates Standard Errors1 female - male 1 -13156.88 710.9696
2 female - male 2 -13039.27 710.7782
```

The gender pay gap is estimated to be -13156.88 at age 36, and -13039.27 at age 43, differing by only about $100. The estimated gap between ages 36 and 53, not shown, is larger, close to $300, but it seems there is not much interaction here. Note that we chose only one [C] variable here, age, for our analysis. We might also choose “occupation”, or any other combination depending on the application and dataset which can affect our results. The package also provides several graphical and analytical tools to aid users further.

Our goal is to predict [Y] from [X] and [O], omitting the sensitive variable [S]. Though, we are concerned that we may be indirectly using [S] via proxies [O] and want to limit their usage. The inherent tradeoff of increasing fairness is reduced utility (reduced predictive power/accuracy). The package provides wrappers for several functions for this purpose.

We are predicting the wage [Y], the sensitive variable [S] is gender,
with the proxy [O] as occupation. The proxy [O] “occupation” will be
deweighted to 0.2 using the *dsldQeFairKNN()* function to limit
its predictive power.

Fairness/Utility Tradeoff | Fairness | Accuracy |
---|---|---|

K-Nearest Neighbors | 0.1943313 | 25452.08 |

Fair K-NN (via EDFFair) | 0.0814919 | 26291.38 |

In the base K-NN model, the correlation between predicted wage and
gender was 0.1943, with mean prediction error of $25,500. Using
*dsldQQeFairKNN*, we see that the correlation between predicted
wage and gender has decreased significantly. On the other hand, test
accuracy increased by about $700 dollars. Hence, we see an increase in
fairness at some expense of accuracy.

**DsldLinear**/**DsldLogit**: Comparison of conditions for sensitive groups via linear & logistic models**DsldML**: Comparison of conditions for sensitive groups via ML algorithms**DsldTakeLookAround**: Evaluates feature sets for predicting Y while considering correlation with sensitive variable S**DsldCHunting**: Confounder hunting–searches for variables C that predict both Y and S**DsldOHunting**: Proxy hunting–searches for variables O that predict S**DsldScatterPlot3D**: Plots a dataset on 3 axes, with the color of the point depending on a 4th variable**DsldConditsDisparity**: Plots mean Y against X for each level of S, revealing potential Simpson’s Paradox-like differences under specified conditions**DsldConfounders**: Analyzes confounding variables in a dataframe**DsldFreqPCoord**: Wrapper for the freqparcoord function from the freqparcoord package**DsldDensityByS**: Graphs densities of a response variable, grouped by a sensitive variable**DsldFrequencyByS**: Assess possible confounding relationship between a sensitive variable and a categorical variable via graphical means**DsldFairML**: Wrappers for several fair machine learning algorithms functions provided via the FairML package**DsldEDFFair**: Wrappers for several EDFFair functions provided via the EDFFair package