fairlib: A Unified Framework for Assessing and Improving Fairness

Xudong Han\(^{1}\),   Aili Shen\(^{1,2, a}\),   Yitong Li\(^{3}\),   Lea Frermann\(^{1}\),   Timothy Baldwin\(^{1,4}\),   and   Trevor Cohn\(^{1}\)

\(^{1}\) The University of Melbourne

\(^{2}\) Alexa AI, Amazon

\(^{3}\) Huawei Technologies Co., Ltd.

\(^{4}\) MBZUAI

56357056c15b4d1cbe34778ccf0bf7d1         2e1f81a5150c4e448cb778b5d3ad2801         042dd7ed9e1f49bd94e4eae9f390594d


  • \(^a\) Work carried out at The University of Melbourne

  • fairlib is licensed under the Apache License 2.0

GitHub, Docs, PyPI

Open In Colab

In this video, we will demostrate how to: 1. Install fairlib 2. Access fairness benchmark datasets 3. Train a vanilla model without debiasing, and measure fairness 4. Improve fairness with most recent debiasing methods 5. Analyze the results, such as creating tables and figures

1. Installation

The most straightforward way to install fairlib is with pip:

[1]:
!pip install -q fairlib

Check out fairlib documents for other installation options.

After installation, let’s import fairlib.

[2]:
import fairlib

2. Build a dataset

fairlib provides simple APIs to access fairness benchmark datasets that are publicly available and under strict ethical guidelines.

In this video, we will be using the preprocessed Moji dataset, where each tweet is annotated with a binary sentiment label (happy vs sad) and a binary race label (AAE vs SAE). Original tweets are encoded with the pre-trained DeepMoji model as 2304d vectors.

Followings are random examples from the Moji dataset.

Text

Sentiment

Race

Dfl somebody said to me yesterday that how can u u have a iPhone or an S3 an ur phone off dfl

Positive

AAE

smh I bet maybe u just don’t care bout poor boo no more

Negative

AAE

I actually put jeans on today and I already wanna go put on leggings or yogas

Positive

SAE

I’m sitting next to the most awkward couple on the plane like they are making out and holding hands , I just can’t

Negative

SAE

See here for other avaliable datasets, including texts, images, and structured inputs.

[3]:
from fairlib import datasets
[4]:
datasets.prepare_dataset("moji", "data/deepmoji")
saving to /content/data/deepmoji/pos_pos.npy
saving to /content/data/deepmoji/pos_neg.npy
saving to /content/data/deepmoji/neg_pos.npy
saving to /content/data/deepmoji/neg_neg.npy

Datasets will be downloaded and saved to data/deepmoji, and then be used to create training, validation, and test splits as previous work.

[5]:
datasets.name2class.keys()
[5]:
dict_keys(['moji', 'bios', 'coloredmnist', 'compas', 'tp_pos', 'adult', 'coco', 'imsitu'])

3. Train a vanilla model without debiasing

Specify hyperparameters

By default, fairlib trains vanilla models without debiasing.

Here we specify the experiment id (exp_id = vanilla), which identify the directory for saving experimental results.

[6]:
args = {
    # The name of the dataset, corresponding dataloader will be used,
    "dataset":  "Moji",

    # Specifiy the path to the input data
    "data_dir": "data/deepmoji",

    # Device for computing, -1 is the cpu; non-negative numbers indicate GPU id.
    "device_id":    -1,

    # Give a name to the exp, which will be used in the path
    "exp_id":"vanilla",
}

# Init the argument
options = fairlib.BaseOptions()
state = options.get_state(args=args, silence=True)
INFO:root:Unexpected args: ['-f', '/root/.local/share/jupyter/runtime/kernel-8eb44448-ab67-4429-be7a-d5fc3fe3b2ce.json']
INFO:root:Logging to ./results/dev/Moji/vanilla/output.log
2022-07-21 06:53:09 [INFO ]  ======================================== 2022-07-21 06:53:09 ========================================
2022-07-21 06:53:09 [INFO ]  Base directory is ./results/dev/Moji/vanilla
Loaded data shapes: (99998, 2304), (99998,), (99998,)
Loaded data shapes: (8000, 2304), (8000,), (8000,)
Loaded data shapes: (7998, 2304), (7998,), (7998,)

state contains a list of hyperparameters for experiments. Besides, other components that are needed, such as dataloaders, will also automatically be initialized

Initialize a model

The default architecutre is a 3-layer MLP classifier with Tanh activation functions in between:

[7]:
print(state.hidden_size, state.n_hidden, state.activation_function)
300 2 Tanh

Users can easily specify model architecutres in the option. Please see the model architecture section for more details about the hyperparameters corresponding to the model architecture.

[8]:
fairlib.utils.seed_everything(2022)

# Init Model
model = fairlib.networks.get_main_model(state)
2022-07-21 06:53:10 [INFO ]  MLP(
2022-07-21 06:53:10 [INFO ]    (output_layer): Linear(in_features=300, out_features=2, bias=True)
2022-07-21 06:53:10 [INFO ]    (AF): Tanh()
2022-07-21 06:53:10 [INFO ]    (hidden_layers): ModuleList(
2022-07-21 06:53:10 [INFO ]      (0): Linear(in_features=2304, out_features=300, bias=True)
2022-07-21 06:53:10 [INFO ]      (1): Tanh()
2022-07-21 06:53:10 [INFO ]      (2): Linear(in_features=300, out_features=300, bias=True)
2022-07-21 06:53:10 [INFO ]      (3): Tanh()
2022-07-21 06:53:10 [INFO ]    )
2022-07-21 06:53:10 [INFO ]    (criterion): CrossEntropyLoss()
2022-07-21 06:53:10 [INFO ]  )
2022-07-21 06:53:10 [INFO ]  Total number of parameters: 782402

Train a model

A list of hyperparameters has been predefined in fairlib, so we can now directly train a model with the model class’s built-in train_self method.

Please see the link for all hyperparameters associated with model training.

[9]:
model.train_self()
2022-07-21 06:53:10 [INFO ]  Epoch:    0 [      0/  99998 ( 0%)]        Loss: 0.6907     Data Time: 0.06s       Train Time: 0.18s
2022-07-21 06:53:14 [INFO ]  Epoch:    0 [  51200/  99998 (51%)]        Loss: 0.4075     Data Time: 0.44s       Train Time: 3.34s
2022-07-21 06:53:18 [INFO ]  Evaluation at Epoch 0
2022-07-21 06:53:18 [INFO ]  Validation accuracy: 71.44 macro_fscore: 71.35     micro_fscore: 71.44     TPR_GAP: 39.59  FPR_GAP: 39.59  PPR_GAP: 38.67
2022-07-21 06:53:18 [INFO ]  Test accuracy: 71.21       macro_fscore: 71.12     micro_fscore: 71.21     TPR_GAP: 39.42  FPR_GAP: 39.42  PPR_GAP: 38.49
2022-07-21 06:53:18 [INFO ]  Epoch:    1 [      0/  99998 ( 0%)]        Loss: 0.4195     Data Time: 0.01s       Train Time: 0.07s
2022-07-21 06:53:22 [INFO ]  Epoch:    1 [  51200/  99998 (51%)]        Loss: 0.3780     Data Time: 0.42s       Train Time: 3.29s
2022-07-21 06:53:26 [INFO ]  Evaluation at Epoch 1
2022-07-21 06:53:26 [INFO ]  Validation accuracy: 72.10 macro_fscore: 72.06     micro_fscore: 72.10     TPR_GAP: 37.69  FPR_GAP: 37.69  PPR_GAP: 37.00
2022-07-21 06:53:26 [INFO ]  Test accuracy: 71.72       macro_fscore: 71.68     micro_fscore: 71.72     TPR_GAP: 37.11  FPR_GAP: 37.11  PPR_GAP: 36.41
2022-07-21 06:53:27 [INFO ]  Epoch:    2 [      0/  99998 ( 0%)]        Loss: 0.3608     Data Time: 0.01s       Train Time: 0.07s
2022-07-21 06:53:30 [INFO ]  Epoch:    2 [  51200/  99998 (51%)]        Loss: 0.3567     Data Time: 0.42s       Train Time: 3.29s
2022-07-21 06:53:34 [INFO ]  Epochs since last improvement: 1
2022-07-21 06:53:34 [INFO ]  Evaluation at Epoch 2
2022-07-21 06:53:34 [INFO ]  Validation accuracy: 71.53 macro_fscore: 71.44     micro_fscore: 71.53     TPR_GAP: 39.40  FPR_GAP: 39.40  PPR_GAP: 38.40
2022-07-21 06:53:34 [INFO ]  Test accuracy: 71.38       macro_fscore: 71.28     micro_fscore: 71.38     TPR_GAP: 39.07  FPR_GAP: 39.07  PPR_GAP: 37.99
2022-07-21 06:53:35 [INFO ]  Epoch:    3 [      0/  99998 ( 0%)]        Loss: 0.3584     Data Time: 0.01s       Train Time: 0.07s
2022-07-21 06:53:38 [INFO ]  Epoch:    3 [  51200/  99998 (51%)]        Loss: 0.3430     Data Time: 0.42s       Train Time: 3.28s
2022-07-21 06:53:42 [INFO ]  Epochs since last improvement: 2
2022-07-21 06:53:42 [INFO ]  Evaluation at Epoch 3
2022-07-21 06:53:43 [INFO ]  Validation accuracy: 71.08 macro_fscore: 70.73     micro_fscore: 71.08     TPR_GAP: 41.08  FPR_GAP: 41.08  PPR_GAP: 38.75
2022-07-21 06:53:43 [INFO ]  Test accuracy: 70.87       macro_fscore: 70.49     micro_fscore: 70.87     TPR_GAP: 40.67  FPR_GAP: 40.67  PPR_GAP: 38.42
2022-07-21 06:53:43 [INFO ]  Epoch:    4 [      0/  99998 ( 0%)]        Loss: 0.3578     Data Time: 0.02s       Train Time: 0.07s
2022-07-21 06:53:46 [INFO ]  Epoch:    4 [  51200/  99998 (51%)]        Loss: 0.3881     Data Time: 0.44s       Train Time: 3.36s
2022-07-21 06:53:50 [INFO ]  Epochs since last improvement: 3
2022-07-21 06:53:50 [INFO ]  Evaluation at Epoch 4
2022-07-21 06:53:51 [INFO ]  Validation accuracy: 72.06 macro_fscore: 71.96     micro_fscore: 72.06     TPR_GAP: 38.04  FPR_GAP: 38.04  PPR_GAP: 36.87
2022-07-21 06:53:51 [INFO ]  Test accuracy: 71.72       macro_fscore: 71.61     micro_fscore: 71.72     TPR_GAP: 38.09  FPR_GAP: 38.09  PPR_GAP: 36.96
2022-07-21 06:53:51 [INFO ]  Epoch:    5 [      0/  99998 ( 0%)]        Loss: 0.4008     Data Time: 0.02s       Train Time: 0.07s
2022-07-21 06:53:54 [INFO ]  Epoch:    5 [  51200/  99998 (51%)]        Loss: 0.3393     Data Time: 0.43s       Train Time: 3.29s
2022-07-21 06:53:58 [INFO ]  Epochs since last improvement: 4
2022-07-21 06:53:58 [INFO ]  Evaluation at Epoch 5
2022-07-21 06:53:59 [INFO ]  Validation accuracy: 71.44 macro_fscore: 71.41     micro_fscore: 71.44     TPR_GAP: 40.50  FPR_GAP: 40.50  PPR_GAP: 39.97
2022-07-21 06:53:59 [INFO ]  Test accuracy: 71.13       macro_fscore: 71.11     micro_fscore: 71.13     TPR_GAP: 39.69  FPR_GAP: 39.69  PPR_GAP: 39.19
2022-07-21 06:53:59 [INFO ]  Epoch:    6 [      0/  99998 ( 0%)]        Loss: 0.3890     Data Time: 0.01s       Train Time: 0.07s
2022-07-21 06:54:03 [INFO ]  Epoch:    6 [  51200/  99998 (51%)]        Loss: 0.3668     Data Time: 0.44s       Train Time: 3.32s
2022-07-21 06:54:07 [INFO ]  Epochs since last improvement: 5
2022-07-21 06:54:07 [INFO ]  Evaluation at Epoch 6
2022-07-21 06:54:07 [INFO ]  Validation accuracy: 71.96 macro_fscore: 71.92     micro_fscore: 71.96     TPR_GAP: 37.52  FPR_GAP: 37.52  PPR_GAP: 36.82
2022-07-21 06:54:07 [INFO ]  Test accuracy: 72.04       macro_fscore: 72.00     micro_fscore: 72.04     TPR_GAP: 36.75  FPR_GAP: 36.75  PPR_GAP: 36.01

After each iteration, a subset of evaluation results over the validation set and test set will be logged, including metrics for both performance and fairness.

It can be seen that the vanilla model achieves around 40% TPR GAP.

4. Improve Fairness

To mitigate bias, we show an example of employing BTEO (Han et al., 2021a) and adversarial training (Li et al., 2018) simultaneously.

Enable debiasing

The only difference for debebiasing is to specify corresponding arguments, as shown in the following cell. Everything else are identical to the standard training.

  • A list of supported bias mitigation methods is shown here.

  • The usage introduces further options associated with each debiasing method.

[10]:
debiasing_args = args.copy()

# Update the experiment name
debiasing_args["exp_id"] = "BT_Adv"

# Perform adversarial training if True
debiasing_args["adv_debiasing"] = True

# Specify the hyperparameters for Balanced Training
debiasing_args["BT"] = "Downsampling"
debiasing_args["BTObj"] = "EO"

debias_options = fairlib.BaseOptions()
debias_state = debias_options.get_state(args=debiasing_args, silence=True)

fairlib.utils.seed_everything(2022)

debias_model = fairlib.networks.get_main_model(debias_state)
2022-07-21 06:54:07 [INFO ]  Unexpected args: ['-f', '/root/.local/share/jupyter/runtime/kernel-8eb44448-ab67-4429-be7a-d5fc3fe3b2ce.json']
2022-07-21 06:54:07 [INFO ]  Logging to ./results/dev/Moji/BT_Adv/output.log
2022-07-21 06:54:07 [INFO ]  ======================================== 2022-07-21 06:54:07 ========================================
2022-07-21 06:54:07 [INFO ]  Base directory is ./results/dev/Moji/BT_Adv
Loaded data shapes: (39996, 2304), (39996,), (39996,)
Loaded data shapes: (8000, 2304), (8000,), (8000,)
Loaded data shapes: (7998, 2304), (7998,), (7998,)
2022-07-21 06:54:16 [INFO ]  SubDiscriminator(
2022-07-21 06:54:16 [INFO ]    (grad_rev): GradientReversal()
2022-07-21 06:54:16 [INFO ]    (output_layer): Linear(in_features=300, out_features=2, bias=True)
2022-07-21 06:54:16 [INFO ]    (AF): ReLU()
2022-07-21 06:54:16 [INFO ]    (hidden_layers): ModuleList(
2022-07-21 06:54:16 [INFO ]      (0): Linear(in_features=300, out_features=300, bias=True)
2022-07-21 06:54:16 [INFO ]      (1): ReLU()
2022-07-21 06:54:16 [INFO ]      (2): Linear(in_features=300, out_features=300, bias=True)
2022-07-21 06:54:16 [INFO ]      (3): ReLU()
2022-07-21 06:54:16 [INFO ]    )
2022-07-21 06:54:16 [INFO ]    (criterion): CrossEntropyLoss()
2022-07-21 06:54:16 [INFO ]  )
2022-07-21 06:54:16 [INFO ]  Total number of parameters: 181202

2022-07-21 06:54:16 [INFO ]  Discriminator built!
2022-07-21 06:54:16 [INFO ]  MLP(
2022-07-21 06:54:16 [INFO ]    (output_layer): Linear(in_features=300, out_features=2, bias=True)
2022-07-21 06:54:16 [INFO ]    (AF): Tanh()
2022-07-21 06:54:16 [INFO ]    (hidden_layers): ModuleList(
2022-07-21 06:54:16 [INFO ]      (0): Linear(in_features=2304, out_features=300, bias=True)
2022-07-21 06:54:16 [INFO ]      (1): Tanh()
2022-07-21 06:54:16 [INFO ]      (2): Linear(in_features=300, out_features=300, bias=True)
2022-07-21 06:54:16 [INFO ]      (3): Tanh()
2022-07-21 06:54:16 [INFO ]    )
2022-07-21 06:54:16 [INFO ]    (criterion): CrossEntropyLoss()
2022-07-21 06:54:16 [INFO ]  )
2022-07-21 06:54:16 [INFO ]  Total number of parameters: 782402

It can be seen from the last cell that the training dataset size is smaller than before (40k verse 100k) due to the downsampling for balanced training, and an MLP adversary is initialized for adversarial debiasing.

Mitigate bias

[11]:
debias_model.train_self()
2022-07-21 06:54:17 [INFO ]  Epoch:    0 [      0/  39996 ( 0%)]        Loss: 0.0003     Data Time: 0.02s       Train Time: 0.22s
2022-07-21 06:54:25 [INFO ]  Evaluation at Epoch 0
2022-07-21 06:54:26 [INFO ]  Validation accuracy: 73.72 macro_fscore: 73.08     micro_fscore: 73.72     TPR_GAP: 19.20  FPR_GAP: 19.20  PPR_GAP: 16.80
2022-07-21 06:54:26 [INFO ]  Test accuracy: 73.59       macro_fscore: 72.98     micro_fscore: 73.59     TPR_GAP: 20.70  FPR_GAP: 20.70  PPR_GAP: 17.86
2022-07-21 06:54:26 [INFO ]  Epoch:    1 [      0/  39996 ( 0%)]        Loss: -0.1368    Data Time: 0.01s       Train Time: 0.20s
2022-07-21 06:54:34 [INFO ]  Epochs since last improvement: 1
2022-07-21 06:54:34 [INFO ]  Evaluation at Epoch 1
2022-07-21 06:54:35 [INFO ]  Validation accuracy: 68.96 macro_fscore: 67.36     micro_fscore: 68.96     TPR_GAP: 10.57  FPR_GAP: 10.57  PPR_GAP: 4.67
2022-07-21 06:54:35 [INFO ]  Test accuracy: 68.89       macro_fscore: 67.33     micro_fscore: 68.89     TPR_GAP: 11.46  FPR_GAP: 11.46  PPR_GAP: 6.46
2022-07-21 06:54:35 [INFO ]  Epoch:    2 [      0/  39996 ( 0%)]        Loss: -0.1266    Data Time: 0.01s       Train Time: 0.20s
2022-07-21 06:54:43 [INFO ]  Evaluation at Epoch 2
2022-07-21 06:54:44 [INFO ]  Validation accuracy: 74.94 macro_fscore: 74.68     micro_fscore: 74.94     TPR_GAP: 14.13  FPR_GAP: 14.13  PPR_GAP: 11.37
2022-07-21 06:54:44 [INFO ]  Test accuracy: 75.54       macro_fscore: 75.33     micro_fscore: 75.54     TPR_GAP: 14.66  FPR_GAP: 14.66  PPR_GAP: 11.41
2022-07-21 06:54:44 [INFO ]  Epoch:    3 [      0/  39996 ( 0%)]        Loss: -0.1809    Data Time: 0.01s       Train Time: 0.20s
2022-07-21 06:54:52 [INFO ]  Evaluation at Epoch 3
2022-07-21 06:54:53 [INFO ]  Validation accuracy: 74.99 macro_fscore: 74.98     micro_fscore: 74.99     TPR_GAP: 12.09  FPR_GAP: 12.09  PPR_GAP: 8.67
2022-07-21 06:54:53 [INFO ]  Test accuracy: 75.47       macro_fscore: 75.47     micro_fscore: 75.47     TPR_GAP: 11.43  FPR_GAP: 11.43  PPR_GAP: 7.95
2022-07-21 06:54:53 [INFO ]  Epoch:    4 [      0/  39996 ( 0%)]        Loss: -0.1584    Data Time: 0.01s       Train Time: 0.20s
2022-07-21 06:55:01 [INFO ]  Evaluation at Epoch 4
2022-07-21 06:55:02 [INFO ]  Validation accuracy: 75.36 macro_fscore: 75.34     micro_fscore: 75.36     TPR_GAP: 12.23  FPR_GAP: 12.23  PPR_GAP: 9.27
2022-07-21 06:55:02 [INFO ]  Test accuracy: 75.73       macro_fscore: 75.70     micro_fscore: 75.73     TPR_GAP: 13.03  FPR_GAP: 13.03  PPR_GAP: 9.38
2022-07-21 06:55:02 [INFO ]  Epoch:    5 [      0/  39996 ( 0%)]        Loss: -0.1741    Data Time: 0.02s       Train Time: 0.20s
2022-07-21 06:55:10 [INFO ]  Epochs since last improvement: 1
2022-07-21 06:55:10 [INFO ]  Evaluation at Epoch 5
2022-07-21 06:55:11 [INFO ]  Validation accuracy: 75.22 macro_fscore: 75.15     micro_fscore: 75.22     TPR_GAP: 10.86  FPR_GAP: 10.86  PPR_GAP: 6.30
2022-07-21 06:55:11 [INFO ]  Test accuracy: 75.44       macro_fscore: 75.36     micro_fscore: 75.44     TPR_GAP: 10.99  FPR_GAP: 10.99  PPR_GAP: 6.50
2022-07-21 06:55:11 [INFO ]  Epoch:    6 [      0/  39996 ( 0%)]        Loss: -0.2291    Data Time: 0.01s       Train Time: 0.20s
2022-07-21 06:55:19 [INFO ]  Epochs since last improvement: 2
2022-07-21 06:55:19 [INFO ]  Evaluation at Epoch 6
2022-07-21 06:55:20 [INFO ]  Validation accuracy: 74.94 macro_fscore: 74.93     micro_fscore: 74.94     TPR_GAP: 9.14   FPR_GAP: 9.14   PPR_GAP: 1.42
2022-07-21 06:55:20 [INFO ]  Test accuracy: 75.17       macro_fscore: 75.17     micro_fscore: 75.17     TPR_GAP: 9.69   FPR_GAP: 9.69   PPR_GAP: 2.85
2022-07-21 06:55:20 [INFO ]  Epoch:    7 [      0/  39996 ( 0%)]        Loss: -0.1665    Data Time: 0.01s       Train Time: 0.21s
2022-07-21 06:55:28 [INFO ]  Epochs since last improvement: 3
2022-07-21 06:55:28 [INFO ]  Evaluation at Epoch 7
2022-07-21 06:55:29 [INFO ]  Validation accuracy: 74.98 macro_fscore: 74.97     micro_fscore: 74.98     TPR_GAP: 9.40   FPR_GAP: 9.40   PPR_GAP: 5.25
2022-07-21 06:55:29 [INFO ]  Test accuracy: 75.38       macro_fscore: 75.38     micro_fscore: 75.38     TPR_GAP: 10.85  FPR_GAP: 10.85  PPR_GAP: 6.88
2022-07-21 06:55:29 [INFO ]  Epoch:    8 [      0/  39996 ( 0%)]        Loss: -0.2003    Data Time: 0.01s       Train Time: 0.21s
2022-07-21 06:55:37 [INFO ]  Epochs since last improvement: 4
2022-07-21 06:55:37 [INFO ]  Evaluation at Epoch 8
2022-07-21 06:55:38 [INFO ]  Validation accuracy: 74.04 macro_fscore: 73.54     micro_fscore: 74.04     TPR_GAP: 10.18  FPR_GAP: 10.18  PPR_GAP: 6.12
2022-07-21 06:55:38 [INFO ]  Test accuracy: 74.22       macro_fscore: 73.75     micro_fscore: 74.22     TPR_GAP: 10.17  FPR_GAP: 10.17  PPR_GAP: 6.96
2022-07-21 06:55:38 [INFO ]  Epoch:    9 [      0/  39996 ( 0%)]        Loss: -0.2118    Data Time: 0.01s       Train Time: 0.21s
2022-07-21 06:55:46 [INFO ]  Epochs since last improvement: 5
2022-07-21 06:55:46 [INFO ]  Evaluation at Epoch 9
2022-07-21 06:55:47 [INFO ]  Validation accuracy: 73.14 macro_fscore: 72.87     micro_fscore: 73.14     TPR_GAP: 9.11   FPR_GAP: 9.11   PPR_GAP: 5.97
2022-07-21 06:55:47 [INFO ]  Test accuracy: 73.36       macro_fscore: 73.14     micro_fscore: 73.36     TPR_GAP: 8.46   FPR_GAP: 8.46   PPR_GAP: 6.07

5. Analyze the results

[12]:
from fairlib import analysis

Here we define a list of hyperparameters that will be repeatedly used for analysis.

[13]:
Shared_options = {
    # Random seed
    "seed": 2022,

    # The name of the dataset, corresponding dataloader will be used,
    "dataset":  "Moji",

    # Specifiy the path to the input data
    "data_dir": "data/deepmoji",

    # Device for computing, -1 is the cpu; non-negative numbers indicate GPU id.
    "device_id":    -1,

    # The default path for saving experimental results
    "results_dir":  "results",

    # Will be used for saving experimental results
    "project_dir":  "dev",

    # We will focusing on TPR GAP, implying the Equalized Odds for binary classification.
    "GAP_metric_name":  "TPR_GAP",

    # The overall performance will be measured as accuracy
    "Performance_metric_name":  "accuracy",

    # Model selections are based on distance to optimum, see section 4 in our paper for more details
    "selection_criterion":  "DTO",

    # Default dirs for saving checkpoints
    "checkpoint_dir":   "models",
    "checkpoint_name":  "checkpoint_epoch",

    # Loading experimental results
    "n_jobs":   1,

}

Epoch Selection

Here we demostrate the usage of DTO for epoch selection (i.e., post-hoc early stopping). model_selection retrieves experimental results for a single method, selects the desired epoch for each run, and saves the resulting df for a later process.

[14]:
analysis.model_selection(
    # exp_id started with model_id will be treated as the same method, e.g, vanilla, and adv
    model_id= ("vanilla"),

    # the tuned hyperparameters of a methods, which will be used to group multiple runs together.
    index_column_names = ["BT", "BTObj", "adv_debiasing"],

    # to convenient the further analysis, we will store the resulting DataFrame to the specified path
    save_path = r"results/Vanilla_df.pkl",

    # Follwoing options are predefined
    results_dir= Shared_options["results_dir"],
    project_dir= Shared_options["project_dir"]+"/"+Shared_options["dataset"],
    GAP_metric_name = Shared_options["GAP_metric_name"],
    Performance_metric_name = Shared_options["Performance_metric_name"],
    # We use DTO for epoch selection
    selection_criterion = Shared_options["selection_criterion"],
    checkpoint_dir= Shared_options["checkpoint_dir"],
    checkpoint_name= Shared_options["checkpoint_name"],
    # If retrive results in parallel
    n_jobs=Shared_options["n_jobs"],
)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s finished
[14]:
epoch dev_DTO test_DTO dev_performance dev_fairness test_performance test_fairness opt_dir
BT BTObj adv_debiasing
NaN NaN False 1 0.001907 0.0 0.719625 0.624825 0.72043 0.632535 results/dev/Moji/vanilla/opt.yaml
[15]:
analysis.model_selection(
    model_id= ("BT_Adv"),
    index_column_names = ["BT", "BTObj", "adv_debiasing"],
    save_path = r"results/BT_ADV_df.pkl",
    # Follwoing options are predefined
    results_dir= Shared_options["results_dir"],
    project_dir= Shared_options["project_dir"]+"/"+Shared_options["dataset"],
    GAP_metric_name = Shared_options["GAP_metric_name"],
    Performance_metric_name = Shared_options["Performance_metric_name"],
    selection_criterion = Shared_options["selection_criterion"],
    checkpoint_dir= Shared_options["checkpoint_dir"],
    checkpoint_name= Shared_options["checkpoint_name"],
    n_jobs=Shared_options["n_jobs"],
)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s finished
[15]:
epoch dev_DTO test_DTO dev_performance dev_fairness test_performance test_fairness opt_dir
BT BTObj adv_debiasing
Downsampling EO True 1 0.005648 0.015392 0.749375 0.908632 0.751688 0.903068 results/dev/Moji/BT_Adv/opt.yaml

We have preprocessed the results with the model_selection function, and the resulting dfs can be downloaded as follows:

[16]:
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1M0G6PyPuDC8Y_2nL9XKYCt10IUzbSvfl' -O retrived_results.tar.gz
--2022-07-21 06:55:48--  https://docs.google.com/uc?export=download&id=1M0G6PyPuDC8Y_2nL9XKYCt10IUzbSvfl
Resolving docs.google.com (docs.google.com)... 108.177.97.138, 108.177.97.102, 108.177.97.100, ...
Connecting to docs.google.com (docs.google.com)|108.177.97.138|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0g-0k-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/d9ah1be1dhjtrm0rinvfdje85fkvmrdh/1658386500000/17527887236587461918/*/1M0G6PyPuDC8Y_2nL9XKYCt10IUzbSvfl?e=download&uuid=2bc658ea-acf2-4d5e-8ae5-920657110366 [following]
Warning: wildcards not supported in HTTP.
--2022-07-21 06:55:51--  https://doc-0g-0k-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/d9ah1be1dhjtrm0rinvfdje85fkvmrdh/1658386500000/17527887236587461918/*/1M0G6PyPuDC8Y_2nL9XKYCt10IUzbSvfl?e=download&uuid=2bc658ea-acf2-4d5e-8ae5-920657110366
Resolving doc-0g-0k-docs.googleusercontent.com (doc-0g-0k-docs.googleusercontent.com)... 74.125.204.132, 2404:6800:4008:c04::84
Connecting to doc-0g-0k-docs.googleusercontent.com (doc-0g-0k-docs.googleusercontent.com)|74.125.204.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 790461 (772K) [application/x-gzip]
Saving to: ‘retrived_results.tar.gz’

retrived_results.ta 100%[===================>] 771.93K  --.-KB/s    in 0.006s

2022-07-21 06:55:51 (129 MB/s) - ‘retrived_results.tar.gz’ saved [790461/790461]

[17]:
!tar -xf retrived_results.tar.gz

Model Selection

Here we demonstrate the application of final_results_df, which loads cached results with retrive_results for all methods, select the best hyperparameter combinations for each technique, and present the result in a DataFrame

[18]:
Moji_results = analysis.retrive_results("Moji", log_dir="analysis/results")
[19]:
Moji_main_results = analysis.final_results_df(
    results_dict = Moji_results,
    pareto = False,
    pareto_selection = "test",
    selection_criterion = "DTO",
    return_dev = True,
    return_conf=True,
    )
Moji_main_results
[19]:
Models test_performance mean test_performance std test_fairness mean test_fairness std dev_performance mean dev_performance std dev_fairness mean dev_fairness std DTO epoch list opt_dir list is_pareto
0 GDEO 0.752763 0.004999 0.892255 0.007860 0.749350 0.003494 0.912672 0.002766 0.269694 [2, 12, 5, 9, 10] [G:\Experimental_results\GroupDifference\Moji\... True
1 BTFairBatch 0.746837 0.003407 0.899351 0.004936 0.743975 0.004236 0.919254 0.004731 0.272437 [8, 6, 9, 5, 7] [G:\Experimental_results\FairBatch\Moji\BTInit... False
2 Vanilla 0.722981 0.004576 0.611870 0.014356 0.726650 0.003673 0.632302 0.013370 0.476849 [2, 11, 2, 5, 2] [G:\Experimental_results\vanilla\Moji\0\opt.ya... True
3 BTEO 0.753927 0.001433 0.877469 0.003756 0.746325 0.000998 0.896874 0.005401 0.274892 [8, 6, 9, 4, 5] [G:\Experimental_results\GatedBT\Moji\GatedBT_... True
4 GatedDAdv 0.750163 0.006945 0.908679 0.021678 0.745600 0.004828 0.928670 0.022488 0.266004 [24, 13, 19, 4, 3] [G:\Experimental_results\hypertune3\Moji\hyper... False
5 FairBatch 0.751488 0.005772 0.904373 0.008213 0.746050 0.003896 0.914526 0.006020 0.266276 [9, 9, 6, 6, 5] [G:\Experimental_results\FairBatch\Moji\FairBa... True
6 GatedAdv 0.753113 0.005196 0.890065 0.013302 0.748975 0.003805 0.910838 0.010314 0.270257 [11, 4, 4, 13, 12] [G:\Experimental_results\hypertune2\Moji\hyper... False
7 DelayedCLS_Adv 0.761015 0.003081 0.882425 0.015918 0.751675 0.003481 0.899346 0.011417 0.266341 [13, 1, 3, 13, 1] [/data/cephfs/punim1421/Fair_NLP_Classificatio... True
8 GDMean 0.752163 0.002130 0.901389 0.003916 0.749050 0.001368 0.922430 0.005829 0.266735 [11, 12, 7, 2, 2] [G:\Experimental_results\GroupDifference\Moji\... True
9 GatedBTEO 0.762106 0.002592 0.900764 0.014701 0.759775 0.003798 0.909445 0.006631 0.257762 [3, 1, 3, 1, 11] [G:\Experimental_results\GatedBT\Moji\GatedBT_... True
10 BTGatedAdv 0.735459 0.028830 0.866150 0.028232 0.730150 0.024594 0.886862 0.030537 0.296476 [5, 19, 19, 6, 1] [G:\Experimental_results\hypertune2\Moji\hyper... True
11 Adv 0.756414 0.007271 0.893286 0.005623 0.747425 0.004549 0.912125 0.008507 0.265936 [9, 14, 18, 5, 16] [G:\Experimental_results\hypertune\Moji\hypert... True
12 OldFairBatch 0.750638 0.006012 0.905537 0.005046 0.744525 0.004995 0.917734 0.004761 0.266655 [5, 7, 8, 13, 7] [G:\Experimental_results\Original_FairBatch\Mo... True
13 FairSCL 0.757314 0.003441 0.878219 0.004314 0.752825 0.001872 0.898325 0.002579 0.271527 [12, 5, 13, 1, 1] [G:\Experimental_results\FairSCL\Moji\FSCL_0.3... False
14 INLP 0.733433 NaN 0.855982 NaN 0.727625 NaN 0.859686 NaN 0.302983 [103] [G:\Experimental_results\INLP\Moji\INLP_True_b... True
15 DAdv 0.755464 0.004076 0.904023 0.011218 0.748550 0.002405 0.915601 0.005007 0.262697 [6, 1, 7, 3, 5] [G:\Experimental_results\hypertune\Moji\hypert... True

Create \(\LaTeX{}\) tabels

[20]:
print(Moji_main_results.to_latex(index=False))
\begin{tabular}{lrrrrrrrrrlll}
\toprule
        Models &  test\_performance mean &  test\_performance std &  test\_fairness mean &  test\_fairness std &  dev\_performance mean &  dev\_performance std &  dev\_fairness mean &  dev\_fairness std &      DTO &         epoch list &                                       opt\_dir list &  is\_pareto \\
\midrule
          GDEO &               0.752763 &              0.004999 &            0.892255 &           0.007860 &              0.749350 &             0.003494 &           0.912672 &          0.002766 & 0.269694 &  [2, 12, 5, 9, 10] & [G:\textbackslash Experimental\_results\textbackslash GroupDifference\textbackslash Moji\textbackslash G... &       True \\
   BTFairBatch &               0.746837 &              0.003407 &            0.899351 &           0.004936 &              0.743975 &             0.004236 &           0.919254 &          0.004731 & 0.272437 &    [8, 6, 9, 5, 7] & [G:\textbackslash Experimental\_results\textbackslash FairBatch\textbackslash Moji\textbackslash BTInitF... &      False \\
       Vanilla &               0.722981 &              0.004576 &            0.611870 &           0.014356 &              0.726650 &             0.003673 &           0.632302 &          0.013370 & 0.476849 &   [2, 11, 2, 5, 2] & [G:\textbackslash Experimental\_results\textbackslash vanilla\textbackslash Moji\textbackslash 0\textbackslash opt.yam... &       True \\
          BTEO &               0.753927 &              0.001433 &            0.877469 &           0.003756 &              0.746325 &             0.000998 &           0.896874 &          0.005401 & 0.274892 &    [8, 6, 9, 4, 5] & [G:\textbackslash Experimental\_results\textbackslash GatedBT\textbackslash Moji\textbackslash GatedBT\_R... &       True \\
     GatedDAdv &               0.750163 &              0.006945 &            0.908679 &           0.021678 &              0.745600 &             0.004828 &           0.928670 &          0.022488 & 0.266004 & [24, 13, 19, 4, 3] & [G:\textbackslash Experimental\_results\textbackslash hypertune3\textbackslash Moji\textbackslash hypert... &      False \\
     FairBatch &               0.751488 &              0.005772 &            0.904373 &           0.008213 &              0.746050 &             0.003896 &           0.914526 &          0.006020 & 0.266276 &    [9, 9, 6, 6, 5] & [G:\textbackslash Experimental\_results\textbackslash FairBatch\textbackslash Moji\textbackslash FairBat... &       True \\
      GatedAdv &               0.753113 &              0.005196 &            0.890065 &           0.013302 &              0.748975 &             0.003805 &           0.910838 &          0.010314 & 0.270257 & [11, 4, 4, 13, 12] & [G:\textbackslash Experimental\_results\textbackslash hypertune2\textbackslash Moji\textbackslash hypert... &      False \\
DelayedCLS\_Adv &               0.761015 &              0.003081 &            0.882425 &           0.015918 &              0.751675 &             0.003481 &           0.899346 &          0.011417 & 0.266341 &  [13, 1, 3, 13, 1] & [/data/cephfs/punim1421/Fair\_NLP\_Classification... &       True \\
        GDMean &               0.752163 &              0.002130 &            0.901389 &           0.003916 &              0.749050 &             0.001368 &           0.922430 &          0.005829 & 0.266735 &  [11, 12, 7, 2, 2] & [G:\textbackslash Experimental\_results\textbackslash GroupDifference\textbackslash Moji\textbackslash G... &       True \\
     GatedBTEO &               0.762106 &              0.002592 &            0.900764 &           0.014701 &              0.759775 &             0.003798 &           0.909445 &          0.006631 & 0.257762 &   [3, 1, 3, 1, 11] & [G:\textbackslash Experimental\_results\textbackslash GatedBT\textbackslash Moji\textbackslash GatedBT\_R... &       True \\
    BTGatedAdv &               0.735459 &              0.028830 &            0.866150 &           0.028232 &              0.730150 &             0.024594 &           0.886862 &          0.030537 & 0.296476 &  [5, 19, 19, 6, 1] & [G:\textbackslash Experimental\_results\textbackslash hypertune2\textbackslash Moji\textbackslash hypert... &       True \\
           Adv &               0.756414 &              0.007271 &            0.893286 &           0.005623 &              0.747425 &             0.004549 &           0.912125 &          0.008507 & 0.265936 & [9, 14, 18, 5, 16] & [G:\textbackslash Experimental\_results\textbackslash hypertune\textbackslash Moji\textbackslash hypertu... &       True \\
  OldFairBatch &               0.750638 &              0.006012 &            0.905537 &           0.005046 &              0.744525 &             0.004995 &           0.917734 &          0.004761 & 0.266655 &   [5, 7, 8, 13, 7] & [G:\textbackslash Experimental\_results\textbackslash Original\_FairBatch\textbackslash Moj... &       True \\
       FairSCL &               0.757314 &              0.003441 &            0.878219 &           0.004314 &              0.752825 &             0.001872 &           0.898325 &          0.002579 & 0.271527 &  [12, 5, 13, 1, 1] & [G:\textbackslash Experimental\_results\textbackslash FairSCL\textbackslash Moji\textbackslash FSCL\_0.31... &      False \\
          INLP &               0.733433 &                   NaN &            0.855982 &                NaN &              0.727625 &                  NaN &           0.859686 &               NaN & 0.302983 &              [103] & [G:\textbackslash Experimental\_results\textbackslash INLP\textbackslash Moji\textbackslash INLP\_True\_ba... &       True \\
          DAdv &               0.755464 &              0.004076 &            0.904023 &           0.011218 &              0.748550 &             0.002405 &           0.915601 &          0.005007 & 0.262697 &    [6, 1, 7, 3, 5] & [G:\textbackslash Experimental\_results\textbackslash hypertune\textbackslash Moji\textbackslash hypertu... &       True \\
\bottomrule
\end{tabular}

[21]:
%matplotlib inline

Create plots

[22]:
Moji_plot_df = analysis.final_results_df(
    results_dict = Moji_results,
    pareto = True, pareto_selection = "test",
    selection_criterion = None, return_dev = True,
    )
[23]:
analysis.tables_and_figures.make_zoom_plot(
    Moji_plot_df, dpi = 100,
    zoom_xlim=(0.6, 0.78),
    zoom_ylim=(0.8, 0.98),
    )
../_images/tutorial_fairlib_demo_47_0.png

6. Cutomize pipeline for fairness

Check out the website https://hanxudong.github.io/fairlib/ for detailed docs.

  • Visualization

    • Interactive plots demonstrates creating interactive plots for comparing different methods, and demonstrating DTO and constrained selection.

    • Plot gallery presents a list of examples for presenting experimental results, e.g., hyperparameter tuning and trade-off plots with zoomed-in area.

  • Customized Dataset and Models

  • Customized Metrics

    • This document provides instructions for customizing evaluation metrics.

    • Single metric evaluations can be seen from there.

    • Metric aggregations such as the default root mean square aggregation can be found from there.

  • Customized Debiasing Methods

    • Please see the document for instructions about adding method-specific options and integrating methods with fairlib.