fairlib: A Unified Framework for Assessing and Improving Fairness

Xudong Han\(^{1}\), Aili Shen\(^{1,2, a}\), Yitong Li\(^{3}\), Lea Frermann\(^{1}\), Timothy Baldwin\(^{1,4}\), and Trevor Cohn\(^{1}\)

\(^{1}\) The University of Melbourne

\(^{2}\) Alexa AI, Amazon

\(^{3}\) Huawei Technologies Co., Ltd.

\(^{4}\) MBZUAI

042dd7ed9e1f49bd94e4eae9f390594d

\(^a\) Work carried out at The University of Melbourne
fairlib is licensed under the Apache License 2.0

GitHub, Docs, PyPI

In this video, we will demostrate how to: 1. Install fairlib 2. Access fairness benchmark datasets 3. Train a vanilla model without debiasing, and measure fairness 4. Improve fairness with most recent debiasing methods 5. Analyze the results, such as creating tables and figures

1. Installation

The most straightforward way to install fairlib is with pip:

[1]:

!pip install -q fairlib

Check out fairlib documents for other installation options.

After installation, let’s import fairlib.

[2]:

import fairlib

2. Build a dataset

fairlib provides simple APIs to access fairness benchmark datasets that are publicly available and under strict ethical guidelines.

In this video, we will be using the preprocessed Moji dataset, where each tweet is annotated with a binary sentiment label (happy vs sad) and a binary race label (AAE vs SAE). Original tweets are encoded with the pre-trained DeepMoji model as 2304d vectors.

Followings are random examples from the Moji dataset.

Text	Sentiment	Race
Dfl somebody said to me yesterday that how can u u have a iPhone or an S3 an ur phone off dfl	Positive	AAE
smh I bet maybe u just don’t care bout poor boo no more	Negative	AAE
I actually put jeans on today and I already wanna go put on leggings or yogas	Positive	SAE
I’m sitting next to the most awkward couple on the plane like they are making out and holding hands , I just can’t	Negative	SAE

See here for other avaliable datasets, including texts, images, and structured inputs.

[3]:

from fairlib import datasets

[4]:

datasets.prepare_dataset("moji", "data/deepmoji")

saving to /content/data/deepmoji/pos_pos.npy

saving to /content/data/deepmoji/pos_neg.npy

saving to /content/data/deepmoji/neg_pos.npy

saving to /content/data/deepmoji/neg_neg.npy

Datasets will be downloaded and saved to data/deepmoji, and then be used to create training, validation, and test splits as previous work.

[5]:

datasets.name2class.keys()

[5]:

dict_keys(['moji', 'bios', 'coloredmnist', 'compas', 'tp_pos', 'adult', 'coco', 'imsitu'])

3. Train a vanilla model without debiasing

Specify hyperparameters

By default, fairlib trains vanilla models without debiasing.

Here we specify the experiment id (exp_id = vanilla), which identify the directory for saving experimental results.

[6]:

args = {
    # The name of the dataset, corresponding dataloader will be used,
    "dataset":  "Moji",

    # Specifiy the path to the input data
    "data_dir": "data/deepmoji",

    # Device for computing, -1 is the cpu; non-negative numbers indicate GPU id.
    "device_id":    -1,

    # Give a name to the exp, which will be used in the path
    "exp_id":"vanilla",
}

# Init the argument
options = fairlib.BaseOptions()
state = options.get_state(args=args, silence=True)

INFO:root:Unexpected args: ['-f', '/root/.local/share/jupyter/runtime/kernel-8eb44448-ab67-4429-be7a-d5fc3fe3b2ce.json']
INFO:root:Logging to ./results/dev/Moji/vanilla/output.log

2022-07-21 06:53:09 [INFO ]  ======================================== 2022-07-21 06:53:09 ========================================
2022-07-21 06:53:09 [INFO ]  Base directory is ./results/dev/Moji/vanilla
Loaded data shapes: (99998, 2304), (99998,), (99998,)
Loaded data shapes: (8000, 2304), (8000,), (8000,)
Loaded data shapes: (7998, 2304), (7998,), (7998,)

state contains a list of hyperparameters for experiments. Besides, other components that are needed, such as dataloaders, will also automatically be initialized

Initialize a model

The default architecutre is a 3-layer MLP classifier with Tanh activation functions in between:

[7]:

print(state.hidden_size, state.n_hidden, state.activation_function)

300 2 Tanh

Users can easily specify model architecutres in the option. Please see the model architecture section for more details about the hyperparameters corresponding to the model architecture.

[8]:

fairlib.utils.seed_everything(2022)

# Init Model
model = fairlib.networks.get_main_model(state)

2022-07-21 06:53:10 [INFO ]  MLP(
2022-07-21 06:53:10 [INFO ]    (output_layer): Linear(in_features=300, out_features=2, bias=True)
2022-07-21 06:53:10 [INFO ]    (AF): Tanh()
2022-07-21 06:53:10 [INFO ]    (hidden_layers): ModuleList(
2022-07-21 06:53:10 [INFO ]      (0): Linear(in_features=2304, out_features=300, bias=True)
2022-07-21 06:53:10 [INFO ]      (1): Tanh()
2022-07-21 06:53:10 [INFO ]      (2): Linear(in_features=300, out_features=300, bias=True)
2022-07-21 06:53:10 [INFO ]      (3): Tanh()
2022-07-21 06:53:10 [INFO ]    )
2022-07-21 06:53:10 [INFO ]    (criterion): CrossEntropyLoss()
2022-07-21 06:53:10 [INFO ]  )
2022-07-21 06:53:10 [INFO ]  Total number of parameters: 782402

Train a model

A list of hyperparameters has been predefined in fairlib, so we can now directly train a model with the model class’s built-in train_self method.

Please see the link for all hyperparameters associated with model training.

[9]:

model.train_self()

2022-07-21 06:53:10 [INFO ]  Epoch:    0 [      0/  99998 ( 0%)]        Loss: 0.6907     Data Time: 0.06s       Train Time: 0.18s
2022-07-21 06:53:14 [INFO ]  Epoch:    0 [  51200/  99998 (51%)]        Loss: 0.4075     Data Time: 0.44s       Train Time: 3.34s
2022-07-21 06:53:18 [INFO ]  Evaluation at Epoch 0
2022-07-21 06:53:18 [INFO ]  Validation accuracy: 71.44 macro_fscore: 71.35     micro_fscore: 71.44     TPR_GAP: 39.59  FPR_GAP: 39.59  PPR_GAP: 38.67
2022-07-21 06:53:18 [INFO ]  Test accuracy: 71.21       macro_fscore: 71.12     micro_fscore: 71.21     TPR_GAP: 39.42  FPR_GAP: 39.42  PPR_GAP: 38.49
2022-07-21 06:53:18 [INFO ]  Epoch:    1 [      0/  99998 ( 0%)]        Loss: 0.4195     Data Time: 0.01s       Train Time: 0.07s
2022-07-21 06:53:22 [INFO ]  Epoch:    1 [  51200/  99998 (51%)]        Loss: 0.3780     Data Time: 0.42s       Train Time: 3.29s
2022-07-21 06:53:26 [INFO ]  Evaluation at Epoch 1
2022-07-21 06:53:26 [INFO ]  Validation accuracy: 72.10 macro_fscore: 72.06     micro_fscore: 72.10     TPR_GAP: 37.69  FPR_GAP: 37.69  PPR_GAP: 37.00
2022-07-21 06:53:26 [INFO ]  Test accuracy: 71.72       macro_fscore: 71.68     micro_fscore: 71.72     TPR_GAP: 37.11  FPR_GAP: 37.11  PPR_GAP: 36.41
2022-07-21 06:53:27 [INFO ]  Epoch:    2 [      0/  99998 ( 0%)]        Loss: 0.3608     Data Time: 0.01s       Train Time: 0.07s
2022-07-21 06:53:30 [INFO ]  Epoch:    2 [  51200/  99998 (51%)]        Loss: 0.3567     Data Time: 0.42s       Train Time: 3.29s
2022-07-21 06:53:34 [INFO ]  Epochs since last improvement: 1
2022-07-21 06:53:34 [INFO ]  Evaluation at Epoch 2
2022-07-21 06:53:34 [INFO ]  Validation accuracy: 71.53 macro_fscore: 71.44     micro_fscore: 71.53     TPR_GAP: 39.40  FPR_GAP: 39.40  PPR_GAP: 38.40
2022-07-21 06:53:34 [INFO ]  Test accuracy: 71.38       macro_fscore: 71.28     micro_fscore: 71.38     TPR_GAP: 39.07  FPR_GAP: 39.07  PPR_GAP: 37.99
2022-07-21 06:53:35 [INFO ]  Epoch:    3 [      0/  99998 ( 0%)]        Loss: 0.3584     Data Time: 0.01s       Train Time: 0.07s
2022-07-21 06:53:38 [INFO ]  Epoch:    3 [  51200/  99998 (51%)]        Loss: 0.3430     Data Time: 0.42s       Train Time: 3.28s
2022-07-21 06:53:42 [INFO ]  Epochs since last improvement: 2
2022-07-21 06:53:42 [INFO ]  Evaluation at Epoch 3
2022-07-21 06:53:43 [INFO ]  Validation accuracy: 71.08 macro_fscore: 70.73     micro_fscore: 71.08     TPR_GAP: 41.08  FPR_GAP: 41.08  PPR_GAP: 38.75
2022-07-21 06:53:43 [INFO ]  Test accuracy: 70.87       macro_fscore: 70.49     micro_fscore: 70.87     TPR_GAP: 40.67  FPR_GAP: 40.67  PPR_GAP: 38.42
2022-07-21 06:53:43 [INFO ]  Epoch:    4 [      0/  99998 ( 0%)]        Loss: 0.3578     Data Time: 0.02s       Train Time: 0.07s
2022-07-21 06:53:46 [INFO ]  Epoch:    4 [  51200/  99998 (51%)]        Loss: 0.3881     Data Time: 0.44s       Train Time: 3.36s
2022-07-21 06:53:50 [INFO ]  Epochs since last improvement: 3
2022-07-21 06:53:50 [INFO ]  Evaluation at Epoch 4
2022-07-21 06:53:51 [INFO ]  Validation accuracy: 72.06 macro_fscore: 71.96     micro_fscore: 72.06     TPR_GAP: 38.04  FPR_GAP: 38.04  PPR_GAP: 36.87
2022-07-21 06:53:51 [INFO ]  Test accuracy: 71.72       macro_fscore: 71.61     micro_fscore: 71.72     TPR_GAP: 38.09  FPR_GAP: 38.09  PPR_GAP: 36.96
2022-07-21 06:53:51 [INFO ]  Epoch:    5 [      0/  99998 ( 0%)]        Loss: 0.4008     Data Time: 0.02s       Train Time: 0.07s
2022-07-21 06:53:54 [INFO ]  Epoch:    5 [  51200/  99998 (51%)]        Loss: 0.3393     Data Time: 0.43s       Train Time: 3.29s
2022-07-21 06:53:58 [INFO ]  Epochs since last improvement: 4
2022-07-21 06:53:58 [INFO ]  Evaluation at Epoch 5
2022-07-21 06:53:59 [INFO ]  Validation accuracy: 71.44 macro_fscore: 71.41     micro_fscore: 71.44     TPR_GAP: 40.50  FPR_GAP: 40.50  PPR_GAP: 39.97
2022-07-21 06:53:59 [INFO ]  Test accuracy: 71.13       macro_fscore: 71.11     micro_fscore: 71.13     TPR_GAP: 39.69  FPR_GAP: 39.69  PPR_GAP: 39.19
2022-07-21 06:53:59 [INFO ]  Epoch:    6 [      0/  99998 ( 0%)]        Loss: 0.3890     Data Time: 0.01s       Train Time: 0.07s
2022-07-21 06:54:03 [INFO ]  Epoch:    6 [  51200/  99998 (51%)]        Loss: 0.3668     Data Time: 0.44s       Train Time: 3.32s
2022-07-21 06:54:07 [INFO ]  Epochs since last improvement: 5
2022-07-21 06:54:07 [INFO ]  Evaluation at Epoch 6
2022-07-21 06:54:07 [INFO ]  Validation accuracy: 71.96 macro_fscore: 71.92     micro_fscore: 71.96     TPR_GAP: 37.52  FPR_GAP: 37.52  PPR_GAP: 36.82
2022-07-21 06:54:07 [INFO ]  Test accuracy: 72.04       macro_fscore: 72.00     micro_fscore: 72.04     TPR_GAP: 36.75  FPR_GAP: 36.75  PPR_GAP: 36.01

After each iteration, a subset of evaluation results over the validation set and test set will be logged, including metrics for both performance and fairness.

It can be seen that the vanilla model achieves around 40% TPR GAP.

4. Improve Fairness

To mitigate bias, we show an example of employing BTEO (Han et al., 2021a) and adversarial training (Li et al., 2018) simultaneously.

Enable debiasing

The only difference for debebiasing is to specify corresponding arguments, as shown in the following cell. Everything else are identical to the standard training.

A list of supported bias mitigation methods is shown here.
The usage introduces further options associated with each debiasing method.

[10]:

debiasing_args = args.copy()

# Update the experiment name
debiasing_args["exp_id"] = "BT_Adv"

# Perform adversarial training if True
debiasing_args["adv_debiasing"] = True

# Specify the hyperparameters for Balanced Training
debiasing_args["BT"] = "Downsampling"
debiasing_args["BTObj"] = "EO"

debias_options = fairlib.BaseOptions()
debias_state = debias_options.get_state(args=debiasing_args, silence=True)

fairlib.utils.seed_everything(2022)

debias_model = fairlib.networks.get_main_model(debias_state)

2022-07-21 06:54:07 [INFO ]  Unexpected args: ['-f', '/root/.local/share/jupyter/runtime/kernel-8eb44448-ab67-4429-be7a-d5fc3fe3b2ce.json']
2022-07-21 06:54:07 [INFO ]  Logging to ./results/dev/Moji/BT_Adv/output.log
2022-07-21 06:54:07 [INFO ]  ======================================== 2022-07-21 06:54:07 ========================================
2022-07-21 06:54:07 [INFO ]  Base directory is ./results/dev/Moji/BT_Adv
Loaded data shapes: (39996, 2304), (39996,), (39996,)
Loaded data shapes: (8000, 2304), (8000,), (8000,)
Loaded data shapes: (7998, 2304), (7998,), (7998,)
2022-07-21 06:54:16 [INFO ]  SubDiscriminator(
2022-07-21 06:54:16 [INFO ]    (grad_rev): GradientReversal()
2022-07-21 06:54:16 [INFO ]    (output_layer): Linear(in_features=300, out_features=2, bias=True)
2022-07-21 06:54:16 [INFO ]    (AF): ReLU()
2022-07-21 06:54:16 [INFO ]    (hidden_layers): ModuleList(
2022-07-21 06:54:16 [INFO ]      (0): Linear(in_features=300, out_features=300, bias=True)
2022-07-21 06:54:16 [INFO ]      (1): ReLU()
2022-07-21 06:54:16 [INFO ]      (2): Linear(in_features=300, out_features=300, bias=True)
2022-07-21 06:54:16 [INFO ]      (3): ReLU()
2022-07-21 06:54:16 [INFO ]    )
2022-07-21 06:54:16 [INFO ]    (criterion): CrossEntropyLoss()
2022-07-21 06:54:16 [INFO ]  )
2022-07-21 06:54:16 [INFO ]  Total number of parameters: 181202

2022-07-21 06:54:16 [INFO ]  Discriminator built!
2022-07-21 06:54:16 [INFO ]  MLP(
2022-07-21 06:54:16 [INFO ]    (output_layer): Linear(in_features=300, out_features=2, bias=True)
2022-07-21 06:54:16 [INFO ]    (AF): Tanh()
2022-07-21 06:54:16 [INFO ]    (hidden_layers): ModuleList(
2022-07-21 06:54:16 [INFO ]      (0): Linear(in_features=2304, out_features=300, bias=True)
2022-07-21 06:54:16 [INFO ]      (1): Tanh()
2022-07-21 06:54:16 [INFO ]      (2): Linear(in_features=300, out_features=300, bias=True)
2022-07-21 06:54:16 [INFO ]      (3): Tanh()
2022-07-21 06:54:16 [INFO ]    )
2022-07-21 06:54:16 [INFO ]    (criterion): CrossEntropyLoss()
2022-07-21 06:54:16 [INFO ]  )
2022-07-21 06:54:16 [INFO ]  Total number of parameters: 782402

It can be seen from the last cell that the training dataset size is smaller than before (40k verse 100k) due to the downsampling for balanced training, and an MLP adversary is initialized for adversarial debiasing.

Mitigate bias

[11]:

debias_model.train_self()

2022-07-21 06:54:17 [INFO ]  Epoch:    0 [      0/  39996 ( 0%)]        Loss: 0.0003     Data Time: 0.02s       Train Time: 0.22s
2022-07-21 06:54:25 [INFO ]  Evaluation at Epoch 0
2022-07-21 06:54:26 [INFO ]  Validation accuracy: 73.72 macro_fscore: 73.08     micro_fscore: 73.72     TPR_GAP: 19.20  FPR_GAP: 19.20  PPR_GAP: 16.80
2022-07-21 06:54:26 [INFO ]  Test accuracy: 73.59       macro_fscore: 72.98     micro_fscore: 73.59     TPR_GAP: 20.70  FPR_GAP: 20.70  PPR_GAP: 17.86
2022-07-21 06:54:26 [INFO ]  Epoch:    1 [      0/  39996 ( 0%)]        Loss: -0.1368    Data Time: 0.01s       Train Time: 0.20s
2022-07-21 06:54:34 [INFO ]  Epochs since last improvement: 1
2022-07-21 06:54:34 [INFO ]  Evaluation at Epoch 1
2022-07-21 06:54:35 [INFO ]  Validation accuracy: 68.96 macro_fscore: 67.36     micro_fscore: 68.96     TPR_GAP: 10.57  FPR_GAP: 10.57  PPR_GAP: 4.67
2022-07-21 06:54:35 [INFO ]  Test accuracy: 68.89       macro_fscore: 67.33     micro_fscore: 68.89     TPR_GAP: 11.46  FPR_GAP: 11.46  PPR_GAP: 6.46
2022-07-21 06:54:35 [INFO ]  Epoch:    2 [      0/  39996 ( 0%)]        Loss: -0.1266    Data Time: 0.01s       Train Time: 0.20s
2022-07-21 06:54:43 [INFO ]  Evaluation at Epoch 2
2022-07-21 06:54:44 [INFO ]  Validation accuracy: 74.94 macro_fscore: 74.68     micro_fscore: 74.94     TPR_GAP: 14.13  FPR_GAP: 14.13  PPR_GAP: 11.37
2022-07-21 06:54:44 [INFO ]  Test accuracy: 75.54       macro_fscore: 75.33     micro_fscore: 75.54     TPR_GAP: 14.66  FPR_GAP: 14.66  PPR_GAP: 11.41
2022-07-21 06:54:44 [INFO ]  Epoch:    3 [      0/  39996 ( 0%)]        Loss: -0.1809    Data Time: 0.01s       Train Time: 0.20s
2022-07-21 06:54:52 [INFO ]  Evaluation at Epoch 3
2022-07-21 06:54:53 [INFO ]  Validation accuracy: 74.99 macro_fscore: 74.98     micro_fscore: 74.99     TPR_GAP: 12.09  FPR_GAP: 12.09  PPR_GAP: 8.67
2022-07-21 06:54:53 [INFO ]  Test accuracy: 75.47       macro_fscore: 75.47     micro_fscore: 75.47     TPR_GAP: 11.43  FPR_GAP: 11.43  PPR_GAP: 7.95
2022-07-21 06:54:53 [INFO ]  Epoch:    4 [      0/  39996 ( 0%)]        Loss: -0.1584    Data Time: 0.01s       Train Time: 0.20s
2022-07-21 06:55:01 [INFO ]  Evaluation at Epoch 4
2022-07-21 06:55:02 [INFO ]  Validation accuracy: 75.36 macro_fscore: 75.34     micro_fscore: 75.36     TPR_GAP: 12.23  FPR_GAP: 12.23  PPR_GAP: 9.27
2022-07-21 06:55:02 [INFO ]  Test accuracy: 75.73       macro_fscore: 75.70     micro_fscore: 75.73     TPR_GAP: 13.03  FPR_GAP: 13.03  PPR_GAP: 9.38
2022-07-21 06:55:02 [INFO ]  Epoch:    5 [      0/  39996 ( 0%)]        Loss: -0.1741    Data Time: 0.02s       Train Time: 0.20s
2022-07-21 06:55:10 [INFO ]  Epochs since last improvement: 1
2022-07-21 06:55:10 [INFO ]  Evaluation at Epoch 5
2022-07-21 06:55:11 [INFO ]  Validation accuracy: 75.22 macro_fscore: 75.15     micro_fscore: 75.22     TPR_GAP: 10.86  FPR_GAP: 10.86  PPR_GAP: 6.30
2022-07-21 06:55:11 [INFO ]  Test accuracy: 75.44       macro_fscore: 75.36     micro_fscore: 75.44     TPR_GAP: 10.99  FPR_GAP: 10.99  PPR_GAP: 6.50
2022-07-21 06:55:11 [INFO ]  Epoch:    6 [      0/  39996 ( 0%)]        Loss: -0.2291    Data Time: 0.01s       Train Time: 0.20s
2022-07-21 06:55:19 [INFO ]  Epochs since last improvement: 2
2022-07-21 06:55:19 [INFO ]  Evaluation at Epoch 6
2022-07-21 06:55:20 [INFO ]  Validation accuracy: 74.94 macro_fscore: 74.93     micro_fscore: 74.94     TPR_GAP: 9.14   FPR_GAP: 9.14   PPR_GAP: 1.42
2022-07-21 06:55:20 [INFO ]  Test accuracy: 75.17       macro_fscore: 75.17     micro_fscore: 75.17     TPR_GAP: 9.69   FPR_GAP: 9.69   PPR_GAP: 2.85
2022-07-21 06:55:20 [INFO ]  Epoch:    7 [      0/  39996 ( 0%)]        Loss: -0.1665    Data Time: 0.01s       Train Time: 0.21s
2022-07-21 06:55:28 [INFO ]  Epochs since last improvement: 3
2022-07-21 06:55:28 [INFO ]  Evaluation at Epoch 7
2022-07-21 06:55:29 [INFO ]  Validation accuracy: 74.98 macro_fscore: 74.97     micro_fscore: 74.98     TPR_GAP: 9.40   FPR_GAP: 9.40   PPR_GAP: 5.25
2022-07-21 06:55:29 [INFO ]  Test accuracy: 75.38       macro_fscore: 75.38     micro_fscore: 75.38     TPR_GAP: 10.85  FPR_GAP: 10.85  PPR_GAP: 6.88
2022-07-21 06:55:29 [INFO ]  Epoch:    8 [      0/  39996 ( 0%)]        Loss: -0.2003    Data Time: 0.01s       Train Time: 0.21s
2022-07-21 06:55:37 [INFO ]  Epochs since last improvement: 4
2022-07-21 06:55:37 [INFO ]  Evaluation at Epoch 8
2022-07-21 06:55:38 [INFO ]  Validation accuracy: 74.04 macro_fscore: 73.54     micro_fscore: 74.04     TPR_GAP: 10.18  FPR_GAP: 10.18  PPR_GAP: 6.12
2022-07-21 06:55:38 [INFO ]  Test accuracy: 74.22       macro_fscore: 73.75     micro_fscore: 74.22     TPR_GAP: 10.17  FPR_GAP: 10.17  PPR_GAP: 6.96
2022-07-21 06:55:38 [INFO ]  Epoch:    9 [      0/  39996 ( 0%)]        Loss: -0.2118    Data Time: 0.01s       Train Time: 0.21s
2022-07-21 06:55:46 [INFO ]  Epochs since last improvement: 5
2022-07-21 06:55:46 [INFO ]  Evaluation at Epoch 9
2022-07-21 06:55:47 [INFO ]  Validation accuracy: 73.14 macro_fscore: 72.87     micro_fscore: 73.14     TPR_GAP: 9.11   FPR_GAP: 9.11   PPR_GAP: 5.97
2022-07-21 06:55:47 [INFO ]  Test accuracy: 73.36       macro_fscore: 73.14     micro_fscore: 73.36     TPR_GAP: 8.46   FPR_GAP: 8.46   PPR_GAP: 6.07

5. Analyze the results

[12]:

from fairlib import analysis

Here we define a list of hyperparameters that will be repeatedly used for analysis.

[13]:

Shared_options = {
    # Random seed
    "seed": 2022,

    # The name of the dataset, corresponding dataloader will be used,
    "dataset":  "Moji",

    # Specifiy the path to the input data
    "data_dir": "data/deepmoji",

    # Device for computing, -1 is the cpu; non-negative numbers indicate GPU id.
    "device_id":    -1,

    # The default path for saving experimental results
    "results_dir":  "results",

    # Will be used for saving experimental results
    "project_dir":  "dev",

    # We will focusing on TPR GAP, implying the Equalized Odds for binary classification.
    "GAP_metric_name":  "TPR_GAP",

    # The overall performance will be measured as accuracy
    "Performance_metric_name":  "accuracy",

    # Model selections are based on distance to optimum, see section 4 in our paper for more details
    "selection_criterion":  "DTO",

    # Default dirs for saving checkpoints
    "checkpoint_dir":   "models",
    "checkpoint_name":  "checkpoint_epoch",

    # Loading experimental results
    "n_jobs":   1,

}

Epoch Selection

Here we demostrate the usage of DTO for epoch selection (i.e., post-hoc early stopping). model_selection retrieves experimental results for a single method, selects the desired epoch for each run, and saves the resulting df for a later process.

[14]:

analysis.model_selection(
    # exp_id started with model_id will be treated as the same method, e.g, vanilla, and adv
    model_id= ("vanilla"),

    # the tuned hyperparameters of a methods, which will be used to group multiple runs together.
    index_column_names = ["BT", "BTObj", "adv_debiasing"],

    # to convenient the further analysis, we will store the resulting DataFrame to the specified path
    save_path = r"results/Vanilla_df.pkl",

    # Follwoing options are predefined
    results_dir= Shared_options["results_dir"],
    project_dir= Shared_options["project_dir"]+"/"+Shared_options["dataset"],
    GAP_metric_name = Shared_options["GAP_metric_name"],
    Performance_metric_name = Shared_options["Performance_metric_name"],
    # We use DTO for epoch selection
    selection_criterion = Shared_options["selection_criterion"],
    checkpoint_dir= Shared_options["checkpoint_dir"],
    checkpoint_name= Shared_options["checkpoint_name"],
    # If retrive results in parallel
    n_jobs=Shared_options["n_jobs"],
)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s finished

[14]:

			epoch	dev_DTO	test_DTO	dev_performance	dev_fairness	test_performance	test_fairness	opt_dir
BT	BTObj	adv_debiasing
NaN	NaN	False	1	0.001907	0.0	0.719625	0.624825	0.72043	0.632535	results/dev/Moji/vanilla/opt.yaml

[15]:

analysis.model_selection(
    model_id= ("BT_Adv"),
    index_column_names = ["BT", "BTObj", "adv_debiasing"],
    save_path = r"results/BT_ADV_df.pkl",
    # Follwoing options are predefined
    results_dir= Shared_options["results_dir"],
    project_dir= Shared_options["project_dir"]+"/"+Shared_options["dataset"],
    GAP_metric_name = Shared_options["GAP_metric_name"],
    Performance_metric_name = Shared_options["Performance_metric_name"],
    selection_criterion = Shared_options["selection_criterion"],
    checkpoint_dir= Shared_options["checkpoint_dir"],
    checkpoint_name= Shared_options["checkpoint_name"],
    n_jobs=Shared_options["n_jobs"],
)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s finished

[15]:

			epoch	dev_DTO	test_DTO	dev_performance	dev_fairness	test_performance	test_fairness	opt_dir
BT	BTObj	adv_debiasing
Downsampling	EO	True	1	0.005648	0.015392	0.749375	0.908632	0.751688	0.903068	results/dev/Moji/BT_Adv/opt.yaml

We have preprocessed the results with the model_selection function, and the resulting dfs can be downloaded as follows:

[16]:

!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1M0G6PyPuDC8Y_2nL9XKYCt10IUzbSvfl' -O retrived_results.tar.gz

--2022-07-21 06:55:48--  https://docs.google.com/uc?export=download&id=1M0G6PyPuDC8Y_2nL9XKYCt10IUzbSvfl
Resolving docs.google.com (docs.google.com)... 108.177.97.138, 108.177.97.102, 108.177.97.100, ...
Connecting to docs.google.com (docs.google.com)|108.177.97.138|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0g-0k-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/d9ah1be1dhjtrm0rinvfdje85fkvmrdh/1658386500000/17527887236587461918/*/1M0G6PyPuDC8Y_2nL9XKYCt10IUzbSvfl?e=download&uuid=2bc658ea-acf2-4d5e-8ae5-920657110366 [following]
Warning: wildcards not supported in HTTP.
--2022-07-21 06:55:51--  https://doc-0g-0k-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/d9ah1be1dhjtrm0rinvfdje85fkvmrdh/1658386500000/17527887236587461918/*/1M0G6PyPuDC8Y_2nL9XKYCt10IUzbSvfl?e=download&uuid=2bc658ea-acf2-4d5e-8ae5-920657110366
Resolving doc-0g-0k-docs.googleusercontent.com (doc-0g-0k-docs.googleusercontent.com)... 74.125.204.132, 2404:6800:4008:c04::84
Connecting to doc-0g-0k-docs.googleusercontent.com (doc-0g-0k-docs.googleusercontent.com)|74.125.204.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 790461 (772K) [application/x-gzip]
Saving to: ‘retrived_results.tar.gz’

retrived_results.ta 100%[===================>] 771.93K  --.-KB/s    in 0.006s

2022-07-21 06:55:51 (129 MB/s) - ‘retrived_results.tar.gz’ saved [790461/790461]

[17]:

!tar -xf retrived_results.tar.gz

Model Selection

Here we demonstrate the application of final_results_df, which loads cached results with retrive_results for all methods, select the best hyperparameter combinations for each technique, and present the result in a DataFrame

[18]:

Moji_results = analysis.retrive_results("Moji", log_dir="analysis/results")

[19]:

Moji_main_results = analysis.final_results_df(
    results_dict = Moji_results,
    pareto = False,
    pareto_selection = "test",
    selection_criterion = "DTO",
    return_dev = True,
    return_conf=True,
    )
Moji_main_results

[19]:

	Models	test_performance mean	test_performance std	test_fairness mean	test_fairness std	dev_performance mean	dev_performance std	dev_fairness mean	dev_fairness std	DTO	epoch list	opt_dir list	is_pareto
0	GDEO	0.752763	0.004999	0.892255	0.007860	0.749350	0.003494	0.912672	0.002766	0.269694	[2, 12, 5, 9, 10]	[G:\Experimental_results\GroupDifference\Moji\...	True
1	BTFairBatch	0.746837	0.003407	0.899351	0.004936	0.743975	0.004236	0.919254	0.004731	0.272437	[8, 6, 9, 5, 7]	[G:\Experimental_results\FairBatch\Moji\BTInit...	False
2	Vanilla	0.722981	0.004576	0.611870	0.014356	0.726650	0.003673	0.632302	0.013370	0.476849	[2, 11, 2, 5, 2]	[G:\Experimental_results\vanilla\Moji\0\opt.ya...	True
3	BTEO	0.753927	0.001433	0.877469	0.003756	0.746325	0.000998	0.896874	0.005401	0.274892	[8, 6, 9, 4, 5]	[G:\Experimental_results\GatedBT\Moji\GatedBT_...	True
4	GatedDAdv	0.750163	0.006945	0.908679	0.021678	0.745600	0.004828	0.928670	0.022488	0.266004	[24, 13, 19, 4, 3]	[G:\Experimental_results\hypertune3\Moji\hyper...	False
5	FairBatch	0.751488	0.005772	0.904373	0.008213	0.746050	0.003896	0.914526	0.006020	0.266276	[9, 9, 6, 6, 5]	[G:\Experimental_results\FairBatch\Moji\FairBa...	True
6	GatedAdv	0.753113	0.005196	0.890065	0.013302	0.748975	0.003805	0.910838	0.010314	0.270257	[11, 4, 4, 13, 12]	[G:\Experimental_results\hypertune2\Moji\hyper...	False
7	DelayedCLS_Adv	0.761015	0.003081	0.882425	0.015918	0.751675	0.003481	0.899346	0.011417	0.266341	[13, 1, 3, 13, 1]	[/data/cephfs/punim1421/Fair_NLP_Classificatio...	True
8	GDMean	0.752163	0.002130	0.901389	0.003916	0.749050	0.001368	0.922430	0.005829	0.266735	[11, 12, 7, 2, 2]	[G:\Experimental_results\GroupDifference\Moji\...	True
9	GatedBTEO	0.762106	0.002592	0.900764	0.014701	0.759775	0.003798	0.909445	0.006631	0.257762	[3, 1, 3, 1, 11]	[G:\Experimental_results\GatedBT\Moji\GatedBT_...	True
10	BTGatedAdv	0.735459	0.028830	0.866150	0.028232	0.730150	0.024594	0.886862	0.030537	0.296476	[5, 19, 19, 6, 1]	[G:\Experimental_results\hypertune2\Moji\hyper...	True
11	Adv	0.756414	0.007271	0.893286	0.005623	0.747425	0.004549	0.912125	0.008507	0.265936	[9, 14, 18, 5, 16]	[G:\Experimental_results\hypertune\Moji\hypert...	True
12	OldFairBatch	0.750638	0.006012	0.905537	0.005046	0.744525	0.004995	0.917734	0.004761	0.266655	[5, 7, 8, 13, 7]	[G:\Experimental_results\Original_FairBatch\Mo...	True
13	FairSCL	0.757314	0.003441	0.878219	0.004314	0.752825	0.001872	0.898325	0.002579	0.271527	[12, 5, 13, 1, 1]	[G:\Experimental_results\FairSCL\Moji\FSCL_0.3...	False
14	INLP	0.733433	NaN	0.855982	NaN	0.727625	NaN	0.859686	NaN	0.302983	[103]	[G:\Experimental_results\INLP\Moji\INLP_True_b...	True
15	DAdv	0.755464	0.004076	0.904023	0.011218	0.748550	0.002405	0.915601	0.005007	0.262697	[6, 1, 7, 3, 5]	[G:\Experimental_results\hypertune\Moji\hypert...	True

Create \(\LaTeX{}\) tabels

[20]:

print(Moji_main_results.to_latex(index=False))

\begin{tabular}{lrrrrrrrrrlll}
\toprule
        Models &  test\_performance mean &  test\_performance std &  test\_fairness mean &  test\_fairness std &  dev\_performance mean &  dev\_performance std &  dev\_fairness mean &  dev\_fairness std &      DTO &         epoch list &                                       opt\_dir list &  is\_pareto \\
\midrule
          GDEO &               0.752763 &              0.004999 &            0.892255 &           0.007860 &              0.749350 &             0.003494 &           0.912672 &          0.002766 & 0.269694 &  [2, 12, 5, 9, 10] & [G:\textbackslash Experimental\_results\textbackslash GroupDifference\textbackslash Moji\textbackslash G... &       True \\
   BTFairBatch &               0.746837 &              0.003407 &            0.899351 &           0.004936 &              0.743975 &             0.004236 &           0.919254 &          0.004731 & 0.272437 &    [8, 6, 9, 5, 7] & [G:\textbackslash Experimental\_results\textbackslash FairBatch\textbackslash Moji\textbackslash BTInitF... &      False \\
       Vanilla &               0.722981 &              0.004576 &            0.611870 &           0.014356 &              0.726650 &             0.003673 &           0.632302 &          0.013370 & 0.476849 &   [2, 11, 2, 5, 2] & [G:\textbackslash Experimental\_results\textbackslash vanilla\textbackslash Moji\textbackslash 0\textbackslash opt.yam... &       True \\
          BTEO &               0.753927 &              0.001433 &            0.877469 &           0.003756 &              0.746325 &             0.000998 &           0.896874 &          0.005401 & 0.274892 &    [8, 6, 9, 4, 5] & [G:\textbackslash Experimental\_results\textbackslash GatedBT\textbackslash Moji\textbackslash GatedBT\_R... &       True \\
     GatedDAdv &               0.750163 &              0.006945 &            0.908679 &           0.021678 &              0.745600 &             0.004828 &           0.928670 &          0.022488 & 0.266004 & [24, 13, 19, 4, 3] & [G:\textbackslash Experimental\_results\textbackslash hypertune3\textbackslash Moji\textbackslash hypert... &      False \\
     FairBatch &               0.751488 &              0.005772 &            0.904373 &           0.008213 &              0.746050 &             0.003896 &           0.914526 &          0.006020 & 0.266276 &    [9, 9, 6, 6, 5] & [G:\textbackslash Experimental\_results\textbackslash FairBatch\textbackslash Moji\textbackslash FairBat... &       True \\
      GatedAdv &               0.753113 &              0.005196 &            0.890065 &           0.013302 &              0.748975 &             0.003805 &           0.910838 &          0.010314 & 0.270257 & [11, 4, 4, 13, 12] & [G:\textbackslash Experimental\_results\textbackslash hypertune2\textbackslash Moji\textbackslash hypert... &      False \\
DelayedCLS\_Adv &               0.761015 &              0.003081 &            0.882425 &           0.015918 &              0.751675 &             0.003481 &           0.899346 &          0.011417 & 0.266341 &  [13, 1, 3, 13, 1] & [/data/cephfs/punim1421/Fair\_NLP\_Classification... &       True \\
        GDMean &               0.752163 &              0.002130 &            0.901389 &           0.003916 &              0.749050 &             0.001368 &           0.922430 &          0.005829 & 0.266735 &  [11, 12, 7, 2, 2] & [G:\textbackslash Experimental\_results\textbackslash GroupDifference\textbackslash Moji\textbackslash G... &       True \\
     GatedBTEO &               0.762106 &              0.002592 &            0.900764 &           0.014701 &              0.759775 &             0.003798 &           0.909445 &          0.006631 & 0.257762 &   [3, 1, 3, 1, 11] & [G:\textbackslash Experimental\_results\textbackslash GatedBT\textbackslash Moji\textbackslash GatedBT\_R... &       True \\
    BTGatedAdv &               0.735459 &              0.028830 &            0.866150 &           0.028232 &              0.730150 &             0.024594 &           0.886862 &          0.030537 & 0.296476 &  [5, 19, 19, 6, 1] & [G:\textbackslash Experimental\_results\textbackslash hypertune2\textbackslash Moji\textbackslash hypert... &       True \\
           Adv &               0.756414 &              0.007271 &            0.893286 &           0.005623 &              0.747425 &             0.004549 &           0.912125 &          0.008507 & 0.265936 & [9, 14, 18, 5, 16] & [G:\textbackslash Experimental\_results\textbackslash hypertune\textbackslash Moji\textbackslash hypertu... &       True \\
  OldFairBatch &               0.750638 &              0.006012 &            0.905537 &           0.005046 &              0.744525 &             0.004995 &           0.917734 &          0.004761 & 0.266655 &   [5, 7, 8, 13, 7] & [G:\textbackslash Experimental\_results\textbackslash Original\_FairBatch\textbackslash Moj... &       True \\
       FairSCL &               0.757314 &              0.003441 &            0.878219 &           0.004314 &              0.752825 &             0.001872 &           0.898325 &          0.002579 & 0.271527 &  [12, 5, 13, 1, 1] & [G:\textbackslash Experimental\_results\textbackslash FairSCL\textbackslash Moji\textbackslash FSCL\_0.31... &      False \\
          INLP &               0.733433 &                   NaN &            0.855982 &                NaN &              0.727625 &                  NaN &           0.859686 &               NaN & 0.302983 &              [103] & [G:\textbackslash Experimental\_results\textbackslash INLP\textbackslash Moji\textbackslash INLP\_True\_ba... &       True \\
          DAdv &               0.755464 &              0.004076 &            0.904023 &           0.011218 &              0.748550 &             0.002405 &           0.915601 &          0.005007 & 0.262697 &    [6, 1, 7, 3, 5] & [G:\textbackslash Experimental\_results\textbackslash hypertune\textbackslash Moji\textbackslash hypertu... &       True \\
\bottomrule
\end{tabular}

[21]:

%matplotlib inline

Create plots

[22]:

Moji_plot_df = analysis.final_results_df(
    results_dict = Moji_results,
    pareto = True, pareto_selection = "test",
    selection_criterion = None, return_dev = True,
    )

[23]:

analysis.tables_and_figures.make_zoom_plot(
    Moji_plot_df, dpi = 100,
    zoom_xlim=(0.6, 0.78),
    zoom_ylim=(0.8, 0.98),
    )

../_images/tutorial_fairlib_demo_47_0.png

6. Cutomize pipeline for fairness

Check out the website https://hanxudong.github.io/fairlib/ for detailed docs.

Visualization
- Interactive plots demonstrates creating interactive plots for comparing different methods, and demonstrating DTO and constrained selection.
- Plot gallery presents a list of examples for presenting experimental results, e.g., hyperparameter tuning and trade-off plots with zoomed-in area.
Customized Dataset and Models
- The dataset document provides instructions for customizing datasets and dataloaders.
- The model document provides instructions for customizing model architectures and register models.
- Besides text inputs, we also provide examples for customizing datasets and models with structured inputs and images.
Customized Metrics
- This document provides instructions for customizing evaluation metrics.
- Single metric evaluations can be seen from there.
- Metric aggregations such as the default root mean square aggregation can be found from there.
Customized Debiasing Methods
- Please see the document for instructions about adding method-specific options and integrating methods with fairlib.