Benchmark ========= To the best of our knowledge, there are only three other libraries besides WEFE that implement bias measurement and mitigation methods for word embeddings: Fair Embedding Engine (FEE), Responsibly, and EmbeddingBiasScores. According to its authors, Fair Embedding Engine is defined as “A library for analyzing and mitigating gender bias in word embeddings”, Responsibly is defined as “A toolkit for auditing and mitigating bias and fairness of machine learning systems”. Finally, EmbeddingBiasScores describes itself as a collection of implementations and wrappers of bias scores for text embeddings. The documentation for these three libraries can be found at the following links: - https://github.com/FEE-Fair-Embedding-Engine/FEE - https://docs.responsibly.ai/ - https://github.com/HammerLabML/EmbeddingBiasScores The benchmark presented here compares these three libraries against WEFE according to the following criteria: 1. Ease of installation. 2. Quality of the package and documentation. 3. Ease of loading models. 4. Ease of running bias measurements. 5. Ease of running bias mitigation algorithms. 6. Implemented metrics and mitigation methods. 1. Ease of installation ----------------------- This comparison aims to evaluate how easy it is to install the library. WEFE ~~~~ According to the documentation, WEFE is available for installation using the Python Package Index (via pip) as well as via conda. .. code:: bash pip install --upgrade wefe # or conda install -c pbadilla wefe Fair Embedding Engine ~~~~~~~~~~~~~~~~~~~~~ In the case of FEE, neither the documentation nor the repository indicates how to install the package. Therefore, the easiest thing to do in this case is to clone the repository and then install the requirements , as described in the following steps: 1. Clone the repo .. code:: bash $ git clone https://github.com/FEE-Fair-Embedding-Engine/FEE 2. Install the requirements. .. code:: bash $ pip install -r FEE/requirements.txt $ pip install sympy $ pip install -U gensim==3.8.3 Responsibly ~~~~~~~~~~~ According to its documentation, responsibly is hosted in the Python Package Index so it can be installed using pip. .. code:: bash $ pip install responsibly EmbeddingBiasScores ~~~~~~~~~~~~~~~~~~~ In the case of EmbeddingBiasScores, the documentation indicates that the repository can be cloned and then installed locally. .. code:: bash $ git clone https://github.com/HammerLabML/EmbeddingBiasScores.git $ pip install -r EmbeddingBiasScores/requirements.txt Conclusion ~~~~~~~~~~ Both WEFE and Responsibly are hosted in the Python package index, which simplifies their installation and dependency handling, lowering the barrier to entry. FEE and EmbeddingBiasScores, on the other hand, require ad hoc installation procedures that require more advanced knowledge of Python and Pip. 2. Source Code Quality and Documentation ---------------------------------------- This benchmark seeks to compare the quality of documentation as well as other software quality features such as testing and continuous integration. WEFE ~~~~ WEFE has a complete documentation site that explains in detail how to use the package: an about page with the motivation and goals of the project, a quick start page showing how to install the library, several user guides on how to measure and mitigate bias in word embeddings, a detailed API of the implemented methods, theoretical background in the area, and finally implementations of previous case studies. In addition, most of the code is tested and developed using continuous integration mechanisms (through a linter and testing mechanisms in Github Actions), which are well-established practices in software development. Fair Embedding Engine ~~~~~~~~~~~~~~~~~~~~~ FEE’ documentation covers only the basic aspects of the API and a flowchart showing the main concepts of the library. The documentation does not include user guides, code examples, or theoretical background on the implemented methods. In terms of software engineering practices and standards, no tests, linter, or continuous integration mechanisms could be identified. Responsibly ~~~~~~~~~~~ Responsibly has a complete documentation site that explains how to use the package: an index page with the main project information and a quick start page that shows how to install the library, demos that act as user manuals, and a detailed API of the implemented methods. In addition, most of the code is tested and developed using continuous integration mechanisms (through a linter and testing in Github Actions). EmbeddingBiasScores ~~~~~~~~~~~~~~~~~~~ It was not possible to find formal documentation explaining how to run bias tests in EmbeddingBiasScores. There is only a small Jupyter notebook with some use cases, which at the time of writing had several flaws that made it difficult to understand and use. No testing, linter, or continuous integration mechanisms could be identified. Conclusion ~~~~~~~~~~ In terms of documentation, WEFE contains much more detailed documentation than the other libraries, with more extensive manuals and replications of previous case studies. Responsibly has sufficient documentation to execute its main functionalities without major problems, however, it is not as exhaustive as that of WEFE. FEE, only provides API documentation, which in our opinion is not sufficient for new users to use it without problems. Finally, EmbeddingBiasScores only presents a Jupyter notebook with some implementation examples. With respect to software quality, both FEE and Responsibly comply with well-established software development practices (i.e., testing, continuous integration, linter). FEE and EmbeddingBiasScores, on the other hand, do not have any of these practices in place 3. Ease of loading models ------------------------- In this section we will compare how easy it is to load a pre-trained word embedding (WE) model from each library. Two settings are compared: loading a model from Gensim’s API (``glove-twitter-25``) and loading a model from a binary file (``word2vec``). The second setting requires downloading a WE model trained with the original word2vec implementation, which can be obtained as follows: .. code:: ipython3 # !wget https://github.com/RaRe-Technologies/gensim-data/releases/download/word2vec-google-news-300/word2vec-google-news-300.gz # !gzip -dv word2vec-google-news-300.gz WEFE ~~~~ In WEFE, WE models are represented internally by wrapping Gensim models. This means that the model loading process (either from the API or from a file) is handled by Gensim loaders, while the class that generates the objects that allow access to the embeddings is managed by WEFE. The following code shows how to load a glove model using the Gensim API from within WEFE: .. code:: ipython3 from wefe.word_embedding_model import WordEmbeddingModel import gensim.downloader as api # load glove twitter_25 = api.load("glove-twitter-25") model = WordEmbeddingModel(twitter_25, "glove twitter dim=25") The following code shows how to load a word2vec model trained with the original implementation. .. code:: ipython3 from wefe.word_embedding_model import WordEmbeddingModel from gensim.models.keyedvectors import KeyedVectors # load word2vec word2vec = api.load("word2vec-google-news-300") model = WordEmbeddingModel(word2vec, "word2vec-google-news-300") FEE ~~~ FEE also offers direct support for loading WE models from its API through the following code. In this case, model loading is coupled to the WE class, which provides the methods to access the embeddings. .. code:: ipython3 from FEE.fee.embedding.loader import WE fee_model = WE().load(ename="glove-twitter-25") .. code:: ipython3 from FEE.fee.embedding.loader import WE fee_model = WE().load(fname="word2vec-google-news-300", format="bin") Responsibly and EmbeddingBiasScores ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Neither Responsibly nor EmbeddingBiasScores implement their own interfaces to handle WE models. Users must rely on Gensim or other external libraries for this purpose. This can be expressed as shown in the following script: .. code:: ipython3 # load twitter_25 model from gensim api twitter_25 = api.load("glove-twitter-25") # load word2vec model from file word2vec = KeyedVectors.load_word2vec_format("word2vec-google-news-300", binary=True) Conclusion ~~~~~~~~~~ As discussed above, both WEFE and FEE implement their own interfaces to internally manage access to WE models. Responsibly and EmbeddingBiasScores lack such functionalities, which may complicate their use. 4. Ease of running bias measurements. ------------------------------------- The following section aims to compare the execution of fairness metrics in the libraries included in this study. To make the benchmark as objective as possible, the set of words and the WE model are kept fixed throughout the comparison, and only the metrics are allowed to vary. .. code:: ipython3 # words to evaluate female_terms = ["female", "woman", "girl", "sister", "she", "her", "hers", "daughter"] male_terms = ["male", "man", "boy", "brother", "he", "him", "his", "son"] family_terms = [ "home", "parents", "children", "family", "cousins", "marriage", "wedding", "relatives", ] career_terms = [ "executive", "management", "professional", "corporation", "salary", "office", "business", "career", ] # optional, only for wefe usage. target_sets_names = ["Female terms", "Male terms"] attribute_sets_names = ["Family terms", "Career terms"] WEFE ~~~~ WEFE defines a standardized framework for executing metrics: in short, it is necessary to define a query that will act as a container for the words to be tested and then, together with the model, will be provided as input to some metric. The outputs of the metrics are contained in dictionaries that allow additional metadata to be included to the output. .. code:: ipython3 # import the modules from wefe.query import Query # 1. create the query query = Query( [female_terms, male_terms], [family_terms, career_terms], target_sets_names, attribute_sets_names, ) query .. parsed-literal:: .. code:: ipython3 from wefe.metrics.WEAT import WEAT # 2. instance a WEAT metric and pass the query plus the model. weat = WEAT() result = weat.run_query(query, model) result .. parsed-literal:: {'query_name': 'Female terms and Male terms wrt Family terms and Career terms', 'result': 0.46343881433131173, 'weat': 0.46343881433131173, 'effect_size': 0.4507652792646716, 'p_value': nan} Since the ``run_query`` method is independent of the query and the model, it can receive additional parameters that customize the process. In this case, we show how to normalize the words before searching for them in the model (i.e., lowercase them and remove their accents). .. code:: ipython3 weat = WEAT() result = weat.run_query( query, model, preprocessors=[{"lowercase": True, "strip_accents": True}], ) result .. parsed-literal:: {'query_name': 'Female terms and Male terms wrt Family terms and Career terms', 'result': 0.46343881433131173, 'weat': 0.46343881433131173, 'effect_size': 0.4507652792646716, 'p_value': nan} Next, we show how to report the corresponding p-value through a permutation test. .. code:: ipython3 weat = WEAT() result = weat.run_query( query, model, calculate_p_value=True, ) result .. parsed-literal:: {'query_name': 'Female terms and Male terms wrt Family terms and Career terms', 'result': 0.46343881433131173, 'weat': 0.46343881433131173, 'effect_size': 0.4507652792646716, 'p_value': 0.19068093190680932} This interface allows us to easily switch to similar metrics (i.e., supporting the same number of number of word sets). .. code:: ipython3 from wefe.metrics import RNSB rnsb = RNSB() result = rnsb.run_query(query, model) result .. parsed-literal:: {'query_name': 'Female terms and Male terms wrt Family terms and Career terms', 'result': 0.09051558681296493, 'rnsb': 0.09051558681296493, 'negative_sentiment_probabilities': {'female': 0.5285811053851917, 'woman': 0.3031782770423851, 'girl': 0.20810547466232254, 'sister': 0.17327510211466302, 'she': 0.4165425516161486, 'her': 0.3895078245770702, 'hers': 0.31412920848479164, 'daughter': 0.13146512364633123, 'male': 0.42679205714649815, 'man': 0.43079499436045987, 'boy': 0.21701323144255624, 'brother': 0.19983034212661, 'he': 0.5645185337599223, 'him': 0.49470907399126185, 'his': 0.552712793795697, 'son': 0.17457869573293805}, 'negative_sentiment_distribution': {'female': 0.09565807331470504, 'woman': 0.054866603359974946, 'girl': 0.03766114329405169, 'sister': 0.031357841309175544, 'she': 0.07538229712572722, 'her': 0.07048978417965314, 'hers': 0.05684840897525258, 'daughter': 0.02379143012863325, 'male': 0.07723716469755836, 'man': 0.0779615819300061, 'boy': 0.03927319268906782, 'brother': 0.036163580806998274, 'he': 0.10216172076480977, 'him': 0.0895282036894233, 'his': 0.10002521923736822, 'son': 0.03159375449759469}} .. code:: ipython3 from wefe.metrics import MAC mac = MAC() result = mac.run_query(query, model) result .. parsed-literal:: {'query_name': 'Female terms and Male terms wrt Family terms and Career terms', 'result': 0.8416415235615204, 'mac': 0.8416415235615204, 'targets_eval': {'Female terms': {'female': {'Family terms': 0.9185737599618733, 'Career terms': 0.916069650076679}, 'woman': {'Family terms': 0.752434104681015, 'Career terms': 0.9377805145923048}, 'girl': {'Family terms': 0.707457959651947, 'Career terms': 0.9867974997032434}, 'sister': {'Family terms': 0.5973392464220524, 'Career terms': 0.9482253392925486}, 'she': {'Family terms': 0.7872791914269328, 'Career terms': 0.9161583095556125}, 'her': {'Family terms': 0.7883057091385126, 'Career terms': 0.9237247597193345}, 'hers': {'Family terms': 0.7385367527604103, 'Career terms': 0.9480051446007565}, 'daughter': {'Family terms': 0.5472579970955849, 'Career terms': 0.9277344475267455}}, 'Male terms': {'male': {'Family terms': 0.8735092766582966, 'Career terms': 0.9468009045813233}, 'man': {'Family terms': 0.8249392118304968, 'Career terms': 0.9350165261421353}, 'boy': {'Family terms': 0.7106057899072766, 'Career terms': 0.9879048476286698}, 'brother': {'Family terms': 0.6280269809067249, 'Career terms': 0.9477180293761194}, 'he': {'Family terms': 0.8693044614046812, 'Career terms': 0.8771287016716087}, 'him': {'Family terms': 0.8230192996561527, 'Career terms': 0.888683641096577}, 'his': {'Family terms': 0.8876195731572807, 'Career terms': 0.8920885202242061}, 'son': {'Family terms': 0.5764635019004345, 'Career terms': 0.9220191016211174}}}} Fair Embedding Engine ~~~~~~~~~~~~~~~~~~~~~ In the case of Fair Embedding Engine, the WE model is passed in the metric instantiation. Then, the output value of the metric is computed using the ``compute`` method of the metric object. FEE differs somewhat from the WEFE standardization by making mandatory to provide the model when instantiating each metric, making the metric object model dependent. This makes it difficult to test several models at once since you have to instantiate a different metric object for each model. On the other hand, FEE does not establish a clear mechanism for passing sets of words of different sizes to the computation method: sets of words are delivered directly with a star parameter \*, which defines an arbitrary number of positional arguments. This lack of definition makes it difficult for the user to understand how many and which word sets to pass. .. code:: ipython3 from FEE.fee.metrics import WEAT as FEE_WEAT fee_weat = FEE_WEAT(fee_model) fee_weat.compute(female_terms, male_terms, family_terms, career_terms) .. parsed-literal:: 0.39821118 The FEE implementation of WEAT also allows the calculation of the p-value. .. code:: ipython3 fee_weat.compute(female_terms, male_terms, family_terms, career_terms, p_val=True) .. parsed-literal:: (0.39821118, 0.0) Finally, the implementation of the metric does not support the execution of more complex actions, such as preprocessing word sets. We could not find any other metric that was easily replaceable using the same or a similar interface (with respect to the WEFE standardization layer). Responsibly ~~~~~~~~~~~ Similar to WEFE, responsibly has a function that takes the model and word sets as input and returns the WEAT score as output. .. code:: ipython3 from responsibly.we.weat import calc_single_weat calc_single_weat( twitter_25, first_target={"name": "female_terms", "words": female_terms}, second_target={"name": "male_terms", "words": male_terms}, first_attribute={"name": "family_terms", "words": family_terms}, second_attribute={"name": "career_terms", "words": career_terms}, ) .. parsed-literal:: {'Target words': 'female_terms vs. male_terms', 'Attrib. words': 'family_terms vs. career_terms', 's': 0.31658393144607544, 'd': 0.67794365, 'p': 0.09673659673659674, 'Nt': '8x2', 'Na': '8x2'} The p-value can also be obtained from the same function by setting the ``with_pvalue`` parameter to ``True``. .. code:: ipython3 calc_single_weat( twitter_25, first_target={"name": "female_terms", "words": female_terms}, second_target={"name": "male_terms", "words": male_terms}, first_attribute={"name": "family_terms", "words": family_terms}, second_attribute={"name": "career_terms", "words": career_terms}, with_pvalue=True, ) .. parsed-literal:: {'Target words': 'female_terms vs. male_terms', 'Attrib. words': 'family_terms vs. career_terms', 's': 0.31658393144607544, 'd': 0.67794365, 'p': 0.09673659673659674, 'Nt': '8x2', 'Na': '8x2'} The implementation of this metric does not include the ability to perform more complex actions such as preprocessing word sets. In addition, we were unable to find any metrics in this library other than WEAT that are directly comparable to those implemented by WEFE. EmbeddingBiasScores ~~~~~~~~~~~~~~~~~~~ EmbeddingBiasScores formalizes how bias is measured in a different way than WEFE: it classifies the methods into clustering or geometric methods (note that WEFE only implements the geometric equivalents). As part of their standardization, each geometric metric must first define the direction of the bias using the ``define_bias_space`` function with attribute_embeddings (attribute words) as input; and then use the ``group_bias`` or ``mean_individual_bias`` methods to compute the value of the metric. Examples of use are shown below: .. code:: ipython3 # the embeddings to be used must be transformed by hand from words to arrays. target_embeddings = [ [model[word] for word in female_terms], [model[word] for word in male_terms], ] attribute_embeddings = [ [model[word] for word in family_terms], [model[word] for word in career_terms], ] .. code:: ipython3 from EmbeddingBiasScores.geometrical_bias import WEAT weat = WEAT() weat.define_bias_space(attribute_embeddings) # group bias returns the effect size. weat.group_bias(target_embeddings) .. parsed-literal:: 0.4364516797305417 This implementation of WEAT returns the effect size by default. There is no way to parameterize the metric to compute the WEAT score or the p-value. Similar to WEFE, the standardization implemented by EmbeddingBiasScores allows to easily change the used metric to another with the same input word sets. .. code:: ipython3 from EmbeddingBiasScores.geometrical_bias import MAC mac = MAC() mac.define_bias_space(attribute_embeddings) # mac does not accept more than one target set, so we have to calculate it manually. target_0_mac = mac.mean_individual_bias(target_embeddings[0]) target_1_mac = mac.mean_individual_bias(target_embeddings[1]) (target_0_mac + target_1_mac) / 2 .. parsed-literal:: 0.8416415235615204 EmbeddingBiasScores includes metrics that WEFE does not yet implement, such as ``GeneralizedWEAT`` and ``SAME``. .. code:: ipython3 from EmbeddingBiasScores.geometrical_bias import GeneralizedWEAT gweat = GeneralizedWEAT() gweat.define_bias_space(attribute_embeddings) gweat.group_bias(target_embeddings) .. parsed-literal:: 0.02896493 .. code:: ipython3 from EmbeddingBiasScores.geometrical_bias import SAME same = SAME() same.define_bias_space(attribute_embeddings) same.mean_individual_bias(target_embeddings[0]) .. parsed-literal:: 0.2677120929221758 Finally, EmbeddingBiasScores does not allow any of its metrics to perform more complex actions, such as preprocessing word set or customizing some performance settings. Conclusion ~~~~~~~~~~ In WEFE, having the input words as query objects decoupled from the execution of metrics allows both parameterization of metric execution and easy exchange of one metric for another. In addition, the clean and unified interface for all metrics makes the execution of bias measurements intuitive. Responsibly and FEE share a similar interface, in which the metric arguments are sets of words (which lack the expressiveness of WEFE queries to declare the number of sets of words supported by each metric), making it difficult to standardize inputs across metrics. We were unable to find any metrics other than WEAT to include in the benchmarking of FEE and Responsibly. On the other hand, EmbeddingBiasScores also presents its own mathematical standardization for each metric as well as some metrics that WEFE does not yet implement. While the standardization they present may be a bit more specific, it makes it more complex to use. The increased difficulty is mainly due to two factors: users have to manually define the bias space (using the ``define_bias_space`` parameter) and then investigate whether to use the parameters ``group_bias`` or ``mean_individual_bias``, which is not clear at first sight unless the basics of the standardization proposed by this library have been previously studied. Finally, we highlight WEFE’s ``run_query`` method, which allows the user to customize the execution of metrics, such as word preprocessing, normalization of embeddings, and calculation of submetrics or statistical tests. 5. Ease of Running Bias Mitigation Algorithms --------------------------------------------- Next we will compare how to run bias mitigation methods on the libraries included in the benchmark. In order to make the comparison as objective as possible, the set of words and the embedding model remain fixed; only the algorithms executed vary. Furthermore, to evaluate the performance of the implemented methods, we will use the same query defined in the previous section using WEAT (female vs. male terms with respect to family vs. career). .. code:: ipython3 from wefe.datasets import fetch_debiaswe from wefe.utils import load_test_model # word sets to be used debiaswe_wordsets = fetch_debiaswe() definitional_pairs = debiaswe_wordsets["definitional_pairs"] gender_specific = debiaswe_wordsets["gender_specific"] targets = [ "executive", "management", "professional", "corporation", "salary", "office", "business", "career", "home", "parents", "children", "family", "cousins", "marriage", "wedding", "relatives", ] WEFE ~~~~ WEFE defines a standardized framework for executing bias mitigation algorithms based on the scikit-learn fit transform interface. The fit-transform interface allows the user to select the sets of words and parameters that will be used to learn the debiasing transformation (``fit``), as well as to select the words that will be effectively debiased by the method (``transform``). This allows the user to change the words used to define the bias criterion (which is usually gender, but could be easily changed), as well as the vocabulary word to which the mitigation is applied. This software design pattern is useful for comparing different de-biasing methods, as the user can ensure that the same parameters are used across methods. Below we show how to execute a mitigation method with WEFE: .. code:: ipython3 from wefe.debias.hard_debias import HardDebias from wefe.debias.hard_debias import HardDebias from wefe.word_embedding_model import WordEmbeddingModel from gensim import downloader as api # load glove model twitter_25 = api.load("glove-twitter-25") model = WordEmbeddingModel(twitter_25, "glove twitter dim=25") # 1. instance Hard Debias algortihm hd = HardDebias( verbose=False, criterion_name="gender", ) # 2. apply fit method and pass the model and definitional pairs. hd.fit(model, definitional_pairs=definitional_pairs) # 3. apply transform method passing the model, target and ignore word sets resulting in the debiased model hd_debiased_model = hd.transform( model, target=targets, ignore=gender_specific, copy=True, ) .. parsed-literal:: Copy argument is True. Transform will attempt to create a copy of the original model. This may fail due to lack of memory. Model copy created successfully. .. parsed-literal:: 100%|██████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 21809.84it/s] Next, we show how to change the debiasing method while keeping a very similar parameter configuration. .. code:: ipython3 from wefe.debias.repulsion_attraction_neutralization import ( RepulsionAttractionNeutralization, ) ran = RepulsionAttractionNeutralization().fit( model=model, definitional_pairs=definitional_pairs, ) ran_debiased_model = ran.transform( model=model, target=targets, ignore=gender_specific, copy=True, ) .. parsed-literal:: Copy argument is True. Transform will attempt to create a copyof the original model. This may fail due to lack of memory. Model copy created successfully. .. parsed-literal:: 100%|█████████████████████████████████████████████████████████████████████████████| 16/16 [00:03<00:00, 5.23it/s] 100%|██████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 45964.98it/s] As can be seen, the fit-transform standardization implemented in WEFE allows to easily execute and exchange the different bias mitigation methods implemented in the library. .. code:: ipython3 from wefe.metrics import WEAT weat = WEAT() result = weat.run_query( query, model, ) print("Original model WEAT evaluation: ", result["weat"]) weat = WEAT() result = weat.run_query( query, hd_debiased_model, ) print("Hard Debias debiased model WEAT evaluation: ", result["weat"]) weat = WEAT() result = weat.run_query( query, ran_debiased_model, ) print( "Repulsion Attraction Neutralization debiased model WEAT evaluation: ", result["weat"], ) .. parsed-literal:: Original model WEAT evaluation: 0.31658415612764657 Hard Debias debiased model WEAT evaluation: 0.002320525236427784 Repulsion Attraction Neutralization debiased model WEAT evaluation: 0.26007230998948216 Fair Embedding Engine ~~~~~~~~~~~~~~~~~~~~~ The Fair Embedding Engine (FEE) requires the embedding model to be passed during instantiation of the algorithm. It currently does not support user-given definitional pairs, as the word sets used are fixed in this implementation, focusing only on gender bias at the moment. Debiasing is performed by executing the run method. The list of target words to be debiased must be provided in this implementation. .. code:: ipython3 import copy from FEE.fee.embedding.loader import WE # load model fee_model = WE().load(ename="glove-twitter-25") # model must be normalized fee_model.normalize() .. code:: ipython3 from FEE.fee.debias import HardDebias # instance the algortihm and apply it to the embedding model fee_hd_debiased_model = HardDebias(copy.deepcopy(fee_model)).run(word_list=targets) FEE allows easy use of different debiasing methods with a similar interface .. code:: ipython3 from FEE.fee.debias import RANDebias # instance the algortihm and apply it to the embedding model ran_hd_debiased_model = RANDebias(copy.deepcopy(fee_model)).run(words=targets) .. code:: ipython3 # in the case, we generate a custom weat calculation using the fee debiasing methods. result = WEAT()._calc_weat( [fee_model.v(word) for word in query.target_sets[0]], [fee_model.v(word) for word in query.target_sets[1]], [fee_model.v(word) for word in query.attribute_sets[0]], [fee_model.v(word) for word in query.attribute_sets[1]], ) print("Original model WEAT evaluation: ", result) result = WEAT()._calc_weat( [fee_hd_debiased_model.v(word) for word in query.target_sets[0]], [fee_hd_debiased_model.v(word) for word in query.target_sets[1]], [fee_hd_debiased_model.v(word) for word in query.attribute_sets[0]], [fee_hd_debiased_model.v(word) for word in query.attribute_sets[1]], ) print("Hard Debias debiased model WEAT evaluation: ", result) result = WEAT()._calc_weat( [ran_hd_debiased_model.v(word) for word in query.target_sets[0]], [ran_hd_debiased_model.v(word) for word in query.target_sets[1]], [ran_hd_debiased_model.v(word) for word in query.attribute_sets[0]], [ran_hd_debiased_model.v(word) for word in query.attribute_sets[1]], ) print("Repulsion Attraction Neutralization debiased model WEAT evaluation: ", result) .. parsed-literal:: Original model WEAT evaluation: 0.31658416730351746 Hard Debias debiased model WEAT evaluation: -0.061893132515251637 Repulsion Attraction Neutralization debiased model WEAT evaluation: 0.17548414319753647 Responsibly ~~~~~~~~~~~ In Responsibly the embedding model is provided during the instantiation of the ``GenderBiasWe`` class. Definitional pairs cannot be provided by the user, as the bias being mitigated is set specifically to gender bias. To perform the debiasing process, one simply needs to execute the ``debias`` method. However, it should be noted that the mitigation method cannot be run on the benchmark model chosen, as it is not compatible with uncased models such as ``twitter-25``. .. code:: ipython3 from responsibly.we import GenderBiasWE # does not work with twitter_25. gender_bias_we = GenderBiasWE(word2vec) # instance the GenderBiasWE gender_bias_we.debias(neutral_words=targets) # apply the debias EmbeddingBiasScore ~~~~~~~~~~~~~~~~~~ The library does not implement mitigation methods, so it is not included in this comparison. Conclusion ~~~~~~~~~~ All three libraries offer a simple way to apply bias mitigation algorithms in a similar way and all of them are able to mitigate bias in the word embedding model by similar amounts, depending on the metric used. The main difference between them is that WEFE offers more flexibility to users, allowing them to choose the bias criteria through the words used to learn the transformation and the words that are mitigated. On the other hand, FEE and Responsibly only work with gender bias because the set of words is fixed by default. Finally, WEFE includes more mitigation algorithms than the other two frameworks. 6. Metrics and Mitigation Methods Implemented --------------------------------------------- The following tables provide a comparison of the libraries included in this benchmarking, with respect to the bias metrics and mitigation methods they implement to date. Fairness Metrics ~~~~~~~~~~~~~~~~ =================== ========================= ==== === =========== =================== Metric Implementable in WEFE WEFE FEE Responsibly EmbeddingBiasScores =================== ========================= ==== === =========== =================== WEAT ✔ ✔ ✔ ✔ ✔ WEAT ES ✔ ✔ ✖ ✖ ✖ RNSB ✔ ✔ ✖ ✖ ✖ RIPA ✔ ✔ ✖ ✖ ✔ ECT ✔ ✔ ✖ ✖ ✖ RND ✔ ✔ ✖ ✖ ✖ MAC ✔ ✔ ✖ ✖ ✔ Direct Bias ✔ ✖ ✔ ✔ ✔ SAME ✔ ✖ ✖ ✖ ✔ Generalized WEAT ✔ ✖ ✖ ✖ ✔ IndirectBias ✖ ✖ ✖ ✔ ✖ GIPE ✖ ✖ ✔ ✖ ✖ PMN ✖ ✖ ✔ ✖ ✖ Proximity Bias ✖ ✖ ✔ ✖ ✖ =================== ========================= ==== === =========== =================== Metrics marked as "✔" in the "Implementable in WEFE" column can be implemented directly within the WEFE framework using word sets as input. Metrics marked as "✖" require additional representations such as gender directions or apply before/after transformations, and are therefore currently out of WEFE's scope. Mitigation algorithms ~~~~~~~~~~~~~~~~~~~~~ ======================= ==== === =========== =================== Algorithm WEFE FEE Responsibly EmbeddingBiasScores ======================= ==== === =========== =================== Hard Debias ✔ ✔ ✔ ✖ Double Hard Debias ✔ ✖ ✖ ✖ Half Sibling Regression ✔ ✔ ✖ ✖ RAN ✔ ✔ ✖ ✖ Multiclass HD ✔ ✖ ✖ ✖ ======================= ==== === =========== =================== Conclusion ---------- The following table summarizes the main differences between the libraries analyzed in this benchmark study. ==================================================== ========================================= ========================================================== ========================================== ==================================== Item WEFE FEE Responsibly EmbeddingBiasScores ==================================================== ========================================= ========================================================== ========================================== ==================================== Implemented Metrics 7 5 3 6 Implemented Mitigation Algorithms 5 3 1 0 Extensible Easy Easy Difficult, not very modular. Easy Well-defined interface for metrics ✔ ✖ ✖ ✔ Well-defined interface for mitigation algorithms ✔ ✖ ✖ ✖ Lastest update July 2025 October 2020 April 2021 April 2023 Installation Easy: pip or conda No instructions. It can be installed from the repository Only with pip. Presents problems Only from the repository Documentation Extensive documentation with examples Almost no documentation Limited documentation with some examples No documentation, only examples. ==================================================== ========================================= ========================================================== ========================================== ====================================