User guide¶
The following guide is designed to present the more general details on using the package. Below:
We first present how to run a simple query using some embedding model.
We then show how to run multiple queries on multiple embeddings.
After that, we show how to compare the results obtained from running multiple sets of queries on multiple embeddings using different metrics through ranking calculation.
Finally, we show how to calculate the correlations between the rankings obtained.
Warning
To accurately study the biases contained in word embeddings, queries may contain words that could be offensive to certain groups or individuals. The relationships studied between these words DO NOT represent the ideas, thoughts or beliefs of the authors of this library. This applies to this and all pages of the documentation.
Note
If you are not familiar with the concepts of query, target and attribute set, please visit the the framework section on the library’s about page. These concepts will be widely used in the following sections.
A jupyter notebook with this code is located in the following link: WEFE User Guide.
Run a Query¶
The following code explains how to run a gender query using
Glove. embeddings and the
Word Embedding Association Test (WEAT
) as fairness metric.
Below we show the three usual steps for performing a query in WEFE
:
# Load the package
from wefe.query import Query
from wefe.word_embedding_model import WordEmbeddingModel
from wefe.metrics.WEAT import WEAT
from wefe.datasets.datasets import load_weat
import gensim.downloader as api
Load a word embeddings model as a WordEmbedding
object.¶
Here, we load the word embedding pretrained model using the gensim
library and then we create a WordEmbeddingModel instance. For this
example, we will use a 25-dimensional Glove
embedding model trained
from a Twitter dataset.
twitter_25 = api.load('glove-twitter-25')
model = WordEmbeddingModel(twitter_25, 'glove twitter dim=25')
Create the query using a Query
object¶
Define the target and attribute words sets and create a Query object that contains them. Some well-known word sets are already provided by the package and can be easily loaded by the user. Users can also set their own custom-made sets.
For this example, we will create a query with gender terms with respect
to family and career. The words we will use will be taken from the set
of words used in the WEAT
paper (included in the package).
# load the weat word sets
word_sets = load_weat()
gender_query = Query([word_sets['male_terms'], word_sets['female_terms']],
[word_sets['career'], word_sets['family']],
['Male terms', 'Female terms'], ['Career', 'Family'])
Run the Query¶
Instantiate the metric that you will use and then execute run_query
with the parameters created in the previous steps.
The bias measurement process consists of three stages:
Checking the measurement parameters.
Transform the word sets into word embeddings.
Calculate the metric.
In this case we are going to use the WEAT
metric.
weat = WEAT()
result = weat.run_query(gender_query, model)
print(result)
{'query_name': 'Male terms and Female terms wrt Career and Family',
'result': 0.3165841,
'weat': 0.3165841,
'effect_size': 0.677944,
'p-value': None}
By default, the results are a dict
containing the query name (in the
key query_name
) and the calculated value of the metric in the
result
key. It also contains a key with the name and the value of
the calculated metric (which is duplicated in the “results” key).
Depending on the metric class used, the result dict
can also return
more metrics, detailed word-by-word values or other statistics. Also
some metrics allow you to change the default value in results, which
will have implications a little further down the line.
In this case, WEAT
returns the value of weat
and the
effect_size
, with weat as default in the results key.
Metric Params¶
Each metric allows to vary the behavior of run_query
according to
different parameters. For example: there are parameters to change the
preprocessing of the words, others to warn errors or to modify what the
method returns by default.
The parameters of each metric are detailed in the API documentation.
In this case, if we want run_query
returns effect_size
instead
of weat
in the result , when we execute run_query
we can pass
the parameter return_effect_size
equal to True
. Note that this
parameter is only of the class WEAT
.
weat = WEAT()
result = weat.run_query(gender_query, model, return_effect_size = True)
print(result)
{'query_name': 'Male terms and Female terms wrt Career and Family',
'result': 0.677944,
'weat': 0.3165841,
'effect_size': 0.677944,
'p-value': None}
Word preprocessors¶
There may be word embeddings models whose words are not cased or that do
not have accents. In Glove
, for example, all its words in its
vocabulary are lowercase. However, many words in WEAT’s ethnicity
dataset contain cased words.
print(word_sets['european_american_names_5'])
['Adam', 'Harry', 'Josh', 'Roger', 'Alan', 'Frank', 'Justin', 'Ryan', 'Andrew', 'Jack', 'Matthew', 'Stephen', 'Brad', 'Greg', 'Paul', 'Jonathan', 'Peter', 'Amanda', 'Courtney', 'Heather', 'Melanie', 'Sara', 'Amber', 'Katie', 'Betsy', 'Kristin', 'Nancy', 'Stephanie', 'Ellen', 'Lauren', 'Colleen', 'Emily', 'Megan', 'Rachel']
If we carelessly execute the following query, when transforming word sets to embeddings we could lose many words or the whole of several sets.
You can specify that run_query
log the words that were lost in the
transformation to vectors by using the parameter
warn_not_found_words=True
.
ethnicity_query = Query(
[
word_sets['european_american_names_5'],
word_sets['african_american_names_5']
], [word_sets['pleasant_5'], word_sets['unpleasant_5']],
['European american names(5)', 'African american names(5)'],
['Pleasant(5)', 'Unpleasant(5)'])
result = weat.run_query(ethnicity_query,
model,
warn_not_found_words=True)
print(result)
WARNING:root:The following words from set 'European american names(5)' do not exist within the vocabulary of glove twitter dim=25: ['Adam', 'Harry', 'Josh', 'Roger', 'Alan', 'Frank', 'Justin', 'Ryan', 'Andrew', 'Jack', 'Matthew', 'Stephen', 'Brad', 'Greg', 'Paul', 'Jonathan', 'Peter', 'Amanda', 'Courtney', 'Heather', 'Melanie', 'Sara', 'Amber', 'Katie', 'Betsy', 'Kristin', 'Nancy', 'Stephanie', 'Ellen', 'Lauren', 'Colleen', 'Emily', 'Megan', 'Rachel']
WARNING:root:The transformation of 'European american names(5)' into glove twitter dim=25 embeddings lost proportionally more words than specified in 'lost_words_threshold': 1.0 lost with respect to 0.2 maximum loss allowed.
WARNING:root:The following words from set 'African american names(5)' do not exist within the vocabulary of glove twitter dim=25: ['Alonzo', 'Jamel', 'Theo', 'Alphonse', 'Jerome', 'Leroy', 'Torrance', 'Darnell', 'Lamar', 'Lionel', 'Tyree', 'Deion', 'Lamont', 'Malik', 'Terrence', 'Tyrone', 'Lavon', 'Marcellus', 'Wardell', 'Nichelle', 'Shereen', 'Ebony', 'Latisha', 'Shaniqua', 'Jasmine', 'Tanisha', 'Tia', 'Lakisha', 'Latoya', 'Yolanda', 'Malika', 'Yvette']
WARNING:root:The transformation of 'African american names(5)' into glove twitter dim=25 embeddings lost proportionally more words than specified in 'lost_words_threshold': 1.0 lost with respect to 0.2 maximum loss allowed.
ERROR:root:At least one set of 'European american names(5) and African american names(5) wrt Pleasant(5) and Unpleasant(5)' query has proportionally fewer embeddings than allowed by the lost_vocabulary_threshold parameter (0.2). This query will return np.nan.
{'query_name': 'European american names(5) and African american names(5) wrt Pleasant(5) and Unpleasant(5)', 'result': nan, 'weat': nan, 'effect_size': nan}
Warning
In order to give more robustness to the results, if more than 20% (by
default) of the words from any of the word sets of the query are not
included in the word embedding model, the result of the metric will be
np.nan
. This behavior can be changed using a float number parameter
called lost_vocabulary_threshold
.
One of the parameters of run_query
, preprocessor_args
allows to
run a preprocessor to each word of all sets before getting its vectors.
This preprocessor can specify that words be preprocessed to lowercase,
remove accents or any other custom preprocessing given by the user.
The possible options for preprocessor_args
are:
lowercase
:bool
. Indicates if the words are transformed to lowercase.strip_accents
:bool, {'ascii', 'unicode'}
: Specifies if the accents of the words are eliminated. The stripping type can be specified.True
uses'unicode'
by default.preprocessor
:Callable
. It receives a function that operates on each word. In the case of specifying a function, it overrides the default preprocessor (i.e., the previous options stop working).
weat = WEAT()
result = weat.run_query(ethnicity_query,
model,
preprocessor_args={
'lowercase': True,
'strip_accents': True
})
print(result)
{'query_name': 'European american names(5) and African american names(5) wrt Pleasant(5) and Unpleasant(5)', 'result': 3.7529151, 'weat': 3.7529151, 'effect_size': 1.2746819, 'p-value': None}
It may happen that first you want to try to find the vector of a word in
uppercase, (since this vector may contain more information than the one
of the word lowercased) and if it is not exists in the model, then try
to find its lowercase representation. This behavior can be specified by
specifying preprocessing options in secondary_preprocessor_args
and
leaving the primary by default (i,e,. without providing it).
In general, the search for vectors will be done first by using the
preprocessor specified in preprocessor_args
and then the specified
in secondary_preprocessor_args
if this was provided. Therefore, any
combination of these is also supported.
weat = WEAT()
result = weat.run_query(ethnicity_query,
model,
secondary_preprocessor_args={
'lowercase': True,
'strip_accents': True
})
print(result)
{'query_name': 'European american names(5) and African american names(5) wrt Pleasant(5) and Unpleasant(5)',
'result': 3.7529151,
'weat': 3.7529151,
'effect_size': 1.2746819,
'p-value': None}
Running multiple Queries¶
We usually want to test several queries that study some criterion of
bias: gender, ethnicity, religion, politics, socioeconomic, among
others. Let’s suppose you’ve created 20 queries that study gender bias
on different models of embeddings. Trying to use run_query
on each
pair embedding-query can be a bit complex and will require extra work to
implement.
This is why the library also implements a function to test multiple
queries on various word embedding models in a single call: the
run_queries
util.
The following code shows how to run various gender queries on Glove
embedding models with different dimensions trained from the Twitter
dataset. The queries will be executed using WEAT
metric.
from wefe.query import Query
from wefe.word_embedding_model import WordEmbeddingModel
from wefe.metrics import WEAT, RNSB
from wefe.datasets import load_weat
from wefe.utils import run_queries
import gensim.downloader as api
Load the models:¶
Load three different Glove Twitter embedding models. These models were trained using the same dataset varying the number of embedding dimensions.
model_1 = WordEmbeddingModel(api.load('glove-twitter-25'),
'glove twitter dim=25')
model_2 = WordEmbeddingModel(api.load('glove-twitter-50'),
'glove twitter dim=50')
model_3 = WordEmbeddingModel(api.load('glove-twitter-100'),
'glove twitter dim=100')
models = [model_1, model_2, model_3]
Load the word sets and create the queries¶
Now, we will load the WEAT
word set and create three queries. The
three queries are intended to measure gender bias.
# Load the WEAT word sets
word_sets = load_weat()
# Create gender queries
gender_query_1 = Query([word_sets['male_terms'], word_sets['female_terms']],
[word_sets['career'], word_sets['family']],
['Male terms', 'Female terms'], ['Career', 'Family'])
gender_query_2 = Query([word_sets['male_terms'], word_sets['female_terms']],
[word_sets['science'], word_sets['arts']],
['Male terms', 'Female terms'], ['Science', 'Arts'])
gender_query_3 = Query([word_sets['male_terms'], word_sets['female_terms']],
[word_sets['math'], word_sets['arts_2']],
['Male terms', 'Female terms'], ['Math', 'Arts'])
gender_queries = [gender_query_1, gender_query_2, gender_query_3]
Run the queries on all Word Embeddings using WEAT.¶
Now, to run our list of queries and models, we call run_queries
using the parameters defined in the previous step. The mandatory
parameters of the function are 3:
a metric,
a list of queries, and,
a list of embedding models.
It is also possible to provide a name for the criterion studied in this
set of queries through the parameter queries_set_name
.
# Run the queries
WEAT_gender_results = run_queries(WEAT,
gender_queries,
models,
queries_set_name='Gender Queries')
WEAT_gender_results
WARNING:root:The transformation of 'Science' into glove twitter dim=25 embeddings lost proportionally more words than specified in 'lost_words_threshold': 0.25 lost with respect to 0.2 maximum loss allowed.
ERROR:root:At least one set of 'Male terms and Female terms wrt Science and Arts' query has proportionally fewer embeddings than allowed by the lost_vocabulary_threshold parameter (0.2). This query will return np.nan.
WARNING:root:The transformation of 'Science' into glove twitter dim=50 embeddings lost proportionally more words than specified in 'lost_words_threshold': 0.25 lost with respect to 0.2 maximum loss allowed.
ERROR:root:At least one set of 'Male terms and Female terms wrt Science and Arts' query has proportionally fewer embeddings than allowed by the lost_vocabulary_threshold parameter (0.2). This query will return np.nan.
WARNING:root:The transformation of 'Science' into glove twitter dim=100 embeddings lost proportionally more words than specified in 'lost_words_threshold': 0.25 lost with respect to 0.2 maximum loss allowed.
ERROR:root:At least one set of 'Male terms and Female terms wrt Science and Arts' query has proportionally fewer embeddings than allowed by the lost_vocabulary_threshold parameter (0.2). This query will return np.nan.
model_name |
Male terms and Female terms wrt Career and Family |
Male terms and Female terms wrt Science and Arts |
Male terms and Female terms wrt Math and Arts |
---|---|---|---|
glove twitter dim=25 |
0.316584 |
nan |
-0.0221328 |
glove twitter dim=50 |
0.363743 |
nan |
-0.272334 |
glove twitter dim=100 |
0.385352 |
nan |
-0.0825434 |
Warning
If more than 20% (by default) of the words from any of the word
sets of the query are not included in the word embedding model, the metric
will return Nan
.
This behavior can be changed using a float number parameter called
lost_vocabulary_threshold
.
Setting metric params¶
As you can see from the results above, there is a whole column that has
no results. As the warnings point out, when transforming the words of
the sets into embeddings, there is a loss of words that is greater than
the allowed by the parameter lost_vocabulary_threshold
. Therefore,
all those queries return np.nan
. In this case, it would be very
useful to use the word preprocessors seen above.
When we use run_queries
, we can also provide specific parameters for
each metric. We can do this by passing a dict
with the metric params
to the metric_params
parameter. In this case, we will use
preprocessor_args
to lower the words.
WEAT_gender_results = run_queries(
WEAT,
gender_queries,
models,
metric_params={'preprocessor_args': {
'lowercase': True
}},
queries_set_name='Gender Queries')
WEAT_gender_results
model_name |
Male terms and Female terms wrt Career and Family |
Male terms and Female terms wrt Science and Arts |
Male terms and Female terms wrt Math and Arts |
---|---|---|---|
glove twitter dim=25 |
0.316584 |
0.167431 |
-0.0339119 |
glove twitter dim=50 |
0.363743 |
-0.0846904 |
-0.307589 |
glove twitter dim=100 |
0.385352 |
0.0996324 |
-0.15579 |
As you can see from the results table, no query was lost now.
Plot the results in a barplot¶
The library also provides an easy way to plot the results obtained from
a run_queries
execution into a plotly
braplot.
from wefe.utils import run_queries, plot_queries_results
# Plot the results
plot_queries_results(WEAT_gender_results).show()
Aggregating Results¶
The execution of run_queries
in the previous step gave us many
results evaluating the gender bias in the tested embeddings. However,
these do not tell us much about the overall fairness of the embedding
models with respect to the criteria evaluated. Therefore, we would like
to have some mechanism that allows us to aggregate the results directly
obtained in run_query
so that we can evaluate the bias as a whole.
A simple way to aggregate the results would be to average their absolute
values. For this, when using run_queries
, you must set the
aggregate_results
parameter as True
. This default value will
activate the option to aggregate the results by averaging the absolute
values of the results and put them in the last column.
This aggregation function can be modified through the aggregation_function parameter. Here you can specify a string that defines some of the aggregation types that are already implemented, as well as provide a function that operates in the results dataframe.
The aggregation functions available are:
Average
avg
.Average of the absolute values
abs_avg
.Sum
sum
.Sum of the absolute values,
abs_sum
.
Note
Notice that some functions are more appropriate for certain metrics. For
metrics returning only positive numbers, all the previous aggregation
functions would be OK. In contrast, for metrics returning real values
(e.g., WEAT
, RND
, etc…), aggregation functions such as sum would
make different outputs to cancel each other.
Let’s aggregate the results from previous example by the average of the absolute values:
WEAT_gender_results_agg = run_queries(
WEAT,
gender_queries,
models,
metric_params={'preprocessor_args': {
'lowercase': True
}},
aggregate_results=True,
aggregation_function='abs_avg',
queries_set_name='Gender Queries')
WEAT_gender_results_agg
model_name |
Male terms and Female terms wrt Career and Family |
Male terms and Female terms wrt Science and Arts |
Male terms and Female terms wrt Math and Arts |
WEAT: Gender Queries average of abs values score |
---|---|---|---|---|
glove twitter dim=25 |
0.316584 |
0.167431 |
-0.0339119 |
0.172642 |
glove twitter dim=50 |
0.363743 |
-0.0846904 |
-0.307589 |
0.252008 |
glove twitter dim=100 |
0.385352 |
0.0996324 |
-0.15579 |
0.213591 |
plot_queries_results(WEAT_gender_results_agg).show()
Finally, we can ask the function to return only the aggregated values
(through return_only_aggregation
parameter) and then plot them.
WEAT_gender_results_only_agg = run_queries(
WEAT,
gender_queries,
models,
metric_params={'preprocessor_args': {
'lowercase': True
}},
aggregate_results=True,
aggregation_function='abs_avg',
return_only_aggregation=True,
queries_set_name='Gender Queries')
WEAT_gender_results_only_agg
model_name |
WEAT: Gender Queries average of abs values score |
---|---|
glove twitter dim=25 |
0.172642 |
glove twitter dim=50 |
0.252008 |
glove twitter dim=100 |
0.213591 |
plot_queries_results(WEAT_gender_results_only_agg).show()
Calculate Rankings¶
When we want to measure various criteria of bias in different embedding models, two major problems arise:
One type of bias can dominate the other because of significant differences in magnitude.
Different metrics can operate on different scales, which makes them difficult to compare.
To show that, suppose we have two sets of queries: one that explores gender biases and another that explores ethnicity biases, and we want to test these sets of queries on 3 Twitter Glove models of 25, 50 and 100 dimensions each, using both WEAT and Relative Negative Sentiment Bias (RNSB) as bias metrics.
Let’s show the first problem: the bias scores obtained from one set of queries are much higher than those from the other set, even when the same metric is used.
We executed the gender and ethnicity queries using WEAT and the 3 models mentioned above. The results obtained are:
model_name |
WEAT: Gender Queries average of abs values score |
WEAT: Ethnicity Queries average of abs values score |
---|---|---|
glove twitter dim=25 |
0.210556 |
2.64632 |
glove twitter dim=50 |
0.292373 |
1.87431 |
glove twitter dim=100 |
0.225116 |
1.78469 |
As can be seen, the results of ethnicity bias are much greater than those of gender.
For the second problem: Metrics deliver their results on different scales.
We executed the gender queries using WEAT and RNSB metrics and the 3 models mentioned above. The results obtained are:
model_name |
WEAT: Gender Queries average of abs values score |
RNSB: Gender Queries average of abs values score |
---|---|---|
glove twitter dim=25 |
0.210556 |
0.032673 |
glove twitter dim=50 |
0.292373 |
0.049429 |
glove twitter dim=100 |
0.225116 |
0.0312772 |
We can see differences between the results of both metrics of an order of magnitude.
To address these two problems, we propose to create rankings. Rankings allow us to focus on the relative differences reported by the metrics (for different models) instead of focusing on the absolute values.
Now, let’s create rankings using the data used above. The following code will load the models and create the queries:
from wefe.query import Query
from wefe.datasets.datasets import load_weat
from wefe.word_embedding_model import WordEmbeddingModel
from wefe.metrics import WEAT, RNSB
from wefe.utils import run_queries, create_ranking, plot_ranking, plot_ranking_correlations
import gensim.downloader as api
# Load the models
model_1 = WordEmbeddingModel(api.load('glove-twitter-25'),
'glove twitter dim=25')
model_2 = WordEmbeddingModel(api.load('glove-twitter-50'),
'glove twitter dim=50')
model_3 = WordEmbeddingModel(api.load('glove-twitter-100'),
'glove twitter dim=100')
models = [model_1, model_2, model_3]
# Load the WEAT word sets
word_sets = load_weat()
# Create gender queries
gender_query_1 = Query([word_sets['male_terms'], word_sets['female_terms']],
[word_sets['career'], word_sets['family']],
['Male terms', 'Female terms'], ['Career', 'Family'])
gender_query_2 = Query([word_sets['male_terms'], word_sets['female_terms']],
[word_sets['science'], word_sets['arts']],
['Male terms', 'Female terms'], ['Science', 'Arts'])
gender_query_3 = Query([word_sets['male_terms'], word_sets['female_terms']],
[word_sets['math'], word_sets['arts_2']],
['Male terms', 'Female terms'], ['Math', 'Arts'])
# Create ethnicity queries
ethnicity_query_1 = Query([word_sets['european_american_names_5'],
word_sets['african_american_names_5']],
[word_sets['pleasant_5'], word_sets['unpleasant_5']],
['European Names', 'African Names'],
['Pleasant', 'Unpleasant'])
ethnicity_query_2 = Query([word_sets['european_american_names_7'],
word_sets['african_american_names_7']],
[word_sets['pleasant_9'], word_sets['unpleasant_9']],
['European Names', 'African Names'],
['Pleasant 2', 'Unpleasant 2'])
gender_queries = [gender_query_1, gender_query_2, gender_query_3]
ethnicity_queries = [ethnicity_query_1, ethnicity_query_2]
Now, we will run the queries with WEAT
, WEAT Effect Size
and
RNSB
:
# Run the queries WEAT
WEAT_gender_results = run_queries(
WEAT,
gender_queries,
models,
metric_params={'preprocessor_args': {
'lowercase': True
}},
aggregate_results=True,
return_only_aggregation=True,
queries_set_name='Gender Queries')
WEAT_ethnicity_results = run_queries(
WEAT,
ethnicity_queries,
models,
metric_params={'preprocessor_args': {
'lowercase': True
}},
aggregate_results=True,
return_only_aggregation=True,
queries_set_name='Ethnicity Queries')
# Run the queries with WEAT Effect Size
WEAT_EZ_gender_results = run_queries(WEAT,
gender_queries,
models,
metric_params={
'preprocessor_args': {
'lowercase': True
},
'return_effect_size': True
},
aggregate_results=True,
return_only_aggregation=True,
queries_set_name='Gender Queries')
WEAT_EZ_ethnicity_results = run_queries(WEAT,
ethnicity_queries,
models,
metric_params={
'preprocessor_args': {
'lowercase': True
},
'return_effect_size': True
},
aggregate_results=True,
return_only_aggregation=True,
queries_set_name='Ethnicity Queries')
# Run the queries using RNSB
RNSB_gender_results = run_queries(
RNSB,
gender_queries,
models,
metric_params={'preprocessor_args': {
'lowercase': True
}},
aggregate_results=True,
return_only_aggregation=True,
queries_set_name='Gender Queries')
RNSB_ethnicity_results = run_queries(
RNSB,
ethnicity_queries,
models,
metric_params={'preprocessor_args': {
'lowercase': True
}},
aggregate_results=True,
return_only_aggregation=True,
queries_set_name='Ethnicity Queries')
To create the ranking we’ll use the create_ranking
function. This
function takes all the DataFrames containing the calculated scores and
uses the last column to create the ranking. It assumes that there is an
aggregation in this column.
from wefe.utils import run_queries, create_ranking, plot_ranking, plot_ranking_correlations
gender_ranking = create_ranking([
WEAT_gender_results, WEAT_EZ_gender_results, RNSB_gender_results
])
gender_ranking
model_name |
WEAT: Gender Queries average of abs values score (1) |
WEAT: Gender Queries average of abs values score (2) |
RNSB: Gender Queries average of abs values score |
---|---|---|---|
glove twitter dim=25 |
1 |
1 |
3 |
glove twitter dim=50 |
3 |
2 |
2 |
glove twitter dim=100 |
2 |
3 |
1 |
ethnicity_ranking = create_ranking([
WEAT_ethnicity_results, WEAT_EZ_gender_results, RNSB_ethnicity_results
])
ethnicity_ranking
model_name |
WEAT: Ethnicity Queries average of abs values score |
WEAT: Gender Queries average of abs values score |
RNSB: Ethnicity Queries average of abs values score |
---|---|---|---|
glove twitter dim=25 |
3 |
1 |
3 |
glove twitter dim=50 |
2 |
2 |
2 |
glove twitter dim=100 |
1 |
3 |
1 |
Plotting the rankings¶
Finally, we can plot the rankings in barplots using the plot_ranking
function. The function can be used in two ways:
With facet by Metric and Criteria:
This image shows the rankings separated by each bias criteria and metric (i.e, by each column). Each bar represents the position of the embedding in the corresponding criterion-metric ranking.
plot_ranking(gender_ranking, use_metric_as_facet=True)
plot_ranking(ethnicity_ranking, use_metric_as_facet=True)
Without facet
This image shows the accumulated rankings for each embedding model. Each bar represents the sum of the rankings obtained by each embedding. Each color within a bar represents a different criterion-metric ranking.
plot_ranking(gender_ranking)
plot_ranking(ethnicity_ranking)
Ranking Correlations¶
We can see how the rankings obtained in the previous section relate to
each other by using a correlation matrix. To do this we provide a
function called calculate_ranking_correlations
. This function takes
the rankings as input and calculates the Spearman correlation between
them.
from wefe.utils import calculate_ranking_correlations, plot_ranking_correlations
correlations = calculate_ranking_correlations(gender_ranking)
correlations
Model |
WEAT: Gender Queries average of abs values score (1) |
WEAT: Gender Queries average of abs values score (2) |
RNSB: Gender Queries average of abs values score |
---|---|---|---|
WEAT: Gender Queries average of abs values score (1) |
1 |
0.5 |
-0.5 |
WEAT: Gender Queries average of abs values score (2) |
0.5 |
1 |
-1 |
RNSB: Gender Queries average of abs values score |
-0.5 |
-1 |
1 |
This function uses the corr() method of the ranking dataframe. This allows you to change the correlation calculation method to: ‘pearson’, ‘spearman’, ‘kendall’.
In the following example we use the kendall correlation.
calculate_ranking_correlations(gender_ranking, method='kendall')
Model |
WEAT: Gender Queries average of abs values score (1) |
WEAT: Gender Queries average of abs values score (2) |
RNSB: Gender Queries average of abs values score |
---|---|---|---|
WEAT: Gender Queries average of abs values score (1) |
1 |
0.333333 |
-0.333333 |
WEAT: Gender Queries average of abs values score (2) |
0.333333 |
1 |
-1 |
RNSB: Gender Queries average of abs values score |
-0.333333 |
-1 |
1 |
Finally, we also provide a function to graph the correlations. This function enables us to visually analyze how the rankings relate to each other.
correlation_fig = plot_ranking_correlations(correlations)
correlation_fig.show()