Loading Embeddings From Different Sources

WEFE depends on gensim’s KeyedVectors to operate the word embeddings models. Therefore, any embedding you want to experiment with must be a model loaded through gensim’s APIs or any library that extends it.

In technical terms, the minimum requirement for WEFE to operate with a model is that it extends the BaseKeyedVectors class.

Next we show several options to load models using different sources.

Create a example query

In this section we only create an example query (same as the query of user guide) to be used in the following sections.

>>> # Load the query
>>> from wefe.query import Query
>>> from wefe.word_embedding_model import WordEmbeddingModel
>>> from wefe.metrics.WEAT import WEAT
>>> from wefe.datasets.datasets import load_weat
>>>
>>> # load the weat word sets
>>> word_sets = load_weat()
>>>
>>> # create the query
>>> query = Query([word_sets['male_terms'], word_sets['female_terms']],
>>>               [word_sets['career'], word_sets['family']],
>>>               ['Male terms', 'Female terms'],
>>>               ['Career', 'Family'])
>>>
>>> # instantiate the metric
>>> weat = WEAT()

Load from Gensim API

Gensim provides an extensive list of pre-trained models that can be used directly. Below we show an example of use.

>>> import gensim.downloader as api
>>>
>>> # Load from gensim.downloader some model, for example: glove-twitter-25
>>> glove_25_keyed_vectors = api.load('glove-twitter-25')
>>>
>>> # The resulting object is already a BaseKeyedVectors subclass object.
>>> # so we can wrap directly using .
>>> glove_25_model = WordEmbeddingModel(glove_25_keyed_vectors, 'glove-25')
>>>
>>> # Execute the query
>>> result = weat.run_query(query, glove_25_model)
>>> print(result)
{'query_name': 'Male terms and Female terms wrt Career and Family', 'result': 0.33814692}

Using Gensim Load

As we said before, any model that is loaded with gensim and extends BaseKeyedVectors can be used in WEFE to measure bias. In this section we will see how to load a word2vec model and Fasttext.

Note

Gensim is not directly compatible with glove model file format. However, they provide a script that allows you to transform any glove model into a word2vec format.

Loading Word2vec

For example, let us load word2vec from a .bin file The procedure is quite simple: first we download word2vec binary file from its source and then we load it using the KeyedVectors.load_word2vec_format function.

>>> from gensim.models import KeyedVectors
>>>
>>> w2v_embeddings = KeyedVectors.load_word2vec_format("/path/to/your/embeddings/model", binary=True)
>>> word2vec = WordEmbeddingModel(w2v_embeddings, 'word2vec')
>>>
>>> result = weat.run_query(query, word2vec)
>>> result
{'query_name': 'Male terms and Female terms wrt Career and Family',
 'result': 0.7280304}

Loading FastText

The same method works for Fasttext.

>>> from gensim.models import KeyedVectors
>>> fast_embeddings = KeyedVectors.load_word2vec_format('path/to/fast/embeddings.vec')
>>>
>>> fast = WordEmbeddingModel(fast_embeddings, 'fast')
>>> result = weat.run_query(query, fast)
>>>
>>> result
{'query_name': 'Male terms and Female terms wrt Career and Family',
 'result': 0.34870023}

While we load FastText here as KeyedVectors (i.e., in word2vec format), it can also be used via FastTextKeyedVectors.

Flair

WEFE does not yet support flair interfaces. However, you can use static embeddings of flair ( Classic Word Embeddings ) which are based on gensim’s KeyedVectors, to load embedding models. The following code is an example of this:

>>> from flair.embeddings import WordEmbeddings
>>>
>>> glove_embedding = WordEmbeddings('glove') # 100 dim glove
>>>
>>> # extract KeyedVectors object
>>> glove_keyed_vectors = glove_embedding.precomputed_word_embeddings
>>> glove_100 = WordEmbeddingModel(glove_keyed_vectors, 'glove-100')
>>>
>>> result = weat.run_query(query, glove_100)
>>> print(result)
{'query_name': 'Male terms and Female terms wrt Career and Family', 'result': 1.0486683}