wefe.RND

class wefe.RND[source]

Relative Norm Distance (RND).

It measures the relative strength of association of a set of neutral words with respect to two groups.

References

[1]: Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou.
Word embeddings quantify 100 years of gender and ethnic stereotypes.
Proceedings of the National Academy of Sciences, 115(16):E3635–E3644,2018.
__init__(*args, **kwargs)
run_query(query: wefe.query.Query, model: wefe.word_embedding_model.WordEmbeddingModel, distance: str = 'norm', lost_vocabulary_threshold: float = 0.2, preprocessors: List[Dict[str, Union[str, bool, Callable]]] = [{}], strategy: str = 'first', normalize: bool = False, warn_not_found_words: bool = False, *args: Any, **kwargs: Any) Dict[str, Any][source]

Calculate the RND metric over the provided parameters.

Parameters
queryQuery

A Query object that contains the target and attribute sets to be tested.

modelWordEmbeddingModel

A word embedding model.

distancestr, optional

Specifies which type of distance will be calculated. It could be: {norm, cos} , by default ‘norm’.

preprocessorsList[Dict[str, Union[str, bool, Callable]]]

A list with preprocessor options.

A preprocessor is a dictionary that specifies what processing(s) are performed on each word before it is looked up in the model vocabulary. For example, the preprocessor {'lowecase': True, 'strip_accents': True} allows you to lowercase and remove the accent from each word before searching for them in the model vocabulary. Note that an empty dictionary {} indicates that no preprocessing is done.

The possible options for a preprocessor are:

  • lowercase: bool. Indicates that the words are transformed to lowercase.

  • uppercase: bool. Indicates that the words are transformed to uppercase.

  • titlecase: bool. Indicates that the words are transformed to titlecase.

  • strip_accents: bool, {'ascii', 'unicode'}: Specifies that the accents of the words are eliminated. The stripping type can be specified. True uses ‘unicode’ by default.

  • preprocessor: Callable. It receives a function that operates on each word. In the case of specifying a function, it overrides the default preprocessor (i.e., the previous options stop working).

A list of preprocessor options allows you to search for several variants of the words into the model. For example, the preprocessors [{}, {"lowercase": True, "strip_accents": True}] {} allows first to search for the original words in the vocabulary of the model. In case some of them are not found, {"lowercase": True, "strip_accents": True} is executed on these words and then they are searched in the model vocabulary.

strategystr, optional

The strategy indicates how it will use the preprocessed words: ‘first’ will include only the first transformed word found. all’ will include all transformed words found, by default “first”.

normalizebool, optional

True indicates that embeddings will be normalized, by default False

warn_not_found_wordsbool, optional

Specifies if the function will warn (in the logger) the words that were not found in the model’s vocabulary, by default False.

Returns
Dict[str, Any]

A dictionary with the query name, the resulting score of the metric, and a dictionary with the distances of each attribute word with respect to the target sets means.

Examples

>>> from wefe.metrics import RND
>>> from wefe.query import Query
>>> from wefe.utils import load_test_model
>>>
>>> # define the query
>>> query = Query(
...     target_sets=[
...         ["female", "woman", "girl", "sister", "she", "her", "hers",
...          "daughter"],
...         ["male", "man", "boy", "brother", "he", "him", "his", "son"],
...     ],
...     attribute_sets=[
...         [
...             "home", "parents", "children", "family", "cousins", "marriage",
...             "wedding", "relatives",
...         ],
...     ],
...     target_sets_names=["Female terms", "Male Terms"],
...     attribute_sets_names=["Family"],
... )
>>>
>>> # load the model (in this case, the test model included in wefe)
>>> model = load_test_model()
>>>
>>> # instance the metric and run the query
>>> RND().run_query(query, model) 
{'query_name': 'Female terms and Male Terms wrt Family',
 'result': 0.030381828546524048,
 'rnd': 0.030381828546524048,
 'distances_by_word': {'wedding': -0.1056304,
                       'marriage': -0.10163283,
                       'children': -0.068374634,
                       'parents': 0.00097084045,
                       'relatives': 0.0483346,
                       'family': 0.12408042,
                       'cousins': 0.17195654,
                       'home': 0.1733501}}
>>>
>>> # if you want the embeddings to be normalized before calculating the metrics
>>> # use the normalize parameter as True before executing the query.
>>> RND().run_query(query, model, normalize=True) 
{'query_name': 'Female terms and Male Terms wrt Family',
 'result': -0.006278775632381439,
 'rnd': -0.006278775632381439,
 'distances_by_word': {'children': -0.05244279,
                       'wedding': -0.04642248,
                       'marriage': -0.04268837,
                       'parents': -0.022358716,
                       'relatives': 0.005497098,
                       'family': 0.023389697,
                       'home': 0.04009247,
                       'cousins': 0.044702888}}
>>>
>>> # if you want to use cosine distance instead of euclidean norm
>>> # use the distance parameter as 'cos' before executing the query.
>>> RND().run_query(query, model, normalize=True, distance='cos') 
{'query_name': 'Female terms and Male Terms wrt Family',
 'result': 0.03643466345965862,
 'rnd': 0.03643466345965862,
 'distances_by_word': {'cousins': -0.035989374,
                       'home': -0.026971221,
                       'family': -0.009296179,
                       'relatives': 0.015690982,
                       'parents': 0.051281124,
                       'children': 0.09255883,
                       'marriage': 0.09959312,
                       'wedding': 0.104610026}}