wefe
.RND¶
- class wefe.RND[source]¶
Relative Norm Distance (RND).
It measures the relative strength of association of a set of neutral words with respect to two groups.
References
[1]: Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou.Word embeddings quantify 100 years of gender and ethnic stereotypes.Proceedings of the National Academy of Sciences, 115(16):E3635–E3644,2018.- __init__(*args, **kwargs)¶
- run_query(query: wefe.query.Query, model: wefe.word_embedding_model.WordEmbeddingModel, distance: str = 'norm', lost_vocabulary_threshold: float = 0.2, preprocessors: List[Dict[str, Union[str, bool, Callable]]] = [{}], strategy: str = 'first', normalize: bool = False, warn_not_found_words: bool = False, *args: Any, **kwargs: Any) Dict[str, Any] [source]¶
Calculate the RND metric over the provided parameters.
- Parameters
- queryQuery
A Query object that contains the target and attribute sets to be tested.
- modelWordEmbeddingModel
A word embedding model.
- distancestr, optional
Specifies which type of distance will be calculated. It could be: {norm, cos} , by default ‘norm’.
- preprocessorsList[Dict[str, Union[str, bool, Callable]]]
A list with preprocessor options.
A
preprocessor
is a dictionary that specifies what processing(s) are performed on each word before it is looked up in the model vocabulary. For example, thepreprocessor
{'lowecase': True, 'strip_accents': True}
allows you to lowercase and remove the accent from each word before searching for them in the model vocabulary. Note that an empty dictionary{}
indicates that no preprocessing is done.The possible options for a preprocessor are:
lowercase
:bool
. Indicates that the words are transformed to lowercase.uppercase
:bool
. Indicates that the words are transformed to uppercase.titlecase
:bool
. Indicates that the words are transformed to titlecase.strip_accents
:bool
,{'ascii', 'unicode'}
: Specifies that the accents of the words are eliminated. The stripping type can be specified. True uses ‘unicode’ by default.preprocessor
:Callable
. It receives a function that operates on each word. In the case of specifying a function, it overrides the default preprocessor (i.e., the previous options stop working).
A list of preprocessor options allows you to search for several variants of the words into the model. For example, the preprocessors
[{}, {"lowercase": True, "strip_accents": True}]
{}
allows first to search for the original words in the vocabulary of the model. In case some of them are not found,{"lowercase": True, "strip_accents": True}
is executed on these words and then they are searched in the model vocabulary.- strategystr, optional
The strategy indicates how it will use the preprocessed words: ‘first’ will include only the first transformed word found. all’ will include all transformed words found, by default “first”.
- normalizebool, optional
True indicates that embeddings will be normalized, by default False
- warn_not_found_wordsbool, optional
Specifies if the function will warn (in the logger) the words that were not found in the model’s vocabulary, by default False.
- Returns
- Dict[str, Any]
A dictionary with the query name, the resulting score of the metric, and a dictionary with the distances of each attribute word with respect to the target sets means.
Examples
>>> from wefe.metrics import RND >>> from wefe.query import Query >>> from wefe.utils import load_test_model >>> >>> # define the query >>> query = Query( ... target_sets=[ ... ["female", "woman", "girl", "sister", "she", "her", "hers", ... "daughter"], ... ["male", "man", "boy", "brother", "he", "him", "his", "son"], ... ], ... attribute_sets=[ ... [ ... "home", "parents", "children", "family", "cousins", "marriage", ... "wedding", "relatives", ... ], ... ], ... target_sets_names=["Female terms", "Male Terms"], ... attribute_sets_names=["Family"], ... ) >>> >>> # load the model (in this case, the test model included in wefe) >>> model = load_test_model() >>> >>> # instance the metric and run the query >>> RND().run_query(query, model) {'query_name': 'Female terms and Male Terms wrt Family', 'result': 0.030381828546524048, 'rnd': 0.030381828546524048, 'distances_by_word': {'wedding': -0.1056304, 'marriage': -0.10163283, 'children': -0.068374634, 'parents': 0.00097084045, 'relatives': 0.0483346, 'family': 0.12408042, 'cousins': 0.17195654, 'home': 0.1733501}} >>> >>> # if you want the embeddings to be normalized before calculating the metrics >>> # use the normalize parameter as True before executing the query. >>> RND().run_query(query, model, normalize=True) {'query_name': 'Female terms and Male Terms wrt Family', 'result': -0.006278775632381439, 'rnd': -0.006278775632381439, 'distances_by_word': {'children': -0.05244279, 'wedding': -0.04642248, 'marriage': -0.04268837, 'parents': -0.022358716, 'relatives': 0.005497098, 'family': 0.023389697, 'home': 0.04009247, 'cousins': 0.044702888}} >>> >>> # if you want to use cosine distance instead of euclidean norm >>> # use the distance parameter as 'cos' before executing the query. >>> RND().run_query(query, model, normalize=True, distance='cos') {'query_name': 'Female terms and Male Terms wrt Family', 'result': 0.03643466345965862, 'rnd': 0.03643466345965862, 'distances_by_word': {'cousins': -0.035989374, 'home': -0.026971221, 'family': -0.009296179, 'relatives': 0.015690982, 'parents': 0.051281124, 'children': 0.09255883, 'marriage': 0.09959312, 'wedding': 0.104610026}}