wefe.RNSB

class wefe.RNSB[source]

A implementation of Relative Relative Negative Sentiment Bias (RNSB).

References

[1] Chris Sweeney and Maryam Najafian.

A transparent framework for evaluating unintended demographic bias in word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1662–1667, 2019.

__init__(*args, **kwargs)
metric_name: str = 'Relative Negative Sentiment Bias'
metric_short_name: str = 'RNSB'
metric_template: Tuple[Union[int, str], Union[int, str]] = ('n', 2)
run_query(query: wefe.query.Query, word_embedding: wefe.word_embedding_model.WordEmbeddingModel, estimator: sklearn.base.BaseEstimator = <class 'sklearn.linear_model._logistic.LogisticRegression'>, estimator_params: Dict[str, Any] = {'max_iter': 10000, 'solver': 'liblinear'}, num_iterations: int = 1, random_state: Optional[int] = None, print_model_evaluation: bool = False, lost_vocabulary_threshold: float = 0.2, preprocessor_args: Dict[str, Optional[Union[bool, str, Callable]]] = {'lowercase': False, 'preprocessor': None, 'strip_accents': False}, secondary_preprocessor_args: Optional[Dict[str, Optional[Union[bool, str, Callable]]]] = None, warn_not_found_words: bool = False, *args: Any, **kwargs: Any) Dict[str, Any][source]

Calculate the RNSB metric over the provided parameters.

Note if you want to use with Bing Liu dataset, you have to pass the positive and negative words in the first and second place of attribute set array respectively. Scores on this metric vary with each run due to different instances of classifier training. For this reason, the robustness of these scores can be improved by repeating the test several times and returning the average of the scores obtained. This can be indicated in the num_iterations parameter.

Parameters
queryQuery

A Query object that contains the target and attribute word sets to be tested.

word_embeddingWordEmbeddingModel

A WordEmbeddingModel object that contains certain word embedding pretrained model.

estimatorBaseEstimator, optional

A scikit-learn classifier class that implements predict_proba function, by default None,

estimator_paramsdict, optional

Parameters that will use the classifier, by default { ‘solver’: ‘liblinear’, ‘max_iter’: 10000, }

num_iterationsint, optional

When provided, it tells the metric to run the specified number of times and then average its results. This functionality is indicated to strengthen the results obtained, by default 1.

random_stateUnion[int, None], optional

Seed that allows to make the execution of the query reproducible. Warning: if a random_state other than None is provided along with num_iterations, each iteration will split the dataset and train a classifier associated to the same seed, so the results of each iteration will always be the same , by default None.

print_model_evaluationbool, optional

Indicates whether the classifier evaluation is printed after the training process is completed., by default False

lost_vocabulary_thresholdfloat, optional

Specifies the proportional limit of words that any set of the query is allowed to lose when transforming its words into embeddings. In the case that any set of the query loses proportionally more words than this limit, the result values will be np.nan, by default 0.2

preprocessor_argsPreprocessorArgs, optional

Dictionary with the arguments that specify how the pre-processing of the words will be done, by default {} The possible arguments for the function are: - lowercase: bool. Indicates if the words are transformed to lowercase. - strip_accents: bool, {‘ascii’, ‘unicode’}: Specifies if the accents of

the words are eliminated. The stripping type can be specified. True uses ‘unicode’ by default.

  • preprocessor: Callable. It receives a function that operates on each

    word. In the case of specifying a function, it overrides the default preprocessor (i.e., the previous options stop working).

, by default { ‘strip_accents’: False, ‘lowercase’: False, ‘preprocessor’: None, }

secondary_preprocessor_argsPreprocessorArgs, optional

Dictionary with the arguments that specify how the secondary pre-processing of the words will be done, by default None. Indicates that in case a word is not found in the model’s vocabulary (using the default preprocessor or specified in preprocessor_args), the function performs a second search for that word using the preprocessor specified in this parameter.

warn_not_found_wordsbool, optional

Specifies if the function will warn (in the logger) the words that were not found in the model’s vocabulary , by default False.

Returns
Dict[str, Any]

A dictionary with the query name, the calculated kl-divergence, the negative probabilities for all tested target words and the normalized distribution of probabilities.