wefe.preprocessing.get_embeddings_from_tuples

wefe.preprocessing.get_embeddings_from_tuples(model: WordEmbeddingModel, sets: Sequence[Sequence[str]], sets_name: str | None = None, preprocessors: list[dict[str, str | bool | Callable]] = [{}], strategy: str = 'first', normalize: bool = False, discard_incomplete_sets: bool = True, warn_lost_sets: bool = True, verbose: bool = False) list[dict[str, ndarray]][source]

Given a sequence of word sets, obtain their corresponding embeddings.

Parameters:
  • model

  • sets (Sequence[Sequence[str]]) – A sequence containing word sets. Example: [[‘woman’, ‘man’], [‘she’, ‘he’], [‘mother’, ‘father’] …].

  • sets_name (Union[str, optional]) – The name of the set of word sets. Example: definning sets. This parameter is used only for printing. by default None

  • preprocessors (List[Dict[str, Union[str, bool, Callable]]]) –

    A list with preprocessor options.

    A preprocessor is a dictionary that specifies what processing(s) are performed on each word before it is looked up in the model vocabulary. For example, the preprocessor {'lowecase': True, 'strip_accents': True} allows you to lowercase and remove the accent from each word before searching for them in the model vocabulary. Note that an empty dictionary {} indicates that no preprocessing is done.

    The possible options for a preprocessor are:

    • lowercase: bool. Indicates that the words are transformed to lowercase.

    • uppercase: bool. Indicates that the words are transformed to uppercase.

    • titlecase: bool. Indicates that the words are transformed to titlecase.

    • strip_accents: bool, {'ascii', 'unicode'}: Specifies that the accents of the words are eliminated. The stripping type can be specified. True uses ‘unicode’ by default.

    • preprocessor: Callable. It receives a function that operates on each word. In the case of specifying a function, it overrides the default preprocessor (i.e., the previous options stop working).

    A list of preprocessor options allows you to search for several variants of the words into the model. For example, the preprocessors [{}, {"lowercase": True, "strip_accents": True}] {} allows searching first for the original words in the vocabulary of the model. In case some of them are not found, {"lowercase": True, "strip_accents": True} is executed on these words and then they are searched in the model vocabulary. by default [{}]

  • strategy (str, optional) – The strategy indicates how it will use the preprocessed words: ‘first’ will include only the first transformed word found. ‘all’ will include all transformed words found, by default “first”.

  • normalize (bool, optional) – True indicates that embeddings will be normalized, by default False

  • discard_incomplete_sets (bool, optional) – True indicates that if a set could not be completely converted, it will be discarded., by default True

  • warn_lost_sets (bool, optional) – Indicates whether word sets that cannot be fully converted to embeddings are warned in the logger, by default True

  • verbose (bool, optional) – Indicates whether the execution status of this function is printed, by default False

Returns:

A list of dictionaries. Each dictionary contains as keys a pair of words and as values their associated embeddings.

Return type:

List[EmbeddingDict]