wefe.preprocessing.get_embeddings_from_set
- wefe.preprocessing.get_embeddings_from_set(model: WordEmbeddingModel, word_set: Sequence[str], preprocessors: list[dict[str, str | bool | Callable]] = [{}], strategy: str = 'first', normalize: bool = False, verbose: bool = False) tuple[list[str], dict[str, ndarray]][source]
Transform a sequence of words into dictionary that maps word - word embedding.
The method discard out words that are not in the model’s vocabulary (according to the rules specified in the preprocessors).
- Parameters:
model (WordEmbeddingModel) – A word embeddding model
word_set (Sequence[str]) – A sequence with the words that this function will convert to embeddings.
preprocessors (List[Dict[str, Union[str, bool, Callable]]]) –
A list with preprocessor options.
A
preprocessoris a dictionary that specifies what processing(s) are performed on each word before it is looked up in the model vocabulary. For example, thepreprocessor{'lowecase': True, 'strip_accents': True}allows you to lowercase and remove the accent from each word before searching for them in the model vocabulary. Note that an empty dictionary{}indicates that no preprocessing is done.The possible options for a preprocessor are:
lowercase:bool. Indicates that the words are transformed to lowercase.uppercase:bool. Indicates that the words are transformed to uppercase.titlecase:bool. Indicates that the words are transformed to titlecase.strip_accents:bool,{'ascii', 'unicode'}: Specifies that the accents of the words are eliminated. The stripping type can be specified. True uses ‘unicode’ by default.preprocessor:Callable. It receives a function that operates on each word. In the case of specifying a function, it overrides the default preprocessor (i.e., the previous options stop working).
A list of preprocessor options allows you to search for several variants of the words into the model. For example, the preprocessors
[{}, {"lowercase": True, "strip_accents": True}]{}allows searching first for the original words in the vocabulary of the model. In case some of them are not found,{"lowercase": True, "strip_accents": True}is executed on these words and then they are searched in the model vocabulary. by default [{}]strategy (str, optional) – The strategy indicates how it will use the preprocessed words: ‘first’ will include only the first transformed word found. ‘all’ will include all transformed words found, by default “first”.
normalize (bool, optional) – True indicates that embeddings will be normalized, by default False
verbose (bool, optional) – Indicates whether the execution status of this function is printed, by default False
- Returns:
A tuple containing the words that could not be found and a dictionary with the found words and their corresponding embeddings.
- Return type: