`wefe`.WordEmbeddingModel¶

class wefe.WordEmbeddingModel(model: gensim.models.keyedvectors.KeyedVectors, model_name: Optional[str] = None, vocab_prefix: Optional[str] = None)[source]¶

A container for Word Embedding pre-trained models.

It can hold gensim’s KeyedVectors or gensim’s api loaded models. It includes the name of the model and some vocab prefix if needed.

__init__(model: gensim.models.keyedvectors.KeyedVectors, model_name: Optional[str] = None, vocab_prefix: Optional[str] = None)[source]¶

Initializes the container.

Parameters

modelBaseKeyedVectors.: An instance of word embedding loaded through gensim KeyedVector interface or gensim’s api.
model_namestr, optional: The name of the model, by default ‘’.
vocab_prefixstr, optional.: A prefix that will be concatenated with all word in the model vocab, by default None.

Raises

TypeError: if word_embedding is not a KeyedVectors instance.
TypeError: if model_name is not None and not instance of str.
TypeError: if vocab_prefix is not None and not instance of str.

Examples

>>> from gensim.test.utils import common_texts
>>> from gensim.models import Word2Vec
>>> from wefe.word_embedding_model import WordEmbeddingModel

>>> dummy_model = Word2Vec(common_texts, window=5,
...                        min_count=1, workers=1).wv

>>> model = WordEmbeddingModel(dummy_model, 'Dummy model dim=10',
...                            vocab_prefix='/en/')
>>> print(model.model_name)
Dummy model dim=10
>>> print(model.vocab_prefix)
/en/

Attributes

modelBaseKeyedVectors: The model.
vocab :: The vocabulary of the model (a dict with the words that have an associated embedding in the model).
model_namestr: The name of the model.
vocab_prefixstr: A prefix that will be concatenated with each word of the vocab of the model.

get_embeddings_from_query(query: wefe.query.Query, lost_vocabulary_threshold: float = 0.2, preprocessor_args: Dict[str, Optional[Union[bool, str, Callable]]] = {}, secondary_preprocessor_args: Optional[Dict[str, Optional[Union[bool, str, Callable]]]] = None, warn_not_found_words: bool = False) → Optional[Tuple[Dict[str, Dict[str, numpy.ndarray]], Dict[str, Dict[str, numpy.ndarray]]]][source]¶

Obtain the word vectors associated with the provided Query.

The words that does not appears in the word embedding pretrained model vocabulary under the specified pre-processing are discarded. If the remaining words percentage in any query set is lower than the specified threshold, the function will return None.

Parameters

queryQuery

The query to be processed.

lost_vocabulary_thresholdfloat, optional, by default 0.2

Indicates the proportional limit of words that any set of the query is allowed to lose when transforming its words into embeddings. In the case that any set of the query loses proportionally more words than this limit, this method will return None.

preprocessor_argsPreprocessorArgs, optional

Dictionary with the arguments that specify how the pre-processing of the words will be done, by default {} The possible arguments for the function are: - lowercase: bool. Indicates if the words are transformed to lowercase. - strip_accents: bool, {‘ascii’, ‘unicode’}: Specifies if the accents of

the words are eliminated. The stripping type can be specified. True uses ‘unicode’ by default.

preprocessor: Callable. It receives a function that operates on each
word. In the case of specifying a function, it overrides the default preprocessor (i.e., the previous options stop working).

secondary_preprocessor_argsPreprocessorArgs, optional

Dictionary with the arguments that specify how the secondary pre-processing of the words will be done, by default None. Indicates that in case a word is not found in the model’s vocabulary (using the default preprocessor or specified in preprocessor_args), the function performs a second search for that word using the preprocessor specified in this parameter. Example: Suppose we have the word “John” in the query and only the lowercase version “john” is found in the model’s vocabulary. If we use preprocessor_args by default (so as not to affect the search for other words that may exist in capital letters in the model), the function will not be able to extract the representation of “john” even if it exists in lower case. However, we can use {‘lowecase’ : True} in preprocessor_args to specify that it also looks for the lower case version of “juan”, without affecting the first preprocessor. Thus, this preprocessor will only remain as an alternative in case the first one does not work.

warn_not_found_wordsbool, optional

A flag that indicates if the function will warn (in the logger) the words that were not found in the model’s vocabulary, by default False.

Returns

Union[Tuple[EmbeddingSets, EmbeddingSets], None]: A tuple of dictionaries containing the targets and attribute sets or None in case there is a set that has proportionally less embeddings than it was allowed to lose.

Raises

TypeError: If query is not an instance of Query
TypeError: If lost_vocabulary_threshold is not float
TypeError: If preprocessor_args is not a dictionary
TypeError: If secondary_preprocessor_args is not a dictionary
TypeError: If warn_not_found_words is not a boolean

get_embeddings_from_word_set(word_set: List[str], preprocessor_args: Dict[str, Optional[Union[bool, str, Callable]]] = {}, secondary_preprocessor_args: Optional[Dict[str, Optional[Union[bool, str, Callable]]]] = None) → Tuple[List[str], Dict[str, numpy.ndarray]][source]¶

Transforms a set of words into their respective embeddings and discard out words that are not in the model’s vocabulary (according to the rules specified in the preprocessors).

Parameters

word_setList[str]

The list/array with the words that will be transformed

preprocessor_argsPreprocessorArgs, optional

Dictionary with the arguments that specify how the pre-processing of the words will be done, by default {} The options for the dict are: - lowercase: bool. Indicates if the words are transformed to lowercase. - strip_accents: bool, {‘ascii’, ‘unicode’}: Specifies if the accents of

the words are eliminated. The stripping type can be specified. True uses ‘unicode’ by default.

preprocessor: Callable. It receives a function that operates on each
word. In the case of specifying a function, it overrides the default preprocessor (i.e., the previous options stop working).

secondary_preprocessor_argsPreprocessorArgs, optional

Dictionary with arguments for pre-processing words (same as the previous parameter), by default None. Indicates that in case a word is not found in the model’s vocabulary (using the default preprocessor or specified in preprocessor_args), the function performs a second search for that word using the preprocessor specified in this parameter.

Returns

Tuple[List[str], Dict[str, np.ndarray]]: A tuple with a list of missing words and a dictionary that maps words to embeddings.

wefe.WordEmbeddingModel¶

`wefe`.WordEmbeddingModel¶