wefe.WordEmbeddingModel

class wefe.WordEmbeddingModel(wv: gensim.models.keyedvectors.KeyedVectors, name: Optional[str] = None, vocab_prefix: Optional[str] = None)[source]

A wrapper for Word Embedding pre-trained models.

It can hold gensim’s KeyedVectors or gensim’s api loaded models. It includes the name of the model and some vocab prefix if needed.

__init__(wv: gensim.models.keyedvectors.KeyedVectors, name: Optional[str] = None, vocab_prefix: Optional[str] = None)[source]

Initialize the word embedding model.

Parameters
wvBaseKeyedVectors.

An instance of word embedding loaded through gensim KeyedVector interface or gensim’s api.

namestr, optional

The name of the model, by default ‘’.

vocab_prefixstr, optional.

A prefix that will be concatenated with all word in the model vocab, by default None.

Raises
TypeError

if word_embedding is not a KeyedVectors instance.

TypeError

if model_name is not None and not an instance of str.

TypeError

if vocab_prefix is not None and not an instance of str.

Examples

>>> from gensim.test.utils import common_texts
>>> from gensim.models import Word2Vec
>>> from wefe.word_embedding_model import WordEmbeddingModel
>>> dummy_model = Word2Vec(common_texts, window=5,
...                        min_count=1, workers=1).wv
>>> model = WordEmbeddingModel(dummy_model, 'Dummy model dim=10',
...                            vocab_prefix='/en/')
>>> print(model.name)
Dummy model dim=10
>>> print(model.vocab_prefix)
/en/
Attributes
wvBaseKeyedVectors

The model.

vocab :

The vocabulary of the model (a dict with the words that have an associated embedding in the model).

model_namestr

The name of the model.

vocab_prefixstr

A prefix that will be concatenated with each word of the vocab of the model.

batch_update(words: Sequence[str], embeddings: Union[Sequence[numpy.ndarray], numpy.ndarray])[source]

Update a batch of embeddings.

This method calls update_embedding method with each of the word-embedding pairs. All words must be in the vocabulary, otherwise an exception will be thrown. Note that both words and embeddings must have the same number of elements, otherwise the method will raise an exception.

Parameters
wordsSequence[str]

A sequence (list, tuple or np.array) that contains the words whose representations will be updated.

embeddingsUnion[Sequence[np.ndarray], np.array],

A sequence (list or tuple) or a np.array of embeddings or an np.array that contains all the new embeddings. The embeddings must have the same size and data type as the model.

Raises
TypeError

if words is not a list

TypeError

if embeddings is not an np.ndarray

Exception

if words collection has not the same size of the embedding array.

normalize()[source]

Normalize word embeddings in the model by using the L2 norm.

Use the init_sims function of the gensim’s KeyedVectors class. Warning: This operation is inplace. In other words, it replaces the embeddings with their L2 normalized versions.

update(word: str, embedding: numpy.ndarray)[source]

Update the value of an embedding of the model.

If the method is executed with a word that is not in the vocabulary, an exception will be raised.

Parameters
wordstr

The word whose embedding will be replaced. This word must be in the model’s vocabulary.

embeddingnp.ndarray

An embedding representing the word. It must have the same dimensions and data type as the model embeddings.

Raises
TypeError

if word is not a1 string.

TypeError

if embedding is not an np.array.

ValueError

if word is not in the model’s vocabulary.

ValueError

if the embedding is not the same size as the size of the model’s embeddings.

ValueError

if the dtype of the embedding values is not the same as the model’s embeddings.