wefe.word_embedding_model.WordEmbeddingModel

class wefe.word_embedding_model.WordEmbeddingModel(wv: KeyedVectors, name: str | None = None, vocab_prefix: str | None = None)[source]

Bases: object

A wrapper for Word Embedding pre-trained models.

It can hold gensim’s KeyedVectors or gensim’s api loaded models. It includes the name of the model and some vocab prefix if needed.

__init__(wv: KeyedVectors, name: str | None = None, vocab_prefix: str | None = None) None[source]

Initialize the word embedding model.

Parameters:
  • wv (BaseKeyedVectors.) – An instance of word embedding loaded through gensim KeyedVector interface or gensim’s api.

  • name (str, optional) – The name of the model, by default ‘’.

  • vocab_prefix (str, optional.) – A prefix that will be concatenated with all word in the model vocab, by default None.

Raises:
  • TypeError – if word_embedding is not a KeyedVectors instance.

  • TypeError – if model_name is not None and not an instance of str.

  • TypeError – if vocab_prefix is not None and not an instance of str.

Examples

>>> from gensim.test.utils import common_texts
>>> from gensim.models import Word2Vec
>>> from wefe.word_embedding_model import WordEmbeddingModel
>>> dummy_model = Word2Vec(common_texts, window=5,
...                        min_count=1, workers=1).wv
>>> model = WordEmbeddingModel(dummy_model, 'Dummy model dim=10',
...                            vocab_prefix='/en/')
>>> print(model.name)
Dummy model dim=10
>>> print(model.vocab_prefix)
/en/
batch_update(words: Sequence[str], embeddings: Sequence[ndarray] | ndarray) None[source]

Update a batch of embeddings in the model.

This method updates the embeddings for a given sequence of words efficiently by leveraging NumPy’s advanced indexing. All validations (word existence, embedding shape, and data type) are performed collectively before any modifications are applied to the model. This ensures atomicity: either all updates succeed, or none do.

Parameters:
  • words (Sequence[str]) – A sequence (list, tuple, or np.ndarray) containing the words whose representations will be updated. All words must already exist in the model’s vocabulary and must be strings.

  • embeddings (Union[Sequence[np.ndarray], np.ndarray]) – A sequence (list or tuple) of NumPy arrays, or a 2D NumPy array, that contains all the new embeddings. Each embedding must be a 1D NumPy array with the same size and data type as the model’s embeddings. The length of embeddings must match the length of words.

Raises:
  • TypeError – If words is not a sequence of strings, or if embeddings is not a sequence of NumPy arrays or a single NumPy array. Also, if individual elements within words are not strings, or elements within embeddings are not NumPy arrays.

  • ValueError – If words and embeddings do not have the same number of elements. If any word in words is not found in the model’s vocabulary. If any embedding has a different dimension than the model’s embeddings. If any embedding has a data type incompatible with the model’s embeddings.

Examples

>>> from gensim.test.utils import common_texts
>>> from gensim.models import Word2Vec
>>> from wefe.word_embedding_model import WordEmbeddingModel
>>> import numpy as np
>>> # Create a dummy WordEmbeddingModel
>>> kv_model = Word2Vec(common_texts, vector_size=10, min_count=1).wv
>>> model = WordEmbeddingModel(kv_model, 'Dummy Model')
>>> original_embedding_the = model['the']
>>> original_embedding_system = model['system']
>>> # Prepare words and new embeddings
>>> words_to_update = ['the', 'system']
>>> new_embeddings = [
...     np.zeros(10, dtype=model.wv.vectors.dtype),
...     np.ones(10, dtype=model.wv.vectors.dtype)
... ]
>>> # Update embeddings
>>> model.batch_update(words_to_update, new_embeddings)
>>> # Verify updates
>>> assert np.all(model['the'] == np.zeros(10))
>>> assert np.all(model['system'] == np.ones(10))
>>> # Test with missing word (will raise error)
>>> try:
...     model.batch_update(['nonexistent_word'], [np.zeros(10)])
... except ValueError as e:
...     print(e)
The following words are not in the model's vocabulary: nonexistent_word.
get(word: str, default: Any | None = None) ndarray[Any, dtype[float64]] | None[source]

Retrieve a word’s embedding, returning a default value if not found.

normalize() None[source]

Normalize the word vectors to unit L2 length.

This method uses the underlying gensim functionality to perform L2 normalization. The model’s vectors are modified in-place. Warning: This is a destructive operation.

Raises:

AttributeError – If the underlying model does not support normalization.

update(word: str, embedding: ndarray[Any, dtype[float64]]) None[source]

Update the value of an embedding of the model.

If the method is executed with a word that is not in the vocabulary, an exception will be raised.

Parameters:
  • word (str) – The word to update. It must already exist in the vocabulary.

  • embedding (NDArray[np.float64]) – The new embedding for the word. Must match the model’s vector size and dtype.

Raises:
  • TypeError – if word is not a1 string.

  • TypeError – if embedding is not an np.array.

  • ValueError – if word is not in the model’s vocabulary.

  • ValueError – if the embedding is not the same size as the size of the model’s embeddings.

  • ValueError – if the dtype of the embedding values is not the same as the model’s embeddings.