zensols.deepnlp.embed package

Submodules

zensols.deepnlp.embed.doc module

A zensols.nlp.container.FeatureDocument decorator that populates sentence and token embeddings.

class zensols.deepnlp.embed.doc.WordEmbedDocumentDecorator(model, torch_config=None, token_embeddings=True, sent_embeddings=True, skip_oov=False)[source]

Bases: FeatureDocumentDecorator

Populates sentence and token embeddings in the documents. Token’s have shape (1, d) where d is the embeddingn dimsion, and the first is always 1 to be compatible with word piece embeddings populated by transformer.WordPieceDocumentDecorator.

See:

WordEmbedModel

__init__(model, torch_config=None, token_embeddings=True, sent_embeddings=True, skip_oov=False)
decorate(doc)[source]
model: WordEmbedModel

The word embedding model for populating tokens and sentences.

sent_embeddings: bool = True

.WordPieceFeatureSentence.embeddings.

Type:

Whether to add class

skip_oov: bool = False

Whether to skip out-of-vocabulary tokens that have no embeddings.

token_embeddings: bool = True

Whether to add WordPieceFeatureToken.embeddings.

torch_config: Optional[TorchConfig] = None

The Torch configuration to allocate the embeddings from either the GPU or the CPU. If None, then Numpy numpy.ndarray arrays are used instead of torch.Tensor.

zensols.deepnlp.embed.domain module

Interface file for word vectors, aka non-contextual word embeddings.

class zensols.deepnlp.embed.domain.NoOpWordEmbedModel(name, *args, **kwargs)[source]

Bases: WordEmbedModel

A no operational implementation of a WordEmbedModel. This is useful in unit test cases that download large models that do not fit on GitHub’s workflow actions environments.

__init__(name, *args, **kwargs)[source]
exception zensols.deepnlp.embed.domain.WordEmbedError[source]

Bases: DeepLearnError

Raised for any errors pertaining to word vectors.

__module__ = 'zensols.deepnlp.embed.domain'
class zensols.deepnlp.embed.domain.WordEmbedModel(name, cache=True, lowercase=False)[source]

Bases: PersistableContainer

This is an abstract base class that represents a set of word vectors (i.e. GloVe).

UNKNOWN: ClassVar[str] = '<unk>'

The unknown symbol used for out of vocabulary words.

ZERO: ClassVar[str] = '<unk>'

The zero vector symbol used for padding vectors.

__init__(name, cache=True, lowercase=False)
cache: bool = True

If True globally cache all data strucures, which should be False if more than one embedding across a model type is used.

clear_cache()[source]
deallocate()[source]

Deallocate all resources for this instance.

get(key, default=None)[source]

Just like a dict.get(), but but return the vector for a word.

Parameters:
  • key (str) – the word to get the vector

  • default (ndarray) – what to return if key doesn’t exist in the dict

Return type:

ndarray

Returns:

the word vector

property keyed_vectors: KeyedVectors

Adapt instances of this class to a gensim keyed vector instance.

keys()[source]

Return the keys, which are the word2vec words.

Return type:

Iterable[str]

lowercase: bool = False

If True, downcase each word for all methods that take a word as input. Use this for embeddings that are only lower case in order to find more hits when querying for words that have uppercase characters.

property matrix: ndarray

The word vector matrix.

property model_id: str

Return a string that uniquely identifies this instance of the embedding model. This should have the type, size and dimension of the embedding.

This string is used to cache models in both CPU and GPU memory so the layers can have the benefit of reusing the same in memeory word embedding matrix.

name: str

The name of the model given by the configuration and must be unique across word vector type and dimension.

prime()[source]
property shape: Tuple[int, int]

obj”matrix.

Type:

The shape of the word vector

to_matrix(torch_config)[source]

Return a matrix the represents the entire vector embedding as a tensor.

Parameters:

torch_config (TorchConfig) – indicates where to load the new tensor

Return type:

Tensor

property unk_idx: int

The ID to the out-of-vocabulary index

property vector_dimension: int

Return the dimension of the word vectors.

property vectors: Dict[str, ndarray]

Return all word vectors with the string words as keys.

word2idx(word, default=None)[source]

Return the index of word or UNKONWN if not indexed.

Return type:

Optional[int]

word2idx_or_unk(word)[source]

Return the index of word or UNKONWN if not indexed.

Return type:

int

class zensols.deepnlp.embed.domain.WordVectorModel(vectors, word2vec, words, word2idx)[source]

Bases: object

Vector data from the model

__init__(vectors, word2vec, words, word2idx)
to_matrix(torch_config)[source]
Return type:

Tensor

vectors: ndarray

The word vectors.

word2idx: Dict[str, int]

The word to word vector index mapping.

word2vec: Dict[str, ndarray]

The word to word vector mapping.

words: List[str]

The vocabulary.

zensols.deepnlp.embed.fasttext module

Fast text word vector implementation.

class zensols.deepnlp.embed.fasttext.FastTextEmbedModel(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='2M', dimension=300, corpus='crawl')[source]

Bases: TextWordEmbedModel

This class reads the FastText word vector text data format and provides an instances of a WordEmbedModel. Files that have the format that look like crawl-300d-2M.vec can be downloaded with the link below.

See:

English word vectors

__init__(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='2M', dimension=300, corpus='crawl')
corpus: str = 'crawl'

The corpus the embeddings were trained on, such as crawl and web.

desc: str = '2M'

The size description (i.e. 6B for the six billion word trained vectors).

dimension: str = 300

The word vector dimension.

zensols.deepnlp.embed.glove module

This module contains the definition of a class that operates like a dict to retrieve GloVE word embeddings. It also creates, stores and reads a binary representation for quick loading on start up.

class zensols.deepnlp.embed.glove.GloveWordEmbedModel(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='6B', dimension=50, vocab_size=400000)[source]

Bases: TextWordEmbedModel

This class uses the Stanford pretrained GloVE embeddings as a dict like Python object. It loads the glove vectors from a text file and then creates a binary file that’s quick to load on subsequent uses.

An example configuration would be:

[glove_embedding]
class_name = zensols.deepnlp.embed.GloveWordEmbedModel
path = path: ${default:corpus_dir}/glove
desc = 6B
dimension = 50
__init__(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='6B', dimension=50, vocab_size=400000)
desc: str = '6B'

The size description (i.e. 6B for the six billion word trained vectors).

dimension: int = 50

The word vector dimension.

vocab_size: int = 400000

Vocabulary size.

zensols.deepnlp.embed.word2vec module

Convenience Gensim glue code for word embeddings/vectors.

class zensols.deepnlp.embed.word2vec.Word2VecModel(name, cache=True, lowercase=False, installer=None, resource=None, dimension=300, model_type='keyed')[source]

Bases: WordEmbedModel

Load keyed or non-keyed Gensim models.

__init__(name, cache=True, lowercase=False, installer=None, resource=None, dimension=300, model_type='keyed')
dimension: int = 300

The dimension of the word embedding.

installer: Installer = None

The installer used to for the text vector zip file.

model_type: str = 'keyed'

The type of the embeddings, which is either keyed or gensim.

property path: Path
resource: Resource = None

The zip resource used to find the path to the model files.

zensols.deepnlp.embed.wordtext module

Contains an abstract class that makes it easier to implement load word vectors from text files.

class zensols.deepnlp.embed.wordtext.DefaultTextWordEmbedModel(name='unknown_name', cache=True, lowercase=False, path=None, installer=None, resource=None, desc='unknown_desc', dimension=50, vocab_size=0, file_name_pattern='{name}.{desc}.{dimension}d.txt')[source]

Bases: TextWordEmbedModel

This class uses the Stanford pretrained GloVE embeddings as a dict like Python object. It loads the glove vectors from a text file and then creates a binary file that’s quick to load on subsequent uses.

An example configuration would be:

[glove_embedding]
class_name = zensols.deepnlp.embed.GloveWordEmbedModel
path = path: ${default:corpus_dir}/glove
desc = 6B
dimension = 50
__init__(name='unknown_name', cache=True, lowercase=False, path=None, installer=None, resource=None, desc='unknown_desc', dimension=50, vocab_size=0, file_name_pattern='{name}.{desc}.{dimension}d.txt')
desc: str = 'unknown_desc'

The size description (i.e. 6B for the six billion word trained vectors).

dimension: int = 50

The word vector dimension.

property file_name: str
file_name_pattern: str = '{name}.{desc}.{dimension}d.txt'

The format of the file to create.

name: str = 'unknown_name'

The name of the word vector set (i.e. glove).

vocab_size: int = 0

Vocabulary size.

class zensols.deepnlp.embed.wordtext.TextWordEmbedModel(name, cache=True, lowercase=False, path=None, installer=None, resource=None)[source]

Bases: WordEmbedModel, Primeable

Extensions of this class read a text vectors file and compile, then write a binary representation for fast loading.

DATASET_NAME = 'vec'

Name of the dataset in the HD5F file.

__init__(name, cache=True, lowercase=False, path=None, installer=None, resource=None)
installer: Installer = None

The installer used to for the text vector zip file.

property metadata

Return the metadata used to construct paths both text source vector file and all generated binary files.

path: Path = None

The path to the model file(s).

prime()[source]
resource: Resource = None

The zip resource used to find the path to the model files.

class zensols.deepnlp.embed.wordtext.TextWordModelMetadata(name, desc, dimension, n_vocab, source_path, sub_directory=None)[source]

Bases: Dictable

Describes a text based WordEmbedModel. This information in this class is used to construct paths both text source vector file and all generated binary files

__init__(name, desc, dimension, n_vocab, source_path, sub_directory=None)
desc: str

A descriptor about this particular word vector set (i.e. 6B).

dimension: int

The dimension of the word vectors.

n_vocab: int

The number of words in the vocabulary.

name: str

The name of the word vector set (i.e. glove).

source_path: Path

The path to the text file.

sub_directory: InitVar = None

The subdirectory to be appended to self.bin_dir, which defaults to the directory bin/<description>.<dimension>.

Module contents