zensols.deepnlp.embed package#

Submodules#

zensols.deepnlp.embed.doc#

Inheritance diagram of zensols.deepnlp.embed.doc

A zensols.nlp.container.FeatureDocument decorator that populates sentence and token embeddings.

class zensols.deepnlp.embed.doc.WordEmbedDocumentDecorator(model, torch_config=None, token_embeddings=True, sent_embeddings=True, skip_oov=False)[source]#

Bases: FeatureDocumentDecorator

Populates sentence and token embeddings in the documents. Token’s have shape (1, d) where d is the embeddingn dimsion, and the first is always 1 to be compatible with word piece embeddings populated by transformer.WordPieceDocumentDecorator.

See:

WordEmbedModel

__init__(model, torch_config=None, token_embeddings=True, sent_embeddings=True, skip_oov=False)#
decorate(doc)[source]#
model: WordEmbedModel#

The word embedding model for populating tokens and sentences.

sent_embeddings: bool = True#

.WordPieceFeatureSentence.embeddings.

Type:

Whether to add class

skip_oov: bool = False#

Whether to skip out-of-vocabulary tokens that have no embeddings.

token_embeddings: bool = True#

Whether to add WordPieceFeatureToken.embeddings.

torch_config: Optional[TorchConfig] = None#

The Torch configuration to allocate the embeddings from either the GPU or the CPU. If None, then Numpy numpy.ndarray arrays are used instead of torch.Tensor.

zensols.deepnlp.embed.domain#

Inheritance diagram of zensols.deepnlp.embed.domain

Interface file for word vectors, aka non-contextual word embeddings.

exception zensols.deepnlp.embed.domain.WordEmbedError[source]#

Bases: DeepLearnError

Raised for any errors pertaining to word vectors.

__module__ = 'zensols.deepnlp.embed.domain'#
class zensols.deepnlp.embed.domain.WordEmbedModel(name, cache=True, lowercase=False)[source]#

Bases: PersistableContainer

This is an abstract base class that represents a set of word vectors (i.e. GloVe).

UNKNOWN: ClassVar[str] = '<unk>'#

The unknown symbol used for out of vocabulary words.

ZERO: ClassVar[str] = '<unk>'#

The zero vector symbol used for padding vectors.

__init__(name, cache=True, lowercase=False)#
cache: bool = True#

If True globally cache all data strucures, which should be False if more than one embedding across a model type is used.

clear_cache()[source]#
deallocate()[source]#

Deallocate all resources for this instance.

get(key, default=None)[source]#

Just like a dict.get(), but but return the vector for a word.

Parameters:
  • key (str) – the word to get the vector

  • default (ndarray) – what to return if key doesn’t exist in the dict

Return type:

ndarray

Returns:

the word vector

property keyed_vectors: KeyedVectors#

Adapt instances of this class to a gensim keyed vector instance.

keys()[source]#

Return the keys, which are the word2vec words.

Return type:

Iterable[str]

lowercase: bool = False#

If True, downcase each word for all methods that take a word as input. Use this for embeddings that are only lower case in order to find more hits when querying for words that have uppercase characters.

property matrix: ndarray#

The word vector matrix.

property model_id: str#

Return a string that uniquely identifies this instance of the embedding model. This should have the type, size and dimension of the embedding.

This string is used to cache models in both CPU and GPU memory so the layers can have the benefit of reusing the same in memeory word embedding matrix.

name: str#

The name of the model given by the configuration and must be unique across word vector type and dimension.

prime()[source]#
property shape: Tuple[int, int]#

obj”matrix.

Type:

The shape of the word vector

to_matrix(torch_config)[source]#

Return a matrix the represents the entire vector embedding as a tensor.

Parameters:

torch_config (TorchConfig) – indicates where to load the new tensor

Return type:

Tensor

property unk_idx: int#

The ID to the out-of-vocabulary index

property vector_dimension: int#

Return the dimension of the word vectors.

property vectors: Dict[str, ndarray]#

Return all word vectors with the string words as keys.

word2idx(word, default=None)[source]#

Return the index of word or UNKONWN if not indexed.

Return type:

Optional[int]

word2idx_or_unk(word)[source]#

Return the index of word or UNKONWN if not indexed.

Return type:

int

class zensols.deepnlp.embed.domain.WordVectorModel(vectors, word2vec, words, word2idx)[source]#

Bases: object

Vector data from the model

__init__(vectors, word2vec, words, word2idx)#
to_matrix(torch_config)[source]#
Return type:

Tensor

vectors: ndarray#

The word vectors.

word2idx: Dict[str, int]#

The word to word vector index mapping.

word2vec: Dict[str, ndarray]#

The word to word vector mapping.

words: List[str]#

The vocabulary.

zensols.deepnlp.embed.fasttext#

Inheritance diagram of zensols.deepnlp.embed.fasttext

Fast text word vector implementation.

class zensols.deepnlp.embed.fasttext.FastTextEmbedModel(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='2M', dimension=300, corpus='crawl')[source]#

Bases: TextWordEmbedModel

This class reads the FastText word vector text data format and provides an instances of a WordEmbedModel. Files that have the format that look like crawl-300d-2M.vec can be downloaded with the link below.

See:

English word vectors

__init__(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='2M', dimension=300, corpus='crawl')#
corpus: str = 'crawl'#

The corpus the embeddings were trained on, such as crawl and web.

desc: str = '2M'#

The size description (i.e. 6B for the six billion word trained vectors).

dimension: str = 300#

The word vector dimension.

zensols.deepnlp.embed.glove#

Inheritance diagram of zensols.deepnlp.embed.glove

This module contains the definition of a class that operates like a dict to retrieve GloVE word embeddings. It also creates, stores and reads a binary representation for quick loading on start up.

class zensols.deepnlp.embed.glove.GloveWordEmbedModel(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='6B', dimension=50, vocab_size=400000)[source]#

Bases: TextWordEmbedModel

This class uses the Stanford pretrained GloVE embeddings as a dict like Python object. It loads the glove vectors from a text file and then creates a binary file that’s quick to load on subsequent uses.

An example configuration would be:

[glove_embedding]
class_name = zensols.deepnlp.embed.GloveWordEmbedModel
path = path: ${default:corpus_dir}/glove
desc = 6B
dimension = 50
__init__(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='6B', dimension=50, vocab_size=400000)#
desc: str = '6B'#

The size description (i.e. 6B for the six billion word trained vectors).

dimension: int = 50#

The word vector dimension.

vocab_size: int = 400000#

Vocabulary size.

zensols.deepnlp.embed.word2vec#

Inheritance diagram of zensols.deepnlp.embed.word2vec

Convenience Gensim glue code for word embeddings/vectors.

class zensols.deepnlp.embed.word2vec.Word2VecModel(name, cache=True, lowercase=False, installer=None, resource=None, dimension=300, model_type='keyed')[source]#

Bases: WordEmbedModel

Load keyed or non-keyed Gensim models.

__init__(name, cache=True, lowercase=False, installer=None, resource=None, dimension=300, model_type='keyed')#
dimension: int = 300#

The dimension of the word embedding.

installer: Installer = None#

The installer used to for the text vector zip file.

model_type: str = 'keyed'#

The type of the embeddings, which is either keyed or gensim.

property path: Path#
resource: Resource = None#

The zip resource used to find the path to the model files.

zensols.deepnlp.embed.wordtext#

Inheritance diagram of zensols.deepnlp.embed.wordtext

Contains an abstract class that makes it easier to implement load word vectors from text files.

class zensols.deepnlp.embed.wordtext.DefaultTextWordEmbedModel(name='unknown_name', cache=True, lowercase=False, path=None, installer=None, resource=None, desc='unknown_desc', dimension=50, vocab_size=0, file_name_pattern='{name}.{desc}.{dimension}d.txt')[source]#

Bases: TextWordEmbedModel

This class uses the Stanford pretrained GloVE embeddings as a dict like Python object. It loads the glove vectors from a text file and then creates a binary file that’s quick to load on subsequent uses.

An example configuration would be:

[glove_embedding]
class_name = zensols.deepnlp.embed.GloveWordEmbedModel
path = path: ${default:corpus_dir}/glove
desc = 6B
dimension = 50
__init__(name='unknown_name', cache=True, lowercase=False, path=None, installer=None, resource=None, desc='unknown_desc', dimension=50, vocab_size=0, file_name_pattern='{name}.{desc}.{dimension}d.txt')#
desc: str = 'unknown_desc'#

The size description (i.e. 6B for the six billion word trained vectors).

dimension: int = 50#

The word vector dimension.

property file_name: str#
file_name_pattern: str = '{name}.{desc}.{dimension}d.txt'#

The format of the file to create.

name: str = 'unknown_name'#

The name of the word vector set (i.e. glove).

vocab_size: int = 0#

Vocabulary size.

class zensols.deepnlp.embed.wordtext.TextWordEmbedModel(name, cache=True, lowercase=False, path=None, installer=None, resource=None)[source]#

Bases: WordEmbedModel, Primeable

Extensions of this class read a text vectors file and compile, then write a binary representation for fast loading.

DATASET_NAME = 'vec'#

Name of the dataset in the HD5F file.

__init__(name, cache=True, lowercase=False, path=None, installer=None, resource=None)#
installer: Installer = None#

The installer used to for the text vector zip file.

property metadata#

Return the metadata used to construct paths both text source vector file and all generated binary files.

path: Path = None#

The path to the model file(s).

prime()[source]#
resource: Resource = None#

The zip resource used to find the path to the model files.

class zensols.deepnlp.embed.wordtext.TextWordModelMetadata(name, desc, dimension, n_vocab, source_path, sub_directory=None)[source]#

Bases: Dictable

Describes a text based WordEmbedModel. This information in this class is used to construct paths both text source vector file and all generated binary files

__init__(name, desc, dimension, n_vocab, source_path, sub_directory=None)#
desc: str#

A descriptor about this particular word vector set (i.e. 6B).

dimension: int#

The dimension of the word vectors.

n_vocab: int#

The number of words in the vocabulary.

name: str#

The name of the word vector set (i.e. glove).

source_path: Path#

The path to the text file.

sub_directory: InitVar = None#

The subdirectory to be appended to self.bin_dir, which defaults to the directory bin/<description>.<dimension>.

Module contents#