zensols.deepnlp.embed package¶
Submodules¶
zensols.deepnlp.embed.doc module¶
A zensols.nlp.container.FeatureDocument
decorator that populates
sentence and token embeddings.
- class zensols.deepnlp.embed.doc.WordEmbedDocumentDecorator(model, torch_config=None, token_embeddings=True, sent_embeddings=True, skip_oov=False)[source]¶
Bases:
FeatureDocumentDecorator
Populates sentence and token embeddings in the documents. Token’s have shape
(1, d)
whered
is the embeddingn dimsion, and the first is always 1 to be compatible with word piece embeddings populated bytransformer.WordPieceDocumentDecorator
.- See:
- __init__(model, torch_config=None, token_embeddings=True, sent_embeddings=True, skip_oov=False)¶
-
model:
WordEmbedModel
¶ The word embedding model for populating tokens and sentences.
-
torch_config:
Optional
[TorchConfig
] = None¶ The Torch configuration to allocate the embeddings from either the GPU or the CPU. If
None
, then Numpynumpy.ndarray
arrays are used instead oftorch.Tensor
.
zensols.deepnlp.embed.domain module¶
Interface file for word vectors, aka non-contextual word embeddings.
- class zensols.deepnlp.embed.domain.NoOpWordEmbedModel(name, *args, **kwargs)[source]¶
Bases:
WordEmbedModel
A no operational implementation of a
WordEmbedModel
. This is useful in unit test cases that download large models that do not fit on GitHub’s workflow actions environments.
- exception zensols.deepnlp.embed.domain.WordEmbedError[source]¶
Bases:
DeepLearnError
Raised for any errors pertaining to word vectors.
- __module__ = 'zensols.deepnlp.embed.domain'¶
- class zensols.deepnlp.embed.domain.WordEmbedModel(name, cache=True, lowercase=False)[source]¶
Bases:
PersistableContainer
This is an abstract base class that represents a set of word vectors (i.e. GloVe).
- __init__(name, cache=True, lowercase=False)¶
-
cache:
bool
= True¶ If
True
globally cache all data strucures, which should beFalse
if more than one embedding across a model type is used.
- property keyed_vectors: KeyedVectors¶
Adapt instances of this class to a gensim keyed vector instance.
-
lowercase:
bool
= False¶ If
True
, downcase each word for all methods that take a word as input. Use this for embeddings that are only lower case in order to find more hits when querying for words that have uppercase characters.
- property model_id: str¶
Return a string that uniquely identifies this instance of the embedding model. This should have the type, size and dimension of the embedding.
This string is used to cache models in both CPU and GPU memory so the layers can have the benefit of reusing the same in memeory word embedding matrix.
-
name:
str
¶ The name of the model given by the configuration and must be unique across word vector type and dimension.
- to_matrix(torch_config)[source]¶
Return a matrix the represents the entire vector embedding as a tensor.
- Parameters:
torch_config (
TorchConfig
) – indicates where to load the new tensor- Return type:
zensols.deepnlp.embed.fasttext module¶
Fast text word vector implementation.
- class zensols.deepnlp.embed.fasttext.FastTextEmbedModel(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='2M', dimension=300, corpus='crawl')[source]¶
Bases:
TextWordEmbedModel
This class reads the FastText word vector text data format and provides an instances of a
WordEmbedModel
. Files that have the format that look likecrawl-300d-2M.vec
can be downloaded with the link below.- See:
- __init__(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='2M', dimension=300, corpus='crawl')¶
zensols.deepnlp.embed.glove module¶
This module contains the definition of a class that operates like a dict to retrieve GloVE word embeddings. It also creates, stores and reads a binary representation for quick loading on start up.
- class zensols.deepnlp.embed.glove.GloveWordEmbedModel(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='6B', dimension=50, vocab_size=400000)[source]¶
Bases:
TextWordEmbedModel
This class uses the Stanford pretrained GloVE embeddings as a
dict
like Python object. It loads the glove vectors from a text file and then creates a binary file that’s quick to load on subsequent uses.An example configuration would be:
[glove_embedding] class_name = zensols.deepnlp.embed.GloveWordEmbedModel path = path: ${default:corpus_dir}/glove desc = 6B dimension = 50
- __init__(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='6B', dimension=50, vocab_size=400000)¶
zensols.deepnlp.embed.word2vec module¶
Convenience Gensim glue code for word embeddings/vectors.
- class zensols.deepnlp.embed.word2vec.Word2VecModel(name, cache=True, lowercase=False, installer=None, resource=None, dimension=300, model_type='keyed')[source]¶
Bases:
WordEmbedModel
Load keyed or non-keyed Gensim models.
- __init__(name, cache=True, lowercase=False, installer=None, resource=None, dimension=300, model_type='keyed')¶
-
installer:
Installer
= None¶ The installer used to for the text vector zip file.
-
resource:
Resource
= None¶ The zip resource used to find the path to the model files.
zensols.deepnlp.embed.wordtext module¶
Contains an abstract class that makes it easier to implement load word vectors from text files.
- class zensols.deepnlp.embed.wordtext.DefaultTextWordEmbedModel(name='unknown_name', cache=True, lowercase=False, path=None, installer=None, resource=None, desc='unknown_desc', dimension=50, vocab_size=0, file_name_pattern='{name}.{desc}.{dimension}d.txt')[source]¶
Bases:
TextWordEmbedModel
This class uses the Stanford pretrained GloVE embeddings as a
dict
like Python object. It loads the glove vectors from a text file and then creates a binary file that’s quick to load on subsequent uses.An example configuration would be:
[glove_embedding] class_name = zensols.deepnlp.embed.GloveWordEmbedModel path = path: ${default:corpus_dir}/glove desc = 6B dimension = 50
- __init__(name='unknown_name', cache=True, lowercase=False, path=None, installer=None, resource=None, desc='unknown_desc', dimension=50, vocab_size=0, file_name_pattern='{name}.{desc}.{dimension}d.txt')¶
- class zensols.deepnlp.embed.wordtext.TextWordEmbedModel(name, cache=True, lowercase=False, path=None, installer=None, resource=None)[source]¶
Bases:
WordEmbedModel
,Primeable
Extensions of this class read a text vectors file and compile, then write a binary representation for fast loading.
- DATASET_NAME = 'vec'¶
Name of the dataset in the HD5F file.
- __init__(name, cache=True, lowercase=False, path=None, installer=None, resource=None)¶
-
installer:
Installer
= None¶ The installer used to for the text vector zip file.
- property metadata¶
Return the metadata used to construct paths both text source vector file and all generated binary files.
-
resource:
Resource
= None¶ The zip resource used to find the path to the model files.
- class zensols.deepnlp.embed.wordtext.TextWordModelMetadata(name, desc, dimension, n_vocab, source_path, sub_directory=None)[source]¶
Bases:
Dictable
Describes a text based
WordEmbedModel
. This information in this class is used to construct paths both text source vector file and all generated binary files- __init__(name, desc, dimension, n_vocab, source_path, sub_directory=None)¶
-
sub_directory:
InitVar
= None¶ The subdirectory to be appended to
self.bin_dir
, which defaults to the directorybin/<description>.<dimension>
.