zensols.deepnlp.embed package¶
Submodules¶
zensols.deepnlp.embed.doc module¶
A zensols.nlp.container.FeatureDocument decorator that populates
sentence and token embeddings.
- class zensols.deepnlp.embed.doc.WordEmbedDocumentDecorator(model, torch_config=None, token_embeddings=True, sent_embeddings=True, skip_oov=False)[source]¶
Bases:
FeatureDocumentDecoratorPopulates sentence and token embeddings in the documents. Token’s have shape
(1, d)wheredis the embeddingn dimsion, and the first is always 1 to be compatible with word piece embeddings populated bytransformer.WordPieceDocumentDecorator.- See:
- __init__(model, torch_config=None, token_embeddings=True, sent_embeddings=True, skip_oov=False)¶
-
model:
WordEmbedModel¶ The word embedding model for populating tokens and sentences.
-
torch_config:
Optional[TorchConfig] = None¶ The Torch configuration to allocate the embeddings from either the GPU or the CPU. If
None, then Numpynumpy.ndarrayarrays are used instead oftorch.Tensor.
zensols.deepnlp.embed.domain module¶
Interface file for word vectors, aka non-contextual word embeddings.
- class zensols.deepnlp.embed.domain.NoOpWordEmbedModel(name, *args, **kwargs)[source]¶
Bases:
WordEmbedModelA no operational implementation of a
WordEmbedModel. This is useful in unit test cases that download large models that do not fit on GitHub’s workflow actions environments.
- exception zensols.deepnlp.embed.domain.WordEmbedError[source]¶
Bases:
DeepLearnErrorRaised for any errors pertaining to word vectors.
- __module__ = 'zensols.deepnlp.embed.domain'¶
- class zensols.deepnlp.embed.domain.WordEmbedModel(name, cache=True, lowercase=False)[source]¶
Bases:
PersistableContainerThis is an abstract base class that represents a set of word vectors (i.e. GloVe).
- __init__(name, cache=True, lowercase=False)¶
-
cache:
bool= True¶ If
Trueglobally cache all data strucures, which should beFalseif more than one embedding across a model type is used.
- property keyed_vectors: KeyedVectors¶
Adapt instances of this class to a gensim keyed vector instance.
-
lowercase:
bool= False¶ If
True, downcase each word for all methods that take a word as input. Use this for embeddings that are only lower case in order to find more hits when querying for words that have uppercase characters.
- property model_id: str¶
Return a string that uniquely identifies this instance of the embedding model. This should have the type, size and dimension of the embedding.
This string is used to cache models in both CPU and GPU memory so the layers can have the benefit of reusing the same in memeory word embedding matrix.
-
name:
str¶ The name of the model given by the configuration and must be unique across word vector type and dimension.
- to_matrix(torch_config)[source]¶
Return a matrix the represents the entire vector embedding as a tensor.
- Parameters:
torch_config (
TorchConfig) – indicates where to load the new tensor- Return type:
Tensor
zensols.deepnlp.embed.fasttext module¶
Fast text word vector implementation.
- class zensols.deepnlp.embed.fasttext.FastTextEmbedModel(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='2M', dimension=300, corpus='crawl')[source]¶
Bases:
TextWordEmbedModelThis class reads the FastText word vector text data format and provides an instances of a
WordEmbedModel. Files that have the format that look likecrawl-300d-2M.veccan be downloaded with the link below.- See:
- __init__(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='2M', dimension=300, corpus='crawl')¶
zensols.deepnlp.embed.glove module¶
This module contains the definition of a class that operates like a dict to retrieve GloVE word embeddings. It also creates, stores and reads a binary representation for quick loading on start up.
- class zensols.deepnlp.embed.glove.GloveWordEmbedModel(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='6B', dimension=50, vocab_size=400000)[source]¶
Bases:
TextWordEmbedModelThis class uses the Stanford pretrained GloVE embeddings as a
dictlike Python object. It loads the glove vectors from a text file and then creates a binary file that’s quick to load on subsequent uses.An example configuration would be:
[glove_embedding] class_name = zensols.deepnlp.embed.GloveWordEmbedModel path = path: ${default:corpus_dir}/glove desc = 6B dimension = 50- __init__(name, cache=True, lowercase=False, path=None, installer=None, resource=None, desc='6B', dimension=50, vocab_size=400000)¶
zensols.deepnlp.embed.word2vec module¶
Convenience Gensim glue code for word embeddings/vectors.
- class zensols.deepnlp.embed.word2vec.Word2VecModel(name, cache=True, lowercase=False, installer=None, resource=None, dimension=300, model_type='keyed')[source]¶
Bases:
WordEmbedModelLoad keyed or non-keyed Gensim models.
- __init__(name, cache=True, lowercase=False, installer=None, resource=None, dimension=300, model_type='keyed')¶
-
installer:
Installer= None¶ The installer used to for the text vector zip file.
-
resource:
Resource= None¶ The zip resource used to find the path to the model files.
zensols.deepnlp.embed.wordtext module¶
Contains an abstract class that makes it easier to implement load word vectors from text files.
- class zensols.deepnlp.embed.wordtext.DefaultTextWordEmbedModel(name='unknown_name', cache=True, lowercase=False, path=None, installer=None, resource=None, desc='unknown_desc', dimension=50, vocab_size=0, file_name_pattern='{name}.{desc}.{dimension}d.txt')[source]¶
Bases:
TextWordEmbedModelThis class uses the Stanford pretrained GloVE embeddings as a
dictlike Python object. It loads the glove vectors from a text file and then creates a binary file that’s quick to load on subsequent uses.An example configuration would be:
[glove_embedding] class_name = zensols.deepnlp.embed.GloveWordEmbedModel path = path: ${default:corpus_dir}/glove desc = 6B dimension = 50- __init__(name='unknown_name', cache=True, lowercase=False, path=None, installer=None, resource=None, desc='unknown_desc', dimension=50, vocab_size=0, file_name_pattern='{name}.{desc}.{dimension}d.txt')¶
- class zensols.deepnlp.embed.wordtext.TextWordEmbedModel(name, cache=True, lowercase=False, path=None, installer=None, resource=None)[source]¶
Bases:
WordEmbedModel,PrimeableExtensions of this class read a text vectors file and compile, then write a binary representation for fast loading.
- DATASET_NAME = 'vec'¶
Name of the dataset in the HD5F file.
- __init__(name, cache=True, lowercase=False, path=None, installer=None, resource=None)¶
-
installer:
Installer= None¶ The installer used to for the text vector zip file.
- property metadata¶
Return the metadata used to construct paths both text source vector file and all generated binary files.
-
resource:
Resource= None¶ The zip resource used to find the path to the model files.
- class zensols.deepnlp.embed.wordtext.TextWordModelMetadata(name, desc, dimension, n_vocab, source_path, sub_directory=None)[source]¶
Bases:
DictableDescribes a text based
WordEmbedModel. This information in this class is used to construct paths both text source vector file and all generated binary files- __init__(name, desc, dimension, n_vocab, source_path, sub_directory=None)¶