zensols.deepnlp.vectorize package¶
Submodules¶
zensols.deepnlp.vectorize.embed module¶
This file contains a stash used to load an embedding layer. It creates features in batches of matrices and persists matrix only (sans features) for efficient retrival.
- class zensols.deepnlp.vectorize.embed.EmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False)[source]¶
Bases:
FoldingDocumentVectorizer,Primeable,DictableVectorize a
FeatureDocumentas a vector of embedding indexes. Later, these indexes are used in aEmbeddingLayerto create the input word embedding during execution of the model.- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False)¶
- decode_embedding: bool = False¶
Whether or not to decode the embedding during the decode phase, which is helpful when caching batches; otherwise, the data is decoded from indexes to embeddings each epoch.
Note that this option and functionality can not be obviated by that implemented with the
encode_transformedattribute. The difference is over whether or not more work is done on during decoding rather than encoding. An example of when this is useful is for large word embeddings (i.e. Google 300D pretrained) where the the index to tensor embedding transform is done while decoding rather than in the forward so it’s not done for every epoch.
- embed_model: Union[WordEmbedModel, TransformerEmbedding]¶
The word vector model.
Types for this value include:
- class zensols.deepnlp.vectorize.embed.WordVectorEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, token_feature_id='norm')[source]¶
Bases:
EmbeddingFeatureVectorizerVectorize sentences using an embedding model (
embed_model) of typeWordEmbedModel.The encoder returns the indicies of the word embedding for each token in the input
FeatureDocument. The decoder returns the corresponding word embedding vectors ifdecode_embeddingisTrue. Otherwise it returns the same indicies, which later used by the embedding layer (usuallyEmbeddingLayer).- DESCRIPTION: ClassVar[str] = 'word vector document embedding'¶
- FEATURE_TYPE: ClassVar[TextFeatureType] = 4¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, token_feature_id='norm')¶
- token_feature_id: str = 'norm'¶
The
FeatureTokenattribute used to index the embedding vectors.
- property vectors: Tensor¶
zensols.deepnlp.vectorize.manager module¶
An extension of a feature vectorizer manager that parses and vectorized natural language.
- class zensols.deepnlp.vectorize.manager.FeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
Bases:
TransformableFeatureVectorizerCreates document or sentence level features using instances of
TokenContainer.Subclasses implement specific vectorization on a single document using
_encode(), and it is up to the subclass to decide how to vectorize the document.Multiple documents as an aggregrate given as a list or tuple of documents is supported. Only the document level vectorization is supported to provide one standard contract across framework components and vectorizers.
If more than one document is given during encoding it and will be combined in to one document as described using an
FoldingDocumentVectorizer.encoding_level=concat_tokens.- __init__(name, config_factory, feature_id, manager, encode_transformed)¶
- encode(doc)[source]¶
Encode by combining documents in to one monolithic document when a tuple is passed, otherwise default to the super class’s encode functionality.
- Return type:
- property feature_type: TextFeatureType¶
The type of feature this vectorizer generates. This is used by classes such as
EmbeddingNetworkModuleto determine where to add the features, such as concating to the embedding layer, join layer etc.
- class zensols.deepnlp.vectorize.manager.FeatureDocumentVectorizerManager(name, config_factory, torch_config, configured_vectorizers, doc_parser, token_length, token_feature_ids=None, configured_spacy_vectorizers=())[source]¶
Bases:
FeatureVectorizerManagerCreates and manages instances of
FeatureDocumentVectorizerand parses text in to feature based document.This is used to manage the relationship of a given set of parsed features keeping in mind that parsing will usually happen as a preprocessing step. A second step is the vectorization of those features, which can be any proper subset of those features parsed in the previous step. However, these checks, of course, are not necessary if pickling isn’t used across the parse and vectorization steps.
Instances can set a hard fixed token length, but which vectorized tensors have a like fixed width based on the setting of
token_length. However, this can also be set to use the longest sentence of the document, which is useful when computing vectorized tensors from the document as a batch, even if the input data are batched as a group of sentences in a document.:see
parse()- __init__(name, config_factory, torch_config, configured_vectorizers, doc_parser, token_length, token_feature_ids=None, configured_spacy_vectorizers=())¶
-
configured_spacy_vectorizers:
Tuple[SpacyFeatureVectorizer,...] = ()¶ Additional vectorizers that aren’t registered, such as those added from external packages.
-
doc_parser:
FeatureDocumentParser¶ Used to
parse()documents.
- get_token_length(doc)[source]¶
Get the token length for the document. If
is_batch_token_lengthisTrue, then the token length is computed based on the longest sentence in the documentdoc. See the class docs.- Parameters:
doc (
FeatureDocument) – used to compute the longest sentence ifis_batch_token_lengthisTrue- Return type:
- Returns:
the (global) token length for the document
- property is_batch_token_length: bool¶
Return whether or not the token length is variable based on the longest token length in the batch.
- property ordered_spacy_vectorizers: Tuple[Tuple[str, SpacyFeatureVectorizer], ...]¶
The spaCy vectorizers in a guaranteed stable ordering.
- parse(text, *args, **kwargs)[source]¶
Parse text or a text as a list of sentences.
Important: Parsing documents through this manager instance is better since safe checks are made that features are available from those used when documents are parsed before pickling.
- property spacy_vectorizers: Dict[str, SpacyFeatureVectorizer]¶
Return vectorizers based on the
token_feature_idsconfigured on this instance. Keys are token level feature ids found inSpacyFeatureVectorizer.VECTORIZERS.- Returns:
an
collections.OrderedDictof vectorizers
-
token_feature_ids:
Set[str] = None¶ Indicates which spaCy parsed features to generate in the vectorizers held in this instance. Examples include
norm,ent,dep,tag.If this is not set, it defaults to the the token_feature_ids in
doc_parser.- See:
SpacyFeatureVectorizer.VECTORIZERS
- class zensols.deepnlp.vectorize.manager.FoldingDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method)[source]¶
Bases:
FeatureDocumentVectorizerThis class is like
FeatureDocumentVectorizer, but provides more options in how to fold multiple documents in a single document for vectorization.Based on the value of
fold_method, this class encodes a sequence ofFeatureDocumentinstances differently.Subclasses must implement
_encode().Note: this is not to be confused with the
MultiDocumentVectorizervectorizer, which vectorizes multiple documents in to document level features.- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method)¶
- decode(context)[source]¶
Decode a (potentially) unpickled context and return a tensor using the manager’s
torch_config.- Return type:
Tensor
- encode(doc)[source]¶
Encode by combining documents in to one monolithic document when a tuple is passed, otherwise default to the super class’s encode functionality.
- Return type:
-
fold_method:
str¶ How multiple documents are merged in to a single document for vectorization, which is one of:
raise: raise an error allowing only single documents to be vectorizedconcat_tokens: concatenate tokens of each document in to singleton sentence documents; usescombine_documents()withconcat_tokens = Truesentence: all sentences of all documents become singleton sentence documents; usescombine_documents()withconcat_tokens = Falseseparate: every sentence of each document is encoded separately, then the each sentence output is concatenated as the respsective document during decoding; this uses the_encode()for each sentence of each document and_decode()to decode back in to the same represented document structure as the original
- class zensols.deepnlp.vectorize.manager.MultiDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
Bases:
FeatureDocumentVectorizerVectorizes multiple documents into document level features. Features generated by subclasses are sometimes used in join layers. Examples include
OverlappingFeatureDocumentVectorizer.This is not to be confused with
FoldingDocumentVectorizer, which merges multiple documents in to a single document for vectorization.- FEATURE_TYPE = 2¶
- __init__(name, config_factory, feature_id, manager, encode_transformed)¶
- class zensols.deepnlp.vectorize.manager.TextFeatureType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
EnumThe type of
FeatureDocumentVectorizer.- DOCUMENT = 2¶
Document level, typically added to a join layer.
- EMBEDDING = 4¶
Embedding layer, typically used as the input layer.
- MULTI_DOCUMENT = 3¶
“Multiple documents for the purposes of aggregating shared features.
- NONE = 5¶
Other type, which tells the framework to ignore the vectorized features.
- TOKEN = 1¶
Token level with a shape congruent with the number of tokens, typically concatenated with the embedding layer.
zensols.deepnlp.vectorize.spacy module¶
Feature (ID) normalization.
- class zensols.deepnlp.vectorize.spacy.DependencyFeatureVectorizer(name, config_factory, feature_id, description, torch_config, model, symbols)[source]¶
Bases:
SpacyFeatureVectorizerA feature vectorizor for dependency head trees.
- DESCRIPTION: ClassVar[str] = 'dependency'¶
- FEATURE_ID: ClassVar[str] = 'dep'¶
- LANG: ClassVar[str] = 'en'¶
- SYMBOLS: ClassVar[str] = 'acl acomp advcl advmod agent amod appos attr aux\nauxpass case cc ccomp clf complm compound conj cop csubj csubjpass dative dep\ndet discourse dislocated dobj expl fixed flat goeswith hmod hyph infmod intj\niobj list mark meta neg nmod nn npadvmod nsubj nsubjpass nounmod npmod num\nnumber nummod oprd obj obl orphan parataxis partmod pcomp pobj poss possessive\npreconj prep prt punct quantmod rcmod relcl reparandum root vocative xcomp ROOT'¶
- __init__(name, config_factory, feature_id, description, torch_config, model, symbols)¶
- class zensols.deepnlp.vectorize.spacy.NamedEntityRecognitionFeatureVectorizer(name, config_factory, feature_id, description, torch_config, model, symbols)[source]¶
Bases:
SpacyFeatureVectorizerA feature vectorizor for NER tags.
- DESCRIPTION: ClassVar[str] = 'named entity recognition'¶
- FEATURE_ID: ClassVar[str] = 'ent'¶
- LANG: ClassVar[str] = 'en'¶
- SYMBOLS: ClassVar[str] = 'PERSON NORP FACILITY FAC ORG GPE LOC PRODUCT\nEVENT WORK_OF_ART LAW LANGUAGE DATE TIME PERCENT MONEY QUANTITY ORDINAL CARDINAL\nPER MISC'¶
- __init__(name, config_factory, feature_id, description, torch_config, model, symbols)¶
- class zensols.deepnlp.vectorize.spacy.PartOfSpeechFeatureVectorizer(name, config_factory, feature_id, description, torch_config, model, symbols)[source]¶
Bases:
SpacyFeatureVectorizerA feature vectorizor for POS tags.
- DESCRIPTION: ClassVar[str] = 'part of speech'¶
- FEATURE_ID: ClassVar[str] = 'tag'¶
- LANG: ClassVar[str] = 'en'¶
- SYMBOLS: ClassVar[str] = 'ADJ ADP ADV AUX CONJ CCONJ DET INTJ NOUN NUM\nPART PRON PROPN PUNCT SCONJ SYM VERB X EOL SPACE . , -LRB- -RRB- `` " \' $ # AFX\nCC CD DT EX FW HYPH IN JJ JJR JJS LS MD NIL NN NNP NNPS NNS PDT POS PRP PRP$ RB\nRBR RBS RP TO UH VB VBD VBG VBN VBP VBZ WDT WP WP$ WRB SP ADD NFP GW XX BES HVS\nNP PP VP ADVP ADJP SBAR PRT PNP'¶
- __init__(name, config_factory, feature_id, description, torch_config, model, symbols)¶
- class zensols.deepnlp.vectorize.spacy.SpacyFeatureVectorizer(name, config_factory, feature_id, description, torch_config, model, symbols)[source]¶
Bases:
FeatureVectorizerThis normalizes feature IDs of parsed token features in to a number between [0, 1]. This is useful for normalized feature vectors as input to neural networks. Input to this would be strings like
token.ent_found on azensols.nlp.feature.TokenAttributesinstance.The class is also designed to create features using indexes, so there are methods to resolve to a unique ID from an identifier.
Instances of this class behave like a
dict.All symbols are taken from
spacy.glossary.GLOSSARY.- Parameters:
vocab – the vocabulary used for
from_spacyto compute the normalized feature from the spacy ID (i.e.token.ent_,token.tag_etc.)- See:
spacy.glossary.GLOSSARY- See:
zensols.nlp.feature.TokenAttributes
- __init__(name, config_factory, feature_id, description, torch_config, model, symbols)¶
- dist(symbol)[source]¶
Return a normalized feature float if
symbolis found.- Return type:
- Returns:
a normalized value between [0 - 1] or
Noneif the symbol isn’t found
- from_spacy(id)[source]¶
Return a binary feature from a Spacy ID or
Noneif it doesn’t have a mapping the ID.- Return type:
Tensor
- id_from_spacy(id, default=-1)[source]¶
Return the ID of this vectorizer for the Spacy ID or -1 if not found.
- Return type:
- id_from_spacy_symbol(id, default=-1)[source]¶
Return the Spacy text symbol for it’s ID (
token.ent->token.ent_).- Return type:
- model: Language¶
The spaCy vocabulary used to create IDs from strings.
:see meth:id_from_spacy_symbol
- property symbols: Sequence[str]¶
The list of symbols to vectorize and provided by spaCy as a feature if a tuple or list. If a string, then use it as the name of the pipe with the
labelsattribute.
- torch_config: TorchConfig¶
The torch configuration used to create tensors.
zensols.deepnlp.vectorize.vectorizers module¶
Generate and vectorize language features.
- class zensols.deepnlp.vectorize.vectorizers.CountEnumContainerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None, string_symbol_feature_ids=None)[source]¶
Bases:
DecodedContainerFeatureVectorizerVectorize the counts of parsed spaCy features. This generates the count of tokens as a S X M * N tensor where S is the number of sentences, M is the number of token feature ids and N is the number of columns of the output of the
SpacyFeatureVectorizervectorizer. Each column position’s count represents the number of counts for that spacy symol for that index position in the output ofSpacyFeatureVectorizer.This class uses the same efficiency in decoding features given in
EnumContainerFeatureVectorizer.- Shape:
- ATTR_EXP_META = ('decoded_feature_ids',)¶
- DESCRIPTION = 'token level feature counts'¶
- FEATURE_TYPE = 2¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None, string_symbol_feature_ids=None)¶
- get_feature_counts(sent, fvec)[source]¶
Return the count of all tokens as a S X N tensor where S is the number of sentences, N is the columns of the
fvecvectorizer. Each column position’s count represents the number of counts for that spacy symol for that index position in thefvec.- Return type:
Tensor
- class zensols.deepnlp.vectorize.vectorizers.DecodedContainerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None)[source]¶
Bases:
FeatureDocumentVectorizerA base class that allows for configuring decoded features after batches are created at train time.
- __init__(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None)¶
-
decoded_feature_ids:
Set[str] = None¶ The spaCy generated features used during only decoding (see class docs). Examples include
norm,ent,dep,tag. When set toNone, use all those given in thespacy_vectorizers.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write the contents of this instance to
writerusing indentiondepth.- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
- class zensols.deepnlp.vectorize.vectorizers.DepthFeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
Bases:
FeatureDocumentVectorizerGenerate the depths of tokens based on how deep they are in a head dependency tree.
Even though this is a document level vectorizer and is usually added in a join layer rather than stacked on to the embedded layer, it still assumes congruence with the token length, which is used in its shape.
Important: do not combine sentences in to a single document with
combine_sentences()since features are created as a dependency parse tree at the sentence level. Otherwise, the dependency relations are broken and results in a zeored tensor.- Shape:
(|sentences|, |sentinel tokens|, 1)
- DESCRIPTION = 'head depth'¶
- FEATURE_TYPE = 1¶
- __init__(name, config_factory, feature_id, manager, encode_transformed)¶
- class zensols.deepnlp.vectorize.vectorizers.EnumContainerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None, string_symbol_feature_ids=None)[source]¶
Bases:
DecodedContainerFeatureVectorizerEncode tokens found in the container by aggregating the spaCy vectorizers output. The result is a concatenated binary representation of all configured token level features for each token. This adds only token vectorizer features generated by the spaCy vectorizers (subclasses of
SpacyFeatureVectorizer), and not the features themselves (such asis_stopetc).All spaCy features are encoded given by
spacy_vectorizers. However, only those given indecoded_feature_idsare produced in the output tensor after decoding.The motivation for encoding all, but decoding a subset of features is for feature selection during training. This is because encoding the features (in a sparse matrix) takes comparatively less time and space over having to re-encode all batches.
Rows are tokens, columns intervals of features. The encoded matrix is sparse, and decoded as a dense matrix.
- Shape:
- See:
- ATTR_EXP_META = ('decoded_feature_ids',)¶
- DESCRIPTION = 'spacy feature vectorizer'¶
- FEATURE_TYPE = 1¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None, string_symbol_feature_ids=None)¶
- class zensols.deepnlp.vectorize.vectorizers.MutualFeaturesContainerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, count_vectorizer_feature_id)[source]¶
Bases:
MultiDocumentVectorizerVectorize the shared count of all tokens as a S X M * N tensor, where S is the number of sentences, M is the number of token feature ids and N is the columns of the output of the
SpacyFeatureVectorizervectorizer.This uses an instance of
CountEnumContainerFeatureVectorizerto compute across each spacy feature and then sums them up for only those features shared. If at least one shared document has a zero count, the features is zeroed.The input to this feature vectorizer are a tuple of N
TokenContainerinstances.- Shape:
(|sentences|, |decoded features|,) from the referenced
CountEnumContainerFeatureVectorizergiven bycount_vectorizer_feature_id
- DESCRIPTION = 'mutual feature counts'¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, count_vectorizer_feature_id)¶
- property count_vectorizer: CountEnumContainerFeatureVectorizer¶
Return the count vectorizer used for the count features.
-
count_vectorizer_feature_id:
str¶ The string feature ID configured in the
FeatureDocumentVectorizerManagerof theCountEnumContainerFeatureVectorizerto use for the count features.
- property ones: Tensor¶
Return a tensor of ones for the shape of this instance.
- class zensols.deepnlp.vectorize.vectorizers.OneHotEncodedFeatureDocumentVectorizer(name, config_factory, feature_id, manager, categories, optimize_bools, encode_transformed, feature_attribute=None, level='token')[source]¶
Bases:
FeatureDocumentVectorizer,OneHotEncodedEncodableFeatureVectorizerVectorize nominal enumerated features in to a one-hot encoded vectors. The feature is taken from a
FeatureToken. Iflevelistokenthen the features are token attributes identified byfeature_attribute. If thelevelisdocumentfeature is taken from the document.- Shape:
level = document: (1, |categories|)
level = token: (|<sentences>|, |<sentinel tokens>|, |categories|)
- DESCRIPTION = 'encoded feature document vectorizer'¶
- __init__(name, config_factory, feature_id, manager, categories, optimize_bools, encode_transformed, feature_attribute=None, level='token')¶
- property feature_type: TextFeatureType¶
The type of feature this vectorizer generates. This is used by classes such as
EmbeddingNetworkModuleto determine where to add the features, such as concating to the embedding layer, join layer etc.
- class zensols.deepnlp.vectorize.vectorizers.OverlappingFeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
Bases:
MultiDocumentVectorizerVectorize the number of normalized and lemmatized tokens (in this order) across multiple documents.
The input to this feature vectorizer are a tuple N of
FeatureDocumentinstances.- Shape:
(2,)
- DESCRIPTION = 'overlapping token counts'¶
- __init__(name, config_factory, feature_id, manager, encode_transformed)¶
- class zensols.deepnlp.vectorize.vectorizers.StatisticsFeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
Bases:
FeatureDocumentVectorizerVectorizes basic surface language statics which include:
character count
token count
min token length in characters
max token length in characters
average token length in characters (|characters| / |tokens|)
sentence count (for FeatureDocuments)
average sentence length (|tokens| / |sentences|)
min sentence length
max sentence length
- Shape:
(1, 9,)
- DESCRIPTION = 'statistics'¶
- FEATURE_TYPE = 2¶
- __init__(name, config_factory, feature_id, manager, encode_transformed)¶
- class zensols.deepnlp.vectorize.vectorizers.TokenEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, delegate_feature_id, size=-1, pad_label=-100, level=TextFeatureType.TOKEN, add_dims=0)[source]¶
Bases:
AggregateEncodableFeatureVectorizer,FeatureDocumentVectorizerA
AggregateEncodableFeatureVectorizerthat is useful for token level classification (i.e. NER). It uses a delegate to first vectorizer the features, then concatenates in to one aggregate.In shape terms, this takes the single sentence position. The additional unsqueezed dimensions set with
n_unsqueezeis useful when the delegate vectorizer encodes booleans or any other value that does not take an additional dimension.- Shape:
(1, |tokens|, <delegate vectorizer shape>[, <unsqueeze dimensions])
- DESCRIPTION = 'token aggregate vectorizer'¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, delegate_feature_id, size=-1, pad_label=-100, level=TextFeatureType.TOKEN, add_dims=0)¶
- property feature_type: TextFeatureType¶
The type of feature this vectorizer generates. This is used by classes such as
EmbeddingNetworkModuleto determine where to add the features, such as concating to the embedding layer, join layer etc.
-
level:
TextFeatureType= 1¶ The level at which to take the attribute value, which is
document,sentenceortoken.
- class zensols.deepnlp.vectorize.vectorizers.WordEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, embed_model)[source]¶
Bases:
EncodableFeatureVectorizerVectorizes string tokens in to word embedded vectors. This class works directly with the string tokens rather than
FeatureDocumentinstances. It can be useful when there’s a need to vectorize tokens outside of a feature document (i.e.cui2vec).- DESCRIPTION = 'word embedding encoder'¶
- FEATURE_TYPE = 4¶
- __init__(name, config_factory, feature_id, manager, embed_model)¶
-
embed_model:
WordEmbedModel¶ The word embedding model that has the string tokens to vector mapping.
Module contents¶
This module vecorizes natural language features in to PyTorch tensors.