zensols.deepnlp.vectorize package¶
Submodules¶
zensols.deepnlp.vectorize.embed module¶
This file contains a stash used to load an embedding layer. It creates features in batches of matrices and persists matrix only (sans features) for efficient retrival.
- class zensols.deepnlp.vectorize.embed.EmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False)[source]¶
Bases:
FoldingDocumentVectorizer
,Primeable
,Dictable
Vectorize a
FeatureDocument
as a vector of embedding indexes. Later, these indexes are used in aEmbeddingLayer
to create the input word embedding during execution of the model.- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False)¶
-
decode_embedding:
bool
= False¶ Whether or not to decode the embedding during the decode phase, which is helpful when caching batches; otherwise, the data is decoded from indexes to embeddings each epoch.
Note that this option and functionality can not be obviated by that implemented with the
encode_transformed
attribute. The difference is over whether or not more work is done on during decoding rather than encoding. An example of when this is useful is for large word embeddings (i.e. Google 300D pretrained) where the the index to tensor embedding transform is done while decoding rather than in the forward so it’s not done for every epoch.
-
embed_model:
Union
[WordEmbedModel
, TransformerEmbedding]¶ The word vector model.
Types for this value include:
WordEmbedModel
- class zensols.deepnlp.vectorize.embed.WordVectorEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, token_feature_id='norm')[source]¶
Bases:
EmbeddingFeatureVectorizer
Vectorize sentences using an embedding model (
embed_model
) of typeWordEmbedModel
.The encoder returns the indicies of the word embedding for each token in the input
FeatureDocument
. The decoder returns the corresponding word embedding vectors ifdecode_embedding
isTrue
. Otherwise it returns the same indicies, which later used by the embedding layer (usuallyEmbeddingLayer
).- DESCRIPTION = 'word vector document embedding'¶
- FEATURE_TYPE = 4¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, token_feature_id='norm')¶
-
token_feature_id:
str
= 'norm'¶ The
FeatureToken
attribute used to index the embedding vectors.
zensols.deepnlp.vectorize.manager module¶
An extension of a feature vectorizer manager that parses and vectorized natural language.
- class zensols.deepnlp.vectorize.manager.FeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
Bases:
TransformableFeatureVectorizer
Creates document or sentence level features using instances of
TokenContainer
.Subclasses implement specific vectorization on a single document using
_encode()
, and it is up to the subclass to decide how to vectorize the document.Multiple documents as an aggregrate given as a list or tuple of documents is supported. Only the document level vectorization is supported to provide one standard contract across framework components and vectorizers.
If more than one document is given during encoding it and will be combined in to one document as described using an
FoldingDocumentVectorizer.encoding_level
=concat_tokens
.- __init__(name, config_factory, feature_id, manager, encode_transformed)¶
- encode(doc)[source]¶
Encode by combining documents in to one monolithic document when a tuple is passed, otherwise default to the super class’s encode functionality.
- Return type:
- property feature_type: TextFeatureType¶
The type of feature this vectorizer generates. This is used by classes such as
EmbeddingNetworkModule
to determine where to add the features, such as concating to the embedding layer, join layer etc.
- class zensols.deepnlp.vectorize.manager.FeatureDocumentVectorizerManager(name, config_factory, torch_config, configured_vectorizers, doc_parser, token_length, token_feature_ids=None)[source]¶
Bases:
FeatureVectorizerManager
Creates and manages instances of
FeatureDocumentVectorizer
and parses text in to feature based document.This is used to manage the relationship of a given set of parsed features keeping in mind that parsing will usually happen as a preprocessing step. A second step is the vectorization of those features, which can be any proper subset of those features parsed in the previous step. However, these checks, of course, are not necessary if pickling isn’t used across the parse and vectorization steps.
Instances can set a hard fixed token length, but which vectorized tensors have a like fixed width based on the setting of
token_length
. However, this can also be set to use the longest sentence of the document, which is useful when computing vectorized tensors from the document as a batch, even if the input data are batched as a group of sentences in a document.:see
parse()
- __init__(name, config_factory, torch_config, configured_vectorizers, doc_parser, token_length, token_feature_ids=None)¶
-
doc_parser:
FeatureDocumentParser
¶ Used to
parse()
documents.
- get_token_length(doc)[source]¶
Get the token length for the document. If
is_batch_token_length
isTrue
, then the token length is computed based on the longest sentence in the documentdoc
. See the class docs.- Parameters:
doc (
FeatureDocument
) – used to compute the longest sentence ifis_batch_token_length
isTrue
- Return type:
- Returns:
the (global) token length for the document
- property is_batch_token_length: bool¶
Return whether or not the token length is variable based on the longest token length in the batch.
- parse(text, *args, **kwargs)[source]¶
Parse text or a text as a list of sentences.
Important: Parsing documents through this manager instance is better since safe checks are made that features are available from those used when documents are parsed before pickling.
- property spacy_vectorizers: Dict[str, SpacyFeatureVectorizer]¶
Return vectorizers based on the
token_feature_ids
configured on this instance. Keys are token level feature ids found inSpacyFeatureVectorizer.VECTORIZERS
.- Returns:
an
collections.OrderedDict
of vectorizers
-
token_feature_ids:
Set
[str
] = None¶ Indicates which spaCy parsed features to generate in the vectorizers held in this instance. Examples include
norm
,ent
,dep
,tag
.If this is not set, it defaults to the the token_feature_ids in
doc_parser
.
- class zensols.deepnlp.vectorize.manager.FoldingDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method)[source]¶
Bases:
FeatureDocumentVectorizer
This class is like
FeatureDocumentVectorizer
, but provides more options in how to fold multiple documents in a single document for vectorization.Based on the value of
fold_method
, this class encodes a sequence ofFeatureDocument
instances differently.Subclasses must implement
_encode()
.Note: this is not to be confused with the
MultiDocumentVectorizer
vectorizer, which vectorizes multiple documents in to document level features.- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method)¶
- decode(context)[source]¶
Decode a (potentially) unpickled context and return a tensor using the manager’s
torch_config
.- Return type:
- encode(doc)[source]¶
Encode by combining documents in to one monolithic document when a tuple is passed, otherwise default to the super class’s encode functionality.
- Return type:
-
fold_method:
str
¶ How multiple documents are merged in to a single document for vectorization, which is one of:
raise
: raise an error allowing only single documents to be vectorizedconcat_tokens
: concatenate tokens of each document in to singleton sentence documents; usescombine_documents()
withconcat_tokens = True
sentence
: all sentences of all documents become singleton sentence documents; usescombine_documents()
withconcat_tokens = False
separate
: every sentence of each document is encoded separately, then the each sentence output is concatenated as the respsective document during decoding; this uses the_encode()
for each sentence of each document and_decode()
to decode back in to the same represented document structure as the original
- class zensols.deepnlp.vectorize.manager.MultiDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
Bases:
FeatureDocumentVectorizer
Vectorizes multiple documents into document level features. Features generated by subclasses are sometimes used in join layers. Examples include
OverlappingFeatureDocumentVectorizer
.This is not to be confused with
FoldingDocumentVectorizer
, which merges multiple documents in to a single document for vectorization.- FEATURE_TYPE = 2¶
- __init__(name, config_factory, feature_id, manager, encode_transformed)¶
- class zensols.deepnlp.vectorize.manager.TextFeatureType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
Enum
The type of
FeatureDocumentVectorizer
.- DOCUMENT = 2¶
Document level, typically added to a join layer.
- EMBEDDING = 4¶
Embedding layer, typically used as the input layer.
- MULTI_DOCUMENT = 3¶
“Multiple documents for the purposes of aggregating shared features.
- NONE = 5¶
Other type, which tells the framework to ignore the vectorized features.
- TOKEN = 1¶
Token level with a shape congruent with the number of tokens, typically concatenated with the embedding layer.
zensols.deepnlp.vectorize.spacy module¶
Feature (ID) normalization.
- class zensols.deepnlp.vectorize.spacy.DependencyFeatureVectorizer(name, config_factory, feature_id, torch_config, vocab)[source]¶
Bases:
SpacyFeatureVectorizer
A feature vectorizor for dependency head trees.
- DESCRIPTION = 'dependency'¶
- FEATURE_ID = 'dep'¶
- LANG = 'en'¶
- SYMBOLS = 'acl acomp advcl advmod agent amod appos attr aux auxpass case cc ccomp clf\ncomplm compound conj cop csubj csubjpass dative dep det discourse dislocated\ndobj expl fixed flat goeswith hmod hyph infmod intj iobj list mark meta neg\nnmod nn npadvmod nsubj nsubjpass nounmod npmod num number nummod oprd obj obl\norphan parataxis partmod pcomp pobj poss possessive preconj prep prt punct\nquantmod rcmod relcl reparandum root vocative xcomp ROOT'¶
- __init__(name, config_factory, feature_id, torch_config, vocab)¶
- class zensols.deepnlp.vectorize.spacy.NamedEntityRecognitionFeatureVectorizer(name, config_factory, feature_id, torch_config, vocab)[source]¶
Bases:
SpacyFeatureVectorizer
A feature vectorizor for NER tags.
- DESCRIPTION = 'named entity recognition'¶
- FEATURE_ID = 'ent'¶
- LANG = 'en'¶
- SYMBOLS = 'PERSON NORP FACILITY FAC ORG GPE LOC PRODUCT EVENT WORK_OF_ART LAW LANGUAGE\n DATE TIME PERCENT MONEY QUANTITY ORDINAL CARDINAL PER MISC'¶
- __init__(name, config_factory, feature_id, torch_config, vocab)¶
- class zensols.deepnlp.vectorize.spacy.PartOfSpeechFeatureVectorizer(name, config_factory, feature_id, torch_config, vocab)[source]¶
Bases:
SpacyFeatureVectorizer
A feature vectorizor for POS tags.
- DESCRIPTION = 'part of speech'¶
- FEATURE_ID = 'tag'¶
- LANG = 'en'¶
- SYMBOLS = 'ADJ ADP ADV AUX CONJ CCONJ DET INTJ NOUN NUM PART PRON PROPN PUNCT SCONJ SYM\nVERB X EOL SPACE . , -LRB- -RRB- `` " \' $ # AFX CC CD DT EX FW HYPH IN JJ JJR\nJJS LS MD NIL NN NNP NNPS NNS PDT POS PRP PRP$ RB RBR RBS RP TO UH VB VBD VBG\nVBN VBP VBZ WDT WP WP$ WRB SP ADD NFP GW XX BES HVS NP PP VP ADVP ADJP SBAR PRT\nPNP'¶
- __init__(name, config_factory, feature_id, torch_config, vocab)¶
- class zensols.deepnlp.vectorize.spacy.SpacyFeatureVectorizer(name, config_factory, feature_id, torch_config, vocab)[source]¶
Bases:
FeatureVectorizer
This normalizes feature IDs of parsed token features in to a number between [0, 1]. This is useful for normalized feature vectors as input to neural networks. Input to this would be strings like
token.ent_
found on azensols.nlp.feature.TokenAttributes
instance.The class is also designed to create features using indexes, so there are methods to resolve to a unique ID from an identifier.
Instances of this class behave like a
dict
.All symbols are taken from
spacy.glossary.GLOSSARY
.- Parameters:
vocab (
Vocab
) – the vocabulary used forfrom_spacy
to compute the normalized feature from the spacy ID (i.e.token.ent_
,token.tag_
etc.)- See:
spacy.glossary.GLOSSARY
- See:
zensols.nlp.feature.TokenAttributes
- VECTORIZERS = {'dep': <class 'zensols.deepnlp.vectorize.spacy.DependencyFeatureVectorizer'>, 'ent': <class 'zensols.deepnlp.vectorize.spacy.NamedEntityRecognitionFeatureVectorizer'>, 'tag': <class 'zensols.deepnlp.vectorize.spacy.PartOfSpeechFeatureVectorizer'>}¶
- __init__(name, config_factory, feature_id, torch_config, vocab)¶
- dist(symbol)[source]¶
Return a normalized feature float if
symbol
is found.- Return type:
- Returns:
a normalized value between [0 - 1] or
None
if the symbol isn’t found
- from_spacy(id)[source]¶
Return a binary feature from a Spacy ID or
None
if it doesn’t have a mapping the ID.- Return type:
- id_from_spacy(id, default=-1)[source]¶
Return the ID of this vectorizer for the Spacy ID or -1 if not found.
- Return type:
- id_from_spacy_symbol(id, default=-1)[source]¶
Return the Spacy text symbol for it’s ID (
token.ent
->token.ent_
).- Return type:
-
torch_config:
TorchConfig
¶ The torch configuration used to create tensors.
-
vocab:
Vocab
¶ The spaCy vocabulary used to create IDs from strings.
:see meth:id_from_spacy_symbol
zensols.deepnlp.vectorize.vectorizers module¶
Generate and vectorize language features.
- class zensols.deepnlp.vectorize.vectorizers.CountEnumContainerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None)[source]¶
Bases:
FeatureDocumentVectorizer
Vectorize the counts of parsed spaCy features. This generates the count of tokens as a S X M * N tensor where S is the number of sentences, M is the number of token feature ids and N is the number of columns of the output of the
SpacyFeatureVectorizer
vectorizer. Each column position’s count represents the number of counts for that spacy symol for that index position in the output ofSpacyFeatureVectorizer
.This class uses the same efficiency in decoding features given in
EnumContainerFeatureVectorizer
.- Shape:
- ATTR_EXP_META = ('decoded_feature_ids',)¶
- DESCRIPTION = 'token level feature counts'¶
- FEATURE_TYPE = 2¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None)¶
- class zensols.deepnlp.vectorize.vectorizers.DepthFeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
Bases:
FeatureDocumentVectorizer
Generate the depths of tokens based on how deep they are in a head dependency tree.
Even though this is a document level vectorizer and is usually added in a join layer rather than stacked on to the embedded layer, it still assumes congruence with the token length, which is used in its shape.
Important: do not combine sentences in to a single document with
combine_sentences()
since features are created as a dependency parse tree at the sentence level. Otherwise, the dependency relations are broken and results in a zeored tensor.- Shape:
(|sentences|, |sentinel tokens|, 1)
- DESCRIPTION = 'head depth'¶
- FEATURE_TYPE = 1¶
- __init__(name, config_factory, feature_id, manager, encode_transformed)¶
- class zensols.deepnlp.vectorize.vectorizers.EnumContainerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None)[source]¶
Bases:
FeatureDocumentVectorizer
Encode tokens found in the container by aggregating the spaCy vectorizers output. The result is a concatenated binary representation of all configured token level features for each token. This adds only token vectorizer features generated by the spaCy vectorizers (subclasses of
SpacyFeatureVectorizer
), and not the features themselves (such asis_stop
etc).All spaCy features are encoded given by
spacy_vectorizers
. However, only those given indecoded_feature_ids
are produced in the output tensor after decoding.The motivation for encoding all, but decoding a subset of features is for feature selection during training. This is because encoding the features (in a sparse matrix) takes comparatively less time and space over having to re-encode all batches.
Rows are tokens, columns intervals of features. The encoded matrix is sparse, and decoded as a dense matrix.
- Shape:
- See:
- ATTR_EXP_META = ('decoded_feature_ids',)¶
- DESCRIPTION = 'spacy feature vectorizer'¶
- FEATURE_TYPE = 1¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None)¶
-
decoded_feature_ids:
Set
[str
] = None¶ The spaCy generated features used during only decoding (see class docs). Examples include
norm
,ent
,dep
,tag
. When set toNone
, use all those given in thespacy_vectorizers
.
- class zensols.deepnlp.vectorize.vectorizers.MutualFeaturesContainerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, count_vectorizer_feature_id)[source]¶
Bases:
MultiDocumentVectorizer
Vectorize the shared count of all tokens as a S X M * N tensor, where S is the number of sentences, M is the number of token feature ids and N is the columns of the output of the
SpacyFeatureVectorizer
vectorizer.This uses an instance of
CountEnumContainerFeatureVectorizer
to compute across each spacy feature and then sums them up for only those features shared. If at least one shared document has a zero count, the features is zeroed.The input to this feature vectorizer are a tuple of N
TokenContainer
instances.- Shape:
(|sentences|, |decoded features|,) from the referenced
CountEnumContainerFeatureVectorizer
given bycount_vectorizer_feature_id
- DESCRIPTION = 'mutual feature counts'¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, count_vectorizer_feature_id)¶
- property count_vectorizer: CountEnumContainerFeatureVectorizer¶
Return the count vectorizer used for the count features.
-
count_vectorizer_feature_id:
str
¶ The string feature ID configured in the
FeatureDocumentVectorizerManager
of theCountEnumContainerFeatureVectorizer
to use for the count features.
- class zensols.deepnlp.vectorize.vectorizers.OneHotEncodedFeatureDocumentVectorizer(name, config_factory, feature_id, manager, categories, optimize_bools, encode_transformed, feature_attribute=None, level='token')[source]¶
Bases:
FeatureDocumentVectorizer
,OneHotEncodedEncodableFeatureVectorizer
Vectorize nominal enumerated features in to a one-hot encoded vectors. The feature is taken from a
FeatureToken
. Iflevel
istoken
then the features are token attributes identified byfeature_attribute
. If thelevel
isdocument
feature is taken from the document.- Shape:
level = document: (1, |categories|)
level = token: (|<sentences>|, |<sentinel tokens>|, |categories|)
- DESCRIPTION = 'encoded feature document vectorizer'¶
- __init__(name, config_factory, feature_id, manager, categories, optimize_bools, encode_transformed, feature_attribute=None, level='token')¶
- property feature_type: TextFeatureType¶
The type of feature this vectorizer generates. This is used by classes such as
EmbeddingNetworkModule
to determine where to add the features, such as concating to the embedding layer, join layer etc.
- class zensols.deepnlp.vectorize.vectorizers.OverlappingFeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
Bases:
MultiDocumentVectorizer
Vectorize the number of normalized and lemmatized tokens (in this order) across multiple documents.
The input to this feature vectorizer are a tuple N of
FeatureDocument
instances.- Shape:
(2,)
- DESCRIPTION = 'overlapping token counts'¶
- __init__(name, config_factory, feature_id, manager, encode_transformed)¶
- class zensols.deepnlp.vectorize.vectorizers.StatisticsFeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
Bases:
FeatureDocumentVectorizer
Vectorizes basic surface language statics which include:
character count
token count
min token length in characters
max token length in characters
average token length in characters (|characters| / |tokens|)
sentence count (for FeatureDocuments)
average sentence length (|tokens| / |sentences|)
min sentence length
max sentence length
- Shape:
(1, 9,)
- DESCRIPTION = 'statistics'¶
- FEATURE_TYPE = 2¶
- __init__(name, config_factory, feature_id, manager, encode_transformed)¶
- class zensols.deepnlp.vectorize.vectorizers.TokenEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, delegate_feature_id, size=-1, pad_label=-100, level=TextFeatureType.TOKEN, add_dims=0)[source]¶
Bases:
AggregateEncodableFeatureVectorizer
,FeatureDocumentVectorizer
A
AggregateEncodableFeatureVectorizer
that is useful for token level classification (i.e. NER). It uses a delegate to first vectorizer the features, then concatenates in to one aggregate.In shape terms, this takes the single sentence position. The additional unsqueezed dimensions set with
n_unsqueeze
is useful when the delegate vectorizer encodes booleans or any other value that does not take an additional dimension.- Shape:
(1, |tokens|, <delegate vectorizer shape>[, <unsqueeze dimensions])
- DESCRIPTION = 'token aggregate vectorizer'¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, delegate_feature_id, size=-1, pad_label=-100, level=TextFeatureType.TOKEN, add_dims=0)¶
- property feature_type: TextFeatureType¶
The type of feature this vectorizer generates. This is used by classes such as
EmbeddingNetworkModule
to determine where to add the features, such as concating to the embedding layer, join layer etc.
-
level:
TextFeatureType
= 1¶ The level at which to take the attribute value, which is
document
,sentence
ortoken
.
- class zensols.deepnlp.vectorize.vectorizers.WordEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, embed_model)[source]¶
Bases:
EncodableFeatureVectorizer
Vectorizes string tokens in to word embedded vectors. This class works directly with the string tokens rather than
FeatureDocument
instances. It can be useful when there’s a need to vectorize tokens outside of a feature document (i.e.cui2vec
).- DESCRIPTION = 'word embedding encoder'¶
- FEATURE_TYPE = 4¶
- __init__(name, config_factory, feature_id, manager, embed_model)¶
-
embed_model:
WordEmbedModel
¶ The word embedding model that has the string tokens to vector mapping.
Module contents¶
This module vecorizes natural language features in to PyTorch tensors.