zensols.deepnlp.vectorize package

Submodules

zensols.deepnlp.vectorize.embed module

This file contains a stash used to load an embedding layer. It creates features in batches of matrices and persists matrix only (sans features) for efficient retrival.

class zensols.deepnlp.vectorize.embed.EmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False)[source]

Bases: FoldingDocumentVectorizer, Primeable, Dictable

Vectorize a FeatureDocument as a vector of embedding indexes. Later, these indexes are used in a EmbeddingLayer to create the input word embedding during execution of the model.

__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False)
decode_embedding: bool = False

Whether or not to decode the embedding during the decode phase, which is helpful when caching batches; otherwise, the data is decoded from indexes to embeddings each epoch.

Note that this option and functionality can not be obviated by that implemented with the encode_transformed attribute. The difference is over whether or not more work is done on during decoding rather than encoding. An example of when this is useful is for large word embeddings (i.e. Google 300D pretrained) where the the index to tensor embedding transform is done while decoding rather than in the forward so it’s not done for every epoch.

embed_model: Union[WordEmbedModel, TransformerEmbedding]

The word vector model.

Types for this value include:

prime()[source]
class zensols.deepnlp.vectorize.embed.WordVectorEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, token_feature_id='norm')[source]

Bases: EmbeddingFeatureVectorizer

Vectorize sentences using an embedding model (embed_model) of type WordEmbedModel.

The encoder returns the indicies of the word embedding for each token in the input FeatureDocument. The decoder returns the corresponding word embedding vectors if decode_embedding is True. Otherwise it returns the same indicies, which later used by the embedding layer (usually EmbeddingLayer).

DESCRIPTION = 'word vector document embedding'
FEATURE_TYPE = 4
__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, token_feature_id='norm')
token_feature_id: str = 'norm'

The FeatureToken attribute used to index the embedding vectors.

property vectors: Tensor

zensols.deepnlp.vectorize.manager module

An extension of a feature vectorizer manager that parses and vectorized natural language.

class zensols.deepnlp.vectorize.manager.FeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]

Bases: TransformableFeatureVectorizer

Creates document or sentence level features using instances of TokenContainer.

Subclasses implement specific vectorization on a single document using _encode(), and it is up to the subclass to decide how to vectorize the document.

Multiple documents as an aggregrate given as a list or tuple of documents is supported. Only the document level vectorization is supported to provide one standard contract across framework components and vectorizers.

If more than one document is given during encoding it and will be combined in to one document as described using an FoldingDocumentVectorizer.encoding_level = concat_tokens.

See:

FoldingDocumentVectorizer

__init__(name, config_factory, feature_id, manager, encode_transformed)
encode(doc)[source]

Encode by combining documents in to one monolithic document when a tuple is passed, otherwise default to the super class’s encode functionality.

Return type:

FeatureContext

property feature_type: TextFeatureType

The type of feature this vectorizer generates. This is used by classes such as EmbeddingNetworkModule to determine where to add the features, such as concating to the embedding layer, join layer etc.

property token_length: int

The number of token features (if token level) generated.

class zensols.deepnlp.vectorize.manager.FeatureDocumentVectorizerManager(name, config_factory, torch_config, configured_vectorizers, doc_parser, token_length, token_feature_ids=None)[source]

Bases: FeatureVectorizerManager

Creates and manages instances of FeatureDocumentVectorizer and parses text in to feature based document.

This is used to manage the relationship of a given set of parsed features keeping in mind that parsing will usually happen as a preprocessing step. A second step is the vectorization of those features, which can be any proper subset of those features parsed in the previous step. However, these checks, of course, are not necessary if pickling isn’t used across the parse and vectorization steps.

Instances can set a hard fixed token length, but which vectorized tensors have a like fixed width based on the setting of token_length. However, this can also be set to use the longest sentence of the document, which is useful when computing vectorized tensors from the document as a batch, even if the input data are batched as a group of sentences in a document.

See:

FeatureDocumentVectorizer

:see parse()

__init__(name, config_factory, torch_config, configured_vectorizers, doc_parser, token_length, token_feature_ids=None)
deallocate()[source]

Deallocate all resources for this instance.

doc_parser: FeatureDocumentParser

Used to parse() documents.

get_token_length(doc)[source]

Get the token length for the document. If is_batch_token_length is True, then the token length is computed based on the longest sentence in the document doc. See the class docs.

Parameters:

doc (FeatureDocument) – used to compute the longest sentence if is_batch_token_length is True

Return type:

int

Returns:

the (global) token length for the document

property is_batch_token_length: bool

Return whether or not the token length is variable based on the longest token length in the batch.

parse(text, *args, **kwargs)[source]

Parse text or a text as a list of sentences.

Important: Parsing documents through this manager instance is better since safe checks are made that features are available from those used when documents are parsed before pickling.

Parameters:

text (Union[str, List[str]]) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list

Return type:

FeatureDocument

property spacy_vectorizers: Dict[str, SpacyFeatureVectorizer]

Return vectorizers based on the token_feature_ids configured on this instance. Keys are token level feature ids found in SpacyFeatureVectorizer.VECTORIZERS.

Returns:

an collections.OrderedDict of vectorizers

token_feature_ids: Set[str] = None

Indicates which spaCy parsed features to generate in the vectorizers held in this instance. Examples include norm, ent, dep, tag.

If this is not set, it defaults to the the token_feature_ids in doc_parser.

See:

SpacyFeatureVectorizer.VECTORIZERS

token_length: int

The length of tokens used in fixed length features. This is used as a dimension in decoded tensors. If this value is -1, use the longest sentence of the document as the token length, which is usually counted as the batch.

See:

get_token_length()

class zensols.deepnlp.vectorize.manager.FoldingDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method)[source]

Bases: FeatureDocumentVectorizer

This class is like FeatureDocumentVectorizer, but provides more options in how to fold multiple documents in a single document for vectorization.

Based on the value of fold_method, this class encodes a sequence of FeatureDocument instances differently.

Subclasses must implement _encode().

Note: this is not to be confused with the MultiDocumentVectorizer vectorizer, which vectorizes multiple documents in to document level features.

__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method)
decode(context)[source]

Decode a (potentially) unpickled context and return a tensor using the manager’s torch_config.

Return type:

Tensor

encode(doc)[source]

Encode by combining documents in to one monolithic document when a tuple is passed, otherwise default to the super class’s encode functionality.

Return type:

FeatureContext

fold_method: str

How multiple documents are merged in to a single document for vectorization, which is one of:

  • raise: raise an error allowing only single documents to be vectorized

  • concat_tokens: concatenate tokens of each document in to singleton sentence documents; uses combine_documents() with concat_tokens = True

  • sentence: all sentences of all documents become singleton sentence documents; uses combine_documents() with concat_tokens = False

  • separate: every sentence of each document is encoded separately, then the each sentence output is concatenated as the respsective document during decoding; this uses the _encode() for each sentence of each document and _decode() to decode back in to the same represented document structure as the original

class zensols.deepnlp.vectorize.manager.MultiDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]

Bases: FeatureDocumentVectorizer

Vectorizes multiple documents into document level features. Features generated by subclasses are sometimes used in join layers. Examples include OverlappingFeatureDocumentVectorizer.

This is not to be confused with FoldingDocumentVectorizer, which merges multiple documents in to a single document for vectorization.

FEATURE_TYPE = 2
__init__(name, config_factory, feature_id, manager, encode_transformed)
encode(docs)[source]

Encode by combining documents in to one monolithic document when a tuple is passed, otherwise default to the super class’s encode functionality.

Return type:

FeatureContext

class zensols.deepnlp.vectorize.manager.TextFeatureType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

The type of FeatureDocumentVectorizer.

DOCUMENT = 2

Document level, typically added to a join layer.

EMBEDDING = 4

Embedding layer, typically used as the input layer.

MULTI_DOCUMENT = 3

“Multiple documents for the purposes of aggregating shared features.

NONE = 5

Other type, which tells the framework to ignore the vectorized features.

See:

EmbeddingNetworkModule

TOKEN = 1

Token level with a shape congruent with the number of tokens, typically concatenated with the embedding layer.

zensols.deepnlp.vectorize.spacy module

Feature (ID) normalization.

class zensols.deepnlp.vectorize.spacy.DependencyFeatureVectorizer(name, config_factory, feature_id, torch_config, vocab)[source]

Bases: SpacyFeatureVectorizer

A feature vectorizor for dependency head trees.

See:

SpacyFeatureVectorizer

DESCRIPTION = 'dependency'
FEATURE_ID = 'dep'
LANG = 'en'
SYMBOLS = 'acl acomp advcl advmod agent amod appos attr aux auxpass case cc ccomp clf\ncomplm compound conj cop csubj csubjpass dative dep det discourse dislocated\ndobj expl fixed flat goeswith hmod hyph infmod intj iobj list mark meta neg\nnmod nn npadvmod nsubj nsubjpass nounmod npmod num number nummod oprd obj obl\norphan parataxis partmod pcomp pobj poss possessive preconj prep prt punct\nquantmod rcmod relcl reparandum root vocative xcomp ROOT'
__init__(name, config_factory, feature_id, torch_config, vocab)
class zensols.deepnlp.vectorize.spacy.NamedEntityRecognitionFeatureVectorizer(name, config_factory, feature_id, torch_config, vocab)[source]

Bases: SpacyFeatureVectorizer

A feature vectorizor for NER tags.

See:

SpacyFeatureVectorizer

DESCRIPTION = 'named entity recognition'
FEATURE_ID = 'ent'
LANG = 'en'
SYMBOLS = 'PERSON NORP FACILITY FAC ORG GPE LOC PRODUCT EVENT WORK_OF_ART LAW LANGUAGE\n    DATE TIME PERCENT MONEY QUANTITY ORDINAL CARDINAL PER MISC'
__init__(name, config_factory, feature_id, torch_config, vocab)
class zensols.deepnlp.vectorize.spacy.PartOfSpeechFeatureVectorizer(name, config_factory, feature_id, torch_config, vocab)[source]

Bases: SpacyFeatureVectorizer

A feature vectorizor for POS tags.

See:

SpacyFeatureVectorizer

DESCRIPTION = 'part of speech'
FEATURE_ID = 'tag'
LANG = 'en'
SYMBOLS = 'ADJ ADP ADV AUX CONJ CCONJ DET INTJ NOUN NUM PART PRON PROPN PUNCT SCONJ SYM\nVERB X EOL SPACE . , -LRB- -RRB- `` " \' $ # AFX CC CD DT EX FW HYPH IN JJ JJR\nJJS LS MD NIL NN NNP NNPS NNS PDT POS PRP PRP$ RB RBR RBS RP TO UH VB VBD VBG\nVBN VBP VBZ WDT WP WP$ WRB SP ADD NFP GW XX BES HVS NP PP VP ADVP ADJP SBAR PRT\nPNP'
__init__(name, config_factory, feature_id, torch_config, vocab)
class zensols.deepnlp.vectorize.spacy.SpacyFeatureVectorizer(name, config_factory, feature_id, torch_config, vocab)[source]

Bases: FeatureVectorizer

This normalizes feature IDs of parsed token features in to a number between [0, 1]. This is useful for normalized feature vectors as input to neural networks. Input to this would be strings like token.ent_ found on a zensols.nlp.feature.TokenAttributes instance.

The class is also designed to create features using indexes, so there are methods to resolve to a unique ID from an identifier.

Instances of this class behave like a dict.

All symbols are taken from spacy.glossary.GLOSSARY.

Parameters:

vocab (Vocab) – the vocabulary used for from_spacy to compute the normalized feature from the spacy ID (i.e. token.ent_, token.tag_ etc.)

See:

spacy.glossary.GLOSSARY

See:

zensols.nlp.feature.TokenAttributes

VECTORIZERS = {'dep': <class 'zensols.deepnlp.vectorize.spacy.DependencyFeatureVectorizer'>, 'ent': <class 'zensols.deepnlp.vectorize.spacy.NamedEntityRecognitionFeatureVectorizer'>, 'tag': <class 'zensols.deepnlp.vectorize.spacy.PartOfSpeechFeatureVectorizer'>}
__init__(name, config_factory, feature_id, torch_config, vocab)
dist(symbol)[source]

Return a normalized feature float if symbol is found.

Return type:

float

Returns:

a normalized value between [0 - 1] or None if the symbol isn’t found

from_spacy(id)[source]

Return a binary feature from a Spacy ID or None if it doesn’t have a mapping the ID.

Return type:

Tensor

id_from_spacy(id, default=-1)[source]

Return the ID of this vectorizer for the Spacy ID or -1 if not found.

Return type:

int

id_from_spacy_symbol(id, default=-1)[source]

Return the Spacy text symbol for it’s ID (token.ent -> token.ent_).

Return type:

str

torch_config: TorchConfig

The torch configuration used to create tensors.

transform(symbol)[source]

Transform data to a tensor data format.

Return type:

Tensor

vocab: Vocab

The spaCy vocabulary used to create IDs from strings.

:see meth:id_from_spacy_symbol

write(writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Pretty print a human readable representation of this feature vectorizer.

zensols.deepnlp.vectorize.vectorizers module

Generate and vectorize language features.

class zensols.deepnlp.vectorize.vectorizers.CountEnumContainerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None)[source]

Bases: FeatureDocumentVectorizer

Vectorize the counts of parsed spaCy features. This generates the count of tokens as a S X M * N tensor where S is the number of sentences, M is the number of token feature ids and N is the number of columns of the output of the SpacyFeatureVectorizer vectorizer. Each column position’s count represents the number of counts for that spacy symol for that index position in the output of SpacyFeatureVectorizer.

This class uses the same efficiency in decoding features given in EnumContainerFeatureVectorizer.

Shape:

(|sentences|, |decoded features|)

ATTR_EXP_META = ('decoded_feature_ids',)
DESCRIPTION = 'token level feature counts'
FEATURE_TYPE = 2
__init__(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None)
decoded_feature_ids: Set[str] = None
get_feature_counts(sent, fvec)[source]

Return the count of all tokens as a S X N tensor where S is the number of sentences, N is the columns of the fvec vectorizer. Each column position’s count represents the number of counts for that spacy symol for that index position in the fvec.

Return type:

Tensor

to_symbols(tensor)[source]

Reverse map the tensor to spaCy features.

Return type:

List[Dict[str, float]]

Returns:

a list of sentences, each a map of name/count pairs.

class zensols.deepnlp.vectorize.vectorizers.DepthFeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]

Bases: FeatureDocumentVectorizer

Generate the depths of tokens based on how deep they are in a head dependency tree.

Even though this is a document level vectorizer and is usually added in a join layer rather than stacked on to the embedded layer, it still assumes congruence with the token length, which is used in its shape.

Important: do not combine sentences in to a single document with combine_sentences() since features are created as a dependency parse tree at the sentence level. Otherwise, the dependency relations are broken and results in a zeored tensor.

Shape:

(|sentences|, |sentinel tokens|, 1)

DESCRIPTION = 'head depth'
FEATURE_TYPE = 1
__init__(name, config_factory, feature_id, manager, encode_transformed)
encode(doc)[source]

Encode by combining documents in to one monolithic document when a tuple is passed, otherwise default to the super class’s encode functionality.

Return type:

FeatureContext

class zensols.deepnlp.vectorize.vectorizers.EnumContainerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None)[source]

Bases: FeatureDocumentVectorizer

Encode tokens found in the container by aggregating the spaCy vectorizers output. The result is a concatenated binary representation of all configured token level features for each token. This adds only token vectorizer features generated by the spaCy vectorizers (subclasses of SpacyFeatureVectorizer), and not the features themselves (such as is_stop etc).

All spaCy features are encoded given by spacy_vectorizers. However, only those given in decoded_feature_ids are produced in the output tensor after decoding.

The motivation for encoding all, but decoding a subset of features is for feature selection during training. This is because encoding the features (in a sparse matrix) takes comparatively less time and space over having to re-encode all batches.

Rows are tokens, columns intervals of features. The encoded matrix is sparse, and decoded as a dense matrix.

Shape:

(|sentences|, |sentinel tokens|, |decoded features|)

See:

SpacyFeatureVectorizer

ATTR_EXP_META = ('decoded_feature_ids',)
DESCRIPTION = 'spacy feature vectorizer'
FEATURE_TYPE = 1
__init__(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None)
decoded_feature_ids: Set[str] = None

The spaCy generated features used during only decoding (see class docs). Examples include norm, ent, dep, tag. When set to None, use all those given in the spacy_vectorizers.

to_symbols(tensor)[source]

Reverse map the tensor to spaCy features.

Return type:

List[List[Dict[str, float]]]

Returns:

a list of sentences, each with a list of tokens, each having a map of name/count pairs

class zensols.deepnlp.vectorize.vectorizers.MutualFeaturesContainerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, count_vectorizer_feature_id)[source]

Bases: MultiDocumentVectorizer

Vectorize the shared count of all tokens as a S X M * N tensor, where S is the number of sentences, M is the number of token feature ids and N is the columns of the output of the SpacyFeatureVectorizer vectorizer.

This uses an instance of CountEnumContainerFeatureVectorizer to compute across each spacy feature and then sums them up for only those features shared. If at least one shared document has a zero count, the features is zeroed.

The input to this feature vectorizer are a tuple of N TokenContainer instances.

Shape:

(|sentences|, |decoded features|,) from the referenced CountEnumContainerFeatureVectorizer given by count_vectorizer_feature_id

DESCRIPTION = 'mutual feature counts'
__init__(name, config_factory, feature_id, manager, encode_transformed, count_vectorizer_feature_id)
property count_vectorizer: CountEnumContainerFeatureVectorizer

Return the count vectorizer used for the count features.

See:

count_vectorizer_feature_id

count_vectorizer_feature_id: str

The string feature ID configured in the FeatureDocumentVectorizerManager of the CountEnumContainerFeatureVectorizer to use for the count features.

property ones: Tensor

Return a tensor of ones for the shape of this instance.

class zensols.deepnlp.vectorize.vectorizers.OneHotEncodedFeatureDocumentVectorizer(name, config_factory, feature_id, manager, categories, optimize_bools, encode_transformed, feature_attribute=None, level='token')[source]

Bases: FeatureDocumentVectorizer, OneHotEncodedEncodableFeatureVectorizer

Vectorize nominal enumerated features in to a one-hot encoded vectors. The feature is taken from a FeatureToken. If level is token then the features are token attributes identified by feature_attribute. If the level is document feature is taken from the document.

Shape:
DESCRIPTION = 'encoded feature document vectorizer'
__init__(name, config_factory, feature_id, manager, categories, optimize_bools, encode_transformed, feature_attribute=None, level='token')
feature_attribute: Tuple[str] = None

The feature attributes to vectorize.

property feature_type: TextFeatureType

The type of feature this vectorizer generates. This is used by classes such as EmbeddingNetworkModule to determine where to add the features, such as concating to the embedding layer, join layer etc.

level: str = 'token'

The level at which to take the attribute value, which is document, sentence or token.

class zensols.deepnlp.vectorize.vectorizers.OverlappingFeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]

Bases: MultiDocumentVectorizer

Vectorize the number of normalized and lemmatized tokens (in this order) across multiple documents.

The input to this feature vectorizer are a tuple N of FeatureDocument instances.

Shape:

(2,)

DESCRIPTION = 'overlapping token counts'
__init__(name, config_factory, feature_id, manager, encode_transformed)
class zensols.deepnlp.vectorize.vectorizers.StatisticsFeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]

Bases: FeatureDocumentVectorizer

Vectorizes basic surface language statics which include:

  • character count

  • token count

  • min token length in characters

  • max token length in characters

  • average token length in characters (|characters| / |tokens|)

  • sentence count (for FeatureDocuments)

  • average sentence length (|tokens| / |sentences|)

  • min sentence length

  • max sentence length

Shape:

(1, 9,)

DESCRIPTION = 'statistics'
FEATURE_TYPE = 2
__init__(name, config_factory, feature_id, manager, encode_transformed)
class zensols.deepnlp.vectorize.vectorizers.TokenEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, delegate_feature_id, size=-1, pad_label=-100, level=TextFeatureType.TOKEN, add_dims=0)[source]

Bases: AggregateEncodableFeatureVectorizer, FeatureDocumentVectorizer

A AggregateEncodableFeatureVectorizer that is useful for token level classification (i.e. NER). It uses a delegate to first vectorizer the features, then concatenates in to one aggregate.

In shape terms, this takes the single sentence position. The additional unsqueezed dimensions set with n_unsqueeze is useful when the delegate vectorizer encodes booleans or any other value that does not take an additional dimension.

Shape:

(1, |tokens|, <delegate vectorizer shape>[, <unsqueeze dimensions])

DESCRIPTION = 'token aggregate vectorizer'
__init__(name, config_factory, feature_id, manager, encode_transformed, delegate_feature_id, size=-1, pad_label=-100, level=TextFeatureType.TOKEN, add_dims=0)
add_dims: int = 0

The number of dimensions to add (see class docs).

encode(doc)[source]

Encode data to a context ready to (potentially) be pickled.

Return type:

FeatureContext

property feature_type: TextFeatureType

The type of feature this vectorizer generates. This is used by classes such as EmbeddingNetworkModule to determine where to add the features, such as concating to the embedding layer, join layer etc.

level: TextFeatureType = 1

The level at which to take the attribute value, which is document, sentence or token.

class zensols.deepnlp.vectorize.vectorizers.WordEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, embed_model)[source]

Bases: EncodableFeatureVectorizer

Vectorizes string tokens in to word embedded vectors. This class works directly with the string tokens rather than FeatureDocument instances. It can be useful when there’s a need to vectorize tokens outside of a feature document (i.e. cui2vec).

DESCRIPTION = 'word embedding encoder'
FEATURE_TYPE = 4
__init__(name, config_factory, feature_id, manager, embed_model)
embed_model: WordEmbedModel

The word embedding model that has the string tokens to vector mapping.

Module contents

This module vecorizes natural language features in to PyTorch tensors.