zensols.deepnlp.vectorize package¶
Submodules¶
zensols.deepnlp.vectorize.embed module¶
This file contains a stash used to load an embedding layer. It creates features in batches of matrices and persists matrix only (sans features) for efficient retrival.
- class zensols.deepnlp.vectorize.embed.EmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False)[source]¶
- Bases: - FoldingDocumentVectorizer,- Primeable,- Dictable- Vectorize a - FeatureDocumentas a vector of embedding indexes. Later, these indexes are used in a- EmbeddingLayerto create the input word embedding during execution of the model.- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False)¶
 - decode_embedding: bool = False¶
- Whether or not to decode the embedding during the decode phase, which is helpful when caching batches; otherwise, the data is decoded from indexes to embeddings each epoch. - Note that this option and functionality can not be obviated by that implemented with the - encode_transformedattribute. The difference is over whether or not more work is done on during decoding rather than encoding. An example of when this is useful is for large word embeddings (i.e. Google 300D pretrained) where the the index to tensor embedding transform is done while decoding rather than in the forward so it’s not done for every epoch.
 - embed_model: Union[WordEmbedModel, TransformerEmbedding]¶
- The word vector model. - Types for this value include: 
 
- class zensols.deepnlp.vectorize.embed.WordVectorEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, token_feature_id='norm')[source]¶
- Bases: - EmbeddingFeatureVectorizer- Vectorize sentences using an embedding model ( - embed_model) of type- WordEmbedModel.- The encoder returns the indicies of the word embedding for each token in the input - FeatureDocument. The decoder returns the corresponding word embedding vectors if- decode_embeddingis- True. Otherwise it returns the same indicies, which later used by the embedding layer (usually- EmbeddingLayer).- DESCRIPTION: ClassVar[str] = 'word vector document embedding'¶
 - FEATURE_TYPE: ClassVar[TextFeatureType] = 4¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, token_feature_id='norm')¶
 - token_feature_id: str = 'norm'¶
- The - FeatureTokenattribute used to index the embedding vectors.
 - property vectors: Tensor¶
 
zensols.deepnlp.vectorize.manager module¶
An extension of a feature vectorizer manager that parses and vectorized natural language.
- class zensols.deepnlp.vectorize.manager.FeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
- Bases: - TransformableFeatureVectorizer- Creates document or sentence level features using instances of - TokenContainer.- Subclasses implement specific vectorization on a single document using - _encode(), and it is up to the subclass to decide how to vectorize the document.- Multiple documents as an aggregrate given as a list or tuple of documents is supported. Only the document level vectorization is supported to provide one standard contract across framework components and vectorizers. - If more than one document is given during encoding it and will be combined in to one document as described using an - FoldingDocumentVectorizer.encoding_level=- concat_tokens.- __init__(name, config_factory, feature_id, manager, encode_transformed)¶
 - encode(doc)[source]¶
- Encode by combining documents in to one monolithic document when a tuple is passed, otherwise default to the super class’s encode functionality. - Return type:
 
 - property feature_type: TextFeatureType¶
- The type of feature this vectorizer generates. This is used by classes such as - EmbeddingNetworkModuleto determine where to add the features, such as concating to the embedding layer, join layer etc.
 
- class zensols.deepnlp.vectorize.manager.FeatureDocumentVectorizerManager(name, config_factory, torch_config, configured_vectorizers, doc_parser, token_length, token_feature_ids=None, configured_spacy_vectorizers=())[source]¶
- Bases: - FeatureVectorizerManager- Creates and manages instances of - FeatureDocumentVectorizerand parses text in to feature based document.- This is used to manage the relationship of a given set of parsed features keeping in mind that parsing will usually happen as a preprocessing step. A second step is the vectorization of those features, which can be any proper subset of those features parsed in the previous step. However, these checks, of course, are not necessary if pickling isn’t used across the parse and vectorization steps. - Instances can set a hard fixed token length, but which vectorized tensors have a like fixed width based on the setting of - token_length. However, this can also be set to use the longest sentence of the document, which is useful when computing vectorized tensors from the document as a batch, even if the input data are batched as a group of sentences in a document.- :see - parse()- __init__(name, config_factory, torch_config, configured_vectorizers, doc_parser, token_length, token_feature_ids=None, configured_spacy_vectorizers=())¶
 - 
configured_spacy_vectorizers: Tuple[SpacyFeatureVectorizer,...] = ()¶
- Additional vectorizers that aren’t registered, such as those added from external packages. 
 - 
doc_parser: FeatureDocumentParser¶
- Used to - parse()documents.
 - get_token_length(doc)[source]¶
- Get the token length for the document. If - is_batch_token_lengthis- True, then the token length is computed based on the longest sentence in the document- doc. See the class docs.- Parameters:
- doc ( - FeatureDocument) – used to compute the longest sentence if- is_batch_token_lengthis- True
- Return type:
- Returns:
- the (global) token length for the document 
 
 - property is_batch_token_length: bool¶
- Return whether or not the token length is variable based on the longest token length in the batch. 
 - property ordered_spacy_vectorizers: Tuple[Tuple[str, SpacyFeatureVectorizer], ...]¶
- The spaCy vectorizers in a guaranteed stable ordering. 
 - parse(text, *args, **kwargs)[source]¶
- Parse text or a text as a list of sentences. - Important: Parsing documents through this manager instance is better since safe checks are made that features are available from those used when documents are parsed before pickling. 
 - property spacy_vectorizers: Dict[str, SpacyFeatureVectorizer]¶
- Return vectorizers based on the - token_feature_idsconfigured on this instance. Keys are token level feature ids found in- SpacyFeatureVectorizer.VECTORIZERS.- Returns:
- an - collections.OrderedDictof vectorizers
 
 - 
token_feature_ids: Set[str] = None¶
- Indicates which spaCy parsed features to generate in the vectorizers held in this instance. Examples include - norm,- ent,- dep,- tag.- If this is not set, it defaults to the the token_feature_ids in - doc_parser.- See:
- SpacyFeatureVectorizer.VECTORIZERS
 
 
- class zensols.deepnlp.vectorize.manager.FoldingDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method)[source]¶
- Bases: - FeatureDocumentVectorizer- This class is like - FeatureDocumentVectorizer, but provides more options in how to fold multiple documents in a single document for vectorization.- Based on the value of - fold_method, this class encodes a sequence of- FeatureDocumentinstances differently.- Subclasses must implement - _encode().- Note: this is not to be confused with the - MultiDocumentVectorizervectorizer, which vectorizes multiple documents in to document level features.- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method)¶
 - decode(context)[source]¶
- Decode a (potentially) unpickled context and return a tensor using the manager’s - torch_config.- Return type:
- Tensor
 
 - encode(doc)[source]¶
- Encode by combining documents in to one monolithic document when a tuple is passed, otherwise default to the super class’s encode functionality. - Return type:
 
 - 
fold_method: str¶
- How multiple documents are merged in to a single document for vectorization, which is one of: - raise: raise an error allowing only single documents to be vectorized
- concat_tokens: concatenate tokens of each document in to singleton sentence documents; uses- combine_documents()with- concat_tokens = True
- sentence: all sentences of all documents become singleton sentence documents; uses- combine_documents()with- concat_tokens = False
- separate: every sentence of each document is encoded separately, then the each sentence output is concatenated as the respsective document during decoding; this uses the- _encode()for each sentence of each document and- _decode()to decode back in to the same represented document structure as the original
 
 
- class zensols.deepnlp.vectorize.manager.MultiDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
- Bases: - FeatureDocumentVectorizer- Vectorizes multiple documents into document level features. Features generated by subclasses are sometimes used in join layers. Examples include - OverlappingFeatureDocumentVectorizer.- This is not to be confused with - FoldingDocumentVectorizer, which merges multiple documents in to a single document for vectorization.- FEATURE_TYPE = 2¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed)¶
 
- class zensols.deepnlp.vectorize.manager.TextFeatureType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
- Bases: - Enum- The type of - FeatureDocumentVectorizer.- DOCUMENT = 2¶
- Document level, typically added to a join layer. 
 - EMBEDDING = 4¶
- Embedding layer, typically used as the input layer. 
 - MULTI_DOCUMENT = 3¶
- “Multiple documents for the purposes of aggregating shared features. 
 - NONE = 5¶
- Other type, which tells the framework to ignore the vectorized features. 
 - TOKEN = 1¶
- Token level with a shape congruent with the number of tokens, typically concatenated with the embedding layer. 
 
zensols.deepnlp.vectorize.spacy module¶
Feature (ID) normalization.
- class zensols.deepnlp.vectorize.spacy.DependencyFeatureVectorizer(name, config_factory, feature_id, description, torch_config, model, symbols)[source]¶
- Bases: - SpacyFeatureVectorizer- A feature vectorizor for dependency head trees. - DESCRIPTION: ClassVar[str] = 'dependency'¶
 - FEATURE_ID: ClassVar[str] = 'dep'¶
 - LANG: ClassVar[str] = 'en'¶
 - SYMBOLS: ClassVar[str] = 'acl acomp advcl advmod agent amod appos attr aux\nauxpass case cc ccomp clf complm compound conj cop csubj csubjpass dative dep\ndet discourse dislocated dobj expl fixed flat goeswith hmod hyph infmod intj\niobj list mark meta neg nmod nn npadvmod nsubj nsubjpass nounmod npmod num\nnumber nummod oprd obj obl orphan parataxis partmod pcomp pobj poss possessive\npreconj prep prt punct quantmod rcmod relcl reparandum root vocative xcomp ROOT'¶
 - __init__(name, config_factory, feature_id, description, torch_config, model, symbols)¶
 
- class zensols.deepnlp.vectorize.spacy.NamedEntityRecognitionFeatureVectorizer(name, config_factory, feature_id, description, torch_config, model, symbols)[source]¶
- Bases: - SpacyFeatureVectorizer- A feature vectorizor for NER tags. - DESCRIPTION: ClassVar[str] = 'named entity recognition'¶
 - FEATURE_ID: ClassVar[str] = 'ent'¶
 - LANG: ClassVar[str] = 'en'¶
 - SYMBOLS: ClassVar[str] = 'PERSON NORP FACILITY FAC ORG GPE LOC PRODUCT\nEVENT WORK_OF_ART LAW LANGUAGE DATE TIME PERCENT MONEY QUANTITY ORDINAL CARDINAL\nPER MISC'¶
 - __init__(name, config_factory, feature_id, description, torch_config, model, symbols)¶
 
- class zensols.deepnlp.vectorize.spacy.PartOfSpeechFeatureVectorizer(name, config_factory, feature_id, description, torch_config, model, symbols)[source]¶
- Bases: - SpacyFeatureVectorizer- A feature vectorizor for POS tags. - DESCRIPTION: ClassVar[str] = 'part of speech'¶
 - FEATURE_ID: ClassVar[str] = 'tag'¶
 - LANG: ClassVar[str] = 'en'¶
 - SYMBOLS: ClassVar[str] = 'ADJ ADP ADV AUX CONJ CCONJ DET INTJ NOUN NUM\nPART PRON PROPN PUNCT SCONJ SYM VERB X EOL SPACE . , -LRB- -RRB- `` " \' $ # AFX\nCC CD DT EX FW HYPH IN JJ JJR JJS LS MD NIL NN NNP NNPS NNS PDT POS PRP PRP$ RB\nRBR RBS RP TO UH VB VBD VBG VBN VBP VBZ WDT WP WP$ WRB SP ADD NFP GW XX BES HVS\nNP PP VP ADVP ADJP SBAR PRT PNP'¶
 - __init__(name, config_factory, feature_id, description, torch_config, model, symbols)¶
 
- class zensols.deepnlp.vectorize.spacy.SpacyFeatureVectorizer(name, config_factory, feature_id, description, torch_config, model, symbols)[source]¶
- Bases: - FeatureVectorizer- This normalizes feature IDs of parsed token features in to a number between [0, 1]. This is useful for normalized feature vectors as input to neural networks. Input to this would be strings like - token.ent_found on a- zensols.nlp.feature.TokenAttributesinstance.- The class is also designed to create features using indexes, so there are methods to resolve to a unique ID from an identifier. - Instances of this class behave like a - dict.- All symbols are taken from - spacy.glossary.GLOSSARY.- Parameters:
- vocab – the vocabulary used for - from_spacyto compute the normalized feature from the spacy ID (i.e.- token.ent_,- token.tag_etc.)
- See:
- spacy.glossary.GLOSSARY
- See:
- zensols.nlp.feature.TokenAttributes
 - __init__(name, config_factory, feature_id, description, torch_config, model, symbols)¶
 - dist(symbol)[source]¶
- Return a normalized feature float if - symbolis found.- Return type:
- Returns:
- a normalized value between [0 - 1] or - Noneif the symbol isn’t found
 
 - from_spacy(id)[source]¶
- Return a binary feature from a Spacy ID or - Noneif it doesn’t have a mapping the ID.- Return type:
- Tensor
 
 - id_from_spacy(id, default=-1)[source]¶
- Return the ID of this vectorizer for the Spacy ID or -1 if not found. - Return type:
 
 - id_from_spacy_symbol(id, default=-1)[source]¶
- Return the Spacy text symbol for it’s ID ( - token.ent->- token.ent_).- Return type:
 
 - model: Language¶
- The spaCy vocabulary used to create IDs from strings. - :see meth:id_from_spacy_symbol 
 - property symbols: Sequence[str]¶
- The list of symbols to vectorize and provided by spaCy as a feature if a tuple or list. If a string, then use it as the name of the pipe with the - labelsattribute.
 - torch_config: TorchConfig¶
- The torch configuration used to create tensors. 
 
zensols.deepnlp.vectorize.vectorizers module¶
Generate and vectorize language features.
- class zensols.deepnlp.vectorize.vectorizers.CountEnumContainerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None, string_symbol_feature_ids=None)[source]¶
- Bases: - DecodedContainerFeatureVectorizer- Vectorize the counts of parsed spaCy features. This generates the count of tokens as a S X M * N tensor where S is the number of sentences, M is the number of token feature ids and N is the number of columns of the output of the - SpacyFeatureVectorizervectorizer. Each column position’s count represents the number of counts for that spacy symol for that index position in the output of- SpacyFeatureVectorizer.- This class uses the same efficiency in decoding features given in - EnumContainerFeatureVectorizer.- Shape:
 - ATTR_EXP_META = ('decoded_feature_ids',)¶
 - DESCRIPTION = 'token level feature counts'¶
 - FEATURE_TYPE = 2¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None, string_symbol_feature_ids=None)¶
 - get_feature_counts(sent, fvec)[source]¶
- Return the count of all tokens as a S X N tensor where S is the number of sentences, N is the columns of the - fvecvectorizer. Each column position’s count represents the number of counts for that spacy symol for that index position in the- fvec.- Return type:
- Tensor
 
 
- class zensols.deepnlp.vectorize.vectorizers.DecodedContainerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None)[source]¶
- Bases: - FeatureDocumentVectorizer- A base class that allows for configuring decoded features after batches are created at train time. - __init__(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None)¶
 - 
decoded_feature_ids: Set[str] = None¶
- The spaCy generated features used during only decoding (see class docs). Examples include - norm,- ent,- dep,- tag. When set to- None, use all those given in the- spacy_vectorizers.
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write the contents of this instance to - writerusing indention- depth.- Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
- class zensols.deepnlp.vectorize.vectorizers.DepthFeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
- Bases: - FeatureDocumentVectorizer- Generate the depths of tokens based on how deep they are in a head dependency tree. - Even though this is a document level vectorizer and is usually added in a join layer rather than stacked on to the embedded layer, it still assumes congruence with the token length, which is used in its shape. - Important: do not combine sentences in to a single document with - combine_sentences()since features are created as a dependency parse tree at the sentence level. Otherwise, the dependency relations are broken and results in a zeored tensor.- Shape:
- (|sentences|, |sentinel tokens|, 1) 
 - DESCRIPTION = 'head depth'¶
 - FEATURE_TYPE = 1¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed)¶
 
- class zensols.deepnlp.vectorize.vectorizers.EnumContainerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None, string_symbol_feature_ids=None)[source]¶
- Bases: - DecodedContainerFeatureVectorizer- Encode tokens found in the container by aggregating the spaCy vectorizers output. The result is a concatenated binary representation of all configured token level features for each token. This adds only token vectorizer features generated by the spaCy vectorizers (subclasses of - SpacyFeatureVectorizer), and not the features themselves (such as- is_stopetc).- All spaCy features are encoded given by - spacy_vectorizers. However, only those given in- decoded_feature_idsare produced in the output tensor after decoding.- The motivation for encoding all, but decoding a subset of features is for feature selection during training. This is because encoding the features (in a sparse matrix) takes comparatively less time and space over having to re-encode all batches. - Rows are tokens, columns intervals of features. The encoded matrix is sparse, and decoded as a dense matrix. - Shape:
- See:
 - ATTR_EXP_META = ('decoded_feature_ids',)¶
 - DESCRIPTION = 'spacy feature vectorizer'¶
 - FEATURE_TYPE = 1¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed, decoded_feature_ids=None, string_symbol_feature_ids=None)¶
 
- class zensols.deepnlp.vectorize.vectorizers.MutualFeaturesContainerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, count_vectorizer_feature_id)[source]¶
- Bases: - MultiDocumentVectorizer- Vectorize the shared count of all tokens as a S X M * N tensor, where S is the number of sentences, M is the number of token feature ids and N is the columns of the output of the - SpacyFeatureVectorizervectorizer.- This uses an instance of - CountEnumContainerFeatureVectorizerto compute across each spacy feature and then sums them up for only those features shared. If at least one shared document has a zero count, the features is zeroed.- The input to this feature vectorizer are a tuple of N - TokenContainerinstances.- Shape:
- (|sentences|, |decoded features|,) from the referenced - CountEnumContainerFeatureVectorizergiven by- count_vectorizer_feature_id
 - DESCRIPTION = 'mutual feature counts'¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed, count_vectorizer_feature_id)¶
 - property count_vectorizer: CountEnumContainerFeatureVectorizer¶
- Return the count vectorizer used for the count features. 
 - 
count_vectorizer_feature_id: str¶
- The string feature ID configured in the - FeatureDocumentVectorizerManagerof the- CountEnumContainerFeatureVectorizerto use for the count features.
 - property ones: Tensor¶
- Return a tensor of ones for the shape of this instance. 
 
- class zensols.deepnlp.vectorize.vectorizers.OneHotEncodedFeatureDocumentVectorizer(name, config_factory, feature_id, manager, categories, optimize_bools, encode_transformed, feature_attribute=None, level='token')[source]¶
- Bases: - FeatureDocumentVectorizer,- OneHotEncodedEncodableFeatureVectorizer- Vectorize nominal enumerated features in to a one-hot encoded vectors. The feature is taken from a - FeatureToken. If- levelis- tokenthen the features are token attributes identified by- feature_attribute. If the- levelis- documentfeature is taken from the document.- Shape:
- level = document: (1, |categories|) 
- level = token: (|<sentences>|, |<sentinel tokens>|, |categories|) 
 
 - DESCRIPTION = 'encoded feature document vectorizer'¶
 - __init__(name, config_factory, feature_id, manager, categories, optimize_bools, encode_transformed, feature_attribute=None, level='token')¶
 - property feature_type: TextFeatureType¶
- The type of feature this vectorizer generates. This is used by classes such as - EmbeddingNetworkModuleto determine where to add the features, such as concating to the embedding layer, join layer etc.
 
- class zensols.deepnlp.vectorize.vectorizers.OverlappingFeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
- Bases: - MultiDocumentVectorizer- Vectorize the number of normalized and lemmatized tokens (in this order) across multiple documents. - The input to this feature vectorizer are a tuple N of - FeatureDocumentinstances.- Shape:
- (2,) 
 - DESCRIPTION = 'overlapping token counts'¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed)¶
 
- class zensols.deepnlp.vectorize.vectorizers.StatisticsFeatureDocumentVectorizer(name, config_factory, feature_id, manager, encode_transformed)[source]¶
- Bases: - FeatureDocumentVectorizer- Vectorizes basic surface language statics which include: - character count 
- token count 
- min token length in characters 
- max token length in characters 
- average token length in characters (|characters| / |tokens|) 
- sentence count (for FeatureDocuments) 
- average sentence length (|tokens| / |sentences|) 
- min sentence length 
- max sentence length 
 - Shape:
- (1, 9,) 
 - DESCRIPTION = 'statistics'¶
 - FEATURE_TYPE = 2¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed)¶
 
- class zensols.deepnlp.vectorize.vectorizers.TokenEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, delegate_feature_id, size=-1, pad_label=-100, level=TextFeatureType.TOKEN, add_dims=0)[source]¶
- Bases: - AggregateEncodableFeatureVectorizer,- FeatureDocumentVectorizer- A - AggregateEncodableFeatureVectorizerthat is useful for token level classification (i.e. NER). It uses a delegate to first vectorizer the features, then concatenates in to one aggregate.- In shape terms, this takes the single sentence position. The additional unsqueezed dimensions set with - n_unsqueezeis useful when the delegate vectorizer encodes booleans or any other value that does not take an additional dimension.- Shape:
- (1, |tokens|, <delegate vectorizer shape>[, <unsqueeze dimensions]) 
 - DESCRIPTION = 'token aggregate vectorizer'¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed, delegate_feature_id, size=-1, pad_label=-100, level=TextFeatureType.TOKEN, add_dims=0)¶
 - property feature_type: TextFeatureType¶
- The type of feature this vectorizer generates. This is used by classes such as - EmbeddingNetworkModuleto determine where to add the features, such as concating to the embedding layer, join layer etc.
 - 
level: TextFeatureType= 1¶
- The level at which to take the attribute value, which is - document,- sentenceor- token.
 
- class zensols.deepnlp.vectorize.vectorizers.WordEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, embed_model)[source]¶
- Bases: - EncodableFeatureVectorizer- Vectorizes string tokens in to word embedded vectors. This class works directly with the string tokens rather than - FeatureDocumentinstances. It can be useful when there’s a need to vectorize tokens outside of a feature document (i.e.- cui2vec).- DESCRIPTION = 'word embedding encoder'¶
 - FEATURE_TYPE = 4¶
 - __init__(name, config_factory, feature_id, manager, embed_model)¶
 - 
embed_model: WordEmbedModel¶
- The word embedding model that has the string tokens to vector mapping. 
 
Module contents¶
This module vecorizes natural language features in to PyTorch tensors.