zensols.deepnlp.transformer package¶
Submodules¶
zensols.deepnlp.transformer.domain module¶
Container classes for Bert models
- class zensols.deepnlp.transformer.domain.TokenizedDocument(tensor, boundary_tokens)[source]¶
- Bases: - PersistableContainer,- Writable- This is the tokenized document output of - TransformerDocumentTokenizer. Instances of this class are pickelable, in a feature context. Then give to the in the decoding phase to create a tensor with a transformer model such as- TransformerEmbedding.- __init__(tensor, boundary_tokens)¶
 - property attention_mask: Tensor¶
- The attention mask (0/1s). 
 - classmethod from_tensor(tensor)[source]¶
- Create an instance of the class using a tensor. This is useful for re-creating documents for mapping with - map_word_pieces()after unpickling from a document created with- TransformerDocumentTokenizer.tokenize.- Parameters:
- tensor ( - Tensor) – the tensor to set in- tensor
- Return type:
 
 - get_wordpiece_count(**kwargs)[source]¶
- The size of the document (sum over sentences) in number of word pieces. To keep special tokens (such BERT’s as - [CLS]and- [SEP]tokens) when passing in a tokenizer in- kwargs, add- special_tokens={}.- Parameters:
- kwargs – any keyword arguments passed on to - map_to_word_pieces()except (do not add- index_tokensand- includes)
- Return type:
 
 - property input_ids: Tensor¶
- The token IDs as the output from the tokenizer. 
 - map_to_word_pieces(sentences=None, map_wp=None, add_indices=False, special_tokens=None, index_tokens=True, includes=frozenset({'map'}))[source]¶
- Map word piece tokens to linguistic tokens. - Parameters:
- sentences ( - Iterable[- List[- Any]]) – an iteration of sentences, which is returned in the output (i.e.- FeatureSentence), or- input_idsif- None
- map_wp ( - Union[- Callable,- Dict[- int,- str]]) – either a function that takes the token index, sentence ID and input IDs, or the mapping from word piece ID to string token; return output is the string token (or numerical output if no mapping is provided); if an instance of- TransformerDocumentTokenizer, its vocabulary and special tokens are utilized for mapping and special token consideration
- add_indices ( - bool) – whether to add the token ID and index after the token string when- id2tokis provided for- map_wp
- special_tokens ( - Set[- str]) – a list of tokens (such BERT’s as- [CLS]and- [SEP]tokens) to remove; to keep special tokens when passing in a tokenizer in- kwargs, add- special_tokens={}.
- index_tokens ( - bool) – whether to index tokens positionally, which is used for mapping with feature or tokenized sentences; set this to- Falsewhen- sentencesare anything but a feature document / sentences
- includes ( - Set[- str]) – what data to return, which is a set of the keys listed in the- returndocumentation below
 
- Return type:
- Returns:
- a list sentence maps, each with: - sent_ix-> the ``i``th sentence (always provided)
- map-> list of- (sentence 'token', word pieces)
- sent-> a- FeatureSentenceor a tensor of vocab indexes if- map_wpis- None
- word_pieces-> the word pieces of the sentences
 
 
 - static map_word_pieces(token_offsets)[source]¶
- Map word piece tokens to linguistic tokens. - Return type:
- List[- Tuple[- FeatureToken,- List[- int]]]
- Returns:
- a list of tuples in the form: - (<token index>, <list of word piece indexes>)
 
 - property offsets: Tensor¶
- The offsets from word piece (transformer’s tokenizer) to feature document index mapping. 
 - property shape: Size¶
- Return the shape of the vectorized document. 
 - 
tensor: Tensor¶
- Encodes the input IDs, attention mask, and word piece offset map. 
 - property token_type_ids: Tensor¶
- The token type IDs (0/1s). 
 - truncate(size)[source]¶
- Truncate the the last (token) dimension to - size.- Return type:
- Returns:
- a new instance of this class truncated to size 
 
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_tokens=True, id2tok=None)[source]¶
- Write the contents of this instance to - writerusing indention- depth.- Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
- class zensols.deepnlp.transformer.domain.TokenizedFeatureDocument(tensor, boundary_tokens, feature, id2tok, char_offsets)[source]¶
- Bases: - TokenizedDocument- Instance of this class are created, then a picklable version returned with - detach()as an instance of the super class.- __init__(tensor, boundary_tokens, feature, id2tok, char_offsets)¶
 - 
feature: FeatureDocument¶
- The document to tokenize. 
 - 
id2tok: Dict[int,str]¶
- If provided, a mapping of indexes to transformer tokens. This attribute is always nulled out after being persisted. 
 - map_to_word_pieces(sentences=None, map_wp=None, **kwargs)[source]¶
- Map word piece tokens to linguistic tokens. - Parameters:
- sentences ( - Iterable[- List[- Any]]) – an iteration of sentences, which is returned in the output (i.e.- FeatureSentence), or- input_idsif- None
- map_wp ( - Union[- Callable,- Dict[- int,- str]]) – either a function that takes the token index, sentence ID and input IDs, or the mapping from word piece ID to string token; return output is the string token (or numerical output if no mapping is provided); if an instance of- TransformerDocumentTokenizer, its vocabulary and special tokens are utilized for mapping and special token consideration
- add_indices – whether to add the token ID and index after the token string when - id2tokis provided for- map_wp
- special_tokens – a list of tokens (such BERT’s as - [CLS]and- [SEP]tokens) to remove; to keep special tokens when passing in a tokenizer in- kwargs, add- special_tokens={}.
- index_tokens – whether to index tokens positionally, which is used for mapping with feature or tokenized sentences; set this to - Falsewhen- sentencesare anything but a feature document / sentences
- includes – what data to return, which is a set of the keys listed in the - returndocumentation below
 
- Return type:
- Returns:
- a list sentence maps, each with: - sent_ix-> the ``i``th sentence (always provided)
- map-> list of- (sentence 'token', word pieces)
- sent-> a- FeatureSentenceor a tensor of vocab indexes if- map_wpis- None
- word_pieces-> the word pieces of the sentences
 
 
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_tokens=True, id2tok=None)[source]¶
- Write the contents of this instance to - writerusing indention- depth.- Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
zensols.deepnlp.transformer.embed module¶
The tokenizer object.
- class zensols.deepnlp.transformer.embed.TransformerEmbedding(name, tokenizer, output='pooler_output', output_attentions=False)[source]¶
- Bases: - PersistableContainer,- Dictable- An model for transformer embeddings (such as BERT) that wraps the HuggingFace transformms API. - __init__(name, tokenizer, output='pooler_output', output_attentions=False)¶
 - property cache¶
- When set to - Truecache a global space model using the parameters from the first instance creation.
 - property model: PreTrainedModel¶
 - 
output: str= 'pooler_output'¶
- The output from the huggingface transformer API to return. - This is set to one of: - LAST_HIDDEN_STATE_OUTPUT: with the output embeddings of the last layer with shape:- (batch, N sentences, hidden layer dimension)
- POOLER_OUTPUT: the last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function with shape:- (batch, hidden layer dimension)
- ALL_OUTPUT: includes both as a dictionary with correpsonding
- keys 
 
 
 - property resource: TransformerResource¶
- The transformer resource containing the model. 
 - tokenize(doc)[source]¶
- Tokenize the feature document, which is used as the input to - transform().- Doc:
- the document to tokenize 
- Return type:
- Returns:
- the tokenization of - doc
 
 - 
tokenizer: TransformerDocumentTokenizer¶
- The tokenizer used for creating the input for the model. 
 - transform(doc, output=None)[source]¶
- Transform the documents in to the transformer output. - Parameters:
- docs – the batch of documents to return 
- output ( - str) – the output from the huggingface transformer API to return (see class docs)
 
- Return type:
- Returns:
- a container object instance with the output, which contains (among other data) - last_hidden_statewith the output embeddings of the last layer with shape:- (batch, N sentences, hidden layer dimension)
 
 
zensols.deepnlp.transformer.layer module¶
Contains transformer embedding layers.
- class zensols.deepnlp.transformer.layer.TransformerEmbeddingLayer(*args, embed_model, **kwargs)[source]¶
- Bases: - EmbeddingLayer- A transformer (i.e. BERT) embedding layer. This class generates embeddings on a per sentence basis. See the initializer documentation for configuration requirements. - MODULE_NAME: ClassVar[str] = 'transformer embedding'¶
- The module name used in the logging message. This is set in each inherited class. 
 - __init__(*args, embed_model, **kwargs)[source]¶
- Initialize with an embedding model. This embedding model must configured with - TransformerEmbedding.outputto- last_hidden_state.- Parameters:
- embed_model ( - TransformerEmbedding) – used to generate the transformer (i.e. BERT) embeddings
 
 - forward(x)[source]¶
- Define the computation performed at every call. - Should be overridden by all subclasses. :rtype: - Tensor- Note - Although the recipe for forward pass needs to be defined within this function, one should call the - Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
 
- class zensols.deepnlp.transformer.layer.TransformerSequence(net_settings, sub_logger=None)[source]¶
- Bases: - EmbeddingNetworkModule,- SequenceNetworkModule- A sequence based model for token classification use HuggingFace transformers layers (not their token classification API). - MODULE_NAME: ClassVar[str] = 'transformer sequence'¶
- The module name used in the logging message. This is set in each inherited class. 
 - __init__(net_settings, sub_logger=None)[source]¶
- Initialize the embedding layer. - Parameters:
- net_settings ( - TransformerSequenceNetworkSettings) – the embedding layer configuration
- logger – the logger to use for the forward process in this layer 
- filter_attrib_fn – if provided, called with a - BatchFieldMetadatafor each field returning- Trueif the batch field should be retained and used in the embedding layer (see class docs); if- Noneall fields are considered
 
 
 
- class zensols.deepnlp.transformer.layer.TransformerSequenceNetworkSettings(name, config_factory, torch_config, dropout, batch_stash, embedding_layer, decoder_settings)[source]¶
- Bases: - EmbeddingNetworkSettings,- DropoutNetworkSettings- Settings configuration for - TransformerSequence.- __init__(name, config_factory, torch_config, dropout, batch_stash, embedding_layer, decoder_settings)¶
 - 
decoder_settings: DeepLinearNetworkSettings¶
- The decoder feed forward network. 
 - get_module_class_name()[source]¶
- Returns the fully qualified class name of the module to create by - ModelManager. This module takes as the first parameter an instance of this class.- Important: This method is not used for nested modules. You must declare specific class names in the configuration for those nested class naems. - Return type:
 
 
zensols.deepnlp.transformer.mask module¶
Classes to predict fill-mask tasks.
- class zensols.deepnlp.transformer.mask.MaskFiller(resource, k=1, feature_id='norm', feature_value='MASK')[source]¶
- Bases: - object- The class fills masked tokens with the prediction of the underlying maked model. Masked tokens with attribute - feature_idhaving value- feature_value(- normand- MASKby default respectively) are substituted with model values.- To use this class, parse a sentence with a - FeatureDocumentParserwith masked tokens using the string- feature_value.- For example (with class defaults), the sentence: - Paris is the MASK of France. - becomes: - Parise is the <mask> of France. - The - <mask>string becomes the- mask_tokenfor the model’s tokenzier.- __init__(resource, k=1, feature_id='norm', feature_value='MASK')¶
 - 
feature_value: str= 'MASK'¶
- The value of feature ID - feature_idto match on masked tokens.
 - 
k: int= 1¶
- The number of top K predicted masked words per mask. The total number of predictions will be <number of masks> X - kin the source document.
 - predict(source)[source]¶
- Predict subtitution values for token masks. - Important: - sourceis modified as a side-effect of this method. Use- clone()on the- sourcedocument passed to this method to preserve the original if necessary.- Parameters:
- source ( - TokenContainer) – the source document, sentence, or span for which to substitute values
- Return type:
 
 - 
resource: TransformerResource¶
- A container class with the Huggingface tokenizer and model. 
 
- class zensols.deepnlp.transformer.mask.Prediction(cont, masked_tokens, df)[source]¶
- Bases: - Dictable- A container class for masked token predictions produced by - MaskFiller. This class offers many ways to get the predictions, including getting the sentences as instances of- TokenContainerby using it as an iterable.- The sentences are also available as the - pred_sentenceskey when using- asdict().- __init__(cont, masked_tokens, df)¶
 - 
cont: TokenContainer¶
- The document, sentence or span to predict masked tokens. 
 - 
df: DataFrame¶
- The predictions with dataframe columns: - k: the k in the top-k highest scored masked token match
- mask_id: the N-th masked token in the source ordered by position
- token: the predicted token
- score: the score of the prediction (- [0, 1], higher the better)
 
 - get_container(k=0)[source]¶
- Get the k-th top scored sentence. This method should be called only once for each instance since it modifies the tokens of the container for each invocation. - A client may call this method as many times as necessary (i.e. for multiple values of - k) since :obj:- conttokens are modified while retaining the original masked tokens- masked_tokens.- Parameters:
- k ( - int) – as k increases the less likely the mask substitutions, and thus sentence; k = 0 is the most likely given the sentence and masks
- Return type:
 
 - get_tokens()[source]¶
- Return an iterable of the prediction coupled with the token it belongs to and its score. - Return type:
 
 - property masked_token_dicts: Tuple[Dict[str, Any]]¶
- A tuple of - builtins.dicteach having token index, norm and text data.
 - 
masked_tokens: Tuple[FeatureToken]¶
- The masked tokens matched. 
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_masked_tokens=True, include_predicted_tokens=True, include_predicted_sentences=True)[source]¶
- Write this instance as either a - Writableor as a- Dictable. If class attribute- _DICTABLE_WRITABLE_DESCENDANTSis set as- True, then use the- write()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a- dictrecursively using- asdict(), then formatting the output.- If the attribute - _DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in the- write()method.- Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively. - Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
zensols.deepnlp.transformer.optimizer module¶
Adapat huggingface transformer weight decay optimizer.
- class zensols.deepnlp.transformer.optimizer.TransformerAdamFactory[source]¶
- Bases: - ModelResourceFactory
- class zensols.deepnlp.transformer.optimizer.TransformerSchedulerFactory[source]¶
- Bases: - ModelResourceFactory- Unified API to get any scheduler from its name. This simply calls - transformers.get_scheduler()and calculates- num_training_stepsas- epochs * batch_size.- Documentation taken directly from - get_schedulerfunction in the PyTorch source tree.
zensols.deepnlp.transformer.pred module¶
Predictions output for transformer models.
- class zensols.deepnlp.transformer.pred.TransformerSequencePredictionsDataFrameFactory(source, result, stash, column_names=None, data_point_transform=None, batch_limit=9223372036854775807, epoch_result=None, label_vectorizer_name=None, metric_metadata=None, name=None, embedded_document_attribute=None)[source]¶
- Bases: - SequencePredictionsDataFrameFactory- Like the super class but create predictions for transformer sequence models. By default, transformer input is truncated at the model’s max token length (usually 512 word piece tokens). It then truncate the tokens that are added as the - textcolumn from (configured by default)- classify.TokenClassifyModelFacade.- For all predictions where the sequence passed the model’s maximum, this class maps that last word piece token output to the respective token in the - predictions_dataframe_factory_classinstance’s- transformoutput.- __init__(source, result, stash, column_names=None, data_point_transform=None, batch_limit=9223372036854775807, epoch_result=None, label_vectorizer_name=None, metric_metadata=None, name=None, embedded_document_attribute=None)¶
 
zensols.deepnlp.transformer.resource module¶
Provide BERT embeddings on a per sentence level.
- exception zensols.deepnlp.transformer.resource.TransformerError[source]¶
- Bases: - DeepLearnError- Raised for any transformer specific errors in this and child modules of the parent. - __annotations__ = {}¶
 - __module__ = 'zensols.deepnlp.transformer.resource'¶
 
- class zensols.deepnlp.transformer.resource.TransformerResource(name, torch_config, model_id, cased=None, trainable=False, args=<factory>, tokenizer_args=<factory>, model_args=<factory>, model_class='transformers.AutoModel', tokenizer_class='transformers.AutoTokenizer', cache=False, cache_dir=None)[source]¶
- Bases: - PersistableContainer,- Dictable- A container base class that allows configuration and creates various huggingface models. - __init__(name, torch_config, model_id, cased=None, trainable=False, args=<factory>, tokenizer_args=<factory>, model_args=<factory>, model_class='transformers.AutoModel', tokenizer_class='transformers.AutoTokenizer', cache=False, cache_dir=None)¶
 - 
args: Dict[str,Any]¶
- Additional arguments to pass to the from_pretrained method for the tokenizer and the model. 
 - 
cache: InitVar= False¶
- When set to - Truecache a global space model using the parameters from the first instance creation.
 - 
cased: bool= None¶
- Truefor case sensitive models,- False(default) otherwise. The negated value of it is also used as the- do_lower_caseparameter in the- *.from_pretrainedcalls to huggingface transformers.
 - property model: PreTrainedModel¶
 - 
model_args: Dict[str,Any]¶
- Additional arguments to pass to the from_pretrained method for the model. 
 - 
model_class: str= 'transformers.AutoModel'¶
- The model fully qualified class used to create models with the - from_pretrainedstatic method.
 - 
model_id: str¶
- The ID of the model (i.e. - bert-base-uncased). If this is not set, is derived from the- model_nameand- case.- Token embeding using - TransformerEmbeddingas been tested with:- bert-base-cased
- bert-large-cased
- roberta-base
- distilbert-base-cased
 - See:
 
 - property tokenizer: PreTrainedTokenizer¶
 - 
tokenizer_args: Dict[str,Any]¶
- Additional arguments to pass to the from_pretrained method for the tokenizer. 
 - 
tokenizer_class: str= 'transformers.AutoTokenizer'¶
- The model fully qualified class used to create tokenizers with the - from_pretrainedstatic method.
 - 
torch_config: TorchConfig¶
- The config device used to copy the embedding data. 
 
zensols.deepnlp.transformer.tokenizer module¶
The tokenizer object.
- class zensols.deepnlp.transformer.tokenizer.TransformerDocumentTokenizer(resource, word_piece_token_length=None, params=None, feature_id='text')[source]¶
- Bases: - PersistableContainer- Creates instances of - TokenziedFeatureDocumentusing a HuggingFace- PreTrainedTokenizer.- 
DEFAULT_PARAMS: ClassVar[Dict[str,Any]] = {'is_split_into_words': True, 'padding': 'longest', 'return_offsets_mapping': True, 'return_special_tokens_mask': True}¶
- Default parameters for the HuggingFace tokenizer. These get overriden by the - tokenizer_kwargsin- tokenize()and the processing of value- word_piece_token_length.
 - __init__(resource, word_piece_token_length=None, params=None, feature_id='text')¶
 - property all_special_tokens: Set[str]¶
- Special tokens used by the model (such BERT’s as - [CLS]and- [SEP]tokens).
 - 
feature_id: str= 'text'¶
- The feature ID to use for token string values from - FeatureToken.
 - property id2tok: Dict[int, str]¶
- A mapping from the HuggingFace tokenizer’s vocabulary to it’s word piece equivalent. 
 - property pretrained_tokenizer: PreTrainedTokenizer¶
- The HuggingFace tokenized used to create tokenized documents. 
 - 
resource: TransformerResource¶
- Contains the model used to create the tokenizer. 
 - tokenize(doc, tokenizer_kwargs=None)[source]¶
- Tokenize a feature document in a form that’s easy to inspect and provide to - TransformerEmbeddingto transform.- Parameters:
- doc ( - FeatureDocument) – the document to tokenize
- Return type:
 
 - 
word_piece_token_length: int= None¶
- The max number of word piece tokens. The word piece length is always the same or greater in count than linguistic tokens because the word piece algorithm tokenizes on characters. - If this value is less than 0, than do not fix sentence lengths. If the value is 0 (default), then truncate to the model’s longest max lenght. Otherwise, if this value is - None, set the length to the model’s longest max length using the model’s- model_max_lengthvalue.- Setting this to a value to 0, making documents multi-length, has the potential of creating token spans longer than the model can tolerate (usually 512 word piece tokens). In these cases, this value must be set to (or lower) than the model’s - model_max_length.- Tokenization padding is on by default. - See:
 
 
- 
DEFAULT_PARAMS: 
zensols.deepnlp.transformer.vectorizers module¶
Contains classes that are used to vectorize documents in to transformer embeddings.
- class zensols.deepnlp.transformer.vectorizers.DocumentEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False, token_pattern='{norm}', token_feature_ids=frozenset({'norm'}))[source]¶
- Bases: - TransformerEmbeddingFeatureVectorizer- Vectorizes a feature from each token as a single sentence document. It does this by tracking the sentence and token positions that have tokens with the necessary features to create what becomes sentences to parse and vectorize. During decoding, each pooled sentence’s embedding is added to the respective position in the returned data. - DESCRIPTION: ClassVar[str] = 'transformer document embedding'¶
 - FEATURE_TYPE: ClassVar[TextFeatureType] = 1¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False, token_pattern='{norm}', token_feature_ids=frozenset({'norm'}))¶
 - token_feature_ids: Set[str] = frozenset({'norm'})¶
- The features IDs used in - token_pattern.
 - token_pattern: str = '{norm}'¶
- The - builtins.str.format()string used to format the sentence to be parsed and vectorized.
 
- class zensols.deepnlp.transformer.vectorizers.DocumentMappedTransformerFeatureContext(feature_id, document, sent_len, pos)[source]¶
- Bases: - TransformerFeatureContext
- class zensols.deepnlp.transformer.vectorizers.LabelTransformerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False)[source]¶
- Bases: - TransformerFeatureVectorizer- A base class for vectorizing by mapping tokens to transformer consumable word piece tokens. This includes creating labels and masks. - Shape:
 - FEATURE_TYPE: ClassVar[TextFeatureType] = 1¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False)¶
 - is_labeler: bool = True¶
- If - True, make this a labeling specific vectorizer. Otherwise, certain layers will use the output of the vectorizer as features rather than the labels.
 
- class zensols.deepnlp.transformer.vectorizers.TransformerEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)[source]¶
- Bases: - TransformerFeatureVectorizer- A feature vectorizer used to create transformer (i.e. BERT) embeddings. The class uses the - embed_model, which is of type- TransformerEmbedding.- Note the encoding input ideally are sentences shorter than 512 tokens. However, this vectorizer can accommodate both - FeatureSentenceand- FeatureDocumentinstances.- DESCRIPTION: ClassVar[str] = 'transformer document embedding'¶
 - FEATURE_TYPE: ClassVar[TextFeatureType] = 4¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)¶
 
- class zensols.deepnlp.transformer.vectorizers.TransformerExpanderFeatureContext(feature_id, contexts, document)[source]¶
- Bases: - TransformerFeatureContext- A vectorizer feature context used with - TransformerExpanderFeatureVectorizer.- __init__(feature_id, contexts, document)[source]¶
- Params feature_id:
- the feature ID used to identify this context 
- Params contexts:
- subordinate contexts given to - MultiFeatureContext
- Params document:
- document used to create the transformer embeddings 
 
 - 
contexts: Tuple[FeatureContext] = Field(name=None,type=None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=<dataclasses._MISSING_TYPE object>,_field_type=None)¶
- The subordinate contexts. 
 
- class zensols.deepnlp.transformer.vectorizers.TransformerExpanderFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False, delegate_feature_ids=None)[source]¶
- Bases: - TransformerFeatureVectorizer- A vectorizer that expands lingustic feature vectors to their respective locations as word piece token vectors. - This is used to concatenate lingustic features with Bert (and other transformer) embeddings. Each lingustic token is copied in the word piece token location across all vectorizers and sentences. - Shape:
- (-1, token length, X), where X is the sum of all the delegate shapes across all three dimensions 
 - DESCRIPTION: ClassVar[str] = 'transformer expander'¶
 - FEATURE_TYPE: ClassVar[TextFeatureType] = 1¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False, delegate_feature_ids=None)¶
 - delegate_feature_ids: Tuple[str] = None¶
- A list of feature IDs of vectorizers whose output will be expanded. 
 - property delegates: EncodableFeatureVectorizer¶
- The delegates used for encoding and decoding the lingustic features. 
 
- class zensols.deepnlp.transformer.vectorizers.TransformerFeatureContext(feature_id, document)[source]¶
- Bases: - FeatureContext,- Deallocatable- A vectorizer feature contex used with - TransformerEmbeddingFeatureVectorizer.- __init__(feature_id, document)[source]¶
- Params feature_id:
- the feature ID used to identify this context 
- Params document:
- document used to create the transformer embeddings 
 
 
- class zensols.deepnlp.transformer.vectorizers.TransformerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)[source]¶
- Bases: - EmbeddingFeatureVectorizer,- FeatureDocumentVectorizer- Base class for classes that vectorize transformer models. This class also tokenizes documents. - __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)¶
 - encode_tokenized: bool = False¶
- Whether to tokenize the document on encoding. Set this to - Trueonly if the huggingface model ID (i.e.- bert-base-cased) will not change after vectorization/batching.- Setting this to - Truetells the vectorizer to tokenize during encoding, and thus will speed experimentation by providing the tokenized tensors to the model directly.
 - property feature_type: TextFeatureType¶
- The type of feature this vectorizer generates. This is used by classes such as - EmbeddingNetworkModuleto determine where to add the features, such as concating to the embedding layer, join layer etc.
 - is_labeler: bool = False¶
- If - True, make this a labeling specific vectorizer. Otherwise, certain layers will use the output of the vectorizer as features rather than the labels.
 - tokenize(doc)[source]¶
- Tokenize the document in to a token document used by the encoding phase. - Parameters:
- doc ( - FeatureDocument) – the document to be tokenized
- Return type:
 
 
- class zensols.deepnlp.transformer.vectorizers.TransformerMaskFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, data_type='bool')[source]¶
- Bases: - LabelTransformerFeatureVectorizer- Creates a mask of word piece tokens to - Trueand special tokens and padding to- False. This maps tokens to word piece tokens like- TransformerNominalFeatureVectorizer.- Shape:
 - DESCRIPTION: ClassVar[str] = 'transformer mask'¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, data_type='bool')¶
 - data_type: Union[str, None, torch.dtype] = 'bool'¶
- The mask tensor type. To use the int type that matches the resolution of the manager’s - torch_config, use- DEFAULT_INT.
 
- class zensols.deepnlp.transformer.vectorizers.TransformerNominalFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, delegate_feature_id=None, size=-1, pad_label=-100, label_all_tokens=False, annotations_attribute='annotations')[source]¶
- Bases: - AggregateEncodableFeatureVectorizer,- LabelTransformerFeatureVectorizer- This creates word piece (maps to tokens) labels. This class uses a - NominalEncodedEncodableFeatureVectorizer`to map from string labels to their nominal long values. This allows a single instance and centralized location where the label mapping happens in case other (non-transformer) components need to vectorize labels.- Shape:
 - DESCRIPTION: ClassVar[str] = 'transformer seq labeler'¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, delegate_feature_id=None, size=-1, pad_label=-100, label_all_tokens=False, annotations_attribute='annotations')¶
 - annotations_attribute: str = 'annotations'¶
- The attribute used to get the features from the - FeatureSentence. For example,- TokenAnnotatedFeatureSentencehas an- annotationsattribute.
 - delegate_feature_id: str = None¶
- The feature ID for the aggregate encodeable feature vectorizer. 
 - label_all_tokens: bool = False¶
- If - True, label all word piece tokens with the corresponding linguistic token label. Otherwise, the default padded value is used, and thus, ignored by the loss function when calculating loss.
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write this instance as either a - Writableor as a- Dictable. If class attribute- _DICTABLE_WRITABLE_DESCENDANTSis set as- True, then use the- write()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a- dictrecursively using- asdict(), then formatting the output.- If the attribute - _DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in the- write()method.- Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively. - Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
zensols.deepnlp.transformer.wordpiece module¶
Word piece mappings to feature tokens, sentences and documents.
There are often edges cases and tricky situations with certain model’s usage of
special tokens (i.e. [CLS]) and where they are used.  With this in mind,
this module attempts to:
Assist in debugging (works with detached
TokenizedDocument) in cases where token level embeddings are directly accessed, and
Map corresponding both token and sentence level embeddings to respective origin natural langauge feature set data structures.
- class zensols.deepnlp.transformer.wordpiece.CachingWordPieceFeatureDocumentFactory(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True, stash=None, hasher=<factory>)[source]¶
- Bases: - WordPieceFeatureDocumentFactory- Caches the documents and their embeddings in a - Stash. For those that are cached, the embeddings are copied over to the passed document in- create().- __init__(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True, stash=None, hasher=<factory>)¶
 - create(fdoc, tdoc=None)[source]¶
- Create a document in to an object graph that relates word pieces to feature tokens. Note that if - tdocis provided, it must have been tokenized from- fdoc.- Parameters:
- fdoc ( - FeatureDocument) – the feature document used to create tdoc
- tdoc ( - TokenizedFeatureDocument) – a tokenized feature document generated by- tokenize()
 
- Return type:
- Returns:
- a data structure with the word piece information 
 
 
- class zensols.deepnlp.transformer.wordpiece.WordPiece(word, vocab_index, index)[source]¶
- Bases: - PersistableContainer,- Dictable- The word piece data. - __init__(word, vocab_index, index)¶
 - 
index: int¶
- The index of the word piece subword in the tokenization tensor, which will have the same index in the output embeddings for - TransformerEmbedding.output=- last_hidden_state.
 
- class zensols.deepnlp.transformer.wordpiece.WordPieceDocumentDecorator(word_piece_doc_factory)[source]¶
- Bases: - FeatureDocumentDecorator- Populates sentence and token embeddings in the documents. - __init__(word_piece_doc_factory)¶
 - 
word_piece_doc_factory: WordPieceFeatureDocumentFactory¶
- The feature document factory that populates embeddings. 
 
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureDocument(sents, text=None, spacy_doc=None, tokenized=None)[source]¶
- Bases: - FeatureDocument,- WordPieceTokenContainer- A document made up of word piece sentences. - __init__(sents, text=None, spacy_doc=None, tokenized=None)¶
 - property embedding: Tensor¶
- The document embedding (see - WordPieceFeatureSpan.embedding).- Shape:
- (|sentences|, <embedding dimension>) 
 
 - tokenized: TokenizedFeatureDocument = None¶
- The tokenized feature document. 
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write the document and optionally sentence features. - Parameters:
- n_sents – the number of sentences to write 
- n_tokens – the number of tokens to print across all sentences 
- include_original – whether to include the original text 
- include_normalized – whether to include the normalized text 
 
 
 
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureDocumentFactory(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True)[source]¶
- Bases: - object- Create instances of - WordPieceFeatureDocumentfrom- FeatureDocumentinstances. It does this by iterating through a feature document data structure and adding- WordPiece*object data and optionally adding the corresponding sentence and/or token level embeddings.- The embeddings can also be added with - add_token_embeddings()and- add_sent_embeddings()individually. If all you want are the sentence level embeddings, you can use- add_sent_embeddings()on a- FeatureSentenceinstance.- __init__(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True)¶
 - add_sent_embeddings(doc, arr)[source]¶
- Add sentence embeddings to the sentences of - doc.- Parameters:
- doc ( - Union[- WordPieceFeatureDocument,- FeatureDocument]) – sentences of this doc have- embeddingsset to the correpsonding sentence tensor with shape- (1, <embedding dimension>).
 
 - add_token_embeddings(doc, arr)[source]¶
- Add token embeddings to the sentences of - doc. This assumes tokens are of type- WordPieceFeatureTokensince the token indices are needed.- Parameters:
- doc ( - WordPieceFeatureDocument) – sentences of this doc have- embeddingsset to the correpsonding sentence tensor with shape (1, <embedding dimension>).
 
 - create(fdoc, tdoc=None)[source]¶
- Create a document in to an object graph that relates word pieces to feature tokens. Note that if - tdocis provided, it must have been tokenized from- fdoc.- Parameters:
- fdoc ( - FeatureDocument) – the feature document used to create tdoc
- tdoc ( - TokenizedFeatureDocument) – a tokenized feature document generated by- tokenize()
 
- Return type:
- Returns:
- a data structure with the word piece information 
 
 - 
embed_model: TransformerEmbedding¶
- Used to populate the embeddings in - WordPiece*classes.
 - populate(doc, truncate=False)[source]¶
- Populate sentence embeddings in a document by first feature parsing a new document with - create()and then copying the embeddings with- WordPieceFeatureDocument.copy_embeddings()- Parameters:
- truncate ( - bool) – if sentence lengths differ (i.e. from using different models to chunk sentences) trim the longer document to match the shorter
 
 - 
tokenizer: TransformerDocumentTokenizer¶
- Used to tokenize documents that aren’t already in - __call__().
 
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureSentence(tokens, text=None, spacy_span=None, embedding=None)[source]¶
- Bases: - WordPieceFeatureSpan,- FeatureSentence- __init__(tokens, text=None, spacy_span=None, embedding=None)¶
 
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureSpan(tokens, text=None, spacy_span=None, embedding=None)[source]¶
- Bases: - FeatureSentence,- WordPieceTokenContainer- A sentence made up of word pieces. - __init__(tokens, text=None, spacy_span=None, embedding=None)¶
 - embedding: Tensor = None¶
- The sentence embedding level (i.e. - [CLS]) embedding from the transformer.- Shape:
- (<embedding dimension>,) 
 
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write the text container. - Parameters:
- include_original – whether to include the original text 
- include_normalized – whether to include the normalized text 
- n_tokens – the number of tokens to write 
- inline – whether to print the tokens on one line each 
 
 
 
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureToken(i, idx, i_sent, norm, lexspan, words, embedding=None)[source]¶
- Bases: - FeatureToken- The token and the word pieces that repesent it. - __init__(i, idx, i_sent, norm, lexspan, words, embedding=None)¶
 - clone(cls=None, **kwargs)[source]¶
- Clone an instance of this token. - Parameters:
- cls ( - Type) – the type of the new instance
- kwargs – arguments to add to as attributes to the clone 
 
- Return type:
- Returns:
- the cloned instance of this instance 
 
 - detach(*args, **kwargs)[source]¶
- Create a detected token (i.e. from spaCy artifacts). - Parameters:
- feature_ids – the features to write, which defaults to - FEATURE_IDS
- skip_missing – whether to only keep - feature_ids
- cls – the type of the new instance 
 
- Return type:
 
 - embedding: Tensor = None¶
- The embedding for - wordsafter using the transformer.- Shape:
- (|words|, <embedding dimension>) 
 
 - property indexes: Tuple[int]¶
- The indexes of the word piece subwords (see - WordPiece.index).
 - property token_embedding: Tensor¶
- The embedding of this token, which is the sum of the word piece embeddings. 
 - words: Tuple[WordPiece]¶
- The word pieces that make up this token. 
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write this instance as either a - Writableor as a- Dictable. If class attribute- _DICTABLE_WRITABLE_DESCENDANTSis set as- True, then use the- write()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a- dictrecursively using- asdict(), then formatting the output.- If the attribute - _DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in the- write()method.- Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively. - Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed=False, fold_method='raise', embed_model=None, decode_embedding=True, word_piece_doc_factory=None, access='raise')[source]¶
- Bases: - EmbeddingFeatureVectorizer- Uses the - embeddingsattributes added to documents, sentences and tokens populated by- WordPieceFeatureDocumentFactory. Currently only sentence sequences are supported. For single sentence or token classification, use- zensols.deepnlp.vectorizers.- If aggregated documents are given to the vectorizer, they are flattened into sentences and vectorized in the same was a single document’s sentences would be vectorized. A batch is created for each document and only one batch is created for singleton documents. - This embedding layer expects the following attribute settings to be left with the defaults set: obj:encode_transformed, - fold_method,- decode_embedding.- Shape:
 - DESCRIPTION: ClassVar[str] = 'wordpiece'¶
 - FEATURE_TYPE: ClassVar[TextFeatureType] = 4¶
 - __init__(name, config_factory, feature_id, manager, encode_transformed=False, fold_method='raise', embed_model=None, decode_embedding=True, word_piece_doc_factory=None, access='raise')¶
 - access: str = 'raise'¶
- What to do when accessing the sentence embedding when encoding. This is one of: - raise: raises an error when missing
- add_missing: create the embedding only if missing
- clobber: always create a new embedding by replacing (if existed)
 
 - decode_embedding: bool = True¶
- Turn off the - embed_modelforward pass to use the embeddings we vectorized from the- embeddingattribute(s). Keep the default.
 - embed_model: TransformerEmbedding = None¶
- This field is not applicable to this vectorizer–keep the default. 
 - encode(doc)[source]¶
- Encode by combining documents in to one monolithic document when a tuple is passed, otherwise default to the super class’s encode functionality. - Return type:
 
 - encode_transformed: bool = False¶
- This field is not applicable to this vectorizer–keep the default. 
 - fold_method: str = 'raise'¶
- This field is not applicable to this vectorizer–keep the default. 
 - word_piece_doc_factory: WordPieceFeatureDocumentFactory = None¶
- The feature document factory that populates embeddings. 
 
- class zensols.deepnlp.transformer.wordpiece.WordPieceTokenContainer[source]¶
- Bases: - TokenContainer- Like - TokenContainerbut contains word pieces.
Module contents¶
Contains classes that adapt the huggingface tranformers to the Zensols deeplearning framework.
- zensols.deepnlp.transformer.normalize_huggingface_logging()[source]¶
- Make the :mod”transformers package use default logging. Using this and setting the - transformerslogging package to- ERRORlevel logging has the same effect as- suppress_warnings().