zensols.deepnlp.transformer package#

Submodules#

zensols.deepnlp.transformer.domain#

Container classes for Bert models

class zensols.deepnlp.transformer.domain.TokenizedDocument(tensor, boundary_tokens)[source]#

Bases: PersistableContainer, Writable

This is the tokenized document output of TransformerDocumentTokenizer. Instances of this class are pickelable, in a feature context. Then give to the in the decoding phase to create a tensor with a transformer model such as TransformerEmbedding.

__init__(tensor, boundary_tokens)#

property attention_mask: Tensor#: The attention mask (0/1s).

boundary_tokens: bool#: If the token document has sentence boundary tokens, such as [CLS] for Bert.

deallocate()[source]#: Deallocate all resources for this instance.

detach()[source]#

Return a version of the document that is pickleable.

Return type:: TokenizedDocument

classmethod from_tensor(tensor)[source]#

Create an instance of the class using a tensor. This is useful for re-creating documents for mapping with map_word_pieces() after unpickling from a document created with TransformerDocumentTokenizer.tokenize.

Parameters:: tensor (Tensor) – the tensor to set in tensor
Return type:: TokenizedDocument

get_wordpiece_count(**kwargs)[source]#

The size of the document (sum over sentences) in number of word pieces. To keep special tokens (such BERT’s as [CLS] and [SEP] tokens) when passing in a tokenizer in kwargs, add special_tokens={}.

Parameters:: kwargs – any keyword arguments passed on to map_to_word_pieces() except (do not add index_tokens and includes)
Return type:: int

property input_ids: Tensor#: The token IDs as the output from the tokenizer.

map_to_word_pieces(sentences=None, map_wp=None, add_indices=False, special_tokens=None, index_tokens=True, includes=frozenset({'map'}))[source]#

Map word piece tokens to linguistic tokens.

Parameters:

sentences (Iterable[List[Any]]) – an iteration of sentences, which is returned in the output (i.e. FeatureSentence), or input_ids if None
map_wp (Union[Callable, Dict[int, str]]) – either a function that takes the token index, sentence ID and input IDs, or the mapping from word piece ID to string token; return output is the string token (or numerical output if no mapping is provided); if an instance of TransformerDocumentTokenizer, its vocabulary and special tokens are utilized for mapping and special token consideration
add_indices (bool) – whether to add the token ID and index after the token string when id2tok is provided for map_wp
special_tokens (Set[str]) – a list of tokens (such BERT’s as [CLS] and [SEP] tokens) to remove; to keep special tokens when passing in a tokenizer in kwargs, add special_tokens={}.
index_tokens (bool) – whether to index tokens positionally, which is used for mapping with feature or tokenized sentences; set this to False when sentences are anything but a feature document / sentences
includes (Set[str]) – what data to return, which is a set of the keys listed in the return documentation below

Return type:

List[Dict[str, Any]]

Returns:

a list sentence maps, each with:

sent_ix -> the ``i``th sentence (always provided)

map -> list of (sentence 'token', word pieces)

sent -> a FeatureSentence or a tensor of vocab indexes if map_wp is None

word_pieces -> the word pieces of the sentences

static map_word_pieces(token_offsets)[source]#

Map word piece tokens to linguistic tokens.

Return type:

List[Tuple[FeatureToken, List[int]]]

Returns:

a list of tuples in the form:

(<token index>, <list of word piece indexes>)

property offsets: Tensor#: The offsets from word piece (transformer’s tokenizer) to feature document index mapping.

params()[source]#

Return type:: Dict[str, Any]

property shape: Size#: Return the shape of the vectorized document.

tensor: Tensor#: Encodes the input IDs, attention mask, and word piece offset map.

property token_type_ids: Tensor#: The token type IDs (0/1s).

truncate(size)[source]#

Truncate the the last (token) dimension to size.

Return type:: TokenizedDocument
Returns:: a new instance of this class truncated to size

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_tokens=True, id2tok=None)[source]#

Write the contents of this instance to writer using indention depth.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deepnlp.transformer.domain.TokenizedFeatureDocument(tensor, boundary_tokens, feature, id2tok, char_offsets)[source]#

Bases: TokenizedDocument

Instance of this class are created, then a picklable version returned with detach() as an instance of the super class.

__init__(tensor, boundary_tokens, feature, id2tok, char_offsets)#

char_offsets: Tuple[Tuple[int, int]]#: The valid character offsets for each word piece token.

detach()[source]#

Return a version of the document that is pickleable.

Return type:: TokenizedDocument

feature: FeatureDocument#: The document to tokenize.

id2tok: Dict[int, str]#: If provided, a mapping of indexes to transformer tokens. This attribute is always nulled out after being persisted.

map_to_word_pieces(sentences=None, map_wp=None, **kwargs)[source]#

Map word piece tokens to linguistic tokens.

Parameters:

sentences (Iterable[List[Any]]) – an iteration of sentences, which is returned in the output (i.e. FeatureSentence), or input_ids if None
map_wp (Union[Callable, Dict[int, str]]) – either a function that takes the token index, sentence ID and input IDs, or the mapping from word piece ID to string token; return output is the string token (or numerical output if no mapping is provided); if an instance of TransformerDocumentTokenizer, its vocabulary and special tokens are utilized for mapping and special token consideration
add_indices – whether to add the token ID and index after the token string when id2tok is provided for map_wp
special_tokens – a list of tokens (such BERT’s as [CLS] and [SEP] tokens) to remove; to keep special tokens when passing in a tokenizer in kwargs, add special_tokens={}.
index_tokens – whether to index tokens positionally, which is used for mapping with feature or tokenized sentences; set this to False when sentences are anything but a feature document / sentences
includes – what data to return, which is a set of the keys listed in the return documentation below

Return type:

List[Dict[str, Any]]

Returns:

a list sentence maps, each with:

sent_ix -> the ``i``th sentence (always provided)

map -> list of (sentence 'token', word pieces)

sent -> a FeatureSentence or a tensor of vocab indexes if map_wp is None

word_pieces -> the word pieces of the sentences

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_tokens=True, id2tok=None)[source]#

Write the contents of this instance to writer using indention depth.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

zensols.deepnlp.transformer.embed#

The tokenizer object.

class zensols.deepnlp.transformer.embed.TransformerEmbedding(name, tokenizer, output='pooler_output', output_attentions=False)[source]#

Bases: PersistableContainer, Dictable

An model for transformer embeddings (such as BERT) that wraps the HuggingFace transformms API.

ALL_OUTPUT: ClassVar[str] = 'all_output'#

LAST_HIDDEN_STATE_OUTPUT: ClassVar[str] = 'last_hidden_state'#

POOLER_OUTPUT: ClassVar[str] = 'pooler_output'#

__init__(name, tokenizer, output='pooler_output', output_attentions=False)#

property cache#: When set to True cache a global space model using the parameters from the first instance creation.

property model: PreTrainedModel#

name: str#: The name of the embedding as given in the configuration.

output: str = 'pooler_output'#

The output from the huggingface transformer API to return.

This is set to one of:

LAST_HIDDEN_STATE_OUTPUT: with the output embeddings of the last layer with shape: (batch, N sentences, hidden layer dimension)

POOLER_OUTPUT: the last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function with shape: (batch, hidden layer dimension)

ALL_OUTPUT: includes both as a dictionary with correpsonding
keys

output_attentions: bool = False#: Whether or not to output the attention layer.

property resource: TransformerResource#: The transformer resource containing the model.

tokenize(doc)[source]#

Tokenize the feature document, which is used as the input to transform().

Doc:: the document to tokenize
Return type:: TokenizedFeatureDocument
Returns:: the tokenization of doc

tokenizer: TransformerDocumentTokenizer#: The tokenizer used for creating the input for the model.

property trainable: bool#: Whether or not the model is trainable or frozen.

transform(doc, output=None)[source]#

Transform the documents in to the transformer output.

Parameters:

docs – the batch of documents to return
output (str) – the output from the huggingface transformer API to return (see class docs)

Return type:

Union[Tensor, Dict[str, Tensor]]

Returns:

a container object instance with the output, which contains (among other data) last_hidden_state with the output embeddings of the last layer with shape: (batch, N sentences, hidden layer dimension)

property vector_dimension: int#: Return the output embedding dimension of the final layer.

zensols.deepnlp.transformer.layer#

Contains transformer embedding layers.

class zensols.deepnlp.transformer.layer.TransformerEmbeddingLayer(*args, embed_model, **kwargs)[source]#

Bases: EmbeddingLayer

A transformer (i.e. BERT) embedding layer. This class generates embeddings on a per sentence basis. See the initializer documentation for configuration requirements.

MODULE_NAME: ClassVar[str] = 'transformer embedding'#: The module name used in the logging message. This is set in each inherited class.

__init__(*args, embed_model, **kwargs)[source]#

Initialize with an embedding model. This embedding model must configured with TransformerEmbedding.output to last_hidden_state.

Parameters:: embed_model (TransformerEmbedding) – used to generate the transformer (i.e. BERT) embeddings

deallocate()[source]#: Deallocate all resources for this instance.

forward(x)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class zensols.deepnlp.transformer.layer.TransformerSequence(net_settings, sub_logger=None)[source]#

Bases: EmbeddingNetworkModule, SequenceNetworkModule

A sequence based model for token classification use HuggingFace transformers layers (not their token classification API).

MODULE_NAME: ClassVar[str] = 'transformer sequence'#: The module name used in the logging message. This is set in each inherited class.

__init__(net_settings, sub_logger=None)[source]#

Initialize the embedding layer.

Parameters:

net_settings (TransformerSequenceNetworkSettings) – the embedding layer configuration
logger – the logger to use for the forward process in this layer
filter_attrib_fn – if provided, called with a BatchFieldMetadata for each field returning True if the batch field should be retained and used in the embedding layer (see class docs); if None all fields are considered

deallocate()[source]#: Deallocate all resources for this instance.

class zensols.deepnlp.transformer.layer.TransformerSequenceNetworkSettings(name, config_factory, dropout, batch_stash, embedding_layer, decoder_settings)[source]#

Bases: EmbeddingNetworkSettings, DropoutNetworkSettings

Settings configuration for TransformerSequence.

__init__(name, config_factory, dropout, batch_stash, embedding_layer, decoder_settings)#

decoder_settings: DeepLinearNetworkSettings#: The decoder feed forward network.

get_module_class_name()[source]#

Returns the fully qualified class name of the module to create by ModelManager. This module takes as the first parameter an instance of this class.

Important: This method is not used for nested modules. You must declare specific class names in the configuration for those nested class naems.

Return type:: str

zensols.deepnlp.transformer.mask#

Classes to predict fill-mask tasks.

class zensols.deepnlp.transformer.mask.MaskFiller(resource, k=1, feature_id='norm', feature_value='MASK')[source]#

Bases: object

The class fills masked tokens with the prediction of the underlying maked model. Masked tokens with attribute feature_id having value feature_value (norm and MASK by default respectively) are substituted with model values.

To use this class, parse a sentence with a FeatureDocumentParser with masked tokens using the string feature_value.

For example (with class defaults), the sentence:

Paris is the MASK of France.

becomes:

Parise is the <mask> of France.

The <mask> string becomes the mask_token for the model’s tokenzier.

__init__(resource, k=1, feature_id='norm', feature_value='MASK')#

feature_id: str = 'norm'#

The FeatureToken feature ID to match on masked tokens.

See:: feature_value

feature_value: str = 'MASK'#: The value of feature ID feature_id to match on masked tokens.

k: int = 1#: The number of top K predicted masked words per mask. The total number of predictions will be <number of masks> X k in the source document.

predict(source)[source]#

Predict subtitution values for token masks.

Important: source is modified as a side-effect of this method. Use clone() on the source document passed to this method to preserve the original if necessary.

Parameters:: source (TokenContainer) – the source document, sentence, or span for which to substitute values
Return type:: Prediction

resource: TransformerResource#: A container class with the Huggingface tokenizer and model.

class zensols.deepnlp.transformer.mask.Prediction(cont, masked_tokens, df)[source]#

Bases: Dictable

A container class for masked token predictions produced by MaskFiller. This class offers many ways to get the predictions, including getting the sentences as instances of TokenContainer by using it as an iterable.

The sentences are also available as the pred_sentences key when using asdict().

__init__(cont, masked_tokens, df)#

cont: TokenContainer#: The document, sentence or span to predict masked tokens.

df: DataFrame#

The predictions with dataframe columns:

k: the k in the top-k highest scored masked token match
mask_id: the N-th masked token in the source ordered by position
token: the predicted token
score: the score of the prediction ([0, 1], higher the better)

get_container(k=0)[source]#

Get the k-th top scored sentence. This method should be called only once for each instance since it modifies the tokens of the container for each invocation.

A client may call this method as many times as necessary (i.e. for multiple values of k) since :obj:cont tokens are modified while retaining the original masked tokens masked_tokens.

Parameters:: k (int) – as k increases the less likely the mask substitutions, and thus sentence; k = 0 is the most likely given the sentence and masks
Return type:: TokenContainer

get_tokens()[source]#

Return an iterable of the prediction coupled with the token it belongs to and its score.

Return type:: Iterable[TokenPrediction]

property masked_token_dicts: Tuple[Dict[str, Any]]#: A tuple of builtins.dict each having token index, norm and text data.

masked_tokens: Tuple[FeatureToken]#: The masked tokens matched.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_masked_tokens=True, include_predicted_tokens=True, include_predicted_sentences=True)[source]#

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deepnlp.transformer.mask.TokenPrediction(token, prediction, score)[source]#

Bases: Dictable

Couples a masked model prediction token to which it belongs and its score.

__init__(token, prediction, score)#

prediction: str#

score: float#

token: FeatureToken#

zensols.deepnlp.transformer.optimizer#

Adapat huggingface transformer weight decay optimizer.

class zensols.deepnlp.transformer.optimizer.TransformerAdamFactory[source]#: Bases: ModelResourceFactory

class zensols.deepnlp.transformer.optimizer.TransformerSchedulerFactory[source]#

Bases: ModelResourceFactory

Unified API to get any scheduler from its name. This simply calls transformers.get_scheduler() and calculates num_training_steps as epochs * batch_size.

Documentation taken directly from get_scheduler function in the PyTorch source tree.

zensols.deepnlp.transformer.pred#

Predictions output for transformer models.

class zensols.deepnlp.transformer.pred.TransformerSequencePredictionsDataFrameFactory(source, result, stash, column_names=None, data_point_transform=None, batch_limit=9223372036854775807, epoch_result=None, label_vectorizer_name=None, embedded_document_attribute=None)[source]#

Bases: SequencePredictionsDataFrameFactory

Like the super class but create predictions for transformer sequence models. By default, transformer input is truncated at the model’s max token length (usually 512 word piece tokens). It then truncate the tokens that are added as the text column from (configured by default) classify.TokenClassifyModelFacade.

For all predictions where the sequence passed the model’s maximum, this class maps that last word piece token output to the respective token in the predictions_dataframe_factory_class instance’s transform output.

__init__(source, result, stash, column_names=None, data_point_transform=None, batch_limit=9223372036854775807, epoch_result=None, label_vectorizer_name=None, embedded_document_attribute=None)#

embedded_document_attribute: str = None#: The Batch attribute key for the tensor that contains the vectorized document.

zensols.deepnlp.transformer.resource#

Provide BERT embeddings on a per sentence level.

exception zensols.deepnlp.transformer.resource.TransformerError[source]#

Bases: DeepLearnError

Raised for any transformer specific errors in this and child modules of the parent.

__annotations__ = {}#

__module__ = 'zensols.deepnlp.transformer.resource'#

class zensols.deepnlp.transformer.resource.TransformerResource(name, torch_config, model_id, cased=None, trainable=False, args=<factory>, tokenizer_args=<factory>, model_args=<factory>, model_class='transformers.AutoModel', tokenizer_class='transformers.AutoTokenizer', cache=False, cache_dir=None)[source]#

Bases: PersistableContainer, Dictable

A container base class that allows configuration and creates various huggingface models.

__init__(name, torch_config, model_id, cased=None, trainable=False, args=<factory>, tokenizer_args=<factory>, model_args=<factory>, model_class='transformers.AutoModel', tokenizer_class='transformers.AutoTokenizer', cache=False, cache_dir=None)#

args: Dict[str, Any]#: Additional arguments to pass to the from_pretrained method for the tokenizer and the model.

cache: InitVar = False#: When set to True cache a global space model using the parameters from the first instance creation.

cache_dir: Path = None#: The directory that is contains the BERT model(s).

property cached: bool#

If the model is cached.

See:: cache

cased: bool = None#: True for case sensitive models, False (default) otherwise. The negated value of it is also used as the do_lower_case parameter in the *.from_pretrained calls to huggingface transformers.

clear()[source]#

property model: PreTrainedModel#

model_args: Dict[str, Any]#: Additional arguments to pass to the from_pretrained method for the model.

model_class: str = 'transformers.AutoModel'#: The model fully qualified class used to create models with the from_pretrained static method.

model_id: str#

The ID of the model (i.e. bert-base-uncased). If this is not set, is derived from the model_name and case.

Token embeding using TransformerEmbedding as been tested with:

bert-base-cased

bert-large-cased

roberta-base

distilbert-base-cased

See:: Pretrained Models

name: str#: The name of the model given by the configuration. Used for debugging.

property tokenizer: PreTrainedTokenizer#

tokenizer_args: Dict[str, Any]#: Additional arguments to pass to the from_pretrained method for the tokenizer.

tokenizer_class: str = 'transformers.AutoTokenizer'#: The model fully qualified class used to create tokenizers with the from_pretrained static method.

torch_config: TorchConfig#: The config device used to copy the embedding data.

trainable: bool = False#: If False the weights on the transformer model are frozen and the use of the model (including in subclasses) turn off autograd when executing..

zensols.deepnlp.transformer.tokenizer#

The tokenizer object.

class zensols.deepnlp.transformer.tokenizer.TransformerDocumentTokenizer(resource, word_piece_token_length=None, params=None)[source]#

Bases: PersistableContainer

Creates instances of TokenziedFeatureDocument using a HuggingFace PreTrainedTokenizer.

DEFAULT_PARAMS: ClassVar[Dict[str, Any]] = {'is_split_into_words': True, 'padding': 'longest', 'return_offsets_mapping': True, 'return_special_tokens_mask': True}#: Default parameters for the HuggingFace tokenizer. These get overriden by the tokenizer_kwargs in tokenize() and the processing of value word_piece_token_length.

__init__(resource, word_piece_token_length=None, params=None)#

property all_special_tokens: Set[str]#: Special tokens used by the model (such BERT’s as [CLS] and [SEP] tokens).

property id2tok: Dict[int, str]#: A mapping from the HuggingFace tokenizer’s vocabulary to it’s word piece equivalent.

params: Dict[str, Any] = None#: Additional parameters given to the transformers.PreTrainedTokenizer.

property pretrained_tokenizer: PreTrainedTokenizer#: The HuggingFace tokenized used to create tokenized documents.

resource: TransformerResource#: Contains the model used to create the tokenizer.

property token_max_length: int#: The word piece token maximum length supported by the model.

tokenize(doc, tokenizer_kwargs=None)[source]#

Tokenize a feature document in a form that’s easy to inspect and provide to TransformerEmbedding to transform.

Parameters:: doc (FeatureDocument) – the document to tokenize
Return type:: TokenizedFeatureDocument

word_piece_token_length: int = None#

The max number of word piece tokens. The word piece length is always the same or greater in count than linguistic tokens because the word piece algorithm tokenizes on characters.

If this value is less than 0, than do not fix sentence lengths. If the value is 0 (default), then truncate to the model’s longest max lenght. Otherwise, if this value is None, set the length to the model’s longest max length using the model’s model_max_length value.

Setting this to a value to 0, making documents multi-length, has the potential of creating token spans longer than the model can tolerate (usually 512 word piece tokens). In these cases, this value must be set to (or lower) than the model’s model_max_length.

Tokenization padding is on by default.

See:: HF Docs

zensols.deepnlp.transformer.vectorizers#

Contains classes that are used to vectorize documents in to transformer embeddings.

class zensols.deepnlp.transformer.vectorizers.LabelTransformerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False)[source]#

Bases: TransformerFeatureVectorizer

A base class for vectorizing by mapping tokens to transformer consumable word piece tokens. This includes creating labels and masks.

Shape:: (|sentences|, |max word piece length|)

FEATURE_TYPE = 1#

__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False)#

is_labeler: bool = True#: If True, make this a labeling specific vectorizer. Otherwise, certain layers will use the output of the vectorizer as features rather than the labels.

class zensols.deepnlp.transformer.vectorizers.TransformerEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)[source]#

Bases: TransformerFeatureVectorizer

A feature vectorizer used to create transformer (i.e. BERT) embeddings. The class uses the embed_model, which is of type TransformerEmbedding.

Note the encoding input ideally are sentences shorter than 512 tokens. However, this vectorizer can accommodate both FeatureSentence and FeatureDocument instances.

DESCRIPTION = 'transformer document embedding'#

FEATURE_TYPE = 4#

__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)#

class zensols.deepnlp.transformer.vectorizers.TransformerExpanderFeatureContext(feature_id, contexts, document)[source]#

Bases: TransformerFeatureContext

A vectorizer feature context used with TransformerExpanderFeatureVectorizer.

__init__(feature_id, contexts, document)[source]#

Params feature_id:: the feature ID used to identify this context
Params contexts:: subordinate contexts given to MultiFeatureContext
Params document:: document used to create the transformer embeddings

contexts: Tuple[FeatureContext] = Field(name=None,type=None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=<dataclasses._MISSING_TYPE object>,_field_type=None)#: The subordinate contexts.

deallocate()[source]#: Deallocate all resources for this instance.

class zensols.deepnlp.transformer.vectorizers.TransformerExpanderFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False, delegate_feature_ids=None)[source]#

Bases: TransformerFeatureVectorizer

A vectorizer that expands lingustic feature vectors to their respective locations as word piece token vectors.

This is used to concatenate lingustic features with Bert (and other transformer) embeddings. Each lingustic token is copied in the word piece token location across all vectorizers and sentences.

Shape:: (-1, token length, X), where X is the sum of all the delegate shapes across all three dimensions

DESCRIPTION = 'transformer expander'#

FEATURE_TYPE = 1#

__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False, delegate_feature_ids=None)#

delegate_feature_ids: Tuple[str] = None#: A list of feature IDs of vectorizers whose output will be expanded.

property delegates: EncodableFeatureVectorizer#: The delegates used for encoding and decoding the lingustic features.

class zensols.deepnlp.transformer.vectorizers.TransformerFeatureContext(feature_id, document)[source]#

Bases: FeatureContext, Deallocatable

A vectorizer feature contex used with TransformerEmbeddingFeatureVectorizer.

__init__(feature_id, document)[source]#

Params feature_id:: the feature ID used to identify this context
Params document:: document used to create the transformer embeddings

deallocate()[source]#: Deallocate all resources for this instance.

get_document(vectorizer)[source]#

Return type:: TokenizedDocument

get_feature_document()[source]#

Return type:: FeatureDocument

class zensols.deepnlp.transformer.vectorizers.TransformerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)[source]#

Bases: EmbeddingFeatureVectorizer, FeatureDocumentVectorizer

Base class for classes that vectorize transformer models. This class also tokenizes documents.

__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)#

encode_tokenized: bool = False#

Whether to tokenize the document on encoding. Set this to True only if the huggingface model ID (i.e. bert-base-cased) will not change after vectorization/batching.

Setting this to True tells the vectorizer to tokenize during encoding, and thus will speed experimentation by providing the tokenized tensors to the model directly.

property feature_type: TextFeatureType#: The type of feature this vectorizer generates. This is used by classes such as EmbeddingNetworkModule to determine where to add the features, such as concating to the embedding layer, join layer etc.

is_labeler: bool = False#: If True, make this a labeling specific vectorizer. Otherwise, certain layers will use the output of the vectorizer as features rather than the labels.

tokenize(doc)[source]#

Tokenize the document in to a token document used by the encoding phase.

Parameters:: doc (FeatureDocument) – the document to be tokenized
Return type:: TokenizedFeatureDocument

property word_piece_token_length: int#

class zensols.deepnlp.transformer.vectorizers.TransformerMaskFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, data_type='bool')[source]#

Bases: LabelTransformerFeatureVectorizer

Creates a mask of word piece tokens to True and special tokens and padding to False. This maps tokens to word piece tokens like TransformerNominalFeatureVectorizer.

Shape:: (|sentences|, |max word piece length|)

DESCRIPTION = 'transformer mask'#

__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, data_type='bool')#

data_type: Union[str, None, torch.dtype] = 'bool'#: The mask tensor type. To use the int type that matches the resolution of the manager’s torch_config, use DEFAULT_INT.

class zensols.deepnlp.transformer.vectorizers.TransformerNominalFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, delegate_feature_id=None, size=-1, pad_label=-100, label_all_tokens=False, annotations_attribute='annotations')[source]#

Bases: AggregateEncodableFeatureVectorizer, LabelTransformerFeatureVectorizer

This creates word piece (maps to tokens) labels. This class uses a NominalEncodedEncodableFeatureVectorizer` to map from string labels to their nominal long values. This allows a single instance and centralized location where the label mapping happens in case other (non-transformer) components need to vectorize labels.

Shape:: (|sentences|, |max word piece length|)

DESCRIPTION = 'transformer seq labeler'#

__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, delegate_feature_id=None, size=-1, pad_label=-100, label_all_tokens=False, annotations_attribute='annotations')#

annotations_attribute: str = 'annotations'#: The attribute used to get the features from the FeatureSentence. For example, TokenAnnotatedFeatureSentence has an annotations attribute.

delegate_feature_id: str = None#: The feature ID for the aggregate encodeable feature vectorizer.

label_all_tokens: bool = False#: If True, label all word piece tokens with the corresponding linguistic token label. Otherwise, the default padded value is used, and thus, ignored by the loss function when calculating loss.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

zensols.deepnlp.transformer.wordpiece#

Word piece mappings to feature tokens, sentences and documents.

There are often edges cases and tricky situations with certain model’s usage of special tokens (i.e. [CLS]) and where they are used. With this in mind, this module attempts to:

Assist in debugging (works with detached TokenizedDocument) in cases where token level embeddings are directly accessed, and

Map corresponding both token and sentence level embeddings to respective origin natural langauge feature set data structures.

class zensols.deepnlp.transformer.wordpiece.CachingWordPieceFeatureDocumentFactory(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True, stash=None, hasher=<factory>)[source]#

Bases: WordPieceFeatureDocumentFactory

Caches the documents and their embeddings in a Stash. For those that are cached, the embeddings are copied over to the passed document in create().

__init__(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True, stash=None, hasher=<factory>)#

clear()[source]#: Clear the caching stash.

create(fdoc, tdoc=None)[source]#

Create a document in to an object graph that relates word pieces to feature tokens. Note that if tdoc is provided, it must have been tokenized from fdoc.

Parameters:

fdoc (FeatureDocument) – the feature document used to create tdoc
tdoc (TokenizedFeatureDocument) – a tokenized feature document generated by tokenize()

Return type:

WordPieceFeatureDocument

Returns:

a data structure with the word piece information

hasher: Hasher#: Used to hash the natural langauge text in to string keys.

stash: Stash = None#: The stash that persists the feature document instances. If this is not provided, no caching will happen.

class zensols.deepnlp.transformer.wordpiece.WordPiece(word, vocab_index, index)[source]#

Bases: PersistableContainer, Dictable

The word piece data.

UNKNOWN_TOKEN: ClassVar[str] = '[UNK]'#: The string used for out of vocabulary word piece tokens.

__init__(word, vocab_index, index)#

index: int#: The index of the word piece subword in the tokenization tensor, which will have the same index in the output embeddings for TransformerEmbedding.output = last_hidden_state.

property is_unknown: bool#: Whether this token is out of vocabulary.

vocab_index: int#: The vocabulary index.

word: str#: The string representation of the word piece.

class zensols.deepnlp.transformer.wordpiece.WordPieceDocumentDecorator(word_piece_doc_factory)[source]#

Bases: FeatureDocumentDecorator

Populates sentence and token embeddings in the documents.

See:: WordPieceFeatureDocumentFactory

__init__(word_piece_doc_factory)#

decorate(doc)[source]#

word_piece_doc_factory: WordPieceFeatureDocumentFactory#: The feature document factory that populates embeddings.

class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureDocument(sents, text=None, spacy_doc=None, tokenized=None)[source]#

Bases: FeatureDocument, WordPieceTokenContainer

A document made up of word piece sentences.

__init__(sents, text=None, spacy_doc=None, tokenized=None)#

copy_embedding(target)[source]#: Copy embeddings (and children) from this instance to target.

property embedding: Tensor#

The document embedding (see WordPieceFeatureSpan.embedding).

Shape:: (|sentences|, <embedding dimension>)

tokenized: TokenizedFeatureDocument = None#: The tokenized feature document.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

Write the document and optionally sentence features.

Parameters:

n_sents – the number of sentences to write
n_tokens – the number of tokens to print across all sentences
include_original – whether to include the original text
include_normalized – whether to include the normalized text

class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureDocumentFactory(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True)[source]#

Bases: object

Create instances of WordPieceFeatureDocument from FeatureDocument instances. It does this by iterating through a feature document data structure and adding WordPiece* object data and optionally adding the corresponding sentence and/or token level embeddings.

The embeddings can also be added with add_token_embeddings() and add_sent_embeddings() individually. If all you want are the sentence level embeddings, you can use add_sent_embeddings() on a FeatureSentence instance.

__init__(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True)#

add_sent_embeddings(doc, arr)[source]#

Add sentence embeddings to the sentences of doc.

Parameters:: doc (Union[WordPieceFeatureDocument, FeatureDocument]) – sentences of this doc have embeddings set to the correpsonding sentence tensor with shape (1, <embedding dimension>).

add_token_embeddings(doc, arr)[source]#

Add token embeddings to the sentences of doc. This assumes tokens are of type WordPieceFeatureToken since the token indices are needed.

Parameters:: doc (WordPieceFeatureDocument) – sentences of this doc have embeddings set to the correpsonding sentence tensor with shape (1, <embedding dimension>).

create(fdoc, tdoc=None)[source]#

Create a document in to an object graph that relates word pieces to feature tokens. Note that if tdoc is provided, it must have been tokenized from fdoc.

Parameters:

fdoc (FeatureDocument) – the feature document used to create tdoc
tdoc (TokenizedFeatureDocument) – a tokenized feature document generated by tokenize()

Return type:

WordPieceFeatureDocument

Returns:

a data structure with the word piece information

embed_model: TransformerEmbedding#: Used to populate the embeddings in WordPiece* classes.

sent_embeddings: bool = True#

.WordPieceFeatureSentence.embeddings.

Type:: Whether to add class

token_embeddings: bool = True#: Whether to add WordPieceFeatureToken.embeddings.

tokenizer: TransformerDocumentTokenizer#: Used to tokenize documents that aren’t already in __call__().

class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureSentence(tokens, text=None, spacy_span=None, embedding=None)[source]#

Bases: WordPieceFeatureSpan, FeatureSentence

__init__(tokens, text=None, spacy_span=None, embedding=None)#

class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureSpan(tokens, text=None, spacy_span=None, embedding=None)[source]#

Bases: FeatureSentence, WordPieceTokenContainer

A sentence made up of word pieces.

__init__(tokens, text=None, spacy_span=None, embedding=None)#

copy_embedding(target)[source]#: Copy embeddings (and children) from this instance to target.

embedding: Tensor = None#

The sentence embedding level (i.e. [CLS]) embedding from the transformer.

Shape:: (<embedding dimension>,)

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

Write the text container.

Parameters:

include_original – whether to include the original text
include_normalized – whether to include the normalized text
n_tokens – the number of tokens to write
inline – whether to print the tokens on one line each

class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureToken(i, idx, i_sent, norm, words, embedding=None)[source]#

Bases: FeatureToken

The token and the word pieces that repesent it.

__init__(i, idx, i_sent, norm, words, embedding=None)#

clone(cls=None, **kwargs)[source]#

Clone an instance of this token.

Parameters:

cls (Type) – the type of the new instance
kwargs – arguments to add to as attributes to the clone

Return type:

FeatureToken

Returns:

the cloned instance of this instance

copy_embedding(target)[source]#: Copy embedding (and children) from this instance to target.

detach(*args, **kwargs)[source]#

Create a detected token (i.e. from spaCy artifacts).

Parameters:

feature_ids – the features to write, which defaults to FEATURE_IDS
skip_missing – whether to only keep feature_ids
cls – the type of the new instance

Return type:

FeatureToken

embedding: Tensor = None#

The embedding for words after using the transformer.

Shape:: (|words|, <embedding dimension>)

property indexes: Tuple[int]#: The indexes of the word piece subwords (see WordPiece.index).

property is_unknown: bool#: Whether this token is out of vocabulary.

property token_embedding: Tensor#: The embedding of this token, which is the sum of the word piece embeddings.

word_iter()[source]#

Return an iterable over the word pieces.

Return type:: Iterable[WordPiece]

words: Tuple[WordPiece]#: The word pieces that make up this token.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deepnlp.transformer.wordpiece.WordPieceTokenContainer[source]#

Bases: TokenContainer

Like TokenContainer but contains word pieces.

property unknown_count: int#: Return the number of out of vocabulary tokens in the container.

word_iter()[source]#

Return an iterable over the word pieces.

Return type:: Iterable[WordPiece]

Module contents#

Contains classes that adapt the huggingface tranformers to the Zensols deeplearning framework.

zensols.deepnlp.transformer.normalize_huggingface_logging()[source]#: Make the :mod”transformers package use default logging. Using this and setting the transformers logging package to ERROR level logging has the same effect as suppress_warnings().

zensols.deepnlp.transformer.suppress_warnings()[source]#

Suppress the `Some weights of the model checkpoint...` warnings from huggingface transformers.

Ses:: normalize_huggingface_logging()

zensols.deepnlp.transformer.turn_off_huggingface_downloads()[source]#: Turn off automatic model checks and downloads.