zensols.deepnlp.transformer package

Submodules

zensols.deepnlp.transformer.domain module

Container classes for Bert models

class zensols.deepnlp.transformer.domain.TokenizedDocument(tensor, boundary_tokens)[source]

Bases: PersistableContainer, Writable

This is the tokenized document output of TransformerDocumentTokenizer. Instances of this class are pickelable, in a feature context. Then give to the in the decoding phase to create a tensor with a transformer model such as TransformerEmbedding.

__init__(tensor, boundary_tokens)
property attention_mask: Tensor

The attention mask (0/1s).

boundary_tokens: bool

If the token document has sentence boundary tokens, such as [CLS] for Bert.

deallocate()[source]

Deallocate all resources for this instance.

detach()[source]

Return a version of the document that is pickleable.

Return type:

TokenizedDocument

classmethod from_tensor(tensor)[source]

Create an instance of the class using a tensor. This is useful for re-creating documents for mapping with map_word_pieces() after unpickling from a document created with TransformerDocumentTokenizer.tokenize.

Parameters:

tensor (Tensor) – the tensor to set in tensor

Return type:

TokenizedDocument

get_wordpiece_count(**kwargs)[source]

The size of the document (sum over sentences) in number of word pieces. To keep special tokens (such BERT’s as [CLS] and [SEP] tokens) when passing in a tokenizer in kwargs, add special_tokens={}.

Parameters:

kwargs – any keyword arguments passed on to map_to_word_pieces() except (do not add index_tokens and includes)

Return type:

int

property input_ids: Tensor

The token IDs as the output from the tokenizer.

map_to_word_pieces(sentences=None, map_wp=None, add_indices=False, special_tokens=None, index_tokens=True, includes=frozenset({'map'}))[source]

Map word piece tokens to linguistic tokens.

Parameters:
  • sentences (Iterable[List[Any]]) – an iteration of sentences, which is returned in the output (i.e. FeatureSentence), or input_ids if None

  • map_wp (Union[Callable, Dict[int, str]]) – either a function that takes the token index, sentence ID and input IDs, or the mapping from word piece ID to string token; return output is the string token (or numerical output if no mapping is provided); if an instance of TransformerDocumentTokenizer, its vocabulary and special tokens are utilized for mapping and special token consideration

  • add_indices (bool) – whether to add the token ID and index after the token string when id2tok is provided for map_wp

  • special_tokens (Set[str]) – a list of tokens (such BERT’s as [CLS] and [SEP] tokens) to remove; to keep special tokens when passing in a tokenizer in kwargs, add special_tokens={}.

  • index_tokens (bool) – whether to index tokens positionally, which is used for mapping with feature or tokenized sentences; set this to False when sentences are anything but a feature document / sentences

  • includes (Set[str]) – what data to return, which is a set of the keys listed in the return documentation below

Return type:

List[Dict[str, Any]]

Returns:

a list sentence maps, each with:

  • sent_ix -> the ``i``th sentence (always provided)

  • map -> list of (sentence 'token', word pieces)

  • sent -> a FeatureSentence or a tensor of vocab indexes if map_wp is None

  • word_pieces -> the word pieces of the sentences

static map_word_pieces(token_offsets)[source]

Map word piece tokens to linguistic tokens.

Return type:

List[Tuple[FeatureToken, List[int]]]

Returns:

a list of tuples in the form:

(<token index>, <list of word piece indexes>)

property offsets: Tensor

The offsets from word piece (transformer’s tokenizer) to feature document index mapping.

params()[source]
Return type:

Dict[str, Any]

property shape: Size

Return the shape of the vectorized document.

tensor: Tensor

Encodes the input IDs, attention mask, and word piece offset map.

property token_type_ids: Tensor

The token type IDs (0/1s).

truncate(size)[source]

Truncate the the last (token) dimension to size.

Return type:

TokenizedDocument

Returns:

a new instance of this class truncated to size

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_tokens=True, id2tok=None)[source]

Write the contents of this instance to writer using indention depth.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deepnlp.transformer.domain.TokenizedFeatureDocument(tensor, boundary_tokens, feature, id2tok, char_offsets)[source]

Bases: TokenizedDocument

Instance of this class are created, then a picklable version returned with detach() as an instance of the super class.

__init__(tensor, boundary_tokens, feature, id2tok, char_offsets)
char_offsets: Tuple[Tuple[int, int]]

The valid character offsets for each word piece token.

detach()[source]

Return a version of the document that is pickleable.

Return type:

TokenizedDocument

feature: FeatureDocument

The document to tokenize.

id2tok: Dict[int, str]

If provided, a mapping of indexes to transformer tokens. This attribute is always nulled out after being persisted.

map_to_word_pieces(sentences=None, map_wp=None, **kwargs)[source]

Map word piece tokens to linguistic tokens.

Parameters:
  • sentences (Iterable[List[Any]]) – an iteration of sentences, which is returned in the output (i.e. FeatureSentence), or input_ids if None

  • map_wp (Union[Callable, Dict[int, str]]) – either a function that takes the token index, sentence ID and input IDs, or the mapping from word piece ID to string token; return output is the string token (or numerical output if no mapping is provided); if an instance of TransformerDocumentTokenizer, its vocabulary and special tokens are utilized for mapping and special token consideration

  • add_indices – whether to add the token ID and index after the token string when id2tok is provided for map_wp

  • special_tokens – a list of tokens (such BERT’s as [CLS] and [SEP] tokens) to remove; to keep special tokens when passing in a tokenizer in kwargs, add special_tokens={}.

  • index_tokens – whether to index tokens positionally, which is used for mapping with feature or tokenized sentences; set this to False when sentences are anything but a feature document / sentences

  • includes – what data to return, which is a set of the keys listed in the return documentation below

Return type:

List[Dict[str, Any]]

Returns:

a list sentence maps, each with:

  • sent_ix -> the ``i``th sentence (always provided)

  • map -> list of (sentence 'token', word pieces)

  • sent -> a FeatureSentence or a tensor of vocab indexes if map_wp is None

  • word_pieces -> the word pieces of the sentences

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_tokens=True, id2tok=None)[source]

Write the contents of this instance to writer using indention depth.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

zensols.deepnlp.transformer.embed module

The tokenizer object.

class zensols.deepnlp.transformer.embed.TransformerEmbedding(name, tokenizer, output='pooler_output', output_attentions=False)[source]

Bases: PersistableContainer, Dictable

An model for transformer embeddings (such as BERT) that wraps the HuggingFace transformms API.

ALL_OUTPUT: ClassVar[str] = 'all_output'
LAST_HIDDEN_STATE_OUTPUT: ClassVar[str] = 'last_hidden_state'
POOLER_OUTPUT: ClassVar[str] = 'pooler_output'
__init__(name, tokenizer, output='pooler_output', output_attentions=False)
property cache

When set to True cache a global space model using the parameters from the first instance creation.

property model: PreTrainedModel
name: str

The name of the embedding as given in the configuration.

output: str = 'pooler_output'

The output from the huggingface transformer API to return.

This is set to one of:

  • LAST_HIDDEN_STATE_OUTPUT: with the output embeddings of the last layer with shape: (batch, N sentences, hidden layer dimension)

  • POOLER_OUTPUT: the last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function with shape: (batch, hidden layer dimension)

  • ALL_OUTPUT: includes both as a dictionary with correpsonding

    keys

output_attentions: bool = False

Whether or not to output the attention layer.

property resource: TransformerResource

The transformer resource containing the model.

tokenize(doc)[source]

Tokenize the feature document, which is used as the input to transform().

Doc:

the document to tokenize

Return type:

TokenizedFeatureDocument

Returns:

the tokenization of doc

tokenizer: TransformerDocumentTokenizer

The tokenizer used for creating the input for the model.

property trainable: bool

Whether or not the model is trainable or frozen.

transform(doc, output=None)[source]

Transform the documents in to the transformer output.

Parameters:
  • docs – the batch of documents to return

  • output (str) – the output from the huggingface transformer API to return (see class docs)

Return type:

Union[Tensor, Dict[str, Tensor]]

Returns:

a container object instance with the output, which contains (among other data) last_hidden_state with the output embeddings of the last layer with shape: (batch, N sentences, hidden layer dimension)

property vector_dimension: int

Return the output embedding dimension of the final layer.

zensols.deepnlp.transformer.layer module

Contains transformer embedding layers.

class zensols.deepnlp.transformer.layer.TransformerEmbeddingLayer(*args, embed_model, **kwargs)[source]

Bases: EmbeddingLayer

A transformer (i.e. BERT) embedding layer. This class generates embeddings on a per sentence basis. See the initializer documentation for configuration requirements.

MODULE_NAME: ClassVar[str] = 'transformer embedding'

The module name used in the logging message. This is set in each inherited class.

__init__(*args, embed_model, **kwargs)[source]

Initialize with an embedding model. This embedding model must configured with TransformerEmbedding.output to last_hidden_state.

Parameters:

embed_model (TransformerEmbedding) – used to generate the transformer (i.e. BERT) embeddings

deallocate()[source]

Deallocate all resources for this instance.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class zensols.deepnlp.transformer.layer.TransformerSequence(net_settings, sub_logger=None)[source]

Bases: EmbeddingNetworkModule, SequenceNetworkModule

A sequence based model for token classification use HuggingFace transformers layers (not their token classification API).

MODULE_NAME: ClassVar[str] = 'transformer sequence'

The module name used in the logging message. This is set in each inherited class.

__init__(net_settings, sub_logger=None)[source]

Initialize the embedding layer.

Parameters:
  • net_settings (TransformerSequenceNetworkSettings) – the embedding layer configuration

  • logger – the logger to use for the forward process in this layer

  • filter_attrib_fn – if provided, called with a BatchFieldMetadata for each field returning True if the batch field should be retained and used in the embedding layer (see class docs); if None all fields are considered

deallocate()[source]

Deallocate all resources for this instance.

class zensols.deepnlp.transformer.layer.TransformerSequenceNetworkSettings(name, config_factory, dropout, batch_stash, embedding_layer, decoder_settings)[source]

Bases: EmbeddingNetworkSettings, DropoutNetworkSettings

Settings configuration for TransformerSequence.

__init__(name, config_factory, dropout, batch_stash, embedding_layer, decoder_settings)
decoder_settings: DeepLinearNetworkSettings

The decoder feed forward network.

get_module_class_name()[source]

Returns the fully qualified class name of the module to create by ModelManager. This module takes as the first parameter an instance of this class.

Important: This method is not used for nested modules. You must declare specific class names in the configuration for those nested class naems.

Return type:

str

zensols.deepnlp.transformer.mask module

Classes to predict fill-mask tasks.

class zensols.deepnlp.transformer.mask.MaskFiller(resource, k=1, feature_id='norm', feature_value='MASK')[source]

Bases: object

The class fills masked tokens with the prediction of the underlying maked model. Masked tokens with attribute feature_id having value feature_value (norm and MASK by default respectively) are substituted with model values.

To use this class, parse a sentence with a FeatureDocumentParser with masked tokens using the string feature_value.

For example (with class defaults), the sentence:

Paris is the MASK of France.

becomes:

Parise is the <mask> of France.

The <mask> string becomes the mask_token for the model’s tokenzier.

__init__(resource, k=1, feature_id='norm', feature_value='MASK')
feature_id: str = 'norm'

The FeatureToken feature ID to match on masked tokens.

See:

feature_value

feature_value: str = 'MASK'

The value of feature ID feature_id to match on masked tokens.

k: int = 1

The number of top K predicted masked words per mask. The total number of predictions will be <number of masks> X k in the source document.

predict(source)[source]

Predict subtitution values for token masks.

Important: source is modified as a side-effect of this method. Use clone() on the source document passed to this method to preserve the original if necessary.

Parameters:

source (TokenContainer) – the source document, sentence, or span for which to substitute values

Return type:

Prediction

resource: TransformerResource

A container class with the Huggingface tokenizer and model.

class zensols.deepnlp.transformer.mask.Prediction(cont, masked_tokens, df)[source]

Bases: Dictable

A container class for masked token predictions produced by MaskFiller. This class offers many ways to get the predictions, including getting the sentences as instances of TokenContainer by using it as an iterable.

The sentences are also available as the pred_sentences key when using asdict().

__init__(cont, masked_tokens, df)
cont: TokenContainer

The document, sentence or span to predict masked tokens.

df: DataFrame

The predictions with dataframe columns:

  • k: the k in the top-k highest scored masked token match

  • mask_id: the N-th masked token in the source ordered by position

  • token: the predicted token

  • score: the score of the prediction ([0, 1], higher the better)

get_container(k=0)[source]

Get the k-th top scored sentence. This method should be called only once for each instance since it modifies the tokens of the container for each invocation.

A client may call this method as many times as necessary (i.e. for multiple values of k) since :obj:cont tokens are modified while retaining the original masked tokens masked_tokens.

Parameters:

k (int) – as k increases the less likely the mask substitutions, and thus sentence; k = 0 is the most likely given the sentence and masks

Return type:

TokenContainer

get_tokens()[source]

Return an iterable of the prediction coupled with the token it belongs to and its score.

Return type:

Iterable[TokenPrediction]

property masked_token_dicts: Tuple[Dict[str, Any]]

A tuple of builtins.dict each having token index, norm and text data.

masked_tokens: Tuple[FeatureToken]

The masked tokens matched.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_masked_tokens=True, include_predicted_tokens=True, include_predicted_sentences=True)[source]

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deepnlp.transformer.mask.TokenPrediction(token, prediction, score)[source]

Bases: Dictable

Couples a masked model prediction token to which it belongs and its score.

__init__(token, prediction, score)
prediction: str
score: float
token: FeatureToken

zensols.deepnlp.transformer.optimizer module

Adapat huggingface transformer weight decay optimizer.

class zensols.deepnlp.transformer.optimizer.TransformerAdamFactory[source]

Bases: ModelResourceFactory

class zensols.deepnlp.transformer.optimizer.TransformerSchedulerFactory[source]

Bases: ModelResourceFactory

Unified API to get any scheduler from its name. This simply calls transformers.get_scheduler() and calculates num_training_steps as epochs * batch_size.

Documentation taken directly from get_scheduler function in the PyTorch source tree.

zensols.deepnlp.transformer.pred module

Predictions output for transformer models.

class zensols.deepnlp.transformer.pred.TransformerSequencePredictionsDataFrameFactory(source, result, stash, column_names=None, data_point_transform=None, batch_limit=9223372036854775807, epoch_result=None, label_vectorizer_name=None, embedded_document_attribute=None)[source]

Bases: SequencePredictionsDataFrameFactory

Like the super class but create predictions for transformer sequence models. By default, transformer input is truncated at the model’s max token length (usually 512 word piece tokens). It then truncate the tokens that are added as the text column from (configured by default) classify.TokenClassifyModelFacade.

For all predictions where the sequence passed the model’s maximum, this class maps that last word piece token output to the respective token in the predictions_dataframe_factory_class instance’s transform output.

__init__(source, result, stash, column_names=None, data_point_transform=None, batch_limit=9223372036854775807, epoch_result=None, label_vectorizer_name=None, embedded_document_attribute=None)
embedded_document_attribute: str = None

The Batch attribute key for the tensor that contains the vectorized document.

zensols.deepnlp.transformer.resource module

Provide BERT embeddings on a per sentence level.

exception zensols.deepnlp.transformer.resource.TransformerError[source]

Bases: DeepLearnError

Raised for any transformer specific errors in this and child modules of the parent.

__annotations__ = {}
__module__ = 'zensols.deepnlp.transformer.resource'
class zensols.deepnlp.transformer.resource.TransformerResource(name, torch_config, model_id, cased=None, trainable=False, args=<factory>, tokenizer_args=<factory>, model_args=<factory>, model_class='transformers.AutoModel', tokenizer_class='transformers.AutoTokenizer', cache=False, cache_dir=None)[source]

Bases: PersistableContainer, Dictable

A container base class that allows configuration and creates various huggingface models.

__init__(name, torch_config, model_id, cased=None, trainable=False, args=<factory>, tokenizer_args=<factory>, model_args=<factory>, model_class='transformers.AutoModel', tokenizer_class='transformers.AutoTokenizer', cache=False, cache_dir=None)
args: Dict[str, Any]

Additional arguments to pass to the from_pretrained method for the tokenizer and the model.

cache: InitVar = False

When set to True cache a global space model using the parameters from the first instance creation.

cache_dir: Path = None

The directory that is contains the BERT model(s).

property cached: bool

If the model is cached.

See:

cache

cased: bool = None

True for case sensitive models, False (default) otherwise. The negated value of it is also used as the do_lower_case parameter in the *.from_pretrained calls to huggingface transformers.

clear()[source]
property model: PreTrainedModel
model_args: Dict[str, Any]

Additional arguments to pass to the from_pretrained method for the model.

model_class: str = 'transformers.AutoModel'

The model fully qualified class used to create models with the from_pretrained static method.

model_id: str

The ID of the model (i.e. bert-base-uncased). If this is not set, is derived from the model_name and case.

Token embeding using TransformerEmbedding as been tested with:

  • bert-base-cased

  • bert-large-cased

  • roberta-base

  • distilbert-base-cased

See:

Pretrained Models

name: str

The name of the model given by the configuration. Used for debugging.

property tokenizer: PreTrainedTokenizer
tokenizer_args: Dict[str, Any]

Additional arguments to pass to the from_pretrained method for the tokenizer.

tokenizer_class: str = 'transformers.AutoTokenizer'

The model fully qualified class used to create tokenizers with the from_pretrained static method.

torch_config: TorchConfig

The config device used to copy the embedding data.

trainable: bool = False

If False the weights on the transformer model are frozen and the use of the model (including in subclasses) turn off autograd when executing..

zensols.deepnlp.transformer.tokenizer module

The tokenizer object.

class zensols.deepnlp.transformer.tokenizer.TransformerDocumentTokenizer(resource, word_piece_token_length=None, params=None)[source]

Bases: PersistableContainer

Creates instances of TokenziedFeatureDocument using a HuggingFace PreTrainedTokenizer.

DEFAULT_PARAMS: ClassVar[Dict[str, Any]] = {'is_split_into_words': True, 'padding': 'longest', 'return_offsets_mapping': True, 'return_special_tokens_mask': True}

Default parameters for the HuggingFace tokenizer. These get overriden by the tokenizer_kwargs in tokenize() and the processing of value word_piece_token_length.

__init__(resource, word_piece_token_length=None, params=None)
property all_special_tokens: Set[str]

Special tokens used by the model (such BERT’s as [CLS] and [SEP] tokens).

property id2tok: Dict[int, str]

A mapping from the HuggingFace tokenizer’s vocabulary to it’s word piece equivalent.

params: Dict[str, Any] = None

Additional parameters given to the transformers.PreTrainedTokenizer.

property pretrained_tokenizer: PreTrainedTokenizer

The HuggingFace tokenized used to create tokenized documents.

resource: TransformerResource

Contains the model used to create the tokenizer.

property token_max_length: int

The word piece token maximum length supported by the model.

tokenize(doc, tokenizer_kwargs=None)[source]

Tokenize a feature document in a form that’s easy to inspect and provide to TransformerEmbedding to transform.

Parameters:

doc (FeatureDocument) – the document to tokenize

Return type:

TokenizedFeatureDocument

word_piece_token_length: int = None

The max number of word piece tokens. The word piece length is always the same or greater in count than linguistic tokens because the word piece algorithm tokenizes on characters.

If this value is less than 0, than do not fix sentence lengths. If the value is 0 (default), then truncate to the model’s longest max lenght. Otherwise, if this value is None, set the length to the model’s longest max length using the model’s model_max_length value.

Setting this to a value to 0, making documents multi-length, has the potential of creating token spans longer than the model can tolerate (usually 512 word piece tokens). In these cases, this value must be set to (or lower) than the model’s model_max_length.

Tokenization padding is on by default.

See:

HF Docs

zensols.deepnlp.transformer.vectorizers module

Contains classes that are used to vectorize documents in to transformer embeddings.

class zensols.deepnlp.transformer.vectorizers.LabelTransformerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False)[source]

Bases: TransformerFeatureVectorizer

A base class for vectorizing by mapping tokens to transformer consumable word piece tokens. This includes creating labels and masks.

Shape:

(|sentences|, |max word piece length|)

FEATURE_TYPE = 1
__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False)
is_labeler: bool = True

If True, make this a labeling specific vectorizer. Otherwise, certain layers will use the output of the vectorizer as features rather than the labels.

class zensols.deepnlp.transformer.vectorizers.TransformerEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)[source]

Bases: TransformerFeatureVectorizer

A feature vectorizer used to create transformer (i.e. BERT) embeddings. The class uses the embed_model, which is of type TransformerEmbedding.

Note the encoding input ideally are sentences shorter than 512 tokens. However, this vectorizer can accommodate both FeatureSentence and FeatureDocument instances.

DESCRIPTION = 'transformer document embedding'
FEATURE_TYPE = 4
__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)
class zensols.deepnlp.transformer.vectorizers.TransformerExpanderFeatureContext(feature_id, contexts, document)[source]

Bases: TransformerFeatureContext

A vectorizer feature context used with TransformerExpanderFeatureVectorizer.

__init__(feature_id, contexts, document)[source]
Params feature_id:

the feature ID used to identify this context

Params contexts:

subordinate contexts given to MultiFeatureContext

Params document:

document used to create the transformer embeddings

contexts: Tuple[FeatureContext] = Field(name=None,type=None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=<dataclasses._MISSING_TYPE object>,_field_type=None)

The subordinate contexts.

deallocate()[source]

Deallocate all resources for this instance.

class zensols.deepnlp.transformer.vectorizers.TransformerExpanderFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False, delegate_feature_ids=None)[source]

Bases: TransformerFeatureVectorizer

A vectorizer that expands lingustic feature vectors to their respective locations as word piece token vectors.

This is used to concatenate lingustic features with Bert (and other transformer) embeddings. Each lingustic token is copied in the word piece token location across all vectorizers and sentences.

Shape:

(-1, token length, X), where X is the sum of all the delegate shapes across all three dimensions

DESCRIPTION = 'transformer expander'
FEATURE_TYPE = 1
__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False, delegate_feature_ids=None)
delegate_feature_ids: Tuple[str] = None

A list of feature IDs of vectorizers whose output will be expanded.

property delegates: EncodableFeatureVectorizer

The delegates used for encoding and decoding the lingustic features.

class zensols.deepnlp.transformer.vectorizers.TransformerFeatureContext(feature_id, document)[source]

Bases: FeatureContext, Deallocatable

A vectorizer feature contex used with TransformerEmbeddingFeatureVectorizer.

__init__(feature_id, document)[source]
Params feature_id:

the feature ID used to identify this context

Params document:

document used to create the transformer embeddings

deallocate()[source]

Deallocate all resources for this instance.

get_document(vectorizer)[source]
Return type:

TokenizedDocument

get_feature_document()[source]
Return type:

FeatureDocument

class zensols.deepnlp.transformer.vectorizers.TransformerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)[source]

Bases: EmbeddingFeatureVectorizer, FeatureDocumentVectorizer

Base class for classes that vectorize transformer models. This class also tokenizes documents.

__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)
encode_tokenized: bool = False

Whether to tokenize the document on encoding. Set this to True only if the huggingface model ID (i.e. bert-base-cased) will not change after vectorization/batching.

Setting this to True tells the vectorizer to tokenize during encoding, and thus will speed experimentation by providing the tokenized tensors to the model directly.

property feature_type: TextFeatureType

The type of feature this vectorizer generates. This is used by classes such as EmbeddingNetworkModule to determine where to add the features, such as concating to the embedding layer, join layer etc.

is_labeler: bool = False

If True, make this a labeling specific vectorizer. Otherwise, certain layers will use the output of the vectorizer as features rather than the labels.

tokenize(doc)[source]

Tokenize the document in to a token document used by the encoding phase.

Parameters:

doc (FeatureDocument) – the document to be tokenized

Return type:

TokenizedFeatureDocument

property word_piece_token_length: int
class zensols.deepnlp.transformer.vectorizers.TransformerMaskFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, data_type='bool')[source]

Bases: LabelTransformerFeatureVectorizer

Creates a mask of word piece tokens to True and special tokens and padding to False. This maps tokens to word piece tokens like TransformerNominalFeatureVectorizer.

Shape:

(|sentences|, |max word piece length|)

DESCRIPTION = 'transformer mask'
__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, data_type='bool')
data_type: Union[str, None, torch.dtype] = 'bool'

The mask tensor type. To use the int type that matches the resolution of the manager’s torch_config, use DEFAULT_INT.

class zensols.deepnlp.transformer.vectorizers.TransformerNominalFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, delegate_feature_id=None, size=-1, pad_label=-100, label_all_tokens=False, annotations_attribute='annotations')[source]

Bases: AggregateEncodableFeatureVectorizer, LabelTransformerFeatureVectorizer

This creates word piece (maps to tokens) labels. This class uses a NominalEncodedEncodableFeatureVectorizer` to map from string labels to their nominal long values. This allows a single instance and centralized location where the label mapping happens in case other (non-transformer) components need to vectorize labels.

Shape:

(|sentences|, |max word piece length|)

DESCRIPTION = 'transformer seq labeler'
__init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, delegate_feature_id=None, size=-1, pad_label=-100, label_all_tokens=False, annotations_attribute='annotations')
annotations_attribute: str = 'annotations'

The attribute used to get the features from the FeatureSentence. For example, TokenAnnotatedFeatureSentence has an annotations attribute.

delegate_feature_id: str = None

The feature ID for the aggregate encodeable feature vectorizer.

label_all_tokens: bool = False

If True, label all word piece tokens with the corresponding linguistic token label. Otherwise, the default padded value is used, and thus, ignored by the loss function when calculating loss.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

zensols.deepnlp.transformer.wordpiece module

Word piece mappings to feature tokens, sentences and documents.

There are often edges cases and tricky situations with certain model’s usage of special tokens (i.e. [CLS]) and where they are used. With this in mind, this module attempts to:

  • Assist in debugging (works with detached TokenizedDocument) in cases where token level embeddings are directly accessed, and

  • Map corresponding both token and sentence level embeddings to respective origin natural langauge feature set data structures.

class zensols.deepnlp.transformer.wordpiece.CachingWordPieceFeatureDocumentFactory(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True, stash=None, hasher=<factory>)[source]

Bases: WordPieceFeatureDocumentFactory

Caches the documents and their embeddings in a Stash. For those that are cached, the embeddings are copied over to the passed document in create().

__init__(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True, stash=None, hasher=<factory>)
clear()[source]

Clear the caching stash.

create(fdoc, tdoc=None)[source]

Create a document in to an object graph that relates word pieces to feature tokens. Note that if tdoc is provided, it must have been tokenized from fdoc.

Parameters:
Return type:

WordPieceFeatureDocument

Returns:

a data structure with the word piece information

hasher: Hasher

Used to hash the natural langauge text in to string keys.

stash: Stash = None

The stash that persists the feature document instances. If this is not provided, no caching will happen.

class zensols.deepnlp.transformer.wordpiece.WordPiece(word, vocab_index, index)[source]

Bases: PersistableContainer, Dictable

The word piece data.

UNKNOWN_TOKEN: ClassVar[str] = '[UNK]'

The string used for out of vocabulary word piece tokens.

__init__(word, vocab_index, index)
index: int

The index of the word piece subword in the tokenization tensor, which will have the same index in the output embeddings for TransformerEmbedding.output = last_hidden_state.

property is_unknown: bool

Whether this token is out of vocabulary.

vocab_index: int

The vocabulary index.

word: str

The string representation of the word piece.

class zensols.deepnlp.transformer.wordpiece.WordPieceDocumentDecorator(word_piece_doc_factory)[source]

Bases: FeatureDocumentDecorator

Populates sentence and token embeddings in the documents.

See:

WordPieceFeatureDocumentFactory

__init__(word_piece_doc_factory)
decorate(doc)[source]
word_piece_doc_factory: WordPieceFeatureDocumentFactory

The feature document factory that populates embeddings.

class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureDocument(sents, text=None, spacy_doc=None, tokenized=None)[source]

Bases: FeatureDocument, WordPieceTokenContainer

A document made up of word piece sentences.

__init__(sents, text=None, spacy_doc=None, tokenized=None)
copy_embedding(target)[source]

Copy embeddings (and children) from this instance to target.

property embedding: Tensor

The document embedding (see WordPieceFeatureSpan.embedding).

Shape:

(|sentences|, <embedding dimension>)

tokenized: TokenizedFeatureDocument = None

The tokenized feature document.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write the document and optionally sentence features.

Parameters:
  • n_sents – the number of sentences to write

  • n_tokens – the number of tokens to print across all sentences

  • include_original – whether to include the original text

  • include_normalized – whether to include the normalized text

class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureDocumentFactory(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True)[source]

Bases: object

Create instances of WordPieceFeatureDocument from FeatureDocument instances. It does this by iterating through a feature document data structure and adding WordPiece* object data and optionally adding the corresponding sentence and/or token level embeddings.

The embeddings can also be added with add_token_embeddings() and add_sent_embeddings() individually. If all you want are the sentence level embeddings, you can use add_sent_embeddings() on a FeatureSentence instance.

__init__(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True)
add_sent_embeddings(doc, arr)[source]

Add sentence embeddings to the sentences of doc.

Parameters:

doc (Union[WordPieceFeatureDocument, FeatureDocument]) – sentences of this doc have embeddings set to the correpsonding sentence tensor with shape (1, <embedding dimension>).

add_token_embeddings(doc, arr)[source]

Add token embeddings to the sentences of doc. This assumes tokens are of type WordPieceFeatureToken since the token indices are needed.

Parameters:

doc (WordPieceFeatureDocument) – sentences of this doc have embeddings set to the correpsonding sentence tensor with shape (1, <embedding dimension>).

create(fdoc, tdoc=None)[source]

Create a document in to an object graph that relates word pieces to feature tokens. Note that if tdoc is provided, it must have been tokenized from fdoc.

Parameters:
Return type:

WordPieceFeatureDocument

Returns:

a data structure with the word piece information

embed_model: TransformerEmbedding

Used to populate the embeddings in WordPiece* classes.

populate(doc, truncate=False)[source]

Populate sentence embeddings in a document by first feature parsing a new document with create() and then copying the embeddings with WordPieceFeatureDocument.copy_embeddings()

Parameters:

truncate (bool) – if sentence lengths differ (i.e. from using different models to chunk sentences) trim the longer document to match the shorter

sent_embeddings: bool = True

.WordPieceFeatureSentence.embeddings.

Type:

Whether to add class

token_embeddings: bool = True

Whether to add WordPieceFeatureToken.embeddings.

tokenizer: TransformerDocumentTokenizer

Used to tokenize documents that aren’t already in __call__().

class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureSentence(tokens, text=None, spacy_span=None, embedding=None)[source]

Bases: WordPieceFeatureSpan, FeatureSentence

__init__(tokens, text=None, spacy_span=None, embedding=None)
class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureSpan(tokens, text=None, spacy_span=None, embedding=None)[source]

Bases: FeatureSentence, WordPieceTokenContainer

A sentence made up of word pieces.

__init__(tokens, text=None, spacy_span=None, embedding=None)
copy_embedding(target)[source]

Copy embeddings (and children) from this instance to target.

embedding: Tensor = None

The sentence embedding level (i.e. [CLS]) embedding from the transformer.

Shape:

(<embedding dimension>,)

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write the text container.

Parameters:
  • include_original – whether to include the original text

  • include_normalized – whether to include the normalized text

  • n_tokens – the number of tokens to write

  • inline – whether to print the tokens on one line each

class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureToken(i, idx, i_sent, norm, words, embedding=None)[source]

Bases: FeatureToken

The token and the word pieces that repesent it.

__init__(i, idx, i_sent, norm, words, embedding=None)
clone(cls=None, **kwargs)[source]

Clone an instance of this token.

Parameters:
  • cls (Type) – the type of the new instance

  • kwargs – arguments to add to as attributes to the clone

Return type:

FeatureToken

Returns:

the cloned instance of this instance

copy_embedding(target)[source]

Copy embedding (and children) from this instance to target.

detach(*args, **kwargs)[source]

Create a detected token (i.e. from spaCy artifacts).

Parameters:
  • feature_ids – the features to write, which defaults to FEATURE_IDS

  • skip_missing – whether to only keep feature_ids

  • cls – the type of the new instance

Return type:

FeatureToken

embedding: Tensor = None

The embedding for words after using the transformer.

Shape:

(|words|, <embedding dimension>)

property indexes: Tuple[int]

The indexes of the word piece subwords (see WordPiece.index).

property is_unknown: bool

Whether this token is out of vocabulary.

property token_embedding: Tensor

The embedding of this token, which is the sum of the word piece embeddings.

word_iter()[source]

Return an iterable over the word pieces.

Return type:

Iterable[WordPiece]

words: Tuple[WordPiece]

The word pieces that make up this token.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed=False, fold_method='raise', embed_model=None, decode_embedding=True, word_piece_doc_factory=None, access='raise')[source]

Bases: EmbeddingFeatureVectorizer

Uses the embeddings attributes added to documents, sentences and tokens populated by WordPieceFeatureDocumentFactory. Currently only sentence sequences are supported. For single sentence or token classification, use zensols.deepnlp.vectorizers.

If aggregated documents are given to the vectorizer, they are flattened into sentences and vectorized in the same was a single document’s sentences would be vectorized. A batch is created for each document and only one batch is created for singleton documents.

This embedding layer expects the following attribute settings to be left with the defaults set: obj:encode_transformed, fold_method, decode_embedding.

Shape:

(|documents|, |sentences|, |embedding dimension|)

DESCRIPTION: ClassVar[str] = 'wordpiece'
FEATURE_TYPE: ClassVar[TextFeatureType] = 4
__init__(name, config_factory, feature_id, manager, encode_transformed=False, fold_method='raise', embed_model=None, decode_embedding=True, word_piece_doc_factory=None, access='raise')
access: str = 'raise'

What to do when accessing the sentence embedding when encoding. This is one of:

  • raise: raises an error when missing

  • add_missing: create the embedding only if missing

  • clobber: always create a new embedding by replacing (if existed)

decode_embedding: bool = True

Turn off the embed_model forward pass to use the embeddings we vectorized from the embedding attribute(s). Keep the default.

embed_model: TransformerEmbedding = None

This field is not applicable to this vectorizer–keep the default.

encode(doc)[source]

Encode by combining documents in to one monolithic document when a tuple is passed, otherwise default to the super class’s encode functionality.

Return type:

FeatureContext

encode_transformed: bool = False

This field is not applicable to this vectorizer–keep the default.

fold_method: str = 'raise'

This field is not applicable to this vectorizer–keep the default.

word_piece_doc_factory: WordPieceFeatureDocumentFactory = None

The feature document factory that populates embeddings.

class zensols.deepnlp.transformer.wordpiece.WordPieceTokenContainer[source]

Bases: TokenContainer

Like TokenContainer but contains word pieces.

property unknown_count: int

Return the number of out of vocabulary tokens in the container.

word_iter()[source]

Return an iterable over the word pieces.

Return type:

Iterable[WordPiece]

Module contents

Contains classes that adapt the huggingface tranformers to the Zensols deeplearning framework.

zensols.deepnlp.transformer.normalize_huggingface_logging()[source]

Make the :mod”transformers package use default logging. Using this and setting the transformers logging package to ERROR level logging has the same effect as suppress_warnings().

zensols.deepnlp.transformer.suppress_warnings()[source]

Suppress the `Some weights of the model checkpoint...` warnings from huggingface transformers.

Ses:

normalize_huggingface_logging()

zensols.deepnlp.transformer.turn_off_huggingface_downloads()[source]

Turn off automatic model checks and downloads.