zensols.deepnlp.transformer package¶
Submodules¶
zensols.deepnlp.transformer.domain module¶
Container classes for Bert models
- class zensols.deepnlp.transformer.domain.TokenizedDocument(tensor, boundary_tokens)[source]¶
Bases:
PersistableContainer,WritableThis is the tokenized document output of
TransformerDocumentTokenizer. Instances of this class are pickelable, in a feature context. Then give to the in the decoding phase to create a tensor with a transformer model such asTransformerEmbedding.- __init__(tensor, boundary_tokens)¶
- property attention_mask: Tensor¶
The attention mask (0/1s).
- classmethod from_tensor(tensor)[source]¶
Create an instance of the class using a tensor. This is useful for re-creating documents for mapping with
map_word_pieces()after unpickling from a document created withTransformerDocumentTokenizer.tokenize.- Parameters:
tensor (
Tensor) – the tensor to set intensor- Return type:
- get_wordpiece_count(**kwargs)[source]¶
The size of the document (sum over sentences) in number of word pieces. To keep special tokens (such BERT’s as
[CLS]and[SEP]tokens) when passing in a tokenizer inkwargs, addspecial_tokens={}.- Parameters:
kwargs – any keyword arguments passed on to
map_to_word_pieces()except (do not addindex_tokensandincludes)- Return type:
- property input_ids: Tensor¶
The token IDs as the output from the tokenizer.
- map_to_word_pieces(sentences=None, map_wp=None, add_indices=False, special_tokens=None, index_tokens=True, includes=frozenset({'map'}))[source]¶
Map word piece tokens to linguistic tokens.
- Parameters:
sentences (
Iterable[List[Any]]) – an iteration of sentences, which is returned in the output (i.e.FeatureSentence), orinput_idsifNonemap_wp (
Union[Callable,Dict[int,str]]) – either a function that takes the token index, sentence ID and input IDs, or the mapping from word piece ID to string token; return output is the string token (or numerical output if no mapping is provided); if an instance ofTransformerDocumentTokenizer, its vocabulary and special tokens are utilized for mapping and special token considerationadd_indices (
bool) – whether to add the token ID and index after the token string whenid2tokis provided formap_wpspecial_tokens (
Set[str]) – a list of tokens (such BERT’s as[CLS]and[SEP]tokens) to remove; to keep special tokens when passing in a tokenizer inkwargs, addspecial_tokens={}.index_tokens (
bool) – whether to index tokens positionally, which is used for mapping with feature or tokenized sentences; set this toFalsewhensentencesare anything but a feature document / sentencesincludes (
Set[str]) – what data to return, which is a set of the keys listed in thereturndocumentation below
- Return type:
- Returns:
a list sentence maps, each with:
sent_ix-> the ``i``th sentence (always provided)map-> list of(sentence 'token', word pieces)sent-> aFeatureSentenceor a tensor of vocab indexes ifmap_wpisNoneword_pieces-> the word pieces of the sentences
- static map_word_pieces(token_offsets)[source]¶
Map word piece tokens to linguistic tokens.
- Return type:
List[Tuple[FeatureToken,List[int]]]- Returns:
a list of tuples in the form:
(<token index>, <list of word piece indexes>)
- property offsets: Tensor¶
The offsets from word piece (transformer’s tokenizer) to feature document index mapping.
- property shape: Size¶
Return the shape of the vectorized document.
-
tensor:
Tensor¶ Encodes the input IDs, attention mask, and word piece offset map.
- property token_type_ids: Tensor¶
The token type IDs (0/1s).
- truncate(size)[source]¶
Truncate the the last (token) dimension to
size.- Return type:
- Returns:
a new instance of this class truncated to size
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_tokens=True, id2tok=None)[source]¶
Write the contents of this instance to
writerusing indentiondepth.- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
- class zensols.deepnlp.transformer.domain.TokenizedFeatureDocument(tensor, boundary_tokens, feature, id2tok, char_offsets)[source]¶
Bases:
TokenizedDocumentInstance of this class are created, then a picklable version returned with
detach()as an instance of the super class.- __init__(tensor, boundary_tokens, feature, id2tok, char_offsets)¶
-
feature:
FeatureDocument¶ The document to tokenize.
-
id2tok:
Dict[int,str]¶ If provided, a mapping of indexes to transformer tokens. This attribute is always nulled out after being persisted.
- map_to_word_pieces(sentences=None, map_wp=None, **kwargs)[source]¶
Map word piece tokens to linguistic tokens.
- Parameters:
sentences (
Iterable[List[Any]]) – an iteration of sentences, which is returned in the output (i.e.FeatureSentence), orinput_idsifNonemap_wp (
Union[Callable,Dict[int,str]]) – either a function that takes the token index, sentence ID and input IDs, or the mapping from word piece ID to string token; return output is the string token (or numerical output if no mapping is provided); if an instance ofTransformerDocumentTokenizer, its vocabulary and special tokens are utilized for mapping and special token considerationadd_indices – whether to add the token ID and index after the token string when
id2tokis provided formap_wpspecial_tokens – a list of tokens (such BERT’s as
[CLS]and[SEP]tokens) to remove; to keep special tokens when passing in a tokenizer inkwargs, addspecial_tokens={}.index_tokens – whether to index tokens positionally, which is used for mapping with feature or tokenized sentences; set this to
Falsewhensentencesare anything but a feature document / sentencesincludes – what data to return, which is a set of the keys listed in the
returndocumentation below
- Return type:
- Returns:
a list sentence maps, each with:
sent_ix-> the ``i``th sentence (always provided)map-> list of(sentence 'token', word pieces)sent-> aFeatureSentenceor a tensor of vocab indexes ifmap_wpisNoneword_pieces-> the word pieces of the sentences
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_tokens=True, id2tok=None)[source]¶
Write the contents of this instance to
writerusing indentiondepth.- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
zensols.deepnlp.transformer.embed module¶
The tokenizer object.
- class zensols.deepnlp.transformer.embed.TransformerEmbedding(name, tokenizer, output='pooler_output', output_attentions=False)[source]¶
Bases:
PersistableContainer,DictableAn model for transformer embeddings (such as BERT) that wraps the HuggingFace transformms API.
- __init__(name, tokenizer, output='pooler_output', output_attentions=False)¶
- property cache¶
When set to
Truecache a global space model using the parameters from the first instance creation.
- property model: PreTrainedModel¶
-
output:
str= 'pooler_output'¶ The output from the huggingface transformer API to return.
This is set to one of:
LAST_HIDDEN_STATE_OUTPUT: with the output embeddings of the last layer with shape:(batch, N sentences, hidden layer dimension)POOLER_OUTPUT: the last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function with shape:(batch, hidden layer dimension)ALL_OUTPUT: includes both as a dictionary with correpsondingkeys
- property resource: TransformerResource¶
The transformer resource containing the model.
- tokenize(doc)[source]¶
Tokenize the feature document, which is used as the input to
transform().- Doc:
the document to tokenize
- Return type:
- Returns:
the tokenization of
doc
-
tokenizer:
TransformerDocumentTokenizer¶ The tokenizer used for creating the input for the model.
- transform(doc, output=None)[source]¶
Transform the documents in to the transformer output.
- Parameters:
docs – the batch of documents to return
output (
str) – the output from the huggingface transformer API to return (see class docs)
- Return type:
- Returns:
a container object instance with the output, which contains (among other data)
last_hidden_statewith the output embeddings of the last layer with shape:(batch, N sentences, hidden layer dimension)
zensols.deepnlp.transformer.layer module¶
Contains transformer embedding layers.
- class zensols.deepnlp.transformer.layer.TransformerEmbeddingLayer(*args, embed_model, **kwargs)[source]¶
Bases:
EmbeddingLayerA transformer (i.e. BERT) embedding layer. This class generates embeddings on a per sentence basis. See the initializer documentation for configuration requirements.
- MODULE_NAME: ClassVar[str] = 'transformer embedding'¶
The module name used in the logging message. This is set in each inherited class.
- __init__(*args, embed_model, **kwargs)[source]¶
Initialize with an embedding model. This embedding model must configured with
TransformerEmbedding.outputtolast_hidden_state.- Parameters:
embed_model (
TransformerEmbedding) – used to generate the transformer (i.e. BERT) embeddings
- forward(x)[source]¶
Define the computation performed at every call.
Should be overridden by all subclasses. :rtype:
TensorNote
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class zensols.deepnlp.transformer.layer.TransformerSequence(net_settings, sub_logger=None)[source]¶
Bases:
EmbeddingNetworkModule,SequenceNetworkModuleA sequence based model for token classification use HuggingFace transformers layers (not their token classification API).
- MODULE_NAME: ClassVar[str] = 'transformer sequence'¶
The module name used in the logging message. This is set in each inherited class.
- __init__(net_settings, sub_logger=None)[source]¶
Initialize the embedding layer.
- Parameters:
net_settings (
TransformerSequenceNetworkSettings) – the embedding layer configurationlogger – the logger to use for the forward process in this layer
filter_attrib_fn – if provided, called with a
BatchFieldMetadatafor each field returningTrueif the batch field should be retained and used in the embedding layer (see class docs); ifNoneall fields are considered
- class zensols.deepnlp.transformer.layer.TransformerSequenceNetworkSettings(name, config_factory, torch_config, dropout, batch_stash, embedding_layer, decoder_settings)[source]¶
Bases:
EmbeddingNetworkSettings,DropoutNetworkSettingsSettings configuration for
TransformerSequence.- __init__(name, config_factory, torch_config, dropout, batch_stash, embedding_layer, decoder_settings)¶
-
decoder_settings:
DeepLinearNetworkSettings¶ The decoder feed forward network.
- get_module_class_name()[source]¶
Returns the fully qualified class name of the module to create by
ModelManager. This module takes as the first parameter an instance of this class.Important: This method is not used for nested modules. You must declare specific class names in the configuration for those nested class naems.
- Return type:
zensols.deepnlp.transformer.mask module¶
Classes to predict fill-mask tasks.
- class zensols.deepnlp.transformer.mask.MaskFiller(resource, k=1, feature_id='norm', feature_value='MASK')[source]¶
Bases:
objectThe class fills masked tokens with the prediction of the underlying maked model. Masked tokens with attribute
feature_idhaving valuefeature_value(normandMASKby default respectively) are substituted with model values.To use this class, parse a sentence with a
FeatureDocumentParserwith masked tokens using the stringfeature_value.For example (with class defaults), the sentence:
Paris is the MASK of France.
becomes:
Parise is the <mask> of France.
The
<mask>string becomes themask_tokenfor the model’s tokenzier.- __init__(resource, k=1, feature_id='norm', feature_value='MASK')¶
-
feature_value:
str= 'MASK'¶ The value of feature ID
feature_idto match on masked tokens.
-
k:
int= 1¶ The number of top K predicted masked words per mask. The total number of predictions will be <number of masks> X
kin the source document.
- predict(source)[source]¶
Predict subtitution values for token masks.
Important:
sourceis modified as a side-effect of this method. Useclone()on thesourcedocument passed to this method to preserve the original if necessary.- Parameters:
source (
TokenContainer) – the source document, sentence, or span for which to substitute values- Return type:
-
resource:
TransformerResource¶ A container class with the Huggingface tokenizer and model.
- class zensols.deepnlp.transformer.mask.Prediction(cont, masked_tokens, df)[source]¶
Bases:
DictableA container class for masked token predictions produced by
MaskFiller. This class offers many ways to get the predictions, including getting the sentences as instances ofTokenContainerby using it as an iterable.The sentences are also available as the
pred_sentenceskey when usingasdict().- __init__(cont, masked_tokens, df)¶
-
cont:
TokenContainer¶ The document, sentence or span to predict masked tokens.
-
df:
DataFrame¶ The predictions with dataframe columns:
k: the k in the top-k highest scored masked token matchmask_id: the N-th masked token in the source ordered by positiontoken: the predicted tokenscore: the score of the prediction ([0, 1], higher the better)
- get_container(k=0)[source]¶
Get the k-th top scored sentence. This method should be called only once for each instance since it modifies the tokens of the container for each invocation.
A client may call this method as many times as necessary (i.e. for multiple values of
k) since :obj:conttokens are modified while retaining the original masked tokensmasked_tokens.- Parameters:
k (
int) – as k increases the less likely the mask substitutions, and thus sentence; k = 0 is the most likely given the sentence and masks- Return type:
- get_tokens()[source]¶
Return an iterable of the prediction coupled with the token it belongs to and its score.
- Return type:
- property masked_token_dicts: Tuple[Dict[str, Any]]¶
A tuple of
builtins.dicteach having token index, norm and text data.
-
masked_tokens:
Tuple[FeatureToken]¶ The masked tokens matched.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_masked_tokens=True, include_predicted_tokens=True, include_predicted_sentences=True)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
zensols.deepnlp.transformer.optimizer module¶
Adapat huggingface transformer weight decay optimizer.
- class zensols.deepnlp.transformer.optimizer.TransformerAdamFactory[source]¶
Bases:
ModelResourceFactory
- class zensols.deepnlp.transformer.optimizer.TransformerSchedulerFactory[source]¶
Bases:
ModelResourceFactoryUnified API to get any scheduler from its name. This simply calls
transformers.get_scheduler()and calculatesnum_training_stepsasepochs * batch_size.Documentation taken directly from
get_schedulerfunction in the PyTorch source tree.
zensols.deepnlp.transformer.pred module¶
Predictions output for transformer models.
- class zensols.deepnlp.transformer.pred.TransformerSequencePredictionsDataFrameFactory(source, result, stash, column_names=None, data_point_transform=None, batch_limit=9223372036854775807, epoch_result=None, label_vectorizer_name=None, metric_metadata=None, name=None, embedded_document_attribute=None)[source]¶
Bases:
SequencePredictionsDataFrameFactoryLike the super class but create predictions for transformer sequence models. By default, transformer input is truncated at the model’s max token length (usually 512 word piece tokens). It then truncate the tokens that are added as the
textcolumn from (configured by default)classify.TokenClassifyModelFacade.For all predictions where the sequence passed the model’s maximum, this class maps that last word piece token output to the respective token in the
predictions_dataframe_factory_classinstance’stransformoutput.- __init__(source, result, stash, column_names=None, data_point_transform=None, batch_limit=9223372036854775807, epoch_result=None, label_vectorizer_name=None, metric_metadata=None, name=None, embedded_document_attribute=None)¶
zensols.deepnlp.transformer.resource module¶
Provide BERT embeddings on a per sentence level.
- exception zensols.deepnlp.transformer.resource.TransformerError[source]¶
Bases:
DeepLearnErrorRaised for any transformer specific errors in this and child modules of the parent.
- __annotations__ = {}¶
- __module__ = 'zensols.deepnlp.transformer.resource'¶
- class zensols.deepnlp.transformer.resource.TransformerResource(name, torch_config, model_id, cased=None, trainable=False, args=<factory>, tokenizer_args=<factory>, model_args=<factory>, model_class='transformers.AutoModel', tokenizer_class='transformers.AutoTokenizer', cache=False, cache_dir=None)[source]¶
Bases:
PersistableContainer,DictableA container base class that allows configuration and creates various huggingface models.
- __init__(name, torch_config, model_id, cased=None, trainable=False, args=<factory>, tokenizer_args=<factory>, model_args=<factory>, model_class='transformers.AutoModel', tokenizer_class='transformers.AutoTokenizer', cache=False, cache_dir=None)¶
-
args:
Dict[str,Any]¶ Additional arguments to pass to the from_pretrained method for the tokenizer and the model.
-
cache:
InitVar= False¶ When set to
Truecache a global space model using the parameters from the first instance creation.
-
cased:
bool= None¶ Truefor case sensitive models,False(default) otherwise. The negated value of it is also used as thedo_lower_caseparameter in the*.from_pretrainedcalls to huggingface transformers.
- property model: PreTrainedModel¶
-
model_args:
Dict[str,Any]¶ Additional arguments to pass to the from_pretrained method for the model.
-
model_class:
str= 'transformers.AutoModel'¶ The model fully qualified class used to create models with the
from_pretrainedstatic method.
-
model_id:
str¶ The ID of the model (i.e.
bert-base-uncased). If this is not set, is derived from themodel_nameandcase.Token embeding using
TransformerEmbeddingas been tested with:bert-base-casedbert-large-casedroberta-basedistilbert-base-cased
- See:
- property tokenizer: PreTrainedTokenizer¶
-
tokenizer_args:
Dict[str,Any]¶ Additional arguments to pass to the from_pretrained method for the tokenizer.
-
tokenizer_class:
str= 'transformers.AutoTokenizer'¶ The model fully qualified class used to create tokenizers with the
from_pretrainedstatic method.
-
torch_config:
TorchConfig¶ The config device used to copy the embedding data.
zensols.deepnlp.transformer.tokenizer module¶
The tokenizer object.
- class zensols.deepnlp.transformer.tokenizer.TransformerDocumentTokenizer(resource, word_piece_token_length=None, params=None, feature_id='text')[source]¶
Bases:
PersistableContainerCreates instances of
TokenziedFeatureDocumentusing a HuggingFacePreTrainedTokenizer.-
DEFAULT_PARAMS:
ClassVar[Dict[str,Any]] = {'is_split_into_words': True, 'padding': 'longest', 'return_offsets_mapping': True, 'return_special_tokens_mask': True}¶ Default parameters for the HuggingFace tokenizer. These get overriden by the
tokenizer_kwargsintokenize()and the processing of valueword_piece_token_length.
- __init__(resource, word_piece_token_length=None, params=None, feature_id='text')¶
- property all_special_tokens: Set[str]¶
Special tokens used by the model (such BERT’s as
[CLS]and[SEP]tokens).
-
feature_id:
str= 'text'¶ The feature ID to use for token string values from
FeatureToken.
- property id2tok: Dict[int, str]¶
A mapping from the HuggingFace tokenizer’s vocabulary to it’s word piece equivalent.
- property pretrained_tokenizer: PreTrainedTokenizer¶
The HuggingFace tokenized used to create tokenized documents.
-
resource:
TransformerResource¶ Contains the model used to create the tokenizer.
- tokenize(doc, tokenizer_kwargs=None)[source]¶
Tokenize a feature document in a form that’s easy to inspect and provide to
TransformerEmbeddingto transform.- Parameters:
doc (
FeatureDocument) – the document to tokenize- Return type:
-
word_piece_token_length:
int= None¶ The max number of word piece tokens. The word piece length is always the same or greater in count than linguistic tokens because the word piece algorithm tokenizes on characters.
If this value is less than 0, than do not fix sentence lengths. If the value is 0 (default), then truncate to the model’s longest max lenght. Otherwise, if this value is
None, set the length to the model’s longest max length using the model’smodel_max_lengthvalue.Setting this to a value to 0, making documents multi-length, has the potential of creating token spans longer than the model can tolerate (usually 512 word piece tokens). In these cases, this value must be set to (or lower) than the model’s
model_max_length.Tokenization padding is on by default.
- See:
-
DEFAULT_PARAMS:
zensols.deepnlp.transformer.vectorizers module¶
Contains classes that are used to vectorize documents in to transformer embeddings.
- class zensols.deepnlp.transformer.vectorizers.DocumentEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False, token_pattern='{norm}', token_feature_ids=frozenset({'norm'}))[source]¶
Bases:
TransformerEmbeddingFeatureVectorizerVectorizes a feature from each token as a single sentence document. It does this by tracking the sentence and token positions that have tokens with the necessary features to create what becomes sentences to parse and vectorize. During decoding, each pooled sentence’s embedding is added to the respective position in the returned data.
- DESCRIPTION: ClassVar[str] = 'transformer document embedding'¶
- FEATURE_TYPE: ClassVar[TextFeatureType] = 1¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False, token_pattern='{norm}', token_feature_ids=frozenset({'norm'}))¶
- token_feature_ids: Set[str] = frozenset({'norm'})¶
The features IDs used in
token_pattern.
- token_pattern: str = '{norm}'¶
The
builtins.str.format()string used to format the sentence to be parsed and vectorized.
- class zensols.deepnlp.transformer.vectorizers.DocumentMappedTransformerFeatureContext(feature_id, document, sent_len, pos)[source]¶
Bases:
TransformerFeatureContext
- class zensols.deepnlp.transformer.vectorizers.LabelTransformerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False)[source]¶
Bases:
TransformerFeatureVectorizerA base class for vectorizing by mapping tokens to transformer consumable word piece tokens. This includes creating labels and masks.
- Shape:
- FEATURE_TYPE: ClassVar[TextFeatureType] = 1¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False)¶
- is_labeler: bool = True¶
If
True, make this a labeling specific vectorizer. Otherwise, certain layers will use the output of the vectorizer as features rather than the labels.
- class zensols.deepnlp.transformer.vectorizers.TransformerEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)[source]¶
Bases:
TransformerFeatureVectorizerA feature vectorizer used to create transformer (i.e. BERT) embeddings. The class uses the
embed_model, which is of typeTransformerEmbedding.Note the encoding input ideally are sentences shorter than 512 tokens. However, this vectorizer can accommodate both
FeatureSentenceandFeatureDocumentinstances.- DESCRIPTION: ClassVar[str] = 'transformer document embedding'¶
- FEATURE_TYPE: ClassVar[TextFeatureType] = 4¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)¶
- class zensols.deepnlp.transformer.vectorizers.TransformerExpanderFeatureContext(feature_id, contexts, document)[source]¶
Bases:
TransformerFeatureContextA vectorizer feature context used with
TransformerExpanderFeatureVectorizer.- __init__(feature_id, contexts, document)[source]¶
- Params feature_id:
the feature ID used to identify this context
- Params contexts:
subordinate contexts given to
MultiFeatureContext- Params document:
document used to create the transformer embeddings
-
contexts:
Tuple[FeatureContext] = Field(name=None,type=None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=<dataclasses._MISSING_TYPE object>,_field_type=None)¶ The subordinate contexts.
- class zensols.deepnlp.transformer.vectorizers.TransformerExpanderFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False, delegate_feature_ids=None)[source]¶
Bases:
TransformerFeatureVectorizerA vectorizer that expands lingustic feature vectors to their respective locations as word piece token vectors.
This is used to concatenate lingustic features with Bert (and other transformer) embeddings. Each lingustic token is copied in the word piece token location across all vectorizers and sentences.
- Shape:
(-1, token length, X), where X is the sum of all the delegate shapes across all three dimensions
- DESCRIPTION: ClassVar[str] = 'transformer expander'¶
- FEATURE_TYPE: ClassVar[TextFeatureType] = 1¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False, delegate_feature_ids=None)¶
- delegate_feature_ids: Tuple[str] = None¶
A list of feature IDs of vectorizers whose output will be expanded.
- property delegates: EncodableFeatureVectorizer¶
The delegates used for encoding and decoding the lingustic features.
- class zensols.deepnlp.transformer.vectorizers.TransformerFeatureContext(feature_id, document)[source]¶
Bases:
FeatureContext,DeallocatableA vectorizer feature contex used with
TransformerEmbeddingFeatureVectorizer.- __init__(feature_id, document)[source]¶
- Params feature_id:
the feature ID used to identify this context
- Params document:
document used to create the transformer embeddings
- class zensols.deepnlp.transformer.vectorizers.TransformerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)[source]¶
Bases:
EmbeddingFeatureVectorizer,FeatureDocumentVectorizerBase class for classes that vectorize transformer models. This class also tokenizes documents.
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)¶
- encode_tokenized: bool = False¶
Whether to tokenize the document on encoding. Set this to
Trueonly if the huggingface model ID (i.e.bert-base-cased) will not change after vectorization/batching.Setting this to
Truetells the vectorizer to tokenize during encoding, and thus will speed experimentation by providing the tokenized tensors to the model directly.
- property feature_type: TextFeatureType¶
The type of feature this vectorizer generates. This is used by classes such as
EmbeddingNetworkModuleto determine where to add the features, such as concating to the embedding layer, join layer etc.
- is_labeler: bool = False¶
If
True, make this a labeling specific vectorizer. Otherwise, certain layers will use the output of the vectorizer as features rather than the labels.
- tokenize(doc)[source]¶
Tokenize the document in to a token document used by the encoding phase.
- Parameters:
doc (
FeatureDocument) – the document to be tokenized- Return type:
- class zensols.deepnlp.transformer.vectorizers.TransformerMaskFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, data_type='bool')[source]¶
Bases:
LabelTransformerFeatureVectorizerCreates a mask of word piece tokens to
Trueand special tokens and padding toFalse. This maps tokens to word piece tokens likeTransformerNominalFeatureVectorizer.- Shape:
- DESCRIPTION: ClassVar[str] = 'transformer mask'¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, data_type='bool')¶
- data_type: Union[str, None, torch.dtype] = 'bool'¶
The mask tensor type. To use the int type that matches the resolution of the manager’s
torch_config, useDEFAULT_INT.
- class zensols.deepnlp.transformer.vectorizers.TransformerNominalFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, delegate_feature_id=None, size=-1, pad_label=-100, label_all_tokens=False, annotations_attribute='annotations')[source]¶
Bases:
AggregateEncodableFeatureVectorizer,LabelTransformerFeatureVectorizerThis creates word piece (maps to tokens) labels. This class uses a
NominalEncodedEncodableFeatureVectorizer`to map from string labels to their nominal long values. This allows a single instance and centralized location where the label mapping happens in case other (non-transformer) components need to vectorize labels.- Shape:
- DESCRIPTION: ClassVar[str] = 'transformer seq labeler'¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, delegate_feature_id=None, size=-1, pad_label=-100, label_all_tokens=False, annotations_attribute='annotations')¶
- annotations_attribute: str = 'annotations'¶
The attribute used to get the features from the
FeatureSentence. For example,TokenAnnotatedFeatureSentencehas anannotationsattribute.
- delegate_feature_id: str = None¶
The feature ID for the aggregate encodeable feature vectorizer.
- label_all_tokens: bool = False¶
If
True, label all word piece tokens with the corresponding linguistic token label. Otherwise, the default padded value is used, and thus, ignored by the loss function when calculating loss.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
zensols.deepnlp.transformer.wordpiece module¶
Word piece mappings to feature tokens, sentences and documents.
There are often edges cases and tricky situations with certain model’s usage of
special tokens (i.e. [CLS]) and where they are used. With this in mind,
this module attempts to:
Assist in debugging (works with detached
TokenizedDocument) in cases where token level embeddings are directly accessed, andMap corresponding both token and sentence level embeddings to respective origin natural langauge feature set data structures.
- class zensols.deepnlp.transformer.wordpiece.CachingWordPieceFeatureDocumentFactory(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True, stash=None, hasher=<factory>)[source]¶
Bases:
WordPieceFeatureDocumentFactoryCaches the documents and their embeddings in a
Stash. For those that are cached, the embeddings are copied over to the passed document increate().- __init__(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True, stash=None, hasher=<factory>)¶
- create(fdoc, tdoc=None)[source]¶
Create a document in to an object graph that relates word pieces to feature tokens. Note that if
tdocis provided, it must have been tokenized fromfdoc.- Parameters:
fdoc (
FeatureDocument) – the feature document used to create tdoctdoc (
TokenizedFeatureDocument) – a tokenized feature document generated bytokenize()
- Return type:
- Returns:
a data structure with the word piece information
- class zensols.deepnlp.transformer.wordpiece.WordPiece(word, vocab_index, index)[source]¶
Bases:
PersistableContainer,DictableThe word piece data.
- __init__(word, vocab_index, index)¶
-
index:
int¶ The index of the word piece subword in the tokenization tensor, which will have the same index in the output embeddings for
TransformerEmbedding.output=last_hidden_state.
- class zensols.deepnlp.transformer.wordpiece.WordPieceDocumentDecorator(word_piece_doc_factory)[source]¶
Bases:
FeatureDocumentDecoratorPopulates sentence and token embeddings in the documents.
- __init__(word_piece_doc_factory)¶
-
word_piece_doc_factory:
WordPieceFeatureDocumentFactory¶ The feature document factory that populates embeddings.
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureDocument(sents, text=None, spacy_doc=None, tokenized=None)[source]¶
Bases:
FeatureDocument,WordPieceTokenContainerA document made up of word piece sentences.
- __init__(sents, text=None, spacy_doc=None, tokenized=None)¶
- property embedding: Tensor¶
The document embedding (see
WordPieceFeatureSpan.embedding).- Shape:
(|sentences|, <embedding dimension>)
- tokenized: TokenizedFeatureDocument = None¶
The tokenized feature document.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write the document and optionally sentence features.
- Parameters:
n_sents – the number of sentences to write
n_tokens – the number of tokens to print across all sentences
include_original – whether to include the original text
include_normalized – whether to include the normalized text
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureDocumentFactory(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True)[source]¶
Bases:
objectCreate instances of
WordPieceFeatureDocumentfromFeatureDocumentinstances. It does this by iterating through a feature document data structure and addingWordPiece*object data and optionally adding the corresponding sentence and/or token level embeddings.The embeddings can also be added with
add_token_embeddings()andadd_sent_embeddings()individually. If all you want are the sentence level embeddings, you can useadd_sent_embeddings()on aFeatureSentenceinstance.- __init__(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True)¶
- add_sent_embeddings(doc, arr)[source]¶
Add sentence embeddings to the sentences of
doc.- Parameters:
doc (
Union[WordPieceFeatureDocument,FeatureDocument]) – sentences of this doc haveembeddingsset to the correpsonding sentence tensor with shape(1, <embedding dimension>).
- add_token_embeddings(doc, arr)[source]¶
Add token embeddings to the sentences of
doc. This assumes tokens are of typeWordPieceFeatureTokensince the token indices are needed.- Parameters:
doc (
WordPieceFeatureDocument) – sentences of this doc haveembeddingsset to the correpsonding sentence tensor with shape (1, <embedding dimension>).
- create(fdoc, tdoc=None)[source]¶
Create a document in to an object graph that relates word pieces to feature tokens. Note that if
tdocis provided, it must have been tokenized fromfdoc.- Parameters:
fdoc (
FeatureDocument) – the feature document used to create tdoctdoc (
TokenizedFeatureDocument) – a tokenized feature document generated bytokenize()
- Return type:
- Returns:
a data structure with the word piece information
-
embed_model:
TransformerEmbedding¶ Used to populate the embeddings in
WordPiece*classes.
- populate(doc, truncate=False)[source]¶
Populate sentence embeddings in a document by first feature parsing a new document with
create()and then copying the embeddings withWordPieceFeatureDocument.copy_embeddings()- Parameters:
truncate (
bool) – if sentence lengths differ (i.e. from using different models to chunk sentences) trim the longer document to match the shorter
-
tokenizer:
TransformerDocumentTokenizer¶ Used to tokenize documents that aren’t already in
__call__().
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureSentence(tokens, text=None, spacy_span=None, embedding=None)[source]¶
Bases:
WordPieceFeatureSpan,FeatureSentence- __init__(tokens, text=None, spacy_span=None, embedding=None)¶
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureSpan(tokens, text=None, spacy_span=None, embedding=None)[source]¶
Bases:
FeatureSentence,WordPieceTokenContainerA sentence made up of word pieces.
- __init__(tokens, text=None, spacy_span=None, embedding=None)¶
- embedding: Tensor = None¶
The sentence embedding level (i.e.
[CLS]) embedding from the transformer.- Shape:
(<embedding dimension>,)
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write the text container.
- Parameters:
include_original – whether to include the original text
include_normalized – whether to include the normalized text
n_tokens – the number of tokens to write
inline – whether to print the tokens on one line each
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureToken(i, idx, i_sent, norm, lexspan, words, embedding=None)[source]¶
Bases:
FeatureTokenThe token and the word pieces that repesent it.
- __init__(i, idx, i_sent, norm, lexspan, words, embedding=None)¶
- clone(cls=None, **kwargs)[source]¶
Clone an instance of this token.
- Parameters:
cls (
Type) – the type of the new instancekwargs – arguments to add to as attributes to the clone
- Return type:
- Returns:
the cloned instance of this instance
- detach(*args, **kwargs)[source]¶
Create a detected token (i.e. from spaCy artifacts).
- Parameters:
feature_ids – the features to write, which defaults to
FEATURE_IDSskip_missing – whether to only keep
feature_idscls – the type of the new instance
- Return type:
- embedding: Tensor = None¶
The embedding for
wordsafter using the transformer.- Shape:
(|words|, <embedding dimension>)
- property indexes: Tuple[int]¶
The indexes of the word piece subwords (see
WordPiece.index).
- property token_embedding: Tensor¶
The embedding of this token, which is the sum of the word piece embeddings.
- words: Tuple[WordPiece]¶
The word pieces that make up this token.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed=False, fold_method='raise', embed_model=None, decode_embedding=True, word_piece_doc_factory=None, access='raise')[source]¶
Bases:
EmbeddingFeatureVectorizerUses the
embeddingsattributes added to documents, sentences and tokens populated byWordPieceFeatureDocumentFactory. Currently only sentence sequences are supported. For single sentence or token classification, usezensols.deepnlp.vectorizers.If aggregated documents are given to the vectorizer, they are flattened into sentences and vectorized in the same was a single document’s sentences would be vectorized. A batch is created for each document and only one batch is created for singleton documents.
This embedding layer expects the following attribute settings to be left with the defaults set: obj:encode_transformed,
fold_method,decode_embedding.- Shape:
- DESCRIPTION: ClassVar[str] = 'wordpiece'¶
- FEATURE_TYPE: ClassVar[TextFeatureType] = 4¶
- __init__(name, config_factory, feature_id, manager, encode_transformed=False, fold_method='raise', embed_model=None, decode_embedding=True, word_piece_doc_factory=None, access='raise')¶
- access: str = 'raise'¶
What to do when accessing the sentence embedding when encoding. This is one of:
raise: raises an error when missingadd_missing: create the embedding only if missingclobber: always create a new embedding by replacing (if existed)
- decode_embedding: bool = True¶
Turn off the
embed_modelforward pass to use the embeddings we vectorized from theembeddingattribute(s). Keep the default.
- embed_model: TransformerEmbedding = None¶
This field is not applicable to this vectorizer–keep the default.
- encode(doc)[source]¶
Encode by combining documents in to one monolithic document when a tuple is passed, otherwise default to the super class’s encode functionality.
- Return type:
- encode_transformed: bool = False¶
This field is not applicable to this vectorizer–keep the default.
- fold_method: str = 'raise'¶
This field is not applicable to this vectorizer–keep the default.
- word_piece_doc_factory: WordPieceFeatureDocumentFactory = None¶
The feature document factory that populates embeddings.
- class zensols.deepnlp.transformer.wordpiece.WordPieceTokenContainer[source]¶
Bases:
TokenContainerLike
TokenContainerbut contains word pieces.
Module contents¶
Contains classes that adapt the huggingface tranformers to the Zensols deeplearning framework.
- zensols.deepnlp.transformer.normalize_huggingface_logging()[source]¶
Make the :mod”transformers package use default logging. Using this and setting the
transformerslogging package toERRORlevel logging has the same effect assuppress_warnings().