zensols.deepnlp.transformer package¶
Submodules¶
zensols.deepnlp.transformer.domain module¶
Container classes for Bert models
- class zensols.deepnlp.transformer.domain.TokenizedDocument(tensor, boundary_tokens)[source]¶
Bases:
PersistableContainer
,Writable
This is the tokenized document output of
TransformerDocumentTokenizer
. Instances of this class are pickelable, in a feature context. Then give to the in the decoding phase to create a tensor with a transformer model such asTransformerEmbedding
.- __init__(tensor, boundary_tokens)¶
- classmethod from_tensor(tensor)[source]¶
Create an instance of the class using a tensor. This is useful for re-creating documents for mapping with
map_word_pieces()
after unpickling from a document created withTransformerDocumentTokenizer.tokenize
.- Parameters:
- Return type:
- get_wordpiece_count(**kwargs)[source]¶
The size of the document (sum over sentences) in number of word pieces. To keep special tokens (such BERT’s as
[CLS]
and[SEP]
tokens) when passing in a tokenizer inkwargs
, addspecial_tokens={}
.- Parameters:
kwargs – any keyword arguments passed on to
map_to_word_pieces()
except (do not addindex_tokens
andincludes
)- Return type:
- map_to_word_pieces(sentences=None, map_wp=None, add_indices=False, special_tokens=None, index_tokens=True, includes=frozenset({'map'}))[source]¶
Map word piece tokens to linguistic tokens.
- Parameters:
sentences (
Iterable
[List
[Any
]]) – an iteration of sentences, which is returned in the output (i.e.FeatureSentence
), orinput_ids
ifNone
map_wp (
Union
[Callable
,Dict
[int
,str
]]) – either a function that takes the token index, sentence ID and input IDs, or the mapping from word piece ID to string token; return output is the string token (or numerical output if no mapping is provided); if an instance ofTransformerDocumentTokenizer
, its vocabulary and special tokens are utilized for mapping and special token considerationadd_indices (
bool
) – whether to add the token ID and index after the token string whenid2tok
is provided formap_wp
special_tokens (
Set
[str
]) – a list of tokens (such BERT’s as[CLS]
and[SEP]
tokens) to remove; to keep special tokens when passing in a tokenizer inkwargs
, addspecial_tokens={}
.index_tokens (
bool
) – whether to index tokens positionally, which is used for mapping with feature or tokenized sentences; set this toFalse
whensentences
are anything but a feature document / sentencesincludes (
Set
[str
]) – what data to return, which is a set of the keys listed in thereturn
documentation below
- Return type:
- Returns:
a list sentence maps, each with:
sent_ix
-> the ``i``th sentence (always provided)map
-> list of(sentence 'token', word pieces)
sent
-> aFeatureSentence
or a tensor of vocab indexes ifmap_wp
isNone
word_pieces
-> the word pieces of the sentences
- static map_word_pieces(token_offsets)[source]¶
Map word piece tokens to linguistic tokens.
- Return type:
List
[Tuple
[FeatureToken
,List
[int
]]]- Returns:
a list of tuples in the form:
(<token index>, <list of word piece indexes>)
- property offsets: Tensor¶
The offsets from word piece (transformer’s tokenizer) to feature document index mapping.
- property shape: Size¶
Return the shape of the vectorized document.
- truncate(size)[source]¶
Truncate the the last (token) dimension to
size
.- Return type:
- Returns:
a new instance of this class truncated to size
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_tokens=True, id2tok=None)[source]¶
Write the contents of this instance to
writer
using indentiondepth
.- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- class zensols.deepnlp.transformer.domain.TokenizedFeatureDocument(tensor, boundary_tokens, feature, id2tok, char_offsets)[source]¶
Bases:
TokenizedDocument
Instance of this class are created, then a picklable version returned with
detach()
as an instance of the super class.- __init__(tensor, boundary_tokens, feature, id2tok, char_offsets)¶
-
feature:
FeatureDocument
¶ The document to tokenize.
-
id2tok:
Dict
[int
,str
]¶ If provided, a mapping of indexes to transformer tokens. This attribute is always nulled out after being persisted.
- map_to_word_pieces(sentences=None, map_wp=None, **kwargs)[source]¶
Map word piece tokens to linguistic tokens.
- Parameters:
sentences (
Iterable
[List
[Any
]]) – an iteration of sentences, which is returned in the output (i.e.FeatureSentence
), orinput_ids
ifNone
map_wp (
Union
[Callable
,Dict
[int
,str
]]) – either a function that takes the token index, sentence ID and input IDs, or the mapping from word piece ID to string token; return output is the string token (or numerical output if no mapping is provided); if an instance ofTransformerDocumentTokenizer
, its vocabulary and special tokens are utilized for mapping and special token considerationadd_indices – whether to add the token ID and index after the token string when
id2tok
is provided formap_wp
special_tokens – a list of tokens (such BERT’s as
[CLS]
and[SEP]
tokens) to remove; to keep special tokens when passing in a tokenizer inkwargs
, addspecial_tokens={}
.index_tokens – whether to index tokens positionally, which is used for mapping with feature or tokenized sentences; set this to
False
whensentences
are anything but a feature document / sentencesincludes – what data to return, which is a set of the keys listed in the
return
documentation below
- Return type:
- Returns:
a list sentence maps, each with:
sent_ix
-> the ``i``th sentence (always provided)map
-> list of(sentence 'token', word pieces)
sent
-> aFeatureSentence
or a tensor of vocab indexes ifmap_wp
isNone
word_pieces
-> the word pieces of the sentences
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_tokens=True, id2tok=None)[source]¶
Write the contents of this instance to
writer
using indentiondepth
.- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
zensols.deepnlp.transformer.embed module¶
The tokenizer object.
- class zensols.deepnlp.transformer.embed.TransformerEmbedding(name, tokenizer, output='pooler_output', output_attentions=False)[source]¶
Bases:
PersistableContainer
,Dictable
An model for transformer embeddings (such as BERT) that wraps the HuggingFace transformms API.
- __init__(name, tokenizer, output='pooler_output', output_attentions=False)¶
- property cache¶
When set to
True
cache a global space model using the parameters from the first instance creation.
- property model: PreTrainedModel¶
-
output:
str
= 'pooler_output'¶ The output from the huggingface transformer API to return.
This is set to one of:
LAST_HIDDEN_STATE_OUTPUT
: with the output embeddings of the last layer with shape:(batch, N sentences, hidden layer dimension)
POOLER_OUTPUT
: the last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function with shape:(batch, hidden layer dimension)
ALL_OUTPUT
: includes both as a dictionary with correpsondingkeys
- property resource: TransformerResource¶
The transformer resource containing the model.
- tokenize(doc)[source]¶
Tokenize the feature document, which is used as the input to
transform()
.- Doc:
the document to tokenize
- Return type:
- Returns:
the tokenization of
doc
-
tokenizer:
TransformerDocumentTokenizer
¶ The tokenizer used for creating the input for the model.
- transform(doc, output=None)[source]¶
Transform the documents in to the transformer output.
- Parameters:
docs – the batch of documents to return
output (
str
) – the output from the huggingface transformer API to return (see class docs)
- Return type:
- Returns:
a container object instance with the output, which contains (among other data)
last_hidden_state
with the output embeddings of the last layer with shape:(batch, N sentences, hidden layer dimension)
zensols.deepnlp.transformer.layer module¶
Contains transformer embedding layers.
- class zensols.deepnlp.transformer.layer.TransformerEmbeddingLayer(*args, embed_model, **kwargs)[source]¶
Bases:
EmbeddingLayer
A transformer (i.e. BERT) embedding layer. This class generates embeddings on a per sentence basis. See the initializer documentation for configuration requirements.
- MODULE_NAME: ClassVar[str] = 'transformer embedding'¶
The module name used in the logging message. This is set in each inherited class.
- __init__(*args, embed_model, **kwargs)[source]¶
Initialize with an embedding model. This embedding model must configured with
TransformerEmbedding.output
tolast_hidden_state
.- Parameters:
embed_model (
TransformerEmbedding
) – used to generate the transformer (i.e. BERT) embeddings
- forward(x)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses. :rtype:
Tensor
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class zensols.deepnlp.transformer.layer.TransformerSequence(net_settings, sub_logger=None)[source]¶
Bases:
EmbeddingNetworkModule
,SequenceNetworkModule
A sequence based model for token classification use HuggingFace transformers layers (not their token classification API).
- MODULE_NAME: ClassVar[str] = 'transformer sequence'¶
The module name used in the logging message. This is set in each inherited class.
- __init__(net_settings, sub_logger=None)[source]¶
Initialize the embedding layer.
- Parameters:
net_settings (
TransformerSequenceNetworkSettings
) – the embedding layer configurationlogger – the logger to use for the forward process in this layer
filter_attrib_fn – if provided, called with a
BatchFieldMetadata
for each field returningTrue
if the batch field should be retained and used in the embedding layer (see class docs); ifNone
all fields are considered
- class zensols.deepnlp.transformer.layer.TransformerSequenceNetworkSettings(name, config_factory, dropout, batch_stash, embedding_layer, decoder_settings)[source]¶
Bases:
EmbeddingNetworkSettings
,DropoutNetworkSettings
Settings configuration for
TransformerSequence
.- __init__(name, config_factory, dropout, batch_stash, embedding_layer, decoder_settings)¶
-
decoder_settings:
DeepLinearNetworkSettings
¶ The decoder feed forward network.
- get_module_class_name()[source]¶
Returns the fully qualified class name of the module to create by
ModelManager
. This module takes as the first parameter an instance of this class.Important: This method is not used for nested modules. You must declare specific class names in the configuration for those nested class naems.
- Return type:
zensols.deepnlp.transformer.mask module¶
Classes to predict fill-mask tasks.
- class zensols.deepnlp.transformer.mask.MaskFiller(resource, k=1, feature_id='norm', feature_value='MASK')[source]¶
Bases:
object
The class fills masked tokens with the prediction of the underlying maked model. Masked tokens with attribute
feature_id
having valuefeature_value
(norm
andMASK
by default respectively) are substituted with model values.To use this class, parse a sentence with a
FeatureDocumentParser
with masked tokens using the stringfeature_value
.For example (with class defaults), the sentence:
Paris is the MASK of France.
becomes:
Parise is the <mask> of France.
The
<mask>
string becomes themask_token
for the model’s tokenzier.- __init__(resource, k=1, feature_id='norm', feature_value='MASK')¶
-
feature_value:
str
= 'MASK'¶ The value of feature ID
feature_id
to match on masked tokens.
-
k:
int
= 1¶ The number of top K predicted masked words per mask. The total number of predictions will be <number of masks> X
k
in the source document.
- predict(source)[source]¶
Predict subtitution values for token masks.
Important:
source
is modified as a side-effect of this method. Useclone()
on thesource
document passed to this method to preserve the original if necessary.- Parameters:
source (
TokenContainer
) – the source document, sentence, or span for which to substitute values- Return type:
-
resource:
TransformerResource
¶ A container class with the Huggingface tokenizer and model.
- class zensols.deepnlp.transformer.mask.Prediction(cont, masked_tokens, df)[source]¶
Bases:
Dictable
A container class for masked token predictions produced by
MaskFiller
. This class offers many ways to get the predictions, including getting the sentences as instances ofTokenContainer
by using it as an iterable.The sentences are also available as the
pred_sentences
key when usingasdict()
.- __init__(cont, masked_tokens, df)¶
-
cont:
TokenContainer
¶ The document, sentence or span to predict masked tokens.
-
df:
DataFrame
¶ The predictions with dataframe columns:
k
: the k in the top-k highest scored masked token matchmask_id
: the N-th masked token in the source ordered by positiontoken
: the predicted tokenscore
: the score of the prediction ([0, 1]
, higher the better)
- get_container(k=0)[source]¶
Get the k-th top scored sentence. This method should be called only once for each instance since it modifies the tokens of the container for each invocation.
A client may call this method as many times as necessary (i.e. for multiple values of
k
) since :obj:cont
tokens are modified while retaining the original masked tokensmasked_tokens
.- Parameters:
k (
int
) – as k increases the less likely the mask substitutions, and thus sentence; k = 0 is the most likely given the sentence and masks- Return type:
- get_tokens()[source]¶
Return an iterable of the prediction coupled with the token it belongs to and its score.
- Return type:
- property masked_token_dicts: Tuple[Dict[str, Any]]¶
A tuple of
builtins.dict
each having token index, norm and text data.
-
masked_tokens:
Tuple
[FeatureToken
]¶ The masked tokens matched.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_masked_tokens=True, include_predicted_tokens=True, include_predicted_sentences=True)[source]¶
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
zensols.deepnlp.transformer.optimizer module¶
Adapat huggingface transformer weight decay optimizer.
- class zensols.deepnlp.transformer.optimizer.TransformerAdamFactory[source]¶
Bases:
ModelResourceFactory
- class zensols.deepnlp.transformer.optimizer.TransformerSchedulerFactory[source]¶
Bases:
ModelResourceFactory
Unified API to get any scheduler from its name. This simply calls
transformers.get_scheduler()
and calculatesnum_training_steps
asepochs * batch_size
.Documentation taken directly from
get_scheduler
function in the PyTorch source tree.
zensols.deepnlp.transformer.pred module¶
Predictions output for transformer models.
- class zensols.deepnlp.transformer.pred.TransformerSequencePredictionsDataFrameFactory(source, result, stash, column_names=None, data_point_transform=None, batch_limit=9223372036854775807, epoch_result=None, label_vectorizer_name=None, embedded_document_attribute=None)[source]¶
Bases:
SequencePredictionsDataFrameFactory
Like the super class but create predictions for transformer sequence models. By default, transformer input is truncated at the model’s max token length (usually 512 word piece tokens). It then truncate the tokens that are added as the
text
column from (configured by default)classify.TokenClassifyModelFacade
.For all predictions where the sequence passed the model’s maximum, this class maps that last word piece token output to the respective token in the
predictions_dataframe_factory_class
instance’stransform
output.- __init__(source, result, stash, column_names=None, data_point_transform=None, batch_limit=9223372036854775807, epoch_result=None, label_vectorizer_name=None, embedded_document_attribute=None)¶
zensols.deepnlp.transformer.resource module¶
Provide BERT embeddings on a per sentence level.
- exception zensols.deepnlp.transformer.resource.TransformerError[source]¶
Bases:
DeepLearnError
Raised for any transformer specific errors in this and child modules of the parent.
- __annotations__ = {}¶
- __module__ = 'zensols.deepnlp.transformer.resource'¶
- class zensols.deepnlp.transformer.resource.TransformerResource(name, torch_config, model_id, cased=None, trainable=False, args=<factory>, tokenizer_args=<factory>, model_args=<factory>, model_class='transformers.AutoModel', tokenizer_class='transformers.AutoTokenizer', cache=False, cache_dir=None)[source]¶
Bases:
PersistableContainer
,Dictable
A container base class that allows configuration and creates various huggingface models.
- __init__(name, torch_config, model_id, cased=None, trainable=False, args=<factory>, tokenizer_args=<factory>, model_args=<factory>, model_class='transformers.AutoModel', tokenizer_class='transformers.AutoTokenizer', cache=False, cache_dir=None)¶
-
args:
Dict
[str
,Any
]¶ Additional arguments to pass to the from_pretrained method for the tokenizer and the model.
-
cache:
InitVar
= False¶ When set to
True
cache a global space model using the parameters from the first instance creation.
-
cased:
bool
= None¶ True
for case sensitive models,False
(default) otherwise. The negated value of it is also used as thedo_lower_case
parameter in the*.from_pretrained
calls to huggingface transformers.
- property model: PreTrainedModel¶
-
model_args:
Dict
[str
,Any
]¶ Additional arguments to pass to the from_pretrained method for the model.
-
model_class:
str
= 'transformers.AutoModel'¶ The model fully qualified class used to create models with the
from_pretrained
static method.
-
model_id:
str
¶ The ID of the model (i.e.
bert-base-uncased
). If this is not set, is derived from themodel_name
andcase
.Token embeding using
TransformerEmbedding
as been tested with:bert-base-cased
bert-large-cased
roberta-base
distilbert-base-cased
- See:
- property tokenizer: PreTrainedTokenizer¶
-
tokenizer_args:
Dict
[str
,Any
]¶ Additional arguments to pass to the from_pretrained method for the tokenizer.
-
tokenizer_class:
str
= 'transformers.AutoTokenizer'¶ The model fully qualified class used to create tokenizers with the
from_pretrained
static method.
-
torch_config:
TorchConfig
¶ The config device used to copy the embedding data.
zensols.deepnlp.transformer.tokenizer module¶
The tokenizer object.
- class zensols.deepnlp.transformer.tokenizer.TransformerDocumentTokenizer(resource, word_piece_token_length=None, params=None)[source]¶
Bases:
PersistableContainer
Creates instances of
TokenziedFeatureDocument
using a HuggingFacePreTrainedTokenizer
.-
DEFAULT_PARAMS:
ClassVar
[Dict
[str
,Any
]] = {'is_split_into_words': True, 'padding': 'longest', 'return_offsets_mapping': True, 'return_special_tokens_mask': True}¶ Default parameters for the HuggingFace tokenizer. These get overriden by the
tokenizer_kwargs
intokenize()
and the processing of valueword_piece_token_length
.
- __init__(resource, word_piece_token_length=None, params=None)¶
- property all_special_tokens: Set[str]¶
Special tokens used by the model (such BERT’s as
[CLS]
and[SEP]
tokens).
- property id2tok: Dict[int, str]¶
A mapping from the HuggingFace tokenizer’s vocabulary to it’s word piece equivalent.
- property pretrained_tokenizer: PreTrainedTokenizer¶
The HuggingFace tokenized used to create tokenized documents.
-
resource:
TransformerResource
¶ Contains the model used to create the tokenizer.
- tokenize(doc, tokenizer_kwargs=None)[source]¶
Tokenize a feature document in a form that’s easy to inspect and provide to
TransformerEmbedding
to transform.- Parameters:
doc (
FeatureDocument
) – the document to tokenize- Return type:
-
word_piece_token_length:
int
= None¶ The max number of word piece tokens. The word piece length is always the same or greater in count than linguistic tokens because the word piece algorithm tokenizes on characters.
If this value is less than 0, than do not fix sentence lengths. If the value is 0 (default), then truncate to the model’s longest max lenght. Otherwise, if this value is
None
, set the length to the model’s longest max length using the model’smodel_max_length
value.Setting this to a value to 0, making documents multi-length, has the potential of creating token spans longer than the model can tolerate (usually 512 word piece tokens). In these cases, this value must be set to (or lower) than the model’s
model_max_length
.Tokenization padding is on by default.
- See:
-
DEFAULT_PARAMS:
zensols.deepnlp.transformer.vectorizers module¶
Contains classes that are used to vectorize documents in to transformer embeddings.
- class zensols.deepnlp.transformer.vectorizers.LabelTransformerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False)[source]¶
Bases:
TransformerFeatureVectorizer
A base class for vectorizing by mapping tokens to transformer consumable word piece tokens. This includes creating labels and masks.
- Shape:
- FEATURE_TYPE = 1¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False)¶
- is_labeler: bool = True¶
If
True
, make this a labeling specific vectorizer. Otherwise, certain layers will use the output of the vectorizer as features rather than the labels.
- class zensols.deepnlp.transformer.vectorizers.TransformerEmbeddingFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)[source]¶
Bases:
TransformerFeatureVectorizer
A feature vectorizer used to create transformer (i.e. BERT) embeddings. The class uses the
embed_model
, which is of typeTransformerEmbedding
.Note the encoding input ideally are sentences shorter than 512 tokens. However, this vectorizer can accommodate both
FeatureSentence
andFeatureDocument
instances.- DESCRIPTION = 'transformer document embedding'¶
- FEATURE_TYPE = 4¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)¶
- class zensols.deepnlp.transformer.vectorizers.TransformerExpanderFeatureContext(feature_id, contexts, document)[source]¶
Bases:
TransformerFeatureContext
A vectorizer feature context used with
TransformerExpanderFeatureVectorizer
.- __init__(feature_id, contexts, document)[source]¶
- Params feature_id:
the feature ID used to identify this context
- Params contexts:
subordinate contexts given to
MultiFeatureContext
- Params document:
document used to create the transformer embeddings
- contexts: Tuple[FeatureContext] = Field(name=None,type=None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=<dataclasses._MISSING_TYPE object>,_field_type=None)¶
The subordinate contexts.
- class zensols.deepnlp.transformer.vectorizers.TransformerExpanderFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False, delegate_feature_ids=None)[source]¶
Bases:
TransformerFeatureVectorizer
A vectorizer that expands lingustic feature vectors to their respective locations as word piece token vectors.
This is used to concatenate lingustic features with Bert (and other transformer) embeddings. Each lingustic token is copied in the word piece token location across all vectorizers and sentences.
- Shape:
(-1, token length, X), where X is the sum of all the delegate shapes across all three dimensions
- DESCRIPTION = 'transformer expander'¶
- FEATURE_TYPE = 1¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False, delegate_feature_ids=None)¶
- delegate_feature_ids: Tuple[str] = None¶
A list of feature IDs of vectorizers whose output will be expanded.
- property delegates: EncodableFeatureVectorizer¶
The delegates used for encoding and decoding the lingustic features.
- class zensols.deepnlp.transformer.vectorizers.TransformerFeatureContext(feature_id, document)[source]¶
Bases:
FeatureContext
,Deallocatable
A vectorizer feature contex used with
TransformerEmbeddingFeatureVectorizer
.- __init__(feature_id, document)[source]¶
- Params feature_id:
the feature ID used to identify this context
- Params document:
document used to create the transformer embeddings
- class zensols.deepnlp.transformer.vectorizers.TransformerFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)[source]¶
Bases:
EmbeddingFeatureVectorizer
,FeatureDocumentVectorizer
Base class for classes that vectorize transformer models. This class also tokenizes documents.
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=False, encode_tokenized=False)¶
- encode_tokenized: bool = False¶
Whether to tokenize the document on encoding. Set this to
True
only if the huggingface model ID (i.e.bert-base-cased
) will not change after vectorization/batching.Setting this to
True
tells the vectorizer to tokenize during encoding, and thus will speed experimentation by providing the tokenized tensors to the model directly.
- property feature_type: TextFeatureType¶
The type of feature this vectorizer generates. This is used by classes such as
EmbeddingNetworkModule
to determine where to add the features, such as concating to the embedding layer, join layer etc.
- is_labeler: bool = False¶
If
True
, make this a labeling specific vectorizer. Otherwise, certain layers will use the output of the vectorizer as features rather than the labels.
- tokenize(doc)[source]¶
Tokenize the document in to a token document used by the encoding phase.
- Parameters:
doc (
FeatureDocument
) – the document to be tokenized- Return type:
- class zensols.deepnlp.transformer.vectorizers.TransformerMaskFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, data_type='bool')[source]¶
Bases:
LabelTransformerFeatureVectorizer
Creates a mask of word piece tokens to
True
and special tokens and padding toFalse
. This maps tokens to word piece tokens likeTransformerNominalFeatureVectorizer
.- Shape:
- DESCRIPTION = 'transformer mask'¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, data_type='bool')¶
- data_type: Union[str, None, torch.dtype] = 'bool'¶
The mask tensor type. To use the int type that matches the resolution of the manager’s
torch_config
, useDEFAULT_INT
.
- class zensols.deepnlp.transformer.vectorizers.TransformerNominalFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, delegate_feature_id=None, size=-1, pad_label=-100, label_all_tokens=False, annotations_attribute='annotations')[source]¶
Bases:
AggregateEncodableFeatureVectorizer
,LabelTransformerFeatureVectorizer
This creates word piece (maps to tokens) labels. This class uses a
NominalEncodedEncodableFeatureVectorizer`
to map from string labels to their nominal long values. This allows a single instance and centralized location where the label mapping happens in case other (non-transformer) components need to vectorize labels.- Shape:
- DESCRIPTION = 'transformer seq labeler'¶
- __init__(name, config_factory, feature_id, manager, encode_transformed, fold_method, embed_model, decode_embedding=False, is_labeler=True, encode_tokenized=False, delegate_feature_id=None, size=-1, pad_label=-100, label_all_tokens=False, annotations_attribute='annotations')¶
- annotations_attribute: str = 'annotations'¶
The attribute used to get the features from the
FeatureSentence
. For example,TokenAnnotatedFeatureSentence
has anannotations
attribute.
- delegate_feature_id: str = None¶
The feature ID for the aggregate encodeable feature vectorizer.
- label_all_tokens: bool = False¶
If
True
, label all word piece tokens with the corresponding linguistic token label. Otherwise, the default padded value is used, and thus, ignored by the loss function when calculating loss.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
zensols.deepnlp.transformer.wordpiece module¶
Word piece mappings to feature tokens, sentences and documents.
There are often edges cases and tricky situations with certain model’s usage of
special tokens (i.e. [CLS]
) and where they are used. With this in mind,
this module attempts to:
Assist in debugging (works with detached
TokenizedDocument
) in cases where token level embeddings are directly accessed, andMap corresponding both token and sentence level embeddings to respective origin natural langauge feature set data structures.
- class zensols.deepnlp.transformer.wordpiece.CachingWordPieceFeatureDocumentFactory(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True, stash=None, hasher=<factory>)[source]¶
Bases:
WordPieceFeatureDocumentFactory
Caches the documents and their embeddings in a
Stash
. For those that are cached, the embeddings are copied over to the passed document increate()
.- __init__(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True, stash=None, hasher=<factory>)¶
- create(fdoc, tdoc=None)[source]¶
Create a document in to an object graph that relates word pieces to feature tokens. Note that if
tdoc
is provided, it must have been tokenized fromfdoc
.- Parameters:
fdoc (
FeatureDocument
) – the feature document used to create tdoctdoc (
TokenizedFeatureDocument
) – a tokenized feature document generated bytokenize()
- Return type:
- Returns:
a data structure with the word piece information
- class zensols.deepnlp.transformer.wordpiece.WordPiece(word, vocab_index, index)[source]¶
Bases:
PersistableContainer
,Dictable
The word piece data.
- __init__(word, vocab_index, index)¶
-
index:
int
¶ The index of the word piece subword in the tokenization tensor, which will have the same index in the output embeddings for
TransformerEmbedding.output
=last_hidden_state
.
- class zensols.deepnlp.transformer.wordpiece.WordPieceDocumentDecorator(word_piece_doc_factory)[source]¶
Bases:
FeatureDocumentDecorator
Populates sentence and token embeddings in the documents.
- __init__(word_piece_doc_factory)¶
-
word_piece_doc_factory:
WordPieceFeatureDocumentFactory
¶ The feature document factory that populates embeddings.
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureDocument(sents, text=None, spacy_doc=None, tokenized=None)[source]¶
Bases:
FeatureDocument
,WordPieceTokenContainer
A document made up of word piece sentences.
- __init__(sents, text=None, spacy_doc=None, tokenized=None)¶
- property embedding: Tensor¶
The document embedding (see
WordPieceFeatureSpan.embedding
).- Shape:
(|sentences|, <embedding dimension>)
- tokenized: TokenizedFeatureDocument = None¶
The tokenized feature document.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write the document and optionally sentence features.
- Parameters:
n_sents – the number of sentences to write
n_tokens – the number of tokens to print across all sentences
include_original – whether to include the original text
include_normalized – whether to include the normalized text
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureDocumentFactory(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True)[source]¶
Bases:
object
Create instances of
WordPieceFeatureDocument
fromFeatureDocument
instances. It does this by iterating through a feature document data structure and addingWordPiece*
object data and optionally adding the corresponding sentence and/or token level embeddings.The embeddings can also be added with
add_token_embeddings()
andadd_sent_embeddings()
individually. If all you want are the sentence level embeddings, you can useadd_sent_embeddings()
on aFeatureSentence
instance.- __init__(tokenizer, embed_model, token_embeddings=True, sent_embeddings=True)¶
- add_sent_embeddings(doc, arr)[source]¶
Add sentence embeddings to the sentences of
doc
.- Parameters:
doc (
Union
[WordPieceFeatureDocument
,FeatureDocument
]) – sentences of this doc haveembeddings
set to the correpsonding sentence tensor with shape(1, <embedding dimension>)
.
- add_token_embeddings(doc, arr)[source]¶
Add token embeddings to the sentences of
doc
. This assumes tokens are of typeWordPieceFeatureToken
since the token indices are needed.- Parameters:
doc (
WordPieceFeatureDocument
) – sentences of this doc haveembeddings
set to the correpsonding sentence tensor with shape (1, <embedding dimension>).
- create(fdoc, tdoc=None)[source]¶
Create a document in to an object graph that relates word pieces to feature tokens. Note that if
tdoc
is provided, it must have been tokenized fromfdoc
.- Parameters:
fdoc (
FeatureDocument
) – the feature document used to create tdoctdoc (
TokenizedFeatureDocument
) – a tokenized feature document generated bytokenize()
- Return type:
- Returns:
a data structure with the word piece information
-
embed_model:
TransformerEmbedding
¶ Used to populate the embeddings in
WordPiece*
classes.
- populate(doc, truncate=False)[source]¶
Populate sentence embeddings in a document by first feature parsing a new document with
create()
and then copying the embeddings withWordPieceFeatureDocument.copy_embeddings()
- Parameters:
truncate (
bool
) – if sentence lengths differ (i.e. from using different models to chunk sentences) trim the longer document to match the shorter
-
tokenizer:
TransformerDocumentTokenizer
¶ Used to tokenize documents that aren’t already in
__call__()
.
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureSentence(tokens, text=None, spacy_span=None, embedding=None)[source]¶
Bases:
WordPieceFeatureSpan
,FeatureSentence
- __init__(tokens, text=None, spacy_span=None, embedding=None)¶
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureSpan(tokens, text=None, spacy_span=None, embedding=None)[source]¶
Bases:
FeatureSentence
,WordPieceTokenContainer
A sentence made up of word pieces.
- __init__(tokens, text=None, spacy_span=None, embedding=None)¶
- embedding: Tensor = None¶
The sentence embedding level (i.e.
[CLS]
) embedding from the transformer.- Shape:
(<embedding dimension>,)
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write the text container.
- Parameters:
include_original – whether to include the original text
include_normalized – whether to include the normalized text
n_tokens – the number of tokens to write
inline – whether to print the tokens on one line each
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureToken(i, idx, i_sent, norm, words, embedding=None)[source]¶
Bases:
FeatureToken
The token and the word pieces that repesent it.
- __init__(i, idx, i_sent, norm, words, embedding=None)¶
- clone(cls=None, **kwargs)[source]¶
Clone an instance of this token.
- Parameters:
cls (
Type
) – the type of the new instancekwargs – arguments to add to as attributes to the clone
- Return type:
- Returns:
the cloned instance of this instance
- detach(*args, **kwargs)[source]¶
Create a detected token (i.e. from spaCy artifacts).
- Parameters:
feature_ids – the features to write, which defaults to
FEATURE_IDS
skip_missing – whether to only keep
feature_ids
cls – the type of the new instance
- Return type:
-
embedding:
Tensor
= None¶ The embedding for
words
after using the transformer.- Shape:
(|words|, <embedding dimension>)
- property indexes: Tuple[int]¶
The indexes of the word piece subwords (see
WordPiece.index
).
- property token_embedding: Tensor¶
The embedding of this token, which is the sum of the word piece embeddings.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- class zensols.deepnlp.transformer.wordpiece.WordPieceFeatureVectorizer(name, config_factory, feature_id, manager, encode_transformed=False, fold_method='raise', embed_model=None, decode_embedding=True, word_piece_doc_factory=None, access='raise')[source]¶
Bases:
EmbeddingFeatureVectorizer
Uses the
embeddings
attributes added to documents, sentences and tokens populated byWordPieceFeatureDocumentFactory
. Currently only sentence sequences are supported. For single sentence or token classification, usezensols.deepnlp.vectorizers
.If aggregated documents are given to the vectorizer, they are flattened into sentences and vectorized in the same was a single document’s sentences would be vectorized. A batch is created for each document and only one batch is created for singleton documents.
This embedding layer expects the following attribute settings to be left with the defaults set: obj:encode_transformed,
fold_method
,decode_embedding
.- Shape:
- DESCRIPTION: ClassVar[str] = 'wordpiece'¶
- FEATURE_TYPE: ClassVar[TextFeatureType] = 4¶
- __init__(name, config_factory, feature_id, manager, encode_transformed=False, fold_method='raise', embed_model=None, decode_embedding=True, word_piece_doc_factory=None, access='raise')¶
- access: str = 'raise'¶
What to do when accessing the sentence embedding when encoding. This is one of:
raise
: raises an error when missingadd_missing
: create the embedding only if missingclobber
: always create a new embedding by replacing (if existed)
- decode_embedding: bool = True¶
Turn off the
embed_model
forward pass to use the embeddings we vectorized from theembedding
attribute(s). Keep the default.
- embed_model: TransformerEmbedding = None¶
This field is not applicable to this vectorizer–keep the default.
- encode(doc)[source]¶
Encode by combining documents in to one monolithic document when a tuple is passed, otherwise default to the super class’s encode functionality.
- Return type:
- encode_transformed: bool = False¶
This field is not applicable to this vectorizer–keep the default.
- fold_method: str = 'raise'¶
This field is not applicable to this vectorizer–keep the default.
- word_piece_doc_factory: WordPieceFeatureDocumentFactory = None¶
The feature document factory that populates embeddings.
- class zensols.deepnlp.transformer.wordpiece.WordPieceTokenContainer[source]¶
Bases:
TokenContainer
Like
TokenContainer
but contains word pieces.
Module contents¶
Contains classes that adapt the huggingface tranformers to the Zensols deeplearning framework.
- zensols.deepnlp.transformer.normalize_huggingface_logging()[source]¶
Make the :mod”transformers package use default logging. Using this and setting the
transformers
logging package toERROR
level logging has the same effect assuppress_warnings()
.