Resource Library¶
DeepZenols NLP framework has a comprehensive resource library that configures popular models that enable little to no code written for many standard language models. This document provides a highlight of the available configuration of the API and deepnlp resource library available with this package.
Embedding¶
The models configured by the deepnlp resource library files include non-contextual word embeddings (i.e. GloVE), a frozen transformer (i.e. BERT) transformer model and a fine-tune trainable transformer model.
The Zensols Deep NLP library supports word embeddings for GloVE, word2Vec,
fastText and BERT. The embedding
section of the GloVE resource library
specifies which word vector models and layers that use them:
[glove_50_embedding]
class_name = zensols.deepnlp.embed.GloveWordEmbedModel
path = path: ${default:corpus_dir}/glove
desc = 6B
dimension = 50
lowercase = True
which defines the 6 billion token (400K vocab) 50 dimension GloVE model with a
GloveWordEmbedModel instance. The lowercase
property telling the framework
to down case all queries to the model since the word vectors were trained on a
down cased corpus.
The feature vectorizer WordVectorSentenceFeatureVectorizer that uses the above embedding is defined. This converts the word vector indexes (depending on the configuration) to a tensor of the word embedding representing the corresponding sentence:
[glove_50_feature_vectorizer]
class_name = zensols.deepnlp.vectorize.WordVectorSentenceFeatureVectorizer
feature_id = wvglove50
embed_model = instance: glove_50_embedding
The last configuration needed is a WordVectorEmbeddingLayer, which extends
a torch.nn.Module
, and used by the PyTorch framework to utilize the word
embedding:
[glove_50_embedding_layer]
class_name = zensols.deepnlp.vectorize.WordVectorEmbeddingLayer
embed_model = instance: glove_50_embedding
feature_vectorizer = instance: language_feature_manager
This module uses the glove embedding model to forward using a
torch.nn.Embedding
as input at the beginning of the forward PyTorch process.
The reference to language_feature_manager
is covered later.
The embedding resource libraries have a similar definition for the GloVE 300
dimension, the word2vec resource library for the Google’s pre-trained 300
dimension, the fasttext resource library for Facebook’s pre-trained News and
Crawl pre-trained embeddings, and the transformer resource library contains
BERT embeddings. When decode_embedding
is set to true, the embedding are
created during decode time, rather than at the time the batch is processed.
The transformer_trainable_resource:model_id
is the HuggingFace model
identifier to use, such as bert-base-cased
, bert-large-cased
,
distilbert-base-cased
, roberta-base
.
Vectorizer Configuration¶
Linguistic features are vectorized at one of the following levels:
token: token level with a shape congruent with the number of tokens, typically concatenated with the ebedding layer
document: document level, typically added to a join layer
embedding: embedding layer, typically used as the input layer
Each FeatureDocumentVectorizer, which extends the deeplearn API
EncodableFeatureVectorizer class defines a FEATURE_TYPE
of type
TextFeatureType that indicates this level. We’ll see examples of
these later in the configuration. See the deeplearn API for more information
on the base class deeplearn vectorizers.
The next configuration defines an EnumContainerFeatureVectorizer in the vectorizer resource library, which vectorizes spaCy features in to one hot encoded vectors at the token level. In this configuration, POS tags, NER tags and dependency head tree is vectorized. See SpacyFeatureVectorizer for more information.
[enum_feature_vectorizer]
class_name = zensols.deepnlp.vectorize.EnumContainerFeatureVectorizer
feature_id = enum
decoded_feature_ids = set: ent, tag, dep
Similarly, the CountEnumContainerFeatureVectorizer encodes counts of each feature in the text at the document level.
[count_feature_vectorizer]
class_name = zensols.deepnlp.vectorize.CountEnumContainerFeatureVectorizer
feature_id = count
decoded_feature_ids = eval: set('ent tag dep'.split())
The language_feature_manager
configuration is used to create a
FeatureDocumentVectorizerManager, which is a language specific vectorizer
manager that uses the FeatureDocumentParser we defined earlier with the
doc_parser
entry. This class extends from FeatureVectorizerManager as an
NLP specific manager that creates and encodes the word embeddings and the other
linguistic feature vectorizers configured. The token_length
parameter are
the lengths of sentences or documents in numbers of tokens.
[language_feature_manager]
class_name = zensols.deepnlp.vectorize.FeatureDocumentVectorizerManager
torch_config = instance: torch_config
configured_vectorizers = eval: [
'word2vec_300_feature_vectorizer',
'glove_50_feature_vectorizer',
'glove_300_feature_vectorizer',
'transformer_feature_vectorizer',
'enum_feature_vectorizer',
'count_feature_vectorizer',
'language_stats_feature_vectorizer',
'depth_token_feature_vectorizer']
doc_parser = instance: doc_parser
token_length = ${language_defaults:token_length}
token_feature_ids = ${doc_parser:token_feature_ids}
Text Classification¶
The text classification resource library provides configuration for components and models used to classify tokens and text.
See the Clickbate example of how this resource library is used.
Vectorization (Text)¶
This configuration set defines the vectorizer for the label itself, which uses
option categories
as the labels and provided in the application context:
[classify_label_vectorizer]
class_name = zensols.deeplearn.vectorize.NominalEncodedEncodableFeatureVectorizer
#categories = y, n
feature_id = lblabel
We define a manager and manager set separate from the linguistic configuration since the package space is different:
# the vectorizer for labels is not language specific and lives in the
# zensols.deeplearn.vectorize package, so it needs it's own instance
[classify_label_vectorizer_manager]
class_name = zensols.deeplearn.vectorize.FeatureVectorizerManager
torch_config = instance: torch_config
configured_vectorizers = list: classify_label_vectorizer
[vectorizer_manager_set]
names = list: language_vectorizer_manager, classify_label_vectorizer_manager
Batch Stash¶
The batch stash configuration should look familiar if you have read through the deeplearn API batch stash documentation. The configuration below is for a BatchDirectoryCompositeStash, which splits data in separate files across features for each batch.
In this configuration, we split the label, embeddings, and linguistic features in their own groups so that we can experiment using different embeddings for each test. Using BERT will take the longest since each sentence will be computed during decoding.
However, GloVE 50D embeddings vectorize much quicker as only the indexes are stored and quickly retrieved in the PyTorch API on demand. Our caching strategy also changes as we can (with most graphics cards) fit the entire GloVE 50D embedding in GPU memory. Our composition stash configuration follows:
[batch_dir_stash]
groups = eval: (
set('label'.split()),
set('glove_50_embedding'.split()),
...
set('transformer_enum_expander transformer_dep_expander'.split()))
The batch stash is configured next. This configuration uses dynamic batch mappings, which map feature attribute names used in the code with the feature IDs used in vectorizers:
[batch_stash]
data_point_type = eval({'import': ['zensols.deepnlp.classify']}): zensols.deepnlp.classify.LabeledFeatureDocumentDataPoint
batch_feature_mappings = dataclass(zensols.deeplearn.batch.ConfigBatchFeatureMapping): classify_batch_mappings
LabeledFeatureDocumentDataPoint is a subclass of DataPoint class that
contains a FeatureDocument, and the classify_batch_mappings
is a reference
to the batch binding in classify-batch.yml, which is defined as:
classify_batch_mappings:
batch_feature_mapping_adds:
- 'dataclass(zensols.deeplearn.batch.BatchFeatureMapping): classify_label_batch_mappings'
- 'dataclass(zensols.deeplearn.batch.BatchFeatureMapping): lang_batch_mappings'
The root defines a section, the second level adds classification and language specific mappings. The classify batch mappings are:
classify_label_batch_mappings:
label_attribute_name: label
manager_mappings:
- vectorizer_manager_name: classify_label_vectorizer_manager
fields:
- attr: label
feature_id: lblabel
is_agg: true
This says to use the singleton label
mapping under fields
for the label and
used by the framework to calculate performance metrics.
Facade¶
The facade is configured as a ClassifyModelFacade:
[facade]
class_name = zensols.deepnlp.classify.ClassifyModelFacade
This class extends LanguageModelFacade, which supports natural language model feature updating and sets up logging. This class is used both from the command line and the Jupyter notebook via the CLI facade applications.
This facade class adds classification specific functionality, including feature updating from a Jupyter notebook or Python REPL.
@dataclass
class ClassifyModelFacade(LanguageModelFacade):
LANGUAGE_MODEL_CONFIG = LanguageModelFacadeConfig(
manager_name=ReviewBatch.LANGUAGE_FEATURE_MANAGER_NAME,
attribs=ReviewBatch.LANGUAGE_ATTRIBUTES,
embedding_attribs=ReviewBatch.EMBEDDING_ATTRIBUTES)
and used by the framework by overriding:
def _get_language_model_config(self) -> LanguageModelFacadeConfig:
return self.LANGUAGE_MODEL_CONFIG
Setting the dropout triggers property setters to propagate (linear and recurrent layers) the setting when set on the facade:
def __post_init__(self, *args, **kwargs):
super().__post_init__(*args, **kwargs)
settings: NetworkSettings = self.executor.net_settings
if hasattr(settings, 'dropout'):
# set to trigger writeback through to sub settings (linear, recur)
self.dropout = self.executor.net_settings.dropout
We can also override the get_predictions method to include the review text and it’s length when creating the data frame and respective CSV export:
def get_predictions(self, *args, **kwargs) -> pd.DataFrame:
return super().get_predictions(
('text', 'len'),
lambda dp: (dp.doc.text, len(dp.doc.text)),
*args, **kwargs)
Model (Text)¶
The model section configures the ClassifyNetworkSettings, which is either a BiLSTM with an optional CRF output layer or a transformer (see the movie review sentiment example for how this can be configured in both settings.
[classify_net_settings]
class_name = zensols.deepnlp.classify.ClassifyNetworkSettings
#embedding_layer = instance: ${deepnlp_default:embedding}_layer
recurrent_settings = None
linear_settings = instance: linear_settings
batch_stash = instance: batch_stash
dropout = None
The batch_stash
instance is configured on this model so it has access to the
dynamic batch metadata for the embedding layer. The commented out
embedding_layer
has to be overridden and set as the instance of the embedding
layer instance use that create the input embeddings from the input text. The
linear_settings
is the network between the recurrent network and the output
CRF (if there is one configured).
Prediction (Text)¶
The prediction mapper uses the model to classify text from the command line. For text classification, the ClassificationPredictionMapper is used and takes text given from the command line and predicts a label:
[classify_feature_prediction_mapper]
class_name = zensols.deepnlp.classify.ClassificationPredictionMapper
vec_manager = instance: language_vectorizer_manager
label_feature_id = classify_label_vectorizer_manager.lblabel
This component needs the vectorizer manager that creates the vectorized label and the nominal vectorizer to reverse map using a scikit-learn LabelEncoder back to the human readable label.
Token Classification¶
Token classification refers to labeling tokens instead of a string of text as
with text classification. However, there is some cross
over functionality between these two tasks, so the token classification
resource library resource library uses some of the same components (not
configuration) defined in the text classification resource library. For
example, we reuse the ClassifyModelFacade by overriding the class in the
facade
section.
Note: despite this overlap, either import only the text classification resource library for text classification projects and only token classification resource library for token classification projects, but not both.
Only the notable differences compared to the text classification section are documented.
See the NER example of how this resource library is used.
Vectorization (Token)¶
This section has the token label vectorizers and mask vectorizers. The mask is needed for the CRF (when configured) to mask out blank tokens for sentences shorter than a max length. Usually, zeroed tensors are used for token slots not used, for example in the word embedding layer for deep learning networks. This is because the zero vectors are learned for sentences are shorter. However, the CRF layer needs to block these as valid state transitions during training and testing.
tok_label_1_vectorizer:
class_name: zensols.deeplearn.vectorize.NominalEncodedEncodableFeatureVectorizer
feature_id: tclabel1
tok_label_vectorizer:
class_name: zensols.deeplearn.vectorize.AggregateEncodableFeatureVectorizer
feature_id: tclabel
size: -1
delegate_feature_id: tclabel1
tok_mask_vectorizer:
class_name: zensols.deeplearn.vectorize.MaskFeatureVectorizer
feature_id: tmask
size: -1
tok_label_batch_mappings:
manager_mappings:
- vectorizer_manager_name: tok_label_vectorizer_manager
fields:
- attr: tok_labels
feature_id: tclabel
is_agg: true
is_label: True
- attr: tok_mask
feature_id: tmask
is_agg: true
attr_access: tok_labels
tok_label_vectorizer_manager:
class_name: zensols.deeplearn.vectorize.FeatureVectorizerManager
torch_config: 'instance: torch_config'
configured_vectorizers:
- tok_label_1_vectorizer
- tok_label_vectorizer
- tok_mask_vectorizer
# add new feature vectorizer managers
vectorizer_manager_set:
names:
- language_vectorizer_manager
- tok_label_vectorizer_manager
Model (Token)¶
The SequenceBatchIterator configured in the model_settings
indicates to use
a different scoring method. This class is used in the framework to calculate a
different loss and produce the output, which must be treated differently than
neural float tensor output. This is because the Viterbi algorithm is used to
determine the lowest cost path through the elements. The sum of this path is
used as the cost instead of a differential optimization function.
model_settings:
batch_iteration_class_name: zensols.deeplearn.model.SequenceBatchIterator
reduce_outcomes: none
prediction_mapper_name: feature_prediction_mapper
recurrent_crf_net_settings:
mask_attribute: tok_mask
Because we use a CRF as the output layer for EmbeddedRecurrentCRF, our
output are the NER labels. Therefore, must also set reduce_outcomes = none
to pass the CRF output through unaltered.
In the recurrent_crf_net_settings
section, we override the mask_attribute
,
which tells recurrent CRF to use the tok_mask
attribute when masking the
label output.
Prediction (Token)¶
The section is the same, but we instead use the sequence base version (SequencePredictionMapper) where the token stream is used as that sequence.
feature_prediction_mapper:
class_name: zensols.deepnlp.classify.SequencePredictionMapper
vec_manager: 'instance: language_vectorizer_manager'
label_feature_id: tok_label_vectorizer_manager.tclabel1