Resource Library#

DeepZenols NLP framework has a comprehensive resource library that configures popular models that enable little to no code written for many standard language models. This document provides a highlight of the available configuration of the API and deepnlp resource library available with this package.

Embedding#

The models configured by the deepnlp resource library files include non-contextual word embeddings (i.e. GloVE), a frozen transformer (i.e. BERT) transformer model and a fine-tune trainable transformer model.

The Zensols Deep NLP library supports word embeddings for GloVE, word2Vec, fastText and BERT. The embedding section of the GloVE resource library specifies which word vector models and layers that use them:

[glove_50_embedding]
class_name = zensols.deepnlp.embed.GloveWordEmbedModel
path = path: ${default:corpus_dir}/glove
desc = 6B
dimension = 50
lowercase = True

which defines the 6 billion token (400K vocab) 50 dimension GloVE model with a GloveWordEmbedModel instance. The lowercase property telling the framework to down case all queries to the model since the word vectors were trained on a down cased corpus.

The feature vectorizer WordVectorSentenceFeatureVectorizer that uses the above embedding is defined. This converts the word vector indexes (depending on the configuration) to a tensor of the word embedding representing the corresponding sentence:

[glove_50_feature_vectorizer]
class_name = zensols.deepnlp.vectorize.WordVectorSentenceFeatureVectorizer
feature_id = wvglove50
embed_model = instance: glove_50_embedding

The last configuration needed is a WordVectorEmbeddingLayer, which extends a torch.nn.Module, and used by the PyTorch framework to utilize the word embedding:

[glove_50_embedding_layer]
class_name = zensols.deepnlp.vectorize.WordVectorEmbeddingLayer
embed_model = instance: glove_50_embedding
feature_vectorizer = instance: language_feature_manager

This module uses the glove embedding model to forward using a torch.nn.Embedding as input at the beginning of the forward PyTorch process. The reference to language_feature_manager is covered later.

The embedding resource libraries have a similar definition for the GloVE 300 dimension, the word2vec resource library for the Google’s pre-trained 300 dimension, the fasttext resource library for Facebook’s pre-trained News and Crawl pre-trained embeddings, and the transformer resource library contains BERT embeddings. When decode_embedding is set to true, the embedding are created during decode time, rather than at the time the batch is processed. The transformer_trainable_resource:model_id is the HuggingFace model identifier to use, such as bert-base-cased, bert-large-cased, distilbert-base-cased, roberta-base.

Vectorizer Configuration#

Linguistic features are vectorized at one of the following levels:

  • token: token level with a shape congruent with the number of tokens, typically concatenated with the ebedding layer

  • document: document level, typically added to a join layer

  • embedding: embedding layer, typically used as the input layer

Each FeatureDocumentVectorizer, which extends the deeplearn API EncodableFeatureVectorizer class defines a FEATURE_TYPE of type TextFeatureType that indicates this level. We’ll see examples of these later in the configuration. See the deeplearn API for more information on the base class deeplearn vectorizers.

The next configuration defines an EnumContainerFeatureVectorizer in the vectorizer resource library, which vectorizes spaCy features in to one hot encoded vectors at the token level. In this configuration, POS tags, NER tags and dependency head tree is vectorized. See SpacyFeatureVectorizer for more information.

[enum_feature_vectorizer]
class_name = zensols.deepnlp.vectorize.EnumContainerFeatureVectorizer
feature_id = enum
decoded_feature_ids = set: ent, tag, dep

Similarly, the CountEnumContainerFeatureVectorizer encodes counts of each feature in the text at the document level.

[count_feature_vectorizer]
class_name = zensols.deepnlp.vectorize.CountEnumContainerFeatureVectorizer
feature_id = count
decoded_feature_ids = eval: set('ent tag dep'.split())

The language_feature_manager configuration is used to create a FeatureDocumentVectorizerManager, which is a language specific vectorizer manager that uses the FeatureDocumentParser we defined earlier with the doc_parser entry. This class extends from FeatureVectorizerManager as an NLP specific manager that creates and encodes the word embeddings and the other linguistic feature vectorizers configured. The token_length parameter are the lengths of sentences or documents in numbers of tokens.

[language_feature_manager]
class_name = zensols.deepnlp.vectorize.FeatureDocumentVectorizerManager
torch_config = instance: torch_config
configured_vectorizers = eval: [
  'word2vec_300_feature_vectorizer',
  'glove_50_feature_vectorizer',
  'glove_300_feature_vectorizer',
  'transformer_feature_vectorizer',
  'enum_feature_vectorizer',
  'count_feature_vectorizer',
  'language_stats_feature_vectorizer',
  'depth_token_feature_vectorizer']
doc_parser = instance: doc_parser
token_length = ${language_defaults:token_length}
token_feature_ids = ${doc_parser:token_feature_ids}

Text Classification#

The text classification resource library provides configuration for components and models used to classify tokens and text.

See the Clickbate example of how this resource library is used.

Vectorization (Text)#

This configuration set defines the vectorizer for the label itself, which uses option categories as the labels and provided in the application context:

[classify_label_vectorizer]
class_name = zensols.deeplearn.vectorize.NominalEncodedEncodableFeatureVectorizer
#categories = y, n
feature_id = lblabel

We define a manager and manager set separate from the linguistic configuration since the package space is different:

# the vectorizer for labels is not language specific and lives in the
# zensols.deeplearn.vectorize package, so it needs it's own instance
[classify_label_vectorizer_manager]
class_name = zensols.deeplearn.vectorize.FeatureVectorizerManager
torch_config = instance: torch_config
configured_vectorizers = list: classify_label_vectorizer

[vectorizer_manager_set]
names = list: language_vectorizer_manager, classify_label_vectorizer_manager

Batch Stash#

The batch stash configuration should look familiar if you have read through the deeplearn API batch stash documentation. The configuration below is for a BatchDirectoryCompositeStash, which splits data in separate files across features for each batch.

In this configuration, we split the label, embeddings, and linguistic features in their own groups so that we can experiment using different embeddings for each test. Using BERT will take the longest since each sentence will be computed during decoding.

However, GloVE 50D embeddings vectorize much quicker as only the indexes are stored and quickly retrieved in the PyTorch API on demand. Our caching strategy also changes as we can (with most graphics cards) fit the entire GloVE 50D embedding in GPU memory. Our composition stash configuration follows:

[batch_dir_stash]
groups = eval: (
       set('label'.split()),
       set('glove_50_embedding'.split()),
...
       set('transformer_enum_expander transformer_dep_expander'.split()))

The batch stash is configured next. This configuration uses dynamic batch mappings, which map feature attribute names used in the code with the feature IDs used in vectorizers:

[batch_stash]
data_point_type = eval({'import': ['zensols.deepnlp.classify']}): zensols.deepnlp.classify.LabeledFeatureDocumentDataPoint
batch_feature_mappings = dataclass(zensols.deeplearn.batch.ConfigBatchFeatureMapping): classify_batch_mappings

LabeledFeatureDocumentDataPoint is a subclass of DataPoint class that contains a FeatureDocument, and the classify_batch_mappings is a reference to the batch binding in classify-batch.yml, which is defined as:

classify_batch_mappings:
  batch_feature_mapping_adds:
    - 'dataclass(zensols.deeplearn.batch.BatchFeatureMapping): classify_label_batch_mappings'
    - 'dataclass(zensols.deeplearn.batch.BatchFeatureMapping): lang_batch_mappings'

The root defines a section, the second level adds classification and language specific mappings. The classify batch mappings are:

classify_label_batch_mappings:
  label_attribute_name: label
  manager_mappings:
    - vectorizer_manager_name: classify_label_vectorizer_manager
      fields:
        - attr: label
          feature_id: lblabel
          is_agg: true

This says to use the singleton label mapping under fields for the label and used by the framework to calculate performance metrics.

Facade#

The facade is configured as a ClassifyModelFacade:

[facade]
class_name = zensols.deepnlp.classify.ClassifyModelFacade

This class extends LanguageModelFacade, which supports natural language model feature updating and sets up logging. This class is used both from the command line and the Jupyter notebook via the CLI facade applications.

This facade class adds classification specific functionality, including feature updating from a Jupyter notebook or Python REPL.

@dataclass
class ClassifyModelFacade(LanguageModelFacade):
    LANGUAGE_MODEL_CONFIG = LanguageModelFacadeConfig(
        manager_name=ReviewBatch.LANGUAGE_FEATURE_MANAGER_NAME,
        attribs=ReviewBatch.LANGUAGE_ATTRIBUTES,
        embedding_attribs=ReviewBatch.EMBEDDING_ATTRIBUTES)

and used by the framework by overriding:

def _get_language_model_config(self) -> LanguageModelFacadeConfig:
	return self.LANGUAGE_MODEL_CONFIG

Setting the dropout triggers property setters to propagate (linear and recurrent layers) the setting when set on the facade:

def __post_init__(self, *args, **kwargs):
	super().__post_init__(*args, **kwargs)
	settings: NetworkSettings = self.executor.net_settings
	if hasattr(settings, 'dropout'):
		# set to trigger writeback through to sub settings (linear, recur)
		self.dropout = self.executor.net_settings.dropout

We can also override the get_predictions method to include the review text and it’s length when creating the data frame and respective CSV export:

def get_predictions(self, *args, **kwargs) -> pd.DataFrame:
	return super().get_predictions(
		('text', 'len'),
		lambda dp: (dp.doc.text, len(dp.doc.text)),
		*args, **kwargs)

Model (Text)#

The model section configures the ClassifyNetworkSettings, which is either a BiLSTM with an optional CRF output layer or a transformer (see the movie review sentiment example for how this can be configured in both settings.

[classify_net_settings]
class_name = zensols.deepnlp.classify.ClassifyNetworkSettings
#embedding_layer = instance: ${deepnlp_default:embedding}_layer
recurrent_settings = None
linear_settings = instance: linear_settings
batch_stash = instance: batch_stash
dropout = None

The batch_stash instance is configured on this model so it has access to the dynamic batch metadata for the embedding layer. The commented out embedding_layer has to be overridden and set as the instance of the embedding layer instance use that create the input embeddings from the input text. The linear_settings is the network between the recurrent network and the output CRF (if there is one configured).

Prediction (Text)#

The prediction mapper uses the model to classify text from the command line. For text classification, the ClassificationPredictionMapper is used and takes text given from the command line and predicts a label:

[classify_feature_prediction_mapper]
class_name = zensols.deepnlp.classify.ClassificationPredictionMapper
vec_manager = instance: language_vectorizer_manager
label_feature_id = classify_label_vectorizer_manager.lblabel

This component needs the vectorizer manager that creates the vectorized label and the nominal vectorizer to reverse map using a scikit-learn LabelEncoder back to the human readable label.

Token Classification#

Token classification refers to labeling tokens instead of a string of text as with text classification. However, there is some cross over functionality between these two tasks, so the token classification resource library resource library uses some of the same components (not configuration) defined in the text classification resource library. For example, we reuse the ClassifyModelFacade by overriding the class in the facade section.

Note: despite this overlap, either import only the text classification resource library for text classification projects and only token classification resource library for token classification projects, but not both.

Only the notable differences compared to the text classification section are documented.

See the NER example of how this resource library is used.

Vectorization (Token)#

This section has the token label vectorizers and mask vectorizers. The mask is needed for the CRF (when configured) to mask out blank tokens for sentences shorter than a max length. Usually, zeroed tensors are used for token slots not used, for example in the word embedding layer for deep learning networks. This is because the zero vectors are learned for sentences are shorter. However, the CRF layer needs to block these as valid state transitions during training and testing.

tok_label_1_vectorizer:
  class_name: zensols.deeplearn.vectorize.NominalEncodedEncodableFeatureVectorizer
  feature_id: tclabel1

tok_label_vectorizer:
  class_name: zensols.deeplearn.vectorize.AggregateEncodableFeatureVectorizer
  feature_id: tclabel
  size: -1
  delegate_feature_id: tclabel1

tok_mask_vectorizer:
  class_name: zensols.deeplearn.vectorize.MaskFeatureVectorizer
  feature_id: tmask
  size: -1

tok_label_batch_mappings:
  manager_mappings:
    - vectorizer_manager_name: tok_label_vectorizer_manager
      fields:
        - attr: tok_labels
          feature_id: tclabel
          is_agg: true
          is_label: True
        - attr: tok_mask
          feature_id: tmask
          is_agg: true
          attr_access: tok_labels

tok_label_vectorizer_manager:
  class_name: zensols.deeplearn.vectorize.FeatureVectorizerManager
  torch_config: 'instance: torch_config'
  configured_vectorizers:
    - tok_label_1_vectorizer
    - tok_label_vectorizer
    - tok_mask_vectorizer

# add new feature vectorizer managers
vectorizer_manager_set:
  names:
    - language_vectorizer_manager
    - tok_label_vectorizer_manager

Model (Token)#

The SequenceBatchIterator configured in the model_settings indicates to use a different scoring method. This class is used in the framework to calculate a different loss and produce the output, which must be treated differently than neural float tensor output. This is because the Viterbi algorithm is used to determine the lowest cost path through the elements. The sum of this path is used as the cost instead of a differential optimization function.

model_settings:
  batch_iteration_class_name: zensols.deeplearn.model.SequenceBatchIterator
  reduce_outcomes: none
  prediction_mapper_name: feature_prediction_mapper

recurrent_crf_net_settings:
  mask_attribute: tok_mask

Because we use a CRF as the output layer for EmbeddedRecurrentCRF, our output are the NER labels. Therefore, must also set reduce_outcomes = none to pass the CRF output through unaltered.

In the recurrent_crf_net_settings section, we override the mask_attribute, which tells recurrent CRF to use the tok_mask attribute when masking the label output.

Prediction (Token)#

The section is the same, but we instead use the sequence base version (SequencePredictionMapper) where the token stream is used as that sequence.

feature_prediction_mapper:
  class_name: zensols.deepnlp.classify.SequencePredictionMapper
  vec_manager: 'instance: language_vectorizer_manager'
  label_feature_id: tok_label_vectorizer_manager.tclabel1