zensols.deepnlp.layer package

Submodules

zensols.deepnlp.layer.conv module

Contains convolution functionality useful for NLP tasks.

class zensols.deepnlp.layer.conv.DeepConvolution1d(net_settings, logger)[source]

Bases: BaseNetworkModule

Configurable repeated series of 1-dimension convolution, pooling, batch norm and activation layers. See get_layers().

See:

DeepConvolution1dNetworkSettings

MODULE_NAME: ClassVar[str] = 'conv'

The module name used in the logging message. This is set in each inherited class.

__init__(net_settings, logger)[source]

Initialize the deep convolution layer.

Implementation note: all layers are stored sequentially using a

torch.nn.Sequential to get normal weight persistance on torch save/loads.

Parameters:
deallocate()[source]

Deallocate all resources for this instance.

get_layers()[source]

Return a tuple of layer sets, with each having the form: (convolution, max pool, batch_norm). The batch_norm norm is None if not configured.

Return type:

Tuple[Tuple[Module, Module, Module]]

class zensols.deepnlp.layer.conv.DeepConvolution1dNetworkSettings(name, config_factory, dropout, activation, token_length=None, embedding_dimension=None, token_kernel=2, stride=1, n_filters=1, padding=1, pool_token_kernel=2, pool_stride=1, pool_padding=0, repeats=1, batch_norm_d=None)[source]

Bases: ActivationNetworkSettings, DropoutNetworkSettings, Writable

Configurable repeated series of 1-dimension convolution, pooling, batch norm and activation layers. This layer is specifically designed for natural language processing task, which is why this configuration includes parameters for token counts.

Each layer repeat consists of::
  1. convolution

  2. max pool

  3. batch (optional)

  4. activation

This class is used directly after embedding (and in conjuction with) a layer class that extends EmbeddingNetworkModule. The lifecycle of this class starts with being instantiated (usually configured using a ImportConfigFactory), then cloned with clone() during the initialization on the layer from which it’s used.

Parameters:
  • token_length (int) – the number of tokens processed through the layer (used as the width kernel parameter W)

  • embedding_dimension (int) – the dimension of the embedding (word vector) layer (height dimension H and the kernel parameter F)

  • token_kernel (int) – the size of the kernel in number of tokens (width dimension of kernel parameter F)

  • n_filters (int) – number of filters to use, aka filter depth/volume (K)

  • stride (int) – the stride, which is the number of cells to skip for each convolution (S)

  • padding (int) – the zero’d number of cells on the ends of tokens X embedding neurons (P)

  • pool_token_kernel (int) – like token_length but in the pooling layer

  • pool_stride (int) – like stride but in the pooling layer

  • pool_padding (int) – like padding but in the pooling layer

  • repeats (int) – number of times the convolution, max pool, batch, activation layers are repeated

  • batch_norm_d (int) – the dimension of the batch norm (should be 1) or None to disable

See:

DeepConvolution1d

:see EmbeddingNetworkModule

__init__(name, config_factory, dropout, activation, token_length=None, embedding_dimension=None, token_kernel=2, stride=1, n_filters=1, padding=1, pool_token_kernel=2, pool_stride=1, pool_padding=0, repeats=1, batch_norm_d=None)
batch_norm_d: int = None
clone(module, **kwargs)[source]

Clone this network settings configuration with a different embedding settings.

Parameters:
  • module (EmbeddingNetworkModule) – the embedding settings to use in the clone

  • kwargs – arguments as attributes on the clone

embedding_dimension: int = None
get_module_class_name()[source]

Returns the fully qualified class name of the module to create by ModelManager. This module takes as the first parameter an instance of this class.

Important: This method is not used for nested modules. You must declare specific class names in the configuration for those nested class naems.

Return type:

str

property layer_factory: ConvolutionLayerFactory

Return the factory used to create convolution layers.

n_filters: int = 1
padding: int = 1
property pool_factory: MaxPool1dFactory

Return the factory used to create max 1D pool layers.

pool_padding: int = 0
pool_stride: int = 1
pool_token_kernel: int = 2
repeats: int = 1
stride: int = 1
token_kernel: int = 2
token_length: int = None
write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

zensols.deepnlp.layer.embed module

An embedding layer module useful for models that use embeddings as input.

class zensols.deepnlp.layer.embed.EmbeddingLayer(feature_vectorizer_manager, embedding_dim, sub_logger=None, trainable=False)[source]

Bases: DebugModule, Deallocatable

A class used as an input layer to provide word embeddings to a deep neural network.

Important: you must always check for attributes in deallocate() since it might be called more than once (i.e. from directly deallocating and then from the factory).

Implementation note: No datacasses are usable since pytorch is picky about initialization order.

__init__(feature_vectorizer_manager, embedding_dim, sub_logger=None, trainable=False)[source]

Initialize.

Parameters:
  • feature_vectorizer_manager (FeatureDocumentVectorizerManager) – the feature vectorizer manager that manages this instance

  • embedding_dim (int) – the vector dimension of the embedding

  • trainable (bool) – True if the embedding layer is to be trained

deallocate()[source]

Deallocate all resources for this instance.

property token_length
property torch_config
class zensols.deepnlp.layer.embed.EmbeddingNetworkModule(net_settings, module_logger=None, filter_attrib_fn=None)[source]

Bases: BaseNetworkModule

An module that uses an embedding as the input layer. This class uses an instance of EmbeddingLayer provided by the network settings configuration for resolving the embedding during the forward phase.

The following attributes are created and/or set during initialization:

  • embedding the EmbeddingLayer instance used get the input embedding tensors

  • embedding_attribute_names the name of the word embedding vectorized feature attribute names (usually one, but possible to have more)

  • embedding_output_size the output size of the embedding layer, note this includes any features layered/concated given in all token level vectorizer’s configuration

  • join_size if a join layer is to be used, this has the size of the part of the join layer that will have the document level features

  • token_attribs the token level feature names (see forward_token_features())

  • doc_attribs the doc level feature names (see forward_document_features())

The initializer adds additional attributes conditional on the EmbeddingNetworkSettings instance’s batch_metadata property (type BatchMetadata). For each meta data field’s vectorizer that extends class FeatureDocumentVectorizer the following is set on this instance based on the value of feature_type (of type TextFeatureType):

  • TOKEN: embedding_output_size is increased by the vectorizer’s shape

  • DOCUMENT: join_size is increased by the vectorizer’s shape

Fields can be filtered by passing a filter function to the initializer. See __init__() for more information.

MODULE_NAME: ClassVar[str] = 'embed'

The module name used in the logging message. This is set in each inherited class.

__init__(net_settings, module_logger=None, filter_attrib_fn=None)[source]

Initialize the embedding layer.

Parameters:
  • net_settings (EmbeddingNetworkSettings) – the embedding layer configuration

  • logger – the logger to use for the forward process in this layer

  • filter_attrib_fn (Callable[[BatchFieldMetadata], bool]) – if provided, called with a BatchFieldMetadata for each field returning True if the batch field should be retained and used in the embedding layer (see class docs); if None all fields are considered

property embedding_dimension: int

Return the dimension of the embeddings, which doesn’t include any additional token or document features potentially added.

forward_document_features(batch, x=None, include_fn=None)[source]

Concatenate any document features given by the vectorizer configuration.

Return type:

Tensor

forward_embedding_features(batch)[source]

Use the embedding layer return the word embedding tensors.

Return type:

Tensor

forward_token_features(batch, x=None)[source]

Concatenate any token features given by the vectorizer configuration.

Parameters:
  • batch (Batch) – contains token level attributes to concatenate to x

  • x (Tensor) – if given, the first tensor to be concatenated

Return type:

Tensor

get_embedding_tensors(batch)[source]

Get the embedding tensors (or indexes depending on how it was vectorize) from a batch.

Parameters:

batch (Batch) – contains the vectorized embeddings

Return type:

Tuple[Tensor]

Returns:

the vectorized embedding as tensors, one for each embedding

vectorizer_by_name(name)[source]

Utility method to get a vectorizer by name.

Parameters:

name (str) – the name of the vectorizer as given in the vectorizer manager

Return type:

FeatureVectorizer

class zensols.deepnlp.layer.embed.EmbeddingNetworkSettings(name, config_factory, batch_stash, embedding_layer)[source]

Bases: MetadataNetworkSettings

A utility container settings class for models that use an embedding input layer that inherit from EmbeddingNetworkModule.

__init__(name, config_factory, batch_stash, embedding_layer)
embedding_layer: EmbeddingLayer

The word embedding layer used to vectorize.

get_module_class_name()[source]

Returns the fully qualified class name of the module to create by ModelManager. This module takes as the first parameter an instance of this class.

Important: This method is not used for nested modules. You must declare specific class names in the configuration for those nested class naems.

Return type:

str

class zensols.deepnlp.layer.embed.TrainableEmbeddingLayer(feature_vectorizer_manager, embedding_dim, sub_logger=None, trainable=False)[source]

Bases: EmbeddingLayer

A non-frozen embedding layer that has grad on parameters.

reset_parameters()[source]
state_dict(*args, destination=None, prefix='', keep_vars=False)[source]

Returns a dictionary containing references to the whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Note

The returned object is a shallow copy. It contains references to the module’s parameters and buffers.

Warning

Currently state_dict() also accepts positional arguments for destination, prefix and keep_vars in order. However, this is being deprecated and keyword arguments will be enforced in future releases.

Warning

Please avoid the use of argument destination as it is not designed for end-users.

Parameters:
  • destination (dict, optional) – If provided, the state of module will be updated into the dict and the same object is returned. Otherwise, an OrderedDict will be created and returned. Default: None.

  • prefix (str, optional) – a prefix added to parameter and buffer names to compose the keys in state_dict. Default: ''.

  • keep_vars (bool, optional) – by default the Tensor s returned in the state dict are detached from autograd. If it’s set to True, detaching will not be performed. Default: False.

Returns:

a dictionary containing a whole state of the module

Return type:

dict

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> module.state_dict().keys()
['bias', 'weight']

zensols.deepnlp.layer.embrecurcrf module

Embedding input layer classes.

class zensols.deepnlp.layer.embrecurcrf.EmbeddedRecurrentCRF(net_settings, sub_logger=None)[source]

Bases: EmbeddingNetworkModule, SequenceNetworkModule

A recurrent neural network composed of an embedding input, an recurrent network, and a linear conditional random field output layer. When configured with an LSTM, this becomes a (Bi)LSTM-CRF. More specifically, this network has the following:

  1. Input embeddings mapped from tokens.

  2. Recurrent network (i.e. LSTM).

  3. Fully connected feed forward deep linear layer(s) as the decoder.

  4. Linear chain conditional random field (CRF) layer.

  5. Output the labels.

MODULE_NAME: ClassVar[str] = 'emb-recur-crf'

The module name used in the logging message. This is set in each inherited class.

__init__(net_settings, sub_logger=None)[source]

Initialize the embedding layer.

Parameters:
  • net_settings (EmbeddedRecurrentCRFSettings) – the embedding layer configuration

  • logger – the logger to use for the forward process in this layer

  • filter_attrib_fn – if provided, called with a BatchFieldMetadata for each field returning True if the batch field should be retained and used in the embedding layer (see class docs); if None all fields are considered

deallocate()[source]

Deallocate all resources for this instance.

class zensols.deepnlp.layer.embrecurcrf.EmbeddedRecurrentCRFSettings(name, config_factory, batch_stash, embedding_layer, recurrent_crf_settings, mask_attribute, tensor_predictions=False, use_crf=True)[source]

Bases: EmbeddingNetworkSettings

A utility container settings class for convulsion network models.

__init__(name, config_factory, batch_stash, embedding_layer, recurrent_crf_settings, mask_attribute, tensor_predictions=False, use_crf=True)
get_module_class_name()[source]

Returns the fully qualified class name of the module to create by ModelManager. This module takes as the first parameter an instance of this class.

Important: This method is not used for nested modules. You must declare specific class names in the configuration for those nested class naems.

Return type:

str

mask_attribute: str

The vectorizer attribute name for the mask feature.

recurrent_crf_settings: RecurrentCRFNetworkSettings

The RNN settings (configure this with an LSTM for (Bi)LSTM CRFs).

tensor_predictions: bool = False

Whether or not to return predictions as tensors. There are currently no identified use cases to do this as setting this to True will inflate performance metrics. This is because the batch iterator will create a tensor with the entire batch adding a lot of default padded value that will be counted as results.

use_crf: bool = True

zensols.deepnlp.layer.wordvec module

Glue betweeen WordEmbedModel and :clas:`torch.nn.Embedding`.

class zensols.deepnlp.layer.wordvec.WordVectorEmbeddingLayer(embed_model, *args, **kwargs)[source]

Bases: TrainableEmbeddingLayer

An input embedding layer. This uses an instance of WordEmbedModel to compose the word embeddings from indexes. Each index is that of word vector, which is stacked to create the embedding. This happens in the PyTorch framework, and is fast.

This class overrides PyTorch methods that disable persistance of the embedding weights when configured to be frozen (not trainable). Otherwise, the entire embedding model is saved every time the model is saved for each epoch, which is both unecessary, but costs in terms of time and memory.

__init__(embed_model, *args, **kwargs)[source]

Initialize

Parameters:

embed_model (WordEmbedModel) – contains the word embedding model, such as glove, and word2vec

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Module contents

Layers specific to natural language processing.