zensols.deeplearn.batch package¶

Submodules¶

zensols.deeplearn.batch.domain module¶

This file contains a stash used to load an embedding layer. It creates features in batches of matrices and persists matrix only (sans features) for efficient retrival.

class zensols.deeplearn.batch.domain.Batch(batch_stash, id, split_name, data_points)[source]¶

Bases: PersistableContainer, Writable

Contains a batch of data used in the first layer of a net. This class holds the labels, but is otherwise useless without at least one embedding layer matrix defined.

The user must subclass, add mapping meta data, and optionally (suggested) add getters and/or properties for the specific data so the model can by more Pythonic in the PyTorch torch.nn.Module.

STATES = {'d': 'decoded', 'e': 'encoded', 'k': 'deallocated', 'n': 'nascent', 't': 'memory copied'}¶: A human friendly mapping of the encoded states.

__init__(batch_stash, id, split_name, data_points)¶

property attributes: Dict[str, Tensor]¶: Return the attribute batched tensors as a dictionary using the attribute names as the keys.

batch_stash: BatchStash¶: Ephemeral instance of the stash used during encoding and decoding.

property data_points: Tuple[DataPoint, ...]¶: The list of the data points given on creation for encoding, and None’d out after encoding/pickinglin.

deallocate()[source]¶: Deallocate all resources for this instance.

get_label_classes()[source]¶

Return the labels in this batch in their string form. This assumes the label vectorizer is instance of CategoryEncodableFeatureVectorizer.

Return type:: List[str]
Returns:: the reverse mapped, from nominal values, labels

get_label_feature_vectorizer()[source]¶

Return the label vectorizer used in the batch. This assumes there’s only one vectorizer found in the vectorizer manager.

Parameters:: batch – used to access the vectorizer set via the batch stash
Return type:: FeatureVectorizer

get_labels()[source]¶

Return the label tensor for this batch.

Return type:: Tensor

property has_labels: bool¶: Return whether or not this batch has labels. If it doesn’t, it is a batch used for prediction.

id: int¶: The ID of this batch instance, which is the sequence number of the batch given during child processing of the chunked data point ID setes.

keys()[source]¶

Return type:: Tuple[str, ...]

size()[source]¶

Return the size of this batch, which is the number of data points.

Return type:: int

split_name: str¶: The name of the split for this batch (i.e. train vs test).

property state_name¶

to()[source]¶

Clone this instance and copy data to the CUDA device configured in the batch stash.

Return type:: Batch
Returns:: a clone of this instance with all attribute tensors copied to the given torch configuration device

property torch_config: TorchConfig¶: The torch config used to copy from CPU to GPU memory.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_data_points=False)[source]¶

Write the contents of this instance to writer using indention depth.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deeplearn.batch.domain.DataPoint(id, batch_stash)[source]¶

Bases: Writable

Abstract class that makes up a container class for features created from sentences.

__init__(id, batch_stash)¶

batch_stash: BatchStash¶: Ephemeral instance of the stash used during encoding only.

id: int¶: The ID of this data point, which maps back to the BatchStash instance’s subordinate stash.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Write the contents of this instance to writer using indention depth.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deeplearn.batch.domain.DefaultBatch(batch_stash, id, split_name, data_points, batch_feature_mappings=None)[source]¶

Bases: Batch

A concrete implementation that uses a batch_feature_mapping usually configured with ConfigBatchFeatureMapping and provided by BatchStash.

__init__(batch_stash, id, split_name, data_points, batch_feature_mappings=None)¶

batch_feature_mappings: BatchFeatureMapping = None¶: The mappings used by this instance.

zensols.deeplearn.batch.interface module¶

Interface and simple domain classes.

class zensols.deeplearn.batch.interface.BatchDirectoryCompositeStash(path, groups)[source]¶

Bases: DirectoryCompositeStash

A composite stash used for instances of BatchStash.

__init__(path, groups)[source]¶

Initialize using the parent class’s default pattern.

Parameters:

path (Path) – the directory that will have to subdirectories with the files, they are named INSTANCE_DIRECTORY_NAME and COMPOSITE_DIRECTORY_NAME
groups (Tuple[Set[str]]) – the groups of the dict composite attribute, which are sets of keys, each of which are persisted to their respective directory
attribute_name – the name of the attribute in each item to split across groups/directories; the instance data to persist has the composite attribute of type dict
load_keys – the keys used to load the data from the composite stashs in to the attribute dict instance; only these keys will exist in the loaded data, or None for all keys; this can be set after the creation of the instance as well

exception zensols.deeplearn.batch.interface.BatchError[source]¶

Bases: DeepLearnError

Thrown for any batch related error.

__annotations__ = {}¶

__module__ = 'zensols.deeplearn.batch.interface'¶

class zensols.deeplearn.batch.interface.DataPointIDSet(batch_id, data_point_ids, split_name, torch_seed_context)[source]¶

Bases: object

Set of subordinate stash IDs with feature values to be vectorized with BatchStash. Groups of these are sent to subprocesses for processing in to Batch instances.

__init__(batch_id, data_point_ids, split_name, torch_seed_context)¶

batch_id: str¶: The ID of the batch.

data_point_ids: Tuple[str]¶: The IDs each data point in the setLevel.

split_name: str¶: The split (i.e. train, test, validation).

torch_seed_context: Dict[str, Any]¶: The seed context given by TorchConfig.

zensols.deeplearn.batch.mapping module¶

Mapping metadata for batch domain specific instances.

class zensols.deeplearn.batch.mapping.BatchFeatureMapping(label_attribute_name='label', manager_mappings=<factory>)[source]¶

Bases: Dictable

The meta data used to encode and decode each feature in to tensors. It is best to define a class level instance of this in the Batch class and return it with _get_batch_feature_mappings.

An example from the iris data set test:

MAPPINGS = BatchFeatureMapping(
    'label',
    [ManagerFeatureMapping(
        'iris_vectorizer_manager',
        (FieldFeatureMapping('label', 'ilabel', True),
         FieldFeatureMapping('flower_dims', 'iseries')))])

__init__(label_attribute_name='label', manager_mappings=<factory>)¶

get_attributes()[source]¶

Return type:: Iterable[FieldFeatureMapping]

get_field_map_by_attribute(attribute_name)[source]¶

Return type:: Optional[Tuple[ManagerFeatureMapping, FieldFeatureMapping]]

get_field_map_by_feature_id(feature_id)[source]¶

Return type:: Optional[Tuple[ManagerFeatureMapping, FieldFeatureMapping]]

label_attribute_name: str = 'label'¶: The name of the attribute used for labels.

property label_feature_id: None | str¶: Return the feature id of the label. This is the vectorizer used to transform the label data.

property label_vectorizer_manager: FeatureVectorizerManager | None¶: Return the feature id of the label. This is the vectorizer used to transform the label data.

manager_mappings: List[ManagerFeatureMapping]¶: The manager level attribute mapping meta data.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deeplearn.batch.mapping.ConfigBatchFeatureMapping(label_attribute_name='label', manager_mappings=<factory>, batch_feature_mapping_adds=<factory>, field_remove=<factory>, field_keep=None)[source]¶

Bases: BatchFeatureMapping

A utility class that allows a easy configuration driven way of refining manager_mappings by adding and deleting them both at the mapping and field levels. These edits happen during the classes __init__.

__init__(label_attribute_name='label', manager_mappings=<factory>, batch_feature_mapping_adds=<factory>, field_remove=<factory>, field_keep=None)¶

batch_feature_mapping_adds: List[BatchFeatureMapping]¶: Mappings to add.

field_keep: Set[str] = None¶: Only these field remain from all batch mappings.

field_remove: Set[str]¶: Field removed from all batch mappings.

class zensols.deeplearn.batch.mapping.FieldFeatureMapping(attr, feature_id, is_agg=False, attr_access=None, is_label=False)[source]¶

Bases: Dictable

Meta data describing an attribute of the data point.

__init__(attr, feature_id, is_agg=False, attr_access=None, is_label=False)¶

attr: str¶: The (human readable/used) name for the mapping.

attr_access: str = None¶: The attribute on the source DataPoint instance (see attribute_accessor).

property attribute_accessor¶: Return the attribute name on the DataPoint instance. This uses attr_access if it is not None, otherwise, use attr.

feature_id: str¶: Indicates which vectorizer to use.

is_agg: bool = False¶: If True, tuplize across all data points and encode as one tuple of data to create the batched tensor on decode; otherwise, each data point feature is encoded and concatenated on decode.

is_label: bool = False¶

Whether or not this field is a label. The is True in cases where there is more than one label. In these cases, usually which label to use changes based on the model (i.e. word embedding vs. BERT word piece token IDs).

This is used in Batch to skip label vectorization while encoding of prediction based batches.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deeplearn.batch.mapping.ManagerFeatureMapping(vectorizer_manager_name, fields)[source]¶

Bases: Dictable

Meta data for a vectorizer manager with fields describing attributes to be vectorized from features in to feature contests.

__init__(vectorizer_manager_name, fields)¶

fields: Tuple[FieldFeatureMapping]¶: The fields of the data point to be vectorized.

remove_field(attr)[source]¶

Remove a field by attribute if it exists.

Parameters:: attr (str) – the name of the field’s attribute to remove
Return type:: bool
Returns:: True if the field was removed, False otherwise

vectorizer_manager_name: str¶: The configuration name that identifiees an instance of FeatureVectorizerManager.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

zensols.deeplearn.batch.meta module¶

Contains container classes for batch data.

class zensols.deeplearn.batch.meta.BatchFieldMetadata(field, vectorizer)[source]¶

Bases: Dictable

Data that describes a field mapping in a batch object.

__init__(field, vectorizer)¶

field: FieldFeatureMapping¶: The field mapping.

property shape¶

vectorizer: FeatureVectorizer¶: The vectorizer used to map the field.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deeplearn.batch.meta.BatchMetadata(data_point_class, batch_class, mapping, fields_by_attribute)[source]¶

Bases: Dictable

Describes metadata about a Batch instance.

__init__(data_point_class, batch_class, mapping, fields_by_attribute)¶

batch_class: Type[Batch]¶: The Batch class, which are created at encoding time.

data_point_class: Type[DataPoint]¶: The DataPoint class, which are created at encoding time.

fields_by_attribute: Dict[str, BatchFieldMetadata]¶: Mapping by field name to attribute.

mapping: BatchFeatureMapping¶: The mapping used for encoding and decoding the batch.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deeplearn.batch.meta.MetadataNetworkSettings(name, config_factory, torch_config, batch_stash)[source]¶

Bases: NetworkSettings

A network settings container that has metadata about batches it recieves for its model.

__init__(name, config_factory, torch_config, batch_stash)¶

property batch_metadata: BatchMetadata¶: Return the batch metadata used by this model.

batch_stash: BatchStash¶: The batch stash that created the batches and has the batch metdata.

zensols.deeplearn.batch.multi module¶

Multi processing with torch.

class zensols.deeplearn.batch.multi.TorchMultiProcessStash(delegate, config, name, chunk_size, workers)[source]¶

Bases: MultiProcessStash

A multiprocessing stash that interacts with PyTorch in a way that it can access the GPU(s) in forked subprocesses using the multiprocessing library.

See:: torch.multiprocessing
See:: zensols.deeplearn.TorchConfig.init()

__init__(delegate, config, name, chunk_size, workers)¶

zensols.deeplearn.batch.stash module¶

This file contains a stash used to load an embedding layer.

class zensols.deeplearn.batch.stash.BatchStash(name, config_factory, delegate, config, chunk_size, workers, data_point_type, batch_type, split_stash_container, vectorizer_manager_set, batch_size, model_torch_config, data_point_id_sets_path, decoded_attributes, batch_feature_mappings=None, batch_limit=9223372036854775807)[source]¶

Bases: TorchMultiProcessStash, SplitKeyContainer, Writeback, Deallocatable

A stash that vectorizes features in to easily consumable tensors for training and testing. This stash produces instances of Batch, which is a batch in the machine learning sense, and the first dimension of what will become the tensor used in PyTorch. Each of these batches has a logical one to many relationship to that batche’s respective set of data points, which is encapsulated in the DataPoint class.

The stash creates subprocesses to vectorize features in to tensors in chunks of IDs (data point IDs) from the subordinate stash using DataPointIDSet instances.

To speed up experiements, all available features configured in vectorizer_manager_set are encoded on disk. However, only the decoded_attributes (see attribute below) are avilable to the model regardless of what was created during encoding time.

The lifecycle of the data follows:

Feature data created by the client, which could be language features, row data etc.
Vectorize the feature data using the vectorizers in vectorizer_manager_set. This creates the feature contexts (FeatureContext) specifically meant to be pickeled.
Pickle the feature contexts when dumping to disk, which is invoked in the child processes of this class.
At train time, load the feature contexts from disk.
Decode the feature contexts in to PyTorch tensors.
The model manager uses the to method to copy the CPU tensors to the GPU (where GPUs are available).

Use the split_stash_container to get dataset which as a splits property for the feature data. Use the dataset_stash from the application context ConfigFactory for the batch splits.

See _process:: for details on the pickling of the batch instances

_process(chunk)[source]¶

Create the batches by creating the set of data points for each DataPointIDSet instance. When the subordinate stash dumps the batch (specifically a subclass of Batch), the overrided pickle logic is used to detach the batch by encoded all data in to FeatureContext instances.

Return type:: Iterable[Tuple[str, Any]]

__init__(name, config_factory, delegate, config, chunk_size, workers, data_point_type, batch_type, split_stash_container, vectorizer_manager_set, batch_size, model_torch_config, data_point_id_sets_path, decoded_attributes, batch_feature_mappings=None, batch_limit=9223372036854775807)¶

property batch_data_point_sets: List[DataPointIDSet]¶: Create the data point ID sets. Each instance returned will correlate to a batch and each set of keys point to a feature DataPoint.

batch_feature_mappings: BatchFeatureMapping = None¶: The meta data used to encode and decode each feature in to tensors.

batch_limit: int = 9223372036854775807¶: The max number of batches to process, which is useful for debugging.

property batch_metadata: BatchMetadata¶

batch_size: int¶: The number of data points in each batch, except the last (unless the data point cardinality divides the batch size).

batch_type: Type[Batch]¶: The batch class to be instantiated when created batchs.

clear()[source]¶: Clear the batch, batch data point sets.

clear_all()[source]¶: Clear the batch, batch data point sets, and the source data (split_stash_container).

create_batch(points, split_name=None, batch_id=None)[source]¶: Create a new batch instance with data points, which happens when primed.

data_point_id_sets_path: Path¶: The path of where to store key data for the splits; note that the container might store it’s key splits in some other location.

data_point_type: Type[DataPoint]¶: A subclass type of DataPoint implemented for the specific feature.

deallocate()[source]¶: Deallocate all resources for this instance.

property decoded_attributes: Set[str]¶: The attributes to decode; only these are avilable to the model regardless of what was created during encoding time; if None, all are available. Sequences are converted to sets, which makes configuration easier in YAML files.

load(name)[source]¶

Load a data value from the pickled data with key name. Semantically, this method loads the using the stash’s implementation. For example DirectoryStash loads the data from a file if it exists, but factory type stashes will always re-generate the data.

See:: get()

model_torch_config: TorchConfig¶: The PyTorch configuration used to (optionally) copy CPU to GPU memory.

populate_batch_feature_mapping(batch)[source]¶: Add batch feature mappings to a batch instance.

prime()[source]¶: If the delegate stash data does not exist, use this implementation to generate the data and process in children processes.

split_stash_container: SplitStashContainer¶: The source data stash that has both the data and data set keys for each split (i.e. train vs test).

vectorizer_manager_set: FeatureVectorizerManagerSet¶: Used to vectorize features in to tensors.

Module contents¶

Contains classes that batch vectorized data in forked subprocesses.