zensols.deeplearn.batch package¶
Submodules¶
zensols.deeplearn.batch.domain module¶
This file contains a stash used to load an embedding layer. It creates features in batches of matrices and persists matrix only (sans features) for efficient retrival.
- class zensols.deeplearn.batch.domain.Batch(batch_stash, id, split_name, data_points)[source]¶
Bases:
PersistableContainer
,Writable
Contains a batch of data used in the first layer of a net. This class holds the labels, but is otherwise useless without at least one embedding layer matrix defined.
The user must subclass, add mapping meta data, and optionally (suggested) add getters and/or properties for the specific data so the model can by more Pythonic in the PyTorch
torch.nn.Module
.- STATES = {'d': 'decoded', 'e': 'encoded', 'k': 'deallocated', 'n': 'nascent', 't': 'memory copied'}¶
A human friendly mapping of the encoded states.
- __init__(batch_stash, id, split_name, data_points)¶
- property attributes: Dict[str, Tensor]¶
Return the attribute batched tensors as a dictionary using the attribute names as the keys.
- batch_stash: BatchStash¶
Ephemeral instance of the stash used during encoding and decoding.
- property data_points: Tuple[DataPoint, ...]¶
The list of the data points given on creation for encoding, and
None
’d out after encoding/pickinglin.
- get_label_classes()[source]¶
Return the labels in this batch in their string form. This assumes the label vectorizer is instance of
CategoryEncodableFeatureVectorizer
.
- get_label_feature_vectorizer()[source]¶
Return the label vectorizer used in the batch. This assumes there’s only one vectorizer found in the vectorizer manager.
- Parameters:
batch – used to access the vectorizer set via the batch stash
- Return type:
- property has_labels: bool¶
Return whether or not this batch has labels. If it doesn’t, it is a batch used for prediction.
- id: int¶
The ID of this batch instance, which is the sequence number of the batch given during child processing of the chunked data point ID setes.
- split_name: str¶
The name of the split for this batch (i.e.
train
vstest
).
- property state_name¶
- to()[source]¶
Clone this instance and copy data to the CUDA device configured in the batch stash.
- Return type:
- Returns:
a clone of this instance with all attribute tensors copied to the given torch configuration device
- property torch_config: TorchConfig¶
The torch config used to copy from CPU to GPU memory.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_data_points=False)[source]¶
Write the contents of this instance to
writer
using indentiondepth
.- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- class zensols.deeplearn.batch.domain.DataPoint(id, batch_stash)[source]¶
Bases:
Writable
Abstract class that makes up a container class for features created from sentences.
- __init__(id, batch_stash)¶
- batch_stash: BatchStash¶
Ephemeral instance of the stash used during encoding only.
- id: int¶
The ID of this data point, which maps back to the
BatchStash
instance’s subordinate stash.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write the contents of this instance to
writer
using indentiondepth
.- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- class zensols.deeplearn.batch.domain.DefaultBatch(batch_stash, id, split_name, data_points, batch_feature_mappings=None)[source]¶
Bases:
Batch
A concrete implementation that uses a
batch_feature_mapping
usually configured withConfigBatchFeatureMapping
and provided byBatchStash
.- __init__(batch_stash, id, split_name, data_points, batch_feature_mappings=None)¶
- batch_feature_mappings: BatchFeatureMapping = None¶
The mappings used by this instance.
zensols.deeplearn.batch.interface module¶
Interface and simple domain classes.
- class zensols.deeplearn.batch.interface.BatchDirectoryCompositeStash(path, groups)[source]¶
Bases:
DirectoryCompositeStash
A composite stash used for instances of
BatchStash
.- __init__(path, groups)[source]¶
Initialize using the parent class’s default pattern.
- Parameters:
path (
Path
) – the directory that will have to subdirectories with the files, they are namedINSTANCE_DIRECTORY_NAME
andCOMPOSITE_DIRECTORY_NAME
groups (
Tuple
[Set
[str
]]) – the groups of thedict
composite attribute, which are sets of keys, each of which are persisted to their respective directoryattribute_name – the name of the attribute in each item to split across groups/directories; the instance data to persist has the composite attribute of type
dict
load_keys – the keys used to load the data from the composite stashs in to the attribute
dict
instance; only these keys will exist in the loaded data, orNone
for all keys; this can be set after the creation of the instance as well
- exception zensols.deeplearn.batch.interface.BatchError[source]¶
Bases:
DeepLearnError
Thrown for any batch related error.
- __annotations__ = {}¶
- __module__ = 'zensols.deeplearn.batch.interface'¶
- class zensols.deeplearn.batch.interface.DataPointIDSet(batch_id, data_point_ids, split_name, torch_seed_context)[source]¶
Bases:
object
Set of subordinate stash IDs with feature values to be vectorized with
BatchStash
. Groups of these are sent to subprocesses for processing in toBatch
instances.- __init__(batch_id, data_point_ids, split_name, torch_seed_context)¶
-
torch_seed_context:
Dict
[str
,Any
]¶ The seed context given by
TorchConfig
.
zensols.deeplearn.batch.mapping module¶
Mapping metadata for batch domain specific instances.
- class zensols.deeplearn.batch.mapping.BatchFeatureMapping(label_attribute_name='label', manager_mappings=<factory>)[source]¶
Bases:
Dictable
The meta data used to encode and decode each feature in to tensors. It is best to define a class level instance of this in the
Batch
class and return it with_get_batch_feature_mappings
.An example from the iris data set test:
MAPPINGS = BatchFeatureMapping( 'label', [ManagerFeatureMapping( 'iris_vectorizer_manager', (FieldFeatureMapping('label', 'ilabel', True), FieldFeatureMapping('flower_dims', 'iseries')))])
- __init__(label_attribute_name='label', manager_mappings=<factory>)¶
- property label_feature_id: None | str¶
Return the feature id of the label. This is the vectorizer used to transform the label data.
- property label_vectorizer_manager: FeatureVectorizerManager | None¶
Return the feature id of the label. This is the vectorizer used to transform the label data.
-
manager_mappings:
List
[ManagerFeatureMapping
]¶ The manager level attribute mapping meta data.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- class zensols.deeplearn.batch.mapping.ConfigBatchFeatureMapping(label_attribute_name='label', manager_mappings=<factory>, batch_feature_mapping_adds=<factory>, field_remove=<factory>, field_keep=None)[source]¶
Bases:
BatchFeatureMapping
A utility class that allows a easy configuration driven way of refining
manager_mappings
by adding and deleting them both at the mapping and field levels. These edits happen during the classes__init__
.- __init__(label_attribute_name='label', manager_mappings=<factory>, batch_feature_mapping_adds=<factory>, field_remove=<factory>, field_keep=None)¶
-
batch_feature_mapping_adds:
List
[BatchFeatureMapping
]¶ Mappings to add.
- class zensols.deeplearn.batch.mapping.FieldFeatureMapping(attr, feature_id, is_agg=False, attr_access=None, is_label=False)[source]¶
Bases:
Dictable
Meta data describing an attribute of the data point.
- __init__(attr, feature_id, is_agg=False, attr_access=None, is_label=False)¶
-
attr_access:
str
= None¶ The attribute on the source
DataPoint
instance (seeattribute_accessor
).
- property attribute_accessor¶
Return the attribute name on the
DataPoint
instance. This usesattr_access
if it is notNone
, otherwise, useattr
.
-
is_agg:
bool
= False¶ If
True
, tuplize across all data points and encode as one tuple of data to create the batched tensor on decode; otherwise, each data point feature is encoded and concatenated on decode.
-
is_label:
bool
= False¶ Whether or not this field is a label. The is
True
in cases where there is more than one label. In these cases, usually which label to use changes based on the model (i.e. word embedding vs. BERT word piece token IDs).This is used in
Batch
to skip label vectorization while encoding of prediction based batches.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- class zensols.deeplearn.batch.mapping.ManagerFeatureMapping(vectorizer_manager_name, fields)[source]¶
Bases:
Dictable
Meta data for a vectorizer manager with fields describing attributes to be vectorized from features in to feature contests.
- __init__(vectorizer_manager_name, fields)¶
-
fields:
Tuple
[FieldFeatureMapping
]¶ The fields of the data point to be vectorized.
-
vectorizer_manager_name:
str
¶ The configuration name that identifiees an instance of
FeatureVectorizerManager
.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
zensols.deeplearn.batch.meta module¶
Contains container classes for batch data.
- class zensols.deeplearn.batch.meta.BatchFieldMetadata(field, vectorizer)[source]¶
Bases:
Dictable
Data that describes a field mapping in a batch object.
- __init__(field, vectorizer)¶
- field: FieldFeatureMapping¶
The field mapping.
- property shape¶
- vectorizer: FeatureVectorizer¶
The vectorizer used to map the field.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- class zensols.deeplearn.batch.meta.BatchMetadata(data_point_class, batch_class, mapping, fields_by_attribute)[source]¶
Bases:
Dictable
Describes metadata about a
Batch
instance.- __init__(data_point_class, batch_class, mapping, fields_by_attribute)¶
- fields_by_attribute: Dict[str, BatchFieldMetadata]¶
Mapping by field name to attribute.
- mapping: BatchFeatureMapping¶
The mapping used for encoding and decoding the batch.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- class zensols.deeplearn.batch.meta.MetadataNetworkSettings(name, config_factory, torch_config, batch_stash)[source]¶
Bases:
NetworkSettings
A network settings container that has metadata about batches it recieves for its model.
- __init__(name, config_factory, torch_config, batch_stash)¶
- property batch_metadata: BatchMetadata¶
Return the batch metadata used by this model.
- batch_stash: BatchStash¶
The batch stash that created the batches and has the batch metdata.
zensols.deeplearn.batch.multi module¶
Multi processing with torch.
- class zensols.deeplearn.batch.multi.TorchMultiProcessStash(delegate, config, name, chunk_size, workers)[source]¶
Bases:
MultiProcessStash
A multiprocessing stash that interacts with PyTorch in a way that it can access the GPU(s) in forked subprocesses using the
multiprocessing
library.- See:
torch.multiprocessing
- See:
zensols.deeplearn.TorchConfig.init()
- __init__(delegate, config, name, chunk_size, workers)¶
zensols.deeplearn.batch.stash module¶
This file contains a stash used to load an embedding layer.
- class zensols.deeplearn.batch.stash.BatchStash(name, config_factory, delegate, config, chunk_size, workers, data_point_type, batch_type, split_stash_container, vectorizer_manager_set, batch_size, model_torch_config, data_point_id_sets_path, decoded_attributes, batch_feature_mappings=None, batch_limit=9223372036854775807)[source]¶
Bases:
TorchMultiProcessStash
,SplitKeyContainer
,Writeback
,Deallocatable
A stash that vectorizes features in to easily consumable tensors for training and testing. This stash produces instances of
Batch
, which is a batch in the machine learning sense, and the first dimension of what will become the tensor used in PyTorch. Each of these batches has a logical one to many relationship to that batche’s respective set of data points, which is encapsulated in theDataPoint
class.The stash creates subprocesses to vectorize features in to tensors in chunks of IDs (data point IDs) from the subordinate stash using
DataPointIDSet
instances.To speed up experiements, all available features configured in
vectorizer_manager_set
are encoded on disk. However, only thedecoded_attributes
(see attribute below) are avilable to the model regardless of what was created during encoding time.The lifecycle of the data follows:
Feature data created by the client, which could be language features, row data etc.
Vectorize the feature data using the vectorizers in
vectorizer_manager_set
. This creates the feature contexts (FeatureContext
) specifically meant to be pickeled.Pickle the feature contexts when dumping to disk, which is invoked in the child processes of this class.
At train time, load the feature contexts from disk.
Decode the feature contexts in to PyTorch tensors.
The model manager uses the
to
method to copy the CPU tensors to the GPU (where GPUs are available).
Use the
split_stash_container
to get dataset which as asplits
property for the feature data. Use thedataset_stash
from the application contextConfigFactory
for the batch splits.- See _process:
for details on the pickling of the batch instances
- _process(chunk)[source]¶
Create the batches by creating the set of data points for each
DataPointIDSet
instance. When the subordinate stash dumps the batch (specifically a subclass ofBatch
), the overrided pickle logic is used to detach the batch by encoded all data in toFeatureContext
instances.
- __init__(name, config_factory, delegate, config, chunk_size, workers, data_point_type, batch_type, split_stash_container, vectorizer_manager_set, batch_size, model_torch_config, data_point_id_sets_path, decoded_attributes, batch_feature_mappings=None, batch_limit=9223372036854775807)¶
- property batch_data_point_sets: List[DataPointIDSet]¶
Create the data point ID sets. Each instance returned will correlate to a batch and each set of keys point to a feature
DataPoint
.
- batch_feature_mappings: BatchFeatureMapping = None¶
The meta data used to encode and decode each feature in to tensors.
- batch_limit: int = 9223372036854775807¶
The max number of batches to process, which is useful for debugging.
- property batch_metadata: BatchMetadata¶
- batch_size: int¶
The number of data points in each batch, except the last (unless the data point cardinality divides the batch size).
- batch_type: Type[Batch]¶
The batch class to be instantiated when created batchs.
- clear_all()[source]¶
Clear the batch, batch data point sets, and the source data (
split_stash_container
).
- create_batch(points, split_name=None, batch_id=None)[source]¶
Create a new batch instance with data points, which happens when primed.
- data_point_id_sets_path: Path¶
The path of where to store key data for the splits; note that the container might store it’s key splits in some other location.
- data_point_type: Type[DataPoint]¶
A subclass type of
DataPoint
implemented for the specific feature.
- property decoded_attributes: Set[str]¶
The attributes to decode; only these are avilable to the model regardless of what was created during encoding time; if None, all are available. Sequences are converted to sets, which makes configuration easier in YAML files.
- load(name)[source]¶
Load a data value from the pickled data with key
name
. Semantically, this method loads the using the stash’s implementation. For exampleDirectoryStash
loads the data from a file if it exists, but factory type stashes will always re-generate the data.- See:
get()
- model_torch_config: TorchConfig¶
The PyTorch configuration used to (optionally) copy CPU to GPU memory.
- prime()[source]¶
If the delegate stash data does not exist, use this implementation to generate the data and process in children processes.
- split_stash_container: SplitStashContainer¶
The source data stash that has both the data and data set keys for each split (i.e.
train
vstest
).
- vectorizer_manager_set: FeatureVectorizerManagerSet¶
Used to vectorize features in to tensors.
Module contents¶
Contains classes that batch vectorized data in forked subprocesses.