zensols.deeplearn.dataframe package¶

Submodules¶

zensols.deeplearn.dataframe.batch module¶

An implementation of batch level API for Pandas dataframe based data.

class zensols.deeplearn.dataframe.batch.DataframeBatch(batch_stash, id, split_name, data_points)[source]¶

Bases: Batch

A batch of data that contains instances of DataframeDataPoint, each of which has the row data from the dataframe.

__init__(batch_stash, id, split_name, data_points)¶

get_features()[source]¶

A utility method to a tensor of all features of all columns in the datapoints.

Return type:: Tensor
Returns:: a tensor of shape (batch size, feature size), where the feaure size is the number of all features vectorized; that is, a data instance for each row in the batch, is a flattened set of features that represent the respective row from the dataframe

class zensols.deeplearn.dataframe.batch.DataframeBatchStash(name, config_factory, delegate, config, chunk_size, workers, data_point_type, batch_type, split_stash_container, vectorizer_manager_set, batch_size, model_torch_config, data_point_id_sets_path, decoded_attributes, batch_feature_mappings=None, batch_limit=9223372036854775807)[source]¶

Bases: BatchStash

A stash used for batches of data using DataframeBatch instances. This stash uses an instance of DataframeFeatureVectorizerManager to vectorize the data in the batches.

__init__(name, config_factory, delegate, config, chunk_size, workers, data_point_type, batch_type, split_stash_container, vectorizer_manager_set, batch_size, model_torch_config, data_point_id_sets_path, decoded_attributes, batch_feature_mappings=None, batch_limit=9223372036854775807)¶

property feature_vectorizer_manager: DataframeFeatureVectorizerManager¶

property flattened_features_shape: Tuple[int]¶

property label_shape: Tuple[int]¶

class zensols.deeplearn.dataframe.batch.DataframeDataPoint(id, batch_stash, row)[source]¶

Bases: DataPoint

A data point used in a batch, which contains a single row of data in the Pandas dataframe. When created, column is saved as an attribute in the instance.

__init__(id, batch_stash, row)¶

row: InitVar¶

zensols.deeplearn.dataframe.util module¶

Utility functionality for dataframe related containers.

class zensols.deeplearn.dataframe.util.DataFrameDictable[source]¶

Bases: Dictable

A container with utility methods that JSON and write Pandas dataframes.

DEFAULT_COLS = 40¶: Default width when writing the dataframe.

NONE_REPR = ''¶: String used for NaNs.

__init__()¶

zensols.deeplearn.dataframe.vectorize module¶

Contains classes used to vectorize dataframe data.

class zensols.deeplearn.dataframe.vectorize.DataframeFeatureVectorizerManager(name, config_factory, torch_config, configured_vectorizers, prefix, label_col, stash, include_columns=None, exclude_columns=None)[source]¶

Bases: FeatureVectorizerManager, Writable

A pure instance based feature vectorizer manager for a Pandas dataframe. All vectorizers used in this vectorizer manager are dynamically allocated and attached.

This class not only acts as the feature manager itself to be used in a FeatureVectorizerManager, but also provides a batch mapping to be used in a BatchStash.

__init__(name, config_factory, torch_config, configured_vectorizers, prefix, label_col, stash, include_columns=None, exclude_columns=None)¶

property batch_feature_mapping: BatchFeatureMapping¶: Return the mapping for zensols.deeplearn.batch.Batch instances.

column_to_feature_id(col)[source]¶

Generate a feature id from the column name. This just attaches the prefix to the column name.

Return type:: str

property dataset_metadata: DataframeMetadata¶: Create a metadata from the data in the dataframe.

exclude_columns: Tuple[str] = None¶: The columns to be excluded, or if None (the default), no columns are excluded as features.

get_flattened_features_shape(attribs)[source]¶

Return the shape if all vectorizers were used.

Return type:: Tuple[int]

include_columns: Tuple[str] = None¶: The columns to be included, or if None (the default), all columns are used as features.

property label_attribute_name: str¶: Return the label attribute.

label_col: str¶: The column that contains the label/class.

property label_shape: Tuple[int]¶: Return the shape if all vectorizers were used.

prefix: str¶: The prefix to use for all vectorizers in the dataframe (i.e. adl_ for the Adult dataset test case example).

stash: DataframeStash¶: The stash that contains the dataframe.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Write the contents of this instance to writer using indention depth.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deeplearn.dataframe.vectorize.DataframeMetadata(prefix, label_col, label_values, continuous, descrete)[source]¶

Bases: Writable

Metadata for a Pandas dataframe.

__init__(prefix, label_col, label_values, continuous, descrete)¶

continuous: Tuple[str]¶: The list of data columns that are continuous.

descrete: Dict[str, Tuple[str]]¶: A mapping of label to nominals the column takes for descrete mappings.

label_col: str¶: The column that contains the label/class.

label_values: Tuple[str]¶: All classes (unique across label_col).

prefix: str¶: The prefix to use for all vectorizers in the dataframe (i.e. adl_ for the Adult dataset test case example).

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Write the contents of this instance to writer using indention depth.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

Module contents¶

Contains API framework code for vectorizing and batching dataframe data without the necessity of a domain specific model implementation.