zensols.deeplearn.dataframe package

Submodules

zensols.deeplearn.dataframe.batch module

An implementation of batch level API for Pandas dataframe based data.

class zensols.deeplearn.dataframe.batch.DataframeBatch(batch_stash, id, split_name, data_points)[source]

Bases: Batch

A batch of data that contains instances of DataframeDataPoint, each of which has the row data from the dataframe.

__init__(batch_stash, id, split_name, data_points)
get_features()[source]

A utility method to a tensor of all features of all columns in the datapoints.

Return type:

Tensor

Returns:

a tensor of shape (batch size, feature size), where the feaure size is the number of all features vectorized; that is, a data instance for each row in the batch, is a flattened set of features that represent the respective row from the dataframe

class zensols.deeplearn.dataframe.batch.DataframeBatchStash(name, config_factory, delegate, config, chunk_size, workers, data_point_type, batch_type, split_stash_container, vectorizer_manager_set, batch_size, model_torch_config, data_point_id_sets_path, decoded_attributes=<property object>, batch_feature_mappings=None, batch_limit=9223372036854775807)[source]

Bases: BatchStash

A stash used for batches of data using DataframeBatch instances. This stash uses an instance of DataframeFeatureVectorizerManager to vectorize the data in the batches.

__init__(name, config_factory, delegate, config, chunk_size, workers, data_point_type, batch_type, split_stash_container, vectorizer_manager_set, batch_size, model_torch_config, data_point_id_sets_path, decoded_attributes=<property object>, batch_feature_mappings=None, batch_limit=9223372036854775807)
property feature_vectorizer_manager: DataframeFeatureVectorizerManager
property flattened_features_shape: Tuple[int]
property label_shape: Tuple[int]
class zensols.deeplearn.dataframe.batch.DataframeDataPoint(id, batch_stash, row)[source]

Bases: DataPoint

A data point used in a batch, which contains a single row of data in the Pandas dataframe. When created, column is saved as an attribute in the instance.

__init__(id, batch_stash, row)
row: InitVar

zensols.deeplearn.dataframe.util module

Utility functionality for dataframe related containers.

class zensols.deeplearn.dataframe.util.DataFrameDictable[source]

Bases: Dictable

A container with utility methods that JSON and write Pandas dataframes.

DEFAULT_COLS = 40

Default width when writing the dataframe.

NONE_REPR = ''

String used for NaNs.

__init__()

zensols.deeplearn.dataframe.vectorize module

Contains classes used to vectorize dataframe data.

class zensols.deeplearn.dataframe.vectorize.DataframeFeatureVectorizerManager(name, config_factory, torch_config, configured_vectorizers, prefix, label_col, stash, include_columns=None, exclude_columns=None)[source]

Bases: FeatureVectorizerManager, Writable

A pure instance based feature vectorizer manager for a Pandas dataframe. All vectorizers used in this vectorizer manager are dynamically allocated and attached.

This class not only acts as the feature manager itself to be used in a FeatureVectorizerManager, but also provides a batch mapping to be used in a BatchStash.

__init__(name, config_factory, torch_config, configured_vectorizers, prefix, label_col, stash, include_columns=None, exclude_columns=None)
property batch_feature_mapping: BatchFeatureMapping

Return the mapping for zensols.deeplearn.batch.Batch instances.

column_to_feature_id(col)[source]

Generate a feature id from the column name. This just attaches the prefix to the column name.

Return type:

str

property dataset_metadata: DataframeMetadata

Create a metadata from the data in the dataframe.

exclude_columns: Tuple[str] = None

The columns to be excluded, or if None (the default), no columns are excluded as features.

get_flattened_features_shape(attribs)[source]

Return the shape if all vectorizers were used.

Return type:

Tuple[int]

include_columns: Tuple[str] = None

The columns to be included, or if None (the default), all columns are used as features.

property label_attribute_name: str

Return the label attribute.

label_col: str

The column that contains the label/class.

property label_shape: Tuple[int]

Return the shape if all vectorizers were used.

prefix: str

The prefix to use for all vectorizers in the dataframe (i.e. adl_ for the Adult dataset test case example).

stash: DataframeStash

The stash that contains the dataframe.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write the contents of this instance to writer using indention depth.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deeplearn.dataframe.vectorize.DataframeMetadata(prefix, label_col, label_values, continuous, descrete)[source]

Bases: Writable

Metadata for a Pandas dataframe.

__init__(prefix, label_col, label_values, continuous, descrete)
continuous: Tuple[str]

The list of data columns that are continuous.

descrete: Dict[str, Tuple[str]]

A mapping of label to nominals the column takes for descrete mappings.

label_col: str

The column that contains the label/class.

label_values: Tuple[str]

All classes (unique across label_col).

prefix: str

The prefix to use for all vectorizers in the dataframe (i.e. adl_ for the Adult dataset test case example).

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write the contents of this instance to writer using indention depth.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

Module contents

Contains API framework code for vectorizing and batching dataframe data without the necessity of a domain specific model implementation.