zensols.dataframe package

Submodules

zensols.dataframe.config module

Configuration classes using dataframes as sources.

class zensols.dataframe.config.DataframeConfig(csv_path, default_section, columns=None, column_eval=None, counts=None)[source]

Bases: DictionaryConfig

A Configurable that dataframes as sources. This is useful for providing labels to nominial label vectorizers.

__init__(csv_path, default_section, columns=None, column_eval=None, counts=None)[source]

Initialize the configuration from a dataframe (see parameters).

Parameters:
  • csv_path (Path) – the path to the CSV file to create the dataframe

  • default_section (str) – the singleton section name, which has as options a list of the columns of the dataframe

  • columns (Dict[str, str]) – the columns to add to the configuration from the dataframe with key, values as column names, option names

  • column_eval (str) – Python code to evaluate and apply to each column if provided

  • counts (Dict[str, str]) – additional option entries in the section to add as counts of respective columns with key, values as column option names, new entry option names; where the ``column option names are those given as values from the columns dict

default_section
serializer

zensols.dataframe.stash module

Stashes that operate on a dataframe, which are useful to common machine learning tasks.

class zensols.dataframe.stash.AutoSplitDataframeStash(dataframe_path, split_col, key_path, distribution)[source]

Bases: SplitKeyDataframeStash

Automatically a dataframe in to train, test and validation datasets by adding a split_col with the split name.

__init__(dataframe_path, split_col, key_path, distribution)
distribution: Dict[str, float]

The distribution as a percent across all key splits. The distribution values must add to 1. The keys must have train, test and validate.

exception zensols.dataframe.stash.DataframeError[source]

Bases: APIError

Thrown for dataframe stash issues.

__module__ = 'zensols.dataframe.stash'
class zensols.dataframe.stash.DataframeStash(dataframe_path)[source]

Bases: ReadOnlyStash, Deallocatable, Writable, PrimeableStash

A factory stash that uses a Pandas data frame from which to load. It uses the data frame index as the keys and pandas.Series as values. The dataframe is usually constructed by reading a file (i.e.CSV) and doing some transformation before using it in an implementation of this stash.

The dataframe created by _get_dataframe() must have a string or integer index since keys for all stashes are of type str. The index will be mapped to a string if it is an int automatically.

__init__(dataframe_path)
clear()[source]

Delete all data from the from the stash.

Important: Exercise caution with this method, of course.

property dataframe
dataframe_path: Path

The path to store the pickeled version of the generated dataframe created with _get_dataframe().

deallocate()[source]

Deallocate all resources for this instance.

exists(name)[source]

Return True if data with key name exists.

Implementation note: This Stash.exists() method is very inefficient and should be overriden.

Return type:

bool

keys()[source]

Return an iterable of keys in the collection.

Return type:

Iterable[str]

load(name)[source]

Load a data value from the pickled data with key name. Semantically, this method loads the using the stash’s implementation. For example DirectoryStash loads the data from a file if it exists, but factory type stashes will always re-generate the data.

See:

get()

Return type:

Series

prime()[source]
write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write the contents of this instance to writer using indention depth.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

class zensols.dataframe.stash.DefaultDataframeStash(dataframe_path, split_col, key_path, input_csv_path)[source]

Bases: SplitKeyDataframeStash

A default implementation of DataframeSplitStash that creates the Pandas dataframe by simply reading it from a specificed CSV file. The index is a string type appropriate for a stash.

__init__(dataframe_path, split_col, key_path, input_csv_path)
input_csv_path: Path

A path to the CSV of the source data.

class zensols.dataframe.stash.ResourceFeatureDataframeStash(dataframe_path, split_col, installer, resource)[source]

Bases: SplitColumnDataframeStash

A dataframe that installs a corpus and then reads a file to create the Pandas dataframe.

__init__(dataframe_path, split_col, installer, resource)
installer: Installer

The installer used to download and uncompress dataset.

resource: Resource

Use to resolve the corpus file.

class zensols.dataframe.stash.SplitColumnDataframeStash(dataframe_path, split_col)[source]

Bases: DataframeStash

A stash that provides a way to get the labels and label count of the dataframe.

__init__(dataframe_path, split_col)
get_label_count()[source]
Return type:

int

get_labels(**kwargs) Tuple[str, ...]
Return type:

Tuple[str, ...]

split_col: str

The column name in the dataframe used to indicate the split (i.e. train vs test).

class zensols.dataframe.stash.SplitKeyDataframeStash(dataframe_path, split_col, key_path)[source]

Bases: SplitColumnDataframeStash, SplitKeyContainer

A stash and split key container that reads from a dataframe.

__init__(dataframe_path, split_col, key_path)
clear()[source]

Clear any cached state.

clear_keys()[source]

Clear only the cache of keys generated from the group by.

deallocate()[source]

Deallocate all resources for this instance.

key_path: Path

The path where the key splits (as a dict) is pickled.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write the contents of this instance to writer using indention depth.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

Module contents