zensols.dataframe package¶

Submodules¶

zensols.dataframe.config module¶

Configuration classes using dataframes as sources.

class zensols.dataframe.config.DataframeConfig(csv_path, default_section, columns=None, column_eval=None, counts=None)[source]¶

Bases: DictionaryConfig

A Configurable that dataframes as sources. This is useful for providing labels to nominial label vectorizers.

__init__(csv_path, default_section, columns=None, column_eval=None, counts=None)[source]¶

Initialize the configuration from a dataframe (see parameters).

Parameters:

csv_path (Path) – the path to the CSV file to create the dataframe
default_section (str) – the singleton section name, which has as options a list of the columns of the dataframe
columns (Dict[str, str]) – the columns to add to the configuration from the dataframe with key, values as column names, option names
column_eval (str) – Python code to evaluate and apply to each column if provided
counts (Dict[str, str]) – additional option entries in the section to add as counts of respective columns with key, values as column option names, new entry option names; where the ``column option names are those given as values from the columns dict

default_section¶

serializer¶

zensols.dataframe.stash module¶

Stashes that operate on a dataframe, which are useful to common machine learning tasks.

class zensols.dataframe.stash.AutoSplitDataframeStash(dataframe_path, split_col, key_path, distribution)[source]¶

Bases: SplitKeyDataframeStash

Automatically a dataframe in to train, test and validation datasets by adding a split_col with the split name.

__init__(dataframe_path, split_col, key_path, distribution)¶

distribution: Dict[str, float]¶: The distribution as a percent across all key splits. The distribution values must add to 1. The keys must have train, test and validate.

exception zensols.dataframe.stash.DataframeError[source]¶

Bases: APIError

Thrown for dataframe stash issues.

__firstlineno__ = 32¶

__module__ = 'zensols.dataframe.stash'¶

__static_attributes__ = ()¶

class zensols.dataframe.stash.DataframeStash(dataframe_path)[source]¶

Bases: ReadOnlyStash, Deallocatable, Writable, PrimeableStash

A factory stash that uses a Pandas data frame from which to load. It uses the data frame index as the keys and pandas.Series as values. The dataframe is usually constructed by reading a file (i.e.CSV) and doing some transformation before using it in an implementation of this stash.

The dataframe created by _get_dataframe() must have a string or integer index since keys for all stashes are of type str. The index will be mapped to a string if it is an int automatically.

__init__(dataframe_path)¶

clear()[source]¶

Delete all data from the from the stash.

Important: Exercise caution with this method, of course.

property dataframe¶

dataframe_path: Path¶: The path to store the pickeled version of the generated dataframe created with _get_dataframe().

deallocate()[source]¶: Deallocate all resources for this instance.

exists(name)[source]¶

Return True if data with key name exists.

Implementation note: This Stash.exists() method is very inefficient and should be overriden.

Return type:: bool

keys()[source]¶

Return an iterable of keys in the collection.

Return type:: Iterable[str]

load(name)[source]¶

Load a data value from the pickled data with key name. Semantically, this method loads the using the stash’s implementation. For example DirectoryStash loads the data from a file if it exists, but factory type stashes will always re-generate the data.

See:: get()
Return type:: Series

prime()[source]¶

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Write the contents of this instance to writer using indention depth.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.dataframe.stash.DefaultDataframeStash(dataframe_path, split_col, key_path, input_csv_path)[source]¶

Bases: SplitKeyDataframeStash

A default implementation of DataframeSplitStash that creates the Pandas dataframe by simply reading it from a specificed CSV file. The index is a string type appropriate for a stash.

__init__(dataframe_path, split_col, key_path, input_csv_path)¶

input_csv_path: Path¶: A path to the CSV of the source data.

class zensols.dataframe.stash.ResourceFeatureDataframeStash(dataframe_path, split_col, installer, resource)[source]¶

Bases: SplitColumnDataframeStash

A dataframe that installs a corpus and then reads a file to create the Pandas dataframe.

__init__(dataframe_path, split_col, installer, resource)¶

installer: Installer¶: The installer used to download and uncompress dataset.

resource: Resource¶: Use to resolve the corpus file.

class zensols.dataframe.stash.SplitColumnDataframeStash(dataframe_path, split_col)[source]¶

Bases: DataframeStash

A stash that provides a way to get the labels and label count of the dataframe.

__init__(dataframe_path, split_col)¶

get_label_count()[source]¶

Return type:: int

get_labels(**kwargs) → Tuple[str, ...]¶

Return type:: Tuple[str, ...]

split_col: str¶: The column name in the dataframe used to indicate the split (i.e. train vs test).

class zensols.dataframe.stash.SplitKeyDataframeStash(dataframe_path, split_col, key_path)[source]¶

Bases: SplitColumnDataframeStash, SplitKeyContainer

A stash and split key container that reads from a dataframe.

__init__(dataframe_path, split_col, key_path)¶

clear()[source]¶: Clear any cached state.

clear_keys()[source]¶: Clear only the cache of keys generated from the group by.

deallocate()[source]¶: Deallocate all resources for this instance.

key_path: Path¶: The path where the key splits (as a dict) is pickled.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Write the contents of this instance to writer using indention depth.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

zensols.dataframe package¶

Submodules¶

zensols.dataframe.config module¶

zensols.dataframe.stash module¶

Module contents¶