zensols.dataset package¶

Submodules¶

zensols.dataset.dimreduce module¶

Dimension reduction wrapper and utility classes.

class zensols.dataset.dimreduce.DecomposeDimensionReducer(data, dim, reduction_meth='pca', normalize='unit', model_args=<factory>)[source]¶

Bases: DimensionReducer

A dimensionality reducer that uses eigenvector decomposition such as PCA or SVD.

__init__(data, dim, reduction_meth='pca', normalize='unit', model_args=<factory>)¶

property description: Dict[str, Any]¶: A object graph of data that describes the results of the model.

get_components(data=None, one_dir=True)[source]¶

Create a start and end points that make the PCA component, which is useful for rendering lines for visualization.

Param:: use in place of the data for component calculation using the (already) trained model
Parameters:: one_dir (bool) – whether or not to create components one way from the mean, or two way (forward and backward) from the mean
Return type:: Tuple[ndarray, ndarray]
Returns:: a tuple of numpy arrays, each as a start and end stacked for each component

static is_decompose_method(reduction_meth)[source]¶

Return whether the reduction is a decomposition method.

See:: reduction_meth
Return type:: bool

class zensols.dataset.dimreduce.DimensionReducer(data, dim, reduction_meth='pca', normalize='unit', model_args=<factory>)[source]¶

Bases: Dictable

Reduce the dimensionality of a dataset.

__init__(data, dim, reduction_meth='pca', normalize='unit', model_args=<factory>)¶

data: ndarray¶: The data that will be dimensionally reduced.

dim: int¶: The lowered dimension spaace.

property model: PCA | TruncatedSVD | TSNE¶

model_args: Dict[str, Any]¶: Additional kwargs to pass to the model initializer.

property n_points: Tuple[int]¶

normalize: str = 'unit'¶

One of:

unit: normalize to unit vectors
standardize: standardize by removing the mean and scaling to unit
variance
None: make no modifications to the data

property reduced: ndarray¶

reduction_meth: str = 'pca'¶: One of pca, svd, or tsne.

zensols.dataset.interface module¶

Interfaces used for dealing with dataset splits.

exception zensols.dataset.interface.DatasetError[source]¶

Bases: APIError

Thrown when any dataset related is raised.

__annotations__ = {}¶

__module__ = 'zensols.dataset.interface'¶

class zensols.dataset.interface.SplitKeyContainer[source]¶

Bases: Writable

An interface defining a container that partitions data sets (i.e. train vs test). For instances of this class, that data are the unique keys that point at the data.

__init__()¶

clear()[source]¶: Clear any cached state.

property counts_by_key: Dict[str, int]¶: Return data set splits name to count for that respective split.

property keys_by_split: Dict[str, Tuple[str, ...]]¶: Generate a dictionary of split name to keys for that split. It is expected this method will be very expensive.

property split_names: Set[str]¶: Return the names of each split in the dataset.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_delegate=False)[source]¶

Write the contents of this instance to writer using indention depth.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.dataset.interface.SplitStashContainer[source]¶

Bases: PrimeableStash, SplitKeyContainer

An interface like SplitKeyContainer, but whose implementations are of Stash containing the instance data.

For a default implemetnation, see DatasetSplitStash.

__init__()¶

property split_name: str¶: Return the name of the split this stash contains. Thus, all data/items returned by this stash are in the data set given by this name (i.e. train).

property splits: Dict[str, Stash]¶

Return a dictionary with keys as split names and values as the stashes represented by that split.

See:: split_name()

zensols.dataset.leaveout module¶

A split key container for leave-one-out dataset splits.

class zensols.dataset.leaveout.LeaveNOutSplitKeyContainer(delegate, distribution=<factory>, shuffle=True, path=None)[source]¶

Bases: SplitKeyContainer

A split key container that leaves one out of the dataset. By default, this creates a dataset that has one data point for validation, another for test, and the rest of the data for training.

__init__(delegate, distribution=<factory>, shuffle=True, path=None)¶

delegate: Stash¶: The source for keys to generate the splits.

distribution: Dict[str, int]¶: The number of data points by each split type. If the value is an integer, that number of data points are used. Otherwise, if it is a float, then that percentage of the entire key set is used.

next_split()[source]¶

Create the next split so that the next access to properties such as keys_by_split provide the next key split permutation.

Return type:: bool

path: Path = None¶: If not None, persist the keys after shuffling (if enabled) to the path specified, for reproducibility of key partitions.

shuffle: bool = True¶: If True, shuffle the keys obtained from delegate before creating the splits.

zensols.dataset.multilabel module¶

A multilabel stratifier.

class zensols.dataset.multilabel.MultiLabelStratifierSplitKeyContainer(key_path, pattern, stash, distribution=<factory>, shuffle=True, partition_attr=None, stratified_write=True, split_labels_path=None, split_preference=None, move_portion=0.5, min_source_occurances=0)[source]¶

Bases: StratifiedStashSplitKeyContainer

Creates stratified two-way splits between token-level annotated feature sentences.

__init__(key_path, pattern, stash, distribution=<factory>, shuffle=True, partition_attr=None, stratified_write=True, split_labels_path=None, split_preference=None, move_portion=0.5, min_source_occurances=0)¶

property count_dataframe: DataFrame¶: A dataframe with the counts of each label as columns.

min_source_occurances: int = 0¶: The minimum number of occurances for a label to trigger the key move described in split_prefernce.

move_portion: float = 0.5¶: The portion of data points per label to move based on split_preference.

split_preference: Tuple[str, ...] = None¶: The list of splits to give preference by moving data that has no instances. For exaple, ('test', 'validation') would move data points from validation to test for labels that have no occurances in test.

property stratified_split_portions: DataFrame¶

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Write the contents of this instance to writer using indention depth.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

zensols.dataset.outlier module¶

A simple outlier detection class.

class zensols.dataset.outlier.OutlierDetector(data, default_method='mahalanobis', threshold=None, proportion=None, return_indicators=None)[source]¶

Bases: object

Simple outlier detection utility that provides a few differnt methods of calculation. These include z-score(), mahalanobis() and robust_mahalanobis().

This class removes either using a method specific threshold or by a proportion of the data set.

DETECTION_METHODS = frozenset({'mahalanobis', 'robust_mahalanobis', 'z_score'})¶

__init__(data, default_method='mahalanobis', threshold=None, proportion=None, return_indicators=None)¶

data: Union[ndarray, DataFrame]¶: The dataframe on which to find outliers given the data. Data points are rows and the feature vectors are columns.

default_method: str = 'mahalanobis'¶: The method used when invoking as a Callable with the __call__() method. This must be one of DETECTION_METHODS.

mahalanobis(significance=0.001)[source]¶

Detect outliers using the Mahalanbis distince in high dimension.

Assuming a multivariate normal distribution of the data with K variables, the Mahalanobis distance follows a chi-squared distribution with K degrees of freedom. For this reason, the cut-off is defined by the square root of the Chi^2 percent pointwise function.

Parameters:: significance (float) – 1 - the Chi^2 percent point function (inverse of cdf / percentiles) outlier threshold; reasonable values include 2.5%, 1%, 0.01%); if None use threshold or proportion
Return type:: ndarray
Returns:: indexes in to data rows (indexes of a dataframe) of the outliers

property numpy: ndarray¶: The numpy form of data. If data is a dataframe, it is converted to a numpy array.

proportion: float = None¶

The proportion of the dataset to use for outliers. The higher the number the more outliers.

See:: threshold

return_indicators: bool = None¶: Whether to return a list of False (not outlier) or True (outlier) instead of indexes in to the input matrix/dataframe (data).

robust_mahalanobis(significance=0.001, random_state=0)[source]¶

Like mahalanobis() but use a robust mean and covarance matrix by sampling the dataset.

Parameters:: significance (float) – 1 - the Chi^2 percent point function (inverse of cdf / percentiles) outlier threshold; reasonable values include 2.5%, 1%, 0.01%); if None use threshold or proportion
Return type:: ndarray
Returns:: indexes in to data rows (indexes of a dataframe) of the outliers

threshold: float = None¶: The outlier threshold, which is method dependent. This is ignored if proportion is set.

z_score(column)[source]¶

Use a Z-score to detect anomolies.

Parameters:

column (Union[int, str]) – the column to use for the z-score analysis.
threshold – the threshold above which a data point is considered an outlier

Return type:

ndarray

Returns:

indexes in to data rows (indexes of a dataframe) of the outliers

zensols.dataset.split module¶

Implementations (some abstract) of split key containers.

class zensols.dataset.split.AbstractSplitKeyContainer(key_path, pattern)[source]¶

Bases: PersistableContainer, SplitKeyContainer, Primeable, Writable

A default implementation of a SplitKeyContainer. This implementation keeps the order of the keys consistent as well, which is stored at the path given in key_path. Once the keys are generated for the first time, they will persist on the file system.

This abstract class requires an implementation of _create_splits().

abstract _create_splits()[source]¶

Create the key splits using keys as the split name (i.e. train) and the values as a list of the keys for the corresponding split.

Return type:: Dict[str, Tuple[str, ...]]

__init__(key_path, pattern)¶

clear()[source]¶: Clear any cached state.

key_path: Path¶: The directory to store the split keys.

pattern: str¶: The file name pattern to use for the keys file key_path on the file system, each file is named after the key split. For example, if {name}.dat is used, train.dat will be a file with the ordered keys.

prime()[source]¶

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Write the contents of this instance to writer using indention depth.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.dataset.split.StashSplitKeyContainer(key_path, pattern, stash, distribution=<factory>, shuffle=True)[source]¶

Bases: AbstractSplitKeyContainer

A default implementation of AbstractSplitKeyContainer that uses a delegate stash for source of the keys.

__init__(key_path, pattern, stash, distribution=<factory>, shuffle=True)¶

distribution: Dict[str, float]¶: The distribution as a percent across all key splits. The distribution values must add to 1.

prime()[source]¶

shuffle: bool = True¶: If True, shuffle the keys when creating the key splits.

stash: Stash¶: The delegate stash from where to get the keys to store.

class zensols.dataset.split.StratifiedCrossFoldSplitKeyContainer(key_path, pattern, stash, distribution=<factory>, shuffle=True, partition_attr=None, stratified_write=True, split_labels_path=None, n_folds=None)[source]¶

Bases: StratifiedStashSplitKeyContainer

Like StratifiedStashSplitKeyContainer, but create splits used for cross-fold validation when batching. This creates a new dataset for each fold by settings distribution.

FOLD_FORMAT: ClassVar[str] = 'fold-{fold_ix}-{iter_ix}'¶: The format used for naming results in ModelExecutor.

__init__(key_path, pattern, stash, distribution=<factory>, shuffle=True, partition_attr=None, stratified_write=True, split_labels_path=None, n_folds=None)¶

get_by_fold(fold)[source]¶

Return type:: Stash

n_folds: int = None¶: The number of folds across

class zensols.dataset.split.StratifiedStashSplitKeyContainer(key_path, pattern, stash, distribution=<factory>, shuffle=True, partition_attr=None, stratified_write=True, split_labels_path=None)[source]¶

Bases: StashSplitKeyContainer

Like StashSplitKeyContainer but data is stratified by a label (partition_attr) across each split.

__init__(key_path, pattern, stash, distribution=<factory>, shuffle=True, partition_attr=None, stratified_write=True, split_labels_path=None)¶

clear()[source]¶: Clear any cached state.

partition_attr: str = None¶: The label used to partition the strata across each split

split_labels_path: Path = None¶: If provided, the path is a pickled cache of stratified_count_dataframe.

property stratified_count_dataframe: DataFrame¶: A count summarization of stratified_split_labels.

property stratified_split_labels: DataFrame¶: A dataframe with all keys, their respective labels and split.

stratified_write: bool = True¶: Whether or not to include the stratified counts when writing with write().

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Write the contents of this instance to writer using indention depth.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

zensols.dataset.stash module¶

Utility stashes useful to common machine learning tasks.

class zensols.dataset.stash.DatasetSplitStash(delegate, split_container)[source]¶

Bases: DelegateStash, SplitStashContainer, PersistableContainer, Writable

A default implementation of SplitStashContainer. However, it needs an instance of a SplitKeyContainer. This implementation generates a separate stash instance for each data set split (i.e. train vs test). Each split instance holds the data (keys and values) for each split.

Stash instances by split are obtained with splits, and will have a split attribute that give the name of the split.

To maintain reproducibility, key ordering must be considered (see SortedDatasetSplitStash).

See:: SplitStashContainer.splits()

__init__(delegate, split_container)¶

check_key_consistent()[source]¶

Return if the split_container have the same key count divisiion as this stash’s split counts.

Return type:: bool

clear()[source]¶: Clear and destory key and delegate data.

clear_keys()[source]¶: Clear any cache state for keys, and keys by split. It does this by clearing the key state for stash, and then the clear() of the split_container.

deallocate()[source]¶: Deallocate all resources for this instance.

exists(name)[source]¶

Return True if data with key name exists.

Implementation note: This Stash.exists() method is very inefficient and should be overriden.

Return type:: bool

get(name, default=None)[source]¶

Load an object or a default if key name doesn’t exist.

Implementation note: sub classes will probably want to override this method given the super method is cavalier about calling exists:() and load(). Based on the implementation, this can be problematic.

Return type:: Any

keys()[source]¶

Return an iterable of keys in the collection.

Return type:: Iterable[str]

load(name)[source]¶

Load a data value from the pickled data with key name. Semantically, this method loads the using the stash’s implementation. For example DirectoryStash loads the data from a file if it exists, but factory type stashes will always re-generate the data.

See:: get()
Return type:: Any

prime()[source]¶

split_container: SplitKeyContainer¶: The instance that provides the splits in the dataset.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_delegate=False)[source]¶

Write the contents of this instance to writer using indention depth.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.dataset.stash.SortedDatasetSplitStash(delegate, split_container, sort_function=None)[source]¶

Bases: DatasetSplitStash

A sorted version of a DatasetSplitStash, where keys, values, items and iterations are sorted by key. This is important for reproducibility of results.

An alternative is to use DatasetSplitStash with an instance of StashSplitKeyContainer set as the delegate since the key container keeps key ordering consistent.

Any shuffling of the dataset, for the sake of training on non-uniform data, needs to come before using this class. This class also sorts the keys in each split given in splits.

ATTR_EXP_META = ('sort_function',)¶

__init__(delegate, split_container, sort_function=None)¶

items()[source]¶

Return an iterable of all stash items.

Return type:: Tuple[str, Any]

keys()[source]¶

Return an iterable of keys in the collection.

Return type:: Iterable[str]

sort_function: Callable = None¶: A function, such as int, used to sort keys per data set split.

values()[source]¶

Return the values in the hash.

Return type:: Iterable[Any]

zensols.dataset package¶

Submodules¶

zensols.dataset.dimreduce module¶

zensols.dataset.interface module¶

zensols.dataset.leaveout module¶

zensols.dataset.multilabel module¶

zensols.dataset.outlier module¶

zensols.dataset.split module¶

zensols.dataset.stash module¶

Module contents¶