zensols.dataset package¶
Submodules¶
zensols.dataset.dimreduce module¶
Dimension reduction wrapper and utility classes.
- class zensols.dataset.dimreduce.DecomposeDimensionReducer(data, dim, reduction_meth='pca', normalize='unit', model_args=<factory>)[source]¶
Bases:
DimensionReducer
A dimensionality reducer that uses eigenvector decomposition such as PCA or SVD.
- __init__(data, dim, reduction_meth='pca', normalize='unit', model_args=<factory>)¶
- property description: Dict[str, Any]¶
A object graph of data that describes the results of the model.
- get_components(data=None, one_dir=True)[source]¶
Create a start and end points that make the PCA component, which is useful for rendering lines for visualization.
- Param:
use in place of the
data
for component calculation using the (already) trained model- Parameters:
one_dir (
bool
) – whether or not to create components one way from the mean, or two way (forward and backward) from the mean- Return type:
- Returns:
a tuple of numpy arrays, each as a start and end stacked for each component
- class zensols.dataset.dimreduce.DimensionReducer(data, dim, reduction_meth='pca', normalize='unit', model_args=<factory>)[source]¶
Bases:
Dictable
Reduce the dimensionality of a dataset.
- __init__(data, dim, reduction_meth='pca', normalize='unit', model_args=<factory>)¶
- property model: PCA | TruncatedSVD | TSNE¶
zensols.dataset.interface module¶
Interfaces used for dealing with dataset splits.
- exception zensols.dataset.interface.DatasetError[source]¶
Bases:
APIError
Thrown when any dataset related is raised.
- __annotations__ = {}¶
- __module__ = 'zensols.dataset.interface'¶
- class zensols.dataset.interface.SplitKeyContainer[source]¶
Bases:
Writable
An interface defining a container that partitions data sets (i.e.
train
vstest
). For instances of this class, that data are the unique keys that point at the data.- __init__()¶
- property counts_by_key: Dict[str, int]¶
Return data set splits name to count for that respective split.
- property keys_by_split: Dict[str, Tuple[str]]¶
Generate a dictionary of split name to keys for that split. It is expected this method will be very expensive.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_delegate=False)[source]¶
Write the contents of this instance to
writer
using indentiondepth
.- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- class zensols.dataset.interface.SplitStashContainer[source]¶
Bases:
PrimeableStash
,SplitKeyContainer
An interface like
SplitKeyContainer
, but whose implementations are ofStash
containing the instance data.For a default implemetnation, see
DatasetSplitStash
.- __init__()¶
- property split_name: str¶
Return the name of the split this stash contains. Thus, all data/items returned by this stash are in the data set given by this name (i.e.
train
).
zensols.dataset.leaveout module¶
A split key container for leave-one-out dataset splits.
- class zensols.dataset.leaveout.LeaveNOutSplitKeyContainer(delegate, distribution=<factory>, shuffle=True, path=None)[source]¶
Bases:
SplitKeyContainer
A split key container that leaves one out of the dataset. By default, this creates a dataset that has one data point for validation, another for test, and the rest of the data for training.
- __init__(delegate, distribution=<factory>, shuffle=True, path=None)¶
-
distribution:
Dict
[str
,int
]¶ The number of data points by each split type. If the value is an integer, that number of data points are used. Otherwise, if it is a float, then that percentage of the entire key set is used.
- next_split()[source]¶
Create the next split so that the next access to properties such as
keys_by_split
provide the next key split permutation.- Return type:
zensols.dataset.outlier module¶
A simple outlier detection class.
- class zensols.dataset.outlier.OutlierDetector(data, default_method='mahalanobis', threshold=None, proportion=None, return_indicators=None)[source]¶
Bases:
object
Simple outlier detection utility that provides a few differnt methods of calculation. These include
z-score()
,mahalanobis()
androbust_mahalanobis()
.This class removes either using a method specific
threshold
or by aproportion
of the data set.- DETECTION_METHODS = frozenset({'mahalanobis', 'robust_mahalanobis', 'z_score'})¶
- __init__(data, default_method='mahalanobis', threshold=None, proportion=None, return_indicators=None)¶
-
data:
Union
[ndarray
,DataFrame
]¶ The dataframe on which to find outliers given the data. Data points are rows and the feature vectors are columns.
-
default_method:
str
= 'mahalanobis'¶ The method used when invoking as a
Callable
with the__call__()
method. This must be one ofDETECTION_METHODS
.
- mahalanobis(significance=0.001)[source]¶
Detect outliers using the Mahalanbis distince in high dimension.
Assuming a multivariate normal distribution of the data with K variables, the Mahalanobis distance follows a chi-squared distribution with K degrees of freedom. For this reason, the cut-off is defined by the square root of the Chi^2 percent pointwise function.
- Parameters:
significance (
float
) – 1 - the Chi^2 percent point function (inverse of cdf / percentiles) outlier threshold; reasonable values include 2.5%, 1%, 0.01%); if None usethreshold
orproportion
- Return type:
- Returns:
indexes in to
data
rows (indexes of a dataframe) of the outliers
- property numpy: ndarray¶
The numpy form of
data
. Ifdata
is a dataframe, it is converted to a numpy array.
-
proportion:
float
= None¶ The proportion of the dataset to use for outliers. The higher the number the more outliers.
- See:
-
return_indicators:
bool
= None¶ Whether to return a list of
False
(not outlier) orTrue
(outlier) instead of indexes in to the input matrix/dataframe (data
).
- robust_mahalanobis(significance=0.001, random_state=0)[source]¶
Like
mahalanobis()
but use a robust mean and covarance matrix by sampling the dataset.- Parameters:
significance (
float
) – 1 - the Chi^2 percent point function (inverse of cdf / percentiles) outlier threshold; reasonable values include 2.5%, 1%, 0.01%); if None usethreshold
orproportion
- Return type:
- Returns:
indexes in to
data
rows (indexes of a dataframe) of the outliers
-
threshold:
float
= None¶ The outlier threshold, which is method dependent. This is ignored if
proportion
is set.
zensols.dataset.split module¶
Implementations (some abstract) of split key containers.
- class zensols.dataset.split.AbstractSplitKeyContainer(key_path, pattern)[source]¶
Bases:
PersistableContainer
,SplitKeyContainer
,Primeable
,Writable
A default implementation of a
SplitKeyContainer
. This implementation keeps the order of the keys consistent as well, which is stored at the path given inkey_path
. Once the keys are generated for the first time, they will persist on the file system.This abstract class requires an implementation of
_create_splits()
.- abstract _create_splits()[source]¶
Create the key splits using keys as the split name (i.e.
train
) and the values as a list of the keys for the corresponding split.
- __init__(key_path, pattern)¶
-
pattern:
str
¶ The file name pattern to use for the keys file
key_path
on the file system, each file is named after the key split. For example, if{name}.dat
is used,train.dat
will be a file with the ordered keys.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write the contents of this instance to
writer
using indentiondepth
.- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- class zensols.dataset.split.StashSplitKeyContainer(key_path, pattern, stash, distribution=<factory>, shuffle=True)[source]¶
Bases:
AbstractSplitKeyContainer
A default implementation of
AbstractSplitKeyContainer
that uses a delegate stash for source of the keys.- __init__(key_path, pattern, stash, distribution=<factory>, shuffle=True)¶
- class zensols.dataset.split.StratifiedStashSplitKeyContainer(key_path, pattern, stash, distribution=<factory>, shuffle=True, partition_attr=None, stratified_write=True, split_labels_path=None)[source]¶
Bases:
StashSplitKeyContainer
Like
StashSplitKeyContainer
but data is stratified by a label (partition_attr
) across each split.- __init__(key_path, pattern, stash, distribution=<factory>, shuffle=True, partition_attr=None, stratified_write=True, split_labels_path=None)¶
-
split_labels_path:
Path
= None¶ If provided, the path is a pickled cache of
stratified_count_dataframe
.
- property stratified_count_dataframe: DataFrame¶
A count summarization of
stratified_split_labels
.
- property stratified_split_labels: DataFrame¶
A dataframe with all keys, their respective labels and split.
-
stratified_write:
bool
= True¶ Whether or not to include the stratified counts when writing with
write()
.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write the contents of this instance to
writer
using indentiondepth
.- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
zensols.dataset.stash module¶
Utility stashes useful to common machine learning tasks.
- class zensols.dataset.stash.DatasetSplitStash(delegate, split_container)[source]¶
Bases:
DelegateStash
,SplitStashContainer
,PersistableContainer
,Writable
A default implementation of
SplitStashContainer
. However, it needs an instance of aSplitKeyContainer
. This implementation generates a separate stash instance for each data set split (i.e.train
vstest
). Each split instance holds the data (keys and values) for each split.Stash instances by split are obtained with
splits
, and will have asplit
attribute that give the name of the split.To maintain reproducibility, key ordering must be considered (see
SortedDatasetSplitStash
).- __init__(delegate, split_container)¶
- check_key_consistent()[source]¶
Return if the
split_container
have the same key count divisiion as this stash’s split counts.- Return type:
- clear_keys()[source]¶
Clear any cache state for keys, and keys by split. It does this by clearing the key state for stash, and then the
clear()
of thesplit_container
.
- exists(name)[source]¶
Return
True
if data with keyname
exists.Implementation note: This
Stash.exists()
method is very inefficient and should be overriden.- Return type:
- get(name, default=None)[source]¶
Load an object or a default if key
name
doesn’t exist.Implementation note: sub classes will probably want to override this method given the super method is cavalier about calling
exists:()
andload()
. Based on the implementation, this can be problematic.- Return type:
- load(name)[source]¶
Load a data value from the pickled data with key
name
. Semantically, this method loads the using the stash’s implementation. For exampleDirectoryStash
loads the data from a file if it exists, but factory type stashes will always re-generate the data.
-
split_container:
SplitKeyContainer
¶ The instance that provides the splits in the dataset.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_delegate=False)[source]¶
Write the contents of this instance to
writer
using indentiondepth
.- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- class zensols.dataset.stash.SortedDatasetSplitStash(delegate, split_container, sort_function=None)[source]¶
Bases:
DatasetSplitStash
A sorted version of a
DatasetSplitStash
, where keys, values, items and iterations are sorted by key. This is important for reproducibility of results.An alternative is to use
DatasetSplitStash
with an instance ofStashSplitKeyContainer
set as thedelegate
since the key container keeps key ordering consistent.Any shuffling of the dataset, for the sake of training on non-uniform data, needs to come before using this class. This class also sorts the keys in each split given in
splits
.- ATTR_EXP_META = ('sort_function',)¶
- __init__(delegate, split_container, sort_function=None)¶