zensols.dataset package¶
Submodules¶
zensols.dataset.dimreduce module¶
Dimension reduction wrapper and utility classes.
- class zensols.dataset.dimreduce.DecomposeDimensionReducer(data, dim, reduction_meth='pca', normalize='unit', model_args=<factory>)[source]¶
- Bases: - DimensionReducer- A dimensionality reducer that uses eigenvector decomposition such as PCA or SVD. - __init__(data, dim, reduction_meth='pca', normalize='unit', model_args=<factory>)¶
 - property description: Dict[str, Any]¶
- A object graph of data that describes the results of the model. 
 - get_components(data=None, one_dir=True)[source]¶
- Create a start and end points that make the PCA component, which is useful for rendering lines for visualization. - Param:
- use in place of the - datafor component calculation using the (already) trained model
- Parameters:
- one_dir ( - bool) – whether or not to create components one way from the mean, or two way (forward and backward) from the mean
- Return type:
- Returns:
- a tuple of numpy arrays, each as a start and end stacked for each component 
 
 
- class zensols.dataset.dimreduce.DimensionReducer(data, dim, reduction_meth='pca', normalize='unit', model_args=<factory>)[source]¶
- Bases: - Dictable- Reduce the dimensionality of a dataset. - __init__(data, dim, reduction_meth='pca', normalize='unit', model_args=<factory>)¶
 - property model: PCA | TruncatedSVD | TSNE¶
 
zensols.dataset.interface module¶
Interfaces used for dealing with dataset splits.
- exception zensols.dataset.interface.DatasetError[source]¶
- Bases: - APIError- Thrown when any dataset related is raised. - __annotations__ = {}¶
 - __module__ = 'zensols.dataset.interface'¶
 
- class zensols.dataset.interface.SplitKeyContainer[source]¶
- Bases: - Writable- An interface defining a container that partitions data sets (i.e. - trainvs- test). For instances of this class, that data are the unique keys that point at the data.- __init__()¶
 - property counts_by_key: Dict[str, int]¶
- Return data set splits name to count for that respective split. 
 - property keys_by_split: Dict[str, Tuple[str, ...]]¶
- Generate a dictionary of split name to keys for that split. It is expected this method will be very expensive. 
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_delegate=False)[source]¶
- Write the contents of this instance to - writerusing indention- depth.- Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
- class zensols.dataset.interface.SplitStashContainer[source]¶
- Bases: - PrimeableStash,- SplitKeyContainer- An interface like - SplitKeyContainer, but whose implementations are of- Stashcontaining the instance data.- For a default implemetnation, see - DatasetSplitStash.- __init__()¶
 - property split_name: str¶
- Return the name of the split this stash contains. Thus, all data/items returned by this stash are in the data set given by this name (i.e. - train).
 
zensols.dataset.leaveout module¶
A split key container for leave-one-out dataset splits.
- class zensols.dataset.leaveout.LeaveNOutSplitKeyContainer(delegate, distribution=<factory>, shuffle=True, path=None)[source]¶
- Bases: - SplitKeyContainer- A split key container that leaves one out of the dataset. By default, this creates a dataset that has one data point for validation, another for test, and the rest of the data for training. - __init__(delegate, distribution=<factory>, shuffle=True, path=None)¶
 - 
distribution: Dict[str,int]¶
- The number of data points by each split type. If the value is an integer, that number of data points are used. Otherwise, if it is a float, then that percentage of the entire key set is used. 
 - next_split()[source]¶
- Create the next split so that the next access to properties such as - keys_by_splitprovide the next key split permutation.- Return type:
 
 
zensols.dataset.multilabel module¶
A multilabel stratifier.
- class zensols.dataset.multilabel.MultiLabelStratifierSplitKeyContainer(key_path, pattern, stash, distribution=<factory>, shuffle=True, partition_attr=None, stratified_write=True, split_labels_path=None, split_preference=None, move_portion=0.5, min_source_occurances=0)[source]¶
- Bases: - StratifiedStashSplitKeyContainer- Creates stratified two-way splits between token-level annotated feature sentences. - __init__(key_path, pattern, stash, distribution=<factory>, shuffle=True, partition_attr=None, stratified_write=True, split_labels_path=None, split_preference=None, move_portion=0.5, min_source_occurances=0)¶
 - 
min_source_occurances: int= 0¶
- The minimum number of occurances for a label to trigger the key move described in - split_prefernce.
 - 
move_portion: float= 0.5¶
- The portion of data points per label to move based on - split_preference.
 - 
split_preference: Tuple[str,...] = None¶
- The list of splits to give preference by moving data that has no instances. For exaple, - ('test', 'validation')would move data points from- validationto- testfor labels that have no occurances in- test.
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write the contents of this instance to - writerusing indention- depth.- Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
zensols.dataset.outlier module¶
A simple outlier detection class.
- class zensols.dataset.outlier.OutlierDetector(data, default_method='mahalanobis', threshold=None, proportion=None, return_indicators=None)[source]¶
- Bases: - object- Simple outlier detection utility that provides a few differnt methods of calculation. These include - z-score(),- mahalanobis()and- robust_mahalanobis().- This class removes either using a method specific - thresholdor by a- proportionof the data set.- DETECTION_METHODS = frozenset({'mahalanobis', 'robust_mahalanobis', 'z_score'})¶
 - __init__(data, default_method='mahalanobis', threshold=None, proportion=None, return_indicators=None)¶
 - 
data: Union[ndarray,DataFrame]¶
- The dataframe on which to find outliers given the data. Data points are rows and the feature vectors are columns. 
 - 
default_method: str= 'mahalanobis'¶
- The method used when invoking as a - Callablewith the- __call__()method. This must be one of- DETECTION_METHODS.
 - mahalanobis(significance=0.001)[source]¶
- Detect outliers using the Mahalanbis distince in high dimension. - Assuming a multivariate normal distribution of the data with K variables, the Mahalanobis distance follows a chi-squared distribution with K degrees of freedom. For this reason, the cut-off is defined by the square root of the Chi^2 percent pointwise function. - Parameters:
- significance ( - float) – 1 - the Chi^2 percent point function (inverse of cdf / percentiles) outlier threshold; reasonable values include 2.5%, 1%, 0.01%); if None use- thresholdor- proportion
- Return type:
- Returns:
- indexes in to - datarows (indexes of a dataframe) of the outliers
 
 - property numpy: ndarray¶
- The numpy form of - data. If- datais a dataframe, it is converted to a numpy array.
 - 
proportion: float= None¶
- The proportion of the dataset to use for outliers. The higher the number the more outliers. - See:
 
 - 
return_indicators: bool= None¶
- Whether to return a list of - False(not outlier) or- True(outlier) instead of indexes in to the input matrix/dataframe (- data).
 - robust_mahalanobis(significance=0.001, random_state=0)[source]¶
- Like - mahalanobis()but use a robust mean and covarance matrix by sampling the dataset.- Parameters:
- significance ( - float) – 1 - the Chi^2 percent point function (inverse of cdf / percentiles) outlier threshold; reasonable values include 2.5%, 1%, 0.01%); if None use- thresholdor- proportion
- Return type:
- Returns:
- indexes in to - datarows (indexes of a dataframe) of the outliers
 
 - 
threshold: float= None¶
- The outlier threshold, which is method dependent. This is ignored if - proportionis set.
 
zensols.dataset.split module¶
Implementations (some abstract) of split key containers.
- class zensols.dataset.split.AbstractSplitKeyContainer(key_path, pattern)[source]¶
- Bases: - PersistableContainer,- SplitKeyContainer,- Primeable,- Writable- A default implementation of a - SplitKeyContainer. This implementation keeps the order of the keys consistent as well, which is stored at the path given in- key_path. Once the keys are generated for the first time, they will persist on the file system.- This abstract class requires an implementation of - _create_splits().- abstract _create_splits()[source]¶
- Create the key splits using keys as the split name (i.e. - train) and the values as a list of the keys for the corresponding split.
 - __init__(key_path, pattern)¶
 - 
pattern: str¶
- The file name pattern to use for the keys file - key_pathon the file system, each file is named after the key split. For example, if- {name}.datis used,- train.datwill be a file with the ordered keys.
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write the contents of this instance to - writerusing indention- depth.- Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
- class zensols.dataset.split.StashSplitKeyContainer(key_path, pattern, stash, distribution=<factory>, shuffle=True)[source]¶
- Bases: - AbstractSplitKeyContainer- A default implementation of - AbstractSplitKeyContainerthat uses a delegate stash for source of the keys.- __init__(key_path, pattern, stash, distribution=<factory>, shuffle=True)¶
 
- class zensols.dataset.split.StratifiedCrossFoldSplitKeyContainer(key_path, pattern, stash, distribution=<factory>, shuffle=True, partition_attr=None, stratified_write=True, split_labels_path=None, n_folds=None)[source]¶
- Bases: - StratifiedStashSplitKeyContainer- Like - StratifiedStashSplitKeyContainer, but create splits used for cross-fold validation when batching. This creates a new dataset for each fold by settings- distribution.- 
FOLD_FORMAT: ClassVar[str] = 'fold-{fold_ix}-{iter_ix}'¶
- The format used for naming results in - ModelExecutor.
 - __init__(key_path, pattern, stash, distribution=<factory>, shuffle=True, partition_attr=None, stratified_write=True, split_labels_path=None, n_folds=None)¶
 
- 
FOLD_FORMAT: 
- class zensols.dataset.split.StratifiedStashSplitKeyContainer(key_path, pattern, stash, distribution=<factory>, shuffle=True, partition_attr=None, stratified_write=True, split_labels_path=None)[source]¶
- Bases: - StashSplitKeyContainer- Like - StashSplitKeyContainerbut data is stratified by a label (- partition_attr) across each split.- __init__(key_path, pattern, stash, distribution=<factory>, shuffle=True, partition_attr=None, stratified_write=True, split_labels_path=None)¶
 - 
split_labels_path: Path= None¶
- If provided, the path is a pickled cache of - stratified_count_dataframe.
 - property stratified_count_dataframe: DataFrame¶
- A count summarization of - stratified_split_labels.
 - property stratified_split_labels: DataFrame¶
- A dataframe with all keys, their respective labels and split. 
 - 
stratified_write: bool= True¶
- Whether or not to include the stratified counts when writing with - write().
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write the contents of this instance to - writerusing indention- depth.- Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
zensols.dataset.stash module¶
Utility stashes useful to common machine learning tasks.
- class zensols.dataset.stash.DatasetSplitStash(delegate, split_container)[source]¶
- Bases: - DelegateStash,- SplitStashContainer,- PersistableContainer,- Writable- A default implementation of - SplitStashContainer. However, it needs an instance of a- SplitKeyContainer. This implementation generates a separate stash instance for each data set split (i.e.- trainvs- test). Each split instance holds the data (keys and values) for each split.- Stash instances by split are obtained with - splits, and will have a- splitattribute that give the name of the split.- To maintain reproducibility, key ordering must be considered (see - SortedDatasetSplitStash).- __init__(delegate, split_container)¶
 - check_key_consistent()[source]¶
- Return if the - split_containerhave the same key count divisiion as this stash’s split counts.- Return type:
 
 - clear_keys()[source]¶
- Clear any cache state for keys, and keys by split. It does this by clearing the key state for stash, and then the - clear()of the- split_container.
 - exists(name)[source]¶
- Return - Trueif data with key- nameexists.- Implementation note: This - Stash.exists()method is very inefficient and should be overriden.- Return type:
 
 - get(name, default=None)[source]¶
- Load an object or a default if key - namedoesn’t exist.- Implementation note: sub classes will probably want to override this method given the super method is cavalier about calling - exists:()and- load(). Based on the implementation, this can be problematic.- Return type:
 
 - load(name)[source]¶
- Load a data value from the pickled data with key - name. Semantically, this method loads the using the stash’s implementation. For example- DirectoryStashloads the data from a file if it exists, but factory type stashes will always re-generate the data.
 - 
split_container: SplitKeyContainer¶
- The instance that provides the splits in the dataset. 
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_delegate=False)[source]¶
- Write the contents of this instance to - writerusing indention- depth.- Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
- class zensols.dataset.stash.SortedDatasetSplitStash(delegate, split_container, sort_function=None)[source]¶
- Bases: - DatasetSplitStash- A sorted version of a - DatasetSplitStash, where keys, values, items and iterations are sorted by key. This is important for reproducibility of results.- An alternative is to use - DatasetSplitStashwith an instance of- StashSplitKeyContainerset as the- delegatesince the key container keeps key ordering consistent.- Any shuffling of the dataset, for the sake of training on non-uniform data, needs to come before using this class. This class also sorts the keys in each split given in - splits.- ATTR_EXP_META = ('sort_function',)¶
 - __init__(delegate, split_container, sort_function=None)¶