zensols.dataset.db
Preemptively compute a dataset (i.e. features from natural language utterances) and store them in Elasticsearch. This is useful for use with training, testing, validating and development machine learning models.
The unit of data is an instance. An instance set (or just instances) makes up the dataset.
The idea is to abstract out Elasticsearch, but that might be a future enhancement. At the moment functions don’t carry Elassticsearch artifacts but they are exposed.
There are three basic ways to use this data:
- Get all instances (i.e. an utterance or a feature set). In this case all data returned from ids is considered training data. This is the default nascent state.
- Split the data into a train and test set (see divide-by-set).
- Use the data as a cross fold validation and iterate folds (see divide-by-fold).
The information used to represent either fold or the test/train split is referred to as the dataset split state and is stored in Elasticsearch under a differnent mapping-type in the same index as the instances.
See ids for more information.
clear
(clear & {:keys [wipe-persistent?], :or {wipe-persistent? false}})
Clear the in memory instance data. If key :wipe-persistent?
is true
all fold and test/train split data is also cleared.
distribution
(distribution)
Return maps representing the data set distribution by class label. Each element of the returned sequence has the following keys:
- :class-label the class-label fo the instances
- :count the number of instances for :class-label
divide-by-fold
(divide-by-fold)
(divide-by-fold folds & {:keys [shuffle?], :or {shuffle? true}})
Divide the data into folds and initialize the current fold in the dataset split state. Using this kind of dataset split is useful for cross fold validation.
- folds number of folds to use, which defaults to 10
See set-fold
divide-by-preset
(divide-by-preset)
Divide the data into test and training buckets. The respective train/test buckets are dictated by the :set-type
label given in parameter given to the :create-instances-fn as documented in elasticsearch-connection.
divide-by-set
(divide-by-set)
(divide-by-set train-ratio & {:keys [dist-type shuffle? max-instances seed], :as opts, :or {shuffle? true, dist-type (quote uneven)}})
Divide the dataset into a test and training buckets.
- train-ratio this is the percentage of data in the train bucket, which defaults to
0.5
Keys
- :dist-type one of the following symbols: even each test/training set has an even distribution by class label uneven each test/training set has an uneven distribution by class label
- :shuffle? if
true
then shuffle the set before partitioning, otherwise just update the demarcation boundary - :filter-fn if given a filter function that takes a key as input
- :max-instances the maximum number of instances per class
- :seed if given, seed the random number generator, otherwise don’t return random documents
elasticsearch-connection
(elasticsearch-connection index-name & {:keys [create-instances-fn population-use set-type url mapping-type-def cache-inst], :or {create-instances-fn identity, population-use 1.0, set-type :train, mapping-type-def {instance-key {:type "nested"}, class-label-key {:type "string", :index "not_analyzed"}}, url "http://localhost:9200"}})
Create a connection to the dataset DB cache.
Parameters
- index-name the name of the Elasticsearch index
Keys
-
:create-instances-fn a function that computes the instance set (i.e. parses the utterance) and invoked by instances-load; this function takes a single argument, which is also a function that is used to load utterance in the DB; this function takes the following forms:
- (fn [instance class-label] …
- (fn [id instance class-label] …
- (fn [id instance class-label set-type] …
- id the unique identifier of the data point
- instance is the data set instance (can be an
N
-deep map) - class-label the label of the class (can be nominal, double, integer)
- set-type either
:test
,:train
,:train-test
(all) used to presort the data with divide-by-preset; note that it isn’t necessary to call divide-by-preset for the first invocation of instances-load
-
:url the URL to the DB (defaults to
http://localhost:9200
) - :mapping-type map type name (see ES docs)
- :cache-inst an atom used to cache instances by ID; if given this retrieves instances from the in memory map stored in the atom; otherwise it goes to ElasticSearch each time
Example
Create a connection that produces a list of 20 instances:
(defn- create-iter-connection []
(letfn [(load-fn [add-fn]
(doseq [i (range 20)]
(add-fn (str i) (format "inst %d" i) (format "class %d" i))))]
(elasticsearch-connection "tmp" :create-instances-fn load-fn)))
freeze-dataset
(freeze-dataset & {:keys [output-file id-key set-type-key], :or {set-type-key :set-type}})
Distille the current data set (data and test/train splits) in a output-file. See freeze-dataset-to-writer.
freeze-dataset-to-writer
(freeze-dataset-to-writer writer & {:keys [set-type-key]})
Distille the current data set (data and test/train splits) in writer to be later restored withzensols.dataset.thaw/taw-connection.
ids
(ids & {:keys [set-type]})
Return all IDs based on the dataset split (see class docs).
Keys
- :set-type is either
:train
,:test
,:train-test
(all) and defaults to set-default-set-type or:train
if not set
instance-by-id
(instance-by-id conn id)
(instance-by-id id)
Get a specific instance by its ID.
This returns a map that has the following keys:
- :instance the instance data, which was set with :create-instances-fn in elasticsearch-connection
instance-count
(instance-count)
Get the number of total instances in the database. This result is independent of the dataset split state.
instances
(instances & {:keys [set-type include-ids? id-set]})
Return all instance data based on the dataset split (see class docs).
See instance-by-id for the data in each map sequence returned.
Keys
- :set-type is either
:train
,:test
,:train-test
(all) and defaults to set-default-set-type or:train
if not set - :include-ids? if non-
nil
return keys in the map as well
instances-by-class-label
(instances-by-class-label & {:keys [max-instances type seed], :or {max-instances Integer/MAX_VALUE}})
instances-load
(instances-load & {:keys [recreate-index?], :or {recreate-index? true}})
Parse and load the dataset in the DB.
set-default-connection
(set-default-connection)
(set-default-connection conn)
Set the default connection.
Parameter conn is used in place of what is set with with-connection. This is very convenient and saves typing, but will get clobbered if a with-connection is used further down in the stack frame.
If the parameter is missing, it’s unset.
set-default-set-type
(set-default-set-type set-type)
Set the default bucket (training or testing) to get data.
- :set-type is either
:train
(default) or:test
; see elasticsearch-connection
See ids
set-fold
(set-fold fold)
Set the current fold in the dataset split state.
You must call divide-by-fold before calling this.
See the namespace docs for more information.
set-population-use
(set-population-use ratio)
with-connection
macro
(with-connection connection & body)
Execute a body with the form (with-connection connection …)
- connection is created with elasticsearch-connection
write-dataset
(write-dataset & {:keys [output-file single? instance-fn columns-fn], :or {instance-fn identity, columns-fn (constantly ["Instance"])}})
Write the data set to a spreadsheet. If the file name ends with a .csv
a CSV file is written, otherwise an Excel file is written.
Keys
- :output-file where to write the file and defaults to res/resource-path
:analysis-report
- :single? if
true
then create a single sheet, otherwise the training and testing buckets are split between sheets