zensols.dataset.db

Preemptively compute a dataset (i.e. features from natural language utterances) and store them in Elasticsearch. This is useful for use with training, testing, validating and development machine learning models.

The unit of data is an instance. An instance set (or just instances) makes up the dataset.

The idea is to abstract out Elasticsearch, but that might be a future enhancement. At the moment functions don’t carry Elassticsearch artifacts but they are exposed.

There are three basic ways to use this data:

Get all instances (i.e. an utterance or a feature set). In this case all data returned from ids is considered training data. This is the default nascent state.
Split the data into a train and test set (see divide-by-set).
Use the data as a cross fold validation and iterate folds (see divide-by-fold).

The information used to represent either fold or the test/train split is referred to as the dataset split state and is stored in Elasticsearch under a differnent mapping-type in the same index as the instances.

See ids for more information.

class-label-key

clear

(clear & {:keys [wipe-persistent?], :or {wipe-persistent? false}})

Clear the in memory instance data. If key :wipe-persistent? is true all fold and test/train split data is also cleared.

dataset-file

(dataset-file)

default-connection-inst

distribution

(distribution)

Return maps representing the data set distribution by class label. Each element of the returned sequence has the following keys:

:class-label the class-label fo the instances
:count the number of instances for :class-label

divide-by-fold

(divide-by-fold)(divide-by-fold folds & {:keys [shuffle?], :or {shuffle? true}})

Divide the data into folds and initialize the current fold in the dataset split state. Using this kind of dataset split is useful for cross fold validation.

folds number of folds to use, which defaults to 10

divide-by-preset

(divide-by-preset)

Divide the data into test and training buckets. The respective train/test buckets are dictated by the :set-type label given in parameter given to the :create-instances-fn as documented in elasticsearch-connection.

divide-by-set

(divide-by-set)

(divide-by-set train-ratio & {:keys [dist-type shuffle? max-instances seed], :as opts, :or {shuffle? true, dist-type (quote uneven)}})

Divide the dataset into a test and training buckets.

train-ratio this is the percentage of data in the train bucket, which defaults to 0.5

Keys

:dist-type one of the following symbols: even each test/training set has an even distribution by class label uneven each test/training set has an uneven distribution by class label
:shuffle? if true then shuffle the set before partitioning, otherwise just update the demarcation boundary
:filter-fn if given a filter function that takes a key as input
:max-instances the maximum number of instances per class
:seed if given, seed the random number generator, otherwise don’t return random documents

elasticsearch-connection

(elasticsearch-connection index-name & {:keys [create-instances-fn population-use set-type url mapping-type-def cache-inst], :or {create-instances-fn identity, population-use 1.0, set-type :train, mapping-type-def {instance-key {:type "nested"}, class-label-key {:type "string", :index "not_analyzed"}}, url "http://localhost:9200"}})

Create a connection to the dataset DB cache.

Parameters

index-name the name of the Elasticsearch index

Keys

:create-instances-fn a function that computes the instance set (i.e. parses the utterance) and invoked by instances-load; this function takes a single argument, which is also a function that is used to load utterance in the DB; this function takes the following forms:
- (fn [instance class-label] …
- (fn [id instance class-label] …
- (fn [id instance class-label set-type] …
  - id the unique identifier of the data point
  - instance is the data set instance (can be an N-deep map)
  - class-label the label of the class (can be nominal, double, integer)
  - set-type either :test, :train, :train-test (all) used to presort the data with divide-by-preset; note that it isn’t necessary to call divide-by-preset for the first invocation of instances-load
:url the URL to the DB (defaults to http://localhost:9200)
:mapping-type map type name (see ES docs)
:cache-inst an atom used to cache instances by ID; if given this retrieves instances from the in memory map stored in the atom; otherwise it goes to ElasticSearch each time

Example

Create a connection that produces a list of 20 instances:

(defn- create-iter-connection []
  (letfn [(load-fn [add-fn]
            (doseq [i (range 20)]
              (add-fn (str i) (format "inst %d" i) (format "class %d" i))))]
    (elasticsearch-connection "tmp" :create-instances-fn load-fn)))

freeze-dataset

(freeze-dataset & {:keys [output-file id-key set-type-key], :or {set-type-key :set-type}})

Distille the current data set (data and test/train splits) in a output-file. See freeze-dataset-to-writer.

freeze-dataset-to-writer

(freeze-dataset-to-writer writer & {:keys [set-type-key]})

Distille the current data set (data and test/train splits) in writer to be later restored withzensols.dataset.thaw/taw-connection.

freeze-file

(freeze-file)

id-key

ids

(ids & {:keys [set-type]})

Return all IDs based on the dataset split (see class docs).

Keys

:set-type is either :train, :test, :train-test (all) and defaults to set-default-set-type or :train if not set

instance-by-id

(instance-by-id conn id)(instance-by-id id)

Get a specific instance by its ID.

This returns a map that has the following keys:

:instance the instance data, which was set with :create-instances-fn in elasticsearch-connection

instance-count

(instance-count)

Get the number of total instances in the database. This result is independent of the dataset split state.

instance-key

instances

(instances & {:keys [set-type include-ids? id-set]})

Return all instance data based on the dataset split (see class docs).

See instance-by-id for the data in each map sequence returned.

Keys

:set-type is either :train, :test, :train-test (all) and defaults to set-default-set-type or :train if not set
:include-ids? if non-nil return keys in the map as well

instances-by-class-label

(instances-by-class-label & {:keys [max-instances type seed], :or {max-instances Integer/MAX_VALUE}})

Return a map with class-labels for keys and corresponding instances for that class-label.

Keys

:max-instances the maximum number of instances per class
:seed if given, seed the random number generator, otherwise don’t return random documents

instances-count

(instances-count)

Return the number of datasets in the DB.

instances-load

(instances-load & {:keys [recreate-index?], :or {recreate-index? true}})

Parse and load the dataset in the DB.

set-default-connection

(set-default-connection)(set-default-connection conn)

Set the default connection.

Parameter conn is used in place of what is set with with-connection. This is very convenient and saves typing, but will get clobbered if a with-connection is used further down in the stack frame.

If the parameter is missing, it’s unset.

set-default-set-type

(set-default-set-type set-type)

Set the default bucket (training or testing) to get data.

:set-type is either :train (default) or :test; see elasticsearch-connection

See ids

set-fold

(set-fold fold)

Set the current fold in the dataset split state.

You must call divide-by-fold before calling this.

See the namespace docs for more information.

set-population-use

(set-population-use ratio)

Set how much of the data from the DB to use. This is useful for cases where your dataset or corpus is huge and you only want to start with a small chunk until you get your models debugged.

Parameters

ratio a number between (0-1]; by default this is 1

Note This removes any stored dataset split state

stats

(stats)

Get training vs testing dataset split statistics.

with-connection

macro

(with-connection connection & body)

Execute a body with the form (with-connection connection …)

connection is created with elasticsearch-connection

write-dataset

(write-dataset & {:keys [output-file single? instance-fn columns-fn], :or {instance-fn identity, columns-fn (constantly ["Instance"])}})

Write the data set to a spreadsheet. If the file name ends with a .csv a CSV file is written, otherwise an Excel file is written.

Keys

:output-file where to write the file and defaults to res/resource-path :analysis-report
:single? if true then create a single sheet, otherwise the training and testing buckets are split between sheets