zensols.nlparse.config

Configure the Stanford CoreNLP parser.

This provides a plugin architecture for natural language processing tasks in a pipeline. A parser takes either an human language utterance or a previously annotated data parsed from an utterance.

Parser Libraries

Each parser provides a set of components that make up the pipeline. Each component (i.e. tokenize) is a function that returns a map including a map containing keys:

  • component a key that’s the name of the component to create.
  • parser a key that is the name of the parser it belongs to.

For example, the Stanford CoreNLP word tokenizer has the following return map:

  • :component :tokenize
  • :lang lang-code (e.g. en)
  • :parser :stanford

The map also has additional key/value pairs that represent remaining configuration given to the parser library used to create it’s pipeline components. All parse library names (keys) are given in all-parsers.

Use register-library to add your library with the key name of your parser.

Usage

You can either create your own custom parser configuration with create-parse-config and then create it’s respective context with create-context. If you do this, then each parse call needs to be in a with-context lexical context. If you don’t, a default context is created and used for each parse invocation.

Once/if configured, use zensols.nlparse.parse/parse to invoke the parsing pipeline.

all-parsers

All parsers available in this package (jar).

component-documentation

(component-documentation)

Return maps doc documentation with keys :name and :doc.

component-from-config

(component-from-config config name)

Return a component by name from parse config.

component-from-context

(component-from-context context name)

Return a component by name from parse context.

components-as-string

(components-as-string)

Return all available components as a string

context

(context lib-name)

Return context created with create-context.

See the usage section section.

coreference

(coreference)

Create annotator to coreference tree structure.

create-context

(create-context)(create-context parse-config & {:keys [timeout-millis]})

Return a context used during parsing. This calls all registered (register-library) parse libraries create functions and returns an object to be used with the parse function zensols.nlparse.parse/parse

The parameter parse-config is either a parse configuration created with create-parse-config or a string. If a string is used for the parse-config parameter create pipeline by component names separated by commas. See zensols.nlparse.config-parse for more inforamation on this DSL.

Using the output of components-of-string would create all components. However, the easier way to utilize all components is to to call this function with no parameters.

See the usage section section.

Keys

  • :timeout-millis number of milliseconds to allow the parser to complete before java.util.concurrent.TimeoutException is thrown or nil for no timeout; no timeout is the default

create-parse-config

(create-parse-config & {:keys [parsers only-tokenize? pipeline], :or {parsers all-parsers}})

Create a parse configuration given as input to create-context.

If no keys are given all components are configured (see components-as-string).

Keys

  • :only-tokenize? create a parse configuration that only utilizes the tokenization of the Stanford CoreNLP library.
  • :pipeline a list of components created with one of the many component create functions (i.e. tokenize) or from a roll-your-own add-on library; this redners the :parsers key unsued
  • :parsers a set of parser library names (keys) used to indicate which components to return (i.e. :stanford); see all-parsers

dependency-parse-tree

(dependency-parse-tree)

Create an annotator to create a dependency parse tree.

See the dependencies manual for definitions.

morphology

(morphology)

Create a morphology annotator, which adds the lemmatization of a word. This adds the :lemma keyword to each token..

named-entity-recognizer

(named-entity-recognizer)(named-entity-recognizer paths)(named-entity-recognizer paths lang)

Create annotator to do named entity recognition. All models in the paths sequence are loaded. The lang is the language parameter, which can be either ENGLISH or CHINESE and defaults to ENGLISH. See the NERClassifierCominer Javadoc for more information.

By default, the English CoNLL 4 class is used. See the Stanford NER for more information.

natural-logic

(natural-logic)

Create a natural logic annotator.

See the Stanford CoreNLP documentation for more information.

parse-functions

(parse-functions)

Return all registered parse function in the order they are to be called.

See the usage section section.

parse-timeout

(parse-timeout)

Return the number of milliseconds to timeout the parse or nil if none.

See create-context.

parse-tree

(parse-tree)(parse-tree {:keys [include-score? maxtime use-shift-reduce? language], :as conf})

Create annotator to create head and parse trees.

Keys

  • :include-score? true if computed per node accuracy scores are included in parse tree
  • :maxtime the maximum time in milliseconds to wait for the tree parser to complete (per sentence)
  • **:use-shift-reduce? if true use the faster and smaller shift reduce model, but the model must be present and model load time is slower (see the shift reduce doc)
  • :language the parse language model (currently only used for shift reduce), defaults to english

part-of-speech

(part-of-speech)(part-of-speech pos-model-resource)

Create annotator to do part of speech tagging. You can set the model with a resource identified with the pos-model-resource string, which defaults to the English WSJ trained corpus.

print-component-documentation

(print-component-documentation)

Print the formatted component documentation see component-documentation.

register-library

(register-library lib-name lib-cfg & {:keys [force?]})

Register plugin library lib-name with lib-cfg a map containing:

  • :create-fn a function taht takes a parse configuration (see create-parse-config) to create a context later returned with context
  • :reset-fn a function that takes the parse context to null out any atoms or cached data structures; this is called by [[reset]
  • :parse-fn a function that takes a signle human language utterance string or output of another parse library’s output
  • :component-fns all component creating functions from this library

Implementation note: this forces re-creation of the default context (see the usage section) to allow create-context invoked on calling library at next invocation to context for newly registered libraries.

reset

(reset & {:keys [hard?], :or {hard? true}})

Reset the cached data structures and configuration in the default (or currently bound with-context) context. This is also called by zensosls.actioncli.dynamic/purge.

semantic-role-labeler

(semantic-role-labeler)(semantic-role-labeler lang-code)

Create a semantic role labeler annotator. You can configure the language with the lang-code, which is a two string language code and defaults to English.

Keys

  • :lang language used to create the SRL pipeline
  • :model-type model type used to create the SRL pipeilne
  • :first-label-token-threshold token minimum position that contains a label to help decide the best SRL labeled sentence to choose.

sentence

(sentence)

Create annotator to group tokens into sentences per configured language.

sentiment

(sentiment)(sentiment aggregate?)

Create annotator for sentiment analysis. The aggregate? parameter tells the parser to create a top (root) sentiment level score for the entire parse utterance.

stopword

(stopword)

Create annotator to annotate stop words (boolean).

token-regex

(token-regex)(token-regex paths)

Create annotator to token regular expression. You can configure an array of strings identifying either resources or files using the paths parameter, which defaults to token-regex.txt, which is included in the resources of this package as an example and used with the test cases.

The :tok-re-resources is a sequence of string paths to create a single annotator or a sequence of sequence string paths. If more than one annotator is created the output of an annotator can be used in the patterns of the next.

tokenize

(tokenize)(tokenize lang-code)

Create annotator to split words per configured language. The tokenization langauge is set with the lang-code parameter, which is a two string language code and defaults to en (English).

with-context

macro

(with-context context & forms)

Use the parser with a context created with create-context. This context is optionally configured. Without this macro the default context is used as described in the usage section section.