zensols.nlparse.tok-re

This namespace extends the NER system to easily add any regular expression using the Stanford TokensRegex API.

This takes a sequence of regular expressions and entity metadata as input and produces a file format the TokensRegex API consumes to tag entities.

This is an example of the output.

item

(item content label & opts)

Create an item used to create a pattern/line in the Stanford CoreNLP regular expression definition file with a regex created from content and NER label.

The opts parameter are keys with:

  • :lem-min-len minimum item utterance length to turn on lemmatization for the last token (default -1), for example:
    • 2: if the string is or longer than 2 chars lemmatize the last token
    • 0: always lemmatize
    • -1: never lemmatize
  • :case-min-tok must have at least N tokens to turn on case sensitivity (default to -1), for example:
    • 2: if there are 1 or 2 tokens make it case sensitive
    • 1: if there is only one token then make it case sensitive
    • 0: always case sensitive
    • -1: always case insensitive
  • :conj-regexp? add and|& regex to match both symbols, defaults to true
  • :first-det-chop? chop off ‘the’ at the beginning of the item utterance, defaults to true
  • :is-regexp? if true write the regular expression verbatim instead of generating one from the utterance like form

parse-features

(parse-features feature-string)

write-regex-files

(write-regex-files regex-output-file features-output-file items)

Write all items to the Stanford token regular expression files regex-output-file with all possible features in features-output-file.