zensols.nlparse.tok-re
This namespace extends the NER system to easily add any regular expression using the Stanford TokensRegex API.
This takes a sequence of regular expressions and entity metadata as input and produces a file format the TokensRegex API consumes to tag entities.
This is an example of the output.
item
(item content label & opts)
Create an item used to create a pattern/line in the Stanford CoreNLP regular expression definition file with a regex created from content and NER label.
The opts parameter are keys with:
- :lem-min-len minimum item utterance length to turn on lemmatization for the last token (default -1), for example:
- 2: if the string is or longer than 2 chars lemmatize the last token
- 0: always lemmatize
- -1: never lemmatize
- :case-min-tok must have at least N tokens to turn on case sensitivity (default to
-1
), for example:- 2: if there are 1 or 2 tokens make it case sensitive
- 1: if there is only one token then make it case sensitive
- 0: always case sensitive
- -1: always case insensitive
- :conj-regexp? add and|& regex to match both symbols, defaults to
true
- :first-det-chop? chop off ‘the’ at the beginning of the item utterance, defaults to
true
- :is-regexp? if
true
write the regular expression verbatim instead of generating one from the utterance like form
write-regex-files
(write-regex-files regex-output-file features-output-file items)
Write all items to the Stanford token regular expression files regex-output-file with all possible features in features-output-file.