# NER Example

This document describes the [named entity task example] to demonstrate
conditional random field and other features in the DeepZenols NLP framework.
Before working through this example, please first read through the
[movie review example].  The difference between this and the movie review
sentiment example is this project classifies at the token level instead of
sentence level.  For this reason, only those parts that differ from the movie
review example are documented.


## Configuration

The [app.conf] is nearly identical with the [movie review example] except that
it adds `--override` option defaults for the [HuggingFace] transformer model
name `bert-base-cased`.  The `app_imp_conf` maps [YAML] file extensions to
[ConditionalYamlConfig] with the `type_map` property.

This NER example is similar to the [movie review example] [obj.yml] file, but
slightly differs in the following ways:

* Corpus resources are defined and downloaded the first time it is accessed.
  However, this downloads three separate files that are not compressed.
* This example does not import [feature resource library] as it is different
  enough for it to be easier to redefine the configuration found in the
  *Corpus/feature creation* section.
* The *Language parsing* section overrides the [FeatureDocumentParser] to
  remove all space tokens, empty sentences and not add named entities (that's
  what this project classifies) and keep only the features parsed from the
  CoNLL corpus.
* The *Vectorization* section has vectorizers for the CoNLL corpus features and
  adds them to the language vectorizer manager.
* The *Batch* configuration differs on a per section basis in the following
  ways:
  * `conll_lang_batch_mappings`: [batch mapping](reslib.html#batch-stash) are
  added for the CoNLL corpus features.  We must also add a transformer specific
  label since there is not a one-to-one mapping from tokens to word piece
  tokens.
  * `ner_batch_mappings`: this transformer label is then only used when the
  embeddings name (`ner_default:name`) from the aforementioned `--override`
  configuration.  See [YAML conditionals] for more information on how this
  if/then/else logic is utilized.  Note that what mappings we add and keep
  closely resembles that of the [movie review example].
  * `batch_dir_stash`: Which features to group together also resembles that of
    the [movie review example]
  * `batch_stash`: we refer to the `ner_batch_mappings` we defined earlier, use
    our own custom [DataPoint] class, set number of sub-processes to 2 (memory
    constraint on large feature sets) and mini-batch size will have 32
    sentences per batch.
* The *Model* shows how the `exectuor` is configured with the `net_settings`,
  which tells the framework which network model to use.  For our example, we
  configure a BiLSTM-CRF, which is a bi-directional LSTM with a decoding layer
  connected to a CRF terminal layer.  This network learns sequences of nominal
  labels, which in our case, are the NER tags.  The `recurrent_crf_settings`
  entry contains the configuration for this BiLSTM-CRF.
* The *Transformer* section has overrides the resource library configuration to
  use the vectorizers and feature attributes defined for our application.


## Code

As mentioned, no code is necessary for the model is it is already provided in
configuration using the framework.  The code that is necessary includes:
* [corpus.py] to parse the [CoNLL 2003 data set]
* [domain.py] defines data point class, and the overridden prediction mapper to
  set the `is_pred` flag
* [app.py] a small a small application to demonstrate how to prototype


### Command Line

Everything can be done with the [harness.py] script:
```bash
# get the command line help using a thin wrapper around the framework
./harness.py -h
# the executor tests and trains the model, use it to get the stats used to train
./harness.py info
# print a sample Glove 50 (default) batch of what the model will get during training
./harness.py info -i batch
# print a sample transformer batch of what the model will get during training
./harness.py info -i batch -c models/transformer.conf 
# train and test the model but switch to model profile with optimized
./harness.py traintest -p
# all model, its (hyper)parameters, metadata and results are stored in subdirectory of files
./harness.py result
# predict labels for an ad-hoc sentence
./harness.py predtext 'Mozambique and Switzerland will join the UN body responsible for the maintenance of global peace.'
```

### Jupyter Notebook

To run the [Jupyter NER notebook]:
1. Pip install: `pip install notebook`
1. Go to the notebook directory: `cd examples/ner/notebook`
1. Start the notebook: `jupyter notebook`
1. Start the execution in the notebook with `Cell > Run All`.


<!-- links -->

[CoNLL 2003 data set]: https://aclanthology.org/W03-0419.pdf
[HuggingFace]: https://github.com/huggingface/transformers
[YAML]: https://yaml.org
[Jupyter NER notebook]: https://github.com/plandes/deepnlp/blob/master/example/ner/notebook/ner.ipynb

[named entity task example]: https://github.com/plandes/deepnlp/blob/master/example/ner
[movie review example]: movie-example.html
[YAML conditionals]: https://plandes.github.io/util/doc/config.html#yaml-conditionals
[feature resource library]: https://github.com/plandes/deepnlp/blob/master/resources/feature.conf

[app.conf]: https://github.com/plandes/deepnlp/blob/master/example/ner/resources/app.conf
[obj.yml]: https://github.com/plandes/deepnlp/blob/master/example/ner/resources/obj.yml
[domain.py]: https://github.com/plandes/deepnlp/blob/master/example/ner/ner/domain.py
[app.py]: https://github.com/plandes/deepnlp/blob/master/example/ner/ner/app.py
[corpus.py]: https://github.com/plandes/deepnlp/blob/master/example/ner/ner/corpus.py
[harness.py]: https://github.com/plandes/deepnlp/blob/master/example/ner/harness.py

[FeatureDocumentParser]: ../api/zensols.deepnlp.html#zensols.deepnlp.parse.FeatureDocumentParser
[DataPoint]: https://plandes.github.io/deeplearn/api/zensols.deeplearn.batch.html?highlight=datapoint#zensols.deeplearn.batch.domain.DataPoint
[ConditionalYamlConfig]: https://plandes.github.io/util/api/zensols.config.html#zensols.config.condyaml.ConditionalYamlConfig