# Natural Language Parsing Before reading this, please read [feature documents]. If you want to jump right in, its recommended to at least pursue the simple CLI example. This framework wraps the [spaCy] framework and creates features. The motivation is to generate features from the parsed text in an object oriented fashion that is fast and easy to pickle as many spaCy objects are C data structures. A secondary use of this package provides a simple, yet robust way to generate a string stream of tokens using a [TokenNormalizer]. This allows for configuration driven way of generating tokens used for downstream feature vectorization such as word vectors, text classification, information retrieval/search, latent semantic indexing or any task that uses a single string token. Token streams can be transformed using [TokenMapper] instances. These take the output of a tokenizer, and then modify them in various ways. Finally, the [MapTokenNormalizer] is a normalizer that uses a list of [TokenMapper]s to first create the token stream and then transform them. See the [norm package] for token normalizers and mappers. ## Resource Library The [NLP resource library] contains configuration for a language parser that works for most use cases. However, like any [resource library], importing and overriding is straight forward. A [TokenNormalizer] is defined in [obj.conf]: ```ini [filter_token_mapper] class_name = zensols.nlp.FilterTokenMapper #remove_stop = True #remove_punctuation = True #remove_space = True [map_filter_token_normalizer] class_name = zensols.nlp.MapTokenNormalizer mapper_class_list = list: filter_token_mapper ``` A [SpacyFeatureDocumentParser], which will be used to parse text in to [spaCy] documents: ```ini [doc_parser] class_name = zensols.nlp.sparser.SpacyFeatureDocumentParser lang = en model_name = ${lang}_core_web_sm token_normalizer = instance: map_filter_token_normalizer ``` which defines a language resource for English that uses our previously defined token normalizer. Note that the API provides for these two tasks (parsing and token normalization) separately. ## Example The [application example] consists of a full CLI application that configures and uses a document parser. In the example the `app.conf` imports the [NLP resource libraries]. By default, the using the `parse` action shows all features for all tokens. However, when adding `--config terse.conf` stop words, punctuation and white space tokens are removed. Similarly, adding `--config lemma.conf` configures the parser to use `lemma_token_mapper`, which uses the lemmas as normalized tokens. The `detailed` action/method in the `app.py` Python source code file illustrates basic usage of the parser. To get a [feature document], which has all the configured parsed artifacts typically needed to use in machine learning models, use the [FeatureDocumentParser], which is the base class of [SpacyFeatureDocumentParser]: ```python doc: FeatureDocument = self.doc_parser(sentence) ``` If you only want a [spaCy] `Doc` instance use the [FeatureDocumentParser]'s `parse_spacy_doc` method: ```python doc: Doc = self.doc_parser.parse_spacy_doc(sentence) ``` See the inline documentation/comments in the `app.py` Python file that explains how to use the API and the `makefile` to run each example. [spaCy]: https://spacy.io [NLP resource library]: https://github.com/plandes/nlparse/tree/master/resources [NLP resource libraries]: https://github.com/plandes/nlparse/tree/master/resources [resource library]: https://plandes.github.io/util/doc/config.html#resource-libraries [norm package]: ../api/zensols.nlp.html#module-zensols.nlp.norm [FeatureDocumentParser]: ../api/zensols.nlp.html#zensols.nlp.parser.FeatureDocumentParser [SpacyFeatureDocumentParser]: ../api/zensols.nlp.html#zensols.nlp.parser.SpacyFeatureDocumentParser [TokenNormalizer]: ../api/zensols.nlp.html#zensols.nlp.norm.TokenNormalizer [TokenMapper]: ../api/zensols.nlp.html#zensols.nlp.norm.TokenMapper [MapTokenNormalizer]: ../api/zensols.nlp.html#zensols.nlp.norm.MapTokenNormalizer [feature documents]: feature-doc.html [feature document]: feature-doc.html [application example]: https://github.com/plandes/nlparse/tree/master/example/config