# Movie Review Example This document describes the [movie review task example] to demonstrate the DeepZenols NLP framework on the sentiment analysis task using the Stanford sentiment analysis corpus. It is highly recommended to first read through the [clickbate example], which contains concepts assumed that are understood for this example. For this reason, only new configuration and concepts will be provided. ## Corpus The corpus used for this example is fairly small so the models train fast. It is the Stanford movie review dataset with Cornell labels: * [Stanford movie review] * [Cornell sentiment polarity] The corpus is automatically downloaded to the `corpus` directory the first time the model is trained or the batch set accessed. ## Model Configuration The model specific configuration is located in the `models` directory. Each has a file that's given with the `--config` command line option to the [harness.py] entry point Python file and contains configuration that overrides on a per model basis. ## Data Set and Corpus This example utilizes much of the [deeplearn API] framework code. The main thrust is to create a [Pandas] data frame, which is then used to provide the natural language text and labels. All features are taken only from the text. ## Application Configuration Like the [clickbate example], the [app.conf] contains command line configuration used by the entry point [harness.py] script to invoke the example application. The model and other application configuration is given in the [obj.yml] resource library file. Also like the clickbate example, we will detail each section that was not already covered since these two projects are text classification. Instead, this document will focus on those more advanced areas such as extending the feature creation aspect of the application. Generally, the term *section* refers to both a configuration section (like those described in the [INI format]). However, for the remainder of this document, *section group* refers to a grouping of sections that are demarcated by two hashes (`##`) in the configuration file such as `## Install the corpus`. The [obj.yml] contains the application specific configuration for reading the corpus files and parsing it in to features that will later be vectorized. It also contains the model. All of this is described in each sub section in this document with the respective named group section (root YAML nodes) in the [obj.yml] application configuration file. ### Install the Corpus As in the [clickbate example] we download several files: the corpus and the labels. The [corpus.py] file provides the `DatasetFactory` class that merges the Stanford movie review corpus with the Cornell labels and populated with the install resources to locate the local corpus files. This class is configured as the `dataset_factory` in the [obj.yml] application configuration file. ### Natural language parsing Like in the [clickbate example], we configure a [FeatureDocumentParser]. However, this time, we'll use the one we provide in the `MovieReview` project described more in the next section. ### Feature Creation Most of the feature creation code comes with the package and the respective configuration with the [feature resource library], which is why the project has only three Python source code files, two resource library files, and three model configuration files. The `MovieReviewRowStash` (configured by overriding the [feature resource library] `dataframe_stash` section in [obj.yml]) in the [domain.py] source file is used by framework to get a [Pandas] data frame and inherits [ResourceFeatureDataframeStash] to first download the corpus. Afterward, it uses the `DatasetFactory` to create a dataframe containing the label and the natural language text used to test and train the model. The [domain.py] file also defines a `MovieReview` container class that has the polarity positive or negative feed back for each review. All we need to do is to extend [FeatureDocument] and add the label to get the complete domain feature document for our application. This class is used by `MovieReviewFeatureStash`, that handles the work of parsing the text in to features and extends ([DocumentFeatureStash]) to use parser to create instances of `MovieReview` feature documents with the polarity (positive or negative sentiment) labels. Note that we could have used the default [LabeledFeatureDocument] from the [classify resource library], but this example shows how to create our own specific labels by overriding a method invoke the parsing with the text and set the label from the [Pandas] data frame: ```python def _parse_document(self, id: int, row: pd.Series) -> Review: # text to parse with SpaCy text = row['sentence'] # the class label polarity = row['polarity'] return self.vec_manager.parse(text, polarity) ``` See the [deeplearn API] more documentation on [data frame stashes]. So far we've defined that base feature class `MovieReview`, the stash that keeps track of them, `MovieReviewFeatureStash`. Now we need to extend the [deeplearn API batch] classes. We'll start with the data point class: ```python @dataclass class ReviewDataPoint(FeatureDocumentDataPoint): @property def label(self) -> str: return self.doc.polarity ``` which extends from the linguistic specific data point class [FeatureDocumentDataPoint]. There's not much more to this than returning the label from the `MovieReview` class that's retrieved from the `MovieReviewFeatureStash` that's set as the `doc` attribute on the [FeatureDocumentDataPoint] by `MovieReviewFeatureStash` (via the super class [DocumentFeatureStash]). ### Vectorization We still have configure the polarity labels (`n`, and `p`) for the label vectorizer so it knows what to use for nominal values during the batch processing. We also declare the vectorizer managers to include language features, the label vectorizer (`classify_label_vectorizer_manager` provided in the [feature resource library]) and the transformer expanders that allow for language features to be concatenated to BERT embeddings. ### Batch We configure our custom `MovieReviewDataPoint` class in the `batch_stash` section. The rest of this section is self explanatory if the [clickbate example] has been reviewed ## Running The movie review data set example can be run from the command line or as a Jupyter notebook. ### Command Line Everything can be done with the harness script: ```bash # get the command line help using a thin wrapper around the framework ./harness.py -h # the executor tests and trains the model, use it to get the stats used to train ./harness.py info # print a sample Glove 50 (default) batch of what the model will get during training ./harness.py info -i batch # print a sample transformer batch of what the model will get during training ./harness.py info -i batch -c models/transformer-trainable.conf # train and test the model but switch to model profile with optimized ./harness.py traintest -p # all model, its (hyper)parameters, metadata and results are stored in subdirectory of files ./harness.py result # predict and write the test set to a CSV file ./harness.py -c models/glove50.conf preds # predict ad-hoc a new sentence ./harness.py predtext 'Great movie' ``` Note the `run.sh` script in the same directory provides a simpler API and more prediction examples as a way of calling the [harness.py] entry point. It also serves as an example of how one might simplify a command line for a specific model. ### Jupyter Notebook To run the [Jupyter movie notebook]: 1. Pip install: `pip install notebook` 1. Go to the notebook directory: `cd examples/movie/notebook` 1. Start the notebook: `jupyter notebook` 1. Start the execution in the notebook with `Cell > Run All`. [Pandas]: https://pandas.pydata.org [Stanford movie review]: https://nlp.stanford.edu/sentiment/ [Cornell sentiment polarity]: https://www.cs.cornell.edu/people/pabo/movie-review-data/ [deeplearn API]: https://plandes.github.io/deeplearn/index.html [data frame stashes]: https://plandes.github.io/deeplearn/doc/preprocess.html#data-as-a-pandas-data-frame [deeplearn API batch]: https://plandes.github.io/deeplearn/doc/preprocess.html#batches [classify resource library]: https://github.com/plandes/deepnlp/blob/master/resources/classify.conf [INI format]: https://plandes.github.io/util/doc/config.html#ini-format [movie review task example]: https://github.com/plandes/deepnlp/blob/master/example/movie [obj.yml]: https://github.com/plandes/deepnlp/blob/master/example/movie/resources/obj.yml [corpus.py]: https://github.com/plandes/deepnlp/blob/master/example/movie/mr/corpus.py [domain.py]: https://github.com/plandes/deepnlp/blob/master/example/movie/mr/domain.py [Jupyter movie notebook]: https://github.com/plandes/deepnlp/blob/master/example/movie/notebook/movie.ipynb [ResourceFeatureDataframeStash]: https://plandes.github.io/deeplearn/api/zensols.dataframe.html#zensols.dataframe.stash.ResourceFeatureDataframeStash [DocumentFeatureStash]: ../api/zensols.deepnlp.feature.html#zensols.deepnlp.feature.stash.DocumentFeatureStash [FeatureDocument]: ../api/zensols.deepnlp.html#zensols.deepnlp.domain.FeatureDocument [FeatureDocumentParser]: ../api/zensols.deepnlp.html#zensols.deepnlp.parse.FeatureDocumentParser [FeatureDocumentDataPoint]: ../api/zensols.deepnlp.batch.html#zensols.deepnlp.batch.domain.FeatureDocumentDataPoint [LabeledFeatureDocument]: ../api/zensols.deepnlp.classify.html#zensols.deepnlp.classify.domain.LabeledFeatureDocument [clickbate example]: clickbate.md [harness.py]: https://github.com/plandes/deepnlp/blob/master/example/movie/harness.py [app.conf]: https://github.com/plandes/deepnlp/blob/master/example/movie/resources/app.conf [obj.yml]: https://github.com/plandes/deepnlp/blob/master/example/movie/resources/app.conf [feature resource library]: https://github.com/plandes/deepnlp/blob/master/resources/feature.conf