Movie Review Example#

This document describes the movie review task example to demonstrate the DeepZenols NLP framework on the sentiment analysis task using the Stanford sentiment analysis corpus. It is highly recommended to first read through the clickbate example, which contains concepts assumed that are understood for this example. For this reason, only new configuration and concepts will be provided.

Corpus#

The corpus used for this example is fairly small so the models train fast. It is the Stanford movie review dataset with Cornell labels:

The corpus is automatically downloaded to the corpus directory the first time the model is trained or the batch set accessed.

Model Configuration#

The model specific configuration is located in the models directory. Each has a file that’s given with the --config command line option to the harness.py entry point Python file and contains configuration that overrides on a per model basis.

Data Set and Corpus#

This example utilizes much of the deeplearn API framework code. The main thrust is to create a Pandas data frame, which is then used to provide the natural language text and labels. All features are taken only from the text.

Application Configuration#

Like the clickbate example, the app.conf contains command line configuration used by the entry point harness.py script to invoke the example application. The model and other application configuration is given in the obj.yml resource library file. Also like the clickbate example, we will detail each section that was not already covered since these two projects are text classification. Instead, this document will focus on those more advanced areas such as extending the feature creation aspect of the application.

Generally, the term section refers to both a configuration section (like those described in the INI format). However, for the remainder of this document, section group refers to a grouping of sections that are demarcated by two hashes (##) in the configuration file such as ## Install the corpus.

The obj.yml contains the application specific configuration for reading the corpus files and parsing it in to features that will later be vectorized. It also contains the model. All of this is described in each sub section in this document with the respective named group section (root YAML nodes) in the obj.yml application configuration file.

Install the Corpus#

As in the clickbate example we download several files: the corpus and the labels. The corpus.py file provides the DatasetFactory class that merges the Stanford movie review corpus with the Cornell labels and populated with the install resources to locate the local corpus files. This class is configured as the dataset_factory in the obj.yml application configuration file.

Natural language parsing#

Like in the clickbate example, we configure a FeatureDocumentParser. However, this time, we’ll use the one we provide in the MovieReview project described more in the next section.

Feature Creation#

Most of the feature creation code comes with the package and the respective configuration with the feature resource library, which is why the project has only three Python source code files, two resource library files, and three model configuration files.

The MovieReviewRowStash (configured by overriding the feature resource library dataframe_stash section in obj.yml) in the domain.py source file is used by framework to get a Pandas data frame and inherits ResourceFeatureDataframeStash to first download the corpus. Afterward, it uses the DatasetFactory to create a dataframe containing the label and the natural language text used to test and train the model.

The domain.py file also defines a MovieReview container class that has the polarity positive or negative feed back for each review. All we need to do is to extend FeatureDocument and add the label to get the complete domain feature document for our application. This class is used by MovieReviewFeatureStash, that handles the work of parsing the text in to features and extends (DocumentFeatureStash) to use parser to create instances of MovieReview feature documents with the polarity (positive or negative sentiment) labels. Note that we could have used the default LabeledFeatureDocument from the classify resource library, but this example shows how to create our own specific labels by overriding a method invoke the parsing with the text and set the label from the Pandas data frame:

def _parse_document(self, id: int, row: pd.Series) -> Review:
	# text to parse with SpaCy
	text = row['sentence']
	# the class label
	polarity = row['polarity']
	return self.vec_manager.parse(text, polarity)

See the deeplearn API more documentation on data frame stashes.

So far we’ve defined that base feature class MovieReview, the stash that keeps track of them, MovieReviewFeatureStash. Now we need to extend the deeplearn API batch classes. We’ll start with the data point class:

@dataclass
class ReviewDataPoint(FeatureDocumentDataPoint):
    @property
    def label(self) -> str:
        return self.doc.polarity

which extends from the linguistic specific data point class FeatureDocumentDataPoint. There’s not much more to this than returning the label from the MovieReview class that’s retrieved from the MovieReviewFeatureStash that’s set as the doc attribute on the FeatureDocumentDataPoint by MovieReviewFeatureStash (via the super class DocumentFeatureStash).

Vectorization#

We still have configure the polarity labels (n, and p) for the label vectorizer so it knows what to use for nominal values during the batch processing. We also declare the vectorizer managers to include language features, the label vectorizer (classify_label_vectorizer_manager provided in the feature resource library) and the transformer expanders that allow for language features to be concatenated to BERT embeddings.

Batch#

We configure our custom MovieReviewDataPoint class in the batch_stash section. The rest of this section is self explanatory if the clickbate example has been reviewed

Running#

The movie review data set example can be run from the command line or as a Jupyter notebook.

Command Line#

Everything can be done with the harness script:

# get the command line help using a thin wrapper around the framework
./harness.py -h
# the executor tests and trains the model, use it to get the stats used to train
./harness.py info
# print a sample Glove 50 (default) batch of what the model will get during training
./harness.py info -i batch
# print a sample transformer batch of what the model will get during training
./harness.py info -i batch -c models/transformer-trainable.conf 
# train and test the model but switch to model profile with optimized
./harness.py traintest -p
# all model, its (hyper)parameters, metadata and results are stored in subdirectory of files
./harness.py result
# predict and write the test set to a CSV file
./harness.py -c models/glove50.conf preds
# predict ad-hoc a new sentence
./harness.py predtext 'Great movie'

Note the run.sh script in the same directory provides a simpler API and more prediction examples as a way of calling the harness.py entry point. It also serves as an example of how one might simplify a command line for a specific model.

Jupyter Notebook#

To run the Jupyter movie notebook:

  1. Pip install: pip install notebook

  2. Go to the notebook directory: cd examples/movie/notebook

  3. Start the notebook: jupyter notebook

  4. Start the execution in the notebook with Cell > Run All.