Clickbate Example¶
This example provides a good starting point since it only contains code to parse a corpus. It also shows how to use your own model your own data by synthesizing a positive and negative dataset sources in to one, which is provided in the only source code file for the project (excluding the entry point harness.py script).
The example shows how to create, train, validate and test a model that determines if a headline is clickbate or not (see the corpus for details). It comes with two models: one that uses word vectors (GloVE 50 dimension and the fasttext news pre-trained embeddings) with additional language features, and a BERT word embedding example.
Note that there is quite a bit of inline documentation in the app.conf and obj.yml configuration files. so it is recommended the reader follow it while reading this tutorial.
Command Line Interface¶
All of the examples for this package are written as an Zensols CLI applications. The entry point program is harness.py. However, the command line is configured in app.conf and the application in obj.yml, which is where almost all of the example. These files configure the file system paths, tell where to load other resource libraries, point to corpus resources and are also used by the Jupyter notebook example to allow for more rapid prototyping and experimentation.
Because the examples (including this one) use resource libraries, the
configuration is much smaller and more manageable. First we start with adding
the application defaults allowing name
to be overridden with the
--override
command line option:
[cb_default]
lang_features = dependencies, enums,
embedding = ${name}_embedding
The --override
command line option takes a string (or file) containing any
configuration in a comma delimited <section>.<option>
and given on the
command line to specify which word embeddings to use. For example: --override cb_default.name=glove_50
, would specify the 50 dimension GloVE resource
library.
Now we add defaults for the deep learning package that set the model name, appears in results and file system naming. In this example, we simply set the model name as the embeddings we’ll use:
[deeplearn_default]
model_name = ${cb_default:embedding}
The following configuration adds default applications, which is invoked from the command line by the CliHarness defined in the harness.py entry point and imported from resource libraries as first pass actions:
[cli]
apps = list: ${cli_config_default:apps}, ${cli_deeplearn_default:apps}, ${cli_deepnlp_default:apps},
deepnlp_fac_text_classify_app, cleaner_cli
cleanups = list: ${cli_config_default:cleanups}, ${cli_deeplearn_default:cleanups},
${cli_deepnlp_default:cleanups}, deepnlp_fac_text_classify_app, cleaner_cli
cleanup_removes = set: log_cli
The application defined in the sections loaded are simply Python dataclasses who’s class and method docstrings as help for the command line interface. Each method is mapped to an action with positional and optional parameters.
Note the log_cli
section is mentioned because it is listed as a clean up in a
resource library. However, we must keep this section because it is useful to
configure child processes when batches are created to keep a consistent logging
configuration.
We can also configure a default for the --override
flag that indicates the
word embedding with:
[override_cli_decorator]
option_overrides = dict: {'override': {'default': 'cb_default.name=glove_50'}}
While a user can create a model specific configuration file specified with the
--config
option (such as in the other examples), this example is so simple as
to not need it. For this reason, we make it optional:
[config_cli]
expect = False
The configured actions and their options for the CLI in the cli
section
described earlier must be imported from their respective resource libraries,
which is done with:
[import]
config_files = list:
resource(zensols.util): resources/default.conf,
...
resource(zensols.deepnlp): resources/cleaner.conf
Finally we import the model configuration from the resource libraries with a
special section used by the --config
option’s first pass action. This
provides special directives for loading the overridden --override
and the
configuration file. We reference the default
and cb_default
as they are
utilized in the loaded in the subordinate configuration files:
[config_import]
references = list: default, cb_default
sections = list: app_imp_conf
[app_imp_conf]
type = import
config_files = list:
^{override},
^{config_path},
resource(zensols.deeplearn): resources/default.conf,
...
resource(cb): resources/obj.yml,
^{config_path}
This first loads the override with --override
, then the configuration
importer from a user provided configuration file with --config
(if
provided). Then the defaults and model configuration is loaded. Finally the
configuration file is loaded again providing the user the option to override
anything clobbered by the resource libraries as everything loaded is either
added over overwritten in order. Nested in this list of resource files
includes the obj.yml file, which is this application example’s specific
configuration.
Application Configuration¶
As mentioned in the previous section, the app.conf specifies resource
libraries to load allowing the obj.yml to add and modify existing
configuration. This file could have been written as an ini
file (like
app.conf). However, it contains the hierarchical vectorizer to batch
mappings lending itself better to a hierarchical data format such as YAML.
Generally, the term section refers to both a configuration section (like
those described in the INI format). However, for the remainder of this
document, section group refers to a grouping of sections that are demarcated
by two hashes (##
) in the configuration file such as ## Install the corpus
.
The obj.yml contains the application specific configuration for reading the corpus files and parsing it in to features that will later be vectorized. It also contains the model. All of this is described in each sub section in this document with the respective named group section (root YAML nodes) in the obj.yml application configuration file.
Install the Corpus¶
First the configuration defines where to download the corpus uncompress the
files to make it available to the program. The Install the corpus
sections
tell where the resources are on the Internet, and the file naming, which by
default takes the naming from the URL.
cb_data_resource:
class_name: zensols.install.Resource
url: 'https://github.com/bhargaviparanjape/clickbait/raw/master/dataset/clickbait_data.gz'
non_cb_data_resource:
class_name: zensols.install.Resource
url: 'https://github.com/bhargaviparanjape/clickbait/raw/master/dataset/non_clickbait_data.gz'
feature_installer:
resources: 'instance: list: cb_data_resource, non_cb_data_resource'
The installer has a list of resources it uses to download the files and uncompress them on the file system. This local directory is set in the feature resource library.
Feature Creation¶
This section contains several sections that describe Stash instances that cache the vectorized batches to the file system (see batch encoding in the paper). This process as it relates to this section includes:
Parse the downloaded corpus from the sentence text files (
dataframe_stash
).Randomly split the dataset in to train, validation and test set, then store the data from the dataframe as a picked file on the file system (
dataframe_stash
).Parse the English sentences from the
dataframe_stash
using spaCy across as many processes as the CPU has cores (feature_factory_stash
) and persisting them to the file system in directories by feature (feature_dir_stash
only found in thedeeplearn
resource library).Read only certain files (based on feature selection for the particular model) from the file system to reconstruct batches (see batch decoding in the paper).
Train, validate and test the model using the same ordering and splits sampled by the
dataframe_stash
from step 1.
Natural Language Processing¶
The doc_parser
section tells the parser to create instances of a different
class that what was defined in its resource library (FeatureDocument
) using
the classification resource library set up (loaded by the classify resource
library by app.conf). The class we provide for this example contains an
attribute to carry a label for our text classification task.
doc_parser:
doc_class: 'class: zensols.deepnlp.classify.LabeledFeatureDocument'
components: 'instance: list: remove_sent_boundaries_component'
classify_label_vectorizer:
categories: ${dataframe_stash:labels}
The classify_label_vectorizer
comes from the feature resource library,
which needs the output nominal label names for encoding/vectorization.
Batch¶
The Batch
section provides all the configuration necessary to batch the
vectorized data in to chunks usable by the model. Specifically, the
batch_stash
section describes how to map between the vectorized output to
entries in the batches and their grouping. It also gives the default set of
attributes to test with at experimentation and the number of sub-process
workers to use during batching.
batch_stash:
batch_feature_mappings: 'dataclass(zensols.deeplearn.batch.ConfigBatchFeatureMapping): cb_batch_mappings'
decoded_attributes: 'set: label, ${cb_default:lang_features} ${cb_default:embedding}'
workers: -2
cb_batch_mappings:
batch_feature_mapping_adds:
- 'dataclass(zensols.deeplearn.batch.BatchFeatureMapping): classify_label_batch_mappings'
- 'dataclass(zensols.deeplearn.batch.BatchFeatureMapping): lang_batch_mappings'
field_keep: [label, enums, dependencies, glove_50_embedding, fasttext_news_300_embedding]
The cb_batch_mappings
section indicates to reuse the text classification
mappings from the classify resource library, and the more general language
features (such as spaCy parsed vectorized data) from the lang-batch resource
library. The workers: -2
says to use all but two cores for the number of
sub-processes for vectorization/batch creation.
Model¶
The executor in the Model
section sets net_settings
to
classify_net_settings
to provide the top level text classification for the
application using a BiLSTM-CRF. This model is provided in the classify
resource library in deepnlp
(this project), with little left to specify.
These remaining portions of the model that are specified are:
Dense output layer that connects the LSTM to the CRF and specifies the output label cardinality, which is set to two (either clickbate or not).
The embedding layer to use, which is string substituted with the
cb_default:embedding
section/option injected with the overridden (--override
) option.The section containing the LSTM configuration (
recurrent_settings
).Model settings that include the model name, learning rate, default epoch count and the component that decodes model output to labels and softmaxes (confidence like scores).
executor:
net_settings: 'instance: classify_net_settings'
linear_settings:
out_features: "eval: '${dataframe_stash:labels}'.count(',') + 1"
classify_net_settings:
embedding_layer: 'instance: ${cb_default:embedding}_layer'
recurrent_settings: 'instance: recurrent_settings'
dropout: 0.2
model_settings:
model_name: 'clickbate: ${cb_default:name}'
learning_rate: 1e-3
epochs: 35
Imported from Resource Libraries¶
Other important components of the application not specified in the obj.yml but present from being imported from resource libraries include:
The facade class (
ClassifyModelFacade
) provided in the classify resource library, which is used by a second pass CLI application to predict ad-hoc text.Model events (i.e. when training or validation starts/end) to track model train/test time consumption using an observer pattern in observer resource library.
Vectorizer configuration, vectorizer manager and manager sets, which take data (in our case English text) and vectorize in to binary form usable by the model. See the paper for more information.
A Stash that stratifies each dataset by label and other components that enable batching from the feature resource library.
Code¶
All the code for this example is in the is in the cb.py that merges the corpus CSV files and the harness.py entry point application that invokes the command line interface API.
Running¶
The movie review data set example can be run from the command line or as a Jupyter notebook.
Command Line¶
Everything can be done with the harness script:
Get the command line help using a thin wrapper around the framework:
$ ./harness.py -h
Usage: harness.py <actions> [options]:
Options:
-h, --help show this help message and exit
--version show the program version and exit
--level X the level to set the application logger, X is one of: debug, err, info, warn
-c, --config FILE the path to the configuration file
--override <FILE|DIR|STRING> cb_default.n... a config file/dir or a comma delimited section.key=value string that overrides
configuration with default cb_default.name=glove_50
Actions:
list list all actions and help
--lstfmt <json|name|text> text the output format for the action listing
...
The executor tests and trains the model, use it to get the stats used to train:
$ ./harness.py info
clickbate: glove_50:
executor:
model: clickbate: glove_50
feature splits:
split stash splits:
train: 25598 (80.0%)
validation: 3201 (10.0%)
test: 3201 (10.0%)
total: 32000
total this instance: 32000
keys consistent: True
...
Print a sample Glove 50 (default) batch of what the model will get during training:
$ ./harness.py info -i batch
clickbate: glove_50:
DefaultBatch
size: 200
label: torch.Size([200])
glove_50_embedding: torch.Size([200, 20])
enums: torch.Size([200, 20, 174])
dependencies: torch.Size([200, 20, 1])
...
Train and test the model but switch to model profile with optimized:
$ ./harness.py traintest -p
2022-06-14 13:55:06,094 resetting executor
...
2022-06-14 13:55:06,947 training model <class 'zensols.deepnlp.classify.model.ClassifyNetwork'> on cpu for 35 epochs using learning rate 0.001
All model, its (hyper)parameters, metadata and results are stored in subdirectory of files:
$ ./harness.py result
Name: clickbate: glove_50: 1
Run index: 2
Learning rate: 0.001
train:
started: 06/14/2022 13:55:06:957329
...
Predict and write the test set to a CSV file:
$ ./harness.py preds
2022-06-14 13:58:57,186 wrote predictions: clickbate-glove_50.csv
Predict ad-hoc a few sentences:
$ ./harness.py predtext "Can't Wait For Summer? You've Got To See These Pics"
pred=y, logit=0.9994343519210815: Can't Wait For Summer? You've Got To See These Pics
$ ./harness.py predtext "Biden is fired up over inflation."
pred=n, logit=0.9951714277267456: Biden is fired up over inflation.
Note the run.sh
script in the same directory provides a simpler API and more
prediction examples as a way of calling the harness.py entry point. It also
serves as an example of how one might simplify a command line for a specific
model.
Jupyter Notebook¶
There is a Jupyter notebook that executes the entire download, train, validate, test and report process for both models. In notebook directory is the notebook, a Python source mngfac.py (facade manager factory) file that “glues” the CLI to the notebook API, and the output of a previous run of the notebook.
The mngfac.py file contains a convenience class used by the notebook to add directories to the Python path, which is useful for debugging when the package isn’t installed. It also has life cycle methods to manage instances of ModelFacade and configure the Jupyter notebook for things such as logging and page width.
To run the Jupyter notebook:
Pip install:
pip install notebook
Go to the notebook directory:
cd examples/clickbate/notebook
Start the notebook:
jupyter notebook
Start the execution in the notebook with
Cell > Run All
.