zensols.deepnlp.classify package

Submodules

zensols.deepnlp.classify.domain module

Domain objects for the natural language text classification atsk.

class zensols.deepnlp.classify.domain.LabeledFeatureDocument(sents, text=None, spacy_doc=None, softmax_logit=None, label=None, pred=None)[source]

Bases: PredictionFeatureDocument

A feature document with a label, used for text classification.

__init__(sents, text=None, spacy_doc=None, softmax_logit=None, label=None, pred=None)
label: str = None

The document level classification gold label.

pred: str = None

The document level prediction label.

See:

ClassificationPredictionMapper.pred_attribute

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write the document and optionally sentence features.

Parameters:
  • n_sents – the number of sentences to write

  • n_tokens – the number of tokens to print across all sentences

  • include_original – whether to include the original text

  • include_normalized – whether to include the normalized text

class zensols.deepnlp.classify.domain.LabeledFeatureDocumentDataPoint(id, batch_stash, container)[source]

Bases: TokenContainerDataPoint

A representation of a data for a reivew document containing the sentiment polarity as the label.

__init__(id, batch_stash, container)
property label: str

The label for the textual data point.

class zensols.deepnlp.classify.domain.PredictionFeatureDocument(sents, text=None, spacy_doc=None, softmax_logit=None)[source]

Bases: FeatureDocument

A feature document with a label, used for text classification.

__init__(sents, text=None, spacy_doc=None, softmax_logit=None)
softmax_logit: Dict[str, ndarray] = None

The document level softmax of the logits.

See:

ClassificationPredictionMapper.softmax_logit_attribute

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write the document and optionally sentence features.

Parameters:
  • n_sents – the number of sentences to write

  • n_tokens – the number of tokens to print across all sentences

  • include_original – whether to include the original text

  • include_normalized – whether to include the normalized text

class zensols.deepnlp.classify.domain.TokenContainerDataPoint(id, batch_stash, container)[source]

Bases: DataPoint

A convenience class that uses data, such as tokens, a sentence or a document (TokenContainer) as a data point.

__init__(id, batch_stash, container)
container: TokenContainer

The token cotainer used for this data point.

property doc: FeatureDocument

The container as a document. If it is a sentence, it will create a document with the single sentence. This is usually used by the embeddings vectorizer.

property token_labels: Tuple[Any, ...]

The label that corresponds to each normalized token.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write the contents of this instance to writer using indention depth.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

zensols.deepnlp.classify.facade module

A facade for simple text classification tasks.

class zensols.deepnlp.classify.facade.ClassifyModelFacade(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.PredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)[source]

Bases: LanguageModelFacade

A facade for the text classification. See super classes for more information on the purprose of this class.

All the set_* methods set parameters in the model.

COUNTS_ATTRIBUTE = 'counts'

The feature counts attribute name.

DEPENDENCIES_ATTRIBUTE = 'dependencies'

The dependency feature attribute name.

DEPENDENCY_EXPANDER_ATTRIBTE = 'transformer_dep_expander'

Expands dependency tree spaCy features to transformer wordpiece alignment.

EMBEDDING_ATTRIBUTES = {'fasttext_crawl_300_embedding', 'fasttext_news_300_embedding', 'glove_300_embedding', 'glove_50_embedding', 'transformer_fixed_embedding', 'transformer_trainable_embedding', 'word2vec_300_embedding'}

All embedding feature section names.

ENUMS_ATTRIBUTE = 'enums'

The enumeration feature attribute name.

ENUM_EXPANDER_ATTRIBUTE = 'transformer_enum_expander'

Expands enumerated spaCy features to transformer wordpiece alignment.

FASTTEXT_CRAWL_300_EMBEDDING = 'fasttext_crawl_300_embedding'

The configuration section name of the fasttext crawl embedding FastTextEmbedModel class.

FASTTEXT_NEWS_300_EMBEDDING = 'fasttext_news_300_embedding'

The configuration section name of the fasttext news embedding FastTextEmbedModel class.

GLOVE_300_EMBEDDING = 'glove_300_embedding'

The configuration section name of the glove embedding GloveWordEmbedModel class.

GLOVE_50_EMBEDDING = 'glove_50_embedding'

The configuration section name of the glove embedding GloveWordEmbedModel class.

LANGUAGE_ATTRIBUTES = {'counts', 'dependencies', 'enums', 'stats', 'transformer_dep_expander', 'transformer_enum_expander'}

All linguistic feature attribute names.

LANGUAGE_FEATURE_MANAGER_NAME = 'language_vectorizer_manager'

The configuration section of the definition of the FeatureDocumentVectorizerManager.

LANGUAGE_MODEL_CONFIG = LanguageModelFacadeConfig(manager_name='language_vectorizer_manager', attribs={'dependencies', 'enums', 'stats', 'transformer_enum_expander', 'transformer_dep_expander', 'counts'}, embedding_attribs={'fasttext_news_300_embedding', 'fasttext_crawl_300_embedding', 'word2vec_300_embedding', 'transformer_trainable_embedding', 'glove_50_embedding', 'transformer_fixed_embedding', 'glove_300_embedding'})

The label model configuration constructed from the batch metadata.

STATS_ATTRIBUTE = 'stats'

The statistics feature attribute name.

TRANSFORMER_FIXED_EMBEDDING = 'transformer_fixed_embedding'

Like TRANSFORMER_TRAINBLE_EMBEDDING, but all layers of the tranformer are frozen and only the static embeddings are used.

TRANSFORMER_TRAINBLE_EMBEDDING = 'transformer_trainable_embedding'

The configuration section name of the BERT transformer contextual embedding TransformerEmbedding class.

WORD2VEC_300_EMBEDDING = 'word2vec_300_embedding'

The configuration section name of the the Google word2vec embedding Word2VecModel class.

__init__(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.PredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)
predict(datas)[source]

Make ad-hoc predictions on batches without labels, and return the results.

Parameters:

datas (Iterable[Any]) – the data predict on, each as a separate element as a data point in a batch

Return type:

Any

class zensols.deepnlp.classify.facade.MultilabelClassifyModelFacade(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.MultiLabelPredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)[source]

Bases: ClassifyModelFacade

A multi-label sentence and document classification facade.

__init__(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.MultiLabelPredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)
predictions_dataframe_factory_class

alias of MultiLabelPredictionsDataFrameFactory

class zensols.deepnlp.classify.facade.TokenClassifyModelFacade(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.SequencePredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)[source]

Bases: ClassifyModelFacade

A token level classification model facade.

__init__(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.SequencePredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)
predictions_dataframe_factory_class

alias of SequencePredictionsDataFrameFactory

zensols.deepnlp.classify.model module

Contains classes that make up a text classification model.

class zensols.deepnlp.classify.model.ClassifyNetwork(net_settings)[source]

Bases: EmbeddingNetworkModule

A model that either allows for an RNN or a masked trained transforemr model to classify text for document level classification. A RNN should be used when the input are non-contextual word vectors, such as GLoVE.

For transformer input, either the pooled (i.e. [CLS] BERT token) may be be used with document level features. Token (last transformer layer output) may also be used, but in this case, the input must be truncated and padded wordpiece size by setting the deepnlp_default:word_piece_token_length resource library configuration.

The RNN should not be set for transformer input, but the linear fully connected terminal output is used for both.

MODULE_NAME: ClassVar[str] = 'classify'

The module name used in the logging message. This is set in each inherited class.

__init__(net_settings)[source]

Initialize the embedding layer.

Parameters:
  • net_settings (ClassifyNetworkSettings) – the embedding layer configuration

  • logger – the logger to use for the forward process in this layer

  • filter_attrib_fn – if provided, called with a BatchFieldMetadata for each field returning True if the batch field should be retained and used in the embedding layer (see class docs); if None all fields are considered

class zensols.deepnlp.classify.model.ClassifyNetworkSettings(name, config_factory, torch_config, batch_stash, embedding_layer, dropout, recurrent_settings, convolution_settings, linear_settings)[source]

Bases: DropoutNetworkSettings, EmbeddingNetworkSettings

A utility container settings class for convulsion network models. This class also updates the recurrent network’s drop out settings when changed.

See:

ClassifyNetwork

__init__(name, config_factory, torch_config, batch_stash, embedding_layer, dropout, recurrent_settings, convolution_settings, linear_settings)
convolution_settings: DeepConvolution1dNetworkSettings

Contains the configuration for the model’s convolution layer(s).

get_module_class_name()[source]

Returns the fully qualified class name of the module to create by ModelManager. This module takes as the first parameter an instance of this class.

Important: This method is not used for nested modules. You must declare specific class names in the configuration for those nested class naems.

Return type:

str

linear_settings: DeepLinearNetworkSettings

Contains the configuration for the model’s terminal layer.

recurrent_settings: RecurrentAggregationNetworkSettings

Contains the confgiuration for the models RNN.

zensols.deepnlp.classify.multilabel module

Classes that enable multi-label classification.

class zensols.deepnlp.classify.multilabel.MultiLabelFeatureDocument(sents, text=None, spacy_doc=None, softmax_logit=None, labels=None, preds=None)[source]

Bases: PredictionFeatureDocument

A feature document with a label, used for text classification.

__init__(sents, text=None, spacy_doc=None, softmax_logit=None, labels=None, preds=None)
labels: Tuple[str, ...] = None

The document level classification gold label.

preds: Tuple[str, ...] = None

The document level prediction label.

See:

ClassificationPredictionMapper.pred_attribute

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write the document and optionally sentence features.

Parameters:
  • n_sents – the number of sentences to write

  • n_tokens – the number of tokens to print across all sentences

  • include_original – whether to include the original text

  • include_normalized – whether to include the normalized text

class zensols.deepnlp.classify.multilabel.MultiLabelFeatureDocumentDataPoint(id, batch_stash, container)[source]

Bases: TokenContainerDataPoint

A representation of a data for a reivew document containing the sentiment polarity as the label.

__init__(id, batch_stash, container)
property labels: str

The label for the textual data point.

zensols.deepnlp.classify.pred module

Prediction mapper support for NLP applications.

class zensols.deepnlp.classify.pred.ClassificationPredictionMapper(datas, batch_stash, vec_manager, label_feature_id, pred_attribute='pred', softmax_logit_attribute='softmax_logit')[source]

Bases: PredictionMapper

A prediction mapper for text classification. This mapper works at any level (document, sentence, token).

__init__(datas, batch_stash, vec_manager, label_feature_id, pred_attribute='pred', softmax_logit_attribute='softmax_logit')
label_feature_id: str

The feature ID for the label vectorizer.

property label_vectorizer: CategoryEncodableFeatureVectorizer

The label vectorizer used to map classes in get_classes().

map_results(result)[source]

Map class predictions, logits, and documents generated during use of this instance. Each data point is aggregated across batches.

Return type:

Tuple[LabeledFeatureDocument, ...]

Returns:

a Settings instance with classess, logits and docs attributes

pred_attribute: str = 'pred'

The prediction attribute to set on the FeatureDocument returned from map_results().

softmax_logit_attribute: str = 'softmax_logit'

The softmax of the logits attribute to set on the FeatureDocument returned from map_results().

See:

On Calibration of Modern Neural Networks

vec_manager: FeatureDocumentVectorizerManager

The vectorizer manager used to parse and get the label vectorizer.

class zensols.deepnlp.classify.pred.SequencePredictionMapper(datas, batch_stash, vec_manager, label_feature_id, pred_attribute='pred', softmax_logit_attribute='softmax_logit')[source]

Bases: ClassificationPredictionMapper

Predicts sequences as a Settings with keys classes as the token level predictions and docs containing the parsed documents from the sentence text.

__init__(datas, batch_stash, vec_manager, label_feature_id, pred_attribute='pred', softmax_logit_attribute='softmax_logit')
map_results(result)[source]

Map class predictions, logits, and documents generated during use of this instance. Each data point is aggregated across batches.

Return type:

Settings

Returns:

a Settings instance with classess, logits and docs attributes

Module contents