zensols.deepnlp.classify package¶
Submodules¶
zensols.deepnlp.classify.domain module¶
Domain objects for the natural language text classification atsk.
- class zensols.deepnlp.classify.domain.LabeledFeatureDocument(sents, text=None, spacy_doc=None, softmax_logit=None, label=None, pred=None)[source]¶
Bases:
PredictionFeatureDocument
A feature document with a label, used for text classification.
- __init__(sents, text=None, spacy_doc=None, softmax_logit=None, label=None, pred=None)¶
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write the document and optionally sentence features.
- Parameters:
n_sents – the number of sentences to write
n_tokens – the number of tokens to print across all sentences
include_original – whether to include the original text
include_normalized – whether to include the normalized text
- class zensols.deepnlp.classify.domain.LabeledFeatureDocumentDataPoint(id, batch_stash, container)[source]¶
Bases:
TokenContainerDataPoint
A representation of a data for a reivew document containing the sentiment polarity as the label.
- __init__(id, batch_stash, container)¶
- class zensols.deepnlp.classify.domain.PredictionFeatureDocument(sents, text=None, spacy_doc=None, softmax_logit=None)[source]¶
Bases:
FeatureDocument
A feature document with a label, used for text classification.
- __init__(sents, text=None, spacy_doc=None, softmax_logit=None)¶
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write the document and optionally sentence features.
- Parameters:
n_sents – the number of sentences to write
n_tokens – the number of tokens to print across all sentences
include_original – whether to include the original text
include_normalized – whether to include the normalized text
- class zensols.deepnlp.classify.domain.TokenContainerDataPoint(id, batch_stash, container)[source]¶
Bases:
DataPoint
A convenience class that uses data, such as tokens, a sentence or a document (
TokenContainer
) as a data point.- __init__(id, batch_stash, container)¶
-
container:
TokenContainer
¶ The token cotainer used for this data point.
- property doc: FeatureDocument¶
The container as a document. If it is a sentence, it will create a document with the single sentence. This is usually used by the embeddings vectorizer.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write the contents of this instance to
writer
using indentiondepth
.- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
zensols.deepnlp.classify.facade module¶
A facade for simple text classification tasks.
- class zensols.deepnlp.classify.facade.ClassifyModelFacade(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.PredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)[source]¶
Bases:
LanguageModelFacade
A facade for the text classification. See super classes for more information on the purprose of this class.
All the
set_*
methods set parameters in the model.- COUNTS_ATTRIBUTE = 'counts'¶
The feature counts attribute name.
- DEPENDENCIES_ATTRIBUTE = 'dependencies'¶
The dependency feature attribute name.
- DEPENDENCY_EXPANDER_ATTRIBTE = 'transformer_dep_expander'¶
Expands dependency tree spaCy features to transformer wordpiece alignment.
- EMBEDDING_ATTRIBUTES = {'fasttext_crawl_300_embedding', 'fasttext_news_300_embedding', 'glove_300_embedding', 'glove_50_embedding', 'transformer_fixed_embedding', 'transformer_trainable_embedding', 'word2vec_300_embedding'}¶
All embedding feature section names.
- ENUMS_ATTRIBUTE = 'enums'¶
The enumeration feature attribute name.
- ENUM_EXPANDER_ATTRIBUTE = 'transformer_enum_expander'¶
Expands enumerated spaCy features to transformer wordpiece alignment.
- FASTTEXT_CRAWL_300_EMBEDDING = 'fasttext_crawl_300_embedding'¶
The configuration section name of the fasttext crawl embedding
FastTextEmbedModel
class.
- FASTTEXT_NEWS_300_EMBEDDING = 'fasttext_news_300_embedding'¶
The configuration section name of the fasttext news embedding
FastTextEmbedModel
class.
- GLOVE_300_EMBEDDING = 'glove_300_embedding'¶
The configuration section name of the glove embedding
GloveWordEmbedModel
class.
- GLOVE_50_EMBEDDING = 'glove_50_embedding'¶
The configuration section name of the glove embedding
GloveWordEmbedModel
class.
- LANGUAGE_ATTRIBUTES = {'counts', 'dependencies', 'enums', 'stats', 'transformer_dep_expander', 'transformer_enum_expander'}¶
All linguistic feature attribute names.
- LANGUAGE_FEATURE_MANAGER_NAME = 'language_vectorizer_manager'¶
The configuration section of the definition of the
FeatureDocumentVectorizerManager
.
- LANGUAGE_MODEL_CONFIG = LanguageModelFacadeConfig(manager_name='language_vectorizer_manager', attribs={'dependencies', 'enums', 'stats', 'transformer_enum_expander', 'transformer_dep_expander', 'counts'}, embedding_attribs={'fasttext_news_300_embedding', 'fasttext_crawl_300_embedding', 'word2vec_300_embedding', 'transformer_trainable_embedding', 'glove_50_embedding', 'transformer_fixed_embedding', 'glove_300_embedding'})¶
The label model configuration constructed from the batch metadata.
- STATS_ATTRIBUTE = 'stats'¶
The statistics feature attribute name.
- TRANSFORMER_FIXED_EMBEDDING = 'transformer_fixed_embedding'¶
Like
TRANSFORMER_TRAINBLE_EMBEDDING
, but all layers of the tranformer are frozen and only the static embeddings are used.
- TRANSFORMER_TRAINBLE_EMBEDDING = 'transformer_trainable_embedding'¶
The configuration section name of the BERT transformer contextual embedding
TransformerEmbedding
class.
- WORD2VEC_300_EMBEDDING = 'word2vec_300_embedding'¶
The configuration section name of the the Google word2vec embedding
Word2VecModel
class.
- __init__(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.PredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)¶
- class zensols.deepnlp.classify.facade.MultilabelClassifyModelFacade(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.MultiLabelPredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)[source]¶
Bases:
ClassifyModelFacade
A multi-label sentence and document classification facade.
- __init__(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.MultiLabelPredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)¶
- predictions_dataframe_factory_class¶
- class zensols.deepnlp.classify.facade.TokenClassifyModelFacade(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.SequencePredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)[source]¶
Bases:
ClassifyModelFacade
A token level classification model facade.
- __init__(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.SequencePredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)¶
- predictions_dataframe_factory_class¶
alias of
SequencePredictionsDataFrameFactory
zensols.deepnlp.classify.model module¶
Contains classes that make up a text classification model.
- class zensols.deepnlp.classify.model.ClassifyNetwork(net_settings)[source]¶
Bases:
EmbeddingNetworkModule
A model that either allows for an RNN or a masked trained transforemr model to classify text for document level classification. A RNN should be used when the input are non-contextual word vectors, such as GLoVE.
For transformer input, either the pooled (i.e.
[CLS]
BERT token) may be be used with document level features. Token (last transformer layer output) may also be used, but in this case, the input must be truncated and padded wordpiece size by setting thedeepnlp_default:word_piece_token_length
resource library configuration.The RNN should not be set for transformer input, but the linear fully connected terminal output is used for both.
-
MODULE_NAME:
ClassVar
[str
] = 'classify'¶ The module name used in the logging message. This is set in each inherited class.
- __init__(net_settings)[source]¶
Initialize the embedding layer.
- Parameters:
net_settings (
ClassifyNetworkSettings
) – the embedding layer configurationlogger – the logger to use for the forward process in this layer
filter_attrib_fn – if provided, called with a
BatchFieldMetadata
for each field returningTrue
if the batch field should be retained and used in the embedding layer (see class docs); ifNone
all fields are considered
-
MODULE_NAME:
- class zensols.deepnlp.classify.model.ClassifyNetworkSettings(name, config_factory, torch_config, batch_stash, embedding_layer, dropout, recurrent_settings, convolution_settings, linear_settings)[source]¶
Bases:
DropoutNetworkSettings
,EmbeddingNetworkSettings
A utility container settings class for convulsion network models. This class also updates the recurrent network’s drop out settings when changed.
- See:
- __init__(name, config_factory, torch_config, batch_stash, embedding_layer, dropout, recurrent_settings, convolution_settings, linear_settings)¶
-
convolution_settings:
DeepConvolution1dNetworkSettings
¶ Contains the configuration for the model’s convolution layer(s).
- get_module_class_name()[source]¶
Returns the fully qualified class name of the module to create by
ModelManager
. This module takes as the first parameter an instance of this class.Important: This method is not used for nested modules. You must declare specific class names in the configuration for those nested class naems.
- Return type:
-
linear_settings:
DeepLinearNetworkSettings
¶ Contains the configuration for the model’s terminal layer.
-
recurrent_settings:
RecurrentAggregationNetworkSettings
¶ Contains the confgiuration for the models RNN.
zensols.deepnlp.classify.multilabel module¶
Classes that enable multi-label classification.
- class zensols.deepnlp.classify.multilabel.MultiLabelFeatureDocument(sents, text=None, spacy_doc=None, softmax_logit=None, labels=None, preds=None)[source]¶
Bases:
PredictionFeatureDocument
A feature document with a label, used for text classification.
- __init__(sents, text=None, spacy_doc=None, softmax_logit=None, labels=None, preds=None)¶
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write the document and optionally sentence features.
- Parameters:
n_sents – the number of sentences to write
n_tokens – the number of tokens to print across all sentences
include_original – whether to include the original text
include_normalized – whether to include the normalized text
- class zensols.deepnlp.classify.multilabel.MultiLabelFeatureDocumentDataPoint(id, batch_stash, container)[source]¶
Bases:
TokenContainerDataPoint
A representation of a data for a reivew document containing the sentiment polarity as the label.
- __init__(id, batch_stash, container)¶
zensols.deepnlp.classify.pred module¶
Prediction mapper support for NLP applications.
- class zensols.deepnlp.classify.pred.ClassificationPredictionMapper(datas, batch_stash, vec_manager, label_feature_id, pred_attribute='pred', softmax_logit_attribute='softmax_logit')[source]¶
Bases:
PredictionMapper
A prediction mapper for text classification. This mapper works at any level (document, sentence, token).
- __init__(datas, batch_stash, vec_manager, label_feature_id, pred_attribute='pred', softmax_logit_attribute='softmax_logit')¶
- property label_vectorizer: CategoryEncodableFeatureVectorizer¶
The label vectorizer used to map classes in
get_classes()
.
- map_results(result)[source]¶
Map class predictions, logits, and documents generated during use of this instance. Each data point is aggregated across batches.
- Return type:
- Returns:
a
Settings
instance withclassess
,logits
anddocs
attributes
-
pred_attribute:
str
= 'pred'¶ The prediction attribute to set on the
FeatureDocument
returned frommap_results()
.
-
softmax_logit_attribute:
str
= 'softmax_logit'¶ The softmax of the logits attribute to set on the
FeatureDocument
returned frommap_results()
.
-
vec_manager:
FeatureDocumentVectorizerManager
¶ The vectorizer manager used to parse and get the label vectorizer.
- class zensols.deepnlp.classify.pred.SequencePredictionMapper(datas, batch_stash, vec_manager, label_feature_id, pred_attribute='pred', softmax_logit_attribute='softmax_logit')[source]¶
Bases:
ClassificationPredictionMapper
Predicts sequences as a
Settings
with keys classes as the token level predictions and docs containing the parsed documents from the sentence text.- __init__(datas, batch_stash, vec_manager, label_feature_id, pred_attribute='pred', softmax_logit_attribute='softmax_logit')¶