zensols.deepnlp.classify package¶
Submodules¶
zensols.deepnlp.classify.domain module¶
Domain objects for the natural language text classification atsk.
- class zensols.deepnlp.classify.domain.LabeledFeatureDocument(sents, text=None, spacy_doc=None, softmax_logit=None, label=None, pred=None)[source]¶
- Bases: - PredictionFeatureDocument- A feature document with a label, used for text classification. - __init__(sents, text=None, spacy_doc=None, softmax_logit=None, label=None, pred=None)¶
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write the document and optionally sentence features. - Parameters:
- n_sents – the number of sentences to write 
- n_tokens – the number of tokens to print across all sentences 
- include_original – whether to include the original text 
- include_normalized – whether to include the normalized text 
 
 
 
- class zensols.deepnlp.classify.domain.LabeledFeatureDocumentDataPoint(id, batch_stash, container)[source]¶
- Bases: - TokenContainerDataPoint- A representation of a data for a reivew document containing the sentiment polarity as the label. - __init__(id, batch_stash, container)¶
 
- class zensols.deepnlp.classify.domain.PredictionFeatureDocument(sents, text=None, spacy_doc=None, softmax_logit=None)[source]¶
- Bases: - FeatureDocument- A feature document with a label, used for text classification. - __init__(sents, text=None, spacy_doc=None, softmax_logit=None)¶
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write the document and optionally sentence features. - Parameters:
- n_sents – the number of sentences to write 
- n_tokens – the number of tokens to print across all sentences 
- include_original – whether to include the original text 
- include_normalized – whether to include the normalized text 
 
 
 
- class zensols.deepnlp.classify.domain.TokenContainerDataPoint(id, batch_stash, container)[source]¶
- Bases: - DataPoint- A convenience class that uses data, such as tokens, a sentence or a document ( - TokenContainer) as a data point.- __init__(id, batch_stash, container)¶
 - 
container: TokenContainer¶
- The token cotainer used for this data point. 
 - property doc: FeatureDocument¶
- The container as a document. If it is a sentence, it will create a document with the single sentence. This is usually used by the embeddings vectorizer. 
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write the contents of this instance to - writerusing indention- depth.- Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
zensols.deepnlp.classify.facade module¶
A facade for simple text classification tasks.
- class zensols.deepnlp.classify.facade.ClassifyModelFacade(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.PredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)[source]¶
- Bases: - LanguageModelFacade- A facade for the text classification. See super classes for more information on the purprose of this class. - All the - set_*methods set parameters in the model.- COUNTS_ATTRIBUTE = 'counts'¶
- The feature counts attribute name. 
 - DEPENDENCIES_ATTRIBUTE = 'dependencies'¶
- The dependency feature attribute name. 
 - DEPENDENCY_EXPANDER_ATTRIBTE = 'transformer_dep_expander'¶
- Expands dependency tree spaCy features to transformer wordpiece alignment. 
 - EMBEDDING_ATTRIBUTES = {'fasttext_crawl_300_embedding', 'fasttext_news_300_embedding', 'glove_300_embedding', 'glove_50_embedding', 'transformer_fixed_embedding', 'transformer_trainable_embedding', 'word2vec_300_embedding'}¶
- All embedding feature section names. 
 - ENUMS_ATTRIBUTE = 'enums'¶
- The enumeration feature attribute name. 
 - ENUM_EXPANDER_ATTRIBUTE = 'transformer_enum_expander'¶
- Expands enumerated spaCy features to transformer wordpiece alignment. 
 - FASTTEXT_CRAWL_300_EMBEDDING = 'fasttext_crawl_300_embedding'¶
- The configuration section name of the fasttext crawl embedding - FastTextEmbedModelclass.
 - FASTTEXT_NEWS_300_EMBEDDING = 'fasttext_news_300_embedding'¶
- The configuration section name of the fasttext news embedding - FastTextEmbedModelclass.
 - GLOVE_300_EMBEDDING = 'glove_300_embedding'¶
- The configuration section name of the glove embedding - GloveWordEmbedModelclass.
 - GLOVE_50_EMBEDDING = 'glove_50_embedding'¶
- The configuration section name of the glove embedding - GloveWordEmbedModelclass.
 - LANGUAGE_ATTRIBUTES = {'counts', 'dependencies', 'enums', 'stats', 'transformer_dep_expander', 'transformer_enum_expander'}¶
- All linguistic feature attribute names. 
 - LANGUAGE_FEATURE_MANAGER_NAME = 'language_vectorizer_manager'¶
- The configuration section of the definition of the - FeatureDocumentVectorizerManager.
 - LANGUAGE_MODEL_CONFIG = LanguageModelFacadeConfig(manager_name='language_vectorizer_manager', attribs={'transformer_enum_expander', 'dependencies', 'transformer_dep_expander', 'stats', 'counts', 'enums'}, embedding_attribs={'transformer_fixed_embedding', 'glove_300_embedding', 'word2vec_300_embedding', 'fasttext_crawl_300_embedding', 'glove_50_embedding', 'fasttext_news_300_embedding', 'transformer_trainable_embedding'})¶
- The label model configuration constructed from the batch metadata. 
 - STATS_ATTRIBUTE = 'stats'¶
- The statistics feature attribute name. 
 - TRANSFORMER_FIXED_EMBEDDING = 'transformer_fixed_embedding'¶
- Like - TRANSFORMER_TRAINBLE_EMBEDDING, but all layers of the tranformer are frozen and only the static embeddings are used.
 - TRANSFORMER_TRAINBLE_EMBEDDING = 'transformer_trainable_embedding'¶
- The configuration section name of the BERT transformer contextual embedding - TransformerEmbeddingclass.
 - WORD2VEC_300_EMBEDDING = 'word2vec_300_embedding'¶
- The configuration section name of the the Google word2vec embedding - Word2VecModelclass.
 - __init__(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.PredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)¶
 
- class zensols.deepnlp.classify.facade.MultilabelClassifyModelFacade(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.MultiLabelPredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)[source]¶
- Bases: - ClassifyModelFacade- A multi-label sentence and document classification facade. - __init__(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.MultiLabelPredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)¶
 - predictions_dataframe_factory_class¶
 
- class zensols.deepnlp.classify.facade.TokenClassifyModelFacade(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.SequencePredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)[source]¶
- Bases: - ClassifyModelFacade- A token level classification model facade. - __init__(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.SequencePredictionsDataFrameFactory'>, model_result_reporter_class=<class 'zensols.deeplearn.result.report.ModelResultReporter'>, result_name=None, suppress_transformer_warnings=True)¶
 - predictions_dataframe_factory_class¶
- alias of - SequencePredictionsDataFrameFactory
 
zensols.deepnlp.classify.model module¶
Contains classes that make up a text classification model.
- class zensols.deepnlp.classify.model.ClassifyNetwork(net_settings)[source]¶
- Bases: - EmbeddingNetworkModule- A model that either allows for an RNN or a masked trained transforemr model to classify text for document level classification. A RNN should be used when the input are non-contextual word vectors, such as GLoVE. - For transformer input, either the pooled (i.e. - [CLS]BERT token) may be be used with document level features. Token (last transformer layer output) may also be used, but in this case, the input must be truncated and padded wordpiece size by setting the- deepnlp_default:word_piece_token_lengthresource library configuration.- The RNN should not be set for transformer input, but the linear fully connected terminal output is used for both. - 
MODULE_NAME: ClassVar[str] = 'classify'¶
- The module name used in the logging message. This is set in each inherited class. 
 - __init__(net_settings)[source]¶
- Initialize the embedding layer. - Parameters:
- net_settings ( - ClassifyNetworkSettings) – the embedding layer configuration
- logger – the logger to use for the forward process in this layer 
- filter_attrib_fn – if provided, called with a - BatchFieldMetadatafor each field returning- Trueif the batch field should be retained and used in the embedding layer (see class docs); if- Noneall fields are considered
 
 
 
- 
MODULE_NAME: 
- class zensols.deepnlp.classify.model.ClassifyNetworkSettings(name, config_factory, torch_config, batch_stash, embedding_layer, dropout, recurrent_settings, convolution_settings, linear_settings)[source]¶
- Bases: - DropoutNetworkSettings,- EmbeddingNetworkSettings- A utility container settings class for convulsion network models. This class also updates the recurrent network’s drop out settings when changed. - See:
 - __init__(name, config_factory, torch_config, batch_stash, embedding_layer, dropout, recurrent_settings, convolution_settings, linear_settings)¶
 - 
convolution_settings: DeepConvolution1dNetworkSettings¶
- Contains the configuration for the model’s convolution layer(s). 
 - get_module_class_name()[source]¶
- Returns the fully qualified class name of the module to create by - ModelManager. This module takes as the first parameter an instance of this class.- Important: This method is not used for nested modules. You must declare specific class names in the configuration for those nested class naems. - Return type:
 
 - 
linear_settings: DeepLinearNetworkSettings¶
- Contains the configuration for the model’s terminal layer. 
 - 
recurrent_settings: RecurrentAggregationNetworkSettings¶
- Contains the confgiuration for the models RNN. 
 
zensols.deepnlp.classify.multilabel module¶
Classes that enable multi-label classification.
- class zensols.deepnlp.classify.multilabel.MultiLabelFeatureDocument(sents, text=None, spacy_doc=None, softmax_logit=None, labels=None, preds=None)[source]¶
- Bases: - PredictionFeatureDocument- A feature document with a label, used for text classification. - __init__(sents, text=None, spacy_doc=None, softmax_logit=None, labels=None, preds=None)¶
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write the document and optionally sentence features. - Parameters:
- n_sents – the number of sentences to write 
- n_tokens – the number of tokens to print across all sentences 
- include_original – whether to include the original text 
- include_normalized – whether to include the normalized text 
 
 
 
- class zensols.deepnlp.classify.multilabel.MultiLabelFeatureDocumentDataPoint(id, batch_stash, container)[source]¶
- Bases: - TokenContainerDataPoint- A representation of a data for a reivew document containing the sentiment polarity as the label. - __init__(id, batch_stash, container)¶
 
zensols.deepnlp.classify.pred module¶
Prediction mapper support for NLP applications.
- class zensols.deepnlp.classify.pred.ClassificationPredictionMapper(datas, batch_stash, vec_manager, label_feature_id, pred_attribute='pred', softmax_logit_attribute='softmax_logit')[source]¶
- Bases: - PredictionMapper- A prediction mapper for text classification. This mapper works at any level (document, sentence, token). - __init__(datas, batch_stash, vec_manager, label_feature_id, pred_attribute='pred', softmax_logit_attribute='softmax_logit')¶
 - property label_vectorizer: CategoryEncodableFeatureVectorizer¶
- The label vectorizer used to map classes in - get_classes().
 - map_results(result)[source]¶
- Map class predictions, logits, and documents generated during use of this instance. Each data point is aggregated across batches. - Return type:
- Returns:
- a - Settingsinstance with- classess,- logitsand- docsattributes
 
 - 
pred_attribute: str= 'pred'¶
- The prediction attribute to set on the - FeatureDocumentreturned from- map_results().
 - 
softmax_logit_attribute: str= 'softmax_logit'¶
- The softmax of the logits attribute to set on the - FeatureDocumentreturned from- map_results().
 - 
vec_manager: FeatureDocumentVectorizerManager¶
- The vectorizer manager used to parse and get the label vectorizer. 
 
- class zensols.deepnlp.classify.pred.SequencePredictionMapper(datas, batch_stash, vec_manager, label_feature_id, pred_attribute='pred', softmax_logit_attribute='softmax_logit')[source]¶
- Bases: - ClassificationPredictionMapper- Predicts sequences as a - Settingswith keys classes as the token level predictions and docs containing the parsed documents from the sentence text.- __init__(datas, batch_stash, vec_manager, label_feature_id, pred_attribute='pred', softmax_logit_attribute='softmax_logit')¶