zensols.deepnlp.model package

Submodules

zensols.deepnlp.model.facade module

A facade that supports natural language model feature updating through a facade.

class zensols.deepnlp.model.facade.LanguageModelFacade(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.PredictionsDataFrameFactory'>, suppress_transformer_warnings=True)[source]

Bases: ModelFacade

A facade that supports natural language model feature updating through a facade. This facade also provides logging configuration for NLP domains for this package.

This class makes assumptions on the naming of the embedding layer vectorizer naming. See embedding.

__init__(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.PredictionsDataFrameFactory'>, suppress_transformer_warnings=True)
property count_feature_ids: Set[str]

The spacy token features are used in the join layer.

property doc_parser: FeatureDocumentParser

Return the document parser assocated with the language vectorizer manager.

See:

obj:language_vectorizer_manager

property embedding: str

The embedding layer.

Important: the naming of the embedding parameter is that which is given in the configuration without the _layer postfix. For example, embedding is glove_50_embedding for:

  • glove_50_embedding is the name of the GloveWordEmbedModel

  • glove_50_feature_vectorizer is the name of the WordVectorEmbeddingFeatureVectorizer

  • glove_50_embedding_layer is the name of the :class: ~zensols.deepnlp.vectorize.WordVectorEmbeddingLayer

Parameters:

embedding – the kind of embedding, i.e. glove_50_embedding

property enum_feature_ids: Set[str]

Spacy enumeration encodings used to token wise to widen the input embeddings.

get_max_word_piece_len()[source]

Get the longest word piece length for the first found configured transformer embedding feature vectorizer.

Return type:

int

get_transformer_vectorizer()[source]

Return the first found tranformer token vectorizer.

Return type:

TransformerEmbeddingFeatureVectorizer

property language_attributes: Set[str]

The language attributes to be used.

property language_vectorizer_manager: FeatureVectorizerManager

Return the language vectorizer manager for the class.

suppress_transformer_warnings: bool = True

If True, suppress the `Some weights of the model checkpoint...` warnings from huggingface transformers library.

class zensols.deepnlp.model.facade.LanguageModelFacadeConfig(manager_name, attribs, embedding_attribs)[source]

Bases: object

Configuration that defines how and what to access language configuration data. Note that this data reflects how you have the model configured per the configuration file. Parameter examples are given per the Movie Review example.

__init__(manager_name, attribs, embedding_attribs)
attribs: Set[str]

token, document etc), such as enum, count, dep etc.

Type:

The language attributes (all levels

embedding_attribs: Set[str]

All embedding attributes using in the configuration, such as glove_50_embedding, word2vec_300, bert_embedding, etc.

manager_name: str

The name of the language based feature vectorizer, such as language_vectorizer_manager.

zensols.deepnlp.model.sequence module

Utility classes for mapping aggregating and collating sequence (i.e. NER) labels.

class zensols.deepnlp.model.sequence.BioSequenceAnnotationMapper(begin_tag='B', in_tag='I', out_tag='O')[source]

Bases: object

Matches feature documents/tokens with spaCy document/tokens and entity labels.

__init__(begin_tag='B', in_tag='I', out_tag='O')
begin_tag: str = 'B'

The sequence begin tag class.

in_tag: str = 'I'

The sequence in tag class.

map(classes, docs)[source]

Map BIO entities and documents to pairings as annotations.

Parameters:
  • classes (Tuple[List[str]]) – a tuple of lists, each list containing the class of the token in BIO format

  • docs (Tuple[FeatureDocument]) – the feature documents to assign labels

Return type:

Tuple[SequenceDocumentAnnotation]

Returns:

a tuple of annotation instances, each with coupling of label, feature token and spaCy token

out_tag: str = 'O'

The sequence out tag class.

class zensols.deepnlp.model.sequence.SequenceAnnotation(label, doc, tokens)[source]

Bases: PersistableContainer, Dictable

An annotation of a pair matching feature and spaCy tokens.

__init__(label, doc, tokens)
doc: FeatureDocument

The feature document associated with this annotation.

label: str

The string label of this annotation.

property mention: str

The mention text.

property sent: FeatureSentence

The sentence containing the annotated tokens.

property token_matches: Tuple[FeatureToken, Token]

Pairs of matching feature token to token mapping. This is useful for annotating spaCy documents.

tokens: Tuple[FeatureToken]

The tokens annotated with label.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, short=False)[source]

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

class zensols.deepnlp.model.sequence.SequenceDocumentAnnotation(doc, sequence_anons)[source]

Bases: Dictable

Contains token annotations for a FeatureDocument as a duple of SequenceAnnotation.

__init__(doc, sequence_anons)
doc: FeatureDocument

The feature document associated with this annotation.

sequence_anons: Tuple[SequenceAnnotation]

The annotations for the respective doc.

property spacy_doc: Doc

The spaCy document associated with this annotation.

property token_matches: Tuple[str, FeatureToken, Token]

Triple of matching feature token to token mapping in the form (label, feature token, spacy token). This is useful for annotating spaCy documents.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, short=False)[source]

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

Module contents