zensols.nlp package ¶

source_parsers: List[FeatureDocumentParser] = None¶: The language resource used to parse documents and create token attributes.

validate_features: Set[str]¶: A set of features to compare across all tokens when copying. If any of the given features don’t match, an mismatch token error is raised.

yield_feature_defaults: Any = None¶: A default value to use when no yielded value is found. If None, do not add the feature if missing.

yield_features: Set[str]¶: A list of features to be copied (in order) if the target token is not set.

class zensols.nlp.combine.MappingCombinerFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, source_parsers=None, validate_features=frozenset({'idx'}), yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, map_features=<factory>, merge_sentences=True)[source]¶

Bases: CombinerFeatureDocumentParser

Maps the source to respective tokens in the target document using spaCy artifacts.

__init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, source_parsers=None, validate_features=frozenset({'idx'}), yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, map_features=<factory>, merge_sentences=True)¶

merge_sentences: bool = True¶: If False ignore sentences and map everything at the token level. Otherwise, the same hierarchy mapping as the super class is used. This is useful when sentence demarcations are not aligned across source document parsers and this parser.

validate_features: Set[str] = frozenset({'idx'})¶: A set of features to compare across all tokens when copying. If any of the given features don’t match, an mismatch token error is raised. The default is the token’s index in the document, which should not change in most cases.

zensols.nlp.component module¶

Components useful for reuse.

class zensols.nlp.component.EntityRecognizer(nlp, name, import_file, patterns)[source]¶

Bases: object

Base class regular expression and spaCy match patterns named entity recognizer. Both subclasses allow for an optional label for each respective pattern or regular expression. If the label is provided, then the match is made a named entity with a label. In any case, a span is created on the token, and in some cases, retokenized.

__init__(nlp, name, import_file, patterns)¶

import_file: Optional[str]¶: An optional JSON file used to append the pattern configuration.

name: str¶: The component name.

nlp: Language¶: The NLP model.

patterns: List¶: A list of the regular expressions to find.

class zensols.nlp.component.PatternEntityRecognizer(nlp, name, import_file, patterns)[source]¶

Bases: EntityRecognizer

Adds entities based on regular epxressions.

See:: Rule matching

__init__(nlp, name, import_file, patterns)¶

patterns: List[Tuple[str, List[List[Dict[str, Any]]]]]¶: The patterns given to the Matcher.

class zensols.nlp.component.RegexEntityRecognizer(nlp, name, import_file, patterns)[source]¶

Bases: EntityRecognizer

Merges regular expression matches as a Span. After matches are found, re-tokenization merges them in to one token per match.

__init__(nlp, name, import_file, patterns)¶

patterns: List[Tuple[str, List[Pattern]]]¶: A list of the regular expressions to find.

class zensols.nlp.component.RegexSplitter(nlp, name, import_file, patterns)[source]¶

Bases: EntityRecognizer

Splits on regular expressions.

__init__(nlp, name, import_file, patterns)¶

patterns: List[Tuple[str, List[Pattern]]]¶: A list of the regular expressions to find.

zensols.nlp.component.create_patner_component(nlp, name, patterns, path=None)[source]¶

zensols.nlp.component.create_regexner_component(nlp, name, patterns, path=None)[source]¶

zensols.nlp.component.create_regexsplit_component(nlp, name, patterns, path=None)[source]¶

zensols.nlp.component.create_remove_sent_boundaries_component(doc)[source]¶

Remove sentence boundaries from tokens.

Parameters:: doc (Doc) – the spaCy document to remove sentence boundaries

zensols.nlp.component.create_whitespace_tokenizer_component(nlp, name)[source]¶

zensols.nlp.container module¶

Domain objects that define features associated with text.

class zensols.nlp.container.FeatureDocument(sents, text=None, spacy_doc=None)[source]¶

Bases: TokenContainer

A container class of tokens that make a document. This class contains a one to many of sentences. However, it can be treated like any TokenContainer to fetch tokens. Instances of this class iterate over FeatureSentence instances.

Parameters:: sents (Tuple[FeatureSentence, ...]) – the sentences defined for this document

_combine_documents(docs, cls, concat_tokens, **kwargs)[source]¶

Override if there are any fields in your dataclass. In most cases, the only time this is called is by an embedding vectorizer to batch muultiple sentences in to a single document, so the only feature that matter are the sentence level.

Parameters:

docs (Tuple[FeatureDocument, ...]) – the documents to combine in to one
cls (Type[FeatureDocument]) – the class of the instance to create
concat_tokens (bool) – if True each sentence of the returned document are the concatenated tokens of each respective document; otherwise simply concatenate sentences in to one document
kwargs – additional keyword arguments to pass to the new feature document’s initializer

Return type:

EMPTY_DOCUMENT: ClassVar[FeatureDocument] = <>¶: A zero length document.

__init__(sents, text=None, spacy_doc=None)¶

clear()[source]¶: Clear all cached state.

clone(cls=None, **kwargs)[source]¶

Parameters:: kwargs – if copy_spacy is True, the spacy document is copied to the clone in addition parameters passed to new clone initializer
Return type:: TokenContainer

classmethod combine_documents(docs, concat_tokens=True, **kwargs)[source]¶

Coerce a tuple of token containers (either documents or sentences) in to one synthesized document.

Parameters:

docs (Iterable[FeatureDocument]) – the documents to combine in to one
cls – the class of the instance to create
concat_tokens (bool) – if True each sentence of the returned document are the concatenated tokens of each respective document; otherwise simply concatenate sentences in to one document
kwargs – additional keyword arguments to pass to the new feature document’s initializer

Return type:

combine_sentences(sents=None)[source]¶

Combine the sentences in this document in to a new document with a single sentence.

Parameters:: sents (Iterable[FeatureSentence]) – the sentences to combine in the new document or all if None
Return type:: FeatureDocument

from_sentences(sents, deep=False)[source]¶

Return a new cloned document using the given sentences.

Parameters:

sents (Iterable[FeatureSentence]) – the sentences to add to the new cloned document
deep (bool) – whether or not to clone the sentences

See:

clone()

Return type:

get_overlapping_document(span, inclusive=True)[source]¶

Get the portion of the document that overlaps span. Sentences completely enclosed in a span are copied. Otherwise, new sentences are created from those tokens that overlap the span.

Parameters:

span (LexicalSpan) – indicates the portion of the document to retain
inclusive (bool) – whether to check include +1 on the end component

Return type:

Iterable[FeatureSentence]

Returns:

a new document that contains the 0 index offset of span

get_overlapping_sentences(span, inclusive=True)[source]¶

Return sentences that overlaps with span from this document.

Parameters:

span (LexicalSpan) – indicates the portion of the document to retain
inclusive (bool) – whether to check include +1 on the end component

Return type:

get_overlapping_span(span, inclusive=True)[source]¶

Return a feature span that includes the lexical scope of span.

Return type:: TokenContainer

property max_sentence_len: int¶: Return the length of tokens from the longest sentence in the document.

sent_iter(*args, **kwargs)[source]¶

Return type:: Iterable[FeatureSentence]

sentence_for_token(token)[source]¶

Return the parent sentence that has token.

Return type:: FeatureSentence

sentence_index_for_token(token)[source]¶

Return index of the parent sentence having token.

Return type:: int

sentences_for_tokens(tokens)[source]¶

Find sentences having a set of tokens.

Parameters:: tokens (Tuple[FeatureToken, ...]) – the query used to finding containing sentences
Return type:: Tuple[FeatureSentence, ...]
Returns:: the document ordered tuple of sentences containing tokens

sents: Tuple[FeatureSentence, ...]¶: The sentences that make up the document.

set_spacy_doc(doc)[source]¶

spacy_doc: Doc = None¶: The parsed spaCy document this feature set is based. As explained in FeatureToken, spaCy documents are heavy weight and problematic to pickle. For this reason, this attribute is dropped when pickled, and only here for ad-hoc predictions.

text: str = None¶: The original raw text of the sentence.

to_document()[source]¶

Coerce this instance in to a document.

Return type:: FeatureDocument

to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]¶

Coerce this instance to a single sentence. No tokens data is updated so FeatureToken.i_sent keep their original indexes. These sentence indexes will be inconsistent when called on FeatureDocument unless contiguous_i_sent is set to True.

Parameters:

limit (int) – the max number of sentences to create (only starting kept)
contiguous_i_sent (Union[str, bool]) – if True, ensures all tokens have FeatureToken.i_sent value that is contiguous for the returned instance; if this value is reset, the token indicies start from 0
delim (str) – a string added between each constituent sentence

Return type:

Returns:

an instance of FeatureSentence that represents this token sequence

token_iter(*args, **kwargs)[source]¶

Return an iterator over the token features.

Parameters:: args – the arguments given to itertools.islice()
Return type:: Iterable[FeatureToken]

uncombine_sentences()[source]¶

Reconstruct the sentence structure that we combined in combine_sentences(). If that has not been done in this instance, then return self.

Return type:: FeatureDocument

update_entity_spans(include_idx=True)[source]¶

Update token entity to norm text. This is helpful when entities are embedded after splitting text, which becomes FeatureToken.norm values. However, the token spans still index the original entities that are multi-word, which leads to norms that are not equal to the text spans. This synchronizes the token span indexes with the norms.

Parameters:: include_idx (bool) – whether to update SpacyFeatureToken.idx as well

update_indexes()[source]¶

Update all FeatureToken.i attributes to those provided by tokens_by_i. This corrects the many-to-one token index mapping for split multi-word named entities.

See:: tokens_by_i

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, n_sents=9223372036854775807, n_tokens=0, include_original=False, include_normalized=True)[source]¶

Write the document and optionally sentence features.

Parameters:

n_sents (int) – the number of sentences to write
n_tokens (int) – the number of tokens to print across all sentences
include_original (bool) – whether to include the original text
include_normalized (bool) – whether to include the normalized text

class zensols.nlp.container.FeatureSentence(tokens, text=None, spacy_span=None)[source]¶

Bases: FeatureSpan

A container class of tokens that make a sentence. Instances of this class iterate over FeatureToken instances, and can create documents with to_document().

EMPTY_SENTENCE: ClassVar[FeatureSentence] = <>¶

__init__(tokens, text=None, spacy_span=None)¶

get_overlapping_span(span, inclusive=True)[source]¶

Return a feature span that includes the lexical scope of span.

Return type:: TokenContainer

to_document()[source]¶

Coerce this instance in to a document.

Return type:: FeatureDocument

to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]¶

Parameters:

limit (int) – the max number of sentences to create (only starting kept)
contiguous_i_sent (Union[str, bool]) – if True, ensures all tokens have FeatureToken.i_sent value that is contiguous for the returned instance; if this value is reset, the token indicies start from 0
delim (str) – a string added between each constituent sentence

Return type:

Returns:

an instance of FeatureSentence that represents this token sequence

class zensols.nlp.container.FeatureSpan(tokens, text=None, spacy_span=None)[source]¶

Bases: TokenContainer

A span of tokens as a TokenContainer, much like spacy.tokens.Span.

__init__(tokens, text=None, spacy_span=None)¶

clone(cls=None, **kwargs)[source]¶

Clone an instance of this token container.

Parameters:

cls (Type[TokenContainer]) – the type of the new instance
kwargs – arguments to add to as attributes to the clone

Return type:

TokenContainer

Returns:

the cloned instance of this instance

property dependency_tree: Dict[FeatureToken, List[Dict[FeatureToken]]]¶

spacy_span: Span = None¶

The parsed spaCy span this feature set is based.

See:: FeatureDocument.spacy_doc()

text: str = None¶: The original raw text of the span.

to_document()[source]¶

Coerce this instance in to a document.

Return type:: FeatureDocument

to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]¶

Parameters:

limit (int) – the max number of sentences to create (only starting kept)
contiguous_i_sent (Union[str, bool]) – if True, ensures all tokens have FeatureToken.i_sent value that is contiguous for the returned instance; if this value is reset, the token indicies start from 0
delim (str) – a string added between each constituent sentence

Return type:

Iterable[FeatureToken]

Returns:

an instance of FeatureSentence that represents this token sequence

token_iter(*args, **kwargs)[source]¶

Return an iterator over the token features.

Parameters:: args – the arguments given to itertools.islice()
Return type:: Iterable[FeatureToken]

property token_len: int¶: Return the number of tokens.

property tokens: Tuple[FeatureToken, ...]¶: The tokens that make up the span.

property tokens_by_i_sent: Dict[int, FeatureToken]¶

A map of tokens with keys as their sentanal position offset and values as tokens.

See:: zensols.nlp.FeatureToken.i

update_entity_spans(include_idx=True)[source]¶

Parameters:: include_idx (bool) – whether to update SpacyFeatureToken.idx as well

update_indexes()[source]¶

Update all FeatureToken.i attributes to those provided by tokens_by_i. This corrects the many-to-one token index mapping for split multi-word named entities.

See:: tokens_by_i

class zensols.nlp.container.TokenAnnotatedFeatureDocument(sents, text=None, spacy_doc=None)[source]¶

Bases: FeatureDocument

A feature sentence that contains token annotations. Sentences can be modeled with TokenAnnotatedFeatureSentence or just FeatureSentence since this sets the annotations attribute when combining.

__init__(sents, text=None, spacy_doc=None)¶

property annotations: Tuple[Any, ...]¶: A token level annotation, which is one-to-one to tokens.

combine_sentences(**kwargs) → FeatureDocument¶

Combine all the sentences in this document in to a new document with a single sentence.

Return type:: FeatureDocument

class zensols.nlp.container.TokenAnnotatedFeatureSentence(tokens, text=None, spacy_span=None, annotations=())[source]¶

Bases: FeatureSentence

A feature sentence that contains token annotations.

__init__(tokens, text=None, spacy_span=None, annotations=())¶

annotations: Tuple[Any, ...] = ()¶: A token level annotation, which is one-to-one to tokens.

to_document()[source]¶

Coerce this instance in to a document.

Return type:: FeatureDocument

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, **kwargs)[source]¶

Write the text container.

Parameters:

include_original – whether to include the original text
include_normalized – whether to include the normalized text
n_tokens – the number of tokens to write
inline – whether to print the tokens on one line each

class zensols.nlp.container.TokenContainer[source]¶

Bases: PersistableContainer, TextContainer

A base class for token container classes such as FeatureSentence and FeatureDocument. In addition to the defined methods, each instance has a text attribute, which is the original text of the document.

property canonical: str¶: A canonical representation of the container, which are non-space tokens separated by CANONICAL_DELIMITER.

clear()[source]¶: Clear all cached state.

clone(cls=None, **kwargs)[source]¶

Clone an instance of this token container.

Parameters:

cls (Type[TokenContainer]) – the type of the new instance
kwargs – arguments to add to as attributes to the clone

Return type:

TokenContainer

Returns:

the cloned instance of this instance

property entities: Tuple[FeatureSpan, ...]¶: The named entities of the container with each multi-word entity as elements.

get_overlapping_span(span, inclusive=True)[source]¶

Return a feature span that includes the lexical scope of span.

Return type:: TokenContainer

get_overlapping_tokens(span, inclusive=True)[source]¶

Get all tokens that overlap lexical span span.

Parameters:

span (LexicalSpan) – the document 0-index character based inclusive span to compare with FeatureToken.lexspan
inclusive (bool) – whether to check include +1 on the end component

Return type:

Returns:

a token sequence containing the 0 index offset of span

property lexspan: LexicalSpan¶: The document indexed lexical span using idx.

map_overlapping_tokens(spans, inclusive=True)[source]¶

Return a tuple of tokens, each tuple in the range given by the respective span in spans.

Parameters:

spans (Iterable[LexicalSpan]) – the document 0-index character based inclusive spans to compare with FeatureToken.lexspan
inclusive (bool) – whether to check include +1 on the end component

Return type:

Iterable[Tuple[FeatureToken, ...]]

Returns:

a tuple of matching tokens for the respective span query

property norm: str¶: The normalized version of the sentence.

property norm_orth: str¶: The normalized version of the sentence using the orignal rather than the token normalized text.

norm_token_iter(*args, **kwargs)[source]¶

Return a list of normalized tokens.

Parameters:: args – the arguments given to itertools.islice()
Return type:: Iterable[str]

reindex(reference_token=None)[source]¶

Re-index tokens, which is useful for situtations where a 0-index offset is assumed for sub-documents created with FeatureDocument.get_overlapping_document() or FeatureDocument.get_overlapping_sentences(). The following data are modified:

FeatureToken.i

FeatureToken.idx

FeatureToken.i_sent

FeatureToken.sent_i (see SpacyFeatureToken.sent_i)

FeatureToken.lexspan (see SpacyFeatureToken.lexspan)

entities

lexspan

tokens_by_i

tokens_by_idx

FeatureSpan.tokens_by_i_sent

FeatureSpan.dependency_tree

strip(in_place=True)[source]¶

Strip beginning and ending whitespace (see strip_tokens()) and text.

Return type:: TokenContainer

strip_token_iter(*args, **kwargs)[source]¶

Strip beginning and ending whitespace (see strip_tokens()) using token_iter().

Return type:: Iterable[FeatureToken]

static strip_tokens(token_iter)[source]¶

Strip beginning and ending whitespace. This uses is_space, which is True for spaces, tabs and newlines.

Parameters:: token_iter (Iterable[FeatureToken]) – an stream of tokens
Return type:: Iterable[FeatureToken]
Returns:: non-whitespace middle tokens

abstract to_document(limit=9223372036854775807)[source]¶

Coerce this instance in to a document.

Return type:: FeatureDocument

abstract to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]¶

Parameters:

limit (int) – the max number of sentences to create (only starting kept)
contiguous_i_sent (Union[str, bool]) – if True, ensures all tokens have FeatureToken.i_sent value that is contiguous for the returned instance; if this value is reset, the token indicies start from 0
delim (str) – a string added between each constituent sentence

Return type:

Returns:

an instance of FeatureSentence that represents this token sequence

abstract token_iter(*args, **kwargs)[source]¶

Return an iterator over the token features.

Parameters:: args – the arguments given to itertools.islice()
Return type:: Iterable[FeatureToken]

property token_len: int¶: Return the number of tokens.

property tokens: Tuple[FeatureToken, ...]¶: Return the token features as a tuple.

property tokens_by_i: Dict[int, FeatureToken]¶

A map of tokens with keys as their position offset and values as tokens. The entries also include named entity tokens that are grouped as multi-word tokens. This is helpful for multi-word entities that were split (for example with SplitTokenMapper), and thus, have many-to-one mapped indexes.

See:: zensols.nlp.FeatureToken.i

property tokens_by_idx: Dict[int, FeatureToken]¶

A map of tokens with keys as their character offset and values as tokens.

Limitations: Multi-word entities will have have a mapping only for the first word of that entity if tokens were split by spaces (for example with SplitTokenMapper). However, tokens_by_i does not have this limitation.

See:: obj:tokens_by_i
See:: zensols.nlp.FeatureToken.idx

abstract update_entity_spans(include_idx=True)[source]¶

Parameters:: include_idx (bool) – whether to update SpacyFeatureToken.idx as well

update_indexes()[source]¶

Update all FeatureToken.i attributes to those provided by tokens_by_i. This corrects the many-to-one token index mapping for split multi-word named entities.

See:: tokens_by_i

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_original=False, include_normalized=True, n_tokens=9223372036854775807, inline=False, feature_ids=None)[source]¶

Write the text container.

Parameters:

include_original (bool) – whether to include the original text
include_normalized (bool) – whether to include the normalized text
n_tokens (int) – the number of tokens to write
inline (bool) – whether to print the tokens on one line each

write_text(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_original=False, include_normalized=True, limit=9223372036854775807)[source]¶

Write only the text of the container.

Parameters:

include_original (bool) – whether to include the original text
include_normalized (bool) – whether to include the normalized text
limit (int) – the max number of characters to print

zensols.nlp.dataframe module¶

zensols.nlp.decorate module¶

Contains useful classes for decorating feature sentences.

class zensols.nlp.decorate.CopyFeatureTokenContainerDecorator(feature_ids)[source]¶

Copies feature(s) for each token in the container. For each token, each source / target tuple pair in feature_ids is copied. If the feature is missing (does not include for existing FeatureToken.NONE values) an exception is raised.

__init__(feature_ids)¶

decorate(container)[source]¶

feature_ids: Tuple[Tuple[str, str], ...]¶: The features to copy in the form ((<source>, <target>), …).

class zensols.nlp.decorate.FilterEmptySentenceDocumentDecorator(filter_space=True)[source]¶

Bases: FeatureDocumentDecorator

Filter zero length sentences.

__init__(filter_space=True)¶

decorate(doc)[source]¶

filter_space: bool = True¶: Whether to filter space tokens when comparing zero length sentences.

class zensols.nlp.decorate.FilterTokenSentenceDecorator(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False, remove_empty=False)[source]¶

Bases: FeatureSentenceDecorator

A decorator that strips whitespace from sentences.

See:: TokenContainer.strip()

__init__(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False, remove_empty=False)¶

decorate(sent)[source]¶

remove_determiners: bool = False¶: Whether to remove determiners (i.e. the).

remove_empty: bool = False¶: Whether to 0-length tokens (using normalized text).

remove_pronouns: bool = False¶: Whether to remove pronouns (i.e. he).

remove_punctuation: bool = False¶: Whether to remove punctuation (i.e. periods).

remove_space: bool = False¶: Whether to remove white space (i.e. new lines).

remove_stop: bool = False¶: Whether to remove stop words.

class zensols.nlp.decorate.RemoveFeatureTokenContainerDecorator(exclude_feature_ids)[source]¶

Removes features each token in the container.

__init__(exclude_feature_ids)¶

decorate(container)[source]¶

exclude_feature_ids: Set[str]¶: The features to remove from the tokens.

class zensols.nlp.decorate.SplitTokenSentenceDecorator[source]¶

Bases: FeatureSentenceDecorator

A decorator that splits feature tokens by white space.

__init__()¶

decorate(sent)[source]¶

class zensols.nlp.decorate.StripTokenContainerDecorator[source]¶

A decorator that strips whitespace from sentences (or TokenContainer).

See:: TokenContainer.strip()

__init__()¶

decorate(container)[source]¶

class zensols.nlp.decorate.UpdateTokenContainerDecorator(update_indexes=True, update_entity_spans=True, reindex=False)[source]¶

Updates document indexes and spans (see fields).

__init__(update_indexes=True, update_entity_spans=True, reindex=False)¶

decorate(container)[source]¶

reindex: bool = False¶: Whether to invoke TokenContainer.reindex() after.

update_entity_spans: bool = True¶: Whether to update the document indexes with FeatureDocument.update_entity_spans().

update_indexes: bool = True¶: Whether to update the document indexes with FeatureDocument.update_indexes().

zensols.nlp.domain module¶

Interfaces, contracts and errors.

class zensols.nlp.domain.LexicalSpan(begin, end)[source]¶

Bases: Dictable

A lexical character span of text in a document. The span has two positions: begin and end, which is indexed respectively as an operator as well. The left (begin) is inclusive and the right (:obj:`end) is exclusive to conform to Python array slicing conventions.

One span is less than the other when the beginning position is less. When the beginnign positions are the same, the one with the smaller end position is less.

The length of the span is the distance between the end and the beginning positions.

EMPTY_SPAN: ClassVar[LexicalSpan] = (0, 0)¶: The span (0, 0).

__init__(begin, end)[source]¶

Initialize the interval.

Parameters:

begin (int) – the begin of the span
end (int) – the end of the span

property astuple: Tuple[int, int]¶: The span as a (begin, end) tuple.

classmethod from_token(tok)[source]¶

Create a span from a spaCy Token or Span.

Return type:: Tuple[int, int]

classmethod from_tuples(tups)[source]¶

Create spans from tuples.

Parameters:: tups (Iterable[Tuple[int, int]]) – an iterable of (<begin>, <end) tuples
Return type:: Iterable[LexicalSpan]

static gaps(spans, end=None)[source]¶

Return the spans for the “holes” in spans. For example, if spans is ((0, 5), (10, 12), (15, 17)), then return ((5, 10), (12, 15)).

Parameters:

spans (Iterable[LexicalSpan]) – the spans used to find gaps
end (Optional[int]) – an end position for the last gap so that if the last item in spans end does not match, another is added

Return type:

List[LexicalSpan]

Returns:

a list of spans that “fill” any holes in spans

narrow(other)[source]¶

Return the shortest span that inclusively fits in both this and other.

Parameters:: other (LexicalSpan) – the second span to narrow with this span
Retun:: a span so that beginning is maximized and end is minimized or None if the two spans do not overlap
Return type:: Optional[LexicalSpan]

static overlaps(a0, a1, b0, b1, inclusive=True)[source]¶

Return whether or not one text span overlaps with another.

Parameters:: inclusive (bool) – whether to check include +1 on the end component
Returns:: any overlap detected returns True

overlaps_with(other, inclusive=True)[source]¶

Return whether or not one text span overlaps non-inclusively with another.

Parameters:

other (LexicalSpan) – the other location
inclusive (bool) – whether to check include +1 on the end component

Return type:

bool

Returns:

any overlap detected returns True

static widen(others)[source]¶

Take the span union by using the left most begin and the right most end.

Parameters:: others (Iterable[LexicalSpan]) – the spans to union
Return type:: Optional[LexicalSpan]
Returns:: the widest span that inclusively aggregates others, or None if an empty sequence is passed

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

exception zensols.nlp.domain.MissingFeatureError(token, feature_id, msg=None)[source]¶

Bases: NLPError

Raised on attempting to access a non-existant feature in FeatureToken.

__init__(token, feature_id, msg=None)[source]¶

Initialize.

Parameters:

token (FeatureToken) – the token for which access was attempted
feature_id (str) – the feature_id that is missing in token

__module__ = 'zensols.nlp.domain'¶

exception zensols.nlp.domain.NLPError[source]¶

Bases: APIError

Raised for any errors for this library.

__annotations__ = {}¶

__module__ = 'zensols.nlp.domain'¶

exception zensols.nlp.domain.ParseError[source]¶

Bases: APIError

Raised for any parsing errors.

__annotations__ = {}¶

__module__ = 'zensols.nlp.domain'¶

class zensols.nlp.domain.TextContainer[source]¶

Bases: Dictable

A writable class that has a text property or attribute. All subclasses need a norm attribute or property.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_original=True, include_normalized=True)[source]¶

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

zensols.nlp.index module¶

A heuristic text indexing and search class.

class zensols.nlp.index.FeatureDocumentIndexer(doc)[source]¶

Bases: object

A utility class that indexes and searches for text in potentially whitespace mangled documents. It does this by trying more efficient means first, then resorts to methods that are more computationaly expensive.

__init__(doc)¶

doc: FeatureDocument¶: The document to index.

property doc_tok_orths: Tuple[Tuple[str, FeatureToken], ...]¶: Reutrn tuples of (<orthographic text>, <token>).

find(query, sent_ix=None)[source]¶

Find a sentence in document doc. If a sentence index is given, it treats the query as a sentence to find in doc.

Parameters:

query (TokenContainer) – the sentence to find in doc
sent_ix (int) – the sentence index hint if available

Return type:

TokenContainer

Returns:

the matched text from doc

property pack2ix: Dict[int, int]¶: Return a dictionary of character positions in the document (doc) text to respective positions in the same string without whitespace.

property packed_doc_text: str¶: Return the document’ (doc) no-space normalized text.

property text2sent: Dict[str, FeatureSentence]¶: Return a dictionary of sentence normalized text to respective sentence in doc.

zensols.nlp.nerscore module¶

Wraps the SemEval-2013 Task 9.1 NER evaluation API as a ScoreMethod.

From the David Batista blog post:

The SemEval’13 introduced four different ways to measure precision/recall/f1-score results based on the metrics defined by MUC:

Strict: exact boundary surface string match and entity type

Exact: exact boundary match over the surface string, regardless of the type

Partial: partial boundary match over the surface string, regardless of the type

Type: some overlap between the system tagged entity and the gold annotation is required

Each of these ways to measure the performance accounts for correct, incorrect, partial, missed and spurious in different ways. Let’s look in detail and see how each of the metrics defined by MUC falls into each of the scenarios described above.

see:: SemEval-2013 Task 9.1
see:: David Batista

class zensols.nlp.nerscore.SemEvalHarmonicMeanScore(precision, recall, f_score, correct, incorrect, partial, missed, spurious, possible, actual)[source]¶

Bases: HarmonicMeanScore

A harmonic mean score with the additional SemEval computed scores (see module zensols.nlp.nerscore docs).

NAN_INSTANCE: ClassVar[SemEvalHarmonicMeanScore] = SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan)¶: Used to add to ErrorScore for harmonic means replacements.

__init__(precision, recall, f_score, correct, incorrect, partial, missed, spurious, possible, actual)¶

actual: int¶

correct: int¶

both are the same.

Type:: The number of correct (COR)

incorrect: int¶

the output of a system and the golden annotation don’t match.

Type:: The number of incorrect (INC)

missed: int¶

a golden annotation is not captured by a system.

Type:: The number of missed (MIS)

partial: int¶

system and the golden annotation are somewhat “similar” but not the same.

Type:: The number of partial (PAR)

possible: int¶

spurious: int¶

system produces a response which does not exist in the golden annotation.

Type:: The number of spurious (SPU)

class zensols.nlp.nerscore.SemEvalScore(strict, exact, partial, ent_type)[source]¶

Bases: Score

Contains all four harmonic mean SemEval scores (see module zensols.nlp.nerscore docs). This score has four harmonic means providing various levels of accuracy.

NAN_INSTANCE: ClassVar[SemEvalScore] = SemEvalScore(strict=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), exact=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), partial=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), ent_type=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan))¶

__init__(strict, exact, partial, ent_type)¶

asrow(meth)[source]¶

Return type:: Dict[str, float]

ent_type: SemEvalHarmonicMeanScore¶: Some overlap between the system tagged entity and the gold annotation is required.

exact: SemEvalHarmonicMeanScore¶: Exact boundary match over the surface string, regardless of the type.

partial: SemEvalHarmonicMeanScore¶: Partial boundary match over the surface string, regardless of the type.

strict: SemEvalHarmonicMeanScore¶: Exact boundary surface string match and entity type.

class zensols.nlp.nerscore.SemEvalScoreMethod(reverse_sents=False, labels=None)[source]¶

Bases: ScoreMethod

A Semeval-2013 Task 9.1 score (see module zensols.nlp.nerscore docs). This score has four harmonic means providing various levels of accuracy. Sentence pairs are ordered as (<gold>, <prediction>).

__init__(reverse_sents=False, labels=None)¶

labels: Optional[Set[str]] = None¶: The NER labels on which to evaluate. If not provided, text is evaluated under a (stubbed tag) label.

zensols.nlp.norm module¶

Normalize text and map Spacy documents.

class zensols.nlp.norm.FilterRegularExpressionMapper(regex='[ ]+', invert=False)[source]¶

Bases: TokenMapper

Filter tokens based on normalized form regular expression.

__init__(regex='[ ]+', invert=False)¶

invert: bool = False¶: If True then remove rather than keep everything that matches..

map_tokens(token_tups)[source]¶

Transform token tuples.

Return type:: Iterable[Tuple[Token, str]]

regex: Union[Pattern, str] = '[ ]+'¶: The regular expression to use for splitting tokens.

class zensols.nlp.norm.FilterTokenMapper(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False)[source]¶

Bases: TokenMapper

Filter tokens based on token (Spacy) attributes.

Configuration example:

[filter_token_mapper]
class_name = zensols.nlp.FilterTokenMapper
remove_stop = True
remove_punctuation = True

__init__(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False)¶

map_tokens(token_tups)[source]¶

Transform token tuples.

Return type:: Iterable[Tuple[Token, str]]

remove_determiners: bool = False¶

remove_pronouns: bool = False¶

remove_punctuation: bool = False¶

remove_space: bool = False¶

remove_stop: bool = False¶

class zensols.nlp.norm.JoinTokenMapper(regex='[ ]', separator=None)[source]¶

Bases: object

Join tokens based on a regular expression. It does this by creating spans in the spaCy component (first in the tuple) and using the span text as the normalized token.

__init__(regex='[ ]', separator=None)¶

map_tokens(token_tups)[source]¶

Return type:: Iterable[Tuple[Token, str]]

regex: Union[Pattern, str] = '[ ]'¶: The regular expression to use for joining tokens

separator: str = None¶: The string used to separate normalized tokens in matches. If None, use the token text.

class zensols.nlp.norm.LambdaTokenMapper(add_lambda=None, map_lambda=None)[source]¶

Bases: TokenMapper

Use a lambda expression to map a token tuple.

This is handy for specialized behavior that can be added directly to a configuration file.

Configuration example:

[lc_lambda_token_mapper]
class_name = zensols.nlp.LambdaTokenMapper
map_lambda = lambda x: (x[0], f'<{x[1].lower()}>')

__init__(add_lambda=None, map_lambda=None)¶

add_lambda: str = None¶

map_lambda: str = None¶

map_tokens(token_tups)[source]¶

Transform token tuples.

Return type:: Iterable[Tuple[Token, str]]

class zensols.nlp.norm.LemmatizeTokenMapper(lemmatize=True, remove_first_stop=False)[source]¶

Bases: TokenMapper

Lemmatize tokens and optional remove entity stop words.

Important: This completely ignores the normalized input token string and essentially just replaces it with the lemma found in the token instance.

Configuration example:

[lemma_token_mapper]
class_name = zensols.nlp.LemmatizeTokenMapper

Parameters:

lemmatize (bool) – lemmatize if True; this is an option to allow (only) the removal of the first top word in named entities
remove_first_stop (bool) – whether to remove the first top word in named entities when embed_entities is True

__init__(lemmatize=True, remove_first_stop=False)¶

lemmatize: bool = True¶

map_tokens(token_tups)[source]¶

Transform token tuples.

Return type:: Iterable[Tuple[Token, str]]

remove_first_stop: bool = False¶

class zensols.nlp.norm.MapTokenNormalizer(embed_entities=True, config_factory=None, mapper_class_list=<factory>)[source]¶

Bases: TokenNormalizer

A normalizer that applies a sequence of TokenMapper instances to transform the normalized token text. The members of the mapper_class_list are sections of the application configuration.

Configuration example:

[map_filter_token_normalizer]
class_name = zensols.nlp.MapTokenNormalizer
mapper_class_list = list: filter_token_mapper

__init__(embed_entities=True, config_factory=None, mapper_class_list=<factory>)¶

config_factory: ConfigFactory = None¶: The factory that created this instance and used to create the mappers.

mapper_class_list: List[str]¶: The configuration section names to create from the application configuration factory, which is added to mappers. This field settings is deprecated; use mappers instead.

class zensols.nlp.norm.SplitEntityTokenMapper(token_unit_type=False, copy_attributes=('label', 'label_'))[source]¶

Bases: TokenMapper

Splits embedded entities (or any Span) in to separate tokens. This is useful for splitting up entities as tokens after being grouped with TokenNormalizer.embed_entities. Note, embed_entities must be True to create the entities as they come from spaCy as spans. This then can be used to create SpacyFeatureToken with spans that have the entity.

__init__(token_unit_type=False, copy_attributes=('label', 'label_'))¶

copy_attributes: Tuple[str, ...] = ('label', 'label_')¶: Attributes to copy from the span to the split token.

map_tokens(token_tups)[source]¶

Transform token tuples.

Return type:: Iterable[Tuple[Token, str]]

token_unit_type: bool = False¶: Whether to generate tokens for each split span or a one token span.

class zensols.nlp.norm.SplitTokenMapper(regex='[ ]')[source]¶

Bases: TokenMapper

Splits the normalized text on a per token basis with a regular expression.

Configuration example:

[split_token_mapper]
class_name = zensols.nlp.SplitTokenMapper
regex = r'[ ]'

__init__(regex='[ ]')¶

map_tokens(token_tups)[source]¶

Transform token tuples.

Return type:: Iterable[Tuple[Token, str]]

regex: Union[Pattern, str] = '[ ]'¶: The regular expression to use for splitting tokens.

class zensols.nlp.norm.SubstituteTokenMapper(regex='', replace_char='')[source]¶

Bases: TokenMapper

Replace a regular expression in normalized token text.

Configuration example:

[subs_token_mapper]
class_name = zensols.nlp.SubstituteTokenMapper
regex = r'[ \t]'
replace_char = _

__init__(regex='', replace_char='')¶

map_tokens(token_tups)[source]¶

Transform token tuples.

Return type:: Iterable[Tuple[Token, str]]

regex: str = ''¶: The regular expression to use for substitution.

replace_char: str = ''¶: The character that is used for replacement.

class zensols.nlp.norm.TokenMapper[source]¶

Bases: ABC

Abstract class used to transform token tuples generated from TokenNormalizer.normalize().

__init__()¶

abstract map_tokens(token_tups)[source]¶

Transform token tuples.

Return type:: Iterable[Tuple[Token, str]]

class zensols.nlp.norm.TokenNormalizer(embed_entities=True)[source]¶

Bases: object

Base token extractor returns tuples of tokens and their normalized version.

Configuration example:

[default_token_normalizer]
class_name = zensols.nlp.TokenNormalizer
embed_entities = False

__init__(embed_entities=True)¶

embed_entities: bool = True¶: Whether or not to replace tokens with their respective named entity version.

normalize(doc)[source]¶

Normalize Spacey document doc in to (token, normal text) tuples.

Return type:: Iterable[Tuple[Token, str]]

zensols.nlp.parser module¶

Parse documents and generate features in an organized taxonomy.

class zensols.nlp.parser.CachingFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, stash=None, hasher=<factory>)[source]¶

Bases: DecoratedFeatureDocumentParser

A document parser that persists previous parses using the hash of the text as a key. Caching is optional given the value of stash, which is useful in cases this class is extended using other use cases other than just caching.

__init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, stash=None, hasher=<factory>)¶

clear()[source]¶: Clear the caching stash.

hasher: Hasher¶: Used to hash the natural langauge text in to string keys.

parse(text, *args, **kwargs)[source]¶

Parse text or a text as a list of sentences.

Parameters:

text (str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list
args – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance

Return type:

stash: Stash = None¶: The stash that persists the feature document instances. If this is not provided, no caching will happen.

class zensols.nlp.parser.Component(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=())[source]¶

Bases: object

A pipeline component to be added to the spaCy model.

__init__(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=())¶

init(model)[source]¶

Initialize the component and add it to the NLP pipe line. This base class implementation loads the module, then calls Language.add_pipe().

Parameters:: model (Language) – the model to add the spaCy model (nlp in their parlance)

initializers: Tuple[ComponentInitializer, ...] = ()¶: Instances to initialize upon this object’s initialization.

modules: Sequence[str] = ()¶: The module to import before adding component pipelines. This will register components mentioned in components when the resepctive module is loaded.

name: str¶: The section name.

pipe_add_kwargs: Dict[str, Any]¶: Arguments to add along with the call to add_pipe().

pipe_config: Dict[str, str] = None¶: The configuration to add with the config kwarg in the Language.add_pipe() call to the spaCy model.

pipe_name: str = None¶: The pipeline component name to add to the pipeline. If None, use name.

class zensols.nlp.parser.ComponentInitializer[source]¶

Bases: ABC

Called by Component to do post spaCy initialization.

abstract init_nlp_model(model, component)[source]¶: Do any post spaCy initialization on the the referred framework.

class zensols.nlp.parser.DecoratedFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None)[source]¶

Bases: FeatureDocumentParser

This class adapts the FeatureDocumentParser adaptors to the general case using a GoF decorator pattern. This is useful for any post processing needed on existing configured document parsers.

All decorators are processed in the following order:

Token
Sentence
Document

Token features are stored in the delegate for those that have them. Otherwise, they are stored in instances of this class.

__init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None)¶

decorate(doc)[source]¶

delegate: FeatureDocumentParser¶: Used to create the feature documents.

document_decorators: Sequence[FeatureDocumentDecorator] = ()¶: A list of decorators that can add, remove or modify features on a document.

name: str¶: The name of the parser, which is taken from the section name when created with a ConfigFactory and used for debugging.

parse(text, *args, **kwargs)[source]¶

Parse text or a text as a list of sentences.

Parameters:

text (str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list
args – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance

Return type:

sentence_decorators: Sequence[FeatureSentenceDecorator] = ()¶: A list of decorators that can add, remove or modify features on a sentence.

silencer: WarningSilencer = None¶: Optinally suppress warnings the parser generates.

token_decorators: Sequence[FeatureTokenDecorator] = ()¶: A list of decorators that can add, remove or modify features on a token.

token_feature_ids: Set[str]¶

The features to keep from spaCy tokens. See class documentation.

See:: TOKEN_FEATURE_IDS

class zensols.nlp.parser.FeatureDocumentDecorator[source]¶

Implementations can add, remove or modify features on a document.

abstract decorate(doc)[source]¶

class zensols.nlp.parser.FeatureDocumentParser[source]¶

Bases: PersistableContainer, Dictable

This class parses text in to instances of FeatureDocument instances using parse().

TOKEN_FEATURE_IDS: ClassVar[Set[str]] = frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_'})¶: The default value for token_feature_ids.

__init__()¶

static default_instance()[source]¶

Create the parser as configured in the resource library of the package.

Return type:: FeatureDocumentParser

abstract parse(text, *args, **kwargs)[source]¶

Parse text or a text as a list of sentences.

Parameters:

text (str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list
args – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance

Return type:

class zensols.nlp.parser.FeatureSentenceDecorator[source]¶

Implementations can add, remove or modify features on a sentence.

abstract decorate(sent)[source]¶

class zensols.nlp.parser.FeatureSentenceFactory(token_decorators=())[source]¶

Bases: object

Create a FeatureSentence out of single tokens or split on whitespace. This is a utility class to create data structures when only single tokens are the source data.

For example, if you only have tokens that need to be scored with Unigram Rouge-1, use this class to create sentences, which is a subclass of TokenContainer.

__init__(token_decorators=())¶

create(tokens)[source]¶

Create a sentence from tokens.

Parameters:: tokens (Union[str, Iterable[str]]) – if a string, then split on white space
Return type:: FeatureSentence

token_decorators: Sequence[FeatureTokenDecorator] = ()¶: A list of decorators that can add, remove or modify features on a token.

class zensols.nlp.parser.FeatureTokenContainerDecorator[source]¶

Bases: ABC

Implementations can add, remove or modify features on a token container.

abstract decorate(container)[source]¶

class zensols.nlp.parser.FeatureTokenDecorator[source]¶

Bases: ABC

Implementations can add, remove or modify features on a token.

abstract decorate(token)[source]¶

class zensols.nlp.parser.WhiteSpaceTokenizerFeatureDocumentParser(sent_class=<class 'zensols.nlp.container.FeatureSentence'>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>)[source]¶

Bases: FeatureDocumentParser

This class parses text in to instances of FeatureDocument instances tokenizing only by whitespace. This parser does no sentence chunking so documents have one and only one sentence for each parse.

__init__(sent_class=<class 'zensols.nlp.container.FeatureSentence'>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>)¶

doc_class¶

The type of document instances to create.

alias of FeatureDocument

parse(text, *args, **kwargs)[source]¶

Parse text or a text as a list of sentences.

Parameters:

text (str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list
args – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance

Return type:

sent_class¶

The type of sentence instances to create.

alias of FeatureSentence

zensols.nlp.score module¶

Produces matching scores.

class zensols.nlp.score.BleuScoreMethod(reverse_sents=False, smoothing_function=None, weights=(0.25, 0.25, 0.25, 0.25), silence_warnings=False)[source]¶

Bases: ScoreMethod

The BLEU scoring method using the nltk package. The first sentences are the references and the second are the hypothesis.

__init__(reverse_sents=False, smoothing_function=None, weights=(0.25, 0.25, 0.25, 0.25), silence_warnings=False)¶

silence_warnings: bool = False¶: Silence the BLEU warning of n-grams not matching The hypothesis contains 0 counts of 3-gram overlaps...

smoothing_function: SmoothingFunction = None¶

This is an implementation of the smoothing techniques for segment-level BLEU scores.

Citation:

Chen and Cherry (2014) A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. In WMT14.

weights: Tuple[float, ...] = (0.25, 0.25, 0.25, 0.25)¶

a tuple of float weights for unigrams, bigrams, trigrams and so on can be given: weights = (0.1, 0.3, 0.5, 0.1).

Type:: Weights for each n-gram. For example

class zensols.nlp.score.ErrorScore(method, exception, replace_score=None)[source]¶

Bases: Score

A replacement instance when scoring fails from a raised exception.

__init__(method, exception, replace_score=None)¶

asrow(meth)[source]¶

Return type:: Dict[str, float]

exception: Exception¶: The exception that was raised.

method: str¶: The method of the ScoreMethod that raised the exception.

replace_score: Score = None¶: The score to use in place of this score. Otherwise asrow() return a single numpy.nan like FloatScore.

class zensols.nlp.score.ExactMatchScoreMethod(reverse_sents=False, equality_measure='norm')[source]¶

Bases: ScoreMethod

A scoring method that return 1 for exact matches and 0 otherwise.

__init__(reverse_sents=False, equality_measure='norm')¶

equality_measure: str = 'norm'¶

The method by which to compare, which is one of:

norm: compare with TokenContainer.norm()
text: compare with TokenContainer.text
equal: compare using a Python object __eq__ equal compare,
which also compares the token values

class zensols.nlp.score.FloatScore(value)[source]¶

Bases: Score

Float container. This is needed to create the flat result container structure. Object creation becomes less import since most clients will use ScoreSet.asnumpy().

NAN_INSTANCE: ClassVar[FloatScore] = FloatScore(value=nan)¶: Used to add to ErrorScore for harmonic means replacements.

__init__(value)¶

asrow(meth)[source]¶

Return type:: Dict[str, float]

value: float¶: The value of score.

class zensols.nlp.score.HarmonicMeanScore(precision, recall, f_score)[source]¶

Bases: Score

A score having a precision, recall and the harmonic mean of the two, F-score.’

NAN_INSTANCE: ClassVar[HarmonicMeanScore] = HarmonicMeanScore(precision=nan, recall=nan, f_score=nan)¶: Used to add to ErrorScore for harmonic means replacements.

__init__(precision, recall, f_score)¶

f_score: float¶

precision: float¶

recall: float¶

class zensols.nlp.score.LevenshteinDistanceScoreMethod(reverse_sents=False, form='canon', normalize=True)[source]¶

Bases: ScoreMethod

A scoring method that computes the Levenshtein distance.

__init__(reverse_sents=False, form='canon', normalize=True)¶

form: str = 'canon'¶

The form of the of the text used for the evaluation, which is one of:

text: the original text with TokenContainer.text
norm: the normalized text using TokenContainer.norm()
canon: TokenContainer.canonical to normalize out whitespace for better comparisons

normalize: bool = True¶: Whether to normalize the return value as the distince / the max length of both sentences.

class zensols.nlp.score.RougeScoreMethod(reverse_sents=False, feature_tokenizer=True)[source]¶

Bases: ScoreMethod

The ROUGE scoring method using the rouge_score package.

__init__(reverse_sents=False, feature_tokenizer=True)¶

feature_tokenizer: bool = True¶: Whether to use the TokenContainer tokenization, otherwise use the rouge_score package.

class zensols.nlp.score.Score[source]¶

Bases: Dictable

Individual scores returned from ScoreMethod.

__init__()¶

asrow(meth)[source]¶

Return type:: Dict[str, float]

class zensols.nlp.score.ScoreContext(pairs, methods=None, norm=True, correlation_ids=None)[source]¶

Bases: Dictable

Input needed to create score(s) using Scorer.

__init__(pairs, methods=None, norm=True, correlation_ids=None)¶

correlation_ids: Tuple[Union[int, str]] = None¶: The IDs to correlate with each sentence pair, or None to skip correlating them. The length of this tuple must be that of pairs.

methods: Set[str] = None¶: A set of strings, each indicating the ScoreMethod used to score pairs.

norm: bool = True¶: Whether to use the normalized tokens, otherwise use the original text.

pairs: Tuple[Tuple[TokenContainer, TokenContainer]]¶

Sentence, span or document pairs to score (order matters for some scoring methods such as rouge). Depending on the scoring method the ordering of the sentence pairs should be:

(<summary>, <source>)

(<gold>, <prediction>)

(<references>, <candidates>)

See ScoreMethod implementations for more information about pair ordering.

validate()[source]¶

class zensols.nlp.score.ScoreMethod(reverse_sents=False)[source]¶

Bases: object

An abstract base class for scoring methods (bleu, rouge, etc).

__init__(reverse_sents=False)¶

classmethod is_available()[source]¶

Whether or not this method is available on this system.

Return type:: bool

classmethod missing_modules()[source]¶

Return a list of missing modules neede by this score method.

Return type:: Tuple[str, ...]

reverse_sents: bool = False¶: Whether to reverse the order of the sentences.

score(meth, context)[source]¶

Score the sentences in context using method identifer meth.

Parameters:

meth (str) – the identifer such as bleu
context (ScoreContext) – the context containing the data to score

Return type:

Iterable[Score]

Returns:

the results, which are usually float or Score

class zensols.nlp.score.ScoreResult(scores, correlation_id=None)[source]¶

Bases: Dictable

A result of scores created by a ScoreMethod.

__init__(scores, correlation_id=None)¶

correlation_id: Optional[str] = None¶: An ID for correlating back to the TokenContainer.

scores: Dict[str, Tuple[Score, ...]]¶: The scores by method name.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.nlp.score.ScoreSet(results, correlation_id_col='id')[source]¶

Bases: Dictable

All scores returned from :class:`.Scorer’.

__init__(results, correlation_id_col='id')¶

as_dataframe(add_correlation=True)[source]¶

This gets data from as_numpy() and returns it as a Pandas dataframe.

Parameters:: add_correlation (bool) – whether to add the correlation ID (if there is one), using correlation_id_col
Return type:: pandas.DataFrame
Returns:: an instance of pandas.DataFrame of the results

as_numpy(add_correlation=True)[source]¶

Return the Numpy array with column descriptors of the results. Spacy depends on Numpy, so this package will always be availale.

Parameters:: add_correlation (bool) – whether to add the correlation ID (if there is one), using correlation_id_col
Return type:: Tuple[List[str], ndarray]

correlation_id_col: str = 'id'¶: The column name for the ScoreResult.correlation_id added to Numpy arrays and Pandas dataframes. If None, then the correlation IDS are used as the index.

property has_correlation_id: bool¶: Whether the results have correlation IDs.

results: Tuple[ScoreResult, ...]¶: A tuple with each element having the results of the respective sentence pair in ScoreContext.sents. Each elemnt is a dictionary with the method are the keys with results as the values as output of the ScoreMethod. This is created in Scorer.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.nlp.score.Scorer(package_manager=None, methods=None, default_methods=None)[source]¶

Bases: object

A class that scores sentences using a set of registered methods (methods).

__init__(package_manager=None, methods=None, default_methods=None)¶

default_methods: Set[str] = None¶: Methods (keys from methods) to use when none are provided in the ScoreContext.meth in the call to score().

methods: Dict[str, ScoreMethod] = None¶: The registered scoring methods availale, which are accessed from ScoreContext.meth.

package_manager: PackageManager = None¶: The package manager used to install scoring methods. If this is None, then packages are not installed and scoring methods are not made available.

score(context)[source]¶

Score the sentences in context.

Parameters:: context (ScoreContext) – the context containing the data to score
Return type:: ScoreSet
Returns:: the results for each method indicated in context

exception zensols.nlp.score.ScorerError[source]¶

Bases: NLPError

Raised for any scoring errors (this module).

__annotations__ = {}¶

__module__ = 'zensols.nlp.score'¶

zensols.nlp.serial module¶

Serializes FeatureToken and TokenContainer instances using the Dictable interface.

class zensols.nlp.serial.Include(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶

Bases: Enum

Indicates what to include at each level.

normal = 2¶: The normalized form of the text.

original = 1¶: The original text.

sentences = 4¶: The sentences of the FeatureDocument.

tokens = 3¶: The tokens of the TokenContainer.

class zensols.nlp.serial.Serialized(container, includes, feature_ids)[source]¶

Bases: Dictable

A base strategy class that can serialize TokenContainer instances.

__init__(container, includes, feature_ids)¶

container: TokenContainer¶: The container to be serialized.

feature_ids: Tuple[str, ...]¶: The feature IDs used when serializing tokens.

includes: Set[Include]¶: The things to be included at the level of the subclass serializer.

class zensols.nlp.serial.SerializedFeatureDocument(container, includes, feature_ids, sentence_includes)[source]¶

Bases: Serialized

A serializer for feature documents. The container has to be an instance of a FeatureDocument.

__init__(container, includes, feature_ids, sentence_includes)¶

sentence_includes: Set[Include]¶: The list of things to include in the sentences of the document.

class zensols.nlp.serial.SerializedTokenContainer(container, includes, feature_ids)[source]¶

Bases: Serialized

Serializes instance of TokenContainer. This is used to serialize spans and sentences.

__init__(container, includes, feature_ids)¶

class zensols.nlp.serial.SerializedTokenContainerFactory(sentence_includes, document_includes, feature_ids=None)[source]¶

Bases: Dictable

Creates instances of Serialized from instances of TokenContainer. These can then be used as Dictable instances, specifically with the asdict and asjson methods.

__init__(sentence_includes, document_includes, feature_ids=None)¶

create(container)[source]¶

Create a serializer from container (see class docs).

Parameters:: container (TokenContainer) – he container to be serialized
Return type:: Serialized
Returns:: an object that can be serialized using asdict and asjson method.

document_includes: Set[Union[Include, str]]¶: The things to be included in documents.

feature_ids: Tuple[str, ...] = None¶: The feature IDs used when serializing tokens.

sentence_includes: Set[Union[Include, str]]¶: The things to be included in sentences.

zensols.nlp.spannorm module¶

Normalize spans (of tokens) into strings by reconstructing based on language rules from the normalized form of the tokens. This is needed after any token manipulation from TokenNormalizer or other changes to FeatureToken.norm.

For now, only English is supported, but the module is provided for other languages and future enhancements of normalization configuration.

class zensols.nlp.spannorm.EnglishSpanNormalizer(post_space_skip=frozenset({'(', '-', '<', '[', '`', '{', '‘', '“'}), pre_space_skip=frozenset({"'d", "'ll", "'m", "'re", "'s", "'ve", '-', "n't"}), keep_space_skip=frozenset({'_'}), canonical_delimiter='|')[source]¶

Bases: SpanNormalizer

An implementation of a span normalizer for the Enlish language.

__init__(post_space_skip=frozenset({'(', '-', '<', '[', '`', '{', '‘', '“'}), pre_space_skip=frozenset({"'d", "'ll", "'m", "'re", "'s", "'ve", '-', "n't"}), keep_space_skip=frozenset({'_'}), canonical_delimiter='|')¶

canonical_delimiter: str = '|'¶: The token delimiter used in canonical.

get_canonical(tokens)[source]¶

A canonical representation of the container, which are non-space tokens separated by CANONICAL_DELIMITER.

Return type:: str

get_norm(tokens, use_norm)[source]¶

Create a string that follows the langauge spacing rules.

Parameters:

tokens (Iterable[FeatureToken]) – the tokens to normalize
use_norm (bool) – whether to use the token normalized or orthographic text

Return type:

str

keep_space_skip: Set[str] = frozenset({'_'})¶: Characters that retain space on both sides.

post_space_skip: Set[str] = frozenset({'(', '-', '<', '[', '`', '{', '‘', '“'})¶: Characters after which no space is added for span normalization.

pre_space_skip: Set[str] = frozenset({"'d", "'ll", "'m", "'re", "'s", "'ve", '-', "n't"})¶: Characters before whcih no space is added for span normalization.

class zensols.nlp.spannorm.SpanNormalizer[source]¶

Bases: object

Subclasses normalize feature tokens on a per spacy.Language. All subclasses must be re-entrant.

abstract get_canonical(tokens)[source]¶

A canonical representation of the container, which are non-space tokens separated by CANONICAL_DELIMITER.

Return type:: str

abstract get_norm(tokens, use_norm)[source]¶

Create a string that follows the langauge spacing rules.

Parameters:

tokens (Iterable[FeatureToken]) – the tokens to normalize
use_norm (bool) – whether to use the token normalized or orthographic text

Return type:

str

zensols.nlp.sparser module¶

The spaCy FeatureDocumentParser implementation.

class zensols.nlp.sparser.SpacyFeatureDocumentParser(config_factory, name, lang='en', model_name=None, token_feature_ids=<factory>, components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.nlp.tok.SpacyFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False, package_manager=<factory>)[source]¶

Bases: FeatureDocumentParser

This langauge resource parses text in to Spacy documents. Loaded spaCy models have attribute doc_parser set enable creation of factory instances from registered pipe components (i.e. specified by Component).

Configuration example:

[doc_parser]
class_name = zensols.nlp.sparser.SpacyFeatureDocumentParser
lang = en
model_name = ${lang}_core_web_sm

Decorators are processed in the same way DecoratedFeatureDocumentParser.

__init__(config_factory, name, lang='en', model_name=None, token_feature_ids=<factory>, components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.nlp.tok.SpacyFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False, package_manager=<factory>)¶

auto_install_model: Union[bool, str, Iterable[str]] = False¶: Whether to install models not already available. Note that this uses the pip command to download model requirements, which might have an adverse effect of replacing currently installed Python packages. This value is interpreted as a pip requirement(s) to install if a string or iterable of strings.

classmethod clear_models()[source]¶: Clears all cached models.

components: Sequence[Component] = ()¶: Additional Spacy components to add to the pipeline.

config_factory: ConfigFactory¶: A configuration parser optionally used by pipeline Component instances.

disable_component_names: Sequence[str] = None¶: Components to disable in the spaCy model when creating documents in parse().

doc_class¶

The type of document instances to create.

alias of FeatureDocument

document_decorators: Sequence[FeatureDocumentDecorator] = ()¶: A list of decorators that can add, remove or modify features on a document.

from_spacy_doc(doc, *args, text=None, **kwargs)[source]¶

Create s FeatureDocument from a spaCy doc.

Parameters:

doc (Doc) – the spaCy generated document to transform in to a feature document
text (str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list
args – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance

Return type:

get_dictable(doc)[source]¶

Return a dictionary object graph and pretty prints spaCy docs.

Return type:: Dictable

lang: str = 'en'¶: The natural language the identify the model.

property model: Language¶: The spaCy model. On first access, this creates a new instance using model_name.

model_name: str = None¶: The Spacy model name (defualts to en_core_web_sm); this is ignored if model is not None.

name: str¶: The name of the parser, which is taken from the section name when created with a ConfigFactory and used for debugging.

package_manager: PackageManager¶: The package manager used to install auto_install_model.

parse(text, *args, **kwargs)[source]¶

Parse text or a text as a list of sentences.

Parameters:

text (str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list
args – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance

Return type: