zensols.nlp package

Submodules

zensols.nlp.chunker module

Clasess that segment text from FeatureDocument instances, but retain the original structure by preserving sentence and token indicies.

class zensols.nlp.chunker.Chunker(doc, pattern, sub_doc=None, char_offset=None)[source]

Bases: object

Splits TokenContainer instances using regular expression pattern. Matched container (implementation of the container is based on the subclass) are given if used as an iterable. The document of all parsed containers is given if used as a callable.

__init__(doc, pattern, sub_doc=None, char_offset=None)
char_offset: int = None

The 0-index absolute character offset where sub_doc starts. However, if the value is -1, then the offset is used as the begging character offset of the first token in the sub_doc.

doc: FeatureDocument

The document that contains the entire text (i.e. Note).

pattern: Pattern

The chunk regular expression. There should be a default for each subclass.

sub_doc: FeatureDocument = None

A lexical span created document of doc, which defaults to the global document. Providing this and char_offset allows use of a document without having to use TokenContainer.reindex().

abstract to_document(conts)[source]
Return type:

FeatureDocument

class zensols.nlp.chunker.ListItemChunker(doc, pattern=re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\\\n]+)$', re.MULTILINE), sub_doc=None, char_offset=None)[source]

Bases: Chunker

A Chunker that splits list item and enumerated lists into separate sentences. Matched sentences are given if used as an iterable. This is useful when spaCy sentence chunks lists incorrectly and finds lists using a regular expression to find lines that star with a decimal, or list characters such as - and +.

DEFAULT_SPAN_PATTERN: ClassVar[Pattern] = re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\n]+)$', re.MULTILINE)

The default list item regular expression, which uses an initial character item notation or an initial enumeration digit.

__init__(doc, pattern=re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\\\n]+)$', re.MULTILINE), sub_doc=None, char_offset=None)
pattern: Pattern = re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\n]+)$', re.MULTILINE)

The list regular expression, which defaults to DEFAULT_SPAN_PATTERN.

to_document(conts)[source]
Return type:

FeatureDocument

class zensols.nlp.chunker.ParagraphChunker(doc, pattern=re.compile('(.+?)(?:(?=\\\\n{2})|\\\\Z)', re.MULTILINE | re.DOTALL), sub_doc=None, char_offset=None)[source]

Bases: Chunker

A Chunker that splits list item and enumerated lists into separate sentences. Matched sentences are given if used as an iterable. For this reason, this class will probably be used as an iterable since clients will usually want just the separated paragraphs as documents

DEFAULT_SPAN_PATTERN: ClassVar[Pattern] = re.compile('(.+?)(?:(?=\\n{2})|\\Z)', re.MULTILINE|re.DOTALL)

The default paragraph regular expression, which uses two newline positive lookaheads to avoid matching on paragraph spacing.

__init__(doc, pattern=re.compile('(.+?)(?:(?=\\\\n{2})|\\\\Z)', re.MULTILINE | re.DOTALL), sub_doc=None, char_offset=None)
pattern: Pattern = re.compile('(.+?)(?:(?=\\n{2})|\\Z)', re.MULTILINE|re.DOTALL)

The list regular expression, which defaults to DEFAULT_SPAN_PATTERN.

to_document(conts)[source]

It usually makes sense to use instances of this class as an iterable rather than this (see class docs).

Return type:

FeatureDocument

zensols.nlp.combine module

A class that combines features.

class zensols.nlp.combine.CombinerFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, source_parsers=None, validate_features=<factory>, yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, map_features=<factory>)[source]

Bases: DecoratedFeatureDocumentParser

A class that combines features from two FeatureDocumentParser instances. Features parsed using each source_parser are optionally copied or overwritten on a token by token basis in the feature document parsed by this instance.

The target tokens are sometimes added to or clobbered from the source, but not the other way around.

__init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, source_parsers=None, validate_features=<factory>, yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, map_features=<factory>)
map_features: List[Tuple[str, str, Any]]

Like yield_features but the feature ID can be different from the source to the target. Each tuple has the form:

(<source feature ID>, <target feature ID>, <default for missing>)

overwrite_features: List[str]

A list of features to be copied/overwritten in order given in the list.

overwrite_nones: bool = False

Whether to write None for missing overwrite_features. This always write the target feature; if you only to write when the source is not set or missing, then use yield_features.

parse(text, *args, **kwargs)[source]

Parse text or a text as a list of sentences.

Parameters:
  • text (str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list

  • args – the arguments used to create the FeatureDocument instance

  • kwargs – the key word arguments used to create the FeatureDocument instance

Return type:

FeatureDocument

source_parsers: List[FeatureDocumentParser] = None

The language resource used to parse documents and create token attributes.

validate_features: Set[str]

A set of features to compare across all tokens when copying. If any of the given features don’t match, an mismatch token error is raised.

yield_feature_defaults: Any = None

A default value to use when no yielded value is found. If None, do not add the feature if missing.

yield_features: Set[str]

A list of features to be copied (in order) if the target token is not set.

class zensols.nlp.combine.MappingCombinerFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, source_parsers=None, validate_features=frozenset({'idx'}), yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, map_features=<factory>, merge_sentences=True)[source]

Bases: CombinerFeatureDocumentParser

Maps the source to respective tokens in the target document using spaCy artifacts.

__init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, source_parsers=None, validate_features=frozenset({'idx'}), yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, map_features=<factory>, merge_sentences=True)
merge_sentences: bool = True

If False ignore sentences and map everything at the token level. Otherwise, the same hierarchy mapping as the super class is used. This is useful when sentence demarcations are not aligned across source document parsers and this parser.

validate_features: Set[str] = frozenset({'idx'})

A set of features to compare across all tokens when copying. If any of the given features don’t match, an mismatch token error is raised. The default is the token’s index in the document, which should not change in most cases.

zensols.nlp.component module

Components useful for reuse.

class zensols.nlp.component.EntityRecognizer(nlp, name, import_file, patterns)[source]

Bases: object

Base class regular expression and spaCy match patterns named entity recognizer. Both subclasses allow for an optional label for each respective pattern or regular expression. If the label is provided, then the match is made a named entity with a label. In any case, a span is created on the token, and in some cases, retokenized.

__init__(nlp, name, import_file, patterns)
import_file: Optional[str]

An optional JSON file used to append the pattern configuration.

name: str

The component name.

nlp: Language

The NLP model.

patterns: List

A list of the regular expressions to find.

class zensols.nlp.component.PatternEntityRecognizer(nlp, name, import_file, patterns)[source]

Bases: EntityRecognizer

Adds entities based on regular epxressions.

See:

Rule matching

__init__(nlp, name, import_file, patterns)
patterns: List[Tuple[str, List[List[Dict[str, Any]]]]]

The patterns given to the Matcher.

class zensols.nlp.component.RegexEntityRecognizer(nlp, name, import_file, patterns)[source]

Bases: EntityRecognizer

Merges regular expression matches as a Span. After matches are found, re-tokenization merges them in to one token per match.

__init__(nlp, name, import_file, patterns)
patterns: List[Tuple[str, List[Pattern]]]

A list of the regular expressions to find.

class zensols.nlp.component.RegexSplitter(nlp, name, import_file, patterns)[source]

Bases: EntityRecognizer

Splits on regular expressions.

__init__(nlp, name, import_file, patterns)
patterns: List[Tuple[str, List[Pattern]]]

A list of the regular expressions to find.

zensols.nlp.component.create_patner_component(nlp, name, patterns, path=None)[source]
zensols.nlp.component.create_regexner_component(nlp, name, patterns, path=None)[source]
zensols.nlp.component.create_regexsplit_component(nlp, name, patterns, path=None)[source]
zensols.nlp.component.create_remove_sent_boundaries_component(doc)[source]

Remove sentence boundaries from tokens.

Parameters:

doc (Doc) – the spaCy document to remove sentence boundaries

zensols.nlp.component.create_whitespace_tokenizer_component(nlp, name)[source]

zensols.nlp.container module

Domain objects that define features associated with text.

class zensols.nlp.container.FeatureDocument(sents, text=None, spacy_doc=None)[source]

Bases: TokenContainer

A container class of tokens that make a document. This class contains a one to many of sentences. However, it can be treated like any TokenContainer to fetch tokens. Instances of this class iterate over FeatureSentence instances.

Parameters:

sents (Tuple[FeatureSentence, ...]) – the sentences defined for this document

_combine_documents(docs, cls, concat_tokens, **kwargs)[source]

Override if there are any fields in your dataclass. In most cases, the only time this is called is by an embedding vectorizer to batch muultiple sentences in to a single document, so the only feature that matter are the sentence level.

Parameters:
  • docs (Tuple[FeatureDocument, ...]) – the documents to combine in to one

  • cls (Type[FeatureDocument]) – the class of the instance to create

  • concat_tokens (bool) – if True each sentence of the returned document are the concatenated tokens of each respective document; otherwise simply concatenate sentences in to one document

  • kwargs – additional keyword arguments to pass to the new feature document’s initializer

Return type:

FeatureDocument

EMPTY_DOCUMENT: ClassVar[FeatureDocument] = <>

A zero length document.

__init__(sents, text=None, spacy_doc=None)
clear()[source]

Clear all cached state.

clone(cls=None, **kwargs)[source]
Parameters:

kwargs – if copy_spacy is True, the spacy document is copied to the clone in addition parameters passed to new clone initializer

Return type:

TokenContainer

classmethod combine_documents(docs, concat_tokens=True, **kwargs)[source]

Coerce a tuple of token containers (either documents or sentences) in to one synthesized document.

Parameters:
  • docs (Iterable[FeatureDocument]) – the documents to combine in to one

  • cls – the class of the instance to create

  • concat_tokens (bool) – if True each sentence of the returned document are the concatenated tokens of each respective document; otherwise simply concatenate sentences in to one document

  • kwargs – additional keyword arguments to pass to the new feature document’s initializer

Return type:

FeatureDocument

combine_sentences(sents=None)[source]

Combine the sentences in this document in to a new document with a single sentence.

Parameters:

sents (Iterable[FeatureSentence]) – the sentences to combine in the new document or all if None

Return type:

FeatureDocument

from_sentences(sents, deep=False)[source]

Return a new cloned document using the given sentences.

Parameters:
  • sents (Iterable[FeatureSentence]) – the sentences to add to the new cloned document

  • deep (bool) – whether or not to clone the sentences

See:

clone()

Return type:

FeatureDocument

get_overlapping_document(span, inclusive=True)[source]

Get the portion of the document that overlaps span. Sentences completely enclosed in a span are copied. Otherwise, new sentences are created from those tokens that overlap the span.

Parameters:
  • span (LexicalSpan) – indicates the portion of the document to retain

  • inclusive (bool) – whether to check include +1 on the end component

Return type:

FeatureDocument

Returns:

a new document that contains the 0 index offset of span

get_overlapping_sentences(span, inclusive=True)[source]

Return sentences that overlaps with span from this document.

Parameters:
  • span (LexicalSpan) – indicates the portion of the document to retain

  • inclusive (bool) – whether to check include +1 on the end component

Return type:

Iterable[FeatureSentence]

get_overlapping_span(span, inclusive=True)[source]

Return a feature span that includes the lexical scope of span.

Return type:

TokenContainer

property max_sentence_len: int

Return the length of tokens from the longest sentence in the document.

sent_iter(*args, **kwargs)[source]
Return type:

Iterable[FeatureSentence]

sentence_for_token(token)[source]

Return the parent sentence that has token.

Return type:

FeatureSentence

sentence_index_for_token(token)[source]

Return index of the parent sentence having token.

Return type:

int

sentences_for_tokens(tokens)[source]

Find sentences having a set of tokens.

Parameters:

tokens (Tuple[FeatureToken, ...]) – the query used to finding containing sentences

Return type:

Tuple[FeatureSentence, ...]

Returns:

the document ordered tuple of sentences containing tokens

sents: Tuple[FeatureSentence, ...]

The sentences that make up the document.

set_spacy_doc(doc)[source]
spacy_doc: Doc = None

The parsed spaCy document this feature set is based. As explained in FeatureToken, spaCy documents are heavy weight and problematic to pickle. For this reason, this attribute is dropped when pickled, and only here for ad-hoc predictions.

text: str = None

The original raw text of the sentence.

to_document()[source]

Coerce this instance in to a document.

Return type:

FeatureDocument

to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]

Coerce this instance to a single sentence. No tokens data is updated so FeatureToken.i_sent keep their original indexes. These sentence indexes will be inconsistent when called on FeatureDocument unless contiguous_i_sent is set to True.

Parameters:
  • limit (int) – the max number of sentences to create (only starting kept)

  • contiguous_i_sent (Union[str, bool]) – if True, ensures all tokens have FeatureToken.i_sent value that is contiguous for the returned instance; if this value is reset, the token indicies start from 0

  • delim (str) – a string added between each constituent sentence

Return type:

FeatureSentence

Returns:

an instance of FeatureSentence that represents this token sequence

token_iter(*args, **kwargs)[source]

Return an iterator over the token features.

Parameters:

args – the arguments given to itertools.islice()

Return type:

Iterable[FeatureToken]

uncombine_sentences()[source]

Reconstruct the sentence structure that we combined in combine_sentences(). If that has not been done in this instance, then return self.

Return type:

FeatureDocument

update_entity_spans(include_idx=True)[source]

Update token entity to norm text. This is helpful when entities are embedded after splitting text, which becomes FeatureToken.norm values. However, the token spans still index the original entities that are multi-word, which leads to norms that are not equal to the text spans. This synchronizes the token span indexes with the norms.

Parameters:

include_idx (bool) – whether to update SpacyFeatureToken.idx as well

update_indexes()[source]

Update all FeatureToken.i attributes to those provided by tokens_by_i. This corrects the many-to-one token index mapping for split multi-word named entities.

See:

tokens_by_i

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, n_sents=9223372036854775807, n_tokens=0, include_original=False, include_normalized=True)[source]

Write the document and optionally sentence features.

Parameters:
  • n_sents (int) – the number of sentences to write

  • n_tokens (int) – the number of tokens to print across all sentences

  • include_original (bool) – whether to include the original text

  • include_normalized (bool) – whether to include the normalized text

class zensols.nlp.container.FeatureSentence(tokens, text=None, spacy_span=None)[source]

Bases: FeatureSpan

A container class of tokens that make a sentence. Instances of this class iterate over FeatureToken instances, and can create documents with to_document().

EMPTY_SENTENCE: ClassVar[FeatureSentence] = <>
__init__(tokens, text=None, spacy_span=None)
get_overlapping_span(span, inclusive=True)[source]

Return a feature span that includes the lexical scope of span.

Return type:

TokenContainer

to_document()[source]

Coerce this instance in to a document.

Return type:

FeatureDocument

to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]

Coerce this instance to a single sentence. No tokens data is updated so FeatureToken.i_sent keep their original indexes. These sentence indexes will be inconsistent when called on FeatureDocument unless contiguous_i_sent is set to True.

Parameters:
  • limit (int) – the max number of sentences to create (only starting kept)

  • contiguous_i_sent (Union[str, bool]) – if True, ensures all tokens have FeatureToken.i_sent value that is contiguous for the returned instance; if this value is reset, the token indicies start from 0

  • delim (str) – a string added between each constituent sentence

Return type:

FeatureSentence

Returns:

an instance of FeatureSentence that represents this token sequence

class zensols.nlp.container.FeatureSpan(tokens, text=None, spacy_span=None)[source]

Bases: TokenContainer

A span of tokens as a TokenContainer, much like spacy.tokens.Span.

__init__(tokens, text=None, spacy_span=None)
clone(cls=None, **kwargs)[source]

Clone an instance of this token container.

Parameters:
  • cls (Type[TokenContainer]) – the type of the new instance

  • kwargs – arguments to add to as attributes to the clone

Return type:

TokenContainer

Returns:

the cloned instance of this instance

property dependency_tree: Dict[FeatureToken, List[Dict[FeatureToken]]]
spacy_span: Span = None

The parsed spaCy span this feature set is based.

See:

FeatureDocument.spacy_doc()

text: str = None

The original raw text of the span.

to_document()[source]

Coerce this instance in to a document.

Return type:

FeatureDocument

to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]

Coerce this instance to a single sentence. No tokens data is updated so FeatureToken.i_sent keep their original indexes. These sentence indexes will be inconsistent when called on FeatureDocument unless contiguous_i_sent is set to True.

Parameters:
  • limit (int) – the max number of sentences to create (only starting kept)

  • contiguous_i_sent (Union[str, bool]) – if True, ensures all tokens have FeatureToken.i_sent value that is contiguous for the returned instance; if this value is reset, the token indicies start from 0

  • delim (str) – a string added between each constituent sentence

Return type:

FeatureSentence

Returns:

an instance of FeatureSentence that represents this token sequence

token_iter(*args, **kwargs)[source]

Return an iterator over the token features.

Parameters:

args – the arguments given to itertools.islice()

Return type:

Iterable[FeatureToken]

property token_len: int

Return the number of tokens.

property tokens: Tuple[FeatureToken, ...]

The tokens that make up the span.

property tokens_by_i_sent: Dict[int, FeatureToken]

A map of tokens with keys as their sentanal position offset and values as tokens.

See:

zensols.nlp.FeatureToken.i

update_entity_spans(include_idx=True)[source]

Update token entity to norm text. This is helpful when entities are embedded after splitting text, which becomes FeatureToken.norm values. However, the token spans still index the original entities that are multi-word, which leads to norms that are not equal to the text spans. This synchronizes the token span indexes with the norms.

Parameters:

include_idx (bool) – whether to update SpacyFeatureToken.idx as well

update_indexes()[source]

Update all FeatureToken.i attributes to those provided by tokens_by_i. This corrects the many-to-one token index mapping for split multi-word named entities.

See:

tokens_by_i

class zensols.nlp.container.TokenAnnotatedFeatureDocument(sents, text=None, spacy_doc=None)[source]

Bases: FeatureDocument

A feature sentence that contains token annotations. Sentences can be modeled with TokenAnnotatedFeatureSentence or just FeatureSentence since this sets the annotations attribute when combining.

__init__(sents, text=None, spacy_doc=None)
property annotations: Tuple[Any, ...]

A token level annotation, which is one-to-one to tokens.

combine_sentences(**kwargs) FeatureDocument

Combine all the sentences in this document in to a new document with a single sentence.

Return type:

FeatureDocument

class zensols.nlp.container.TokenAnnotatedFeatureSentence(tokens, text=None, spacy_span=None, annotations=())[source]

Bases: FeatureSentence

A feature sentence that contains token annotations.

__init__(tokens, text=None, spacy_span=None, annotations=())
annotations: Tuple[Any, ...] = ()

A token level annotation, which is one-to-one to tokens.

to_document()[source]

Coerce this instance in to a document.

Return type:

FeatureDocument

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, **kwargs)[source]

Write the text container.

Parameters:
  • include_original – whether to include the original text

  • include_normalized – whether to include the normalized text

  • n_tokens – the number of tokens to write

  • inline – whether to print the tokens on one line each

class zensols.nlp.container.TokenContainer[source]

Bases: PersistableContainer, TextContainer

A base class for token container classes such as FeatureSentence and FeatureDocument. In addition to the defined methods, each instance has a text attribute, which is the original text of the document.

property canonical: str

A canonical representation of the container, which are non-space tokens separated by CANONICAL_DELIMITER.

clear()[source]

Clear all cached state.

clone(cls=None, **kwargs)[source]

Clone an instance of this token container.

Parameters:
  • cls (Type[TokenContainer]) – the type of the new instance

  • kwargs – arguments to add to as attributes to the clone

Return type:

TokenContainer

Returns:

the cloned instance of this instance

property entities: Tuple[FeatureSpan, ...]

The named entities of the container with each multi-word entity as elements.

get_overlapping_span(span, inclusive=True)[source]

Return a feature span that includes the lexical scope of span.

Return type:

TokenContainer

get_overlapping_tokens(span, inclusive=True)[source]

Get all tokens that overlap lexical span span.

Parameters:
  • span (LexicalSpan) – the document 0-index character based inclusive span to compare with FeatureToken.lexspan

  • inclusive (bool) – whether to check include +1 on the end component

Return type:

Iterable[FeatureToken]

Returns:

a token sequence containing the 0 index offset of span

property lexspan: LexicalSpan

The document indexed lexical span using idx.

map_overlapping_tokens(spans, inclusive=True)[source]

Return a tuple of tokens, each tuple in the range given by the respective span in spans.

Parameters:
  • spans (Iterable[LexicalSpan]) – the document 0-index character based inclusive spans to compare with FeatureToken.lexspan

  • inclusive (bool) – whether to check include +1 on the end component

Return type:

Iterable[Tuple[FeatureToken, ...]]

Returns:

a tuple of matching tokens for the respective span query

property norm: str

The normalized version of the sentence.

property norm_orth: str

The normalized version of the sentence using the orignal rather than the token normalized text.

norm_token_iter(*args, **kwargs)[source]

Return a list of normalized tokens.

Parameters:

args – the arguments given to itertools.islice()

Return type:

Iterable[str]

reindex(reference_token=None)[source]

Re-index tokens, which is useful for situtations where a 0-index offset is assumed for sub-documents created with FeatureDocument.get_overlapping_document() or FeatureDocument.get_overlapping_sentences(). The following data are modified:

strip(in_place=True)[source]

Strip beginning and ending whitespace (see strip_tokens()) and text.

Return type:

TokenContainer

strip_token_iter(*args, **kwargs)[source]

Strip beginning and ending whitespace (see strip_tokens()) using token_iter().

Return type:

Iterable[FeatureToken]

static strip_tokens(token_iter)[source]

Strip beginning and ending whitespace. This uses is_space, which is True for spaces, tabs and newlines.

Parameters:

token_iter (Iterable[FeatureToken]) – an stream of tokens

Return type:

Iterable[FeatureToken]

Returns:

non-whitespace middle tokens

abstract to_document(limit=9223372036854775807)[source]

Coerce this instance in to a document.

Return type:

FeatureDocument

abstract to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]

Coerce this instance to a single sentence. No tokens data is updated so FeatureToken.i_sent keep their original indexes. These sentence indexes will be inconsistent when called on FeatureDocument unless contiguous_i_sent is set to True.

Parameters:
  • limit (int) – the max number of sentences to create (only starting kept)

  • contiguous_i_sent (Union[str, bool]) – if True, ensures all tokens have FeatureToken.i_sent value that is contiguous for the returned instance; if this value is reset, the token indicies start from 0

  • delim (str) – a string added between each constituent sentence

Return type:

FeatureSentence

Returns:

an instance of FeatureSentence that represents this token sequence

abstract token_iter(*args, **kwargs)[source]

Return an iterator over the token features.

Parameters:

args – the arguments given to itertools.islice()

Return type:

Iterable[FeatureToken]

property token_len: int

Return the number of tokens.

property tokens: Tuple[FeatureToken, ...]

Return the token features as a tuple.

property tokens_by_i: Dict[int, FeatureToken]

A map of tokens with keys as their position offset and values as tokens. The entries also include named entity tokens that are grouped as multi-word tokens. This is helpful for multi-word entities that were split (for example with SplitTokenMapper), and thus, have many-to-one mapped indexes.

See:

zensols.nlp.FeatureToken.i

property tokens_by_idx: Dict[int, FeatureToken]

A map of tokens with keys as their character offset and values as tokens.

Limitations: Multi-word entities will have have a mapping only for the first word of that entity if tokens were split by spaces (for example with SplitTokenMapper). However, tokens_by_i does not have this limitation.

See:

obj:tokens_by_i

See:

zensols.nlp.FeatureToken.idx

abstract update_entity_spans(include_idx=True)[source]

Update token entity to norm text. This is helpful when entities are embedded after splitting text, which becomes FeatureToken.norm values. However, the token spans still index the original entities that are multi-word, which leads to norms that are not equal to the text spans. This synchronizes the token span indexes with the norms.

Parameters:

include_idx (bool) – whether to update SpacyFeatureToken.idx as well

update_indexes()[source]

Update all FeatureToken.i attributes to those provided by tokens_by_i. This corrects the many-to-one token index mapping for split multi-word named entities.

See:

tokens_by_i

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_original=False, include_normalized=True, n_tokens=9223372036854775807, inline=False, feature_ids=None)[source]

Write the text container.

Parameters:
  • include_original (bool) – whether to include the original text

  • include_normalized (bool) – whether to include the normalized text

  • n_tokens (int) – the number of tokens to write

  • inline (bool) – whether to print the tokens on one line each

write_text(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_original=False, include_normalized=True, limit=9223372036854775807)[source]

Write only the text of the container.

Parameters:
  • include_original (bool) – whether to include the original text

  • include_normalized (bool) – whether to include the normalized text

  • limit (int) – the max number of characters to print

zensols.nlp.dataframe module

Create Pandas dataframes from features. This must be imported by absolute module (zensols.nlp.dataframe).

class zensols.nlp.dataframe.FeatureDataFrameFactory(token_feature_ids=frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_', 'text'}), priority_feature_ids=('text', 'norm', 'idx', 'sent_i', 'i', 'i_sent', 'tag', 'pos', 'is_wh', 'entity', 'dep', 'children'))[source]

Bases: object

Creates a Pandas dataframe of features from a document annotations. Each feature ID is given a column in the output pandas.DataFrame.

__init__(token_feature_ids=frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_', 'text'}), priority_feature_ids=('text', 'norm', 'idx', 'sent_i', 'i', 'i_sent', 'tag', 'pos', 'is_wh', 'entity', 'dep', 'children'))
priority_feature_ids: Tuple[str, ...] = ('text', 'norm', 'idx', 'sent_i', 'i', 'i_sent', 'tag', 'pos', 'is_wh', 'entity', 'dep', 'children')

Feature IDs that are used first in the column order in the output pandas.DataFrame.

token_feature_ids: Set[str] = frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_', 'text'})

The feature IDs to add to the pandas.DataFrame.

zensols.nlp.decorate module

Contains useful classes for decorating feature sentences.

class zensols.nlp.decorate.CopyFeatureTokenContainerDecorator(feature_ids)[source]

Bases: FeatureTokenContainerDecorator

Copies feature(s) for each token in the container. For each token, each source / target tuple pair in feature_ids is copied. If the feature is missing (does not include for existing FeatureToken.NONE values) an exception is raised.

__init__(feature_ids)
decorate(container)[source]
feature_ids: Tuple[Tuple[str, str], ...]

The features to copy in the form ((<source>, <target>), …).

class zensols.nlp.decorate.FilterEmptySentenceDocumentDecorator(filter_space=True)[source]

Bases: FeatureDocumentDecorator

Filter zero length sentences.

__init__(filter_space=True)
decorate(doc)[source]
filter_space: bool = True

Whether to filter space tokens when comparing zero length sentences.

class zensols.nlp.decorate.FilterTokenSentenceDecorator(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False, remove_empty=False)[source]

Bases: FeatureSentenceDecorator

A decorator that strips whitespace from sentences.

See:

TokenContainer.strip()

__init__(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False, remove_empty=False)
decorate(sent)[source]
remove_determiners: bool = False

Whether to remove determiners (i.e. the).

remove_empty: bool = False

Whether to 0-length tokens (using normalized text).

remove_pronouns: bool = False

Whether to remove pronouns (i.e. he).

remove_punctuation: bool = False

Whether to remove punctuation (i.e. periods).

remove_space: bool = False

Whether to remove white space (i.e. new lines).

remove_stop: bool = False

Whether to remove stop words.

class zensols.nlp.decorate.RemoveFeatureTokenContainerDecorator(exclude_feature_ids)[source]

Bases: FeatureTokenContainerDecorator

Removes features each token in the container.

__init__(exclude_feature_ids)
decorate(container)[source]
exclude_feature_ids: Set[str]

The features to remove from the tokens.

class zensols.nlp.decorate.SplitTokenSentenceDecorator[source]

Bases: FeatureSentenceDecorator

A decorator that splits feature tokens by white space.

__init__()
decorate(sent)[source]
class zensols.nlp.decorate.StripTokenContainerDecorator[source]

Bases: FeatureTokenContainerDecorator

A decorator that strips whitespace from sentences (or TokenContainer).

See:

TokenContainer.strip()

__init__()
decorate(container)[source]
class zensols.nlp.decorate.UpdateTokenContainerDecorator(update_indexes=True, update_entity_spans=True, reindex=False)[source]

Bases: FeatureTokenContainerDecorator

Updates document indexes and spans (see fields).

__init__(update_indexes=True, update_entity_spans=True, reindex=False)
decorate(container)[source]
reindex: bool = False

Whether to invoke TokenContainer.reindex() after.

update_entity_spans: bool = True

Whether to update the document indexes with FeatureDocument.update_entity_spans().

update_indexes: bool = True

Whether to update the document indexes with FeatureDocument.update_indexes().

zensols.nlp.domain module

Interfaces, contracts and errors.

class zensols.nlp.domain.LexicalSpan(begin, end)[source]

Bases: Dictable

A lexical character span of text in a document. The span has two positions: begin and end, which is indexed respectively as an operator as well. The left (begin) is inclusive and the right (:obj:`end) is exclusive to conform to Python array slicing conventions.

One span is less than the other when the beginning position is less. When the beginnign positions are the same, the one with the smaller end position is less.

The length of the span is the distance between the end and the beginning positions.

EMPTY_SPAN: ClassVar[LexicalSpan] = (0, 0)

The span (0, 0).

__init__(begin, end)[source]

Initialize the interval.

Parameters:
  • begin (int) – the begin of the span

  • end (int) – the end of the span

property astuple: Tuple[int, int]

The span as a (begin, end) tuple.

classmethod from_token(tok)[source]

Create a span from a spaCy Token or Span.

Return type:

Tuple[int, int]

classmethod from_tuples(tups)[source]

Create spans from tuples.

Parameters:

tups (Iterable[Tuple[int, int]]) – an iterable of (<begin>, <end) tuples

Return type:

Iterable[LexicalSpan]

static gaps(spans, end=None)[source]

Return the spans for the “holes” in spans. For example, if spans is ((0, 5), (10, 12), (15, 17)), then return ((5, 10), (12, 15)).

Parameters:
  • spans (Iterable[LexicalSpan]) – the spans used to find gaps

  • end (Optional[int]) – an end position for the last gap so that if the last item in spans end does not match, another is added

Return type:

List[LexicalSpan]

Returns:

a list of spans that “fill” any holes in spans

narrow(other)[source]

Return the shortest span that inclusively fits in both this and other.

Parameters:

other (LexicalSpan) – the second span to narrow with this span

Retun:

a span so that beginning is maximized and end is minimized or None if the two spans do not overlap

Return type:

Optional[LexicalSpan]

static overlaps(a0, a1, b0, b1, inclusive=True)[source]

Return whether or not one text span overlaps with another.

Parameters:

inclusive (bool) – whether to check include +1 on the end component

Returns:

any overlap detected returns True

overlaps_with(other, inclusive=True)[source]

Return whether or not one text span overlaps non-inclusively with another.

Parameters:
  • other (LexicalSpan) – the other location

  • inclusive (bool) – whether to check include +1 on the end component

Return type:

bool

Returns:

any overlap detected returns True

static widen(others)[source]

Take the span union by using the left most begin and the right most end.

Parameters:

others (Iterable[LexicalSpan]) – the spans to union

Return type:

Optional[LexicalSpan]

Returns:

the widest span that inclusively aggregates others, or None if an empty sequence is passed

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

exception zensols.nlp.domain.MissingFeatureError(token, feature_id, msg=None)[source]

Bases: NLPError

Raised on attempting to access a non-existant feature in FeatureToken.

__init__(token, feature_id, msg=None)[source]

Initialize.

Parameters:
  • token (FeatureToken) – the token for which access was attempted

  • feature_id (str) – the feature_id that is missing in token

__module__ = 'zensols.nlp.domain'
exception zensols.nlp.domain.NLPError[source]

Bases: APIError

Raised for any errors for this library.

__annotations__ = {}
__module__ = 'zensols.nlp.domain'
exception zensols.nlp.domain.ParseError[source]

Bases: APIError

Raised for any parsing errors.

__annotations__ = {}
__module__ = 'zensols.nlp.domain'
class zensols.nlp.domain.TextContainer[source]

Bases: Dictable

A writable class that has a text property or attribute. All subclasses need a norm attribute or property.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_original=True, include_normalized=True)[source]

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

zensols.nlp.index module

A heuristic text indexing and search class.

class zensols.nlp.index.FeatureDocumentIndexer(doc)[source]

Bases: object

A utility class that indexes and searches for text in potentially whitespace mangled documents. It does this by trying more efficient means first, then resorts to methods that are more computationaly expensive.

__init__(doc)
doc: FeatureDocument

The document to index.

property doc_tok_orths: Tuple[Tuple[str, FeatureToken], ...]

Reutrn tuples of (<orthographic text>, <token>).

find(query, sent_ix=None)[source]

Find a sentence in document doc. If a sentence index is given, it treats the query as a sentence to find in doc.

Parameters:
  • query (TokenContainer) – the sentence to find in doc

  • sent_ix (int) – the sentence index hint if available

Return type:

TokenContainer

Returns:

the matched text from doc

property pack2ix: Dict[int, int]

Return a dictionary of character positions in the document (doc) text to respective positions in the same string without whitespace.

property packed_doc_text: str

Return the document’ (doc) no-space normalized text.

property text2sent: Dict[str, FeatureSentence]

Return a dictionary of sentence normalized text to respective sentence in doc.

zensols.nlp.nerscore module

Wraps the SemEval-2013 Task 9.1 NER evaluation API as a ScoreMethod.

From the David Batista blog post:

The SemEval’13 introduced four different ways to measure precision/recall/f1-score results based on the metrics defined by MUC:

  • Strict: exact boundary surface string match and entity type

  • Exact: exact boundary match over the surface string, regardless of the type

  • Partial: partial boundary match over the surface string, regardless of the type

  • Type: some overlap between the system tagged entity and the gold annotation is required

Each of these ways to measure the performance accounts for correct, incorrect, partial, missed and spurious in different ways. Let’s look in detail and see how each of the metrics defined by MUC falls into each of the scenarios described above.

see:

SemEval-2013 Task 9.1

see:

David Batista

class zensols.nlp.nerscore.SemEvalHarmonicMeanScore(precision, recall, f_score, correct, incorrect, partial, missed, spurious, possible, actual)[source]

Bases: HarmonicMeanScore

A harmonic mean score with the additional SemEval computed scores (see module zensols.nlp.nerscore docs).

NAN_INSTANCE: ClassVar[SemEvalHarmonicMeanScore] = SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan)

Used to add to ErrorScore for harmonic means replacements.

__init__(precision, recall, f_score, correct, incorrect, partial, missed, spurious, possible, actual)
actual: int
correct: int

both are the same.

Type:

The number of correct (COR)

incorrect: int

the output of a system and the golden annotation don’t match.

Type:

The number of incorrect (INC)

missed: int

a golden annotation is not captured by a system.

Type:

The number of missed (MIS)

partial: int

system and the golden annotation are somewhat “similar” but not the same.

Type:

The number of partial (PAR)

possible: int
spurious: int

system produces a response which does not exist in the golden annotation.

Type:

The number of spurious (SPU)

class zensols.nlp.nerscore.SemEvalScore(strict, exact, partial, ent_type)[source]

Bases: Score

Contains all four harmonic mean SemEval scores (see module zensols.nlp.nerscore docs). This score has four harmonic means providing various levels of accuracy.

NAN_INSTANCE: ClassVar[SemEvalScore] = SemEvalScore(strict=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), exact=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), partial=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), ent_type=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan))
__init__(strict, exact, partial, ent_type)
asrow(meth)[source]
Return type:

Dict[str, float]

ent_type: SemEvalHarmonicMeanScore

Some overlap between the system tagged entity and the gold annotation is required.

exact: SemEvalHarmonicMeanScore

Exact boundary match over the surface string, regardless of the type.

partial: SemEvalHarmonicMeanScore

Partial boundary match over the surface string, regardless of the type.

strict: SemEvalHarmonicMeanScore

Exact boundary surface string match and entity type.

class zensols.nlp.nerscore.SemEvalScoreMethod(reverse_sents=False, labels=None)[source]

Bases: ScoreMethod

A Semeval-2013 Task 9.1 score (see module zensols.nlp.nerscore docs). This score has four harmonic means providing various levels of accuracy. Sentence pairs are ordered as (<gold>, <prediction>).

__init__(reverse_sents=False, labels=None)
labels: Optional[Set[str]] = None

The NER labels on which to evaluate. If not provided, text is evaluated under a (stubbed tag) label.

zensols.nlp.norm module

Normalize text and map Spacy documents.

class zensols.nlp.norm.FilterRegularExpressionMapper(regex='[ ]+', invert=False)[source]

Bases: TokenMapper

Filter tokens based on normalized form regular expression.

__init__(regex='[ ]+', invert=False)
invert: bool = False

If True then remove rather than keep everything that matches..

map_tokens(token_tups)[source]

Transform token tuples.

Return type:

Iterable[Tuple[Token, str]]

regex: Union[Pattern, str] = '[ ]+'

The regular expression to use for splitting tokens.

class zensols.nlp.norm.FilterTokenMapper(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False)[source]

Bases: TokenMapper

Filter tokens based on token (Spacy) attributes.

Configuration example:

[filter_token_mapper]
class_name = zensols.nlp.FilterTokenMapper
remove_stop = True
remove_punctuation = True
__init__(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False)
map_tokens(token_tups)[source]

Transform token tuples.

Return type:

Iterable[Tuple[Token, str]]

remove_determiners: bool = False
remove_pronouns: bool = False
remove_punctuation: bool = False
remove_space: bool = False
remove_stop: bool = False
class zensols.nlp.norm.JoinTokenMapper(regex='[ ]', separator=None)[source]

Bases: object

Join tokens based on a regular expression. It does this by creating spans in the spaCy component (first in the tuple) and using the span text as the normalized token.

__init__(regex='[ ]', separator=None)
map_tokens(token_tups)[source]
Return type:

Iterable[Tuple[Token, str]]

regex: Union[Pattern, str] = '[ ]'

The regular expression to use for joining tokens

separator: str = None

The string used to separate normalized tokens in matches. If None, use the token text.

class zensols.nlp.norm.LambdaTokenMapper(add_lambda=None, map_lambda=None)[source]

Bases: TokenMapper

Use a lambda expression to map a token tuple.

This is handy for specialized behavior that can be added directly to a configuration file.

Configuration example:

[lc_lambda_token_mapper]
class_name = zensols.nlp.LambdaTokenMapper
map_lambda = lambda x: (x[0], f'<{x[1].lower()}>')
__init__(add_lambda=None, map_lambda=None)
add_lambda: str = None
map_lambda: str = None
map_tokens(token_tups)[source]

Transform token tuples.

Return type:

Iterable[Tuple[Token, str]]

class zensols.nlp.norm.LemmatizeTokenMapper(lemmatize=True, remove_first_stop=False)[source]

Bases: TokenMapper

Lemmatize tokens and optional remove entity stop words.

Important: This completely ignores the normalized input token string and essentially just replaces it with the lemma found in the token instance.

Configuration example:

[lemma_token_mapper]
class_name = zensols.nlp.LemmatizeTokenMapper
Parameters:
  • lemmatize (bool) – lemmatize if True; this is an option to allow (only) the removal of the first top word in named entities

  • remove_first_stop (bool) – whether to remove the first top word in named entities when embed_entities is True

__init__(lemmatize=True, remove_first_stop=False)
lemmatize: bool = True
map_tokens(token_tups)[source]

Transform token tuples.

Return type:

Iterable[Tuple[Token, str]]

remove_first_stop: bool = False
class zensols.nlp.norm.MapTokenNormalizer(embed_entities=True, config_factory=None, mapper_class_list=<factory>)[source]

Bases: TokenNormalizer

A normalizer that applies a sequence of TokenMapper instances to transform the normalized token text. The members of the mapper_class_list are sections of the application configuration.

Configuration example:

[map_filter_token_normalizer]
class_name = zensols.nlp.MapTokenNormalizer
mapper_class_list = list: filter_token_mapper
__init__(embed_entities=True, config_factory=None, mapper_class_list=<factory>)
config_factory: ConfigFactory = None

The factory that created this instance and used to create the mappers.

mapper_class_list: List[str]

The configuration section names to create from the application configuration factory, which is added to mappers. This field settings is deprecated; use mappers instead.

class zensols.nlp.norm.SplitEntityTokenMapper(token_unit_type=False, copy_attributes=('label', 'label_'))[source]

Bases: TokenMapper

Splits embedded entities (or any Span) in to separate tokens. This is useful for splitting up entities as tokens after being grouped with TokenNormalizer.embed_entities. Note, embed_entities must be True to create the entities as they come from spaCy as spans. This then can be used to create SpacyFeatureToken with spans that have the entity.

__init__(token_unit_type=False, copy_attributes=('label', 'label_'))
copy_attributes: Tuple[str, ...] = ('label', 'label_')

Attributes to copy from the span to the split token.

map_tokens(token_tups)[source]

Transform token tuples.

Return type:

Iterable[Tuple[Token, str]]

token_unit_type: bool = False

Whether to generate tokens for each split span or a one token span.

class zensols.nlp.norm.SplitTokenMapper(regex='[ ]')[source]

Bases: TokenMapper

Splits the normalized text on a per token basis with a regular expression.

Configuration example:

[split_token_mapper]
class_name = zensols.nlp.SplitTokenMapper
regex = r'[ ]'
__init__(regex='[ ]')
map_tokens(token_tups)[source]

Transform token tuples.

Return type:

Iterable[Tuple[Token, str]]

regex: Union[Pattern, str] = '[ ]'

The regular expression to use for splitting tokens.

class zensols.nlp.norm.SubstituteTokenMapper(regex='', replace_char='')[source]

Bases: TokenMapper

Replace a regular expression in normalized token text.

Configuration example:

[subs_token_mapper]
class_name = zensols.nlp.SubstituteTokenMapper
regex = r'[ \t]'
replace_char = _
__init__(regex='', replace_char='')
map_tokens(token_tups)[source]

Transform token tuples.

Return type:

Iterable[Tuple[Token, str]]

regex: str = ''

The regular expression to use for substitution.

replace_char: str = ''

The character that is used for replacement.

class zensols.nlp.norm.TokenMapper[source]

Bases: ABC

Abstract class used to transform token tuples generated from TokenNormalizer.normalize().

__init__()
abstract map_tokens(token_tups)[source]

Transform token tuples.

Return type:

Iterable[Tuple[Token, str]]

class zensols.nlp.norm.TokenNormalizer(embed_entities=True)[source]

Bases: object

Base token extractor returns tuples of tokens and their normalized version.

Configuration example:

[default_token_normalizer]
class_name = zensols.nlp.TokenNormalizer
embed_entities = False
__init__(embed_entities=True)
embed_entities: bool = True

Whether or not to replace tokens with their respective named entity version.

normalize(doc)[source]

Normalize Spacey document doc in to (token, normal text) tuples.

Return type:

Iterable[Tuple[Token, str]]

zensols.nlp.parser module

Parse documents and generate features in an organized taxonomy.

class zensols.nlp.parser.CachingFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, stash=None, hasher=<factory>)[source]

Bases: DecoratedFeatureDocumentParser

A document parser that persists previous parses using the hash of the text as a key. Caching is optional given the value of stash, which is useful in cases this class is extended using other use cases other than just caching.

__init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, stash=None, hasher=<factory>)
clear()[source]

Clear the caching stash.

hasher: Hasher

Used to hash the natural langauge text in to string keys.

parse(text, *args, **kwargs)[source]

Parse text or a text as a list of sentences.

Parameters:
  • text (str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list

  • args – the arguments used to create the FeatureDocument instance

  • kwargs – the key word arguments used to create the FeatureDocument instance

Return type:

FeatureDocument

stash: Stash = None

The stash that persists the feature document instances. If this is not provided, no caching will happen.

class zensols.nlp.parser.Component(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=())[source]

Bases: object

A pipeline component to be added to the spaCy model.

__init__(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=())
init(model)[source]

Initialize the component and add it to the NLP pipe line. This base class implementation loads the module, then calls Language.add_pipe().

Parameters:

model (Language) – the model to add the spaCy model (nlp in their parlance)

initializers: Tuple[ComponentInitializer, ...] = ()

Instances to initialize upon this object’s initialization.

modules: Sequence[str] = ()

The module to import before adding component pipelines. This will register components mentioned in components when the resepctive module is loaded.

name: str

The section name.

pipe_add_kwargs: Dict[str, Any]

Arguments to add along with the call to add_pipe().

pipe_config: Dict[str, str] = None

The configuration to add with the config kwarg in the Language.add_pipe() call to the spaCy model.

pipe_name: str = None

The pipeline component name to add to the pipeline. If None, use name.

class zensols.nlp.parser.ComponentInitializer[source]

Bases: ABC

Called by Component to do post spaCy initialization.

abstract init_nlp_model(model, component)[source]

Do any post spaCy initialization on the the referred framework.

class zensols.nlp.parser.DecoratedFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None)[source]

Bases: FeatureDocumentParser

This class adapts the FeatureDocumentParser adaptors to the general case using a GoF decorator pattern. This is useful for any post processing needed on existing configured document parsers.

All decorators are processed in the following order:
  1. Token

  2. Sentence

  3. Document

Token features are stored in the delegate for those that have them. Otherwise, they are stored in instances of this class.

__init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None)
decorate(doc)[source]
delegate: FeatureDocumentParser

Used to create the feature documents.

document_decorators: Sequence[FeatureDocumentDecorator] = ()

A list of decorators that can add, remove or modify features on a document.

name: str

The name of the parser, which is taken from the section name when created with a ConfigFactory and used for debugging.

parse(text, *args, **kwargs)[source]

Parse text or a text as a list of sentences.

Parameters:
  • text (str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list

  • args – the arguments used to create the FeatureDocument instance

  • kwargs – the key word arguments used to create the FeatureDocument instance

Return type:

FeatureDocument

sentence_decorators: Sequence[FeatureSentenceDecorator] = ()

A list of decorators that can add, remove or modify features on a sentence.

silencer: WarningSilencer = None

Optinally suppress warnings the parser generates.

token_decorators: Sequence[FeatureTokenDecorator] = ()

A list of decorators that can add, remove or modify features on a token.

token_feature_ids: Set[str]

The features to keep from spaCy tokens. See class documentation.

See:

TOKEN_FEATURE_IDS

class zensols.nlp.parser.FeatureDocumentDecorator[source]

Bases: FeatureTokenContainerDecorator

Implementations can add, remove or modify features on a document.

abstract decorate(doc)[source]
class zensols.nlp.parser.FeatureDocumentParser[source]

Bases: PersistableContainer, Dictable

This class parses text in to instances of FeatureDocument instances using parse().

TOKEN_FEATURE_IDS: ClassVar[Set[str]] = frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_'})

The default value for token_feature_ids.

__init__()
static default_instance()[source]

Create the parser as configured in the resource library of the package.

Return type:

FeatureDocumentParser

abstract parse(text, *args, **kwargs)[source]

Parse text or a text as a list of sentences.

Parameters:
  • text (str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list

  • args – the arguments used to create the FeatureDocument instance

  • kwargs – the key word arguments used to create the FeatureDocument instance

Return type:

FeatureDocument

class zensols.nlp.parser.FeatureSentenceDecorator[source]

Bases: FeatureTokenContainerDecorator

Implementations can add, remove or modify features on a sentence.

abstract decorate(sent)[source]
class zensols.nlp.parser.FeatureSentenceFactory(token_decorators=())[source]

Bases: object

Create a FeatureSentence out of single tokens or split on whitespace. This is a utility class to create data structures when only single tokens are the source data.

For example, if you only have tokens that need to be scored with Unigram Rouge-1, use this class to create sentences, which is a subclass of TokenContainer.

__init__(token_decorators=())
create(tokens)[source]

Create a sentence from tokens.

Parameters:

tokens (Union[str, Iterable[str]]) – if a string, then split on white space

Return type:

FeatureSentence

token_decorators: Sequence[FeatureTokenDecorator] = ()

A list of decorators that can add, remove or modify features on a token.

class zensols.nlp.parser.FeatureTokenContainerDecorator[source]

Bases: ABC

Implementations can add, remove or modify features on a token container.

abstract decorate(container)[source]
class zensols.nlp.parser.FeatureTokenDecorator[source]

Bases: ABC

Implementations can add, remove or modify features on a token.

abstract decorate(token)[source]
class zensols.nlp.parser.WhiteSpaceTokenizerFeatureDocumentParser(sent_class=<class 'zensols.nlp.container.FeatureSentence'>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>)[source]

Bases: FeatureDocumentParser

This class parses text in to instances of FeatureDocument instances tokenizing only by whitespace. This parser does no sentence chunking so documents have one and only one sentence for each parse.

__init__(sent_class=<class 'zensols.nlp.container.FeatureSentence'>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>)
doc_class

The type of document instances to create.

alias of FeatureDocument

parse(text, *args, **kwargs)[source]

Parse text or a text as a list of sentences.

Parameters:
  • text (str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list

  • args – the arguments used to create the FeatureDocument instance

  • kwargs – the key word arguments used to create the FeatureDocument instance

Return type:

FeatureDocument

sent_class

The type of sentence instances to create.

alias of FeatureSentence

zensols.nlp.score module

Produces matching scores.

class zensols.nlp.score.BleuScoreMethod(reverse_sents=False, smoothing_function=None, weights=(0.25, 0.25, 0.25, 0.25), silence_warnings=False)[source]

Bases: ScoreMethod

The BLEU scoring method using the nltk package. The first sentences are the references and the second are the hypothesis.

__init__(reverse_sents=False, smoothing_function=None, weights=(0.25, 0.25, 0.25, 0.25), silence_warnings=False)
silence_warnings: bool = False

Silence the BLEU warning of n-grams not matching The hypothesis contains 0 counts of 3-gram overlaps...

smoothing_function: SmoothingFunction = None

This is an implementation of the smoothing techniques for segment-level BLEU scores.

Citation:

Chen and Cherry (2014) A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. In WMT14.

weights: Tuple[float, ...] = (0.25, 0.25, 0.25, 0.25)

a tuple of float weights for unigrams, bigrams, trigrams and so on can be given: weights = (0.1, 0.3, 0.5, 0.1).

Type:

Weights for each n-gram. For example

class zensols.nlp.score.ErrorScore(method, exception, replace_score=None)[source]

Bases: Score

A replacement instance when scoring fails from a raised exception.

__init__(method, exception, replace_score=None)
asrow(meth)[source]
Return type:

Dict[str, float]

exception: Exception

The exception that was raised.

method: str

The method of the ScoreMethod that raised the exception.

replace_score: Score = None

The score to use in place of this score. Otherwise asrow() return a single numpy.nan like FloatScore.

class zensols.nlp.score.ExactMatchScoreMethod(reverse_sents=False, equality_measure='norm')[source]

Bases: ScoreMethod

A scoring method that return 1 for exact matches and 0 otherwise.

__init__(reverse_sents=False, equality_measure='norm')
equality_measure: str = 'norm'

The method by which to compare, which is one of:

  • norm: compare with TokenContainer.norm()

  • text: compare with TokenContainer.text

  • equal: compare using a Python object __eq__ equal compare,

    which also compares the token values

class zensols.nlp.score.FloatScore(value)[source]

Bases: Score

Float container. This is needed to create the flat result container structure. Object creation becomes less import since most clients will use ScoreSet.asnumpy().

NAN_INSTANCE: ClassVar[FloatScore] = FloatScore(value=nan)

Used to add to ErrorScore for harmonic means replacements.

__init__(value)
asrow(meth)[source]
Return type:

Dict[str, float]

value: float

The value of score.

class zensols.nlp.score.HarmonicMeanScore(precision, recall, f_score)[source]

Bases: Score

A score having a precision, recall and the harmonic mean of the two, F-score.’

NAN_INSTANCE: ClassVar[HarmonicMeanScore] = HarmonicMeanScore(precision=nan, recall=nan, f_score=nan)

Used to add to ErrorScore for harmonic means replacements.

__init__(precision, recall, f_score)
f_score: float
precision: float
recall: float
class zensols.nlp.score.LevenshteinDistanceScoreMethod(reverse_sents=False, form='canon', normalize=True)[source]

Bases: ScoreMethod

A scoring method that computes the Levenshtein distance.

__init__(reverse_sents=False, form='canon', normalize=True)
form: str = 'canon'

The form of the of the text used for the evaluation, which is one of:

normalize: bool = True

Whether to normalize the return value as the distince / the max length of both sentences.

class zensols.nlp.score.RougeScoreMethod(reverse_sents=False, feature_tokenizer=True)[source]

Bases: ScoreMethod

The ROUGE scoring method using the rouge_score package.

__init__(reverse_sents=False, feature_tokenizer=True)
feature_tokenizer: bool = True

Whether to use the TokenContainer tokenization, otherwise use the rouge_score package.

class zensols.nlp.score.Score[source]

Bases: Dictable

Individual scores returned from ScoreMethod.

__init__()
asrow(meth)[source]
Return type:

Dict[str, float]

class zensols.nlp.score.ScoreContext(pairs, methods=None, norm=True, correlation_ids=None)[source]

Bases: Dictable

Input needed to create score(s) using Scorer.

__init__(pairs, methods=None, norm=True, correlation_ids=None)
correlation_ids: Tuple[Union[int, str]] = None

The IDs to correlate with each sentence pair, or None to skip correlating them. The length of this tuple must be that of pairs.

methods: Set[str] = None

A set of strings, each indicating the ScoreMethod used to score pairs.

norm: bool = True

Whether to use the normalized tokens, otherwise use the original text.

pairs: Tuple[Tuple[TokenContainer, TokenContainer]]

Sentence, span or document pairs to score (order matters for some scoring methods such as rouge). Depending on the scoring method the ordering of the sentence pairs should be:

  • (<summary>, <source>)

  • (<gold>, <prediction>)

  • (<references>, <candidates>)

See ScoreMethod implementations for more information about pair ordering.

validate()[source]
class zensols.nlp.score.ScoreMethod(reverse_sents=False)[source]

Bases: ABC

An abstract base class for scoring methods (bleu, rouge, etc).

__init__(reverse_sents=False)
classmethod is_available()[source]

Whether or not this method is available on this system.

Return type:

bool

classmethod missing_modules()[source]

Return a list of missing modules neede by this score method.

Return type:

Tuple[str]

reverse_sents: bool = False

Whether to reverse the order of the sentences.

score(meth, context)[source]

Score the sentences in context using method identifer meth.

Parameters:
  • meth (str) – the identifer such as bleu

  • context (ScoreContext) – the context containing the data to score

Return type:

Iterable[Score]

Returns:

the results, which are usually float or Score

class zensols.nlp.score.ScoreResult(scores, correlation_id=None)[source]

Bases: Dictable

A result of scores created by a ScoreMethod.

__init__(scores, correlation_id=None)
correlation_id: Optional[str] = None

An ID for correlating back to the TokenContainer.

scores: Dict[str, Tuple[Score]]

The scores by method name.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

class zensols.nlp.score.ScoreSet(results, correlation_id_col='id')[source]

Bases: Dictable

All scores returned from :class:`.Scorer’.

__init__(results, correlation_id_col='id')
as_dataframe(add_correlation=True)[source]

This gets data from as_numpy() and returns it as a Pandas dataframe.

Parameters:

add_correlation (bool) – whether to add the correlation ID (if there is one), using correlation_id_col

Return type:

pandas.DataFrame

Returns:

an instance of pandas.DataFrame of the results

as_numpy(add_correlation=True)[source]

Return the Numpy array with column descriptors of the results. Spacy depends on Numpy, so this package will always be availale.

Parameters:

add_correlation (bool) – whether to add the correlation ID (if there is one), using correlation_id_col

Return type:

Tuple[List[str], ndarray]

correlation_id_col: str = 'id'

The column name for the ScoreResult.correlation_id added to Numpy arrays and Pandas dataframes. If None, then the correlation IDS are used as the index.

property has_correlation_id: bool

Whether the results have correlation IDs.

results: Tuple[ScoreResult]

A tuple with each element having the results of the respective sentence pair in ScoreContext.sents. Each elemnt is a dictionary with the method are the keys with results as the values as output of the ScoreMethod. This is created in Scorer.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

class zensols.nlp.score.Scorer(methods=None, default_methods=None)[source]

Bases: object

A class that scores sentences using a set of registered methods (methods).

__init__(methods=None, default_methods=None)
default_methods: Set[str] = None

Methods (keys from methods) to use when none are provided in the ScoreContext.meth in the call to score().

methods: Dict[str, ScoreMethod] = None

The registered scoring methods availale, which are accessed from ScoreContext.meth.

score(context)[source]

Score the sentences in context.

Parameters:

context (ScoreContext) – the context containing the data to score

Return type:

ScoreSet

Returns:

the results for each method indicated in context

exception zensols.nlp.score.ScorerError[source]

Bases: NLPError

Raised for any scoring errors (this module).

__annotations__ = {}
__module__ = 'zensols.nlp.score'

zensols.nlp.serial module

Serializes FeatureToken and TokenContainer instances using the Dictable interface.

class zensols.nlp.serial.Include(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Indicates what to include at each level.

normal = 2

The normalized form of the text.

original = 1

The original text.

sentences = 4

The sentences of the FeatureDocument.

tokens = 3

The tokens of the TokenContainer.

class zensols.nlp.serial.Serialized(container, includes, feature_ids)[source]

Bases: Dictable

A base strategy class that can serialize TokenContainer instances.

__init__(container, includes, feature_ids)
container: TokenContainer

The container to be serialized.

feature_ids: Tuple[str, ...]

The feature IDs used when serializing tokens.

includes: Set[Include]

The things to be included at the level of the subclass serializer.

class zensols.nlp.serial.SerializedFeatureDocument(container, includes, feature_ids, sentence_includes)[source]

Bases: Serialized

A serializer for feature documents. The container has to be an instance of a FeatureDocument.

__init__(container, includes, feature_ids, sentence_includes)
sentence_includes: Set[Include]

The list of things to include in the sentences of the document.

class zensols.nlp.serial.SerializedTokenContainer(container, includes, feature_ids)[source]

Bases: Serialized

Serializes instance of TokenContainer. This is used to serialize spans and sentences.

__init__(container, includes, feature_ids)
class zensols.nlp.serial.SerializedTokenContainerFactory(sentence_includes, document_includes, feature_ids=None)[source]

Bases: Dictable

Creates instances of Serialized from instances of TokenContainer. These can then be used as Dictable instances, specifically with the asdict and asjson methods.

__init__(sentence_includes, document_includes, feature_ids=None)
create(container)[source]

Create a serializer from container (see class docs).

Parameters:

container (TokenContainer) – he container to be serialized

Return type:

Serialized

Returns:

an object that can be serialized using asdict and asjson method.

document_includes: Set[Union[Include, str]]

The things to be included in documents.

feature_ids: Tuple[str, ...] = None

The feature IDs used when serializing tokens.

sentence_includes: Set[Union[Include, str]]

The things to be included in sentences.

zensols.nlp.spannorm module

Normalize spans (of tokens) into strings by reconstructing based on language rules from the normalized form of the tokens. This is needed after any token manipulation from TokenNormalizer or other changes to FeatureToken.norm.

For now, only English is supported, but the module is provided for other languages and future enhancements of normalization configuration.

class zensols.nlp.spannorm.EnglishSpanNormalizer(post_space_skip=frozenset({'(', '-', '<', '[', '`', '{', '‘', '“'}), pre_space_skip=frozenset({"'d", "'ll", "'m", "'re", "'s", "'ve", '-', "n't"}), keep_space_skip=frozenset({'_'}), canonical_delimiter='|')[source]

Bases: SpanNormalizer

An implementation of a span normalizer for the Enlish language.

__init__(post_space_skip=frozenset({'(', '-', '<', '[', '`', '{', '‘', '“'}), pre_space_skip=frozenset({"'d", "'ll", "'m", "'re", "'s", "'ve", '-', "n't"}), keep_space_skip=frozenset({'_'}), canonical_delimiter='|')
canonical_delimiter: str = '|'

The token delimiter used in canonical.

get_canonical(tokens)[source]

A canonical representation of the container, which are non-space tokens separated by CANONICAL_DELIMITER.

Return type:

str

get_norm(tokens, use_norm)[source]

Create a string that follows the langauge spacing rules.

Parameters:
  • tokens (Iterable[FeatureToken]) – the tokens to normalize

  • use_norm (bool) – whether to use the token normalized or orthographic text

Return type:

str

keep_space_skip: Set[str] = frozenset({'_'})

Characters that retain space on both sides.

post_space_skip: Set[str] = frozenset({'(', '-', '<', '[', '`', '{', '‘', '“'})

Characters after which no space is added for span normalization.

pre_space_skip: Set[str] = frozenset({"'d", "'ll", "'m", "'re", "'s", "'ve", '-', "n't"})

Characters before whcih no space is added for span normalization.

class zensols.nlp.spannorm.SpanNormalizer[source]

Bases: object

Subclasses normalize feature tokens on a per spacy.Language. All subclasses must be re-entrant.

abstract get_canonical(tokens)[source]

A canonical representation of the container, which are non-space tokens separated by CANONICAL_DELIMITER.

Return type:

str

abstract get_norm(tokens, use_norm)[source]

Create a string that follows the langauge spacing rules.

Parameters:
  • tokens (Iterable[FeatureToken]) – the tokens to normalize

  • use_norm (bool) – whether to use the token normalized or orthographic text

Return type:

str

zensols.nlp.sparser module

The spaCy FeatureDocumentParser implementation.

class zensols.nlp.sparser.SpacyFeatureDocumentParser(config_factory, name, lang='en', model_name=None, token_feature_ids=<factory>, components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.nlp.tok.SpacyFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False)[source]

Bases: FeatureDocumentParser

This langauge resource parses text in to Spacy documents. Loaded spaCy models have attribute doc_parser set enable creation of factory instances from registered pipe components (i.e. specified by Component).

Configuration example:

[doc_parser]
class_name = zensols.nlp.sparser.SpacyFeatureDocumentParser
lang = en
model_name = ${lang}_core_web_sm

Decorators are processed in the same way DecoratedFeatureDocumentParser.

__init__(config_factory, name, lang='en', model_name=None, token_feature_ids=<factory>, components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.nlp.tok.SpacyFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False)
auto_install_model: bool = False

Whether to install models not already available. Note that this uses the pip command to download model requirements, which might have an adverse effect of replacing currently installed Python packages.

classmethod clear_models()[source]

Clears all cached models.

components: Sequence[Component] = ()

Additional Spacy components to add to the pipeline.

config_factory: ConfigFactory

A configuration parser optionally used by pipeline Component instances.

disable_component_names: Sequence[str] = None

Components to disable in the spaCy model when creating documents in parse().

doc_class

The type of document instances to create.

alias of FeatureDocument

document_decorators: Sequence[FeatureDocumentDecorator] = ()

A list of decorators that can add, remove or modify features on a document.

from_spacy_doc(doc, *args, text=None, **kwargs)[source]

Create s FeatureDocument from a spaCy doc.

Parameters:
  • doc (Doc) – the spaCy generated document to transform in to a feature document

  • text (str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list

  • args – the arguments used to create the FeatureDocument instance

  • kwargs – the key word arguments used to create the FeatureDocument instance

Return type:

FeatureDocument

get_dictable(doc)[source]

Return a dictionary object graph and pretty prints spaCy docs.

Return type:

Dictable

lang: str = 'en'

The natural language the identify the model.

property model: Language

The spaCy model. On first access, this creates a new instance using model_name.

model_name: str = None

The Spacy model name (defualts to en_core_web_sm); this is ignored if model is not None.

name: str

The name of the parser, which is taken from the section name when created with a ConfigFactory and used for debugging.

parse(text, *args, **kwargs)[source]

Parse text or a text as a list of sentences.

Parameters:
  • text (str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list

  • args – the arguments used to create the FeatureDocument instance

  • kwargs – the key word arguments used to create the FeatureDocument instance

Return type:

FeatureDocument

parse_spacy_doc(text)[source]

Parse text in to a Spacy document.

Return type:

Doc

reload_components: bool = False

Removes, then re-adds components for cached models. This is helpful for when there are component configurations that change on reruns with a difference application context but in the same Python interpreter session.

A spaCy component can get other instances via config_factory, but if this is False it will be paired with the first instance of this class and not the new ones created with a new configuration factory.

remove_empty_sentences: bool = None

Deprecated and will be removed from future versions. Use FilterSentenceFeatureDocumentDecorator instead.

sent_class

The type of sentence instances to create.

alias of FeatureSentence

sentence_decorators: Sequence[FeatureSentenceDecorator] = ()

A list of decorators that can add, remove or modify features on a sentence.

special_case_tokens: List

Tokens that will be parsed as one token, i.e. </s>.

to_spacy_doc(doc, norm=True, add_features=None)[source]

Convert a feature document back in to a spaCy document.

Note: not all data is copied–only text, pos_, tag_, lemma_ and dep_.

Parameters:
  • doc (FeatureDocument) – the spaCy doc to convert

  • norm (bool) – whether to use the normalized text as the orth_ spaCy token attribute or text

Pram add_features:

whether to add POS, NER tags, lemmas, heads and dependnencies

Return type:

Doc

Returns:

the feature document with copied data from doc

token_class

The type of document instances to create.

alias of SpacyFeatureToken

token_decorators: Sequence[FeatureTokenDecorator] = ()

A list of decorators that can add, remove or modify features on a token.

token_feature_ids: Set[str]

The features to keep from spaCy tokens.

See:

TOKEN_FEATURE_IDS

token_normalizer: TokenNormalizer = None

The token normalizer for methods that use it, i.e. features.

zensols.nlp.stemmer module

Stem text using the Porter stemmer.

class zensols.nlp.stemmer.PorterStemmerTokenMapper(stemmer=<factory>)[source]

Bases: TokenMapper

Use the Porter Stemmer from the NTLK to stem as normalized tokens.

__init__(stemmer=<factory>)
map_tokens(token_tups)[source]

Transform token tuples.

stemmer: PorterStemmer

zensols.nlp.tok module

Feature token and related base classes

class zensols.nlp.tok.FeatureToken(i, idx, i_sent, norm)[source]

Bases: PersistableContainer, TextContainer

A container class for features about a token. Subclasses such as SpacyFeatureToken extracts only a subset of features from the heavy Spacy C data structures and is hard/expensive to pickle. Instances of this token class are almost always detached, meaning the underlying in memory data structures have been copied as pure Python types to facilitate serialization of spaCy tokens.

Feature note: features i, idx and i_sent are always added to features tokens to be able to reconstruct sentences (see FeatureDocument.uncombine_sentences()), and alwyas included.

FEATURE_IDS: ClassVar[Set[str]] = frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_'})

All default available feature IDs.

FEATURE_IDS_BY_TYPE: ClassVar[Dict[str, Set[str]]] = {'bool': frozenset({'is_contraction', 'is_ent', 'is_pronoun', 'is_space', 'is_stop', 'is_superlative', 'is_wh'}), 'int': frozenset({'dep', 'ent', 'ent_iob', 'i', 'i_sent', 'idx', 'is_punctuation', 'norm_len', 'sent_i', 'shape', 'tag'}), 'list': frozenset({'children'}), 'object': frozenset({'lexspan'}), 'str': frozenset({'dep_', 'ent_', 'ent_iob_', 'lemma_', 'norm', 'pos_', 'shape_', 'tag_'})}

Map of class type to set of feature IDs.

NONE: ClassVar[str] = '-<N>-'

Default string for not a feature, or missing features.

REQUIRED_FEATURE_IDS: ClassVar[Set[str]] = frozenset({'i', 'i_sent', 'idx', 'lexspan', 'norm'})

Features retained regardless of configuration for basic functionality.

SKIP_COMPARE_FEATURE_IDS: ClassVar[Set[str]] = {}

A set of feature IDs to avoid comparing in __eq__().

TYPES_BY_FEATURE_ID: ClassVar[Dict[str, str]] = {'children': 'list', 'dep': 'int', 'dep_': 'str', 'ent': 'int', 'ent_': 'str', 'ent_iob': 'int', 'ent_iob_': 'str', 'i': 'int', 'i_sent': 'int', 'idx': 'int', 'is_contraction': 'bool', 'is_ent': 'bool', 'is_pronoun': 'bool', 'is_punctuation': 'int', 'is_space': 'bool', 'is_stop': 'bool', 'is_superlative': 'bool', 'is_wh': 'bool', 'lemma_': 'str', 'lexspan': 'object', 'norm': 'str', 'norm_len': 'int', 'pos_': 'str', 'sent_i': 'int', 'shape': 'int', 'shape_': 'str', 'tag': 'int', 'tag_': 'str'}

A map of feature ID to string type. This is used by FeatureToken.write_attributes() to dump the type features.

WRITABLE_FEATURE_IDS: ClassVar[Tuple[str, ...]] = ('text', 'norm', 'idx', 'sent_i', 'i', 'i_sent', 'tag', 'pos', 'is_wh', 'entity', 'dep', 'children')

Feature IDs that are dumped on write() and write_attributes().

__init__(i, idx, i_sent, norm)
clone(cls=None, **kwargs)[source]

Clone an instance of this token.

Parameters:
  • cls (Type) – the type of the new instance

  • kwargs – arguments to add to as attributes to the clone

Return type:

FeatureToken

Returns:

the cloned instance of this instance

property default_detached_feature_ids: Set[str] | None

The default set of feature IDs used when cloning or detaching with clone() or detach().

detach(feature_ids=None, skip_missing=False, cls=None)[source]

Create a detected token (i.e. from spaCy artifacts).

Parameters:
  • feature_ids (Set[str]) – the features to write, which defaults to FEATURE_IDS

  • skip_missing (bool) – whether to only keep feature_ids

  • cls (Type[FeatureToken]) – the type of the new instance

Return type:

FeatureToken

get_feature(feature_id, expect=True, check_none=False, message=None)[source]

Return a feature by the feature ID.

Parameters:
  • feature_id (str) – the ID of the feature to retrieve

  • expect (bool) – whether to raise an error

  • message (str) – additional context to append to the error message

  • check_none (bool) – whether to return the value even if it has an unset value such as NONE as determined by is_none(), in which case None is returned

Raises:

MissingFeatureError – if expect is True and the feature does not exist

Return type:

Optional[Any]

get_features(feature_ids=None, skip_missing=False)[source]

Get features as a dict.

Parameters:
  • feature_ids (Iterable[str]) – the features to write, which defaults to FEATURE_IDS

  • skip_missing (bool) – whether to only keep feature_ids

Return type:

Dict[str, Any]

i: int

The index of the token within the parent document.

i_sent: int

The index of the token within the parent sentence.

The index of the token in the respective sentence. This is not to be confused with the index of the sentence to which the token belongs, which is sent_i.

idx: int

The character offset of the token within the parent document.

property is_detached: bool

Whether this token has been detached.

property is_none: bool

Return whether or not this token is represented as none or empty.

long_repr()[source]
Return type:

str

norm: str

Normalized text, which is the text/orth or the named entity if tagged as a named entity.

set_feature(feature_id, value)[source]

Set, or add if non-existant, a feature to this token instance. If the token has been detached, it will be added to the default_detached_feature_ids.

Parameters:
  • feature_id (str) – the ID of the feature to set

  • value (Any) – the new or replaced value of the feature

split(positions)[source]

Split on text normal index positions. This needs and updates the idx and lexspan atttributes.

Parameters:

positions (Iterable[int]) – 0-indexes into norm indicating where to split

Return type:

List[FeatureToken]

Returns:

new (cloned) tokens along the boundaries of positions

property text: str

The initial text before normalized by any TokenNormalizer.

to_vector(feature_ids=None)[source]

Return an iterable of feature data.

Return type:

Iterable[str]

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_type=True, feature_ids=None, inline=False)[source]

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

write_attributes(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_type=True, feature_ids=None, inline=False, include_none=True)[source]

Write feature attributes.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

  • include_type (bool) – if True write the type of value (if available)

  • feature_ids (Iterable[str]) – the features to write, which defaults to WRITABLE_FEATURE_IDS

  • inline (bool) – whether to print attributes all on the same line

class zensols.nlp.tok.SpacyFeatureToken(spacy_token, norm)[source]

Bases: FeatureToken

Contains and provides the same features as a spaCy Token.

__init__(spacy_token, norm)[source]
property children

A sequence of the token’s immediate syntactic children.

conll_iob_()[source]

Return the CoNLL formatted IOB tag, such as B-ORG for a beginning organization token.

Return type:

str

property dep: int

Syntactic dependency relation.

property dep_: str

Syntactic dependency relation string representation.

property ent: int

Return the entity numeric value or 0 if this is not an entity.

property ent_: str

Return the entity string label or None if this token has no entity.

property ent_iob: int

Return the entity IOB tag, which I for in, `O for out, B` for begin.

property ent_iob_: str

Return the entity IOB nominal index for :obj:ent_iob.

property is_contraction: bool

Return True if this token is a contradiction.

property is_pronoun: bool

Return True if this is a pronoun (i.e. ‘he’) token.

property is_punctuation: bool

Return True if this is a punctuation (i.e. ‘?’) token.

property is_space: bool

Return True if this token is white space only.

property is_stop: bool

Return True if this is a stop word.

property is_superlative: bool

Return True if this token is the superlative.

property is_wh: bool

Return True if this is a WH word (i.e. what, where).

property lemma_: str

Return the string lemma or text of the named entitiy if tagged as a named entity.

property lexspan: LexicalSpan

The document indexed lexical span using idx.

property norm_len: int

The length of the norm in characters.

property pos: int

The simple UPOS part-of-speech tag.

property pos_: str

The simple UPOS part-of-speech tag.

property sent_i: int

The index of the sentence to which the token belongs. This is not to be confused with the index of the token in the respective sentence, which is FeatureToken.i_sent.

This attribute does not exist in a spaCy token, and was named as such to follow the naming conventions of their API.

property shape: int

Transform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “d.

property shape_: str

Transform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “d.

spacy_token: Union[Token, Span]

The parsed spaCy token (or span if entity) this feature set is based.

See:

FeatureDocument.spacy_doc()

property tag: int

Fine-grained part-of-speech text.

property tag_: str

Fine-grained part-of-speech text.

property token: Token

Return the SpaCy token.

Module contents