zensols.nlp package¶
Submodules¶
zensols.nlp.chunker module¶
Clasess that segment text from FeatureDocument instances, but
retain the original structure by preserving sentence and token indicies.
- class zensols.nlp.chunker.Chunker(doc, pattern, sub_doc=None, char_offset=None)[source]¶
Bases:
objectSplits
TokenContainerinstances using regular expressionpattern. Matched container (implementation of the container is based on the subclass) are given if used as an iterable. The document of all parsed containers is given if used as a callable.- __init__(doc, pattern, sub_doc=None, char_offset=None)¶
-
char_offset:
int= None¶ The 0-index absolute character offset where
sub_docstarts. However, if the value is -1, then the offset is used as the begging character offset of the first token in thesub_doc.
-
doc:
FeatureDocument¶ The document that contains the entire text (i.e.
Note).
-
sub_doc:
FeatureDocument= None¶ A lexical span created document of
doc, which defaults to the global document. Providing this andchar_offsetallows use of a document without having to useTokenContainer.reindex().
- class zensols.nlp.chunker.ListItemChunker(doc, pattern=re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\\\n]+)$', re.MULTILINE), sub_doc=None, char_offset=None)[source]¶
Bases:
ChunkerA
Chunkerthat splits list item and enumerated lists into separate sentences. Matched sentences are given if used as an iterable. This is useful when spaCy sentence chunks lists incorrectly and finds lists using a regular expression to find lines that star with a decimal, or list characters such as-and+.-
DEFAULT_SPAN_PATTERN:
ClassVar[Pattern] = re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\n]+)$', re.MULTILINE)¶ The default list item regular expression, which uses an initial character item notation or an initial enumeration digit.
- __init__(doc, pattern=re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\\\n]+)$', re.MULTILINE), sub_doc=None, char_offset=None)¶
-
pattern:
Pattern= re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\n]+)$', re.MULTILINE)¶ The list regular expression, which defaults to
DEFAULT_SPAN_PATTERN.
-
DEFAULT_SPAN_PATTERN:
- class zensols.nlp.chunker.ParagraphChunker(doc, pattern=re.compile('(.+?)(?:(?=\\\\n{2})|\\\\Z)', re.MULTILINE | re.DOTALL), sub_doc=None, char_offset=None)[source]¶
Bases:
ChunkerA
Chunkerthat splits list item and enumerated lists into separate sentences. Matched sentences are given if used as an iterable. For this reason, this class will probably be used as an iterable since clients will usually want just the separated paragraphs as documents-
DEFAULT_SPAN_PATTERN:
ClassVar[Pattern] = re.compile('(.+?)(?:(?=\\n{2})|\\Z)', re.MULTILINE|re.DOTALL)¶ The default paragraph regular expression, which uses two newline positive lookaheads to avoid matching on paragraph spacing.
- __init__(doc, pattern=re.compile('(.+?)(?:(?=\\\\n{2})|\\\\Z)', re.MULTILINE | re.DOTALL), sub_doc=None, char_offset=None)¶
-
pattern:
Pattern= re.compile('(.+?)(?:(?=\\n{2})|\\Z)', re.MULTILINE|re.DOTALL)¶ The list regular expression, which defaults to
DEFAULT_SPAN_PATTERN.
-
DEFAULT_SPAN_PATTERN:
zensols.nlp.combine module¶
A class that combines features.
- class zensols.nlp.combine.CombinerFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, source_parsers=None, validate_features=<factory>, yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, map_features=<factory>)[source]¶
Bases:
DecoratedFeatureDocumentParserA class that combines features from two
FeatureDocumentParserinstances. Features parsed using eachsource_parserare optionally copied or overwritten on a token by token basis in the feature document parsed by this instance.The target tokens are sometimes added to or clobbered from the source, but not the other way around.
- __init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, source_parsers=None, validate_features=<factory>, yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, map_features=<factory>)¶
-
map_features:
List[Tuple[str,str,Any]]¶ Like
yield_featuresbut the feature ID can be different from the source to the target. Each tuple has the form:(<source feature ID>, <target feature ID>, <default for missing>)
-
overwrite_features:
List[str]¶ A list of features to be copied/overwritten in order given in the list.
-
overwrite_nones:
bool= False¶ Whether to write
Nonefor missingoverwrite_features. This always write the target feature; if you only to write when the source is not set or missing, then useyield_features.
- parse(text, *args, **kwargs)[source]¶
Parse text or a text as a list of sentences.
- Parameters:
text (
str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the listargs – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance
- Return type:
-
source_parsers:
List[FeatureDocumentParser] = None¶ The language resource used to parse documents and create token attributes.
-
validate_features:
Set[str]¶ A set of features to compare across all tokens when copying. If any of the given features don’t match, an mismatch token error is raised.
- class zensols.nlp.combine.MappingCombinerFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, source_parsers=None, validate_features=frozenset({'idx'}), yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, map_features=<factory>, merge_sentences=True)[source]¶
Bases:
CombinerFeatureDocumentParserMaps the source to respective tokens in the target document using spaCy artifacts.
- __init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, source_parsers=None, validate_features=frozenset({'idx'}), yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, map_features=<factory>, merge_sentences=True)¶
zensols.nlp.component module¶
Components useful for reuse.
- class zensols.nlp.component.EntityRecognizer(nlp, name, import_file, patterns)[source]¶
Bases:
objectBase class regular expression and spaCy match patterns named entity recognizer. Both subclasses allow for an optional label for each respective pattern or regular expression. If the label is provided, then the match is made a named entity with a label. In any case, a span is created on the token, and in some cases, retokenized.
- __init__(nlp, name, import_file, patterns)¶
-
nlp:
Language¶ The NLP model.
- class zensols.nlp.component.PatternEntityRecognizer(nlp, name, import_file, patterns)[source]¶
Bases:
EntityRecognizerAdds entities based on regular epxressions.
- See:
- __init__(nlp, name, import_file, patterns)¶
- class zensols.nlp.component.RegexEntityRecognizer(nlp, name, import_file, patterns)[source]¶
Bases:
EntityRecognizerMerges regular expression matches as a
Span. After matches are found, re-tokenization merges them in to one token per match.- __init__(nlp, name, import_file, patterns)¶
- class zensols.nlp.component.RegexSplitter(nlp, name, import_file, patterns)[source]¶
Bases:
EntityRecognizerSplits on regular expressions.
- __init__(nlp, name, import_file, patterns)¶
zensols.nlp.container module¶
Domain objects that define features associated with text.
- class zensols.nlp.container.FeatureDocument(sents, text=None, spacy_doc=None)[source]¶
Bases:
TokenContainerA container class of tokens that make a document. This class contains a one to many of sentences. However, it can be treated like any
TokenContainerto fetch tokens. Instances of this class iterate overFeatureSentenceinstances.- Parameters:
sents (
Tuple[FeatureSentence,...]) – the sentences defined for this document
- _combine_documents(docs, cls, concat_tokens, **kwargs)[source]¶
Override if there are any fields in your dataclass. In most cases, the only time this is called is by an embedding vectorizer to batch muultiple sentences in to a single document, so the only feature that matter are the sentence level.
- Parameters:
docs (
Tuple[FeatureDocument,...]) – the documents to combine in to onecls (
Type[FeatureDocument]) – the class of the instance to createconcat_tokens (
bool) – ifTrueeach sentence of the returned document are the concatenated tokens of each respective document; otherwise simply concatenate sentences in to one documentkwargs – additional keyword arguments to pass to the new feature document’s initializer
- Return type:
-
EMPTY_DOCUMENT:
ClassVar[FeatureDocument] = <>¶ A zero length document.
- __init__(sents, text=None, spacy_doc=None)¶
- clone(cls=None, **kwargs)[source]¶
- Parameters:
kwargs – if copy_spacy is
True, the spacy document is copied to the clone in addition parameters passed to new clone initializer- Return type:
- classmethod combine_documents(docs, concat_tokens=True, **kwargs)[source]¶
Coerce a tuple of token containers (either documents or sentences) in to one synthesized document.
- Parameters:
docs (
Iterable[FeatureDocument]) – the documents to combine in to onecls – the class of the instance to create
concat_tokens (
bool) – ifTrueeach sentence of the returned document are the concatenated tokens of each respective document; otherwise simply concatenate sentences in to one documentkwargs – additional keyword arguments to pass to the new feature document’s initializer
- Return type:
- combine_sentences(sents=None)[source]¶
Combine the sentences in this document in to a new document with a single sentence.
- Parameters:
sents (
Iterable[FeatureSentence]) – the sentences to combine in the new document or all ifNone- Return type:
- from_sentences(sents, deep=False)[source]¶
Return a new cloned document using the given sentences.
- Parameters:
sents (
Iterable[FeatureSentence]) – the sentences to add to the new cloned documentdeep (
bool) – whether or not to clone the sentences
- See:
- Return type:
- get_overlapping_document(span, inclusive=True)[source]¶
Get the portion of the document that overlaps
span. Sentences completely enclosed in a span are copied. Otherwise, new sentences are created from those tokens that overlap the span.- Parameters:
span (
LexicalSpan) – indicates the portion of the document to retaininclusive (
bool) – whether to check include +1 on the end component
- Return type:
- Returns:
a new document that contains the 0 index offset of
span
- get_overlapping_sentences(span, inclusive=True)[source]¶
Return sentences that overlaps with
spanfrom this document.- Parameters:
span (
LexicalSpan) – indicates the portion of the document to retaininclusive (
bool) – whether to check include +1 on the end component
- Return type:
- get_overlapping_span(span, inclusive=True)[source]¶
Return a feature span that includes the lexical scope of
span.- Return type:
- property max_sentence_len: int¶
Return the length of tokens from the longest sentence in the document.
- sentence_index_for_token(token)[source]¶
Return index of the parent sentence having
token.- Return type:
- sentences_for_tokens(tokens)[source]¶
Find sentences having a set of tokens.
- Parameters:
tokens (
Tuple[FeatureToken,...]) – the query used to finding containing sentences- Return type:
- Returns:
the document ordered tuple of sentences containing tokens
-
sents:
Tuple[FeatureSentence,...]¶ The sentences that make up the document.
-
spacy_doc:
Doc= None¶ The parsed spaCy document this feature set is based. As explained in
FeatureToken, spaCy documents are heavy weight and problematic to pickle. For this reason, this attribute is dropped when pickled, and only here for ad-hoc predictions.
- to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]¶
Coerce this instance to a single sentence. No tokens data is updated so
FeatureToken.i_sentkeep their original indexes. These sentence indexes will be inconsistent when called onFeatureDocumentunless contiguous_i_sent is set toTrue.- Parameters:
limit (
int) – the max number of sentences to create (only starting kept)contiguous_i_sent (
Union[str,bool]) – ifTrue, ensures all tokens haveFeatureToken.i_sentvalue that is contiguous for the returned instance; if this value isreset, the token indicies start from 0delim (
str) – a string added between each constituent sentence
- Return type:
- Returns:
an instance of
FeatureSentencethat represents this token sequence
- token_iter(*args, **kwargs)[source]¶
Return an iterator over the token features.
- Parameters:
args – the arguments given to
itertools.islice()- Return type:
- uncombine_sentences()[source]¶
Reconstruct the sentence structure that we combined in
combine_sentences(). If that has not been done in this instance, then returnself.- Return type:
- update_entity_spans(include_idx=True)[source]¶
Update token entity to
normtext. This is helpful when entities are embedded after splitting text, which becomesFeatureToken.normvalues. However, the token spans still index the original entities that are multi-word, which leads to norms that are not equal to the text spans. This synchronizes the token span indexes with the norms.- Parameters:
include_idx (
bool) – whether to updateSpacyFeatureToken.idxas well
- update_indexes()[source]¶
Update all
FeatureToken.iattributes to those provided bytokens_by_i. This corrects the many-to-one token index mapping for split multi-word named entities.- See:
tokens_by_i
- class zensols.nlp.container.FeatureSentence(tokens, text=None, spacy_span=None)[source]¶
Bases:
FeatureSpanA container class of tokens that make a sentence. Instances of this class iterate over
FeatureTokeninstances, and can create documents withto_document().-
EMPTY_SENTENCE:
ClassVar[FeatureSentence] = <>¶
- __init__(tokens, text=None, spacy_span=None)¶
- get_overlapping_span(span, inclusive=True)[source]¶
Return a feature span that includes the lexical scope of
span.- Return type:
- to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]¶
Coerce this instance to a single sentence. No tokens data is updated so
FeatureToken.i_sentkeep their original indexes. These sentence indexes will be inconsistent when called onFeatureDocumentunless contiguous_i_sent is set toTrue.- Parameters:
limit (
int) – the max number of sentences to create (only starting kept)contiguous_i_sent (
Union[str,bool]) – ifTrue, ensures all tokens haveFeatureToken.i_sentvalue that is contiguous for the returned instance; if this value isreset, the token indicies start from 0delim (
str) – a string added between each constituent sentence
- Return type:
- Returns:
an instance of
FeatureSentencethat represents this token sequence
-
EMPTY_SENTENCE:
- class zensols.nlp.container.FeatureSpan(tokens, text=None, spacy_span=None)[source]¶
Bases:
TokenContainerA span of tokens as a
TokenContainer, much likespacy.tokens.Span.- __init__(tokens, text=None, spacy_span=None)¶
- clone(cls=None, **kwargs)[source]¶
Clone an instance of this token container.
- Parameters:
cls (
Type[TokenContainer]) – the type of the new instancekwargs – arguments to add to as attributes to the clone
- Return type:
- Returns:
the cloned instance of this instance
- property dependency_tree: Dict[FeatureToken, List[Dict[FeatureToken]]]¶
-
spacy_span:
Span= None¶ The parsed spaCy span this feature set is based.
- See:
FeatureDocument.spacy_doc()
- to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]¶
Coerce this instance to a single sentence. No tokens data is updated so
FeatureToken.i_sentkeep their original indexes. These sentence indexes will be inconsistent when called onFeatureDocumentunless contiguous_i_sent is set toTrue.- Parameters:
limit (
int) – the max number of sentences to create (only starting kept)contiguous_i_sent (
Union[str,bool]) – ifTrue, ensures all tokens haveFeatureToken.i_sentvalue that is contiguous for the returned instance; if this value isreset, the token indicies start from 0delim (
str) – a string added between each constituent sentence
- Return type:
- Returns:
an instance of
FeatureSentencethat represents this token sequence
- token_iter(*args, **kwargs)[source]¶
Return an iterator over the token features.
- Parameters:
args – the arguments given to
itertools.islice()- Return type:
- property tokens: Tuple[FeatureToken, ...]¶
The tokens that make up the span.
- property tokens_by_i_sent: Dict[int, FeatureToken]¶
A map of tokens with keys as their sentanal position offset and values as tokens.
- See:
zensols.nlp.FeatureToken.i
- update_entity_spans(include_idx=True)[source]¶
Update token entity to
normtext. This is helpful when entities are embedded after splitting text, which becomesFeatureToken.normvalues. However, the token spans still index the original entities that are multi-word, which leads to norms that are not equal to the text spans. This synchronizes the token span indexes with the norms.- Parameters:
include_idx (
bool) – whether to updateSpacyFeatureToken.idxas well
- update_indexes()[source]¶
Update all
FeatureToken.iattributes to those provided bytokens_by_i. This corrects the many-to-one token index mapping for split multi-word named entities.- See:
tokens_by_i
- class zensols.nlp.container.TokenAnnotatedFeatureDocument(sents, text=None, spacy_doc=None)[source]¶
Bases:
FeatureDocumentA feature sentence that contains token annotations. Sentences can be modeled with
TokenAnnotatedFeatureSentenceor justFeatureSentencesince this sets the annotations attribute when combining.- __init__(sents, text=None, spacy_doc=None)¶
- combine_sentences(**kwargs) FeatureDocument¶
Combine all the sentences in this document in to a new document with a single sentence.
- Return type:
FeatureDocument
- class zensols.nlp.container.TokenAnnotatedFeatureSentence(tokens, text=None, spacy_span=None, annotations=())[source]¶
Bases:
FeatureSentenceA feature sentence that contains token annotations.
- __init__(tokens, text=None, spacy_span=None, annotations=())¶
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, **kwargs)[source]¶
Write the text container.
- Parameters:
include_original – whether to include the original text
include_normalized – whether to include the normalized text
n_tokens – the number of tokens to write
inline – whether to print the tokens on one line each
- class zensols.nlp.container.TokenContainer[source]¶
Bases:
PersistableContainer,TextContainerA base class for token container classes such as
FeatureSentenceandFeatureDocument. In addition to the defined methods, each instance has atextattribute, which is the original text of the document.- property canonical: str¶
A canonical representation of the container, which are non-space tokens separated by
CANONICAL_DELIMITER.
- clone(cls=None, **kwargs)[source]¶
Clone an instance of this token container.
- Parameters:
cls (
Type[TokenContainer]) – the type of the new instancekwargs – arguments to add to as attributes to the clone
- Return type:
- Returns:
the cloned instance of this instance
- property entities: Tuple[FeatureSpan, ...]¶
The named entities of the container with each multi-word entity as elements.
- get_overlapping_span(span, inclusive=True)[source]¶
Return a feature span that includes the lexical scope of
span.- Return type:
- get_overlapping_tokens(span, inclusive=True)[source]¶
Get all tokens that overlap lexical span
span.- Parameters:
span (
LexicalSpan) – the document 0-index character based inclusive span to compare withFeatureToken.lexspaninclusive (
bool) – whether to check include +1 on the end component
- Return type:
- Returns:
a token sequence containing the 0 index offset of
span
- property lexspan: LexicalSpan¶
The document indexed lexical span using
idx.
- map_overlapping_tokens(spans, inclusive=True)[source]¶
Return a tuple of tokens, each tuple in the range given by the respective span in
spans.- Parameters:
spans (
Iterable[LexicalSpan]) – the document 0-index character based inclusive spans to compare withFeatureToken.lexspaninclusive (
bool) – whether to check include +1 on the end component
- Return type:
- Returns:
a tuple of matching tokens for the respective
spanquery
- property norm_orth: str¶
The normalized version of the sentence using the orignal rather than the token normalized text.
- reindex(reference_token=None)[source]¶
Re-index tokens, which is useful for situtations where a 0-index offset is assumed for sub-documents created with
FeatureDocument.get_overlapping_document()orFeatureDocument.get_overlapping_sentences(). The following data are modified:FeatureToken.sent_i(seeSpacyFeatureToken.sent_i)FeatureToken.lexspan(seeSpacyFeatureToken.lexspan)
- set_entity_offsets(offsets)[source]¶
Set entities as a sequence of non-inclusive character offsets of
(<begin> , <end>).
- strip(in_place=True)[source]¶
Strip beginning and ending whitespace (see
strip_tokens()) andtext.- Return type:
- strip_token_iter(*args, **kwargs)[source]¶
Strip beginning and ending whitespace (see
strip_tokens()) usingtoken_iter().- Return type:
- static strip_tokens(token_iter)[source]¶
Strip beginning and ending whitespace. This uses
is_space, which isTruefor spaces, tabs and newlines.- Parameters:
token_iter (
Iterable[FeatureToken]) – an stream of tokens- Return type:
- Returns:
non-whitespace middle tokens
- abstract to_document(limit=9223372036854775807)[source]¶
Coerce this instance in to a document.
- Return type:
- abstract to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]¶
Coerce this instance to a single sentence. No tokens data is updated so
FeatureToken.i_sentkeep their original indexes. These sentence indexes will be inconsistent when called onFeatureDocumentunless contiguous_i_sent is set toTrue.- Parameters:
limit (
int) – the max number of sentences to create (only starting kept)contiguous_i_sent (
Union[str,bool]) – ifTrue, ensures all tokens haveFeatureToken.i_sentvalue that is contiguous for the returned instance; if this value isreset, the token indicies start from 0delim (
str) – a string added between each constituent sentence
- Return type:
- Returns:
an instance of
FeatureSentencethat represents this token sequence
- abstract token_iter(*args, **kwargs)[source]¶
Return an iterator over the token features.
- Parameters:
args – the arguments given to
itertools.islice()- Return type:
- property tokens: Tuple[FeatureToken, ...]¶
Return the token features as a tuple.
- property tokens_by_i: Dict[int, FeatureToken]¶
A map of tokens with keys as their position offset and values as tokens. The entries also include named entity tokens that are grouped as multi-word tokens. This is helpful for multi-word entities that were split (for example with
SplitTokenMapper), and thus, have many-to-one mapped indexes.- See:
zensols.nlp.FeatureToken.i
- property tokens_by_idx: Dict[int, FeatureToken]¶
A map of tokens with keys as their character offset and values as tokens.
Limitations: Multi-word entities will have have a mapping only for the first word of that entity if tokens were split by spaces (for example with
SplitTokenMapper). However,tokens_by_idoes not have this limitation.- See:
obj:tokens_by_i
- See:
zensols.nlp.FeatureToken.idx
- abstract update_entity_spans(include_idx=True)[source]¶
Update token entity to
normtext. This is helpful when entities are embedded after splitting text, which becomesFeatureToken.normvalues. However, the token spans still index the original entities that are multi-word, which leads to norms that are not equal to the text spans. This synchronizes the token span indexes with the norms.- Parameters:
include_idx (
bool) – whether to updateSpacyFeatureToken.idxas well
- update_indexes()[source]¶
Update all
FeatureToken.iattributes to those provided bytokens_by_i. This corrects the many-to-one token index mapping for split multi-word named entities.- See:
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_original=False, include_normalized=True, n_tokens=9223372036854775807, inline=False, feature_ids=None)[source]¶
Write the text container.
zensols.nlp.dataframe module¶
zensols.nlp.decorate module¶
Contains useful classes for decorating feature sentences.
- class zensols.nlp.decorate.CopyFeatureTokenContainerDecorator(feature_ids)[source]¶
Bases:
FeatureTokenContainerDecoratorCopies feature(s) for each token in the container. For each token, each source / target tuple pair in
feature_idsis copied. If the feature is missing (does not include for existingFeatureToken.NONEvalues) an exception is raised.- __init__(feature_ids)¶
- class zensols.nlp.decorate.FilterEmptySentenceDocumentDecorator(filter_space=True)[source]¶
Bases:
FeatureDocumentDecoratorFilter zero length sentences.
- __init__(filter_space=True)¶
- class zensols.nlp.decorate.FilterTokenSentenceDecorator(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False, remove_empty=False)[source]¶
Bases:
FeatureSentenceDecoratorA decorator that strips whitespace from sentences.
- __init__(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False, remove_empty=False)¶
- class zensols.nlp.decorate.RemoveFeatureTokenContainerDecorator(exclude_feature_ids)[source]¶
Bases:
FeatureTokenContainerDecoratorRemoves features each token in the container.
- __init__(exclude_feature_ids)¶
- class zensols.nlp.decorate.SplitTokenSentenceDecorator[source]¶
Bases:
FeatureSentenceDecoratorA decorator that splits feature tokens by white space.
- __init__()¶
- class zensols.nlp.decorate.StripTokenContainerDecorator[source]¶
Bases:
FeatureTokenContainerDecoratorA decorator that strips whitespace from sentences (or
TokenContainer).- __init__()¶
- class zensols.nlp.decorate.UpdateTokenContainerDecorator(update_indexes=True, update_entity_spans=True, reindex=False)[source]¶
Bases:
FeatureTokenContainerDecoratorUpdates document indexes and spans (see fields).
- __init__(update_indexes=True, update_entity_spans=True, reindex=False)¶
-
update_entity_spans:
bool= True¶ Whether to update the document indexes with
FeatureDocument.update_entity_spans().
-
update_indexes:
bool= True¶ Whether to update the document indexes with
FeatureDocument.update_indexes().
zensols.nlp.domain module¶
Interfaces, contracts and errors.
- class zensols.nlp.domain.LexicalSpan(begin, end)[source]¶
Bases:
DictableA lexical character span of text in a document. The span has two positions:
beginandend, which is indexed respectively as an operator as well. The left (begin) is inclusive and the right (:obj:`end) is exclusive to conform to Python array slicing conventions.One span is less than the other when the beginning position is less. When the beginnign positions are the same, the one with the smaller end position is less.
The length of the span is the distance between the end and the beginning positions.
-
EMPTY_SPAN:
ClassVar[LexicalSpan] = (0, 0)¶ The span
(0, 0).
- static gaps(spans, end=None)[source]¶
Return the spans for the “holes” in
spans. For example, ifspansis((0, 5), (10, 12), (15, 17)), then return((5, 10), (12, 15)).- Parameters:
spans (
Iterable[LexicalSpan]) – the spans used to find gapsend (
Optional[int]) – an end position for the last gap so that if the last item inspansend does not match, another is added
- Return type:
- Returns:
a list of spans that “fill” any holes in
spans
- narrow(other)[source]¶
Return the shortest span that inclusively fits in both this and
other.- Parameters:
other (
LexicalSpan) – the second span to narrow with this span- Retun:
a span so that beginning is maximized and end is minimized or
Noneif the two spans do not overlap- Return type:
- static overlaps(a0, a1, b0, b1, inclusive=True)[source]¶
Return whether or not one text span overlaps with another.
- Parameters:
inclusive (
bool) – whether to check include +1 on the end component- Returns:
any overlap detected returns
True
- overlaps_with(other, inclusive=True)[source]¶
Return whether or not one text span overlaps non-inclusively with another.
- Parameters:
other (
LexicalSpan) – the other locationinclusive (
bool) – whether to check include +1 on the end component
- Return type:
- Returns:
any overlap detected returns
True
- static widen(others)[source]¶
Take the span union by using the left most
beginand the right mostend.- Parameters:
others (
Iterable[LexicalSpan]) – the spans to union- Return type:
- Returns:
the widest span that inclusively aggregates
others, or None if an empty sequence is passed
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
-
EMPTY_SPAN:
- exception zensols.nlp.domain.MissingFeatureError(token, feature_id, msg=None)[source]¶
Bases:
NLPErrorRaised on attempting to access a non-existant feature in
FeatureToken.- __init__(token, feature_id, msg=None)[source]¶
Initialize.
- Parameters:
token (
FeatureToken) – the token for which access was attemptedfeature_id (
str) – the feature_id that is missing intoken
- __module__ = 'zensols.nlp.domain'¶
- exception zensols.nlp.domain.NLPError[source]¶
Bases:
APIErrorRaised for any errors for this library.
- __annotations__ = {}¶
- __module__ = 'zensols.nlp.domain'¶
- exception zensols.nlp.domain.ParseError[source]¶
Bases:
APIErrorRaised for any parsing errors.
- __annotations__ = {}¶
- __module__ = 'zensols.nlp.domain'¶
- class zensols.nlp.domain.TextContainer[source]¶
Bases:
DictableA writable class that has a
textproperty or attribute. All subclasses need anormattribute or property.- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_original=True, include_normalized=True)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
zensols.nlp.index module¶
A heuristic text indexing and search class.
- class zensols.nlp.index.FeatureDocumentIndexer(doc)[source]¶
Bases:
objectA utility class that indexes and searches for text in potentially whitespace mangled documents. It does this by trying more efficient means first, then resorts to methods that are more computationaly expensive.
- __init__(doc)¶
-
doc:
FeatureDocument¶ The document to index.
- property doc_tok_orths: Tuple[Tuple[str, FeatureToken], ...]¶
Reutrn tuples of (<orthographic text>, <token>).
- find(query, sent_ix=None)[source]¶
Find a sentence in document
doc. If a sentence index is given, it treats the query as a sentence to find indoc.- Parameters:
query (
TokenContainer) – the sentence to find indocsent_ix (
int) – the sentence index hint if available
- Return type:
- Returns:
the matched text from
doc
- property pack2ix: Dict[int, int]¶
Return a dictionary of character positions in the document (
doc) text to respective positions in the same string without whitespace.
- property text2sent: Dict[str, FeatureSentence]¶
Return a dictionary of sentence normalized text to respective sentence in
doc.
zensols.nlp.nerscore module¶
Wraps the SemEval-2013 Task 9.1 NER evaluation API as a
ScoreMethod.
From the David Batista blog post:
The SemEval’13 introduced four different ways to measure precision/recall/f1-score results based on the metrics defined by MUC:
Strict: exact boundary surface string match and entity type
Exact: exact boundary match over the surface string, regardless of the type
Partial: partial boundary match over the surface string, regardless of the type
Type: some overlap between the system tagged entity and the gold annotation is required
Each of these ways to measure the performance accounts for correct, incorrect, partial, missed and spurious in different ways. Let’s look in detail and see how each of the metrics defined by MUC falls into each of the scenarios described above.
- see:
- see:
- class zensols.nlp.nerscore.SemEvalHarmonicMeanScore(precision, recall, f_score, correct, incorrect, partial, missed, spurious, possible, actual)[source]¶
Bases:
HarmonicMeanScoreA harmonic mean score with the additional SemEval computed scores (see module
zensols.nlp.nerscoredocs).-
NAN_INSTANCE:
ClassVar[SemEvalHarmonicMeanScore] = SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan)¶ Used to add to ErrorScore for harmonic means replacements.
- __init__(precision, recall, f_score, correct, incorrect, partial, missed, spurious, possible, actual)¶
-
incorrect:
int¶ the output of a system and the golden annotation don’t match.
- Type:
The number of incorrect (INC)
-
NAN_INSTANCE:
- class zensols.nlp.nerscore.SemEvalScore(strict, exact, partial, ent_type)[source]¶
Bases:
ScoreContains all four harmonic mean SemEval scores (see module
zensols.nlp.nerscoredocs). This score has four harmonic means providing various levels of accuracy.-
NAN_INSTANCE:
ClassVar[SemEvalScore] = SemEvalScore(strict=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), exact=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), partial=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), ent_type=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan))¶
- __init__(strict, exact, partial, ent_type)¶
-
ent_type:
SemEvalHarmonicMeanScore¶ Some overlap between the system tagged entity and the gold annotation is required.
-
exact:
SemEvalHarmonicMeanScore¶ Exact boundary match over the surface string, regardless of the type.
-
partial:
SemEvalHarmonicMeanScore¶ Partial boundary match over the surface string, regardless of the type.
-
strict:
SemEvalHarmonicMeanScore¶ Exact boundary surface string match and entity type.
-
NAN_INSTANCE:
- class zensols.nlp.nerscore.SemEvalScoreMethod(reverse_sents=False, labels=None)[source]¶
Bases:
ScoreMethodA Semeval-2013 Task 9.1 score (see module
zensols.nlp.nerscoredocs). This score has four harmonic means providing various levels of accuracy. Sentence pairs are ordered as(<gold>, <prediction>).- __init__(reverse_sents=False, labels=None)¶
zensols.nlp.norm module¶
Normalize text and map Spacy documents.
- class zensols.nlp.norm.FilterRegularExpressionMapper(regex='[ ]+', invert=False)[source]¶
Bases:
TokenMapperFilter tokens based on normalized form regular expression.
- __init__(regex='[ ]+', invert=False)¶
- class zensols.nlp.norm.FilterTokenMapper(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False)[source]¶
Bases:
TokenMapperFilter tokens based on token (Spacy) attributes.
Configuration example:
[filter_token_mapper] class_name = zensols.nlp.FilterTokenMapper remove_stop = True remove_punctuation = True
- __init__(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False)¶
- class zensols.nlp.norm.JoinTokenMapper(regex='[ ]', separator=None)[source]¶
Bases:
objectJoin tokens based on a regular expression. It does this by creating spans in the spaCy component (first in the tuple) and using the span text as the normalized token.
- __init__(regex='[ ]', separator=None)¶
- class zensols.nlp.norm.LambdaTokenMapper(add_lambda=None, map_lambda=None)[source]¶
Bases:
TokenMapperUse a lambda expression to map a token tuple.
This is handy for specialized behavior that can be added directly to a configuration file.
Configuration example:
[lc_lambda_token_mapper] class_name = zensols.nlp.LambdaTokenMapper map_lambda = lambda x: (x[0], f'<{x[1].lower()}>')
- __init__(add_lambda=None, map_lambda=None)¶
- class zensols.nlp.norm.LemmatizeTokenMapper(lemmatize=True, remove_first_stop=False)[source]¶
Bases:
TokenMapperLemmatize tokens and optional remove entity stop words.
Important: This completely ignores the normalized input token string and essentially just replaces it with the lemma found in the token instance.
Configuration example:
[lemma_token_mapper] class_name = zensols.nlp.LemmatizeTokenMapper
- Parameters:
- __init__(lemmatize=True, remove_first_stop=False)¶
- class zensols.nlp.norm.MapTokenNormalizer(embed_entities=True, config_factory=None, mapper_class_list=<factory>)[source]¶
Bases:
TokenNormalizerA normalizer that applies a sequence of
TokenMapperinstances to transform the normalized token text. The members of themapper_class_listare sections of the application configuration.Configuration example:
[map_filter_token_normalizer] class_name = zensols.nlp.MapTokenNormalizer mapper_class_list = list: filter_token_mapper
- __init__(embed_entities=True, config_factory=None, mapper_class_list=<factory>)¶
-
config_factory:
ConfigFactory= None¶ The factory that created this instance and used to create the mappers.
- class zensols.nlp.norm.SplitEntityTokenMapper(token_unit_type=False, copy_attributes=('label', 'label_'))[source]¶
Bases:
TokenMapperSplits embedded entities (or any
Span) in to separate tokens. This is useful for splitting up entities as tokens after being grouped withTokenNormalizer.embed_entities. Note,embed_entitiesmust beTrueto create the entities as they come from spaCy as spans. This then can be used to createSpacyFeatureTokenwith spans that have the entity.- __init__(token_unit_type=False, copy_attributes=('label', 'label_'))¶
- class zensols.nlp.norm.SplitTokenMapper(regex='[ ]')[source]¶
Bases:
TokenMapperSplits the normalized text on a per token basis with a regular expression.
Configuration example:
[split_token_mapper] class_name = zensols.nlp.SplitTokenMapper regex = r'[ ]'
- __init__(regex='[ ]')¶
- class zensols.nlp.norm.SubstituteTokenMapper(regex='', replace_char='')[source]¶
Bases:
TokenMapperReplace a regular expression in normalized token text.
Configuration example:
[subs_token_mapper] class_name = zensols.nlp.SubstituteTokenMapper regex = r'[ \t]' replace_char = _
- __init__(regex='', replace_char='')¶
- class zensols.nlp.norm.TokenMapper[source]¶
Bases:
ABCAbstract class used to transform token tuples generated from
TokenNormalizer.normalize().- __init__()¶
- class zensols.nlp.norm.TokenNormalizer(embed_entities=True)[source]¶
Bases:
objectBase token extractor returns tuples of tokens and their normalized version.
Configuration example:
[default_token_normalizer] class_name = zensols.nlp.TokenNormalizer embed_entities = False
- __init__(embed_entities=True)¶
zensols.nlp.parser module¶
Parse documents and generate features in an organized taxonomy.
- class zensols.nlp.parser.CachingFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, stash=None, hasher=<factory>)[source]¶
Bases:
DecoratedFeatureDocumentParserA document parser that persists previous parses using the hash of the text as a key. Caching is optional given the value of
stash, which is useful in cases this class is extended using other use cases other than just caching.- __init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, stash=None, hasher=<factory>)¶
- parse(text, *args, **kwargs)[source]¶
Parse text or a text as a list of sentences.
- Parameters:
text (
str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the listargs – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance
- Return type:
- class zensols.nlp.parser.Component(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=())[source]¶
Bases:
objectA pipeline component to be added to the spaCy model.
- __init__(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=())¶
- init(model, parser)[source]¶
Initialize the component and add it to the NLP pipe line. This base class implementation loads the
module, then callsLanguage.add_pipe().- Parameters:
model (
Language) – the model to add the spaCy model (nlpin their parlance)parser (
FeatureDocumentParser) – the owning parser of this component instance
-
initializers:
Tuple[ComponentInitializer,...] = ()¶ Instances to initialize upon this object’s initialization.
-
modules:
Sequence[str] = ()¶ The module to import before adding component pipelines. This will register components mentioned in
componentswhen the resepctive module is loaded.
- class zensols.nlp.parser.ComponentInitializer[source]¶
Bases:
ABCCalled by
Componentto do post spaCy initialization.
- class zensols.nlp.parser.DecoratedFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None)[source]¶
Bases:
FeatureDocumentParserThis class adapts the
FeatureDocumentParseradaptors to the general case using a GoF decorator pattern. This is useful for any post processing needed on existing configured document parsers.- All decorators are processed in the following order:
Token
Sentence
Document
Token features are stored in the delegate for those that have them. Otherwise, they are stored in instances of this class.
- __init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None)¶
-
delegate:
FeatureDocumentParser¶ Used to create the feature documents.
-
document_decorators:
Sequence[FeatureDocumentDecorator] = ()¶ A list of decorators that can add, remove or modify features on a document.
-
name:
str¶ The name of the parser, which is taken from the section name when created with a
ConfigFactoryand used for debugging.
- parse(text, *args, **kwargs)[source]¶
Parse text or a text as a list of sentences.
- Parameters:
text (
str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the listargs – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance
- Return type:
-
sentence_decorators:
Sequence[FeatureSentenceDecorator] = ()¶ A list of decorators that can add, remove or modify features on a sentence.
-
silencer:
WarningSilencer= None¶ Optinally suppress warnings the parser generates.
-
token_decorators:
Sequence[FeatureTokenDecorator] = ()¶ A list of decorators that can add, remove or modify features on a token.
- class zensols.nlp.parser.FeatureDocumentDecorator[source]¶
Bases:
FeatureTokenContainerDecoratorImplementations can add, remove or modify features on a document.
- class zensols.nlp.parser.FeatureDocumentParser[source]¶
Bases:
PersistableContainer,DictableThis class parses text in to instances of
FeatureDocumentinstances usingparse().-
TOKEN_FEATURE_IDS:
ClassVar[Set[str]] = frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_'})¶ The default value for
token_feature_ids.
- __init__()¶
- static default_instance()[source]¶
Create the parser as configured in the resource library of the package.
- Return type:
- abstract parse(text, *args, **kwargs)[source]¶
Parse text or a text as a list of sentences.
- Parameters:
text (
str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the listargs – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance
- Return type:
-
TOKEN_FEATURE_IDS:
- class zensols.nlp.parser.FeatureSentenceDecorator[source]¶
Bases:
FeatureTokenContainerDecoratorImplementations can add, remove or modify features on a sentence.
- class zensols.nlp.parser.FeatureSentenceFactory(token_decorators=())[source]¶
Bases:
objectCreate a
FeatureSentenceout of single tokens or split on whitespace. This is a utility class to create data structures when only single tokens are the source data.For example, if you only have tokens that need to be scored with Unigram Rouge-1, use this class to create sentences, which is a subclass of
TokenContainer.- __init__(token_decorators=())¶
- create(tokens)[source]¶
Create a sentence from tokens.
-
token_decorators:
Sequence[FeatureTokenDecorator] = ()¶ A list of decorators that can add, remove or modify features on a token.
- class zensols.nlp.parser.FeatureTokenContainerDecorator[source]¶
Bases:
ABCImplementations can add, remove or modify features on a token container.
- class zensols.nlp.parser.FeatureTokenDecorator[source]¶
Bases:
ABCImplementations can add, remove or modify features on a token.
- class zensols.nlp.parser.WhiteSpaceTokenizerFeatureDocumentParser(sent_class=<class 'zensols.nlp.container.FeatureSentence'>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>)[source]¶
Bases:
FeatureDocumentParserThis class parses text in to instances of
FeatureDocumentinstances tokenizing only by whitespace. This parser does no sentence chunking so documents have one and only one sentence for each parse.- __init__(sent_class=<class 'zensols.nlp.container.FeatureSentence'>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>)¶
- doc_class¶
The type of document instances to create.
alias of
FeatureDocument
- parse(text, *args, **kwargs)[source]¶
Parse text or a text as a list of sentences.
- Parameters:
text (
str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the listargs – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance
- Return type:
- sent_class¶
The type of sentence instances to create.
alias of
FeatureSentence
zensols.nlp.score module¶
Produces matching scores.
- class zensols.nlp.score.BleuScoreMethod(reverse_sents=False, smoothing_function=None, weights=(0.25, 0.25, 0.25, 0.25), silence_warnings=False)[source]¶
Bases:
ScoreMethodThe BLEU scoring method using the
nltkpackage. The first sentences are the references and the second are the hypothesis.- __init__(reverse_sents=False, smoothing_function=None, weights=(0.25, 0.25, 0.25, 0.25), silence_warnings=False)¶
-
silence_warnings:
bool= False¶ Silence the BLEU warning of n-grams not matching
The hypothesis contains 0 counts of 3-gram overlaps...
-
smoothing_function:
SmoothingFunction= None¶ This is an implementation of the smoothing techniques for segment-level BLEU scores.
Citation:
Chen and Cherry (2014) A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. In WMT14.
- class zensols.nlp.score.ErrorScore(method, exception, replace_score=None)[source]¶
Bases:
ScoreA replacement instance when scoring fails from a raised exception.
- __init__(method, exception, replace_score=None)¶
-
method:
str¶ The method of the
ScoreMethodthat raised the exception.
-
replace_score:
Score= None¶ The score to use in place of this score. Otherwise
asrow()return a singlenumpy.nanlikeFloatScore.
- class zensols.nlp.score.ExactMatchScoreMethod(reverse_sents=False, equality_measure='norm')[source]¶
Bases:
ScoreMethodA scoring method that return 1 for exact matches and 0 otherwise.
- __init__(reverse_sents=False, equality_measure='norm')¶
-
equality_measure:
str= 'norm'¶ The method by which to compare, which is one of:
norm: compare withTokenContainer.norm()text: compare withTokenContainer.textequal: compare using a Python object__eq__equal compare,which also compares the token values
- class zensols.nlp.score.FloatScore(value)[source]¶
Bases:
ScoreFloat container. This is needed to create the flat result container structure. Object creation becomes less import since most clients will use
ScoreSet.asnumpy().-
NAN_INSTANCE:
ClassVar[FloatScore] = FloatScore(value=nan)¶ Used to add to ErrorScore for harmonic means replacements.
- __init__(value)¶
-
NAN_INSTANCE:
- class zensols.nlp.score.HarmonicMeanScore(precision, recall, f_score)[source]¶
Bases:
ScoreA score having a precision, recall and the harmonic mean of the two, F-score.’
-
NAN_INSTANCE:
ClassVar[HarmonicMeanScore] = HarmonicMeanScore(precision=nan, recall=nan, f_score=nan)¶ Used to add to ErrorScore for harmonic means replacements.
- __init__(precision, recall, f_score)¶
-
NAN_INSTANCE:
- class zensols.nlp.score.LevenshteinDistanceScoreMethod(reverse_sents=False, form='canon', normalize=True)[source]¶
Bases:
ScoreMethodA scoring method that computes the Levenshtein distance.
- __init__(reverse_sents=False, form='canon', normalize=True)¶
-
form:
str= 'canon'¶ The form of the of the text used for the evaluation, which is one of:
text: the original text withTokenContainer.textnorm: the normalized text usingTokenContainer.norm()canon:TokenContainer.canonicalto normalize out whitespace for better comparisons
- class zensols.nlp.score.RougeScoreMethod(reverse_sents=False, feature_tokenizer=True)[source]¶
Bases:
ScoreMethodThe ROUGE scoring method using the
rouge_scorepackage.- __init__(reverse_sents=False, feature_tokenizer=True)¶
-
feature_tokenizer:
bool= True¶ Whether to use the
TokenContainertokenization, otherwise use therouge_scorepackage.
- class zensols.nlp.score.Score[source]¶
Bases:
DictableIndividual scores returned from
ScoreMethod.- __init__()¶
- class zensols.nlp.score.ScoreContext(pairs, methods=None, norm=True, correlation_ids=None)[source]¶
Bases:
DictableInput needed to create score(s) using
Scorer.- __init__(pairs, methods=None, norm=True, correlation_ids=None)¶
-
correlation_ids:
Tuple[Union[int,str]] = None¶ The IDs to correlate with each sentence pair, or
Noneto skip correlating them. The length of this tuple must be that ofpairs.
-
methods:
Set[str] = None¶ A set of strings, each indicating the
ScoreMethodused to scorepairs.
-
pairs:
Tuple[Tuple[TokenContainer,TokenContainer]]¶ Sentence, span or document pairs to score (order matters for some scoring methods such as rouge). Depending on the scoring method the ordering of the sentence pairs should be:
(<summary>, <source>)(<gold>, <prediction>)(<references>, <candidates>)
See
ScoreMethodimplementations for more information about pair ordering.
- class zensols.nlp.score.ScoreMethod(reverse_sents=False)[source]¶
Bases:
objectAn abstract base class for scoring methods (bleu, rouge, etc).
- __init__(reverse_sents=False)¶
- classmethod is_available()[source]¶
Whether or not this method is available on this system.
- Return type:
- class zensols.nlp.score.ScoreResult(scores, correlation_id=None)[source]¶
Bases:
DictableA result of scores created by a
ScoreMethod.- __init__(scores, correlation_id=None)¶
-
correlation_id:
Optional[str] = None¶ An ID for correlating back to the
TokenContainer.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
- class zensols.nlp.score.ScoreSet(results, correlation_id_col='id')[source]¶
Bases:
DictableAll scores returned from :class:`.Scorer’.
- __init__(results, correlation_id_col='id')¶
- as_dataframe(add_correlation=True)[source]¶
This gets data from
as_numpy()and returns it as a Pandas dataframe.- Parameters:
add_correlation (bool) – whether to add the correlation ID (if there is one), using
correlation_id_col- Return type:
pandas.DataFrame
- Returns:
an instance of
pandas.DataFrameof the results
- as_numpy(add_correlation=True)[source]¶
Return the Numpy array with column descriptors of the results. Spacy depends on Numpy, so this package will always be availale.
- Parameters:
add_correlation (
bool) – whether to add the correlation ID (if there is one), usingcorrelation_id_col- Return type:
-
correlation_id_col:
str= 'id'¶ The column name for the
ScoreResult.correlation_idadded to Numpy arrays and Pandas dataframes. IfNone, then the correlation IDS are used as the index.
-
results:
Tuple[ScoreResult,...]¶ A tuple with each element having the results of the respective sentence pair in
ScoreContext.sents. Each elemnt is a dictionary with the method are the keys with results as the values as output of theScoreMethod. This is created inScorer.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
- class zensols.nlp.score.Scorer(package_manager=None, methods=None, default_methods=None)[source]¶
Bases:
objectA class that scores sentences using a set of registered methods (
methods).- __init__(package_manager=None, methods=None, default_methods=None)¶
-
default_methods:
Set[str] = None¶ Methods (keys from
methods) to use when none are provided in theScoreContext.methin the call toscore().
-
methods:
Dict[str,ScoreMethod] = None¶ The registered scoring methods availale, which are accessed from
ScoreContext.meth.
-
package_manager:
PackageManager= None¶ The package manager used to install scoring methods. If this is
None, then packages are not installed and scoring methods are not made available.
- score(context)[source]¶
Score the sentences in
context.- Parameters:
context (
ScoreContext) – the context containing the data to score- Return type:
- Returns:
the results for each method indicated in
context
zensols.nlp.serial module¶
Serializes FeatureToken and TokenContainer instances
using the Dictable interface.
- class zensols.nlp.serial.Include(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
EnumIndicates what to include at each level.
- normal = 2¶
The normalized form of the text.
- original = 1¶
The original text.
- sentences = 4¶
The sentences of the
FeatureDocument.
- tokens = 3¶
The tokens of the
TokenContainer.
- class zensols.nlp.serial.Serialized(container, includes, feature_ids)[source]¶
Bases:
DictableA base strategy class that can serialize
TokenContainerinstances.- __init__(container, includes, feature_ids)¶
-
container:
TokenContainer¶ The container to be serialized.
- class zensols.nlp.serial.SerializedFeatureDocument(container, includes, feature_ids, sentence_includes)[source]¶
Bases:
SerializedA serializer for feature documents. The
containerhas to be an instance of aFeatureDocument.- __init__(container, includes, feature_ids, sentence_includes)¶
- class zensols.nlp.serial.SerializedTokenContainer(container, includes, feature_ids)[source]¶
Bases:
SerializedSerializes instance of
TokenContainer. This is used to serialize spans and sentences.- __init__(container, includes, feature_ids)¶
- class zensols.nlp.serial.SerializedTokenContainerFactory(sentence_includes, document_includes, feature_ids=None)[source]¶
Bases:
DictableCreates instances of
Serializedfrom instances ofTokenContainer. These can then be used asDictableinstances, specifically with theasdictandasjsonmethods.- __init__(sentence_includes, document_includes, feature_ids=None)¶
- create(container)[source]¶
Create a serializer from
container(see class docs).- Parameters:
container (
TokenContainer) – he container to be serialized- Return type:
- Returns:
an object that can be serialized using
asdictandasjsonmethod.
zensols.nlp.spannorm module¶
Normalize spans (of tokens) into strings by reconstructing based on language
rules from the normalized form of the tokens. This is needed after any token
manipulation from TokenNormalizer or other changes to
FeatureToken.norm.
For now, only English is supported, but the module is provided for other languages and future enhancements of normalization configuration.
- class zensols.nlp.spannorm.EnglishSpanNormalizer(post_space_skip=frozenset({'(', '-', '<', '[', '`', '{', '‘', '“'}), pre_space_skip=frozenset({"'d", "'ll", "'m", "'re", "'s", "'ve", '-', "n't"}), keep_space_skip=frozenset({'_'}), canonical_delimiter='|')[source]¶
Bases:
SpanNormalizerAn implementation of a span normalizer for the Enlish language.
- __init__(post_space_skip=frozenset({'(', '-', '<', '[', '`', '{', '‘', '“'}), pre_space_skip=frozenset({"'d", "'ll", "'m", "'re", "'s", "'ve", '-', "n't"}), keep_space_skip=frozenset({'_'}), canonical_delimiter='|')¶
- get_canonical(tokens)[source]¶
A canonical representation of the container, which are non-space tokens separated by
CANONICAL_DELIMITER.- Return type:
- get_norm(tokens, use_norm)[source]¶
Create a string that follows the langauge spacing rules.
- Parameters:
tokens (
Iterable[FeatureToken]) – the tokens to normalizeuse_norm (
bool) – whether to use the token normalized or orthographic text
- Return type:
- class zensols.nlp.spannorm.SpanNormalizer[source]¶
Bases:
objectSubclasses normalize feature tokens on a per
spacy.Language. All subclasses must be re-entrant.
zensols.nlp.sparser module¶
The spaCy FeatureDocumentParser implementation.
- class zensols.nlp.sparser.SpacyComponent(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=(), auto_install_model=False)[source]¶
Bases:
ComponentA utilty base class that supports installing pip dependencies and spaCy models.
- __init__(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=(), auto_install_model=False)¶
-
auto_install_model:
Union[bool,str,Iterable[str]] = False¶ Whether to install models not already available. Note that this uses the pip command to download model requirements, which might have an adverse effect of replacing currently installed Python packages. This value is interpreted as a pip requirement(s) to install if a string or iterable of strings.
- init(model, parser)[source]¶
Initialize the component and add it to the NLP pipe line. This base class implementation loads the
module, then callsLanguage.add_pipe().- Parameters:
model (
Language) – the model to add the spaCy model (nlpin their parlance)parser (
FeatureDocumentParser) – the owning parser of this component instance
- class zensols.nlp.sparser.SpacyFeatureDocumentParser(config_factory, name, lang='en', model_name=None, token_feature_ids=<factory>, components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.nlp.tok.SpacyFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False, package_manager=<factory>)[source]¶
Bases:
FeatureDocumentParserThis langauge resource parses text in to Spacy documents. Loaded spaCy models have attribute
doc_parserset enable creation of factory instances from registered pipe components (i.e. specified byComponent).Configuration example:
[doc_parser] class_name = zensols.nlp.sparser.SpacyFeatureDocumentParser lang = en model_name = ${lang}_core_web_smDecorators are processed in the same way
DecoratedFeatureDocumentParser.- __init__(config_factory, name, lang='en', model_name=None, token_feature_ids=<factory>, components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.nlp.tok.SpacyFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False, package_manager=<factory>)¶
-
auto_install_model:
Union[bool,str,Iterable[str]] = False¶ Whether to install models not already available. Note that this uses the pip command to download model requirements, which might have an adverse effect of replacing currently installed Python packages. This value is interpreted as a pip requirement(s) to install if a string or iterable of strings.
-
config_factory:
ConfigFactory¶ A configuration parser optionally used by pipeline
Componentinstances.
-
disable_component_names:
Sequence[str] = None¶ Components to disable in the spaCy model when creating documents in
parse().
- doc_class¶
The type of document instances to create.
alias of
FeatureDocument
-
document_decorators:
Sequence[FeatureDocumentDecorator] = ()¶ A list of decorators that can add, remove or modify features on a document.
- from_spacy_doc(doc, *args, text=None, **kwargs)[source]¶
Create s
FeatureDocumentfrom a spaCy doc.- Parameters:
doc (
Doc) – the spaCy generated document to transform in to a feature documenttext (
str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the listargs – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance
- Return type:
- get_dictable(doc)[source]¶
Return a dictionary object graph and pretty prints spaCy docs.
- Return type:
- property model: Language¶
The spaCy model. On first access, this creates a new instance using
model_name.
-
model_name:
str= None¶ The Spacy model name (defualts to
en_core_web_sm); this is ignored ifmodelis notNone.
-
name:
str¶ The name of the parser, which is taken from the section name when created with a
ConfigFactoryand used for debugging.
-
package_manager:
PackageManager¶ The package manager used to install
auto_install_model.
- parse(text, *args, **kwargs)[source]¶
Parse text or a text as a list of sentences.
- Parameters:
text (
str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the listargs – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance
- Return type:
-
reload_components:
bool= False¶ Removes, then re-adds components for cached models. This is helpful for when there are component configurations that change on reruns with a difference application context but in the same Python interpreter session.
A spaCy component can get other instances via
config_factory, but if this isFalseit will be paired with the first instance of this class and not the new ones created with a new configuration factory.
-
remove_empty_sentences:
bool= None¶ Deprecated and will be removed from future versions. Use
FilterSentenceFeatureDocumentDecoratorinstead.
- sent_class¶
The type of sentence instances to create.
alias of
FeatureSentence
-
sentence_decorators:
Sequence[FeatureSentenceDecorator] = ()¶ A list of decorators that can add, remove or modify features on a sentence.
- to_spacy_doc(doc, norm=True, add_features=None)[source]¶
Convert a feature document back in to a spaCy document.
Note: not all data is copied–only text,
pos_,tag_,lemma_anddep_.- Parameters:
doc (
FeatureDocument) – the spaCy doc to convertnorm (
bool) – whether to use the normalized text as theorth_spaCy token attribute ortext
- Pram add_features:
whether to add POS, NER tags, lemmas, heads and dependnencies
- Return type:
Doc- Returns:
the feature document with copied data from
doc
- token_class¶
The type of document instances to create.
alias of
SpacyFeatureToken
-
token_decorators:
Sequence[FeatureTokenDecorator] = ()¶ A list of decorators that can add, remove or modify features on a token.
-
token_normalizer:
TokenNormalizer= None¶ The token normalizer for methods that use it, i.e.
features.
zensols.nlp.stemmer module¶
Stem text using the Porter stemmer.
zensols.nlp.tok module¶
Feature token and related base classes
- class zensols.nlp.tok.FeatureToken(i, idx, i_sent, norm, lexspan)[source]¶
Bases:
PersistableContainer,TextContainerA container class for features about a token. Subclasses such as
SpacyFeatureTokenextracts only a subset of features from the heavy Spacy C data structures and is hard/expensive to pickle. Instances of this token class are almost always detached, meaning the underlying in memory data structures have been copied as pure Python types to facilitate serialization of spaCy tokens.Feature note: features
i,idxandi_sentare always added to features tokens to be able to reconstruct sentences (seeFeatureDocument.uncombine_sentences()), and alwyas included.-
FEATURE_IDS:
ClassVar[Set[str]] = frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_'})¶ All default available feature IDs.
-
FEATURE_IDS_BY_TYPE:
ClassVar[Dict[str,Set[str]]] = {'bool': frozenset({'is_contraction', 'is_ent', 'is_pronoun', 'is_space', 'is_stop', 'is_superlative', 'is_wh'}), 'int': frozenset({'dep', 'ent', 'ent_iob', 'i', 'i_sent', 'idx', 'is_punctuation', 'norm_len', 'sent_i', 'shape', 'tag'}), 'list': frozenset({'children'}), 'object': frozenset({'lexspan'}), 'str': frozenset({'dep_', 'ent_', 'ent_iob_', 'lemma_', 'norm', 'pos_', 'shape_', 'tag_'})}¶ Map of class type to set of feature IDs.
-
REQUIRED_FEATURE_IDS:
ClassVar[Set[str]] = frozenset({'i', 'i_sent', 'idx', 'lexspan', 'norm'})¶ Features retained regardless of configuration for basic functionality.
-
SKIP_COMPARE_FEATURE_IDS:
ClassVar[Set[str]] = {}¶ A set of feature IDs to avoid comparing in
__eq__().
-
TYPES_BY_FEATURE_ID:
ClassVar[Dict[str,str]] = {'children': 'list', 'dep': 'int', 'dep_': 'str', 'ent': 'int', 'ent_': 'str', 'ent_iob': 'int', 'ent_iob_': 'str', 'i': 'int', 'i_sent': 'int', 'idx': 'int', 'is_contraction': 'bool', 'is_ent': 'bool', 'is_pronoun': 'bool', 'is_punctuation': 'int', 'is_space': 'bool', 'is_stop': 'bool', 'is_superlative': 'bool', 'is_wh': 'bool', 'lemma_': 'str', 'lexspan': 'object', 'norm': 'str', 'norm_len': 'int', 'pos_': 'str', 'sent_i': 'int', 'shape': 'int', 'shape_': 'str', 'tag': 'int', 'tag_': 'str'}¶ A map of feature ID to string type. This is used by
FeatureToken.write_attributes()to dump the type features.
-
WRITABLE_FEATURE_IDS:
ClassVar[Tuple[str,...]] = ('text', 'norm', 'idx', 'sent_i', 'i', 'i_sent', 'tag', 'pos', 'is_wh', 'entity', 'dep', 'children')¶ Feature IDs that are dumped on
write()andwrite_attributes().
- __init__(i, idx, i_sent, norm, lexspan)¶
- clone(cls=None, **kwargs)[source]¶
Clone an instance of this token.
- Parameters:
cls (
Type) – the type of the new instancekwargs – arguments to add to as attributes to the clone
- Return type:
- Returns:
the cloned instance of this instance
- property default_detached_feature_ids: Set[str] | None¶
The default set of feature IDs used when cloning or detaching with
clone()ordetach().
- detach(feature_ids=None, skip_missing=False, cls=None)[source]¶
Create a detected token (i.e. from spaCy artifacts).
- Parameters:
feature_ids (
Set[str]) – the features to write, which defaults toFEATURE_IDSskip_missing (
bool) – whether to only keepfeature_idscls (
Type[FeatureToken]) – the type of the new instance
- Return type:
- get_feature(feature_id, expect=True, check_none=False, message=None)[source]¶
Return a feature by the feature ID.
- Parameters:
feature_id (
str) – the ID of the feature to retrieveexpect (
bool) – whether to raise an errormessage (
str) – additional context to append to the error messagecheck_none (
bool) – whether to return the value even if it has an unset value such asNONEas determined byis_none(), in which caseNoneis returned
- Raises:
MissingFeatureError – if
expectisTrueand the feature does not exist- Return type:
-
i_sent:
int¶ The index of the token within the parent sentence.
The index of the token in the respective sentence. This is not to be confused with the index of the sentence to which the token belongs, which is
sent_i.
-
lexspan:
LexicalSpan¶ The character offset beginning and end of the token. This is set as (
start,end) as (idx,idx+len(text)). Thebeginis usually the same asidxbut can be updated when updated for normalized text or when the text moves/reindexed in the document.
- set_feature(feature_id, value)[source]¶
Set, or add if non-existant, a feature to this token instance. If the token has been detached, it will be added to the
default_detached_feature_ids.
- split(positions)[source]¶
Split on text normal index positions. This needs and updates the
idxandlexspanatttributes.
- property text: str¶
The initial text before normalized by any
TokenNormalizer.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_type=True, feature_ids=None, inline=False)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
- write_attributes(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_type=True, feature_ids=None, inline=False, include_none=True)[source]¶
Write feature attributes.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writableinclude_type (
bool) – ifTruewrite the type of value (if available)feature_ids (
Iterable[str]) – the features to write, which defaults toWRITABLE_FEATURE_IDSinline (
bool) – whether to print attributes all on the same line
-
FEATURE_IDS:
- class zensols.nlp.tok.SpacyFeatureToken(spacy_token, norm)[source]¶
Bases:
FeatureTokenContains and provides the same features as a spaCy
Token.- property children¶
A sequence of the token’s immediate syntactic children.
- conll_iob_()[source]¶
Return the CoNLL formatted IOB tag, such as
B-ORGfor a beginning organization token.- Return type:
- property lemma_: str¶
Return the string lemma or text of the named entitiy if tagged as a named entity.
- property sent_i: int¶
The index of the sentence to which the token belongs. This is not to be confused with the index of the token in the respective sentence, which is
FeatureToken.i_sent.This attribute does not exist in a spaCy token, and was named as such to follow the naming conventions of their API.
- property shape: int¶
Transform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “d.
- property shape_: str¶
Transform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “d.
-
spacy_token:
Union[Token,Span]¶ The parsed spaCy token (or span if entity) this feature set is based.
- See:
FeatureDocument.spacy_doc()
- property token: Token¶
Return the SpaCy token.