zensols.nlp package#
Submodules#
zensols.nlp.chunker#
Clasess that segment text from FeatureDocument
instances, but
retain the original structure by preserving sentence and token indicies.
- class zensols.nlp.chunker.Chunker(doc, pattern, sub_doc=None, char_offset=None)[source]#
Bases:
object
Splits
TokenContainer
instances using regular expressionpattern
. Matched container (implementation of the container is based on the subclass) are given if used as an iterable. The document of all parsed containers is given if used as a callable.- __init__(doc, pattern, sub_doc=None, char_offset=None)#
-
char_offset:
int
= None# The 0-index absolute character offset where
sub_doc
starts. However, if the value is -1, then the offset is used as the begging character offset of the first token in thesub_doc
.
-
doc:
FeatureDocument
# The document that contains the entire text (i.e.
Note
).
-
sub_doc:
FeatureDocument
= None# A lexical span created document of
doc
, which defaults to the global document. Providing this andchar_offset
allows use of a document without having to useTokenContainer.reindex()
.
- class zensols.nlp.chunker.ListItemChunker(doc, pattern=re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\\\n]+)$', re.MULTILINE), sub_doc=None, char_offset=None)[source]#
Bases:
Chunker
A
Chunker
that splits list item and enumerated lists into separate sentences. Matched sentences are given if used as an iterable. This is useful when spaCy sentence chunks lists incorrectly and finds lists using a regular expression to find lines that star with a decimal, or list characters such as-
and+
.-
DEFAULT_SPAN_PATTERN:
ClassVar
[Pattern
] = re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\n]+)$', re.MULTILINE)# The default list item regular expression, which uses an initial character item notation or an initial enumeration digit.
- __init__(doc, pattern=re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\\\n]+)$', re.MULTILINE), sub_doc=None, char_offset=None)#
-
pattern:
Pattern
= re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\n]+)$', re.MULTILINE)# The list regular expression, which defaults to
DEFAULT_SPAN_PATTERN
.
-
DEFAULT_SPAN_PATTERN:
- class zensols.nlp.chunker.ParagraphChunker(doc, pattern=re.compile('(.+?)(?:(?=\\\\n{2})|\\\\Z)', re.MULTILINE | re.DOTALL), sub_doc=None, char_offset=None)[source]#
Bases:
Chunker
A
Chunker
that splits list item and enumerated lists into separate sentences. Matched sentences are given if used as an iterable. For this reason, this class will probably be used as an iterable since clients will usually want just the separated paragraphs as documents-
DEFAULT_SPAN_PATTERN:
ClassVar
[Pattern
] = re.compile('(.+?)(?:(?=\\n{2})|\\Z)', re.MULTILINE|re.DOTALL)# The default paragraph regular expression, which uses two newline positive lookaheads to avoid matching on paragraph spacing.
- __init__(doc, pattern=re.compile('(.+?)(?:(?=\\\\n{2})|\\\\Z)', re.MULTILINE | re.DOTALL), sub_doc=None, char_offset=None)#
-
pattern:
Pattern
= re.compile('(.+?)(?:(?=\\n{2})|\\Z)', re.MULTILINE|re.DOTALL)# The list regular expression, which defaults to
DEFAULT_SPAN_PATTERN
.
-
DEFAULT_SPAN_PATTERN:
zensols.nlp.combine#
A class that combines features.
- class zensols.nlp.combine.CombinerFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, source_parsers=None, validate_features=<factory>, yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, include_detached_features=True)[source]#
Bases:
DecoratedFeatureDocumentParser
A class that combines features from two
FeatureDocumentParser
instances. Features parsed using eachsource_parser
are optionally copied or overwritten on a token by token basis in the feature document parsed by this instance.The target tokens are sometimes added to or clobbered from the source, but not the other way around.
- __init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, source_parsers=None, validate_features=<factory>, yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, include_detached_features=True)#
-
include_detached_features:
bool
= True# Whether to include copied (yielded or overwritten) features as listed detected features. This controls what is compared, cloned and for printed in
write()
.
-
overwrite_features:
List
[str
]# A list of features to be copied/overwritten in order given in the list.
-
overwrite_nones:
bool
= False# Whether to write
None
for missingoverwrite_features
. This always write the target feature; if you only to write when the source is not set or missing, then useyield_features
.
- parse(text, *args, **kwargs)[source]#
Parse text or a text as a list of sentences.
- Parameters:
text (
str
) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the listargs – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance
- Return type:
-
source_parsers:
List
[FeatureDocumentParser
] = None# The language resource used to parse documents and create token attributes.
-
validate_features:
Set
[str
]# A set of features to compare across all tokens when copying. If any of the given features don’t match, an mismatch token error is raised.
- class zensols.nlp.combine.MappingCombinerFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, source_parsers=None, validate_features=frozenset({'idx'}), yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, include_detached_features=True, merge_sentences=True)[source]#
Bases:
CombinerFeatureDocumentParser
Maps the source to respective tokens in the target document using spaCy artifacts.
- __init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, source_parsers=None, validate_features=frozenset({'idx'}), yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, include_detached_features=True, merge_sentences=True)#
zensols.nlp.component#
Components useful for reuse.
- class zensols.nlp.component.EntityRecognizer(nlp, name, import_file, patterns)[source]#
Bases:
object
Base class regular expression and spaCy match patterns named entity recognizer. Both subclasses allow for an optional label for each respective pattern or regular expression. If the label is provided, then the match is made a named entity with a label. In any case, a span is created on the token, and in some cases, retokenized.
- __init__(nlp, name, import_file, patterns)#
-
nlp:
Language
# The NLP model.
- class zensols.nlp.component.PatternEntityRecognizer(nlp, name, import_file, patterns)[source]#
Bases:
EntityRecognizer
Adds entities based on regular epxressions.
- See:
- __init__(nlp, name, import_file, patterns)#
- class zensols.nlp.component.RegexEntityRecognizer(nlp, name, import_file, patterns)[source]#
Bases:
EntityRecognizer
Merges regular expression matches as a
Span
. After matches are found, re-tokenization merges them in to one token per match.- __init__(nlp, name, import_file, patterns)#
- class zensols.nlp.component.RegexSplitter(nlp, name, import_file, patterns)[source]#
Bases:
EntityRecognizer
Splits on regular expressions.
- __init__(nlp, name, import_file, patterns)#
zensols.nlp.container#
Domain objects that define features associated with text.
- class zensols.nlp.container.FeatureDocument(sents, text=None, spacy_doc=None)[source]#
Bases:
TokenContainer
A container class of tokens that make a document. This class contains a one to many of sentences. However, it can be treated like any
TokenContainer
to fetch tokens. Instances of this class iterate overFeatureSentence
instances.- Parameters:
sents (
Tuple
[FeatureSentence
,...
]) – the sentences defined for this document
- _combine_documents(docs, cls, concat_tokens, **kwargs)[source]#
Override if there are any fields in your dataclass. In most cases, the only time this is called is by an embedding vectorizer to batch muultiple sentences in to a single document, so the only feature that matter are the sentence level.
- Parameters:
docs (
Tuple
[FeatureDocument
,...
]) – the documents to combine in to onecls (
Type
[FeatureDocument
]) – the class of the instance to createconcat_tokens (
bool
) – ifTrue
each sentence of the returned document are the concatenated tokens of each respective document; otherwise simply concatenate sentences in to one documentkwargs – additional keyword arguments to pass to the new feature document’s initializer
- Return type:
-
EMPTY_DOCUMENT:
ClassVar
[FeatureDocument
] = <># A zero length document.
- __init__(sents, text=None, spacy_doc=None)#
- clone(cls=None, **kwargs)[source]#
- Parameters:
kwargs – if copy_spacy is
True
, the spacy document is copied to the clone in addition parameters passed to new clone initializer- Return type:
- classmethod combine_documents(docs, concat_tokens=True, **kwargs)[source]#
Coerce a tuple of token containers (either documents or sentences) in to one synthesized document.
- Parameters:
docs (
Iterable
[FeatureDocument
]) – the documents to combine in to onecls – the class of the instance to create
concat_tokens (
bool
) – ifTrue
each sentence of the returned document are the concatenated tokens of each respective document; otherwise simply concatenate sentences in to one documentkwargs – additional keyword arguments to pass to the new feature document’s initializer
- Return type:
- combine_sentences(sents=None)[source]#
Combine the sentences in this document in to a new document with a single sentence.
- Parameters:
sents (
Iterable
[FeatureSentence
]) – the sentences to combine in the new document or all ifNone
- Return type:
- from_sentences(sents, deep=False)[source]#
Return a new cloned document using the given sentences.
- Parameters:
sents (
Iterable
[FeatureSentence
]) – the sentences to add to the new cloned documentdeep (
bool
) – whether or not to clone the sentences
- See:
- Return type:
- get_overlapping_document(span, inclusive=True)[source]#
Get the portion of the document that overlaps
span
. Sentences completely enclosed in a span are copied. Otherwise, new sentences are created from those tokens that overlap the span.- Parameters:
span (
LexicalSpan
) – indicates the portion of the document to retaininclusive (
bool
) – whether to check include +1 on the end component
- Return type:
- Returns:
a new document that contains the 0 index offset of
span
- get_overlapping_sentences(span, inclusive=True)[source]#
Return sentences that overlaps with
span
from this document.- Parameters:
span (
LexicalSpan
) – indicates the portion of the document to retaininclusive (
bool
) – whether to check include +1 on the end component
- Return type:
- get_overlapping_span(span, inclusive=True)[source]#
Return a feature span that includes the lexical scope of
span
.- Return type:
- property max_sentence_len: int#
Return the length of tokens from the longest sentence in the document.
- sentence_index_for_token(token)[source]#
Return index of the parent sentence having
token
.- Return type:
- sentences_for_tokens(tokens)[source]#
Find sentences having a set of tokens.
- Parameters:
tokens (
Tuple
[FeatureToken
,...
]) – the query used to finding containing sentences- Return type:
- Returns:
the document ordered tuple of sentences containing tokens
-
sents:
Tuple
[FeatureSentence
,...
]# The sentences that make up the document.
-
spacy_doc:
Doc
= None# The parsed spaCy document this feature set is based. As explained in
FeatureToken
, spaCy documents are heavy weight and problematic to pickle. For this reason, this attribute is dropped when pickled, and only here for ad-hoc predictions.
- to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]#
Coerce this instance to a single sentence. No tokens data is updated so
FeatureToken.i_sent
keep their original indexes. These sentence indexes will be inconsistent when called onFeatureDocument
unless contiguous_i_sent is set toTrue
.- Parameters:
limit (
int
) – the max number of sentences to create (only starting kept)contiguous_i_sent (
Union
[str
,bool
]) – ifTrue
, ensures all tokens haveFeatureToken.i_sent
value that is contiguous for the returned instance; if this value isreset
, the token indicies start from 0delim (
str
) – a string added between each constituent sentence
- Return type:
- Returns:
an instance of
FeatureSentence
that represents this token sequence
- token_iter(*args, **kwargs)[source]#
Return an iterator over the token features.
- Parameters:
args – the arguments given to
itertools.islice()
- Return type:
- uncombine_sentences()[source]#
Reconstruct the sentence structure that we combined in
combine_sentences()
. If that has not been done in this instance, then returnself
.- Return type:
- update_entity_spans(include_idx=True)[source]#
Update token entity to
norm
text. This is helpful when entities are embedded after splitting text, which becomesFeatureToken.norm
values. However, the token spans still index the original entities that are multi-word, which leads to norms that are not equal to the text spans. This synchronizes the token span indexes with the norms.- Parameters:
include_idx (
bool
) – whether to updateSpacyFeatureToken.idx
as well
- update_indexes()[source]#
Update all
FeatureToken.i
attributes to those provided bytokens_by_i
. This corrects the many-to-one token index mapping for split multi-word named entities.- See:
tokens_by_i
- class zensols.nlp.container.FeatureSentence(tokens, text=None, spacy_span=None)[source]#
Bases:
FeatureSpan
A container class of tokens that make a sentence. Instances of this class iterate over
FeatureToken
instances, and can create documents withto_document()
.-
EMPTY_SENTENCE:
ClassVar
[FeatureSentence
] = <>#
- __init__(tokens, text=None, spacy_span=None)#
- get_overlapping_span(span, inclusive=True)[source]#
Return a feature span that includes the lexical scope of
span
.- Return type:
- to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]#
Coerce this instance to a single sentence. No tokens data is updated so
FeatureToken.i_sent
keep their original indexes. These sentence indexes will be inconsistent when called onFeatureDocument
unless contiguous_i_sent is set toTrue
.- Parameters:
limit (
int
) – the max number of sentences to create (only starting kept)contiguous_i_sent (
Union
[str
,bool
]) – ifTrue
, ensures all tokens haveFeatureToken.i_sent
value that is contiguous for the returned instance; if this value isreset
, the token indicies start from 0delim (
str
) – a string added between each constituent sentence
- Return type:
- Returns:
an instance of
FeatureSentence
that represents this token sequence
-
EMPTY_SENTENCE:
- class zensols.nlp.container.FeatureSpan(tokens, text=None, spacy_span=None)[source]#
Bases:
TokenContainer
A span of tokens as a
TokenContainer
, much likespacy.tokens.Span
.- __init__(tokens, text=None, spacy_span=None)#
- clone(cls=None, **kwargs)[source]#
Clone an instance of this token container.
- Parameters:
cls (
Type
[TokenContainer
]) – the type of the new instancekwargs – arguments to add to as attributes to the clone
- Return type:
- Returns:
the cloned instance of this instance
- property dependency_tree: Dict[FeatureToken, List[Dict[FeatureToken]]]#
-
spacy_span:
Span
= None# The parsed spaCy span this feature set is based.
- See:
FeatureDocument.spacy_doc()
- to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]#
Coerce this instance to a single sentence. No tokens data is updated so
FeatureToken.i_sent
keep their original indexes. These sentence indexes will be inconsistent when called onFeatureDocument
unless contiguous_i_sent is set toTrue
.- Parameters:
limit (
int
) – the max number of sentences to create (only starting kept)contiguous_i_sent (
Union
[str
,bool
]) – ifTrue
, ensures all tokens haveFeatureToken.i_sent
value that is contiguous for the returned instance; if this value isreset
, the token indicies start from 0delim (
str
) – a string added between each constituent sentence
- Return type:
- Returns:
an instance of
FeatureSentence
that represents this token sequence
- token_iter(*args, **kwargs)[source]#
Return an iterator over the token features.
- Parameters:
args – the arguments given to
itertools.islice()
- Return type:
- property tokens: Tuple[FeatureToken, ...]#
The tokens that make up the span.
- property tokens_by_i_sent: Dict[int, FeatureToken]#
A map of tokens with keys as their sentanal position offset and values as tokens.
- See:
zensols.nlp.FeatureToken.i
- update_entity_spans(include_idx=True)[source]#
Update token entity to
norm
text. This is helpful when entities are embedded after splitting text, which becomesFeatureToken.norm
values. However, the token spans still index the original entities that are multi-word, which leads to norms that are not equal to the text spans. This synchronizes the token span indexes with the norms.- Parameters:
include_idx (
bool
) – whether to updateSpacyFeatureToken.idx
as well
- update_indexes()[source]#
Update all
FeatureToken.i
attributes to those provided bytokens_by_i
. This corrects the many-to-one token index mapping for split multi-word named entities.- See:
tokens_by_i
- class zensols.nlp.container.TokenAnnotatedFeatureDocument(sents, text=None, spacy_doc=None)[source]#
Bases:
FeatureDocument
A feature sentence that contains token annotations. Sentences can be modeled with
TokenAnnotatedFeatureSentence
or justFeatureSentence
since this sets the annotations attribute when combining.- __init__(sents, text=None, spacy_doc=None)#
- combine_sentences(**kwargs) FeatureDocument #
Combine all the sentences in this document in to a new document with a single sentence.
- Return type:
FeatureDocument
- class zensols.nlp.container.TokenAnnotatedFeatureSentence(tokens, text=None, spacy_span=None, annotations=())[source]#
Bases:
FeatureSentence
A feature sentence that contains token annotations.
- __init__(tokens, text=None, spacy_span=None, annotations=())#
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, **kwargs)[source]#
Write the text container.
- Parameters:
include_original – whether to include the original text
include_normalized – whether to include the normalized text
n_tokens – the number of tokens to write
inline – whether to print the tokens on one line each
- class zensols.nlp.container.TokenContainer[source]#
Bases:
PersistableContainer
,TextContainer
A base class for token container classes such as
FeatureSentence
andFeatureDocument
. In addition to the defined methods, each instance has atext
attribute, which is the original text of the document.- property canonical: str#
A canonical representation of the container, which are non-space tokens separated by
CANONICAL_DELIMITER
.
- clone(cls=None, **kwargs)[source]#
Clone an instance of this token container.
- Parameters:
cls (
Type
[TokenContainer
]) – the type of the new instancekwargs – arguments to add to as attributes to the clone
- Return type:
- Returns:
the cloned instance of this instance
- property entities: Tuple[FeatureSpan, ...]#
The named entities of the container with each multi-word entity as elements.
- get_overlapping_span(span, inclusive=True)[source]#
Return a feature span that includes the lexical scope of
span
.- Return type:
- get_overlapping_tokens(span, inclusive=True)[source]#
Get all tokens that overlap lexical span
span
.- Parameters:
span (
LexicalSpan
) – the document 0-index character based inclusive span to compare withFeatureToken.lexspan
inclusive (
bool
) – whether to check include +1 on the end component
- Return type:
- Returns:
a token sequence containing the 0 index offset of
span
- property lexspan: LexicalSpan#
The document indexed lexical span using
idx
.
- map_overlapping_tokens(spans, inclusive=True)[source]#
Return a tuple of tokens, each tuple in the range given by the respective span in
spans
.- Parameters:
spans (
Iterable
[LexicalSpan
]) – the document 0-index character based inclusive spans to compare withFeatureToken.lexspan
inclusive (
bool
) – whether to check include +1 on the end component
- Return type:
- Returns:
a tuple of matching tokens for the respective
span
query
- reindex(reference_token=None)[source]#
Re-index tokens, which is useful for situtations where a 0-index offset is assumed for sub-documents created with
FeatureDocument.get_overlapping_document()
orFeatureDocument.get_overlapping_sentences()
. The following data are modified:FeatureToken.sent_i
(seeSpacyFeatureToken.sent_i
)FeatureToken.lexspan
(seeSpacyFeatureToken.lexspan
)
- strip(in_place=True)[source]#
Strip beginning and ending whitespace (see
strip_tokens()
) andtext
.- Return type:
- strip_token_iter(*args, **kwargs)[source]#
Strip beginning and ending whitespace (see
strip_tokens()
) usingtoken_iter()
.- Return type:
- static strip_tokens(token_iter)[source]#
Strip beginning and ending whitespace. This uses
is_space
, which isTrue
for spaces, tabs and newlines.- Parameters:
token_iter (
Iterable
[FeatureToken
]) – an stream of tokens- Return type:
- Returns:
non-whitespace middle tokens
- abstract to_document(limit=9223372036854775807)[source]#
Coerce this instance in to a document.
- Return type:
- abstract to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]#
Coerce this instance to a single sentence. No tokens data is updated so
FeatureToken.i_sent
keep their original indexes. These sentence indexes will be inconsistent when called onFeatureDocument
unless contiguous_i_sent is set toTrue
.- Parameters:
limit (
int
) – the max number of sentences to create (only starting kept)contiguous_i_sent (
Union
[str
,bool
]) – ifTrue
, ensures all tokens haveFeatureToken.i_sent
value that is contiguous for the returned instance; if this value isreset
, the token indicies start from 0delim (
str
) – a string added between each constituent sentence
- Return type:
- Returns:
an instance of
FeatureSentence
that represents this token sequence
- abstract token_iter(*args, **kwargs)[source]#
Return an iterator over the token features.
- Parameters:
args – the arguments given to
itertools.islice()
- Return type:
- property tokens: Tuple[FeatureToken, ...]#
Return the token features as a tuple.
- property tokens_by_i: Dict[int, FeatureToken]#
A map of tokens with keys as their position offset and values as tokens. The entries also include named entity tokens that are grouped as multi-word tokens. This is helpful for multi-word entities that were split (for example with
SplitTokenMapper
), and thus, have many-to-one mapped indexes.- See:
zensols.nlp.FeatureToken.i
- property tokens_by_idx: Dict[int, FeatureToken]#
A map of tokens with keys as their character offset and values as tokens.
Limitations: Multi-word entities will have have a mapping only for the first word of that entity if tokens were split by spaces (for example with
SplitTokenMapper
). However,tokens_by_i
does not have this limitation.- See:
obj:tokens_by_i
- See:
zensols.nlp.FeatureToken.idx
- abstract update_entity_spans(include_idx=True)[source]#
Update token entity to
norm
text. This is helpful when entities are embedded after splitting text, which becomesFeatureToken.norm
values. However, the token spans still index the original entities that are multi-word, which leads to norms that are not equal to the text spans. This synchronizes the token span indexes with the norms.- Parameters:
include_idx (
bool
) – whether to updateSpacyFeatureToken.idx
as well
- update_indexes()[source]#
Update all
FeatureToken.i
attributes to those provided bytokens_by_i
. This corrects the many-to-one token index mapping for split multi-word named entities.- See:
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_original=False, include_normalized=True, n_tokens=9223372036854775807, inline=False)[source]#
Write the text container.
zensols.nlp.dataframe#
Create Pandas dataframes from features. This must be imported by absolute
module (zensols.nlp.dataframe
).
- class zensols.nlp.dataframe.FeatureDataFrameFactory(token_feature_ids=frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_', 'text'}), priority_feature_ids=('text', 'norm', 'idx', 'sent_i', 'i', 'i_sent', 'tag', 'pos', 'is_wh', 'entity', 'dep', 'children'))[source]#
Bases:
object
Creates a Pandas dataframe of features from a document annotations. Each feature ID is given a column in the output
pandas.DataFrame
.- __init__(token_feature_ids=frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_', 'text'}), priority_feature_ids=('text', 'norm', 'idx', 'sent_i', 'i', 'i_sent', 'tag', 'pos', 'is_wh', 'entity', 'dep', 'children'))#
-
priority_feature_ids:
Tuple
[str
,...
] = ('text', 'norm', 'idx', 'sent_i', 'i', 'i_sent', 'tag', 'pos', 'is_wh', 'entity', 'dep', 'children')# Feature IDs that are used first in the column order in the output
pandas.DataFrame
.
-
token_feature_ids:
Set
[str
] = frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_', 'text'})# The feature IDs to add to the
pandas.DataFrame
.
zensols.nlp.decorate#
Contains useful classes for decorating feature sentences.
- class zensols.nlp.decorate.CopyFeatureTokenContainerDecorator(feature_ids)[source]#
Bases:
FeatureTokenContainerDecorator
Copies feature(s) for each token in the container. For each token, each source / target tuple pair in
feature_ids
is copied. If the feature is missing (does not include for existingFeatureToken.NONE
values) an exception is raised.- __init__(feature_ids)#
- class zensols.nlp.decorate.FilterEmptySentenceDocumentDecorator(filter_space=True)[source]#
Bases:
FeatureDocumentDecorator
Filter zero length sentences.
- __init__(filter_space=True)#
- class zensols.nlp.decorate.FilterTokenSentenceDecorator(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False, remove_empty=False)[source]#
Bases:
FeatureSentenceDecorator
A decorator that strips whitespace from sentences.
- __init__(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False, remove_empty=False)#
- class zensols.nlp.decorate.SplitTokenSentenceDecorator[source]#
Bases:
FeatureSentenceDecorator
A decorator that splits feature tokens by white space.
- __init__()#
- class zensols.nlp.decorate.StripTokenContainerDecorator[source]#
Bases:
FeatureTokenContainerDecorator
A decorator that strips whitespace from sentences (or
TokenContainer
).- __init__()#
- class zensols.nlp.decorate.UpdateTokenContainerDecorator(update_indexes=True, update_entity_spans=True, reindex=False)[source]#
Bases:
FeatureTokenContainerDecorator
Updates document indexes and spans (see fields).
- __init__(update_indexes=True, update_entity_spans=True, reindex=False)#
-
update_entity_spans:
bool
= True# Whether to update the document indexes with
FeatureDocument.update_entity_spans()
.
-
update_indexes:
bool
= True# Whether to update the document indexes with
FeatureDocument.update_indexes()
.
zensols.nlp.domain#
Interfaces, contracts and errors.
- class zensols.nlp.domain.LexicalSpan(begin, end)[source]#
Bases:
Dictable
A lexical character span of text in a document. The span has two positions:
begin
andend
, which is indexed respectively as an operator as well. The left (begin) is inclusive and the right (:obj:`end
) is exclusive to conform to Python array slicing conventions.One span is less than the other when the beginning position is less. When the beginnign positions are the same, the one with the smaller end position is less.
The length of the span is the distance between the end and the beginning positions.
-
EMPTY_SPAN:
ClassVar
[LexicalSpan
] = (0, 0)# The span
(0, 0)
.
- static gaps(spans, end=None)[source]#
Return the spans for the “holes” in
spans
. For example, ifspans
is((0, 5), (10, 12), (15, 17))
, then return((5, 10), (12, 15))
.- Parameters:
spans (
Iterable
[LexicalSpan
]) – the spans used to find gapsend (
Optional
[int
]) – an end position for the last gap so that if the last item inspans
end does not match, another is added
- Return type:
- Returns:
a list of spans that “fill” any holes in
spans
- narrow(other)[source]#
Return the shortest span that inclusively fits in both this and
other
.- Parameters:
other (
LexicalSpan
) – the second span to narrow with this span- Retun:
a span so that beginning is maximized and end is minimized or
None
if the two spans do not overlap- Return type:
- static overlaps(a0, a1, b0, b1, inclusive=True)[source]#
Return whether or not one text span overlaps with another.
- Parameters:
inclusive (
bool
) – whether to check include +1 on the end component- Returns:
any overlap detected returns
True
- overlaps_with(other, inclusive=True)[source]#
Return whether or not one text span overlaps non-inclusively with another.
- Parameters:
other (
LexicalSpan
) – the other locationinclusive (
bool
) – whether to check include +1 on the end component
- Return type:
- Returns:
any overlap detected returns
True
- static widen(others)[source]#
Take the span union by using the left most
begin
and the right mostend
.- Parameters:
others (
Iterable
[LexicalSpan
]) – the spans to union- Return type:
- Returns:
the widest span that inclusively aggregates
others
, or None if an empty sequence is passed
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
-
EMPTY_SPAN:
- exception zensols.nlp.domain.NLPError[source]#
Bases:
APIError
Raised for any errors for this library.
- __module__ = 'zensols.nlp.domain'#
- exception zensols.nlp.domain.ParseError[source]#
Bases:
APIError
Raised for any parsing errors.
- __annotations__ = {}#
- __module__ = 'zensols.nlp.domain'#
- class zensols.nlp.domain.TextContainer[source]#
Bases:
Dictable
A writable class that has a
text
property or attribute. All subclasses need anorm
attribute or property.- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_original=True, include_normalized=True)[source]#
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
zensols.nlp.nerscore#
Wraps the SemEval-2013 Task 9.1 NER evaluation API as a
ScoreMethod
.
From the David Batista blog post:
The SemEval’13 introduced four different ways to measure precision/recall/f1-score results based on the metrics defined by MUC:
Strict: exact boundary surface string match and entity type
Exact: exact boundary match over the surface string, regardless of the type
Partial: partial boundary match over the surface string, regardless of the type
Type: some overlap between the system tagged entity and the gold annotation is required
Each of these ways to measure the performance accounts for correct, incorrect, partial, missed and spurious in different ways. Let’s look in detail and see how each of the metrics defined by MUC falls into each of the scenarios described above.
- see:
- see:
- class zensols.nlp.nerscore.SemEvalHarmonicMeanScore(precision, recall, f_score, correct, incorrect, partial, missed, spurious, possible, actual)[source]#
Bases:
HarmonicMeanScore
A harmonic mean score with the additional SemEval computed scores (see module
zensols.nlp.nerscore
docs).-
NAN_INSTANCE:
ClassVar
[SemEvalHarmonicMeanScore
] = SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan)# Used to add to ErrorScore for harmonic means replacements.
- __init__(precision, recall, f_score, correct, incorrect, partial, missed, spurious, possible, actual)#
-
incorrect:
int
# the output of a system and the golden annotation don’t match.
- Type:
The number of incorrect (INC)
-
NAN_INSTANCE:
- class zensols.nlp.nerscore.SemEvalScore(strict, exact, partial, ent_type)[source]#
Bases:
Score
Contains all four harmonic mean SemEval scores (see module
zensols.nlp.nerscore
docs). This score has four harmonic means providing various levels of accuracy.-
NAN_INSTANCE:
ClassVar
[SemEvalScore
] = SemEvalScore(strict=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), exact=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), partial=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), ent_type=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan))#
- __init__(strict, exact, partial, ent_type)#
-
ent_type:
SemEvalHarmonicMeanScore
# Some overlap between the system tagged entity and the gold annotation is required.
-
exact:
SemEvalHarmonicMeanScore
# Exact boundary match over the surface string, regardless of the type.
-
partial:
SemEvalHarmonicMeanScore
# Partial boundary match over the surface string, regardless of the type.
-
strict:
SemEvalHarmonicMeanScore
# Exact boundary surface string match and entity type.
-
NAN_INSTANCE:
- class zensols.nlp.nerscore.SemEvalScoreMethod(reverse_sents=False, labels=None)[source]#
Bases:
ScoreMethod
A Semeval-2013 Task 9.1 scor (see module
zensols.nlp.nerscore
docs). This score has four harmonic means providing various levels of accuracy. Sentence pairs are ordered as(<gold>, <prediction>)
.- __init__(reverse_sents=False, labels=None)#
zensols.nlp.norm#
Normalize text and map Spacy documents.
- class zensols.nlp.norm.FilterRegularExpressionMapper(regex='[ ]+', invert=False)[source]#
Bases:
TokenMapper
Filter tokens based on normalized form regular expression.
- __init__(regex='[ ]+', invert=False)#
- class zensols.nlp.norm.FilterTokenMapper(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False)[source]#
Bases:
TokenMapper
Filter tokens based on token (Spacy) attributes.
Configuration example:
[filter_token_mapper] class_name = zensols.nlp.FilterTokenMapper remove_stop = True remove_punctuation = True
- __init__(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False)#
- class zensols.nlp.norm.JoinTokenMapper(regex='[ ]', separator=None)[source]#
Bases:
object
Join tokens based on a regular expression. It does this by creating spans in the spaCy component (first in the tuple) and using the span text as the normalized token.
- __init__(regex='[ ]', separator=None)#
- class zensols.nlp.norm.LambdaTokenMapper(add_lambda=None, map_lambda=None)[source]#
Bases:
TokenMapper
Use a lambda expression to map a token tuple.
This is handy for specialized behavior that can be added directly to a configuration file.
Configuration example:
[lc_lambda_token_mapper] class_name = zensols.nlp.LambdaTokenMapper map_lambda = lambda x: (x[0], f'<{x[1].lower()}>')
- __init__(add_lambda=None, map_lambda=None)#
- class zensols.nlp.norm.LemmatizeTokenMapper(lemmatize=True, remove_first_stop=False)[source]#
Bases:
TokenMapper
Lemmatize tokens and optional remove entity stop words.
Important: This completely ignores the normalized input token string and essentially just replaces it with the lemma found in the token instance.
Configuration example:
[lemma_token_mapper] class_name = zensols.nlp.LemmatizeTokenMapper
- Parameters:
- __init__(lemmatize=True, remove_first_stop=False)#
- class zensols.nlp.norm.MapTokenNormalizer(embed_entities=True, config_factory=None, mapper_class_list=<factory>)[source]#
Bases:
TokenNormalizer
A normalizer that applies a sequence of
TokenMapper
instances to transform the normalized token text. The members of themapper_class_list
are sections of the application configuration.Configuration example:
[map_filter_token_normalizer] class_name = zensols.nlp.MapTokenNormalizer mapper_class_list = list: filter_token_mapper
- __init__(embed_entities=True, config_factory=None, mapper_class_list=<factory>)#
-
config_factory:
ConfigFactory
= None# The factory that created this instance and used to create the mappers.
- class zensols.nlp.norm.SplitEntityTokenMapper(token_unit_type=False, copy_attributes=('label', 'label_'))[source]#
Bases:
TokenMapper
Splits embedded entities (or any
Span
) in to separate tokens. This is useful for splitting up entities as tokens after being grouped withTokenNormalizer.embed_entities
. Note,embed_entities
must beTrue
to create the entities as they come from spaCy as spans. This then can be used to createSpacyFeatureToken
with spans that have the entity.- __init__(token_unit_type=False, copy_attributes=('label', 'label_'))#
- class zensols.nlp.norm.SplitTokenMapper(regex='[ ]')[source]#
Bases:
TokenMapper
Splits the normalized text on a per token basis with a regular expression.
Configuration example:
[split_token_mapper] class_name = zensols.nlp.SplitTokenMapper regex = r'[ ]'
- __init__(regex='[ ]')#
- class zensols.nlp.norm.SubstituteTokenMapper(regex='', replace_char='')[source]#
Bases:
TokenMapper
Replace a regular expression in normalized token text.
Configuration example:
[subs_token_mapper] class_name = zensols.nlp.SubstituteTokenMapper regex = r'[ \t]' replace_char = _
- __init__(regex='', replace_char='')#
- class zensols.nlp.norm.TokenMapper[source]#
Bases:
ABC
Abstract class used to transform token tuples generated from
TokenNormalizer.normalize()
.- __init__()#
- class zensols.nlp.norm.TokenNormalizer(embed_entities=True)[source]#
Bases:
object
Base token extractor returns tuples of tokens and their normalized version.
Configuration example:
[default_token_normalizer] class_name = zensols.nlp.TokenNormalizer embed_entities = False
- __init__(embed_entities=True)#
zensols.nlp.parser#
Parse documents and generate features in an organized taxonomy.
- class zensols.nlp.parser.CachingFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, stash=None, hasher=<factory>)[source]#
Bases:
DecoratedFeatureDocumentParser
A document parser that persists previous parses using the hash of the text as a key. Caching is optional given the value of
stash
, which is useful in cases this class is extended using other use cases other than just caching.- __init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, stash=None, hasher=<factory>)#
- parse(text, *args, **kwargs)[source]#
Parse text or a text as a list of sentences.
- Parameters:
text (
str
) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the listargs – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance
- Return type:
- class zensols.nlp.parser.Component(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=())[source]#
Bases:
object
A pipeline component to be added to the spaCy model.
- __init__(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=())#
- init(model)[source]#
Initialize the component and add it to the NLP pipe line. This base class implementation loads the
module
, then callsLanguage.add_pipe()
.- Parameters:
model (
Language
) – the model to add the spaCy model (nlp
in their parlance)
-
initializers:
Tuple
[ComponentInitializer
,...
] = ()# Instances to initialize upon this object’s initialization.
-
modules:
Sequence
[str
] = ()# The module to import before adding component pipelines. This will register components mentioned in
components
when the resepctive module is loaded.
- class zensols.nlp.parser.ComponentInitializer[source]#
Bases:
ABC
Called by
Component
to do post spaCy initialization.
- class zensols.nlp.parser.DecoratedFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>)[source]#
Bases:
FeatureDocumentParser
This class adapts the
FeatureDocumentParser
adaptors to the general case using a GoF decorator pattern. This is useful for any post processing needed on existing configured document parsers.- All decorators are processed in the following order:
Token
Sentence
Document
Token features are stored in the delegate for those that have them. Otherwise, they are stored in instances of this class.
- __init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>)#
-
delegate:
FeatureDocumentParser
# Used to create the feature documents.
-
document_decorators:
Sequence
[FeatureDocumentDecorator
] = ()# A list of decorators that can add, remove or modify features on a document.
-
name:
str
# The name of the parser, which is taken from the section name when created with a
ConfigFactory
and used for debugging.
- parse(text, *args, **kwargs)[source]#
Parse text or a text as a list of sentences.
- Parameters:
text (
str
) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the listargs – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance
- Return type:
-
sentence_decorators:
Sequence
[FeatureSentenceDecorator
] = ()# A list of decorators that can add, remove or modify features on a sentence.
-
token_decorators:
Sequence
[FeatureTokenDecorator
] = ()# A list of decorators that can add, remove or modify features on a token.
- class zensols.nlp.parser.FeatureDocumentDecorator[source]#
Bases:
FeatureTokenContainerDecorator
Implementations can add, remove or modify features on a document.
- class zensols.nlp.parser.FeatureDocumentParser[source]#
Bases:
PersistableContainer
,Dictable
This class parses text in to instances of
FeatureDocument
instances usingparse()
.-
TOKEN_FEATURE_IDS:
ClassVar
[Set
[str
]] = frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_'})# The default value for
token_feature_ids
.
- __init__()#
- static default_instance()[source]#
Create the parser as configured in the resource library of the package.
- Return type:
- abstract parse(text, *args, **kwargs)[source]#
Parse text or a text as a list of sentences.
- Parameters:
text (
str
) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the listargs – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance
- Return type:
-
TOKEN_FEATURE_IDS:
- class zensols.nlp.parser.FeatureSentenceDecorator[source]#
Bases:
FeatureTokenContainerDecorator
Implementations can add, remove or modify features on a sentence.
- class zensols.nlp.parser.FeatureSentenceFactory(token_decorators=())[source]#
Bases:
object
Create a
FeatureSentence
out of single tokens or split on whitespace. This is a utility class to create data structures when only single tokens are the source data.For example, if you only have tokens that need to be scored with Unigram Rouge-1, use this class to create sentences, which is a subclass of
TokenContainer
.- __init__(token_decorators=())#
- create(tokens)[source]#
Create a sentence from tokens.
-
token_decorators:
Sequence
[FeatureTokenDecorator
] = ()# A list of decorators that can add, remove or modify features on a token.
- class zensols.nlp.parser.FeatureTokenContainerDecorator[source]#
Bases:
ABC
Implementations can add, remove or modify features on a token container.
- class zensols.nlp.parser.FeatureTokenDecorator[source]#
Bases:
ABC
Implementations can add, remove or modify features on a token.
- class zensols.nlp.parser.WhiteSpaceTokenizerFeatureDocumentParser(sent_class=<class 'zensols.nlp.container.FeatureSentence'>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>)[source]#
Bases:
FeatureDocumentParser
This class parses text in to instances of
FeatureDocument
instances tokenizing only by whitespace. This parser does no sentence chunking so documents have one and only one sentence for each parse.- __init__(sent_class=<class 'zensols.nlp.container.FeatureSentence'>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>)#
- doc_class#
The type of document instances to create.
alias of
FeatureDocument
- parse(text, *args, **kwargs)[source]#
Parse text or a text as a list of sentences.
- Parameters:
text (
str
) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the listargs – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance
- Return type:
- sent_class#
The type of sentence instances to create.
alias of
FeatureSentence
zensols.nlp.score#
Produces matching scores.
- class zensols.nlp.score.BleuScoreMethod(reverse_sents=False, smoothing_function=None, weights=(0.25, 0.25, 0.25, 0.25), silence_warnings=False)[source]#
Bases:
ScoreMethod
The BLEU scoring method using the
nltk
package. The first sentences are the references and the second are the hypothesis.- __init__(reverse_sents=False, smoothing_function=None, weights=(0.25, 0.25, 0.25, 0.25), silence_warnings=False)#
-
silence_warnings:
bool
= False# Silence the BLEU warning of n-grams not matching
The hypothesis contains 0 counts of 3-gram overlaps...
-
smoothing_function:
SmoothingFunction
= None# This is an implementation of the smoothing techniques for segment-level BLEU scores.
Citation:
Chen and Cherry (2014) A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. In WMT14.
- class zensols.nlp.score.ErrorScore(method, exception, replace_score=None)[source]#
Bases:
Score
A replacement instance when scoring fails from a raised exception.
- __init__(method, exception, replace_score=None)#
-
method:
str
# The method of the
ScoreMethod
that raised the exception.
-
replace_score:
Score
= None# The score to use in place of this score. Otherwise
asrow()
return a singlenumpy.nan
likeFloatScore
.
- class zensols.nlp.score.ExactMatchScoreMethod(reverse_sents=False, equality_measure='norm')[source]#
Bases:
ScoreMethod
A scoring method that return 1 for exact matches and 0 otherwise.
- __init__(reverse_sents=False, equality_measure='norm')#
-
equality_measure:
str
= 'norm'# The method by which to compare, which is one of:
norm
: compare withTokenContainer.norm()
text
: compare withTokenContainer.text
equal
: compare using a Python object__eq__
equal compare,which also compares the token values
- class zensols.nlp.score.FloatScore(value)[source]#
Bases:
Score
Float container. This is needed to create the flat result container structure. Object creation becomes less import since most clients will use
ScoreSet.asnumpy()
.-
NAN_INSTANCE:
ClassVar
[FloatScore
] = FloatScore(value=nan)# Used to add to ErrorScore for harmonic means replacements.
- __init__(value)#
-
NAN_INSTANCE:
- class zensols.nlp.score.HarmonicMeanScore(precision, recall, f_score)[source]#
Bases:
Score
A score having a precision, recall and the harmonic mean of the two, F-score.’
-
NAN_INSTANCE:
ClassVar
[HarmonicMeanScore
] = HarmonicMeanScore(precision=nan, recall=nan, f_score=nan)# Used to add to ErrorScore for harmonic means replacements.
- __init__(precision, recall, f_score)#
-
NAN_INSTANCE:
- class zensols.nlp.score.LevenshteinDistanceScoreMethod(reverse_sents=False, form='canon', normalize=True)[source]#
Bases:
ScoreMethod
A scoring method that computes the Levenshtein distance.
- __init__(reverse_sents=False, form='canon', normalize=True)#
-
form:
str
= 'canon'# The form of the of the text used for the evaluation, which is one of:
text
: the original text withTokenContainer.text
norm
: the normalized text usingTokenContainer.norm()
canon
:TokenContainer.canonical
to normalize out whitespace for better comparisons
- class zensols.nlp.score.RougeScoreMethod(reverse_sents=False, feature_tokenizer=True)[source]#
Bases:
ScoreMethod
The ROUGE scoring method using the
rouge_score
package.- __init__(reverse_sents=False, feature_tokenizer=True)#
-
feature_tokenizer:
bool
= True# Whether to use the
TokenContainer
tokenization, otherwise use therouge_score
package.
- class zensols.nlp.score.Score[source]#
Bases:
Dictable
Individual scores returned from
ScoreMethod
.- __init__()#
- class zensols.nlp.score.ScoreContext(pairs, methods=None, norm=True, correlation_ids=None)[source]#
Bases:
Dictable
Input needed to create score(s) using
Scorer
.- __init__(pairs, methods=None, norm=True, correlation_ids=None)#
-
correlation_ids:
Tuple
[Union
[int
,str
]] = None# The IDs to correlate with each sentence pair, or
None
to skip correlating them. The length of this tuple must be that ofpairs
.
-
methods:
Set
[str
] = None# A set of strings, each indicating the
ScoreMethod
used to scorepairs
.
-
pairs:
Tuple
[Tuple
[TokenContainer
,TokenContainer
]]# Sentence, span or document pairs to score (order matters for some scoring methods such as rouge). Depending on the scoring method the ordering of the sentence pairs should be:
(<summary>, <source>)
(<gold>, <prediction>)
(<references>, <candidates>)
See
ScoreMethod
implementations for more information about pair ordering.
- class zensols.nlp.score.ScoreMethod(reverse_sents=False)[source]#
Bases:
ABC
An abstract base class for scoring methods (bleu, rouge, etc).
- __init__(reverse_sents=False)#
- classmethod is_available()[source]#
Whether or not this method is available on this system.
- Return type:
- class zensols.nlp.score.ScoreResult(scores, correlation_id=None)[source]#
Bases:
Dictable
A result of scores created by a
ScoreMethod
.- __init__(scores, correlation_id=None)#
-
correlation_id:
Optional
[str
] = None# An ID for correlating back to the
TokenContainer
.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- class zensols.nlp.score.ScoreSet(results, correlation_id_col='id')[source]#
Bases:
Dictable
All scores returned from :class:`.Scorer’.
- __init__(results, correlation_id_col='id')#
- as_dataframe(add_correlation=True)[source]#
This gets data from
as_numpy()
and returns it as a Pandas dataframe.- Parameters:
add_correlation (bool) – whether to add the correlation ID (if there is one), using
correlation_id_col
- Return type:
pandas.DataFrame
- Returns:
an instance of
pandas.DataFrame
of the results
- as_numpy(add_correlation=True)[source]#
Return the Numpy array with column descriptors of the results. Spacy depends on Numpy, so this package will always be availale.
- Parameters:
add_correlation (
bool
) – whether to add the correlation ID (if there is one), usingcorrelation_id_col
- Return type:
-
correlation_id_col:
str
= 'id'# The column name for the
ScoreResult.correlation_id
added to Numpy arrays and Pandas dataframes. IfNone
, then the correlation IDS are used as the index.
-
results:
Tuple
[ScoreResult
]# A tuple with each element having the results of the respective sentence pair in
ScoreContext.sents
. Each elemnt is a dictionary with the method are the keys with results as the values as output of theScoreMethod
. This is created inScorer
.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- class zensols.nlp.score.Scorer(methods=None, default_methods=None)[source]#
Bases:
object
A class that scores sentences using a set of registered methods (
methods
).- __init__(methods=None, default_methods=None)#
-
default_methods:
Set
[str
] = None# Methods (keys from
methods
) to use when none are provided in theScoreContext.meth
in the call toscore()
.
-
methods:
Dict
[str
,ScoreMethod
] = None# The registered scoring methods availale, which are accessed from
ScoreContext.meth
.
- score(context)[source]#
Score the sentences in
context
.- Parameters:
context (
ScoreContext
) – the context containing the data to score- Return type:
- Returns:
the results for each method indicated in
context
zensols.nlp.serial#
Serializes FeatureToken
and TokenContainer
instances
using the Dictable
interface.
- class zensols.nlp.serial.Include(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
Indicates what to include at each level.
- normal = 2#
The normalized form of the text.
- original = 1#
The original text.
- sentences = 4#
The sentences of the
FeatureDocument
.
- tokens = 3#
The tokens of the
TokenContainer
.
- class zensols.nlp.serial.Serialized(container, includes, feature_ids)[source]#
Bases:
Dictable
A base strategy class that can serialize
TokenContainer
instances.- __init__(container, includes, feature_ids)#
-
container:
TokenContainer
# The container to be serialized.
- class zensols.nlp.serial.SerializedFeatureDocument(container, includes, feature_ids, sentence_includes)[source]#
Bases:
Serialized
A serializer for feature documents. The
container
has to be an instance of aFeatureDocument
.- __init__(container, includes, feature_ids, sentence_includes)#
- class zensols.nlp.serial.SerializedTokenContainer(container, includes, feature_ids)[source]#
Bases:
Serialized
Serializes instance of
TokenContainer
. This is used to serialize spans and sentences.- __init__(container, includes, feature_ids)#
- class zensols.nlp.serial.SerializedTokenContainerFactory(sentence_includes, document_includes, feature_ids=None)[source]#
Bases:
Dictable
Creates instances of
Serialized
from instances ofTokenContainer
. These can then be used asDictable
instances, specifically with theasdict
andasjson
methods.- __init__(sentence_includes, document_includes, feature_ids=None)#
- create(container)[source]#
Create a serializer from
container
(see class docs).- Parameters:
container (
TokenContainer
) – he container to be serialized- Return type:
- Returns:
an object that can be serialized using
asdict
andasjson
method.
zensols.nlp.spannorm#
Normalize spans (of tokens) into strings by reconstructing based on language
rules from the normalized form of the tokens. This is needed after any token
manipulation from TokenNormalizer
or other changes to
FeatureToken.norm
.
For now, only English is supported, but the module is provided for other languages and future enhancements of normalization configuration.
- class zensols.nlp.spannorm.EnglishSpanNormalizer(post_space_skip=frozenset({'(', '-', '<', '[', '`', '{', '‘', '“'}), pre_space_skip=frozenset({"'d", "'ll", "'m", "'re", "'s", "'ve", '-', "n't"}), keep_space_skip=frozenset({'_'}), canonical_delimiter='|')[source]#
Bases:
SpanNormalizer
An implementation of a span normalizer for the Enlish language.
- __init__(post_space_skip=frozenset({'(', '-', '<', '[', '`', '{', '‘', '“'}), pre_space_skip=frozenset({"'d", "'ll", "'m", "'re", "'s", "'ve", '-', "n't"}), keep_space_skip=frozenset({'_'}), canonical_delimiter='|')#
- get_canonical(tokens)[source]#
A canonical representation of the container, which are non-space tokens separated by
CANONICAL_DELIMITER
.- Return type:
- class zensols.nlp.spannorm.SpanNormalizer[source]#
Bases:
object
Subclasses normalize feature tokens on a per
spacy.Language
. All subclasses must be re-entrant.
zensols.nlp.sparser#
The spaCy FeatureDocumentParser
implementation.
- class zensols.nlp.sparser.SpacyFeatureDocumentParser(config_factory, name, lang='en', model_name=None, token_feature_ids=<factory>, components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.nlp.tok.SpacyFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False)[source]#
Bases:
FeatureDocumentParser
This langauge resource parses text in to Spacy documents. Loaded spaCy models have attribute
doc_parser
set enable creation of factory instances from registered pipe components (i.e. specified byComponent
).Configuration example:
[doc_parser] class_name = zensols.nlp.sparser.SpacyFeatureDocumentParser lang = en model_name = ${lang}_core_web_sm
Decorators are processed in the same way
DecoratedFeatureDocumentParser
.- __init__(config_factory, name, lang='en', model_name=None, token_feature_ids=<factory>, components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.nlp.tok.SpacyFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False)#
-
auto_install_model:
bool
= False# Whether to install models not already available. Note that this uses the pip command to download model requirements, which might have an adverse effect of replacing currently installed Python packages.
-
config_factory:
ConfigFactory
# A configuration parser optionally used by pipeline
Component
instances.
-
disable_component_names:
Sequence
[str
] = None# Components to disable in the spaCy model when creating documents in
parse()
.
- doc_class#
The type of document instances to create.
alias of
FeatureDocument
-
document_decorators:
Sequence
[FeatureDocumentDecorator
] = ()# A list of decorators that can add, remove or modify features on a document.
- from_spacy_doc(doc, *args, text=None, **kwargs)[source]#
Create s
FeatureDocument
from a spaCy doc.- Parameters:
doc (
Doc
) – the spaCy generated document to transform in to a feature documenttext (
str
) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the listargs – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance
- Return type:
- get_dictable(doc)[source]#
Return a dictionary object graph and pretty prints spaCy docs.
- Return type:
- property model: Language#
The spaCy model. On first access, this creates a new instance using
model_name
.
-
model_name:
str
= None# The Spacy model name (defualts to
en_core_web_sm
); this is ignored ifmodel
is notNone
.
-
name:
str
# The name of the parser, which is taken from the section name when created with a
ConfigFactory
and used for debugging.
- parse(text, *args, **kwargs)[source]#
Parse text or a text as a list of sentences.
- Parameters:
text (
str
) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the listargs – the arguments used to create the FeatureDocument instance
kwargs – the key word arguments used to create the FeatureDocument instance
- Return type:
-
reload_components:
bool
= False# Removes, then re-adds components for cached models. This is helpful for when there are component configurations that change on reruns with a difference application context but in the same Python interpreter session.
A spaCy component can get other instances via
config_factory
, but if this isFalse
it will be paired with the first instance of this class and not the new ones created with a new configuration factory.
-
remove_empty_sentences:
bool
= None# Deprecated and will be removed from future versions. Use
FilterSentenceFeatureDocumentDecorator
instead.
- sent_class#
The type of sentence instances to create.
alias of
FeatureSentence
-
sentence_decorators:
Sequence
[FeatureSentenceDecorator
] = ()# A list of decorators that can add, remove or modify features on a sentence.
- to_spacy_doc(doc, norm=True, add_features=None)[source]#
Convert a feature document back in to a spaCy document.
Note: not all data is copied–only text,
pos_
,tag_
,lemma_
anddep_
.- Parameters:
doc (
FeatureDocument
) – the spaCy doc to convertnorm (
bool
) – whether to use the normalized text as theorth_
spaCy token attribute ortext
- Pram add_features:
whether to add POS, NER tags, lemmas, heads and dependnencies
- Return type:
Doc
- Returns:
the feature document with copied data from
doc
- token_class#
The type of document instances to create.
alias of
SpacyFeatureToken
-
token_decorators:
Sequence
[FeatureTokenDecorator
] = ()# A list of decorators that can add, remove or modify features on a token.
-
token_normalizer:
TokenNormalizer
= None# The token normalizer for methods that use it, i.e.
features
.
zensols.nlp.stemmer#
Stem text using the Porter stemmer.
zensols.nlp.tok#
Feature token and related base classes
- class zensols.nlp.tok.FeatureToken(i, idx, i_sent, norm)[source]#
Bases:
PersistableContainer
,TextContainer
A container class for features about a token. Subclasses such as
SpacyFeatureToken
extracts only a subset of features from the heavy Spacy C data structures and is hard/expensive to pickle.Feature note: features
i
,idx
andi_sent
are always added to features tokens to be able to reconstruct sentences (seeFeatureDocument.uncombine_sentences()
), and alwyas included.-
FEATURE_IDS:
ClassVar
[Set
[str
]] = frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_'})# All default available feature IDs.
-
FEATURE_IDS_BY_TYPE:
ClassVar
[Dict
[str
,Set
[str
]]] = {'bool': frozenset({'is_contraction', 'is_ent', 'is_pronoun', 'is_space', 'is_stop', 'is_superlative', 'is_wh'}), 'int': frozenset({'dep', 'ent', 'ent_iob', 'i', 'i_sent', 'idx', 'is_punctuation', 'norm_len', 'sent_i', 'shape', 'tag'}), 'list': frozenset({'children'}), 'object': frozenset({'lexspan'}), 'str': frozenset({'dep_', 'ent_', 'ent_iob_', 'lemma_', 'norm', 'pos_', 'shape_', 'tag_'})}# Map of class type to set of feature IDs.
-
REQUIRED_FEATURE_IDS:
ClassVar
[Set
[str
]] = frozenset({'i', 'i_sent', 'idx', 'norm'})# Features retained regardless of configuration for basic functionality.
-
SKIP_COMPARE_FEATURE_IDS:
ClassVar
[Set
[str
]] = {}# A set of feature IDs to avoid comparing in
__eq__()
.
-
TYPES_BY_FEATURE_ID:
ClassVar
[Dict
[str
,str
]] = {'children': 'list', 'dep': 'int', 'dep_': 'str', 'ent': 'int', 'ent_': 'str', 'ent_iob': 'int', 'ent_iob_': 'str', 'i': 'int', 'i_sent': 'int', 'idx': 'int', 'is_contraction': 'bool', 'is_ent': 'bool', 'is_pronoun': 'bool', 'is_punctuation': 'int', 'is_space': 'bool', 'is_stop': 'bool', 'is_superlative': 'bool', 'is_wh': 'bool', 'lemma_': 'str', 'lexspan': 'object', 'norm': 'str', 'norm_len': 'int', 'pos_': 'str', 'sent_i': 'int', 'shape': 'int', 'shape_': 'str', 'tag': 'int', 'tag_': 'str'}# A map of feature ID to string type. This is used by
FeatureToken.write_attributes()
to dump the type features.
-
WRITABLE_FEATURE_IDS:
ClassVar
[Tuple
[str
,...
]] = ('text', 'norm', 'idx', 'sent_i', 'i', 'i_sent', 'tag', 'pos', 'is_wh', 'entity', 'dep', 'children')# Feature IDs that are dumped on
write()
andwrite_attributes()
.
- __init__(i, idx, i_sent, norm)#
- clone(cls=None, **kwargs)[source]#
Clone an instance of this token.
- Parameters:
cls (
Type
) – the type of the new instancekwargs – arguments to add to as attributes to the clone
- Return type:
- Returns:
the cloned instance of this instance
- property default_detached_feature_ids: Set[str] | None#
The default set of feature IDs used when cloning or detaching with
clone()
ordetatch()
.
- detach(feature_ids=None, skip_missing=False, cls=None)[source]#
Create a detected token (i.e. from spaCy artifacts).
- Parameters:
feature_ids (
Set
[str
]) – the features to write, which defaults toFEATURE_IDS
skip_missing (
bool
) – whether to only keepfeature_ids
cls (
Type
[FeatureToken
]) – the type of the new instance
- Return type:
-
i_sent:
int
# The index of the token within the parent sentence.
The index of the token in the respective sentence. This is not to be confused with the index of the sentence to which the token belongs, which is
sent_i
.
- split(positions)[source]#
Split on text normal index positions. This needs and updates the
idx
andlexspan
atttributes.
- property text: str#
The initial text before normalized by any
TokenNormalizer
.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_type=True, feature_ids=None, inline=False)[source]#
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- write_attributes(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_type=True, feature_ids=None, inline=False, include_none=True)[source]#
Write feature attributes.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writableinclude_type (
bool
) – ifTrue
write the type of value (if available)feature_ids (
Iterable
[str
]) – the features to write, which defaults toWRITABLE_FEATURE_IDS
inline (
bool
) – whether to print attributes all on the same line
-
FEATURE_IDS:
- class zensols.nlp.tok.SpacyFeatureToken(spacy_token, norm)[source]#
Bases:
FeatureToken
Contains and provides the same features as a spaCy
Token
.- property children#
A sequence of the token’s immediate syntactic children.
- conll_iob_()[source]#
Return the CoNLL formatted IOB tag, such as
B-ORG
for a beginning organization token.- Return type:
- property lemma_: str#
Return the string lemma or text of the named entitiy if tagged as a named entity.
- property lexspan: LexicalSpan#
The document indexed lexical span using
idx
.
- property sent_i: int#
The index of the sentence to which the token belongs. This is not to be confused with the index of the token in the respective sentence, which is
FeatureToken.i_sent
.This attribute does not exist in a spaCy token, and was named as such to follow the naming conventions of their API.
- property shape: int#
Transform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “d.
- property shape_: str#
Transform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “d.
-
spacy_token:
Union
[Token
,Span
]# The parsed spaCy token (or span if entity) this feature set is based.
- See:
FeatureDocument.spacy_doc()
- property token: Token#
Return the SpaCy token.