zensols.nlp package¶
Submodules¶
zensols.nlp.chunker module¶
Clasess that segment text from FeatureDocument instances, but
retain the original structure by preserving sentence and token indicies.
- class zensols.nlp.chunker.Chunker(doc, pattern, sub_doc=None, char_offset=None)[source]¶
- Bases: - object- Splits - TokenContainerinstances using regular expression- pattern. Matched container (implementation of the container is based on the subclass) are given if used as an iterable. The document of all parsed containers is given if used as a callable.- __init__(doc, pattern, sub_doc=None, char_offset=None)¶
 - 
char_offset: int= None¶
- The 0-index absolute character offset where - sub_docstarts. However, if the value is -1, then the offset is used as the begging character offset of the first token in the- sub_doc.
 - 
doc: FeatureDocument¶
- The document that contains the entire text (i.e. - Note).
 - 
sub_doc: FeatureDocument= None¶
- A lexical span created document of - doc, which defaults to the global document. Providing this and- char_offsetallows use of a document without having to use- TokenContainer.reindex().
 
- class zensols.nlp.chunker.ListItemChunker(doc, pattern=re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\\\n]+)$', re.MULTILINE), sub_doc=None, char_offset=None)[source]¶
- Bases: - Chunker- A - Chunkerthat splits list item and enumerated lists into separate sentences. Matched sentences are given if used as an iterable. This is useful when spaCy sentence chunks lists incorrectly and finds lists using a regular expression to find lines that star with a decimal, or list characters such as- -and- +.- 
DEFAULT_SPAN_PATTERN: ClassVar[Pattern] = re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\n]+)$', re.MULTILINE)¶
- The default list item regular expression, which uses an initial character item notation or an initial enumeration digit. 
 - __init__(doc, pattern=re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\\\n]+)$', re.MULTILINE), sub_doc=None, char_offset=None)¶
 - 
pattern: Pattern= re.compile('^((?:[0-9-+]+|[a-zA-Z]+:)[^\\n]+)$', re.MULTILINE)¶
- The list regular expression, which defaults to - DEFAULT_SPAN_PATTERN.
 
- 
DEFAULT_SPAN_PATTERN: 
- class zensols.nlp.chunker.ParagraphChunker(doc, pattern=re.compile('(.+?)(?:(?=\\\\n{2})|\\\\Z)', re.MULTILINE | re.DOTALL), sub_doc=None, char_offset=None)[source]¶
- Bases: - Chunker- A - Chunkerthat splits list item and enumerated lists into separate sentences. Matched sentences are given if used as an iterable. For this reason, this class will probably be used as an iterable since clients will usually want just the separated paragraphs as documents- 
DEFAULT_SPAN_PATTERN: ClassVar[Pattern] = re.compile('(.+?)(?:(?=\\n{2})|\\Z)', re.MULTILINE|re.DOTALL)¶
- The default paragraph regular expression, which uses two newline positive lookaheads to avoid matching on paragraph spacing. 
 - __init__(doc, pattern=re.compile('(.+?)(?:(?=\\\\n{2})|\\\\Z)', re.MULTILINE | re.DOTALL), sub_doc=None, char_offset=None)¶
 - 
pattern: Pattern= re.compile('(.+?)(?:(?=\\n{2})|\\Z)', re.MULTILINE|re.DOTALL)¶
- The list regular expression, which defaults to - DEFAULT_SPAN_PATTERN.
 
- 
DEFAULT_SPAN_PATTERN: 
zensols.nlp.combine module¶
A class that combines features.
- class zensols.nlp.combine.CombinerFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, source_parsers=None, validate_features=<factory>, yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, map_features=<factory>)[source]¶
- Bases: - DecoratedFeatureDocumentParser- A class that combines features from two - FeatureDocumentParserinstances. Features parsed using each- source_parserare optionally copied or overwritten on a token by token basis in the feature document parsed by this instance.- The target tokens are sometimes added to or clobbered from the source, but not the other way around. - __init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, source_parsers=None, validate_features=<factory>, yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, map_features=<factory>)¶
 - 
map_features: List[Tuple[str,str,Any]]¶
- Like - yield_featuresbut the feature ID can be different from the source to the target. Each tuple has the form:- (<source feature ID>, <target feature ID>, <default for missing>)
 - 
overwrite_features: List[str]¶
- A list of features to be copied/overwritten in order given in the list. 
 - 
overwrite_nones: bool= False¶
- Whether to write - Nonefor missing- overwrite_features. This always write the target feature; if you only to write when the source is not set or missing, then use- yield_features.
 - parse(text, *args, **kwargs)[source]¶
- Parse text or a text as a list of sentences. - Parameters:
- text ( - str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list
- args – the arguments used to create the FeatureDocument instance 
- kwargs – the key word arguments used to create the FeatureDocument instance 
 
- Return type:
 
 - 
source_parsers: List[FeatureDocumentParser] = None¶
- The language resource used to parse documents and create token attributes. 
 - 
validate_features: Set[str]¶
- A set of features to compare across all tokens when copying. If any of the given features don’t match, an mismatch token error is raised. 
 
- class zensols.nlp.combine.MappingCombinerFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, source_parsers=None, validate_features=frozenset({'idx'}), yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, map_features=<factory>, merge_sentences=True)[source]¶
- Bases: - CombinerFeatureDocumentParser- Maps the source to respective tokens in the target document using spaCy artifacts. - __init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, source_parsers=None, validate_features=frozenset({'idx'}), yield_features=<factory>, yield_feature_defaults=None, overwrite_features=<factory>, overwrite_nones=False, map_features=<factory>, merge_sentences=True)¶
 
zensols.nlp.component module¶
Components useful for reuse.
- class zensols.nlp.component.EntityRecognizer(nlp, name, import_file, patterns)[source]¶
- Bases: - object- Base class regular expression and spaCy match patterns named entity recognizer. Both subclasses allow for an optional label for each respective pattern or regular expression. If the label is provided, then the match is made a named entity with a label. In any case, a span is created on the token, and in some cases, retokenized. - __init__(nlp, name, import_file, patterns)¶
 - 
nlp: Language¶
- The NLP model. 
 
- class zensols.nlp.component.PatternEntityRecognizer(nlp, name, import_file, patterns)[source]¶
- Bases: - EntityRecognizer- Adds entities based on regular epxressions. - See:
 - __init__(nlp, name, import_file, patterns)¶
 
- class zensols.nlp.component.RegexEntityRecognizer(nlp, name, import_file, patterns)[source]¶
- Bases: - EntityRecognizer- Merges regular expression matches as a - Span. After matches are found, re-tokenization merges them in to one token per match.- __init__(nlp, name, import_file, patterns)¶
 
- class zensols.nlp.component.RegexSplitter(nlp, name, import_file, patterns)[source]¶
- Bases: - EntityRecognizer- Splits on regular expressions. - __init__(nlp, name, import_file, patterns)¶
 
zensols.nlp.container module¶
Domain objects that define features associated with text.
- class zensols.nlp.container.FeatureDocument(sents, text=None, spacy_doc=None)[source]¶
- Bases: - TokenContainer- A container class of tokens that make a document. This class contains a one to many of sentences. However, it can be treated like any - TokenContainerto fetch tokens. Instances of this class iterate over- FeatureSentenceinstances.- Parameters:
- sents ( - Tuple[- FeatureSentence,- ...]) – the sentences defined for this document
 - _combine_documents(docs, cls, concat_tokens, **kwargs)[source]¶
- Override if there are any fields in your dataclass. In most cases, the only time this is called is by an embedding vectorizer to batch muultiple sentences in to a single document, so the only feature that matter are the sentence level. - Parameters:
- docs ( - Tuple[- FeatureDocument,- ...]) – the documents to combine in to one
- cls ( - Type[- FeatureDocument]) – the class of the instance to create
- concat_tokens ( - bool) – if- Trueeach sentence of the returned document are the concatenated tokens of each respective document; otherwise simply concatenate sentences in to one document
- kwargs – additional keyword arguments to pass to the new feature document’s initializer 
 
- Return type:
 
 - 
EMPTY_DOCUMENT: ClassVar[FeatureDocument] = <>¶
- A zero length document. 
 - __init__(sents, text=None, spacy_doc=None)¶
 - clone(cls=None, **kwargs)[source]¶
- Parameters:
- kwargs – if copy_spacy is - True, the spacy document is copied to the clone in addition parameters passed to new clone initializer
- Return type:
 
 - classmethod combine_documents(docs, concat_tokens=True, **kwargs)[source]¶
- Coerce a tuple of token containers (either documents or sentences) in to one synthesized document. - Parameters:
- docs ( - Iterable[- FeatureDocument]) – the documents to combine in to one
- cls – the class of the instance to create 
- concat_tokens ( - bool) – if- Trueeach sentence of the returned document are the concatenated tokens of each respective document; otherwise simply concatenate sentences in to one document
- kwargs – additional keyword arguments to pass to the new feature document’s initializer 
 
- Return type:
 
 - combine_sentences(sents=None)[source]¶
- Combine the sentences in this document in to a new document with a single sentence. - Parameters:
- sents ( - Iterable[- FeatureSentence]) – the sentences to combine in the new document or all if- None
- Return type:
 
 - from_sentences(sents, deep=False)[source]¶
- Return a new cloned document using the given sentences. - Parameters:
- sents ( - Iterable[- FeatureSentence]) – the sentences to add to the new cloned document
- deep ( - bool) – whether or not to clone the sentences
 
- See:
- Return type:
 
 - get_overlapping_document(span, inclusive=True)[source]¶
- Get the portion of the document that overlaps - span. Sentences completely enclosed in a span are copied. Otherwise, new sentences are created from those tokens that overlap the span.- Parameters:
- span ( - LexicalSpan) – indicates the portion of the document to retain
- inclusive ( - bool) – whether to check include +1 on the end component
 
- Return type:
- Returns:
- a new document that contains the 0 index offset of - span
 
 - get_overlapping_sentences(span, inclusive=True)[source]¶
- Return sentences that overlaps with - spanfrom this document.- Parameters:
- span ( - LexicalSpan) – indicates the portion of the document to retain
- inclusive ( - bool) – whether to check include +1 on the end component
 
- Return type:
 
 - get_overlapping_span(span, inclusive=True)[source]¶
- Return a feature span that includes the lexical scope of - span.- Return type:
 
 - property max_sentence_len: int¶
- Return the length of tokens from the longest sentence in the document. 
 - sentence_index_for_token(token)[source]¶
- Return index of the parent sentence having - token.- Return type:
 
 - sentences_for_tokens(tokens)[source]¶
- Find sentences having a set of tokens. - Parameters:
- tokens ( - Tuple[- FeatureToken,- ...]) – the query used to finding containing sentences
- Return type:
- Returns:
- the document ordered tuple of sentences containing tokens 
 
 - 
sents: Tuple[FeatureSentence,...]¶
- The sentences that make up the document. 
 - 
spacy_doc: Doc= None¶
- The parsed spaCy document this feature set is based. As explained in - FeatureToken, spaCy documents are heavy weight and problematic to pickle. For this reason, this attribute is dropped when pickled, and only here for ad-hoc predictions.
 - to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]¶
- Coerce this instance to a single sentence. No tokens data is updated so - FeatureToken.i_sentkeep their original indexes. These sentence indexes will be inconsistent when called on- FeatureDocumentunless contiguous_i_sent is set to- True.- Parameters:
- limit ( - int) – the max number of sentences to create (only starting kept)
- contiguous_i_sent ( - Union[- str,- bool]) – if- True, ensures all tokens have- FeatureToken.i_sentvalue that is contiguous for the returned instance; if this value is- reset, the token indicies start from 0
- delim ( - str) – a string added between each constituent sentence
 
- Return type:
- Returns:
- an instance of - FeatureSentencethat represents this token sequence
 
 - token_iter(*args, **kwargs)[source]¶
- Return an iterator over the token features. - Parameters:
- args – the arguments given to - itertools.islice()
- Return type:
 
 - uncombine_sentences()[source]¶
- Reconstruct the sentence structure that we combined in - combine_sentences(). If that has not been done in this instance, then return- self.- Return type:
 
 - update_entity_spans(include_idx=True)[source]¶
- Update token entity to - normtext. This is helpful when entities are embedded after splitting text, which becomes- FeatureToken.normvalues. However, the token spans still index the original entities that are multi-word, which leads to norms that are not equal to the text spans. This synchronizes the token span indexes with the norms.- Parameters:
- include_idx ( - bool) – whether to update- SpacyFeatureToken.idxas well
 
 - update_indexes()[source]¶
- Update all - FeatureToken.iattributes to those provided by- tokens_by_i. This corrects the many-to-one token index mapping for split multi-word named entities.- See:
- tokens_by_i
 
 
- class zensols.nlp.container.FeatureSentence(tokens, text=None, spacy_span=None)[source]¶
- Bases: - FeatureSpan- A container class of tokens that make a sentence. Instances of this class iterate over - FeatureTokeninstances, and can create documents with- to_document().- 
EMPTY_SENTENCE: ClassVar[FeatureSentence] = <>¶
 - __init__(tokens, text=None, spacy_span=None)¶
 - get_overlapping_span(span, inclusive=True)[source]¶
- Return a feature span that includes the lexical scope of - span.- Return type:
 
 - to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]¶
- Coerce this instance to a single sentence. No tokens data is updated so - FeatureToken.i_sentkeep their original indexes. These sentence indexes will be inconsistent when called on- FeatureDocumentunless contiguous_i_sent is set to- True.- Parameters:
- limit ( - int) – the max number of sentences to create (only starting kept)
- contiguous_i_sent ( - Union[- str,- bool]) – if- True, ensures all tokens have- FeatureToken.i_sentvalue that is contiguous for the returned instance; if this value is- reset, the token indicies start from 0
- delim ( - str) – a string added between each constituent sentence
 
- Return type:
- Returns:
- an instance of - FeatureSentencethat represents this token sequence
 
 
- 
EMPTY_SENTENCE: 
- class zensols.nlp.container.FeatureSpan(tokens, text=None, spacy_span=None)[source]¶
- Bases: - TokenContainer- A span of tokens as a - TokenContainer, much like- spacy.tokens.Span.- __init__(tokens, text=None, spacy_span=None)¶
 - clone(cls=None, **kwargs)[source]¶
- Clone an instance of this token container. - Parameters:
- cls ( - Type[- TokenContainer]) – the type of the new instance
- kwargs – arguments to add to as attributes to the clone 
 
- Return type:
- Returns:
- the cloned instance of this instance 
 
 - property dependency_tree: Dict[FeatureToken, List[Dict[FeatureToken]]]¶
 - 
spacy_span: Span= None¶
- The parsed spaCy span this feature set is based. - See:
- FeatureDocument.spacy_doc()
 
 - to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]¶
- Coerce this instance to a single sentence. No tokens data is updated so - FeatureToken.i_sentkeep their original indexes. These sentence indexes will be inconsistent when called on- FeatureDocumentunless contiguous_i_sent is set to- True.- Parameters:
- limit ( - int) – the max number of sentences to create (only starting kept)
- contiguous_i_sent ( - Union[- str,- bool]) – if- True, ensures all tokens have- FeatureToken.i_sentvalue that is contiguous for the returned instance; if this value is- reset, the token indicies start from 0
- delim ( - str) – a string added between each constituent sentence
 
- Return type:
- Returns:
- an instance of - FeatureSentencethat represents this token sequence
 
 - token_iter(*args, **kwargs)[source]¶
- Return an iterator over the token features. - Parameters:
- args – the arguments given to - itertools.islice()
- Return type:
 
 - property tokens: Tuple[FeatureToken, ...]¶
- The tokens that make up the span. 
 - property tokens_by_i_sent: Dict[int, FeatureToken]¶
- A map of tokens with keys as their sentanal position offset and values as tokens. - See:
- zensols.nlp.FeatureToken.i
 
 - update_entity_spans(include_idx=True)[source]¶
- Update token entity to - normtext. This is helpful when entities are embedded after splitting text, which becomes- FeatureToken.normvalues. However, the token spans still index the original entities that are multi-word, which leads to norms that are not equal to the text spans. This synchronizes the token span indexes with the norms.- Parameters:
- include_idx ( - bool) – whether to update- SpacyFeatureToken.idxas well
 
 - update_indexes()[source]¶
- Update all - FeatureToken.iattributes to those provided by- tokens_by_i. This corrects the many-to-one token index mapping for split multi-word named entities.- See:
- tokens_by_i
 
 
- class zensols.nlp.container.TokenAnnotatedFeatureDocument(sents, text=None, spacy_doc=None)[source]¶
- Bases: - FeatureDocument- A feature sentence that contains token annotations. Sentences can be modeled with - TokenAnnotatedFeatureSentenceor just- FeatureSentencesince this sets the annotations attribute when combining.- __init__(sents, text=None, spacy_doc=None)¶
 - combine_sentences(**kwargs) FeatureDocument¶
- Combine all the sentences in this document in to a new document with a single sentence. - Return type:
- FeatureDocument 
 
 
- class zensols.nlp.container.TokenAnnotatedFeatureSentence(tokens, text=None, spacy_span=None, annotations=())[source]¶
- Bases: - FeatureSentence- A feature sentence that contains token annotations. - __init__(tokens, text=None, spacy_span=None, annotations=())¶
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, **kwargs)[source]¶
- Write the text container. - Parameters:
- include_original – whether to include the original text 
- include_normalized – whether to include the normalized text 
- n_tokens – the number of tokens to write 
- inline – whether to print the tokens on one line each 
 
 
 
- class zensols.nlp.container.TokenContainer[source]¶
- Bases: - PersistableContainer,- TextContainer- A base class for token container classes such as - FeatureSentenceand- FeatureDocument. In addition to the defined methods, each instance has a- textattribute, which is the original text of the document.- property canonical: str¶
- A canonical representation of the container, which are non-space tokens separated by - CANONICAL_DELIMITER.
 - clone(cls=None, **kwargs)[source]¶
- Clone an instance of this token container. - Parameters:
- cls ( - Type[- TokenContainer]) – the type of the new instance
- kwargs – arguments to add to as attributes to the clone 
 
- Return type:
- Returns:
- the cloned instance of this instance 
 
 - property entities: Tuple[FeatureSpan, ...]¶
- The named entities of the container with each multi-word entity as elements. 
 - get_overlapping_span(span, inclusive=True)[source]¶
- Return a feature span that includes the lexical scope of - span.- Return type:
 
 - get_overlapping_tokens(span, inclusive=True)[source]¶
- Get all tokens that overlap lexical span - span.- Parameters:
- span ( - LexicalSpan) – the document 0-index character based inclusive span to compare with- FeatureToken.lexspan
- inclusive ( - bool) – whether to check include +1 on the end component
 
- Return type:
- Returns:
- a token sequence containing the 0 index offset of - span
 
 - property lexspan: LexicalSpan¶
- The document indexed lexical span using - idx.
 - map_overlapping_tokens(spans, inclusive=True)[source]¶
- Return a tuple of tokens, each tuple in the range given by the respective span in - spans.- Parameters:
- spans ( - Iterable[- LexicalSpan]) – the document 0-index character based inclusive spans to compare with- FeatureToken.lexspan
- inclusive ( - bool) – whether to check include +1 on the end component
 
- Return type:
- Returns:
- a tuple of matching tokens for the respective - spanquery
 
 - property norm_orth: str¶
- The normalized version of the sentence using the orignal rather than the token normalized text. 
 - reindex(reference_token=None)[source]¶
- Re-index tokens, which is useful for situtations where a 0-index offset is assumed for sub-documents created with - FeatureDocument.get_overlapping_document()or- FeatureDocument.get_overlapping_sentences(). The following data are modified:
- FeatureToken.sent_i(see- SpacyFeatureToken.sent_i)
- FeatureToken.lexspan(see- SpacyFeatureToken.lexspan)
 
 - set_entity_offsets(offsets)[source]¶
- Set entities as a sequence of non-inclusive character offsets of - (<begin> , <end>).
 - strip(in_place=True)[source]¶
- Strip beginning and ending whitespace (see - strip_tokens()) and- text.- Return type:
 
 - strip_token_iter(*args, **kwargs)[source]¶
- Strip beginning and ending whitespace (see - strip_tokens()) using- token_iter().- Return type:
 
 - static strip_tokens(token_iter)[source]¶
- Strip beginning and ending whitespace. This uses - is_space, which is- Truefor spaces, tabs and newlines.- Parameters:
- token_iter ( - Iterable[- FeatureToken]) – an stream of tokens
- Return type:
- Returns:
- non-whitespace middle tokens 
 
 - abstract to_document(limit=9223372036854775807)[source]¶
- Coerce this instance in to a document. - Return type:
 
 - abstract to_sentence(limit=9223372036854775807, contiguous_i_sent=False, delim='')[source]¶
- Coerce this instance to a single sentence. No tokens data is updated so - FeatureToken.i_sentkeep their original indexes. These sentence indexes will be inconsistent when called on- FeatureDocumentunless contiguous_i_sent is set to- True.- Parameters:
- limit ( - int) – the max number of sentences to create (only starting kept)
- contiguous_i_sent ( - Union[- str,- bool]) – if- True, ensures all tokens have- FeatureToken.i_sentvalue that is contiguous for the returned instance; if this value is- reset, the token indicies start from 0
- delim ( - str) – a string added between each constituent sentence
 
- Return type:
- Returns:
- an instance of - FeatureSentencethat represents this token sequence
 
 - abstract token_iter(*args, **kwargs)[source]¶
- Return an iterator over the token features. - Parameters:
- args – the arguments given to - itertools.islice()
- Return type:
 
 - property tokens: Tuple[FeatureToken, ...]¶
- Return the token features as a tuple. 
 - property tokens_by_i: Dict[int, FeatureToken]¶
- A map of tokens with keys as their position offset and values as tokens. The entries also include named entity tokens that are grouped as multi-word tokens. This is helpful for multi-word entities that were split (for example with - SplitTokenMapper), and thus, have many-to-one mapped indexes.- See:
- zensols.nlp.FeatureToken.i
 
 - property tokens_by_idx: Dict[int, FeatureToken]¶
- A map of tokens with keys as their character offset and values as tokens. - Limitations: Multi-word entities will have have a mapping only for the first word of that entity if tokens were split by spaces (for example with - SplitTokenMapper). However,- tokens_by_idoes not have this limitation.- See:
- obj:tokens_by_i 
- See:
- zensols.nlp.FeatureToken.idx
 
 - abstract update_entity_spans(include_idx=True)[source]¶
- Update token entity to - normtext. This is helpful when entities are embedded after splitting text, which becomes- FeatureToken.normvalues. However, the token spans still index the original entities that are multi-word, which leads to norms that are not equal to the text spans. This synchronizes the token span indexes with the norms.- Parameters:
- include_idx ( - bool) – whether to update- SpacyFeatureToken.idxas well
 
 - update_indexes()[source]¶
- Update all - FeatureToken.iattributes to those provided by- tokens_by_i. This corrects the many-to-one token index mapping for split multi-word named entities.- See:
 
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_original=False, include_normalized=True, n_tokens=9223372036854775807, inline=False, feature_ids=None)[source]¶
- Write the text container. 
 
zensols.nlp.dataframe module¶
zensols.nlp.decorate module¶
Contains useful classes for decorating feature sentences.
- class zensols.nlp.decorate.CopyFeatureTokenContainerDecorator(feature_ids)[source]¶
- Bases: - FeatureTokenContainerDecorator- Copies feature(s) for each token in the container. For each token, each source / target tuple pair in - feature_idsis copied. If the feature is missing (does not include for existing- FeatureToken.NONEvalues) an exception is raised.- __init__(feature_ids)¶
 
- class zensols.nlp.decorate.FilterEmptySentenceDocumentDecorator(filter_space=True)[source]¶
- Bases: - FeatureDocumentDecorator- Filter zero length sentences. - __init__(filter_space=True)¶
 
- class zensols.nlp.decorate.FilterTokenSentenceDecorator(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False, remove_empty=False)[source]¶
- Bases: - FeatureSentenceDecorator- A decorator that strips whitespace from sentences. - __init__(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False, remove_empty=False)¶
 
- class zensols.nlp.decorate.RemoveFeatureTokenContainerDecorator(exclude_feature_ids)[source]¶
- Bases: - FeatureTokenContainerDecorator- Removes features each token in the container. - __init__(exclude_feature_ids)¶
 
- class zensols.nlp.decorate.SplitTokenSentenceDecorator[source]¶
- Bases: - FeatureSentenceDecorator- A decorator that splits feature tokens by white space. - __init__()¶
 
- class zensols.nlp.decorate.StripTokenContainerDecorator[source]¶
- Bases: - FeatureTokenContainerDecorator- A decorator that strips whitespace from sentences (or - TokenContainer).- __init__()¶
 
- class zensols.nlp.decorate.UpdateTokenContainerDecorator(update_indexes=True, update_entity_spans=True, reindex=False)[source]¶
- Bases: - FeatureTokenContainerDecorator- Updates document indexes and spans (see fields). - __init__(update_indexes=True, update_entity_spans=True, reindex=False)¶
 - 
update_entity_spans: bool= True¶
- Whether to update the document indexes with - FeatureDocument.update_entity_spans().
 - 
update_indexes: bool= True¶
- Whether to update the document indexes with - FeatureDocument.update_indexes().
 
zensols.nlp.domain module¶
Interfaces, contracts and errors.
- class zensols.nlp.domain.LexicalSpan(begin, end)[source]¶
- Bases: - Dictable- A lexical character span of text in a document. The span has two positions: - beginand- end, which is indexed respectively as an operator as well. The left (- begin) is inclusive and the right (:obj:`end) is exclusive to conform to Python array slicing conventions.- One span is less than the other when the beginning position is less. When the beginnign positions are the same, the one with the smaller end position is less. - The length of the span is the distance between the end and the beginning positions. - 
EMPTY_SPAN: ClassVar[LexicalSpan] = (0, 0)¶
- The span - (0, 0).
 - static gaps(spans, end=None)[source]¶
- Return the spans for the “holes” in - spans. For example, if- spansis- ((0, 5), (10, 12), (15, 17)), then return- ((5, 10), (12, 15)).- Parameters:
- spans ( - Iterable[- LexicalSpan]) – the spans used to find gaps
- end ( - Optional[- int]) – an end position for the last gap so that if the last item in- spansend does not match, another is added
 
- Return type:
- Returns:
- a list of spans that “fill” any holes in - spans
 
 - narrow(other)[source]¶
- Return the shortest span that inclusively fits in both this and - other.- Parameters:
- other ( - LexicalSpan) – the second span to narrow with this span
- Retun:
- a span so that beginning is maximized and end is minimized or - Noneif the two spans do not overlap
- Return type:
 
 - static overlaps(a0, a1, b0, b1, inclusive=True)[source]¶
- Return whether or not one text span overlaps with another. - Parameters:
- inclusive ( - bool) – whether to check include +1 on the end component
- Returns:
- any overlap detected returns - True
 
 - overlaps_with(other, inclusive=True)[source]¶
- Return whether or not one text span overlaps non-inclusively with another. - Parameters:
- other ( - LexicalSpan) – the other location
- inclusive ( - bool) – whether to check include +1 on the end component
 
- Return type:
- Returns:
- any overlap detected returns - True
 
 - static widen(others)[source]¶
- Take the span union by using the left most - beginand the right most- end.- Parameters:
- others ( - Iterable[- LexicalSpan]) – the spans to union
- Return type:
- Returns:
- the widest span that inclusively aggregates - others, or None if an empty sequence is passed
 
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write this instance as either a - Writableor as a- Dictable. If class attribute- _DICTABLE_WRITABLE_DESCENDANTSis set as- True, then use the- write()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a- dictrecursively using- asdict(), then formatting the output.- If the attribute - _DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in the- write()method.- Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively. - Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
- 
EMPTY_SPAN: 
- exception zensols.nlp.domain.MissingFeatureError(token, feature_id, msg=None)[source]¶
- Bases: - NLPError- Raised on attempting to access a non-existant feature in - FeatureToken.- __init__(token, feature_id, msg=None)[source]¶
- Initialize. - Parameters:
- token ( - FeatureToken) – the token for which access was attempted
- feature_id ( - str) – the feature_id that is missing in- token
 
 
 - __module__ = 'zensols.nlp.domain'¶
 
- exception zensols.nlp.domain.NLPError[source]¶
- Bases: - APIError- Raised for any errors for this library. - __annotations__ = {}¶
 - __module__ = 'zensols.nlp.domain'¶
 
- exception zensols.nlp.domain.ParseError[source]¶
- Bases: - APIError- Raised for any parsing errors. - __annotations__ = {}¶
 - __module__ = 'zensols.nlp.domain'¶
 
- class zensols.nlp.domain.TextContainer[source]¶
- Bases: - Dictable- A writable class that has a - textproperty or attribute. All subclasses need a- normattribute or property.- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_original=True, include_normalized=True)[source]¶
- Write this instance as either a - Writableor as a- Dictable. If class attribute- _DICTABLE_WRITABLE_DESCENDANTSis set as- True, then use the- write()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a- dictrecursively using- asdict(), then formatting the output.- If the attribute - _DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in the- write()method.- Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively. - Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
zensols.nlp.index module¶
A heuristic text indexing and search class.
- class zensols.nlp.index.FeatureDocumentIndexer(doc)[source]¶
- Bases: - object- A utility class that indexes and searches for text in potentially whitespace mangled documents. It does this by trying more efficient means first, then resorts to methods that are more computationaly expensive. - __init__(doc)¶
 - 
doc: FeatureDocument¶
- The document to index. 
 - property doc_tok_orths: Tuple[Tuple[str, FeatureToken], ...]¶
- Reutrn tuples of (<orthographic text>, <token>). 
 - find(query, sent_ix=None)[source]¶
- Find a sentence in document - doc. If a sentence index is given, it treats the query as a sentence to find in- doc.- Parameters:
- query ( - TokenContainer) – the sentence to find in- doc
- sent_ix ( - int) – the sentence index hint if available
 
- Return type:
- Returns:
- the matched text from - doc
 
 - property pack2ix: Dict[int, int]¶
- Return a dictionary of character positions in the document ( - doc) text to respective positions in the same string without whitespace.
 - property text2sent: Dict[str, FeatureSentence]¶
- Return a dictionary of sentence normalized text to respective sentence in - doc.
 
zensols.nlp.nerscore module¶
Wraps the SemEval-2013 Task 9.1 NER evaluation API as a
ScoreMethod.
From the David Batista blog post:
The SemEval’13 introduced four different ways to measure precision/recall/f1-score results based on the metrics defined by MUC:
Strict: exact boundary surface string match and entity type
Exact: exact boundary match over the surface string, regardless of the type
Partial: partial boundary match over the surface string, regardless of the type
Type: some overlap between the system tagged entity and the gold annotation is required
Each of these ways to measure the performance accounts for correct, incorrect, partial, missed and spurious in different ways. Let’s look in detail and see how each of the metrics defined by MUC falls into each of the scenarios described above.
- see:
- see:
- class zensols.nlp.nerscore.SemEvalHarmonicMeanScore(precision, recall, f_score, correct, incorrect, partial, missed, spurious, possible, actual)[source]¶
- Bases: - HarmonicMeanScore- A harmonic mean score with the additional SemEval computed scores (see module - zensols.nlp.nerscoredocs).- 
NAN_INSTANCE: ClassVar[SemEvalHarmonicMeanScore] = SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan)¶
- Used to add to ErrorScore for harmonic means replacements. 
 - __init__(precision, recall, f_score, correct, incorrect, partial, missed, spurious, possible, actual)¶
 - 
incorrect: int¶
- the output of a system and the golden annotation don’t match. - Type:
- The number of incorrect (INC) 
 
 
- 
NAN_INSTANCE: 
- class zensols.nlp.nerscore.SemEvalScore(strict, exact, partial, ent_type)[source]¶
- Bases: - Score- Contains all four harmonic mean SemEval scores (see module - zensols.nlp.nerscoredocs). This score has four harmonic means providing various levels of accuracy.- 
NAN_INSTANCE: ClassVar[SemEvalScore] = SemEvalScore(strict=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), exact=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), partial=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan), ent_type=SemEvalHarmonicMeanScore(precision=nan, recall=nan, f_score=nan, correct=nan, incorrect=nan, partial=nan, missed=nan, spurious=nan, possible=nan, actual=nan))¶
 - __init__(strict, exact, partial, ent_type)¶
 - 
ent_type: SemEvalHarmonicMeanScore¶
- Some overlap between the system tagged entity and the gold annotation is required. 
 - 
exact: SemEvalHarmonicMeanScore¶
- Exact boundary match over the surface string, regardless of the type. 
 - 
partial: SemEvalHarmonicMeanScore¶
- Partial boundary match over the surface string, regardless of the type. 
 - 
strict: SemEvalHarmonicMeanScore¶
- Exact boundary surface string match and entity type. 
 
- 
NAN_INSTANCE: 
- class zensols.nlp.nerscore.SemEvalScoreMethod(reverse_sents=False, labels=None)[source]¶
- Bases: - ScoreMethod- A Semeval-2013 Task 9.1 score (see module - zensols.nlp.nerscoredocs). This score has four harmonic means providing various levels of accuracy. Sentence pairs are ordered as- (<gold>, <prediction>).- __init__(reverse_sents=False, labels=None)¶
 
zensols.nlp.norm module¶
Normalize text and map Spacy documents.
- class zensols.nlp.norm.FilterRegularExpressionMapper(regex='[ ]+', invert=False)[source]¶
- Bases: - TokenMapper- Filter tokens based on normalized form regular expression. - __init__(regex='[ ]+', invert=False)¶
 
- class zensols.nlp.norm.FilterTokenMapper(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False)[source]¶
- Bases: - TokenMapper- Filter tokens based on token (Spacy) attributes. - Configuration example: - [filter_token_mapper] class_name = zensols.nlp.FilterTokenMapper remove_stop = True remove_punctuation = True - __init__(remove_stop=False, remove_space=False, remove_pronouns=False, remove_punctuation=False, remove_determiners=False)¶
 
- class zensols.nlp.norm.JoinTokenMapper(regex='[ ]', separator=None)[source]¶
- Bases: - object- Join tokens based on a regular expression. It does this by creating spans in the spaCy component (first in the tuple) and using the span text as the normalized token. - __init__(regex='[ ]', separator=None)¶
 
- class zensols.nlp.norm.LambdaTokenMapper(add_lambda=None, map_lambda=None)[source]¶
- Bases: - TokenMapper- Use a lambda expression to map a token tuple. - This is handy for specialized behavior that can be added directly to a configuration file. - Configuration example: - [lc_lambda_token_mapper] class_name = zensols.nlp.LambdaTokenMapper map_lambda = lambda x: (x[0], f'<{x[1].lower()}>') - __init__(add_lambda=None, map_lambda=None)¶
 
- class zensols.nlp.norm.LemmatizeTokenMapper(lemmatize=True, remove_first_stop=False)[source]¶
- Bases: - TokenMapper- Lemmatize tokens and optional remove entity stop words. - Important: This completely ignores the normalized input token string and essentially just replaces it with the lemma found in the token instance. - Configuration example: - [lemma_token_mapper] class_name = zensols.nlp.LemmatizeTokenMapper - Parameters:
 - __init__(lemmatize=True, remove_first_stop=False)¶
 
- class zensols.nlp.norm.MapTokenNormalizer(embed_entities=True, config_factory=None, mapper_class_list=<factory>)[source]¶
- Bases: - TokenNormalizer- A normalizer that applies a sequence of - TokenMapperinstances to transform the normalized token text. The members of the- mapper_class_listare sections of the application configuration.- Configuration example: - [map_filter_token_normalizer] class_name = zensols.nlp.MapTokenNormalizer mapper_class_list = list: filter_token_mapper - __init__(embed_entities=True, config_factory=None, mapper_class_list=<factory>)¶
 - 
config_factory: ConfigFactory= None¶
- The factory that created this instance and used to create the mappers. 
 
- class zensols.nlp.norm.SplitEntityTokenMapper(token_unit_type=False, copy_attributes=('label', 'label_'))[source]¶
- Bases: - TokenMapper- Splits embedded entities (or any - Span) in to separate tokens. This is useful for splitting up entities as tokens after being grouped with- TokenNormalizer.embed_entities. Note,- embed_entitiesmust be- Trueto create the entities as they come from spaCy as spans. This then can be used to create- SpacyFeatureTokenwith spans that have the entity.- __init__(token_unit_type=False, copy_attributes=('label', 'label_'))¶
 
- class zensols.nlp.norm.SplitTokenMapper(regex='[ ]')[source]¶
- Bases: - TokenMapper- Splits the normalized text on a per token basis with a regular expression. - Configuration example: - [split_token_mapper] class_name = zensols.nlp.SplitTokenMapper regex = r'[ ]' - __init__(regex='[ ]')¶
 
- class zensols.nlp.norm.SubstituteTokenMapper(regex='', replace_char='')[source]¶
- Bases: - TokenMapper- Replace a regular expression in normalized token text. - Configuration example: - [subs_token_mapper] class_name = zensols.nlp.SubstituteTokenMapper regex = r'[ \t]' replace_char = _ - __init__(regex='', replace_char='')¶
 
- class zensols.nlp.norm.TokenMapper[source]¶
- Bases: - ABC- Abstract class used to transform token tuples generated from - TokenNormalizer.normalize().- __init__()¶
 
- class zensols.nlp.norm.TokenNormalizer(embed_entities=True)[source]¶
- Bases: - object- Base token extractor returns tuples of tokens and their normalized version. - Configuration example: - [default_token_normalizer] class_name = zensols.nlp.TokenNormalizer embed_entities = False - __init__(embed_entities=True)¶
 
zensols.nlp.parser module¶
Parse documents and generate features in an organized taxonomy.
- class zensols.nlp.parser.CachingFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, stash=None, hasher=<factory>)[source]¶
- Bases: - DecoratedFeatureDocumentParser- A document parser that persists previous parses using the hash of the text as a key. Caching is optional given the value of - stash, which is useful in cases this class is extended using other use cases other than just caching.- __init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None, stash=None, hasher=<factory>)¶
 - parse(text, *args, **kwargs)[source]¶
- Parse text or a text as a list of sentences. - Parameters:
- text ( - str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list
- args – the arguments used to create the FeatureDocument instance 
- kwargs – the key word arguments used to create the FeatureDocument instance 
 
- Return type:
 
 
- class zensols.nlp.parser.Component(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=())[source]¶
- Bases: - object- A pipeline component to be added to the spaCy model. - __init__(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=())¶
 - init(model, parser)[source]¶
- Initialize the component and add it to the NLP pipe line. This base class implementation loads the - module, then calls- Language.add_pipe().- Parameters:
- model ( - Language) – the model to add the spaCy model (- nlpin their parlance)
- parser ( - FeatureDocumentParser) – the owning parser of this component instance
 
 
 - 
initializers: Tuple[ComponentInitializer,...] = ()¶
- Instances to initialize upon this object’s initialization. 
 - 
modules: Sequence[str] = ()¶
- The module to import before adding component pipelines. This will register components mentioned in - componentswhen the resepctive module is loaded.
 
- class zensols.nlp.parser.ComponentInitializer[source]¶
- Bases: - ABC- Called by - Componentto do post spaCy initialization.
- class zensols.nlp.parser.DecoratedFeatureDocumentParser(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None)[source]¶
- Bases: - FeatureDocumentParser- This class adapts the - FeatureDocumentParseradaptors to the general case using a GoF decorator pattern. This is useful for any post processing needed on existing configured document parsers.- All decorators are processed in the following order:
- Token 
- Sentence 
- Document 
 
 - Token features are stored in the delegate for those that have them. Otherwise, they are stored in instances of this class. - __init__(name, delegate, token_decorators=(), sentence_decorators=(), document_decorators=(), token_feature_ids=<factory>, silencer=None)¶
 - 
delegate: FeatureDocumentParser¶
- Used to create the feature documents. 
 - 
document_decorators: Sequence[FeatureDocumentDecorator] = ()¶
- A list of decorators that can add, remove or modify features on a document. 
 - 
name: str¶
- The name of the parser, which is taken from the section name when created with a - ConfigFactoryand used for debugging.
 - parse(text, *args, **kwargs)[source]¶
- Parse text or a text as a list of sentences. - Parameters:
- text ( - str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list
- args – the arguments used to create the FeatureDocument instance 
- kwargs – the key word arguments used to create the FeatureDocument instance 
 
- Return type:
 
 - 
sentence_decorators: Sequence[FeatureSentenceDecorator] = ()¶
- A list of decorators that can add, remove or modify features on a sentence. 
 - 
silencer: WarningSilencer= None¶
- Optinally suppress warnings the parser generates. 
 - 
token_decorators: Sequence[FeatureTokenDecorator] = ()¶
- A list of decorators that can add, remove or modify features on a token. 
 
- class zensols.nlp.parser.FeatureDocumentDecorator[source]¶
- Bases: - FeatureTokenContainerDecorator- Implementations can add, remove or modify features on a document. 
- class zensols.nlp.parser.FeatureDocumentParser[source]¶
- Bases: - PersistableContainer,- Dictable- This class parses text in to instances of - FeatureDocumentinstances using- parse().- 
TOKEN_FEATURE_IDS: ClassVar[Set[str]] = frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_'})¶
- The default value for - token_feature_ids.
 - __init__()¶
 - static default_instance()[source]¶
- Create the parser as configured in the resource library of the package. - Return type:
 
 - abstract parse(text, *args, **kwargs)[source]¶
- Parse text or a text as a list of sentences. - Parameters:
- text ( - str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list
- args – the arguments used to create the FeatureDocument instance 
- kwargs – the key word arguments used to create the FeatureDocument instance 
 
- Return type:
 
 
- 
TOKEN_FEATURE_IDS: 
- class zensols.nlp.parser.FeatureSentenceDecorator[source]¶
- Bases: - FeatureTokenContainerDecorator- Implementations can add, remove or modify features on a sentence. 
- class zensols.nlp.parser.FeatureSentenceFactory(token_decorators=())[source]¶
- Bases: - object- Create a - FeatureSentenceout of single tokens or split on whitespace. This is a utility class to create data structures when only single tokens are the source data.- For example, if you only have tokens that need to be scored with Unigram Rouge-1, use this class to create sentences, which is a subclass of - TokenContainer.- __init__(token_decorators=())¶
 - create(tokens)[source]¶
- Create a sentence from tokens. 
 - 
token_decorators: Sequence[FeatureTokenDecorator] = ()¶
- A list of decorators that can add, remove or modify features on a token. 
 
- class zensols.nlp.parser.FeatureTokenContainerDecorator[source]¶
- Bases: - ABC- Implementations can add, remove or modify features on a token container. 
- class zensols.nlp.parser.FeatureTokenDecorator[source]¶
- Bases: - ABC- Implementations can add, remove or modify features on a token. 
- class zensols.nlp.parser.WhiteSpaceTokenizerFeatureDocumentParser(sent_class=<class 'zensols.nlp.container.FeatureSentence'>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>)[source]¶
- Bases: - FeatureDocumentParser- This class parses text in to instances of - FeatureDocumentinstances tokenizing only by whitespace. This parser does no sentence chunking so documents have one and only one sentence for each parse.- __init__(sent_class=<class 'zensols.nlp.container.FeatureSentence'>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>)¶
 - doc_class¶
- The type of document instances to create. - alias of - FeatureDocument
 - parse(text, *args, **kwargs)[source]¶
- Parse text or a text as a list of sentences. - Parameters:
- text ( - str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list
- args – the arguments used to create the FeatureDocument instance 
- kwargs – the key word arguments used to create the FeatureDocument instance 
 
- Return type:
 
 - sent_class¶
- The type of sentence instances to create. - alias of - FeatureSentence
 
zensols.nlp.score module¶
Produces matching scores.
- class zensols.nlp.score.BleuScoreMethod(reverse_sents=False, smoothing_function=None, weights=(0.25, 0.25, 0.25, 0.25), silence_warnings=False)[source]¶
- Bases: - ScoreMethod- The BLEU scoring method using the - nltkpackage. The first sentences are the references and the second are the hypothesis.- __init__(reverse_sents=False, smoothing_function=None, weights=(0.25, 0.25, 0.25, 0.25), silence_warnings=False)¶
 - 
silence_warnings: bool= False¶
- Silence the BLEU warning of n-grams not matching - The hypothesis contains 0 counts of 3-gram overlaps...
 - 
smoothing_function: SmoothingFunction= None¶
- This is an implementation of the smoothing techniques for segment-level BLEU scores. - Citation: - Chen and Cherry (2014) A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. In WMT14. 
 
- class zensols.nlp.score.ErrorScore(method, exception, replace_score=None)[source]¶
- Bases: - Score- A replacement instance when scoring fails from a raised exception. - __init__(method, exception, replace_score=None)¶
 - 
method: str¶
- The method of the - ScoreMethodthat raised the exception.
 - 
replace_score: Score= None¶
- The score to use in place of this score. Otherwise - asrow()return a single- numpy.nanlike- FloatScore.
 
- class zensols.nlp.score.ExactMatchScoreMethod(reverse_sents=False, equality_measure='norm')[source]¶
- Bases: - ScoreMethod- A scoring method that return 1 for exact matches and 0 otherwise. - __init__(reverse_sents=False, equality_measure='norm')¶
 - 
equality_measure: str= 'norm'¶
- The method by which to compare, which is one of: - norm: compare with- TokenContainer.norm()
- text: compare with- TokenContainer.text
- equal: compare using a Python object- __eq__equal compare,
- which also compares the token values 
 
 
 
- class zensols.nlp.score.FloatScore(value)[source]¶
- Bases: - Score- Float container. This is needed to create the flat result container structure. Object creation becomes less import since most clients will use - ScoreSet.asnumpy().- 
NAN_INSTANCE: ClassVar[FloatScore] = FloatScore(value=nan)¶
- Used to add to ErrorScore for harmonic means replacements. 
 - __init__(value)¶
 
- 
NAN_INSTANCE: 
- class zensols.nlp.score.HarmonicMeanScore(precision, recall, f_score)[source]¶
- Bases: - Score- A score having a precision, recall and the harmonic mean of the two, F-score.’ - 
NAN_INSTANCE: ClassVar[HarmonicMeanScore] = HarmonicMeanScore(precision=nan, recall=nan, f_score=nan)¶
- Used to add to ErrorScore for harmonic means replacements. 
 - __init__(precision, recall, f_score)¶
 
- 
NAN_INSTANCE: 
- class zensols.nlp.score.LevenshteinDistanceScoreMethod(reverse_sents=False, form='canon', normalize=True)[source]¶
- Bases: - ScoreMethod- A scoring method that computes the Levenshtein distance. - __init__(reverse_sents=False, form='canon', normalize=True)¶
 - 
form: str= 'canon'¶
- The form of the of the text used for the evaluation, which is one of: - text: the original text with- TokenContainer.text
- norm: the normalized text using- TokenContainer.norm()
- canon:- TokenContainer.canonicalto normalize out whitespace for better comparisons
 
 
- class zensols.nlp.score.RougeScoreMethod(reverse_sents=False, feature_tokenizer=True)[source]¶
- Bases: - ScoreMethod- The ROUGE scoring method using the - rouge_scorepackage.- __init__(reverse_sents=False, feature_tokenizer=True)¶
 - 
feature_tokenizer: bool= True¶
- Whether to use the - TokenContainertokenization, otherwise use the- rouge_scorepackage.
 
- class zensols.nlp.score.Score[source]¶
- Bases: - Dictable- Individual scores returned from - ScoreMethod.- __init__()¶
 
- class zensols.nlp.score.ScoreContext(pairs, methods=None, norm=True, correlation_ids=None)[source]¶
- Bases: - Dictable- Input needed to create score(s) using - Scorer.- __init__(pairs, methods=None, norm=True, correlation_ids=None)¶
 - 
correlation_ids: Tuple[Union[int,str]] = None¶
- The IDs to correlate with each sentence pair, or - Noneto skip correlating them. The length of this tuple must be that of- pairs.
 - 
methods: Set[str] = None¶
- A set of strings, each indicating the - ScoreMethodused to score- pairs.
 - 
pairs: Tuple[Tuple[TokenContainer,TokenContainer]]¶
- Sentence, span or document pairs to score (order matters for some scoring methods such as rouge). Depending on the scoring method the ordering of the sentence pairs should be: - (<summary>, <source>)
- (<gold>, <prediction>)
- (<references>, <candidates>)
 - See - ScoreMethodimplementations for more information about pair ordering.
 
- class zensols.nlp.score.ScoreMethod(reverse_sents=False)[source]¶
- Bases: - object- An abstract base class for scoring methods (bleu, rouge, etc). - __init__(reverse_sents=False)¶
 - classmethod is_available()[source]¶
- Whether or not this method is available on this system. - Return type:
 
 
- class zensols.nlp.score.ScoreResult(scores, correlation_id=None)[source]¶
- Bases: - Dictable- A result of scores created by a - ScoreMethod.- __init__(scores, correlation_id=None)¶
 - 
correlation_id: Optional[str] = None¶
- An ID for correlating back to the - TokenContainer.
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write this instance as either a - Writableor as a- Dictable. If class attribute- _DICTABLE_WRITABLE_DESCENDANTSis set as- True, then use the- write()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a- dictrecursively using- asdict(), then formatting the output.- If the attribute - _DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in the- write()method.- Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively. - Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
- class zensols.nlp.score.ScoreSet(results, correlation_id_col='id')[source]¶
- Bases: - Dictable- All scores returned from :class:`.Scorer’. - __init__(results, correlation_id_col='id')¶
 - as_dataframe(add_correlation=True)[source]¶
- This gets data from - as_numpy()and returns it as a Pandas dataframe.- Parameters:
- add_correlation (bool) – whether to add the correlation ID (if there is one), using - correlation_id_col
- Return type:
- pandas.DataFrame 
- Returns:
- an instance of - pandas.DataFrameof the results
 
 - as_numpy(add_correlation=True)[source]¶
- Return the Numpy array with column descriptors of the results. Spacy depends on Numpy, so this package will always be availale. - Parameters:
- add_correlation ( - bool) – whether to add the correlation ID (if there is one), using- correlation_id_col
- Return type:
 
 - 
correlation_id_col: str= 'id'¶
- The column name for the - ScoreResult.correlation_idadded to Numpy arrays and Pandas dataframes. If- None, then the correlation IDS are used as the index.
 - 
results: Tuple[ScoreResult,...]¶
- A tuple with each element having the results of the respective sentence pair in - ScoreContext.sents. Each elemnt is a dictionary with the method are the keys with results as the values as output of the- ScoreMethod. This is created in- Scorer.
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
- Write this instance as either a - Writableor as a- Dictable. If class attribute- _DICTABLE_WRITABLE_DESCENDANTSis set as- True, then use the- write()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a- dictrecursively using- asdict(), then formatting the output.- If the attribute - _DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in the- write()method.- Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively. - Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 
- class zensols.nlp.score.Scorer(package_manager=None, methods=None, default_methods=None)[source]¶
- Bases: - object- A class that scores sentences using a set of registered methods ( - methods).- __init__(package_manager=None, methods=None, default_methods=None)¶
 - 
default_methods: Set[str] = None¶
- Methods (keys from - methods) to use when none are provided in the- ScoreContext.methin the call to- score().
 - 
methods: Dict[str,ScoreMethod] = None¶
- The registered scoring methods availale, which are accessed from - ScoreContext.meth.
 - 
package_manager: PackageManager= None¶
- The package manager used to install scoring methods. If this is - None, then packages are not installed and scoring methods are not made available.
 - score(context)[source]¶
- Score the sentences in - context.- Parameters:
- context ( - ScoreContext) – the context containing the data to score
- Return type:
- Returns:
- the results for each method indicated in - context
 
 
zensols.nlp.serial module¶
Serializes FeatureToken and TokenContainer instances
using the Dictable interface.
- class zensols.nlp.serial.Include(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
- Bases: - Enum- Indicates what to include at each level. - normal = 2¶
- The normalized form of the text. 
 - original = 1¶
- The original text. 
 - sentences = 4¶
- The sentences of the - FeatureDocument.
 - tokens = 3¶
- The tokens of the - TokenContainer.
 
- class zensols.nlp.serial.Serialized(container, includes, feature_ids)[source]¶
- Bases: - Dictable- A base strategy class that can serialize - TokenContainerinstances.- __init__(container, includes, feature_ids)¶
 - 
container: TokenContainer¶
- The container to be serialized. 
 
- class zensols.nlp.serial.SerializedFeatureDocument(container, includes, feature_ids, sentence_includes)[source]¶
- Bases: - Serialized- A serializer for feature documents. The - containerhas to be an instance of a- FeatureDocument.- __init__(container, includes, feature_ids, sentence_includes)¶
 
- class zensols.nlp.serial.SerializedTokenContainer(container, includes, feature_ids)[source]¶
- Bases: - Serialized- Serializes instance of - TokenContainer. This is used to serialize spans and sentences.- __init__(container, includes, feature_ids)¶
 
- class zensols.nlp.serial.SerializedTokenContainerFactory(sentence_includes, document_includes, feature_ids=None)[source]¶
- Bases: - Dictable- Creates instances of - Serializedfrom instances of- TokenContainer. These can then be used as- Dictableinstances, specifically with the- asdictand- asjsonmethods.- __init__(sentence_includes, document_includes, feature_ids=None)¶
 - create(container)[source]¶
- Create a serializer from - container(see class docs).- Parameters:
- container ( - TokenContainer) – he container to be serialized
- Return type:
- Returns:
- an object that can be serialized using - asdictand- asjsonmethod.
 
 
zensols.nlp.spannorm module¶
Normalize spans (of tokens) into strings by reconstructing based on language
rules from the normalized form of the tokens.  This is needed after any token
manipulation from TokenNormalizer or other changes to
FeatureToken.norm.
For now, only English is supported, but the module is provided for other languages and future enhancements of normalization configuration.
- class zensols.nlp.spannorm.EnglishSpanNormalizer(post_space_skip=frozenset({'(', '-', '<', '[', '`', '{', '‘', '“'}), pre_space_skip=frozenset({"'d", "'ll", "'m", "'re", "'s", "'ve", '-', "n't"}), keep_space_skip=frozenset({'_'}), canonical_delimiter='|')[source]¶
- Bases: - SpanNormalizer- An implementation of a span normalizer for the Enlish language. - __init__(post_space_skip=frozenset({'(', '-', '<', '[', '`', '{', '‘', '“'}), pre_space_skip=frozenset({"'d", "'ll", "'m", "'re", "'s", "'ve", '-', "n't"}), keep_space_skip=frozenset({'_'}), canonical_delimiter='|')¶
 - get_canonical(tokens)[source]¶
- A canonical representation of the container, which are non-space tokens separated by - CANONICAL_DELIMITER.- Return type:
 
 - get_norm(tokens, use_norm)[source]¶
- Create a string that follows the langauge spacing rules. - Parameters:
- tokens ( - Iterable[- FeatureToken]) – the tokens to normalize
- use_norm ( - bool) – whether to use the token normalized or orthographic text
 
- Return type:
 
 
- class zensols.nlp.spannorm.SpanNormalizer[source]¶
- Bases: - object- Subclasses normalize feature tokens on a per - spacy.Language. All subclasses must be re-entrant.
zensols.nlp.sparser module¶
The spaCy FeatureDocumentParser implementation.
- class zensols.nlp.sparser.SpacyComponent(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=(), auto_install_model=False)[source]¶
- Bases: - Component- A utilty base class that supports installing pip dependencies and spaCy models. - __init__(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=(), auto_install_model=False)¶
 - 
auto_install_model: Union[bool,str,Iterable[str]] = False¶
- Whether to install models not already available. Note that this uses the pip command to download model requirements, which might have an adverse effect of replacing currently installed Python packages. This value is interpreted as a pip requirement(s) to install if a string or iterable of strings. 
 - init(model, parser)[source]¶
- Initialize the component and add it to the NLP pipe line. This base class implementation loads the - module, then calls- Language.add_pipe().- Parameters:
- model ( - Language) – the model to add the spaCy model (- nlpin their parlance)
- parser ( - FeatureDocumentParser) – the owning parser of this component instance
 
 
 
- class zensols.nlp.sparser.SpacyFeatureDocumentParser(config_factory, name, lang='en', model_name=None, token_feature_ids=<factory>, components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.nlp.tok.SpacyFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False, package_manager=<factory>)[source]¶
- Bases: - FeatureDocumentParser- This langauge resource parses text in to Spacy documents. Loaded spaCy models have attribute - doc_parserset enable creation of factory instances from registered pipe components (i.e. specified by- Component).- Configuration example: - [doc_parser] class_name = zensols.nlp.sparser.SpacyFeatureDocumentParser lang = en model_name = ${lang}_core_web_sm- Decorators are processed in the same way - DecoratedFeatureDocumentParser.- __init__(config_factory, name, lang='en', model_name=None, token_feature_ids=<factory>, components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.nlp.tok.SpacyFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False, package_manager=<factory>)¶
 - 
auto_install_model: Union[bool,str,Iterable[str]] = False¶
- Whether to install models not already available. Note that this uses the pip command to download model requirements, which might have an adverse effect of replacing currently installed Python packages. This value is interpreted as a pip requirement(s) to install if a string or iterable of strings. 
 - 
config_factory: ConfigFactory¶
- A configuration parser optionally used by pipeline - Componentinstances.
 - 
disable_component_names: Sequence[str] = None¶
- Components to disable in the spaCy model when creating documents in - parse().
 - doc_class¶
- The type of document instances to create. - alias of - FeatureDocument
 - 
document_decorators: Sequence[FeatureDocumentDecorator] = ()¶
- A list of decorators that can add, remove or modify features on a document. 
 - from_spacy_doc(doc, *args, text=None, **kwargs)[source]¶
- Create s - FeatureDocumentfrom a spaCy doc.- Parameters:
- doc ( - Doc) – the spaCy generated document to transform in to a feature document
- text ( - str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list
- args – the arguments used to create the FeatureDocument instance 
- kwargs – the key word arguments used to create the FeatureDocument instance 
 
- Return type:
 
 - get_dictable(doc)[source]¶
- Return a dictionary object graph and pretty prints spaCy docs. - Return type:
 
 - property model: Language¶
- The spaCy model. On first access, this creates a new instance using - model_name.
 - 
model_name: str= None¶
- The Spacy model name (defualts to - en_core_web_sm); this is ignored if- modelis not- None.
 - 
name: str¶
- The name of the parser, which is taken from the section name when created with a - ConfigFactoryand used for debugging.
 - 
package_manager: PackageManager¶
- The package manager used to install - auto_install_model.
 - parse(text, *args, **kwargs)[source]¶
- Parse text or a text as a list of sentences. - Parameters:
- text ( - str) – either a string or a list of strings; if the former a document with one sentence will be created, otherwise a document is returned with a sentence for each string in the list
- args – the arguments used to create the FeatureDocument instance 
- kwargs – the key word arguments used to create the FeatureDocument instance 
 
- Return type:
 
 - 
reload_components: bool= False¶
- Removes, then re-adds components for cached models. This is helpful for when there are component configurations that change on reruns with a difference application context but in the same Python interpreter session. - A spaCy component can get other instances via - config_factory, but if this is- Falseit will be paired with the first instance of this class and not the new ones created with a new configuration factory.
 - 
remove_empty_sentences: bool= None¶
- Deprecated and will be removed from future versions. Use - FilterSentenceFeatureDocumentDecoratorinstead.
 - sent_class¶
- The type of sentence instances to create. - alias of - FeatureSentence
 - 
sentence_decorators: Sequence[FeatureSentenceDecorator] = ()¶
- A list of decorators that can add, remove or modify features on a sentence. 
 - to_spacy_doc(doc, norm=True, add_features=None)[source]¶
- Convert a feature document back in to a spaCy document. - Note: not all data is copied–only text, - pos_,- tag_,- lemma_and- dep_.- Parameters:
- doc ( - FeatureDocument) – the spaCy doc to convert
- norm ( - bool) – whether to use the normalized text as the- orth_spaCy token attribute or- text
 
- Pram add_features:
- whether to add POS, NER tags, lemmas, heads and dependnencies 
- Return type:
- Doc
- Returns:
- the feature document with copied data from - doc
 
 - token_class¶
- The type of document instances to create. - alias of - SpacyFeatureToken
 - 
token_decorators: Sequence[FeatureTokenDecorator] = ()¶
- A list of decorators that can add, remove or modify features on a token. 
 - 
token_normalizer: TokenNormalizer= None¶
- The token normalizer for methods that use it, i.e. - features.
 
zensols.nlp.stemmer module¶
Stem text using the Porter stemmer.
zensols.nlp.tok module¶
Feature token and related base classes
- class zensols.nlp.tok.FeatureToken(i, idx, i_sent, norm, lexspan)[source]¶
- Bases: - PersistableContainer,- TextContainer- A container class for features about a token. Subclasses such as - SpacyFeatureTokenextracts only a subset of features from the heavy Spacy C data structures and is hard/expensive to pickle. Instances of this token class are almost always detached, meaning the underlying in memory data structures have been copied as pure Python types to facilitate serialization of spaCy tokens.- Feature note: features - i,- idxand- i_sentare always added to features tokens to be able to reconstruct sentences (see- FeatureDocument.uncombine_sentences()), and alwyas included.- 
FEATURE_IDS: ClassVar[Set[str]] = frozenset({'children', 'dep', 'dep_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'sent_i', 'shape', 'shape_', 'tag', 'tag_'})¶
- All default available feature IDs. 
 - 
FEATURE_IDS_BY_TYPE: ClassVar[Dict[str,Set[str]]] = {'bool': frozenset({'is_contraction', 'is_ent', 'is_pronoun', 'is_space', 'is_stop', 'is_superlative', 'is_wh'}), 'int': frozenset({'dep', 'ent', 'ent_iob', 'i', 'i_sent', 'idx', 'is_punctuation', 'norm_len', 'sent_i', 'shape', 'tag'}), 'list': frozenset({'children'}), 'object': frozenset({'lexspan'}), 'str': frozenset({'dep_', 'ent_', 'ent_iob_', 'lemma_', 'norm', 'pos_', 'shape_', 'tag_'})}¶
- Map of class type to set of feature IDs. 
 - 
REQUIRED_FEATURE_IDS: ClassVar[Set[str]] = frozenset({'i', 'i_sent', 'idx', 'lexspan', 'norm'})¶
- Features retained regardless of configuration for basic functionality. 
 - 
SKIP_COMPARE_FEATURE_IDS: ClassVar[Set[str]] = {}¶
- A set of feature IDs to avoid comparing in - __eq__().
 - 
TYPES_BY_FEATURE_ID: ClassVar[Dict[str,str]] = {'children': 'list', 'dep': 'int', 'dep_': 'str', 'ent': 'int', 'ent_': 'str', 'ent_iob': 'int', 'ent_iob_': 'str', 'i': 'int', 'i_sent': 'int', 'idx': 'int', 'is_contraction': 'bool', 'is_ent': 'bool', 'is_pronoun': 'bool', 'is_punctuation': 'int', 'is_space': 'bool', 'is_stop': 'bool', 'is_superlative': 'bool', 'is_wh': 'bool', 'lemma_': 'str', 'lexspan': 'object', 'norm': 'str', 'norm_len': 'int', 'pos_': 'str', 'sent_i': 'int', 'shape': 'int', 'shape_': 'str', 'tag': 'int', 'tag_': 'str'}¶
- A map of feature ID to string type. This is used by - FeatureToken.write_attributes()to dump the type features.
 - 
WRITABLE_FEATURE_IDS: ClassVar[Tuple[str,...]] = ('text', 'norm', 'idx', 'sent_i', 'i', 'i_sent', 'tag', 'pos', 'is_wh', 'entity', 'dep', 'children')¶
- Feature IDs that are dumped on - write()and- write_attributes().
 - __init__(i, idx, i_sent, norm, lexspan)¶
 - clone(cls=None, **kwargs)[source]¶
- Clone an instance of this token. - Parameters:
- cls ( - Type) – the type of the new instance
- kwargs – arguments to add to as attributes to the clone 
 
- Return type:
- Returns:
- the cloned instance of this instance 
 
 - property default_detached_feature_ids: Set[str] | None¶
- The default set of feature IDs used when cloning or detaching with - clone()or- detach().
 - detach(feature_ids=None, skip_missing=False, cls=None)[source]¶
- Create a detected token (i.e. from spaCy artifacts). - Parameters:
- feature_ids ( - Set[- str]) – the features to write, which defaults to- FEATURE_IDS
- skip_missing ( - bool) – whether to only keep- feature_ids
- cls ( - Type[- FeatureToken]) – the type of the new instance
 
- Return type:
 
 - get_feature(feature_id, expect=True, check_none=False, message=None)[source]¶
- Return a feature by the feature ID. - Parameters:
- feature_id ( - str) – the ID of the feature to retrieve
- expect ( - bool) – whether to raise an error
- message ( - str) – additional context to append to the error message
- check_none ( - bool) – whether to return the value even if it has an unset value such as- NONEas determined by- is_none(), in which case- Noneis returned
 
- Raises:
- MissingFeatureError – if - expectis- Trueand the feature does not exist
- Return type:
 
 - 
i_sent: int¶
- The index of the token within the parent sentence. - The index of the token in the respective sentence. This is not to be confused with the index of the sentence to which the token belongs, which is - sent_i.
 - 
lexspan: LexicalSpan¶
- The character offset beginning and end of the token. This is set as ( - start,- end) as (- idx,- idx+- len(text)). The- beginis usually the same as- idxbut can be updated when updated for normalized text or when the text moves/reindexed in the document.
 - set_feature(feature_id, value)[source]¶
- Set, or add if non-existant, a feature to this token instance. If the token has been detached, it will be added to the - default_detached_feature_ids.
 - split(positions)[source]¶
- Split on text normal index positions. This needs and updates the - idxand- lexspanatttributes.
 - property text: str¶
- The initial text before normalized by any - TokenNormalizer.
 - write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_type=True, feature_ids=None, inline=False)[source]¶
- Write this instance as either a - Writableor as a- Dictable. If class attribute- _DICTABLE_WRITABLE_DESCENDANTSis set as- True, then use the- write()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a- dictrecursively using- asdict(), then formatting the output.- If the attribute - _DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in the- write()method.- Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively. - Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
 
 
 - write_attributes(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_type=True, feature_ids=None, inline=False, include_none=True)[source]¶
- Write feature attributes. - Parameters:
- depth ( - int) – the starting indentation depth
- writer ( - TextIOBase) – the writer to dump the content of this writable
- include_type ( - bool) – if- Truewrite the type of value (if available)
- feature_ids ( - Iterable[- str]) – the features to write, which defaults to- WRITABLE_FEATURE_IDS
- inline ( - bool) – whether to print attributes all on the same line
 
 
 
- 
FEATURE_IDS: 
- class zensols.nlp.tok.SpacyFeatureToken(spacy_token, norm)[source]¶
- Bases: - FeatureToken- Contains and provides the same features as a spaCy - Token.- property children¶
- A sequence of the token’s immediate syntactic children. 
 - conll_iob_()[source]¶
- Return the CoNLL formatted IOB tag, such as - B-ORGfor a beginning organization token.- Return type:
 
 - property lemma_: str¶
- Return the string lemma or text of the named entitiy if tagged as a named entity. 
 - property sent_i: int¶
- The index of the sentence to which the token belongs. This is not to be confused with the index of the token in the respective sentence, which is - FeatureToken.i_sent.- This attribute does not exist in a spaCy token, and was named as such to follow the naming conventions of their API. 
 - property shape: int¶
- Transform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “d. 
 - property shape_: str¶
- Transform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “d. 
 - 
spacy_token: Union[Token,Span]¶
- The parsed spaCy token (or span if entity) this feature set is based. - See:
- FeatureDocument.spacy_doc()
 
 - property token: Token¶
- Return the SpaCy token.