Change Log¶

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

Unreleased ¶

1.12.8 - 2025-07-31¶

Changed¶

Removed unused dependencies: msgpack and smartopen.

1.12.7 - 2025-07-27¶

Changed¶

Fixed SpacyComponent hash bug.

1.12.6 - 2025-07-27¶

Added¶

TokenContainer.set_entity_offsets
A SpacyComponent that offers spaCy model and pip dependency installation.

1.12.5 - 2025-06-22¶

Changed¶

Set auto install models default to true.

1.12.4 - 2025-06-21¶

Added¶

Added feature to optionally auto load spaCy models as pip dependencies.

1.12.3 - 2025-06-19¶

Changed¶

Fixed pypi package name.

1.12.2 - 2025-06-19¶

Changed¶

Switch from setuptools to Pixi.
Add scoring method module install.

1.12.1 - 2025-01-23¶

Changed¶

Fix bug when cloning FeatureToken.

1.12.0 - 2025-01-11¶

Removed¶

CombinerFeatureDocumentParser.include_detached_features to default using FeatureToken.{get,set}_feature semantics.
Dropped support for Python 3.10.

Added¶

A text indexing and search class which find feature spans in text with mangled white space.
Feature ID mapping in the aggregating parser CombinerFeatureDocumentParser class.

Changed¶

Replaced FeatureToken.{get,set}_value with a more robust {get,set}_feature.
Upgraded to zensols.util version 1.15.

1.11.1 - 2024-05-11¶

Added¶

A method FeatureToken.set_value that sets a value by attribute.
A token container decorator that copies features.

1.11.0 - 2024-04-14¶

Feature release with significant modification to feature merging document parsers.

Added¶

A composite parser that combines several parsers, each with their own rules of copying (or clobbering).

Changed¶

The combiner parser CombinerFeatureDocumentParser, and subclasses, are now optimized to avoid re-parsing for shared parsers. This is the case with the zensols.mednlp parsers that migrate features down to the same parser.
Fixed some features not copied in combiner parsers after a token clone.

Removed¶

The spaCy and combiner parsers are removed from the default zensols.nlp package import.

Changed¶

Add TokenContainer class to decorator hierarchy.
Rename classes:
- StripSentenceDecorator to StripTokenContainerDecorator
- UpdateDocumentDecorator to UpdateTokenContainerDecorator
Rename resource library configuration:
- strip_sentence_decorator to strip_token_container_decorator
- update_document_decorator to update_token_container_decorator
CombinerFeatureDocumentParser now extends from DecoratedFeatureDocumentParser with target_parser becoming delegate. Token features now come from the delegate or stored in the DecoratedFeatureDocumentParser when they don’t exist in the delegate.

1.10.0 - 2024-02-27¶

A class name typo is the impetuous for this being a new minor release (even if the release is mostly for bug fixes).

Added¶

Add token level annotations to TokenAnnotatedFeatureDocument.
Yielded feature defaults in CombinerFeatureDocumentParser.

Changed¶

Class name typo for TokenAnnotatedFeatureDocument.
Fixed bug on CombinerFeatureDocumentParser where Nones were not replaced by a source parser.
Added toaken_feature_ids to CombinerFeatureDocumentParser to facilitate token feature passing.
Lexical span gaps end boundary edge case bug fix.
Minor bug fixes.

1.9.2 - 2024-01-11¶

Changed¶

The CachingFeatureDocumentParser is now configurable with decorators.

1.9.1 - 2024-01-04¶

Added¶

Added an API, parser components, and unit tests to split tokens.
Adding missing text column on the feature document Pandas dataframe.

Changed¶

Bug fixes to FeatureDocument sentence combining.
White space tokenization parser no longer inherits the spaCy parser, and needs no configuration.

1.9.0 - 2023-12-05¶

Upgrade and Python deprecation release.

Changed¶

Upgrade to spaCy version 3.6.
Upgrade to zensols.util version 1.14.

Added¶

Support for Python 3.11.
Optional dependencies for scoring methods.

Removed¶

Support for Python 3.9.

1.8.1 - 2023-11-29¶

Added¶

A simple FeatureSentenceFactory that creates sentence instances from tokens.

Changed¶

FeatureToken bug fixes.
Reduce pickle data footprint.
Span normalization.
Reduce flake8 warning, typehints, documentation.

1.8.0 - 2023-08-16¶

Functional and downstream moderate risk update release.

Changed¶

TokenContainer.norm removes newlines of the normalized text.
FeatureToken hash function.
Fix text mangling in sub-document FeatureDocument.get_overlapping method.
Refactor hash and equal compare methods in TokenContainer
Terse writing for TokenContainer and FeatureToken.

Added¶

Rule based paragraph and list item chunkers.
FeatureDocument.reindex and method to clear cached state with unit tests.

1.7.3 - 2023-06-29¶

Changed¶

FeatureToken detached features are transmitted by the CombinerFeatureDocumentParser.

1.7.2 - 2023-06-27¶

Changed¶

Move spaCy parser and supporting classes to a separate module.
Feature to auto load any missing spaCy models at runtime. This feature doc_parser.auto_install_model must be turned on to be used.

1.7.1 - 2023-06-20¶

Added¶

Feature to add None values to missing overwritten features in CombinerFeatureDocumentParser.

1.7.0 - 2023-06-07¶

Changed¶

Fixed type exception bug on Feature.to_sentence.
Fix raised exception for overlapped methods on 0-length documents.
Remove spaCy artifacts from parser decorators (i.e. SpacyFeatureDocumentDecorator -> FeatureDocumentDecorator) to generalize to non-spaCy document parsers and other components (deepnlp transformer embedding populators).

Added¶

Right lexical span inclusive parameter for all TokenContainer.get_overlapping* methods.
Empty versions of TokenContainer subclasses.
Added a default instance of a FeatureDocumentParser that does not require a resource library configuration.
A TokenContainer.canonical that provides a canonical representation of the token container.
A right inclusive flag on TokenContainer overlapping methods.
Container methods to update token spans for split entities and a decorator.
Levenshtein edit distance based scoring module.
Exact match scoring module.
SemEval-2013 Task 9.1 scoring module.

1.6.0 - 2023-04-05¶

Added¶

Backwards compatible scoring: error handling and correlation IDs.
More unit tests.
Handle errors during scoring and robustly provide scores when reporting.
Make token containers are hashable.

Changed¶

Fixed token overlap on left side of lexical spans.

1.5.0 - 2023-01-23¶

Changed¶

Fix TokenContainer indexing bug with edge case on split on space.
Updated zensols.util to 1.12.1.

Added¶

Scoring framework. This includes Bleu via NLTK by default, and optionally ROUGE via optional package support.
Contiguous sentence index (i_sent) in FeatureDocument.to_sentece.
Default feature ID set to FeatureToken.

Removed:¶

Unused Levenshtein dependency.

1.4.1 - 2022-10-02¶

Changed¶

Fixed token indexing bug

1.4.0 - 2022-09-30¶

Added¶

A document stash caching parser CachingFeatureDocumentParser.
The InterLap library to speed up overlapping token queries.
Sentence decorator and sentence split space decorator.

Changed¶

FeatureDocument.sents changed from a list to a tuple.
Add checks for FeatureDocument.sents and FeatureSentence.sent_tokens as tuples.
Better (English) normalization of text by adding more apostrophe/contraction syntax.
The FeatureToken.NONE constant changed from <none> to -<N>-.
Speed up FeatureToken equals.

Removed¶

Removed stemmer module from default imports. Use import zensols.nlp.stemmer.

1.3.0 - 2022-08-06¶

Added¶

Token indexing mappings accounting for (named entity) multi-word tokens.
IOB (iob_, iob) features.
Re-loadable components and component initializers.

Changed¶

Upgraded to spaCy 3.2
Add spaCy tokens to spaCy feature tokens.
Bug fixes in combining and overlapping sentences.
Switched to shallow copy of document in overlapping sentence doc methods.

1.2.0 - 2022-06-16¶

Removed¶

Remove resource library regular_expression_escape:dollar configuration. Use zensols.util conf_esc:dollar as a replacement.

1.1.2 - 2022-06-14¶

Changed¶

Dependency bump.

1.1.1 - 2022-05-15¶

Changed¶

Dependency bump.

1.1.0 - 2022-05-04¶

Changed¶

Fix resource leaks and other bugs.
Persist original text along with FeatureDocument rather than reconstruct it from sentence and/or token text.

Added¶

An lexical overlapping utility module (overlap).
A token normalizer that merges tokens in to spans (JoinTokenMapper).
Regular expression matching for entity and merge components (similar to JoinTokenMapper).
Add back TokenAnnotatedFeatureSentence for down stream packages.
Add token decorator to spacy parser to allow for add/modify features on creation separate from parser class hierarchy.

1.0.1 - 2022-01-25¶

Added¶

Sentences and tokens accessible by index.

Changed¶

More robust regular expression for token splitting.
Mapping combiner is persistable with spaCy tokens and handles split named entities.

1.0.0 - 2021-10-22¶

First major development release.

Added¶

A FeatureDocumentCombiner that merges features from different document parsers.
Top level library NLPError.
A pipeline component and resource configuration library entry to remove sentence boundaries in a spaCy document.

Changed¶

Split out optional resource library content in to mappers.conf.
The spaCy model has attribute langres set on LanguageResource to enable creation of factory instances from registered pipe components.
Fix issue with component creation with no pipeline arguments.

Removed¶

The DocStash instance as it was too simple for any practical application.

0.1.3 - 2021-09-21¶

Changed¶

Dependency.

Removed¶

zensols.nlp.lang.DocStash

0.1.2 - 2021-09-21¶

Changed¶

Make FeatureDocumentParser callable.
Fix memory leak in LanguageResource.

Added¶

Configuration Resource library.
Configuration for keyword arguments to the add_pipe_comp and example.

0.1.1 - 2021-09-07¶

Changed¶

Fixed bug with creating a dict from a FeatureToken.
Fixed/improved how Feature{Token,Sentence,Document} are dictified with (asdict) and how they are written as text with write.

Added¶

Creates a Pandas dataframe from token feature attributes.
Add back FeatureToken feature ID -> type for write dumping
Add lexical location SpacyTokenFeatures.loc location in the document as an (starting, ending) range.

0.1.0 - 2021-08-16¶

This release simplifies the token attributes level classes in the features module by:

Using feature IDs instead of trying to make sense of the class property/attribute member data.
Using the FeatureDocumentParser and FeatureToken to copy spaCy resources to simple picklable Python classes.

Not only does this greatly reduce complexity in class hierarchy and data copy/move functionality, but speeds things up.

Changes¶

Attributes set on detached token features are no longer robust. Before, if a token feature ID was specified, but didn’t exist on the source token feature set, it would copy over a None. This now raises an AttributeError instead.
For TokenAttributes, creation of dicts (either by asdict or get_features) is now consistent with the set attributes and properties of the class. Only those specified passed to methods, which default to FIELD_IDS of the class (which can be overridden at a class level).

Removed¶

The dictionary creation of attribute/property individual features methods TokenAttributes.{string}features. These methods are obviated by the get_features, which returns all features in FIELD_IDS.
FeatureDocumentParser.additional_token_feature_ids to simplify token feature IDs passed to feature tokens.
The TokenAttributes class, as it was just a metadata member holder.

Added¶

A SpaCy implementation of the TokenFeatures class, that somewhat resembles the old TokenFeatures of the old class hierarchy.

0.0.15 - 2021-08-07¶

Changes¶

Upgrade from spaCy 2.x to 3.x.

Added¶

POS feature inclusion by default to support is_pronoun, which is needed after spaCy 3 changed how lemmatization works.
Move feature containers and parser from zensols.deepnlp, including test cases.
A sentence index feature (i_sent).
An index of sentence feature (sent_i).
Advanced spacy configuration by adding component classes. This gives more control over configuring the spaCy pipeline.
Add feature containers (FeatureDocument) and parser (FeatureDocumentParser), which were moved over from zensols.deepnlp.

0.0.14 - 2021-04-29¶

Changes¶

Upgrade to zensols.util version 1.4.1.
Upgrade documentation API generation.
Nail dependencies to spacy 2.3.5 until pip deps are fixed.
Added sentence index features to reconstruct sentences from documents.

0.0.13 - 2021-01-14¶

Changes¶

Fix component adds for spacy > 2.0.
Add langres model to API documentation.

0.0.12 - 2020-12-29¶

Changed¶

Upgraded zenbuild.
Switched from Travis to GitHub workflows.
Tested with Python 3.9.1.

0.0.11 - 2020-12-09¶

Changed¶

Add basic token features for non-spacy parse use cases.
Rename feature type to feature id.
TokeFeatures is now a dictable with to_dict -> asdict.

0.0.10 - 2020-12-09¶

Added¶

Sphinx documentation, which includes API docs.

Changed¶

Settable detached TokenAttributes instances.
Make dataclasses, and therefore, needs >= Python 3.7.

0.0.9 - 2020-05-10¶

Changed¶

Home/master move lemmatizing out of default token normalizer.
Update super method calls to modern (at least) Python 3.7.
Fix annoying can’t find smart_open.gcs bogus warning.
Remove language resource factory.
Upgrade to zensols.util 1.2.0 and get rid of custom factories.

Added¶

Feature to parse whole special tokens.
Added porter stemmer from nltk.

Removed¶

Moved word2vec embedding (word2vec.py) to zensols.deepnlp library.
Moved feature normalization (fnorm.py) to zensols.deepnlp library.

0.0.8 - 2020-04-14¶

Changed¶

Upgrade to spaCy 2.2.4 and textacy 0.10.0

0.0.7 - 2020-01-24¶

Added¶

Added the Porter stemmer from the [NTLK].

Changed¶

Better class naming for token mapper.
Features debugging bug fix.

0.0.6 - 2019-12-14¶

Changed¶

Fix Travis.

0.0.5 - 2019-12-14¶

Data classes are now used so Python 3.7 is now a requirement.

Added¶

Feature normalizers were added for neural networks.
Implemented a better strategy for using language resources with token normalization.

0.0.4 - 2019-11-21¶

Added¶

Adding detachable and picklable token feature set.

0.0.3 - 2019-07-31¶

Added¶

DocStash that parses documents as a factory stash.

0.0.2 - 2019-07-25¶

Added¶

Feature to disable SpaCy pipeline components.
Add configuration for removing punctuation and determiners.

Changed¶

Skip textacy for document creation since it wasn’t used. This is more efficient.

0.0.1 - 2019-07-06¶

Added¶

Initial version.

Change Log¶

Unreleased¶

1.12.8 - 2025-07-31¶

Changed¶

1.12.7 - 2025-07-27¶

Changed¶

1.12.6 - 2025-07-27¶

Added¶

1.12.5 - 2025-06-22¶

Changed¶

1.12.4 - 2025-06-21¶

Added¶

1.12.3 - 2025-06-19¶

Changed¶

1.12.2 - 2025-06-19¶

Changed¶

1.12.1 - 2025-01-23¶

Changed¶

1.12.0 - 2025-01-11¶

Removed¶

Added¶

Changed¶

1.11.1 - 2024-05-11¶

Added¶

1.11.0 - 2024-04-14¶

Added¶

Changed¶

Removed¶

Changed¶

1.10.0 - 2024-02-27¶

Added¶

Changed¶

1.9.2 - 2024-01-11¶

Changed¶

1.9.1 - 2024-01-04¶

Added¶

Changed¶

1.9.0 - 2023-12-05¶

Changed¶

Added¶

Removed¶

1.8.1 - 2023-11-29¶

Added¶

Changed¶

1.8.0 - 2023-08-16¶

Changed¶

Added¶

1.7.3 - 2023-06-29¶

Changed¶

1.7.2 - 2023-06-27¶

Changed¶

1.7.1 - 2023-06-20¶

Added¶

1.7.0 - 2023-06-07¶

Changed¶

Added¶

1.6.0 - 2023-04-05¶

Added¶

Changed¶

1.5.0 - 2023-01-23¶

Changed¶

Added¶

Removed:¶

1.4.1 - 2022-10-02¶

Changed¶

1.4.0 - 2022-09-30¶

Added¶

Changed¶

Removed¶

1.3.0 - 2022-08-06¶

Added¶

Changed¶

1.2.0 - 2022-06-16¶

Removed¶

1.1.2 - 2022-06-14¶

Changed¶

1.1.1 - 2022-05-15¶

Changed¶

1.1.0 - 2022-05-04¶

Changed¶

Unreleased ¶