Change Log¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
Unreleased¶
1.12.8 - 2025-07-31¶
Changed¶
Removed unused dependencies:
msgpackandsmartopen.
1.12.7 - 2025-07-27¶
Changed¶
Fixed
SpacyComponenthash bug.
1.12.6 - 2025-07-27¶
Added¶
TokenContainer.set_entity_offsetsA
SpacyComponentthat offers spaCy model and pip dependency installation.
1.12.5 - 2025-06-22¶
Changed¶
Set auto install models default to
true.
1.12.4 - 2025-06-21¶
Added¶
Added feature to optionally auto load spaCy models as pip dependencies.
1.12.3 - 2025-06-19¶
Changed¶
Fixed pypi package name.
1.12.2 - 2025-06-19¶
Changed¶
Switch from setuptools to Pixi.
Add scoring method module install.
1.12.1 - 2025-01-23¶
Changed¶
Fix bug when cloning
FeatureToken.
1.12.0 - 2025-01-11¶
Removed¶
CombinerFeatureDocumentParser.include_detached_featuresto default usingFeatureToken.{get,set}_featuresemantics.Dropped support for Python 3.10.
Added¶
A text indexing and search class which find feature spans in text with mangled white space.
Feature ID mapping in the aggregating parser
CombinerFeatureDocumentParserclass.
Changed¶
Replaced
FeatureToken.{get,set}_valuewith a more robust{get,set}_feature.Upgraded to zensols.util version 1.15.
1.11.1 - 2024-05-11¶
Added¶
A method
FeatureToken.set_valuethat sets a value by attribute.A token container decorator that copies features.
1.11.0 - 2024-04-14¶
Feature release with significant modification to feature merging document parsers.
Added¶
A composite parser that combines several parsers, each with their own rules of copying (or clobbering).
Changed¶
The combiner parser
CombinerFeatureDocumentParser, and subclasses, are now optimized to avoid re-parsing for shared parsers. This is the case with the zensols.mednlp parsers that migrate features down to the same parser.Fixed some features not copied in combiner parsers after a token clone.
Removed¶
The spaCy and combiner parsers are removed from the default
zensols.nlppackage import.
Changed¶
Add
TokenContainerclass to decorator hierarchy.Rename classes:
StripSentenceDecoratortoStripTokenContainerDecoratorUpdateDocumentDecoratortoUpdateTokenContainerDecorator
Rename resource library configuration:
strip_sentence_decoratortostrip_token_container_decoratorupdate_document_decoratortoupdate_token_container_decorator
CombinerFeatureDocumentParsernow extends fromDecoratedFeatureDocumentParserwithtarget_parserbecomingdelegate. Token features now come from the delegate or stored in theDecoratedFeatureDocumentParserwhen they don’t exist in the delegate.
1.10.0 - 2024-02-27¶
A class name typo is the impetuous for this being a new minor release (even if the release is mostly for bug fixes).
Added¶
Add token level annotations to
TokenAnnotatedFeatureDocument.Yielded feature defaults in
CombinerFeatureDocumentParser.
Changed¶
Class name typo for
TokenAnnotatedFeatureDocument.Fixed bug on
CombinerFeatureDocumentParserwhereNones were not replaced by a source parser.Added
toaken_feature_idstoCombinerFeatureDocumentParserto facilitate token feature passing.Lexical span gaps end boundary edge case bug fix.
Minor bug fixes.
1.9.2 - 2024-01-11¶
Changed¶
The
CachingFeatureDocumentParseris now configurable with decorators.
1.9.1 - 2024-01-04¶
Added¶
Added an API, parser components, and unit tests to split tokens.
Adding missing
textcolumn on the feature document Pandas dataframe.
Changed¶
Bug fixes to
FeatureDocumentsentence combining.White space tokenization parser no longer inherits the spaCy parser, and needs no configuration.
1.9.0 - 2023-12-05¶
Upgrade and Python deprecation release.
Changed¶
Upgrade to spaCy version 3.6.
Upgrade to zensols.util version 1.14.
Added¶
Support for Python 3.11.
Optional dependencies for scoring methods.
Removed¶
Support for Python 3.9.
1.8.1 - 2023-11-29¶
Added¶
A simple
FeatureSentenceFactorythat creates sentence instances from tokens.
Changed¶
FeatureTokenbug fixes.Reduce pickle data footprint.
Span normalization.
Reduce flake8 warning, typehints, documentation.
1.8.0 - 2023-08-16¶
Functional and downstream moderate risk update release.
Changed¶
TokenContainer.normremoves newlines of the normalized text.FeatureTokenhash function.Fix text mangling in sub-document
FeatureDocument.get_overlappingmethod.Refactor hash and equal compare methods in
TokenContainerTerse writing for
TokenContainerandFeatureToken.
Added¶
Rule based paragraph and list item chunkers.
FeatureDocument.reindexand method to clear cached state with unit tests.
1.7.3 - 2023-06-29¶
Changed¶
FeatureTokendetached features are transmitted by theCombinerFeatureDocumentParser.
1.7.2 - 2023-06-27¶
Changed¶
Move spaCy parser and supporting classes to a separate module.
Feature to auto load any missing spaCy models at runtime. This feature
doc_parser.auto_install_modelmust be turned on to be used.
1.7.1 - 2023-06-20¶
Added¶
Feature to add
Nonevalues to missing overwritten features inCombinerFeatureDocumentParser.
1.7.0 - 2023-06-07¶
Changed¶
Fixed type exception bug on
Feature.to_sentence.Fix raised exception for overlapped methods on 0-length documents.
Remove spaCy artifacts from parser decorators (i.e.
SpacyFeatureDocumentDecorator->FeatureDocumentDecorator) to generalize to non-spaCy document parsers and other components (deepnlptransformer embedding populators).
Added¶
Right lexical span inclusive parameter for all
TokenContainer.get_overlapping*methods.Empty versions of
TokenContainersubclasses.Added a default instance of a
FeatureDocumentParserthat does not require a resource library configuration.A
TokenContainer.canonicalthat provides a canonical representation of the token container.A right inclusive flag on
TokenContaineroverlapping methods.Container methods to update token spans for split entities and a decorator.
Levenshtein edit distance based scoring module.
Exact match scoring module.
SemEval-2013 Task 9.1 scoring module.
1.6.0 - 2023-04-05¶
Added¶
Backwards compatible scoring: error handling and correlation IDs.
More unit tests.
Handle errors during scoring and robustly provide scores when reporting.
Make token containers are hashable.
Changed¶
Fixed token overlap on left side of lexical spans.
1.5.0 - 2023-01-23¶
Changed¶
Fix
TokenContainerindexing bug with edge case on split on space.Updated zensols.util to 1.12.1.
Added¶
Scoring framework. This includes Bleu via NLTK by default, and optionally ROUGE via optional package support.
Contiguous sentence index (i_sent) in
FeatureDocument.to_sentece.Default feature ID set to
FeatureToken.
Removed:¶
Unused Levenshtein dependency.
1.4.1 - 2022-10-02¶
Changed¶
Fixed token indexing bug
1.4.0 - 2022-09-30¶
Added¶
A document stash caching parser
CachingFeatureDocumentParser.The InterLap library to speed up overlapping token queries.
Sentence decorator and sentence split space decorator.
Changed¶
FeatureDocument.sentschanged from alistto atuple.Add checks for
FeatureDocument.sentsandFeatureSentence.sent_tokensas tuples.Better (English) normalization of text by adding more apostrophe/contraction syntax.
The
FeatureToken.NONEconstant changed from<none>to-<N>-.Speed up
FeatureTokenequals.
Removed¶
Removed
stemmermodule from default imports. Useimport zensols.nlp.stemmer.
1.3.0 - 2022-08-06¶
Added¶
Token indexing mappings accounting for (named entity) multi-word tokens.
IOB (
iob_,iob) features.Re-loadable components and component initializers.
Changed¶
Upgraded to spaCy 3.2
Add spaCy tokens to spaCy feature tokens.
Bug fixes in combining and overlapping sentences.
Switched to shallow copy of document in overlapping sentence doc methods.
1.2.0 - 2022-06-16¶
Removed¶
Remove resource library
regular_expression_escape:dollarconfiguration. Use zensols.utilconf_esc:dollaras a replacement.
1.1.2 - 2022-06-14¶
Changed¶
Dependency bump.
1.1.1 - 2022-05-15¶
Changed¶
Dependency bump.
1.1.0 - 2022-05-04¶
Changed¶
Fix resource leaks and other bugs.
Persist original text along with
FeatureDocumentrather than reconstruct it from sentence and/or token text.
Added¶
An lexical overlapping utility module (
overlap).A token normalizer that merges tokens in to spans (
JoinTokenMapper).Regular expression matching for entity and merge components (similar to
JoinTokenMapper).Add back
TokenAnnotatedFeatureSentencefor down stream packages.Add token decorator to spacy parser to allow for add/modify features on creation separate from parser class hierarchy.
1.0.1 - 2022-01-25¶
Added¶
Sentences and tokens accessible by index.
Changed¶
More robust regular expression for token splitting.
Mapping combiner is persistable with spaCy tokens and handles split named entities.
1.0.0 - 2021-10-22¶
First major development release.
Added¶
A
FeatureDocumentCombinerthat merges features from different document parsers.Top level library
NLPError.A pipeline component and resource configuration library entry to remove sentence boundaries in a spaCy document.
Changed¶
Split out optional resource library content in to
mappers.conf.The spaCy model has attribute
langresset onLanguageResourceto enable creation of factory instances from registered pipe components.Fix issue with component creation with no pipeline arguments.
Removed¶
The
DocStashinstance as it was too simple for any practical application.
0.1.3 - 2021-09-21¶
Changed¶
Dependency.
Removed¶
zensols.nlp.lang.DocStash
0.1.2 - 2021-09-21¶
Changed¶
Make
FeatureDocumentParsercallable.Fix memory leak in
LanguageResource.
Added¶
Configuration Resource library.
Configuration for keyword arguments to the
add_pipe_compand example.
0.1.1 - 2021-09-07¶
Changed¶
Fixed bug with creating a
dictfrom aFeatureToken.Fixed/improved how
Feature{Token,Sentence,Document}aredictified with (asdict) and how they are written as text withwrite.
Added¶
Creates a Pandas dataframe from token feature attributes.
Add back
FeatureTokenfeature ID -> type for write dumpingAdd lexical location
SpacyTokenFeatures.loclocation in the document as an (starting, ending) range.
0.1.0 - 2021-08-16¶
This release simplifies the token attributes level classes in the features
module by:
Using feature IDs instead of trying to make sense of the class property/attribute member data.
Using the
FeatureDocumentParserandFeatureTokento copy spaCy resources to simple picklable Python classes.
Not only does this greatly reduce complexity in class hierarchy and data copy/move functionality, but speeds things up.
Changes¶
Attributes set on detached token features are no longer robust. Before, if a token feature ID was specified, but didn’t exist on the source token feature set, it would copy over a
None. This now raises anAttributeErrorinstead.For
TokenAttributes, creation ofdicts(either byasdictorget_features) is now consistent with the set attributes and properties of the class. Only those specified passed to methods, which default toFIELD_IDSof the class (which can be overridden at a class level).
Removed¶
The dictionary creation of attribute/property individual features methods
TokenAttributes.{string}features. These methods are obviated by theget_features, which returns all features inFIELD_IDS.FeatureDocumentParser.additional_token_feature_idsto simplify token feature IDs passed to feature tokens.The
TokenAttributesclass, as it was just a metadata member holder.
Added¶
A SpaCy implementation of the
TokenFeaturesclass, that somewhat resembles the oldTokenFeaturesof the old class hierarchy.
0.0.15 - 2021-08-07¶
Changes¶
Upgrade from spaCy 2.x to 3.x.
Added¶
POS feature inclusion by default to support
is_pronoun, which is needed after spaCy 3 changed how lemmatization works.Move feature containers and parser from
zensols.deepnlp, including test cases.A sentence index feature (
i_sent).An index of sentence feature (
sent_i).Advanced spacy configuration by adding component classes. This gives more control over configuring the spaCy pipeline.
Add feature containers (
FeatureDocument) and parser (FeatureDocumentParser), which were moved over from zensols.deepnlp.
0.0.14 - 2021-04-29¶
Changes¶
Upgrade to zensols.util version 1.4.1.
Upgrade documentation API generation.
Nail dependencies to spacy 2.3.5 until pip deps are fixed.
Added sentence index features to reconstruct sentences from documents.
0.0.13 - 2021-01-14¶
Changes¶
Fix component adds for spacy > 2.0.
Add langres model to API documentation.
0.0.12 - 2020-12-29¶
Changed¶
Upgraded zenbuild.
Switched from Travis to GitHub workflows.
Tested with Python 3.9.1.
0.0.11 - 2020-12-09¶
Changed¶
Add basic token features for non-spacy parse use cases.
Rename feature type to feature id.
TokeFeaturesis now a dictable with to_dict -> asdict.
0.0.10 - 2020-12-09¶
Added¶
Sphinx documentation, which includes API docs.
Changed¶
Settable detached
TokenAttributesinstances.Make
dataclasses, and therefore, needs >= Python 3.7.
0.0.9 - 2020-05-10¶
Changed¶
Home/master move lemmatizing out of default token normalizer.
Update super method calls to modern (at least) Python 3.7.
Fix annoying can’t find smart_open.gcs bogus warning.
Remove language resource factory.
Upgrade to zensols.util 1.2.0 and get rid of custom factories.
Added¶
Feature to parse whole special tokens.
Added porter stemmer from nltk.
Removed¶
Moved word2vec embedding (
word2vec.py) to zensols.deepnlp library.Moved feature normalization (
fnorm.py) to zensols.deepnlp library.
0.0.8 - 2020-04-14¶
Changed¶
Upgrade to
spaCy2.2.4 andtextacy0.10.0
0.0.7 - 2020-01-24¶
Added¶
Added the Porter stemmer from the [NTLK].
Changed¶
Better class naming for token mapper.
Features debugging bug fix.
0.0.6 - 2019-12-14¶
Changed¶
Fix Travis.
0.0.5 - 2019-12-14¶
Data classes are now used so Python 3.7 is now a requirement.
Added¶
Feature normalizers were added for neural networks.
Implemented a better strategy for using language resources with token normalization.
0.0.4 - 2019-11-21¶
Added¶
Adding detachable and picklable token feature set.
0.0.3 - 2019-07-31¶
Added¶
DocStashthat parses documents as a factory stash.
0.0.2 - 2019-07-25¶
Added¶
Feature to disable SpaCy pipeline components.
Add configuration for removing punctuation and determiners.
Changed¶
Skip textacy for document creation since it wasn’t used. This is more efficient.
0.0.1 - 2019-07-06¶
Added¶
Initial version.