Change Log¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
Unreleased¶
1.12.0 - 2025-01-11¶
Removed¶
CombinerFeatureDocumentParser.include_detached_features
to default usingFeatureToken.{get,set}_feature
semantics.Dropped support for Python 3.10.
Added¶
A text indexing and search class which find feature spans in text with mangled white space.
Feature ID mapping in the aggregating parser
CombinerFeatureDocumentParser
class.
Changed¶
Replaced
FeatureToken.{get,set}_value
with a more robust{get,set}_feature
.Upgraded to zensols.util version 1.15.
1.11.1 - 2024-05-11¶
Added¶
A method
FeatureToken.set_value
that sets a value by attribute.A token container decorator that copies features.
1.11.0 - 2024-04-14¶
Feature release with significant modification to feature merging document parsers.
Added¶
A composite parser that combines several parsers, each with their own rules of copying (or clobbering).
Changed¶
The combiner parser
CombinerFeatureDocumentParser
, and subclasses, are now optimized to avoid re-parsing for shared parsers. This is the case with the zensols.mednlp parsers that migrate features down to the same parser.Fixed some features not copied in combiner parsers after a token clone.
Removed¶
The spaCy and combiner parsers are removed from the default
zensols.nlp
package import.
Changed¶
Add
TokenContainer
class to decorator hierarchy.Rename classes:
StripSentenceDecorator
toStripTokenContainerDecorator
UpdateDocumentDecorator
toUpdateTokenContainerDecorator
Rename resource library configuration:
strip_sentence_decorator
tostrip_token_container_decorator
update_document_decorator
toupdate_token_container_decorator
CombinerFeatureDocumentParser
now extends fromDecoratedFeatureDocumentParser
withtarget_parser
becomingdelegate
. Token features now come from the delegate or stored in theDecoratedFeatureDocumentParser
when they don’t exist in the delegate.
1.10.0 - 2024-02-27¶
A class name typo is the impetuous for this being a new minor release (even if the release is mostly for bug fixes).
Added¶
Add token level annotations to
TokenAnnotatedFeatureDocument
.Yielded feature defaults in
CombinerFeatureDocumentParser
.
Changed¶
Class name typo for
TokenAnnotatedFeatureDocument
.Fixed bug on
CombinerFeatureDocumentParser
whereNone
s were not replaced by a source parser.Added
toaken_feature_ids
toCombinerFeatureDocumentParser
to facilitate token feature passing.Lexical span gaps end boundary edge case bug fix.
Minor bug fixes.
1.9.2 - 2024-01-11¶
Changed¶
The
CachingFeatureDocumentParser
is now configurable with decorators.
1.9.1 - 2024-01-04¶
Added¶
Added an API, parser components, and unit tests to split tokens.
Adding missing
text
column on the feature document Pandas dataframe.
Changed¶
Bug fixes to
FeatureDocument
sentence combining.White space tokenization parser no longer inherits the spaCy parser, and needs no configuration.
1.9.0 - 2023-12-05¶
Upgrade and Python deprecation release.
Changed¶
Upgrade to spaCy version 3.6.
Upgrade to zensols.util version 1.14.
Added¶
Support for Python 3.11.
Optional dependencies for scoring methods.
Removed¶
Support for Python 3.9.
1.8.1 - 2023-11-29¶
Added¶
A simple
FeatureSentenceFactory
that creates sentence instances from tokens.
Changed¶
FeatureToken
bug fixes.Reduce pickle data footprint.
Span normalization.
Reduce flake8 warning, typehints, documentation.
1.8.0 - 2023-08-16¶
Functional and downstream moderate risk update release.
Changed¶
TokenContainer.norm
removes newlines of the normalized text.FeatureToken
hash function.Fix text mangling in sub-document
FeatureDocument.get_overlapping
method.Refactor hash and equal compare methods in
TokenContainer
Terse writing for
TokenContainer
andFeatureToken
.
Added¶
Rule based paragraph and list item chunkers.
FeatureDocument.reindex
and method to clear cached state with unit tests.
1.7.3 - 2023-06-29¶
Changed¶
FeatureToken
detached features are transmitted by theCombinerFeatureDocumentParser
.
1.7.2 - 2023-06-27¶
Changed¶
Move spaCy parser and supporting classes to a separate module.
Feature to auto load any missing spaCy models at runtime. This feature
doc_parser.auto_install_model
must be turned on to be used.
1.7.1 - 2023-06-20¶
Added¶
Feature to add
None
values to missing overwritten features inCombinerFeatureDocumentParser
.
1.7.0 - 2023-06-07¶
Changed¶
Fixed type exception bug on
Feature.to_sentence
.Fix raised exception for overlapped methods on 0-length documents.
Remove spaCy artifacts from parser decorators (i.e.
SpacyFeatureDocumentDecorator
->FeatureDocumentDecorator
) to generalize to non-spaCy document parsers and other components (deepnlp
transformer embedding populators).
Added¶
Right lexical span inclusive parameter for all
TokenContainer.get_overlapping*
methods.Empty versions of
TokenContainer
subclasses.Added a default instance of a
FeatureDocumentParser
that does not require a resource library configuration.A
TokenContainer.canonical
that provides a canonical representation of the token container.A right inclusive flag on
TokenContainer
overlapping methods.Container methods to update token spans for split entities and a decorator.
Levenshtein edit distance based scoring module.
Exact match scoring module.
SemEval-2013 Task 9.1 scoring module.
1.6.0 - 2023-04-05¶
Added¶
Backwards compatible scoring: error handling and correlation IDs.
More unit tests.
Handle errors during scoring and robustly provide scores when reporting.
Make token containers are hashable.
Changed¶
Fixed token overlap on left side of lexical spans.
1.5.0 - 2023-01-23¶
Changed¶
Fix
TokenContainer
indexing bug with edge case on split on space.Updated zensols.util to 1.12.1.
Added¶
Scoring framework. This includes Bleu via NLTK by default, and optionally ROUGE via optional package support.
Contiguous sentence index (i_sent) in
FeatureDocument.to_sentece
.Default feature ID set to
FeatureToken
.
Removed:¶
Unused Levenshtein dependency.
1.4.1 - 2022-10-02¶
Changed¶
Fixed token indexing bug
1.4.0 - 2022-09-30¶
Added¶
A document stash caching parser
CachingFeatureDocumentParser
.The InterLap library to speed up overlapping token queries.
Sentence decorator and sentence split space decorator.
Changed¶
FeatureDocument.sents
changed from alist
to atuple
.Add checks for
FeatureDocument.sents
andFeatureSentence.sent_tokens
as tuples.Better (English) normalization of text by adding more apostrophe/contraction syntax.
The
FeatureToken.NONE
constant changed from<none>
to-<N>-
.Speed up
FeatureToken
equals.
Removed¶
Removed
stemmer
module from default imports. Useimport zensols.nlp.stemmer
.
1.3.0 - 2022-08-06¶
Added¶
Token indexing mappings accounting for (named entity) multi-word tokens.
IOB (
iob_
,iob
) features.Re-loadable components and component initializers.
Changed¶
Upgraded to spaCy 3.2
Add spaCy tokens to spaCy feature tokens.
Bug fixes in combining and overlapping sentences.
Switched to shallow copy of document in overlapping sentence doc methods.
1.2.0 - 2022-06-16¶
Removed¶
Remove resource library
regular_expression_escape:dollar
configuration. Use zensols.utilconf_esc:dollar
as a replacement.
1.1.2 - 2022-06-14¶
Changed¶
Dependency bump.
1.1.1 - 2022-05-15¶
Changed¶
Dependency bump.
1.1.0 - 2022-05-04¶
Changed¶
Fix resource leaks and other bugs.
Persist original text along with
FeatureDocument
rather than reconstruct it from sentence and/or token text.
Added¶
An lexical overlapping utility module (
overlap
).A token normalizer that merges tokens in to spans (
JoinTokenMapper
).Regular expression matching for entity and merge components (similar to
JoinTokenMapper
).Add back
TokenAnnotatedFeatureSentence
for down stream packages.Add token decorator to spacy parser to allow for add/modify features on creation separate from parser class hierarchy.
1.0.1 - 2022-01-25¶
Added¶
Sentences and tokens accessible by index.
Changed¶
More robust regular expression for token splitting.
Mapping combiner is persistable with spaCy tokens and handles split named entities.
1.0.0 - 2021-10-22¶
First major development release.
Added¶
A
FeatureDocumentCombiner
that merges features from different document parsers.Top level library
NLPError
.A pipeline component and resource configuration library entry to remove sentence boundaries in a spaCy document.
Changed¶
Split out optional resource library content in to
mappers.conf
.The spaCy model has attribute
langres
set onLanguageResource
to enable creation of factory instances from registered pipe components.Fix issue with component creation with no pipeline arguments.
Removed¶
The
DocStash
instance as it was too simple for any practical application.
0.1.3 - 2021-09-21¶
Changed¶
Dependency.
Removed¶
zensols.nlp.lang.DocStash
0.1.2 - 2021-09-21¶
Changed¶
Make
FeatureDocumentParser
callable.Fix memory leak in
LanguageResource
.
Added¶
Configuration Resource library.
Configuration for keyword arguments to the
add_pipe_comp
and example.
0.1.1 - 2021-09-07¶
Changed¶
Fixed bug with creating a
dict
from aFeatureToken
.Fixed/improved how
Feature{Token,Sentence,Document}
aredict
ified with (asdict
) and how they are written as text withwrite
.
Added¶
Creates a Pandas dataframe from token feature attributes.
Add back
FeatureToken
feature ID -> type for write dumpingAdd lexical location
SpacyTokenFeatures.loc
location in the document as an (starting, ending) range.
0.1.0 - 2021-08-16¶
This release simplifies the token attributes level classes in the features
module by:
Using feature IDs instead of trying to make sense of the class property/attribute member data.
Using the
FeatureDocumentParser
andFeatureToken
to copy spaCy resources to simple picklable Python classes.
Not only does this greatly reduce complexity in class hierarchy and data copy/move functionality, but speeds things up.
Changes¶
Attributes set on detached token features are no longer robust. Before, if a token feature ID was specified, but didn’t exist on the source token feature set, it would copy over a
None
. This now raises anAttributeError
instead.For
TokenAttributes
, creation ofdicts
(either byasdict
orget_features
) is now consistent with the set attributes and properties of the class. Only those specified passed to methods, which default toFIELD_IDS
of the class (which can be overridden at a class level).
Removed¶
The dictionary creation of attribute/property individual features methods
TokenAttributes.{string}features
. These methods are obviated by theget_features
, which returns all features inFIELD_IDS
.FeatureDocumentParser.additional_token_feature_ids
to simplify token feature IDs passed to feature tokens.The
TokenAttributes
class, as it was just a metadata member holder.
Added¶
A SpaCy implementation of the
TokenFeatures
class, that somewhat resembles the oldTokenFeatures
of the old class hierarchy.
0.0.15 - 2021-08-07¶
Changes¶
Upgrade from spaCy 2.x to 3.x.
Added¶
POS feature inclusion by default to support
is_pronoun
, which is needed after spaCy 3 changed how lemmatization works.Move feature containers and parser from
zensols.deepnlp
, including test cases.A sentence index feature (
i_sent
).An index of sentence feature (
sent_i
).Advanced spacy configuration by adding component classes. This gives more control over configuring the spaCy pipeline.
Add feature containers (
FeatureDocument
) and parser (FeatureDocumentParser
), which were moved over from zensols.deepnlp.
0.0.14 - 2021-04-29¶
Changes¶
Upgrade to zensols.util version 1.4.1.
Upgrade documentation API generation.
Nail dependencies to spacy 2.3.5 until pip deps are fixed.
Added sentence index features to reconstruct sentences from documents.
0.0.13 - 2021-01-14¶
Changes¶
Fix component adds for spacy > 2.0.
Add langres model to API documentation.
0.0.12 - 2020-12-29¶
Changed¶
Upgraded zenbuild.
Switched from Travis to GitHub workflows.
Tested with Python 3.9.1.
0.0.11 - 2020-12-09¶
Changed¶
Add basic token features for non-spacy parse use cases.
Rename feature type to feature id.
TokeFeatures
is now a dictable with to_dict -> asdict.
0.0.10 - 2020-12-09¶
Added¶
Sphinx documentation, which includes API docs.
Changed¶
Settable detached
TokenAttributes
instances.Make
dataclasses
, and therefore, needs >= Python 3.7.
0.0.9 - 2020-05-10¶
Changed¶
Home/master move lemmatizing out of default token normalizer.
Update super method calls to modern (at least) Python 3.7.
Fix annoying can’t find smart_open.gcs bogus warning.
Remove language resource factory.
Upgrade to zensols.util 1.2.0 and get rid of custom factories.
Added¶
Feature to parse whole special tokens.
Added porter stemmer from nltk.
Removed¶
Moved word2vec embedding (
word2vec.py
) to zensols.deepnlp library.Moved feature normalization (
fnorm.py
) to zensols.deepnlp library.
0.0.8 - 2020-04-14¶
Changed¶
Upgrade to
spaCy
2.2.4 andtextacy
0.10.0
0.0.7 - 2020-01-24¶
Added¶
Added the Porter stemmer from the [NTLK].
Changed¶
Better class naming for token mapper.
Features debugging bug fix.
0.0.6 - 2019-12-14¶
Changed¶
Fix Travis.
0.0.5 - 2019-12-14¶
Data classes are now used so Python 3.7 is now a requirement.
Added¶
Feature normalizers were added for neural networks.
Implemented a better strategy for using language resources with token normalization.
0.0.4 - 2019-11-21¶
Added¶
Adding detachable and picklable token feature set.
0.0.3 - 2019-07-31¶
Added¶
DocStash
that parses documents as a factory stash.
0.0.2 - 2019-07-25¶
Added¶
Feature to disable SpaCy pipeline components.
Add configuration for removing punctuation and determiners.
Changed¶
Skip textacy for document creation since it wasn’t used. This is more efficient.
0.0.1 - 2019-07-06¶
Added¶
Initial version.