zensols.mimicsid package#

Submodules#

zensols.mimicsid.anon#

Inheritance diagram of zensols.mimicsid.anon

Stashes that use annotated sections when available.

class zensols.mimicsid.anon.AnnotatedNoteStash(corpus, anon_resource, row_hadm_map_path)[source]#

Bases: ReadOnlyStash, PrimeableStash

A stash that returns Note instances by thier unique row_id keys.

__init__(corpus, anon_resource, row_hadm_map_path)#
anon_resource: AnnotationResource#

Contains the annotations and ontolgy/metadata note to section data.

clear()[source]#

Delete all data from the from the stash.

Important: Exercise caution with this method, of course.

corpus: Corpus#

A container class for the resources that access the MIMIC-III corpus.

exists(row_id)[source]#

Return True if data with key name exists.

Implementation note: This Stash.exists() method is very inefficient and should be overriden.

Return type:

bool

keys()[source]#

Return an iterable of keys in the collection.

Return type:

Iterable[str]

load(row_id)[source]#

Load a data value from the pickled data with key name. Semantically, this method loads the using the stash’s implementation. For example DirectoryStash loads the data from a file if it exists, but factory type stashes will always re-generate the data.

See:

get()

Return type:

AnnotatedNote

prime()[source]#
row_hadm_map_path: Path#

The path to the note to admission ID mapping cached file.

property row_to_hadm_ids: Dict[str, str]#

A mapping of row to hospital admission IDs.

class zensols.mimicsid.anon.AnnotationNoteFactory(config_factory, category_to_note, mimic_default_note_section, anon_resource=None, annotated_note_section=None)[source]#

Bases: NoteFactory

Override to replace section with MedSecId annotations if they exist.

__init__(config_factory, category_to_note, mimic_default_note_section, anon_resource=None, annotated_note_section=None)#
annotated_note_section: str = None#

The section to use for creating new annotated section, for those that found in the annotation set.

anon_resource: AnnotationResource = None#

Contains the annotations and ontolgy/metadata note to section data.

create(note_event, section=None)[source]#

Create a new factory based instance of a Note from a NoteEvent.

Parameters:

note_event (NoteEvent) – the source data

Return type:

Note

class zensols.mimicsid.anon.AnnotationResource(installer)[source]#

Bases: Dictable

This class providess access to the .zip file that contains the JSON section identification annotations. It also has the ontology provided as a Pandas dataframe.

__init__(installer)#
static category_to_id(name)[source]#

Return the ID form for the category name.

Return type:

str

property corpus_path: Path#

The path to the annotations .zip file (see class docs).

get_annotation(note_event)[source]#

Get the raw annotation as Python dict of dics for a NoteEvent.

Return type:

Dict[str, Any]

installer: Installer#

Used to download the annotation set as a zip file and provide the location to the downloaded file.

property note_counts_by_admission: DataFrame#

The counts of each category and row IDs for each admission.

property note_ids: DataFrame#

Return a dataframe of hospital admission and corresponding note IDs.

property ontology: DataFrame#

A dataframe representing the note to section ontology. It contains the relation from notes to sections along with their respective descriptions.

class zensols.mimicsid.anon.NoteStash(delegate, corpus)[source]#

Bases: DelegateStash

Creates notes of type Note or AnnotatedNote depending on if the note was annotated.

__init__(delegate, corpus)#
corpus: Corpus#

A container class for the resources that access the MIMIC-III corpus.

get(name, default=None)[source]#

Load an object or a default if key name doesn’t exist.

Implementation note: sub classes will probably want to override this method given the super method is cavalier about calling exists:() and load(). Based on the implementation, this can be problematic.

Return type:

Any

load(row_id)[source]#

Load a data value from the pickled data with key name. Semantically, this method loads the using the stash’s implementation. For example DirectoryStash loads the data from a file if it exists, but factory type stashes will always re-generate the data.

See:

get()

Return type:

Note

zensols.mimicsid.app#

Inheritance diagram of zensols.mimicsid.app

Use the MedSecId section annotations with MIMIC-III corpus parsing.

class zensols.mimicsid.app.Application(config, facade_name='facade', model_path=None, config_factory_args=<factory>, config_overwrites=None, cache_global_facade=True, model_config_overwrites=None, config_factory=None, corpus=None, anon_resource=None, note_stash=None)[source]#

Bases: FacadeApplication

Use the MedSecId section annotations with MIMIC-III corpus parsing.

__init__(config, facade_name='facade', model_path=None, config_factory_args=<factory>, config_overwrites=None, cache_global_facade=True, model_config_overwrites=None, config_factory=None, corpus=None, anon_resource=None, note_stash=None)#
admission_notes(hadm_id, out_file=None, keeps=None)[source]#

Create a CSV of note information by admission.

Parameters:
  • hadm_id (str) – the admission ID

  • out_file (Path) – the output path

  • keeps (str) – a comma-delimited list of column to keep in the output; defaults to all columns

Return type:

DataFrame

anon_resource: AnnotationResource = None#

Contains resources to acces the MIMIC-III MedSecId annotations.

clear()[source]#

Remove all admission, note and section cached (parsed) data.

config_factory: ConfigFactory = None#

The config used to create facade instances.

corpus: Corpus = None#

A container class for the resources that access the MIMIC-III corpus.

dump_ontology(out_file=None)[source]#

Writes the ontology.

Parameters:

out_file (Path) – the output path

note_counts_by_admission(out_file=None)[source]#

Write the counts of each category and row IDs for each admission.

Parameters:

out_file (Path) – the output path

Return type:

DataFrame

note_stash: NoteStash = None#

A stash that returns Note instances by thier unique row_id keys.

write_admission(hadm_id, out_dir=PosixPath('.'), output_format=NoteFormat.text)[source]#

Write all the notes of an admission.

Parameters:
  • hadm_id (str) – the admission ID

  • out_dir (Path) – the output directory

  • output_format (NoteFormat) – the output format of the note

write_note(row_id, out_file=None, output_format=NoteFormat.text)[source]#

Write an admission, note or section.

Parameters:
  • row_id (int) – the row ID of the note to write

  • out_file (Path) – the output path

  • output_format (NoteFormat) – the output format of the note

class zensols.mimicsid.app.PredOutputType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

The types of prediction output formats.

json = 2#
text = 1#
class zensols.mimicsid.app.PredictionApplication(config_factory=None, note_stash=None, section_predictor=None)[source]#

Bases: object

An application that predicts sections in file(s) on the file system, then dumps them back to the file system (or standard out).

__init__(config_factory=None, note_stash=None, section_predictor=None)#
config_factory: ConfigFactory = None#

The config factory used to help find the packed model.

note_stash: NoteStash = None#

A stash that returns Note instances by their unique row_id keys.

predict_sections(input_path, output_path=PosixPath('preds'), out_type=PredOutputType.text, file_limit=None)[source]#

Predict the section IDs of a medical notes by file name or all files in a directory.

Parameters:
  • input_path (Path) – the path to the medical note(s) to annotate

  • output_path (Path) – where to write the prediction(s) or - for standard out

  • out_type (PredOutputType) – the prediction output format

  • file_limit (int) – the max number of document to predict when the input path is a directory

repredict(row_id, output_path=PosixPath('preds'), out_type=PredOutputType.text)[source]#

Predict the section IDs of an existing MIMIC III note.

Parameters:
  • row_id (int) – the row ID of the note to write

  • output_path (Path) – where to write the prediction(s) or - for standard out

  • out_type (PredOutputType) – the prediction output format

section_predictor: SectionPredictor = None#

The section name that contains the name of the SectionPredictor to create from the config_factory.

zensols.mimicsid.cli#

Inheritance diagram of zensols.mimicsid.cli

Command line entry point to the application.

class zensols.mimicsid.cli.ApplicationFactory(*args, **kwargs)[source]#

Bases: ApplicationFactory

The application factory for section identification.

__init__(*args, **kwargs)[source]#
classmethod annotation_resource()[source]#

Contains resources to acces the MIMIC-III MedSecId annotations.

Return type:

AnnotationResource

classmethod corpus()[source]#

Return the MIMIC-III corpus data access object.

Return type:

Corpus

classmethod instance(name)[source]#

Return the application context.

Return type:

ConfigFactory

classmethod note_stash(host, port, db_name, user, password)[source]#

Return the note stash using the app context, which is populated with the Postgres DB login provided as the parameters.

Return type:

NoteStash

classmethod section_predictor()[source]#

Return the section predictor using the app context.

Return type:

SectionPredictor

zensols.mimicsid.cli.main(args=['/Users/landes/opt/lib/python/bin/sphinx-build', '-M', 'html', '/Users/landes/view/nlp/med/mimicsid/target/doc/src', '/Users/landes/view/nlp/med/mimicsid/target/doc/build'], **kwargs)[source]#
Return type:

ActionResult

zensols.mimicsid.dapp#

Inheritance diagram of zensols.mimicsid.dapp

Distribution utility application.

class zensols.mimicsid.dapp.DistApplication(anon_resource, preempt_stash)[source]#

Bases: object

Utilities to train the models.

__init__(anon_resource, preempt_stash)#
anon_resource: AnnotationResource#

Contains resources to acces the MIMIC-III MedSecId annotations.

preempt_notes(input_file=None, workers=None, max_adm=None)[source]#

Preemptively document parse notes across multiple threads.

Parameters:
  • input_file (Path) – a file of notes’ unique row_id IDs

  • workers (int) – the number of processes to use to parse notes

  • max_adm (int) – the maximum number of admission notes to process

preempt_stash: NoteDocumentPreemptiveStash#

A multi-processing stash used to preemptively parse notes.

zensols.mimicsid.domain#

Inheritance diagram of zensols.mimicsid.domain

Annotated section and note domain specific classes.

class zensols.mimicsid.domain.AgeType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

An enumeration of all possible ages identified by the physicians per note in the annotation set.

adult = 1#
newborn = 2#
pediatric = 3#
class zensols.mimicsid.domain.AnnotatedNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context, annotation=None)[source]#

Bases: Note

An annotated note that contains instances of AnnotationSection. It also contains the age type taken from the annotations.

__init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context, annotation=None)#
property age_type: AgeType#

The age type of the discharge note as annotated by the physicians.

annotation: Dict[str, Any] = None#

The annotation (JSON) parsed from the annotations zip file.

write_fields(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

Write note header fields such as the row_id and category.

class zensols.mimicsid.domain.AnnotatedSection(id, name, container, header_spans, body_span, annotation=None)[source]#

Bases: Section

A section that uses the MedSecId annotations for section demarcation (header_span, header_spans and body_span) and identification (id).

Many of the header identifiers are found in multiple locations in the body of the text. In other cases there are no header spans at all. The header_spans field has all of them, and if there is at least one, the header_span is set to the first.

See the MedSecId paper for details.

__init__(id, name, container, header_spans, body_span, annotation=None)#
annotation: Dict[str, Any] = None#

The raw annotation data parsed from the zip file containing the JSON.

class zensols.mimicsid.domain.MimicPredictedNote(*args, predicted_note, **kwargs)[source]#

Bases: Note

A note that comes from the MIMIC-III corpus with predicted sections. This takes an instance of PredictedNote created by the model during inference. It creates Section instances, and then discards the predicted note on pickling.

This method avoids having to serialize the FeatureDocument (PredictedNote.doc) twice.

__init__(*args, predicted_note, **kwargs)[source]#
exception zensols.mimicsid.domain.MimicSectionAssertError(a, b)[source]#

Bases: MimicSectionError

__init__(a, b)[source]#
__module__ = 'zensols.mimicsid.domain'#
exception zensols.mimicsid.domain.MimicSectionError[source]#

Bases: MimicError

__annotations__ = {}#
__module__ = 'zensols.mimicsid.domain'#
class zensols.mimicsid.domain.PredictedNote(predicted_sections, doc)[source]#

Bases: PersistableContainer, SectionContainer

A note with predicted sections.

__init__(predicted_sections, doc)#
doc: InitVar#

The used document that was parsed for prediction.

property predicted_sections: List[Section]#

The sections predicted by the model.

property text: str#

“The entire note text.

property truncted_text: str#
class zensols.mimicsid.domain.SectionFilterType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

Indicates which sections to keep in SectionPredictor.

keep_all = 1#

Do not filter any sections.

keep_classified = 3#

Keep sections that have a section classification.

keep_non_empty = 2#

Keep sections that have headers, more than just whitespace, or both.

zensols.mimicsid.model#

Inheritance diagram of zensols.mimicsid.model

Contains section ID model and prediction classes.

exception zensols.mimicsid.model.EmptyPredictionError[source]#

Bases: PredictionError

Raised when the model classifies all tokens as having no section.

__annotations__ = {}#
__init__()[source]#
__module__ = 'zensols.mimicsid.model'#
exception zensols.mimicsid.model.PredictionError[source]#

Bases: MimicSectionError

Raised for any issue predicting sections.

__annotations__ = {}#
__module__ = 'zensols.mimicsid.model'#
class zensols.mimicsid.model.SectionDataPoint(id, batch_stash, note, pred_doc=None)[source]#

Bases: DataPoint

A data point for the section ID model.

TOKEN_TYPES: ClassVar[Tuple[str]] = ('SEP', 'SPACE', 'COLON', 'NEWLINE', 'UPCASE', 'DOWNCASE', 'CAPITAL', 'PUNCTUATION', 'DIGIT', 'MIX')#

The list of types used as enumerated nominal values in labeled encoder vectorizer components.

__init__(id, batch_stash, note, pred_doc=None)#
property cuis: Tuple[str | None]#

The CUI feature.

property doc: FeatureDocument#

The document from where this data point originates.

property ents: Tuple[str]#

The named entity feature.

property feature_dataframe: DataFrame#

A dataframe used to create some of the features of this data point.

property headers: Tuple[str]#

The header label (section types per the paper).

property idxs: Tuple[int]#

The index feature.

property is_pred: bool#

Whether this data point is used for prediction.

note: AnnotatedNote#

The note contained by this data point.

pred_doc: FeatureDocument = None#

The parsed document used for prediction when using this data point for prediction.

property section_names: Tuple[str]#

The section names label (section types per the paper).

property ttypes: Tuple[str]#

The token type feature, which is the string value of TokenType.

class zensols.mimicsid.model.SectionFacade(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.SequencePredictionsDataFrameFactory'>, suppress_transformer_warnings=True)[source]#

Bases: TokenClassifyModelFacade

The application model facade. This only adds the zensols.install package to the CLI output logging.

__init__(config, config_factory=<property object>, progress_bar=True, progress_bar_cols='term', executor_name='executor', writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, predictions_dataframe_factory_class=<class 'zensols.deeplearn.result.pred.SequencePredictionsDataFrameFactory'>, suppress_transformer_warnings=True)#
class zensols.mimicsid.model.SectionPredictionMapper(datas, batch_stash, vec_manager, label_feature_id, pred_attribute='pred', softmax_logit_attribute='softmax_logit')[source]#

Bases: ClassificationPredictionMapper

Predict sections from a FeatureDocument as a list of PredictedNote instances. It does this by creating data points of type SectionDataPoint that are used by the model.

__init__(datas, batch_stash, vec_manager, label_feature_id, pred_attribute='pred', softmax_logit_attribute='softmax_logit')#
map_results(result)[source]#

Map class predictions, logits, and documents generated during use of this instance. Each data point is aggregated across batches.

Return type:

List[PredictedNote]

Returns:

a Settings instance with classess, logits and docs attributes

class zensols.mimicsid.model.TokenType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

A custom token type feature that identifies specifies whether the token is:

* a separator
* a space
* a colon character (``:``)
* if its upper, lower case or capitalized
* if its punctuation (if not a colon)
* all digits
* anything else is ``MIX``
CAPITAL = 7#
COLON = 3#
DIGIT = 9#
DOWNCASE = 6#
MIX = 10#
NEWLINE = 4#
PUNCTUATION = 8#
SEP = 1#
SPACE = 2#
UPCASE = 5#

zensols.mimicsid.pred#

Inheritance diagram of zensols.mimicsid.pred

Collates the predictions of both models.

class zensols.mimicsid.pred.PredictionNoteFactory(config_factory, category_to_note, mimic_default_note_section, anon_resource=None, annotated_note_section=None, mimic_pred_note_section=None, section_predictor_name=None)[source]#

Bases: AnnotationNoteFactory

A note factory that predicts so that HospitalAdmissionDbStash predicts missing sections.

Implementation note: The section_predictor_name is used with the application context factory config_factory since declaring it in the configuration creates an instance cycle.

__init__(config_factory, category_to_note, mimic_default_note_section, anon_resource=None, annotated_note_section=None, mimic_pred_note_section=None, section_predictor_name=None)#
config_factory: ConfigFactory#

The factory to get the section predictor.

mimic_pred_note_section: str = None#

The section name holding the configuration of the MimicPredictedNote class.

prime()[source]#

The MedSecId project primes by installing the model files.

property section_predictor: SectionPredictor#

The section predictor (see class docs).

section_predictor_name: InitVar[str] = None#

The name of the section predictor as an app config section name. See class docs.

class zensols.mimicsid.pred.SectionPredictor(name, config_factory, section_id_model_unpacker=None, header_model_unpacker=None, model_config=None, doc_parser=None, min_section_body_len=1, section_filter_type=SectionFilterType.keep_non_empty, auto_deallocate=True)[source]#

Bases: PersistableContainer, Primeable

Creates a complete prediction by collating the predictions of both the section ID (type) and header token models. If header_model_packer is not set, then only section identifiers (types) and body spans are predicted. In this case, all header spans are left empty.

Implementation note: when auto_deallocate is False you must wrap creations of this instance in dealloc() as this instance contains resources (FacadeApplication) that need deallocation. Their deallocation logic is invoked with this instance and deallocated by PersistableContainer.

__init__(name, config_factory, section_id_model_unpacker=None, header_model_unpacker=None, model_config=None, doc_parser=None, min_section_body_len=1, section_filter_type=SectionFilterType.keep_non_empty, auto_deallocate=True)#
auto_deallocate: bool = True#

Whether or not to deallocate resources after every call to predict(). See class docs.

config_factory: ConfigFactory#

The config factory used to help find the packed model.

deallocate()[source]#

Deallocate all resources for this instance.

doc_parser: FeatureDocumentParser = None#

Used for parsing documents for predicton. Default to using model’s configured document parser.

header_model_unpacker: Optional[ModelUnpacker] = None#

The packer used to create the header token identifier model.

min_section_body_len: int = 1#

The minimum length of the body needed to make a section.

model_config: Configurable = None#

Configuration that overwrites the packaged model configuration.

name: str#

The name of this object instance definition in the configuration.

predict(doc_texts)[source]#

Collate the predictions of both the section ID (type) and header token models.

Parameters:

doc_texts (List[str]) – the text of the medical note to segment

Return type:

Tuple[SectionContainer]

Returns:

a list of the predictions as notes for each respective doc_texts

predict_from_docs(docs)[source]#
Return type:

List[PredictedNote]

prime()[source]#
section_filter_type: SectionFilterType = 2#

What sections to keep. See SectionFilterType.

section_id_model_unpacker: ModelUnpacker = None#

The packer used to create the section identifier model.

Module contents#

MIMIC-III corpus parsing and section prediction with MedSecId.

zensols.mimicsid.suppress_warnings()[source]#

The pretrained model uses a deprecated API.