zensols.mimic package#

Submodules#

zensols.mimic.adm#

Inheritance diagram of zensols.mimic.adm

Hospital admission/stay details.

class zensols.mimic.adm.HospitalAdmission(admission, patient, diagnoses, procedures)[source]#

Bases: PersistableContainer, Dictable

Represents data collected by a patient over the course of their hospital admission. Note: this object keys notes using their row_id IDs used in the MIMIC dataset as integers and not strings like some note stashes.

__init__(admission, patient, diagnoses, procedures)#
admission: Admission#

The admission of the admission.

diagnoses: Tuple[Diagnosis, ...]#

The ICD-9 diagnoses of the hospital admission.

property feature_dataframe: DataFrame#

The feature dataframe for the hospital admission as the constituent note feature dataframes.

get_duplicate_notes(text_start=None)[source]#

Notes with the same note text, each in their respective set.

Parameters:

text_start (int) – the number of first N characters used to compare notes, or the entire note text if None

Return type:

Tuple[Set[str], ...]

Returns:

the duplicate note``row_id``, or if there are no duplicates, an empty tuple

get_non_duplicate_notes(dup_sets, filter_fn=None)[source]#

Return non-duplicated notes.

Parameters:
  • dup_sets (Tuple[Set[str]]) – the duplicate sets generated from get_duplicate_notes()

  • filer_fn – if provided it is used to filter duplicates; if everything is filtered, a note from the respective duplicate set is chosen at random

Return type:

Tuple[Tuple[Note, bool], ...]

Returns:

a tuple of (<note>, <is duplicate>) pairs

See:

duplicate_notes

property hadm_id: int#

The hospital admission unique identifier.

keys()[source]#
Return type:

Iterable[int]

property notes: Iterable[Note]#

The notes by the care givers.

property notes_by_category: Dict[str, Tuple[Note, ...]]#

All notes by Note.category as keys with the list of resepctive notes as a list as values.

patient: Patient#

The patient/subject.

procedures: Tuple[Procedure, ...]#

The ICD-9 procedures of the hospital admission.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_admission=False, include_patient=False, include_diagnoses=False, include_procedures=False, **note_kwargs)[source]#

Write the admission and the notes of the admission.

Parameters:

note_kwargs – the keyword arguments gtiven to Note.write_full()

write_full(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, **kwargs)[source]#

Write a verbose output of the admission.

Parameters:

kwargs – the keyword arguments given to meth:write

write_notes(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, note_limit=9223372036854775807, categories=None, include_note_id=False, **note_kwargs)[source]#

Write the notes of the admission.

Parameters:
  • note_limit (int) – the number of notes to write

  • include_note_id (bool) – whether to include the note identification info

  • categories (Set[str]) – the note categories to write

  • note_kwargs – the keyword arguments gtiven to Note.write_full()

class zensols.mimic.adm.HospitalAdmissionDbFactoryStash(delegate, factory, enable_preemptive=True, dump_factory_nones=True, doc_stash=None, mimic_note_context=None)[source]#

Bases: FactoryStash, Primeable

A factory stash that configures NoteEvent instances so they can parse the MIMIC-III English text as FeatureDocument instances.

__init__(delegate, factory, enable_preemptive=True, dump_factory_nones=True, doc_stash=None, mimic_note_context=None)#
clear()[source]#

Delete all data from the from the stash.

Important: Exercise caution with this method, of course.

doc_stash: Stash = None#

Contains the document that map to row_id.

load(hadm_id)[source]#

Load a data value from the pickled data with key name. Semantically, this method loads the using the stash’s implementation. For example DirectoryStash loads the data from a file if it exists, but factory type stashes will always re-generate the data.

See:

get()

Return type:

HospitalAdmission

mimic_note_context: Settings = None#

Contains resources needed by new and re-hydrated notes, such as the document stash.

prime()[source]#
class zensols.mimic.adm.HospitalAdmissionDbStash(config_factory, mimic_note_factory, admission_persister, diagnosis_persister, patient_persister, procedure_persister, note_event_persister, note_stash, hospital_adm_name)[source]#

Bases: ReadOnlyStash, Primeable

A stash that creates HospitalAdmission instances. This instance is used by caching stashes per the default resource library configuration for this package.

__init__(config_factory, mimic_note_factory, admission_persister, diagnosis_persister, patient_persister, procedure_persister, note_event_persister, note_stash, hospital_adm_name)#
admission_persister: AdmissionPersister#

The persister for the admissions table.

config_factory: ConfigFactory#

The factory used to create domain objects (ie hospital admission).

diagnosis_persister: DiagnosisPersister#

The persister for the diagnosis table.

exists(hadm_id)[source]#

Return True if data with key name exists.

Implementation note: This Stash.exists() method is very inefficient and should be overriden.

Return type:

bool

hospital_adm_name: str#

The configuration section name of the HospitalAdmission used to load instances.

keys(**kwargs) Iterable[str]#

Return an iterable of keys in the collection.

Return type:

Iterable[str]

load(hadm_id)[source]#

Create a complete picture of a hospital stay with admission, patient and notes data.

Parameters:

hadm_id (str) – the ID that specifics the hospital admission to create

Return type:

HospitalAdmission

mimic_note_factory: NoteFactory#

The factory that creates Note for hopsital admissions.

note_event_persister: NoteEventPersister#

The persister for the noteevents table.

note_stash: Stash#

Creates cached instances of Note.

patient_persister: PatientPersister#

The persister for the patients table.

prime()[source]#
procedure_persister: ProcedurePersister#

The persister for the procedure table.

class zensols.mimic.adm.NoteDocumentPreemptiveStash(delegate, config, name, chunk_size=0, workers=1, processor_class=<class 'zensols.multi.stash.PoolMultiProcessor'>, note_event_persister=None, adm_factory_stash=None)[source]#

Bases: MultiProcessDefaultStash

Contains the stash that preemptively creates Admission, Note and FeatureDocument cache files. This class is not useful for returning any data (see :class:`.HospitalAdmissionDbFactoryStash).

__init__(delegate, config, name, chunk_size=0, workers=1, processor_class=<class 'zensols.multi.stash.PoolMultiProcessor'>, note_event_persister=None, adm_factory_stash=None)#
adm_factory_stash: HospitalAdmissionDbFactoryStash = None#

The factory to create the admission instances.

note_event_persister: NoteEventPersister = None#

The persister for the noteevents table.

prime()[source]#

If the delegate stash data does not exist, use this implementation to generate the data and process in children processes.

process_keys(row_ids, workers=None, chunk_size=None)[source]#

Invoke the multi-processing system to preemptively parse and store note events for the IDs provided.

Parameters:
  • row_ids (Iterable[str]) – the admission IDs to parse and cache

  • workers (int) – the number of processes spawned to accomplish the work

  • chunk_size (int) – the size of each group of data sent to the child process to be handled

See:

MultiProcessStash

zensols.mimic.app#

Inheritance diagram of zensols.mimic.app

A utility library for parsing the MIMIC-III corpus

class zensols.mimic.app.Application(config_factory, doc_parser, corpus, preempt_stash)[source]#

Bases: object

A utility library for parsing the MIMIC-III corpus

__init__(config_factory, doc_parser, corpus, preempt_stash)#
clear()[source]#

Clear the all cached admission and note parses.

config_factory: ConfigFactory#

Used to get temporary resources

corpus: Corpus#

The contains assets to access the MIMIC-III corpus via database.

corpus_stats()[source]#

Print corpus statistics.

doc_parser: FeatureDocumentParser#

Used to parse command line documents.

preempt_notes(input_file, workers=None)[source]#

Preemptively document parse notes across multiple threads.

Parameters:
  • input_file (Path) – a file of notes’ unique row_id IDs

  • workers (int) – the number of processes to use to parse notes

preempt_stash: NoteDocumentPreemptiveStash#

A multi-processing stash used to preemptively parse notes.

show(sent)[source]#

Parse a sentence and print all features for each token.

Parameters:

sent (str) – the sentence to parse and generate features

uniform_sample_hadm_ids(limit=1)[source]#

Print a uniform random sample of admission hadm_ids.

Parameters:

limit (int) – the number to fetch

write_admission(hadm_id, out_dir=PosixPath('.'), output_format=NoteFormat.text)[source]#

Write all the notes of an admission.

Parameters:
  • hadm_id (str) – the hospital admission ID or - for a random ID

  • out_dir (Path) – the output directory

  • output_format (NoteFormat) – the output format of the note

write_admission_summary(hadm_id)[source]#

Write an admission note categories and section names.

Parameters:

hadm_id (str) – the hospital admission ID or - for a random ID

write_discharge_reports(limit=1, out_dir=PosixPath('.'))[source]#

Write discharge reports (as apposed to addendums).

Parameters:
  • limit (int) – the number to fetch

  • out_dir (Path) – the output directory

write_features(sent, out_file=None)[source]#

Parse a sentence as MIMIC data and write features to CSV.

Parameters:
  • sent (str) – the sentence to parse and generate features

  • out_file (Path) – the file to write

write_hadm_id_for_note(row_id)[source]#

Get the hospital admission ID (hadm_id) that has note row_id.

Parameters:

row_id (int) – the unique note identifier in the NOTEEVENTS table

Return type:

int

write_note(row_id, out_file=None, output_format=NoteFormat.text)[source]#

Write a note.

Parameters:
  • row_id (int) – the unique note identifier in the NOTEEVENTS table

  • output_format (NoteFormat) – the output format of the note

  • out_file (Path) – the file to write

zensols.mimic.cli#

Inheritance diagram of zensols.mimic.cli

Command line entry point to the application.

class zensols.mimic.cli.ApplicationFactory(*args, **kwargs)[source]#

Bases: ApplicationFactory

__init__(*args, **kwargs)[source]#
classmethod get_corpus()[source]#

Get the MIMIC-III corpus.

Return type:

Corpus

zensols.mimic.cli.main(args=['/Users/landes/opt/lib/python/bin/sphinx-build', '-M', 'html', '/Users/landes/view/nlp/med/mimic/target/doc/src', '/Users/landes/view/nlp/med/mimic/target/doc/build'], **kwargs)[source]#
Return type:

ActionResult

zensols.mimic.corpus#

Inheritance diagram of zensols.mimic.corpus

Discharge summary research and Mimic III data exploration.

class zensols.mimic.corpus.Corpus(config_factory, patient_persister, admission_persister, diagnosis_persister, note_event_persister, hospital_adm_stash, temporary_results_dir)[source]#

Bases: Dictable

A container class provided access to the MIMIC-III dataset using a relational database (by default Postgress per the resource library configuration). It also has methods to dump corpus statistics.

See:

Resource Libraries

__init__(config_factory, patient_persister, admission_persister, diagnosis_persister, note_event_persister, hospital_adm_stash, temporary_results_dir)#
admission_persister: AdmissionPersister#

The persister for the admissions table.

clear(include_notes=True)[source]#

Clear the all cached admission and note parses.

Parameters:

include_notes (bool) – whether to also clear the parsed notes cache

config_factory: ConfigFactory#

Used to clear the note event cache.

diagnosis_persister: DiagnosisPersister#

The persister for the diagnosis table.

get_hospital_adm_by_id(hadm_id)[source]#

Return a hospital admission by its unique identifier.

Return type:

HospitalAdmission

get_hospital_adm_for_note(row_id)[source]#

Return an admission that has note row_id.

Raise:

RecordNotFoundError if row_id is not found in the database

Return type:

HospitalAdmission

get_note_by_id(row_id)[source]#

Return the note (via the hospital admission) for row_id.

Raise:

RecordNotFoundError if row_id is not found in the database

Return type:

Note

hospital_adm_stash: HospitalAdmissionDbStash#

Creates hospital admission instances. Note that this might be a caching stash instance, but method calls are delegated through to the instance of HospitalAdmissionDbStash.

note_event_persister: NoteEventPersister#

The persister for the noteevents table.

patient_persister: PatientPersister#

The persister for the patients table.

temporary_results_dir: Path#

The path to create the output results. This is not used, but needs to stay until the next zensols.mimicsid is retrained.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

write_hospital_admission(hadm_id, depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, note_line_limit=9223372036854775807)[source]#

Write the hospital admission identified by hadm_id.

write_hosptial_count_admission(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, limit=9223372036854775807)[source]#

Write the counts for each hospital admission.

Parameters:

limit (int) – the limit on the return admission counts

See:

AdmissionPersister.get_admission_admission_counts()

write_note_event_counts(subject_id, depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

Print a list of hospital admissions by count of related notes in descending order.

See:

NoteEventPersister.get_note_counts_by_subject_id()

zensols.mimic.domain#

Inheritance diagram of zensols.mimic.domain

Domain classes for the corpus notes.

class zensols.mimic.domain.Admission(row_id, subject_id, hadm_id, admittime, dischtime, deathtime, admission_type, admission_location, discharge_location, insurance, language, religion, marital_status, ethnicity, edregtime, edouttime, diagnosis, hospital_expire_flag, has_chartevents_data)[source]#

Bases: MimicContainer

The ADMISSIONS table gives information regarding a patient’s admission to the hospital. Since each unique hospital visit for a patient is assigned a unique HADM_ID, the ADMISSIONS table can be considered as a definition table for HADM_ID. Information available includes timing information for admission and discharge, demographic information, the source of the admission, and so on.

Table source: Hospital database.

Table purpose: Define a patient’s hospital admission, HADM_ID.

Number of rows: 58976

Links to:
  • PATIENTS on SUBJECT_ID

See:

Dictionary

__init__(row_id, subject_id, hadm_id, admittime, dischtime, deathtime, admission_type, admission_location, discharge_location, insurance, language, religion, marital_status, ethnicity, edregtime, edouttime, diagnosis, hospital_expire_flag, has_chartevents_data)#
admission_location: str#

Admission location.

admission_type: str#

Type of admission, for example emergency or elective.

admittime: datetime#

Time of admission to the hospital.

deathtime: datetime#

Time of death.

diagnosis: str#

The DIAGNOSIS column provides a preliminary, free text diagnosis for the patient on hospital admission. The diagnosis is usually assigned by the admitting clinician and does not use a systematic ontology. As of MIMIC-III v1.0 there were 15,693 distinct diagnoses for 58,976 admissions. The diagnoses can be very informative (e.g. chronic kidney failure) or quite vague (e.g. weakness). Final diagnoses for a patient’s hospital stay are coded on discharge and can be found in the DIAGNOSES_ICD table. While this field can provide information about the status of a patient on hospital admission, it is not recommended to use it to stratify patients.

discharge_location: str#

Discharge location

dischtime: datetime#

Time of discharge from the hospital.

edouttime: datetime#

See edregtime.

edregtime: datetime#

Time that the patient was registered and discharged from the emergency department.

ethnicity: str#

See insurance.

hadm_id: int#

Primary key. Identifies the hospital admission.

has_chartevents_data: int#

Hospital admission has at least one observation in the CHARTEVENTS table.

hospital_expire_flag: int#

This indicates whether the patient died within the given hospitalization. 1 indicates death in the hospital, and 0 indicates survival to hospital discharge.

insurance: str#

The INSURANCE, LANGUAGE, RELIGION, MARITAL_STATUS, ETHNICITY columns describe patient demographics. These columns occur in the ADMISSIONS table as they are originally sourced from the admission, discharge, and transfers (ADT) data from the hospital database. The values occasionally change between hospital admissions (HADM_ID) for a single patient (SUBJECT_ID). This is reasonable for some fields (e.g. MARITAL_STATUS, RELIGION), but less reasonable for others (e.g. ETHNICITY).

language: str#

See insurance.

marital_status: str#

See insurance.

religion: str#

See insurance.

subject_id: int#

Foreign key. Identifies the patient.

class zensols.mimic.domain.Diagnosis(row_id, icd9_code, short_title, long_title)[source]#

Bases: ICD9Container

Table source: Hospital database.

Table purpose: Contains ICD diagnoses for patients, most notably ICD-9 diagnoses.

Number of rows: 651,047

Links to:

PATIENTS on SUBJECT_ID ADMISSIONS on HADM_ID D_ICD_DIAGNOSES on ICD9_CODE

__init__(row_id, icd9_code, short_title, long_title)#
class zensols.mimic.domain.HospitalAdmissionContainer(row_id, hadm_id)[source]#

Bases: MimicContainer

Any data container that has a unique identifier with an (inpatient) non-null identifier.

__init__(row_id, hadm_id)#
hadm_id: int#

Primary key. Identifies the hospital admission.

class zensols.mimic.domain.ICD9Container(row_id, icd9_code, short_title, long_title)[source]#

Bases: MimicContainer

A data container that has ICD-9 codes.

__init__(row_id, icd9_code, short_title, long_title)#
icd9_code: str#

ICD9 code for the diagnosis or procedure.

long_title: str#

Long title associated with the code.

short_title: str#

Short title associated with the code.

class zensols.mimic.domain.MimicContainer(row_id)[source]#

Bases: PersistableContainer, Dictable

Abstract base class for data containers, which are plain old Python objects that are CRUD’d from DAO persisters.

__init__(row_id)#
row_id: int#

Unique row identifier.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, dct=None)[source]#

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

exception zensols.mimic.domain.MimicError[source]#

Bases: APIError

Raised for any application level error.

__module__ = 'zensols.mimic.domain'#
exception zensols.mimic.domain.MimicParseError(text)[source]#

Bases: MimicError

Raised for MIMIC note parsing errors.

__annotations__ = {}#
__init__(text)[source]#
__module__ = 'zensols.mimic.domain'#
class zensols.mimic.domain.NoteEvent(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#

Bases: MimicContainer

Table source: Hospital database.

Table purpose: Contains all notes for patients.

Number of rows: 2,083,180

Links to:
  • PATIENTS on SUBJECT_ID

  • ADMISSIONS on HADM_ID

  • CAREGIVERS on CGID

See:

Dictionary

__init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
category: str#

Category of the note, e.g. Discharge summary.

CATEGORY and DESCRIPTION define the type of note recorded. For example, a CATEGORY of ‘Discharge summary’ indicates that the note is a discharge summary, and the DESCRIPTION of ‘Report’ indicates a full report while a DESCRIPTION of ‘Addendum’ indicates an addendum (additional text to be added to the previous report).

cgid: int#

Foreign key. Identifies the caregiver.

chartdate: datetime#

Date when the note was charted.

CHARTDATE records the date at which the note was charted. CHARTDATE will always have a time value of 00:00:00.

CHARTTIME records the date and time at which the note was charted. If both CHARTDATE and CHARTTIME exist, then the date portions will be identical. All records have a CHARTDATE. A subset are missing CHARTTIME. More specifically, notes with a CATEGORY value of ‘Discharge Summary’, ‘ECG’, and ‘Echo’ never have a CHARTTIME, only CHARTDATE. Other categories almost always have both CHARTTIME and CHARTDATE, but there is a small amount of missing data for CHARTTIME (usually less than 0.5% of the total number of notes for that category).

STORETIME records the date and time at which a note was saved into the system. Notes with a CATEGORY value of ‘Discharge Summary’, ‘ECG’, ‘Radiology’, and ‘Echo’ never have a STORETIME. All other notes have a STORETIME.

charttime: datetime#

Date and time when the note was charted. Note that some notes (e.g. discharge summaries) do not have a time associated with them: these notes have NULL in this column.

See:

chartdate

context: InitVar#

Contains resources needed by new and re-hydrated notes, such as the document stash.

description: str#

A more detailed categorization for the note, sometimes entered by free-text.

property doc: FeatureDocument#

The parsed document of the name of the section.

get_normal_name(include_desc=True)[source]#

A normalized name of the note useful as a file name (sans extension).

Parameters:

include_desc (bool) – whether or not to add the note’s desc field, which adds an extra dash (-) for any subsequent file name parsing

Return type:

str

hadm_id: int#

Foreign key. Identifies the hospital admission.

property id: str#

The unique identifier of this note event.

iserror: bool#

Flag to highlight an error with the note.

property normal_name: str#

A normalized name of the note useful as a file name (sans extension).

storetime: datetime#

See chartdate.

subject_id: int#

Foreign key. Identifies the patient.

Identifiers which specify the patient: SUBJECT_ID is unique to a patient and HADM_ID is unique to a patient hospital stay.

:see hadm_id

text: str#

Content of the note.

property truncted_text: str#

A beginning substring of the note’s text useful for debugging.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, line_limit=9223372036854775807, write_divider=True, indent_fields=True, note_indent=1, include_fields=True)[source]#

Write the note event.

Parameters:
  • line_limit (int) – the number of lines to write from the note text

  • write_divider (bool) – whether to write a divider before the note text

  • indent_fields (bool) – whether to indent the fields of the note

  • note_indent (int) – how many indentation to indent the note fields

class zensols.mimic.domain.Patient(row_id, subject_id, gender, dob, dod, dod_hosp, dod_ssn, expire_flag)[source]#

Bases: MimicContainer

Table source: CareVue and Metavision ICU databases.

Table purpose: Defines each SUBJECT_ID in the database, i.e. defines a single patient.

Number of rows: 46,520

Links to: ADMISSIONS on SUBJECT_ID ICUSTAYS on SUBJECT_ID

__init__(row_id, subject_id, gender, dob, dod, dod_hosp, dod_ssn, expire_flag)#
dob: datetime#

Date of birth.

dod: datetime#

Date of death. Null if the patient was alive at least 90 days post hospital discharge.

dod_hosp: datetime#

Date of death recorded in the hospital records.

dod_ssn: datetime#

Date of death recorded in the social security records.

expire_flag: int#

Flag indicating that the patient has died.

gender: str#

M/F).

Type:

Gender (one character

row_id: int#

Unique row identifier.

subject_id: int#

Primary key. Identifies the patient.

class zensols.mimic.domain.Procedure(row_id, icd9_code, short_title, long_title)[source]#

Bases: ICD9Container

Table source: Hospital database.

Table purpose: Contains ICD procedures for patients, most notably ICD-9 procedures.

Number of rows: 240,095

Links to:

PATIENTS on SUBJECT_ID ADMISSIONS on HADM_ID D_ICD_PROCEDURES on ICD9_CODE

__init__(row_id, icd9_code, short_title, long_title)#
exception zensols.mimic.domain.RecordNotFoundError(actor, key_type, key)[source]#

Bases: MimicError

Raised on any domain/container class error.

__annotations__ = {}#
__init__(actor, key_type, key)[source]#
__module__ = 'zensols.mimic.domain'#

zensols.mimic.note#

Inheritance diagram of zensols.mimic.note

EHR related text documents.

class zensols.mimic.note.DefaultNoteFactory(config_factory, category_to_note, mimic_default_note_section)[source]#

Bases: NoteFactory

A note factory that creates only default notes.

See:

NoteFactory.create_default()

__init__(config_factory, category_to_note, mimic_default_note_section)#
create(note_event)[source]#

Create a new factory based instance of a Note from a NoteEvent.

Parameters:

note_event (NoteEvent) – the source data

Return type:

Note

class zensols.mimic.note.GapSectionContainer(delegate, filter_empty)[source]#

Bases: SectionContainer

A container that fills in missing sections of text from a note with additional sections.

__init__(delegate, filter_empty)#
delegate: Note#

The note with the sections to be filled.

filter_empty: bool#

Whether to filter empty sections.

class zensols.mimic.note.Note(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#

Bases: NoteEvent, SectionContainer

A container class of Section for each section for the text in the note events given by the property sections.

__init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
property section_annotator_type: SectionAnnotatorType#

A human readable string describing who or what annotated the note.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

Write the note event.

Parameters:
  • line_limit – the number of lines to write from the note text

  • write_divider – whether to write a divider before the note text

  • indent_fields – whether to indent the fields of the note

  • note_indent – how many indentation to indent the note fields

write_fields(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

Write note header fields such as the row_id and category.

write_full(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, note_line_limit=9223372036854775807, section_line_limit=9223372036854775807, section_sent_limit=9223372036854775807, include_section_header=True, sections=None, include_fields=True, include_note_divider=True, include_section_divider=True)[source]#

Write the custom parts of the note.

Parameters:
  • note_line_limit (int) – the number of lines to write from the note text

  • section_line_limit (int) – the number of line of the section’s body and number of sentences to output

  • par_limit – the number of paragraphs to output

  • sections (Set[str]) – the sections, by name, to write

  • include_section_header (bool) – whether to include the header

  • include_fields (bool) – whether to write the note fields

  • include_note_divider (bool) – whether to write dividers between notes

  • include_section_divider (bool) – whether to write dividers between sections

class zensols.mimic.note.NoteFactory(config_factory, category_to_note, mimic_default_note_section)[source]#

Bases: Primeable

Creates an instance of Note from NoteEvent.

__init__(config_factory, category_to_note, mimic_default_note_section)#
category_to_note: Dict[str, str]#

.Note` configuration.

Type:

A mapping between notes’ category to section name for

Type:

class

config_factory: ConfigFactory#

The factory used to create notes.

create(note_event)[source]#

Create a new factory based instance of a Note from a NoteEvent.

Parameters:

note_event (NoteEvent) – the source data

Return type:

Note

create_default(note_event)[source]#

Like create() but always create the default (Note) note.

Parameters:

note_event (NoteEvent) – the source data

Return type:

Note

Returns:

always an instance of Note

mimic_default_note_section: str#

The section name holding the configuration of the class to create when there is no mapping in category_to_note.

prime()[source]#

The MedSecId project primes by installing the model files.

class zensols.mimic.note.NoteFormat(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

Used in Note.format() for a parameterized method to write a note.

property ext: str#
json = 5#
markdown = 7#
raw = 2#
summary = 4#
text = 1#
verbose = 3#
yaml = 6#
class zensols.mimic.note.ParagraphFactory[source]#

Bases: object

Splits a document in to constituent paragraphs.

__init__()#
abstract create(sec)[source]#
Return type:

Iterable[FeatureDocument]

class zensols.mimic.note.Section(id, name, container, header_spans, body_span)[source]#

Bases: PersistableContainer, Dictable

A section segment with an identifier and represents a section of a Note, one for each section. An example of a section is the history of present illness in a discharge note.

FILTER_ENUMS: ClassVar[bool] = True#

Whether to filter enumerated lists as separate sentences.

__init__(id, name, container, header_spans, body_span)#
property body: str#

The section text.

property body_doc: FeatureDocument#

A feature document of the body of this section’s body text.

body_span: LexicalSpan#

Like header_spans but for the section body. The body and name do not intersect.

property body_tokens: Iterable[FeatureToken]#
clone()[source]#
Return type:

Section

container: SectionContainer#

The container that has this section.

property doc: FeatureDocument#

A feature document of the section’s body text.

header_spans: Tuple[LexicalSpan, ...]#

The character offsets of the section headers. The first is usually the name of the section. If there are no headers, this is an 0-length tuple.

static header_to_name(s)[source]#

Convert a section header text to a section name.

Return type:

str

property header_tokens: Iterable[FeatureToken]#
property headers: Tuple[str, ...]#

The section text.

id: int#

The unique ID of the section.

property is_empty: bool#

Whether the content of the section is empty.

property lexspan: LexicalSpan#

The widest lexical extent of the sections, including headers.

name: Optional[str]#

The name of the section (i.e. hospital-course). This field is what’s called the type in the paper, which is not used since type is a keyword in Python.

static name_to_header(s)[source]#

Convert a section name to a section header text. Note that this uses a heuristic method that might generate a string that does not match the original header text.

Return type:

str

property note_text: str#

The entire parent note’s text.

property paragraphs: Tuple[FeatureDocument, ...]#

The list of paragraphs, each as as a feature document, of this section’s body text.

property text: str#

Get the entire text of the section, which includes the headers.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, body_line_limit=9223372036854775807, norm_line_limit=9223372036854775807, par_limit=0, sent_limit=0, include_header=True, include_id_name=True, include_header_spans=False, include_body_span=False)[source]#

Write a note section’s name, original body, normalized body and sentences with respective sentence entities.

Parameters:
  • body_line_limit (int) – the number of line of the section’s body to output

  • norm_line_limit (int) – the number of line of the section’s normalized (parsed) body to output

  • par_limit (int) – the number of paragraphs to output

  • sent_limit (int) – the number of sentences to output

  • include_header (bool) – whether to include the header

  • include_id_name (bool) – whether to write the section ID and name

write_as_item(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

A terse output designed for list iteration.

write_sentences(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, container=None, limit=0)[source]#

Write all parsed sentences of the section with respective entities.

class zensols.mimic.note.SectionAnnotatorType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

The type of Section annotator for Note instances. The MedSecId project adds the human and model:

See:

MedSecId

HUMAN = 3#

A MedSecId human annotator.

MODEL = 4#

Predictions are provided by a MedSecId model.

NONE = 1#

Default for those without section identifiers.

REGULAR_EXPRESSION = 2#

Sections are automatically assigned by regular expressions.

class zensols.mimic.note.SectionContainer[source]#

Bases: Dictable

A note like container base class that has sections. Note based classes extend this base class. Sections in order of their position in the document are produced when using this class as an iterable.

DEFAULT_SECTION_NAME: ClassVar[str] = 'default'#

The name of the singleton section when none the note is not sectioned.

__init__()#
static category_to_id(s)[source]#

Convert a category string (i.e. Discharge summary) to a category ID (i.e. discharge-summary).

Return type:

str

property feature_dataframe: DataFrame#

A dataframe useful for features used in an ML model.

static id_to_category(s)[source]#

Convert a category ID (i.e. discharge-summary) to a category string (i.e. Discharge summary).

Return type:

str

property section_dataframe: DataFrame#

A Pandas dataframe containing the section’s name, header and body offset spans.

property sections: Dict[int, Section]#

A map from the unique section identifier to a note section.

property sections_by_name: Dict[str, Tuple[Section, ...]]#

A map from the name of a section (i.e. history of present illness in discharge notes) to a note section.

property sections_ordered: Tuple[Section, ...]#

Sections returned in order as they appear in the note.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

write_by_format(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, note_format=<enum 'NoteFormat'>)[source]#

Write the note in the specified format.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

  • note_format (NoteFormat) – the format to use for the output

write_fields(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

Write note header fields such as the row_id and category.

write_full(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, note_line_limit=9223372036854775807, section_line_limit=9223372036854775807, section_sent_limit=9223372036854775807, include_section_header=True, sections=None, include_fields=True, include_note_divider=True, include_section_divider=True)[source]#

Write the custom parts of the note.

Parameters:
  • note_line_limit (int) – the number of lines to write from the note text

  • section_line_limit (int) – the number of line of the section’s body and number of sentences to output

  • par_limit – the number of paragraphs to output

  • sections (Set[str]) – the sections, by name, to write

  • include_section_header (bool) – whether to include the header

  • include_fields (bool) – whether to write the note fields

  • include_note_divider (bool) – whether to write dividers between notes

  • include_section_divider (bool) – whether to write dividers between sections

write_human(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, normalize=False)[source]#

Generates a human readable version of the annotation. This calls the following methods in order: write_fields() and write_sections().

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

  • normalize (bool) – whether to use the paragraphs’ normalized (:obj:~zensols.nlp.TokenContainer.norm`) or text

write_markdown(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, normalize=False)[source]#

Generates markdown version of the annotation.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

  • normalize (bool) – whether to use the paragraphs’ normalized (:obj:~zensols.nlp.TokenContainer.norm`) or text

write_sections(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, normalize=False)[source]#

Writes the sections of the container.

Parameters:
  • depth (int) – the starting indentation depth

  • writer (TextIOBase) – the writer to dump the content of this writable

  • normalize (bool) – whether to use the paragraphs’ normalized (:obj:~zensols.nlp.TokenContainer.norm`) or text

zensols.mimic.parafac#

Inheritance diagram of zensols.mimic.parafac

Paragraph factories.

class zensols.mimic.parafac.ChunkingParagraphFactory(min_sent_len, min_list_norm_matches, max_sent_list_len, include_section_headers, filter_sent_text)[source]#

Bases: ParagraphFactory

A paragraph factory that uses zensols.nlp.chunker chunking to split paragraphs and MIMIC lists.

MIMIC_SPAN_PATTERN: ClassVar[Pattern] = re.compile('(.+?)(?:(?=[\\n.]{2})|\\Z)', re.MULTILINE|re.DOTALL)#

MIMIC regular expression adds period, which is used in notes to separate paragraphs.

__init__(min_sent_len, min_list_norm_matches, max_sent_list_len, include_section_headers, filter_sent_text)#
create(sec)[source]#
Return type:

Iterable[FeatureDocument]

filter_sent_text: Set[str]#

A set of sentence norm values to filter from replaced documents.

include_section_headers: bool#

Whether to include section headers in the output.

max_sent_list_len: int#

The maximum lenght a sentence can be to keep it chunked as a list. Otherwise very long sentences form from what appear to be front list syntax.

min_list_norm_matches: int#

The minimum amount of list matches needed to use the list item chunked version of the section.

min_sent_len: int#

Minimum sentence length in tokens to be kept.

class zensols.mimic.parafac.WhitespaceParagraphFactory[source]#

Bases: ParagraphFactory

A simple paragraph factory that splits on whitespace.

SEPARATOR_REGEX: ClassVar[Pattern] = re.compile('\\n[\\s.]*\\n')#
create(sec)[source]#
Return type:

Iterable[FeatureDocument]

zensols.mimic.persist#

Inheritance diagram of zensols.mimic.persist

Persisters for the MIMIC-III database.

class zensols.mimic.persist.AdmissionPersister(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)[source]#

Bases: DataClassDbPersister

Manages instances of Admission.

__init__(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)#
get_admission_counts(limit=9223372036854775807)[source]#

Return the counts of subjects for each hospital admission.

Parameters:

limit (int) – the limit on the return admission counts

Return type:

Tuple[Tuple[int, int], ...]

Returns:

a list of tuples, each in the form (subject_id, count)

get_by_hadm_id(hadm_id)[source]#

Return the admission by it’s hospital admission ID.

Return type:

Admission

get_by_subject_id(subject_id)[source]#

Get an admissions by patient ID.

Return type:

Tuple[Admission, ...]

get_hadm_ids(subject_id)[source]#

Get all hospital admission IDs (hadm_id) for a patient.

Return type:

Iterable[int]

uniform_sample_hadm_ids(limit)[source]#

Return a sample from the uniform distribution of admission IDs.

Return type:

Iterable[int]

class zensols.mimic.persist.DiagnosisPersister(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)[source]#

Bases: DataClassDbPersister

Manages instances of Diagnosis.

__init__(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)#
get_by_hadm_id(hadm_id)[source]#

Get ICD-9 diagnoses codes by hospital admission IDs.

Return type:

Diagnosis

get_heart_failure_hadm_ids()[source]#

Return hospital admission IDs that are heart failure related.

Return type:

Tuple[int, ...]

class zensols.mimic.persist.NoteDocumentStash(doc_parser=None, note_db_persister=None)[source]#

Bases: ReadOnlyStash

Reads noteevents from the database and returns parsed documents.

__init__(doc_parser=None, note_db_persister=None)#
doc_parser: FeatureDocumentParser = None#

NER+L medical domain natural langauge parser.

exists(name)[source]#

Return True if data with key name exists.

Implementation note: This Stash.exists() method is very inefficient and should be overriden.

Return type:

bool

keys()[source]#

Return an iterable of keys in the collection.

Return type:

Iterable[str]

load(row_id)[source]#

Load a data value from the pickled data with key name. Semantically, this method loads the using the stash’s implementation. For example DirectoryStash loads the data from a file if it exists, but factory type stashes will always re-generate the data.

See:

get()

Return type:

FeatureDocument

note_db_persister: DbPersister = None#

Fetches the note text by key from the DB.

class zensols.mimic.persist.NoteEventPersister(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None, mimic_note_context=None, hadm_row_chunk_size=None)[source]#

Bases: DataClassDbPersister

Manages instances of NoteEvent.

__init__(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None, mimic_note_context=None, hadm_row_chunk_size=None)#
property categories: Tuple[str, ...]#

All unique categories.

get_discharge_reports(limit=9223372036854775807)[source]#

Return discharge reports (as apposed to addendums).

Parameters:

limit (int) – the limit of notes to return

Return type:

Tuple[NoteEvent]

get_hadm_id(row_id)[source]#

Return the hospital admission for a note.

Parameters:

row_id (int) – the unique ID of the note event

Return type:

Optional[int]

Returns:

the hospital admission unique ID hadm_id if row_id is in the database

get_hadm_ids(row_ids)[source]#

Return the hospital admission for a set of note.

Parameters:

row_id – the unique IDs of the note events

Return type:

Iterable[int]

Returns:

the hospital admission admissions unique ID hadm_id

get_hadm_ids_all()[source]#

Get all hospital admission IDs that have at least one associated note.

Return type:

Iterable[int]

get_note_count(hadm_id)[source]#

Return the count of notes for a hospital admission.

Parameters:

hadm_id (int) – the hospital admission ID

Return type:

int

get_note_counts()[source]#

Return the count of notes for all hospital admissions.

Return type:

Tuple[int, ...]

get_note_counts_by_subject_id(subject_id)[source]#

Get counts of notes related to a subject.

Parameters:

subject_id (int) – the patient’s ID

Return type:

Tuple[Tuple[int, int], ...]

Returns:

tuple of (hadm_id, count) pairs for a subject

get_notes_by_category(category, limit=9223372036854775807)[source]#

Return notes by what the category to which they belong.

Parameters:
  • category (str) – the category of the note (i.e. Radiology)

  • limit (int) – the limit of notes to return

Return type:

Tuple[NoteEvent, ...]

get_notes_by_hadm_id(hadm_id)[source]#

Return notes by hospital admission ID.

Parameters:

hadm_id (int) – the hospital admission ID

Return type:

Tuple[NoteEvent, ...]

get_row_ids_by_category(hadm_id, categories)[source]#
Return type:

Dict[str, List[int]]

get_row_ids_by_hadm_id(hadm_id)[source]#

Return all note row IDs for a admission ID.

Return type:

Tuple[int, ...]

get_row_ids_with_admissions()[source]#

Get note IDs associate with at least one admission.

Return type:

Iterable[int]

hadm_row_chunk_size: int = None#

The number of note IDs for each round trip to the DB in get_hadm_ids().

mimic_note_context: Settings = None#

Contains resources needed by new and re-hydrated notes, such as the document stash.

class zensols.mimic.persist.PatientPersister(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)[source]#

Bases: DataClassDbPersister

Manages instances of Patient.

__init__(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)#
get_by_subject_id(subject_id)[source]#
Return type:

Patient

class zensols.mimic.persist.ProcedurePersister(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)[source]#

Bases: DataClassDbPersister

Manages instances of Procedure.

__init__(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)#
get_by_hadm_id(hadm_id)[source]#
Return type:

Procedure

zensols.mimic.regexnote#

Inheritance diagram of zensols.mimic.regexnote

Regular expression note parsing

class zensols.mimic.regexnote.ConsultNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#

Bases: RegexNote

Contains sections for the discharge summary. There should be only one of these per hospital admission.

CATEGORY: ClassVar[str] = 'Consult'#
__init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
class zensols.mimic.regexnote.DischargeSummaryNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#

Bases: RegexNote

Contains sections for the discharge summary. There should be only one of these per hospital admission.

CATEGORY: ClassVar[str] = 'Discharge summary'#
__init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
class zensols.mimic.regexnote.EchoNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#

Bases: RegexNote

CATEGORY: ClassVar[str] = 'Echo'#
__init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
class zensols.mimic.regexnote.NursingOtherNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#

Bases: RegexNote

CATEGORY: ClassVar[str] = 'Nursing/other'#
__init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
class zensols.mimic.regexnote.PhysicianNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#

Bases: RegexNote

CATEGORY: ClassVar[str] = 'Physician'#
__init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
class zensols.mimic.regexnote.RadiologyNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#

Bases: RegexNote

CATEGORY: ClassVar[str] = 'Radiology'#
__init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
class zensols.mimic.regexnote.RegexNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#

Bases: Note

Base class used to collect subclass regular expressions captures and create sections from them.

__init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#

zensols.mimic.tokenizer#

Inheritance diagram of zensols.mimic.tokenizer

Modify the spaCy parser configuration to deal with the MIMIC-III dataset.

class zensols.mimic.tokenizer.MimicTokenDecorator(token_entities=((re.compile('^First Name'), 'FIRSTNAME', 'PERSON'), (re.compile('^Last Name'), 'LASTNAME', 'PERSON'), (re.compile('^21\\\\d{2}-\\\\d{1,2}-\\\\d{1,2}$'), 'DATE', 'DATE')), token_replacements=())[source]#

Bases: FeatureTokenDecorator

Contains the MIMIC-III regular expressions and other patterns to annotate and normalized feature tokens. The class finds mask tokens and separators (such as a long string of dashes or asterisks).

Attribute onto_mapping is a mapping from the MIMIC symbol in token_entities (2nd value in tuple) to Onto Notes 5, which is used as the NER symbol in spaCy.

MASK_REGEX: ClassVar[Pattern] = re.compile('\\[\\*\\*([^\\*]+)\\*\\*\\]')#

Matches mask tokens.

MASK_TOKEN_FEATURE: ClassVar[str] = 'mask'#

The value given from entity TOKEN_FEATURE_ID for mask tokens (i.e. [**First Name**]).

ONTO_FEATURE_ID: ClassVar[str] = 'onto_'#

The feature ID to use for the Onto Notes 5 (onto_mapping).

SEPARATOR_TOKEN_FEATURE: ClassVar[str] = 'separator'#

The value name of separators defined by SEP_REGEX.

SEP_REGEX: ClassVar[Pattern] = re.compile('(_{5,}|[*]{5,}|[-]{5,})')#

Matches text based separators such as a long string of dashes.

TOKEN_FEATURE_ID: ClassVar[str] = 'mimic_'#

The feature ID to use for MIMIC-III tokens.

UNKNOWN_ENTITY: ClassVar[str] = '<UNKNOWN>'#

The mask nromalized token form for unknown MIMIC entity text (i.e. First Name).

__init__(token_entities=((re.compile('^First Name'), 'FIRSTNAME', 'PERSON'), (re.compile('^Last Name'), 'LASTNAME', 'PERSON'), (re.compile('^21\\\\d{2}-\\\\d{1,2}-\\\\d{1,2}$'), 'DATE', 'DATE')), token_replacements=())#
decorate(token)[source]#
token_entities: Tuple[Tuple[Union[Pattern, str]], str, Optional[str]] = ((re.compile('^First Name'), 'FIRSTNAME', 'PERSON'), (re.compile('^Last Name'), 'LASTNAME', 'PERSON'), (re.compile('^21\\d{2}-\\d{1,2}-\\d{1,2}$'), 'DATE', 'DATE'))#

A list of psuedo token patterns and a string to replace with the respective match.

token_replacements: Tuple[Tuple[Union[Pattern, str], str]] = ()#

A list of token text to replaced as the normalized token text.

class zensols.mimic.tokenizer.MimicTokenizerComponent(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=())[source]#

Bases: Component

Modifies the spacCy tokenizer to split on colons (:) to capture more MIMIC-III mask tokens.

__init__(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=())#
init(model)[source]#

Initialize the component and add it to the NLP pipe line. This base class implementation loads the module, then calls Language.add_pipe().

Parameters:

model (Language) – the model to add the spaCy model (nlp in their parlance)

Module contents#