zensols.mimic package#
Submodules#
zensols.mimic.adm#
Hospital admission/stay details.
- class zensols.mimic.adm.HospitalAdmission(admission, patient, diagnoses, procedures)[source]#
Bases:
PersistableContainer
,Dictable
Represents data collected by a patient over the course of their hospital admission. Note: this object keys notes using their
row_id
IDs used in the MIMIC dataset as integers and not strings like some note stashes.- __init__(admission, patient, diagnoses, procedures)#
- property feature_dataframe: DataFrame#
The feature dataframe for the hospital admission as the constituent note feature dataframes.
- get_duplicate_notes(text_start=None)[source]#
Notes with the same note text, each in their respective set.
- get_non_duplicate_notes(dup_sets, filter_fn=None)[source]#
Return non-duplicated notes.
- Parameters:
dup_sets (
Tuple
[Set
[str
]]) – the duplicate sets generated fromget_duplicate_notes()
filer_fn – if provided it is used to filter duplicates; if everything is filtered, a note from the respective duplicate set is chosen at random
- Return type:
- Returns:
a tuple of
(<note>, <is duplicate>)
pairs- See:
duplicate_notes
- property notes_by_category: Dict[str, Tuple[Note, ...]]#
All notes by
Note.category
as keys with the list of resepctive notes as a list as values.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_admission=False, include_patient=False, include_diagnoses=False, include_procedures=False, **note_kwargs)[source]#
Write the admission and the notes of the admission.
- Parameters:
note_kwargs – the keyword arguments gtiven to
Note.write_full()
- write_full(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, **kwargs)[source]#
Write a verbose output of the admission.
- Parameters:
kwargs – the keyword arguments given to meth:write
- write_notes(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, note_limit=9223372036854775807, categories=None, include_note_id=False, **note_kwargs)[source]#
Write the notes of the admission.
- Parameters:
note_limit (
int
) – the number of notes to writeinclude_note_id (
bool
) – whether to include the note identification infonote_kwargs – the keyword arguments gtiven to
Note.write_full()
- class zensols.mimic.adm.HospitalAdmissionDbFactoryStash(delegate, factory, enable_preemptive=True, dump_factory_nones=True, doc_stash=None, mimic_note_context=None)[source]#
Bases:
FactoryStash
,Primeable
A factory stash that configures
NoteEvent
instances so they can parse the MIMIC-III English text asFeatureDocument
instances.- __init__(delegate, factory, enable_preemptive=True, dump_factory_nones=True, doc_stash=None, mimic_note_context=None)#
- clear()[source]#
Delete all data from the from the stash.
Important: Exercise caution with this method, of course.
- load(hadm_id)[source]#
Load a data value from the pickled data with key
name
. Semantically, this method loads the using the stash’s implementation. For exampleDirectoryStash
loads the data from a file if it exists, but factory type stashes will always re-generate the data.- See:
get()
- Return type:
- class zensols.mimic.adm.HospitalAdmissionDbStash(config_factory, mimic_note_factory, admission_persister, diagnosis_persister, patient_persister, procedure_persister, note_event_persister, note_stash, hospital_adm_name)[source]#
Bases:
ReadOnlyStash
,Primeable
A stash that creates
HospitalAdmission
instances. This instance is used by caching stashes per the default resource library configuration for this package.- __init__(config_factory, mimic_note_factory, admission_persister, diagnosis_persister, patient_persister, procedure_persister, note_event_persister, note_stash, hospital_adm_name)#
-
admission_persister:
AdmissionPersister
# The persister for the
admissions
table.
-
config_factory:
ConfigFactory
# The factory used to create domain objects (ie hospital admission).
-
diagnosis_persister:
DiagnosisPersister
# The persister for the
diagnosis
table.
- exists(hadm_id)[source]#
Return
True
if data with keyname
exists.Implementation note: This
Stash.exists()
method is very inefficient and should be overriden.- Return type:
-
hospital_adm_name:
str
# The configuration section name of the
HospitalAdmission
used to load instances.
- load(hadm_id)[source]#
Create a complete picture of a hospital stay with admission, patient and notes data.
- Parameters:
hadm_id (
str
) – the ID that specifics the hospital admission to create- Return type:
-
mimic_note_factory:
NoteFactory
# The factory that creates
Note
for hopsital admissions.
-
note_event_persister:
NoteEventPersister
# The persister for the
noteevents
table.
-
patient_persister:
PatientPersister
# The persister for the
patients
table.
-
procedure_persister:
ProcedurePersister
# The persister for the
procedure
table.
- class zensols.mimic.adm.NoteDocumentPreemptiveStash(delegate, config, name, chunk_size=0, workers=1, processor_class=<class 'zensols.multi.stash.PoolMultiProcessor'>, note_event_persister=None, adm_factory_stash=None)[source]#
Bases:
MultiProcessDefaultStash
Contains the stash that preemptively creates
Admission
,Note
andFeatureDocument
cache files. This class is not useful for returning any data (see :class:`.HospitalAdmissionDbFactoryStash).- __init__(delegate, config, name, chunk_size=0, workers=1, processor_class=<class 'zensols.multi.stash.PoolMultiProcessor'>, note_event_persister=None, adm_factory_stash=None)#
-
adm_factory_stash:
HospitalAdmissionDbFactoryStash
= None# The factory to create the admission instances.
-
note_event_persister:
NoteEventPersister
= None# The persister for the
noteevents
table.
- prime()[source]#
If the delegate stash data does not exist, use this implementation to generate the data and process in children processes.
zensols.mimic.app#
A utility library for parsing the MIMIC-III corpus
- class zensols.mimic.app.Application(config_factory, doc_parser, corpus, preempt_stash)[source]#
Bases:
object
A utility library for parsing the MIMIC-III corpus
- __init__(config_factory, doc_parser, corpus, preempt_stash)#
-
config_factory:
ConfigFactory
# Used to get temporary resources
-
doc_parser:
FeatureDocumentParser
# Used to parse command line documents.
- preempt_notes(input_file, workers=None)[source]#
Preemptively document parse notes across multiple threads.
-
preempt_stash:
NoteDocumentPreemptiveStash
# A multi-processing stash used to preemptively parse notes.
- show(sent)[source]#
Parse a sentence and print all features for each token.
- Parameters:
sent (
str
) – the sentence to parse and generate features
- uniform_sample_hadm_ids(limit=1)[source]#
Print a uniform random sample of admission hadm_ids.
- Parameters:
limit (
int
) – the number to fetch
- write_admission(hadm_id, out_dir=PosixPath('.'), output_format=NoteFormat.text)[source]#
Write all the notes of an admission.
- Parameters:
hadm_id (
str
) – the hospital admission ID or-
for a random IDout_dir (
Path
) – the output directoryoutput_format (
NoteFormat
) – the output format of the note
- write_admission_summary(hadm_id)[source]#
Write an admission note categories and section names.
- Parameters:
hadm_id (
str
) – the hospital admission ID or-
for a random ID
- write_discharge_reports(limit=1, out_dir=PosixPath('.'))[source]#
Write discharge reports (as apposed to addendums).
- write_features(sent, out_file=None)[source]#
Parse a sentence as MIMIC data and write features to CSV.
- write_hadm_id_for_note(row_id)[source]#
Get the hospital admission ID (
hadm_id
) that has noterow_id
.
- write_note(row_id, out_file=None, output_format=NoteFormat.text)[source]#
Write a note.
- Parameters:
row_id (
int
) – the unique note identifier in the NOTEEVENTS tableoutput_format (
NoteFormat
) – the output format of the noteout_file (
Path
) – the file to write
zensols.mimic.cli#
Command line entry point to the application.
- class zensols.mimic.cli.ApplicationFactory(*args, **kwargs)[source]#
Bases:
ApplicationFactory
zensols.mimic.corpus#
Discharge summary research and Mimic III data exploration.
- class zensols.mimic.corpus.Corpus(config_factory, patient_persister, admission_persister, diagnosis_persister, note_event_persister, hospital_adm_stash, temporary_results_dir)[source]#
Bases:
Dictable
A container class provided access to the MIMIC-III dataset using a relational database (by default Postgress per the resource library configuration). It also has methods to dump corpus statistics.
- See:
- __init__(config_factory, patient_persister, admission_persister, diagnosis_persister, note_event_persister, hospital_adm_stash, temporary_results_dir)#
-
admission_persister:
AdmissionPersister
# The persister for the
admissions
table.
- clear(include_notes=True)[source]#
Clear the all cached admission and note parses.
- Parameters:
include_notes (
bool
) – whether to also clear the parsed notes cache
-
config_factory:
ConfigFactory
# Used to clear the note event cache.
-
diagnosis_persister:
DiagnosisPersister
# The persister for the
diagnosis
table.
- get_hospital_adm_by_id(hadm_id)[source]#
Return a hospital admission by its unique identifier.
- Return type:
- get_hospital_adm_for_note(row_id)[source]#
Return an admission that has note
row_id
.- Raise:
RecordNotFoundError if
row_id
is not found in the database- Return type:
- get_note_by_id(row_id)[source]#
Return the note (via the hospital admission) for
row_id
.- Raise:
RecordNotFoundError if
row_id
is not found in the database- Return type:
-
hospital_adm_stash:
HospitalAdmissionDbStash
# Creates hospital admission instances. Note that this might be a caching stash instance, but method calls are delegated through to the instance of
HospitalAdmissionDbStash
.
-
note_event_persister:
NoteEventPersister
# The persister for the
noteevents
table.
-
patient_persister:
PatientPersister
# The persister for the
patients
table.
-
temporary_results_dir:
Path
# The path to create the output results. This is not used, but needs to stay until the next
zensols.mimicsid
is retrained.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- write_hospital_admission(hadm_id, depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, note_line_limit=9223372036854775807)[source]#
Write the hospital admission identified by
hadm_id
.
- write_hosptial_count_admission(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, limit=9223372036854775807)[source]#
Write the counts for each hospital admission.
- Parameters:
limit (
int
) – the limit on the return admission counts- See:
AdmissionPersister.get_admission_admission_counts()
zensols.mimic.domain#
Domain classes for the corpus notes.
- class zensols.mimic.domain.Admission(row_id, subject_id, hadm_id, admittime, dischtime, deathtime, admission_type, admission_location, discharge_location, insurance, language, religion, marital_status, ethnicity, edregtime, edouttime, diagnosis, hospital_expire_flag, has_chartevents_data)[source]#
Bases:
MimicContainer
The ADMISSIONS table gives information regarding a patient’s admission to the hospital. Since each unique hospital visit for a patient is assigned a unique HADM_ID, the ADMISSIONS table can be considered as a definition table for HADM_ID. Information available includes timing information for admission and discharge, demographic information, the source of the admission, and so on.
Table source: Hospital database.
Table purpose: Define a patient’s hospital admission, HADM_ID.
Number of rows: 58976
- Links to:
PATIENTS on SUBJECT_ID
- See:
- __init__(row_id, subject_id, hadm_id, admittime, dischtime, deathtime, admission_type, admission_location, discharge_location, insurance, language, religion, marital_status, ethnicity, edregtime, edouttime, diagnosis, hospital_expire_flag, has_chartevents_data)#
-
diagnosis:
str
# The DIAGNOSIS column provides a preliminary, free text diagnosis for the patient on hospital admission. The diagnosis is usually assigned by the admitting clinician and does not use a systematic ontology. As of MIMIC-III v1.0 there were 15,693 distinct diagnoses for 58,976 admissions. The diagnoses can be very informative (e.g. chronic kidney failure) or quite vague (e.g. weakness). Final diagnoses for a patient’s hospital stay are coded on discharge and can be found in the DIAGNOSES_ICD table. While this field can provide information about the status of a patient on hospital admission, it is not recommended to use it to stratify patients.
-
edregtime:
datetime
# Time that the patient was registered and discharged from the emergency department.
-
has_chartevents_data:
int
# Hospital admission has at least one observation in the CHARTEVENTS table.
-
hospital_expire_flag:
int
# This indicates whether the patient died within the given hospitalization. 1 indicates death in the hospital, and 0 indicates survival to hospital discharge.
-
insurance:
str
# The INSURANCE, LANGUAGE, RELIGION, MARITAL_STATUS, ETHNICITY columns describe patient demographics. These columns occur in the ADMISSIONS table as they are originally sourced from the admission, discharge, and transfers (ADT) data from the hospital database. The values occasionally change between hospital admissions (HADM_ID) for a single patient (SUBJECT_ID). This is reasonable for some fields (e.g. MARITAL_STATUS, RELIGION), but less reasonable for others (e.g. ETHNICITY).
- class zensols.mimic.domain.Diagnosis(row_id, icd9_code, short_title, long_title)[source]#
Bases:
ICD9Container
Table source: Hospital database.
Table purpose: Contains ICD diagnoses for patients, most notably ICD-9 diagnoses.
Number of rows: 651,047
Links to:
PATIENTS on SUBJECT_ID ADMISSIONS on HADM_ID D_ICD_DIAGNOSES on ICD9_CODE
- __init__(row_id, icd9_code, short_title, long_title)#
- class zensols.mimic.domain.HospitalAdmissionContainer(row_id, hadm_id)[source]#
Bases:
MimicContainer
Any data container that has a unique identifier with an (inpatient) non-null identifier.
- __init__(row_id, hadm_id)#
- class zensols.mimic.domain.ICD9Container(row_id, icd9_code, short_title, long_title)[source]#
Bases:
MimicContainer
A data container that has ICD-9 codes.
- __init__(row_id, icd9_code, short_title, long_title)#
- class zensols.mimic.domain.MimicContainer(row_id)[source]#
Bases:
PersistableContainer
,Dictable
Abstract base class for data containers, which are plain old Python objects that are CRUD’d from DAO persisters.
- __init__(row_id)#
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, dct=None)[source]#
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- exception zensols.mimic.domain.MimicError[source]#
Bases:
APIError
Raised for any application level error.
- __module__ = 'zensols.mimic.domain'#
- exception zensols.mimic.domain.MimicParseError(text)[source]#
Bases:
MimicError
Raised for MIMIC note parsing errors.
- __annotations__ = {}#
- __module__ = 'zensols.mimic.domain'#
- class zensols.mimic.domain.NoteEvent(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#
Bases:
MimicContainer
Table source: Hospital database.
Table purpose: Contains all notes for patients.
Number of rows: 2,083,180
- Links to:
PATIENTS on SUBJECT_ID
ADMISSIONS on HADM_ID
CAREGIVERS on CGID
- See:
- __init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
-
category:
str
# Category of the note, e.g. Discharge summary.
CATEGORY and DESCRIPTION define the type of note recorded. For example, a CATEGORY of ‘Discharge summary’ indicates that the note is a discharge summary, and the DESCRIPTION of ‘Report’ indicates a full report while a DESCRIPTION of ‘Addendum’ indicates an addendum (additional text to be added to the previous report).
-
chartdate:
datetime
# Date when the note was charted.
CHARTDATE records the date at which the note was charted. CHARTDATE will always have a time value of 00:00:00.
CHARTTIME records the date and time at which the note was charted. If both CHARTDATE and CHARTTIME exist, then the date portions will be identical. All records have a CHARTDATE. A subset are missing CHARTTIME. More specifically, notes with a CATEGORY value of ‘Discharge Summary’, ‘ECG’, and ‘Echo’ never have a CHARTTIME, only CHARTDATE. Other categories almost always have both CHARTTIME and CHARTDATE, but there is a small amount of missing data for CHARTTIME (usually less than 0.5% of the total number of notes for that category).
STORETIME records the date and time at which a note was saved into the system. Notes with a CATEGORY value of ‘Discharge Summary’, ‘ECG’, ‘Radiology’, and ‘Echo’ never have a STORETIME. All other notes have a STORETIME.
-
charttime:
datetime
# Date and time when the note was charted. Note that some notes (e.g. discharge summaries) do not have a time associated with them: these notes have NULL in this column.
- See:
-
context:
InitVar
# Contains resources needed by new and re-hydrated notes, such as the document stash.
- property doc: FeatureDocument#
The parsed document of the
name
of the section.
- get_normal_name(include_desc=True)[source]#
A normalized name of the note useful as a file name (sans extension).
-
subject_id:
int
# Foreign key. Identifies the patient.
Identifiers which specify the patient: SUBJECT_ID is unique to a patient and HADM_ID is unique to a patient hospital stay.
:see
hadm_id
- class zensols.mimic.domain.Patient(row_id, subject_id, gender, dob, dod, dod_hosp, dod_ssn, expire_flag)[source]#
Bases:
MimicContainer
Table source: CareVue and Metavision ICU databases.
Table purpose: Defines each SUBJECT_ID in the database, i.e. defines a single patient.
Number of rows: 46,520
Links to: ADMISSIONS on SUBJECT_ID ICUSTAYS on SUBJECT_ID
- __init__(row_id, subject_id, gender, dob, dod, dod_hosp, dod_ssn, expire_flag)#
- class zensols.mimic.domain.Procedure(row_id, icd9_code, short_title, long_title)[source]#
Bases:
ICD9Container
Table source: Hospital database.
Table purpose: Contains ICD procedures for patients, most notably ICD-9 procedures.
Number of rows: 240,095
Links to:
PATIENTS on SUBJECT_ID ADMISSIONS on HADM_ID D_ICD_PROCEDURES on ICD9_CODE
- __init__(row_id, icd9_code, short_title, long_title)#
zensols.mimic.note#
EHR related text documents.
- class zensols.mimic.note.DefaultNoteFactory(config_factory, category_to_note, mimic_default_note_section)[source]#
Bases:
NoteFactory
A note factory that creates only default notes.
- __init__(config_factory, category_to_note, mimic_default_note_section)#
- class zensols.mimic.note.GapSectionContainer(delegate, filter_empty)[source]#
Bases:
SectionContainer
A container that fills in missing sections of text from a note with additional sections.
- __init__(delegate, filter_empty)#
- class zensols.mimic.note.Note(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#
Bases:
NoteEvent
,SectionContainer
A container class of
Section
for each section for the text in the note events given by the propertysections
.- __init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
- property section_annotator_type: SectionAnnotatorType#
A human readable string describing who or what annotated the note.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#
Write the note event.
- Parameters:
line_limit – the number of lines to write from the note text
write_divider – whether to write a divider before the note text
indent_fields – whether to indent the fields of the note
note_indent – how many indentation to indent the note fields
- write_fields(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#
Write note header fields such as the
row_id
andcategory
.
- write_full(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, note_line_limit=9223372036854775807, section_line_limit=9223372036854775807, section_sent_limit=9223372036854775807, include_section_header=True, sections=None, include_fields=True, include_note_divider=True, include_section_divider=True)[source]#
Write the custom parts of the note.
- Parameters:
note_line_limit (
int
) – the number of lines to write from the note textsection_line_limit (
int
) – the number of line of the section’s body and number of sentences to outputpar_limit – the number of paragraphs to output
include_section_header (
bool
) – whether to include the headerinclude_fields (
bool
) – whether to write the note fieldsinclude_note_divider (
bool
) – whether to write dividers between notesinclude_section_divider (
bool
) – whether to write dividers between sections
- class zensols.mimic.note.NoteFactory(config_factory, category_to_note, mimic_default_note_section)[source]#
Bases:
Primeable
Creates an instance of
Note
fromNoteEvent
.- __init__(config_factory, category_to_note, mimic_default_note_section)#
-
category_to_note:
Dict
[str
,str
]# .Note` configuration.
- Type:
A mapping between notes’ category to section name for
- Type:
class
-
config_factory:
ConfigFactory
# The factory used to create notes.
-
mimic_default_note_section:
str
# The section name holding the configuration of the class to create when there is no mapping in
category_to_note
.
- class zensols.mimic.note.NoteFormat(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
Used in
Note.format()
for a parameterized method to write a note.- json = 5#
- markdown = 7#
- raw = 2#
- summary = 4#
- text = 1#
- verbose = 3#
- yaml = 6#
- class zensols.mimic.note.ParagraphFactory[source]#
Bases:
object
Splits a document in to constituent paragraphs.
- __init__()#
- class zensols.mimic.note.Section(id, name, container, header_spans, body_span)[source]#
Bases:
PersistableContainer
,Dictable
A section segment with an identifier and represents a section of a
Note
, one for each section. An example of a section is the history of present illness in a discharge note.- __init__(id, name, container, header_spans, body_span)#
- property body_doc: FeatureDocument#
A feature document of the body of this section’s body text.
-
body_span:
LexicalSpan
# Like
header_spans
but for the section body. The body and name do not intersect.
- property body_tokens: Iterable[FeatureToken]#
-
container:
SectionContainer
# The container that has this section.
- property doc: FeatureDocument#
A feature document of the section’s body text.
-
header_spans:
Tuple
[LexicalSpan
,...
]# The character offsets of the section headers. The first is usually the
name
of the section. If there are no headers, this is an 0-length tuple.
- property header_tokens: Iterable[FeatureToken]#
- property lexspan: LexicalSpan#
The widest lexical extent of the sections, including headers.
-
name:
Optional
[str
]# The name of the section (i.e.
hospital-course
). This field is what’s called thetype
in the paper, which is not used sincetype
is a keyword in Python.
- static name_to_header(s)[source]#
Convert a section name to a section header text. Note that this uses a heuristic method that might generate a string that does not match the original header text.
- Return type:
- property paragraphs: Tuple[FeatureDocument, ...]#
The list of paragraphs, each as as a feature document, of this section’s body text.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, body_line_limit=9223372036854775807, norm_line_limit=9223372036854775807, par_limit=0, sent_limit=0, include_header=True, include_id_name=True, include_header_spans=False, include_body_span=False)[source]#
Write a note section’s name, original body, normalized body and sentences with respective sentence entities.
- Parameters:
body_line_limit (
int
) – the number of line of the section’s body to outputnorm_line_limit (
int
) – the number of line of the section’s normalized (parsed) body to outputpar_limit (
int
) – the number of paragraphs to outputsent_limit (
int
) – the number of sentences to outputinclude_header (
bool
) – whether to include the headerinclude_id_name (
bool
) – whether to write the section ID and name
- class zensols.mimic.note.SectionAnnotatorType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
The type of
Section
annotator forNote
instances. The MedSecId project adds thehuman
andmodel
:- See:
- NONE = 1#
Default for those without section identifiers.
- REGULAR_EXPRESSION = 2#
Sections are automatically assigned by regular expressions.
- class zensols.mimic.note.SectionContainer[source]#
Bases:
Dictable
A note like container base class that has sections. Note based classes extend this base class. Sections in order of their position in the document are produced when using this class as an iterable.
-
DEFAULT_SECTION_NAME:
ClassVar
[str
] = 'default'# The name of the singleton section when none the note is not sectioned.
- __init__()#
- static category_to_id(s)[source]#
Convert a category string (i.e.
Discharge summary
) to a category ID (i.e.discharge-summary
).- Return type:
- static id_to_category(s)[source]#
Convert a category ID (i.e.
discharge-summary
) to a category string (i.e.Discharge summary
).- Return type:
- property section_dataframe: DataFrame#
A Pandas dataframe containing the section’s name, header and body offset spans.
- property sections_by_name: Dict[str, Tuple[Section, ...]]#
A map from the name of a section (i.e. history of present illness in discharge notes) to a note section.
- property sections_ordered: Tuple[Section, ...]#
Sections returned in order as they appear in the note.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#
Write this instance as either a
Writable
or as aDictable
. If class attribute_DICTABLE_WRITABLE_DESCENDANTS
is set asTrue
, then use thewrite()
method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adict
recursively usingasdict()
, then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDES
is set, those attributes are removed from what is written in thewrite()
method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writable
- write_by_format(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, note_format=<enum 'NoteFormat'>)[source]#
Write the note in the specified format.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writablenote_format (
NoteFormat
) – the format to use for the output
- write_fields(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#
Write note header fields such as the
row_id
andcategory
.
- write_full(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, note_line_limit=9223372036854775807, section_line_limit=9223372036854775807, section_sent_limit=9223372036854775807, include_section_header=True, sections=None, include_fields=True, include_note_divider=True, include_section_divider=True)[source]#
Write the custom parts of the note.
- Parameters:
note_line_limit (
int
) – the number of lines to write from the note textsection_line_limit (
int
) – the number of line of the section’s body and number of sentences to outputpar_limit – the number of paragraphs to output
include_section_header (
bool
) – whether to include the headerinclude_fields (
bool
) – whether to write the note fieldsinclude_note_divider (
bool
) – whether to write dividers between notesinclude_section_divider (
bool
) – whether to write dividers between sections
- write_human(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, normalize=False)[source]#
Generates a human readable version of the annotation. This calls the following methods in order:
write_fields()
andwrite_sections()
.- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writablenormalize (
bool
) – whether to use the paragraphs’ normalized (:obj:~zensols.nlp.TokenContainer.norm`) or text
- write_markdown(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, normalize=False)[source]#
Generates markdown version of the annotation.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writablenormalize (
bool
) – whether to use the paragraphs’ normalized (:obj:~zensols.nlp.TokenContainer.norm`) or text
- write_sections(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, normalize=False)[source]#
Writes the sections of the container.
- Parameters:
depth (
int
) – the starting indentation depthwriter (
TextIOBase
) – the writer to dump the content of this writablenormalize (
bool
) – whether to use the paragraphs’ normalized (:obj:~zensols.nlp.TokenContainer.norm`) or text
-
DEFAULT_SECTION_NAME:
zensols.mimic.parafac#
Paragraph factories.
- class zensols.mimic.parafac.ChunkingParagraphFactory(min_sent_len, min_list_norm_matches, max_sent_list_len, include_section_headers, filter_sent_text)[source]#
Bases:
ParagraphFactory
A paragraph factory that uses
zensols.nlp.chunker
chunking to split paragraphs and MIMIC lists.-
MIMIC_SPAN_PATTERN:
ClassVar
[Pattern
] = re.compile('(.+?)(?:(?=[\\n.]{2})|\\Z)', re.MULTILINE|re.DOTALL)# MIMIC regular expression adds period, which is used in notes to separate paragraphs.
- __init__(min_sent_len, min_list_norm_matches, max_sent_list_len, include_section_headers, filter_sent_text)#
-
max_sent_list_len:
int
# The maximum lenght a sentence can be to keep it chunked as a list. Otherwise very long sentences form from what appear to be front list syntax.
-
MIMIC_SPAN_PATTERN:
- class zensols.mimic.parafac.WhitespaceParagraphFactory[source]#
Bases:
ParagraphFactory
A simple paragraph factory that splits on whitespace.
zensols.mimic.persist#
Persisters for the MIMIC-III database.
- class zensols.mimic.persist.AdmissionPersister(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)[source]#
Bases:
DataClassDbPersister
Manages instances of
Admission
.- __init__(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)#
- get_admission_counts(limit=9223372036854775807)[source]#
Return the counts of subjects for each hospital admission.
- class zensols.mimic.persist.DiagnosisPersister(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)[source]#
Bases:
DataClassDbPersister
Manages instances of
Diagnosis
.- __init__(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)#
- class zensols.mimic.persist.NoteDocumentStash(doc_parser=None, note_db_persister=None)[source]#
Bases:
ReadOnlyStash
Reads
noteevents
from the database and returns parsed documents.- __init__(doc_parser=None, note_db_persister=None)#
-
doc_parser:
FeatureDocumentParser
= None# NER+L medical domain natural langauge parser.
- exists(name)[source]#
Return
True
if data with keyname
exists.Implementation note: This
Stash.exists()
method is very inefficient and should be overriden.- Return type:
- load(row_id)[source]#
Load a data value from the pickled data with key
name
. Semantically, this method loads the using the stash’s implementation. For exampleDirectoryStash
loads the data from a file if it exists, but factory type stashes will always re-generate the data.- See:
get()
- Return type:
-
note_db_persister:
DbPersister
= None# Fetches the note text by key from the DB.
- class zensols.mimic.persist.NoteEventPersister(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None, mimic_note_context=None, hadm_row_chunk_size=None)[source]#
Bases:
DataClassDbPersister
Manages instances of
NoteEvent
.- __init__(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None, mimic_note_context=None, hadm_row_chunk_size=None)#
- get_discharge_reports(limit=9223372036854775807)[source]#
Return discharge reports (as apposed to addendums).
- get_notes_by_category(category, limit=9223372036854775807)[source]#
Return notes by what the category to which they belong.
-
hadm_row_chunk_size:
int
= None# The number of note IDs for each round trip to the DB in
get_hadm_ids()
.
- class zensols.mimic.persist.PatientPersister(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)[source]#
Bases:
DataClassDbPersister
Manages instances of
Patient
.- __init__(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)#
- class zensols.mimic.persist.ProcedurePersister(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)[source]#
Bases:
DataClassDbPersister
Manages instances of
Procedure
.- __init__(conn_manager, sql_file=None, row_factory='tuple', select_name=None, select_by_id_name=None, select_exists_name=None, insert_name=None, update_name=None, delete_name=None, keys_name=None, count_name=None, bean_class=None)#
zensols.mimic.regexnote#
Regular expression note parsing
- class zensols.mimic.regexnote.ConsultNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#
Bases:
RegexNote
Contains sections for the discharge summary. There should be only one of these per hospital admission.
- __init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
- class zensols.mimic.regexnote.DischargeSummaryNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#
Bases:
RegexNote
Contains sections for the discharge summary. There should be only one of these per hospital admission.
- __init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
- class zensols.mimic.regexnote.EchoNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#
Bases:
RegexNote
- __init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
- class zensols.mimic.regexnote.NursingOtherNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#
Bases:
RegexNote
- __init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
- class zensols.mimic.regexnote.PhysicianNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#
Bases:
RegexNote
- __init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
- class zensols.mimic.regexnote.RadiologyNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#
Bases:
RegexNote
- __init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
- class zensols.mimic.regexnote.RegexNote(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)[source]#
Bases:
Note
Base class used to collect subclass regular expressions captures and create sections from them.
- __init__(row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, description, cgid, iserror, text, context)#
zensols.mimic.tokenizer#
Modify the spaCy parser configuration to deal with the MIMIC-III dataset.
- class zensols.mimic.tokenizer.MimicTokenDecorator(token_entities=((re.compile('^First Name'), 'FIRSTNAME', 'PERSON'), (re.compile('^Last Name'), 'LASTNAME', 'PERSON'), (re.compile('^21\\\\d{2}-\\\\d{1,2}-\\\\d{1,2}$'), 'DATE', 'DATE')), token_replacements=())[source]#
Bases:
FeatureTokenDecorator
Contains the MIMIC-III regular expressions and other patterns to annotate and normalized feature tokens. The class finds mask tokens and separators (such as a long string of dashes or asterisks).
Attribute
onto_mapping
is a mapping from the MIMIC symbol intoken_entities
(2nd value in tuple) to Onto Notes 5, which is used as the NER symbol in spaCy.-
MASK_TOKEN_FEATURE:
ClassVar
[str
] = 'mask'# The value given from entity
TOKEN_FEATURE_ID
for mask tokens (i.e.[**First Name**]
).
-
ONTO_FEATURE_ID:
ClassVar
[str
] = 'onto_'# The feature ID to use for the Onto Notes 5 (
onto_mapping
).
-
SEPARATOR_TOKEN_FEATURE:
ClassVar
[str
] = 'separator'# The value name of separators defined by
SEP_REGEX
.
-
SEP_REGEX:
ClassVar
[Pattern
] = re.compile('(_{5,}|[*]{5,}|[-]{5,})')# Matches text based separators such as a long string of dashes.
-
UNKNOWN_ENTITY:
ClassVar
[str
] = '<UNKNOWN>'# The mask nromalized token form for unknown MIMIC entity text (i.e. First Name).
- __init__(token_entities=((re.compile('^First Name'), 'FIRSTNAME', 'PERSON'), (re.compile('^Last Name'), 'LASTNAME', 'PERSON'), (re.compile('^21\\\\d{2}-\\\\d{1,2}-\\\\d{1,2}$'), 'DATE', 'DATE')), token_replacements=())#
-
token_entities:
Tuple
[Tuple
[Union
[Pattern
,str
]],str
,Optional
[str
]] = ((re.compile('^First Name'), 'FIRSTNAME', 'PERSON'), (re.compile('^Last Name'), 'LASTNAME', 'PERSON'), (re.compile('^21\\d{2}-\\d{1,2}-\\d{1,2}$'), 'DATE', 'DATE'))# A list of psuedo token patterns and a string to replace with the respective match.
-
MASK_TOKEN_FEATURE:
- class zensols.mimic.tokenizer.MimicTokenizerComponent(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=())[source]#
Bases:
Component
Modifies the spacCy tokenizer to split on colons (
:
) to capture more MIMIC-III mask tokens.- __init__(name, pipe_name=None, pipe_config=None, pipe_add_kwargs=<factory>, modules=(), initializers=())#