MIMIC III Corpus Parsing¶
A utility library for parsing the MIMIC-III corpus. This uses spaCy and extends the zensols.mednlp to parse the MIMIC-III medical note dataset. Features include:
Creates both natural language and medical features from medical notes. The latter is generated using linked entity concepts parsed with MedCAT via zensols.mednlp.
Modifies the spaCy tokenizer to chunk masked tokens. For example,
[
,**
,First
,Name
**
]
becomes[**First Name**]
.Provides a clean Pythonic object oriented representation of MIMIC-III admissions and medical notes.
Interfaces MIMIC-III data as a relational database (either PostgreSQL or SQLite).
Paragraph chunking using the most common syntax/physician templates provided in the MIMIC-III dataset.
Documentation¶
See the full documentation. The API reference is also available.
Obtaining¶
The easiest way to install the command line program is via the pip
installer:
pip3 install zensols.mimic
Binaries are also available on pypi.
Installation¶
Install the package:
pip3 install zensols.mimic
Install the database (either PostgreSQL or SQLite).
Configuration¶
After a database is installed it must be configured in a new file ~/.mimicrc
that you create. This INI formatted file also specifies where to cache data:
[default]
# the directory where cached data is stored
data_dir = ~/directory/to/cached/data
If this file doesn’t exist, it must be specified with the --config
option.
SQLite¶
SQLite is the default database used for MIMIC-III access, but, it is slower and not as well tested compared to the PostgreSQL driver. See the SQLite database file using the SQLite instructions to create the SQLite file from MIMIC-III if you need database access.
Once you create the file, configure it with the API using the following
additional configuration in the --config
specified file is also necessary (or in
~/.mimicrc
):
[mimic_sqlite_conn_manager]
db_file = path: <some directory>/mimic3.sqlite3
PostgreSQL¶
PostgreSQL is the preferred way to access MIMIC-II for this API. The MIMIC-III
database can be loaded by following the PostgreSQL instructions, or consider
the PostgreSQL Docker image. Then configure the database by adding the
following to ~/.mimicrc
:
[mimic_default]
resources_dir = resource(zensols.mimic): resources
sql_resources = ${resources_dir}/postgres
conn_manager = mimic_postgres_conn_manager
[mimic_db]
database = <needs a value>
host = <needs a value>
port = <needs a value>
user = <needs a value>
password = <needs a value>
The Python PostgreSQL client package is also needed (not needed for the SQLite installs), which can be installed with:
pip3 install zensols.dbpg
Usage¶
The Corpus class is the data access object used to read and parse the corpus:
# get the MIMIC-III corpus data acceess object
>>> from zensols.mimic import ApplicationFactory
>>> corpus = ApplicationFactory.get_corpus()
# get an admission by hadm_id
>>> adm = corpus.hospital_adm_stash['165315']
# get the first discharge note (some have admissions have addendums)
>>> from zensols.mimic.regexnote import DischargeSummaryNote
>>> ds = adm.notes_by_category[DischargeSummaryNote.CATEGORY][0]
# dump the note as a human readable section-by-section
>>> ds.write()
row_id: 12144
category: Discharge summary
description: Report
annotator: regular_expression
----------------------0:chief-complaint (CHIEF COMPLAINT)-----------------------
Unresponsiveness
-----------1:history-of-present-illness (HISTORY OF PRESENT ILLNESS)------------
The patient is a ...
# get features of the note useful in ML models as a Pandas dataframe
>>> df = ds.feature_dataframe
# get only medical features (CUI, entity, NER and POS tag) for the HPI section
>>> df[(df['section'] == 'history-of-present-illness') & (df['cui_'] != '-<N>-')]['norm cui_ detected_name_ ent_ tag_'.split()]
norm cui_ detected_name_ ent_ tag_
15 history C0455527 history~of~hypertension concept NN
See the application example, which gives a fine grain way of configuring the API.
Medical Note Segmentation¶
This package uses regular expressions to segment notes. However, the zensols.mimicsid uses annotations and a model trained by clinical informatics physicians. Using this package gives this enhanced segmentation without any API changes.
Citation¶
If you use this project in your research please use the following BibTeX entry:
@inproceedings{landes-etal-2023-deepzensols,
title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility",
author = "Landes, Paul and
Di Eugenio, Barbara and
Caragea, Cornelia",
editor = "Tan, Liling and
Milajevs, Dmitrijs and
Chauhan, Geeticka and
Gwinnup, Jeremy and
Rippeth, Elijah",
booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
month = dec,
year = "2023",
address = "Singapore, Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.nlposs-1.16",
pages = "141--146"
}
Changelog¶
An extensive changelog is available here.
Community¶
Please star this repository and let me know how and where you use this API. Contributions as pull requests, feedback and any input is welcome.
License¶
Copyright (c) 2022 - 2025 Paul Landes