zensols.mednlp package

Submodules

zensols.mednlp.app module

A natural language medical domain parsing library.

class zensols.mednlp.app.Application(config_factory, doc_parser, library)[source]

Bases: Dictable

A natural language medical domain parsing library.

__init__(config_factory, doc_parser, library)
atom(cui)[source]

Search the UMLS database using UTS and show results.

Parameters:

cui (str) – the concept ID to search for (eg ‘C0242379’)

config_factory: ConfigFactory

Used to create a cTAKES stash.

ctakes(text_or_file, only_medical=False)[source]

Invoke cTAKES on a directory with text files.

Parameters:
  • text_or_file (str) – natural language to be processed

  • only_medical (bool) – only provide medical linked tokens

define(cui)[source]

Look up an entity by CUI. This takes a long time.

Parameters:

cui (str) – the concept ID to search for (eg ‘C0242379’)

doc_parser: FeatureDocumentParser

Parses and NER tags medical terms.

features(text_or_file, out=None, ids=None, only_medical=False)[source]

Dump features as CSV output.

Parameters:
  • text_or_file (str) – natural language to be processed

  • out (Path) – the path to output the CSV file or stdout if missing

  • ids (str) – the comma separate feature IDs to output

  • only_medical (bool) – only provide medical linked tokens

group(info, query=None)[source]

Get TUI group information.

Parameters:
  • info (GroupInfo) – the type of information to return

  • query (str) – comma delimited name list used to subset the output data

library: MedicalLibrary

Medical resource library that contains UMLS access, cui2vec etc..

search(term)[source]

Search the UMLS database using UTS and show results.

Parameters:

term (str) – the term to search for (eg ‘lung cancer’)

show(text_or_file, only_medical=False)[source]

Parse and output medical entities.

Parameters:
  • text_or_file (str) – natural language to be processed

  • only_medical (bool) – only provide medical linked tokens

similarity(term)[source]

Get the cosine similarity between two CUIs.

class zensols.mednlp.app.GroupInfo(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Used to group TUI information in Application.group()

byname = 2
csv = 1

zensols.mednlp.cli module

Command line entry point to the application.

class zensols.mednlp.cli.ApplicationFactory(*args, **kwargs)[source]

Bases: ApplicationFactory

__init__(*args, **kwargs)[source]
classmethod get_doc_parser()[source]

Get the default application’s document parser.

Return type:

FeatureDocumentParser

zensols.mednlp.cli.main(args=['/Users/landes/opt/lib/pixi/envs/zensols_relpo/bin/sphinx-build', '-M', 'html', '/Users/landes/view/nlp/med/mednlp/target/doc/stage', '/Users/landes/view/nlp/med/mednlp/target/doc/build'], **kwargs)[source]
Return type:

ActionResult

zensols.mednlp.ctakes module

Parse and normalize discharge notes.

class zensols.mednlp.ctakes.CTakesParserStash(entry_point_bin, entry_point_cmd, home, source_dir, output_dir=None)[source]

Bases: ReadOnlyStash, Primeable, Dictable

Runs the cTAKES CUI entity linker on a directory of medical notes. For each medical text file, it generates an xmi file, which is then parsed by the the ctakes_parser library.

This straightforward wrapper around the ctparser library automates the file system orchestration that needs to happen. Configure an instance of this class as an application configuration and use a ImportConfigFactory to create the objects. See the examples/ctakes directory for a quick start guide on how to use this class.

__init__(entry_point_bin, entry_point_cmd, home, source_dir, output_dir=None)
clear()[source]

Delete all data from the from the stash.

Important: Exercise caution with this method, of course.

entry_point_bin: Path

Entry point script in to the cTAKES parser.

entry_point_cmd: str

Command line arguments passed to cTAKES.

exists(name)[source]

Return True if data with key name exists.

Implementation note: This Stash.exists() method is very inefficient and should be overriden.

Return type:

bool

home: Path

The directory where cTAKES is installed.

keys()[source]

Return an iterable of keys in the collection.

Return type:

Iterable[str]

load(name)[source]

Load a data value from the pickled data with key name. Semantically, this method loads the using the stash’s implementation. For example DirectoryStash loads the data from a file if it exists, but factory type stashes will always re-generate the data.

See:

get()

Return type:

DataFrame

output_dir: Path = None

The directory where to output the xmi files.

prime()[source]
set_documents(docs)[source]

Set the document to be parsed by cTAKES.

Parameters:

docs (Iterable[str]) – an iterable of string text documents to persist to the file system, and then be parsed by cTAKES.

source_dir: Path

Contains a path to the source directory where the text documents live.

property source_stash: Stash

The stash that tracks the text documents that are to be parsed by cTAKES.

zensols.mednlp.cui2vec module

zensols.mednlp.domain module

Contains the classes for the medical token type and others.

exception zensols.mednlp.domain.MedNLPError[source]

Bases: APIError

Raised by any medical NLP speicic reason in this library.

__module__ = 'zensols.mednlp.domain'

zensols.mednlp.lib module

Medical resource library that contains UMLS access, cui2vec etc..

class zensols.mednlp.lib.MedicalLibrary(config_factory=None, medcat_resource=None, entity_linker_resource=None, uts_client=None)[source]

Bases: Dictable

A utility class that provides access to medical APIs.

__init__(config_factory=None, medcat_resource=None, entity_linker_resource=None, uts_client=None)
config_factory: ConfigFactory = None

The configuration factory used to create cTAKES and cui2vec instances.

property cui2vec_embedding: Cui2VecEmbedModel

The cui2vec embedding model.

entity_linker_resource: EntityLinkerResource = None

The entity linker resource.

get_atom(cui)[source]

Get the UMLS atoms of a CUI from UTS.

Parameters:
  • cui (str) – the concept ID used to query

  • preferred – if True only return preferred atoms

Return type:

Dict[str, str]

Returns:

a list of atom entries in dictionary form

get_entities(text)[source]

Return the all concept entity data.

Return type:

Dict[str, Any]

Returns:

concepts as a multi-tiered dict

get_linked_entity(cui)[source]

Get a scispaCy linked entity.

Parameters:

cui (str) – the unique concept ID

Return type:

Entity

get_new_ctakes_parser_stash()[source]

Return a new instance of a ctakes parser stash.

Return type:

CTakesParserStash

get_relations(cui)[source]

Get the UMLS related concepts connected to a concept by ID.

Parameters:

cui (str) – the concept ID used to get related concepts

Return type:

List[Dict[str, Any]]

Returns:

a list of relation entries in dictionary form in the order returned by UTS

medcat_resource: MedCatResource = None

The MedCAT factory resource.

similarity_by_term(term, topn=5)[source]

Return similaries of a medical term.

Parameters:
  • term (str) – the medical term (i.e. heart disease)

  • topn (int) – the top N count similarities to return

Return type:

List[‘EntitySimilarity’]

uts_client: UTSClient = None

Queries UMLS data.

zensols.mednlp.parser module

Medical langauge parser.

class zensols.mednlp.parser.MedCatFeatureDocumentParser(config_factory, name, lang='en', model_name=None, token_feature_ids=frozenset({'children', 'context_similarity', 'cui', 'cui_', 'definition_', 'dep', 'dep_', 'detected_name_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_concept', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'pref_name_', 'sent_i', 'shape', 'shape_', 'sub_names', 'tag', 'tag_', 'tui_descs_', 'tuis', 'tuis_'}), components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.mednlp.tok.MedicalFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False, package_manager=<factory>, medcat_resource=None)[source]

Bases: SpacyFeatureDocumentParser

A medical based language resources that parses concepts.

TOKEN_FEATURE_IDS: ClassVar[Set[str]] = frozenset({'children', 'context_similarity', 'cui', 'cui_', 'definition_', 'dep', 'dep_', 'detected_name_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_concept', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'pref_name_', 'sent_i', 'shape', 'shape_', 'sub_names', 'tag', 'tag_', 'tui_descs_', 'tuis', 'tuis_'})

Default token feature ID set for the medical parser.

__init__(config_factory, name, lang='en', model_name=None, token_feature_ids=frozenset({'children', 'context_similarity', 'cui', 'cui_', 'definition_', 'dep', 'dep_', 'detected_name_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_concept', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'pref_name_', 'sent_i', 'shape', 'shape_', 'sub_names', 'tag', 'tag_', 'tui_descs_', 'tuis', 'tuis_'}), components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.mednlp.tok.MedicalFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False, package_manager=<factory>, medcat_resource=None)
medcat_resource: MedCatResource = None

The MedCAT factory resource.

token_class

The class to use for instances created by features().

alias of MedicalFeatureToken

token_feature_ids: Set[str] = frozenset({'children', 'context_similarity', 'cui', 'cui_', 'definition_', 'dep', 'dep_', 'detected_name_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_concept', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'pref_name_', 'sent_i', 'shape', 'shape_', 'sub_names', 'tag', 'tag_', 'tui_descs_', 'tuis', 'tuis_'})

The features to keep from spaCy tokens.

See:

TOKEN_FEATURE_IDS

zensols.mednlp.resource module

MedCAT wrapper.

class zensols.mednlp.resource.MedCatResource(installer, vocab_resource, cdb_resource, mc_status_resource, umls_tuis, umls_groups, filter_tuis=None, filter_groups=None, spacy_enable_components=<factory>, cat_config=None, cache_global=True, requirements=(), package_manager=<factory>)[source]

Bases: Dictable

A factory class that creates MedCAT resources.

__init__(installer, vocab_resource, cdb_resource, mc_status_resource, umls_tuis, umls_groups, filter_tuis=None, filter_groups=None, spacy_enable_components=<factory>, cat_config=None, cache_global=True, requirements=(), package_manager=<factory>)
cache_global: InitVar = True

Whether or not to globally cache resources, which saves load time.

property cat: CAT

The MedCAT NER tagger instance.

When this property is accessed, all models are downloaded first, then loaded, if not already.

cat_config: Dict[str, Dict[str, Any]] = None

If provieded, set the CDB configuration. Keys are general, preprocessing and all other attributes documented in the MedCAT Config

cdb_resource: Resource

The cdb-medmen-v1.dat file.

clear()[source]
filter_groups: Set[str] = None

Just like filter_tuis but each element is treated as a group used to generate a list of CUIs from those mapped from name to ``tui` in groups.

filter_tuis: Set[str] = None

Types used to filter linked CUIs (i.e. {'T047', 'T048'}).

property groups: DataFrame

A dataframe of TUIs, their abbreviations, descriptions and a group name associated with each.

installer: Installer

Installs and provides paths to the model files.

mc_status_resource: Resource

The the mc_status directory.

package_manager: PackageManager

The package manager used to install requirements.

requirements: Tuple[str, ...] = ()

A list of spaCy pip dependencies (can include model direct-references) that will be installed if not already.

spacy_enable_components: Set[str]

By default, MedCAT disables several pipeline components. Some of these are needed for sentence chunking and other downstream tasks. Otherwise sentence indexing won’t work because sentence boundaries are missing.

See:

MedCAT Config

property tuis: Dict[str, str]

A mapping of type identifiers (TUIs) to their descriptions.

umls_groups: Path

Like umls_tuis but groups TUIs in gropus.

umls_tuis: Path

The UMLS TUIs (types) mapping resource that maps from TUIs to descriptions.

See:

Semantic Types

vocab_resource: Resource

The path to the vocab.dat file.

zensols.mednlp.tok module

Contains the classes for the medical token type.

class zensols.mednlp.tok.MedicalFeatureToken(spacy_token, norm, res, ix2ent)[source]

Bases: SpacyFeatureToken

A set of token features that optionally contains a medical concept.

FEATURE_IDS: ClassVar[Set[str]] = frozenset({'context_similarity', 'cui', 'cui_', 'definition_', 'detected_name_', 'is_concept', 'pref_name_', 'sub_names', 'tui_descs_', 'tuis', 'tuis_'})

All default available feature IDs.

FEATURE_IDS_BY_TYPE: ClassVar[Dict[str, Set[str]]] = {'bool': frozenset({'is_concept'}), 'float': frozenset({'context_similarity'}), 'int': frozenset({'cui'}), 'list': frozenset({'sub_names', 'tuis'}), 'str': frozenset({'cui_', 'definition_', 'detected_name_', 'pref_name_', 'tui_descs_', 'tuis_'})}

Map of class type to set of feature IDs.

WRITABLE_FEATURE_IDS: ClassVar[Tuple[str, ...]] = ('text', 'norm', 'idx', 'sent_i', 'i', 'i_sent', 'tag', 'pos', 'is_wh', 'entity', 'dep', 'children', 'cui_')

Feature IDs that are dumped on write() and write_attributes().

__init__(spacy_token, norm, res, ix2ent)[source]
property context_similarity: float

The similiarity of the concept.

property cui: int

Returns the numeric part of the concept ID.

property cui_: str

The unique UMLS concept ID.

property definition_: str

The definition if the concept.

property detected_name_: str

The detected name of the concept.

property ent: int

Return the entity numeric value or 0 if this is not an entity.

property ent_: str

Return the entity string label or None if this token has no entity.

property is_concept: bool

True if this has a CUI and identifies a medical concept.

property pref_name_: str

The preferred name of the concept.

property sub_names: Tuple[str, ...]

Return other names for the concept.

property tui_descs_: str

Descriptions of tuis_.

property tuis: Tuple[str, ...]

The the CUI type of the concept.

property tuis_: str

All CUI TUIs (types) of the concept sorted as a comma delimited list.

zensols.mednlp.uts module

Interface to the UTS (UMLS Terminology Services (UTS)) RESTful service, which was taken from the UTS example repo.

:see UTS GitHug repo

class zensols.mednlp.uts.Authentication(api_key, auth_endpoint='/cas/v1/api-key')[source]

Bases: object

A utility class to manage the authentication with the UTS system.

AUTH_URI = 'https://utslogin.nlm.nih.gov'

The authetication service endpoint URL.

SERVICE = 'http://umlsks.nlm.nih.gov'

The service endpoint URL.

__init__(api_key, auth_endpoint='/cas/v1/api-key')
api_key: str

The API key used for the RESTful NIH service.

auth_endpoint: str = '/cas/v1/api-key'

The path of the authentication service endpoint.

getst(tgt)[source]
gettgt()[source]
exception zensols.mednlp.uts.AuthenticationError(api_key)[source]

Bases: UTSError

Thrown when authentication fails.

__annotations__ = {}
__init__(api_key)[source]
__module__ = 'zensols.mednlp.uts'
exception zensols.mednlp.uts.NoResultsError[source]

Bases: UTSError

Thrown when no results, usually for a CUI not found.

__annotations__ = {}
__module__ = 'zensols.mednlp.uts'
class zensols.mednlp.uts.UTSClient(api_key, version='2020AA', request_stash=None)[source]

Bases: object

MISSING_VALUE = '<missing>'

Value to store in the stash when there is a missing CUI.

NO_RESULTS_ERR = 'No results containing all your search terms were found.'

Error message from UTS indicating a missing CUI.

REL_ID_REGEX = re.compile('.*CUI\\/(.+)$')

Used to parse related CUIs in get_related_cuis().

URI = 'https://uts-ws.nlm.nih.gov'

The service URL endpoint.

__init__(api_key, version='2020AA', request_stash=None)
api_key: str

The API key used for the RESTful NIH service.

get_atoms(cui, preferred=True, expect=True)[source]

Get the UMLS atoms of a CUI from UTS.

Parameters:
  • cui (str) – the concept ID used to query

  • preferred (bool) – if True only return preferred atoms

Return type:

Union[Dict[str, str], List[Dict[str, str]]]

Returns:

a list of atom entries in dictionary form or a single dict if

` preferred is True

Get the UMLS related concept IDs connected to a concept by ID.

Parameters:

cui (str) – the concept ID used to get related concepts

Return type:

List[Tuple[str, Dict[str, Any]]]

Returns:

a list of tuples, each the related CUIs and the relation entry, in the order returned by UTS

get_relations(cui, expect=True)[source]

Get the UMLS related concepts connected to a concept by ID.

Parameters:

cui (str) – the concept ID used to get related concepts

Return type:

List[Dict[str, Any]]

Returns:

a list of relation entries in dictionary form in the order returned by UTS

request_stash: Stash = None
search_term(term, pages=1)[source]

Search for a string term in UMLS.

Parameters:

term (str) – the string term to match against

Return type:

List[Dict[str, str]]

Returns:

a list (one for each page), each with a dictionary of matching terms that have the name of the term, the ui (CUI), the uri of the term and the rootSource of the orginitating system

version: str = '2020AA'

The version of the UML we want.

exception zensols.mednlp.uts.UTSError[source]

Bases: MedNLPError

An error thrown by wrapper of the UTS system.

__annotations__ = {}
__module__ = 'zensols.mednlp.uts'

Module contents

zensols.mednlp.surpress_warnings()[source]

Supress future warnings generated by spaCy and ScispaCy models.