zensols.mednlp package

Submodules

zensols.mednlp.app module

A natural language medical domain parsing library.

class zensols.mednlp.app.Application(config_factory, doc_parser, library)[source]

Bases: Dictable

A natural language medical domain parsing library.

__init__(config_factory, doc_parser, library)
atom(cui)[source]

Search the UMLS database using UTS and show results.

Parameters:

cui (str) – the concept ID to search for (eg ‘C0242379’)

config_factory: ConfigFactory

Used to create a cTAKES stash.

ctakes(text_or_file, only_medical=False)[source]

Invoke cTAKES on a directory with text files.

Parameters:
  • text_or_file (str) – natural language to be processed

  • only_medical (bool) – only provide medical linked tokens

define(cui)[source]

Look up an entity by CUI. This takes a long time.

Parameters:

cui (str) – the concept ID to search for (eg ‘C0242379’)

doc_parser: FeatureDocumentParser

Parses and NER tags medical terms.

features(text_or_file, out=None, ids=None, only_medical=False)[source]

Dump features as CSV output.

Parameters:
  • text_or_file (str) – natural language to be processed

  • out (Path) – the path to output the CSV file or stdout if missing

  • ids (str) – the comma separate feature IDs to output

  • only_medical (bool) – only provide medical linked tokens

group(info, query=None)[source]

Get TUI group information.

Parameters:
  • info (GroupInfo) – the type of information to return

  • query (str) – comma delimited name list used to subset the output data

library: MedicalLibrary

Medical resource library that contains UMLS access, cui2vec etc..

search(term)[source]

Search the UMLS database using UTS and show results.

Parameters:

term (str) – the term to search for (eg ‘lung cancer’)

show(text_or_file, only_medical=False)[source]

Parse and output medical entities.

Parameters:
  • text_or_file (str) – natural language to be processed

  • only_medical (bool) – only provide medical linked tokens

similarity(term)[source]

Get the cosine similarity between two CUIs.

class zensols.mednlp.app.GroupInfo(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Used to group TUI information in Application.group()

byname = 2
csv = 1

zensols.mednlp.cli module

Command line entry point to the application.

class zensols.mednlp.cli.ApplicationFactory(*args, **kwargs)[source]

Bases: ApplicationFactory

__init__(*args, **kwargs)[source]
classmethod get_doc_parser()[source]

Get the default application’s document parser.

Return type:

FeatureDocumentParser

zensols.mednlp.cli.main(args=['/Users/landes/opt/lib/python/util/bin/sphinx-build', '-M', 'html', '/Users/landes/view/nlp/med/mednlp/target/doc/src', '/Users/landes/view/nlp/med/mednlp/target/doc/build'], **kwargs)[source]
Return type:

ActionResult

zensols.mednlp.ctakes module

Parse and normalize discharge notes.

class zensols.mednlp.ctakes.CTakesParserStash(entry_point_bin, entry_point_cmd, home, source_dir, output_dir=None)[source]

Bases: ReadOnlyStash, Primeable, Dictable

Runs the cTAKES CUI entity linker on a directory of medical notes. For each medical text file, it generates an xmi file, which is then parsed by the the ctakes_parser library.

This straightforward wrapper around the ctparser library automates the file system orchestration that needs to happen. Configure an instance of this class as an application configuration and use a ImportConfigFactory to create the objects. See the examples/ctakes directory for a quick start guide on how to use this class.

__init__(entry_point_bin, entry_point_cmd, home, source_dir, output_dir=None)
clear()[source]

Delete all data from the from the stash.

Important: Exercise caution with this method, of course.

entry_point_bin: Path

Entry point script in to the cTAKES parser.

entry_point_cmd: str

Command line arguments passed to cTAKES.

exists(name)[source]

Return True if data with key name exists.

Implementation note: This Stash.exists() method is very inefficient and should be overriden.

Return type:

bool

home: Path

The directory where cTAKES is installed.

keys()[source]

Return an iterable of keys in the collection.

Return type:

Iterable[str]

load(name)[source]

Load a data value from the pickled data with key name. Semantically, this method loads the using the stash’s implementation. For example DirectoryStash loads the data from a file if it exists, but factory type stashes will always re-generate the data.

See:

get()

Return type:

DataFrame

output_dir: Path = None

The directory where to output the xmi files.

prime()[source]
set_documents(docs)[source]

Set the document to be parsed by cTAKES.

Parameters:

docs (Iterable[str]) – an iterable of string text documents to persist to the file system, and then be parsed by cTAKES.

source_dir: Path

Contains a path to the source directory where the text documents live.

property source_stash: Stash

The stash that tracks the text documents that are to be parsed by cTAKES.

zensols.mednlp.cui2vec module

This module contains the embedding subclass for cui2vec embeddings.

class zensols.mednlp.cui2vec.Cui2VecEmbedModel(name, cache=True, lowercase=False, path=None, installer=None, resource=None, dimension=500, vocab_size=109053)[source]

Bases: TextWordEmbedModel

This class uses the pretrained cui2vec embeddings.

__init__(name, cache=True, lowercase=False, path=None, installer=None, resource=None, dimension=500, vocab_size=109053)
dimension: str = 500

The word vector dimension.

vocab_size: int = 109053

Vocabulary size.

zensols.mednlp.domain module

Contains the classes for the medical token type and others.

exception zensols.mednlp.domain.MedNLPError[source]

Bases: APIError

Raised by any medical NLP speicic reason in this library.

__module__ = 'zensols.mednlp.domain'

zensols.mednlp.lib module

Medical resource library that contains UMLS access, cui2vec etc..

class zensols.mednlp.lib.MedicalLibrary(config_factory=None, medcat_resource=None, entity_linker_resource=None, uts_client=None)[source]

Bases: object

A utility class that provides access to medical APIs.

__init__(config_factory=None, medcat_resource=None, entity_linker_resource=None, uts_client=None)
config_factory: ConfigFactory = None

The configuration factory used to create cTAKES and cui2vec instances.

property cui2vec_embedding: Cui2VecEmbedModel

The cui2vec embedding model.

entity_linker_resource: EntityLinkerResource = None

The entity linker resource.

get_atom(cui)[source]

Get the UMLS atoms of a CUI from UTS.

Parameters:
  • cui (str) – the concept ID used to query

  • preferred – if True only return preferred atoms

Return type:

Dict[str, str]

Returns:

a list of atom entries in dictionary form

get_entities(text)[source]

Return the all concept entity data.

Return type:

Dict[str, Any]

Returns:

concepts as a multi-tiered dict

get_linked_entity(cui)[source]

Get a scispaCy linked entity.

Parameters:

cui (str) – the unique concept ID

Return type:

Entity

get_new_ctakes_parser_stash()[source]

Return a new instance of a ctakes parser stash.

Return type:

CTakesParserStash

get_relations(cui)[source]

Get the UMLS related concepts connected to a concept by ID.

Parameters:

cui (str) – the concept ID used to get related concepts

Return type:

List[Dict[str, Any]]

Returns:

a list of relation entries in dictionary form in the order returned by UTS

medcat_resource: MedCatResource = None

The MedCAT factory resource.

similarity_by_term(term, topn=5)[source]

Return similaries of a medical term.

Parameters:
  • term (str) – the medical term (i.e. heart disease)

  • topn (int) – the top N count similarities to return

Return type:

List[‘EntitySimilarity’]

uts_client: UTSClient = None

Queries UMLS data.

zensols.mednlp.parser module

Medical langauge parser.

class zensols.mednlp.parser.MedCatFeatureDocumentParser(config_factory, name, lang='en', model_name=None, token_feature_ids=frozenset({'children', 'context_similarity', 'cui', 'cui_', 'definition_', 'dep', 'dep_', 'detected_name_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_concept', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'pref_name_', 'sent_i', 'shape', 'shape_', 'sub_names', 'tag', 'tag_', 'tui_descs_', 'tuis', 'tuis_'}), components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.mednlp.tok.MedicalFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False, medcat_resource=None)[source]

Bases: SpacyFeatureDocumentParser

A medical based language resources that parses concepts.

TOKEN_FEATURE_IDS: ClassVar[Set[str]] = frozenset({'children', 'context_similarity', 'cui', 'cui_', 'definition_', 'dep', 'dep_', 'detected_name_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_concept', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'pref_name_', 'sent_i', 'shape', 'shape_', 'sub_names', 'tag', 'tag_', 'tui_descs_', 'tuis', 'tuis_'})

Default token feature ID set for the medical parser.

__init__(config_factory, name, lang='en', model_name=None, token_feature_ids=frozenset({'children', 'context_similarity', 'cui', 'cui_', 'definition_', 'dep', 'dep_', 'detected_name_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_concept', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'pref_name_', 'sent_i', 'shape', 'shape_', 'sub_names', 'tag', 'tag_', 'tui_descs_', 'tuis', 'tuis_'}), components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.mednlp.tok.MedicalFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False, medcat_resource=None)
medcat_resource: MedCatResource = None

The MedCAT factory resource.

token_class

The class to use for instances created by features().

alias of MedicalFeatureToken

token_feature_ids: Set[str] = frozenset({'children', 'context_similarity', 'cui', 'cui_', 'definition_', 'dep', 'dep_', 'detected_name_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_concept', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'pref_name_', 'sent_i', 'shape', 'shape_', 'sub_names', 'tag', 'tag_', 'tui_descs_', 'tuis', 'tuis_'})

The features to keep from spaCy tokens.

See:

TOKEN_FEATURE_IDS

zensols.mednlp.resource module

MedCAT wrapper.

class zensols.mednlp.resource.MedCatResource(installer, vocab_resource, cdb_resource, mc_status_resource, umls_tuis, umls_groups, filter_tuis=None, filter_groups=None, spacy_enable_components=<factory>, cat_config=None, cache_global=True, requirements_dir=None, auto_install_models=())[source]

Bases: object

A factory class that creates MedCAT resources.

__init__(installer, vocab_resource, cdb_resource, mc_status_resource, umls_tuis, umls_groups, filter_tuis=None, filter_groups=None, spacy_enable_components=<factory>, cat_config=None, cache_global=True, requirements_dir=None, auto_install_models=())
auto_install_models: Tuple[str, ...] = ()

A list of spaCy models that will be installed if not already.

cache_global: InitVar = True

Whether or not to globally cache resources, which saves load time.

property cat: CAT

The MedCAT NER tagger instance.

When this property is accessed, all models are downloaded first, then loaded, if not already.

cat_config: Dict[str, Dict[str, Any]] = None

If provieded, set the CDB configuration. Keys are general, preprocessing and all other attributes documented in the MedCAT Config

cdb_resource: Resource

The cdb-medmen-v1.dat file.

clear()[source]
filter_groups: Set[str] = None

Just like filter_tuis but each element is treated as a group used to generate a list of CUIs from those mapped from name to ``tui` in groups.

filter_tuis: Set[str] = None

Types used to filter linked CUIs (i.e. {'T047', 'T048'}).

property groups: DataFrame

A dataframe of TUIs, their abbreviations, descriptions and a group name associated with each.

installer: Installer

Installs and provides paths to the model files.

mc_status_resource: Resource

The the mc_status directory.

requirements_dir: Path = None

The directory with the pip requirements files.

spacy_enable_components: Set[str]

By default, MedCAT disables several pipeline components. Some of these are needed for sentence chunking and other downstream tasks. Otherwise sentence indexing won’t work because sentence boundaries are missing.

See:

MedCAT Config

property tuis: Dict[str, str]

A mapping of type identifiers (TUIs) to their descriptions.

umls_groups: Resource

Like umls_tuis but groups TUIs in gropus.

umls_tuis: Resource

The UMLS TUIs (types) mapping resource that maps from TUIs to descriptions.

See:

Semantic Types

vocab_resource: Resource

The path to the vocab.dat file.

zensols.mednlp.tok module

Contains the classes for the medical token type.

class zensols.mednlp.tok.MedicalFeatureToken(spacy_token, norm, res, ix2ent)[source]

Bases: SpacyFeatureToken

A set of token features that optionally contains a medical concept.

FEATURE_IDS: ClassVar[Set[str]] = frozenset({'context_similarity', 'cui', 'cui_', 'definition_', 'detected_name_', 'is_concept', 'pref_name_', 'sub_names', 'tui_descs_', 'tuis', 'tuis_'})

All default available feature IDs.

FEATURE_IDS_BY_TYPE: ClassVar[Dict[str, Set[str]]] = {'bool': frozenset({'is_concept'}), 'float': frozenset({'context_similarity'}), 'int': frozenset({'cui'}), 'list': frozenset({'sub_names', 'tuis'}), 'str': frozenset({'cui_', 'definition_', 'detected_name_', 'pref_name_', 'tui_descs_', 'tuis_'})}

Map of class type to set of feature IDs.

WRITABLE_FEATURE_IDS: ClassVar[Tuple[str, ...]] = ('text', 'norm', 'idx', 'sent_i', 'i', 'i_sent', 'tag', 'pos', 'is_wh', 'entity', 'dep', 'children', 'cui_')

Feature IDs that are dumped on write() and write_attributes().

__init__(spacy_token, norm, res, ix2ent)[source]
property context_similarity: float

The similiarity of the concept.

property cui: int

Returns the numeric part of the concept ID.

property cui_: str

The unique UMLS concept ID.

property definition_: str

The definition if the concept.

property detected_name_: str

The detected name of the concept.

property ent: int

Return the entity numeric value or 0 if this is not an entity.

property ent_: str

Return the entity string label or None if this token has no entity.

property is_concept: bool

True if this has a CUI and identifies a medical concept.

property pref_name_: str

The preferred name of the concept.

property sub_names: Tuple[str, ...]

Return other names for the concept.

property tui_descs_: str

Descriptions of tuis_.

property tuis: Tuple[str, ...]

The the CUI type of the concept.

property tuis_: str

All CUI TUIs (types) of the concept sorted as a comma delimited list.

zensols.mednlp.uts module

Interface to the UTS (UMLS Terminology Services (UTS)) RESTful service, which was taken from the UTS example repo.

:see UTS GitHug repo

class zensols.mednlp.uts.Authentication(api_key, auth_endpoint='/cas/v1/api-key')[source]

Bases: object

A utility class to manage the authentication with the UTS system.

AUTH_URI = 'https://utslogin.nlm.nih.gov'

The authetication service endpoint URL.

SERVICE = 'http://umlsks.nlm.nih.gov'

The service endpoint URL.

__init__(api_key, auth_endpoint='/cas/v1/api-key')
api_key: str

The API key used for the RESTful NIH service.

auth_endpoint: str = '/cas/v1/api-key'

The path of the authentication service endpoint.

getst(tgt)[source]
gettgt()[source]
exception zensols.mednlp.uts.AuthenticationError(api_key)[source]

Bases: UTSError

Thrown when authentication fails.

__annotations__ = {}
__init__(api_key)[source]
__module__ = 'zensols.mednlp.uts'
exception zensols.mednlp.uts.NoResultsError[source]

Bases: UTSError

Thrown when no results, usually for a CUI not found.

__annotations__ = {}
__module__ = 'zensols.mednlp.uts'
class zensols.mednlp.uts.UTSClient(api_key, version='2020AA', request_stash=None)[source]

Bases: object

MISSING_VALUE = '<missing>'

Value to store in the stash when there is a missing CUI.

NO_RESULTS_ERR = 'No results containing all your search terms were found.'

Error message from UTS indicating a missing CUI.

REL_ID_REGEX = re.compile('.*CUI\\/(.+)$')

Used to parse related CUIs in get_related_cuis().

URI = 'https://uts-ws.nlm.nih.gov'

The service URL endpoint.

__init__(api_key, version='2020AA', request_stash=None)
api_key: str

The API key used for the RESTful NIH service.

get_atoms(cui, preferred=True, expect=True)[source]

Get the UMLS atoms of a CUI from UTS.

Parameters:
  • cui (str) – the concept ID used to query

  • preferred (bool) – if True only return preferred atoms

Return type:

Union[Dict[str, str], List[Dict[str, str]]]

Returns:

a list of atom entries in dictionary form or a single dict if

` preferred is True

Get the UMLS related concept IDs connected to a concept by ID.

Parameters:

cui (str) – the concept ID used to get related concepts

Return type:

List[Tuple[str, Dict[str, Any]]]

Returns:

a list of tuples, each the related CUIs and the relation entry, in the order returned by UTS

get_relations(cui, expect=True)[source]

Get the UMLS related concepts connected to a concept by ID.

Parameters:

cui (str) – the concept ID used to get related concepts

Return type:

List[Dict[str, Any]]

Returns:

a list of relation entries in dictionary form in the order returned by UTS

request_stash: Stash = None
search_term(term, pages=1)[source]

Search for a string term in UMLS.

Parameters:

term (str) – the string term to match against

Return type:

List[Dict[str, str]]

Returns:

a list (one for each page), each with a dictionary of matching terms that have the name of the term, the ui (CUI), the uri of the term and the rootSource of the orginitating system

version: str = '2020AA'

The version of the UML we want.

exception zensols.mednlp.uts.UTSError[source]

Bases: MedNLPError

An error thrown by wrapper of the UTS system.

__annotations__ = {}
__module__ = 'zensols.mednlp.uts'

Module contents

zensols.mednlp.surpress_warnings()[source]

Supress future warnings generated by spaCy and ScispaCy models.