zensols.mednlp package#

Submodules#

zensols.mednlp.app#

Inheritance diagram of zensols.mednlp.app

A natural language medical domain parsing library.

class zensols.mednlp.app.Application(config_factory, doc_parser, library)[source]#

Bases: Dictable

A natural language medical domain parsing library.

__init__(config_factory, doc_parser, library)#
atom(cui)[source]#

Search the UMLS database using UTS and show results.

Parameters:

cui (str) – the concept ID to search for (eg ‘C0242379’)

config_factory: ConfigFactory#

Used to create a cTAKES stash.

ctakes(text_or_file, only_medical=False)[source]#

Invoke cTAKES on a directory with text files.

Parameters:
  • text_or_file (str) – natural language to be processed

  • only_medical (bool) – only provide medical linked tokens

define(cui)[source]#

Look up an entity by CUI. This takes a long time.

Parameters:

cui (str) – the concept ID to search for (eg ‘C0242379’)

doc_parser: FeatureDocumentParser#

Parses and NER tags medical terms.

features(text_or_file, out=None, ids=None, only_medical=False)[source]#

Dump features as CSV output.

Parameters:
  • text_or_file (str) – natural language to be processed

  • out (Path) – the path to output the CSV file or stdout if missing

  • ids (str) – the comma separate feature IDs to output

  • only_medical (bool) – only provide medical linked tokens

group(info, query=None)[source]#

Get TUI group information.

Parameters:
  • info (GroupInfo) – the type of information to return

  • query (str) – comma delimited name list used to subset the output data

library: MedicalLibrary#

Medical resource library that contains UMLS access, cui2vec etc..

search(term)[source]#

Search the UMLS database using UTS and show results.

Parameters:

term (str) – the term to search for (eg ‘lung cancer’)

show(text_or_file, only_medical=False)[source]#

Parse and output medical entities.

Parameters:
  • text_or_file (str) – natural language to be processed

  • only_medical (bool) – only provide medical linked tokens

similarity(term)[source]#

Get the cosine similarity between two CUIs.

class zensols.mednlp.app.GroupInfo(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

Used to group TUI information in Application.group()

byname = 2#
csv = 1#

zensols.mednlp.cli#

Inheritance diagram of zensols.mednlp.cli

Command line entry point to the application.

class zensols.mednlp.cli.ApplicationFactory(*args, **kwargs)[source]#

Bases: ApplicationFactory

__init__(*args, **kwargs)[source]#
classmethod get_doc_parser()[source]#

Get the default application’s document parser.

Return type:

FeatureDocumentParser

zensols.mednlp.cli.main(args=['/Users/landes/opt/lib/python/bin/sphinx-build', '-M', 'html', '/Users/landes/view/nlp/med/mednlp/target/doc/src', '/Users/landes/view/nlp/med/mednlp/target/doc/build'], **kwargs)[source]#
Return type:

ActionResult

zensols.mednlp.ctakes#

Inheritance diagram of zensols.mednlp.ctakes

Parse and normalize discharge notes.

class zensols.mednlp.ctakes.CTakesParserStash(entry_point_bin, entry_point_cmd, home, source_dir, output_dir=None)[source]#

Bases: ReadOnlyStash, Primeable, Dictable

Runs the cTAKES CUI entity linker on a directory of medical notes. For each medical text file, it generates an xmi file, which is then parsed by the the ctakes_parser library.

This straightforward wrapper around the ctparser library automates the file system orchestration that needs to happen. Configure an instance of this class as an application configuration and use a ImportConfigFactory to create the objects. See the examples/ctakes directory for a quick start guide on how to use this class.

__init__(entry_point_bin, entry_point_cmd, home, source_dir, output_dir=None)#
clear()[source]#

Delete all data from the from the stash.

Important: Exercise caution with this method, of course.

entry_point_bin: Path#

Entry point script in to the cTAKES parser.

entry_point_cmd: str#

Command line arguments passed to cTAKES.

exists(name)[source]#

Return True if data with key name exists.

Implementation note: This Stash.exists() method is very inefficient and should be overriden.

Return type:

bool

home: Path#

The directory where cTAKES is installed.

keys()[source]#

Return an iterable of keys in the collection.

Return type:

Iterable[str]

load(name)[source]#

Load a data value from the pickled data with key name. Semantically, this method loads the using the stash’s implementation. For example DirectoryStash loads the data from a file if it exists, but factory type stashes will always re-generate the data.

See:

get()

Return type:

DataFrame

output_dir: Path = None#

The directory where to output the xmi files.

prime()[source]#
set_documents(docs)[source]#

Set the document to be parsed by cTAKES.

Parameters:

docs (Iterable[str]) – an iterable of string text documents to persist to the file system, and then be parsed by cTAKES.

source_dir: Path#

Contains a path to the source directory where the text documents live.

property source_stash: Stash#

The stash that tracks the text documents that are to be parsed by cTAKES.

zensols.mednlp.cui2vec#

Inheritance diagram of zensols.mednlp.cui2vec

This module contains the embedding subclass for cui2vec embeddings.

class zensols.mednlp.cui2vec.Cui2VecEmbedModel(name, cache=True, lowercase=False, path=None, installer=None, resource=None, dimension=500, vocab_size=109053)[source]#

Bases: TextWordEmbedModel

This class uses the pretrained cui2vec embeddings.

__init__(name, cache=True, lowercase=False, path=None, installer=None, resource=None, dimension=500, vocab_size=109053)#
dimension: str = 500#

The word vector dimension.

vocab_size: int = 109053#

Vocabulary size.

zensols.mednlp.domain#

Inheritance diagram of zensols.mednlp.domain

Contains the classes for the medical token type and others.

exception zensols.mednlp.domain.MedNLPError[source]#

Bases: APIError

Raised by any medical NLP speicic reason in this library.

__module__ = 'zensols.mednlp.domain'#

zensols.mednlp.lib#

Inheritance diagram of zensols.mednlp.lib

Medical resource library that contains UMLS access, cui2vec etc..

class zensols.mednlp.lib.MedicalLibrary(config_factory=None, medcat_resource=None, entity_linker_resource=None, uts_client=None)[source]#

Bases: object

A utility class that provides access to medical APIs.

__init__(config_factory=None, medcat_resource=None, entity_linker_resource=None, uts_client=None)#
config_factory: ConfigFactory = None#

The configuration factory used to create cTAKES and cui2vec instances.

property cui2vec_embedding: Cui2VecEmbedModel#

The cui2vec embedding model.

entity_linker_resource: EntityLinkerResource = None#

The entity linker resource.

get_atom(cui)[source]#

Get the UMLS atoms of a CUI from UTS.

Parameters:
  • cui (str) – the concept ID used to query

  • preferred – if True only return preferred atoms

Return type:

Dict[str, str]

Returns:

a list of atom entries in dictionary form

get_entities(text)[source]#

Return the all concept entity data.

Return type:

Dict[str, Any]

Returns:

concepts as a multi-tiered dict

get_linked_entity(cui)[source]#

Get a scispaCy linked entity.

Parameters:

cui (str) – the unique concept ID

Return type:

Entity

get_new_ctakes_parser_stash()[source]#

Return a new instance of a ctakes parser stash.

Return type:

CTakesParserStash

get_relations(cui)[source]#

Get the UMLS related concepts connected to a concept by ID.

Parameters:

cui (str) – the concept ID used to get related concepts

Return type:

List[Dict[str, Any]]

Returns:

a list of relation entries in dictionary form in the order returned by UTS

medcat_resource: MedCatResource = None#

The MedCAT factory resource.

similarity_by_term(term, topn=5)[source]#

Return similaries of a medical term.

Parameters:
  • term (str) – the medical term (i.e. heart disease)

  • topn (int) – the top N count similarities to return

Return type:

List[‘EntitySimilarity’]

uts_client: UTSClient = None#

Queries UMLS data.

zensols.mednlp.parser#

Inheritance diagram of zensols.mednlp.parser

Medical langauge parser.

class zensols.mednlp.parser.MedCatFeatureDocumentParser(config_factory, name, lang='en', model_name=None, token_feature_ids=frozenset({'children', 'context_similarity', 'cui', 'cui_', 'definition_', 'dep', 'dep_', 'detected_name_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_concept', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'pref_name_', 'sent_i', 'shape', 'shape_', 'sub_names', 'tag', 'tag_', 'tui_descs_', 'tuis', 'tuis_'}), components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.mednlp.tok.MedicalFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False, medcat_resource=None)[source]#

Bases: SpacyFeatureDocumentParser

A medical based language resources that parses concepts.

TOKEN_FEATURE_IDS: ClassVar[Set[str]] = frozenset({'children', 'context_similarity', 'cui', 'cui_', 'definition_', 'dep', 'dep_', 'detected_name_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_concept', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'pref_name_', 'sent_i', 'shape', 'shape_', 'sub_names', 'tag', 'tag_', 'tui_descs_', 'tuis', 'tuis_'})#

Default token feature ID set for the medical parser.

__init__(config_factory, name, lang='en', model_name=None, token_feature_ids=frozenset({'children', 'context_similarity', 'cui', 'cui_', 'definition_', 'dep', 'dep_', 'detected_name_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_concept', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'pref_name_', 'sent_i', 'shape', 'shape_', 'sub_names', 'tag', 'tag_', 'tui_descs_', 'tuis', 'tuis_'}), components=(), token_decorators=(), sentence_decorators=(), document_decorators=(), disable_component_names=None, token_normalizer=None, special_case_tokens=<factory>, doc_class=<class 'zensols.nlp.container.FeatureDocument'>, sent_class=<class 'zensols.nlp.container.FeatureSentence'>, token_class=<class 'zensols.mednlp.tok.MedicalFeatureToken'>, remove_empty_sentences=None, reload_components=False, auto_install_model=False, medcat_resource=None)#
medcat_resource: MedCatResource = None#

The MedCAT factory resource.

token_class#

The class to use for instances created by features().

alias of MedicalFeatureToken

token_feature_ids: Set[str] = frozenset({'children', 'context_similarity', 'cui', 'cui_', 'definition_', 'dep', 'dep_', 'detected_name_', 'ent', 'ent_', 'ent_iob', 'ent_iob_', 'i', 'i_sent', 'idx', 'is_concept', 'is_contraction', 'is_ent', 'is_pronoun', 'is_punctuation', 'is_space', 'is_stop', 'is_superlative', 'is_wh', 'lemma_', 'lexspan', 'norm', 'norm_len', 'pos_', 'pref_name_', 'sent_i', 'shape', 'shape_', 'sub_names', 'tag', 'tag_', 'tui_descs_', 'tuis', 'tuis_'})#

The features to keep from spaCy tokens.

See:

TOKEN_FEATURE_IDS

zensols.mednlp.resource#

Inheritance diagram of zensols.mednlp.resource

MedCAT wrapper.

class zensols.mednlp.resource.MedCatResource(installer, vocab_resource, cdb_resource, mc_status_resource, umls_tuis, umls_groups, filter_tuis=None, filter_groups=None, spacy_enable_components=<factory>, cat_config=None, cache_global=True, requirements_dir=None, auto_install_models=())[source]#

Bases: object

A factory class that creates MedCAT resources.

__init__(installer, vocab_resource, cdb_resource, mc_status_resource, umls_tuis, umls_groups, filter_tuis=None, filter_groups=None, spacy_enable_components=<factory>, cat_config=None, cache_global=True, requirements_dir=None, auto_install_models=())#
auto_install_models: Tuple[str, ...] = ()#

A list of spaCy models that will be installed if not already.

cache_global: InitVar = True#

Whether or not to globally cache resources, which saves load time.

property cat: CAT#

The MedCAT NER tagger instance.

When this property is accessed, all models are downloaded first, then loaded, if not already.

cat_config: Dict[str, Dict[str, Any]] = None#

If provieded, set the CDB configuration. Keys are general, preprocessing and all other attributes documented in the MedCAT Config

cdb_resource: Resource#

The cdb-medmen-v1.dat file.

clear()[source]#
filter_groups: Set[str] = None#

Just like filter_tuis but each element is treated as a group used to generate a list of CUIs from those mapped from name to ``tui` in groups.

filter_tuis: Set[str] = None#

Types used to filter linked CUIs (i.e. {'T047', 'T048'}).

property groups: DataFrame#

A dataframe of TUIs, their abbreviations, descriptions and a group name associated with each.

installer: Installer#

Installs and provides paths to the model files.

mc_status_resource: Resource#

The the mc_status directory.

requirements_dir: Path = None#

The directory with the pip requirements files.

spacy_enable_components: Set[str]#

By default, MedCAT disables several pipeline components. Some of these are needed for sentence chunking and other downstream tasks. Otherwise sentence indexing won’t work because sentence boundaries are missing.

See:

MedCAT Config

property tuis: Dict[str, str]#

A mapping of type identifiers (TUIs) to their descriptions.

umls_groups: Resource#

Like umls_tuis but groups TUIs in gropus.

umls_tuis: Resource#

The UMLS TUIs (types) mapping resource that maps from TUIs to descriptions.

See:

Semantic Types

vocab_resource: Resource#

The path to the vocab.dat file.

zensols.mednlp.tok#

Inheritance diagram of zensols.mednlp.tok

Contains the classes for the medical token type.

class zensols.mednlp.tok.MedicalFeatureToken(spacy_token, norm, res, ix2ent)[source]#

Bases: SpacyFeatureToken

A set of token features that optionally contains a medical concept.

FEATURE_IDS: ClassVar[Set[str]] = frozenset({'context_similarity', 'cui', 'cui_', 'definition_', 'detected_name_', 'is_concept', 'pref_name_', 'sub_names', 'tui_descs_', 'tuis', 'tuis_'})#

All default available feature IDs.

FEATURE_IDS_BY_TYPE: ClassVar[Dict[str, Set[str]]] = {'bool': frozenset({'is_concept'}), 'float': frozenset({'context_similarity'}), 'int': frozenset({'cui'}), 'list': frozenset({'sub_names', 'tuis'}), 'str': frozenset({'cui_', 'definition_', 'detected_name_', 'pref_name_', 'tui_descs_', 'tuis_'})}#

Map of class type to set of feature IDs.

WRITABLE_FEATURE_IDS: ClassVar[Tuple[str, ...]] = ('text', 'norm', 'idx', 'sent_i', 'i', 'i_sent', 'tag', 'pos', 'is_wh', 'entity', 'dep', 'children', 'cui_')#

Feature IDs that are dumped on write() and write_attributes().

__init__(spacy_token, norm, res, ix2ent)[source]#
property context_similarity: float#

The similiarity of the concept.

property cui: int#

Returns the numeric part of the concept ID.

property cui_: str#

The unique UMLS concept ID.

property definition_: str#

The definition if the concept.

property detected_name_: str#

The detected name of the concept.

property ent: int#

Return the entity numeric value or 0 if this is not an entity.

property ent_: str#

Return the entity string label or None if this token has no entity.

property is_concept: bool#

True if this has a CUI and identifies a medical concept.

property pref_name_: str#

The preferred name of the concept.

property sub_names: Tuple[str, ...]#

Return other names for the concept.

property tui_descs_: str#

Descriptions of tuis_.

property tuis: Tuple[str, ...]#

The the CUI type of the concept.

property tuis_: str#

All CUI TUIs (types) of the concept sorted as a comma delimited list.

zensols.mednlp.uts#

Inheritance diagram of zensols.mednlp.uts

Interface to the UTS (UMLS Terminology Services (UTS)) RESTful service, which was taken from the UTS example repo.

:see UTS GitHug repo

class zensols.mednlp.uts.Authentication(api_key, auth_endpoint='/cas/v1/api-key')[source]#

Bases: object

A utility class to manage the authentication with the UTS system.

AUTH_URI = 'https://utslogin.nlm.nih.gov'#

The authetication service endpoint URL.

SERVICE = 'http://umlsks.nlm.nih.gov'#

The service endpoint URL.

__init__(api_key, auth_endpoint='/cas/v1/api-key')#
api_key: str#

The API key used for the RESTful NIH service.

auth_endpoint: str = '/cas/v1/api-key'#

The path of the authentication service endpoint.

getst(tgt)[source]#
gettgt()[source]#
exception zensols.mednlp.uts.AuthenticationError(api_key)[source]#

Bases: UTSError

Thrown when authentication fails.

__annotations__ = {}#
__init__(api_key)[source]#
__module__ = 'zensols.mednlp.uts'#
exception zensols.mednlp.uts.NoResultsError[source]#

Bases: UTSError

Thrown when no results, usually for a CUI not found.

__annotations__ = {}#
__module__ = 'zensols.mednlp.uts'#
class zensols.mednlp.uts.UTSClient(api_key, version='2020AA', request_stash=None)[source]#

Bases: object

MISSING_VALUE = '<missing>'#

Value to store in the stash when there is a missing CUI.

NO_RESULTS_ERR = 'No results containing all your search terms were found.'#

Error message from UTS indicating a missing CUI.

REL_ID_REGEX = re.compile('.*CUI\\/(.+)$')#

Used to parse related CUIs in get_related_cuis().

URI = 'https://uts-ws.nlm.nih.gov'#

The service URL endpoint.

__init__(api_key, version='2020AA', request_stash=None)#
api_key: str#

The API key used for the RESTful NIH service.

get_atoms(cui, preferred=True, expect=True)[source]#

Get the UMLS atoms of a CUI from UTS.

Parameters:
  • cui (str) – the concept ID used to query

  • preferred (bool) – if True only return preferred atoms

Return type:

Union[Dict[str, str], List[Dict[str, str]]]

Returns:

a list of atom entries in dictionary form or a single dict if

` preferred is True

Get the UMLS related concept IDs connected to a concept by ID.

Parameters:

cui (str) – the concept ID used to get related concepts

Return type:

List[Tuple[str, Dict[str, Any]]]

Returns:

a list of tuples, each the related CUIs and the relation entry, in the order returned by UTS

get_relations(cui, expect=True)[source]#

Get the UMLS related concepts connected to a concept by ID.

Parameters:

cui (str) – the concept ID used to get related concepts

Return type:

List[Dict[str, Any]]

Returns:

a list of relation entries in dictionary form in the order returned by UTS

request_stash: Stash = None#
search_term(term, pages=1)[source]#

Search for a string term in UMLS.

Parameters:

term (str) – the string term to match against

Return type:

List[Dict[str, str]]

Returns:

a list (one for each page), each with a dictionary of matching terms that have the name of the term, the ui (CUI), the uri of the term and the rootSource of the orginitating system

version: str = '2020AA'#

The version of the UML we want.

exception zensols.mednlp.uts.UTSError[source]#

Bases: MedNLPError

An error thrown by wrapper of the UTS system.

__annotations__ = {}#
__module__ = 'zensols.mednlp.uts'#

Module contents#

zensols.mednlp.surpress_warnings()[source]#

Supress future warnings generated by spaCy and ScispaCy models.