Medical NLP and Utility API#

This API primarily wraps others with the Zensols Framework to provide easy way and reproducible method of utilization and experimentation with medical and clinical natural language text. It provides the following functionality:

The rest of this document is structured as a cookbook style tutorial. Each sub-section describes the examples in the examples directory.

Important: many of the examples use UMLS UTS service, which requires a key that is provided by NIH. If you do not have a key, request one and add it to the UTS key file.

Medical Concept and Entity Linking#

Concept linking with CUIs is provided using the same interface as the Zensols NLP parsing API. The resource library provided with this package creates a mednlp_doc_parser as shown in the [entity-example]. First we start with the configuration with file name features.conf, which starts with telling the CLI to import the Zensols NLP package and this (zensols.mednlp) package:

[import]
sections = list: imp_conf

[imp_conf]
type = importini
config_files = list:
    resource(zensols.nlp): resources/obj.conf,
    resource(zensols.nlp): resources/mapper.conf,
    resource(zensols.mednlp): resources/lang.conf

Next configure the parser with specific features, since otherwise, the parser will retain all medical and non-medical features:

[mednlp_doc_parser]
token_feature_ids = set: norm, is_ent, cui, cui_, pref_name_, detected_name_, is_concept, ent_, ent

Finally, declare the application, which is needed by the CLI glue code to invoke the class we will write afterward:

[app]
class_name = ${program:name}.Application
doc_parser = instance: mednlp_doc_parser

Next comes the application class:

@dataclass
class Application(object):
    doc_parser: FeatureDocumentParser = field()

    def show(self, sent: str = None):
        if sent is None:
            sent = 'He was diagnosed with kidney failure in the United States.'
        doc: FeatureDocument = self.doc_parser(sent)
        print('first three tokens:')
        for tok in it.islice(doc.token_iter(), 3):
            print(tok.norm)
            tok.write_attributes(1, include_type=False)

This uses the document parser to create the feature document, which has both the medical and linguistic features in tokens (provided by token_iter()) of the document.

Use the CLI API in the entry point to use the configuration and application class:

if (__name__ == '__main__'):
    CliHarness(
        app_config_resource='uts.conf',
        app_config_context=ProgramNameConfigurator(
            None, default='uts').create_section(),
        proto_args='',
    ).run()

Running the program produces one such token data:

...
diagnosed
    cui=11900
    cui_=C0011900
    detected_name_=diagnosed
    ent=13188083023294932426
    ent_=concept
    i=2
    i_sent=2
    idx=7
    is_concept=True
    is_ent=True
    norm=diagnosed
    pref_name_=Diagnosis
...

See the full entity example for the full example code, which will also output both linguistic and medical features as a Pandas data frame.

UMLS Access via UTS#

NIH provides a very rough REST client using the requests library given as an example. This API takes that example, adds some “rigor” and structure in a an easy to use class called UTSClient. This is configured by first defining paths for where fetched entities are cached:

[default]
# root directory given by the application, which is the parent directory
root_dir = ${appenv:root_dir}/..
# the directory to hold the cached UMLS data
cache_dir = ${root_dir}/cache

Next, import the this package’s resource library (zensols.mednlp). Note we have to refer to sections that substitute the default section’s data:

[import]
references = list: uts, default
sections = list: imp_uts_key, imp_conf

[imp_conf]
type = importini
config_file = resource(zensols.mednlp): resources/uts.conf

[imp_uts_key]
type = json
default_section = uts
config_file = ${default:root_dir}/uts-key.json

The imp_uts_key points to a file where you put add your UTS key, which is given by NIH.

Now indicate where to cache the UMLS data and define our application we’ll write afterward:

# UTS (UMLS access)
[uts]
cache_file = ${default:cache_dir}/uts-request.dat

For brevity the CLI application code and configuration is omitted (see UMLS Access via UTS for more detail).

To use the API to first search a term, then print entity information, we can use the search_term method with get_atoms:

@dataclass
class Application(object):
    ...
    def lookup(self, term: str = 'heart'):
        # terms are returned as a list of pages with dictionaries of data
        pages: List[Dict[str, str]] = self.uts_client.search_term(term)
        # get all term dictionaries from the first page
        terms: Dict[str, str] = pages[0]
        # get the concept unique identifier
        cui: str = terms['ui']

        # print atoms of this concept
        print('atoms:')
        pprint(self.uts_client.get_atoms(cui))

This yields the following output:

atoms:
{'ancestors': None,
 'classType': 'Atom',
 'code': 'https://uts-ws.nlm.nih.gov/rest/content/2020AA/source/MTH/NOCODE',
 'concept': 'https://uts-ws.nlm.nih.gov/rest/content/2020AA/CUI/C0018787',
 'contentViewMemberships': [{'memberUri': 'https://uts-ws.nlm.nih.gov/rest/content-views/2020AA/CUI/C1700357/member/A0066369',
                             'name': 'MetaMap NLP View',
                             'uri': 'https://uts-ws.nlm.nih.gov/rest/content-views/2020AA/CUI/C1700357'}],
 'name': 'Heart',
 'obsolete': 'false',
 'rootSource': 'MTH',
...
}

See the full UTS example for the full example code.

Using CUI as Word Embeddings#

cui2vec was trained and can be in the same way as word2vec. Such examples is computing a similarity between UMLS CUIs. This API provides access to the vectors directly along with all the functionality using cui2vec with the gensim package. This example computes the similarity between two medical concepts. For brevity the CLI application code and configuration is omitted (see UMLS Access via UTS for more detail).

Let’s jump right to how we import everything we need for the cui2vec example, which the uts and cui2vec resource libraries:

[imp_conf]
type = importini
config_files = list:
    resource(zensols.mednlp): resources/uts.conf,
    resource(zensols.mednlp): resources/cui2vec.conf

The UTS configuration is given as in the UMLS Access via UTS section and the parser is configured as in the Medical Concept and Entity Linking section.

With the high level classes given the configuration is class looks similar to what we’ve seen before, this time we define a similarity method/CLI action:

@dataclass
class Application(object):
    def similarity(self, term: str = 'heart disease', topn: int = 5):

Next, get the gensim KeyedVectors instance, which provides (among many other useful methods) one to compute the similarity between two words, or in our case, two medical CUIs:

        embedding: Cui2VecEmbedModel = self.cui2vec_embedding
        kv: KeyedVectors = embedding.keyed_vectors

Next we use UTS to get the term we’re searching on, use gensim to find similarities, and output them:

        res: List[Dict[str, str]] = self.uts_client.search_term(term)
        cui: str = res[0]['ui']
        sims_by_word: List[Tuple[str, float]] = kv.similar_by_word(cui, topn)
        for rel_cui, proba in sims_by_word:
            rel_atom: Dict[str, str] = self.uts_client.get_atoms(rel_cui)
            rel_name = rel_atom.get('name', 'Unknown')
            print(f'{rel_name} ({rel_cui}): {proba * 100:.2f}%')

The output contains the top (topn) 5 matches and their similarity to the search term in the example heart:

Heart failure (C0018801): 72.03%
Atrial Premature Complexes (C0033036): 71.53%
Chronic myocardial ischemia (C0264694): 69.68%
Right bundle branch block (C0085615): 69.34%
First degree atrioventricular block (C0085614): 69.09%

See the full cui2vec example for the full example code.

Entity Linking with cTAKES#

This package provides an interface to cTAKES, which primarily manages the file system and invokes the Java program to produce results. It then uses the ctakes-parser to create a data frame of features and linked entities from tokens of the source text.

The configuration is a bit more involved since you have to indicate where the cTAKES program is installed, and provide your NIH key as detailed in the UMLS Access via UTS section:

[import]
# refer to sections for which we need substitution in this file
references = list: default, ctakes, uts
sections = list: imp_env, imp_uts_key, imp_conf

# expose the user HOME environment variable
[imp_env]
type = environment
section_name = env
includes = set: HOME

# import the Zensols NLP UTS resource library
[imp_conf]
type = importini
config_files = list:
    resource(zensols.mednlp): resources/uts.conf,
    resource(zensols.mednlp): resources/ctakes.conf

# indicate where Apache cTAKES is installed
[ctakes]
home = ${env:home}/opt/app/ctakes-4.0.0.1
source_dir = ${default:cache_dir}/ctakes/source

For brevity the CLI application code and configuration is omitted, and other configuration given in previous sections (see UMLS Access via UTS for more detail). See the full cTAKES example for the full example code.

The pertinent snippet to get the Pandas data frame from the medical text is very simple:

@dataclass
class Application(object):
    def entities(self, sent: str = None, output: Path = None):
        if sent is None:
            sent = 'He was diagnosed with kidney failure in the United States.'
        self.ctakes_stash.set_documents([sent])
        df: pd.DataFrame = self.ctakes_stash['0']
        print(df)
        if output is not None:
            df.to_csv(output)
            print(f'wrote: {output}')

The set_documents expects a list of text, which is saved to disk. When cTAKES is run, the directory where this list of text is saved (one file per element in the list). The access to the Stash accesses the first document by element ID. Note: the element ID has to be a string to follow the Stash API.