Inferencing and Training Large Language Model Tasks¶

A large language model (LLM) API to train and inference specifically for tasks. The API provides utility classes and configuration to streamline project’s access to LLM responses that (can be) machine readable, such as querying the LLM to produce JSON output–even if that output is partial given output token limits. The package provides an API to train and interface with LLMs as both pretrained embeddings and instruct models.

Features:

Create new LLMs with configuration without having to write code.
Three examples of “code-less” models: sentiment analysis, NER tagging and generation.
Cache LLM responses to save from recomputing potentially costly prompts (optional feature).
Command-line interface to inference, pre-train and post-train LLM models.
Advanced API to read responses and accept partial output for max token cutoffs.
Chat template integration when supported.
Extendable interfaces with LLMs with built in support for Llama 3.
Easy to configure datasets processed by model trainers

Documentation¶

See the full documentation. The API reference is also available.

Obtaining¶

The library can be installed with pip from the pypi repository:

pip3 install zensols.lmtask

A Conda environment can also be created with the environment.yml:

conda env create -f src/python/environment.yml

Usage¶

The package can be used from the command line to both inference and train a new model or as an API.

Command Line¶

The command-line can be used for inferencing to list available tasks:

lmtask task

Generate text by inferencing using the Llama base model:

lmtask stream base_generate 'in a world long long away' \
  --override=lmtask_model_generate_args.temperature=0.9

Use named entity recognition (NER):

lmtask instruct ner 'UIC is in Chicago.' 
model_output_json:
    label: ORG
    span: [0, 3]
    text: UIC
    label: O
    span: [4, 6]
    text: is
    label: O
    span: [7, 9]
    text: in
    label: LOC
    span: [10, 15]
    text: Chicago

The command-line program can be used to train new models with just configuration. See the trainconf directory for examples of configuration file. Before you train, you might want to get a sample of the configured dataset:

lmtask -c trainconf/imdb.yml sample -m 1
________________________________________
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a sentiment classifier.<|eot_id|><|start_header_id|>user<|end_header_id|>

Only output the sentiment.

### Review:We always watch American movies with their particular...
### Sentiment:```positive```<|eot_id|>

To train a new sentiment model on the IMDB dataset:

lmtask -c trainconf/imdb.yml train

Python API¶

The Python API can be used to access tasks directly.

>>> import json
>>> from zensols.lmtask import ApplicationFactory
>>> from zensols.lmtask import InstructTaskRequest

# create the task factory
>>> fac = ApplicationFactory.get_task_factory()

# list configured tasks
>>> fac.write(short=True)
base_generate (base generate text)
instruct_generate (base generate text)
ner (tags named entities)
sentiment (classifies sentiment)

# create a sentiment analysis task
>>> task = fac.create('sentiment')

# inference
>>> sents = 'I love football.\nI hate olives.\nEarth is big.'
>>> res = task.process(InstructTaskRequest(instruction=sents))

# print the JSON result as formatted text
>>> print(json.dumps(res.model_output_json, indent=4))
[
    {
        "index": 0,
        "sentence": "I love football.",
        "label": "+"
    },
    {
        "index": 1,
        "sentence": "I hate olives.",
        "label": "-"
    },
    {
        "index": 2,
        "sentence": "Earth is big.",
        "label": "n"
    }
]

Datasets¶

The package features easy to configure datasets and data processing on it. For example, the following is taken from the IMDB training configuration example:

# a dataset factory instance used by the trainer (lmtask_trainer_hf)
lmtask_imdb_source:
  class_name: zensols.lmtask.dataset.LoadedTaskDatasetFactory
  # the dataset name (downloaded if not already); this can be a `pathlib.Path`,
  # Pandas dataframe or Zensols Stash
  source: stanfordnlp/imdb
  # use only the training split
  load_args:
    split: train
  # the task that consumes the data, which will format each datapoint
  # specifically for that task's model
  task: 'instance: lmtask_task_imdb'
  # preprocessing Python source code to add labels and subset the data (db.select)
  pre_process: |-
    ds = ds.map(lambda x: {'output': 'positive' if x['label'] == 1 else 'negative'})
    ds = ds.rename_column('text', 'instruction')
    ds = ds.shuffle(seed=0)
    # 7K takes 55m
    ds = ds.select(range(7_000))

Changelog¶

An extensive changelog is available here.

Community¶

Please star this repository and let me know how and where you use this API. Contributions as pull requests, feedback and any input is welcome.

License¶

MIT License