# Inferencing and Training Large Language Model Tasks [![PyPI][pypi-badge]][pypi-link] [![Python 3.11][python311-badge]][python311-link] A large language model (LLM) API to train and inference specifically for tasks. The API provides utility classes and configuration to streamline project's access to LLM responses that (can be) machine readable, such as querying the LLM to produce JSON output--even if that output is partial given output token limits. The package provides an API to train and interface with LLMs as both pretrained embeddings and instruct models. Features: * Create new LLMs with configuration without having to write code. * [Three examples](resources/tasks.yml) of "code-less" models: sentiment analysis, NER tagging and generation. * Cache LLM responses to save from recomputing potentially costly prompts (optional feature). * [Command-line](#command-line) interface to inference, pre-train and post-train LLM models. * [Advanced API](#python-api) to read responses and accept partial output for max token cutoffs. * Chat template integration when supported. * Extendable interfaces with LLMs with built in support for Llama 3. * [Easy to configure datasets](#datasets) processed by model trainers ## Documentation See the [full documentation](https://plandes.github.io/lmtask/index.html). The [API reference](https://plandes.github.io/lmtask/api.html) is also available. ## Obtaining The library can be installed with pip from the [pypi] repository: ```bash pip3 install zensols.lmtask ``` A Conda environment can also be created with the [environment.yml](src/python/environment.yml): ```bash conda env create -f src/python/environment.yml ``` ## Usage The package can be used from the command line to both inference and train a new model or as an API. ### Command Line The command-line can be used for inferencing to list available tasks: ```bash lmtask task ``` Generate text by inferencing using the Llama base model: ```bash lmtask stream base_generate 'in a world long long away' \ --override=lmtask_model_generate_args.temperature=0.9 ``` Use named entity recognition (NER): ```bash lmtask instruct ner 'UIC is in Chicago.' model_output_json: label: ORG span: [0, 3] text: UIC label: O span: [4, 6] text: is label: O span: [7, 9] text: in label: LOC span: [10, 15] text: Chicago ``` The command-line program can be used to train new models with just configuration. See the [trainconf](trainconf) directory for examples of configuration file. Before you train, you might want to get a sample of the configured dataset: ```bash lmtask -c trainconf/imdb.yml sample -m 1 ________________________________________ <|begin_of_text|><|start_header_id|>system<|end_header_id|> Cutting Knowledge Date: December 2023 Today Date: 26 Jul 2024 You are a sentiment classifier.<|eot_id|><|start_header_id|>user<|end_header_id|> Only output the sentiment. ### Review:We always watch American movies with their particular... ### Sentiment:```positive```<|eot_id|> ``` To train a new sentiment model on the IMDB dataset: ```bash lmtask -c trainconf/imdb.yml train ``` ### Python API The Python API can be used to access tasks directly. ```python >>> import json >>> from zensols.lmtask import ApplicationFactory >>> from zensols.lmtask import InstructTaskRequest # create the task factory >>> fac = ApplicationFactory.get_task_factory() # list configured tasks >>> fac.write(short=True) base_generate (base generate text) instruct_generate (base generate text) ner (tags named entities) sentiment (classifies sentiment) # create a sentiment analysis task >>> task = fac.create('sentiment') # inference >>> sents = 'I love football.\nI hate olives.\nEarth is big.' >>> res = task.process(InstructTaskRequest(instruction=sents)) # print the JSON result as formatted text >>> print(json.dumps(res.model_output_json, indent=4)) [ { "index": 0, "sentence": "I love football.", "label": "+" }, { "index": 1, "sentence": "I hate olives.", "label": "-" }, { "index": 2, "sentence": "Earth is big.", "label": "n" } ] ``` ## Datasets The package features easy to configure datasets and data processing on it. For example, the following is taken from the [IMDB training configuration](trainconf/imdb.yml) example: ```yaml # a dataset factory instance used by the trainer (lmtask_trainer_hf) lmtask_imdb_source: class_name: zensols.lmtask.dataset.LoadedTaskDatasetFactory # the dataset name (downloaded if not already); this can be a `pathlib.Path`, # Pandas dataframe or Zensols Stash source: stanfordnlp/imdb # use only the training split load_args: split: train # the task that consumes the data, which will format each datapoint # specifically for that task's model task: 'instance: lmtask_task_imdb' # preprocessing Python source code to add labels and subset the data (db.select) pre_process: |- ds = ds.map(lambda x: {'output': 'positive' if x['label'] == 1 else 'negative'}) ds = ds.rename_column('text', 'instruction') ds = ds.shuffle(seed=0) # 7K takes 55m ds = ds.select(range(7_000)) ``` ## Changelog An extensive changelog is available [here](CHANGELOG.md). ## Community Please star this repository and let me know how and where you use this API. Contributions as pull requests, feedback and any input is welcome. ## License [MIT License](LICENSE.md) Copyright (c) 2025 Paul Landes [pypi]: https://pypi.org/project/zensols.lmtask/ [pypi-link]: https://pypi.python.org/pypi/zensols.lmtask [pypi-badge]: https://img.shields.io/pypi/v/zensols.lmtask.svg [python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg [python311-link]: https://www.python.org/downloads/release/python-3110